6,968 Matching Annotations
  1. Last 7 days
    1. Thank you for submitting this paper. I think the paper requires substantial, major revisions to be published. Throughout the paper I noted many instances where references or examples would help make the intent clear. I also think the message of the paper would benefit from several figures to demonstrate workflows or ideas. The figures presented are essentially tables, and I think the message could be made clearer for the reader if they were presented as flow charts or at least with clear numbering to hook the ideas to the reader - e.g., Figures 1 & 2 would benefit from having numbers on the key ideas.

      The paper is lacking many instances of citation, and at times reads as though it is an essay delivering an opinion. I'm not sure if this is the type of article that the journal would like, but two examples of sentences missing citations are:

      1. "Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection." (Introduction, page 2)

      2. "A large number of examples cited in this context involves faulty software or inappropriate use of software" (Introduction, page 3)

      Two examples of sentences missing examples are:

      1. Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete (in Mature vs. experimental software, page 7). Could the author provide more examples of what "experimental software" is? There is also consistent use of universal terms like "...is rarely up to date or complete", which would be better phrased as "is often not up to date or complete"

      2. There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification.

      Overall the paper introduces many new concepts, and I think it would greatly benefit from being made shorter and more concise, with adding some key figures for the reader to refer back to to understand these new ideas. The paper is well written, and it is clear the author is a great writer, and has put a lot of thought into the ideas. However it is my opinion that because these ideas are so big and require so much unpacking, they are also harder to understand. The reader would benefit from having more guidance to come back to understand these ideas.

      I hope this review is helpful to the author.

      Review comments

      Introduction

      Highlight [page 2]: Ever since the beginnings of organized science in the 17th century, researchers are expected to put all facts supporting their conclusions on the table, and allow their peers to inspect them for accuracy, pertinence, completeness, and bias. Since the 1950s, critical inspection has become an integral part of the publication process in the form of peer review, which is still widely regarded as a key criterion for trustworthy results.

      • and Note [page 2]: Both of these statements feel like they should have some peer review, or reference on them, I believe. What was the beginnings of organised science in the 1600s? Why since the 1950s? Why not sooner? What happened then?

      Highlight [page 2]: Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection.

      Highlight [page 2]: In the quantitative sciences, almost all of today’s research critically relies on computational techniques, even when they are not the primary tool for investigation - and Note [page 2]: Again, it does feel like it would be great to acknowledge research in this space.

      Highlight [page 2]: But then, scientists mostly abandoned doubting.

      • and Note [page 2]: This feels like an essay, where show me the evidence for where you can say something like this?

      Highlight [page 2]: Automation bias

      • and Note [page 2]: What is automation bias?

      Highlight [page 3]: A large number of examples cited in this context involves faulty software or inappropriate use of software

      • and Note [page 3]: Can you provide some examples of the examples cited that you are referring to here?

      Highlight [page 3]: A particularly frequent issue is the inappropriate use of statistical inference techniques.

      • and Note [page 3]: Please provide citations to these frequent issues.

      Highlight [page 3]: The Open Science movement has made a first step towards dealing with automated reasoning in insisting on the necessity to publish scientific software, and ideally making the full development process transparent by the adoption of Open Source practices - and Note [page 3]: Could you provide an example of one of these Open Science movements?

      Highlight [page 3]: Almost no scientific software is subjected to independent review today.

      • and Note [page 3]: How can you justify this claim?

      Highlight [page 3]: In fact, we do not even have established processes for performing such reviews

      Highlight [page 3]: as I will show

      • and Note [page 3]: How will you show this?

      Highlight [page 3]: is as much a source of mistakes as defects in the software itself

      • and Note [page 3]: Again, this feels like a statement of fact without evidence or citation.

      Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.

      • and Note [page 3]: The same can be said of assumptions for equations and mathematics - the problem here is dealing with abstraction of complexity and the potential unintended consequences.

      Highlight [page 4]: the preservation of epistemic diversity

      • and Note [page 4]: Please define epistemic diversity
      Reviewability of automated reasoning systems

      Highlight [page 5]: The five dimensions of scientific software that influence its reviewability.

      • and Note [page 5]: It might be clearer to number these in the figure, and also I might suggest changing the “convivial” - it’s a pretty unusual word?
      Wide-spectrum vs. situated software

      Highlight [page 6]: In between these extremes, we have in particular domain libraries and tools, which play a very important role in computational science, i.e. in studies where computational techniques are the principal means of investigation

      • and Note [page 6]: I’m not very clear on this example - can you provide an example of a “domain library” or “domain tool” ?

      Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.

      • and Note [page 6]: I’m not sure I agree it is always smaller and simpler - the custom code for a new method could be incredibly complicated.

      Highlight [page 6]: Domain tools and libraries

      • and Note [page 6]: Can you give an example of this?
      Mature vs. experimental software

      Highlight [page 7]: Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete

      • and Note [page 7]: Could the author provide more examples of what “experimental software” is? There is also consistent use of universal terms like “…is rarely up to date or complete”, which would be better phrased as “is often not up to date or complete”

      Highlight [page 7]: An extreme case of experimental software is machine learning models that are constantly updated with new training data.

      • and Note [page 7]: Such as…

      Highlight [page 7]: interlocutor

      • and Note [page 7]: suggest “middle man” or “mediator”, ‘interlocutor’ isn’t a very common word

      Highlight [page 7]: A grey zone

      • and Note [page 7]: I think it would be helpful to discuss black and white zones before this.

      Highlight [page 7]: The libraries of the scientific Python ecosystem

      • and Note [page 7]: Do you mean SciPy? https://scipy.org/. Can you provide an example of the frequent changes that break backward compatibility?

      Highlight [page 7]: too late that some of their critical dependencies are not as mature as they seemed to be

      • and Note [page 7]: Again, can you provide some evidence for this?

      Highlight [page 7]: The main difference in practice is the widespread use of experimental software by unsuspecting scientists who believe it to be mature, whereas users of instrument prototypes are usually well aware of the experimental status of their equipment.

      • and Note [page 7]: Again this feels like an assertion without evidence. Is this an essay, or a research paper?
      Convivial vs. proprietary software

      Highlight [page 8]: Convivial software [Kell 2020], named in reference to Ivan Illich’s book “Tools for conviviality” [Illich 1973], is software that aims at augmenting its users’ agency over their computation

      • and Note [page 8]: It would be really helpful if the author would define the word, “convivial” here. It would also be very useful if they went on to give an example of what they meant by: “…software that aims at augmenting its users’ agency over their computation.” How does it augment the users agency?

      Highlight [page 8]: Shaw recently proposed the less pejorative term vernacular developers [Shaw 2022]

      • and Note [page 8]: Could you provide an example of what makes “vernacular developers” different, or just what they mean by this term?

      Highlight [page 8]: which Illich has described in detail

      • and Note [page 8]: Should this have a citation to Illich then in this sentence?

      Highlight [page 8]: what has happened with computing technology for the general public

      • and Note [page 8]: Can you give an example of this. Do you mean the rise of Apple and Windows? MS Word? Facebook? A couple of examples would be really useful to make this point clear.

      Highlight [page 8]: tech corporations

      • and Note [page 8]: Suggest “tech corporations” be “technology corporations”.

      Highlight [page 8]: Some research communities have fallen into this trap as well, by adopting proprietary tools such as MATLAB as a foundation for their computational tools and models.

      • and Note [page 8]: Can you provide an example of the alternative here, what would be the way to avoid this trap - use software such as Octave, or?

      Highlight [page 8]: Historically, the Free Software movement was born in a universe of convivial technology.

      • and Note [page 8]: If it is historic, can you please provide a reference to this?

      Highlight [page 8]: most of the software they produced and used was placed in the public domain

      • and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.

      Highlight [page 8]: as they saw legal constraints as the main obstacle to preserving conviviality

      • and Note [page 8]: Again, these are conjectures that are lacking a reference or example, can you provide some examples of references of this?

      Highlight [page 9]: Software complexity has led to a creeping loss of user agency, to the point that even building and installing Open Source software from its source code is often no longer accessible to non-experts, making them dependent not only on the development communities, but also on packaging experts. An experience report on building the popular machine learning library PyTorch from source code nicely illustrates this point [Courtès 2021].

      • and Note [page 9]: Can you summarise what makes it difficult to install Open Source Software? Again, this statement feels like it is making a strong generalisation without clear evidence to support this. The article by Courtès (https://hpc.guix.info/blog/2021/09/whats-in-a-package/), actually notes that it’s straightforward to install PyTorch via pip, but using an alternative package manager causes difficulty. The point you are making here seems to be that building and installing most open source software is almost prohibitive, but I think you’ve given strong evidence for this claim, and I don’t understand how this builds into your overall argument.

      Highlight [page 9]: It survives mainly in communities whose technology has its roots in the 1980s, such as programming systems inheriting from Smalltalk (e.g. Squeak, Pharo, and Cuis), or the programmable text editor GNU Emacs.

      • and Note [page 9]: Can you give an example of how it survives in these communities?

      Highlight [page 9]: FLOSS has been rapidly gaining in popularity, and receives strong support from the Open Science movement

      • and Note [page 9]: Can you provide some evidence to back this statement up?

      Highlight [page 9]: the traditional values of scientific research.

      • and Note [page 9]: Can you state what you mean by “traditional values of scientific research”

      Highlight [page 9]: always been convivial

      • and Note [page 9]: Can you provide a further explanation of what makes them convivial?
      Transparent vs. opaque software

      Highlight [page 9]: Transparent software

      • and Note [page 9]: It might be useful to explain a distinction between transparent and open software - or to perhaps open with a statement for why we are talking about transparent and opaque software.

      Highlight [page 9]: Large language models are an extreme example.

      • and Note [page 9]: Based on your definition of transparent software - every action produces a visible result. If I type something into an LLM and get an immediate and visible result, how is this different? It is possible you are stating that the behaviour is able to be easily interpreted, or perhaps the behaviour is easy to understand?

      Highlight [page 10]: Even highly interactive software, for example in data analysis, performs nonobvious computations, yielding output that an experienced user can perhaps judge for plausibility, but not for correctness.

      • and Note [page 10]: Could you give a small example of this?

      Highlight [page 10]: It is much easier to develop trust in transparent than in opaque software.

      • and Note [page 10]: Can you state why it is easier to develop this trust?

      Highlight [page 10]: but also less important

      • and Note [page 10]: Can you state why it is less important?

      Highlight [page 10]: even a very weak trustworthiness indicator such as popularity becomes sufficient

      • and Note [page 10]: becomes sufficient for what? Reviewing? Why does it become sufficient?

      Highlight [page 10]: This is currently a much discussed issue with machine learning models,

      • and Note [page 10]: Given it is currently much discussed, could you link to at least 2 research articles discussing this point?

      Highlight [page 10]: treated extensively in the philosophy of science.

      • and Note [page 10]: Given that is has been treated extensively, can you please provide some key references after this statement? You do go on to cite one paper, but it would be helpful to mention at least a few key articles.
      Size of the minimal execution environment

      Highlight [page 11]: The importance of this execution environment is not sufficiently appreciated by most researchers today, who tend to consider it a technical detail

      • and Note [page 11]: This statement is a bit of a sweeping generalisation - why is it not sufficiently appreciated? What evidence do you have of this?

      Highlight [page 11]: Software environments have only recently been recognized as highly relevant for automated reasoning in science and beyond

      • and Note [page 11]: Where have they been only recently recognised?

      Highlight [page 11]: However, they have not yet found their way into mainstream computational science.

      • and Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?
      Analogies in experimental and theoretical science

      Highlight [page 12]: Non-industrial components are occasionally made for special needs, but this is discouraged by their high manufacturing cost

      • and Note [page 12]: Can you provide an example of this?

      Highlight [page 12]: cables

      • and Note [page 12]: What do you mean by a cable? As in a computer cable? An electricity cable?

      Highlight [page 13]: which an experienced microscopist will recognize. Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily.

      • and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional programmer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.

      Highlight [page 13]: where “traditional” means not relying on any form of automated reasoning.

      • and Note [page 13]: Can you give an example of what a “traditional” scientific model or theory
      Improving the reviewability of automated reasoning systems

      Highlight [page 14]: Figure 2: Four measures that can be taken to make scientific software more trustworthy.

      • and Note [page 14]: Could the author perhaps instead call these “four measures” or perhaps give them a better name, and number them?
      Review the reviewable

      Highlight [page 14]: mature wide-spectrum software

      • and Note [page 14]: Can you give an example of what “mature wide-spectrum software” is?

      Highlight [page 15]: The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.

      Science vs. the software industry

      Highlight [page 15]: Many computers, operating systems, and compilers were designed specifically for the needs of scientists.

      • and Note [page 15]: Could you give an example of this? E.g., FORTRAN? COBAL?

      Highlight [page 15]: Today, scientists use mostly commodity hardware

      • and Note [page 15]: Can you explain what you mean by “commodity hardware”, and give an example.

      Highlight [page 15]: even considered advantageous if it also creates a barrier to reverse- engineering of the software by competitors

      • and Note [page 15]: Can you give an example of this?

      Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for

      • and Note [page 15]: What about software like SPSS/STATA/SAS - surely many many industries, and also researchers will pay for software like this that is considered mature?
      Emphasize situated and convivial software

      Highlight [page 16]: a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer.

      • and Note [page 16]: Could you give an example of what this might look like practically? Are you saying things like SciPy would be restructured into many separate modules, or?

      Highlight [page 16]: In terms of FLOSS jargon, users make a partial fork of the project. Version control systems ensure provenance tracking and support the discovery of other forks. Keeping up to date with relevant forks of one’s software, and with the motivations for them, is part of everyday research work at the same level as keeping up to date with publications in one’s wider community. In fact, another way to describe this approach is full integration of scientific software development into established research practices, rather than keeping it a distinct activity governed by different rules.

      • and Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?

      Highlight [page 17]: a universe is very

      • and Note [page 17]: Perhaps this could be “would be very different” - since this doesn’t yet exist, right?

      Highlight [page 17]: Improvement thus happens by small-step evolution rather than by large-scale design. While this may look strange to anyone used to today’s software development practices, it is very similar to how scientific models and theories have evolved in the pre-digital era.

      • and Note [page 17]: I think some kind of schematic or workflow to compare existing practices to this new practice would be really useful to articulate these points. I also think this new method of development you are proposing should have a concrete name.

      Highlight [page 17]: Existing code refactoring tools can probably be adapted to support application-specific forks, for example via code specialization. But tools for working with the forks, i.e. discovering, exploring, and comparing code from multiple forks, are so far lacking. The ideal toolbox should support both forking and merging, where merging refers to creating consensual code versions from multiple forks. Such maintenance by consensus would probably be much slower than maintenance performed by a coordinated team.

      • and Note [page 17]: Perhaps an example of screenshot of a diff could be used to demonstrate that we can make these changes between two branches/commits, but comparing multiple is challenging?
      Make scientific software explainable

      Highlight [page 18]: An interesting line of research in software engineering is exploring possibilities to make complete software systems explainable [Nierstrasz and Girba 2022]. Although motivated by situated business applications, the basic ideas should be transferable to scientific computing

      • and Note [page 18]: Is this similar to concepts such as “X-AI” or “X-ML” - that is, “Explainable” Artificial Intelligence or Machine Learning?

      Highlight [page 18]: Unlike traditional notebooks, Glamorous Toolkit [feenk.com 2023],

      • and Note [page 18]: It appears that you have introduced “Glamorous Toolkit” as an example of these three principles? It feels like it should be introduced earlier in this paragraph?

      Highlight [page 18]: In Glamorous Toolkit, whenever you look at some code, you can access corresponding examples (and also other references to the code) with a few mouse clicks

      • and Note [page 18]: I think it would be very beneficial to show screenshots of what the author means - while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.
      Use Digital Scientific Notations

      Highlight [page 18]: There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification

      • and Note [page 18]: Can you give an example of these techniques?

      Highlight [page 18]: The use of these tools is, for now, reserved to software that is critical for safety or security,

      • and Note [page 18]: Again, could you give an example of this point? Which tools, and which software is critical for safety or security?

      Highlight [page 19]: formal specifications

      • and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.

      Highlight [page 19]: All of them are much more elaborate than the specification of the result they produce. They are also rather opaque.

      • and Note [page 19]: It isn’t clear to me how these are opaque - if the algorithm is defined, it can be understood, how is it opaque?

      Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]

      • and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.

      Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.

      • and Note [page 19]: Is an example of this test drive development?

      Highlight [page 19]: A formal specification has to evolve in the same way, and is best seen as the formalization of the scientific knowledge. Change can flow from specification to software, but also in the opposite direction.

      • and Note [page 19]: Again, I think a good figure here would be very helpful in articulating this clearly.

      Highlight [page 19]: My own experimental Digital Scientific Notation, Leibniz [Hinsen 2024], is intended to resemble traditional mathematical notation as used e.g. in physics. Its statements are embeddable into a narrative, such as a journal article, and it intentionally lacks typical programming language features such as scopes that do not exist in natural language, nor in mathematical notation.

      • and Note [page 19]: Could we see an example of what this might look like?
      Conclusion

      Highlight [page 20]: Situated software is easy to recognize.

      • and Note [page 20]: Could you provide some examples?

      Highlight [page 20]: Examples from the reproducibility crisis support this view

      • and Note [page 20]: Can you provide some example papers that you mention here?

      Highlight [page 21]: The ideal structure for a reliable scientific software stack would thus consist of a foundation of mature software, on top of which a transparent layer of situated software, such as a script, a notebook, or a workflow, orchestrates the computations that together answer a specific scientific question. Both layers of such a stack are reviewable, as I have explained in section 3.1, but adequate reviewing processes remain to be enacted.

      • and Note [page 21]: Again, I think it would be very insightful for the reader to have a clear figure to rest these ideas upon.

      Highlight [page 21]: has been neglected by research institutions all around the world

      • and Note [page 21]: I do not think this is true - could you instead say “neglected my most/many” perhaps?
    2. Dear editors and reviewers, Thank you for your careful reading of my manuscript and the detailed and insightful feedback. It has contributed significantly to the improvements in the revised version. Please find my detailed responses below.

      1 Reviewer 1

      Thank you for this helpful review, and in particular for pointing out the need for more references, illustrations, and examples in various places of my manuscript. In the case of the section on experimental software, the search for examples made clear to me that the label was in fact badly chosen. I have relabeled the dimension as “stable vs. evolving software”, and rewritten the section almost entirely. Another major change motivated by your feedback is the addition of a figure showing the structure of a typical scientific software stack (Fig. 2), and of three case studies (section 2.7) in which I evaluate scientific software packages according to my five dimensions of reviewability. The discussion of conviviality (section 2.4), a concept that is indeed not widely known yet, has been much expanded. I have followed the advice to add references in many places. I have been more hesitant to follow the requests for additional examples and illustrations, because of the inevitable conflict with the equally understandable request to make the paper more compact. In many cases, I have preferred to refer to examples discussed in the literature. A few comments deserve a more detailed reply:

      Introduction

      Highlight [page 3]: In fact, we do not even have established processes for performing such reviews

      and Note [page 3]: I disagree, there is the Journal of Open Source Software: https://joss.theoj.org/, rOpenSci has a guide for development of peer review of statistical software: https://github.com/ropensci/statistical software-review-book, and also maintain a very clear process of software review: https://ropensci.org/software-review/

      As I say in the section “Review the reviewable”, these reviews are not independent critical examination of the software as I define it. Reviewers are not asked to evaluate the software’s correctness or appropriateness for any specific purpose. They are expected to comment only on formal characteristics of the software publication process (e.g. “is there a license?”), and on a few software engineering quality indicators (“is there a test suite?”).

      Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.

      and Note [page 3]: The same can be said of assumptions for equations and mathematics- the problem here is dealing with abstraction of complexity and the potential unintended consequences.

      Indeed. That’s why we need someone other than the authors to go through mathematical reasoning and verify it. Which we do.

      Reviewability of automated reasoning systems

      Wide-spectrum vs. situated software

      Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.

      and Note [page 6]: I’m not sure I agree it is always smaller and simpler- the custom code for a new method could be incredibly complicated.

      The comparison is between situated software and more generic software performing the same operation. For example, a script reading one specific CSV file compared to a subroutine reading arbitrary CSV files. I have yet to see a case in which abstraction from a concrete to a generic function makes code smaller or simpler.

      Convivial vs. proprietary software

      Highlight [page 8]: most of the software they produced and used was placed in the public domain

      and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.

      Software distribution in science was well organized long before the Internet, it was just slower and more expensive. Both decks of punched cards and magnetic tapes were routinely sent by mail. The earliest organized software distribution for science I am aware of was the DECUS Software Library in the early 1960s.

      Size of the minimal execution environment

      Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?

      I have looked for quantitative studies on software use in science that would allow to give a precise meaning to “mainstream”, but I have not been able to find any. Based on my personal experience, mostly with teaching MOOCs on computational science in which students are asked about the software they use, the most widely used platform is Microsoft Windows. Linux is already a minority platform (though overrepresented in computer science), and Nix users are again a small minority among Linux users.

      Analogies in experimental and theoretical science

      Highlight [page 13]: which an experienced microscopist will recognize. Soft ware with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diag- nose easily.

      and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional program mer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.

      There are probably cases of microscopists not noticing defects, but my point is that if you ask them to look for defects, they know what to do (and I have made this clearer in my text). For contrast, take GROMACS (one of my case studies in the revised manuscript) and ask either an expert programmer or an experienced computational biophysicist if it correctly implements, say, the AMBER force field. They wouldn’t know what to do to answer that question, both because it is ill-defined (there is no precise definition of the AMBER force field) and because the number of possible mistakes and symptoms of mistakes is enormous. I have seen a protein simulation program fail for proteins whose number of atoms was in a narrow interval, defined by the size that a compiler attributed to a specific data structure. I was able to catch and track down this failure only because a result was obviously wrong for my use case. I have never heard of similar issues with microscopes.

      Improving the reviewability of automated reasoning systems

      Review the reviewable

      Highlight [page 15]: The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.

      and Note [page 15]: I disagree. Monash provides an example here where they view software as a first class research output: https://robjhyndman.com/files/EBS_research_software.pdf

      This example is about superficial reviews in the context of career evaluation. Other institutions have similar processes. As far as I know, none of them ask reviewers to look at the actual code and comment on its correctness or its suitability for some specific purpose.

      Science vs. the software industry

      Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for

      and Note [page 15]: What about software like SPSS/STATA/SAS- surely many many industries, and also researchers will pay for software like this that is considered mature?

      I could indeed extend the list of examples to include various industries. Compared to the huge number of individuals using PCs and smartphones, that’s still few customers.

      Emphasize situated and convivial software

      Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?

      I have decided the contrary: I have significantly shortened this section, removing all speculation about how the ideas could be turned into concrete technology. The reason is that I have been working on this topic since I wrote the reviewed version of this manuscript, and I have a lot more to say about it than would be reasonable to include in this work. This will become a separate article.

      Make scientific software explainable

      Note [page 18]: I think it would be very beneficial to show screenshots of what the author means- while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.

      Unfortunately, static screenshots can only convey a limited impression of Glamorous Toolkit, but I agree that they have are a more stable support than the software itself. Rather than adding my own screenshots, I refer to a recent paper by the authors of Glamorous Toolkit that includes many screenshots for illustration.

      Use Digital Scientific Notations

      Highlight [page 19]: formal specifications and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.

      Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]

      and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.

      I do give an example: sorting a list. To write down an actual formalized version, I’d have to introduce a formal specification language and explain it, which I think goes well beyond the scope of this article. Illustrating modularity requires an even larger example. This is, however, an interesting challenge which I’d be happy to take up in a future article.

      Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.

      and Note [page 19]: Is an example of this test drive development?

      Not exactly, though the underlying idea is similar: provide a condition that a result must satisfy as evidence for being correct. With testing, the condition is spelt out for one specific input. In a formal specification, the condition is written down for all possible inputs.

      2 Reviewer 2

      First of all, I would like to thank the reviewer for this thoughtful review. It addresses many points that require clarifications in the my article, which I hope to have done adequately in the revised version.

      One such point is the role and form of reviewing processes for software. I have made it clearer that I take “review” to mean “critical independent inspection”. It could be performed by the user of a piece of software, but the standard case should be a review performed by experts at the request of some institution that then publishes the reviewer’s findings. There is no notion of gatekeeping attached to such reviews. Users are free to ignore them. Given that today, we publish and use scientific software without any review at all, the risk of shifting to the opposite extreme of having reviewers become gatekeepers seems unlikely to me.

      Your comment on users being software developers addresses another important point that I had failed to make clear: conviviality is all about diminishing the distinction between developers and users. Users gain agency over their computations at the price of taking on more of a developer role. This is now stated explicitly in the revised article. Your hypothesis that I want scientific software to be convivial is only partially true. I want convivially structured software to be an option for scientists, with adequate infrastructure and tooling support, but I do not consider it to be the best approach for all scientific software.

      The paragraph on the relevance and importance of reviewing in your comment is a valid point of view but, unsurprisingly, not mine. In the grand scheme of science, no specific quality assurance measure is strictly necessary. There is always another layer above that will catch mistakes that weren’t detected in the layer below. It is thus unlikely that unreliable software will cause all of science to crumble. But from many perspectives, including overall efficiency, personal satisfaction of practitioners, and insight derived from the process, it is preferable to catch mistakes as closely as possible to their source. Pre-digital theoreticians have always double-checked their manual calculations before submitting their papers, rather than sending off unchecked results and count on confrontation with experiment for finding mistakes. I believe that we should follow this same approach with software. The cost of mistakes can be quite high. Consider the story of the five retracted protein structures that I cite in my article (Miller, 2006, 10.1126/science.314.5807.1856). The five publications that were retracted involved years of work by researchers, reviewers, and editors. In between their publication and their retraction, other protein crystallographers saw their work rejected because it was in contradiction with the high-profile articles that later turned out to be wrong. The whole story has probably involved a few ruined careers in addition to its monetary cost. In contrast, independent critical examination of the software and the research processes in which it was used would likely have spotted the problem rather quickly (Matthews, 2007).

      You point out that reviewability is also a criterion in choosing software to build on, and I agree. Building on other people’s software requires trusting it. Incorporating it into one’s own work (the core principle of convivial software) requires understanding it. This is in fact what motivated my reflections on this topic. I am not much interested in neatly separating epistemic and practical issues. I am a practitioner, my interest in epistemology comes from a desire for improving practices.

      Review holism is something I have not thought about before. I consider it both impossible to apply in practice and of little practical value. What I am suggesting, and I hope to have made this clearer in my revision, is that reviewing must take into account the dependency graph. Reviewing software X requires a prior review of its dependencies (possibly already done by someone else), and a consideration of how each dependency influences the software under consideration. However, I do not consider Donoho’s “frictionless reproducibility” a sufficient basis for trust. It has the same problem as the widespread practice of tacitly assuming a piece of software to be correct because it is widely used. This reasoning is valid only if mistakes have a high chance of being noticed, and that’s in my experience not true for many kinds of research software. “It works”, when pronounced by a computational scientist, really means “There is no evidence that it doesn’t work”.

      This is also why I point out the chaotic nature of computation. It is not about Humphreys’ “strange errors”, for which I have no solution to offer. It is about the fact that looking for mistakes requires some prior idea of what the symptoms of a mistake might be. Experienced researchers do have such prior ideas for scientific instruments, and also e.g. for numerical algorithms. They come from an understanding of the instruments and their use, including in particular a knowledge of how they can go wrong. But once your substrate is a Turing-complete language, no such understanding is possible any more. Every programmer has made the experience of chasing down some bug that at first sight seems impossible. My long-term hope is that scientific computing will move towards domain-specific languages that are explicitly not Turing-complete, and offer useful guarantees in exchange. Unfortunately, I am not aware of any research in this space.

      I fully agree with you that internalist justifications are preferable to reliabilistic ones. But being fundamentally a pragmatist, I don’t care much about that distinction. Indisputable justification doesn’t really exist anywhere in science. I am fine with trust that has a solid basis, even if there remains a chance of failure. I’d already be happy if every researcher could answer the question “why do you trust your computational results?” in a way that shows signs of critical reflection.

      What I care about ultimately is improving practices in computational science. Over the last 30 years, I have seen numerous mistakes being discovered by chance, often leading to abandoned research projects. Some of these mistakes were due to software bugs, but the most common cause was an incorrect mental model of what the software does. I believe that the best technique we have found so far to spot mistakes in science is critical independent inspection. That’s why I am hoping to see it applied more widely to computation.

      2.1 References

      Miller, G. (2006) A Scientist’s Nightmare: Software Problem Leads to Five Retractions. Science 314, 1856. https://doi.org/10.1126/science.314.5807.1856

      Matthews, B.W. (2007) Five retracted structure reports: Inverted or incorrect? Protein Science 16, 1013. https://doi.org/10.1110/ps.072888607

      3 Editor

      Bayesian methods often use MCMC, which is often slow and creates long chains of estimates; however, the chains will show if the likelihood does not have a clear maximum, which is usually from a badly specified model...

      That is an interesting observation I haven’t seen mentioned bedore. I agree that Bayesian inference is particularly amenable to inspection. One more reason to normalize inspection and inspectability in computational science.

      Some reflection on the growing use of AI to write software may be worthwhile.

      The use of AI in writing and reviewing software is a topic I have considered for this review, since the technology has evolved enormously since I wrote the current version of the manuscript. However, in view of reviewer 1’s constant admonition to back up statements with citations, I refrained from delving into this topic. We all know it’s happening, but it’s too early to observe a clear impact on research software. I have therefore limited myself to a short comment in the Conclusion section.

      I wondered if highly-used software should get more scrutiny.

      This is an interesting suggestion. If and when we get serious about reviewing code, resource allocation will become an important topic. For getting started, it’s probably more productive to review newly published code than heavily used code, because there is a better chance that authors actually act on the feedback and improve their code before it has many users. That in turn will help improve the reviewing process, which is what matters most right now, in my opinion.

      “supercomputers are rare”, should this be “relatively rare” or am I speaking from a privileged university where I’ve always had access to supercomputers.

      If you have easy access to supercomputer, you should indeed consider yourself privileged. But did you ever use supercomputer time for reviewing someone else’s work? I have relatively easy access to supercomputers as well, but I do have to make a re quest and promise to do innovative research with the allocated resources.

      I did think about “testthat” at multiple points whilst reading the paper (https://testthat.r-lib.org/)

      I hadn’t seen “testthat” before, not being much of a user of R. It looks interesting, and reminds me of similar test support features in Smalltalk which I found very helpful. Improving testing culture is definitely a valuable contribution to improving computational practices.

      Can badges on github about downloads and maturity help (page 7)?

      Badges can help, on GitHub or elsewhere, e.g. in scientific software catalogs. I see them as a coarse-grained output of reviewing. The right balance to find is between the visibility of a badge and the precision of a carefully written review report. One risk with badges is the temptation to automate the evaluation that leads to it. This is fine for quantitative measures such as test coverage, but what we mostly lack today is human expert judgement on software.

    1. Reviewer #1 (Public review):

      This paper describes a number of patterns of epistasis in a large fitness landscape dataset recently published by Papkou et al. The paper is motivated by an important goal in the field of evolutionary biology to understand the statistical structure of epistasis in protein fitness landscapes, and it capitalizes on the unique opportunities presented by this new dataset to address this problem.

      The paper reports some interesting previously unobserved patterns that may have implications for our understanding of fitness landscapes and protein evolution. In particular, Figure 5 is very intriguing. However, I have two major concerns detailed below. First, I found the paper rather descriptive (it makes little attempt to gain deeper insights into the origins of the observed patterns) and unfocused (it reports what appears to be a disjointed collection of various statistics without a clear narrative. Second, I have concerns with the statistical rigor of the work.

      (1) I think Figures 5 and 7 are the main, most interesting, and novel results of the paper. However, I don't think that the statement "Only a small fraction of mutations exhibit global epistasis" accurately describes what we see in Figure 5. To me, the most striking feature of this figure is that the effects of most mutations at all sites appear to be a mixture of three patterns. The most interesting pattern noted by the authors is of course the "strong" global epistasis, i.e., when the effect of a mutation is highly negatively correlated with the fitness of the background genotype. The second pattern is a "weak" global epistasis, where the correlation with background fitness is much weaker or non-existent. The third pattern is the vertically spread-out cluster at low-fitness backgrounds, i.e., a mutation has a wide range of mostly positive effects that are clearly not correlated with fitness. What is very interesting to me is that all background genotypes fall into these three groups with respect to almost every mutation, but the proportions of the three groups are different for different mutations. In contrast to the authors' statement, it seems to me that almost all mutations display strong global epistasis in at least a subset of backgrounds. A clear example is C>A mutation at site 3.

      1a. I think the authors ought to try to dissect these patterns and investigate them separately rather than lumping them all together and declaring that global epistasis is rare. For example, I would like to know whether those backgrounds in which mutations exhibit strong global epistasis are the same for all mutations or whether they are mutation- or perhaps position-specific. Both answers could be potentially very interesting, either pointing to some specific site-site interactions or, alternatively, suggesting that the statistical patterns are conserved despite variation in the underlying interactions.

      1b. Another rather remarkable feature of this plot is that the slopes of the strong global epistasis patterns seem to be very similar across mutations. Is this the case? Is there anything special about this slope? For example, does this slope simply reflect the fact that a given mutation becomes essentially lethal (i.e., produces the same minimal fitness) in a certain set of background genotypes?

      1c. Finally, how consistent are these patterns with some null expectations? Specifically, would one expect the same distribution of global epistasis slopes on an uncorrelated landscape? Are the pivot points unusually clustered relative to an expectation on an uncorrelated landscape?

      1d. The shapes of the DFE shown in Figure 7 are also quite interesting, particularly the bimodal nature of the DFE in high-fitness (HF) backgrounds. I think this bimodality must be a reflection of the clustering of mutation-background combinations mentioned above. I think the authors ought to draw this connection explicitly. Do all HF backgrounds have a bimodal DFE? What mutations occupy the "moving" peak?

      1e. In several figures, the authors compare the patterns for HF and low-fitness (LF) genotypes. In some cases, there are some stark differences between these two groups, most notably in the shape of the DFE (Figure 7B, C). But there is no discussion about what could underlie these differences. Why are the statistics of epistasis different for HF and LF genotypes? Can the authors at least speculate about possible reasons? Why do HF and LF genotypes have qualitatively different DFEs? I actually don't quite understand why the transition between bimodal DFE in Figure 7B and unimodal DFE in Figure 7C is so abrupt. Is there something biologically special about the threshold that separates LF and HF genotypes? My understanding was that this was just a statistical cutoff. Perhaps the authors can plot the DFEs for all backgrounds on the same plot and just draw a line that separates HF and LF backgrounds so that the reader can better see whether the DFE shape changes gradually or abruptly.

      1f. The analysis of the synonymous mutations is also interesting. However I think a few additional analyses are necessary to clarify what is happening here. I would like to know the extent to which synonymous mutations are more often neutral compared to non-synonymous ones. Then, synonymous pairs interact in the same way as non-synonymous pair (i.e., plot Figure 1 for synonymous pairs)? Do synonymous or non-synonymous mutations that are neutral exhibit less epistasis than non-neutral ones? Finally, do non-synonymous mutations alter epistasis among other mutations more often than synonymous mutations do? What about synonymous-neutral versus synonymous-non-neutral. Basically, I'd like to understand the extent to which a mutation that is neutral in a given background is more or less likely to alter epistasis between other mutations than a non-neutral mutation in the same background.

      (2) I have two related methodological concerns. First, in several analyses, the authors employ thresholds that appear to be arbitrary. And second, I did not see any account of measurement errors. For example, the authors chose the 0.05 threshold to distinguish between epistasis and no epistasis, but why this particular threshold was chosen is not justified. Another example: is whether the product s12 × (s1 + s2) is greater or smaller than zero for any given mutation is uncertain due to measurement errors. Presumably, how to classify each pair of mutations should depend on the precision with which the fitness of mutants is measured. These thresholds could well be different across mutants. We know, for example, that low-fitness mutants typically have noisier fitness estimates than high-fitness mutants. I think the authors should use a statistically rigorous procedure to categorize mutations and their epistatic interactions. I think it is very important to address this issue. I got very concerned about it when I saw on LL 383-388 that synonymous stop codon mutations appear to modulate epistasis among other mutations. This seems very strange to me and makes me quite worried that this is a result of noise in LF genotypes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary: 

      The idea is appealing, but the authors have not sufficiently demonstrated the utility of this approach.

      Strengths: 

      Novelty of the approach, potential impli=cations for discovering novel interactions

      Weaknesses:

      The Duong had introduced their highly elegant peptidisc approach several years ago. In this present work, they combine it with thermal proteome profiling (TPP) and attempt to demonstrate the utility of this combination for identifying novel membrane protein-ligand interactions.

      While I find this idea intriguing, and the approach potentially useful, I do not feel that the authors had sufficiently demonstrated the utility of this approach. My main concern is that no novel interactions are identified and validated. For the presentation of any new methodology, I think this is quite necessary. In addition, except for MsbA, no orthogonal methods are used to support the conclusions, and the authors rely entirely on quantifying rather small differences in abundances using either iBAQ or LFQ.

      We thank the reviewer for their thoughtful comments. In this revision, we have experimentally addressed the reviewer’s concerns in three ways:

      (1) To demonstrate the utility of our MM-TPP method over the detergent-based TPP workflow (termed DB-TPP), we performed a side-by-side comparison using ATP–VO₄ at 51 °C (Figure 3B and Figure 4A). From the DB-TPP dataset, 7.4% of all identified proteins were annotated as ATP-binding, while 6.4% of proteins differentially stabilized were annotated as ATP-binding. In contrast, in the MM-TPP dataset, 9.3% of all identified proteins were annotated as ATP-binding proteins, while 17% of proteins differentially stabilized were annotated as ATP-binding. The lack of enrichment in the detergent-based approach indicates that the observed differences are likely stochastic, rather than a result of specific ATP–VO₄-mediated stabilization as found with MM-TPP. For instance, several key proteins—BCS1, P2RY6, SLC27A2, ABCB1, ABCC2, and ABCC9— found differentially stabilized using the MM-TPP method showed no such pattern in the DB-TPP dataset. This divergence strongly supports the specificity and utility of our Peptidisc approach. 

      (2) To demonstrate that MM-TPP can resolve not only the broader effects of ATP–VO₄ but also specific ligand–protein interactions, we employed 2-methylthio-ADP (2-MeS-ADP), a selective agonist of the P2RY12 receptor [PMID: 24784220]. In that case, we observed clear thermal stabilization of P2RY12, with more than 6-fold increase in stability at both 51 °C and 57 °C (–log₁₀ p > 5.97; Figure 4B and Figure S4). Notably, no other proteins—including the structurally related but non-responsive P2RY6 receptor- showed comparable stabilization fold change at these temperatures.

      (3) To further probe the reproducibility of the method, we performed an independent MMTPP evaluation with ATP–VO₄ at 51 °C using data-independent acquisition (DIA), in contrast to the data-dependent acquisition (DDA) approach used in the initial study (Figure S5). Overall, 7.8% of all identified proteins were annotated as ATP-binding, and as before, this proportion increased to 17% among proteins with log₂ fold changes greater than 0.5. Specifically, BCS1 and SLC27A2 exhibited strong stabilization (log₂ fold change > 1), while P2RY6, ABCB11, ABCC2, and ABCG2 showed moderate stabilization (log₂ fold changes between 0.5 and 1), and consistent with previous results, P2RX4 was destabilized, with a log₂ fold change below –1. These findings support the consistency and reproducibility of the method across distinct data acquisition methods.

      My main concern is that no novel interactions are identified and validated. For the presentation of any new methodology, I think this is quite necessary.  

      The primary objective of our study is to establish and benchmark the MM-TPP workflow using known targets, rather than to discover novel ligand–protein interactions. Identifying new binders requires extensive screening and downstream validations, which we believe is beyond the scope of this methodological report. Instead, our study highlights the sensitivity and reliability of the MM-TPP approach by demonstrating consistent and reproducible results with well-characterized interactions.

      We respectfully disagree with the notion that introducing a new methodology must necessarily include the discovery of novel interactions. For instance, Martinez Molina et al. [PMID: 23828940] introduced the cellular thermal shift assay (CETSA) by validating established targets such as MetAP2 with TNP-470 and CDK2 with AZD-5438, without identifying novel protein–ligand pairs. Similarly, Kalxdorf et al. [PMID: 33398190] published their cell-surface thermal proteome profiling (CS-TPP) using Ouabain to stabilize the Na⁺/K⁺-ATPase pump in K562 cells, and SB431542 to stabilize its canonical target JAG1. In fact, when these methods revealed additional stabilizations, these were not validated but instead interpreted through reasoning grounded in the literature. For instance, they attributed the SB431542-induced stabilization of MCT1 to its reported role in cell migration and tumor invasiveness, and explained that SLC1A2 stabilization is related to the disruption of Na⁺/K⁺-ATPase activity by Ouabain. In the same way, our interpretation of ATP-VO₄–mediated stabilization of Mao-B is justified by predictive AlphaFold-3 rather than direct orthogonal assays, which are beyond the scope of our methodological presentation. 

      Collectively, the influential studies cited above have set methodological precedents by prioritizing validation and proof-of-concept over merely finding uncharacterized binders. In the same spirit, our work is centred on establishing MM-TPP as a robust platform for probing membrane protein–ligand interactions in a water-soluble format. The discovery of novel binders remains an exciting future direction—one that will build upon the methodological foundation laid by the present study.

      In addition, except for MsbA, no orthogonal methods are used to support the conclusions, and the authors rely entirely on quantifying rather small differences in abundances using either iBAQ or LFQ.

      We deliberately began this study with our model protein, MsbA, examined under both native and overexpressed conditions, to establish an adequation between MMTPP (Figure 2D) and biochemical stability assays (Figure 2A). This validation has provided us with the foundation to confidently extend MM-TPP to the mouse organ proteome. To demonstrate the validity of our workflow, we have used ATP-VO₄ because it has expected targets. 

      We note that orthogonal validation often requires overproduction and purification of the candidate proteins, including suitable antibodies, which is a true challenge for membrane proteins. Here, we demonstrate that MM-TPP can detect ligand-induced thermal shifts directly in native membrane preparations, without requiring protein overproduction or purification. We also emphasize several influential studies in TPP, including Martinez Molina et al. (PMID: 23828940) and Fang et al. (PMID: 34188175), which focused primarily on establishing and benchmarking the methodology, rather than on extensive orthogonal validation. In the same spirit, our study prioritizes methodological development, and accordingly, several orthogonal validations are now included in this revision.

      [...] and the authors rely entirely on quantifying rather small differences in abundances using either iBAQ or LFQ.

      To clarify, all analyses on ligand-induced stabilization or destabilization were carried out using LFQ values. The sole exception is on Figure 2B, where we used iBAQ values to depict the relative abundance of proteins within a single sample; this to show MsbA's relative level within the E. coli peptidisc library.

      Respectfully, we disagree with the assertion that we are “quantifying rather small differences in abundances using either iBAQ or LFQ.” We were able to clearly distinguish between stabilizations driven by specific ligands binding to their targets versus those caused by non-specific ligands with broader activity. This is further confirmed by comparing 2-MeS-ADP, a selective ligand for P2RY12, with ATP-VO₄, a highly promiscuous ligand, and AMP-PNP, which exhibits intermediate breadth. When tested in triplicate at 51 °C, 2-MeS-ADP significantly altered the thermal stability of 27 proteins,  AMP-PNP 44 proteins, and ATP-VO₄ 230 proteins, consistent with the expectation that broader ligands stabilize more proteins nonspecifically. Importantly, 2-MeS-ADP produced markedly stronger stabilization of its intended target, P2RY12 (–log<sub>10</sub>p = 9.32), than the top stabilized proteins for ATP–VO₄ (DNAJB3, –log₁₀p = 5.87) or AMP-PNP (FTH1, p = 5.34). Moreover, 2-MeS-ADP did not significantly stabilize proteins that were consistently stabilized by the broad ligands, such as SLC27A2, which was strongly stabilized by both ATP-VO<sub>4</sub> and AMP-PNP (–log<sub>10</sub> p>2.5). Together, these findings demonstrate that MMTPP can robustly distinguish between broad-spectrum and target-specific ligands, with selective ligands inducing stronger and more physiologically meaningful stabilization at their intended targets compared to promiscuous ligands.

      Finally, we emphasize that our findings are not marginal, but meet quantitative and statistical rigor consistent with best practices in proteomics. We apply dual thresholds combining effect size (|log₂FC| ≥ 1, i.e., at least a two-fold change) with statistical significance (FDR-adjusted p ≤ 0.05)—criteria commonly used in proteomics methodology studies (e.g., PMID: 24942700, 38724498). Moreover, the stabilization and destabilization events we report are reproducible across biological replicates (n = 3), consistent across adjacent temperatures for most targets, and technically robust across acquisition modes (DDA vs. DIA). Taken together, these results reflect statistically valid and biologically meaningful effects, fully aligned with standards set by prior published proteomics studies.

      Furthermore, the reported changes in abundances are solely based on iBAQ or LFQ analysis. This must be supported by a more quantitative approach such as SILAC or labeled peptides. In summary, I think this story requires a stronger and broader demonstration of the ability of peptidisc-TPP to identify novel physiologically/pharmacologically relevant interactions.

      With respect to labeling strategies, we deliberately avoided using TMT due to concerns about both cost and potential data quality issues. Some recent studies have documented the drawbacks of TMT in contexts directly relevant to our work. For example, a benchmarking study of LiP-MS workflows showed that although TMT increased proteome depth and reduced technical variance, it was less accurate in identifying true drug–protein interactions and produced weaker dose–response correlations compared with label-free DIA approaches [PMID: 40089063]. More broadly, technical reviews have highlighted that isobaric tagging is intrinsically prone to ratio compression and reporterion interference due to co-isolation and co-fragmentation of peptides, which flatten measured fold-changes and obscure biologically meaningful differences [PMID: 22580419, 22036744]. In terms of SILAC, the technique requires metabolic incorporation of heavy amino acids, which is feasible in cultured cells but not in physiologically relevant tissues such as the liver organ used here. SILAC mouse models exist, but they are expensive and time-consuming [PMID: 18662549, 21909926]. We are not a mouse lab, and introducing liver organ SILAC labeling in our workflow is beyond the scope of these revisions. We also note that several hallmark TPP studies have been successfully carried out using label-free quantification [PMID: 25278616, 26379230, 33398190, 23828940], establishing this as an accepted and widely applied approach in the field. 

      To further support our conclusions, we added controls showing that detergent solubilization of mouse liver membranes followed by SP4 cleanup fails to detect ATP-VO₄– mediated stabilization of ATP-binding proteins, underscoring the necessity of Peptidisc reconstitution for capturing ligand-induced thermal stabilization. We also present new data demonstrating selective stabilization of the P2Y12 receptor by its agonist 2-MeS-ADP, providing orthogonal, receptor-specific validation within the MM-TPP framework. Finally, an orthogonal DIA acquisition on separate replicates confirmed robust ATP-vanadate stabilization of ATP-binding proteins, including BCS1l and SLC27A2. Together, these additions reinforce that the observed stabilizations are genuine, physiologically relevant ligand–protein interactions and highlight the unique advantage of the Peptidisc-based workflow in capturing such events.

      Cited Reference:

      24784220: Zhang J, Zhang K, Gao ZG, et al. Agonist-bound structure of the human P2Y₁₂ receptor. Nature.  2014;509(7498):119-122. doi:10.1038/nature13288. 

      23828940: Martinez Molina D, Jafari R, Ignatushchenko M, et al. Monitoring drug target engagement in cells and tissues using the cellular thermal shift assay. Science. 2013;341(6141):84-87. doi:10.1126/science.1233606.

      33398190: Kalxdorf M, Günthner I, Becher I, et al. Cell surface thermal proteome profiling tracks perturbations and drug targets on the plasma membrane. Nat Methods. 2021;18(1):84-91. doi:10.1038/s41592-020-01022-1.

      34188175: Fang S, Kirk PDW, Bantscheff M, Lilley KS, Crook OM. A Bayesian semi-parametric model for thermal proteome profiling. Commun Biol. 2021;4(1):810. doi:10.1038/s42003-021-02306-8.

      24942700: Cox J, Hein MY, Luber CA, Paron I, Nagaraj N, Mann M. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol Cell Proteomics. 2014;13(9):2513-2526. doi:10.1074/mcp.M113.031591.

      38724498: Peng H, Wang H, Kong W, Li J, Goh WWB. Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. Nat Commun. 2024;15(1):3922. doi:10.1038/s41467-02447899-w. 

      40089063: Koudelka T, Bassot C, Piazza I. Benchmarking of quantitative proteomics workflows for limited proteolysis mass spectrometry. Mol Cell Proteomics. 2025;24(4):100945. doi:10.1016/j.mcpro.2025.100945.

      22580419: Christoforou AL, Lilley KS. Isobaric tagging approaches in quantitative proteomics: the ups and downs. Anal Bioanal Chem. 2012;404(4):1029-1037. doi:10.1007/s00216-012-6012-9. 

      22036744: Christoforou AL, Lilley KS. Isobaric tagging approaches in quantitative proteomics: the ups and downs. Anal Bioanal Chem. 2012;404(4):1029-1037. doi:10.1007/s00216-012-6012-9. 

      18662549: Krüger M, Moser M, Ussar S, et al. SILAC mouse for quantitative proteomics uncovers kindlin-3 as an essential factor for red blood cell function. Cell. 2008;134(2):353-364. doi:10.1016/j.cell.2008.05.033.

      21909926: Zanivan S, Krueger M, Mann M. In vivo quantitative proteomics: the SILAC mouse. Methods Mol Biol. 2012;757:435-450. doi:10.1007/978-1-61779-166-6_25. 

      25278616: Kalxdorf M, Becher I, Savitski MM, et al. Temperature-dependent cellular protein stability enables highprecision proteomics profiling. Nat Methods. 2015;12(12):1147-1150. doi:10.1038/nmeth.3651.

      26379230: Savitski MM, Reinhard FBM, Franken H, et al. Tracking cancer drugs in living cells by thermal profiling of the proteome. Science. 2015;346(6205):1255784. doi:10.1126/science.1255784. 

      33452728: Leuenberger P, Ganscha S, Kahraman A, et al. Cell-wide analysis of protein thermal unfolding reveals determinants of thermostability. Science. 2020;355(6327):eaai7825. doi:10.1126/science.aai7825. 

      23066101: Savitski MM, Zinn N, Faelth-Savitski M, et al. Quantitative thermal proteome profiling reveals ligand interactions and thermal stability changes in cells. Nat Methods. 2013;10(12):1094-1096. doi:10.1038/nmeth.2766.  

      30858367: Piazza I, Kochanowski K, Cappelletti V, et al. A machine learning-based chemoproteomic approach to identify drug targets and binding sites in complex proteomes. Nat Commun. 2019;10(1):1216. doi:10.1038/s41467019-09199-0. 

      Reviewer #2 (Public Review):

      Summary:

      The membrane mimetic thermal proteome profiling (MM-TPP) presented by Jandu et al. seems to be a useful way to minimize the interference of detergents in efficient mass spectrometry analysis of membrane proteins. Thermal proteome profiling is a mass spectrometric method that measures binding of a drug to different proteins in a cell lysate by monitoring thermal stabilization of the proteins because of the interaction with the ligands that are being studied. This method has been underexplored for membrane proteome because of the inefficient mass spectrometric detection of membrane proteins and because of the interference from detergents that are used often for membrane protein solubilization.

      Strengths:

      In this report the binding of ligands to membrane protein targets has been monitored in crude membrane lysates or tissue homogenates exalting the efficacy of the method to detect both intended and off-target binding events in a complex physiologically relevant sample setting.

      The manuscript is lucidly written and the data presented seems clear. The only insignificant grammatical error I found was that the 'P' in the word peptidisc is not capitalized in the beginning of the methods section "MM-TPP profiling on membrane proteomes". The clear writing made it easy to understand and evaluate what has been presented. Kudos to the authors.

      Weaknesses:

      While this is a solid report and a promising tool for analyzing membrane protein drug interactions, addressing some of the minor caveats listed below could make it much more impactful.

      The authors claim that MM-TPP is done by "completely circumventing structural perturbations invoked by detergents[1] ". This may not be entirely accurate, because before reconstitution of the membrane proteins in peptidisc, the membrane fractions are solubilized by 1% DDM. The solubilization and following centrifugation step lasts at least for 45 min. It is less likely that all the structural perturbations caused by DDM to various membrane proteins and their transient interactions become completely reversed or rescued by peptidisc reconstitution.

      We thank the reviewer for this insightful comment. In response, we have revised the sentence and expanded the discussion to clarify that the Peptidisc provides a complementary approach to detergent-based preparations for studying membrane proteins, preserving native lipid–protein interactions and stabilization effects that may be diminished in detergent.

      To further address the structural perturbations invoked by detergents, and as already detailed to our response to Reviewer 1, we have compared the thermal profile of the Peptidisc library to the mouse liver membranes solubilized with 1% DDM, after incubation with ATP–VO₄ at 51 °C (Figure 4A). The results with the detergent extract revealed random patterns of stabilization and destabilization, with only 6.4% of differentially stabilized proteins being ATP-binding—comparable to the 7.4% observed in the background. In contrast, in the Peptidisc library, 17% of differentially stabilized proteins were ATP-binding, compared to 9.3% in the background. Thus, while Peptidisc reconstitution does not fully avoid initial detergent exposure, these findings underscore the importance of implementing Peptidisc in the TPP workflow when dealing with membrane proteins.

      In the introduction, the authors make statements such as "..it is widely acknowledged that even mild detergents can disrupt protein structures and activities, leading to challenges in accurately identifying drug targets.." and "[peptidisc] libraries are instrumental in capturing and stabilizing IMPs in their functional states while preserving their interactomes and lipid allosteric modulators...'. These need to be rephrased, as it has been shown by countless studies that even with membrane protein suspended in micelles robust ligand binding assays and binding kinetics have been performed leading to physiologically relevant conclusions and identification of protein-protein and protein-ligand interactions.

      We thank the reviewer for this valuable feedback and fully agree with the point raised. In response, we have revised the Introduction and conclusion to moderate the language concerning the limitations of detergent use. We now explicitly acknowledge that numerous studies have successfully used detergent micelles for ligand-binding assays and kinetic analyses, yielding physiologically relevant insights into both protein–protein and protein–ligand interactions [e.g., PMID: 22004748, 26440106, 31776188].

      At the same time, we clarify that the Peptidisc method offers a complementary advantage, particularly in the context of thermal proteome profiling (TPP), which involves mass spectrometry workflows that are incompatible with detergents. In this setting, Peptidiscs facilitate the detection of ligand-binding events that may be more difficult to observe in detergent micelles.

      We have reframed our discussion accordingly to present Peptidiscs not as a replacement for detergent-based methods, but rather as a complementary tool that broadens the available methodological landscape for studying membrane protein interactions.

      If the method involves detergent solubilization, for example using 1% DDM, it is a bit disingenuous to argue that 'interactomes and lipid allosteric modulators' characterized by lowaffinity interactions will remain intact or can be rescued upon detergent removal. Authors should discuss this or at least highlight the primary caveat of the peptidisc method of membrane protein reconstitution - which is that it begins with detergent solubilization of the proteome and does not completely circumvent structural perturbations invoked by detergents.

      We would like to clarify that, in our current workflow, ligand incubation occurs after reconstitution into Peptidiscs. As such, the method is designed to circumvent the negative effects of detergent during the critical steps involving low-affinity interactions.

      That said, we fully acknowledge that Peptidisc reconstitution begins with detergent solubilization (e.g., 1% DDM), and we have revised the conclusion to explicitly state this important caveat. As the reviewer correctly points out, this initial step may introduce some structural perturbations or result in the loss of weakly associated lipid modulators.

      However, reconstitution into Peptidiscs rapidly restores a detergent-free environment for membrane proteins, which has been shown in our previous studies [PMID: 38577106, 38232390, 31736482, 31364989] to mitigate these effects. Specifically, we have demonstrated that time-limited DDM exposure, followed by Peptidisc reconstitution, minimizes membrane protein delipidation, enhances thermal stability, retains functionality, and preserves multi-protein assemblies.

      It would also be important to test detergents that are even milder than 1% DDM and ones which are harsher than 1% DDM to show that this method of reconstitution can indeed rescue the perturbations to the structure and interactions of the membrane protein done by detergents during solubilization step. 

      We selected 1% DDM based on our previous work [PMID: 37295717, 39313981,38232390], where it consistently enabled robust and reproducible solubilization for Peptidisc reconstitution. We agree that comparing milder detergents (e.g., LMNG) and harsher ones (e.g., SDC) would provide valuable insights into how detergent strength influences structural perturbations, and how effectively these can be mitigated by Peptidisc reconstitution. Preliminary data (not shown) from mouse liver membranes indicate broadly similar proteomic profiles following solubilization with DDM, LMNG, and SDC, although potential differences in functional activity or ligand binding remain to be investigated.

      Based on the methods provided, it appears that the final amount of detergent in peptidisc membrane protein library was 0.008%, which is ~150 uM. The CMC of DDM depending on the amount of NaCl could be between 120-170 uM.

      While we cannot entirely rule out the presence of residual DDM (0.008%) in the raw library, its free concentration may be lower than initially estimated. This is related to the formation of mixed micelles with the amphipathic peptide scaffold, which is supplied in excess during reconstitution. These mixed micelles are subsequently removed during the ultrafiltration step. Furthermore, in related work using His-tagged Peptidiscs [PMID: 32364744], we purified the library by nickel-affinity chromatography following a 5× dilution into a detergent-free buffer. Although this purification step reduced the number of soluble proteins, the same membrane proteins were retained, suggesting that any residual detergent does not significantly interfere with Peptidisc reconstitution. Supporting this, our MM-TPP assays on purified libraries (data not shown) consistently demonstrated stabilization of ATP-binding proteins (e.g., SLC27A2, DNAJB3), indicating that the observed ligand–protein interactions result from successful incorporation into Peptidiscs.

      Perhaps, to completely circumvent the perturbations from detergents other methods of detergentfree solubilization such as using SMA polymers and SMALP reconstitution could be explored for a comparison. Moreover, a comparison of the peptidisc reconstitution with detergent-free extraction strategies, such as SMA copolymers, could lend more strength to the presented method.

      We agree that detergent-free methods such as SMA polymers hold promise for membrane protein solubilization. However, in preliminary single-replicate experiments using SMA2000 at 51 °C in the presence of ATP–VO₄ (data not shown), we observed broad, non-specific stabilization effects. Of the 2,287 quantified proteins, 9.3% were annotated as ATP-binding, yet 9.9% of the 101 proteins showing a log₂ fold change >1 or <–1 were ATPbinding, indicating no meaningful enrichment. Given this lack of specificity and the limited dataset, we chose not to pursue further SMA experiments and have not included them here. However, in a recent study (https://doi.org/10.1101/2025.08.25.672181), we directly compared Peptidisc, SMA, and nanodiscs for liver membrane proteome profiling. In that work, Peptidisc outperformed both SMA and nanodiscs in detecting membrane protein dysregulation between healthy and diseased liver. By extension, we expect Peptidisc to offer superior sensitivity and specificity for detecting ligand-induced stabilization events, such as those observed here with ATP–vanadate.

      Cross-verification of the identified interactions, and subsequent stabilization or destabilizations, should be demonstrated by other in vitro methods of thermal stability and ligand binding analysis using purified protein to support the efficacy of the MM-TPP method. An example cross-verification using SDS-PAGE, of the well-studied MsbA, is shown in Figure 2. In a similar fashion, other discussed targets such as, BCS1L, P2RX4, DgkA, Mao-B, and some un-annotated IMPs shown in supplementary figure 3 that display substantial stabilization or destabilization should be cross-verified.

      We appreciate this suggestion and note that a similar point was raised in R1’s comment “In addition, except for MsbA, no orthogonal methods are used to support the conclusions, and the authors rely entirely on quantifying rather small differences in abundances using either iBAQ or LFQ.” We have developed a detailed response to R1 on this matter, which equally applies here. 

      Cited Reference:

      35616533: Young JW, Wason IS, Zhao Z, et al. Development of a Method Combining Peptidiscs and Proteomics to Identify, Stabilize, and Purify a Detergent-Sensitive Membrane Protein Assembly. J Proteome Res. 2022;21(7):1748-1758. doi:10.1021/acs.jproteome.2c00129. PMID: 35616533.

      31364989: Carlson ML, Stacey RG, Young JW, et al. Profiling the Escherichia coli membrane protein interactome captured in Peptidisc libraries. Elife. 2019;8:e46615. doi:10.7554/eLife.46615. 

      22004748: O'Malley MA, Helgeson ME, Wagner NJ, Robinson AS. Toward rational design of protein detergent complexes: determinants of mixed micelles that are critical for the in vitro stabilization of a G-protein coupled receptor. Biophys J. 2011;101(8):1938-1948. doi:10.1016/j.bpj.2011.09.018.

      26440106: Allison TM, Reading E, Liko I, Baldwin AJ, Laganowsky A, Robinson CV. Quantifying the stabilizing effects of protein-ligand interactions in the gas phase. Nat Commun. 2015;6:8551. doi:10.1038/ncomms9551.

      31776188: Beckner RL, Zoubak L, Hines KG, Gawrisch K, Yeliseev AA. Probing thermostability of detergentsolubilized CB2 receptor by parallel G protein-activation and ligand-binding assays. J Biol Chem. 2020;295(1):181190. doi:10.1074/jbc.RA119.010696.

      38577106: Jandu RS, Yu H, Zhao Z, Le HT, Kim S, Huan T, Duong van Hoa F. Capture of endogenous lipids in peptidiscs and effect on protein stability and activity. iScience. 2024;27(4):109382. doi:10.1016/j.isci.2024.109382.

      38232390: Antony F, Brough Z, Zhao Z, Duong van Hoa F. Capture of the Mouse Organ Membrane Proteome Specificity in Peptidisc Libraries. J Proteome Res. 2024;23(2):857-867. doi:10.1021/acs.jproteome.3c00825.

      31736482: Saville JW, Troman LA, Duong Van Hoa F. PeptiQuick, a one-step incorporation of membrane proteins into biotinylated peptidiscs for streamlined protein binding assays. J Vis Exp. 2019;(153). doi:10.3791/60661. 

      37295717: Zhao Z, Khurana A, Antony F, et al. A Peptidisc-Based Survey of the Plasma Membrane Proteome of a Mammalian Cell. Mol Cell Proteomics. 2023;22(8):100588. doi:10.1016/j.mcpro.2023.100588. 

      39313981: Antony F, Brough Z, Orangi M, Al-Seragi M, Aoki H, Babu M, Duong van Hoa F. Sensitive Profiling of Mouse Liver Membrane Proteome Dysregulation Following a High-Fat and Alcohol Diet Treatment. Proteomics. 2024;24(23-24):e202300599. doi:10.1002/pmic.202300599. 

      32364744: Young JW, Wason IS, Zhao Z, Rattray DG, Foster LJ, Duong Van Hoa F. His-Tagged Peptidiscs Enable Affinity Purification of the Membrane Proteome for Downstream Mass Spectrometry Analysis. J Proteome Res. 2020;19(7):2553-2562. doi:10.1021/acs.jproteome.0c00022.

      32591519: The M, Käll L. Focus on the spectra that matter by clustering of quantification data in shotgun proteomics. Nat Commun. 2020;11(1):3234. doi:10.1038/s41467-020-17037-3. 

      33188197: Kurzawa N, Becher I, Sridharan S, et al. A computational method for detection of ligand-binding proteins from dose range thermal proteome profiles. Nat Commun. 2020;11(1):5783. doi:10.1038/s41467-02019529-8. 

      26524241: Reinhard FBM, Eberhard D, Werner T, et al. Thermal proteome profiling monitors ligand interactions with cellular membrane proteins. Nat Methods. 2015;12(12):1129-1131. doi:10.1038/nmeth.3652. 

      23828940: Martinez Molina D, Jafari R, Ignatushchenko M, et al. Monitoring drug target engagement in cells and tissues using the cellular thermal shift assay. Science. 2013;341(6141):84-87. doi:10.1126/science.1233606. 

      32133759: Mateus A, Kurzawa N, Becher I, et al. Thermal proteome profiling for interrogating protein interactions. Mol Syst Biol. 2020;16(3):e9232. doi:10.15252/msb.20199232. 

      14755328: Dorsam RT, Kunapuli SP. Central role of the P2Y12 receptor in platelet activation. J Clin Invest. 2004;113(3):340-345. doi:10.1172/JCI20986. 

      Reviewer #1 (Recommendations for the authors):

      “The authors use iBAC or LFQ to compare across samples. This inconsistency is puzzling. As far as I know, LFQ should always be used when comparing across samples”

      As mentioned above, we use iBAQ only in Fig. 2B to illustrate within-sample relative abundance; all comparative analyses elsewhere use LFQ. We have updated the Fig. 2B legend to state this explicitly.

      We used iBAQ Fig. 2B as it provides a notion of protein abundance within a sample, normalizing the summed peptide intensities by the number of theoretically observable peptides. This normalization facilitates comparisons between proteins within the same sample, offering a clearer understanding of their relative molar proportions [PMID: 33452728]. LFQ, by contrast, is optimized for comparing the same protein across different samples. It achieves this by performing delayed normalization to reduce run-to-run variability and by applying maximal peptide ratio extraction, which integrates pairwise peptide intensity ratios across all samples to build a consistent protein-level quantification matrix [PMID: 24942700]. These features make LFQ more robust to missing values and technical variation, thereby enabling accurate detection of relative abundance changes in the same protein under different experimental conditions. This distinction is well supported by the proteomics literature: Smits et al. [PMID: 23066101] used iBAQ specifically to determine the relative abundance of proteins within one sample, whereas LFQ was applied for comparative analyses between conditions.

      “[Regarding Figure 2A] Why does the control also contain ATP-vanadate? Also, I am not aware of a commercially available chemical "ATP-VO4". I assume this is a mistake”

      The control condition in Figure 2A was mislabeled, and the figure has been corrected to remove this discrepancy. In our experiments, ATP and orthovanadate (VO<sub>4</sub>) were added together, and for simplicity this was annotated as “ATP-VO<sub>4</sub>.” 

      “[Regarding Figure 2B] What is the fold change in MsbA iBAQ values? It seems that the differences are quite small, and as such require a more quantitative approach than iBAQ (e.g SILAC or some other internal standard). In addition, what information does this panel add relative to 2C”

      The figure has been updated to clarify that the values shown are log₂transformed iBAQ intensities. Figures 2B and 2C are complementary: Figure 2B shows that in the control sample, MsbA’s peptide abundance decreases with temperatures (51, 56, and 61 °C) relative to the remaining bulk proteins. Figure 2C shows the specific thermal profiles of MsbA in control and ATP–vanadate conditions. To make this clearer, we have added a sentence to the Results section explaining the specific role of Figure 2B.

      Together, these panels indicate that the method can identify ligand-induced stabilization even for proteins whose abundance decreases faster than the bulk during the TPP assay. We have provided the rationale for not using SILAC or TMT labeling in our public response.

      “[Regarding Figure 2C] Although not mentioned in the legend, I assume this is iBAQ quantification, which as mentioned above isn't accurate enough for such small differences. In addition, I find this data confusing: why is MsbA more stable at the lower temperatures in the absence of ATP-vanadate? The smoothed-line representation is misleading, certainly given the low number of data points”

      The data presented represent LFQ values for MsbA, and we have updated the figure legend to clearly indicate this. Additionally, as suggested, we have removed the smoothing line to more accurately reflect the data. Regarding the reviewer’s concern about stability at lower temperatures, we note that MsbA exhibits comparable abundance at 38 °C and 46 °C under both conditions, with overlapping error bars. We therefore interpret these data as indicating no significant difference in stability at the lower temperatures, with ligand-dependent stabilization becoming apparent only at elevated temperatures. We do not exclude the possibility that MsbA stability at these temperatures is affected by the conformational dynamics of this ABC transporter upon ATP binding and hydrolysis.

      “[Regarding Figure 3A] is this raw LFQ data? Why did the authors suddenly change from iBAQ to LFQ? I find this inconsistency puzzling”

      To clarify, all analyses of protein stabilization or destabilization presented in the manuscript are based on LFQ values. The only instance where iBAQ was used is Figure 2B, where it served to illustrate the relative peptide abundance of MsbA within the same sample. We have revised the figure legends and text to make this distinction explicit and ensure consistency in presentation.

      “[Regarding Figure 3B] The non-specific ATP-dependent stabilization increases the likelihood of false positive hits. This limitation is not mentioned by the authors. I think it is important to show other small molecules, in addition to ATP. The authors suggest that their approach is highly relevant for drug screening. Therefore, a good choice is to test an effect of a known stabilizing drug (eg VX-809 and CFTR)”

      We thank the reviewer for this suggestion. As noted in the manuscript (results and discussion sections), ATP is a natural hydrotrope and is therefore expected to induce broad, non-specific stabilization effects, a phenomenon also observed in previous proteome-wide studies, which demonstrated ATP’s widespread influence on cytosolic protein solubility and thermal stability (PMID: 30858367). To demonstrate that MM-TPP can resolve specific ligand–protein interactions beyond these global ATP effects, we tested 2-methylthio-ADP (2-MeS-ADP), a selective agonist of P2RY12 (PMID: 14755328). In these experiments, we observed robust and reproducible stabilization of P2RY12 at both 51°C and 57°C, with no consistent stabilization of unrelated proteins across temperatures. This provides direct evidence that our workflow can distinguish specific from non-specific ligand-induced effects. We selected 2-MeS-ADP due to its structural stability and receptor higher-affinity over ADP, allowing us to extend our existing workflow while testing a receptor-specific interaction. We agree that extending this approach to clinically relevant small-molecule drugs, such as VX-809 with CFTR, would further underscore the pharmacological potential of MM-TPP, and we have now noted this as an important avenue for future studies.

      “X axis of Figure 3B: Log 2 fold difference of what? iBAQ? LFQ? Similar ambiguity regarding the Y axis of 3E. What peptide? And why the constant changes in estimating abundances?”

      We thank the reviewer for pointing out these inaccuracies in the figure annotations. As mentioned above, all analyses (except Figure 2B) are based on LFQ values. We have revised the figure legends and text to make this clear.

      In Figure 3E, “peptide intensity” refers to log2 LFQ peptide intensities derived from the BCS1L protein, as indicated in the figure caption. 

      “The authors suggest that P2RY6 and P2RY12 are stabilized by ADP, the hydrolysis product of ATP. Currently, the support for this suggestion is highly indirect. To support this claim, the authors need to directly show the effect of ADP. In reference to the alpha fold results shown in Figure 4D, the authors state that "Collectively, these data highlight the ability of MM-TPP to detect the side effects of parent compounds, an important consideration for drug development". To support this claim, it is necessary to show that Mao-B is indeed best stabilized with ADP or AMP, rather than ATP.”

      In this revision, we chose not to test ADP directly, as it is a broadly binding, relatively weak ligand that would likely stabilize many proteins without revealing clear target-specific effects. Since we had already evaluated ATP-VO₄, a similarly broad, non-specific ligand, additional testing with ADP would provide limited additional insight. Instead, we prioritized 2-methylthio-ADP, a selective agonist of P2RY12, to more effectively demonstrate the specificity of MM-TPP. With this ligand, we observed clear and reproducible stabilization of P2RY12, underscoring the ability of MM-TPP to resolve receptor–ligand interactions beyond ATP’s broad hydrotropic effects. Importantly, and as expected, we did not observe stabilization of the related purinergic receptor P2RY6, further supporting the specificity of the observed effect.

      We have also revised the AlphaFold-related statement in Figure 4D to adopt a more cautious tone: “Collectively, these data suggest that MM-TPP may detect potential side effects of parent compounds, an important consideration for drug development.” In this context, we use AlphaFold not as a validation tool, but rather as a structural aid to help rationalize why certain off-target proteins (e.g., ATP with Mao-B) exhibit stabilization.

      Reviewer #2 (Recommendations for the authors):

      “In the main text, it will be useful to include the unique peptides table of at least the targets discussed in the manuscript. For example, in presence of AMP-PNP at 51oC P2RY6 shows 4-6 peptides in all n=3 positive & negative ionization modes. But, for P2RY12 only 1-3 peptides were observed. Depending on the sequence length and the relative abundance in the cell of a protein of interest, the number of peptides observed could vary a lot per protein. Given the unique peptide abundance reported in the supplementary file, for various proteins in different conditions, it appears the threshold of observation of two unique peptides for a protein to be analyzed seems less stringent.”

      By applying a filter requiring at least two unique peptides in at least one replicate, we exclude, on average, 15–20% of the total identified proteins. We consider this a reasonable level of stringency that balances confidence in protein identification with the retention of relevant data. This threshold was selected because it aligns with established LC-MS/MS data analysis practices (PMID: 32591519, 33188197, 26524241), and we have included these references in the Methods section to justify our approach. We have included in this revision a Supplemental Table 2 showing the unique peptide counts for proteins highlighted in this study.  

      “It appears that the time of heat treatment for peptidisc library subjected to MM-TPP profiling was chosen as 3 min based on the results presented in Supplementary Figure 1A, especially the loss of MsbA observed in 1% DDM after 3 min heat perturbation. However, when reconstituted in peptidisc there seems to be no loss in MsbA even after 12 mins at 45oC. So, perhaps a longer heat treatment would be a more efficient perturbation.”

      Previous studies indicate that heat exposure of 3–5 minutes is optimal for visualizing protein denaturation (PMID: 23828940, 32133759). We have added a statement to the Results section to justify our choice of heat exposure. Although MsbA remains stable at 45 °C for extended periods, higher temperatures allow for more effective perturbation to reveal destabilization. Supplementary Figure 1A specifically illustrates MsbA instability in detergent environments.

      “Some of the stabilized temperatures listed in Table 1 are a bit confusing. For example, ABCC3 and ABCG2. In the case of ABCC3 stabilization was observed at 51oC and 60oC, but 56oC is not mentioned. In the same way, 51oC is not mentioned for ABCG2. You would expect protein to be stabilized at 56oC if it is stabilized at both 51oC and 60oC. So, it is unclear if the stabilizations were not monitored for these proteins at the missing temperatures in the table or if no peptides could be recorded at these temperatures as in the case of P2RX4 at 60oC in Figure 4C.”

      Both scenarios are represented in our data. For some proteins, like ABCG2, sufficient peptide coverage was achieved, but no stabilization was observed at intermediate temperatures (e.g., 56 °C), likely because the perturbation was not strong enough to reveal an effect. In other cases, such as ABCC3 at 56 °C or P2RX4 at 60 °C, the proteins were not detected due to insufficient peptide identifications at those temperatures, which explains their omission from the table. 

      “In Figure 4C, it is perplexing to note that despite n = 3 there were no peptide fragments detected for P2RX4 at 60oC in presence of ATP-VO4, but they were detected in presence of AMP-PNP. It will be useful to learn authors explanation for this, especially because both of these ligands destabilize P2RX4. In Figure 4B, it would have been great to see the effect of ADP too, to corroborate the theory that ATP metabolites could impact the thermal stability.”

      In Figure 4C, the absence of P2RX4 peptide detection at 60 °C with ATP–VO₄ mirrors variability observed in the corresponding control (n = 6). Specifically, neither the control nor ATP–VO₄ produced unique peptides for P2RX4 at 60 °C in that replicate, whereas peptides were detected at 60 °C in other replicates for both the control and AMPPNP, and at 64 °C for ATP–VO<sub>4</sub>, the controls, and AMP-PNP. Such missing values are a natural feature of MS-based proteomics and can arise from multiple technical factors, including inconsistent heating, incomplete digestion, stochastic MS injection, or interference from Peptidisc peptides. We therefore interpret the absence of peptides in this replicate as a technical artifact rather than evidence against protein destabilization. Importantly, the overall dataset consistently shows that both ATP–VO₄ and AMP-PNP destabilize P2RX4, supporting their characterization as broad, non-specific ligands with off-target effects.

      Because ATP and ADP belong to the same class of broadly binding, non-specific ligands, additional testing with ADP would not provide meaningful mechanistic insight. Instead, we chose to test 2-methylthio-ADP, a selective P2RY12 agonist. This experiment revealed robust, reproducible stabilization of P2RY12, without consistent effects on unrelated proteins at 51 °C and 57 °C, thereby demonstrating the ability of MM-TPP to detect specific receptor–ligand interactions.

      Finally, we note that P2RX4 is not a primary target of ATP–VO<sub>4</sub> or AMP-PNP. Consequently, the observed destabilization of P2RX4 is expected to be less pronounced than the strong, physiologically consistent stabilization of ABC transporters by ATP–VO<sub>4</sub>, as shown in Figure 3D, where the majority of ABC transporters are thermally stabilized across all tested temperatures.

      “As per Figure 4, P2Y receptors P2RY6 and P2RY12 both showed great thermal stability in presence of ATP-VO4 despite their preference for ADP. The authors argue this could be because of ATP metabolism, and binding of the resultant ADP to the P2RY6. If P2RX4 prefers ATP and not the metabolized product ADP that apparently is available, ideally you should not see a change in stability. A stark destabilization would indicate interaction of some sorts. P2X receptors are activated by ATP and are not naturally activated by AMP-PNP. So, destabilization of P2RX4 upon binding to ATP that can activate P2X receptors is conceivable. However, destabilization both in presence of ATP-VO4 and AMP-PNP is unclear. It is perhaps useful to test effect of ADP using this method, and maybe even compare some antagonists such as TNPATP.”

      In this study, we did not directly test ADP, as we had already demonstrated that MM-TPP detects stabilization by broad-binding ligands such as ATP–VO₄. Instead, we focused on a more selective ligand, 2-MeS-ADP, a specific agonist of P2RY12 [PMID: 14755328]. Here, we observed robust and reproducible stabilization of P2RY12 at 51 °C and 57 °C, while P2RY6 showed no significant changes, and no other proteins were consistently stabilized (Figure 4B, S4). This confirms that MM-TPP can distinguish specific ligand–receptor interactions from broader ATP-induced effects. To further explore the assay’s nuance and sensitivity, testing additional nucleotide ligands—including antagonists like TNP-ATP or ATPγS—would provide valuable insights, and we have identified this as an important future direction.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The paper presents a model for sequence generation in the zebra finch HVC, which adheres to cellular properties measured experimentally. However, the model is fine-tuned and exhibits limited robustness to noise inherent in the inhibitory interneurons within the HVC, as well as to fluctuations in connectivity between neurons. Although the proposed microcircuits are introduced as units for sub-syllabic segments (SSS), the backbone of the network remains a feedforward chain of HVC_RA neurons, similar to previous models.

      Strengths:

      The model incorporates all three of the major types of HVC neurons. The ion channels used and their kinetics are based on experimental measurements. The connection patterns of the neurons are also constrained by the experiments.

      Weaknesses:

      The model is described as consisting of micro-circuits corresponding to SSS. This presentation gives the impression that the model's structure is distinct from previous models, which connected HVC_RA neurons in feedforward chain networks (Jin et al 2007, Li & Greenside, 2006; Long et al 2010; Egger et al 2020). However, the authors implement single HVC_RA neurons into chain networks within each micro-circuit and then connect the end of the chain to the start of the chain in the subsequent micro-circuit. Thus, the HVC_RA neuron in their model forms a single-neuron chain. This structure is essentially a simplified version of earlier models.

      In the model of the paper, the chain network drives the HVC_I and HVC_X neurons. The role of the micro-circuits is more significant in organizing the connections: specifically, from HVC_RA neurons to HVC_I neurons, and from HVC_I neurons to both HVC_X and HVC_RA neurons.

      We thank Reviewer 1 for their thoughtful comments.

      While the reviewer is correct about the fact that the propagation of sequential activity in this model is primarily carried by HVC<sub>RA</sub> neurons in a feed-forward manner, we need to emphasize that this is true only if there is no intrinsic or synaptic perturbation to the HVC network. For example, we showed in Figures 10 and 12 how altering the intrinsic properties of HVC<sub>X</sub> neurons or for interneurons disrupts sequence propagation. In other words, while HVC<sub>RA</sub> neurons are the key forces to carry the chain forward, the interplay between excitation and inhibition in our network as well as the intrinsic parameters for all classes of HVC neurons are equally important forces in carrying the chain of activity forward. Thus, the stability of activity propagation necessary for song production depend on a finely balanced network of HVC neurons, with all classes contributing to the overall dynamics. Moreover, all existing models that describe premotor sequence generation in the HVC either assume a distributed model (Elmaleh et al., 2021) that dictates that local HVC circuitry is not sufficient to advance the sequence but rather depends upon moment to-moment feedback through Uva (Hamaguchi et al., 2016), or assume models that rely on intrinsic connections within HVC to propagate sequential activity. In the latter case, some models assume that HVC is composed of multiple discrete subnetworks that encode individual song elements (Glaze & Troyer, 2013; Long & Fee, 2008; Wang et al., 2008), but lacks the local connectivity to link the subnetworks, while other models assume that HVC may have sufficient information in its intrinsic connections to form a single continuous network sequence (Long et al. 2010). The HVC model we present extends the concept of a feedforward network by incorporating additional neuronal classes that influence the propagation of activity (interneurons and HVC<sub>X</sub> neurons). We have shown that any disturbance of the intrinsic or synaptic conductances of these latter neurons will disrupt activity in the circuit even when HVC<sub>RA</sub> neurons properties are maintained. 

      In regard to the similarities between our model and earlier models, several aspects of our model distinguish it from prior work. In short, while several models of how sequence is generated within HVC have been proposed (Cannon et al., 2015; Drew & Abbott, 2003; Egger et al., 2020; Elmaleh et al., 2021; Galvis et al., 2018; Gibb et al., 2009a, 2009b; Hamaguchi et al., 2016; Jin, 2009; Long & Fee, 2008; Markowitz et al., 2015), all the models proposed either rely on intrinsic HVC circuitry to propagate sequential activity, rely on extrinsic feedback to advance the sequence or rely on both. These models do not capture the complex details of spike morphology, do not include the right ionic currents, do not incorporate all classes of HVC neurons, or do not generate realistic firing patterns as seen in vivo. Our model is the first biophysically realistic model that incorporates all classes of HVC neurons and their intrinsic properties. We tuned the intrinsic and the synaptic properties bases on the traces collected by Daou et al. (2013) and Mooney and Prather (2005) as shown in Figure 3. The three classes of model neurons incorporated to our network as well as the synaptic currents that connect them are based on Hodgkin- Huxley formalisms that contain ion channels and synaptic currents which had been pharmacologically identified. This is an advancement over prior models that primarily focused on the role of synaptic interactions or external inputs. The model is based on feedforward chain of microcircuits that encode for the different sub-syllabic segments and that interact with each other through structured feedback inhibition, defining an ordered sequence of cell firing. Moreover, while several models highlight the critical role of inhibitory interneurons in shaping the timing and propagation of bursts of activity in HVC<sub>RA</sub> neurons, our work offers an intricate and comprehensive model that help understand this critical role played by inhibition in shaping song dynamics and ensuring sequence propagation.

      How useful is this concept of micro-circuits? HVC neurons fire continuously even during the silent gaps. There are no SSS during these silent gaps.

      Regarding the concern about the usefulness of the 'microcircuit' concept in our study, we appreciate the comment and we are glad to clarify its relevance in our network. While we acknowledge that HVC<sub>RA</sub> neurons interconnect microcircuits, our model's dynamics are still best described within the framework of microcircuitry particularly due to the firing behavior of HVC<sub>X</sub> neurons and interneurons. Here, we are referring to microcircuits in a more functional sense, rather than rigid, isolated spatial divisions (Cannon et al. 2015), and we now make this clear on page 21. A microcircuit in our model reflects the local rules that govern the interaction between all HVC neuron classes within the broader network, and that are essential for proper activity propagation. For example, HVC<sub>INT</sub> neurons belonging to any microcircuit burst densely and at times other than the moments when the corresponding encoded SSS is being “sung”. What makes a particular interneuron belong to this microcircuit or the other is merely the fact that it cannot inhibit HVC<sub>RA</sub> neurons that are housed in the microcircuit it belongs to. In particular, if HVC<sub>INT</sub> inhibits HVC<sub>RA</sub> in the same microcircuit, some of the HVC<sub>RA</sub> bursts in the microcircuit might be silenced by the dense and strong HVC<sub>INT</sub> inhibition breaking the chain of activity again. Similarly, HVC<sub>X</sub> neurons were selected to be housed within microcircuits due to the following reason: if an HVC<sub>X</sub> neuron belonging to microcircuit i sends excitatory input to an HVC<sub>INT</sub> neuron in microcircuit j, and that interneuron happens to select an HVC<sub>RA</sub> neuron from microcircuit i, then the propagation of sequential activity will halt, and we’ll be in a scenario similar to what was described earlier for HVC<sub>INT</sub> neurons inhibiting HVC<sub>RA</sub> neurons in the same microcircuit.

      We agree that there are no sub-syllabic segments described during the silent gaps and we thank the reviewer to pointing this out. Although silent gaps are integral to the overall process of song production, we have not elaborated on them in this model due to the lack of a clear, biophysically grounded representation for the gaps themselves at the level of HVC. Our primary focus has been on modeling the active, syllable-producing phases of the song, where the HVC network’s sequential dynamics are critical for song. However, one can think the encoding of silent gaps via similar mechanisms that encode SSSs, where each gap is encoded by similar microcircuits comprised of the three classes of HVC neurons (let’s call them GAP rather than SSS) that are active only during the silent gaps. In this case, the propagation of sequential activity is carried throughout the GAPs from the last SSS of the previous syllable to the first SSS of the subsequent syllable. This is no described more clearly on page 22 of the manuscript.

      A significant issue of the current model is that the HVC_RA to HVC_RA connections require fine-tuning, with the network functioning only within a narrow range of g_AMPA (Figure 2B). Similarly, the connections from HVC_I neurons to HVC_RA neurons also require fine-tuning. This sensitivity arises because the somatic properties of HVC_RA neurons are insufficient to produce the stereotypical bursts of spikes observed in recordings from singing birds, as demonstrated in previous studies (Jin et al 2007; Long et al 2010). In these previous works, to address this limitation, a dendritic spike mechanism was introduced to generate an intrinsic bursting capability, which is absent in the somatic compartment of HVC_RA neurons. This dendritic mechanism significantly enhances the robustness of the chain network, eliminating the need to fine-tune any synaptic conductances, including those from HVC_I neurons (Long et al 2010). Why is it important that the model should NOT be sensitive to the connection strengths?

      We thank the reviewer for the comment. While mathematical models designed for highly complex nonlinear biological processes tangentially touch the biological realism, the current network as is right now is the first biologically realistic-enough network model designed for HVC that explains sequence propagation. We do not include dendritic processes in our network although that increases the realistic dynamics for various reasons. 1) The ion channels we integrated into the somatic compartment are known pharmacologically (Daou et al. 2013), but we don’t know about the dendritic compartment’s intrinsic properties of HVC neurons and the cocktail of ion channels that are expressed there. 2) We are able to generate realistic bursting in HVC<sub>RA</sub> neurons despite the single compartment, and the main emphasis in this network is on the interactions between excitation and inhibition, the effects of ion channels in modulating sequence propagation, etc … 3) The network model already incorporates thousands of ODEs that govern the dynamics of each of the HVC neurons, so we did not want to add more complexity to the network especially that we don’t know the biophysical properties of the dendritic compartments.

      Therefore, our present focus is on somatic dynamics and the interaction between HVC<sub>RA</sub> and HVC<sub>INT</sub> neurons, but we acknowledge the importance of these processes in enhancing network resiliency. Although we agree that adding dendritic processes improves robustness, we still think that somatic processes alone can offer insightful information on the sequential dynamics of the HVC network. While the network should be robust across a wide range of parameters, it is also essential that certain parameters are designed to filter out weaker signals, ensuring that only reliable, precise patterns of activity propagate. Hence, we specifically chose to make the HVC<sub>RA</sub>-to-HVC<sub>RA</sub> excitatory connections more sensitive (narrow range of values) such that only strong, precise and meaningful stimuli can propagate through the network representing the high stereotypy and precision seen in song production.

      First, the firing of HVC_I neurons is highly noisy and unreliable. HVC_I neurons fire spontaneous, random spikes under baseline conditions. During singing, their spike timing is imprecise and can vary significantly from trial to trial, with spikes appearing or disappearing across different trials. As a result, their inputs to HVC_RA neurons are inherently noisy. If the model relies on precisely tuned inputs from HVC_I neurons, the natural fluctuations in HVC_I firing would render the model non-functional. The authors should incorporate noisy HVC_I neurons into their model to evaluate whether this noise would render the model non-functional.

      We acknowledge that under baseline and singing settings, interneurons fire in an extremely noisy and inaccurate manner, although they exhibit time locked episodes in their activity (Hahnloser et al 2002, Kozhinikov and Fee 2007). In order to mimic the biological variability of these neurons, our model does, in fact, include a stochastic current to reflect the intrinsic noise and random variations in interneuron firing shown in vivo (and we highlight this in the Methods). However, to make sure the network is resilient to this randomness in interneuron firing, introduced a stochastic input current of the form I<sub>noise</sub> (t)= σ.ξ(t) where ξ(t) is a Gaussian white noise with zero mean and unit variance, and σ is the noise amplitude. This stochastic drive was introduced to every model neuron and it mimics the fluctuations in synaptic input arising from random presynaptic activity and background noise. For values of σ within 1-5% of the mean synaptic conductance, the stochastic current has no effect on network propagation. For larger values of σ, the desired network activity was disrupted or halted. We now talk about this on page 22 of the manuscript.  

      Second, Kosche et al. (2015) demonstrated that reducing inhibition by suppressing HVC_I neuron activity makes HVC_RA firing less sparse but does not compromise the temporal precision of the bursts. In this experiment, the local application of gabazine should have severely disrupted HVC_I activity. However, it did not affect the timing precision of HVC_RA neuron firing, emphasizing the robustness of the HVC timing circuit. This robustness is inconsistent with the predictions of the current model, which depends on finely tuned inputs and should, therefore, be vulnerable to such disruptions.

      We thank the reviewer for the comment. The differences between the Kosche et al. (2015) findings and the predictions of our model arise from differences in the aspect of HVC function we are modeling. Our model is more sensitive to inhibition, which is a designed mechanism for achieving precise song patterning. This is a modeling simplification we adopted to capture specific characteristics of HVC function. Hence, Kosche et al. (2015) findings do not invalidate the approach of our model, but highlights that HVC likely operates with several, redundant mechanisms that overall ensure temporal precision. 

      Third, the reliance on fine-tuning of HVC_RA connections becomes problematic if the model is scaled up to include groups of HVC_RA neurons forming a chain network, rather than the single HVC_RA neurons used in the current work. With groups of HVC_RA neurons, the summation of presynaptic inputs to each HVC_RA neuron would need to be precisely maintained for the model to function. However, experimental evidence shows that the HVC circuit remains functional despite perturbations, such as a few degrees of cooling, micro-lesions, or turnover of HVC_RA neurons. Such robustness cannot be accounted for by a model that depends on finely tuned connections, as seen in the current implementation.

      Our model of individual HVC<sub>RA</sub> neurons and as stated previously is reductive model that focuses on understanding the mechanisms that govern sequential neural activity. We agree that scaling the model to include many of HVC<sub>RA</sub> neurons poses challenges, specifically concerning the summation of presynaptic inputs. However, our model can still be adapted to a larger network without requiring the level of fine-tuning currently needed. In fact, the current fine-tuning of synaptic connections in the model is a reflection of fundamental network mechanisms rather than a limitation when scaling to a larger network. Besides, one important feature of this neural network is redundancy. Even if some neurons or synaptic connections are impaired, other neurons or pathways can compensate for these changes, allowing the activity propagation to remain intact.

      The authors examined how altering the channel properties of neurons affects the activity in their model. While this approach is valid, many of the observed effects may stem from the delicate balancing required in their model for proper function. In the current model, HVC_X neurons burst as a result of rebound activity driven by the I_H current. Rebound bursts mediated by the I_H current typically require a highly hyperpolarized membrane potential. However, this mechanism would fail if the reversal potential of inhibition is higher than the required level of hyperpolarization. Furthermore, Mooney (2000) demonstrated that depolarizing the membrane potential of HVC_X neurons did not prevent bursts of these neurons during forward playback of the bird's own song, suggesting that these bursts (at least under anesthesia, which may be a different state altogether) are not necessarily caused by rebound activity. This discrepancy should be addressed or considered in the model.

      In our HVC network model, one goal with HVC<sub>X</sub> neurons is to generate bursts in their underlying neuron population. Since HVC<sub>X</sub> neurons in our model receive only inhibitory inputs from interneurons, we rely on inhibition followed by rebound bursts orchestrated by the I<sub>H</sub> and the I<sub>CaT</sub> currents to achieve this goal. The interplay between the T-type Ca<sup>++</sup> current and the H current in our model is fundamental to generate their corresponding bursts, as they are sufficient for producing the desired behavior in the network. Due to this interplay, we do not need significant inhibition to generate rebound bursts, because the T-type Ca<sub>++</sub> current’s conductance can be stronger leading to robust rebound bursting even when the degree of inhibition is not very strong. This is now highlighted on page 42 in the revised version.

      Some figures contain direct copies of figures from published papers. It is perhaps a better practice to replace them with schematics if possible.

      We wanted on purpose to keep the results shown in Mooney and Prather (2005) to be shown as is, in order to compare them with our model simulations highlighting the degree of resemblance. We believe that creating schematics of the Mooney and Prather (2005) results will not have the same impact, similarly creating a schematic for Hahnloser et al (2002) results won’t help much. However, if the reviewer still believes that we should do that, we’re happy to do it.

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors use numerical simulations to try to understand better a major experimental discovery in songbird neuroscience from 2002 by Richard Hahnloser and collaborators. The 2002 paper found that a certain class of projection neurons in the premotor nucleus HVC of adult male zebra finch songbirds, the neurons that project to another premotor nucleus RA, fired sparsely (once per song motif) and precisely (to about 1 ms accuracy) during singing.

      The experimental discovery is important to understand since it initially suggested that the sparsely firing RA-projecting neurons acted as a simple clock that was localized to HVC and that controlled all details of the temporal hierarchy of singing: notes, syllables, gaps, and motifs. Later experiments suggested that the initial interpretation might be incomplete: that the temporal structure of adult male zebra finch songs instead emerged in a more complicated and distributed way, still not well understood, from the interaction of HVC with multiple other nuclei, including auditory and brainstem areas. So at least two major questions remain unanswered more than two decades after the 2002 experiment: What is the neurobiological mechanism that produces the sparse precise bursting: is it a local circuit in HVC or is it some combination of external input to HVC and local circuitry? And how is the sparse precise bursting in HVC related to a songbird's vocalizations? The authors only investigate part of the first question, whether the mechanism for sparse precise bursts is local to HVC. They do so indirectly, by using conductance-based Hodgkin-Huxley-like equations to simulate the spiking dynamics of a simplified network that includes three known major classes of HVC neurons and such that all neurons within a class are assumed to be identical. A strength of the calculations is that the authors include known biophysically deduced details of the different conductances of the three major classes of HVC neurons, and they take into account what is known, based on sparse paired recordings in slices, about how the three classes connect to one another. One weakness of the paper is that the authors make arbitrary and not well-motivated assumptions about the network geometry, and they do not use the flexibility of their simulations to study how their results depend on their network assumptions. A second weakness is that they ignore many known experimental details such as projections into HVC from other nuclei, dendritic computations (the somas and dendrites are treated by the authors as point-like isopotential objects), the role of neuromodulators, and known heterogeneity of the interneurons. These weaknesses make it difficult for readers to know the relevance of the simulations for experiments and for advancing theoretical understanding.

      Strengths:

      The authors use conductance-based Hodgkin-Huxley-like equations to simulate spiking activity in a network of neurons intended to model more accurately songbird nucleus HVC of adult male zebra finches. Spiking models are much closer to experiments than models based on firing rates or on 2-state neurons.

      The authors include information deduced from modeling experimental current-clamp data such as the types and properties of conductances. They also take into account how neurons in one class connect to neurons in other classes via excitatory or inhibitory synapses, based on sparse paired recordings in slices by other researchers. The authors obtain some new results of modest interest such as how changes in the maximum conductances of four key channels (e.g., A-type K+ currents or Ca-dependent K+ currents) influence the structure and propagation of bursts, while simultaneously being able to mimic accurately current-clamp voltage measurements.

      Weaknesses:

      One weakness of this paper is the lack of a clearly stated, interesting, and relevant scientific question to try to answer. In the introduction, the authors do not discuss adequately which questions recent experimental and theoretical work have failed to explain adequately, concerning HVC neural dynamics and its role in producing vocalizations. The authors do not discuss adequately why they chose the approach of their paper and how their results address some of these questions.

      For example, the authors need to explain in more detail how their calculations relate to the works of Daou et al, J. Neurophys. 2013 (which already fitted spiking models to neuronal data and identified certain conductances), to Jin et al J. Comput. Neurosci. 2007 (which already discussed how to get bursts using some experimental details), and to the rather similar paper by E. Armstrong and H. Abarbanel, J. Neurophys 2016, which already postulated and studied sequences of microcircuits in HVC. This last paper is not even cited by the authors.

      We thank the reviewer for this valuable comment, and we agree that we did not clarify enough throughout the paper the utility of our model or how it advanced our understanding of the HVC dynamics and circuitry. To that end, we revised several places of the manuscript and made sure to cite and highlight the relevance and relatedness of the mentioned papers.

      In short, and as mentioned to Reviewer 1, while several models of how sequence is generated within HVC have been proposed (Cannon et al., 2015; Drew & Abbott, 2003; Egger et al., 2020; Elmaleh et al., 2021; Galvis et al., 2018; Gibb et al., 2009a, 2009b; Hamaguchi et al., 2016; Jin, 2009; Long & Fee, 2008; Markowitz et al., 2015; Jin et al., 2007), all the models proposed either rely on intrinsic HVC circuitry to propagate sequential activity, rely on extrinsic feedback to advance the sequence or rely on both. These models do not capture the complex details of spike morphology, do not include the right ionic currents, do not incorporate all classes of HVC neurons, or do not generate realistic firing patterns as seen in vivo. Our model is the first biophysically realistic model that incorporates all classes of HVC neurons and their intrinsic properties. 

      No existing hypothesis had been challenged with our model, rather; our model is a distillation of the various models that’s been proposed for the HVC network. We go over this in detail in the Discussion. We believe that the network model we developed provide a step forward in describing the biophysics of HVC circuitry, and may throw a new light on certain dynamics in the mammalian brain, particularly the motor cortex and the hippocampus regions where precisely-timed sequential activity is crucial. We suggest that temporally-precise sequential activity may be a manifestation of neural networks comprised of chain of microcircuits, each containing pools of excitatory and inhibitory neurons, with local interplay among neurons of the same microcircuit and global interplays across the various microcircuits, and with structured inhibition as well as intrinsic properties synchronizing the neuronal pools and stabilizing timing within a firing sequence.

      The authors' main achievement is to show that simulations of a certain simplified and idealized network of spiking neurons, which includes some experimental details but ignores many others, match some experimental results like current-clamp-derived voltage time series for the three classes of HVC neurons (although this was already reported in earlier work by Daou and collaborators in 2013), and simultaneously the robust propagation of bursts with properties similar to those observed in experiments. The authors also present results about how certain neuronal details and burst propagation change when certain key maximum conductances are varied. However, these are weak conclusions for two reasons. First, the authors did not do enough calculations to allow the reader to understand how many parameters were needed to obtain these fits and whether simpler circuits, say with fewer parameters and simpler network topology, could do just as well. Second, many previous researchers have demonstrated robust burst propagation in a variety of feed-forward models. So what is new and important about the authors' results compared to the previous computational papers?

      A major novelty of our work is the incorporation of experimental data with detailed network models. While earlier works have established robust burst propagation, our model uses realistic ion channel kinetics and feedback inhibition not only to reproduce experimental neural activity patterns but also to suggest prospective mechanisms for song sequence production in the most biophysical way possible. This aspect that distinguishes our work from other feed-forward models. We go over this in detail in the Discussion. However, the reviewer is right regarding the details of the calculations conducted for the fits, we will make sure to highlight this in the Methods and throughout the manuscript with more details.

      We believe that the network model we developed provide a step forward in describing the biophysics of HVC circuitry, and may throw a new light on certain dynamics in the mammalian brain, particularly the motor cortex and the hippocampus regions where precisely-timed sequential activity is crucial. We suggest that temporally-precise sequential activity may be a manifestation of neural networks comprised of chain of microcircuits, each containing pools of excitatory and inhibitory neurons, with local interplay among neurons of the same microcircuit and global interplays across the various microcircuits, and with structured inhibition as well as intrinsic properties synchronizing the neuronal pools and stabilizing timing within a firing sequence.

      Also missing is a discussion, or at least an acknowledgment, of the fact that not all of the fine experimental details of undershoots, latencies, spike structure, spike accommodation, etc may be relevant for understanding vocalization. While it is nice to know that some models can match these experimental details and produce realistic bursts, that does not mean that all of these details are relevant for the function of producing precise vocalizations. Scientific insights in biology often require exploring which of the many observed details can be ignored and especially identifying the few that are essential for answering some questions. As one example, if HVC-X neurons are completely removed from the authors' model, does one still get robust and reasonable burst propagation of HVC-RA neurons? While part of the nucleus HVC acts as a premotor circuit that drives the nucleus RA, part of HVC is also related to learning. It is not clear that HVC-X neurons, which carry out some unknown calculation and transmit information to area X in a learning pathway, are relevant for burst production and propagation of HVCRA neurons, and so relevant for vocalization. Simulations provide a convenient and direct way to explore questions of this kind.

      One key question to answer is whether the bursting of HVC-RA projection neurons is based on a mechanism local to HVC or is some combination of external driving (say from auditory nuclei) and local circuitry. The authors do not contribute to answering this question because they ignore external driving and assume that the mechanism is some kind of intrinsic feed-forward circuit, which they put in by hand in a rather arbitrary and poorly justified way, by assuming the existence of small microcircuits consisting of a few HVC-RA, HVC-X, and HVC-I neurons that somehow correspond to "sub-syllabic segments". To my knowledge, experiments do not suggest the existence of such microcircuits nor does theory suggest the need for such microcircuits. 

      Recent results showed a tight correlation between the intrinsic properties of neurons and features of song (Daou and Margoliash 2020, Medina and Margoliash 2024), where adult birds that exhibit similar songs tend to have similar intrinsic properties. While this is relevant, we acknowledge that not all details may be necessary for every aspect of vocalization, and future models could simplify concentrate on core dynamics and exclude certain features while still providing insights into the primary mechanisms.

      The question of whether HVC<sub>X</sub> neurons are relevant for burst propagation given that our model includes these neurons as part of the network for completeness, the reviewer is correct, the propagation of sequential activity in this model is primarily carried by HVC<sub>RA</sub> neurons in a feed-forward manner, but only if there is no perturbation to the HVC network. For example, we have shown how altering the intrinsic properties of HVC<sub>X</sub> neurons or for interneurons disrupts sequence propagation. In other words, while HVC neurons are the key forces to carry the chain forward, the interplay between excitation and inhibition in our network as well as the intrinsic parameters for all classes of HVC neurons are equally important forces in carrying the chain of activity forward. Thus, the stability of activity propagation necessary for song production depend on a finely balanced network of HVC neurons, with all classes contributing to the overall dynamics.

      We agree with the reviewer however that a potential drawback of our model is that its sole focus is on local excitatory connectivity within the HVC (Kornfeld et al., 2017; Long et al., 2010), while HVC neurons receive afferent excitatory connections (Akutagawa & Konishi, 2010; Nottebohm et al., 1982) that plays significant roles in their local dynamics. For example, the excitatory inputs that HVC neurons receive from Uvaeformis may be crucial in initiating (Andalman et al., 2011; Danish et al., 2017; Galvis et al., 2018) or sustaining (Hamaguchi et al., 2016) the sequential activity. While we acknowledge this limitation, our main contribution in this work is the biophysical insights onto how the patterning activity in HVC is largely shaped by the intrinsic properties of the individual neurons as well as the synaptic properties where excitation and inhibition play a major role in enabling neurons to generate their characteristic bursts during singing. This is true and holds irrespective of whether an external drive is injected onto the microcircuits or not. We elaborated on this further in the revised version in the Discussion.

      Another weakness of this paper is an unsatisfactory discussion of how the model was obtained, validated, and simulated. The authors should state as clearly as possible, in one location such as an appendix, what is the total number of independent parameters for the entire network and how parameter values were deduced from data or assigned by hand. With enough parameters and variables, many details can be fit arbitrarily accurately so researchers have to be careful to avoid overfitting. If parameter values were obtained by fitting to data, the authors should state clearly what the fitting algorithm was (some iterative nonlinear method, whose results can depend on the initial choice of parameters), what the error function used for fitting (sum of least squares?) was, and what data were used for the fitting.

      The authors should also state clearly the dynamical state of the network, the vector of quantities that evolve over time. (What is the dimension of that vector, which is also the number of ordinary differential equations that have to be integrated?) The authors do not mention what initial state was used to start the numerical integrations, whether transient dynamics were observed and what were their properties, or how the results depended on the choice of the initial state. The authors do not discuss how they determined that their model was programmed correctly (it is difficult to avoid typing errors when writing several pages or more of a code in any language) or how they determined the accuracy of the numerical integration method beyond fitting to experimental data, say by varying the time step size over some range or by comparing two different integration algorithms.

      We thank the reviewer again. The fitting process in our model occurred only at the first stage where the synaptic parameters were fit to the Mooney and Prather as well as the Kosche results. There was no data shared and we merely looked at the figures in those papers and checked the amplitude of the elicited currents, the magnitudes of DC-evoked excitations etc … and we replicated that in our model. While this is suboptimal, it was better for us to start with it rather than simply using equations for synaptic currents from the literature for other types of neurons (that are not even HVC’s or in the songbird) and integrate them into our network model. The number of ODEs that govern the dynamics of every model neuron is listed on page 10 of the manuscript as well as in the Appendix.  Moreover, we highlighted the details of this fitting process in the revised version.

      Also disappointing is that the authors do not make any predictions to test, except rather weak ones such as that varying a maximum conductance sufficiently (which might be possible by using dynamic clamps) might cause burst propagation to stop or change its properties. Based on their results, the authors do not make suggestions for further experiments or calculations, but they should.

      We agree that making experimental testable predictions is crucial for the advancement of the model. Our predictions include testing whether eradication of a class of neurons such as HVC<sub>X</sub> neurons disrupts activity propagation which can be done through targeted neuron elimination. This also can be done through preventing rebound bursting in HVC<sub>X</sub> by pharmacologically blocking the I<sub>H</sub> channels. Others include down regulation of certain ion channels (pharmacologically done through ion blockers) and testing which current is fundamental for song production (and there a plenty of test based our results, like the SK current, the T-type Ca<sup>2+</sup> current, the A-type K<sup>+</sup> current, etc…). We incorporated these into the Discussion of the revised manuscript to better demonstrate the model's applicability and to guide future research directions.

      Main issues:

      (1) Parameters are overly fine-tuned and often do not match known biology to generate chains. This fine-tuning does not reveal fundamental insights.

      (1a) Specific conductances (e.g. AMPA) are finely tweaked to generate bursts, in part due to a lack of a dendritic mechanism for burst generation. A dendritic mechanism likely reflects the true biology of HVC neurons.

      We acknowledge that the model does not include active dendritic processes and we do not regard this as a limitation. In fact, our present approach, although simplified, is intended to focus on somatic mechanisms to identify minimal conditions required for stable sequential propagation. We know HVC<sub>RA</sub> neurons possess thin, spiny dendrites which can contribute to burst initiation and shaping. Future models that include such nonlinear dendritic mechanisms would likely reduce the need for fine tuning of specific conductances at the soma and consequently better match the known biology of HVC<sub>RA</sub> neurons. 

      In text: “While our simplified, somatically driven architecture enables better exploration of mechanisms for sequence propagation, future extensions of the model will incorporate dendritic compartments to more accurately reflect the intrinsic bursting mechanisms observed in HVC<sub>RA</sub> neurons.”

      (1b) In this paper, microcircuits are simulated and then concatenated to make the HVC chain, resulting in no representations during silent gaps. This is out of touch with the known HVC function. There is no anatomical nor functional evidence for microcircuits of the kind discussed in this paper or in the earlier and rather similar paper by Eve Armstrong and Henry Abarbanel (J. Neurophy 2016). One can write a large number of papers in which one makes arbitrary unconstrained guesses of network structure in HVC and, unless they reveal some novel principle or surprising detail, they are all going to be weak.

      Although the model is composed of sequentially activated microcircuits, the gaps between each microcircuit’s output do not represent complete silence in the network. During these periods, other neurons such as those in other microcircuits may still exhibit bursting activity. Thus, what may appear as a 'silent gap' from the perspective of a given output microcircuit is, in fact, part of the ongoing background dynamics of the larger HVC neuron network. We fully acknowledge the reviewer's point that there is no direct anatomical or physiological evidence supporting the presence of microcircuits with this structure in HVC. Our intention was not to propose the existence of such a physical model but to use it as a computational simplification to make precise sequential bursting activity feasible given the biologically realistic neuronal dynamics used. Hence, our use of 'microcircuits' refers to a modeling construct rather than a structural hypothesis. Even if the network topology is hypothetical, we still believe that the temporal structuring suggested allows us to generate specific predictions for future work about burst timing and neuronal connections.

      (1c) HVC interneuron discharge in the author's model is overly precise; addressing the observation that these neurons can exhibit noisy discharge. Real HVC interneurons are noisy. This issue is critical: All reviewers strongly recommend that the authors should, at the minimum in a revision, focus on incorporating HVC-I noise in their model.

      We agree that capturing the variability in interneuron bursting is critical for biological realism. In our model, HVC interneurons receive stochastic background current that introduces variability in their firing patterns as observed in vivo. This variability is seen in our simulations and produces more biologically realistic dynamics while maintaining sequence propagation. We clarify this implementation in the Methods section. 

      (1d) Address the finding that Kosche et al show that even with reduced inhibition, HVCra neuronal timing is preserved; it is the burst pattern that is affected.

      The differences between the Kosche et al. (2015) findings and the predictions of our model arise from differences in the aspect of HVC function we are modeling. Our model is more sensitive to inhibition, which is a designed mechanism for achieving precise song patterning. This is a modeling simplification we adopted to capture specific characteristics of HVC function. 

      We acknowledged this point in the discussion: “While findings of Kosche et al. (2015) emphasize the robustness of the HVC timing circuit to inhibition, our model is more sensitive to inhibition, highlighting that HVC likely operates with several, redundant mechanisms that overall ensure temporal precision.”

      (1e) The real HVC is robust to microlesions, cooling, and HVCra neuron turnover. The model in this paper relies on precise HVCra connectivity and is not robust.

      Although our model is grounded in the biologically observed behavior of HVC neurons in vivo, we don’t claim that it fully captures the resilience seen in the HVC network. Instead, we see this as a simplified framework that helps us explore the basic principles of sequential activity. In the future, adding features like recurrent excitation, synaptic plasticity, or homeostatic mechanisms could make the model more robust.

      (1f) There is unclear motivation for Ih-driven HVCx bursting, given past findings from the Mooney group.

      Daou et al (2013) noticed that the observed in HVC<sub>X</sub> and HVC<sub>INT</sub> neurons in response to hyperpolarizing current pulses (Dutar et al. 1998; Kubota and Saito 1991; Kubota and Taniguchi 1998) was completely abolished after the application of the drug ZD 7288 in all of the neurons tested indicating that the sag in these HVC neurons is due to the hyperpolarization-activated inward current (I<sub>h</sub>). in addition, the sag and the rebound seen in these two neuron groups were larger as for larger hyperpolarization current pulses.

      (1g) The initial conditions of the network and its activity under those conditions, as well as the possible reliance on external inputs, are not defined.

      In our model, network activity is initiated through a brief, stochastic excitatory input to a small HVC<sub>RA</sub> neuron of one microcircuit. This drive represents a simplified version of external input from upstream brain regions known to project to HVC, such as nuclei in the high vocal center's auditory pathways such as Nif and Uva. Modeling the activity of these upstream regions and their influence on HVC dynamics is an ongoing research work to be published in the future.

      (1h) It has been known from the time of Hodgkin and Huxley how to include temperature dependences for neuronal dynamics so another suggestion is for the authors to add such dependences for the three classes of neurons and see if their simulation causes burst frequencies to speed up or slow down as T is varied.

      We added this as limitation to the discussion section: “Our model was run at a fixed physiological temperature, but it's well known going all the way back to Hodgkin and Huxley that both ion channel activity and synaptic dynamics can change with temperature. In future work, adding temperature scaling (like Q10 factors) could help us explore how burst timing and sequence speed change with temperature changes, and how neural activity in HVC would/would not preserve its precision under different physiological conditions.”

      (2) The scope of the paper and its objectives must be clearly defined. Defining the scope and providing caveats for what is not considered will help the reader contextualize this study with other work.

      (2a) The paper does not consider the role of external inputs to HVC, which are very likely important for the capacity of the HVC chain to tile the entire song, including silent gaps.

      The role of afferent input to HVC particularly from nuclei such as Uva and Nif is critical in shaping the timing and initiation of HVC sequences throughout the song, including silent intervals. In fact, external inputs are likely involved in more than just triggering sequences, they may also influence the continuity of activity across motifs. However, in this study, we chose to focus on the intrinsic dynamics of HVC as a step toward understanding the internal mechanisms required for generating temporally precise sequences and for this reason, we used a simplified external input only to initiate activity in the chain.

      (2b) The paper does not consider important dendritic mechanisms that almost certainly facilitate the all-or-none bursting behavior of HVC projection neurons. the authors need to mention and discuss that current-clamped neuronal response - in which an electrode is inserted into the soma and then a constant current-step is applied - bypasses dendritic structure and dendritic processing and so is an incomplete way to characterize a neuron's properties. In particular, claiming to fit current-clamp data accurately and then claiming that one now has a biophysically accurate network model, as the authors do, is greatly misleading.

      While we addressed this is 1a, we do not suggest that our model is a fully accurate biophysical representation of HVC network. Instead, we see it as a simplified framework that helps reveal how much of HVC’s sequential activity can be explained by somatic properties and synaptic interactions alone. However, additional biological mechanisms, like dendritic processing, are likely to play an important role and should be explored in future work.

      (2c) The introduction does not provide a clear motivation for the paper - what hypotheses are being tested? What is at stake in the model outcomes? It is not inherently informative to take a known biological representation and fine-tune a limited model to replicate that representation.

      We explicitly added the hypotheses to the revised introduction.

      (2d) There have been several published modeling efforts applied to the HVC chain (Seung, Fee, Long, Greenside, Jin, Margoliash, Abarbanel). These and others need to be introduced adequately, and it needs to be crystal clear what, if anything, the present study is adding to the canon.

      While several influential models have explored how HVC might generate sequences ranging from synfire chains to recurrent dynamics or externally driven sequences (e.g., Seung, Fee, Long, Greenside, Jin, Abarbanel, and others), these models could not capture the detailed dynamics observed in vivo. Our aim was to bridge a gap in the modeling literature by exploring how far biophysically grounded intrinsic properties and experimentally supported synaptic connections that are local to the HVC can alone produce temporally precise sequences. We have proven that these mechanisms are sufficient to generate these sequences, although some missing components (such as dendritic mechanisms or external inputs) might be needed to fully capture the complexity and robustness of HVC function.

      (2e) The authors mention learning prominently in the abstract, summary, and introduction but this paper has nothing to do with learning. Most or all mentions of learning should be deleted since they are misleading.

      We appreciate the reviewer’s observation however our intent by referencing learning was not to suggest that our model directly simulates learning processes, but rather to place HVC function within the broader context of song learning and production, where temporal sequencing plays a fundamental role. Yet, repeated references to learning may be misleading given that our current model does not incorporate plasticity, synaptic modification, or developmental changes. Hence, we have carefully revised the manuscript to rephrase mentions of learning unless directly relevant to context. 

      (3) Using the model for hypothesis generation and prediction of experimental results.

      (3a) The utility of a model is to provide conceptual insight into how or why the real HVC functions as it does, or to predict outcomes in yet-to-be conducted experiments to help motivate future studies. This paper does not adequately achieve these goals.

      We revised the Discussion of the manuscript to better emphasize potential contributions and point out many experiments that could validate or challenge the model’s predictions. These include dynamic clamp or ion channel blockers targeting A-type K<sup>+</sup> in HVC<sub>RA</sub> neurons to assess their impact on burst precision, optogenetic disruption of inhibitory interneurons to observe changes in burst timing and sequence propagation, pharmacological modulation of I<sub>h</sub> or I<sub>CaT</sub> in HVC<sub>X</sub> and interneurons etc. 

      (3b) Additionally, it can be interesting to conduct an experiment on an existing model; for example, what happens to the HVCra chain in your model if you delete the HVCx neurons? What happens if you block NMDA receptors? Such an approach in a modeling paper can help motivate hypotheses and endow the paper with a sense of purpose.

      We agree that running targeted experiments to test our computational model such as removing an HVC neuron population or blocking a synaptic receptor can be a powerful way to generate new ideas and guide future experiments. While we didn’t include these specific tests in the current study, the model is well suited for this kind of exploration. For instance, removing interneurons could help us better understand their role in shaping the timing of HVC<sub>RA</sub> bursts. These are great directions for future experiments, and we now highlight this in the discussion as a way the model could be used to guide experiments.

      (4) Changes to the paper's organization may improve clarity.

      (4a) Nearly all equations should be moved to an Appendix so that the main part of the paper can focus on the science: assumptions made, details of simulations, conclusions obtained, and their significance. The authors present many equations without discussion which weakens the paper.

      Equations moved to appendix.

      (4b) There are many grammatical errors, e.g., verbs do not match the subject in terms of being single or plural. The authors need to run their manuscript through a grammar checker.

      Done.

      (4c) Many of the figures are poorly designed and should be substantially modified. E.g. in Figure 1B, too many colors are used, making it hard to grasp what is being plotted and the colors are not needed. Figures 1C and 1D are entire figures taken from other papers, and there is no way a reader will be able to see or appreciate all the details when this figure is published on a single page. Figure 2 uses colors for dots that are almost identical, and the colors could be avoided by using different symbols. Figure 5 fills an entire page but most of the figure conveys no information, there is no need to show the same details for all 120 neurons, just show the top 1/3 of this figure; the same for Figure 7, a lot of unnecessary information is being included. Figure 10, the bottom time series of spikes should be replaced with a time series of rates, cannot extract useful information.

      Adjusted as requested. 

      (4d) Table 1 is long and largely uninteresting, and should be moved to an appendix.

      Table 1 moved to appendix.

      (4e) Many sentences are not carefully written, which greatly weakens the paper. As one typical example, the first sentence in the Discussion section "In this study, we have designed a neural network model that describes [sic] zebra finch song production in the HVC." This is inaccurate, the model does not describe song production, it just explores some properties of one nucleus involved with song production. Just one or few sentences like this is ok but there are so many sentences of this kind that the reader loses faith in the authors.

      Thank you for raising this point, we revised the manuscript to improve the precision of the writing. We replaced the first sentence of the discussion with this: "In this study, we developed a biophysically realistic neural network model to explore how intrinsic neuronal properties and local connectivity within the songbird nucleus HVC may support the generation of temporally precise activity sequences associated with zebra finch song."

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary

      The authors previously published a study of RGC boutons in the dLGN in developing wild-type mice and developing mutant mice with disrupted spontaneous activity. In the current manuscript, they have broken down their analysis of RGC boutons according to the number of Homer/Bassoon puncta associated with each vGlut3 cluster.

      The authors find that, in the first post-natal week, RGC boutons with multiple active zones (mAZs) are about a third as common as boutons with a single active zone (sAZ). The size of the vGluT2 cluster associated with each bouton was proportional to the number of active zones present in each bouton. Within the author's ability to estimate these values (n=3 per group, 95% of results expected to be within ~2.5 standard deviations), these results are consistent across groups: 1) dominant eye vs. nondominant eye, 2) wild-type mice vs. mice with activity blocked, and at 3) ages P2, P4, and P8. The authors also found that mAZs and sAZs also have roughly the same number (about 1.5) of sAZs clustered around them (within 1.5 um).

      However, the authors do not interpret this consistency between groups as evidence that active zone clustering is not a specific marker or driver of activity dependent synaptic segregation. Rather, the authors perform a large number of tests for statistical significance and cite the presence or absence of statistical significance as evidence that "Eye-specific active zone clustering underlies synaptic competition in the developing visual system (title)". I don't believe this conclusion is supported by the evidence.

      We have revised the title to be descriptive: "Eye-specific differences in active zone addition during synaptic competition in the developing visual system." While our correlative approach does not establish direct causality, our findings provide important structural evidence that complements existing functional studies of activity-dependent synaptic refinement. We have carefully revised the text throughout to avoid causal language, focusing instead on the developmental patterns we observe.

      Strengths

      The source dataset is high resolution data showing the colocalization of multiple synaptic proteins across development. Added to this data is labeling that distinguishes axons from the right eye from axons from the left eye. The first order analysis of this data showing changes in synapse density and in the occurrence of multi-active zone synapses is useful information about the development of an important model for activity dependent synaptic remodeling.

      Weaknesses

      In my previous review I argued that it was not possible to determine, from their analysis, whether the differences they were reporting between groups was important to the biology of the system. The authors have made some changes to their statistics (paired t-tests) and use some less derived measures of clustering. However, they still fail to present a meaningfully quantitative argument that the observed group differences are important. The authors base most of their claims on small differences between groups. There are two big problems with this practice. First, the differences between groups appear too small to be biologically important. Second, the differences between groups that are used as evidence for how the biology works are generally smaller than the precision of the author's sampling. That is, the differences are as likely to be false positives as true positives.

      (1) Effect size. The title claims: "Eye-specific active zone clustering underlies synaptic competition in the developing visual system". Such a claim might be supported if the authors found that mAZs are only found in dominant-eye RGCs and that eye-specific segregation doesn't begin until some threshold of mAZ frequency is reached. Instead, the behavior of mAZs is roughly the same across all conditions. For example, the clear trend in Figure 4C and D is that measures of clustering between mAZ and sAZ are as similar as could reasonably be expected by the experimental design. However, some of the comparisons of very similar values produced p-values < 0.05. The authors use this fact to argue that the negligible differences between mAZ and sAZs explain the development of the dramatic differences in the distribution of ipsilateral and contralateral RGCs.

      We have changed the title to avoid implying a causal relationship between clustering and eye-specific segregation. Our key findings in Figures 4C and 4D demonstrate effect sizes >2.0 with high statistical power (Supplemental Table S2). While the absolute magnitude of differences is modest (5-7%), these high effect sizes combined with low inter-animal variability demonstrate consistent, reproducible biological phenomena. During development, small differences during critical periods can have profound downstream consequences for synaptic refinement outcomes.

      We acknowledge that significance in Figure 4 arises due to low variance between biological replicates rather than large mean differences. We have revised the text to describe these as "slight" differences and that "WT mice show a tendency toward forming more synapses near mAZ inputs," reflecting appropriate caution in our interpretation while maintaining the statistical robustness of our findings.

      (2) Sample size. Performing a large number of significance tests and comparing pvalues is not hypothesis testing and is not descriptive science. At best, with large sample sizes and controls for multiple tests, this approach could be considered exploratory. With n=3 for each group, many comparisons of many derived measures, among many groups, and no control for multiple testing, this approach constitutes a random result generator.

      The authors argue that n=3 is a large sample size for the type of high resolution / large volume data being used. It is true that many electron microscopy studies with n=1 are used to reveal the patterns of organization that are possible within an individual. However, such studies cannot control individual variation and are, therefore, not appropriate for identifying subtle differences between groups.

      In response to previous critiques along these lines, the authors argue they have dealt with this issue by limiting their analysis to within-individual paired comparisons. There are several problems with their thinking in this approach. The main problem is that they did not change the logic of their arguments, only which direction they pointed the t-tests. Instead of claiming that two groups are different because p < 0.05, they say that two groups are different because one produced p < 0.05 and the other produced p > 0.05. These arguments are not statistically valid or biologically meaningful.

      We have implemented rigorous statistical controls, applying false discovery rate (FDR) correction using the Benjamini-Hochberg method (α = 0.05) within each experimental condition (age × genotype combination). This correction strategy treats each condition as addressing a distinct experimental question: “What synaptic properties differ between left eye and right eye inputs in this specific developmental stage and genotype?” The approach appropriately controls for multiple testing while preserving power to detect biologically meaningful differences. We applied FDR correction separately to the ~20-34 measurements (varying by age and genotype) within each of the six experimental conditions, resulting in condition-specific adjusted p-values reported in updated Supplemental Table S2. This correction confirmed the robustness of our key findings. We do not base conclusions solely on comparing p-values across conditions. Our interpretations focus on effect sizes, confidence intervals, and consistent patterns within each condition, with statistical significance providing supporting evidence rather than the primary basis for biological conclusions.

      To the best of my understanding, the results are consistent with the following model:

      RGCs form mAZs at large boutons (known)

      About a quarter of week-one RGC boutons are mAZs (new observation)

      Vesicle clustering is proportional to active zone number (~new observation)

      RGC synapse density increases during the first post-week (known)

      Blocking activity reduces synapse density (known)

      Contralateral eye RGCs for more and larger synapses in the lateral dLGN (known)

      While mAZ formation is known in adult and juvenile dLGN, the formation of mAZ boutons during eye-specific competition represents new information with important functional implications. Synapses with multiple release sites should be stronger than single-active-zone synapses, suggesting a structural correlate for competitive advantage during refinement.

      We demonstrate distinct developmental patterns for sAZ versus mAZ contacts during the first postnatal week. Multi-active zone density favors the dominant eye, while single active-zone synapse density from the competing eye increases from P2-P4 to match dominant-eye levels. This reveals that newly formed synapses from the competing eye predominantly contain single release sites, marking P4-P8 as a critical window for understanding molecular mechanisms driving synaptic elimination.

      Our results show that altered retinal activity patterns (β2KO mice) reduce synapse density during eye-specific competition. We relied on β2 knockout mice, which retain retinal waves and spontaneous spike activity but with disrupted patterns and output levels compared to controls. We make no claims about complete activity blockade. Previous studies using different activity manipulations (epibatidine, TTX) have examined terminal morphology, but effects on synapse density during competition remain largely unknown. Achieving complete retinal activity blockade is technically challenging, making it of interest to revisit the role of activity using more precise manipulations to control spike output and relative timing.

      With n=3 and effect sizes smaller than 1 standard deviation, a statistically significant result is about as likely to be a false positive as a true positive.

      A true-positive statistically significant result does is not evidence of a meaningful deviation from a biological model.

      Our conclusions are based on results with effect sizes substantially larger than 1. Key findings demonstrate effect sizes exceeding 2.0. These large effect sizes, combined with rigorous FDR correction and low inter-animal variability, provide evidence against false positive results. During critical developmental periods, consistent structural differences, even those modest in absolute magnitude, can reflect important regulatory mechanisms that influence refinement outcomes. All statistical results, effect sizes, and power analyses are reported in Supplementary Tables S2, with confidence intervals in Supplementary Table S3. We have revised the text in several places where small differences are presented to reflect appropriate caution in our interpretation.

      Providing plots that show the number of active zones present in boutons across these various conditions is useful. However, I could find no compelling deviation from the above default predictions that would influence how I see the role of mAZs in activity dependent eye-specific segregation.

      Below are critiques of most of the claims of the manuscript.

      Claim (abstract): individual retinogeniculate boutons begin forming multiple nearby presynaptic active zones during the first postnatal week.

      Confirmed by data.

      Claim (abstract): the dominant-eye forms more numerous mAZ contacts,

      Misleading: The dominant-eye (by definition) forms more contacts than the nondominant eye. That includes mAZ.

      While the dominant eye forms more total contacts, the pattern depends critically on contact type and developmental stage. The dominant eye forms more mAZ contacts across all ages (Figures 2 and S1). However, for sAZ contacts, the two eyes form similar numbers at P4, with the non-dominant eye showing increased sAZ formation during this critical period. This differential pattern by synapse type represents an important aspect of how synaptic competition unfolds structurally.

      Claim (abstract): At the height of competition, the non-dominant-eye projection adds many single active zone (sAZ) synapses

      Weak: While the individual observation is strong, it is a surprising deviation based on a single n=3 experiment in a study that performed twelve such experiments (six ages, mutant/wildtype, sAZ/mAZ)

      The difference in eye-specific sAZ formation at P2 and P8 had effect sizes of ~5.3 and ~2.7 respectively (after FDR correction the difference was still significant at P2 and trending at P8). At P4, no effect was observed by paired T-test and the 5/95% confidence intervals ranged from -0.021-0.008 synapses/m<sup>3</sup>. The consistency of this pattern across P2 and P8, combined with the large effect sizes, supports the reliability of this developmental finding. We report all effect sizes and power test analyses in Supplemental Table S2, and confidence intervals in Supplemental Table S3. 

      Claim (abstract): Together, these findings reveal eye-specific differences in release site addition during synaptic competition in circuits essential for visual perception and behavior.

      False: This claim is unambiguously false. The above findings, even if true, do not argue for any functional significance to active zone clustering.

      Our phrasing “circuits essential for visual perception and behavior” referred to the general importance of binocular organization in the retinogeniculate system for visual processing and we did not intend to claim direct functional significance of our structural data. For clarity we have deleted the latter part of this sentence. In lines 35-37, the abstract now reads “Together, these findings reveal eye-specific differences in release site addition that correlate with axonal refinement outcomes during retinogeniculate refinement.”

      Claim (line 84): "At the peak of synaptic competition midway through the first postnatal week, the non-dominant-eye formed numerous sAZ inputs, equalizing the global synapse density between the two eyes"

      Weak: At one of twelve measures (age, bouton type, genotype) performed with 3 mice each, one density measure was about twice as high as expected.

      The difference in eye-specific sAZ formation at P2 and P8 had effect sizes of ~5.3 and ~2.7 respectively (after FDR correction the difference was still significant at P2 and trending at P8). At P4, no effect was observed by paired T-test and the 5/95% confidence intervals ranged from -0.021-0.008 synapses/m<sup>3</sup>. The consistency of this pattern across P2 and P8, combined with the large effect sizes, supports the reliability of this developmental finding. We report all effect sizes and power test analyses in Supplemental Table S2, and confidence intervals in Supplemental Table S3. 

      Claim (line 172): "In WT mice, both mAZ (Fig. 3A, left) and sAZ (Fig. 3B, left) inputs showed significant eye-specific volume differences at each age."

      Questionable: There appears to be a trend, but the size and consistency is unclear.

      Claim (line 175): "the median VGluT2 cluster volume in dominant-eye mAZ inputs was 3.72 fold larger than that of non-dominant-eye inputs (Fig. 3A, left)."

      Cherry picking. Twelve differences were measured with an n of 3, 3 each time. The biggest difference of the group was cited. No analysis is provided for the range of uncertainty about this measure (2.5 standard deviations) as an individual sample or as one of twelve comparisons.

      Claim (line 174): "In the middle of eye-specific competition at P4 in WT mice, the median VGluT2 cluster volume in dominant-eye mAZ inputs was 3.72 fold larger than that of non-dominant-eye inputs (Fig. 3A, left). In contrast, β2KO mice showed a smaller 1.1 fold difference at the same age (Fig. 3A, right panel). For sAZ synapses at P4, the magnitudes of eye-specific differences in VGluT2 volume were smaller: 1.35-fold in WT (Fig. 3B, left) and 0.41-fold in β2KO mice (Fig. 3B, right). Thus, both mAZ and sAZ input size favors the dominant eye, with larger eye-specific differences seen in WT mice (see Table S3)."

      No way to judge the reliability of the analysis and trivial conclusion: To analyze effect size the authors choose the median value of three measures (whatever the middle value is). They then make four comparisons at the time point where they observed the biggest difference in favor of their hypothesis. There is no way to determine how much we should trust these numbers besides spending time with the mislabeled scatter plots. The authors then claim that this analysis provides evidence that there is a difference in vGluT2 cluster volume between dominant and non-dominant RGCs and that that difference is activity dependent. The conclusion that dominant axons have bigger boutons and that mutants that lack the property that would drive segregation would show less of a difference is very consistent with the literature. Moreover, there is no context provided about what 1.35 or 1.1 fold difference means for the biology of the system.

      We focused on P4 for biological reasons rather than post-hoc selection. P4 represents the established peak of synaptic competition when eye-specific synapse densities are globally equivalent. This is a timepoint consistently highlighted throughout our manuscript and supported by previous literature. We have modified our presentation from fold changes to measured eye-specific differences in volume (mean ± standard error) and added confidence intervals in Supplemental Table S3. The effect sizes for eye-specific differences in VGluT2 volume at P4 are robust: ~2.3 and ~1.5 for mAZ and sAZ measurements in WT mice, and ~2.5 and ~1.8 in β2KO mice, with all analyses well-powered (Supplemental Table S2).

      We were unable to identify any mislabeled scatter plots and believe all figures are correctly labeled. While dominant-eye advantage in bouton size is consistent with previous literature, our study provides the first detailed analysis of how this develops specifically during the critical period of competition, with distinct patterns for single versus multi-active zone contacts. Our data show that dominant-eye inputs have larger vesicle pools that scale with active zone number. While this suggests enhanced transmission capacity, we make no direct physiological claims based on structural data alone.

      Claim (189): "This shows that vesicle docking at release sites favors the dominant-eye as we previously reported but is similar for like eye type inputs regardless of AZ number."

      Contradicts core claim of manuscript: Consistent with previous literature, there is an activity dependent relative increase in vGlut2 clustering of dominant eye RGCs. The new information is that that activity dependence is more or less the same in sAZ and mAZ. The only plausible alternative is that vGlut2 scaling only increases in mAZ which would be consistent with the claims of their paper. That is not what they found. To the extent that the analysis presented in this manuscript tests a hypothesis, this is it. The claim of the title has been refuted by figure 3.

      We report the volume of docked vesicle signal (VGluT2) nearby each active zone, finding this is greater for dominant-eye synapses. Within each eye-specific synapse population, vesicle signal per active zone is similar regardless of whether these are part of single- or multi-active zone contacts. This is consistent with a modular program of active zone assembly and maintenance: core molecular programs facilitate docking at each AZ similarly regardless of how many AZs are nearby. 

      This finding does not contradict our main conclusions but rather provides insight into how synaptic advantages are structured. The dominant eye's advantage may arise in part from forming more multi-AZ contacts (which have proportionally more docked vesicles) rather than from enhanced vesicle loading per individual active zone. This organization may reflect how developmental competition operates through contact number and active zone addition rather than fundamental changes to individual release site properties.

      We have changed the title to be descriptive rather than mechanistic.

      Claim (line 235): "For the non-dominant eye projection, however, clustered mAZ inputs outnumbered clustered sAZ inputs at P4 (Fig. 4C, bottom left panel), the age when this eye adds sAZ synapses (Fig. 2C)."

      Misleading: The overwhelming trend across 24 comparisons is that the sAZ clustering looks like mAZ clustering. That is the objective and unambiguous result. Among these 24 underpowered tests (n=3), there were a few p-values < 0.05. The authors base their interpretation of cell behavior on these crossings.

      In Figures 4C and 4D we report significant results with high effect sizes (effect sizes all greater than 2; see Supplemental Table S2). The mean differences are modest (5-7%) and significance arises due to low variance between biological replicates. We acknowledge that clustering patterns are generally similar between mAZ and sAZ inputs across most conditions. We have revised the text to describe these as “slight” differences and that “WT mice show a tendency toward forming more synapses near mAZ inputs”, reflecting appropriate caution in our interpretation while noting the statistical consistency of these patterns.

      Claim (line 328): "The failure to add synapses reduced synaptic clustering and more inputs formed in isolation in the mutants compared to controls."

      Trivially true: Density was lower in mutant.

      We have rewritten the sentence for clarity: “The failure to add synapses could explain the observation that synaptic clustering was reduced and more inputs formed in isolation in the mutants compared to controls.”

      Claim (line 332): "While our findings support a role for spontaneous retinal activity in presynaptic release site addition and clustering..."

      Not meaningfully supported by evidence: I could not find meaningful differences between WT and mutant beside the already known dramatic difference in synapse density.

      We have changed the sentence to avoid overinterpreting the results. The new sentence in lines 415-417 reads: “While our results highlight developmental changes in presynaptic release site addition and clustering, activity-dependent postsynaptic mechanisms also influence input refinement at later stages.”

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Zhang and Speer examine changes in the spatial organization of synaptic proteins during eye specific segregation, a developmental period when axons from the two eyes initially mingle and gradually segregate into eye-specific regions of the dorsal lateral geniculate. The authors use STORM microscopy and immunostain presynaptic (VGluT2, Bassoon) and postsynaptic (Homer) proteins to identify synaptic release sites. Activity-dependent changes of this spatial organization are identified by comparing the β2KO mice to WT mice. They describe two types of synapses based on Bassoon clustering: the multiple active zone (mAZ) synapse and single active zone (sAZ) synapse. In this revision, the authors have added EM data to support the idea that mAZ synapses represent boutons with multiple release sites. They have also reanalyzed their data set with different statistical approaches.

      Strengths:

      The data presented is of good quality and provides an unprecedented view at high resolution of the presynaptic components of the retinogeniculate synapse during active developmental remodeling. This approach offers an advance to the previous mouse EM studies of this synapse because of the CTB label allows identification of the eye from which the presynaptic terminal arises.

      Weaknesses:

      While the interpretation of this data set is much more grounded in this second revised submission, some of the authors' conclusions/statements still lack convincing supporting evidence. In particular, the data does not support the title: "Eye-specific active zone clustering underlies synaptic competition in the developing visual system". The data show that there are fewer synapses made for both contra- and ipsi- inputs in the β2KO mice-- this fact alone can account for the differences in clustering. There is no evidence linking clustering to synaptic competition. Moreover, the findings of differences in AZ# or distance between AZs that the authors report are quite small and it is not clear whether they are functionally meaningful.

      We thank the reviewer for their helpful suggestions that improved the manuscript in this revision. We have changed the title to remove the reference to “clustering” and to avoid implying any causal relationships. The new title is descriptive: “Eye-specific differences in active zone addition during synaptic competition in the developing visual system”.

      To further address the reviewers comments, we have removed the remaining references to activity-dependent effects on synaptic development (line 36, line 96, line 415). We have also modified the text in lines 411-413 to state that “The failure to add synapses could explain the observation that synaptic clustering was reduced and more inputs formed in isolation in the mutants compared to controls.”

      We have also updated our presentation of results for Figure 4 to ensure that we do not causally link clustering to synaptic competition. In Figures 4C and 4D we report significant results with high effect sizes (effect sizes all greater than 2; see Supplemental Table S2). The mean differences are modest (5-7%) and significance arises due to low variance between biological replicates. We acknowledge that clustering patterns are generally similar between mAZ and sAZ inputs across most conditions. We have revised the text to describe these as “slight” differences and that “WT mice show a tendency toward forming more synapses near mAZ inputs”, reflecting appropriate caution in our interpretation while noting the statistical consistency of these patterns.

      Reviewer #3 (Public review):

      This study is a follow-up to a recent study of synaptic development based on a powerful data set that combines anterograde labeling, immunofluorescence labeling of synaptic proteins, and STORM imaging (Cell Reports, 2023). Specifically, they use anti-Vglut2 label to determine the size of the presynaptic structure (which they describe as the vesicle pool size), anti-Bassoon to label active zones with the resolution to count them, and anti-Homer to identify postsynaptic densities. Their previous study compared the detailed synaptic structure across the development of synapses made with contraprojecting vs. ipsi-projecting RGCs and compared this developmental profile with a mouse model with reduced retinal waves. In this study, they produce a new detailed analysis on the same data set in which they classify synapses into "multi-active zone" vs. "single-active zone" synapses and assess the number and spacing of these synapses. The authors use measurements to make conclusions about the role of retinal waves in the generation of same-eye synaptic clusters. The authors interpret these results as providing insight into how neural activity drives synapse maturation, the strength of their conclusions is not directly tested by their analysis.

      Strengths:

      This is a fantastic data set for describing the structural details of synapse development in a part of the brain undergoing activity-dependent synaptic rearrangements. The fact that they can differentiate the eye of origin is what makes this data set unique over previous structural work. The addition of example images from the EM dataset provides confidence in their categorization scheme.

      Weaknesses:

      Though the descriptions of single vs multi-active zone synapses are important and represent a significant advance, the authors continue to make unsupported conclusions regarding the biological processes driving these changes. Although this revision includes additional information about the populations tested and the tests conducted, the authors do not address the issue raised by previous reviews. Specifically, they provide no assessment of what effect size represents a biologically meaningful result. For example, a more appropriate title is "The distribution of eye-specific single vs multiactive zone is altered in mice with reduced spontaneous activity" rather than concluding that this difference in clustering is somehow related to synaptic competition. Of course, the authors are free to speculate, but many of the conclusions of the paper are not supported by their results.

      We appreciate the reviewer’s helpful critique. We have changed the title to be descriptive and avoid implying causal relationships. 

      We have applied false discovery rate (FDR) correction using the Benjamini-Hochberg method with α = 0.05 within each experimental condition (age × genotype combination). The FDR correction treats each condition as addressing a distinct experimental question: 'What synaptic properties differ between left eye and right eye inputs in this specific developmental stage and genotype?'

      This correction strategy is appropriate because: 1) we focus our statistical comparisons within each age/genotype; 2) each age-genotype combination represents a separate biological context where different synaptic properties between eye-of-origin may be relevant; and 3) this approach controls for multiple testing within each experimental question while maintaining statistical power to detect meaningful biological differences.

      We applied FDR correction separately to the ~20-34 measurements (varying with age and genotype) within each of the six experimental conditions (P2-WT, P2-ß2, P4-WT, P4-ß2, P8-WT, P8-ß2), resulting in condition-specific adjusted p-values. These are reported in the updated Supplemental Table S2. Figures have been also been updated to reflect the FDR-adjusted values. Selected between-genotype comparisons are presented descriptively using 5/95% confidence intervals. This correction confirmed the robustness of our key findings.

      With regard to the biological significance of effect sizes, our key findings demonstrate effect sizes >2.0, indicating robust effects. During critical developmental periods, consistent structural differences, even those modest in absolute magnitude, can reflect important regulatory mechanisms that influence refinement outcomes. The differences in synaptic organization we observe occur during the first postnatal week when eyespecific competition is active, suggesting these patterns may be relevant to understanding how structural advantages emerge during synaptic refinement.

      Reviewer #1 (Recommendations for the authors):

      I have tried to understand the analysis and biology of this manuscript as best I can. I believe the analytical approach taken is not reliable and I have explained why in my public comments. I don't believe this manuscript is unique in taking this approach. I have recently published a paper on how common this approach is and why it doesn't work. I don't want to give the impression that the problem with the analysis was that it was not computationally sophisticated enough or that you did not jump through a specific statistical hoop. If I strip out the arguments that depend on misinterpretations of p-values and -instead- look at the scatterplots, I come up with a very different view of the data than what is described in the paper.

      The information in the plots could be translated into a rigorous statistical analysis of estimated differences between groups given the uncertainties of the experimental design. I don't really think that analysis would be useful. I think it would have been enough to publish the plots and report your estimates of the number of active zones in RGCs during development. I don't see evidence of an additional effect.

      We appreciate the reviewer’s helpful comments throughout the review process. Mean active zone numbers per mAZ contact are presented in Figure S2D/E. We look forward to further technical and computational advances that will help us increase our data acquisition throughput and sample sizes when designing future studies. 

      Reviewer #2 (Recommendations for the authors):

      The authors should modify the title and other text to be more consistent with the data. There is no evidence that active zone clustering has any direct relationship to synaptic competition.

      We appreciate the reviewer’s helpful suggestions to ensure appropriate language around causal effects. We have modified the title to accurately reflect the results: "Eyespecific differences in active zone addition during synaptic competition in the developing visual system." We have revised the text in the abstract, introduction, and results section for Figures 4 to be consistent with the data and not imply causality of synapse clustering on segregation phenotypes.

      Reviewer #3 (Recommendations for the authors):

      Change the title.

      We appreciate the reviewer’s feedback throughout the review process. We have modified the title to accurately reflect the results: "Eye-specific differences in active zone addition during synaptic competition in the developing visual system."

    1. Unclear Privacy Rules: Sometimes privacy rules aren’t made clear to the people using a system. For example: If you send “private” messages on a work system, your boss might be able to read them [i19]. When Elon Musk purchased Twitter, he also was purchasing access to all Twitter Direct Messages [i20] Others Posting Without Permission: Someone may post something about another person without their permission. See in particular: The perils of ‘sharenting’: The parents who share too much [i21] Metadata: Sometimes the metadata that comes with content might violate someone’s privacy. For example, in 2012, former tech CEO John McAfee was a suspect in a murder in Belize [i22], John McAfee hid out in secret. But when Vice magazine wrote an article about him, the photos in the story contained metadata with the exact location in Guatemala [i23]. Deanonymizing Data: Sometimes companies or researchers release datasets that have been “anonymized,” meaning that things like names have been removed, so you can’t directly see who the data is about. But sometimes people can still deduce who the anonymized data is about. This happened when Netflix released anonymized movie ratings data sets, but at least some users’ data could be traced back to them [i24]. Inferred Data: Sometimes information that doesn’t directly exist can be inferred through data mining (as we saw last chapter), and the creation of that new information could be a privacy violation. This includes the creation of Shadow Profiles [i25], which are information about the user that the user didn’t provide or consent to Non-User Information: Social Media sites migh

      This section makes me think on the internet nowadays, there's absolutely no way to keep your information to yourself. People's information is in so many different companies, and users would not know how their information is being used either. Users has no control over their own privacy although it's something about themselves.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      We thank the reviewer for very enthusiastic and supportive comments on our manuscript. 

      Summary:

      This manuscript presents a compelling and innovative approach that combines Track2p neuronal tracking with advanced analytical methods to investigate early postnatal brain development. The work provides a powerful framework for exploring complex developmental processes such as the emergence of sensory representations, cognitive functions, and activity-dependent circuit formation. By enabling the tracking of the same neurons over extended developmental periods, this methodology sets the stage for mechanistic insights that were previously inaccessible.

      Strengths:

      (1) Innovative Methodology:

      The integration of Track2p with longitudinal calcium imaging offers a unique capability to follow individual neurons across critical developmental windows.

      (2) High Conceptual Impact:

      The manuscript outlines a clear path for using this approach to study foundational developmental questions, such as how early neuronal activity shapes later functional properties and network assembly.

      (3) Future Experimental Potential:

      The authors convincingly argue for the feasibility of extending this tracking into adulthood and combining it with targeted manipulations, which could significantly advance our understanding of causality in developmental processes.

      (4) Broad Applicability:

      The proposed framework can be adapted to a wide range of experimental designs and questions, making it a valuable resource for the field.

      Weaknesses:

      No major weaknesses were identified by this reviewer. The manuscript is conceptually strong and methodologically sound. Future studies will need to address potential technical limitations of long-term tracking, but this does not detract from the current work's significance and clarity of vision.

      Reviewer #2 (Public review):

      Summary:

      The manuscript by Majnik and colleagues introduces "Track2p", a new tool designed to track neurons across imaging sessions of two-photon calcium imaging in developing mice. The method addresses the challenge of tracking cells in the growing brain of developing mice. The authors showed that "Track2p" successfully tracks hundreds of neurons in the barrel cortex across multiple days during the second postnatal week. This enabled the identification of the emergence of behavioral state modulation and desynchronization of spontaneous network activity around postnatal day 11.

      Strengths:

      The manuscript is well written, and the analysis pipeline is clearly described. Moreover, the dataset used for validation is of high quality, considering the technical challenges associated with longitudinal two-photon recordings in mouse pups. The authors provide a convincing comparison of both manual annotation and "CellReg" to demonstrate the tracking performance of "Track2p". Applying this tracking algorithm, Majnik and colleagues characterized hallmark developmental changes in spontaneous network activity, highlighting the impact of longitudinal imaging approaches in developmental neuroscience. Additionally, the code is available on GitHub, along with helpful documentation, which will facilitate accessibility and usability by other researchers.

      Weaknesses:

      (1) The main critique of the "Track2p" package is that, in its current implementation, it is dependent on the outputs of "Suite2p". This limits adoption by researchers who use alternative pipelines or custom code. One potential solution would be to generalize the accepted inputs beyond the fixed format of "Suite2p", for instance, by accepting NumPy arrays (e.g., ROIs, deltaF/F traces, images, etc.) from files generated by other software. Otherwise, the tool may remain more of a useful add-on to "Suite2p" (see https://github.com/MouseLand/suite2p/issues/933) rather than a fully standalone tool.

      We thank the reviewer for this excellent suggestion. 

      We have now implemented this feature, where Track2p is now compatible with ‘raw’ NumPy arrays for the three types of inputs. For more information, please check the updated documentation: https://track2p.github.io/run_inputs_and_parameters.html#raw-npy-arrays. We have also tested this feature using a custom segmentation and trace extraction pipeline using Cellpose for segmentation.

      (2) Further benchmarking would strengthen the validation of "Track2p", particularly against "CaIMaN" (Giovannucci et al., eLife, 2019), which is widely used in the field and implements a distinct registration approach.

      This reviewer suggested  further benchmarking of Track2P.  Ideally, we would want to benchmark Track2p against the current state-of-the-art method. However, the field currently lacks consensus on which algorithm performs best, with multiple methods available including CaIMaN, SCOUT (Johnston et al. 2022), ROICaT (Nguyen et al. 2023), ROIMatchPub (recommended by Suite2p documentation and recently used by Hasegawa et al. 2024), and custom pipelines such as those described by Sun et al. 2025. The absence of systematic benchmarking studies—particularly for custom tracking pipelines—makes it impossible to identify the current state-of-the-art for comparison with Track2p. While comparing Track2p against all available methods would provide comprehensive evaluation, such an analysis falls beyond the scope of this paper.

      We selected CellReg for our primary comparison because it has been validated under similar experimental conditions—specifically, 2-photon calcium imaging in developing hippocampus between P17-P25 (Wang et al. 2024)—making it the most relevant benchmark for our developmental neocortex dataset.

      That said, to support further benchmarking in mouse neocortex (P8-P14), we will publicly release our ground truth tracking dataset.

      (3) The authors might also consider evaluating performance using non-consecutive recordings (e.g., alternate days or only three time points across the week) to demonstrate utility in other experimental designs.

      Thank you for your suggestion. We have performed a similar analysis prior to submission, but we decided against including it in the final manuscript, to keep the evaluation brief and to not confuse the reader with too many different evaluation methods. We have included the results inAuthor response images 1 and 2 below.

      To evaluate performance in experimental designs with larger time spans between recordings (>1 day) we performed additional evaluation of tracking from P8 to each of the consecutive days while omitting the intermediate days (e. g. P8 to P9, P8 to P10 … P8 to P14). The performance for the three mice from the manuscript is shown below:

      Author response image 1.

      As expected with increasing time difference between the two recordings the performance drops significantly (dropping to effectively zero for 2 out of 3 mice). This could also explain why CellReg struggles to track cells across all days, since it takes P8 as a reference and attempts to register all consecutive days to that time point before matching, instead of performing registration and matching in consecutive pairs of recordings (P8-P9, P9-P10 … P13-P14) as we do.

      Finally for one of the three mice we also performed an additional test where we asked how adding an additional recording day might rescue the P8-P14 tracking performance. This corresponds to the comment from the reviewer, answering the question if we can only perform three days of recording which additional day would give the best tracking performance. 

      Author response image 2.

      As can be seen from the plot, adding the P10 or P11 recording shows the most significant improvement to the tracking performance, however the performance is still significantly lower than when including all days (see Fig. 4). This test suggests that including a day that is slightly skewed to earlier ages might improve the performance more than simply choosing the middle day between the two extremes. This would also be consistent with the qualitative observation that the FOV seems to show more drastic day-to-day changes at earlier ages in our recording conditions.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, Majnik et al. developed a computational algorithm to track individual developing interneurons in the rodent cortex at postnatal stages. Considerable development in cortical networks takes place during the first postnatal weeks; however, tools to study them longitudinally at a single-cell level are scarce. This paper provides a valuable approach to study both single-cell dynamics across days and state-driven network changes. The authors used Gad67Cre mice together with virally introduced TdTom to track interneurons based on their anatomical location in the FOV and AAVSynGCaMP8m to follow their activity across the second postnatal week, a period during which the cortex is known to undergo marked decorrelation in spontaneous activity. Using Track2P, the authors show the feasibility of tracking populations of neurons in the same mice, capturing with their analysis previously described developmental decorrelation and uncovering stable representations of neuronal activity, coincident with the onset of spontaneous active movement. The quality of the imaging data is compelling, and the computational analysis is thorough, providing a widely applicable tool for the analysis of emerging neuronal activity in the cortex. Below are some points for the authors to consider.

      We thank the reviewer for a constructive and positive evaluation of our MS. 

      Major points:

      (1) The authors used 20 neurons to generate a ground truth dataset. The rationale for this sample size is unclear. Figure 1 indicates the capability to track ~728 neurons. A larger ground truth data set will increase the robustness of the conclusions.

      We think this was a misunderstanding of our ground truth dataset analysis which included 192 and not 20 neurons. Indeed, as explained in the methods section, since manually tracking all cells would require prohibitive amounts of time, we decided to generate sparse manual annotations, only tracking a subset of all cells from the first recording day onwards. To do this, we took the first recording (s0), and we defined a grid 64 equidistant points over the FOV and, for each point, identified the closest ROI in terms of euclidean distance from the median pixel of the ROI (see Fig. S3A). We then manually tracked these 64 ROIs across subsequent days. Only neurons that were detected and tracked across all sessions were taken into account and referred to as our ground truth dataset (‘GT’ in Fig. 4). This was done for 3 mice, hence 3X64 neurons and not 20 were used to generate our GT dataset. 

      (2) It is unclear how movement was scored in the analysis shown in Figure 5A. Was the time that the mouse spent moving scored after visual inspection of the videos? Were whisker and muscle twitches scored as movement, or was movement quantified as the amount of time during which the treadmill was displaced?

      Movement was scored using a ‘motion energy’ metric as in Stringer et al. 2019 (V1) or Inácio et al. 2025 (S1). This metric takes each two consecutive frames of the videography recordings and computes the difference between them by summing up the square of pixelwise differences between the two images. We made the appropriate changes in the manuscript to further clarify this in the main text and methods in order to avoid confusion.

      Since this metric quantifies global movements, it is inherently biased to whole-body movements causing more significant changes in pixel values around the whole FOV of the camera. Slight twitches of a single limb, or the whisker pad would thus contribute much less to this metric, since these are usually slight displacements in a small region of the camera FOV. Additionally, comparing neural activity across all time points (using correlation or R<sup>2</sup>) also favours movements that last longer (such as wake movements / prolonged periods of high arousal) since each time point is treated equally.

      As we suggested in the discussion, in further analysis it would be interesting to look at the link between twitches and neural activity, but this would likely require extensive manual scoring. We could then treat movements not as continuous across all time-points, but instead using event-based analysis for example peri-movement time histograms for different types of movements at different ages, which is however outside of the scope of this study.

      (3) The rationale for binning the data analysis in early P11 is unclear. As the authors acknowledged, it is likely that the decoder captured active states from P11 onwards. Because active whisking begins around P14, it is unlikely to drive this change in network dynamics at P11. Does pupil dilation in the pups change during locomotor and resting states? Does the arousal state of the pups abruptly change at P11?

      We agree that P11 does not match any change in mouse behavior that we have been able to capture. However, arousal state in mice does change around postnatal day 11. This period marks a transition from immature, fragmented states to more organized and regulated sleep-wake patterns, along with increasing influence from neuromodulatory and sensory systems. All of these changes have been recently reviewed in Wu et al. 2024 (see also Martini et al. 2021). In addition, in the developing somatosensory system, before postnatal day 11 (P11), wake-related movements (reafference) are actively gated and blocked by the external cuneate nucleus (ECN, Tiriac et al. 2016 and all excellent recent work from the Blumberg lab). This gating prevents sensory feedback from wake movements from reaching the cortex, ensuring that only sleep-related twitches drive neural responses. However, around P11, this gating mechanism abruptly lifts, enabling sensory signals from wake movements to influence cortical processing—signaling a dramatic developmental shift from Wu et al. 2024

      Reviewer #1 (Recommendations for the authors):

      This manuscript represents a significant advancement in the field of developmental neuroscience, offering a powerful and elegant framework for longitudinal cellular tracking using the Track2p method combined with robust analytical approaches. The authors convincingly demonstrate that this integrated methodology provides an invaluable template for investigating complex developmental processes, including the emergence of sensory representations and higher cognitive functions.

      A major strength of this work is its emphasis on the power of longitudinal imaging to illuminate activity-dependent development. By tracking the same neurons over time, the authors open up new possibilities to uncover how early activity patterns shape later functional outcomes and the organization of neuronal assemblies-insights that would be inaccessible using conventional cross-sectional designs.

      Importantly, the manuscript highlights the potential for this approach to be extended even further, enabling continuous tracking into adulthood and thus offering an unprecedented window into long-term developmental trajectories. The authors also underscore the exciting opportunity to incorporate targeted perturbation experiments, allowing researchers to causally link early circuit dynamics to later outcomes.

      Given the increasing recognition that early postnatal alterations can underlie the etiology of various neurodevelopmental disorders, this work is especially timely. The methods and perspectives presented here are poised to catalyze a new generation of developmental studies that can reveal mechanistic underpinnings of both typical and atypical brain development.

      In summary, this is a technically impressive and conceptually forward-looking study that sets the stage for transformative advances in developmental neuroscience.

      Thank you for the thoughtful feedback—it's greatly appreciated!

      Reviewer #2 (Recommendations for the authors):

      Minor points:

      (1) Figure 1. Consider merging or moving to Supplemental, as its rationale is well described in the text.

      We would like to retain the current figure as we believe it provides an effective visual illustration of our rationale that will capture readers' attention and could serve as a valuable reference for others seeking to justify longitudinal tracking of the developing brain. We hope the reviewer will understand our decision.

      (2) Some axis labels and panels are difficult to read due to small font sizes (e.g. smaller panels in Figures 5-7).

      Modified, thanks 

      (3) Supplementary Figures. The order of appearance in the main text is occasionally inconsistent.

      This was modified, thanks

      (4) Line 132. Add a reference to the registration toolbox used (elastix). A brief description of the affine transformation would also be helpful, either here or in the Methods section (p. 27).

      We have added reference to Ntatsis et al. 2023 and described affine transformation in the main text (lines 133-135): 

      Firstly, we estimate the spatial transformation between s0 and s1 using affine image registration (i.e. allowing shifting, rotation, scaling and shearing, see Fig. 2B, the transformation is denoted as T).

      (5) Lines 147-151. If this method is adapted from another work, please cite the source.

      Computing the intersection over union of two ROIs for tracking is a widely established and intuitive method used across numerous studies, representing standard practice rather than requiring specific citation. We have however included the reference to the paper describing the algorithm we use to solve the linear sum assignment problem used for matching neurons across a pair of consecutive days (Crouse 2016).

      (6) Line 218. "classical" or automatic?

      We meant “classical” in the sense of widely used. 

      (7) Lines 220-231. Did the authors find significant variability of successfully tracked neurons across mice? While the data for successfully tracked cells is reported (Figure 5B), the proportions are not. Could differences in neuron dropout across days and mice affect the analysis of neuronal activity statistics?

      We thank the reviewer for raising this important point. We computed the fraction of successfully tracked cells in our dataset and found substantial variability:

      Cells detected on day 0: [607, 1849, 2190, 1988, 1316, 2138] 

      Proportion successfully tracked: [0.47, 0.20, 0.36, 0.37, 0.41, 0.19]

      Notably, the number of cells detected on the first day varies considerably (607–2138 cells). There appears to be a trend whereby datasets with fewer initially detected cells show higher tracking success rates, potentially because only highly active cells are identified in these cases.

      To draw more definitive conclusions about the proportion of active cells and tracking dropout rates, we would require activity-independent cell detection methods (such as Cellpose applied to isosbestic 830 nm fluorescence, or ideally a pan-neuronal marker in a separate channel, e.g., tdTomato). We have incorporated the tracking success proportions into the revised manuscript.

      (8) Line 260. Please briefly explain, here or in the Methods, the rationale for using data from only 3 mice (rather than all 6) for evaluating tracking performance.

      We used three mice for this analysis due to the labor-intensive nature of manually annotating 64 ROIs across several days. Given the time constraints of this manual process, we determined that three subjects would provide adequate data to reliably assess tracking performance.

      (9) Line 277. Consider clarifying or rephrasing the phrase "across progressively shorter time intervals"? Do you mean across consecutive days?

      This has been rephrased as follows: 

      Additionally, to assess tracking performance over time, we quantified the proportion of reconstructed ground truth tracks over progressively longer time intervals (first two days, first three days etc. ‘Prop. correct’ in Fig. 4C-F, see Methods). This allowed us to understand how tracking accuracy depends on the number of successive sessions, as well as at which time points the algorithm might fail to successfully track cells.

      (10) Line 306. "we also provide additional resources and documentation". Please add a reference or link.

      Done, thanks

      Track2p  

      (11) Lines 342-344. Specify that the raster plots refer to one example mouse, not the entire sample.

      Done, thanks.

      (12) Lines 996-1002. Please confirm whether only successfully tracked neurons were used to compute the Pearson correlations between all pairs.

      Yes of course, this only applies to tracked neurons as it is impossible to compute this for non-tracked pairs.

      (13) Line 1003. Add a reference to scikit-learn.

      Reference was added to: 

      Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. 

      (14) Typos.Correct spacing between numeric values and units.

      We did not find many typos regarding spacing between the numerical value and the unit symbol (degrees and percent should not be spaced right?).

      Reviewer #3 (Recommendations for the authors):

      The font size in many of the figures is too small. For example, it is difficult to follow individual ROIs in Figure S3.

      Figure font size has been increased, thanks. In Figure S3 there might have been a misunderstanding, since the three FOV images do not correspond to the FOV of the same mouse across three days but rather to the first recording for each of the three mice used in evaluation (the ROIs can thus not be followed across images since they correspond to a different mouse). To avoid confusion we have labelled each of the FOV images with the corresponding mouse identifier (same as in Fig. 4 and 5).

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review): 

      Summary: 

      In this manuscript, the authors explore the role of the conserved transcription factor POU4-2 in planarian maintenance and regeneration of mechanosensory neurons. The authors explore the role of this transcription factor and identify potential targets of this transcription factor. Importantly, many genes discovered in this work are deeply conserved, with roles in mechanosensation and hearing, indicating that planarians may be a useful model with which to study the roles of these key molecules. This work is important within the field of regenerative neurobiology, but also impactful for those studying the evolution of the machinery that is important for human hearing. 

      Strengths: 

      The paper is rigorous and thorough, with convincing support for the conclusions of the work. 

      Weaknesses: 

      Weaknesses are relatively minor and could be addressed with additional experiments or changes in writing.

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, the authors investigate the role of the transcription factor Smed-pou4-2 in the maintenance, regeneration, and function of mechanosensory neurons in the freshwater planarian Schmidtea mediterranea. First, they characterize the expression of pou4-2 in mechanosensory neurons during both homeostasis and regeneration, and examine how its expression is affected by the knockdown of soxB1, 2, a previously identified transcription factor essential for the maintenance and regeneration of these neurons. Second, the authors assess whether pou4-2 is functionally required for the maintenance and regeneration of mechanosensory neurons. 

      Strengths: 

      The study provides some new insights into the regulatory role of pou4-2 in the differentiation, maintenance, and regeneration of ciliated mechanosensory neurons in planarians. 

      Weaknesses: 

      The overall scope is relatively limited. The manuscript lacks clear organization, and many of the conclusions would benefit from additional experiments and more rigorous quantification to enhance their strength and impact. 

      Reviewing Editor Comments: 

      (1) Quantification of pou4-2(+) cells that express (or do not express) hmcn-1-L and/or pkd1L-2(-) is a common suggestion amongst reviewers. It is recognized that Ross et al. (2018) showed that pkd1L-2 and hmcn-1L expression is detected in separate cells by double FISH, and the analysis presented in Supplementary Figure S3 is helpful in showing that some cells expressing pou4-2 (magenta) are not labeled by the combined signal of pkd1L-2 and hmcn-1-L riboprobes (green). However, I am not sure that we can conclude that pkd1L-2 and hmcn-1-L are effectively detected when riboprobes are combined in the analysis. Therefore, quantification of labeled cells as proposed by Reviewers 1 and 2 would help.

      Combining riboprobes is a standard approach in the field, and we chose this method as a direct way to determine which cells lack expression of both genes. We agree that providing the raw quantification data would be helpful for readers, and we included this data in Supplementary File S7; the file contains the quantification information for this dFISH experiment represented in Supplementary Figure 3.

      (2) It may be helpful to comment on changes (or lack of changes) in atoh gene RNA levels in RNAseq analyses of pou4-2 animals. As mentioned by one of the reviewers, in situs that don't show signal are inconclusive in this regard. 

      We fully agree with both reviewers. Two of the planarian atonal homologs are difficult to detect and produce background signals, which we attempted and previously reported in Cowles et al. Development (2013). We conceived performing reciprocal RNAi/in situ experiments, born out of curiosity given the reported role of atonal in the pou4 cascade in other organisms. However, these exploratory experiments lacked a strong rationale for inclusion, particularly given that pou4-2 and the atonal homologs do not share expression patterns, co-expression, or differential expression in our RNA-seq dataset. Therefore, we decided to omit the atonal in situs following pou4-2 RNAi. We retained the experiments showing that knockdown of the atonal genes does not show robust effects on the mechanosensory neuron pattern, as expected. We thank the reviewing editor and reviewers for pinpointing the concern. We agree that additional experiments, such as qPCR experiments, would be needed. We reasoned that while these additional experiments could be informative, they are unlikely to alter the key conclusions of this study substantially.

      (3) There seem to be typos at bottom of Figure 10 and top of page 11 when referencing to Figure 4B (should be to 5B instead): "While mechanosensory neuronal patterned expression of Eph1 was downregulated after pou4-2 and soxB1-2 inhibition, low expression in the brain branches of the ventral cephalic ganglia persisted (Figure 4B)." 

      Thank you! We have fixed those.

      (4) Typo (page 13; kernel?): "...to test to what extent the Pou4 gene regulatory kernel is conserved among these widely divergent animals." 

      Regulatory kernels are defined as the minimal sets of interacting genes that drive developmental processes and are the core circuits within a gene regulatory network, but we recognize that this might not be as well known, so we have changed the term to “network” for clarity.

      Reviewer #1 (Recommendations for the authors): 

      (1) The authors indicate that they are interested in finding out whether POU4-2 is important in the creation of mechanosensory neurons in adulthood as well as in embryogenesis (in other words, whether the mechanism is "reused during adult tissue maintenance and regeneration"). The manuscript clearly shows that planarian POU4 -2 is important in adult neurogenesis in planarians, but there is no evidence presented to show that this is a recapitulation of embryogenesis. Is pou4-2 expressed in the planarian embryo? This might be possible to examine by ISH or through the evaluation of sequencing data that already exists in the literature. 

      We agree that these statements should be precise. We have clarified when we make comparisons to the role of Pou4 in sensory system development in other organisms versus its role in the adult planarian. We examined its expression using the existing database of embryonic gene expression. Thanks for hinting at this idea. We performed BLAST in Planosphere (Davies et al., 2017) to cross-reference our clone matching dd_Smed_v6_30562_0_1, which is identical to SMED30002016. The embryonic gene expression for SMED30002016 indicates this gene is expressed at the expected stages given prior knowledge of the timing of organ development in Schmidtea mediterranea (a positive trend begins at Stage 5, with a marked increase by Stage 6 that remains comparable to the asexual expression levels shown). We thank the reviewer for pointing out this oversight. We have incorporated this result in the paper as a Supplementary Figure and discuss how we can only speculate that it has a similar role as we detect in the adult asexual worms.

      (2) Can it be determined whether the punctate pou4-2+ cells outside of the stripes are progenitors or other neural cell types? Are there pou4-2+ neurons that are not mechanosensory cell types? Could there be other roles for POU4-2 in the neurogenesis of other cell types? It might help to show percentages of overlap in Figure 4A and discuss whether the two populations add up to 100% of cells. 

      These are good questions that arise in part from other statements that need clarification in the text (pointed out by Reviewer 2). We think some of the dorsal pou4-2<sup>+</sup> might represent progenitor cells undergoing terminal differentiation (see Supplementary Figure 4). We attempted BrdU pulse chase experiments but were not successful in consistently detecting pou4-2 at sufficient levels with our protocol. In response to this helpful comment, we have included this question as a future direction in the revised Discussion. Finally, we have edited our description of the expression pattern. We already pointed out that there are other cells on the ventral side that are not affected when soxB1-2 is knocked down. We attempted to resolve the potential identity of those cells working with existing scRNA-seq data in collaboration with colleagues, but their low abundance made it difficult to distinguish other populations. While we acknowledge this interesting possibility, we have chosen to focus this report on the role of pou4-2 downstream of soxB1-2, as this represents the most well-supported aspect of the dataset and was positively highlighted by both the reviewer and editor.

      (3) The authors discuss many genes from their analysis that play conserved roles in mechanosensation and hearing. Were there any conserved genes that came up in the analysis of pou4-2(RNAi) planarians that have not yet been studied in human hearing and neurodevelopment? I am wondering the extent to which planarians could be used as a discovery system for mechanosensory neuron function and development, and discussion of this point might increase the impact of this paper or provide critical rationale for expanding work on planarian mechanosensation. 

      Indeed, we agree that planarians could be used to identify conserved genes with roles in mechanosensation and have included this point in the Discussion. In this study, we have focused on demonstrating the conservation of gene regulation. While this study was initially based on a graduate thesis project, we have since generated a more comprehensive dataset from isolated heads, which we are currently analyzing. This has been emphasized in the revised Discussion.

      Minor: 

      (1) For Figure 6E, the authors could consider showing data along a negative axis to indicate a decrease in length in response to vibration and to more clearly show that this decrease doesn't occur as strongly after pou4-2(RNAi). 

      We displayed this behavior as the percent change, as this is a standard way to represent this data. As the percent change is a positive value, we represent the data as these positive values.

      (2) The authors should consider quantifying the decrease of pou4-2 mRNA after atonal(RNAi) conditions, either by RT-qPCR or cell quantification. Visually, the signal in the stripes after atoh8-2(RNAi) seems lower, particularly in the tail. The punctate pattern outside the stripes may also be decreased after atoh8-1(RNAi). But quantification might strengthen the argument. 

      We agree with the reviewer and acknowledge that we should have been more cautious in interpreting these results. Those two genes are difficult to detect and did not show specific patterns in Cowles et al. (2013). The reviewer is correct that additional experiments are necessary before reaching conclusions, but we do not think as discussed earlier we do not think new experiments would provide insights for the major conclusions. These experiments were exploratory in nature and tangential to our main conclusions, especially in the absence of reciprocal evidence (e.g., shared expression patterns, co-expression, or differential expression in our RNA-seq data. Therefore, we decided to eliminate the atonal in situs following pou4-2 RNAi.

      Reviewer #2 (Recommendations for the authors): 

      A. Expression of pou4-2 in ciliated mechanosensory neurons: 

      (1) The conclusion that pou4-2 is expressed in ciliated mechanosensory neurons is primarily based on co-expression analysis using a published single-cell dataset. Although the authors later show that a subset of pou4-2 cells also express pkd1L-2 (Figure 4A), a known marker of ciliated mechanosensory neurons, this finding is not properly quantified. I recommend moving Figure 4A to earlier in the manuscript (e.g., to Figure 2) and expanding the analysis to include additional known markers of this cell type. Proper quantification of the extent of co-localization is necessary to support the claim robustly. 

      As pointed out by the reviewer, there is substantive evidence from our lab and other reports. King et al. also showed pou4-2 and pkd1L-2 ‘regulation’ by their scRNA-seq data, and this function is conserved in the acoel Hofstenia miamia (Hulett et al., PNAS 2024 ). Our analysis shows convincing co-localization by scRNA-seq and expression of soxB1-2 and neural markers in the respective populations. Furthermore, we included colocalization of pou4-2 with mechanosensory genes using fluorescence in situ hybridization (Figure 3B, Supplementary Figure 4, and Supplementary File S7). We are confident the data conclusively show pou4-2 regulates pkd1L-2 expression in a subset of mechanosensory neurons. Given the strength of existing observations and previously published data, we believe that additional staining experiments are not essential to support this conclusion. 

      (2) There appears to be a conceptual inconsistency in the interpretation of pou4-2 expression dynamics. On one hand, the authors suggest that delayed pou4-2 expression indicates a role in late-stage differentiation (p.6). On the other hand, they propose that pou4-2 may be expressed in undifferentiated progenitors to initiate downstream transcriptional programs (p.8). These interpretations should be reconciled. Additionally, claims regarding pou4-2 expression in progenitor populations should be supported by co-localization with established stem cell or progenitor markers, rather than inferred from signal intensity alone. 

      This is an excellent point, and we agree with the reviewer that this section requires editing. As described in response to Reviewer 1, we attempted BrdU pulse chase experiments but were not successful in consistently detecting pou4-2 at sufficient levels with our protocol. Furthermore, we could not obtain strong signals in double labeling experiments in pou4-2 in situs combined with piwi-1 or PIWI-1 antibodies. We will include those experiments as a future direction and amend our conclusions accordingly.

      (3) The expression pattern shown in Figure 1B raises questions about the precise anatomical localization of pou4-2 cells. It is unclear whether these cells reside in the subepidermal plexus or the deeper submuscular plexus, which represent distinct neuronal layers (Ross et al., 2017). The observed signals near the ventral nerve cords could suggest submuscular localization. To clarify this, higher-resolution imaging and co-staining with region-specific neural markers are recommended. 

      In Ross et al. (2018), we showed that the pkd1L-2<sup>+</sup> cells are located submuscularly. The pkd1L-2 cells express pou4-2, thus the pou4-2<sup>+</sup> cells are located in the same location. Based on co-expression data and co-expression with PKD genes, we are confident it is submuscular.

      B. The functional requirements of pou4-2 in the maintenance of mechanosensory neurons: 

      (1) To evaluate the functional role of pou4-2 in maintaining mechanosensory neurons, the authors performed whole-animal RNA-seq on pou4-2(RNAi) and control animals, identifying a significant downregulation of genes associated with mechanosensory neuron expression. However, the presentation of these findings is fragmented across Figures 3, 4, and 5. I recommend consolidating the RNA-seq results (Figure 3) and the subsequent validation of downregulated genes (Figures 4 and 5) into a single, cohesive figure. This would improve the logical flow and clarity of the manuscript. 

      As suggested by the reviewer, we have combined Figures 3 and 4 (new Figure 3), which we believe improves the flow. We decided to keep Figure 5 (new Figure 4) as a standalone because it focuses on the characterization of new genes revealed by RNAseq and scRNA-seq data mining that were not previously reported in Ross et al. 2018 and

      2024.

      (2) In pou4-2(RNAi) animals, pkd1L-2 expression appears to be entirely lost, while hmcn-1-L shows faint expression in scattered peripheral regions. The authors suggest that an extended RNAi treatment might be necessary to fully eliminate hmcn-1-L expression. However, an alternative explanation is that pou4-2 is not essential for maintaining all hmcn-1-L cells, particularly if pou4-2 expression does not fully overlap with that of hmcn-1-L. This possibility should be acknowledged and discussed. 

      We agree and have acknowledged this point in the revised text.

      (3) On page 9, the section title claims that "Smed-pou4-2 regulates genes involved in ciliated cell structure organization, cell adhesion, and nervous system development." While some differentially expressed genes are indeed annotated with these functions based on homology, the manuscript does not provide experimental evidence supporting their roles in these biological processes in planarians. The title should be revised to avoid overstatement, and the limitations of extrapolating a function solely from gene annotation should be acknowledged. 

      Excellent point. We have edited the text to indicate that the genes were annotated or implicated.

      (4) The cilia staining presented in Figure 6B to support the claim that pou4-2 is required for ciliated cell structure organization is unconvincing. Improved imaging and more targeted analysis (e.g., co-labeling with mechanosensory markers) are needed to support this conclusion. 

      We have addressed this concern by adjusting the language to be more precise and indicate that the stereotypical banded pattern is disrupted with decreased cilia labeling along the dorsal ciliated stripe. Indeed, our conclusion overstated the observations made with the staining and imaging resolution. Thank you.

      C. The functional requirements of pou4-2 in the regeneration of mechanosensory neurons: 

      To evaluate the role of pou4-2 in the regeneration of mechanosensory neurons, the authors performed amputations on pou4-2(RNAi) and control(RNAi) animals and assessed the expression of mechanosensory markers (pkd1L-2, hmcn-1-L) alongside a functional assay. However, the results shown in Figure 4B indicate the presence of numerous pkd1L-2 and hmcn-1-L cells in the blastema of pou4-2(RNAi) animals. This observation raises the possibility that pou4-2 may not be essential for the regeneration of these mechanosensory neurons. The authors should address this alternative interpretation. 

      Our interpretation is that there were very few cells expressing the markers compared to controls. The pattern was predominantly lost, which is consistent with other experiments shown in the paper. However, we have added the additional caveat suggested by the reviewer.

      Minor points: 

      (1) On p.8, the authors wrote "every 12 hours post-irradiation". However, this is not consistent with the figure, which only shows 0, 3, 4, 4.5, 5, and 5.5 dpi. 

      We corrected this. Thank you for catching the mistake!

      (2) On p.12, the authors wrote "Analysis of pou4-2 RNAi data revealed differentially expressed genes with known roles in mechanosensory functions, such as loxhd-1, cdh23, and myo7a. Mutations in these genes can cause a loss of mechanosensation/transduction". This is misleading because, to my knowledge, the role of these genes in planarians is unknown. If the authors meant other model systems, they should clearly state this in the text and include proper references. 

      The reviewer is correct that we are referencing findings from other organisms. We have clarified this point in the revised text. The appropriate references were included and cited in the first version.

      (3) On p.7, the authors wrote, "conversely, the expression of atonal genes was unaffected in pou4-2 RNAi-treated regenerates (Supplementary Figure S2B)". However, it is unclear whether the Atoh8-1 and Atoh8-2 signals are real, as the quality of the in situ results is too low to distinguish between real signals and background noise/non-specific staining. 

      This valid concern was addressed in our response to Reviewer 1. We have adjusted the figure and the text accordingly.

      (4) On p.6 the authors wrote "pinpointed time points wherein the pou4-2 transcripts were robustly downregulated". However, the current version of the manuscript does not provide data explaining why Pou4-2 transcripts are robustly downregulated on day 12. 

      Yes, we determined the appropriate time points using qPCR for all sample extractions. As an example, see the figure for qPCR validation at day 12 showing that pou4-2 and pkd1L2 are down.

      Author response image 1.

      In this graph, samples labeled “G” represent four biological controls of gfp(RNAi) control animals, and samples labeled “P” represent four biological controls of pou4-2(RNAi)animals at day 12 in the RNAi protocol.

      (5) On p.13, the authors wrote "collecting RNA from how animals." Is this a typo? 

      Thanks for catching the typo. It should read “whole” animals. We have corrected this.

      (6) On p.14, the authors wrote "but the expression patterns of planarian atonal genes indicated that they represent completely different cell populations from pou4-2-regulated mechanosensory neurons". However, this is unclear from the images, as the in situ staining of Atoh8-1 and Atoh82 are potentially failed stainings. 

      We agree. We have edited accordingly.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The manuscript "Lifestyles shape genome size and gene content in fungal pathogens" by Fijarczyk et al. presents a comprehensive analysis of a large dataset of fungal genomes to investigate what genomic features correlate with pathogenicity and insect associations. The authors focus on a single class of fungi, due to the diversity of lifestyles and availability of genomes. They analyze a set of 12 genomic features for correlations with either pathogenicity or insect association and find that, contrary to previous assertions, repeat content does not associate with pathogenicity. They discover that the number of proteincoding genes, including the total size of non-repetitive DNA does correlate with pathogenicity. However, unique features are associated with insect associations. This work represents an important contribution to the attempts to understand what features of genomic architecture impact the evolution of pathogenicity in fungi.

      Strengths:

      The statistical methods appear to be properly employed and analyses thoroughly conducted. The manuscript is well written and the information, while dense, is generally presented in a clear manner.

      Weaknesses:

      My main concerns all involve the genomic data, how they were annotated, and the biases this could impart to the downstream analyses. The three main features I'm concerned with are sequencing technology, gene annotation, and repeat annotation.

      We thank the reviewer for all the comments. We are aware that the genome assemblies are of heterogeneous quality since they come from many sources. The goal of this study was to make the best use of the existing assemblies, with the assumption that noise introduced by the heterogeneity of sequencing methods should be overcome by the robustness of evolutionary trends and the breadth and number of analyzed assemblies. Therefore, at worst, we would expect a decrease in the power to detect existing trends. It is important to note that the only way to confidently remove all potential biases would be to sequence and analyze all species in the same way; this would require a complete study and is beyond the scope of the work presented here. Nevertheless some biases could affect the results in a negative way, eg. is if they affect fungal lifestyles differently. We therefore made an attempt to explore the impact of sequencing technology, gene and repeat annotation approach among genomes of different fungal lifestyles. Details are described in Supplementary Results and below. Overall, even though the assembly size and annotations conducted with Augustus can sometimes vary compared to annotations from other resources, such as JGI Mycocosm, we do not observe a bias associated with fungal lifestyles. Comparison of annotations conducted with Augustus and JGI Mycocosm dataset revealed variation in gene-related features that reflect biological differences rather than issues with annotation.  

      The collection of genomes is diverse and includes assemblies generated from multiple sequencing technologies including both short- and long-read technologies. Not only has the impact of the sequencing method not been evaluated, but the technology is not even listed in Table S1. From the number of scaffolds it is clear that the quality of the assemblies varies dramatically. This is going to impact many of the values important for this study, including genome size, repeat content, and gene number.

      We have now added sequencing technology in Table S1 as it was reported in NCBI. We evaluated the impact of long-read (Nanopore, PacBio, Sanger) vs short-read assemblies in Supplementary Results. In short, the proportion of different lifestyles (pathogenic vs. nonpathogenic, IA vs non-IA) were the same for short- and long-read assemblies. Indeed, longread assemblies were longer, had a higher fraction of repeats and less genes on average, but the differences between pathogenic vs. non-pathogenic (or IA vs non-IA) species were in the same direction for two sequencing technologies and in line with our results. There were some discrepancies, eg. mean intron length was longer for pathogens with long-read assemblies, but slightly shorter on average for short-read assemblies (and to lesser extent GC and pseudo tRNA count), which could explain weaker or mixed results in our study for these features.

      Additionally, since some filtering was employed for small contigs, this could also bias the results.

      The reason behind setting the lower contig length threshold was the fact that assemblies submitted to NCBI have varying lower-length thresholds. This is because assemblers do not output contigs above a certain length, and this threshold can be manipulated by the user. Setting a common min contig length was meant to remove this variation, knowing that any length cut-off will have a larger effect on short-read based assemblies than long-read-based assemblies. Notably, genome assemblies of corresponding species in JGI Mycocosm have a minimum contig length of 865 bp, not much lower than in our dataset. Importantly, in a response to a comment of previous reviewer, repeat content was recalculated on raw assembly lengths instead of on filtered assembly length. 

      I have considerable worries that the gene annotation methods could impart biases that significantly affect the main conclusions. Only 5 reference training sets were used for the Sordariomycetes and these are unequally distributed across the phylogeny. Augusts obviously performed less than ideally, as the authors reported that it under-annotated the genomes by 10%. I suspect it will have performed worse with increasing phylogenetic distance from the reference genomes. None of the species used for training were insectassociated, except for those generated by the authors for this study. As this feature was used to split the data it could impact the results. Some major results rely explicitly on having good gene annotations, like exon length, adding to these concerns. Looking manually at Table S1 at Ophiostoma, it does seem to be a general trend that the genomes annotated with Magnaporthe grisea have shorter exons than those annotated with H294. I also wonder if many of the trends evident in Figure 5 are also the result of these biases. Clades H1 and G each contain a species used in the training and have an increase in genes for example.

      We have applied 6 different reference training sets (instead of one) precisely to address the problem of increasing phylogenetic distance of annotated species. To further investigate the impact of chosen species for training, we plotted five gene features (number of genes, number of introns, intron length, exon length, fraction of genes with introns) as a function of   branch length distance from the species (or genus) used as a training set for annotation. We don’t see systematic biases across different training sets. However,  trends are very clear for clades annotated with fusarium. This set of species includes Hypocreales and Microascales, which is indeed unfortunate since Microascales is an IA group and at the same time the most distant from the fusarium genus in this set. To clarify if this trend is related to annotation bias or a biological trend, we compared gene annotations with those of Mycocosm, between Hypocreales Fusarium species, Hypocreales non-Fusarium species, and Microascales, and we observe exactly the same trends in all gene features. 

      Similarly, among species that were annotated with magnaporthe_grisea, Ophiostomatales (another IA group) are among the most distant from the training set species. Here, however, another order, Diaporthales, is similarly distant, yet the two orders display different feature ranges. In terms of exon length, top 2 species in this training set include Ophiostoma, and they reach similar exon length as the Ophiostoma species annotated using H294 as a training set. In summary, it is possible that the choice of annotation species has some effect on feature values; however, in this dataset, these biases are likely mitigated by biological differences among lifestyles and clades. 

      Unfortunately, the genomes available from NCBI will vary greatly in the quality of their repeat masking. While some will have been masked using custom libraries generated with software like Repeatmodeler, others will probably have been masked with public databases like repbase. As public databases are again biased towards certain species (Fusarium is well represented in repbase for example), this could have significant impacts on estimating repeat content. Additionally, even custom libraries can be problematic as some software (like RepeatModeler) will include multicopy host genes leading to bona fide genes being masked if proper filtering is not employed. A more consistent repeat masking pipeline would add to the robustness of the conclusions.

      We have searched for the same species in JGI Mycocosm and were able to retrieve 58 genome assemblies with matching species, with 19 of them belonging to the same strain as in our dataset. Overall we found no differences in genome assembly length. Interestingly, repeat content was slightly higher for NCBI genome assemblies compared to JGI Mycocosm assemblies, perhaps due to masking of host multicopy genes, as the reviewer mentioned. By comparing pathogenic and non-pathogenic species for the same 19 strains, we observe that JGI Mycocosm annotates fewer repeats in pathogenic species than Augustus annotations (but trends are similar when taking into account 58 matching species). Given a small number of samples, it is hard to draw any strong conclusions; however, the differences that we see are in favor of our general results showing no (or negative) correlation of repeat content with pathogenicity. 

      To a lesser degree, I wonder what impact the use of representative genomes for a species has on the analyses. Some species vary greatly in genome size, repeat content, and architecture among strains. I understand that it is difficult to address in this type of analysis, but it could be discussed.

      In our case the use of protein sequences could underestimate divergence between closely related strains from the same species. We also excluded strains of the same species to avoid overrepresentation of closely related strains with similar lifestyle traits. We agree that some changes in the genome architecture can occur very rapidly, even at the species level, though analyzing emergence of eg. pathogenicity at the population level would require a slightly different approach which accounts for population-level processes. 

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors report on the genomic correlates of the transition to the pathogenic lifestyle in Sordariomycetes. The pathogenic lifestyle was found to be better explained by the number of genes, and in particular effectors and tRNAs, but this was modulated by the type of interacting host (insect or not insect) and the ability to be vectored by insects.

      Strengths:

      The main strength of this study lies in the size of the dataset, and the potentially high number of lifestyle transitions in Sordariomycetes.

      Weaknesses:

      The main strength of the study is not the clarity of the conclusions.

      (1) This is due firstly to the presentation of the hypotheses. The introduction is poorly structured and contradictory in some places. It is also incomplete since, for example, fungusinsect associations are not mentioned in the introduction even though they are explicitly considered in the analyses.

      We thank the reviewer for pointing this out. We strived to address all comments and suggestions of the reviewer to clarify the message and remove the contradictions. We also added information about why we included insect-association trait in our analysis. 

      (2) The lack of clarity also stems from certain biases that are challenging to control in microbial comparative genomics. Indeed, defining lifestyles is complicated because many fungi exhibit different lifestyles throughout their life cycles (for instance, symbiotic phases interspersed with saprotrophic phases). In numerous fungi, the lifestyle referenced in the literature is merely the sampling substrate (such as wood or dung), which doesn't mean that this substrate is a crucial aspect of the life cycle. This issue is discussed by the authors, but they do not eliminate the underlying uncertainties.

      We agree with the reviewer that lack of certainty in the lifestyle or range of possible lifestyles of studied species is a weakness in this analysis. We are limited by the information available in the literature. We hope that our study will increase interest in collecting such data in the future.

      Reviewer #3 (Public review):

      Summary:

      This important study combines comparative genomics with other validation methods to identify the factors that mediate genome size evolution in Sordariomycetes fungi and their relationship with lifestyle. The study provides insights into genome architecture traits in this Ascomycete group, finding that, rather than transposons, the size of their genomes is often influenced by gene gain and loss. With an excellent dataset and robust statistical support, this work contributes valuable insights into genome size evolution in Sordariomycetes, a topic of interest to both the biological and bioinformatics communities.

      Strengths:

      This study is complete and well-structured.

      Bioinformatics analysis is always backed by good sampling and statistical methods. Also, the graphic part is intuitive and complementary to the text.

      Weaknesses:

      The work is great in general, I just had issues with the Figure 1B interpretation.

      I struggled a bit to find the correspondence between this sentence: "Most genomic features were correlated with genome size and with each other, with the strongest positive correlation observed between the size of the assembly excluding repeats and the number of genes (Figure 1B)." and the Figure 1B. Perhaps highlighting the key p values in the figure could help.

      We thank the reviewer for pointing out this sentence. Perhaps the misunderstanding comes from the fact that in this sentence one variable is missing. The correct version should be “Most genomic features were correlated with genome size and with each other, with the strongest positive correlation observed between the genome size, the genome size excluding repeats and the number of genes (Figure 1B)”. Also, the variable names now correspond better to those shown on the figure.

      Reviewer #1 (Recommendations for the authors):

      The authors have clearly done a lot of good work, and I think this study is worthwhile. I understand that my concerns about the underlying data could necessitate rerunning the entire analysis with better gene models, but there may be another option. JGI has a fairly standard pipeline for gene and repeat annotation. Their gene predictions are based on RNA data from the sequenced strain and should be quite good in general. One could either compare the annotations from this manuscript to those in mycocosm for genomes that are identical and see if there are systematic biases, or rerun some analyses on a subset of genomes from mycocosm. Indeed, it's possible that the large dataset used here compensates for the above concerns, but without some attempt to evaluate these issues, it's difficult to have confidence in the results.

      We very appreciate the positive reception of our manuscript. Following the reviewer’s comments we have investigated gene annotations in comparison with those of JGI Mycocosm, even though only 58 species were matching and only 19 of them were from the same strain. This dataset is not representative of the Sordariomycetes diversity (most species come from one clade), therefore will not reflect the results we obtained in this study. To note, the reason for not choosing JGI Mycocosm in the first place, was the poor representation of the insect-associated species, which we found key in this study. In general, we found that assembly lengths were nearly identical, number of genes was higher, and the repeat content was lower for the JGI Mycocosm dataset. When comparing different lifestyles (in particular pathogens vs. non-pathogens), we found the same differences for our and JGI Mycocosm annotations, with one exception being the repeat content. In the small subset (19 same-strain assemblies), our dataset showed the same level of repeats between the two lifestyles, whereas JGI Mycocosm showed lower repeat content for pathogens (but notably for all 58 species, the trend was same for our and JGI Mycocosm annotations). None of these observations are in conflict with our results where we find no or negative association of repeat content with pathogens. 

      The figures are very information-dense. While I accept that this is somewhat of a necessity for presenting this type of study, if the authors could summarize the important information in easier-to-interpret plots, that could help improve readability.

      We put a lot of effort into showing these complicated results in as approachable manner as possible. Given that other reviewers find them intuitive we decided to keep most of them as they are. To add more clarification, we added one supplementary figure showing distributions of genomic traits across lifestyles. Moreover, in Figure 5, a phylogenetic tree was added with position of selected clades, as well as a scatterplot showing distributions of mean values for genome size and number of genes for those clades. If the reviewer has any specific suggestions on what to improve and in which figure, we’re happy to consider it. 

      Reviewer #2 (Recommendations for the authors):

      I have no major comments on the analyses, which have already been extensively revised. My major criticism is the presentation of the background, which is very insufficient to understand the importance or relevance of the results presented fully.

      Lines are not numbered, unfortunately, which will not help the reading of my review.

      (1) The introduction could better present the background and hypotheses:

      (a) After reading the introduction, I still didn't have a clear understanding of the specific 'genome features' the study focuses on. The introduction fails to clearly outline the current knowledge about the genetic basis of the pathogenic lifestyle: What is known, what remains unknown, what constitutes a correlation, and what has been demonstrated? This lack of clarity makes reading difficult.

      We thank the reviewer for pointing this out. We have now included in the introduction a list of genomic traits we focus on. We also tried to be more precise about demonstrated pathogenic traits and other correlated traits in the introduction. 

      (b) Page 3. « Various features of the genome have been implicated in the evolution of the pathogenic lifestyle. » The cited studies did not genuinely link genome features to lifestyle, so the authors can't use « implicated in » - correlation does not imply causation.

      This sentence also somehow contradicts the one at the end of the paragraph: « we still have limited knowledge of which genomic features are specific to pathogenic lifestyle

      We thank the reviewer for this comment. We added a phrase “correlated with or implicated in” and changed the last sentence of the paragraph into “Yet we still have limited knowledge of how important and frequent different genomic processes are in the evolution of pathogenicity across phylogenetically distinct groups of fungi and whether we can use genomic signatures left by some of these processes as predictors of pathogenic state.”.

      (c) Page 3: « Fungal pathogen genomes, and in particular fungal plant pathogen genomes have been often linked to large sizes with expansions of TEs, and a unique presence of a compartmentalized genome with fast and slow evolving regions or chromosomes » Do the authors really need to say « often »? Do they really know how often?

      We removed “often”.

      (d) Such accessory genomic compartments were shown to facilitate the fast evolution of effectors (Dong, Raffaele, and Kamoun 2015) ». The cited paper doesn't « show » that genomic compartments facilitate the fast evolution of effectors. It's just an observation that there might be a correlation. It's an opinion piece, not a research manuscript.

      We changed the sentence to “Such accessory genomic compartments could facilitate the fast evolution of effectors”.

      (e) even though such architecture can facilitate pathogen evolution, it is currently recognized more as a side effect of a species evolutionary history rather than a pathogenicity related trait ». This sentence somehow contradicts the following one: « Such accessory genomic compartments were shown to facilitate the fast evolution of effectors".

      Here we wanted to point out that even though accessory genome compartments and TE expansions can facilitate pathogen evolution the origin of such architecture is not linked to pathogenicity. We reformulated the sentence to “Even though such architecture can facilitate pathogen evolution, it is currently recognized that its origin is more likely a side effect of a species evolutionary history rather than being caused by pathogenicity”.

      (f) As the number of genes is strongly correlated with fungal genome size (Stajich 2017), such expansions could be a major contributor to fungal genome size. » This sentence suggests that pathogens might have bigger genomes because they have more effectors. This is contradictory to the sentence right after « At the end of the spectrum are the endoparasites Microsporidia, which have among the smallest known fungal genomes ».

      The authors state that pathogens have bigger genomes and then they take an example of a pathogen that has a minimal genome. I know it's probably because they lost genes following the transition to endoparasitism and not related to their capacity to cause disease. I just want to point out that their writing could be more precise. I invite authors to think of young scholars who are new to the field of fungal evolutionary genomics.

      We thank the reviewer for prompting us to clarify the text. We rewrote this short extract as follows “Notably, not all pathogenic species experience genome or gene expansions, or show compartmentalized genome architecture. While gene family expansions are important for some pathogens, the contrary can be observed in others, such as Microsporidia. Due to transition to obligatory intracellular lifestyle these fungi show signatures of strong genome contractions and reduced gene repertoire (Katinka et al. 2001) without compromising their ability to induce disease in the host. This raises questions about universal genomic mechanisms of transition to pathogenic state.”

      (g) I find it strange that the authors do not cite - and do not present the major results of two other studies that use the same type of approach and ask the same type of question in Sordariomycetes, although not focusing on pathogenicity:

      Hensen et al.: https://pubmed.ncbi.nlm.nih.gov/37820761/

      Shen et al.: https://pubmed.ncbi.nlm.nih.gov/33148650/

      We thank the reviewer for pointing out this omission. We now added more information in the introduction to highlight the importance of the phylogenetic context in studying genome evolution as demonstrated by these studies. The following part was added to introduction:  “Other phylogenomic studies investigating a wide range of Ascomycete species, while not explicitly focusing on the neutral evolution hypothesis, have found strong phylogenetic signals in genome evolution, reflected in distinct genome characteristics (e.g., genome size, gene number, intron number, repeat content) across lineages or families (Shen et al. 2020; Hensen et al. 2023). Variation in genome size has been shown to correlate with the activity of the repeat-induced point mutation (RIP) mechanism (Hensen et al. 2023; Badet and Croll 2025), by which repeated DNA is targeted and mutated. RIP can potentially lead to a slower rate of emergence of new genes via duplication (Galagan et al. 2003), and hinder TE proliferation limiting genome size expansion (Badet and Croll 2025). Variation in genome dynamics across lineages has also been suggested to result from environmental context and lifestyle strategies (Shen et al. 2020), with Saccharomycotina yeast fungi showing reductive genome evolution and Pezizomycotina filamentous fungi exhibiting frequent gene family expansions. Given the strong impact of phylogenetic membership,  demographic history (Ne) and host-specific adaptations of pathogens on their genomes, we reasoned that further examination of genomic sequences in groups of species with various lifestyles can generate predictions regarding the architecture of pathogenic genomes.”

      (h) Genome defense mechanisms against repeated elements, such as RIP, are not mentioned while they could have a major impact on genome size (Hensen et al cited above; Badet and Croll https://www.biorxiv.org/content/10.1101/2025.01.10.632494v1.full).

      This citation is added in the text above.

      (i) Should the reader assume that the genome features to be examined are those mentioned in the first paragraph or those in the penultimate one?

      In the last paragraph of the introduction we included the complete list of investigated genomic traits.

      (j) The insect-associated lifestyle is mentioned only in the research questions on page 4, but not earlier in the introduction. Why should we care about insect-associated fungi?

      We apologize for this omission. We added a sentence explaining how neutral evolution hypotheses can explain patterns of genome evolution in endoparasites and species with specialized vectors (traits present in insect-associated species) and added a sentence in the last paragraph that this is the reason why we have selected this trait for analysis.  

      (2) Why use concatenation to infer phylogeny?

      (a) Kapli et al. https://pubmed.ncbi.nlm.nih.gov/32424311/ « Analyses of both simulated and empirical data suggest that full likelihood methods are superior to the approximate coalescent methods and to concatenation »

      (b) It also seems that a homogeneous model was used, and not a partitioned model, while the latter are more powerful. Why?

      We thank the reviewer for the comment. When we were reconstructing the phylogenetic tree  we were not aware of the publication and we followed common practices from literature for phylogenetic tree reconstruction even though currently they are not regarded as most optimal. In fact, in the first round of submission, we have included both concatenation as well as a multispecies coalescent method based on 1000 busco sequences and a concatenation method with different partitions for 250 busco sequences. All three methods produced similar topologies. Since the results were concordant, we chose to omit these analyses from the manuscript to streamline the presentation and focus on the most important results.

      (3) Other comments:

      Is there a table listing lifestyles?

      Yes, lifestyles (pathogenicity and insect-association) are listed in Supplementary Table S1. 

      (4) Summary:

      (a) seemingly similar pathogens »: meaning unclear; on what basis are they similar? why « seemingly »?

      We removed “seemingly” from the sentence.

      (b) Page 4: what's the difference between genome feature and genome trait?

      There is no difference. We apologize for the confusion. We changed “feature” to “trait” whenever it refers to the specific 13 genomic traits analyzed in this study.

      (c) Page 22: Braker, not Breaker

      corrected

      What do the authors mean when they write that genes were predicted with Augustus and Braker? Do they mean that the two sets of gene models were combined? Gene counts are based on Augustus (P24): why not Braker?

      We only meant here that gene annotation was performed using Braker pipeline, which uses a particular version of Augustus. We corrected the sentence.

      (d) Figure 2B and 2C:

      'Undetermined sign' or 'Positive/Negative' would be better than « YES » or it's just impossible to understand the figure without reading the legend.

      We changed “YES” to “UNDETERMINED SIGN” as suggested by the reviewer.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics, and the well-established behavioral paradigm outcome-specific PIT-sPIT), Octavia Soegyono and colleagues decipher the diNerential contribution of dopamine receptors D1 and D2 expressing spiny projection neurons (SPNs). 

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these eNects were specific to stimulus-based actions, as valuebased choices were left intact in all manipulations. 

      This is a well-designed study, and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and adds to the current literature.

      We thank the Reviewer for their positive assessment. 

      Reviewer 2 (Public Review):

      Summary: 

      This manuscript by Soegyono et al. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cueguided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no eNects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter-only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum was required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths: 

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value-guided action selection. The inclusion of reporter-only control groups is rigorous and rules out nonspecific eNects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provide a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry. 

      We thank the Reviewer for their positive assessment. 

      Weaknesses: 

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration of D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to the ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPNs engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) oMers another promising avenue, given its established utility in reporting striatal SPNs plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014). The manuscript was modified to report these future avenues of research (page 12). 

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (page 13), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Reviewer 3 (Public Review):

      Summary:

      The authors present data demonstrating that optogenetic inhibition of either D1- or D2MSNs in the NAc Shell attenuates expression of sensory-specific PIT while largely sparing value-based decision on an instrumental task. They also provide evidence that SS-PIT depends on D1-MSN projections from the NAc-Shell to the VP, whereas projections from D2-MSNs to the VP do not contribute to SS-PIT.

      Strengths:

      This is clearly written. The evidence largely supports the authors' interpretations, and these eNects are somewhat novel, so they help advance our understanding of PIT and NAc-Shell function.

      We thank the Reviewer for their positive assessment. 

      Weaknesses:

      I think the interpretation of some of the eNects (specifically the claim that D1-MSNs do not contribute to value-based decision making) is not fully supported by the data presented.

      We appreciate the reviewer's comment regarding the marginal attenuation of valuebased choice observed following NAc-S D1-SPN silencing. While this manipulation did produce a slight reduction in choice performance, the behavior remained largely intact. We are hesitant to interpret this marginal eMect as evidence for a direct role of NAc-S D1SPNs in value-based decision-making, particularly given the substantial literature demonstrating that NAc-S manipulations typically preserve such choice behavior (Corbit et al., 2001; Corbit & Balleine, 2011; Laurent et al., 2012). Furthermore, previous work has shown that NAc-S D1 receptor blockade impairs outcome-specific PIT while leaving value-based choice unaMected (Laurent et al., 2014). We favor an alternative explanation for our observed marginal reduction. As documented in Supplemental Figure 1, viral transduction extended slightly into the nucleus accumbens core (NAc-C), a region established as critical for value-based decision-making (Corbit et al., 2001; Corbit & Balleine, 2011; Laurent et al., 2012; Parkes et al., 2015). The marginal impairment may therefore reflect inadvertent silencing of a small number of  NAc-C D1-SPNs rather than a functional contribution from NAc-S D1-SPNs. Future studies specifically targeting larger NAc-C D1-SPN populations would help clarify this possibility and provide definitive resolution of this question.

      Reviewer 1 (Recommendations for the Author):

      My main concerns and comments are listed below.

      (1) Could the authors provide the "raw" data of the PIT tests, such as PreSame vs Same vs PreDiNerent vs DiNerent? Could the authors clarify how the Net responding was calculated? Was it Same minus PreSame & DiNerent minus PreDiNerent, or was the average of PreSame and PreDiNerent used in this calculation?

      The raw data for PIT testing across all experiments are now included in the Supplemental Figures (Supplemental Figures S1E, S2E, S3E, and S4E). Baseline responding was quantified as the average number of lever presses per minute for both actions during the two-minute period (i.e., average of PreSame and PreDiMerent) preceding each stimulus presentation. This methodology has been clarified in the revised manuscript (page 7).

      (2) While both sexes are utilized in the current study, no statistical analysis is provided. Can the authors please comment on this point and provide these analyses (for both training and tests)?

      As noted in the original manuscript, the final sample sizes for female and male rats were insuMicient to provide adequate statistical power for sex-based analyses (page 15). To address this limitation, we have now cited a previous study from our laboratory (Burton et al., 2014) that conducted such analyses with suMicient power in identical behavioural tasks. That study identified only marginal sex diMerences in performance, with female rats exhibiting slightly higher magazine entry rates during Pavlovian conditioning. Importantly, no diMerences were observed in outcome-specific PIT or value-based choice performance between sexes.

      (3) Regarding Figure 1 - Anterograde tracing in D1-Cre and A2a-Cre rats (from line 976), I have one major and one minor question:

      (3.1) I do not understand the rationale of showing anterograde tracing from the Dorsal Striatum (DS) as this region is not studied in the current work. Moreover, sagittal micrographs of D1-Cre and A2a-Cre would be relevant here. Could the authors please provide these micrographs and explain the rationale for doing tracing in DS?

      We included dorsal striatum (DS) tracing data as a reference because the projection patterns of D1 and D2 SPNs in this region are well-established and extensively characterized, in contrast to the more limited literature on these cell types in the NAc-S. Regarding the comment about sagittal micrographs, we are uncertain of the specific concern as these images are presented in Figure 1B.

      If the reviewer is requesting sagittal micrographs for NAc-S anterograde tracing, we did not employ this approach because: (1) the NAc-S and ventral pallidum are anatomically adjacent regions and (2) the medial-lateral coordinates of the ventral pallidum and lateral hypothalamus do not align optimally with those of the NAc-S, limiting the utility of sagittal analysis for these projections.

      (3.2) There is no description about how the quantifications were done: manually? Automatically? What script or plugin was used? If automated, what were the thresholding conditions? How many brain sections along the anteroposterior axis? What was the density of these subpopulations? Can the authors include a methodological section to address this point?

      We apologize for the omission of quantification methods used to assess viral transduction specificity. This methodological description has now been added to the revised manuscript (page 22). Briefly, we employed a manual procedure in two sections per rat, and cell counts were completed in a defined region of interest located around the viral infusion site.

      (4) Lex A & Hauber (2008) Dopamine D1 and D2 receptors in the nucleus accumbens core and shell mediate Pavlovian-instrumental transfer. Learning & memory 15:483- 491, should be cited and discussed. It also seems that the contribution of the main dopaminergic source of the brain, the ventral tegmental area, is not cited, while it has been investigated in PIT in at least 3 studies regarding sPIT only, notably the VP-VTA pathway (Leung & Balleine 2015, accurately cited already).

      We did not include the Lex & Hauber (2008) study because its experimental design (single lever and single outcome) prevents diMerentiation between the eMects of Pavlovian stimuli on action performance (general PIT) versus action selection (outcome-specific PIT, as examined in the present study). Drawing connections between their findings and our results would require speculative interpretations regarding whether their observed eMects reflect general or outcome-specific PIT mechanisms, which could distract from the core findings reported in the article.

      Several studies examining the role of the VTA in outcome-specific PIT were referenced in the manuscript's introduction. Following the reviewer's recommendation, these references have also been incorporated into the discussion section (page 13). 

      (5) While not directly the focus of this study, it would be interesting to highlight the accumbens dissociation between General vs Specific PIT, and how the dopaminergic system (diNerentially?) influences both forms of PIT.

      We agree with the reviewer that the double dissociation between nucleus accumbens core/shell function and general/specific PIT is an interesting topic. However, the present manuscript does not examine this dissociation, the nucleus accumbens core, or general PIT. Similarly, our study does not directly investigate the dopaminergic system per se. We believe that discussing these topics would distract from our core findings and substantially increase manuscript length without contributing novel data directly relevant to these areas. 

      (6) While authors indicate that conditioned response to auditory stimuli (magazine visits) are persevered in all groups, suggesting intact sensitivity to the general motivational properties of reward-predictive stimuli (lines 344, 360), authors can't conclude about the specificity of this behavior i.e. does the subject use a mental representation of O1 when experiencing S1, leading to a magazine visits to retrieve O1 (and same for S2-O2), or not? Two food ports would be needed to address this question; also, authors should comment on the fact that competition between instrumental & pavlovian responses does not explain the deficits observed.

      We agree with the Reviewer that magazine entry data cannot be used to draw conclusions about specificity, and we do not make such claims in our manuscript. We are therefore unclear about the specific concern being raised. Following the Reviewer’s recommendation, we have commented on the fact that response competition could not explain the results obtained (page 11, see also supplemental discussion). 

      The minor comments are listed below.

      (7) A high number of rats were excluded (> 32 total), and the number of rats excluded for NAc-S D1-SPNs-VP is not indicated.

      We apologize for omitting the number of rats excluded from the experiment examining NAc-S D1-SPN projections to the ventral pallidum. This information has been added to the revised manuscript (page 22).

      (7.1) Can authors please comment on the elevated number of exclusions?

      A total of 133 rats were used across the reported experiments, with 40 rats excluded based on post-mortem analyses. This represents an attrition rate of approximately 30%, which we consider reasonable given that most animals received two separate viral infusions and two separate fiber-optic cannula implantations, and that the inclusion of both female and male rats contributed to some variability in coordinates and so targeting. 

      (7.2) Can authors please present the performance of these animals during the tasks (OFF conditions, and for control ones, both ON & OFF conditions)?

      Rats were excluded after assessing the spread of viral infusions, placement of fibre-optic cannulas and potential damage due to the surgical procedures (page 21). The requested data are presented below and plotted in the same manner as in Figures 3-6. The pattern of performance in excluded animals was highly variable. 

      Author response image 1.

       

      (8) For tracing, only males were used, and for electrophysiology, only females were used.

      (8.1) Can authors please comment on not using both sexes in these experiments? 

      We agree that equal allocation of female and male rats in the experiments presented in Figures 1-2 would have been preferable. Animal availability was the sole factor determining these allocations. Importantly, both female and male D1-Cre and A2A-Cre rats were used for the NAc-S tracing studies, and no sex diMerences were observed in the projection patterns. The article describing the two transgenic lines of rats did not report any sex diMerence (Pettibone et al., 2019). 

      (8.2) Is there evidence in the literature that the electrophysiological properties of female versus male SPNs could diNer?

      The literature indicates that there is no sex diMerence in the electrophysiological properties of NAc-S SPNs (Cao et al., 2018; Willett et al., 2016).  

      (8.3) It seems like there is a discrepancy between the number of animals used as presented in the Figure 2 legend versus what is described in the main text. In the Figure legend, I understand that 5 animals were used for D1-Cre/DIO-eNpHR3.0 validation, and 7 animals for A2a-Cre/DIO-eNpHR3.0; however, the main text indicates the use of a total of 8 animals instead of the 12 presented in the Figure legend. Can authors please address this mismatch or clarify?

      The number of rats reported in the main text and Figure 2 legend was correct. However, recordings sometimes involved multiple cells from the same animal, and this aspect of the data was incorrectly reported and generated confusion. We have clarified the numbers in both the main text and Figure 2 legend to distinguish between animal counts and cell counts. 

      (9) Overall, in the study, have the authors checked for outliers?

      Performance across all training and testing stages was inspected to identify potential behavioral outliers in each experiment. Abnormal performance during a single session within a multi-session stage was not considered suMicient grounds for outlier designation. Based on these criteria, no subjects remaining after post-mortem analyses exhibited performance patterns warranting exclusion through statistical outlier analysis. However, we have conducted the specific analyses requested by the Reviewer, as described below. 

      (9.1) In Figure 3, it seems that one female in the eYFP group, in the OFF situation, for the diNerent condition, has a higher level of responding than the others. Can authors please confirm or refute this visual observation with the appropriate statistical analysis?

      Statistical analysis (z-score) confirmed the reviewer's observation regarding responding of the diMerent action in the OFF condition for this subject (|z| = 2.58). Similar extreme responding was observed in the ON condition (|z| = 2.03). Analyzing responding on the diMerent action in isolation is not informative in the context of outcome-specific PIT. Additional analyses revealed |z| < 2 when examining the magnitude of choice discrimination in outcome-specific PIT (i.e., net same versus net diMerent responding) in both ON and OFF conditions. Furthermore, this subject showed |z| < 2 across all other experimental stages. Based on these analyses, we conclude that the subject should be kept in all analyses. 

      (9.2) In Figure 5, it seems that one male, in the ON situation, in the diNerent condition, has a quite higher level of responding - is this subject an outlier? If so, how does it aNect the statistical analysis after being removed? And who is this subject in the OFF condition?

      The reviewer has identified two diMerent male rats infused with the eNpHR3.0 virus and has asked closer examination of their performance.

      The first rat showed outlier-level responding on the diMerent action in the ON condition (|z| = 2.89) but normal responding for all other measures across LED conditions (|z| < 2). Additional analyses revealed |z| = 2.55 when examining choice discrimination magnitude in outcome-specific PIT during the ON condition but not during the OFF condition (|z| = 0.62). This subject exhibited |z| < 2 across all other experimental stages.

      The second rat showed outlier-level responding on the same action in the OFF condition (|z| = 2.02) but normal responding for all other measures across LED conditions (|z| < 2). Additional analyses revealed |z| = 2.12 when examining choice discrimination magnitude in outcome-specific PIT during the OFF condition but not during the ON condition (|z| = 0.67). This subject also exhibited |z| < 2 across all other experimental stages.

      We excluded these two subjects and conducted the same analyses as described in the original manuscript. Baseline responding did not diMer between groups (p = 0.14), allowing to look at the net eMect of the stimuli. Overall lever presses were greater in the eYFP rats (Group: F(1,16) = 6.08, p < 0.05; η<sup>2</sup> = 0.28) and were reduced by LED activation (LED: F(1,16) = 9.52, p < 0.01; η<sup>2</sup> = 0.44) and this reduction depended on the group considered (Group x LED: F(1,16) = 12.125, p < 0.001; η<sup>2</sup> = 0.43). Lever press rates were higher on the action earning the same outcome as the stimuli compared to the action earning the diMerent outcome (Lever: F(1,16)= 49.32; η<sup>2</sup> = 0.76; p < 0.001), regardless of group (Group x Lever: p = 0.14). There was a Lever by LED light condition interaction (Lever x LED: F(1,16)= 5.25; η<sup>2</sup> = 0.24; p < 0.05) but no an interaction between group, LED light condition, and Lever during the presentation of the predictive stimuli (p = 0.10). Given the significant Group x LED and Lever x LED interactions, additional analyses were conducted to determine the source of these interactions. In eYFP rats, LED activation had no eMect (LED: p = 0.70) and lever presses were greater on the same action (Lever: (F(1,9) = 23.94, p < 0.001; η<sup>2</sup> = 0.79) regardless of LED condition (LED x Lever: p = 0.72). By contrast, in eNpHR3.0 rats, lever presses were reduced by LED activation (LED: F(1,9) = 23.97, p < 0.001; η<sup>2</sup> = 0.73), were greater on the same action (Lever: F(1,9) = 16.920, p < 0.001; η<sup>2</sup> = 0.65) and the two factors interacted (LED x Lever: F(1,9) = 9.12, p < 0.01; η<sup>2</sup> = 0.50). These rats demonstrated outcome-specific PIT in the OFF condition (F(1,9) = 27.26, p < 0.001; η<sup>2</sup> = 0.75) but not in the ON condition (p = 0.08).

      Overall, excluding these two rats altered the statistical analyses, but both the original and revised analyses yielded the same outcome: silencing the NAc-S D1-SPN to VP pathway disrupted PIT. More importantly, we do not believe there are suMicient grounds to exclude the two rats identified by the reviewer. These animals did not display outlier-level responding across training stages or during the choice test. Their potential classification as outliers would be based on responding during only one LED condition and not the other, with notably opposite patterns between the two rats despite belonging to the same experimental group. 

      (10) I think it would be appreciable if in the cartoons from Figure 5.A and 6.A, the SPNs neurons were color-coded as in the results (test plots) and the supplementary figures (histological color-coding), such as D1- in blue & D2-SPNs in red.

      Our current color-coding system uses blue for D1-SPNs transduced with eNpHR3.0 and red for D2-SPNs transduced with eNpHR3.0. The D1-SPNs and D2-SPNs shown in Figures 5A and 6A represent cells transduced with either eYFP (control) or eNpHR3.0 virus and therefore cannot be assigned the blue or red color, which is reserved for eNpHR3.0transduced cells specifically. The micrographs in the Supplemental Figures maintain consistency with the color-coding established in the main figures.

      (11) As there are (relatively small) variations in the control performance in term of Net responding (from ~3 to ~7 responses per min), I wonder what would be the result of pooling eYFP groups from the two first experiments (Figures 3 & 4) and from the two last ones (Figures 5 & 6) - would the same statically results stand or vary (as eYFP vs D1-Cre vs A2a-Cre rats)? In particular for Figures 3 & 4, with and without the potential outlier, if it's indeed an outlier.

      We considered the Reviewer’s recommendation but do not believe the requested analysis is appropriate. The Reviewer is requesting the pooling of data from subjects of distinct transgenic strains (D1-Cre and A2A-Cre rats) that underwent surgical and behavioral procedures at diMerent time points, sometimes months apart. Each experiment was designed with necessary controls to enable adequate statistical analyses for testing our specific hypotheses. 

      (12) Presence of cameras in operant cages is mentioned in methods, but no data is presented regarding recordings, though authors mention that they allow for real-time observations of behavior. I suggest removing "to record" or adding a statement about the fact that no videos were recorded or used in the present study.

      We have removed “to record” from the manuscript (page 18). 

      (13) In all supplementary Figures, "F" is wrongly indicated as "E".

      We thank the Reviewer for reporting these errors, which have been corrected. 

      (14) While the authors acknowledge that the eNicacy of optogenetic inhibition of terminals is questionable, I think that more details are required to address this point in the discussion (existing literature?). Maybe, the combination of an anterograde tracer from SPNs to VP, to label VP neurons (to facilitate patching these neurons), and the Credependent inhibitory opsin in the NAc Shell, with optogenetic illumination at the level of the VP, along with electrophysiological recordings of VP neurons, could help address this question but may, reasonably, seem challenging technically.

      Our manuscript does not state that optogenetic inhibition of terminals is questionable. It acknowledges that we do not provide any evidence about the eMicacy of the approach. Regardless, we have provided additional details and suggestions to address this lack of evidence (page 13). 

      (15) A nice addition could be an illustration of the proposed model (from line 374), but it may be unnecessary.

      We have carefully considered the reviewer's recommendation. The proposed model is detailed in three published articles, including one that is freely accessible, which we have cited when presenting the model in our manuscript (page 14). This reference should provide interested readers with easy access to a comprehensive illustration of the model.

      Reviewer 2 (Recommendations for the Author):

      As noted in my public comments, this is a truly excellent and compelling study. I have only a few minor comments.

      (1) I could not find the coordinates/parameters for the dorsal striatal AAV injections for that component of the tract tracing experiment.

      We apologize for this omission, which has now been corrected (page 16). 

      (2) Please add the final group sizes to the figure captions.

      We followed the Reviewer’s recommendation and added group sizes in the main figure captions. 

      (3) The discussion of group exclusions (p 21 line 637) seems to accidentally omit (n = X) the number of NAc-S D1-SPNs-VP mice excluded.

      We apologize for this omission, which has now been corrected (page 22). 

      (4) There were some labeling issues in the supplementary figures (perhaps elsewhere, too). Specifically, panel E was listed twice (once for F) in captions.

      We apologize for this error, which has now been corrected.  

      (5) Inspection of the magazine entry data from PIT tests suggests that the optogenetic manipulations may have had some eNects on this behavior and would encourage the authors to probe further. There was a significant group diNerence for D1-SPN inhibition and a marginal group eNect for D2-SPNs. The fact that these eNects were in opposite directions is intriguing, although not easily interpreted based on the canonical D1/D2 model. Of course, the eNects are not specific to the light-on trials, but this could be due to carryover into light-oN trials. An analysis of trial-order eNects seems crucial for interpreting these eNects. One might also consider normalizing for pre-test baseline performance. Response rates during Pavlovian conditioning seem to suggest that D2eNpHR mice showed slightly higher conditioned responding during training, which contrasts with their low entry rates at test. I don't see any of this as problematic -- but more should be done to interpret these findings.

      We thank the reviewer for raising this interesting point regarding magazine entry rates. Since these data are presented in the Supplemental Figures, we have added a section in the Supplemental Material file that elaborates on these findings. This section does not address trial order eMects, as trial order was fully counterbalanced in our experiments and the relevant statistical analyses would lack adequate power. Baseline normalization was not conducted because the reviewer's suggestion was based on their assumption that eNpHR3.0 rats in the D2-SPNs experiment showed slightly higher magazine entries during Pavlovian training. However, this was not the case. In fact, like the eNpHR3.0 rats in the D1-SPNs experiment, they tended to display lower magazine entries during training. The added section therefore focuses on the potential role of response competition during outcome-specific PIT tests. Although we concluded that response competition cannot explain our findings, we believe it may complicate interpretation of magazine entry behavior. Thus, we recommend that future studies examine the role of NAc-S SPNs using purely Pavlovian tasks. It is worth nothing that we have recently completed experiments (unpublished) examining NAc-S D1- and D2-SPN silencing during stimulus presentation in a Pavlovian task identical to the one used here. Silencing of either SPN population had no eMect on magazine entry behavior.

      Reviewer 3 (Recommendations for the Author):

      Broad comments:

      Throughout the manuscript, the authors draw parallels between the eNect established via pharmacological manipulations and those shown here with optogenetic manipulation. I understand using the pharmacological data to launch this investigation, but these two procedures address very diNerent physiological questions. In the case of a pharmacological manipulation, the targets are receptors, wherever they are expressed, and in the case of D2 receptors, this means altering function in both pre-synaptically expressed autoreceptors and post-synaptically expressed D2 MSN receptors. In the case of an optogenetic approach, the target is a specific cell population with a high degree of temporal control. So I would just caution against comparing results from these types of studies too closely.

      Related to this point is the consideration of the physiological relevance of the manipulation. Under normal conditions, dopamine acts at D1-like receptors to increase the probability of cell firing via Ga signaling. In contrast, dopamine binding of D2-like receptors decreases the cell's firing probability (signaling via Gi/o). Thus, shunting D1MSN activation provides a clear impression of the role of these cells and, putatively, the role of dopamine acting on these cells. However, inhibiting D2-MSNs more closely mimics these cells' response to dopamine (though optogenetic manipulations are likely far more impactful than Gi signaling). All this is to say that when we consider the results presented here in Experiment 2, it might suggest that during PIT testing, normal performance may require a halting of DA release onto D2-MSNs. This is highly speculative, of course, just a thought worth considering.

      We agree with the comments made by the Reviewer, and the original manuscript included statements acknowledging that pharmacological approaches are limited in the capacity to inform about the function of NAc-S SPNs (pages 4 and 9). As noted by the Reviewer, these limitations are especially salient when considering NAc-S D2-SPNs. Based on the Reviewer’s comment, we have modified our discussion to further underscore these limitations (page 12). Finally, we agree with the suggestion that PIT may require a halting of DA release onto D2-SPNs. This is consistent with the model presented, whereby D2-SPNs function is required to trigger enkephalin release (page 13).     

      Section-Specific Comments and Questions:

      Results:

      Anterograde tracing and ex vivo cell recordings in D1 Cre and A2a Cre rats: Why are there no statistics reported for the e-phys data in this section? Was this merely a qualitative demonstration? I realize that the A2a-Cre condition only shows 3 recordings, so I appreciate the limitations in analyzing the data presented.

      The reviewer is correct that we initially intended to provide a qualitative demonstration. However, we have now included statistical analyses for the ex vivo recordings. It is important to note that there were at least 5 recordings per condition, though overlapping data points may give the impression of fewer recordings in certain conditions. We have provided the exact number of recordings in both the main text (page 5) and figure legend. 

      What does trial by trial analysis look like, because in addition to the eNects of extinction, do you know if the responsiveness of the opsin to light stimulation is altered after repeated exposures, or whether the cells themselves become compromised in any way with repeated light-inhibition, particularly given the relatively long 2m duration of the trial.

      The Reviewer raises an interesting point, and we provide complete trial-by-trial data for each experiment below. As identified by the Reviewer, there is some evidence for extinction, although it remained modest. Importantly, the data suggest that light stimulation did not aMect the physiology of the targeted cells. In eNpHR3.0 rats, performance across OFF trials remained stable (both for Same and DiMerent) even though they were preceded by ON trials, indicating no carryover eMects from optical stimulation.

      Author response image 2.

       

      The statistics for the choice test are not reported for eNpHR-D1-Cre rats, but do show a weakening of the instrumental devaluation eNect "Group x Lever x LED: F1,18 = 10.04, p < 0.01, = 0.36". The post hoc comparisons showed that all groups showed devaluation, but it is evident that there is a weakening of this eNect when the LED was on (η<sup>2</sup> = 0.41) vs oN (η<sup>2</sup> = 0.78), so I think the authors should soften the claim that NAcS-D1s are not involved in value-based decision-making. (Also, there is a typo in the legend in Figure S1, where the caption for panel "F" is listed as "E".) I also think that this could be potentially interesting in light of the fact that with circuit manipulation, this same weakening of the instrumental devaluation eNect was not observed. To me, this suggests that D1-NAcS that project to a diNerent region (not VP) contribute to value-based decision making.

      This comment overlaps with one made in the Public Review, for which we have already provided a response. Given its importance, we have added a section addressing this point in the supplemental discussion of the Supplementary Material file, which aligns with the location of the relevant data. The caption labelling error has been corrected.

      Materials and Methods:

      Subjects:

      Were these heterozygous or homozygous rats? If hetero, what rats were used for crossbreeding (sex, strain, and vendor)? Was genotyping done by the lab or outsourced to commercial services? If genotyping was done within the lab, please provide a brief description of the protocol used. How was food restriction established and maintained (i.e., how many days to bring weights down, and was maintenance achieved by rationing or by limiting ad lib access to food for some period in the day)?

      The information requested by the Reviewer have been added to the subjects section (pages 15-16).  

      Were rats pair/group housed after implantation of optic fibers?

      We have clarified that rats were group houses throughout (see subjects section; pages 15-16). 

      Behavioral Procedures:

      How long did each 0.2ml sucrose infusion take? For pellets, for each US delivery, was it a single pellet or two in quick succession?

      We have modified the method section to indicate that the sucrose was delivered across 2 seconds and that a single pellet was provided (page 17). 

      The CS to ITI duration ratio is quite low. Is there a reason such a short ratio was used in training?

      These parameters are those used in all our previous experiments on outcome-specific PIT. There is no specific reason for using such a ratio, except that it shortens the length of the training session. 

      Relative to the end of training, when were the optical implantation surgeries conducted, and how much recovery time was given before initiating reminder training and testing?

      Fibre-optic implantation was conducted 3-4 days after training and another 3-4 days were given for recovery. This has been clarified in the Materials and methods section (pages 15-16).

      I think a diagram or schematic showing the timeline for surgeries, training, and testing would be helpful to the audience.

      We opted for a text-based experimental timeline rather than a diagram due to slight temporal variations across experiments (page 15).

      On trials, when the LED was on, was light delivered continuously or pulsed? Do these opto-receptors 'bleach' within such a long window?

      We apologize for the lack of clarity; the light was delivered continuously. We have modified the manuscript (pages 6 and 19) and figure legend accordingly. The postmortem analysis did not provide evidence for photobleaching (Supplemental Figures) and as noted above, the behavioural results do not indicate any negative physiological impact on cell function.  

      Immunofluorescence: The blocking solution used during IHC is described as "NHS"; is this normal horse serum?

      The Reviewer is correct; NHS stands for normal horse serum. This has been added (page 21). 

      Microscopy and imaging:

      For the description of rats excluded due to placement or viral spread problems, an n=X is listed for the NAc S D1 SPNs --> VP silencing group. Is this a typo, or was that meant to read as n=0? Also, was there a major sex diNerence in the attrition rate? If so, I think reporting the sex of the lost subjects might be beneficial to the scientific community, as it might reflect a need for better guidance on sex-specific coordinates for targeting small nuclei.

      We apologize for the error regarding the number of excluded animals. This error has been corrected (page 23). There were no major sex diMerences in the attrition rate. The manuscript has been updated to provide information about the sex of excluded animals (page 23). 

      References

      Cao, J., Willett, J. A., Dorris, D. M., & Meitzen, J. (2018). Sex DiMerences in Medium Spiny Neuron Excitability and Glutamatergic Synaptic Input: Heterogeneity Across Striatal Regions and Evidence for Estradiol-Dependent Sexual DiMerentiation. Front Endocrinol (Lausanne), 9, 173. https://doi.org/10.3389/fendo.2018.00173

      Corbit, L. H., Muir, J. L., Balleine, B. W., & Balleine, B. W. (2001). The role of the nucleus accumbens in instrumental conditioning: Evidence of a functional dissociation between accumbens core and shell. J Neurosci, 21(9), 3251-3260. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=11312 310&retmode=ref&cmd=prlinks

      Corbit, L. H., & Balleine, B. W. (2011). The general and outcome-specific forms of Pavlovian-instrumental transfer are diMerentially mediated by the nucleus accumbens core and shell. J Neurosci, 31(33), 11786-11794. https://doi.org/10.1523/JNEUROSCI.2711-11.2011

      Laurent, V., Bertran-Gonzalez, J., Chieng, B. C., & Balleine, B. W. (2014). δ-Opioid and Dopaminergic Processes in Accumbens Shell Modulate the Cholinergic Control of Predictive Learning and Choice. J Neurosci, 34(4), 1358-1369. https://doi.org/10.1523/JNEUROSCI.4592-13.2014

      Laurent, V., Leung, B., Maidment, N., & Balleine, B. W. (2012). μ- and δ-opioid-related processes in the accumbens core and shell diMerentially mediate the influence of reward-guided and stimulus-guided decisions on choice. J Neurosci, 32(5), 1875-1883. https://doi.org/10.1523/JNEUROSCI.4688-11.2012

      Matamales, M., McGovern, A. E., Mi, J. D., Mazzone, S. B., Balleine, B. W., & BertranGonzalez, J. (2020). Local D2- to D1-neuron transmodulation updates goal-directed learning in the striatum. Science, 367(6477), 549-555. https://doi.org/10.1126/science.aaz5751

      Parkes, S. L., Bradfield, L. A., & Balleine, B. W. (2015). Interaction of insular cortex and ventral striatum mediates the eMect of incentive memory on choice between goaldirected actions. J Neurosci, 35(16), 6464-6471. https://doi.org/10.1523/JNEUROSCI.4153-14.2015

      Pettibone, J. R., Yu, J. Y., Derman, R. C., Faust, T. W., Hughes, E. D., Filipiak, W. E., Saunders, T. L., Ferrario, C. R., & Berke, J. D. (2019). Knock-In Rat Lines with Cre Recombinase at the Dopamine D1 and Adenosine 2a Receptor Loci. eNeuro, 6(5). https://doi.org/10.1523/ENEURO.0163-19.2019

      Willett, J. A., Will, T., Hauser, C. A., Dorris, D. M., Cao, J., & Meitzen, J. (2016). No Evidence for Sex DiMerences in the Electrophysiological Properties and Excitatory Synaptic Input onto Nucleus Accumbens Shell Medium Spiny Neurons. eNeuro, 3(1), ENEURO.0147-15.2016. https://doi.org/10.1523/ENEURO.0147-15.2016

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1(Public Reviews):

      Summary: 

      Here, Millet et al. consider whether the nematode C. elegans 'discounts' the value of reward due to effort in a manner similar to that shown in other species, including rodents and humans. They designed a T-maze effort choice paradigm inspired by previous literature, but manipulated how effortful the food is to consume.C. elegans worms were sensitive to this novel manipulation, exhibiting effort-discountinglike behaviour that could be shaped by varying the density of food at each alternative in order to calculate an indifference point. This discounting-like behaviour was related to worms' rates of patch leaving, which differed between the low and high effort patches in isolation. The authors also found a potential relationship to dopamine signalling, and also that this discounting behaviour was not specific to lab-based strains of C. elegans

      Strengths: 

      The question is well-motivated, and the approach taken here is novel. The authors are careful in their approach to altering and testing the properties of the effortful, elongated bacteria. Similarly, they go to some effort to understand what exactly is driving behavioural choices in this context, both through the application of simple standard models of effort discounting and a kinetic analysis of patch leaving. The comparisons to various dopamine mutants further extend the translational potential of their findings. I also appreciate the comparison to natural isolate strains, as the question of whether this behaviour may be driven by some sort of strain-specific adaptation to the environment is not regularly addressed in mammalian counterparts. The manuscript is well-written, and the figures are clear and comprehensible. 

      Weaknesses: 

      Discounting is typically defined as the alteration of a subjective value by effort (or time, risk, etc.), which is then used to guide future decision-making. By adapting the standard t-maze task for C. elegans as a patch-leaving paradigm, the authors observe behaviour strongly consistent with discounting models, but that is likely driven by a different process, in particular by an online estimate of the type of food in the current patch, which then influences patch-leaving dynamics (Figure 3). This is fundamentally different from decision-making strategies relating to effort that have been described in the rodent and human literatures. 

      We agree that in our study worms are likely making an on-line estimate of food quality in the current patch, but we wish to point out that rodents and humans also use on-line estimates in some significant effort-discounting paradigms. With respect to rodents, we call attention to effort discounting studies involving the widely used progressive ratio task (references in Discussion). In this task, animals can either lever-press for a preferred food or consume a less preferred food that is freely available nearby. However, the number of lever presses required to obtain preferred food increases as a function of the cumulative number of lever presses until the effort-cost of obtaining preferred food becomes too high and the animal switches to a freely available food. In essence, the lever and the freely available food are patches and the animal decides whether or not to leave the “lever” patch. It seems inescapable that the progressive ratio task involves an on-line assessment of the cost/benefit relationship associated with lever pressing. With respect to humans, one highly cited study (reference in Discussion) presented participants with a series of virtual apple trees. They could see how many apples are in the current tree and how much effort (squeezing a handgrip) is required to gather them. Their task was to decide whether or not to gather apples from that tree based on the perceived cost and benefit. Thus, on-line estimation is a common strategy used by animals and humans as shown in the effort discounting literature. We now make this point in the Discussion section titled A model of effort-discounting like behavior.

      Similarly, the calculation of indifference points at the group instead of at the individual level also suggests a different underlying process and limits the translational potential of their findings. The authors do not discuss the implications of these differences or why they chose not to attempt a more analogous trial-based experiment.  

      It is not clear to us why changing the read-out –– from the individual level to the population level –– necessarily suggests that a different biological mechanism is at work. In our view, there is one mechanism and it can be seen from different perspectives (e.g., individual vs population). Furthermore, the analogous trial-based experiment, as we understand it, would be to record behavior one worm at a time in the T-maze. This design is not practical because it entails recording a large number of single worms in the T-maze for 60 min each. 

      In the case of both the dopamine and natural isolate experiments, the data are very noisy despite large (relative to other C. elegans experiments) sample sizes. In the dopamine experiment, disruption of dop1, dop-2, and cat-2 had no statistically significant effect. There do not appear to be any corrections for multiple comparisons, and the single significant comparison, for dop-3, had a small effect size. 

      An ANOVA followed by a Dunnett test was used to test differences between groups in Fig. 4 and 5. The Dunnett test is a multiple comparison test comparing experimental groups to a single control group. It is used to minimize type I error while maintaining statistical power and does not require further correction for multiple comparisons. We have clarified the use of the Dunnett test in the statistical table.  The effect size for dop-3 is 0.5 (Cohen’s d), which is typically interpreted as a medium, not small, effect size.(e.g. Cohen, Psychological Bulletin, 1992, Vol. 112. No. 1,155-159). 

      More detailed behavioural analyses on both these and the wild isolate strains, for example by applying their kinetic analysis, would likely give greater insight as to what is driving these inconsistent effects. 

      More detailed behavioral analysis could reveal why we observe a difference in effort discounting in some strains and not others. However, it is not obvious what type of behavioral analysis would be needed to differentiate between pleiotropic effects of the mutations/natural isolates and more specific effects on effort discounting. A simple kinetic analysis in particular may not be enough to reveal relevant differences between mutants/natural isolates. For this reason, we think that such experiments may be better suited for future follow up studies.

      Reviewer #2 (Public Reviews)

      Summary: 

      Millet et al. show that C. elegans systematically prefers easy-to-eat bacteria but will switch its choice when harder-to-eat bacteria are offered at higher densities, producing indifference points that fit standard economic discounting models. Detailed kinetic analysis reveals that this bias arises from unchanged patch-entry rates but significantly elevated exit rates on effortful food, and dop-3 mutants lose the preference altogether, implicating dopamine in effort sensitivity. These findings extend effortdiscounting behavior to a simple nematode, pushing the phylogenetic boundary of economic costbenefit decision-making. 

      Strengths: 

      (1) Extends the well-characterized concept of effort discounting into C. elegans , setting a new phylogenetic boundary and opening invertebrate genetics to economic-behavior studies. 

      (2) Elegant use of cephalexin-elongated bacteria to manipulate "effort" without altering nutritional or olfactory cues, yielding clear preference reversals and reproducible indifference points. 

      (3) Application of standard discounting models to predict novel indifference points is both rigorous and quantitatively satisfying, reinforcing the interpretation of worm behavior in economic terms. 

      (4) The three-state patch-model cleanly separates entry and exit dynamics, showing that increased leaving rates-rather than altered re-entry-drive choice biases. 

      (5) Investigates the role of dopamine in this behavior to try to establish shared mechanisms with vertebrates. 

      (6) Demonstration of discounting in wild strain (solid evidence). 

      Weaknesses: 

      (1) The kinetic model omits rich trajectory details-such as turning angles or hazard functions-that could distinguish a bona fide roaming transition from other exit behaviors. 

      The overarching goal of present paper was to develop a simple model for effort discounting in a small, genetically tractable organism.  Accordingly,  we focused on quantitative assays that are easy to implement and analyze. The patch-leaving assay and its associated kinetic analysis are one such assay. To keep things simple in this assay, we counted the number of  transitions between the three states shown in Fig. 3A. We chose not to analyze the data in terms of turning angles or hazard functions because the metrics we developed seemed sufficient. Finally, we note that there are new modeling data showing that the presumptive transitions into the roaming state can be explained in terms of a one-state stochastic model in which there is no discrete roaming state (Elife. 2025 Jul 30;14:RP104972. doi:

      10.7554/eLife.104972.PMID: 40736321).

      (2) Only dop-3 shows an effect, and the statistical validity of this result is questionable. It is not clear if the authors corrected for multiple comparisons, and the effect size is quite small and noisy, given the large number of worms tested. Other mutants do not show effects. Given these two concerns, the role of dopamine in C. elegans effort discounting was unconvincing. 

      An ANOVA followed by a Dunnett test was used to test statistical significance in figures 4 and 5 (see above for a discussion of these tests). We believe this approach is rigorous, and the use of these tests is statistically valid. We note that the effect size for this comparison was medium.

      (3) With only five wild isolates tested (and variable data quality), it's hard to conclude that effort discounting isn't a lab-strain artifact or how broadly it varies in natural populations. 

      The fact that four of the five natural isolates tested display levels of effort discounting similar to N2 (only one natural isolate does not display effort discounting) argues against effort discounting being a laboratory adaption.  We have nevertheless weakened the claim regarding natural isolates. We now say effort discounting-like behavior may not be an adaptation to the laboratory environment.  

      (4) Detailed analysis of behavior beyond preference indices would strengthen the dopamine link and the claim of effort discounting in wild strains. 

      Going beyond preference in the behavioral analysis might or might not reveal new phenotypes that strengthen the link with dopamine. At present, however, we think such experiments are beyond the scope of the paper.

      (5) A few mechanistic statements (e.g., tying satiety exclusively to nutrient signals) would benefit from explicit citations or brief clarifications for non-worm specialists. 

      We are unable to identify a mechanistic statement tying satiety to nutrient signals in our manuscript.

      Reviewer #3 (Public Reviews)

      Summary: 

      The authors establish a behavioral task to explore effort discounting in C. eleganss . By using bacterial food that takes longer to consume, the authors show that, for equivalent effort, as measured by pumping rate, they obtain less food, as measured by fat deposition. The authors formalize the task by applying a formal neuroeconomic decision-making model that includes value, effort, and discounting. They use this to estimate the discounting that C. elegans applies based on ingestion effort by using a population-level 2-choice T-maze. They then analyze the behavioral dynamics of individual animals transitioning between on-food and off-food states. Harder to ingest bacteria led to increased food patch leaving. Finally, they examined a set of mutants defective in different aspects of dopamine signaling, as dopamine plays a key role in discounting in vertebrates and regulates certain aspects of C. elegans foraging. 

      Strengths: 

      The behavioral experiments and neuroeconomic analysis framework are compelling, interesting, and make a significant contribution to the field. While these foraging behaviors have been extensively studied, few include clearly articulated theoretical models to be tested. 

      Demonstrating that C. elegans effort discounting fits model predictions and has stable indifference points is important for establishing these tasks as a model for decision making. 

      Weaknesses: 

      The dopamine experiments are harder to interpret. The authors point out the perplexing lack of an effect of dat-1 and cat-2. dop-3 leads to general indifference. I am not sure this is the expected result if the argument is a parallel functional role to discounting in vertebrates. dop-3 causes a range of locomotor phenotypes and may affect feeding (reduced fat storage), and thus, there may be a general defect in the ability to perform the task rather than anything specific to discounting.

      That said, some of the other DA mutants also have locomotor defects and do not differ from N2. But there is no clear result here - my concern is that global mutants in such a critical pathway exhibit such pleiotropy that it's difficult to conclude there is a clear and specific role for DA in effort discounting. This would require more targeted or cell-specific approaches. 

      We agree with the reviewer that the results of the dopamine experiments are puzzling and getting a better understanding of the role of dopamine in effort-discounting will require more sensitive assays and different experimental approaches (e.g. cell-specific rescues). However, as mentioned by the reviewer, all the mutations tested have some pleiotropic effects, yet only dop-3 displays a defect in effort discounting. This, in our opinion, points to a specific role of dop-3 in effort-discounting in C. elegans. This point is now made in the Discussion in the section titled Role of dopamine signaling in effort discountinglike behavior.

      Meanwhile, there are other pathways known to affect responses to food and patch leaving decisions: serotonin, pigment-dispersing factor, tyramine, etc. The paper would have benefited from a clarification about why these were not considered as promising candidates to test (in addition to or instead of dopamine). 

      We focused on DA because of its well-established effect on effort discounting in rodents.

      Testing other pathways is a goal for future research.

      Reviewer #1 (Recommendations for the authors):

      The current results are more a reframing of data gathered from a patch-leaving paradigm, but described in the form of economic choice modelling in which discounting is one possible explanation. One more parsimonious explanation that worms estimate in real-time some rate of reward and leave the patch at some threshold, consistent with canonical foraging models, previous experiments in C. elegans, and the authors' own data (Figure 3). Therefore, I am wary about some of the claims made in this manuscript, such as 'decision-making strategies based on effort-cost trade-offs are evolutionarily conserved'. 

      These points are now addressed in the Discussion in a revised section titled A model of effortdiscounting like behavior. (i) We now call attention to the fact that our T-maze assay is a patch-leaving foraging paradigm. (ii) We now propose a revised model in which “worms make an on-line assessment of food value in the current patch which in turn alters patch-leaving dynamics, increasing the exit rates from cephalexin-treated patches as shown in Figure 3.” (iii) We now provide evidence from the rodent and human literature that the strategy of on-line assessment of reward value may be evolutionarily conserved in the case of a class of effort discounting tasks whose solution requires on-line assessments. 

      If the reason the authors chose to do a patch-leaving style task rather than a traditional t-maze is because C. elegans is unable to retain the sort of information necessary to make such simultaneous decisions - e.g., if pre-training on the two options isn't possible - then this in itself suggests that mechanisms underlying these decisions in worms and mammals are unlikely to be the same. I mention this because I would like to suggest to the authors an alternative interpretation: that patch foraging is actually 'the' canonical computation that translates across species. This would, in fact, be nicely consistent with some other recent modelling work in humans, e.g., https://www.biorxiv.org/content/10.1101/2025.05.06.652482v1

      Please see the previous response.

      Reviewer #2 (Recommendations for the authors):

      Can you provide a picture of the regular and CEPH bacteria? 

      Done (see Figure 1––figure supplement 1).

      Reviewer #3 (Recommendations for the authors):

      I would recommend testing representative mutants in other pathways in the choice task. If possible, more targeted experiments with dop-3, including either cell-specific KOs or rescues, would very much strengthen this aspect of the paper. 

      While valuable, these experiments are out of scope for the present study.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Bansal et al. present a study on the fundamental blood and nectar feeding behaviors of the critical disease vector, Anopheles stephensi. The study encompasses not just the fundamental changes in blood feeding behaviors of the crucially understudied vector, but then uses a transcriptomic approach to identify candidate neuromodulation pathways which influence blood feeding behavior in this mosquito species. The authors then provide evidence through RNAi knockdown of candidate pathways that the neuromodulators sNPF and Rya modulate feeding either via their physiological activity in the brain alone or through joint physiological activity along the brain-gut axis (but critically not the gut alone). Overall, I found this study to be built on tractable, well-designed behavioral experiments.

      Their study begins with a well-structured experiment to assess how the feeding behaviors of A. stephensi change over the course of its life history and in response to its age, mating, and oviposition status. The authors are careful and validate their experimental paradigm in the more well-studied Ae. aegypti, and are able to recapitulate the results of prior studies, which show that mating is a prerequisite for blood feeding behaviors in Ae. aegypt. Here they find A. Stephensi, like other Anopheline mosquitoes, has a more nuanced regulation of its blood and nectar feeding behaviors.

      The authors then go on to show in a Y-maze olfactometer that ,to some degree, changes in blood feeding status depend on behavioral modulation to host cues, and this is not likely to be a simple change to the biting behaviors alone. I was especially struck by the swap in valence of the host cues for the blood-fed and mated individuals, which had not yet oviposited. This indicates that there is a change in behavior that is not simply desensitization to host cues while navigating in flight, but something much more exciting is happening.

      The authors then use a transcriptomic approach to identify candidate genes in the blood-feeding stages of the mosquito's life cycle to identify a list of 9 candidates that have a role in regulating the host-seeking status of A. stephensi. Then, through investigations of gene knockdown of candidates, they identify the dual action of RYa and sNPF and candidate neuromodulators of host-seeking in this species. Overall, I found the experiments to be well-designed. I found the molecular approach to be sound. While I do not think the molecular approach is necessarily an all-encompassing mechanism identification (owing mostly to the fact that genetic resources are not yet available in A. stephensi as they are in other dipteran models), I think it sets up a rich line of research questions for the neurobiology of mosquito behavioral plasticity and comparative evolution of neuromodulator action.

      We appreciate the reviewer’s detailed summary of our work. We thank them for their positive comments and agree with them on the shortcomings of our approach.

      Strengths:

      I am especially impressed by the authors' attention to small details in the course of this article. As I read and evaluated this article, I continued to think about how many crucial details could potentially have been missed if this had not been the approach. The attention to detail paid off in spades and allowed the authors to carefully tease apart molecular candidates of blood-seeking stages. The authors' top-down approach to identifying RYamide and sNPF starting from first principles behavioral experiments is especially comprehensive. The results from both the behavioral and molecular target studies will have broad implications for the vectorial capacity of this species and comparative evolution of neural circuit modulation.

      We really appreciate that the reviewer has recognised the attention to detail we have tried to put, thank you!

      Weaknesses:

      There are a few elements of data visualizations and methodological reporting that I found confusing on a first few read-throughs. Figure 1F, for example, was initially confusing as it made it seem as though there were multiple 2-choice assays for each of the conditions. I would recommend removing the "X" marker from the x-axis to indicate the mosquitoes did not feed from either nectar, blood, or neither in order to make it clear that there was one assay in which mosquitoes had access to both food sources, and the data quantify if they took both meals, one meal, or no meals.

      We thank the reviewer for flagging the schematic in figure 1F. As suggested, we have removed the “X” markers from the x-axis and revised the axis label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose in the assay. For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data, as it does not capture the variability in the data.

      I would also like to know more about how the authors achieved tissue-specific knockdown for RNAi experiments. I think this is an intriguing methodology, but I could not figure out from the methods why injections either had whole-body or abdomen-specific knockdown.

      The tissue-specific knockdown (abdomen only or abdomen+head) emerged from initial standardisations where we were unable to achieve knockdown in the head unless we used higher concentrations of dsRNA and did the injections in older females. We realised that this gave us the opportunity to isolate the neuronal contribution of these neuropeptides in the phenotype produced. Further optimisations revealed that injecting dsRNA into 0-10h old females produced abdomen-specific knockdowns without affecting head expression, whereas injections into 4 days old females resulted in knockdowns in both tissues. Moreover, head knockdowns in older females required higher dsRNA concentrations, with knockdown efficiency correlating with the amount injected. In contrast, abdominal knockdowns in younger females could be achieved even with lower dsRNA amounts.

      We have mentioned the knockdown conditions- time of injection and the amount dsRNA injected- for tissue-specific knockdowns in methods but realise now that it does not explain this well enough. We have now edited it to state our methodology more clearly (see lines 932-948).

      I also found some interpretations of the transcriptomic to be overly broad for what transcriptomes can actually tell us about the organism's state. For example, the authors mention, "Interestingly, we found that  after a blood meal, glucose is neither spent nor stored, and that the female brain goes into a state of metabolic 'sugar rest', while actively processing proteins (Figure S2B, S3)".

      This would require a physiological measurement to actually know. It certainly suggests that there are changes in carbohydrate metabolism, but there are too many alternative interpretations to make this broad claim from transcriptomic data alone.

      We thank the reviewer for pointing this out and agree with them. We have now edited our statement to read:

      “Instead, our data suggests altered carbohydrate metabolism  after a blood meal, with the female brain potentially entering a state of metabolic 'sugar rest' while actively processing proteins (Figure S2B, S3). However, physiological measurements of carbohydrate and protein metabolism will be required to confirm whether glucose is indeed neither spent nor stored during this period.” See lines 271-277.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Bansal et al examine and characterize feeding behaviour in Anopheles stephensi mosquitoes. While sharing some similarities to the well-studied Aedes aegypti mosquito, the authors demonstrate that mated females, but not unmated (virgin) females, exhibit suppression in their bloodfeeding behaviour. Using brain transcriptomic analysis comparing sugar-fed, blood-fed, and starved mosquitoes, several candidate genes potentially responsible for influencing blood-feeding behaviour were identified, including two neuropeptides (short NPF and RYamide) that are known to modulate feeding behaviour in other mosquito species. Using molecular tools, including in situ hybridization, the authors map the distribution of cells producing these neuropeptides in the nervous system and in the gut. Further, by implementing systemic RNA interference (RNAi), the study suggests that both neuropeptides appear to promote blood-feeding (but do not impact sugar feeding), although the impact was observed only  after both neuropeptide genes underwent knockdown.

      Strengths and/or weaknesses:

      Overall, the manuscript was well-written; however, the authors should review carefully, as some sections would benefit from restructuring to improve clarity. Some statements need to be rectified as they are factually inaccurate.

      Below are specific concerns and clarifications needed in the opinion of this reviewer:

      (1) What does "central brains" refer to in abstract and in other sections of the manuscript (including methods and results)? This term is ambiguous, and the authors should more clearly define what specific components of the central nervous system was/were used in their study.

      Central brain, or mid brain, is a commonly used term to refer to brain structures/neuropils without the optic lobes (For example: https://www.nature.com/articles/s41586-024-07686-5). In this study we have focused our analysis on the central brain circuits involved in modulating blood-feeding behaviour and have therefore excluded the optic lobes. As optic lobes account for nearly half of all the neurons in the mosquito brain (https://pmc.ncbi.nlm.nih.gov/articles/PMC8121336/), including them would have disproportionately skewed our transcriptomic data toward visual processing pathways.

      We have indicated this in figure 3A and in the methods (see lines 800-801, 812). We have now also clarified it in the results section for neuro-transcriptomics to avoid confusion (see lines 236-237).

      (2) The abstract states that two neuropeptides, sNPF and RYamide are working together, but no evidence is summarized for the latter in this section.

      We thank the reviewer for pointing this out. We have now added a statement “This occurs in the context of the action of RYa in the brain” to end of the abstract, for a complete summary of our proposed model.

      (3) Figure 1

      Panel A: This should include mating events in the reproductive cycle to demonstrate differences in the feeding behavior of Ae. aegypti.

      Our data suggest that mating can occur at any time between eclosion and oviposition in An. stephensi and between eclosion and blood feeding in Ae. aegypti. Adding these into (already busy) 1A, would cloud the purpose of the schematic, which is to indicate the time points used in the behavioural assays and transcriptomics.

      Panel F: In treatments where insects were not provided either blood or sugar, how is it that some females and males had fed? Also, it is unclear why the y-axis label is % fed when the caption indicates this is a choice assay. Also, it is interesting that sugar-starved females did not increase sugar intake. Is there any explanation for this (was it expected)?

      We apologise for the confusion. The experiment is indeed a choice assay in which sugar-starved or sugar-sated females, co-housed with males, were provided simultaneous access to both blood and sugar, and were assessed for the choice made (indicated on the x-axis): both blood and sugar, blood only, sugar only, or neither. The x-axis indicates the choice made by the mosquitoes, not the choice provided in the assay, and the y-axis indicates the percentage of males or females that made each particular choice. We have now removed the “X” markers from the x-axis and revised the axis label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose to take.

      In this assay, we scored females only for the presence or absence of each meal type (blood or sugar) and are therefore unable to comment on whether sugar-starved females consumed more sugar than sugarsated females. However, when sugar-starved, a higher proportion of females consumed both blood and sugar, while fewer fed on blood alone.

      For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data as it does not capture the variability in the data.

      (4) Figure 3

      In the neurotranscriptome analysis of the (central) brain involving the two types of comparisons, can the authors clarify what "excluded in males" refers to? Does this imply that only genes not expressed in males were considered in the analysis? If so, what about co-expressed genes that have a specific function in female feeding behaviour?

      This is indeed correct. We reasoned that since blood feeding is exclusive to females, we should focus our analysis on genes that were specifically upregulated in them. As the reviewer points out, it is very likely that genes commonly upregulated in males and females may also promote blood feeding and we will miss out on any such candidates based on our selection criteria.

      (5) Figure 4

      The authors state that there is more efficient knockdown in the head of unfed females; however, this is not accurate since they only get knockdown in unfed animals, and no evidence of any knockdown in fed animals (panel D). This point should be revised in the results test as well.

      Perhaps we do not understand the reviewer’s point or there has been a misunderstanding. In figure 4D, we show that while there is more robust gene knockdown in unfed females, blood-fed females also showed modest but measurable knockdowns ranging from 5-40% for RYamide and 2-21% for sNPF.

      Relatedly, blood-feeding is decreased when both neuropeptide transcripts are targeted compared to uninjected (panel C) but not compared to dsGFP injected (panel E). Why is this the case if authors showed earlier in this figure (panel B) that dsGFP does not impact blood feeding?

      We realise this concern stems from our representation of the data. Since we had earlier determined that dsGFP-injected females fed similarly to uninjected females (fig 4B), we used these controls interchangeably in subsequent experiments. To avoid confusion, we have now only used the label ‘control’ in figure 4 (and supplementary figure S9) and specified which control was used for each experiment in the legend.

      In addition to this, we wanted to clarify that fig 4C and 4E are independent experiments. 4C is the behaviour corresponding to when the neuropeptides were knocked down in both heads and abdomens.

      4E is the behaviour corresponding to when the neuropeptides were knocked down in only the abdomens. We have now added a schematic in the plots to make this clearer.

      In addition, do the uninjected and dsGFP-injected relative mRNA expression data reflect combined RYa and sNPF levels? Why is there no variation in these data,…

      In these qPCRs, we calculated relative mRNA expression using the delta-delta Ct method (see line 975). For each neuropeptide its respective control was used. For simplicity, we combined the RYa and sNPF control data into a single representation. The value of this control is invariant because this method sets the control baseline to a value of 1.

      …and how do transcript levels of RYa and sNPF compare in the brain versus the abdomen (the presentation of data doesn't make this relationship clear).

      The reviewer is correct in pointing out that we have not clarified this relationship in our current presentation. While we have not performed absolute mRNA quantifications, we extracted relative mRNA levels from qPCR data of 96h old unmanipulated control females. We observed that both sNPF and RYa transcripts are expressed at much lower levels in the abdomens, as compared to those in the heads, as shown in the graphs inserted below.

      Author response image 1.

      (6) As an overall comment, the figure captions are far too long and include redundant text presented in the methods and results sections.

      We thank the reviewer for flagging this and have now edited the legends to remove redundancy.

      (7) Criteria used for identifying neuropeptides promoting blood-feeding: statement that reads "all neuropeptides, since these are known to regulate feeding behaviours". This is not accurate since not all neuropeptides govern feeding behaviors, while certainly a subset do play a role.

      We agree with the reviewer that not all neuropeptides regulate feeding behaviours. Our statement refers to the screening approach we used: in our shortlist of candidates, we chose to validate all neuropeptides.

      (8) In the section beginning with "Two neuropeptides - sNPF and RYa - showed about 25% and 40% reduced mRNA levels...", the authors state that there was no change in blood-feeding and later state the opposite. The wording should be clarified as it is unclear.

      Thank you for pointing this out. We were referring to an unchanged proportion of the blood fed females. We have now edited the text to the following:

      “Two neuropeptides - sNPF and RYa - showed about 25% and 40% reduced mRNA levels in the heads but the proportion of females that took blood meals remained unchanged”. See lines 338-340.

      (9) Just before the conclusions section, the statement that "neuropeptide receptors are often ligand promiscuous" is unjustified. Indeed, many studies have shown in heterologous systems that high concentrations of structurally related peptides, which are not physiologically relevant, might cross-react and activate a receptor belonging to a different peptide family; however, the natural ligand is often many times more potent (in most cases, orders of magnitude) than structurally related peptides. This is certainly the case for various RYamide and sNPF receptors characterized in various insect species.

      We agree with the reviewer and apologise for the mistake. We have now removed the statement.

      (10) Methods

      In the dsRNA-mediated gene knockdown section, the authors could more clearly describe how much dsRNA was injected per target. At the moment, the reader must carry out calculations based on the concentrations provided and the injected volume range provided later in this section.

      We have now edited the section to reflect the amount of dsRNA injected per target. Please see lines 921-931.

      It is also unclear how tissue-specific knockdown was achieved by performing injection on different days/times. The authors need to explain/support, and justify how temporal differences in injection lead to changes in tissue-specific expression. Does the blood-brain barrier limit knockdown in the brain instead, while leaving expression in the peripheral organs susceptible?

      To achieve tissue-specific knockdowns of sNPF and RYa, we optimised both the time of injection as well as the dsRNA concentration to be injected. Injecting dsRNA into 0-10h females produced abdomen specific knockdowns without affecting head expression, whereas injections into 96h old females resulted in knockdowns in both tissues. Head knockdowns in older females required higher dsRNA concentrations, with knockdown efficiency correlating with the amount injected. In contrast, abdominal knockdowns in younger females could be achieved even with lower dsRNA amounts, reflecting the lower baseline expression of sNPF in abdomens compared to heads and the age-dependent increase in head expression (as confirmed by qPCR). It is possible that the blood-brain barrier also limits the dsRNA entering the brain, thereby requiring higher amounts to be injected for head knockdowns.

      We have now edited this section to state our methodology more clearly (see lines 932-948).

      For example, in Figure 4, the data support that knockdown in the head/brain is only effective in unfed animals compared to uninjected animals, while there is no evidence of knockdown in the brain relative to dsGFP-injected animals. Comparatively, evidence appears to show stronger evidence of abdominal knockdown mostly for the RYa transcript (>90%) while still significantly for the sNPF transcript (>60%).

      As we explained earlier, this concern likely stems from our representation of the data. Since we had earlier determined that dsGFP-injected females fed similarly to uninjected females (fig 4B), we used these controls interchangeably in subsequent experiments. To avoid confusion, we have now only used the label ‘control’ in figure 4 (and supplementary figure S9) and specified which control was used for each experiment in the legend.

      In addition to this, we wanted to clarify that fig 4C and 4E are independent experiments. 4C is the behaviour corresponding to when the neuropeptides were knocked down in both heads and abdomens. 4E is the behaviour corresponding to when the neuropeptides were knocked down in only the abdomen. We have now added a schematic in the plots to make this clearer.

      Reviewer #3 (Public review):

      Summary:

      This manuscript investigates the regulation of host-seeking behavior in Anopheles stephensi females across different life stages and mating states. Through transcriptomic profiling, the authors identify differential gene expression between "blood-hungry" and "blood-sated" states. Two neuropeptides, sNPF and RYamide, are highlighted as potential mediators of host-seeking behavior. RNAi knockdown of these peptides alters host-seeking activity, and their expression is anatomically mapped in the mosquito brain (sNPF and RYamide) and midgut (sNPF only).

      Strengths:

      (1) The study addresses an important question in mosquito biology, with relevance to vector control and disease transmission.

      (2) Transcriptomic profiling is used to uncover gene expression changes linked to behavioral states.

      (3) The identification of sNPF and RYamide as candidate regulators provides a clear focus for downstream mechanistic work.

      (3) RNAi experiments demonstrate that these neuropeptides are necessary for normal host-seeking behavior.

      (4) Anatomical localization of neuropeptide expression adds depth to the functional findings.

      Weaknesses:

      (1) The title implies that the neuropeptides promote host-seeking, but sufficiency is not demonstrated (for example, with peptide injection or overexpression experiments).

      Demonstrating sufficiency would require injecting sNPF peptide or its agonist. To date, no small-molecule agonists (or antagonists) that selectively mimic sNPF or RYa neuropeptides have been identified in insects. An NPY analogue, TM30335, has been reported to activate the Aedes aegypti NPY-like receptor 7 (NPYLR7; Duvall et al., 2019), which is also activated by sNPF peptides at higher doses (Liesch et al., 2013). Unfortunately, the compound is no longer available because its manufacturer, 7TM Pharma, has ceased operations. Synthesising the peptides is a possibility that we will explore in the future.

      (2) The proposed model regarding central versus peripheral (gut) peptide action is inconsistently presented and lacks strong experimental support.

      The best way to address this would be to conduct tissue-specific manipulations, the tools for which are not available in this species. Our approach to achieve head+abdomen and abdomen only knockdown was the closest we could get to achieving tissue specificity and allowed us to confirm that knockdown in the head was necessary for the phenotype. However, as the reviewer points out, this did not allow us to rule out any involvement of the abdomen. This point has been addressed in lines 364-371.

      (3) Some conclusions appear premature based on the current data and would benefit from additional functional validation.

      The most definitive way of demonstrating necessity of sNPF and RYa in blood feeding would be to generate mutant lines. While we are pursuing this line of experiments, they lie beyond the scope of a revision. In its absence, we relied on the knockdown of the genes using dsRNA. We would like to posit that despite only partial knockdown, mosquitoes do display defects in blood-feeding behaviour, without affecting sugar-feeding. We think this reflects the importance of sNPF in promoting blood feeding.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, I found this manuscript to be well-prepared, visually the figures are great and clearly were carefully thought out and curated, and the research is impacwul. It was a wonderful read from start to finish. I have the following recommendations:

      Thank you very much, we are very pleased to hear that you enjoyed reading our manuscript!

      (1) For future manuscripts, it would make things significantly easier on the reviewer side to submit a format that uses line numbers.

      We sincerely apologise for the oversight. We have now incorporated line numbers in the revised manuscript.

      (2) There are a few statements in the text that I think may need clarification or might be outside the bounds of what was actually studied here. For example, in the introduction "However, mating is dispensable in Anophelines even under conditions of nutritional satiety". I am uncertain what is meant by this statement - please clarify.

      We apologise for the lack of clarity in the statement and have now deleted it since we felt it was not necessary.

      (3) Typo/Grammatical minutiae:

      a) A small idiosyncrasy of using hyphens in compound words should also be fixed throughout. Typically, you don't hyphenate if the words are being used as a noun, as in the case: e.g. "Age affects blood feeding.". However, you would hyphenate if the two words are used as a compound adjective "Age affects blood-feeding behavior". This may not be an all-inclusive list, but here are some examples where hyphens need to either be removed or added. Some examples:

      "Nutritional state also influences other internal state outputs on blood-feeding": blood-feeding -> blood feeding

      "... the modulation of blood-feeding": blood-feeding -> blood feeding

      "For example, whether virgin females take blood-meals...": blood-meals -> blood meals

      ".... how internal and external cues shape meal-choice"-> meal choice

      "blood-meal" is often used throughout the text, but is correctly "blood meal" in the figures.

      There are many more examples throughout.

      We apologise for these errors and appreciate the reviewer’s keen eye. We have now fixed them throughout the manuscript.

      b) Figure 1 Caption has a typo: "co-housed males were accessed for sugar-feeding" should be "co-housed males were assessed for sugar feeding"

      We apologise for the typo and thank the reviewer for spotting it. We have now corrected this.

      c) It would be helpful in some other figure captions to more clearly label which statement is relevant to which part of the text. For example, in Figure 4's caption.

      "C,D. Blood-feeding and sugar-feeding behaviour of females when both RYa and sNPF are knocked down in the head (C). Relative mRNA expressions of RYa and sNPF in the heads of dsRYa+dssNPF - injected blood-fed and unfed females, as compared to that in uninjected females, analysed via qPCR (D)."

      I found re-referencing C and D at the end of their statements makes it look as thought C precedes the "Relative mRNA expression" and on a first read through, I thought the figure captions were backwards. I'd recommend reformating here and throughout consistently to only have the figure letter precede its relevant caption information, e.g.:

      "C. Blood-feeding and sugar-feeding behaviour of females when both RYa and sNPF are knocked down in the head. D. Relative mRNA expressions of RYa and sNPF in the heads of dsRYa+dssNPF - injected bloodfed and unfed females, as compared to that in uninjected females, analysed via qPCR."

      We have now edited the legends as suggested.

      Reviewer #2 (Recommendations for the authors):

      Separately from the clarifications and limitations listed above, the authors could strengthen their study and the conclusions drawn if they could rescue the behavioural phenotype observed following knockdown of sNPF and RYamide. This could be achieved by injection of either sNPF or RYa peptide independently or combined following knockdown to validate the role of these peptides in promoting blood-feeding in An. stephensi. Additionally, the apparent (but unclear) regionalized (or tissue-specific) knockdown of sNPF and RYamide transcripts could be visualized and verified by implementing HCR in situ hyb in knockdown animals (or immunohistochemistry using antibodies specific for these two neuropeptides).

      In a follow up of this work, we are generating mutants and peptides for these candidates and are planning to conduct exactly the experiments the reviewer suggests.

      Reviewer #3 (Recommendations for the authors):

      The loss-of-function data suggest necessity but not sufficiency. Synthetic peptide injection in non-host seeking (blood-fed mated or juvenile) mosquitoes would provide direct evidence for peptide-induced behavioral activation. The lack of these experiments weakens the central claim of the paper that these neuropeptides directly promote blood feeding.

      As noted above, we plan to synthesise the peptide to test rescue in a mutant background and sufficiency.

      Some of the claims about knockdown efficiency and interpretation are conflicting; the authors dismiss Hairy and Prp as candidates due to 30-35% knockdown, yet base major conclusions on sNPF and RYamide knockdowns with comparable efficiencies (25-40%). This inconsistency should be addressed, or the justification for different thresholds should be clearly stated.

      We have not defined any specific knockdown efficacy thresholds in the manuscript, as these can vary considerably between genes, and in some cases, even modest reductions can be sufficient to produce detectable phenotypes. For example, knockdown efficiencies of even as low as about 25% - 40% gave us observable phenotypes for sNPF and RYa RNAi (Figure S9B-G).

      No such phenotypes were observed for Hairy (30%) or Prp (35%) knockdowns. Either these genes are not involved in blood feeding, or the knockdown was not sufficient for these specific genes to induce phenotypes. We cannot distinguish between these scenarios.

      The observation that knockdown animals take smaller blood meals is interesting and could reflect a downstream effect of altered host-seeking or an independent physiological change. The relationship between meal size and host-seeking behavior should be clarified.

      We agree with the reviewer that the reduced meal size observed in sNPF and RYa knockdown animals could result from their inability to seek a host or due to an independent effect on blood meal intake. Unfortunately, we did not measure host-seeking in these animals. We plan to distinguish between these possibilities using mutants in future work.

      Several figures are difficult to interpret due to cluttered labeling and poorly distinguishable color schemes. Simplifying these and improving contrast (especially for co-housed vs. virgin conditions) would enhance readability.

      We regret that the reviewer found the figures difficult to follow. We have now revised our annotations throughout the manuscript for enhanced readability. For example, “D1<sup>B</sup>” is now “D1<sup>PBM</sup>” (post-bloodmeal) and “D1<sup>O</sup>” is now “D1<sup>PO</sup>” (post-oviposition). Wherever mated females were used, we have now appended “(m)” to the annotations and consistently depicted these females with striped abdomens in all the schematics. We believe these changes will improve clarity and readability.

      The manuscript does not clearly justify the use of whole-brain RNA sequencing to identify peptides involved in metabolic or peripheral processes. Given that anticipatory feeding signals are often peripheral, the logic for brain transcriptomics should be explained.

      The reviewer is correct in pointing out that feeding signals could also emerge from peripheral tissues. Signals from these tissues – in response to both changing nutritional and reproductive states – are then integrated by the central brain to modulate feeding choices. For example, in Drosophila, increased protein intake is mediated by central brain circuitry including those in the SEZ and central complex (Munch et al., 2022; Liu et al., 2017; Goldschmidt et al., 2023). In the context of mating, male-derived sex peptide further increases protein feeding by acting on a dedicated central brain circuitry (Walker et al., 2015). We, therefore focused on the central brain for our studies.

      The proposed model suggests brain-derived peptides initiate feeding, while gut peptides provide feedback. However, gut-specific knockdowns had no effect, undermining this hypothesis. Conversely, the authors also suggest abdominal involvement based on RNAi results. These contradictions need to be resolved into a consistent model.

      We thank the reviewer for raising this point and recognise their concern. Our reasons for invoking an involvement of the gut were two-fold:

      (1) We find increased sNPF transcript expression in the entero-endocrine cells of the midgut in blood-hungry females, which returns to baseline  after a blood-meal (Fig. 4L, M).

      (2) While the abdomen-only knockdowns did not affect blood feeding, every effective head knockdown that affected blood feeding also abolished abdominal transcript levels (Fig. S9C, F). (Achieving a head-only reduction proved impossible because (i) systemic dsRNA delivery inevitably reaches the abdomen and (ii) abdominal expression of both peptides is low, leaving little dynamic range for selective manipulation.) Consequently, we can only conclude the following: 1) that brain expression is required for the behaviour, 2) that we cannot exclude a contributory role for gut-derived sNPF. We have discussed this in lines 364-371.

      The identification of candidate receptors is promising, but the manuscript would be significantly strengthened by testing whether receptor knockdowns phenocopy peptide knockdowns. Without this, it is difficult to conclude that the identified receptors mediate the behavioral effects.

      We agree that functional validation of the receptors would strengthen the evidence for sNPF and RYa_mediated control of blood feeding in _An. stephensi. We selected these receptors based on sequence homology. A possibility remains that sNPF neuropeptides activate more than one receptor, each modulating a distinct circuit, as shown in the case of Drosophila Tachykinin (https://pmc.ncbi.nlm.nih.gov/articles/PMC10184743/). This will mean a systematic characterisation and knockdown of each of them to confirm their role. We are planning these experiments in the future.

      The authors compared the percentage changes in sugar-fed and blood-fed animals under sugar-sated or sugar-starved conditions. Figure 1F should reflect what was discussed in the results.

      Perhaps this concern stems from our representation of the data in figure 1F? We have now edited the xaxis and revised its label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose to take.

      For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data because it does not capture the variability in the data.

      Minor issues:

      (1) The authors used mosquitoes with belly stripes to indicate mated females. To be consistent, the post-oviposition females should also have belly stripes.

      We thank the reviewer for pointing this out. We have now edited all the figures as suggested.

      (2) In the first paragraph on the right column of the second page, the authors state, "Since females took blood-meals regardless of their prior sugar-feeding status and only sugar-feeding was selectively suppressed by prior sugar access." Just because the well-fed animals ate less than the starved animals does not mean their feeding behavior was suppressed.

      Perhaps there has been a misunderstanding in the experimental setup of figure 1F, probably stemming from our data representation. The experiment is a choice assay in which sugar-starved or sugar-sated females, co-housed with males, were provided simultaneous access to both blood and sugar, and were assessed for the choice made (indicated on the x-axis): both blood and sugar, blood only, sugar only, or neither. We scored females only for the presence or absence of each meal type (blood or sugar) and did not quantify the amount consumed.

      (3) The figure legend for Figure 1A and the naming convention for different experimental groups are difficult to follow. A simplified or consistently abbreviated scheme would help readers navigate the figures and text.

      We regret that the reviewer found the figure difficult to follow. We have now revised our annotations throughout the manuscript for enhanced readability. For example, “D1<sup>B</sup>” is now “D1<sup>PBM</sup>” (post-bloodmeal) and “D1<sup>O</sup>” is now “D1<sup>PO</sup>” (post-oviposition).

      (4) In the last paragraph of the Y-maze olfactory assay for host-seeking behaviour in An. stephensi in Methods, the authors state, "When testing blood-fed females, aged-matched sugar-fed females (bloodhungry) were included as positive controls where ever possible, with satisfactory results." The authors should explicitly describe what the criteria are for "satisfactory results".

      We apologise for the lack of clarity. We have now edited the statement to read:

      “When testing blood-fed females, age-matched sugar-fed females (blood-hungry) were included wherever possible as positive controls. These females consistently showed attraction to host cues, as expected.” See lines 786-790.

      (5) In the first paragraph of the dsRNA-mediated gene knockdown section in Methods, dsRNA against GFP is used as a negative control for the injection itself, but not for the potential off-target effect.

      We agree with the reviewer that dsGFP injections act as controls only for injection-related behavioural changes, and not for off-target effects of RNAi. We have now corrected the statement. See lines 919-920.

      To control for off-target effects, we could have designed multiple dsRNAs targeting different parts of a given gene. We regret not including these controls for potential off-target effects of dsRNAs injected.

      (6) References numbers 48, 89, and 90 are not complete citations.

      We thank the reviewer for spotting these. We have now corrected these citations.

    1. Author Response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      The scale bar for fly and ovary images should be included in Figures 9, 10, and 12.

      We agree with this comment and apologize for the oversight. We have now modified Figures 9, 10, and 12 to include the scale bars for the ovary images. The fly images were acquired using a stereo microscope where scale bar calculation was not possible. However, all images were acquired at the same magnification for consistency.

      Reviewer #2 (Public review):

      A weakness of this paper is the phylogenetic analysis to investigate if there is correspondence in the phylogenetic distribution of ITP-type and Gyc76C-type genes/proteins. Unfortunately, the evidence presented is rather limited in scope. Essentially, the authors report that they only found ITP-type and Gyc76C-type genes/proteins in protostomes, but not in deuterostomes. What is needed is a more fine-grained analysis at the species level within the protostomes. However, I recognise that such a detailed analysis may extend beyond the scope of this paper, which is already rich in data.

      We thank the reviewer for their comment and the suggestion to perform a fine-grained species level comparison of ITP and Gyc76C genes across protostomes. We are unsure of the utility of this analysis for the present study given that we have now shown that ITPa can activate Gyc76C using both an ex vivo and a heterologous assay, the latter being the gold standard in GPCR and guanylate cyclase discovery (see Huang et al 2025 https://doi.org/10.1073/pnas.2420966122; Beets et al 2023 https://doi.org/10.1016/j.celrep.2023.113058); Chang et al 2009 https://doi.org/10.1073/pnas.0812593106.

      Additionally, absence of a gene in a genome/proteome is hard to prove especially when many/most of the protostomian datasets are not as high-quality as those of model systems (e.g. Drosophila melanogaster and Caenorhabditis elegans). Secondly, based on previous findings in Bombyx mori (Nagai et al. 2014 https://doi.org/10.1074/jbc.m114.590646 and Nagai et al. 2016 https://doi.org/10.1371/journal.pone.0156501) and Drosophila (Xu et al. 2023 https://doi.org/10.1038/s41586-023-06833-8 and our study) it is evident that different products of the ITP gene (ITPa and ITPL) could signal via different receptor types depending on the species. Hence, we would need to explore the presence of several genes (ITP, tachykinin, pyrokinin, tachykinin receptor, pyrokinin receptor, CG30340 orphan receptor and Gyc76C) to fully understand which components of these diverse signaling systems are present in a given species to decipher the potential for cross-talk.

      While this species-level comparison will certainly be useful in the context of ITP-Gyc76C evolution, it will not alter the conclusions of the present study – ITPa acts via Gyc76C in Drosophila. We therefore agree with the reviewer that these analyses are beyond the scope of this paper.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):  

      Summary:  

      In Drosophila melanogaster, ITP has functions on feeding, drinking, metabolism, excretion, and circadian rhythm. In the current study, the authors characterized and compared the expression of all three ITP isoforms (ITPa and ITPL1&2) in the CNS and peripheral tissues of Drosophila. An important finding is that they functionally characterized and identified Gyc76C as an ITPa receptor in Drosophila using both in vitro and in vivo approaches. In vitro, the authors nicely confirmed that the inhibitory function of recombinant Drosophila ITPa on MT secretion is Gyc76C-dependent (knockdown Gyc76C specifically in two types of cells abolished the anti-diuretic action of Drosophila ITPa on renal tubules). They also used a combination of multiple approaches to investigate the roles of ITPa and Gyc76C on osmotic and metabolic homeostasis modulation in vivo. They revealed that ITPa signaling to renal tubules and fat body modulates osmotic and metabolic homeostasis via Gyc76C.  

      Furthermore, they tried to identify the upstream and downstream of ITP neurons in the nervous system by using connectomics and single-cell transcriptomic analysis. I found this interesting manuscript to be well-written and described. The findings in this study are valuable to help understand how ITP signals work on systemic homeostasis regulation. Both anatomical and single-cell transcriptome analysis here should be useful to many in the field. 

      We thank this reviewer for the positive and thorough assessment of our manuscript.  

      Strengths:  

      The question (what receptors of ITPa in Drosophila) that this study tries to address is important. The authors ruled out the Bombyx ITPa receptor orthologs as potential candidates. They identified a novel ITP receptor by using phylogenetic, anatomical analysis, and both in vitro and in vivo approaches. 

      The authors exhibited detailed anatomical data of both ITP isoforms and Gyc76C (in the main and supplementary figures), which helped audiences understand the expression of the neurons studied in the manuscript.  

      They also performed connectomes and single-cell transcriptomics analysis to study the synaptic and peptidergic connectivity of ITP-expressing neurons. This provided more information for better understanding and further study on systemic homeostasis modulation.  

      Weaknesses:  

      In the discussion section, the authors raised the limitations of the current study, which I mostly agree with, such as the lack of verification of direct binding between ITPa and Gyc76C, even though they provided different data to support that ITPa-Gyc76C signaling pathway regulates systemic homeostasis in adult flies. 

      We now provide evidence of Gyc76C activation by ITPa in a heterologous system (new Figure 7 and Figure 7 Supplement 1).

      Reviewer #2 (Public Review):  

      Summary:  

      The physiology and behaviour of animals are regulated by a huge variety of neuropeptide signalling systems. In this paper, the authors focus on the neuropeptide ion transport peptide (ITP), which was first identified and named on account of its effects on the locust hindgut (Audsley et al. 1992). Using Drosophila as an experimental model, the authors have mapped the expression of three different isoforms of ITP (Figures 1, S1, and S2), all of which are encoded by the same gene.  

      The authors then investigated candidate receptors for isoforms of ITP. Firstly, Drosophila orthologs of G-protein coupled receptors (GPCRs) that have been reported to act as receptors for ITPa or ITPL in the insect Bombyx mori were investigated. Importantly, the authors report that ITPa does not act as a ligand for the GPCRs TkR99D and PK2-R1 (Figure S3). Therefore, the authors investigated other putative receptors for ITPs. Informed by a previously reported finding that ITP-type peptides cause an increase in cGMP levels in cells/tissues (Dircksen, 2009, Nagai et al., 2014), the authors investigated guanylyl cyclases as candidate receptors for ITPs. In particular, the authors suggest that Gyc76C may act as an ITP receptor in Drosophila.  

      Evidence that Gyc76C may be involved in mediating effects of ITP in Bombyx was first reported by Nagai et al. (2014) and here the authors present further evidence, based on a proposed concordance in the phylogenetic distribution ITP-type neuropeptides and Gyc76C (Figure 2). Having performed detailed mapping of the expression of Gyc76C in Drosophila (Figures 3, S4, S5, S6), the authors then investigated if Gyc76C knockdown affects the bioactivity of ITPa in Drosophila. The inhibitory effect of ITPa on leucokinin- and diuretic hormone-31-stimulated fluid secretion from Malpighian tubules was found to be abolished when expression of Gyc76C was knocked down in stellate cells and principal cells, respectively (Figure 4). However, as discussed below, this does not provide proof that Gyc76C directly mediates the effect of ITPa by acting as its receptor. The effect of Gyc76C knockdown on the action of ITPa could be an indirect consequence of an alteration in cGMP signalling.  

      Having investigated the proposed mechanism of ITPa in Drosophila, the authors then investigated its physiological roles at a systemic level. In Figure 5 the authors present evidence that ITPa is released during desiccation and accordingly, overexpression of ITPa increases survival when animals are subjected to desiccation. Furthermore, knockdown of Gyc76C in stellate or principal cells of Malphigian tubules decreases survival when animals are subject to desiccation. However, whilst this is correlative, it does not prove that Gyc76C mediates the effects of ITPa. The authors investigated the effects of knockdown of Gyc76C in stellate or principal cells of Malphigian tubules on i). survival when animals are subject to salt stress and ii). time taken to recover from of chill coma. It is not clear, however, why animals overexpressing ITPa were also not tested for its effect on i). survival when animals are subject to salt stress and ii). time taken to recover from of chill coma. In Figures 6 and S8, the authors show the effects of Gyc76C knockdown in the female fat body on metabolism, feeding-associated behaviours and locomotor activity, which are interesting. Furthermore, the relevance of the phenotypes observed to potential in vivo actions of ITPa is explored in Figure 7. The authors conclude that "increased ITPa signaling results in phenotypes that largely mirror those seen following Gyc76C knockdown in the fat body, providing further support that ITPa mediates its effects via Gyc76C." Use of the term "largely mirror" seems inappropriate here because there are opposing effects- e.g. decreased starvation resistance in Figure 6A versus increased starvation resistance in Figure 7A. Furthermore, as discussed above, the results of these experiments do not prove that the effects of ITPa are mediated by Gyc76C because the effects reported here could be correlative, rather than causative. 

      We thank this reviewer for an extremely thorough and fair assessment of our manuscript. 

      We have now performed salt stress tolerance and chill coma recovery assays using flies over-expressing ITPa (new Figure 10 Supplement 1).

      We agree that the use of the term “largely mirrors” to describe the effects of ITPa overexpression and Gyc76C knockdown is not appropriate and have changed this sentence. We also agree that the experiments did not provide direct evidence that the effects of ITPa are mediated by Gyc76C. To address this, we now provide evidence of Gyc76C activation by ITPa in a heterologous system (new Figure 7 and Figure 7 Supplement 1).

      Lastly, in Figures 8, S9, and S10 the authors analyse publicly available connectomic data and single-cell transcriptomic data to identify putative inputs and outputs of ITPa-expressing neurons. These data are a valuable addition to our knowledge ITPa expressing neurons; but they do not address the core hypothesis of this paper - namely that Gyc76C acts as an ITPa receptor.  

      The goal of our study was to comprehensively characterize an anti-diuretic system in Drosophila. Hence, in addition to identifying the receptor via which ITPa exerts its effects, we also wanted to understand how ITPa-producing neurons are regulated. Connectomic and single-cell transcriptomic analyses are highly appropriate for this purpose. We have now updated the connectomic analyses using an improved connectome dataset that was released during the revision of this manuscript. Our new analysis shows that lNSC<sup>ITP</sup> are connected to other endocrine cells that produce other homeostatic hormones (new Figure 13F). We also identify a pathway through which other ITP-producing neurons (LNd<sup>ITP</sup>) receive hygrosensory inputs to regulate water seeking behavior (new Figure 13E). Moreover, we now include results which showcase that ITPa-producing neurons (l-NSC<sup>ITP</sup>) are active (new Figure 8A and B) and release ITPa under desiccation. Together with other analyses, these data provide a comprehensive outlook on the when, what and how ITPa regulates systemic homeostasis.  

      Strengths:  

      (1) The main strengths of this paper are i) the detailed analysis of the expression and actions of ITP and the phenotypic consequences of overexpression of ITPa in Drosophila. ii). the detailed analysis of the expression of Gyc76C and the phenotypic consequences of knockdown of Gyc76C expression in Drosophila.  

      (2) Furthermore, the paper is generally well-written and the figures are of good quality. 

      We thank this reviewer for highlighting the strengths of this manuscript.

      Weaknesses:  

      (1) The main weakness of this paper is that the data obtained do not prove that Gyc76C acts as a receptor for ITPa. Therefore, the following statement in the abstract is premature: "Using a phylogenetic-driven approach and the ex vivo secretion assay, we identified and functionally characterized Gyc76C, a membrane guanylate cyclase, as an elusive Drosophila ITPa receptor." Further experimental studies are needed to determine if Gyc76C acts as a receptor for ITPa. In the section of the paper headed "Limitations of the study", the authors recognise this weakness. They state "While our phylogenetic analysis, anatomical mapping, and ex vivo and in vivo functional studies all indicate that Gyc76C functions as an ITPa receptor in Drosophila, we were unable to verify that ITPa directly binds to Gyc76C. This was largely due to the lack of a robust and sensitive reporter system to monitor mGC activation." It is not clear what the authors mean by "the lack of a robust and sensitive reporter system to monitor mGC activation". The discovery of mGCs as receptors for ANP in mammals was dependent on the use of assays that measure GC activity in cells (e.g. by measuring cGMP levels in cells). Furthermore, more recently cGMP reporters have been developed. The use of such assays is needed here to investigate directly whether Gyc76C acts as a receptor for ITPa. In summary, insufficient evidence has been obtained to conclude that Gyc76C acts as a receptor for ITPa. Therefore, I think there are two ways forward, either:  

      (a) The authors obtain additional biochemical evidence that ITPa is a ligand for Gyc76C.  

      or  

      (b) The authors substantially revise the conclusions of the paper (in the title, abstract, and throughout the paper) to state that Gyc76C MAY act as a receptor for ITPa, but that additional experiments are needed to prove this. 

      We thank the reviewer for this comment and agree with the two options they propose. We had previously tried different a cGMP reporter (Promega GloSensor cGMP assay) to monitor activation of Gyc76C by ITPa in a heterologous system. Unfortunately, we were not successful in monitoring Gyc76C activation by ITPa. We now utilized another cGMP sensor, Green cGull, to show that ITPa can indeed activate Gyc76C heterologously expressed in HEK cells (new Figure 7 and Figure 7 Supplement 1). However, we still cannot rule out the possibility that ITPa can act on additional receptors in vivo. This is based on our ex vivo Malpighian tubule assays (new Figure 6E and F). ITPa inhibits DH31- and LK-stimulated secretion and we show that this effect is abolished in Gyc76C knockdown specifically in principal and stellate cells, respectively. Interestingly, application of ITPa alone can stimulate secretion when Gyc76C is knocked down in principal cells (new Figure 6E). This could be explained by: 1) presence of another receptor for ITPa which results in diuretic actions and/or 2) low Gyc76C signaling activity (RNAi based knockdown lowers signaling but does not abolish it completely) could alter other intracellular messenger pathways that promote secretion. We have added text to indicate the possibility of other ITPa receptors. Nonetheless, our conclusions are supported by the heterologous assay results which indicate that ITPa can activate Gyc76C. Therefore, we do not alter the title. 

      (2) The authors state in the abstract that a phylogenetic-driven approach led to their identification of Gyc76C as a candidate receptor for ITPa. However, there are weaknesses in this claim. Firstly, because the hypothesis that Gyc76C may be involved in mediating effects of ITPa was first proposed ten years ago by Nagai et al. 2014, so this surely was the primary basis for investigating this protein. Nevertheless, investigating if there is correspondence in the phylogenetic distribution of ITP-type and Gyc76C-type genes/proteins is a valuable approach to addressing this issue. Unfortunately, the evidence presented is rather limited in scope. Essentially, the authors report that they only found ITP-type and Gyc76C-type genes/proteins in protostomes, but not in deuterostomes. What is needed is a more fine-grained analysis at the species level within the protostomes. Thus, are there protostome species in which both ITP-type and Gyc76C-type genes/proteins have been lost? Furthermore, are there any protostome species in which an ITP-type gene is present but an Gyc76C-type gene is absent, or vice versa? If there are protostome species in which an ITP-type gene is present but a Gyc76C-type gene is absent or vice versa, this would argue against Gyc76C being a receptor for ITPa. In this regard, it is noteworthy that in Figure 2A there are two ITP-type precursors in C. elegans, but there are no Gyc76Ctype proteins shown in the tree in Figure 2B. Thus, what is needed is a more detailed analysis of protostomes to investigate if there really is correspondence in the phylogenetic distribution of Gyc76C-type and ITP-type genes at the species level. 

      We thank the reviewer for this comment. While the previous study by Nagai et al had implicated Gyc76C in the ITP signaling pathway, how they narrowed down Gyc76C as a candidate was not reported. Therefore, our unbiased phylogenetic approach was necessary to ensure that we identified all suitable candidate receptors. Indeed, our phylogenetic analysis also identified Gyc32E as another candidate ITP receptor. However, we did not pursue this receptor further as our expression data (new Figure 4 Supplement 2) indicated that Gyc32E is not expressed in osmoregulatory tissues and therefore likely does not mediate the osmotic effects of ITPa. 

      We also appreciate the suggestion to perform a more detailed phylogenetic analysis for the peptide and receptor. We did not include C. elegans receptors in the phylogenetic analysis because they tend to be highly evolved and routinely cause long-branch attraction (see: Guerra and Zandawala 2024: https://doi.org/10.1093/gbe/evad108). We (specifically the senior author) have previously excluded C. elegans receptors in the phylogenetic analysis of GnRH and Corazonin receptors for similar reasons (see: Tian and Zandawala et al. 2016: 10.1038/srep28788). 

      Unfortunately, absence of a gene in a genome is hard to prove especially when they are not as high-quality as the genomes of model systems (e.g. Drosophila and mice). Moreover, given the concern of this reviewer that our physiological and behavioral data on ITPa and Gyc76C only provide correlative evidence, we decided against performing additional phylogenetic analysis which also provides correlative evidence. Our only goal with this analysis was to identify a candidate ITPa receptor. Since we have now functionally characterized this receptor using a heterologous system, we feel that the current phylogenetic analysis was able to successfully serve its purpose.  

      (3) The manuscript would benefit from a more comprehensive overview and discussion of published literature on Gyc76C in Drosophila, both as a basis for this study and for interpretation of the findings of this study.  

      We thank the reviewer for this comment. We have now included a broader discussion of Gyc76C based on published literature.  

      Reviewer #3 (Public Review):  

      Summary:  

      The goal of this paper is to characterize an anti-diuretic signaling system in insects using Drosophila melanogaster as a model. Specifically, the authors wished to characterize a role of ion transport peptide (ITP) and its isoforms in regulating diverse aspects of physiology and metabolism. The authors combined genetic and comparative genomic approaches with classical physiological techniques and biochemical assays to provide a comprehensive analysis of ITP and its role in regulating fluid balance and metabolic homeostasis in Drosophila. The authors further characterized a previously unrecognized role for Gyc76C as a receptor for ITPa, an amidated isoform of ITP, and in mediating the effects of ITPa on fluid balance and metabolism. The evidence presented in favor of this model is very strong as it combines multiple approaches and employs ideal controls. Taken together, these findings represent an important contribution to the field of insect neuropeptides and neurohormones and have strong relevance for other animals. 

      We thank this reviewer for the positive and thorough assessment of our manuscript.

      Strengths:  

      Many approaches are used to support their model. Experiments were wellcontrolled, used appropriate statistical analyses, and were interpreted properly and without exaggeration.  

      Weaknesses:  

      No major weaknesses were identified by this reviewer. More evidence to support their model would be gained by using a loss-of-function approach with ITPa, and by providing more direct evidence that Gyc76C is the receptor that mediates the effects of ITPa on fat metabolism. However, these weaknesses do not detract from the overall quality of the evidence presented in this manuscript, which is very strong.  

      We agree with this reviewer regarding the need to provide additional evidence using a loss-of-function approach with ITPa. We now characterize the phenotypes following knockdown of ITP in ITP-producing cells (new Figure 9). Our results are in agreement with phenotypes observed following Gyc76C knockdown, lending further support that ITPa mediates its effects via Gyc76C. Unfortunately, we are not able to provide evidence that ITPa acts on Gyc76C in the fat body using the assay suggested by this reviewer (explained in detail below). Instead, we now provide direct evidence of Gyc76C activation by ITPa in a heterologous system (new Figure 7 and Figure 7 Supplement 1).

      Reviewer #1 (Recommendations For The Authors):  

      Here, I have several extra concerns about the work as below:  

      (1) The authors confirmed the function of ITPa in regulating both osmotic and metabolic homeostasis by specifically overexpressing ITPa driven by ITP-RCGal4 in adult flies (Figures. 5 and 7). Have authors ever tried to knock down ITP in ITP-RC-Gal4 neurons? What was the phenotype? Especially regarding the impact on metabolic homeostasis, does knocking down ITP in ITP neurons mimic the phenotypes of Gyc76C fat body knockdown flies? 

      We thank the reviewer for this suggestion. We now characterize the phenotypes following knockdown of ITP using ITP-RC-Gal4 (new Figure 9). Our results are in agreement with phenotypes observed following Gyc76C knockdown, lending further support that ITPa mediates its effects via Gyc76C.

      The authors mentioned that the existing ITP RNAi lines target all three isoforms. It would be interesting if the authors could overexpress ITPa in ITPRC-Gal4>ITP-RNAi flies and confirm whether any phenotypes induced by ITP knockdown could be rescued. It will further confirm the role of ITPa in homeostasis regulation.  

      We thank the reviewer for this suggestion. Unfortunately, this experiment is not straightforward because knockdown with ITP RNAi does not completely abolish ITP expression (see Figure 9A). Hence, the rescue experiment needs to be ideally performed in an ITP mutant background. However, ITP mutation leads to developmental lethality (unpublished observation) so we cannot generate all the flies necessary for this experiment. Therefore, we cannot perform the rescue experiments at this time. In future studies, we hope to perform knockdown of specific ITP isoforms using the transgenes generated here (Xu et al 2023: 10.1038/s41586-023-06833-8).   

      (2) In Figures 5A and B, the authors nicely show the increased release of ITPa under desiccation by quantifying the ITPa immunolabelling intensity in different neuronal populations. It may be induced by the increased neuronal activity of ITPa neurons under the desiccated condition. Have the authors confirmed whether the activity of ITPa-expressing neurons is impacted by desiccation?  

      The TRIC system may be able to detect the different activity of those neurons before and after desiccation. This may further explain the reduced ITPa peptide levels during desiccation.  

      We thank the reviewer for this suggestion. We have now monitored the activity of ITPa-expressing neurons using the CaLexA system (Masuyama et al 2012: 10.3109/01677063.2011.642910). Our results indicate that ITPa neurons are indeed active under desiccation (new Figure 8A and B). These results are also in agreement with ITPa immunolabelling showing increased peptide release during desiccation (new Figure 8C and D). Together, these results show that ITPa neurons are activated and release ITPa under desiccation.  

      (3) What about the intensity of ITPa immunolabelling in other ITPa-positive neurons (e.g., VNC) under desiccation? If there is no change in other ITPa neurons, it will be a good control. 

      We thank the reviewer for this suggestion. Unfortunately, ITPa immunostaining in VNC neurons is extremely weak preventing accurate quantification of ITPa levels under different conditions. We did hypothesize that ITPa immunolabelling in clock neurons (5<sup>th</sup>-LN<sub>v</sub> and LN<Sub>d</sub><sup>ITP</sup>) would not change depending on the osmotic state of the animal. However, our results (Figure 8C and D) indicate that ITPa from these neurons is also released under desiccation. Interestingly, LNd<sup>ITP</sup>, which also coexpress Neuropeptide F (NPF) have recently been implicated in water seeking during thirst (Ramirez et al, 2025: 10.1101/2025.07.03.662850). Our new connectomic-driven analysis shows that these neurons can receive thermo/hygrosensory inputs (new Figure 13E). Hence, it is conceivable that other ITPa-expressing neurons also release ITPa during thirst/desiccation.

      (4) The adult stage, specifically overexpression of ITPa in ITP neurons, does show significant phenotypes compared to controls in both osmotic and metabolic homeostasis-related assays. It would be helpful if authors could show how much ITPa mRNA levels are increased in the fly heads with ITPa overexpression (under desiccation & starvation or not). 

      We thank the reviewer for this suggestion. We have now included immunohistochemical evidence showing increase in ITPa peptide levels in flies with ITPa overexpression (new Figure 10A). We feel that this is a better indicator of ITPa signaling level instead of ITPa mRNA levels.   

      (5) Another question concerns the bloated abdomens of ITPa-overexpressing flies. Are the bloated abdomens of ITPa OE female flies (Figure 5E) due to increased ovary size (Figure 7G)? Have the authors also detected similar bloated abdomens in male flies with ITPa overexpression? Since both male and female flies show more release of ITPa during the desiccation.  

      We thank the reviewer for this comment. The bloated abdomen phenotype seen in females can be attributed to increased water content since we see a similar phenotype in males (see Author response image 1 below).

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):  

      (1) Page 1 - change "Homeostasis is obtained by" to "Homeostasis is achieved by".  

      Changed

      (2) Page 1 - change "Physiological responses" to "Physiological processes". 

      Changed

      (3) Page 2 - Change "Recently, ITPL2 was also shown to mediate anti-diuretic effects via the tachykinin receptor" to "Recently, ITPL2 was also shown to exert anti-diuretic effects via the tachykinin receptor". 

      Changed

      (4) Page 9 - "(C) Adult-specific overexpression of ITPa using ITP- RC-GAL4TS (ITP-RC-T2A-GAL4 combined with temperature-sensitive tubulinGAL80) increases desiccation" Unless I am misunderstanding Fig 5C, I think what is shown is that overexpression of ITPa prolongs survival during a period of desiccation. I am not sure what the authors mean by "increases desiccation". In the text (page 9) the authors state "ITPa overexpression improves desiccation tolerance, which is a much clearer statement than what is in the figure legend. 

      We thank the reviewer for identifying this oversight. We have now changed the caption to “increases desiccation tolerance”.  

      (5) Page 11 - The authors conclude that "increased ITPa signaling results in phenotypes that largely mirror those seen following Gyc76C knockdown in the fat body, providing further support that ITPa mediates its effects via Gyc76C." Use of the term "largely mirror" seems inappropriate here because there are opposing effects- e.g. decreased starvation resistance in Figure 6A versus increased starvation resistance in Figure 7A.  

      Perhaps there is a misunderstanding of what is meant by "mirroring" - it means the same, not the opposite. 

      We thank the reviewer for this comment. We agree that the use of the term “largely mirrors” to describe the effects of ITPa overexpression and Gyc76C knockdown is not appropriate and have changed this sentence as follows: “Taken together, the phenotypes seen following Gyc76C knockdown in the fat body largely mirror those seen following ITP knockdown in ITP-RC neurons, providing further support that ITPa mediates its effects via Gyc76C.”

      (6) Page 12 - There appear to be words missing between "neurons during desiccation, as well as their downstream" and "the recently completed FlyWire adult brain connectome" 

      We thank the reviewer for highlighting this mistake. We have changed the sentence as following: “Having characterized the functions of ITP signaling to the renal tubules and the fat body, we wanted to identify the factors and mechanisms regulating the activity of ITP neurons during desiccation, as well as their downstream neuronal pathways. To address this, we took advantage of the recently completed FlyWire adult brain connectome (Dorkenwald et al., 2024, Schlegel et al., 2024) to identify pre- and post-synaptic partners of ITP neurons.”

      (7) Page 15 - "can release up to a staggering 8 neuropeptides" - I suggest that the word "staggering" is removed. The notion that individual neurons release many neuropeptides is now widely recognised (both in vertebrates and invertebrates) based on analysis of single-cell transcriptomic data. 

      Removed staggering.

      (8) Page 16 - "(Farwa and Jean-Paul, 2024)" - this citation needs to be added to the reference list and I think it needs to be changed to "Sajadi and Paluzzi, 2024". 

      We thank the reviewer for highlighting this oversight. The correct citation has now been added.

      (9) It is noteworthy that, based on a PubMed search, there are at least thirteen published papers that report on Gyc76C in Drosophila (PMIDs: 34988396, 32063902, 27642749, 26440503, 24284209, 23862019, 23213443,  21893139, 21350862, 16341244, 15485853, 15282266, 7706258). However, none of these papers are discussed/cited by the authors. This is surprising because the authors' hypothesis that Gyc76C acts as a receptor for ITPa surely needs to be evaluated and discussed with reference to all the published insights into the developmental/physiological roles of this protein. 

      We thank the reviewer for this comment. Some of the references mentioned above (21350862, 16341244, 15485853) mainly report on soluble guanylyl cyclases and not membrane guanylyl cyclase like Gyc76C. Based on other studies on Gyc76C and its role in immunity and development, we have now expanded the discussion on additional roles of ITPa.

      Reviewer #3 (Recommendations For The Authors):  

      I have only a few comments that will help the authors strengthen a couple of aspects of their model.  

      (1) The case for Gyc76C as a receptor for ITPa in regulating fluid homeostasis is clear, given the experiments the authors carried out where they applied ITPa to tubules and showed that the effects of ITPa on tubule secretion were blocked if Gyc76C was absent in tubules. This approach, or something similar, should be used to provide conclusive proof that ITPa's metabolic effects on the fat body go through Gyc76C.  

      At present (unless I missed it) the authors only show that gain of ITPa has the opposite phenotype to fat body-specific loss of Gyc76C. While this would be the expected result if ITPa/Gyc76C is a ligand-receptor pair, it is not quite sufficient to conclusively demonstrate that Gyc76C is definitely the fat body receptor. Ex vivo experiments such as soaking the adult fat body carcasses with and without Gyc76C in ITPa and monitoring fat content via Nile Red could be one way to address this lack of direct evidence. The authors could also make text changes to explicitly mention this lack of conclusive evidence and suggest it as a future direction.

      We thank the reviewer for this comment. We have now conclusively demonstrated that Gyc76C is activated by ITPa in a heterologous assay (new Figure 7 and Figure 7 Supplement 1). With this evidence, we can confidently claim that ITPa can mediate its actions via Gyc76C in various tissues including the Malpighian tubules and fat body. Nonetheless, we liked the suggestion by this reviewer to perform the ex vivo assay and test the effect of ITPa on the fat body. Unfortunately, it is challenging to do this because increased ITPa signaling (chronically using ITPa overexpression) results in increased lipid accumulation in the fat body in vivo. Therefore, we would likely not see the effect of ITPa addition in an ex vivo fat body preparation since lipogenesis will not occur in the absence of glucose. However, ITPa could counteract the effects of other lipolytic factors such as adipokinetic hormone (AKH). To test this hypothesis, we monitored fat content in the fat body incubated with and without AKH (see Author response image 2 below showing representative images from this experiment). Since we did not observe any differences in fat levels between these two conditions, we were unable to test the effects of ITPa on AKH-activity using this assay.

      Author response image 2.

      (2) I did not see any loss of function data for ITPa - is this possible? If so this would strengthen the case for a 1:1 relationship between loss of ligand and loss of receptor. Alternatively, the authors could suggest this as an important future direction. 

      We agree with this reviewer regarding the need to provide additional evidence using a loss-of-function approach with ITPa. We have now characterized the phenotypes following knockdown of ITP in ITP-producing cells (new Figure 9). Our results are in agreement with phenotypes observed following Gyc76C knockdown, lending further support that ITPa mediates its effects via Gyc76C.

      (3) For clarity, please include the sex of all animals in the figure legend. Even though the methods say 'females used unless otherwise indicated' it is still better for the reader to know within the figure legend what sex is displayed. 

      We thank the reviewer for this suggestion and have now included sex of the animals in the figure legends.  

      (4) Please state whether females are mated or not, as this is relevant for taste preferences and food intake. 

      We apologize for this oversight. We used mated females for all experiments. This has now been included in the methods.  

      (5) More discussion on the previous study on metabolic effects of ITP in this study compared with past studies would help readers appreciate any similarities and/or differences between this study and past work (Galikova 2018, 2022) 

      We thank the reviewer for this suggestion. Unfortunately, it is difficult to directly compare our phenotypes with the metabolic effects of ITP reported in Galikova and Klepsatel 2022 because the previous study used a ubiquitous driver (Da-GAL4) to manipulate ITP levels. Ectopically overexpressing ITPa in non-ITP producing cells can result in non-physiological phenotypes. This is evident in their metabolic measurements where both global overexpression and knockdown of ITP results in reduced glycogen and fat levels, and starvation tolerance. Moreover, ITP-RC-GAL4 used in our study to overexpress and knockdown ITPa is more specific than the Da-GAL4 used previously. Da-GAL4 would include other ITP cells (e.g. ITP-RD producing cells). Since ITP is broadly expressed across the animal, it is difficult to parse out the phenotypes of ITPa and other isoforms using manipulations performed with Da-GAL4. We have mentioned this limitation in the results for ITP knockdown as follows: “A previous study employing ubiquitous ITP knockdown and overexpression suggests that Drosophila ITP also regulates feeding and metabolic homeostasis (Galikova and Klepsatel, 2022) in addition to osmotic homeostais (Galikova et al., 2018). However, given the nature of the genetic manipulations (ectopic ITPa overexpression and knockdown of ITP in all tissues) utilized in those studies, it is difficult to parse the effects of ITP signaling from ITPa-producing neurons.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      (1) Legionella effectors are often activated by binding to eukaryote-specific host factors, including actin. The authors should test the following: a) whether Lfat1 can fatty acylate small G-proteins in vitro; b) whether this activity is dependent on actin binding; and c) whether expression of the Y240A mutant in mammalian cells affects the fatty acylation of Rac3 (Figure 6B), or other small G-proteins.

      We were not able to express and purify the full-length recombinant Lfat1 to perform fatty acylation of small GTPases in vitro. However, In cellulo overexpression of the Y240A mutant still retained ability to fatty acylate Rac3 and another small GTPase RheB (see Figure 6-figure supplement 2). We postulate that under infection conditions, actin-binding might be required to fatty acylate certain GTPases due to the small amount of effector proteins that secreted into the host cell.

      (2) It should be demonstrated that lysine residues on small G-proteins are indeed targeted by Lfat1. Ideally, the functional consequences of these modifications should also be investigated. For example, does fatty acylation of G-proteins affect GTPase activity or binding to downstream effectors?

      We have mutated K178 on RheB and showed that this mutation abolished its fatty acylation by Lfat1 (see Author response image 1 below). We were not able to test if fatty acylation by Lfat1 affect downstream effector binding.

      Author response image 1.

      (3) Line 138: Can the authors clarify whether the Lfat1 ABD induces bundling of F-actin filaments or promotes actin oligomerization? Does the Lfat1 ABD form multimers that bring multiple filaments together? If Lfat1 induces actin oligomerization, this effect should be experimentally tested and reported. Additionally, the impact of Lfat1 binding on actin filament stability should be assessed. This is particularly important given the proposed use of the ABD as an actin probe.

      The ABD domain does not form oligomer as evidenced by gel filtration profile of the ABD domain. However, we do see F-actin bundling in our in vitro -F-actin polymerization experiment when both actin and ABD are in high concentration (data not shown). Under low concentration of ABD, there is not aggregation/bundling effect of F-actin.

      (4) Line 180: I think it's too premature to refer to the interaction as having "high specificity and affinity." We really don't know what else it's binding to.

      We have revised the text and reworded the sentence by removing "high specificity and affinity."

      (5) The authors should reconsider the color scheme used in the structural figures, particularly in Figures 2D and S4.

      Not sure the comments on the color scheme of the structure figures.

      (6) In Figure 3E, the WT curve fits the data poorly, possibly because the actin concentration exceeds the Kd of the interaction. It might fit better to a quadratic.

      We have performed quadratic fitting and replaced Figure 3E.

      (7) The authors propose that the individual helices of the Lfat1 ABD could be expressed on separate proteins and used to target multi-component biological complexes to F-actin by genetically fusing each component to a split alpha-helix. This is an intriguing idea, but it should be tested as a proof of concept to support its feasibility and potential utility.

      It is a good suggestion. We plan to thoroughly test the feasibility of this idea as one of our future directions.

      (8) The plot in Figure S2D appears cropped on the X-axis or was generated from a ~2× binned map rather than the deposited one (pixel size ~0.83 Å, plot suggests ~1.6 Å). The reported pixel size is inconsistent between the Methods and Table 1-please clarify whether 0.83 Å refers to super-resolution.

      Yes, 0.83 Å is super-resolution.  We have updated in the cryoEM table

      Reviewer #2:

      Weaknesses:

      (1) The authors should use biochemical reactions to analyze the KFAT of Llfat1 on one or two small GTPases shown to be modified by this effector in cellulo. Such reactions may allow them to determine the role of actin binding in its biochemical activity. This notion is particularly relevant in light of recent studies that actin is a co-factor for the activity of LnaB and Ceg14 (PMID: 39009586; PMID: 38776962; PMID: 40394005). In addition, the study should be discussed in the context of these recent findings on the role of actin in the activity of L. pneumophila effectors.

      We have new data showed that Actin binding does not affect Lfat1 enzymatic activity. (see response to Reviewer #1). We have added this new data as Figure S7 to the paper. Accordingly, we also revised the discussion by adding the following paragraph.

      “The discovery of Lfat1 as an F-actin–binding lysine fatty acyl transferase raised the intriguing question of whether its enzymatic activity depends on F-actin binding. Recent studies have shown that other Legionella effectors, such as LnaB and Ceg14, use actin as a co-factor to regulate their activities. For instance, LnaB binds monomeric G-actin to enhance its phosphoryl-AMPylase activity toward phosphorylated residues, resulting in unique ADPylation modifications in host proteins  (Fu et al, 2024; Wang et al, 2024). Similarly, Ceg14 is activated by host actin to convert ATP and dATP into adenosine and deoxyadenosine monophosphate, thereby modulating ATP levels in L. pneumophila–infected cells (He et al, 2025). However, this does not appear to be the case for Lfat1. We found that Lfat1 mutants defective in F-actin binding retained the ability to modify host small GTPases when expressed in cells (Figure S7). These findings suggest that, rather than serving as a co-factor, F-actin may serve to localize Lfat1 via its actin-binding domain (ABD), thereby confining its activity to regions enriched in F-actin and enabling spatial specificity in the modification of host targets.”

      (2) The development of the ABD domain of Llfat1 as an F-actin domain is a nice extension of the biochemical and structural experiments. The authors need to compare the new probe to those currently commonly used ones, such as Lifeact, in labeling of the actin cytoskeleton structure.

      We fully agree with the reviewer’s insightful suggestion. However, a direct comparison of the Lfat1 ABD domain with commonly used actin probes such as Lifeact, as well as evaluation of the split α-helix probe (as suggested by Reviewer #1), would require extensive and technically demanding experiments. These are important directions that we plan to pursue in future studies.

      For all other minors, we have made corrections/changes in our revised text and figures.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      What are the overarching principles by which prokaryotic genomes evolve? This fundamental question motivates the investigations in this excellent piece of work. While it is still very common in this field to simply assume that prokaryotic genome evolution can be described by a standard model from mathematical population genetics, and fit the genomic data to such a model, a smaller group of researchers rightly insists that we should not have such preconceived ideas and instead try to carefully look at what the genomic data tell us about how prokaryotic genomes evolve. This is the approach taken by the authors of this work. Lacking a tight theoretical framework, the challenge of such approaches is to devise analysis methods that are robust to all our uncertainties about what the underlying evolutionary dynamics might be.

      The authors here focus on a collection of ~300 single-cell genomes from a relatively well-isolated habitat with relatively simple species composition, i.e. cyanobacteria living in hotsprings in Yellowstone National Park, and convincingly demonstrate that the relative simplicity of this habitat increases our ability to interpret what the genomic data tells us about the evolutionary dynamics.

      Using a very thorough and multi-faceted analysis of these data, the authors convincingly show that there are three main species of Synechococcus cyanobacteria living in this habitat, and that apart from very frequent recombination within each species (which is in line with insights from other recent studies) there is also a remarkably frequent occurrence of hybridization events between the different species, and with as of yet unidentified other genomes. Moreover, these hybridization events drive much of the diversity within each species. The authors also show convincing evidence that these hybridization events are not neutral but are driven by selected by natural selection.

      Strengths:

      The great strength of this paper is that, by not making any preconceived assumptions about what the evolutionary dynamics is expected to look like, but instead devising careful analysis methods to tease apart what the data tells us about what has happened in the evolution in these genomes, highly novel and unexpected results are obtained, i.e. the major role of hybridization across the 3 main species living in this habitat.

      The analysis is very thorough and reading the detailed supplementary material it is clear that these authors took a lot of care in devising these methods and avoiding the pitfalls that unfortunately affect many other studies in this research area.

      The picture of the evolutionary dynamics of these three Synechococcus species that emerge from this analysis is highly novel and surprising. I think this study is a major stepping stone toward the development of more realistic quantitative theories of genome evolution in prokaryotes.

      The analysis methods that the authors employ are also partially novel and will no doubt be very valuable for analysis of many other datasets.

      We thank the reviewer for their appreciation of our work.

      Weaknesses:

      I feel the main weakness of this paper is that the presentation is structured such that it is extremely difficult to read. I feel readers have essentially no chance to understand the main text without first fully reading the 50-page supplement with methods and 31 supplementary materials. I think this will unfortunately strongly narrow the audience for this paper and below in the recommendations for the authors I make some suggestions as to how this might be improved.<br /> A very interesting observation is that a lot of hybridization events (i.e. about half) originate from species other than the alpha, beta, and gamma Synechococcus species from which the genomes that are analyzed here derive. For this to occur, these other species must presumably also be living in the same habitat and must be relatively abundant. But if they are, why are they not being captured by the sampling? I did not see a clear explanation for this very common occurrence of hybridization events from outside of these Synechococcus species. The authors raise the possibility that these other species used to live in these hot springs but are now extinct. I'm not sure how plausible this is and wonder if there would be some way to find support for this in the data (e.g that one does not observe recent events of import from one of these unknown other species). This was one major finding that I believe went without a clear interpretation.

      We agree with the reviewer that the extent of hybridization with other species is surprising. While we do feel that our metagenome data provide convincing evidence that “X” species are not present in MS or OS, we cannot currently rule out the presence of X in other springs. In the revision we explicitly mention the alternative hypothesis (Lines 239-242).

      The core entities in the paper are groups of orthologous genes that show clear evidence of hybridization. It is thus very frustating that exactly the methods for identifying and classifying these hybridization events were really difficult to understand (sections I and V of the supplement). Even after several readings, I was unsure of exactly how orthogroups were classified, i.e. what the difference between M and X clusters is, what a `simple hybrid' corresponds to (as opposed to complex hybrids?), what precisely the definitions of singlet and non-singlet hybrids are, etcetera. It also seems that some numbers reported in the main text do not match what is shown in the supplement. For example, the main text talks about "around 80 genes with more than three clusters (SM, Sec. V; fig. S17).", but there is no group with around 80 genes shown in Fig S17! And similarly, it says "We found several dozen (100 in α and 84 in β) simple hybrid loci" and I also cannot match those numbers to what is shown in the supplement. I am convinced that what the authors did probably made sense. But as a reader, it is frustrating that when one tries to understand the results in detail, it is very difficult to understand what exactly is going on. I mention this example in detail because the hybrid classification is the core of this paper, but I had similar problems in other sections.

      We thank the reviewer for pointing out these issues with our original presentation. In the revision, we have redone most of the analysis to simplify the methods and check the consistency of the results. We did not find any qualitative differences in our results after reanalysis, but some of the numbers for different hybridization patterns have changed. The most notable difference is an increase in the number of alpha-gamma simple hybrids and a corresponding decrease in mixed-species clusters (now labeled mosaic hybrids). These transfers are difficult to assign because we only have access to a single gamma genome. We have added a short explanation of this point in Lines 219-222.

      To improve the presentation, we significantly expanded the “Results” section to better explain our analysis and the different steps we take. We included two additional figures (Figs. 3 and 4) that illustrate the different types of hybrids and the heterogeneity in the diversity of alpha which is discussed in the main text and is important for interpreting our results. We also included two additional figures (Figs. 2 and 6) that were previously in the Appendix but were mentioned in the main text. We believe these changes should address most of the issues raised by the reviewer and hopefully make the manuscript easier to read.

      Although I generally was quite convinced by the methods and it was clear that the authors were doing a very thorough job, there were some instances where I did not understand the analysis. For example, the way orthogroups were built is very much along the lines used by many in the field (i.e. orthoMCL on the graph of pairwise matchings, building phylogenies of connected components of the graph, splitting the phylogenies along long branches). But then to subdivide orthogroups into clusters of different species, the authors did not use the phylogenetic tree already built but instead used an ad hoc pairwise hierarchical average linkage clustering algorithm.

      The reviewer is correct that there is an unexplained discrepancy between the clustering methods we used at different steps in our pipeline. We followed previous work by using phylogenetic distances for the initial clustering of orthogroups. On these scales we expect hybridization to play a minor role and phylogenetic distances to correlate reasonably well with evolutionary divergence. However, because of the extensive hybridization we observed, the use of phylogenetic models for species clustering is more difficult to justify. We therefore chose to simply use pairwise nucleotide distances, which make fewer assumptions about the underlying evolutionary processes and should be more robust. We have briefly explained our reasoning and the details of our clustering method in the revision (Lines 182-190).

      Reviewer #2 (Public Review):

      Summary:

      Birzu et al. describe two sympatric hotspring cyanobacterial species ("alpha" and "beta") and infer recombination across the genome, including inter-species recombination events (hybridization) based on single-cell genome sequencing. The evidence for hybridization is strong and the authors took care to control for artefacts such as contamination during sequencing library preparation. Despite hybridization, the species remain genetically distinct from each other. The authors also present evidence for selective sweeps of genes across both species - a phenomenon which is widely observed for antibiotic resistance genes in pathogens, but rarely documented in environmental bacteria.

      Strengths:

      This manuscript describes some of the most thorough and convincing evidence to date of recombination happening within and between cohabitating bacteria in nature. Their single-cell sequencing approach allows them to sample the genetic diversity from two dominant species. Although single-cell genome sequences are incomplete, they contain much more information about genetic linkage than typical short-read shotgun metagenomes, enabling a reliable analysis of recombination. The authors also go to great lengths to quality-filter the single-cell sequencing data and to exclude contamination and read mismapping as major drivers of the signal of recombination.

      We thank the reviewer for their appreciation of our work.

      Weaknesses:

      Despite the very thorough and extensive analyses, many of the methods are bespoke and rely on reasonable but often arbitrary cutoffs (e.g. for defining gene sequence clusters etc.). Much of this is warranted, given the unique challenges of working with single-cell genome sequences, which are often quite fragmented and incomplete (30-70% of the genome covered). I think the challenges of working with this single-cell data should be addressed up-front in the main text, which would help justify the choices made for the analysis.

      We have significantly expanded the “Results” section to better justify and explain the choices we made during our analysis. We hope these changes address the reviewer’s concerns and make the manuscript more accessible to readers.

      The conclusions could also be strengthened by an analysis restricted to only a subset of the highest quality (>70% complete) genomes. Even if this results in a much smaller sample size, it could enable more standard phylogenetic methods to be applied, which could give meaningful support to the conclusions even if applied to just ~10 genomes or so from each species. By building phylogenetic trees, recombination events could be supported using bootstraps, which would add confidence to the gene sequence clustering-based analyses which rely on arbitrary cutoffs without explicit measures of support.

      It seems to us that the reviewer’s suggestion presupposes that the recombination events we find can be described as discrete events on an asexual phylogeny, similar to how rare mutations are treated in standard phylogenetic inference. Popular tools, such as ClonalFrame and its offshoots, have attempted to identify individual recombination events starting from these assumptions. But the main conclusion of both our linkage and SNP block analysis is that the ClonalFrame assumptions do not hold for our data. Under a clonal frame, the SNP blocks we observe should be perfectly linked, similar to mutations on an asexual tree. But our results in Fig. 7D show the opposite. Part of the issue may have been that in our original presentation, we only briefly discuss the results of our linkage analysis and refer readers to the Appendix for more details. To fix this issue we have added an extra figure (Fig. 2), showing rapid linkage decrease in both species and that at long distances the linkage values are essentially identical to the unlinked case, similar to sexual populations. We hope that this change will help clarify this point.

      The manuscript closes without a cartoon (Figure 4) which outlines the broad evolutionary scenario supported by the data and analysis. I agree with the overall picture, but I do think that some of the temporal ordering of events, especially the timing of recombination events could be better supported by data. In particular, is there evidence that inter-species recombination events are increasing or decreasing over time? Are they currently at steady-state? This would help clarify whether a newly arrived species into the caldera experiences an initial burst of accepting DNA from already-present species (perhaps involving locally adaptive alleles), or whether recombination events are relatively constant over time.

      The reviewer raises some very interesting questions about the dynamics of recombination in the population, which we hope to pursue in future work. We have added this as an open question in the Discussion (Lines 365-382).

      These questions could be answered by counting recombination events that occur deeper or more recently in a phylogenetic tree.

      The reviewer here seems to presuppose that recombination is rare enough that a phylogenetic tree can reliably be inferred, which is contrary to our linkage analysis (see the response to an earlier comment). Perhaps the reviewer missed this point in our original manuscript since it was discussed primarily in the Appendix. See also our response to a previous comment by the reviewer.

      The cartoon also shows a 'purple' species that is initially present, then donates some DNA to the 'blue' species before going extinct. In this model, 'purple' DNA should also be donated to the more recently arrived 'orange' species, in proportion to its frequency in the 'blue' genome. This is a relatively subtle detail, but it could be tested in the real data, and this may actually help discern the order of the inferred recombination events.

      We have included an extra figure in the main text (Fig. 6) that addresses the question of timing of events. A quantitative test of our cartoon model along the lines the reviewer suggested would certainly be worthwhile and we hope to do that in future work.  

      The abstract also makes a bold claim that is not well-supported by the data: "This widespread mixing is contrary to the prevailing view that ecological barriers can maintain cohesive bacterial species..." In fact, the two species are cohesive in the sense that they are identifiable based on clustering of genome-wide genetic diversity (as shown in Fig 1A). I agree that the mixing is 'widespread' in the sense that it occurs across the genome (as shown in Figure 2A) but it is clearly not sufficient to erode species boundaries. So I believe the data is consistent with a Biological Species Concept (sensu Bobay & Ochman, Genome Biology & Evolution 2017) that remains 'fuzzy' - such that there are still inter-species recombination events, just not sufficient to erode the cohesion of genomic clusters. Therefore, I think the data supports the emerging picture of most bacteria abiding by some version of a BSC, and is not particularly 'contrary' to the prevailing view.

      We have revised the phrase mentioned by the reviewer to “prevent genetic mixture between bacterial species,” which more accurately represents our conclusions. 

      The final Results paragraph begins by posing a question about epistatic interactions, but fails to provide a definitive answer to the extent of epistasis in these genomes. Quantifying epistatic effects in bacterial genomes is certainly of interest, but might be beyond the scope of this paper. This could be a Discussion point rather than an underdeveloped section of the Results.

      We agree with the reviewer that an exhaustive analysis of epistasis in the population is beyond the scope of the manuscript. Our original intention was to answer whether SNP blocks we discovered showed evidence of strong linkage, as might be expected if only a small number of strains are present in the population. In light of the previous comments by the reviewer regarding the consistency with the clonal frame hypothesis, we believe this is especially relevant for our results. Moreover, the results we found‑especially for the beta population‑were quite conclusive: SNP block linkages in beta are indistinguishable from an unlinked model. To avoid misdirecting the reader about the significance of our results, we have revised the relevant paragraph (Lines 316-319).

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      Although I am entirely convinced of the validity of the results, methodology, and interpretations presented in this work, I must say I found the paper very hard to read. And I think I am really quite familiar with these kinds of approaches. I fear that for people other than experts on these kinds of comparative genomic analyses, this paper will be almost impossible to read. With the aim of expanding the audience for this compelling work, I think the authors might want to consider ways to improve the presentation.

      At the end of a long project, the obtained results typically form a web of mutual interconnections and dependencies and one of the key challenges in presenting the results in a paper is having to untangle this web of connected results and analysis into a linear ordered narrative so that, at any point in the narrative, understanding the next point only depends on previous points in the narrative. I frankly feel that this paper fails at this.

      The paper reads to me as if one author put together the supplement by essentially writing a report of all the analyses that were done together with supplementary figures summarizing all those analyses, and that another author then wrote the main text by using the materials in the supplement almost in the way a cook uses ingredients for a dish. Almost every other sentence in the main text refers to results in the (31!) supplementary figures and can only be understood by reading the appropriate corresponding sections in the supplementary materials. I found it essentially impossible to read the main text without having first read the entire 50-page supplement.

      I think the paper could be hugely improved by trying to restructure the presentation so as to make it more linear. The main text can be expanded to include a summary of the crucial methods and analysis results from the supplement needed to understand the narrative in the main text. For example, as it currently stands it is really challenging to understand what is shown in figures 2 and 3 of the main text without having to first read a very substantial part of the supplement. Figure 3, even after having read the relevant sections in the supplement, took me quite a while to understand and almost felt like a puzzle to decypher. Rethinking which parts of the supplement are really necessary would also help. Finally, it would also help if the terminology was kept as simple, transparent, and consistent as possible.

      I understand that my suggestion to thoroughly reorganize the presentation may feel like a big hassle, but I am afraid that in its current form, these important results are essentially rendered inaccessible to all but a small group of experts in this area. This paper deserves a wider readership.

      We thank the reviewer for these valuable suggestions. In the revision, we have significantly expanded and restructured the “Results” section to make the presentation more linear, as the reviewer suggested (see our reply to the public comment by the reviewer for details). We hope these changes will make the manuscript easier to read.

      Reviewer #2 (Recommendations For The Authors):

      I found this paper challenging to follow since the main text was so condensed and the supplementary material so extensive. Given that eLife does not impose strong limits on the length of the main text, I suggest moving some key sections from the supplement into the main text to make it easier for the reader to follow rather than flipping back and forth. Adding to the confusion, supplementary figures were referenced out of order in the main text (e.g. S23 is referenced before S1). Please check the numbering and ensure figures are mentioned in the main text in the correct order.

      We thank the reviewer for their feedback on the presentation of the results. In response to similar comments from Reviewer #1, we have significantly expanded and restructured the “Results” section to make it easier to read (see also our responses to Reviewer #1).

      Page 2: The term 'coevolution' is typically reserved for two species that mutually impose selective pressures on one another (e.g. predator-prey interactions; see Janzen, Evolution 1980). In the context of these two cyanobacterial species, it's not clear that this is the case so I would simply refer to them 'cohabitating' or being sympatric in the same environment.

      It is true that the term "coevolution” has become associated with predator-prey interactions, as the reviewer said. However, we feel that in our case “coevolution” fairly accurately describes the continual hybridization over long time scales we observe. We have therefore chosen to keep the term.

      Page 3: The authors mention that the gamma SAG is ~70% complete, which turns out to be quite high. It would be useful to mention early in the Results the mean/median completeness across SAGs, and how this leads to some challenges in analysing the data. Some of the material from the Supplement could be moved into the Results here.

      We have added a short note on the completeness in the Results (Lines 153-154). We have also added an extra figure in Appendix 1 with the completeness of all the SAGs for interested readers.

      I was left puzzled by the sentence: "Alternatively, high rates of recombination could generate different genotypes within each genome cluster that are adapted to different temperatures, with the relative frequencies of each cluster being only a correlated and not a causal driver of temperature adaptation." This is suggesting that individual genes or alleles, rather than entire genomes, could be adapted to temperature. But figure 1B seems to imply that the entire genome is adapted to different temperatures. Anyway, this does not seem to be a key point and could probably be removed (or clarified if the authors deem this an important point, which I failed to understand).

      We have revised this section to clarify the alternative hypothesis mentioned by the reviewer (Lines 100-103).

      Page 4. 'Several dozen' hybrid genes were found, but please also specify how many genes were tested. In general, it would be good to briefly outline the sample size (SAGs or genes) considered for each analysis.

      We have added the total numbers of genes we analyzed at each step of our analysis.

      'Mosaic hybrid loci' are mentioned alongside the issue of poor alignment. Presumably, the mosaic hybrid loci are first filtered to remove the poor alignments? This should be specified, and please mention how many loci are retained before/after this filter.

      We thank the reviewer for highlighting this important point. In the revision, we have implemented a more aggressive filtering of genes with poor alignments. We have added an extra paragraph to Appendix 1 (step 5 in the pipeline analysis) briefly explaining the issue.

      Page 5. "By contrast, the diversity of mosaic loci was typical of other loci within beta, suggesting most of the beta genome has undergone hybridization." Please point to the data (figure) to support this statement.

      We have restructured our discussion of the different hybrid loci so this comment is no longer relevant. In case the reviewer is interested, the synonymous diversity within beta was 0.047, while in mosaic hybrids it was 0.064.

      Page 6. "The largest diversity trough contained 28 genes." Since this trough is discussed in detail and seems to be of interest, it would be nice to illustrate it, perhaps as an inset in Figure 2 or as a separate figure. If I understood correctly, this trough includes genes (in a nitrogen-fixation pathway) that are present in all genomes, but are exchanged by homologous recombination. So I don't think it's correct to say that the "ancestors acquired the ability to fix nitrogen." Rather, the different alleles of these same genes were present in the ancestor. So perhaps there was a selective sweep involving alleles in this region that provided adaptation to local nitrogen sources or concentrations, but not a gain of new genes. Perhaps I misunderstood, in which case clarification would be appreciated.

      The reviewer raises an interesting possibility. We agree that it is in principle possible that the ancestor contained the nitrogen fixation genes and the selective sweep simply replaced the ancestral alleles. In this particular case, there is additional evidence that the entire pathway was acquired around roughly the same time from gene order. The gene order between alpha and beta is almost entirely different, with only a few segments containing more than 2-3 genes in the same order, as shown by Bhaya et al. 2007 and confirmed by additional unpublished analysis of the SAGs. One of the few exceptions is the nitrogen fixation pathway, which has essentially the same gene order over more than 20 kbp. Thus, if the ancestor of both alpha and beta contained the nitrogen-fixation pathway, we would expect these genes to be scatter across the genome. We have revised the sentences in question to clarify this point (Lines 260-271).

      Page 6. Last paragraph on epistasis references Fig 3C, but I believe it should be Fig 3D.

      Fixed.

      Page 7. Figure 3 legend. "Note that alpha-2 is identical to gamma here." I believe it should be beta, not gamma.

      The reviewer is correct. We have fixed this error.

      Page 8. What is the evidence for "at least six independent colonizers"? I could not find the data supporting this claim.

      The statement mentioned by the reviewer was based on the maximum number of species clusters we identified in different core genes. However, during the revision, we found that only a handful of genes contained five or more clusters. We did find several tens of genes with four clusters. In addition, Rosen et al. (2018) also found additional 16S clusters at low frequency in the same springs. Based on these results we conservatively estimate that at least four independent strains colonized the caldera, but the number could be much greater. We have revised the text in question accordingly (Lines 336-339) and added Fig. 2 in Appendix 1 to support the conclusion.

      Page 9. Line 200: "acting to homogenize the population." It should be specified that the population is only homogenized at these introgressed loci, not genome-wide. Otherwise, the genome-wide species clusters seen in Fig 1 would not be maintained.

      It is true that the selective sweeps that lead to diversity throughs only homogenize the introgressed loci. But other hybrid segments could also rise to high frequency in the population during the sweep through hitchhiking. The fact that we observe SNP blocks generated through secondary recombination events of introgressed segments throughout the genome supports this view. While we do not fully understand the dynamics of this process currently, we do feel that the current evidence supports the statement that mixing is occurring throughout the genome and not just at a few loci so we have kept the original statement.

      The final sentence (lines 221-222) is vague and uninformative. On the one hand, "investigating whether hybridization plays a major role" is what the current manuscript has already done - depending on what is meant by 'major' (how much of the genome? Or whether there are ecological implications?). It is also not clear what is meant by a predictive theory and 'possible evolutionary scenarios. This should be elaborated upon, otherwise, it is not clear what the authors mean. Otherwise, this sentence could be cut.

      We thank the reviewer for their feedback. One possible source of confusion could be that in this sentence we were referring to detecting hybridization in other communities. We have changed “these communities” to “other communities” to make this clearer.

      Supplement.

      Broadly speaking, I appreciate the thorough and careful analysis of the single cell data. On the other hand, it is hard to evaluate whether these custom analyses are doing what is intended in many cases. Would it be possible to consider an analysis using more established methods, e.g. taking a subset of genomes with 'good' completeness and using Panaroo to find the core and accessory genome, then ClonalFrameML or Gubbins to infer a phylogeny and recombination events? Such analyses could probably be applied to a subset of the sample with relatively complete genomes. I don't want to suggest an overly time-consuming analysis, but the authors could consider what would be feasible.

      We have added a comparison between our analysis and that from two other methods, including ClonalFrameML mentioned by the author. One important point that we feel might have been lost in the first version is that our linkage results imply that recombination is not rare such that it can be mapped onto an asexual tree as assumed by ClonalFrameML. Note that this is not simply due to technical limitations due to incomplete coverage and is instead a consequence of the evolutionary dynamics of the population. Consistent with this, we found several inconsistencies in how recombination events were assigned by ClonalFrameML. We have summarized these conclusions in Appendix 7 of the revised manuscript.

      Page 8. Line 190. What is meant by 'minimal compositional bias'?

      We mean that the sample is not biased towards strains that grow in the lab. We have revised the sentence to clarify.

      Page 25. Figure S14 is not referenced in the text.

      We have added part of this figure to the main text since it illustrates one of our main results, namely that sites at long genomic distances are essentially unlinked.

      Page 26. The 'unlinked controls' (line 530) are very useful, but it would be even more informative to see if these controls also show the same decline in linkage with distance in the genome as observed in the real data. In particular, it would be good to know if the observed rapid decline in linkage with distance in the low-diversity regions is also observed in controls. Currently, it is unclear if this observation might be due to higher uncertainty in inferring linkage in low-diversity regions, which by definition have less polymorphism to include in the linkage calculation.

      We thank the reviewer for the suggestion. After further consideration, we have decided to remove the subsection on linkage decrease in the low-diversity regions. We feel such detailed quantitative analysis would be better suited for a more technical paper, which we hope to do at a later time.

      Page 26. There are some sections with missing identifiers (Sec ??).

      Fixed.

      Page 27. The information about the typical breadth of SAG coverage (~30%) would be better to include earlier in the Supplement, and also mentioned in the main text so the reader can more easily understand the nature of the dataset.

      We have added an extra figure with the SAG coverages to Appendix 1.

      Page 29. Any sensitivity analysis around the S = 0.9 value? Even if arbitrary, could the authors provide justification why they think this value is reasonable?

      We have significantly revised this section in response to earlier comments by one of the reviewers. We hope that this would clarify the details of our methods to interested readers. To answer the reviewer’s specific question, we chose this heuristic after examining the fraction of cells of each species in different species clusters. For the clusters assigned to alpha and beta, we found a sharp peak near one and that a cutoff of 0.9 captured most clusters while still being high enough to inconsistent with a mixed cluster.

      Page 30. I could not see where Fig. S17 was mentioned in the text. Also, how are 'simple hybrid genes' defined?

      We have removed this figure in the revision. The definition of the different types of hybrid genes have been added to the main text in response to a comment from the other reviewer.

      Page 36. It is hard to see that divergence is 'high' relative to what reference. Would it be possible to include the expected value (from ref. 12) in the plot, or at least explicitly mentioned in the text?

      We have added the mean synonymous and non-synonymous divergences between alpha and beta to the figures for reference.

      Page 38. Line 770 "would be comparable to that of beta." This is not necessarily the case since beta could have a different time to its most recent common ancestor. It could have a different time to the last bottleneck or selective sweep, etc.

      We thank the reviewer for pointing out this misleading statement. Our point here was that in the first scenario the TMRCA of alpha and beta would be similar since the diversity in the high-diversity alpha genes is similar to beta. We have clarified this statement in the revision.

      Page 39. Line 793. The use of the term 'genomic backbone' implies the presence of a clonal frame, which is not what the data seems to support. Perhaps another term such as 'genetic diversity' would more appropriately capture the intended meaning here.

      We agree with the reviewer that the low-diversity regions may not be asexual. We used “genomic backbone” to distinguish from the “clonal frame,” which is usually used to mean that the backbone is asexual. We have added a note in the revision to clarify this point.

      Page 39. Lines 802-805. I found this explanation hard to follow. Could the logic be clarified?

      We simply meant that although the beta distribution is unimodal, it is not consistent with a simple Poisson distribution, unlike in alpha. We have added an extra sentence to clarify this.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #2 (Public review):

      In this valuable manuscript, Lin et al attempt to examine the role of long non coding RNAs (lncRNAs) in human evolution, through a set of population genetics and functional genomics analyses that leverage existing datasets and tools. Although the methods are incomplete and at times inadequate, the results nonetheless point towards a possible contribution of long non coding RNAs to shaping humans, and suggest clear directions for future, more rigorous study.

      Comments on revisions:

      I thank the authors for their revision and changes in response to previous rounds of comments. As it had been nearly two years since I last saw the manuscript, I reread the full text to familiarise myself again with the findings presented. While I appreciate the changes made and think they have strengthened the manuscript, I still find parts of it a bit too speculative or hyperbolic. In particular, I think claims of evolutionary acceleration and adaptation require more careful integration with existing human/chimpanzee genetics and functional genomics literature.

      We thank the reviewer heartfully for the great patience and valuable comments, which have helped us further improve the manuscript. Before responding to comments point by point, we provide a summary here.

      (1) On parameters and cutoffs.

      Parameters and cutoffs influence data analysis. The large number of Supplementary Notes, Supplementary Figures, and Supplementary Tables indicates that we paid great attention to the influence of parameters and robustness of analyses. Specifically, here we explain the DBS sequence distance cutoff of 0.034, which determines the top 20% genes that most differentiate humans from chimpanzees and influences the gene set enrichment analysis (Figure 2). As described in the revised manuscript, we estimated this cutoff based on Song et al., verified its rationality based on Prufer et al. (Song et al. 2021; Prufer et al. 2017), and measured its influence by examining slightly different cutoff values (e.g., 0.035).

      (2) Analyses of HS TFs and HS TF DBSs.

      It is desirable to compare the contribution of HS lncRNAs and HS TFs to human evolution. Identifying HS TFs faces the challenges that different institutions (e.g., NCBI and Ensembl) annotate orthologous genes using different criteria, and that multiple human TF lists have been published by different research groups. Recently, Kirilenko et al. identified orthologous genes in hundreds of placental mammals and birds and organized different types of genes into datasets of parewise comparison (e.g., hg38-panTro6) using humans and mice as references (Kirilenko et al. Integrating gene annotation with orthology inference at scale. Science 2023). Based on (a) the many2zero and one2zero gene lists in the “hg38-panTro6” dataset, (b) three human TF lists reported by two studies (Bahram et al. 2015; Lambert et al. 2018) and used in the SCENIC package, we identified HS TFs. The number of HS TFs and HS lncRNAs (5 vs 66) alone lends strong evidence suggesting that HS lncRNAs have contributed more significantly to human evolution than HS TFs (note that 5 is the union of three intersections between <many2zero + one2zero> and the three <human TF list>).

      TF DBS (i.e., TFBS) prediction has also been challenging because they are very short (mostly about 10 bp) and TF-DNA binding involves many cofactors (Bianchi et al. Zincore, an atypical coregulator, binds zinc finger transcription factors to control gene expression. Science 2025). We used two TF DBS prediction programs to predict HS TF DBSs, including the well-established FIMO program (whose results have been incorporated into the JASPAR database) (Rauluseviciute et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles Open Access. NAR 2023) and the recently reported CellOracle program (Kamimoto et al. Dissecting cell identity via network inference and in silico gene perturbation. Nature 2023). Then, we performed downstream analyses and obtained two major results. One is that on average (per base), fewer selection signals are detected in HS TF DBSs (anyway, caution is needed because TF DBSs are very short); the other is that HS TFs and HS lncRNAs contribute to human evolution in quite different ways (Supplementary Figs. 25 and 26).

      (3) On genes with more transcripts may appear as spurious targets of HS lncRNAs.

      Now, the results of HS TF DBSs allow us to address the question of whether genes with more transcripts may appear as spurious targets of HS lncRNAs. We note that (a) we predicted HS lncRNA DBSs and HS TF DBSs in the same promoter regions before the same 179128 Ensembl-annotated transcripts (release 79), (b) we used the same GTEx transcript expression matrices in the analyses of HS TF DBSs and HS lncRNA DBSs (the GTEx database includes gene expression matrices and transcript expression matrices, the latter includes multiple transcripts of a gene). Thus, the analyses of HS TF DBSs provide an effective control for examining the question of whether genes with more transcripts may appear as spurious targets of HS lncRNAs, and consequently, cause the high percentages of HS lncRNA-target transcript pairs that show correlated expression in the brain (Figure 3). We find that the percentages of HS TF-target transcript pairs that show correlated expression are also high in the brain, but the whole profile in GTEx tissues is significantly different from that of HS lncRNA DBSs (Figure 3A; Supplementary Figure 25). On the other hand, on the distribution of significantly changed DBSs in GTEx tissues, the difference between HS lncRNA DBSs and HS TF DBSs is more apparent (Figure 3B; Supplementary Figure 26). Together, these suggest that the brain-enriched distribution of co-expressed HS lncRNA-target transcript pairs must arise from HS lncRNA-mediated transcriptional regulation rather than from the transcript number difference.

      (4) Additional notes on HS TFs and HS TF DBSs.

      First, the “many2zero” and “one2zero” gene lists in the “hg38-panTro6” dataset of Kirilenko et al. provide the most update, but not most complete, data on human-specific genes because “hg38-panTro6” is a pairwise comparison. On the other hand, the Ensembl database also annotates orthologous genes, but lacks such pairwise comparisons as “hg38-panTro6”. Therefore, not all HS genes based on “hg38-panTro6” agree with orthologous genes in the Ensembl database. Second, if HS genes are identified based on both Ensembl and Kirilenko et al., HS TFs will be fewer.

      (5) On speculative or hyperbolic claims.

      First, the title “Human-specific lncRNAs contributed critically to human evolution by distinctly regulating gene expression” is now further supported by HS TF DBSs analyses. Second, we have carefully revised the entire manuscript, trying to make it more readable, accurate, logically reasonable, and biologically acceptable. Third, specifically, in the revision, we avoid speculative or hyperbolic claims in results, interpretations, and discussions as possible as we can. This includes the tone-down of statements and claims, for example, using “reshape” to replace “rewire” and using “suggest” to replace “indicate”. Since the revisions are pervasive, we do not mark all of them, except those that are directly relevant to the reviewer’s comments.

      (1) Line 155: "About 5% of genes have significant sequence differences in humans and chimpanzees," This statement needs a citation, and a definition of what is meant by 'significant', especially as multiple lines below instead mention how it's not clear how many differences matter, or which of them, etc.

      Different studies give different estimates, from 1.24% (Ebersberger et al. Genomewide Comparison of DNA Sequences between Humans and Chimpanzees. Am J Hum Genet. 2002) to 5% (Britten RJ. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. PNAS 2002). The 5% for significant gene sequence differences arises when considering a broader range of genetic variations, particularly insertions and deletions of genetic material (indels). To provide more accurate information, we have replaced this simple statement with a more comprehensive one and cited the above two papers.

      (2) line 187: "Notably, 97.81% of the 105141 strong DBSs have counterparts in chimpanzees, suggesting that these DBSs are similar to HARs in evolution and have undergone human-specific evolution." I do not see any support for the inference here. Identifying HARs and acceleration relies on a far more thorough methodology than what's being presented here. Even generously, pairwise comparison between two taxa only cannot polarise the direction of differences; inferring human-specific change requires outgroups beyond chimpanzee.

      Here, we actually made an analogy but not an inference; therefore, we used such words as “suggesting” and “similar” instead of using more confirmatory words. We have revised the latter half sentence, saying “raising the possibility that these sequences have evolved considerably during human evolution”.

      (3) line 210: "Based on a recent study that identified 5,984 genes differentially expressed between human-only and chimpanzee-only iPSC lines (Song et al., 2021), we estimated that the top 20% (4248) genes in chimpanzees may well characterize the human-chimpanzee differences". I do not agree with the rationale for this claim, and do not agree that it supports the cutoff of 0.034 used below. I also find that my previous concerns with the very disparate numbers of results across the three archaics have not been suitably addressed.

      (1) Indeed, “we estimated that the top 20% (4248) genes in chimpanzees may well characterize the human-chimpanzee differences” is an improper claim; we made this mistake due to the flawed use of English.

      (2) What we need is a gene number, which (a) indicates genes that effectively differentiate humans from chimpanzees, (b) can be used to set a DBS sequence distance cutoff. Since this study is the first to systematically examine DBSs in humans and chimpanzees, we must estimate this gene number based on studies that identify differentially expressed genes in humans and chimpanzees. We choose Song et al. 2021 (Song et al. Genetic studies of human–chimpanzee divergence using stem cell fusions. PNAS 2021), which identified 5984 differentially expressed genes, including 4377 genes whose differential expression is due to trans-acting differences between humans and chimpanzeees. To the best of our knowledge, this is the only published data on trans-acting differences between humans and chimpanzeees, and most HS lncRNAs and their DBSs/targets have trans-acting relationships (see Supplementary Table 2). Based on these numbers, we chose a DBS sequence distance cutoff of 0.034, which corresponds to 4248 genes (the top 20%), slightly fewer than 4377.

      (3) If we chose DBS sequence distance cutoff=0.033 or 0.035, slightly more or fewer genes would be determined, raising the question of whether they would significantly influence the downstream gene set enrichment analysis (Figure 2). We found that 91 genes have a DBS sequence distance of 0.034. Thus, if cutoff=0.035, 4248-91=4157 genes were determined, and the influence on gene set enrichment analysis was very limited.

      (4) On the disparate numbers of results across the three archaics. Figure 1A is based on Figure 2 in Prufer et al. 2017. At first glance, our Figure 1A indicates that Altai Neanderthal is older than Denisovan (upon kya), making our result “identified 1256, 2514, and 134 genes in Altai Neanderthals, Denisovans, and Vindija Neanderthals” unreasonable. However, Prufer et al. (2017) reported that “It has been suggested that Denisovans received gene flow from a hominin lineage that diverged prior to the common ancestor of modern humans, Neandertals, and Denisovans……In agreement with these studies, we find that the Denisovan genome carries fewer derived alleles that are fixed in Africans, and thus tend to be older, than the Altai Neandertal genome”. This note by Prufer et al. provides an explanation for our result, which is that more genes with large DBS sequence distances were identified in Denisovans than in Altai Neanderthals. Of course, the 1256, 2514, and 134 depend on the cutoff of 0.034. If cutoff=0.035, these numbers change slightly, but their relationships remain (i.e., more genes in Denisovans). We examined multiple cutoff values and found that more genes in Denisovans have large DBS sequence distances than in Altai Neanderthals.

      (4) I also think that there is still too much of a tendency to assume that adaptive evolutionary change is the only driving force behind the observed results in the results. As I've stated before, I do not doubt that lncRNAs contribute in some way to evolutionary divergence between these species, as do other gene regulatory mechanisms; the manuscript leans down on it being the sole, or primary force, however, and that requires much stronger supporting evidence. Examples include, but are not limited to:

      (1) Indeed, the observed results are also caused by other genomic elements and mechanisms (but it is hardly feasible to identify and differentiate them in a single study), and we do not assume that adaptive evolutionary change is the only driving force. Careful revisions have been made to avoid leaving readers the impression that we have this tendency or hold the simple assumption.

      (2) Comparing HS lncRNAs to HS TFs is critical, and we have done this.

      (5) line 230: "These results reveal when and how HS lncRNA-mediated epigenetic regulation influences human evolution." This statement is too speculative.

      We have toned down the statement, just saying “These results provide valuable insights into when and how HS lncRNA-mediated epigenetic regulation impacts human evolution”.

      Line 268: "yet the overall results agree well with features of human evolution." What does this mean? This section is too short and unclear.

      (1) First, the sentence “Selection signals in YRI may be underestimated due to fewer samples and smaller sample sizes (than CEU and CHB), yet the overall results agree well with features of human evolution” has been deleted, because CEU, CHB, and YRI samples are comparable (100, 99, and 97, respectively).

      (2) Now the sentence has been changed to “These results agree well with findings reported in previous studies, including that fewer selection signals are detected in YRI (Sabeti et al., 2007; Voight et al., 2006)”.

      (3) On “This section is too short and unclear” - To make the manuscript more readable, we adopt short sections instead of long ones. This section expresses that (a) our finding that more selection signals were detected in CEU and CHB than in YRI agrees with well-established findings (Voight et al. A Map of Recent Positive Selection in the Human Genome. PLoS Biology 2006; Sabeti et al. Genome-wide detection and characterization of positive selection in human populations. Nature 2007), (b) in considerable DBSs, selection signals were detected by multiple tests.

      Line 325: "and form 198876 HS lncRNA-DBS pairs with target transcripts in all tissues." This has not been shown in this paper - sequence based analyses simply identify the “potential” to form pairs.

      This section describes transcriptomic analysis using the GTEx data. Indeed, target transcripts of HS lncRNAs are results of sequence-based analysis, and a predicted target is not necessarily regulated by the HS lncRNA in a tissue. Here, “pair” means a pair of HS lncRNA-target transcript whose expression shows significant Pearson correlation in a GTEx tissue (by the way, we do not mean correlation equals regulation; actually, we identified HS lncRNA-mediated transcriptional regulation upon both DBS-targeting relationship and correlation relationship).

      Line 423: "Our analyses of these lncRNAs, DBSs, and target genes, including their evolution and interaction, indicate that HS lncRNAs have greatly promoted human evolution by distinctly rewiring gene expression." I do not agree that this conclusion is supported by the findings presented - this would require significant additional evidence in the form of orthogonal datasets.

      (1) As mentioned above, we have used “reshape” to replace “rewire” and used “suggest” to replace “indicate”. In addition, we have substantially revised the Discussion, in which this sentence is replaced by “our results suggest that HS lncRNAs have greatly reshaped (or even rewired) gene expression in humans”.

      (2) Multiple citations have been added, including Voight et al. 2006 (Voight et al. A Map of Recent Positive Selection in the Human Genome. PLoS Biology 2006) and Sabeti et al. 2007 (Sabeti et al. Genome-wide detection and characterization of positive selection in human populations. Nature 2007).

      (3) We have analyzed HS TF DBSs, and the obtained results also support the critical contribution of HS lncRNAs.

      I also return briefly to some of my comments before, in particular on the confounding effects of gene length and transcript/isoform number. In their rebuttal the authors argued that there was no need to control for this, but this does in fact matter. A gene with 10 transcripts that differ in the 5' end has 10 times as many chances of having a DBS than a gene with only 1 transcript, or a gene with 10 transcripts but a single annotated TSS. When the analyses are then performed at the gene level, without taking into account the number of transcripts, this could introduce a bias towards genes with more annotated isoforms. Similarly, line 246 focuses on genes with "SNP numbers in CEU, CHB, YRI are 5 times larger than the average." Is this controlled for length of the DBS? All else being equal a longer DBS will have more SNPs than a shorter one. It is therefore not surprising that the same genes that were highlighted above as having 'strong' DBS, where strength is impacted by length, show up here too.

      (1) In gene set enrichment analysis (Figure 2, which is a gene-level analysis), when determining genes differentiating humans from chimpanzees based on DBS sequence distance, if a gene has multiple transcripts/DBSs, we choose the DBS with the largest distance. That is, the input to g:Profiler is a non-redundant gene list.

      (2) In GTEx data analysis (Figure 3, which is a transcriptome-level analysis), the analyses of HS TF DBSs using the GTEx data provide evidence suggesting that different DBS/transcript numbers of genes are unlikely to cause confounding effects. As explained above, we predicted HS TF DBSs in the same promoter regions of 179128 Ensembl-annotated transcripts (release 79), but Supplementary Figures 25 and 26 are distinctly different from Figure 3AB.

      (3) In evolutionary analysis, a gene with 10 DBSs has a higher chance of having selection signals than a gene with 1 DBS. This is biologically plausible, because many conserved genes have novel transcripts whose expression is species-, tissue-, or developmental period-specific, and DBSs before these novel transcripts may differ from DBSs before conserved transcripts.

      (4) “line 246 focuses on genes with "SNP numbers in CEU, CHB, YRI are 5 times larger than the average." Is this controlled for the length of the DBS?” - This is a defect. We have now computed SNP numbers per base and used the new table to replace the old Supplementary Table 8. After examining the new table, we find that the major results of SNP analysis remain.

      (5) On “Is this controlled for length of the DBS? All else being equal a longer DBS will have more SNPs than a shorter one” - We do not think there are reasons to control for the length of DBSs; also, what “All else being equal” means matters. First, DBS sequences have specific features; thus, the feature of a long DBS is stronger than the feature of a short one, making a long DBS less likely to be generated by chance in the genome and less likely to be predicted wrongly than a short one. This means that longer DBSs are less likely to be false ones (note our explanation that the chance of a DBS of 147 bp, the mean length of DBSs, to be wrongly predicted is extremely low, p<8.2e-19 to 1.5e-48). Second, the difference in length suggests a difference in binding affinity, which in turn influences the regulation of the specific transcripts and influences the analysis of GTEx data. Third, it cannot be excluded that some SNPs may be selection signals (detecting selection signal is challenging, and many selection signals cannot be detected by statistical tests, see Grossman et al. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science 2010).

      (6) On “It is therefore not surprising that the same genes that were highlighted above as having 'strong' DBS, where strength is impacted by length” - Indeed, strength is influenced by length, see the above response.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Finally, figure 1 panels D and F are not legible - the font is tiny! There's also a typo in panel A, where "Homo Sapien" should be "Homo sapiens".

      (1) “Homo sapien” is changed to “Homo sapiens”.

      (2) Even if we double the font size, they are still too small. Inserting a very large panel D into Figure 1 will make Figure 1 ugly, and converting Figure 1D into an independent figure is unnecessary. Actually, panels 1D and F are illustrative figures; the full Fig.1D is Supplementary Figure 6, and the full Fig.1F is Figure 3. We have revised Fig.1’s legend to explain these.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This unique study reports original and extensive behavioral data collected by the authors on 21 living mammal taxa in zoo conditions (primates, tree shrew, rodents, carnivorans, and marsupials) on how descent along a vertical substrate can be done effectively and securely using gait variables. Ten morphological variables reflecting head size and limb proportions are examined in relationship to vertical descent strategies and then applied to reconstruct modes of vertical descent in fossil mammals.

      Strengths:

      This is a broad and data-rich comparative study, which requires a good understanding of the mammal groups being compared and how they are interrelated, the kinematic variables that underlie the locomotion used by the animals during vertical descent, and the morphological variables that are associated with vertical descent styles. Thankfully, the study presents data in a cogent way with clear hypotheses at the beginning, followed by results and a discussion that addresses each of those hypotheses using the relevant behavioral and morphological variables, always keeping in mind the relationships of the mammal groups under investigation. As pointed out in the study, there is a clear phylogenetic signal associated with vertical descent style. Strepsirrhine primates much prefer descending tail first, platyrrhine primates descend sideways when given a choice, whereas all other mammals (with the exception of the raccoon) descend head first. Not surprisingly, all mammals descending a vertical substrate do so in a more deliberate way, by reducing speed, and by keeping the limbs in contact for a longer period (i.e., higher duty factors).

      Weaknesses:

      The different gait patterns used by mammals during vertical descent are a bit more difficult to interpret. It is somewhat paradoxical that asymmetrical gaits such as bounds, half bounds, and gallops are more common during descent since they are associated with higher speeds and lower duty factors. Also, the arguments about the limb support polygons provided by DSDC vs. LSDC gaits apply for horizontal substrates, but perhaps not as much for vertical substrates.

      We analyzed gait patterns using methods commonly found in the literature and discussed our results accordingly. However, the study of limbs support polygons was indeed developed specifically for studying locomotion on horizontal supports, and may not be applicable for studying vertical locomotion, which is in fact a type of locomotion shared by all arboreal species. In the future, it would be interesting to consider new methods for analyzing vertical gaits.

      The importance of body mass cannot be overemphasized as it affects all aspects of an animal's biology. In this case, larger mammals with larger heads avoid descending head-first. Variation in trunk/tail and limb proportions also covaries with different vertical descent strategies. For example, a lower intermembral index is associated with tail-first descent. That said, the authors are quick to acknowledge that the five lemur species of their sample are driving this correlation. There is a wide range of intermembral indices among primates, and this simple measure of forelimb over hindlimb has vital functional implications for locomotion: primates with relatively long hindlimbs tend to emphasize leaping, primates with more even limb proportions are typically pronograde quadrupeds, and primates with relatively long forelimbs tend to emphasize suspensory locomotion and brachiation. Equally important is the fact that the intermembral index has been shown to increase with body mass in many primate families as a way to keep functional equivalence for (ascending) climbing behavior (see Jungers, 1985). Therefore, the manner in which a primate descends a vertical substrate may just be a by-product of limb proportions that evolved for different locomotor purposes. Clearly, more vertical descent data within a wider array of primate intermembral indices would clarify these relationships. Similarly, vertical descent data for other primate groups with longer tails, such as arboreal cercopithecoids, and particularly atelines with very long and prehensile tails, should provide more insights into the relationship between longer tail length and tail-first descent observed in the five lemurs. The relatively longer hallux of lemurs correlates with tail-first descent, whereas the more evenly grasping autopods of platyrrhines allow for all four limbs to be used for sideways descent. In that context, the pygmy loris offers a striking contrast. Here is a small primate equipped with four pincer-like, highly grasping autopods and a tail reduced to a short stub. Interestingly, this primate is unique within the sample in showing the strongest preference for head-first descent, just like other non-primate mammals. Again, a wider sample of primates should go a long way in clarifying the morphological and behavioral relationships reported in this study.

      We agree with this statement. In the future, we plan to study other species, particularly large-bodied ones with varied intermembral indexes.

      Reconstruction of the ancient lifestyles, including preferred locomotor behaviors, is a formidable task that requires careful documentation of strong form-function relationships from extant species that can be used as analogs to infer behavior in extinct species. The fossil record offers challenges of its own, as complete and undistorted skulls and postcranial skeletons are rare occurrences. When more complete remains are available, the entire evidence should be considered to reconstruct the adaptive profile of a fossil species rather than a single ("magic") trait.

      We completely agree with this, and we would like to emphasize that our intention here was simply to conduct a modest inference test, the purpose of which is to provide food for thought for future studies, and whose results should be considered in light of a comprehensive evolutionary model.

      Reviewer #2 (Public review):

      Summary:

      This paper contains kinematic analyses of a large comparative sample of small to medium-sized arboreal mammals (n = 21 species) traveling on near-vertical arboreal supports of varying diameter. This data is paired with morphological measures from the extant sample to reconstruct potential behaviors in a selection of fossil euarchontaglires. This research is valuable to anyone working in mammal locomotion and primate evolution.

      Strengths:

      The experimental data collection methods align with best research practices in this field and are presented with enough detail to allow for reproducibility of the study as well as comparison with similar datasets. The four predictions in the introduction are well aligned with the design of the study to allow for hypothesis testing. Behaviors are well described and documented, and Figure 1 does an excellent job in conveying the variety of locomotor behaviors observed in this sample. I think the authors took an interesting and unique angle by considering the influence of encephalization quotient on descent and the experience of forward pitch in animals with very large heads.

      Weaknesses:

      The authors acknowledge the challenges that are inherent with working with captive animals in enclosures and how that might influence observed behaviors compared to these species' wild counterparts. The number of individuals per species in this sample is low; however, this is consistent with the majority of experimental papers in this area of research because of the difficulties in attaining larger sample sizes.

      Yes, that is indeed the main cost/benefit trade-off with this type of study. Working with captive animals allows for large comparative studies, but there is a risk of variations in locomotor behavior among individuals in the natural environment, as well as few individuals per species in the dataset. That is why we plan and encourage colleagues to conduct studies in the natural environment to compare with these results. However, this type of study is very time-consuming and requires focusing on a single species at a time, which limits the comparative aspect.

      Figure 2 is difficult to interpret because of the large amount of information it is trying to convey.

      We agree that this figure is dense. One possible solution would be to combine species by phylogenetic groups to reduce the amount of information, as we did with Fig. 3 on the dataset relating to gaits. However, we believe that this would be unfortunate in the case of speed and duty factor because we would have to provide the complete figure in SI anyway, as the species-level information is valuable. We therefore prefer to keep this comprehensive figure here and we will enlarge the data points to improve their visibility, and provide the figure with a sufficiently high resolution to allow zooming in on the details.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 had several remaining suggestions:

      In some instances, the authors face well-known limitations. For example, bath application of drugs. Blockers of Gly and Gaba receptors are likely problematic when studying a network that includes a diverse set of inhibitory interneurons. Likewise, the results derived from application of AMPAR and KAR blockers should impact HC cell fxn, and presumably inner retina interneuron networks. In the Discussion the authors are encouraged to address more of these concerns (e.g., Discussion line 709).

      Rather than concluding that the bath application of drugs is without complications, they can conclude that under the experimental conditions, glutamate release from these On-bipolars continues to exhibit Transient and Sustained release. This is really the key point of their study.

      This is a good suggestion.  We have added a discussion of the complications of the pharmacology starting on line 754.  

      If indeed sustained release is a reflection of higher release rates, ribbon size is what point to but, there are many other possibilities, such as SV recycling, or recruitment of reserve pools of SVs, fusion machinery, Cav channel behavior. The authors could cite more literature in the Discussion.

      We added a sentence to this effect in the discussion, starting on line 866.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      In the retina, parallel processing of cone photoreceptor output under bright light conditions dissects critical features of our visual environment and is fundamental to visual function. Cone photoreceptor signals are sampled by several types of bipolar cells and passed onto the ganglion cells. At the output of retinal processing, retinal ganglion cells send about 40 different codes of the visual scene to the brain for further processing. In this study, the authors focus on whether subtype-specific differences in the size of synaptic ribbon-associated vesicle pools of bipolar cells contribute to different retinal ganglion cell (RGC) responses. Specifically, inputs to ON alpha RGCs producing transient versus sustained kinetics (ON-S vs. ON-T, respectively) are compared. The authors first demonstrate that ON-S vs. ON-T RGCs are readily identifiable in a whole mount preparation and respond differently to both static and to a spatially uniform, randomly fluctuating (Gaussian noise) light stimulus. Liner-nonlinear (LN) models were used to estimate the transformation between visual input and excitatory synaptic input for each RGCs; these models suggested the presence of transient versus sustained kinetics already in the excitatory inputs to ON-T and ON-S RGCs. Indeed, the authors show that (glutamatergic) excitatory inputs to ON-S vs. ON-T RGCs are of distinct kinetics. The subtypes of bipolar cells providing input to ON-S are known (i.e., type 6 and 7), but the source of excitatory bipolar inputs to ON-T RGCs needed to be determined. In a tedious process, it is elegantly shown here that ON-T RGCs receive most of their excitatory inputs from type 5 and 6 bipolars. Interestingly, the temporal properties of light-evoked responses of type 5, 6, and 7 bipolars recorded from the somas were indistinguishable and rather sustained, suggesting that the origin of transient kinetics of excitatory inputs to ON-T RGCs suggested by the LN model might be found in the processing of visual signals at the bipolar cell axon terminal. Blocking GABA- or glycinergic inhibitory inputs did not alter the light-evoked excitatory input kinetics to ON-T and ON-S RGCs. Twophoton glutamate sensor imaging revealed significantly faster kinetics of light-evoked glutamate signals at ON-T versus ON-S RGCs. Detailed EM analysis of bipolar cell ribbon synapses onto ON-T and ON-S RGCs revealed fewer ribbon-associated vesicles at ON-T synapses, which is consistent with stronger paired-flash depression of lightevoked excitatory currents in ON-T RGCS versus ON-S RGCs. This study suggests that bipolar subtype-specific differences in the size of synaptic ribbon-associated vesicle pools contribute to transient versus sustained kinetics in RGCs. 

      Strengths: 

      The use of multiple, state-of-the-art tools and approaches to address the kinetics of bipolar to ganglion cell synapse in an identified circuit. 

      Weaknesses: 

      For the most part, the data in the paper support the conclusions, and the authors were careful to try to address questions in multiple ways. Two-photon glutamate sensor imaging experiment showing that blocking GABA- and glycinergic inhibition does not change the kinetics of light-evoked glutamate signals at ON-T RGCs would strengthen the conclusion that bipolar subtype-specific differences in the size of synaptic ribbon-associated vesicle pools contribute to transient versus sustained kinetics in RGCs. 

      Thank you for this suggestion. We have revised the text throughout to be careful not to imply that amacrine cells have no role in shaping EPSCs and spike output, but instead that the transience of the On-T responses persists without amacrine cells (see for example lines 91, 450-453, 514-518, 696-714). We have also added additional iGluSnFR experiments to the paper to further test this conclusion (new Figure 7). The new data shows that the transience of glutamate release from the On-T cells is retained when 1) spiking amacrine cell activity is suppressed by blocking voltage-gated Na<sup>+</sup> channels with TTX or 2) all amacrine cell activity is suppressed by blocking AMPA receptors with NBQX. This does provide nice additional evidence that amacrine cells are not necessary for the sustained/transient distinction.

      Reviewer #2 (Public Review): 

      Summary: 

      Goal of the study. The authors tried to pinpoint the origins of transient and sustained responses measured at retinal ganglion cells (rgcs), which is the output layer of the retina. Response characteristics of rgcs are used to group them into different types. The diversity of rgc types represents the ability of the retina to transform visual inputs into distinct output channels. They find that the physical dimensions of bipolar cell's synaptic ribbons (specialized release sites/active zones) vary across the different types of cone on-bpcs, in ways that they argue could facilitate transient or sustained release. This diversity of release output is what they argue underlies the differences in on-rgcs response characteristics, and ultimately represents a mechanism for creating parallel cone-driven channels. 

      Strengths: 

      The major strengths of the study are the anatomical approaches employed and the use of the "glutamate sniffer" to assay synaptic glutamate levels. The outline of the study is elegant and reflects the strengths of the authors. 

      Weaknesses: 

      The major weakness is that the ambitious outline is not matched with a complete set of results, and the set of physiological protocols is disjointed, not sufficient to bridge the systems-level question with the presynaptic release question. 

      Thank you for this comment as it provides an opportunity (here and in the paper) for us to clarify our main goal. We wanted to link the well-established distinction between transient and sustained retinal responses to anatomy. This required locating where this difference arises within the circuitry – which we show to be at least largely the bipolar output synapse – and then examining the structure of this synapse in detail. While we would certainly be interested in connecting our results to a biophysical description of the synapse, that was not the primary focus of our study and was not something we could add without substantial additional work.  

      Major comments on the results and suggestions. 

      The ribbon model of release has been explored for decades and needs to be further adapted to systems-level work. The study under consideration by Kuo et al. takes on this task. Unfortunately, the experimental design does not permit a level of control over presynaptic/bpc behavior that is comparable to earlier studies, nor do they manipulate release in ways that test the ribbon model (i.e., paired recordings or Ribeye-ko). Furthermore, the data needs additional evaluation, and the presentation and interpretations should draw on published biophysical and molecular studies. 

      As described above, our goal was to test several possible explanations for the difference between transient and sustained responses in OnT and OnS ganglion cells: (1) differences in the light responses of the bipolar cells that convey photoreceptor signals to the relevant ganglion cells; (2) shaping of bipolar transmitter release by presynaptic inhibition; (3) shaping of ganglion cell responses by postsynaptic inhibition or spike generation; (4) differences in feedforward bipolar synapses. We were surprised to find that the feedforward bipolar synapses play a central role in this difference, and your comment nicely prompts us to relate this to the large literature on biophysical studies of release from ribbon synapses. We have made substantial revisions in the text to do this. This includes anticipating the importance of feedforward synaptic properties in the abstract and introduction (lines 36-37 and 61-64), pointers in the results (lines 539-548), and several new paragraphs in the discussion (starting on lines 751, 773 and 787). By showing that the transient/sustained differences originates largely at feedforward bipolar synapses, we set the stage for future work that shows how biophysical properties of the synapse shape physiological signals that traverse it.

      To build a ribbon-centric context, consider recent literature that supports the assertion that ribbons play a role in forming AZ release sites and facilitating exocytosis. Reference Ribeye-ko studies. For example, ribbonless bpcs show an 80% reduction in release (Maxeiner et al EMBO J 2016), the ribbonless retina exhibits signaling deficits at the output layer (Okawa et al ...Rieke, ..Wong Nat Comm 2019), and ribbonless rods show an 80% reduction the readily releasable pool (RRP) of SVs (Grabner Moser, elife 2021). In addition, the authors could refer to whole-cell membrane capacitance studies on mammalian rods, cones, and bpcs, because the size of the RRP of SVs scales with the dimensions and numbers of ribbons (total ribbon footprint). For comparison, bipolars see the review by Wan and Heidelberger 2011. For a comparison of mammalian rods and cones, see, rods: Grabner and Moser (2021 eLife), Mueller.. Regus Leidig et al. (2019; J Neurosci) and cones Grabner ...DeVries (Nat Comm 2023). A comparison of cell types shows that the extent of release is (1) proportional to the total size of the ribbon footprint, and (2) less release is witnessed when ribbons are deleted (also see photo ablation studies by Snellman.... And Mehta..Zenisek, Nat Neurosci and Neuron).

      Thank you for these pointers into the literature.  We have included much of this work in the revised Discussion (see three paragraphs starting on line 751). The revised text focuses on the evidence that larger and more numerous ribbons lead to increased release. The direct evidence from previous work for this relationship supports our (indirect) conclusions in the current paper about the role of ribbon size and associated vesicle pools in transient vs sustained responses.  

      Ribbon morphology may change in an activity-dependent manner. The rod ribbon AZ has been reported to lengthen in the dark (Dembla et al 2020), and deletion of the ribbon shortens the length of the AZ (defined by Cav1,4 or RIM2); in addition, the Ribeye-ko AZs fail to change in size with light and dark conditioning. Furthermore, EM studies on rod and cone AZs in light and dark argue that the number of SVs at the base of the ribbon increases in the dark, when PRs are depolarized (see Figure 10, Babai et al 2016 JNeurosci). Lastly, using goldfish Mb1 on-bipolars, Hull et al (2006, J Neurophysio) correlated an increase in release efficiency with an increase in ribbon numbers, which accompanied daylight. >> When release activity is high, ribbon AZ length increases (Dembla, rods), the number of docked SVs increases (Babai, rods cones), and the number of ribbons increases (Hull, diurnal Mb1s). 

      We have extensively revised the discussion section to include more discussion of ribbons, particularly emphasizing evidence supporting the general argument that larger ribbons support higher release rates. We focused on studies that provided direct links between release rates and ribbon size or number of ribbon-associated vesicles.  This includes studies that pair electrophysiology and anatomy and those that measure the consequences of ablating ribbons,

      The results under review, Kuo et al., were attained with SBF-SEM, which has the benefit of addressing large-volume questions as required here, yet it achieves lower spatial resolution than what is attained with TEM tomography and FIB-EM. Ideally, the EM description would include SV size, and the density of ribbon-tethered SVs that are docked at the plasma membrane, because this is where the SVs fuse (additional non-ribbon release sites may also exist? Mehta ... Singer 2014 J Neurosci). Studies by Graydon et al 2011 and 2014 (both in J Neurosci), and Jean ... Moser et al 2018 (eLife) are good examples of quantitative estimates of SVs docking sites at ribbons. SBF-SEM does not allow for an assessment of SVs within 5 nm of the PM, but if the authors can identify the number of SVs that appear within the limit of resolution (10 to 15 nm) from the PM, then this data would be useful. Also, what dimension(s) of the large ribbons make them larger? Typically, ribbons are fixed in height (at least in the outer retina, 200 to 250 nm), but their length varies and the number ribbons per terminal varies. Is the larger ribbon size observed in type 6 bpcs do to longer ribbons, or taller ribbons? A longer ribbon likely has more docked SVs. An additional possibility is that more SVs are about the ribbon-PM footprint, either more densely packed and/or expanding laterally (see definitions in Jean....Moser, elife 2018). 

      We have included an additional analysis of ribbon surface area from our 3D SBFSEM reconstructions. As with the volume measurements included in the original submission, ribbon surface areas are distinct between type 5i and type 6 bipolar cells (Fig. S10A), ON-T RGCs on average receive input from ribbons with smaller surface area than ON-S RGCs (Fig. S10B), and ribbon surface area predicts the number of adjacent vesicles across bipolar cell types (Fig. S10C).  We agree that a higher resolution view of presynaptic structures would be very helpful, but the resolution of our SBF-SEM data is limited (e.g. each pixel is 40 nm on a side).  This resolution does not allow us to distinguish between vesicles at vs near the membrane. 

      In our observations, both length and height of the ribbons showed variability across individual bipolar cells. And ribbons in type 6 bipolar cells tended to be either longer and/or taller compared to those in type 5 cells. We agree that a longer ribbon may accommodate more docked SVs. A more definitive analysis would benefit from higher-resolution, isotropic 3D reconstructions of ribbons, which would allow more precise shape analysis and ,together with a detailed assessment of docked SVs at the ribbons.

      The ribbon literature given above makes the argument that ribbons increase exocytotic output, and morphological studies suggest that release activity enhances 1) ribbon length (Dembla) and 2) the density of SVs near the PM (Babai). These findings could lead one to propose that type 6 bpcs (inputs to On-sustained) are more active than type 5i (feed into On-transient). Here Kuo et al. show that the bpcs have similar Vm (measured from the soma) in response to light stimulation. Does Vm predict release? Not entirely as the authors acknowledge, because: Cav channel properties, SV availability, and negative feedback are all downstream of bpc Vm. The only experiment performed to test downstream factors focused on negative feedback from amacrines. The data presented in Figures 5C-F led me to conclude the opposite of what the authors concluded. My impression is that the T-ON rgc exhibits strong disinhibition when GABA-blockers are applied (the initial phase is greatly increased in amplitude and broadened with the drug), which contrasts with the S-On rgc responses that show a change in the amplitude of the initial phase but not its width (taus would be nice). Here and in many places the authors refer to changes in release kinetics, without implementing a useful description of kinetics. For instance, take the cumulative current (charge) in Figure 5C and fit the control and drug traces to arrive at taus, and their respective amplitudes, and use these values to describe kinetic phases. One final point, the summary in Figure 5D has a p: 0.06, very close to the cutoff for significance, which begs for more than an n = 5. Given that previous studies have shown that bpc output is shaped by immediate msec GABA feedback, in ways that influence kinetic phases of release (..Mb1 bipolars, see Vigh et al 2005 Neuron), more attention to this matter is needed before the authors rule out feedback inhibition in favor of ribbon size. If by chance, type 5i bpcs are under uniquely strong feedback inhibition, then ribbon size may result from less activity, not less output resulting from smaller ribbons.

      The text surrounding Figure 5 led to some confusion, and we have revised that text and the figure for clarity.  First, the data in that figure is entirely from On-T cells (the upper and lower panels show block of GABA and glycine receptors separately).  Second, the observation that we make there is that block of inhibitory receptors increases the transience of the On-T excitatory input, rather than decreasing it as would be expected if the transience is created by presynaptic inhibition. We have added additional data and that increase in transience is now significant. Inhibitory block does substantially increase the amplitude of the postsynaptic response, and a likely origin of this change in response is inhibitory feedback to the bipolar synaptic terminal. We now indicate this in the text on page 13, lines 438-453. 

      The key result of this figure for our purposes here is that the transience of the excitatory input to the OffT cell remains with inhibitory input blocked. We have clarified throughout the text that our results indicate that inhibitory feedback is not necessary for the difference between transient release into On-T and sustained release onto On-S. This does not mean that inhibitory feedback does not shape the responses in other ways or contribute to the transient/sustained difference - just that for the specific stimuli we use that difference is retained without presynaptic inhibition. We have also added citations to past work showing that activity of amacrine cells can modulate bipolar transmitter release. 

      Whether strong feedback inhibition limits activity and therefore limits ribbon size in an activity-dependent way is an intriguing possibility. Indeed, addressing why ribbons are larger in type 6 bipolar cells vs. other bipolar types will be an interesting avenue of further study. However, it would be surprising if ribbon sizes changed during the acute pharmacological block conditions (~10-15 minutes) we employed in our study. Our point here is that there is an interesting correlation between presynaptic ribbon size and the kinetics of glutamate release. We do not think that the two possibilities stated in the last sentence (“…ribbon size may result from less activity, not less output resulting from smaller ribbons”) are mutually exclusive.

      We have not further quantified the response kinetics in the experiments of Figure 5 as the large changes induced by the pharmacology (especially GABA receptor block) make it unclear how to interpret quantitative differences.  In other places we have quantified kinetics through the STA or specified that our focus was more qualitative (i.e. transient vs sustained kinetics). 

      As mentioned above, the behavior of Cav channels is important here. This is difficult to address with voltage clamps from the soma, especially in the Vm range relevant to this study. Given that it has previously been modeled that the rod bpc to AII pathway adapts to prolonged depolarization of rbcs through downregulating Cav channel-mediated Ca<sup>2+</sup> influx (Grimes ....Rieke 2014 Neuron), it seems important for Kou et al to test if there is a difference in Cav regulation between type 6 and 5i bpcs. Ca<sup>2+</sup>  imaging with a GCaMP strategy (Baden....Lagnado Current Biology, 2011) or filling the presynapse with Ca dyes (see inner hair cells: Ozcete and Moser, EMBO J 2020) would allow for the correlation of [Ca]intra with GluSnf signals (both local readouts).

      This is a good suggestion but is outside the scope of our current paper. Our focus was on the circuit origin of the difference in response of the OnT and OnS responses rather than the specific biophysical mechanism.  We are of course interested in the mechanism, but the additional experiments needed to pin that down would need to be a part of future experiments. The work here represents an important step in that direction as it greatly reduces the number of possible locations and mechanisms for the sustained/transient difference and hence serves to focus any future mechanistic investigations.

      Stimulation protocol and presentation of Glutamate Sniffer data in Figure 6. In all of your figures where you state steady st as a % of pk amplitude, please indicate in the figure where you estimate steady state. Alternatively, if you take the cumulative dF/F signal, then you can fit the different kinetic phases. From the appearance of the data, the Sustained Glu signals look like square waves (Figure 6B ROI1-4), without a transient at onset, which is not predicted in your ribbon model that assumes different kinetic phases (1. depletion of docked SVs, and 2. refilling and repriming). The Transient responses (Figure 6B ROI5-8) are transient and more compatible with a depressing ribbon scheme. If you take the cumulative, for all of the On-S and compare it to all of the On-T responses, my guess is the cumulative dF/F will be 10 to 20 larger for the S-On. Would you conclude that bpc inputs to On-S (type 6) release 20fold more SVs per 4 seconds on a per ribbon basis, and does the surface area of the type 6 bpcs account for this difference? From Figures 8B and D, the volume of the ribbon is ~2 fold greater for type 6 vs 5i, but the Surface Area (both faces of ribbon) is more relevant to your model that claims ribbon size is the pivotal factor. If making cumulative traces, and comparisons on an absolute scale is unfounded, then we need to know how to compare different observations. The classic ribbon models always have a conversion factor such as the capacitance of an SV or q size that is used to derive SV numbers from total dCm or Qcontent. See Kim ....et al von Gersdorff, 2023, Cell Reports. Why not use the Gaussian noise stimulus in Fig 6 as in Figure 1 and 2? 

      For iGluSnFR recordings, steady-state responses were measured from the mean fluorescence over the last 1 sec of the light step (2 sec duration) response. We have included this information in the figure caption and in the Methods. 

      There is a good deal of variability in the iGluSnR responses from one ROI to another, and the ROIs shown in the original submission had a less prominent transient component than many other ROIs. We have replaced this figure with another that is more representative of the average behavior across ROIs. The full range of behavior is captured in Figure 6C; it is clear across ROIs that glutamate release near ON-S dendrites shows both sustained and transient components. The new experiments in which we block amacrine cell activity also include a few more example ROIs from ON-S cells, and those also show both transient and sustained components.

      Your suggestion to integrate the iGluSnFR signals to compare to our structural analysis of ribbons is interesting. However, we are hesitant to make a quantitative comparison between the two without further experiments to validate how the iGluSnFR signals we measure relate to release of single vesicles. For example, a quantitative measure of release based on the iGluSnR experiments would require accounting for possible differences in the expression of the indicator - which could differ both in overall level and/or location relative to release sites. 

      This comment and one above highlight the importance of measures of ribbon surface area, which we now provide (Figure S10).

      Figure 7. What is the recovery time for mammalian cones derived from ribbon-based models? There are estimates from membrane capacitance studies. Ground squirrel cones take 0.7 to 1 sec to recover the ultrafast, primed pool of SVs when probed with a paired-pulse protocol (Grabner ...DeVries 2016, Neuron). Their off-bpcs take anywhere from under 0.2 sec to a second to recover, which is a combination of many synaptic factors (Grabner ...DeVries Nat Comm 2023). Rod On bpcs take over a second (Singer Diamond 2006, reviewed Wan and Heidelberger 2011). In Figure 7B, the recovery time is ~150 ms for the responses measured at rgcs. This brief recovery time is incompatible with existing ribbon models of release. Whole-cell membrane capacitance measurements would be helpful here.

      Thanks for drawing our attention to this issue. Indeed, we see a relatively rapid recovery in the paired-flash experiments. We now discuss this recovery time in the context of past measurements of recovery of responses in cones and bipolar cells (paragraph starting on line 773). There are many factors that could contribute to the relatively rapid recovery we observe - including synaptic factors such as those highlighted by Grabner et al., (2016) either at the cone-to-bipolar synapses or the bipolar-to-RGC synapses. We are certainly interested in a more detailed understanding of this issue, but the additional experiments are outside the scope of this paper.  

      Experimental Suggestion: Add GABA blockers and see if type 5i bpc responds with more release (GluSniff) and prolonged [Ca2+] intra (GCaMP). Compare this to type 6 bpc behavior with GABA/gly blockers. This will rule in or out whether feedback inhibition is involved. 

      Figure 7 in the revised manuscript includes two new experiments examining glutamate release (without the simultaneous measurement of bipolar cell intracellular calcium) while blocking (1) all/most amacrine cell-mediated inhibition via inclusion of NBQX in the bath solution, and (2) blocking spiking amacrine cells via inclusion of TTX in the bath solution. The transient vs sustained difference in light-evoked glutamate release around ON-T and ON-S RGC dendrites remained with amacrine activity suppressed. These new results are consistent with the anatomical and pharmacological data that were included in the initial submission of the manuscript (Fig. 5) that indicate presynaptic inhibition does not have a major role in shaping release kinetics at these synapses. 

      Reviewer #3 (Public Review): 

      Summary: 

      Different types of retinal ganglion cell (RGC) have different temporal properties - most prominently a distinction between sustained vs. transient responses to contrast. This has been well established in multiple species, including mice. In general, RGCs with dendrites that stratify close to the ganglion cell layer (GCL) are sustained; whereas those that stratify near the middle of the inner plexiform layer (IPL) are transient. This difference in RGC spiking responses aligns with similar differences in excitatory synaptic currents as well as with differences in glutamate release in the respective layers - shown previously and here, with a glutamate sensor (iGluSnFR) expressed in the RGCs of interest. Differences in glutamate release were not explained by differences in the distinct presynaptic bipolar cells' voltage responses, which were quite similar to one another. Rather, the difference in transient vs. sustained responses seems to emerge at the bipolar cell axon terminals in the form of glutamate release. This difference in the temporal pattern of glutamate release was correlated with differences in the size of synaptic ribbons (larger in the bipolar cells with more sustained responses), which also correlated with a greater number of vesicles in the vicinity of the larger ribbons. 

      The main conclusion of the study relates to a correlation (because it is difficult to manipulate ribbon size or vesicle density experimentally): the bipolar cells with increased ribbon size/vesicle number would have a greater possibility of sustained release, which would be reflected in the postsynaptic RGC synaptic currents and RGC firing rates. This model proposes a mechanism for temporal channels that is independent of synaptic inhibition. Indeed, some experiments in the paper suggest that inhibition cannot explain the transient nature of glutamate release onto one of the RGC types. Still, it is surprising that such a diverse set of inhibitory interneurons in the retina would not play some role in diversifying the temporal properties of RGC responses. 

      Strengths: 

      (1) The study uses a systematic approach to evaluating temporal properties of retinal ganglion cell (RGC) spiking outputs, excitatory synaptic inputs, presynaptic voltage responses, and presynaptic glutamate release. The combination of these experiments demonstrates an important step in the conversion from voltage to glutamate release in shaping response dynamics in RGCs. 

      (2) The study uses a combination of electrophysiology, two-photon imaging, and scanning block-face EM to build a quantitative and coherent story about specific retinal circuits and their functional properties. 

      Weaknesses: 

      (1) There were some interesting aspects of the study that were not completely resolved, and resolving some of these issues may go beyond the current study. For example, it was interesting that different extracellular media (Ames medium vs. ACSF) generated different degrees of transient vs. sustained responses in RGCs, but it was unclear how these media might have impacted ion channels at different levels of the circuit that could explain the effects on temporal tuning.

      We do not have an explanation for the quantitative differences in response kinetics we observed in Ames’ medium vs. ACSF. There are modest differences in calcium and magnesium concentration and a larger difference in potassium (2.5 mM in ACSF vs 3.6 mM in Ames). It would be interesting to test which of these (or other) differences accounts for the difference in response kinetics.

      (2) It was surprising that inhibition played such a small role in generating temporal tuning. At the same time, there were some gaps in the investigation of inhibition (e.g., IPSCs were not measured in either of the RGC types; pharmacology was used to investigate responses only in the transient RGCs).

      We were also surprised at this result. We have included additional data on inhibition in the revised manuscript. Figure S3 shows light-evoked IPSC data from both RGC types (Fig. S3) and Fig. 7 shows additional iGluSnFR measurements around both ON-T and ON-S RGC dendrites with inhibition blocked via bath application of NBQX (Fig. 7) and separately with inhibition from spiking amacrine cells blocked with TTX. These experiments provide additional evidence for the small role of inhibition. We attempted to measure the kinetics of excitatory input to ON-S cells with inhibition blocked, but we found that the excitatory input showed strong spontaneous oscillations under these conditions and the light responses were changed so drastically that we did not feel we could make a clear comparison with control conditions.

      (3) There could be additional discussion and references to the literature describing several topics, including: temporal dynamics of glutamate release at different levels of the IPL; previous evidence that release sites from a single presynaptic neuron can differ in their temporal properties depending on the postsynaptic target; previous investigations of the role of inhibition in temporal tuning within retinal circuitry. 

      Thanks, we have included more discussion and references to the relevant literature as you have suggested in the recommendations to authors.

      Reviewer #1 (Recommendations For The Authors): 

      The presented raw data of the pharmacological experiments show that SR95531 and TPMPA robustly increased both the amplitude and duration of the transient component of the light step-evoked excitatory currents, with slight, if any enhancement of the sustained component in ON-T RGCs Figure 5C. Statistical analysis of the population data (n=5) with Wilcoxon signed rank test yielded no significant difference (ln 363). However, reanalyzing the data extracted from the graph (Figure 5D) revealed that the difference between the paired observations is normally distributed (Shapiro-Wilk normality test, P=0.48) allowing parametric statistics to be used, which provides higher statistical power. Accordingly, reanalyzing the presented data with paired Student's t-test data revealed significant differences (P=0.01) in the steady-state amplitude normalized to that of the peak, recorded in the presence of SR95531 and TPMPA. In other words, based on the (rough) analysis of the presented pharmacology data GABAergic feedback inhibition significantly contributes to shaping the transient portion of the light-evoked excitatory currents in ON-T RGCs, by making it more transient. I believe a similar analysis based on the actual data is necessary, and the results should be communicated either way. However, if warranted, two-photon glutamate sensor imaging experiments showing that blocking GABA- and glycinergic inhibition does not change the kinetics of light-evoked glutamate signals at ON-T RGCs should also be performed, as these would be critical in drawing a conclusion regarding the effect of feedback inhibition on glutamate release from bipolar cells.

      Thanks for this feedback. We have added another cell to the data set in Fig. 5D. With this addition, SR95531/TPMPA application significantly increases the response transience of excitatory currents measured in ON-T RGCs compared to control. This enhanced transience in GABA<sub>A/C</sub> receptor blockers is due to an increase in the amplitude of the initial peak component of the response (control peak amplitude: -833.7±103.3 pA; SR95531+TPMPA peak amplitude: 2023±372.7pA; p=0.03, Wilcoxon signed rank test), with no change to the later sustained component (control plateau amplitude: -200.7±14.71pA; SR95531+TPMPA plateau amplitude: -290.9±43.69pA; p=0.15, Wilcoxon signed rank test).

      We should clarify that this result indicates that GABAergic inhibition makes the excitatory inputs to ON-T RGCs less transient. Block of GABA receptors increased transience, thus intact GABAergic transmission appears to limit the initial peak of the response and therefore make excitatory currents more sustained. We unfortunately were not able to examine whether sustained excitatory currents in ON-S RGCs would become more transient using the same approach. In our hands, bath application of SR95531+TPMPA led to the generation of large-amplitude (>1nA) oscillatory bursts of excitatory input that developed within 5 minutes and persisted for the duration of the incubation (up to ~30 min) in drugs. Further, presentation of light steps tended to induce variable amplitude responses, likely dependent on the presence of spontaneous bursts; when large amplitude responses were evoked, these typically oscillated for several seconds after the step.

      To examine a potential role for presynaptic inhibition in transient vs. sustained bipolar cell output, we therefore chose to eliminate amacrine cell-mediated inhibition by bath application of the AMPA/kainate receptor antagonist NBQX in additional iGluSnFR measurements. This manipulation should leave ON bipolar cell responses intact while eliminating most amacrine cell-mediated responses (and OFF bipolar cell driven responses). In separate experiments, we also eliminated inhibition from spiking amacrine cells by bath application of TTX. As shown in new Fig. 7, sustained and transient responses persisted in distal versus proximal RGC dendrites, respectively. Compared to SR95531/TPMPA, bath application of NBQX was not associated with spontaneous bursts of glutamate release around ON-S dendrites. These results show that amacrine cell-mediated inhibition is not required for either sustained or transient glutamate release from bipolar cells that provide input to ON-S and ON-T RGCs.

      Small points: 

      (1) The legend of Figure 1 (D) refers to shaded areas to show {plus minus} SEM, but no shade is visible (at least in my printout).

      The SEM shading is there in Fig. 1D but is mostly obscured by the mean lines for the respective RGC types. We have added this to the figure caption.

      (2) I found the reported Vrest for the ON bipolar cells somewhat depolarized. Perhaps due to the uncompensated junction potentials? 

      These measurements are indeed not corrected for the liquid junction potential (which is approximately -10.8 mV between K-gluconate internal and Ames’ solution). We did not apply this correction since the appropriate value is not clear in perforated patch recordings as the intracellular chloride concentration is unknown (and can differ from that in the pipette solution). We have clarified this in the results text where we describe the Vrest values (lines 335-338).

      (3) It is Wilcoxon signed rank test, not Wilcoxan. 

      Thanks for catching this. This has been corrected in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors): 

      Some amacrines express vesicular Glut-3 transporter and are reported to release glutamate (Marshak, Vis Neurosci 2016). Are Amacrine vGlut3 signals postsynaptic (within ~0.5 um) to cone bpc ribbons?

      We did not characterize VgluT3-expressing amacrine cells in our SEM datasets. A recent study by Friedrichson et al. (Nat. Comm. 2024; PMID 38580652) using 3D SEM reconstructions found that Vglut3-amacrines are postsynaptic to both type 5i and type 6 bipolar cells, as well as other type 5/xbc bipolar cells (and receive >50% of their input from type 3a OFF bipolar cells).

      How far apart are the postsynaptic targets from the ribbon release sites? The ribbons at type 5i bpc/On-T input appear separated from the dendrites of On-T rgcs (Figure 8C). At least further away than the type 6 bpc ribbons are from On-S rgc dendrites (Figure 8C). Distance may create a thresholding phenomenon, whereby only multivesicular bouts at the onset of depolarization are able to elevate synaptic Glu to levels needed to activate On-T GluRs. See Grabner et al Nat Comm 2023 for such scenarios in the outer retina.

      This is an intriguing possibility, but we should point out that the presynaptic ribbons in Fig. 9C (former Fig. 8C) are similar distances (within the resolution of our reconstructions) from the ON-T and ON-S dendrites. We have increased the brightness of the dendrite segments for both RGC types in the resubmission figure; note that ON-T RGCs have spine-like protrusions that may not have been as apparent in the previously submitted version of our manuscript.

      In Figures 1 and 2, Sustained responses look like the derivative of Transient responses, minus the negative going inflection. In addition, the sustained responses appear to have a lower threshold of activation than the transient On rgcs, because there are more bouts of action potentials (and membrane depol in V-clamp) with earlier onset in sustained than transients traces. It would be great if the GLuSniff data captured these differences. Take cumulative dF/F and see what the onset time is, or an initial tau if possible.

      This is a good suggestion. However, we are reluctant to make detailed quantitative comparisons such as this without further validation of how the kinetics of the iGluSnFR signals relate to kinetics of glutamate release.  A specific concern is that differences in the location and amount of iGluSnFR expression could impact any such comparisons.

      A recent study by Kim et al von Gersdorff (Cell Reports, 2023) presents interesting phases of release in response to light flashes, measured from AIIs, and complementary results from pairs of rbcs-AIIs. The findings highlight the complexity of SV pools under well-controlled experiments. Could their results be explained as variations in rbc ribbon size through development, and possibly between rbcs or within an rbc? 

      This certainly seems possible and would be consistent with the dependence of release on ribbon size that our results support.  It would be interesting to see if there are clear anatomical correlates of that change in release properties.  

      Figure 5 is a pivotal point in the study, but my review has identified numerous weaknesses. The feedback inhibition onto bipolar cell terminals is likely to sculpt glutamate release, and the results do not convincingly rule out this possibility. The suggestions for improvements range from the data needing to be reanalyzed with regard to statistical tests, and/or adding a few more data points (n = 5) before concluding a p: 0.06 is insignificant. 

      We have added an additional recording to this data set. With n= 6 cells, there is now a statistically significant difference between ON-T RGC excitatory currents measured in control conditions versus during GABA<sub>A/C</sub> receptor blockade. Please note that all the recordings shown in Figure 5C-F are from ON-T RGCs (the two panels show separately block of GABergic and glycinergic receptors). We did not make it sufficiently clear that the original trend (now statistically significant) is opposite of that expected if presynaptic GABAergic inhibition contributes to response transience in ON-T RGCs.  What we see is that excitatory synaptic inputs to ON-T RGCs become more transient (rather than mpre sustained) during GABA<sub>A/C</sub> receptor blockade. We have revised the text in that section to make this point more clearly.

      We have also included new data from iGluSnFR measurements showing that bath application of NBQX does not affect light step-evoked glutamate release kinetics at proximal (sustained) or distal (transient) RGC dendrites (control: steady-state amp. as % of peak amp. 13 ± 10; mean ± S.D.; n = 189 ROIs/4 FOVs for ON-T dendrites vs 40 ± 12; mean ± S.D.; n = 287 ROIs/8 FOVs for ON-S dendrites; NBQX: 6 ± 3; mean ± S.D.; n = 112 ROIs/1 FOV for ON-T dendrites vs 23 ± 9; mean ± S.D.; n = 97 ROIs/2 FOVs for ON-S dendrites; *p<0.001). By blocking glutamate receptors on amacrine cells, NBQX (AMPA/KAR antagonist) eliminates all/most amacrine cell-mediated signaling in the retina and should therefore abolish presynaptic inhibitory input to bipolar cell terminals across the IPL. Taken together, our results indicate that presynaptic inhibition does not play a critical role in establishing transient versus sustained kinetics for the stimulus conditions we employed in our study.

      There is a need to cite more recent literature on bipolar cell ribbons (e.g. see Wakeham et al., Front. Cell. Neurosci., 2023), in order to support experimental design and interpretation of the results. The authors should discuss their Ribeye-KO data from Okawa et al 2019 Nat Comm, Figure 7, in the context of their new iGluSnFR results. 

      Thank you for prompting us on this issue. We have expanded the discussion regarding ribbons and included more citations to the ribbon literature. That is largely in the three paragraphs starting on line 727.

      One point deserves emphasis because it is central to the authors' ribbon model but not consistent with their data. The ribbon model as they put it, and as commonly stated, holds that a transient phase of release at the onset of depolarization indicates the depletion of the primed SVs, and the subsequent slower rate of release (steady state release in the authors' terms) reflects recruiting, priming, and release of new SVs. The On-transient dendrite GluSnf responses agree with this multiphasic process, but the sustained responses show only an elevation in glutamate without a pronounced initial peak, creating a square-wave-shaped response (Figure 6B). This does not agree with the simple ribbon-based release model. I would expect the signals from the T- and S-on dendrites to have a comparable initial phase, while the sustained phase should be greater in amplitude for the S-on dendrites. More discussion may clarify possible mechanisms.

      Thanks for pointing this out. The example iGluSnFR traces we originally included in the manuscript were not entirely representative in that they did not show much initial transient phase. Note there is a distribution of steady-state amplitudes for proximal dendrites in Fig. 6C; the examples are from ROIs from the upper end of the distribution. In the new Figure 7, we have included some additional examples that show both a clear transient and sustained component. The summary data in Figure 6C shows the distribution of sustained/transient ratios across ROIs.  

      Reviewer #3 (Recommendations For The Authors): 

      (1) It would be interesting to understand the differences in IPSCs in the two RGC types. Perhaps they are small in both types, which would explain their apparent lack of impact on temporal tuning. The authors may already have these data.

      We did make measurements of noise-evoked IPSCs (as well as EPSCs) in a subset of ON-T and ON-S recordings. We have now included this data as Figure S3. There are slight differences in the kinetics of inhibition between RGC types (Fig. S3C) and there is a trend towards stronger inhibition (relative to excitation) in ON-T RGCs compared to ON-S RGCs (Fig. S3E), although there is not a statistically significant difference. In both cases excitatory synaptic currents are as large or larger than inhibitory currents, and this does not include the difference in driving force near spike threshold which will favor excitatory input by a factor of 2-3.  Hence our data suggests that postsynaptic inhibition does not play a major role in generating the differential temporal spiking responses of ON-T and ON-S RGCs. However, additional experiments examining the relative contribution of excitation and inhibition to spiking output in these RGCs would be needed to reach a firm conclusion.

      The pharmacological experiments in which we blocked inhibition (Fig. 5C-F, new Fig. 7) were designed to test the effect of presynaptic inhibition on bipolar cell output (voltage-clamp isolation of excitatory currents in Fig. 5; iGluSnFR measurements of glutamate release in Fig. 7). We do not mean to suggest that postsynaptic inhibition does not have any role in shaping the spiking behavior of these RGC types, but that transient vs. sustained kinetics are already present in the bipolar cell output and that presynaptic inhibition of bipolar cell terminals does not appear to account for this difference.  We have revised the text throughout to be clearer on this point.

      (2) It could be convincing to show transient/sustained differences between RGC types in dim light, where the response would depend on the rod bipolar/AII circuit. In this case, any difference in temporal properties would presumably be explained by differences that localize to the cone bipolar cell axon terminals. Indeed, is that the result in Figure 1B? This seems to be a dim stimulus presented on darkness, which may be driven through the rod bipolar pathway. The authors could then discuss the interpretation of this data in terms of the rod bipolar circuit. 

      Yes, Figure 1B is a dim light step (~30R*/rod/s) presented from darkness and the distinction between cells is clear down at still lower light levels that more effectively isolate signaling through the rod bipolar pathway. Thanks for making this point that observation of distinct temporal responses under scotopic conditions where signals suggests these differences must arise at and/or downstream of cone bipolar cell output. We have included additional text (lines 361-365) in the results describing bipolar cell responses that raise this point.

      (3) Glutamate release was already measured across the full IPL depth by Borghuis et al. (2013) and Franke et al. (2017). It would be appropriate to better motivate the current study based on these existing measurements.

      We have clarified that these important studies provided important motivation for measuring excitatory synaptic input to ON-T vs. ON-S RGCs (lines 165-169).   

      (4) Line 212/213. It would be appropriate to add to the list of papers showing the different stratification of transient vs. sustained responses: Borghuis et al. (2013) and Beaudoin et al. (2019).

      Thank you - these references have been added.  

      (5) Line 635-638. It would be useful to discuss papers by Pottackal et al. (2020, 2021), which suggested that a single presynaptic cell (starburst) can signal with different temporal properties depending on the postsynaptic target (other starburst vs. DSGCs). The mechanism was not completely resolved (i.e., it was not explained by differences in presynaptic Ca channels at the two synapse types), but it at least shows that neurotransmitter release can show different filtering depending on the postsynaptic target from the same presynaptic neuron. (This could also be at play for the type 6 bipolar cell inputs to ON-S vs. ON-T RGCs in the present study.)

      We have added a reference to Pottackal et al 2021 in this section.

      (6) Line 714. Should describe the procedure for embedding the tissue in agarose. 

      We have added more detail regarding agarose embedding for preparation of retinal slices in the methods.

      (7) Line 775. Need a better description of the virus (not the construct), what serotype? Provide the Addgene number if available. 

      This has been added to the methods.

      (8) Line 808. Was the SD for the gaussian really 50%? That would cut off a lot of the distribution, i.e., it would get clipped at 0. 

      Yes, the SD for Gaussian noise was 50%. This high contrast stimulus was used in part to achieve measurable signals from bipolar cells. You are correct that some of the distribution was clipped at 0 (it was also clipped at twice the mean to make sure that the distribution remained symmetrical). The clipping was accounted for during our LN analyses.

      (9) The paper should discuss Swygart et al. (2024) results showing different spatial surround properties of neighboring synapses from a type 6 bipolar cell. Based on this result, it would seem very likely that amacrine cells could play a role in shaping the temporal processing of bipolar cell glutamate release as well. Indeed, spatial and temporal processing will not be completely independent in a typical experiment. For example, with the spot stimulus used in the present study, bipolar cells within the center versus the edge of the spot will have different balances of center/surround activation, which could potentially influence their temporal processing.

      We have included discussion of results from Swygart et al 2024 in the section of the Discussion in which we point out differences in surround inhibition between ON-S and ON-T RGCs (lines 710-714). We agree that spatial and temporal processing are not completely independent. Our results with SR95531/TPMPA indicate ON-T RGCs receive stronger GABAergic surround inhibition than ON-S RGCs (Fig. S8). However, our results in Fig. 5C-D show GABAergic surround inhibition makes ON-T excitation more sustained rather than more transient. So even though bipolar cells presynaptic to ON-T RGCs receive stronger surround inhibition (Fig. S8), this inhibition does not establish the transient kinetics of glutamate release from these bipolar cells (in fact, it works to make release more sustained). Additional iGluSnFR experiments where we used NBQX to block all/most amacrine cell-mediated responses also suggest presynaptic inhibition does not have an important role in establishing differential glutamate release kinetics onto ON-S vs. ON-T RGC dendrites (Fig. 7).

      (10) Cui et al. 2016 described ON-S Alpha as having a divisive suppression mechanism that explained the temporal properties of white-noise response better than a standard LN model. Do the authors think the divisive suppression reflects a property of the excitatory synapses independent of inhibition?

      This is an interesting question, but one for which we don’t have a good answer for now. As mentioned in some of the above responses and as we have tried to clarify in the manuscript, we do not mean to imply that there is no role for presynaptic inhibition in modulating bipolar cell output, including for the divisive suppression described by Cui et al. Rather, our point is that the distinction between transient and sustained excitatory input to ON-T and ON-S RGCs does not require presynaptic inhibition and is more likely an intrinsic property of the bipolar cell synapses. 

      (11) Do the authors mean to imply that the pool size at bipolar cell ribbon synapses could depend on the use of Ames vs. ACSF? 

      For now, we do not have a good answer as to why there are quantitative differences in response kinetics between Ames and ACSF. We have not done any experiments to investigate whether ribbon sizes or ribbon pools are different in the different solutions.

      (12) More generally, different mean luminance levels could drive different levels of baseline glutamate release, which could alter the available pool of vesicles at bipolar cell ribbon synapses. Can we explain varying degrees of transient/sustained in the same cell at different levels of mean luminance based on this mechanism (e.g., Grimes et al., 2014)?

      Yes, the emergence of a transient component of excitatory input to ON-S RGCs at ~100 R*/rod/s versus at scotopic levels (0.5 R*/rod/s) in Grimes et al. (2014) could be due to differences in the number of releasable vesicles (due to different type 6 bipolar cell axon terminal membrane potentials and hence differences in spontaneous release rates) at the different light levels.

      We should note that although ON-T and ON-S RGCs exhibit some changes in transient/sustained kinetics across different light levels, the relative differences between these RGC types are preserved across light levels. We have included a statement about this in the text (lines 361-367).

      (13) Figure 1. Have the authors considered performing the LN analysis of the firing responses, to compare the degree of rectification between the two RGC types?

      This is a good suggestions. From an LN analysis of spiking responses, we do not observe a clear difference between the static nonlinearity component of the model for ON-T and ON-S RGCs. Both RGC types are strongly rectified under our experimental conditions.  

      (14) Figure 5. Do the authors have the pharmacology data for the ON-S cells? There are examples of sustained EPSCs in amacrine cells that become more transient after blocking inhibition, which at least suggests that inhibition can play some role in the transient/sustained nature of glutamate release (Park et al., 2015, Figure 3). Perhaps ON-S cells likewise become more transient with inhibition blocked. 

      (The colored symbols in A were not visible in a printout. It would be useful to indicate the cell type (ON-T) in C and E). 

      As described above in the response to reviewer 1’s recommendation for authors, we were not able to use SR95531/TPMPA for recordings from ON-S RGCs. Bath application of these drugs led to oscillatory bursts of excitatory input to ON-S RGCs. However, the lack of effect of bath-applied NBQX on the kinetics of glutamate release around either ON-T or ON-S RGC dendrites (new Fig. 7) suggests that presynaptic inhibition does not contribute to generating sustained excitation to ON-S RGCs (or transient excitation to ON-T RGCs).  

      We have corrected Fig. 5A to include the referenced colored symbols and have also edited Fig 5C and E to clarify that measurements in Fig. 5C-F are from ON-T RGCs.

      (15) Figure 6 legend. Should be Kcng4-Cre, not KCNG-Cre. Also, it should make clear that this is cre-dependent expression of iGluSnFR. For C, were the statistics based on the number of FOVs? 

      Thanks for catching this, we have corrected Figure 6 legend. The methods section includes a description of how we achieved iGluSnFR expression on alpha RGC dendrites via a cre-dependent viral strategy in Kcng4-Cre mice.  We have also clarified that the statistics are based on ROIs in Figure 6C.

      (16) Figure 7, Flashes were apparently 400% contrast on a dim background. What was the background? Is there a rod component to the response in this case? 

      In Figure 7 (now Figure 8), the same background (~3300 R*/rod/s; 2000 P*/Scone/s) was used as in the Gaussian noise and step response experiments. At this light level, the response should be primarily be mediated by cones.

      (17) Figure S1. The colors here differ from those in previous figures (Here, ON-T, magenta; ON-S, cyan). Is something mislabeled? 

      Thanks for catching this. We mistakenly swapped the labels in the legend for Fig. S1. The figure colors were correct, but we have corrected the legend in the revised manuscript.

      (18) Figure S2. For the LN model for RGC synaptic currents, the ON-S are more rectified than some previous recordings (Cui et al., 2016). Is this perhaps explained by different light levels?

      We aren’t sure why ON-S excitatory currents are more strongly rectified in our recordings compared to Cui et al., 2016. Cui et al. used an ~20-fold higher background light intensity (~40,000 P*/cone/s vs. ~2000 P*/cone/s in our study), so different light levels may be a factor (although we should point out that rectification increases in these RGCs between scotopic to low photopic light levels (see Grimes et al., 2014 and Kuo et al., 2016).

      (19) The study is apparently comparing PV1 and PV2 described in Farrow et al. (2013; see Supplementary information for stratification analysis), which should be cited.

      Thanks, we have corrected this oversight in the revised manuscript. We now cite Farrow et al and mention the connection to PV1 and PV2 in the first paragraph of Results (lines 104-108).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      Major comments:

      (comment #1)- It is interesting that TRF2 loss not only fails to increase γH2AX/53BP1 levels but may even slightly reduce them (e.g., Fig. S2c and the IF images). While the main hypothesis is that TRF2 loss does not trigger telomere dysfunction in NSCs, this observation raises the possibility that TRF2 itself contributes to DDR signaling (ATM-P, γH2AX, 53BP1) in these cells and that in its absence, cells are not able to form those foci. To exclude the possibility that telomere-specific DDR is being missed due to an overall dampened DDR response in the absence of TRF2, it would be informative to induce exogenous DSBs in TRF2-depleted cells and test DDR competence (e.g., IF for γH2AX/53BP1). In other words, are those NSC lacking TRF2 even able to form H2AX/53BP1 foci when damaged? In addition, it would be interesting to perform telomere fusion analysis in TRF2 silenced cells (and TRF1 silenced cells as a positive control).

      We acknowledge a slight reduction; however, this difference is not statistically significant (Fig S2c,e). We will quantify the levels of DDR markers upon TRF2 loss and exogenous DSBs and include it in the subsequent revision.

      (comment #2)-A TRF2 ChIP-seq should be performed in NSC as this list of genes (named TAN genes in the text) was determined using a ChIP performed in another cell line (HT1080). For the ChIP-qPCR in the various conditions, primers for negative control regions should be included to show the specific binding of TRF2 to the promoter of the genes associated with neuronal differentiation. For example, an intergenic region and/or promoters of genes that are not associated with neuronal differentiation (or don't contain a potential G4). The same comment goes true for the gene expression analysis: a few genes that are not bound by TRF2 should be included as negative controls to exclude a potential global effect of TRF2 loss on gene expression (ideally a RNA-seq would be performed instead). We have performed NSC-specific TRF2 ChIP-seq for an upcoming manuscript, which confirms TRF2 occupancy at multiple promoters of differentiation-associated genes. These data are provided solely for confidential evaluation by the designated reviewers.

      Regarding the ChIP-qPCR control experiments: We thank reviewer for pointing this out, indeed we included controls in our PCR assays as positive (telomeric) and TRF2-nonbinding loci (GAPDH, RPS18, and ACTB, based on HT1080 TRF2 ChIP-seq data) as negative controls. These results were not included earlier for clarity given that we were presenting several ChIP-PCR figures - in response to the comment we have included this now in the revised version (Fig. S3d,e). Gene expression analyses show selective upregulation of the TAN genes upon TRF2 loss (data normalised to GAPDH); whereas negative control genes lacking TRF2 binding (RPS18, ACTB) remain unchanged, ruling out non-specific effects. (Fig S3f,g,j,k).

      -(comment #3) A co-IP should be performed between the TRF2 PTM mutant K176R or WT TRF2 and REST and PRC2 components to directly show a defect of interaction between them when TRF2 is mutated (a co-IP with DNase/RNase treatment to exclude nucleic-acid bridging). The TRF2 PTM mutant T188N also seems to lead to an increased differentiation (Fig. S5a). Could the author repeat the measure of gene expression and co-IP with REST upon the overexpression of this mutant too?

      We confirm that DNase/RNase is routinely included in our pull-down experiments to exclude nucleic-acid bridging, with detailed methodology now elaborated in the Methods section. Not including this in the manuscript Methods was an oversight from our side. Our data demonstrate that only REST directly interacts with TRF2, while TRF2 engages PRC2 indirectly via REST, as also previously shown by us and others (page 6; ref. [62]; Sharma et al., ref. [15]).

      We thank the reviewer for noting the apparent differentiation in Fig. S5a. However, this observation represents rare spontaneous differentiation event and is not statistically significant (as shown in Fig S5b). Consistently, gene expression analysis of the TRF2-T188N mutant shows no significant change in TRF2-associated neuronal differentiation (TAN) genes. Therefore, Co-IP for TRF2-T188N with REST was not done.

      (comment #4) - The authors show that the G4 ligands SMH14.6 and Bis-indole carboxamide upregulate TAN genes and promote neuronal differentiation, but the underlying mechanism remains unclear. Bis-indole carboxamide is generally considered a G4 stabilizer, while SMH14.6 is less characterized and should be better introduced. The authors should clarify how G4 stabilization would interfere with TRF2 binding, it seems that it would likely be by blocking access. A more detailed discussion, and ideally TRF2 ChIP after ligand treatment and/or G4 helicase treatment, would strengthen the model.

      We clarify that Bis-indole carboxamide acts as a G4 stabilizer, while SMH14.6 is also a noted G4-binding ligand that stabilizes G4s (ref. [15]). The exclusion of TRF2 from G4 motifs in gene promoters by G4-binding ligands has also been documented previously (ref. [18]). In line with these findings, ChIP experiments performed following ligand treatment revealed a decreased occupancy of TRF2 at TAN gene promoters, supporting the proposed mechanism (added Fig. 6h).

      Minor comments:

      • Supp Figures related to the scRNA-seq are difficult to read (blurry).

      Corrected

      • Fig S1h: The red box mentioned in the legend is not visible

      Corrected

      • In the text, the Figures 1 f-g are misannotated as Fig 1m and l

      Corrected

      • The symbol γ of γH2AX is missing in the text

      Corrected

      • Fig.3d, please indicate in the legend that it is done in SH-SY5Y.

      Added SH-SY5Y in the legend of Fig. 3d.

      • Fig. S3b: Please consider replotting this panel with an increased y-axis scale. As currently presented, the TRF2 ChIP-seq peaks at several promoters appear truncated by the scaling.

      Corrected

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      1. For most of the data graphs in the manuscript, there is no indication of the number of independent biological replicates carried out (which should ideally be plotted as individual dots overlaying the column graphs), or what the error bars represent, or what statistical test was used. All the figure legends and methods have now been updated with the corresponding biological replicates per experiment, with error bars as SD/SEM and the corresponding statistical test along with p values.

      Figure S1.1a: needs a marker to show that the tissue is dentate gyrus.

      We acknowledge the reviewers' concern that high-magnification images alone make it difficult to verify whether the fields are taken from the correct anatomical location. The dentate gyrus (DG) of the hippocampus is a well-defined structure. In the revised figure (Fig S1.1a), we now include a low-magnification image showing the entire hippocampus, including the CA fields, along with two high-magnification fields specifically from the DG region. Consistent with our claim, the co-immunostaining demonstrates that Sox2-positive neural stem cells in the DG are also positive for TRF2.

      Figure 1c (and all other flow cytometry panels throughout the manuscript): it is not clear if the expression of any of these proteins, except maybe MAP2, are significantly different in the presence or absence of TRF2. These differences need to be presented more quantitatively, with the results compiled from multiple biological replicates and analysed statistically. I am not sure that flow cytometry is the best way to determine differences in protein expression levels for non-surface proteins, because many of the reported differences are not at all convincing.

      To detect intracellular/nuclear proteins by flow cytometry, cells were permeabilized using pre-chilled 0.2% Triton X-100 for 10 minutes, as described in the Methods section.

      We have revised the figures (Fig 1c,e) and now included statistical analysis from three independent biological replicates for these experiments.(Fig S1.4h-j, S2e, S6d)

      Fig 1d: has TRF2 been effectively silenced in this experiment? There appears to be just as many TRF2+ nuclei in the "TRF2 silenced" panel vs the control, including in the cells with neurite outgrowths.

      Quantification of nuclear levels of TRF2 showing decrease in nuclear TRF2 has been included in supplementary Fig S1g.

      Fig 2a-c: these experiments need a positive control, showing increased expression of these proteins in mNSC and SH-SY5Y cells in response to a DNA damaging agent. Again, flow cytometry may not be the best method for this; immunofluorescence combined with telomere FISH would be more convincing.

      We confirm that doxorubicin induces 53BP1 foci (IF-FISH Sup Fig. S2b) and TRF1 silencing elevates γH2AX (Sup Fig. S2c) validating DDR sensitivity. Unlike TRF2 loss (Fig. 2a-c), no TIFs appear with IF and telomere probes (Fig. 2d, Sup Fig. 2a), and without TIFs, there is no telomeric fusion. Flow cytometry was performed with Triton X- 100 to target nuclear protein. These findings adequately address the concern; therefore, further IF-FISH experiments were not included in the present study.

      To conclude that telomere damage is not occurring, an independent marker of such damage, such as telomere fusions, should also be measured.

      In response to uncapped telomeres, ATM kinase activates the DNA damage response (DDR), recruiting γH2AX and 53BP1 to telomeres, which precedes the end-to-end fusions (Takai et al., 2003; Maciejowski & de Lange, 2015; Takai et al., 2003; d'Adda di Fagagna et al., 2003; Cesare & Reddel, 2010; Hayashi et al., 2012; Sarek et al., 2015). We observe no DDR activation or foci (Fig. 2; Sup. Fig. 2). This absence of a DDR response and TIFs indicates no telomere uncapping, negating the need for direct telomere fusion analysis.

      Figure S2b is lacking a no-doxorubicin control.

      Untreated control has been included Fig. S2b.

      Figures 3a and 3b need a positive control (e.g. TRF2 binding to telomeric DNA) and a negative control (e.g. a promoter that did not show any TRF2 binding in the HT1080 ChiP-seq experiment in Fig S3).

      We have included positive (telomere) and negative (GAPDH) controls (based on HT1080 TRF2 ChIP-seq data) for the TRF2 ChIP assay in Supplementary Fig. S3d,e. Additionally, positive and negative controls for all ChIP experiments conducted in this study are presented in Supplementary Figs. S3d, S3e, S3h, S3i, S4c-h, and S5c-e

      The data in Figure 3 would be more compelling if all experiments were also performed in fibroblasts to confirm the cell-type specificity of the effect.

      Our HT1080 fibrosarcoma ChIP-seq data (ref. [18]; Sup. Fig. 3a,b) show TRF2 binding to TAN gene promoters in a fibroblast-derived model, with enrichment in neurogenesis-related genes (refs. [19,20]). In fibroblasts TRF2 depletion, as expected, induce telomere dysfunction and DDR (Fig. 2d; Sup. Fig. 2a), and eventually cell-cycle arrest and cell death as also reported earlier (van Steensel et al., 1998; Smogorzewska & de Lange, 2002). Therefore, the suggested experiments which would require sustained TRF2-depletion are not possible to perform in fibroblasts. TRF2 occupancy on the promoter of the genes in question in cells other than NSC was noted in HT1080 cells (ref. [18]; Sup. Fig. 3a,b).

      No references are provided for the TRF2 posttranslational modifications on R17, K176, K190 and T188. What is the evidence for these modifications, and is it known if they participate in the telomeric role of TRF2?

      These lines with references have been included in the manuscript (highlighted in blue).

      R17 methylation enhances telomere stability (66). K176/K190 acetylation stabilizes telomeres and is deacetylated by SIRT6 (67). T188 phosphorylation facilitates telomere repair after DSBs(68). These PTMs primarily support telomeric roles.

      The experiments in Fig 5 should also be performed with WT TRF2, to confirm that effects are not due to the overexpression of TRF2.

      WT TRF2 shows no differentiation phenotype and change in TAN gene expression (Fig. 1f,g; 3h, Sup Fig. 5a). Confirming effects are not due to TRF2 overexpression.

      Fig 5c has not been described in the text, and there are multiple technical problems with the TRF2 WT experiment: i) There appears to be significant background binding of REST to the IgG beads, though this blot has such high background it is hard to tell (the REST blot in Fig S4b is also of poor quality), ii) TRF2 is migrating at two different positions in the Input and IP lanes, and the TRF2 band in the K176R blot is at a different position to either, and iii) the relative loading of the Input and IP lanes is not indicated, so it's not clear why K176R appears to be so enriched in the IP.

      We acknowledge the oversight in not citing Fig 5c in the manuscript. This has been corrected, and, highlighted in blue in the revised manuscript.

      i) Multiple optimization attempts were made for the Co-IP experiments, and the presented figure reflects the best achievable result despite REST blot smearing, a pattern also reported previously (Ref. 65). The TRF2-REST interaction is well established, and a similar background was also observed in the cited study

      ii)Variable migration patterns of TRF2 were also noted in the cited study (Ref. 65), consistent with our observations. Our primary emphasis, however, is on the TRF2 K176R mutant, which clearly disrupts its interaction with REST.

      iii)The input loading corresponds to 10% of the total lysate. As the experiments were conducted independently, variations in transfection and pull-down efficiencies may account for observed differences.

      To rule out indirect effects of the G4 ligands on the results in Fig 6g, the binding of BG4 and TRF2 at the promoters of these genes should be measured by ChIP.

      To confirm that G4 ligand effects on TAN gene promoters are direct, TRF2 occupancy was assessed using ChIP. Significantly decreased occupancy of TRF2 was noted at TAN gene promoters, (added Fig. 6h). This implies that ligand-induced changes in TRF2 binding are directly linked to promoter-level G4 stabilization.

      Minor comments:

      1. The size of all the size markers in western blots should be added to the figures. Size has been included in all the western blots

      2. There are several figure panels that are incorrectly referenced in the text, e.g. Fig S1.1 (e-f) should be Fig S1.1 (e-h); Fig. 1m should be Fig. 1f; Figs 5e and 5f have been swapped.

      Corrected.

      1. Fig S1.4 is not referred to in the text. It is not clear what the purpose of Fig S1.4a is.

      The following line has been included in the manuscript highlighted in blue.

      Neurospheres were characterized using PAX6, a NSC marker (Fig S1.4a).

      Are the experiments in Figs 3e, 4a, 4c and 4e using 4-OHT treatment, or siRNA? If the latter, I don't think a control for the effectiveness of the knockdown in this cell type has been included anywhere in the manuscript.

      It is using siRNA, a western blot showing the effectiveness of knockdown is presented in supplementary figure S4c (now S4a).

      The lanes of the western blots in Fig S4c are not labelled.

      Corrected.

      1. Given that the experiments in Fig 5 were carried out on a background of endogenous WT TRF2 expression, presumably the K176R mutant is having a dominant negative effect. To understand the mechanism of this effect (e.g, is it simply due to replacement of endogenous WT TRF2 at its genomic binding sites by a large excess of exogenous K176R, or is dimerisation with WT TRF2 needed?) it would be helpful to know the relative expression levels of endogenous and K176R TRF2.

      To address the query, qRT-PCR with 3′ UTR-specific primers showed no change in endogenous TRF2 mRNA upon K176R expression in SH-SY5Y cells, while primers detecting total TRF2 revealed ~10-fold higher expression of K176R compared to control (Figure below). This indicates the absence of suppression of endogenous TRF2 mRNA. Given that the mutant's DNA binding is intact (Fig. 5f), the dominant-negative effect of K176R likely arises from overexpression of the exogenous mutant.

      For the sentence "...and critical for transcription factor binding including epigenetic functions that are G4 dependent" (bottom of page 3 of the PDF), the authors cite only their own prior papers, but there are examples from others that could be cited.

      We have incorporated citations from other research groups, now included as references 23-26.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers for their thoughtful and constructive feedback, which helped us strengthen the study on both the computational and biological side. In response, we added substantial new analyses and results in a total of 26 new supplementary figures and a new supplementary note. Importantly, we demonstrated that our approach generalizes beyond tissue outcomes by predicting final-timepoint morphology clusters from early frames with good accuracy as new Figure 4C. Furthermore, we completely restructured and expanded the human expert panel: six experts now provided >30,000 annotations across evenly spaced time intervals, allowing us to benchmark human predictions against CNNs and classical models under comparable conditions. We verified that morphometric trajectories are robust: PCA-based reductions and nearest-neighbor checks confirmed that patterns seen in t-SNE/UMAP are genuine, not projection artifacts. To test whether z-stacks are required, we re-did all analyses with sum- and maximum-intensity projections across five slices; results were unchanged, showing that single-slice imaging is sufficient. From a bioinformatics perspective, we performed negative-label baselines, downsampling analyses to quantify dataset needs, and statistical tests confirming CNNs significantly outperform classical models. Biologically, we clarified that each well contains one organoid, further introduced the Latent Determination Horizon concept tied to expert visibility thresholds, and discussed limits in cross-experiment transfer alongside strategies for domain adaptation and adaptive interventions. Finally, we clarified methods, corrected terminology and a scaler leak, and made all code and raw data publicly available.

      Together, these revisions in our opinion provide an even clearer, more reproducible, and stronger case for the utility of predictive modeling in retinal organoid development.


      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      This study presents predictive modeling for developmental outcome in retinal organoids based on high-content imaging. Specifically, it compares the predictive performance of an ensemble of deep learning models with classical machine learning based on morphometric image features and predictions from human experts for four different task: prediction of RPE presence and lense presence (at the end of development) as well as the respective sizes. It finds that the DL model outperforms the other approaches and is predictive from early timepoints on, strongly indicating a time-frame for important decision steps in the developmental trajectory.

      Response: We thank the reviewer for the constructive and thoughtful feedback. In response to the review as found below, we have made substantial revisions and additions to the manuscript. Specifically, we clarified key aspects of the experimental setup, changed terminology regarding training/validation/test sets, and restructured our human expert baseline analysis by collecting and integrating a substantially larger dataset of expert annotations according to suggestion. We introduced the Latent Determination Horizon concept with clearer rationale and grounding. Most importantly, we significantly expanded our interpretability analyses across three CNN architectures and eight attribution methods, providing comprehensive quantitative evaluations and supplementary figures that extend beyond the initial DenseNet121 examples (new Supplementary Figures S29-S37). We also ensured full reproducibility by making both code and raw data publicly available with documentation. While certain advanced interpretability methods (e.g., Discover) could not be integrated despite considerable effort, we believe the revised manuscript presents a robust, well-documented, and carefully qualified analysis of CNN predictions in retinal organoid development.

      Major comments: I find the paper over-all well written and easy to understand. The findings are relevant (see significance statement for details) and well supported. However, I have some remarks on the description and details of the experimental set-up, the data availability and reproducibility / re-usability of the data.

      1. Some details about the experimental set-up are unclear to me. In particular, it seems like there is a single organoid per well, as the manuscript does not mention any need for instance segmentation or tracking to distinguish organoids in the images and associate them over time. Is that correct? If yes, it should be explicitly stated so. Are there any specific steps in the organoid preparation necessary to avoid multiple organoids per well? Having multiple organoids per well would require the aforementioned image analysis steps (instance segmentation and tracking) and potentially add significant complexity to the analysis procedure, so this information is important to estimate the effort for setting up a similar approach in other organoid cultures (for example cancer organoids, where multiple organoids per well are common / may not be preventable in certain experimental settings).

      Response: We thank the reviewer for this question. We agree that these preprocessing steps would add more complexity to our presented preprocessing steps and would definitely be required in some organoid systems. In our experimental setup, there is only one organoid per well which forms spontaneously after cell seeding from (almost) all seeded cells. There are no additional steps necessary in order to ensure this behaviour in our setup. We amended the Methods section to now explicitly state this accordingly (paragraph ‘Organoid timelapse imaging’).

      The terminology used with respect to the test and validation set is contrary to the field, and reporting the results on the test set (should be called validation set), should be avoided since it is used to select models. In more detail: the terms "test set" and "validation set" (introduced in 213-221) are used with the opposite meaning to their typical use in the deep learning literature. Typically, the validation set refers to a separate split that is used to monitor convergence / avoid overfitting during training, and the test set refers to an external set that is used to evaluate the performance of trained models. The study uses these terms in an opposite manner, which becomes apparent from line 624: "best performing model ... judged by the loss of the test set.". Please exchange this terminology, it is confusing to a machine learning domain expert. Furthermore, the performance on the test set (should be called validation set) is typically not reported in graphs, as this data was used for model selection, and thus does not provide an unbiased estimate of model performance. I would remove the respective curves from Figures 3 and 4.

      Response: We are thankful for the reviewers comments on this matter. Indeed, we were using an opposite terminology compared to what is commonly used within the field. We have adjusted the Results, Discussion and Methods sections as well as the figures accordingly. Further, we added a corresponding disclaimer for the code base in the github repository. However, we prefer to not remove the respective curves from the figures. We think that this information is crucial to interpret the variability in accuracy between organoids from the same experiments and organoids acquired from a different, independent experiment. The results suggest that the accuracy for organoids within the same experiments is still higher, indicating to users the potential accuracy drop resulting from independent experiments. As we think that this is crucial information for the interpretability of our results, we would like to still include it side-by-side with the test data in the figures.

      The experimental set-up for the human expert baseline is quite different to the evaluation of the machine learning models. The former is based on the annotation of 4,000 images by seven expert, the latter based on a cross-validation experiments on a larger dataset. First of all, the details on the human expert labeling procedure is very sparse, I could only find a very short description in the paragraph 136-144, but did not find any further details in the methods section. Please add a methods section paragraph that explains in more detail how the images were chosen, how they were assigned to annotators, and if there was any redundancy in annotation, and if yes how this was resolved / evaluated. Second, the fact that the set-up for human experts and ML models is quite different means that these values are not quite comparable in a statistical sense. Ideally, human estimators would follow the same set-up as in ML (as in, evaluate the same test sets). However, this would likely prohibitive in the required effort, so I think it's enough to state this fact clearly, for example by adding a comment on this to the captions of Figure 3 and 4.

      Response: We thank the reviewer for this constructive suggestion. We agree that the curves for human evaluations in the original draft were calculated differently compared to the curves for the classification algorithms, mostly stemming from feasibility of data set annotation at the time. In order to still address this suggestion, we went on to repeat and substantially expand the number of images annotated and thus revised the full human expert annotation. Each one of 6 human experts was asked to predict/interpret 6 images of each organoid within the full dataset. In order to select the images, we divided the time course (0-72h) into 6 evenly spaced intervals of 12 hours. For each interval, one image per organoid and human expert was randomly selected and assigned. This resulted in a total of 31,626 classified images (up from 4000 in the original version of the manuscript), from which the assigned images were overlapping between experts for each source interval but not for the individual images. We then changed the calculation of the curves to be the same as for the classification analysis: F1 data were calculated for each experiment over 6 timeframes and all experts, and plotted within the respective figure. We have amended the Methods section accordingly and replaced the respective curves within Figures 3 and 4 and Supplementary Figures S1, S8 and S19.

      It is unclear to me where the theoretical time window for the Latent Determination Horizon in Figure 5 (also mentioned in line 350) comes from? Please explain this in more detail and provide a citation for it.

      Response: We thank the reviewer for this important point. The Latent Determination Horizon (LDH) is a conceptual framework we introduced in this study to describe the theoretical period during which the eventual presence of a tissue outcome of interest (TOI) is being determined but not yet detectable. It is derived from two main observations in our dataset: (i) the inherent intra- and inter-experimental heterogeneity of organoid outcomes despite standardized protocols, and (ii) the progressive increase in predictive performance of our deep learning models over time, which suggests that informative morphological features only emerge gradually. We have now clarified this rationale in the manuscript (Discussion section) further and explicitly stated that the LDH is a concept we introduce here, rather than a previously described or cited term.

      The timewindow is defined by the TOI visibility, which is defined empirically as indicated by the results of our human expert panel (compare also Supplementary Figure S1).

      The intepretability analysis (Figure 4, 634-639) based on relevance backpropagation was performed based on DenseNet121 only. Why did you choose this model and not the ResNet / MobileNet? I think it is quite crucial to see if there are any differences between these model, as this would show how much weight can be put on the evidence from this analysis and I would suggest to add an additional experiment and supplementary figure on this.

      Response: We thank the reviewer for this important comment regarding the interpretability analysis and the choice of model. In the original submission, we restricted the attribution analyses shown in originial Figure 4C to DenseNet121, which served as our main reference model throughout the study. This choice was made primarily for clarity and to avoid redundancy in the main figures, as all three convolutional neural network (CNN) architectures (DenseNet121, ResNet50, MobileNetV3_Large) achieved comparable classification performance on our tasks.

      In response to the reviewer’s concern, we have now extended the interpretability analyses to include all three CNN architectures and a total of eight attribution methods (new Supplementary Note 1). Specifically, we generated saliency maps for DenseNet121, ResNet50, and MobileNetV3_Large across multiple time points and evaluated them using a systematic set of metrics: pairwise method agreement within each model (new Supplementary Figure S29), cross-model consistency per method (new Supplementary Figure S34), entropy and diffusion of saliencies over time (new Supplementary Figure S35), regional voting overlap across methods (new Supplementary Figure S36), and spatial drift of saliency centers of mass (new Supplementary Figure S37).

      These pooled analyses consistently showed that attribution methods differ markedly in the regions they prioritize, but that their relative behaviors were mostly stable across the three CNN architectures. For example, Grad-CAM and Guided Grad-CAM exhibited strong internal agreement and progressively focused relevance into smaller regions, while gradient-based methods such as DeepLiftSHAP and Integrated Gradients maintained broader and more diffuse relevance patterns but were the most consistent across models. Perturbation-based methods like Feature Ablation and Kernel SHAP often showed decreasing entropy and higher spatial drift, again similarly across architectures.

      To further address the reviewer’s point, we visualized the organoid depicted in original Figure 4C across all three CNNs and all eight attribution methods (new Supplementary Figures S30-S33). These comparisons confirm and extend analysis of the qualitative patterns described in original Figure 4C and show that they are not specific to DenseNet121, but are representative of the general behavior across architectures.

      In sum, we observed notable differences in how relevance was assigned and how consistently these assignments aligned. Highlighted organoid patterns were not consistent enough across attribution methods for us to be comfortable to base unequivocal biological interpretation on them. Nevertheless we believe that the analyses in response to the reviewer’s suggestions (new Supplementary Note 1 and new Supplementary Figures S29-S37) add valuable context to what can be expected from machine learning models in an organoid research setting.

      As we did not base further unequivocal biological claims on the relevance backpropagation, we decided to move the analyses to the Supporting Information and now show a new model predicting organoid morphology by morphometrics clustering at the final imaging timepoint in new Figure 4C in line with suggestions by Reviewer #3.

      The code referenced in the code availability statement is not yet present. Please make it available and ensure a good documentation for reproducibility. Similarly, it is unclear to me what is meant by "The data that supports the findings will be made available on HeiDoc". Does this only refer to the intermediate results used for statistical analysis? I would also recommend to make the image data of this study available. This could for example be done through a dedicated data deposition service such as BioImageArchive or BioStudies, or with less effort via zenodo. This would ensure both reproducibility as well as potential re-use of the data. I think the latter point is quite interesting in this context; as the authors state themselves it is unclear if prediction of the TOIs isn't even possible at an earlier point that could be achieved through model advances, which could be studied by making this data available.

      Response: We thank the reviewer for this comment. We have now made the repository and raw data public on the suggested platform (Zenodo) and apologize for this oversight. The links are contained within the github repository which is stated in the manuscript under “Data availability”.

      Minor comments:

      Line 315: Please add a citation for relevance backpropagation here.

      Response: We have included citations for all relevance backpropagation methods used in the paper.

      Line 591: There seems to be typo: "[...] classification of binary classification [...]"

      Response: Corrected as suggested.

      Line 608: "[...] where the images of individual organoids served as groups [...]" It is unclear to me what this means.

      Response: We wanted to express that organoid images belonging to one organoid were assigned in full to a training/validation set. We have now stated this more clearly in the Methods section.

      Reviewer #1 (Significance (Required)):

      General assessment: This study demonstrates that (retinal) organoid development can be predicted from early timepoints with deep learning, where these cannot be discerned by human experts or simpler machine learning models. This fact is very interesting in itself due to its implication for organoid development, and could provide a valuable tool for molecular analysis of different organoid populations, as outlined by the authors. The contribution could be strengthened by providing a more thorough investigation of what features in the image are predictive at early timepoints, using a more sophisticated approach than relevance backprop, e.g. Discover (https://www.nature.com/articles/s41467-024-51136-9). This could provide further biological insight into the underlying developmental processes and enhance the understanding of retinal organoid development.

      Response: We thank the reviewer for this assessment and suggestion. We agree that identifying image features predictive at early timepoints would add important biological context. We therefore attempted to apply Discover to our dataset. However, we were unable to get the system to run successfully. After considerable effort, we concluded that this approach could not be integrated into our current analysis. Instead, we report our substantially expanded results obtained with relevance backpropagation, which provided the most interpretable and reproducible insights for our study as described above (New Supplementary Note 1, new Supplementary Figures S29-S37).

      Advance: similar studies that predict developmental outcome based on image data, for example cell proliferation or developmental outcome exist. However, to the best of my knowledge, this study is the first to apply such a methodology to organoids and convincingly shows is efficacy and argues is potential practical benefits. It thus constitutes a solid technical advance, that could be especially impactful if it could be translated to other organoid systems in the future.

      Response: We thank the reviewer for this positive assessment of our work and for highlighting its novelty and potential impact. We are encouraged that the reviewer recognizes the value of applying predictive modeling to organoids and the opportunities this creates for translation to other organoid systems.

      Audience: This research is of interest to a technical audience. It will be of immediate interest to researchers working on retinal organoids, who could adapt and use the proposed system to support experiments by better distinguishing organoids during development. To enable this application, code and data availability should be ensured (see above comments on reproducibility). It is also of interest to researchers in other organoid systems, who may be able to adapt the methodology to different developmental outcome predictions. Finally, it may also be of interest to image analysis / deep learning researchers as a dataset to improve architectures for predictive time series modeling.

      My research background: I am an expert in computer vision and deep learning for biomedical imaging, especially in microscopy. I have some experience developing image analysis for (cancer) organoids. I don't have any experience on the wet lab side of this work.

      Response: We thank the reviewer for this encouraging feedback and for recognizing the broad relevance of our work across retinal organoid research, other organoid systems, and the image analysis community. We are pleased that the potential utility of our dataset and methodology is appreciated by experts in computer vision and biomedical imaging. We have now made the repository and raw data public and apologize for this oversight. The links are provided in the manuscript under “Data availability”.

      Constantin Pape


      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary: Afting et al. present a computational pipeline for analyzing timelapse brightfield images of retinal organoids derived from Medaka fish. Their pipeline processes images along two paths: 1) morphometrics (based on computer vision features from skimage) and 2) deep learning. They discovered, through extensive manual annotation of ground truth, that their deep learning method could predict retinal pigmented epithelium and lens tissue emergence in time points earlier than either morphometrics or expert predictions. Our review is formatted based on the review commons recommendation.

      Response: We thank the reviewer for the detailed and constructive feedback, which has greatly improved the clarity and rigor of our manuscript. In response, we have corrected a potential data leakage issue, re-ran the affected analyses, and confirmed that results remain unchanged. We clarified the use of data augmentation in CNN training, tempered some claims throughout the text, and provided stronger justification for our discretization approach together with new supplementary analyses (New Supplementary Figures S26, S27). We substantially expanded our interpretability analyses across three CNN architectures and eight attribution methods, quantified their consistency and differences (new Supplementary Figures S29, S34-S37, new Supplementary Note 1), and added comprehensive visualizations (New S30-S33). We also addressed technical artifact controls, provided downsampling analyses to support our statement on sample size sufficiency (new Supplementary Figure S28), and included negative-control baselines with shuffled labels in Figures 3 and 4. Furthermore, we improved the clarity of terminology, figures, and methodological descriptions, and we have now made both code and raw data publicly available with documentation. Together, we believe these changes further strengthen the robustness, reproducibility, and interpretability of our study while carefully qualifying the claims.

      Major comments:

      Are the key conclusions convincing?

      Yes, the key conclusion that deep learning outperforms morphometric approaches is convincing. However, several methodological details require clarification. For instance, were the data splitting procedures conducted in the same manner for both approaches? Additionally, the authors note in the methods: "The validation data were scaled to the same range as the training data using the fitted scalers obtained from the training data." This represents a classic case of data leakage, which could artificially inflate performance metrics in traditional machine learning models. It is unclear whether the deep learning model was subject to the same issue. Furthermore, the convolutional neural network was trained with random augmentations, effectively increasing the diversity of the training data. Would the performance advantage still hold if the sample size had not been artificially expanded through augmentation?

      Response: We thank the reviewer for raising these important methodological points. As Reviewer #1 correctly noted, our use of the terms validation and test may have contributed to confusion. To clarify: in the original analysis the scalers were fitted on the training and validation data and then applied to the test data. This indeed constitutes a form of data leakage. We have corrected the respective code, re-ran all analyses that were potentially affected, and did not observe any meaningful change in the reported results. The Methods section has been amended to clarify this important detail.

      For the neural networks, each image was normalized independently (per image), without using dataset-level statistics, thereby avoiding any risk of data leakage.

      Regarding data augmentation, the convolutional neural network was indeed trained with augmentations. Early experiments without augmentation led to severe overfitting, confirming that the performance advantage would not hold without artificially increasing the effective sample size. We have added a clarifying statement in the Methods section to make this explicit.

      Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether? Their claims are currently preliminary, pending increased clarity and additional computational experiments described below.

      Response: We believe our additionally performed computational experiments qualify all the claims we make in the revised version of the manuscript.

      Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.

      • The authors discretize continuous variables into four bins for classification. However, a regression framework may be more appropriate for preserving the full resolution of the data. At a minimum, the authors should provide a stronger justification for this binning strategy and include an analysis of bin performance. For example, do samples near bin boundaries perform comparably to those near the bin centers? This would help determine whether the discretization introduces artifacts or obscures signals.

      Response: We thank the reviewer for this thoughtful suggestion. We agree that regression frameworks can, in principle, preserve the full resolution of continuous outcome variables. However, in our setting we deliberately chose a discretization approach. First, the discretized outcome categories correspond to ranges of tissue sizes that are biologically meaningful and allow direct comparison to expert annotations. In practice, human experts also tend to judge tissue presence and size in categorical rather than strictly continuous terms, which was mirrored by our human expert annotation strategy. As we aimed to compare deep learning with classical machine learning models and with expert annotations across the same prediction tasks, a categorical outcome formulation provided the most consistent and fair framework. Secondly, the underlying outcome variables did not follow a normal distribution, but instead exhibited a skewed and heterogeneous spread. Regression models trained on such distributions often show biases toward the most frequent value ranges, which may obscure less common but biologically important outcomes. Discretization mitigated this issue by balancing the prediction task across defined size categories.

      In line with the reviewer’s request, we have now analyzed the performance in relation to the distance of each sample from the bin center. These results are provided as new Supplementary Figures S26 and S27. Interestingly, for the classical machine learning classifiers, F1 scores tended to be somewhat higher for samples close to bin edges. For the convolutional neural networks, however, F1 scores were more evenly distributed across distances from bin centers. While the reason for this difference remains unclear, the analysis demonstrates that the discretization did not obscure predictive signals in either framework. We have amended the results section accordingly.

      • The relevance backpropagation interpretation analysis is not convincing. The authors argue that the model's use of pixels across the entire image (rather than just the RPE region) indicates that the deep learning approach captures holistic information. However, only three example images are shown out of hundreds, with no explanation for their selection, limiting the generalizability of the interpretation. Additionally, it is unclear how this interpretability approach would work at all in earlier time points, particularly before the model begins making confident predictions around the 8-hour mark. It is also not specified whether the input used for GradSHAP matches the input used during CNN training. The authors should consider expanding this analysis by quantifying pixel importance inside versus outside annotated regions over time. Lastly, Figure 4C is missing a scale bar, which would aid in interpretability.

      Response: We thank the reviewer for raising these important concerns. In the initial version we showed examples of relevance backpropagation that suggested CNNs rely on visible RPE or lens tissue for their predictions (original Figure 4C). Following the reviewer’s comment, we expanded the analysis extensively across all models and attribution methods (compare new Supplementary Note 1), and quantified agreement, consistency, entropy, regional overlap, and drift (new Supplementary Figures S29 and S34-S37), as well as providing comprehensive visualizations across models and methods (new Supplementary Figures S30-S33).

      This extended analysis showed that attribution methods behave very differently from each other, but consistently so across the three CNN architectures. Each method displayed characteristic patterns, for example in entropy or center-of-mass drift, but the overlap between methods was generally low. While integrated gradients and DeepLiftSHAP tended to concentrate on tissue regions, other methods produced broader or shifting relevance patterns, and overall we could not establish robust or interpretable signals from a biological point of view that would support stronger conclusions.

      We have therefore revised the text to focus on descriptive results only, without making claims about early structural information or tissue-specific cues being used by the networks. We also added missing scale bars and clarified methodological details. Together, the revised section now reflects the extensive work performed while remaining cautious about what can and cannot be inferred from saliency methods in this setting.

      • The authors claim that they removed technical artifacts to the best of their ability, but it is unclear if the authors performed any adjustment beyond manual quality checks for contamination. Did the authors observe any illumination artifacts (either within a single image or over time)? Any other artifacts or procedures to adjust?

      Response: We thank the reviewer for this comment. We have not performed any adjustment beyond manual quality control post organoid seeding. The aforementioned removal of technical artifacts included, among others, seeding at the same time of day, seeding and cell processing by the same investigator according to a standardized protocol, usage of reproducible chemicals (same LOT, frozen only once, etc.) and temperature control during image acquisition. We adhered strictly to internal, previously published workflows that were aimed to reduce any variability due to technical variations during cell harvesting, organoid preparation and imaging. We have clarified this important point in the Methods section.

      • In line 434-436 the authors state "In this work, we used 1,000 organoids in total, to achieve the reported prediction accuracies. Yet, we suspect that as little as ~500 organoids are sufficient to reliably recapitulate our findings." It is unclear what evidence the authors use to support this claim? The authors could perform a downsampling analysis to determine tradeoff between performance and sample size.

      Response: We thank the reviewer for this important comment. To clarify, our statement regarding the sufficiency of ~500 organoids was based on a downsampling-style analysis we had already performed. In this analysis, we systematically reduced the number of experiments used for training and assessed predictive performance for both CNN- and classifier-based approaches (former Supplementary Figure S11, new Supplementary Figure S28). For CNNs, performance curves plateaued at approximately six experiments (corresponding to ~500 organoids), suggesting that increasing the sample size further only marginally improved prediction accuracy. In contrast, we did not observe a clear plateau for the machine learning classifiers, indicating that these models can achieve comparable performance with fewer training experiments. We have revised the manuscript text to clarify that this conclusion is derived from these analyses, and continue to include Supplementary Figure S11 as new Supplementary Figure S28 for transparency (compare Supplementary Note 1).

      Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments. Yes, we believe all experiments are realistic in terms of time and resources. We estimate all experiments could be completed in 3-6 months.

      Response: We confirm that the suggested experiments are realistic in terms of time and resources and have been able to complete them within 6 months.

      Are the data and the methods presented in such a way that they can be reproduced? No, the code is not currently available. We were not able to review the source code.

      Response: We have now made the repository public. We apologize for this initial oversight. The links are provided in the revised version of the manuscript under “Data availability”.

      Are the experiments adequately replicated and statistical analysis adequate?

      • The experiments are adequately replicated.

      • The statistical analysis (deep learning) is lacking a negative control baseline, which would be helpful to observe if performance is inflated.

      Response: We thank the reviewer for this comment. We have calculated the respective curves with neural networks and machine learning classifiers that were trained on data with shuffled labels and have included these results as a separate curve in the respective Figures 3 and 4. We have also amended the Methods section accordingly.

      Minor comments:

      Specific experimental issues that are easily addressable.

      Are prior studies referenced appropriately?

      Yes.

      Are the text and figures clear and accurate?

      The authors must improve clarity on terminology. For example, they should define a comprehensive dataset, significant, and provide clarity on their morphometrics feature space. They should elaborate on what they mean by "confounding factor of heterogeneity".

      Response: We thank the reviewer for highlighting the need to clarify terminology. We have revised the manuscript accordingly. Specifically, we now explicitly define comprehensive dataset as longitudinal brightfield imaging of ~1,000 organoids from 11 independent experiments, imaged every 30 minutes over several days, covering a wide range of developmental outcomes at high temporal resolution. Furthermore, we replaced the term significantly with wording that avoids implying statistical significance, where appropriate. We have clarified the morphometrics feature space in the Methods section in a more detailed fashion, describing the custom parameters that we used to enhance the regionprops_table function of skimage.

      Do you have suggestions that would help the authors improve the presentation of their data and conclusions? - Figure 2C describes a distance between what? The y axis is likely too simple. Same confusion over Figure 2D. Was distance computed based on tsne coordinates?

      Response: We thank the reviewer for pointing out this potential source of confusion. The distances shown in original Figures 2C and 2D were not calculated in tSNE space. Instead, morphometrics features were first Z-scaled, and then dimensionality reduction by PCA was applied, with the first 20 principal components retaining ~93% of the variance. Euclidean distances were subsequently computed in this 20-dimensional PC space. For inter-organoid distances (Figure 2C), we calculated mean pairwise Euclidean distances between all organoids at each imaging time point, capturing the global divergence of organoid morphologies over time in an experiment-specific manner. For intra-organoid distances (Figure 2D), we calculated Euclidean distances between consecutive time points (n vs. n+1) for each individual organoid, thereby quantifying the extent of morphological change within organoids over time. We have revised the Figure legend and Methods section to make these definitions clearer.

      • The authors perform a Herculean analysis comparing dozens of different machine learning classifiers. They select two, but they should provide justification for this decision.

      Response: We thank the reviewer for this comment. In our initial machine learning analyses, we systematically benchmarked a broad set of classifiers on the morphometrics feature space, using cross-validation and hyperparameter tuning where appropriate. The classifiers that we ultimately focused on were those that consistently achieved the best performance in these comparisons. This process is described in the Methods and summarized in the Supplementary Figures S4 and S15 (for sum- and maximum-intensity z-projections new Supplementary Figures S5/6 and S16/17), which show the results of the benchmarking. We have clarified the text to state that the selected classifiers were chosen on the basis of their superior performance in these evaluations.

      • It would be good to get a sense for how these retinal organoids grow - are they moving all over the place? They are in Matrigel so maybe not, but are they rotating?

      Can the author's approach predict an entire non-emergence experiment? The authors tried to standardize protocol, but ultimately if It's deriving this much heterogeneity, then how well it will actually generalize to a different lab is a limitation.

      Response: We thank the reviewer for these thoughtful questions. The retinal organoids in our study were embedded in low concentrations of Matrigel and remained relatively stable in position throughout imaging. We did not observe substantial displacement or lateral movement of organoids, and no systematic rotation could be detected in our dataset. Small morphological rearrangements within organoids were observed, but the gross positioning of organoids within the wells remained consistent across time-lapse recordings.

      Regarding generalization across laboratories, we agree with the reviewer that this is an important limitation. While we minimized technical variability by adhering to a highly standardized, published protocol (see Methods), considerable heterogeneity remained at both intra- and inter-experimental levels. This variability likely reflects inherent properties of the system, similar the reportings in the literature across organoid systems, rather than technical artifacts, and poses a potential challenge for applying our models to independently generated datasets. We therefore highlight the need for future work to test the robustness of our models across laboratories, which will be essential to determine the true generalizability of our approach. We have amended the Discussion accordingly.

      • The authors should dampen claims throughout. For example, in the abstract they state, "by combining expert annotations with advanced image analysis". The image analysis pipelines use common approaches.

      Response: We thank the reviewer for this comment. We agree that the individual image analysis steps we used, such as morphometric feature extraction, are based on well-established algorithms. By referring to “advanced image analysis,” we intended to highlight not the novelty of each single algorithm, but rather the way in which we systematically combined a large number of quantitative parameters and leveraged them through machine learning models to generate predictive insights into organoid development.

      • The authors state: "the presence of RPE and lenses were disagreed upon by the two independently annotating experts in a considerable fraction of organoids (3.9 % for RPE, 2.9% for lenses).", but it is unclear why there were two independently annotating experts. The supplements say images were split between nine experts for annotation.

      Response: We thank the reviewer for pointing out this ambiguity. To clarify, the ground truth definition at the final time point was established by two experts who annotated all organoids. These two annotators were part of the larger group of six experts who contributed to the earlier human expert annotation tasks. Thus, while six experts provided annotations for subsets of images during the expert prediction experiments, the final annotation for every single organoid at its last time frame was consistently performed by the same two experts to ensure a uniform ground truth. We have amended this in the revised manuscript to make this distinction clear.

      • Details on the image analysis pipeline would be helpful to clarify. For example, why did they choose to measure these 165 morphology features? Which descriptors were used to quantify blur? Did the authors apply blur metrics per FOV or per segmented organoid?

      Response: We thank the reviewer for this comment. To clarify, we extracted 165 morphometric features per segmented organoid, combining standard scikit-image region properties with custom implementations (e.g., blur quantified as the variance of the Laplace filter response within the organoid mask). All metrics, including blur, were calculated per segmented organoid rather than per full field of view. This broad feature space was deliberately chosen to capture size, shape, and intensity distributions in a comprehensive and unbiased manner. We now provide a more detailed description of the preprocessing steps, the full feature list, and the exact code implementations are provided in the Methods section (“Large-scale time-lapse Image analysis”) of the revised version of the manuscript as well as in the source code github repository.

      • The description of the number of images is confusing and distracts from the number of organoids. The number of organoids and number of timepoints used would provide a better description of the data with more value. For example, does this image count include all five z slices?

      Response: We thank the reviewer for this comment. The reported image count includes slice 3 only, which we based our models on. The five z-slices that we used to create the MAX- and SUM-intensity z-projections would increase this number 5-fold. While we agree that the number of organoids and time points are highly informative metrics and have provided these details in the manuscript, we also believe that reporting the image count is valuable, as it directly reflects the size of the dataset processed by our analysis pipelines. For this reason, we prefer to keep the current description.

      • The authors should consider applying a maximum projection across the five z slices (rather than the middle z) as this is a common procedure in image analysis. Why not analyze three-dimensional morphometrics or deep learning features? Might this improve performance further?

      Response: We thank the reviewer for this valuable suggestion. To address this point, we repeated all analyses using both sum- and maximum-intensity z-projections and have included the results as new Supplementary Figures S8-S10, S13/S14 for TOI emergence and new Supplementary Figures S19-S21, S24/S25 for TOI sizes (classifier benchmarking and hyperparameter tuning in new Supplementary Figures S5/S6 and S16/S17). These additional analyses did not reveal a noticeable improvement in performance, suggesting that projections incorporating all slices are not strictly necessary in our setting. An analysis that included all five z-slices separately for classification would indeed be of interest, but was not feasible within the scope of this study, as it would substantially increase the computational demands beyond the available resources and timeframe.

      • There is a lot of manual annotation performed in this work, the authors could speculate how this could be streamlined for future studies. How does the approach presented enable streamlining?

      Response: We thank the reviewer for raising this important point. The current study relied on expert visual review, which is time-intensive, but our findings suggest several ways to streamline future work. For instance, model-assisted prelabeling could be used to automatically accept high-confidence cases while routing only uncertain cases to experts. Active sampling strategies, focusing expert review on boundary cases or rare classes, as well as programmatic checks from morphometrics (e.g., blur or contrast to flag low-quality frames), could further reduce effort. Consensus annotation could be reserved only for cases where the model and expert disagree or confidence is low. Finally, new experiments could be bootstrapped with a small seed set of annotated organoids for fine-tuning before switching to such a model-assisted workflow. These possibilities are enabled by our approach, where organoids are imaged individually, morphometrics provide automated quality indicators, and the CNN achieves reliable performance at early developmental stages, making model-in-the-loop annotation a feasible and efficient strategy for future studies. We have added a clarifying paragraph to the Discussion accordingly.

      Reviewer #2 (Significance (Required)):

      Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field. The paper's advance is technical (providing new methods for organoid quality control) and conceptual (providing proof of concept that earlier time points contain information to predict specific future outcomes in retinal organoids)

      Place the work in the context of the existing literature (provide references, where appropriate).

      • The authors do a good job of placing their work in context in the introduction.
      • The work presents a simple image analysis pipeline (using only the middle z slice) to process timelapse organoid images. So not a 4D pipeline (time and space), just 3D (time). It is likely that more and more of these approaches will be developed over time, and this article is one of the early attempts.

      • The work uses standard convolutional neural networks.

      Response: We thank the reviewer for this assessment. We agree that our work represents one of the early attempts in this direction, applying a straightforward pipeline with standard convolutional neural networks, and we appreciate the reviewer’s acknowledgment of how the study has been placed in context within the Introduction.

      State what audience might be interested in and influenced by the reported findings. - Data scientists performing image-based profiling for time lapse imaging of organoids.

      • Retinal organoid biologists

      • Other organoid biologists who may have long growth times with indeterminate outcomes.

      Response: We thank the reviewer for outlining the relevant audiences. We agree that the reported findings will be of interest to data scientists working on image-based profiling, retinal organoid biologists, and more broadly to organoid researchers facing long culture times with uncertain developmental outcomes.

      Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate. - Image-based profiling/morphometrics

      • Organoid image analysis

      • Computational biology

      • Cell biology

      • Data science/machine learning

      • Software

      This is a signed review:

      Gregory P. Way, PhD

      Erik Serrano

      Jenna Tomkinson

      Michael J. Lippincott

      Cameron Mattson

      Department of Biomedical Informatics, University of Colorado


      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary:

      This manuscript by Afting et. al. addresses the challenge of heterogeneity in retinal organoid development by using deep learning to predict eventual tissue outcomes from early-stage images. The central hypothesis is that deep learning can forecast which tissues an organoid will form (specifically retinal pigmented epithelium, RPE, and lens) well before those tissues become visibly apparent. To test this, the authors assembled a large-scale time-lapse imaging dataset of ~1,000 retinal organoids (~100,000 images) with expert annotations of tissue outcomes. They characterized the variability in organoid morphology and tissue formation over time, focusing on two tissues: RPE (which requires induction) and lens (which appears spontaneously). The core finding is that a deep learning model can accurately predict the emergence and size of RPE and lens in individual organoids at very early developmental stages. Notably, a convolutional neural network (CNN) ensemble achieved high predictive performance (F1-scores ~0.85-0.9) hours before the tissues were visible, significantly outperforming human experts and classical image-analysis-based classifiers. This approach effectively bypasses the issue of stochastic developmental heterogeneity and defines an early "determination window" for fate decisions. Overall, the study demonstrates a proof-of-concept that artificial intelligence can forecast organoid differentiation outcomes non-invasively, which could revolutionize how organoid experiments are analyzed and interpreted.

      Recommendation:

      While this manuscript addresses an important and timely scientific question using innovative deep learning methodologies, it currently cannot be recommended for acceptance in its present form. The authors must thoroughly address several critical limitations highlighted in this report. In particular, significant issues remain regarding the generalizability of the predictive models across different experimental conditions, the interpretability of deep learning predictions, and the use of Euclidean distance metrics in high-dimensional morphometric spaces-potentially leading to distorted interpretations of organoid heterogeneity. These revisions are essential for validating the general applicability of their approach and enhancing biological interpretability. After thoroughly addressing these concerns, the manuscript may become suitable for future consideration.

      Response: We thank the reviewer for the thoughtful and constructive comments. In response, we expanded our analyses in several key ways. We clarified limitations regarding external datasets. Interpretability analyses were greatly extended across three CNN architectures and eight attribution methods (new Supplementary Figures S29-S37, new Supplementary Note 1), showing consistent but method-specific behaviors; as no reproducible biologically interpretable signals emerged, we now present these results descriptively and clearly state their limitations. We further demonstrated the flexibility of our framework by predicting morphometric clusters in addition to tissue outcomes (new Figure 4C), confirmed robustness of the morphometrics space using PCA and nearest-neighbor analyses (new Supplementary Figure S3), and added statistical tests confirming CNNs significantly outperform classical classifiers (Supplementary File 1). Finally, we made all code and raw data publicly available, clarified species context, and added forward-looking discussion on adaptive interventions. We believe these revisions now further improve the rigor and clarity of our work.

      Major Issues (with Suggestions):

      1. Generalization to Other Batches or Protocols: The drop in performance on independent validation experiments suggests the model may partially overfit to specific experimental conditions. A major concern is how well this approach would work on organoids from a different batch or produced by a slightly different differentiation protocol. Suggestion: The authors should clarify the extent of variability between their "independent experiment" and training data (e.g., were these done months apart, with different cell lines or minor protocol tweaks?). To strengthen confidence in the model's robustness, I recommend testing the trained model on one or more truly external datasets, if available (for instance, organoids generated in a separate lab or under a modified protocol). Even a modest analysis showing the model can be adapted (via transfer learning or re-training) to another dataset would be valuable. If new data cannot be added, the authors should explicitly discuss this limitation and perhaps propose strategies (like domain adaptation techniques or more robust training with diverse conditions) to handle batch effects in future applications.

      Response: We thank the reviewer for this important comment. We fully agree with the reviewer that this would be an amazing addition to the manuscript. Unfortunately we are not able to obtain the requested external data set. Although retinal organoid systems exist and are widely used across different species lines, to the best of our knowledge our laboratory is the only one currently raising retinal organoids from primary embryonic pluripotent stem cells of Oryzias latipes and there is currently only one known (and published) differentiation protocol which allows the successful generation of these organoids. We note that our datasets were collected over the course of nine months, which already introduces variability across time and thus partially addresses concerns regarding batch effects. While we did not have access to truly external datasets (e.g., from other laboratories), we have clarified this limitation as suggested in the revised version of the manuscript and outlined strategies such as domain adaptation and training on more diverse conditions as promising future directions to improve robustness.

      Biological Interpretation of Early Predictive Features: The study currently concludes that the CNN picks up on complex, non-intuitive features that neither human experts nor conventional analysis could identify. However, from a biological perspective, it would be highly insightful to know what these features are (e.g., subtle texture, cell distribution patterns, etc.). Suggestion: I encourage the authors to delve deeper into interpretability. They might try complementary explainability techniques (for example, occlusion tests where parts of the image are masked to see if predictions change, or activation visualization to see what patterns neurons detect) beyond GradientSHAP. Additionally, analyzing false predictions might provide clues: if the model is confident but wrong for certain organoids, what visual traits did those have? If possible, correlating the model's prediction confidence with measured morphometrics or known markers (if any early marker data exist) could hint at what the network sees. Even if definitive features remain unidentified, providing the reader with any hypothesis (for instance, "the network may be sensing a subtle rim of pigmentation or differences in tissue opacity") would add value. This would connect the AI predictions back to biology more strongly.

      Response: We thank the reviewer for this thoughtful suggestion. We agree that linking CNN predictions to specific biological features would be highly valuable. In response, we expanded our interpretability analyses beyond GradientSHAP to a broad set of attribution methods and quantified their behavior across models and timepoints (new Supplementary Figures S29-S37, new Supplementary Note 1). While some methods (e.g., Integrated Gradients, DeepLiftSHAP) occasionally highlighted visible tissue regions, others produced diffuse or shifting relevance, and overall overlap was low. Therefore, our results did not yield reproducible, interpretable biological signals.

      Given these results, we have refrained from speculating about specific early image features and now present the interpretability analyses descriptively. We agree that future studies integrating imaging with molecular markers will be required to directly link early predictive cues to defined biological processes.

      Expansion to Other Outcomes or Multi-Outcome Prediction: The focus on RPE and lens is well-justified, but these are two outcomes within retinal organoids. A major question is whether the approach could be extended to predict other cell types or structures (e.g., presence of certain retinal neurons, or malformations) or even multiple outcomes at once. Suggestion: The authors should discuss the generality of their approach. Could the same pipeline be trained to predict, say, photoreceptor layer formation or other features if annotated? Are there limitations (like needing binary outcomes vs. multi-class)? Even if outside the scope of this study, a brief discussion would reassure readers that the method is not intrinsically limited to these two tissues. If data were available, it would be interesting to see a multi-label classification (predict both RPE and lens presence simultaneously) or an extension to other organoid systems in future. Including such commentary would highlight the broad applicability of this platform.

      Response: We thank the reviewer for this helpful and important suggestion. While our study focused on RPE and lens as the most readily accessible tissues of interest in retinal organoids, our new analyses demonstrate that the pipeline is not limited to these outcomes. In addition to tissue-specific predictions, we trained both a convolutional neural network (on image data) and a decision tree classifier (on morphometrics features) to predict more abstract morphological clusters defined at the final timepoint using the morphometrics features, showing that both approaches could successfully capture non-tissue features from early frames (new Figure 4C). This illustrates that the framework can be extended beyond binary tissue outcomes to multi-class problems, and predict relevant outcomes like the overall organoid morphology. Given appropriate annotations, the framework could in principle be trained to detect additional structures such as photoreceptor layers or malformations. Furthermore, the CNN architecture we employed and the morphometrics feature space are compatible with multi-label classification, meaning simultaneous prediction of several outcomes would also be feasible. We have clarified this point in the discussion to highlight the methodological flexibility and potential generality of our approach and are excited to share this very interesting, additional model with the readership.

      Curse of high dimensionality: Using Euclidean distance in a 165-dimensional morphometric space likely suffers from the curse of dimensionality, which diminishes the meaning of distances as dimensionality increases. In such high-dimensional settings, the range of pairwise distances tends to collapse, undermining the ability to discern meaningful intra- vs. inter-organoid differences. Suggestion: To address this, I would encourage the authors to apply principal component analysis (PCA) in place of (or prior to) tSNE. PCA would reduce the data to a few dominant axes of variation that capture most of the morphometric variance, directly revealing which features drive differences between organoids. These principal components are linear combinations of the original 165 parameters, so one can examine their loadings to identify which morphometric traits carry the most information - yielding interpretable axes of biological variation (e.g., organoid size, shape complexity, etc.). In addition, I would like to mention an important cautionary remark regarding tSNE embeddings. tSNE does not preserve global geometry of the data. Distances and cluster separations in a tSNE map are therefore not faithful to the original high-dimensional distances and should be interpreted with caution. See Chari T, Pachter L (2023), The specious art of single-cell genomics, PLoS Comput Biol 19(8): e1011288, for an enlightening discussion in the context of single cell genomics. The authors have shown that extreme dimensionality reduction to 2D can introduce significant distortions in the data's structure, meaning the apparent proximity or separation of points in a tSNE plot may be an artifact of the algorithm rather than a true reflection of morphometric similarity. Implementing PCA would mitigate high-dimensional distance issues by focusing on the most informative dimensions, while also providing clear, quantitative axes that summarize organoid heterogeneity. This change would strengthen the analysis by making the results more robust (avoiding distance artifacts) and biologically interpretable, as each principal component can be traced back to specific morphometric features of interest.

      Response: We thank the reviewer for this mention. Indeed, high dimensionality and dimensionality reductions can lead to false interpretations. We approached this issue as follows: First, we calculated the same TSNE projections and distances using the first 20 PCs and supplied these data as the new Figure 2 and new Supplementary Figure 2. While the scale of the data shifted slightly, there were no differences in the data distribution that would contradict our prior conclusions.

      In order to confirm the findings and further emphasize the validity of our dimensionality reduction, we calculated the intersection of 30 nearest neighbors in raw data space (or pca space) compared and 30 nearest neighbors in reduced space (TSNE or UMAP, as we wanted to emphasize that this was not an effect specific for TSNE projections and would also be valid in a dimensionality reduction which is more known to preserve global structure rather than local structure). As shown in the new Supplementary Figure S3 (A-D), the high jaccard index confirmed that our projections accurately reflect the data’s structure obtained from raw distance measurements. Moreover, the jaccard index generally increased over time, which is best explained by a stronger morphological similarity of organoids at timepoint 0 and reflected by the dense point cloud in the TSNE projections at that timepoint. The described effects were independent of the usage of data derived from 20 PCs versus data derived from all 165 dimensions.

      We next wanted to confirm the conclusion that data points obtained from organoids at later timepoints were more closely related to each other than data points from different organoids. We therefore identified the 30 nearest neighbor data points, showing that at later timepoints these 30 nearest neighbor data points were almost all attributable to the same organoid (new Supplementary Figure S3 E/F). This was only not the case for experiments that lacked in between timepoints (E007 and E002), therefore misaligning the organoids in the reduced space and convoluting the nearest neighbor analysis.

      We have included the respective new Figures and new Supplementary Figures and linked them in the main manuscript.

      Statistical Reporting and Significance: The manuscript focuses on F1-score as the metric to report accuracy over time, which is appropriate. However, it's not explicitly stated whether any statistical significance tests were performed on the differences between methods (e.g., CNN vs human, CNN vs classical ML). Suggestion: The authors could report statistical significance of the performance differences, perhaps using a permutation test or McNemar's test on predictions. For example, is the improvement of the CNN ensemble over the Random Forest/QDA classifier statistically significant across experiments? Given the n of organoids, this should be assessable. Demonstrating significance would add rigor to the analysis.

      Response: We thank the reviewer for this helpful suggestion. Following the recommendation, we quantified per-experiment differences in predictive performance by calculating the area under the F1-score curves (AUC) for each classifier and experiment. We then compared methods using paired Wilcoxon signed-rank tests across experiments, with Holm-Bonferroni correction for multiple comparisons. This analysis confirmed that the CNN consistently and significantly outperformed the baseline models and classical machine learning classifiers in validation and test organoids, while CNNs were notably but not significantly better performing in test organoids for RPE area and lens sizes compared to the machine learning classifiers. In summary, the findings add the requested statistical rigor to our findings. The results of these tests are now provided in the Supplementary Material as Supplementary File 1.

      Minor Issues (with Suggestions):

      1. Data Availability: Given the resource-intensive nature of the work, the value to the community will be highest if the data is made publicly available. I understand that this is of course at the behest of the authors and they do mention that they will make the data available upon publication of the manuscript. For the time being, the authors can consider sharing at least a representative subset of the data or the trained model weights. This will allow others to build on their work and test the method in other contexts, amplifying the impact of the study.

      Response: We have now made the repository and raw data public and apologize for this oversight. The link for the github repository is now provided in the manuscript under “Data availability”, while the links for the datasets are contained within the github repository.

      Discussion - Future Directions: The Discussion does a good job of highlighting applications (like guiding molecular analysis). One minor addition could be speculation on using this approach to actively intervene: for example, could one imagine altering culture conditions mid-course for organoids predicted not to form RPE, to see if their fate can be changed? The authors touch on reducing variability by focusing on the window of determination; extending that thought to an experimental test (though not done here) would inspire readers. This is entirely optional, but a sentence or two envisioning how predictive models enable dynamic experimental designs (not just passive prediction) would be a forward-looking note to end on.

      Response: We thank the reviewer for this constructive suggestion. We have expanded the discussion to briefly address how predictive modeling could go beyond passive observation. Specifically, we now discuss that predictive models may enable dynamic interventions, such as altering culture conditions mid-course for organoids predicted not to form RPE, to test whether their developmental trajectory can be redirected. While outside the scope of the present work, this forward-looking perspective emphasizes how predictive modeling could inspire adaptive experimental strategies in future studies.

      I believe with the above clarifications and enhancements - especially regarding generalizability and interpretability - the paper will be suitable for broad readership. The work represents an exciting intersection of developmental biology and AI, and I commend the authors for this contribution.

      Response: We thank the reviewer for the positive assessment and their encouraging remarks regarding the contribution of our work to these fields.

      Novelty and Impact:

      This work fills an important gap in organoid biology and imaging. Previous studies have used deep learning to link imaging with molecular profiles or spatial patterns in organoids, but there remained a "notable gap" in predicting whether and to what extent specific tissues will form in organoids. The authors' approach is novel in applying deep learning to prospectively predict organoid tissue outcomes (RPE and lens) on a per-organoid basis, something not previously demonstrated in retinal organoids. Conceptually, this is a significant advance: it shows that fate decisions in a complex 3D culture model can be predicted well in advance, suggesting the existence of subtle early morphogenetic cues that only a sophisticated model can discern. The findings will be of broad interest to researchers in organoid technology, developmental biology, and biomedical AI.

      Response: We thank the reviewer for this thoughtful and encouraging assessment. We agree that our study addresses an important gap by prospectively predicting tissue outcomes at the single-organoid level, and we appreciate the recognition that this represents a conceptual advance with relevance not only for retinal organoids but also for broader applications in organoid biology, developmental biology, and biomedical AI.

      Methodological Rigor and Technical Quality:

      The study is methodologically solid and carefully executed. The authors gathered a uniquely large dataset under consistent conditions, which lends statistical power to their analyses. They employ rigorous controls: an expert panel provided human predictions as a baseline, and a classical machine learning pipeline using quantitative image-derived features was implemented for comparison. The deep learning approach is well-chosen and technically sound. They use an ensemble of CNN architectures (DenseNet121, ResNet50, and MobileNetV3) pre-trained on large image databases, fine-tuning them on organoid images. The use of image segmentation (DeepLabV3) to isolate the organoid from background is appropriate to ensure the models focus on the relevant morphology. Model training procedures (data augmentation, cross-entropy loss with class balancing, learning rate scheduling, and cross-validation) are thorough and follow best practices. The evaluation metrics (primarily F1-score) are suitable for the imbalanced outcomes and emphasize prediction accuracy in a biologically relevant way. Importantly, the authors separate training, test, and validation sets in a meaningful manner: images of each organoid are grouped to avoid information leakage, and an independent experiment serves as a validation to test generalization. The observation that performance is slightly lower on independent validation experiments underscores both the realism of their evaluation and the inherent heterogeneity between experimental batches. In addition, the study integrates interpretability (using GradientSHAP-based relevance backpropagation) to probe what image features the network uses. Although the relevance maps did not reveal obvious human-interpretable features, the attempt reflects a commendable thoroughness in analysis. Overall, the experimental design, data analysis, and reporting are of high quality, supporting the credibility of the conclusions.

      Response: We thank the reviewer for their very positive and detailed assessment. We appreciate the recognition of our efforts to ensure methodological rigor and reproducibility, and we agree that interpretability remains an important but challenging area for future work.

      Reviewer #3 (Significance (Required)):

      Scientific Significance and Conceptual Advances:

      Biologically, the ability to predict organoid outcomes early is quite significant. It means researchers can potentially identify when and which organoids will form a given tissue, allowing them to harvest samples at the right moment for molecular assays or to exclude organoids that will not form the desired structure. The manuscript's results indicate that RPE and lens fate decisions in retinal organoids are made much earlier than visible differentiation, with predictive signals detectable as early as ~11 hours for RPE and ~4-5 hours for lens. This suggests a surprising synchronization or early commitment in organoid development that was not previously appreciated. The authors' introduction of deep learning-derived determination windows refines the concept of a developmental "point of no return" for cell fate in organoids. Focusing on these windows could help in pinpointing the molecular triggers of these fate decisions. Another conceptual advance is demonstrating that non-invasive imaging data can serve a predictive role akin to (or better than) destructive molecular assays. The study highlights that classical morphology metrics and even expert eyes capture mainly recognition of emerging tissues, whereas the CNN detects subtler, non-intuitive features predictive of future development. This underlines the power of deep learning to uncover complex phenotypic patterns that elude human analysis, a concept that could be extended to other organoid systems and developmental biology contexts. In sum, the work not only provides a tool for prediction but also contributes conceptual insights into the timing of cell fate determination in organoids.

      Response: We thank the reviewer for this thoughtful and positive assessment. We agree that the determination windows provide a valuable framework to study early fate decisions in organoids, and we have emphasized this point in the discussion to highlight the biological significance of our findings.

      Strengths:

      The combination of high-resolution time-lapse imaging with advanced deep learning is innovative. The authors effectively leverage AI to solve a biological uncertainty problem, moving beyond qualitative observations to quantitative predictions. The study uses a remarkably large dataset (1,000 organoids, >100k images), which is a strength as it captures variability and provides robust training data. This scale lends confidence that the model isn't overfit to a small sample. By comparing deep learning with classical machine learning and human predictions, the authors provide context for the model's performance. The CNN ensemble consistently outperforms both the classical algorithms and human experts, highlighting the value added by the new method. The deep learning model achieves high accuracy (F1 > 0.85) at impressively early time points. The fact that it can predict lens formation just ~4.5 hours into development with confidence is striking. Performance remained strong and exceeded human capability at all assessed times. Key experimental and analytical steps (segmentation, cross-validation between experiments, model calibration, use of appropriate metrics) are executed carefully. The manuscript is transparent about training procedures and even provides source code references, enhancing reproducibility. The manuscript is generally well-written with a logical flow from the problem (organoid heterogeneity) to the solution (predictive modeling) and clear figures referenced.

      Response: We thank the reviewer for this very positive and encouraging assessment of our study, particularly regarding the scale of our dataset, the methodological rigor, and the reproducibility of our approach.

      Weaknesses and Limitations:

      Generalizability Across Batches/Conditions: One limitation is the variability in model performance on organoids from independent experiments. The CNN did slightly worse on a validation set from a separate experiment, indicating that differences in the experimental batch (e.g., slight protocol or environmental variations) can affect accuracy. This raises the question of how well the model would generalize to organoids generated under different protocols or by other labs. While the authors do employ an experiment-wise cross-validation, true external validation (on a totally independent dataset or a different organoid system) would further strengthen the claim of general applicability.

      Response: We thank the reviewer for this important point. We agree that generalizability across batches and experimental conditions is a key consideration. We have carefully revised the discussion to explicitly address this limitation and to highlight the variability observed between independent experiments.

      Interpretability of the Predictions: Despite using relevance backpropagation, the authors were unable to pinpoint clear human-interpretable image features that drive the predictions. In other words, the deep learning model remains somewhat of a "black box" in terms of what subtle cues it uses at early time points. This limits the biological insight that can be directly extracted regarding early morphological indicators of RPE or lens fate. It would be ideal if the study could highlight specific morphological differences (even if minor) correlated with fate outcomes, but currently those remain elusive.

      Response: We thank the reviewer for raising this important point. Indeed, while our models achieved robust predictive performance, the underlying morphological cues remained difficult to interpret using relevance backpropagation. We believe this limitation reflects both the subtlety of the early predictive signals and the complexity of the features captured by deep learning models, which may not correspond to human-intuitive descriptors. We have clarified this limitation in the Discussion and Supplementary Note 1 and emphasize that further methodological advances in interpretability, or integration with complementary molecular readouts, will be essential to uncover the precise morphological correlates of fate determination.

      Scope of Outcomes: The study focuses on two particular tissues (RPE and lens) as the outcomes of interest. These were well-chosen as examples (one induced, one spontaneous), but they do not encompass the full range of retinal organoid fates (e.g., neural retina layers). It's not a flaw per se, but it means the platform as presented is specialized. The method might need adaptation to predict more complex or multiple tissue outcomes simultaneously.

      Response: We agree with the reviewer that our study focuses on two specific tissues, RPE and lens, which served as proof-of-concept outcomes representing both induced and spontaneous differentiation events. While this scope is necessarily limited, we believe it demonstrates the general feasibility of our approach. We have clarified in the Discussion that the same framework could, in principle, be extended to additional retinal fates such as neural retina layers, or even to multi-label prediction tasks, provided appropriate annotations are available. We now provide additional experiments showing that even abstract morphological classes are well predictable. This will be an important next step to broaden the applicability of our platform.

      Requirement of Large Data and Annotations: Practically, the approach required a very large imaging dataset and extensive manual annotation; each organoid's RPE and lens outcome, plus manual masking for training the segmentation model. This is a substantial effort that may be challenging to reproduce widely. The authors suggest that perhaps ~500 organoids might suffice to achieve similar results, but the data requirement is still high. Smaller labs or studies with fewer organoids might not immediately reap the full benefits of this approach without access to such imaging throughput.

      Response: We thank the reviewer for highlighting this important point. We agree that the generation of a large imaging dataset and the associated annotations represent a substantial investment of time and resources. At the same time, we consider this effort highly relevant, as it reflects the intrinsic heterogeneity of organoid systems rather than technical artifacts, and therefore ensures robust model training. We have clarified this limitation in the discussion. While our full dataset included ~1,000 organoids, our downsampling analysis suggests that as few as ~500 organoids may already be sufficient to reproduce the key findings, which we believe makes the approach feasible for many organoid systems (compare new Supplementary Note 1). Moreover, as we outline in the Discussion, future refinements such as combining image- and tabular-based features or incorporating fluorescence data could further enhance predictive power and reduce annotation effort.

      Medaka Fish vs. Other Systems: The retinal organoids in this study appear to be from medaka fish, whereas much organoid research uses human iPSC-derived organoids. It's not fully clear in the manuscript as to how the findings translate to mammalian or human organoids. If there are species-specific differences, the applicability to human retinal organoids (which are important for disease modeling) might need discussion. This is a minor point if the biology is conserved, but worth noting as a potential limitation.

      Response: We thank the reviewer for pointing out this important consideration. We have now explicitly clarified in the Discussion that our proof-of-concept study was performed in medaka organoids, which offer high reproducibility and rapid development. While species-specific differences may exist, the predictive framework is not inherently restricted to medaka and should, in principle, be transferable to mammalian or human iPSC/ESC-derived organoids, provided sufficiently annotated datasets are available. We have amended the Discussion accordingly.

      Predicting Tissue Size is Harder: The model's accuracy in predicting how much tissue (relative area) an organoid will form, while good, is notably lower than for simply predicting presence/absence. Final F1 scores for size classes (~0.7) indicate moderate success. This implies that quantitatively predicting organoid phenotypic severity or extent is more challenging, perhaps due to more continuous variation in size. The authors do acknowledge the lower accuracy for size and treat it carefully.

      Response: We thank the reviewer for this observation and agree with their interpretation. We have already acknowledged in the manuscript that predicting tissue size is more challenging than predicting tissue presence/absence, and we believe we have treated these results with appropriate caution in the revised version of the manuscript.

      Latency vs. Determination: While the authors narrow down the time window of fate determination, it remains somewhat unclear whether the times at which the model reaches high confidence truly correspond to the biological "decision point" or are just the earliest detection of its consequences. The manuscript discusses this caveat, but it's an inherent limitation that the predictive time point might lag the actual internal commitment event. Further work might be needed to link these predictions to molecular events of commitment.

      Response: We agree with the reviewer. As noted in the Discussion, the time points identified by our models likely reflect the earliest detectable morphological consequences of fate determination, rather than the exact molecular commitment events themselves. Establishing a direct link between predictive signals and underlying molecular mechanisms will require future experimental work.

    1. And crawled head downward down a blackened wall And upside down in air were towers Tolling reminiscent bells, that kept the hours And voices singing out of empty cisterns and exhausted wells

      Last year, Addie annotated this exact section and described how Eliot purposefully confuses the reader's sense of right-side-up and upside-down. In an especially insightful section of analysis she claims that if the reader were to orient herself with respect to Dracula (whom "crawled head downward down a blackened wall"), the tower down which he crawls becomes inverted - and the corresponding Tarot Card, the Dark Tower, is similarly flipped. Nested in this idea is a broader understanding: that in the chaos and turbulency of the modern world, the only form of agency we truly have is our perspective. When Dracula is flipped upside down, the world appears to him inverted; and though in fact it remains exactly the same as it always was, in his mind's eye all has been reoriented. That's precisely Eliot's point. Though the world itself may be a wasteland, there exists a copy of this world - a world of shadows, of impressions, of perspectives and opinions - which is completely up to interpretation. I think he invokes Tarot as a way of imbuing this doppelganger realm with purpose and value: Tarot is all about perspective. Your interpretation of the card, and what it tells you about your life in this theoretical duplicate of reality, informs the way you act in the real physical world - and so perhaps our agency, though constrained to our own perspectives, is more powerful than we think. The following two lines are relevant insofar as they condense several central thematic discussions: the voices, time, familiarity and remembrance, and water. All of these strands weave together a picture of reality IN FACT: that is, a world in which people are consigned to make the same mistakes over and over, a world where several voices overlap but never really hear one another, a world analogous to a dry rock. I think Eliot piles up all these images to drive home the fact that though our perspectives may change (though the Dark Tower may become inverted, or vice versa), objective reality is constant. In this way he DOES put a pessimistic constraint on the extent to which our conception of life can actually influence the events occuring around us; but nevertheless I do think there are some shards of positivity embedded in there.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2024-02830

      Corresponding author(s): Julien, Sage

      1. General Statements

      We thank the Reviewers for a fair review of our work and helpful suggestions. We have significantly revised the manuscript in response to these suggestions. We provide a point-by-point response to the Reviewers below but wanted to highlight in our response a recurring concern related to the strong cell cycle arrest observed upon the acute FAM53C knock-down being different than the limited phenotypes in other contexts, including the knockout mice and DepMap data.

      First, we now show that we can recapitulate the strong G1 arrest resulting from the FAM53C knock-down using two independent siRNAs in RPE-1 cells, supporting the specificity of the effects.

      Second, the G1 arrest that results from the FAM53C knock-down is also observed in cells with inactive p53, suggesting it is not due to a non-specific stress response due to “toxic” siRNAs. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype.

      Third, we have performed experiments in other human cells, including cancer cell lines. As would be expected for cancer cells, the G1 arrest is less pronounced but is still significant, indicating that the G1 arrest is not unique to RPE-1 cells.

      Fourth, it is not unexpected that compensatory mechanisms would be activated upon loss of FAM53C during development or in cancer – which may explain the lack of phenotypes in vivo or upon long-term knockout. This has been true for many cell cycle regulators, either because of compensation by other family members that have overlapping functions, or by a larger scale rewiring of signaling pathways.

      2. Point-by-point description of the revisions

      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __

      Summary:

      Taylar Hammond and colleagues identified new regulators of the G1/S transition of the cell cycle. They did so by screening public available data from the Cancer Dependency Map, and identified FAM53C as a positive regulator of the G1/S transition. Using biochemical assays they then show that FAM53 interacts with the DYRK1A kinase to inhibit its function. DYRK1A in its is known to induce degradation of cyclin D, leading the authors to propose a model in which DYRK1A-dependent cyclin D degradation is inhibited by FAM53C to permit S-phase entry. Finally the authors assess the effect of FAM53C deletion in a cortical organoid model, and in Fam53c knockout mice. Whereas proliferation of the organoids is indeed inhibited, mice show virtually no phenotype.

      Major comments:

      The authors show convincing evidence that FAM53C loss can reduce S-phase entry in cell cultures, and that it can bind to DYRK1A. However, FAM53 has multiple other binding partners and I am not entirely convinced that negative regulation of DYRK1A is the predominant mechanism to explain its effects on S-phase entry. Some of the claims that are made based on the biochemical assays, and on the physiological effects of FAM53C are overstated. In addition, some choices made methodology and data representation need further attention.

      1. The authors do note that P21 levels increase upon FAM53C. They show convincing evidence that this is not a P53-dependent response. But the claim that " p21 upregulation alone cannot explain the G1 arrest in FAM53C-deficient cells (line 138-139) is misleading. A p53-independent p21 response could still be highly relevant. The authors could test if FAM53C knockdown inhibits proliferation after p21 knockdown or p21 deletion in RPE1 cells. The Reviewer raises a great point. Our initial statement needed to be clarified and also need more experimental support. We have performed experiments where we knocked down FAM53C and p21 individually, as well as in combination, in RPE-1 cells. These experiment show that p21 knock-down is not sufficient to negate the cell cycle arrest resulting from the FAM53C knock-down in RPE-1 cells (Figure 4B,C and Figure S4C,D).

      We now extended these experiments to conditions where we inhibited DYRK1A, and we also compared these data to experiments in p53-null RPE-1 cells. Altogether, these experiments point to activation of p53 downstream of DYRK1A activation upon FAM53C knock-down, and indicate that p21 is not the only critical p53 target in the cell cycle arrest observed in FAM53C knock-down cells (Figure 4 and Figure S4).

      The authors do not convincingly show that FAM53C acts as a DYRK1A inhibitor in cells. Figures 4B+C and S4B+C show extremely faint P-CycD1 bands, and tiny differences in ratios. The P values are hovering around the 0.05, so n=3 is clearly underpowered here. Total CycD1 levels also correlate with FAM53C levels, which seems to affect the ratios more than the tiny pCycD1 bands. Why is there still a pCycD1 band visible in 4B in the GFP + BTZ + DYRK1Ai condition? And if I look at the data points I honestly don't understand how the authors can conclude from S4C that knockdown of siFAM53C increases (DYRK1A dependent) increases in pCycD1 (relative to total CycD1). In figure 5C, no blot scans are even shown, and again the differences look tiny. So the authors should either find a way to make these assays more robust, or alter their claims appropriately.

      We appreciate these comments from the Reviewer and have significantly revised the manuscript to address them.

      The analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knock-down, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We removed previous panel 4B from the revised manuscript. For panels 4E and S4B (now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      The representative Western blot images for 5C-D (now 5F-G) in the original submission are shown in Figure 5E, we apologize if this was not clear. The differences are small, which we acknowledge in the revised manuscript. Note that several factors can affect Cyclin D levels in cells, including the growth rate and the stage of the cell cycle. Our FACS analysis shows that normal organoids have ~63% of cells in G1 and ~13% in S phase; the overall lower proportion of S-phase cells in organoids may make the immunoblot difference appear smaller, with fewer cycling cells resulting in decreased Cyclin D phosphorylation.

      Nevertheless, the Reviewer brings up a good point and comments from this Reviewer and the others made us re-think how to best interpret our results. As discussed above, we re-read carefully the Meyer paper and think that FAM53C’s role and DYRK1A activity in cells may be understood when considering levels of both CycD and p21 at the same time in a continuum. While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is likely that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      The experiments to test if DYRK1A inhibition could rescue the G1 arrest observed upon FAM53C knockdown are not entirely convincing either. It would be much more convincing if they also perform cell counting experiments as they have done in Figures 1F and 1G, to complement the flow cytometry assays. I suggest that the authors do these cell counting experiments in RPE1 +/- P53 cells as well as HCT116 cells. In addition, did the authors test if P21 is induced by DYRK1Ai in HCT116 cells?

      We repeated the experiments with the DYRK1A inhibitor and counted the cells. In p53-null RPE-1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide. Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells.

      The data in Figure 5C and 5D are identical, although they are supposed to represent either pCycD1 ratios or p21 levels. This is a problem because at least one of the two cannot be true. Please provide the proper data and show (representative) images of both data types.

      We apologize for these duplicated panels in the original submission. We now replaced the wrong panel with the correct data (Fig. 5F,G).

      Line 246: "Fam53c knockout mice display developmental and behavioral defects." I don't agree with this claim. The mutant mice are born at almost the expected Mendelian ratios, the body weight development is not consistently altered. But more importantly, no differences in adult survival or microscopic pathology were seen. The authors put strong emphasis on the IMPC behavioral analysis, but they should be more cautious. The IMPC mouse cohorts are tested for many other phenotypes related to behavior and neurological symptoms and apparently none of these other traits were changed in the IMPC Famc53c-/- cohort. Thus, the decreased exploration in a new environment could very well be a chance finding. The authors need to take away claims about developmental and behavioral defects from the abstract, results and discussion sections; the data are just too weak to justify this.

      We agree with the Reviewer that, although we observed significant p-values, this original statement may not be appropriate in the biological sense. We made sure in the revised manuscript to carefully present these data.

      Minor comments:

      Can the authors provide a rationale for each of the proteins they chose to generate the list of the 38 proteins in the DepMap analysis? I looked at the list and it seems to me that they do not all have described functions in the G1/S transition. The analysis may thus be biased.

      To address this point, we updated Table S1 (2nd tab) to provide a better rationale for the 38 factors chosen. Our focus was on the canonical RB pathway and we included RB binding proteins whose function had suggested they may also be playing a role in the G1/S transition. We do agree that there is some bias in this selection (e.g., there are more RB binding factors described) but we hope the Reviewer will agree with us that this list and the subsequent analysis identified expected factors, including FAM53C. Future studies using this approach and others will certainly identify new regulators of cell cycle progression.

      Figure 1B is confusing to me. Are these just some (arbitrarily) chosen examples? Consider leaving this heatmap out altogether, of explain in more detail.

      We agree with the Reviewer that this panel was not necessarily useful and possibly in the wrong place, and we removed it from the manuscript. We replaced it with a cartoon of top hits in the screen.

      The y-axes in Figures 2C, 2D, 2E, and 4D are misleading because they do not start at 0. Please let the axis start at 0, or make axis breaks.

      We re-graphed these panels.

      Line 229: " Consequences ... brain development." This subheader is misleading, because the in vitro cortical organoid system is a rather simplistic model for brain development, and far away from physiological brain development. Please alter the header.

      We changed the header to “Consequences of FAM53C inactivation in human cortical organoids in culture”.

      Figure S5F: the gating strategy is not clear to me. In particular, how do the authors know the difference between subG1 and G1 DAPI signals? Do they interpret the subG1 as apoptotic cells? If yes, why are there so many? Are the culturing or harvesting conditions of these organoids suboptimal? Perhaps the authors could consider doing IF stainings on EdU or BrdU on paraffin sections of organoids to obtain cleaner data?

      Thank you for your feedback. The subG1 population in the original Figure S5F represents cells that died during the dissociation step of the organoids for FACS analysis. To address this point, we performed live & dead staining to exclude dead cells and provide clearer data. We refined gating strategy for better clarity in the new S5F panel.

      Figure S6A; the labeling seems incorrect. I would think that red is heterozygous here, and grey mutant.

      We fixed this mistake, thank you.

      __Reviewer #1 (Significance (Required)): __

      The finding that the poorly studied gene FAM53C controls the G1/S transition in cell lines is novel and interesting for the cell cycle field. However, the lack of phenotypes in Famc53-/- mice makes this finding less interesting for a broader audience. Furthermore, the mechanisms are incompletely dissected. The importance of a p53-indepent induction of p21 is not ruled out. And while the direct inhibitory interaction between FAM53C and DYRK1A is convincing (and also reported by others; PMID: 37802655), the authors do not (yet) convincingly show that DYRK1A inhibition can rescue a cell proliferation defect in FAM53C-deficient cells.

      Altogether, this study can be of interest to basic researchers in the cell cycle field.

      I am a cell biologist studying cell cycle fate decisions, and adaptation of cancer cells & stem cells to (drug-induced) stress. My technical expertise aligns well with the work presented throughout this paper, although I am not familiar with biolayer interferometry.

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

      Summary

      In this study Hammond et al. investigated the role of Dual-specificity Tyrosine Phosphorylation regulated Kinase 1A (DYRK1) in G1/S transition. By exploiting Dependency Map portal, they identified a previously unexplored protein FAM53C as potential regulator of G1/S transition. Using RNAi, they confirmed that depletion of FAM53C suppressed proliferation of human RPE1 cells and that this phenotype was dependent on the presence protein RB. In addition, they noted increased level of CDKN1A transcript and p21 protein that could explain G1 arrest of FAM53C-depleted cells but surprisingly, they did not observe activation of other p53 target genes. Proteomic analysis identified DYRK1 as one of the main interactors of FAM53C and the interaction was confirmed in vitro. Further, they showed that purified FAM53C blocked the ability of DYRK1 to phosphorylate cyclin D in vitro although the activity of DYRK1 was likely not inhibited (judging from the modification of FAM53C itself). Instead, it seems more likely that FAM53C competes with cyclin D in this assay. Authors claim that the G1 arrest caused by depletion of FAM53C was rescued by inhibition of DYRK1 but this was true only in cells lacking functional p53. This is quite confusing as DYRK1 inhibition reduced the fraction of G1 cells in p53 wild type cells as well as in p53 knock-outs, suggesting that FAM53C may not be required for regulation of DYRK1 function. Instead of focusing on the impact of FAM53C on cell cycle progression, authors moved towards investigating its potential (and perhaps more complex) roles in differentiation of IPSCs into cortical organoids and in mice. They observed a lower level of proliferating cells in the organoids but if that reflects an increased activity of DYRK1 or if it is just an off target effect of the genetic manipulation remains unclear. Even less clear is the phenotype in FAM53C knock-out mice. Authors did not observe any significant changes in survival nor in organ development but they noted some behavioral differences. Weather and how these are connected to the rate of cellular proliferation was not explored. In the summary, the study identified previously unknown role of FAM53C in proliferation but failed to explain the mechanism and its physiological relevance at the level of tissues and organism. Although some of the data might be of interest, in current form the data is too preliminary to justify publication.

      Major points

      1. Whole study is based on one siRNA to Fam53C and its specificity was not validated. Level of the knock down was shown only in the first figure and not in the other experiments. The observed phenotypes in the cell cycle progression may be affected by variable knock-down efficiency and/or potential off target effects. We thank the Reviewer for raising this important point. First, we need to clarify that our experiments were performed with a pool of siRNAs (not one siRNA). Second, commercial antibodies against FAM53C are not of the best quality and it has been challenging to detect FAM53C using these antibodies in our hands – the results are often variable. In addition, to better address the Reviewer’s point and control for the phenotypes we have observed, we performed two additional series of experiments: first, we have confirmed G1 arrest in RPE-1 cells with individual siRNAs, providing more confidence for the specificity of this arrest (Fig. S1B); second, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (Fig. S1E,F and Fig. 4F).

      Experiments focusing on the cell cycle progression were done in a single cell line RPE1 that showed a strong sensitivity to FAM53C depletion. In contrast, phenotypes in IPSCs and in mice were only mild suggesting that there might be large differences across various cell types in the expression and function of FAM53C. Therefore, it is important to reproduce the observations in other cell types.

      As mentioned above, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (three cancer cell lines) (Fig. S1E,F and Fig. 4F).

      Authors state that FAM53C is a direct inhibitor of DYRK1A kinase activity (Line 203), however this model is not supported by the data in Fig 4A. FAM53C seems to be a good substrate of DYRK1 even at high concentrations when phosphorylations of cyclin D is reduced. It rather suggests that DYRK1 is not inhibited by FAM53C but perhaps FAM53C competes with cyclin D. Further, authors should address if the phosphorylation of cyclin D is responsible for the observed cell cycle phenotype. Is this Cyclin D-Thr286 phosphorylation, or are there other sites involved?

      We revised the text of the manuscript to include the possibility that FAM53C could act as a competitive substrate and/or an inhibitor.

      We removed most of the Cyclin D phosphorylation/stability data from the revised manuscript. As the Reviewers pointed out, some of these data were statistically significant but the biological effects were small. As discussed above in our response to Reviewer #1, the analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knock-down, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We note, however, that we used specific Thr286 phospho-antibodies, which have been used extensively in the field. Our data in Figure 1 with palbociclib place FAM53C upstream of Cyclin D/CDK4,6. We performed Cyclin D overexpression experiments but RPE-1 cells did not tolerate high expression of Cyclin D1 (T286A mutant) and we have not been able to conduct more ‘genetic’ studies.

      At many places, information on statistical tests is missing and SDs are not shown in the plots. For instance, what statistics was used in Fig 4C? Impact of FAM53C on cyclin D phosphorylation does not seem to be significant. In the same experiment, does DYRK1 inhibitor prevent modification of cyclin D?

      As discussed above, we removed some of these data and re-focused the manuscript on p53-p21 as a second pathway activated by loss of FAM53C.

      Validation of SM13797 compound in terms of specificity to DYRK1 was not performed.

      This is an important point. We had cited an abstract from the company (Biosplice) but we agree that providing data is critical. We have now revised the manuscript with a new analysis of the compound’s specificity using kinase assays. These data are shown in Fig. S3F-H.

      A fraction of cells in G1 is a very easy readout but it does not measure progression through the G1 phase. Extension of the S phase or G2 delay would indirectly also result in reduction of the G1 fraction. Instead, authors could measure the dynamics of entry to S phase in cells released from a G1 block or from mitotic shake off.

      The Reviewer made a good point. As discussed in our response to Reviewer #1, with p53-null RPE-1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide. Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells. These data indicate that G1 entry by flow cytometry will not always translate into proliferation.

      Other points:

      Fig. 2C, 2D, 2E graphs should begin with 0

      We remade these graphs.

      Fig. 5D shows that the difference in p21 levels is not significant in FAM53C-KO cells but difference is mentioned in the text.

      We replaced the panel by the correct panel; we apologize for this error.

      Fig. 6D comparison of datasets of extremely different sizes does not seem to be appropriate

      We agree and revised the text. We hope that the Reviewer will agree with us that it is worth showing these data, which are clearly preliminary but provide evidence of a possible role for FAM53C in the brain.

      Could there be alternative splicing in mice generating a partially functional protein without exon 4? Did authors confirm that the animal model does not express FAM53C?

      We performed RNA sequencing of mouse embryonic fibroblasts derived from control and mutant mice. We clearly identified fewer reads in exon 4 in the knockout cells, and no other obvious change in the transcript (data not shown). However, immunoblot with mouse cells for FAM53C never worked well in our hands. We made sure to add this caveat to the revised manuscript.

      __Reviewer #2 (Significance (Required)): __

      Main problem of this study is that the advanced experimental models in IPSCs and mice did not confirm the observations in the cell lines and thus the whole manuscript does not hold together. Although I acknowledge the effort the authors invested in these experiments, the data do not contribute to the main conclusion of the paper that FAM53C/DYRK1 regulates G1/S transition.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      This paper identifies FAM53C as a novel regulator of cell cycle progression, particularly at the G1/S transition, by inhibiting DYRK1A. Using data from the Cancer Dependency Map, the authors suggest that FAM53C acts upstream of the Cyclin D-CDK4/6-RB axis by inhibiting DYRK1A.

      Specifically, their experiments suggest that FAM53C Knockdown induces G1 arrest in cells, reducing proliferation without triggering apoptosis. DYRK1A Inhibition rescues G1 arrest in P53KO cells, suggesting FAM53C normally suppresses DYRK1A activity. Mass Spectrometry and biochemical assays confirm that FAM53C directly interacts with and inhibits DYRK1A. FAM53C Knockout in Human Cortical Organoids and Mice leads to cell cycle defects, growth impairments, and behavioral changes, reinforcing its biological importance.

      Strength of the paper:

      The study introduces a novel cell cycle control signalling module upstream of CDK4/6 in G1/S regulation which could have significant impact. The identification of FAM53C using a depmap correlation analysis is a nice example of the power of this dataset. The experiments are carried out mostly in a convincing manner and support the conclusions of the manuscript.

      Critique:

      1) The experiments rely heavily on siRNA transfections without the appropriate controls. There are so many cases of off-target effects of siRNA in the literature, and specifically for a strong phenotype on S-phase as described here, I would expect to see solid results by additional experiments. This is especially important since the ko mice do not show any significant developmental cell cycle phenotypes. Moreover, FAM53C does not show a strong fitness effect in the depmap dataset, suggesting that it is largely non-essential in most cancer cell lines. For this paper to reach publication in a high-standard journal, I would expect that the authors show a rescue of the S-phase phenotype using an siRNA-resistant cDNA, and show similar S-phase defects using an acute knock out approach with lentiviral gRNA/Cas9 delivery.

      We thank the Reviewer for this comment. Please refer to the initial response to the three Reviewers, where we discuss our use of single siRNAs and our results in multiple cell lines. Briefly, we can recapitulate the G1 arrest upon FAM53C knock-down using two independent siRNAs in RPE-1 cells. We also observe the same G1 arrest in p53 knockout cells, suggesting it is not due to a non-specific stress response. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype. Human cancer cell lines also arrest in G1 upon FAM53C knock-down, not just RPE-1 cells. Finally, we hope the Reviewer will agree with us that compensatory mechanisms are very common in the cell cycle – which may explain the lack of phenotypes in vivo or upon long-term knockout of FAM53C.

      2) The S-phase phenotype following FAM53C should be demonstrated in a larger variety of TP53WT and mutant cell lines. Given that this paper introduces a new G1/S control element, I think this is important for credibility. Ideally, this should be done with acute gRNA/Cas9 gene deletion using a lentiviral delivery system; but if the siRNA rescue experiments work and validate an on-target effect, siRNA would be an appropriate alternative.

      We now show data with three cancer cell lines (U2OS, A549, and HCT-116 – Fig. S1E,F and Fig. 4F), in addition to our results in RPE-1 cells and in human cortical organoids. We note that the knock-down experiments are complemented by overexpression data (Fig. 1G-I), by genetic data (our original DepMap screen), and our biochemical data (showing direct binding of FAM53C to DYRK1A).

      3) The western blot images shown in the MS appear heavily over-processed and saturated (See for example S4B, 4A, B, and E). Perhaps the authors should provide the original un-processed data of the entire gels?

      For several of our panels (e.g., 4E and S4B, now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      Data in 4A are also not a western blot but a radiograph.

      For immunoblots, we will provide all the source data with uncropped blots with the final submission.

      4) A critical experiment for the proposed mechanism is the rescue of the FAM53C S-phase reduction using DYRK1A inhibition shown in Figure 4. The legend here states that the data were extracted from BrdU incorporation assays, but in Figure S4D only the PI histograms are shown, and the S-phase population is not quantified. The authors should show the BrdU scatterplot and quantify the phenotype using the S-phase population in these plots. G1 measurements from PI histograms are not precise enough to allow for conclusions. Also, why are the intensities of the PI peaks so variable in these plots? Compare, for example, the HCT116 upper and lower panels where the siRNA appears to have caused an increase in ploidy.

      We apologize for the confusion and we fixed these errors, for most of the analyses, we used PI to measure G1 and S-phase entry. We added relevant flow cytometry plots to supplemental figures (Fig. S1G, H, I, as well as Fig. S4E and S4K, and Fig. S5F).

      5) There's an apparent contradiction in how RB deletion rescues the G1 arrest (Figure 2) while p21 seems to maintain the arrest even when DYRK1A is inhibited. Is p21 not induced when FAM53C is depleted in RB ko cells? This should be measured and discussed.

      This comment and comments from the two other Reviewers made us reconsider our model. We re-read carefully the Meyer paper and think that DYRK1A activity may be understood when considering levels of both CycD and p21 at the same time in a continuum (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is obvious that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      __Reviewer #3 (Significance (Required)): __

      In conclusion, I believe that this MS could potentially be important for the cell cycle field and also provide a new target pathway that could be relevant for cancer therapy. However, the paper has quite a few gaps and inconsistencies that need to be addressed with further experiments. My main worry is that the acute depletion phenotypes appear so strong, while the gene is non-essential in mice and shows only a minor fitness effect in the depmap screens. More convincing controls are necessary to rule out experimental artefacts that misguide the interpretation of the results.

      We appreciate this comment and hope that the Reviewer will agree it is still important to share our data with the field, even if the phenotypes in mice are modest.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We would like to thank all the reviewers for their valuable comments and criticisms. We have thoroughly revised the manuscript and the resource to address all the points raised by the reviewers. Below, we provide a point-by-point response for the sake of clarity.

      Reviewer #1

      __Evidence, reproducibility and clarity __

      Summary: This manuscript, "MAVISp: A Modular Structure-Based Framework for Protein Variant Effects," presents a significant new resource for the scientific community, particularly in the interpretation and characterization of genomic variants. The authors have developed a comprehensive and modular computational framework that integrates various structural and biophysical analyses, alongside existing pathogenicity predictors, to provide crucial mechanistic insights into how variants affect protein structure and function. Importantly, MAVISp is open-source and designed to be extensible, facilitating reuse and adaptation by the broader community.

      Major comments: - While the manuscript is formally well-structured (with clear Introduction, Results, Conclusions, and Methods sections), I found it challenging to follow in some parts. In particular, the Introduction is relatively short and lacks a deeper discussion of the state-of-the-art in protein variant effect prediction. Several methods are cited but not sufficiently described, as if prior knowledge were assumed. OPTIONAL: Extend the Introduction to better contextualize existing approaches (e.g., AlphaMissense, EVE, ESM-based predictors) and clarify what MAVISp adds compared to each.

      We have expanded the introduction on the state-of-the-art of protein variant effects predictors, explaining how MAVISp departs from them.

      - The workflow is summarized in Figure 1(b), which is visually informative. However, the narrative description of the pipeline is somewhat fragmented. It would be helpful to describe in more detail the available modules in MAVISp, and which of them are used in the examples provided. Since different use cases highlight different aspects of the pipeline, it would be useful to emphasize what is done step-by-step in each.

      We have added a concise, narrative description of the data flow for MAVISp, as well as improved the description of modules in the main text. We will integrate the results section with a more comprehensive description of the available modules, and then clarify in the case studies which modules were applied to achieve specific results.

      OPTIONAL: Consider adding a table or a supplementary figure mapping each use case to the corresponding pipeline steps and modules used.

      We have added a supplementary table (Table S2) to guide the reader on the modules and workflows applied for each case study

      We also added Table S1 to map the toolkit used by MAVISp to collect the data that are imported and aggregated in the webserver for further guidance.

      - The text contains numerous acronyms, some of which are not defined upon first use or are only mentioned in passing. This affects readability. OPTIONAL: Define acronyms upon first appearance, and consider moving less critical technical details (e.g., database names or data formats) to the Methods or Supplementary Information. This would greatly enhance readability.

      We revised the usage of acronyms following the reviewer’s directions of defying them at first appearance.

      • The code and trained models are publicly available, which is excellent. The modular design and use of widely adopted frameworks (PyTorch and PyTorch Geometric) are also strong points. However, the Methods section could benefit from additional detail regarding feature extraction and preprocessing steps, especially the structural features derived from AlphaFold2 models. OPTIONAL: Include a schematic or a table summarizing all feature types, their dimensionality, and how they are computed.

      We thank the reviewer for noticing and praising the availability of the tools of MAVISp. Our MAVISp framework utilizes methods and scores that incorporate machine learning features (such as EVE or RaSP), but does not employ machine learning itself. Specifically, we do not use PyTorch and do not utilize features in a machine learning sense. We do extract some information from the AlphaFold2 models that we use (such as the pLDDT score and their secondary structure content, as calculated by DSSP), and those are available in the MAVISp aggregated csv files for each protein entry and detailed in the Documentation section of the MAVISp website.

      • The section on transcription factors is relatively underdeveloped compared to other use cases and lacks sufficient depth or demonstration of its practical utility. OPTIONAL: Consider either expanding this section with additional validation or removing/postponing it to a future manuscript, as it currently seems preliminary.

      We have removed this section and included a mention in the conclusions as part of the future directions.

      Minor comments: - Most relevant recent works are cited, including EVE, ESM-1v, and AlphaFold-based predictors. However, recent methods like AlphaMissense (Cheng et al., 2023) could be discussed more thoroughly in the comparison.

      We have revised the introduction to accommodate the proper space for this comparison.

      • Figures are generally clear, though some (e.g., performance barplots) are quite dense. Consider enlarging font sizes and annotating key results directly on the plots.

      We have revised Figure 2 and presented only one case study to simplify its readability. We have also changed Figure 3, whereas retained the other previous figures since they seemed less problematic.

      • Minor typographic errors are present. A careful proofreading is highly recommended. Below are some of the issues I identified: Page 3, line 46: "MAVISp perform" -> "MAVISp performs" Page 3, line 56: "automatically as embedded" -> "automatically embedded" Page 3, line 57: "along with to enhance" -> unclear; please revise Page 4, line 96: "web app interfaces with the database and present" -> "presents" Page 6, line 210: "to investigate wheatear" -> "whether" Page 6, lines 215-216: "We have in queue for processing with MAVISp proteins from datasets relevant to the benchmark of the PTM module." -> unclear sentence; please clarify Page 15, line 446: "Both the approaches" -> "Both approaches" Page 20, line 704: "advantage of multi-core system" -> "multi-core systems"

      We have done a proofreading of the entire article, including the points above

      Significance

      General assessment: the strongest aspects of the study are the modularity, open-source implementation, and the integration of structural information through graph neural networks. MAVISp appears to be one of the few publicly available frameworks that can easily incorporate AlphaFold2-based features in a flexible way, lowering the barrier for developing custom predictors. Its reproducibility and transparency make it a valuable resource. However, while the technical foundation is solid and the effort substantial, the scientific narrative and presentation could be significantly improved. The manuscript is dense and hard to follow in places, with a heavy use of acronyms and insufficient explanation of key design choices. Improving the descriptive clarity, especially in the early sections, would greatly enhance the impact of this work.

      Advance

      to the best of my knowledge, this is one of the first modular platforms for protein variant effect prediction that integrates structural data from AlphaFold2 with bioinformatic annotations and even clinical data in an extensible fashion. While similar efforts exist (e.g., ESMfold, AlphaMissense), MAVISp distinguishes itself through openness and design for reusability. The novelty is primarily technical and practical rather than conceptual.

      Audience

      this study will be of strong interest to researchers in computational biology, structural bioinformatics, and genomics, particularly those developing variant effect predictors or analyzing the impact of mutations in clinical or functional genomics contexts. The audience is primarily specialized, but the open-source nature of the tool may diffuse its use among more applied or translational users, including those working in precision medicine or protein engineering.

      Reviewer expertise: my expertise is in computational structural biology, molecular modeling, and (rather weak) machine learning applications in bioinformatics. I am familiar with graph-based representations of proteins, AlphaFold2, and variant effects based on Molecular Dynamics simulations. I do not have any direct expertise in clinical variant annotation pipelines.

      Reviewer #2

      __Evidence, reproducibility and clarity __

      Summary: The authors present a pipeline and platform, MAVISp, for aggregating, displaying and analysis of variant effects with a focus on reclassification of variants of uncertain clinical significance and uncovering the molecular mechanisms underlying the mutations.

      Major comments: - On testing the platform, I was unable to look-up a specific variant in ADCK1 (rs200211943, R115Q). I found that despite stating that the mapped refseq ID was NP_001136017 in the HGVSp column, it was actually mapped to the canonical UniProt sequence (Q86TW2-1). NP_001136017 actually maps to Q86TW2-3, which is missing residues 74-148 compared to the -1 isoform. The Uniprot canonical sequence has no exact RefSeq mapping, so the HGVSp column is incorrect in this instance. This mapping issue may also affect other proteins and result in incorrect HGVSp identifiers for variants.

      We would like to thank the reviewer for pointing out these inconsistencies. We have revised all the entries and corrected them. If needed, the history of the cases that have been corrected can be found in the closed issues of the GitHub repository that we use for communication between biocurators and data managers (https://github.com/ELELAB/mavisp_data_collection). We have also revised the protocol we follow in this regard and the MAVISp toolkit to include better support for isoform matching in our pipelines for future entries, as well as for the revision/monitoring of existing ones, as detailed in the Method Section. In particular, we introduced a tool, uniprot2refseq, which aids the biocurator in identifying the correct match in terms of sequence length and sequence identity between RefSeq and UniProt. More details are included in the Method Section of the paper. The two relevant scripts for this step are available at: https://github.com/ELELAB/mavisp_accessory_tools/

      - The paper lacks a section on how to properly interpret the results of the MAVISp platform (the case-studies are helpful, but don't lay down any global rules for interpreting the results). For example: How should a variant with conflicts between the variant impact predictors be interpreted? Are specific indicators considered more 'reliable' than others?

      We have added a section in Results to clarify how to interpret results from MAVISp in the most common use cases.

      • In the Methods section, GEMME is stated as being rank-normalised with 0.5 as a threshold for damaging variants. On checking the data downloaded from the site, GEMME was not rank-normalised but rather min-max normalised. Furthermore, Supplementary text S4 conflicts with the methods section over how GEMME scores are classified, S4 states that a raw-value threshold of -3 is used.

      We thank the reviewer for spotting this inconsistency. This part in the main text was left over from a previous and preliminary version of the pre-print, we have revised the main text. Supplementary Text S4 includes the correct reference for the value in light of the benchmarking therewithin.

      • Note. This is a major comment as one of the claims is that the associated web-tool is user-friendly. While functional, the web app is very awkward to use for analysis on any more than a few variants at once. The fixed window size of the protein table necessitates excessive scrolling to reach your protein-of-interest. This will also get worse as more proteins are added. Suggestion: add a search/filter bar. The same applies to the dataset window.

      We have changed the structure of the webserver in such a way that now the whole website opens as its own separate window, instead of being confined within the size permitted by the website at DTU. This solves the fixed window size issue. Hopefully, this will improve the user experience.

      We have refactored the web app by adding filtering functionality, both for the main protein table (that can now be filtered by UniProt AC, gene name or RefSeq ID) and the mutations table. Doing this required a general overhaul of the table infrastructure (we changed the underlying engine that renders the tables).

      • You are unable to copy anything out of the tables.
      • Hyperlinks in the tables only seem to work if you open them in a new tab or window.

      The table overhauls fixed both of these issues

      • All entries in the reference column point to the MAVISp preprint even when data from other sources is displayed (e.g. MAVE studies).

      We clarified the meaning of the reference column in the Documentation on the MAVISp website, as we realized it had confused the reviewer. The reference column is meant to cite the papers where the computationally-generated MAVISp data are used, not external sources. Since we also have the experimental data module in the most recent release, we have also refactored the MAVISp website by adding a “Datasets and metadata” page, which details metadata for key modules. These include references to data from external sources that we include in MAVISp on a case-by-case basis (for example the results of a MAVE experiment). Additionally, we have verified that the papers using MAVISp data are updated in https://elelab.gitbook.io/mavisp/overview/publications-that-used-mavisp-data and in the csv file of the interested proteins.

      Here below the current references that have been included in terms of publications using MAVISp data:

      SMPD1

      ASM variants in the spotlight: A structure-based atlas for unraveling pathogenic mechanisms in lysosomal acid sphingomyelinase

      Biochim Biophys Acta Mol Basis Dis

      38782304

      https://doi.org/10.1016/j.bbadis.2024.167260

      TRAP1

      Point mutations of the mitochondrial chaperone TRAP1 affect its functions and pro-neoplastic activity

      Cell Death & Disease

      40074754

      https://doi.org/10.1038/s41419-025-07467-6

      BRCA2

      Saturation genome editing-based clinical classification of BRCA2 variants

      Nature

      39779848

      0.1038/s41586-024-08349-1

      TP53, GRIN2A, CBFB, CALR, EGFR

      TRAP1 S-nitrosylation as a model of population-shift mechanism to study the effects of nitric oxide on redox-sensitive oncoproteins

      Cell Death & Disease

      37085483

      10.1038/s41419-023-05780-6

      KIF5A, CFAP410, PILRA, CYP2R1

      Computational analysis of five neurodegenerative diseases reveals shared and specific genetic loci

      Computational and Structural Biotechnology Journal

      38022694

      https://doi.org/10.1016/j.csbj.2023.10.031

      KRAS

      Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

      Brief Bioinform

      39708841

      https://doi.org/10.1093/bib/bbae664

      OPTN

      Decoding phospho-regulation and flanking regions in autophagy-associated short linear motifs

      Communications Biology

      40835742

      10.1038/s42003-025-08399-9

      DLG4,GRB2,SMPD1

      Deciphering long-range effects of mutations: an integrated approach using elastic network models and protein structure networks

      JMB

      40738203

      doi: 10.1016/j.jmb.2025.169359

      Entering multiple mutants in the "mutations to be displayed" window is time-consuming for more than a handful of mutants. Suggestion: Add a box where multiple mutants can be pasted in at once from an external document.

      During the table overhaul, we have revised the user interface to add a text box that allows free copy-pasting of mutation lists. While we understand having a single input box would have been ideal, the former selection interface (which is also still available) doesn’t allow copy-paste. This is a known limitation in Streamlit.

      Minor comments

      • Grammar. I appreciate that this manuscript may have been compiled by a non-native English speaker, but I would be remiss not to point out that there are numerous grammar errors throughout, usually sentence order issues or non-pluralisation. The meaning of the authors is mostly clear, but I recommend very thoroughly proof-reading the final version.

      We have done proofreading on the final version of the manuscript

      • There are numerous proteins that I know have high-quality MAVE datasets that are absent in the database e.g. BRCA1, HRAS and PPARG.

      Yes, we are aware of this. It is far from trivial to properly import the datasets from multiplex assays. They often need to be treated on a case-by-case basis. We are in the process of carefully compiling locally all the MAVE data before releasing it within the public version of the database, so this is why they are missing. We are giving priorities to the ones that can be correlated with our predictions on changes in structural stability and then we will also cover the rest of the datasets handling them in batches. Having said this, we have checked the dataset for BRCA1, HRAS, and PPARG. We have imported the ones for PPARG and BRCA1 from ProtGym, referring to the studies published in 10.1038/ng.3700 and 10.1038/s41586-018-0461-z, respectively. Whereas for HRAS, checking in details both the available data and literature, while we did identify a suitable dataset (10.7554/eLife.27810), we struggled to understand what a sensible cut-off for discriminating between pathogenic and non-pathogenic variants would be, and so ended up not including it in the MAVISp dataset for now. We will contact the authors to clarify which thresholds to apply before importing the data.

      • Checking one of the existing MAVE datasets (KRAS), I found that the variants were annotated as damaging, neutral or given a positive score (these appear to stand-in for gain-of-function variants). For better correspondence with the other columns, those with positive scores could be labelled as 'ambiguous' or 'uncertain'.

      In the KRAS case study presented in MAVISP, we utilized the protein abundance dataset reported in (http://dx.doi.org/10.1038/s41586-023-06954-0) and made available in the ProteinGym repository (specifically referenced at https://github.com/OATML-Markslab/ProteinGym/blob/main/reference_files/DMS_substitutions.csv#L153). We adopted the precalculated thresholds as provided by the ProteinGym authors. In this regard, we are not really sure the reviewer is referring to this dataset or another one on KRAS.

      • Numerous thresholds are defined for stabilizing / destabilizing / neutral variants in both the STABILITY and the LOCAL_INTERACTION modules. How were these thresholds determined? I note that (PMC9795540) uses a ΔΔG threshold of 1/-1 for defining stabilizing and destabilizing variants, which is relatively standard (though they also say that 2-3 would likely be better for pinpointing pathogenic variants).

      We improved the description of our classification strategies for both modules in the Documentation page of our website. Also, we explained more clearly the possible sources of ‘uncertain’ annotations for the two modules in both the web app (Documentation page) and main text. Briefly, in the STABILITY module, we consider FoldX and either Rosetta or RaSP to achieve a final classification. We first classify one and the other independently, according to the following strategy:

      If DDG ≥ 3, the mutation is Destabilizing If DDG ≤ −3, the mutation is Stabilizing If −2 We then compare the classifications obtained by the two methods: if they agree, then that is the final classification, if they disagree, then the final classification is Uncertain. The thresholds were selected based on a previous study, in which variants with changes in stability below 3 kcal/mol were not featuring a markedly different abundance at cellular level [10.1371/journal.pgen.1006739, 10.7554/eLife.49138]

      Regarding the LOCAL_INTERACTION module, it works similarly as for the Stability module, in that Rosetta and FoldX are considered independently, and an implicit classification is performed for each, according to the rules (values in kcal/mol)

      If DDG > 1, the mutation is Destabilizing. If DDG Each mutation is therefore classified for both methods. If the methods agree (i.e., if they classify the mutation in the same way), their consensus is the final classification for the mutation; if they do not agree, the final classification will be Uncertain.

      If a mutation does not have an associated free energy value, the relative solvent accessible area is used to classify it: if SAS > 20%, the mutation is classified as Uncertain, otherwise it is not classified.

      Thresholds here were selected according to best practices followed by the tool authors and more in general in the literature, as the reviewer also noticed.

      • "Overall, with the examples in this section, we illustrate different applications of the MAVISp results, spanning from benchmarking purposes, using the experimental data to link predicted functional effects with structural mechanisms or using experimental data to validate the predictions from the MAVISp modules."

      The last of these points is not an application of MAVISp, but rather a way in which external data can help validate MAVISp results. Furthermore, none of the examples given demonstrate an application in benchmarking (what is being benchmarked?).

      We have revised the statements to avoid this confusion in the reader.

      • Transcription factors section. This section describes an intended future expansion to MAVISp, not a current feature, and presents no results. As such, it should be moved to the conclusions/future directions section.

      We have removed this section and included a mention in the conclusions as part of the future directions.

      • Figures. The dot-plots generated by the web app, and in Figures 4, 5 and 6 have 2 legends. After looking at a few, it is clear that the lower legend refers to the colour of the variant on the X-axis - most likely referencing the ClinVar effect category. This is not, however, made clear either on the figures or in the app.

      The reviewer’s interpretation on the second legend is correct - it does refer to the ClinVar classification. Nonetheless, we understand the positioning of the legend makes understanding what the legend refers to not obvious. We also revised the captions of the figures in the main text. On the web app, we have changed the location of the figure legend for the ClinVar effect category and added a label to make it clear what the classification refers to.

      • "We identified ten variants reported in ClinVar as VUS (E102K, H86D, T29I, V91I, P2R, L44P, L44F, D56G, R11L, and E25Q, Fig.5a)" E25Q is benign in ClinVar and has had that status since first submitted.

      We have corrected this in the text and the statements related to it.

      Significance

      Platforms that aggregate predictors of variant effect are not a new concept, for example dbNSFP is a database of SNV predictions from variant effect predictors and conservation predictors over the whole human proteome. Predictors such as CADD and PolyPhen-2 will often provide a summary of other predictions (their features) when using their platforms. MAVISp's unique angle on the problem is in the inclusion of diverse predictors from each of its different moules, giving a much wider perspective on variants and potentially allowing the user to identify the mechanistic cause of pathogenicity. The visualisation aspect of the web app is also a useful addition, although the user interface is somewhat awkward. Potentially the most valuable aspect of this study is the associated gitbook resource containing reports from biocurators for proteins that link relevant literature and analyse ClinVar variants. Unfortunately, these are only currently available for a small minority of the total proteins in the database with such reports. For improvement, I think that the paper should focus more on the precise utility of the web app / gitbook reports and how to interpret the results rather than going into detail about the underlying pipeline.

      We appreciate the interest in the gitbook resource that we also see as very valuable and one of the strengths of our work. We have now implemented a new strategy based on a Python script introduced in the mavisp toolkit to generate a template Markdown file of the report that can be further customized and imported into GitBook directly (​​https://github.com/ELELAB/mavisp_accessory_tools/). This should allow us to streamline the production of more reports. We are currently assigning proteins in batches for reporting to biocurator through the mavisp_data_collection GitHub to expand their coverage. Also, we revised the text and added a section on the interpretation of results from MAVISp. with a focus on the utility of the web-app and reports.

      In terms of audience, the fast look-up and visualisation aspects of the web-platform are likely to be of interest to clinicians in the interpretation of variants of unknown clinical significance. The ability to download the fully processed dataset on a per-protein database would be of more interest to researchers focusing on specific proteins or those taking a broader view over multiple proteins (although a facility to download the whole database would be more useful for this final group).

      While our website only displays the dataset per protein, the whole dataset, including all the MAVISp entries, is available at our OSF repository (https://osf.io/ufpzm/), which is cited in the paper and linked on the MAVISp website. We have further modified the MAVISp database to add a link to the repository in the modes page, so that it is more visible.

      My expertise. - I am a protein bioinformatician with a background in variant effect prediction and large-scale data analysis.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Evidence, reproducibility and clarity:

      Summary:

      The authors present MAVISp, a tool for viewing protein variants heavily based on protein structure information. The authors have done a very impressive amount of curation on various protein targets, and should be commended for their efforts. The tool includes a diverse array of experimental, clinical, and computational data sources that provides value to potential users interested in a given target.

      Major comments:

      Unfortunately I was not able to get the website to work correctly. When selecting a protein target in simple mode, I was greeted with a completely blank page in the app window. In ensemble mode, there was no transition away from the list of targets at all. I'm using Firefox 140.0.2 (64-bit) on Ubuntu 22.04. I would like to explore the data myself and provide feedback on the user experience and utility.

      We have tried reproducing the issue mentioned by the reviewer, using the exact same Ubuntu and Firefox versions, but unfortunately failed to produce it. The website worked fine for us under such an environment. The issue experienced by the reviewer may have been due to either a temporary issue with the web server or a problem with the specific browser environment they were working in, which we are unable to reproduce. It would be useful to know the date that this happened to verify if it was a downtime on the DTU IT services side that made the webserver inaccessible.

      I have some serious concerns about the sustainability of the project and think that additional clarifications in the text could help. Currently is there a way to easily update a dataset to add, remove, or update a component (for example, if a new predictor is published, an error is found in a predictor dataset, or a predictor is updated)? If it requires a new round of manual curation for each protein to do this, I am worried that this will not scale and will leave the project with many out of date entries. The diversity of software tools (e.g., three different pipeline frameworks) also seems quite challenging to maintain.

      We appreciate the reviewer’s concerns about long-term sustainability. It is a fair point that we consider within our steering group, who oversee and plans the activities and meet monthly. Adding entries to MAVISp is moving more and more towards automation as we grow. We aim to minimize the manual work where applicable. Still, an expert-based intervention is really needed in some of the steps, and we do not want to renounce it. We intend to keep working on MAVISp to make the process of adding and updating entries as automated as possible, and to streamline the process when manual intervention is necessary. From the point of view of the biocurators, they have three core workflows to use for the default modules, which also automatically cover the source of annotations. We are currently working to streamline the procedures behind LOCAL_INTERACTION, which is the most challenging one. On the data manager and maintainers' side, we have workflows and protocols that help us in terms of automation, quality control, etc, and we keep working to improve them. Among these, we have workflows to use for the old entries updates. As an example, the update of erroneously attributed RefSeq data (pointed out by reviewer 2) took us only one week overall (from assigning revisions and importing to the database) because we have a reduced version of Snakemake for automation that can act on only the affected modules. Also, another point is that we have streamlined the generation of the templates for the gitbook reports (see also answer to reviewer 2).

      The update of old entries is planned and made regularly. We also deposit the old datasets on OSF for transparency, in case someone needs to navigate and explore the changes. We have activities planned between May and August every year to update the old entries in relation to changes of protocols in the modules, updates in the core databases that we interact with (COSMIC, Clinvar etc). In case of major changes, the activities for updates continue in the Fall. Other revisions can happen outside these time windows if an entry is needed or a specific research project and needs updates too.

      Furthermore, the community of people contributing to MAVISp as biocurators or developers is growing and we have scientists contributing from other groups in relation to their research interest. We envision that for this resource to scale up, our team cannot be the only one producing data and depositing it to the database. To facilitate this we launched a pilot for a training event online (see Event page on the website) and we will repeat it once per year. We also organize regular meetings with all the active curators and developers to plan the activities in a sustainable manner and address the challenges we encounter.

      As stated in the manuscript, currently with the team of people involved, automatization and resources that we have gathered around this initiative we can provide updates to the public database every third month and we have been regularly satisfied with them. Additionally, we are capable of processing from 20 to 40 proteins every month depending also on the needs of revision or expansion of analyses on existing proteins. We also depend on these data for our own research projects and we are fully committed to it.

      Additionally, we are planning future activities in these directions to improve scale up and sustainability:

      • Streamlining manual steps so that they are as convenient as fast as possible for our curators, e.g. by providing custom pages on the MAVISp website
      • Streamline and automatize the generation of useful output, for instance the reports, by using a combination of simple automation and large language models
      • Implement ways to share our software and scripts with third parties, for instance by providing ready made (or close to) containers or virtual machines
      • For a future version 2 if the database grows in a direction that is not compatible with Streamlit, the web data science framework we are currently using, we will rewrite the website using a framework that would allow better flexibility and performance, for instance using Django and a proper database backend. On the same theme, according to the GitHub repository, the program relies on Python 3.9, which reaches end of life in October 2025. It has been tested against Ubuntu 18.04, which left standard support in May 2023. The authors should update the software to more modern versions of Python to promote the long-term health and maintainability of the project.

      We thank the reviewer for this comment - we are aware of the upcoming EOL of Python 3.9. We tested MAVISp, both software package and web server, using Python 3.10 (which is the minimum supported version going forward) and Python 3.13 (which is the latest stable release at the time of writing) and updated the instructions in the README file on the MAVISp GitHub repository accordingly.

      We plan on keeping track of Python and library versions during our testing and updating them when necessary. In the future, we also plan to deploy Continuous Integration with automated testing for our repository, making this process easier and more standardized.

      I appreciate that the authors have made their code and data available. These artifacts should also be versioned and archived in a service like Zenodo, so that researchers who rely on or want to refer to specific versions can do so in their own future publications.

      Since 2024, we have been reporting all previous versions of the dataset on OSF, the repository linked to the MAVISp website, at https://osf.io/ufpzm/files/osfstorage (folder: previous_releases). We prefer to keep everything under OSF, as we also use it to deposit, for example, the MD trajectory data.

      Additionally, in this GitHub page that we use as a space to interact between biocurators, developers, and data managers within the MAVISp community, we also report all the changes in the NEWS space: https://github.com/ELELAB/mavisp_data_collection

      Finally, the individual tools are all available in our GitHub repository, where version control is in place (see Table S1, where we now mapped all the resources used in the framework)

      In the introduction of the paper, the authors conflate the clinical challenges of variant classification with evidence generation and it's quite muddled together. They should strongly consider splitting the first paragraph into two paragraphs - one about challenges in variant classification/clinical genetics/precision oncology and another about variant effect prediction and experimental methods. The authors should also note that they are many predictors other than AlphaMissense, and may want to cite the ClinGen recommendations (PMID: 36413997) in the intro instead.

      We revised the introduction in light of these suggestions. We have split the paragraph as recommended and added a longer second paragraph about VEPs and using structural data in the context of VEPs. We have also added the citation that the reviewer kindly recommended.

      Also in the introduction on lines 21-22 the authors assert that "a mechanistic understanding of variant effects is essential knowledge" for a variety of clinical outcomes. While this is nice, it is clearly not the case as we can classify variants according to the ACMG/AMP guidelines without any notion of specific mechanism (for example, by combining population frequency data, in silico predictor data, and functional assay data). The authors should revise the statement so that it's clear that mechanistic understanding is a worthy aspiration rather than a prerequisite.

      We revised the statement in light of this comment from the reviewer

      In the structural analysis section (page 5, lines 154-155 and elsewhere), the authors define cutoffs with convenient round numbers. Is there a citation for these values or were these arbitrarily chosen by the authors? I would have liked to see some justification that these assignments are reasonable. Also there seems to be an error in the text where values between -2 and -3 kcal/mol are not assigned to a bin (I assume they should also be uncertain). There are other similar seemingly-arbitrary cutoffs later in the section that should also be explained.

      We have revised the text making the two intervals explicit, for better clarity.

      On page 9, lines 294-298 the authors talk about using the PTEN data from ProteinGym, rather than the actual cutoffs from the paper. They get to the latter later on, but I'm not sure why this isn't first? The ProteinGym cutoffs are somewhat arbitrarily based on the median rather than expert evaluation of the dataset, and I'm not sure why it's even worth mentioning them when proper classifications are available. Regarding PTEN, it would be quite interesting to see a comparison of the VAMP-seq PTEN data and the Mighell phosphatase assay, which is cited on page 9 line 288 but is not actually a VAMP-seq dataset. I think this section could be interesting but it requires some additional attention.

      We have included the data from Mighell’s phosphatase assay as provided by MAVEdb in the MAVISp database, within the experimental_data module for PTEN, and we have revised the case study, including them and explaining better the decision of supporting both the ProteinGym and MAVEdb classification in MAVISp (when available). See revised Figure3, Table 1 and corresponding text.

      The authors mention "pathogenicity predictors" and otherwise use pathogenicity incorrectly throughout the manuscript. Pathogenicity is a classification for a variant after it has been curated according to a framework like the ACMG/AMP guidelines (Richards 2015 and amendments). A single tool cannot predict or assign pathogenicity - the AlphaMissense paper was wrong to use this nomenclature and these authors should not compound this mistake. These predictors should be referred to as "variant effect predictors" or similar, and they are able to produce evidence towards pathogenicity or benignity but not make pathogenicity calls themselves. For example, in Figure 4e, the terms "pathogenic" and "benign" should only be used here if these are the classifications the authors have derived from ClinVar or a similar source of clinically classified variants.

      The reviewer is correct, we have revised the terminology we used in the manuscript and refers to VEPs (Variant Effect Predictors)

      Minor comments:

      The target selection table on the website needs some kind of text filtering option. It's very tedious to have to find a protein by scrolling through the table rather than typing in the symbol. This will only get worse as more datasets are added.

      We have revised the website, adding a filtering option. In detail, we have refactored the web app by adding filtering functionality, both for the main protein table (that can now be filtered by UniProt AC, gene name, or RefSeq ID) and the mutations table. Doing this required a general overhaul of the table infrastructure (we changed the underlying engine that renders the tables).

      The data sources listed on the data usage section of the website are not concordant with what is in the paper. For example, MaveDB is not listed.

      We have revised and updated the data sources on the website, adding a metadata section with relevant information, including MaveDB references where applicable.

      Figure 2 is somewhat confusing, as it partially interleaves results from two different proteins. This would be nicer as two separate figures, one on each protein, or just of a single protein.

      As suggested by the reviewer, we have now revised the figure and corresponding legends and text, focusing only on one of the two proteins.

      Figure 3 panel b is distractingly large and I wonder if the authors could do a little bit more with this visualization.

      We have revised Figure 3 to solve these issues and integrating new data from the comparison with the phosphatase assay

      Capitalization is inconsistent throughout the manuscript. For example, page 9 line 288 refers to VampSEQ instead of VAMP-seq (although this is correct elsewhere). MaveDB is referred to as MAVEdb or MAVEDB in various places. AlphaMissense is referred to as Alphamissense in the Figure 5 legend. The authors should make a careful pass through the manuscript to address this kind of issues.

      We have carefully proofread the paper for these inconsistencies

      MaveDB has a more recent paper (PMID: 39838450) that should be cited instead of/in addition to Esposito et al.

      We have added the reference that the reviewer recommended

      On page 11, lines 338-339 the authors mention some interesting proteins including BLC2, which has base editor data available (PMID: 35288574). Are there plans to incorporate this type of functional assay data into MAVISp?

      The assay mentioned in the paper refers to an experimental setup designed to investigate mutations that may confer resistance to the drug venetoclax. We started the first steps to implement a MAVISp module aimed at evaluating the impact of mutations on drug binding using alchemical free energy perturbations (ensemble mode) but we are far from having it complete. We expect to import these data when the module will be finalized since they can be used to benchmark it and BCL2 is one of the proteins that we are using to develop and test the new module.

      Reviewer #3 (Significance (Required)):

      Significance:

      General assessment:

      This is a nice resource and the authors have clearly put a lot of effort in. They should be celebrated for their achievments in curating the diverse datasets, and the GitBooks are a nice approach. However, I wasn't able to get the website to work and I have raised several issues with the paper itself that I think should be addressed.

      Advance:

      New ways to explore and integrate complex data like protein structures and variant effects are always interesting and welcome. I appreciate the effort towards manual curation of datasets. This work is very similar in theme to existing tools like Genomics 2 Proteins portal (PMID: 38260256) and ProtVar (PMID: 38769064). Unfortunately as I wasn't able to use the site I can't comment further on MAVISp's position in the landscape.

      We have expanded the conclusions section to add a comparison and cite previously published work, and linked to a review we published last year that frames MAVISp in the context of computational frameworks for the prediction of variant effects. In brief, the Genomics 2 Proteins portal (G2P) includes data from several sources, including some overlapping with MAVISp such as Phosphosite or MAVEdb, as well as features calculated on the protein structure. ProtVar also aggregates mutations from different sources and includes both variant effect predictors and predictions of changes in stability upon mutation, as well as predictions of complex structures. These approaches are only partially overlapping with MAVISp. G2P is primarily focused on structural and other annotations of the effect of a mutation; it doesn’t include features about changes of stability, binding, or long-range effects, and doesn’t attempt to classify the impact of a mutation according to its measurements. It also doesn’t include information on protein dynamics. Similarly, ProtVar does include information on binding free energies, long effects, or dynamical information.

      Audience:

      MAVISp could appeal to a diverse group of researchers who are interested in the biology or biochemistry of proteins that are included, or are interested in protein variants in general either from a computational/machine learning perspective or from a genetics/genomics perspective.

      My expertise:

      I am an expert in high-throughput functional genomics experiments and am an experienced computational biologist with software engineering experience.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      Summary:

      The authors present MAVISp, a tool for viewing protein variants heavily based on protein structure information. The authors have done a very impressive amount of curation on various protein targets, and should be commended for their efforts. The tool includes a diverse array of experimental, clinical, and computational data sources that provides value to potential users interested in a given target.

      Major comments:

      Unfortunately I was not able to get the website to work properly. When selecting a protein target in simple mode, I was greeted with a completely blank page in the app window, and in ensemble mode, there was no transition away from the list of targets at all. I'm using Firefox 140.0.2 (64-bit) on Ubuntu 22.04. I would have liked to be able to explore the data myself and provide feedback on the user experience and utility.

      I have some serious concerns about the sustainability of the project and think that additional clarifications in the text could help. Currently is there a way to easily update a dataset to add, remove, or update a component (for example, if a new predictor is published, an error is found in a predictor dataset, or a predictor is updated)? If it requires a new round of manual curation for each protein to do this, I am worried that this will not scale and will leave the project with many out of date entries. The diversity of software tools (e.g., three different pipeline frameworks) also seems quite challenging to maintain.

      On the same theme, according to the GitHub repository, the program relies on Python 3.9, which reaches end of life in October 2025. It has been tested against Ubuntu 18.04, which left standard support in May 2023. The authors should update the software to more modern versions of Python to promote the long-term health and maintainability of the project.

      I appreciate that the authors have made their code and data available. These artifacts should also be versioned and archived in a service like Zenodo, so that researchers who rely on or want to refer to specific versions can do so in their own future publications.

      In the introduction of the paper, the authors conflate the clinical challenges of variant classification with evidence generation and it's quite muddled together. The y should strongly consider splitting the first paragraph into two paragraphs - one about challenges in variant classification/clinical genetics/precision oncology and another about variant effect prediction and experimental methods. The authors should also note that they are many predictors other than AlphaMissense, and may want to cite the ClinGen recommendations (PMID: 36413997) in the intro instead.

      Also in the introduction on lines 21-22 the authors assert that "a mechanistic understanding of variant effects is essential knowledge" for a variety of clinical outcomes. While this is nice, it is clearly not the case as we are able to classify variants according to the ACMG/AMP guidelines without any notion of specific mechanism (for example, by combining population frequency data, in silico predictor data, and functional assay data). The authors should revise the statement so that it's clear that mechanistic understanding is a worthy aspiration rather than a prerequisite.

      In the structural analysis section (page 5, lines 154-155 and elsewhere), the authors define cutoffs with convenient round numbers. Is there a citation for these values or were these arbitrarily chosen by the authors? I would have liked to see some justification that these assignments are reasonable. Also there seems to be an error in the text where values between -2 and -3 kcal/mol are not assigned to a bin (I assume they should also be uncertain). There are other similar seemingly-arbitrary cutoffs later in the section that should also be explained.

      On page 9, lines 294-298 the authors talk about using the PTEN data from ProteinGym, rather than the actual cutoffs from the paper. They get to the latter later on, but I'm not sure why this isn't first? The ProteinGym cutoffs are somewhat arbitrarily based on the median rather than expert evaluation of the dataset and I'm not sure why it's even worth mentioning them when proper classifications are available. Regarding PTEN, it would be quite interesting to see a comparison of the VAMP-seq PTEN data and the Mighell phosphatase assay, which is cited on page 9 line 288 but is not actually a VAMP-seq dataset. I think this section could be interesting but it requires some additional attention.

      The authors mention "pathogenicity predictors" and otherwise use pathogenicity incorrectly throughout the manuscript. Pathogenicity is a classification for a variant after it has been curated according to a framework like the ACMG/AMP guidelines (Richards 2015 and amendments). A single tool cannot predict or assign pathogenicity - the AlphaMissense paper was wrong to use this nomenclature and these authors should not compound this mistake. These predictors should be referred to as "variant effect predictors" or similar, and they are able to produce evidence towards pathogenicity or benignity but not make pathogenicity calls themselves. For example, in Figure 4e, the terms "pathogenic" and "benign" should only be used here if these are the classifications the authors have derived from ClinVar or a similar source of clinically classified variants.

      Minor comments:

      The target selection table on the website needs some kind of text filtering option. It's very tedious to have to find a protein by scrolling through the table rather than typing in the symbol. This will only get worse as more datasets are added.

      The data sources listed on the data usage section of the website are not concordant with what is in the paper. For example, MaveDB is not listed.

      I found Figure 2 to be a bit confusing in that it partially interleaves results from two different proteins. I think this would be nicer as two separate figures, one on each protein, or just of a single protein.

      Figure 3 panel b is distractingly large and I wonder if the authors could do a little bit more with this visualization.

      Capitalization is inconsistent throughout the manuscript. For example, page 9 line 288 refers to VampSEQ instead of VAMP-seq (although this is correct elsewhere). MaveDB is referred to as MAVEdb or MAVEDB in various places. AlphaMissense is referred to as Alphamissense in the Figure 5 legend. The authors should make a careful pass through the manuscript to address this kind of issues.

      MaveDB has a more recent paper (PMID: 39838450) that should be cited instead of/in addition to Esposito et al.

      On page 11, lines 338-339 the authors mention some interesting proteins including BLC2, which has base editor data available (PMID: 35288574). Are there plans to incorporate this type of functional assay data into MAVISp?

      Significance

      General assessment:

      This is a nice resource and the authors have clearly put a lot of effort in. They should be celebrated for their achievments in curating the diverse datasets, and the GitBooks are a nice approach. However, I wasn't able to get the website to work and I have raised several issues with the paper itself that I think should be addressed.

      Advance:

      New ways to explore and integrate complex data like protein structures and variant effects are always interesting and welcome. I appreciate the effort towards manual curation of datasets. This work is very similar in theme to existing tools like Genomics 2 Proteins portal (PMID: 38260256) and ProtVar (PMID: 38769064). Unfortunately as I wasn't able to use the site I can't comment further on MAVISp's position in the landscape.

      Audience:

      MAVISp could appeal to a diverse group of researchers who are interested in the biology or biochemistry of proteins that are included, or are interested in protein variants in general either from a computational/machine learning perspective or from a genetics/genomics perspective.

      My expertise:

      I am an expert in high-throughput functional genomics experiments and am an experienced computational biologist with software engineering experience.

    3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Summary:

      The authors present a pipeline and platform, MAVISp, for aggregating, displaying and analysis of variant effects with a focus on reclassification of variants of uncertain clinical significance and uncovering the molecular mechanisms underlying the mutations.

      Major comments:

      • On testing the platform, I was unable to look-up a specific variant in ADCK1 (rs200211943, R115Q). I found that despite stating that the mapped refseq ID was NP_001136017 in the HGVSp column, it was actually mapped to the canonical UniProt sequence (Q86TW2-1). NP_001136017 actually maps to Q86TW2-3, which is missing residues 74-148 compared to the -1 isoform. The Uniprot canonical sequence has no exact RefSeq mapping, so the HGVSp column is incorrect in this instance. This mapping issue may also affect other proteins and result in incorrect HGVSp identifiers for variants.
      • The paper lacks a section on how to properly interpret the results of the MAVISp platform (the case-studies are useful, but don't lay down any global rules for interpreting the results). For example: How should a variant with conflicts between the variant impact predictors be interpreted? Are certain indicators considered more 'reliable' than others?
      • In the Methods section, GEMME is stated as being rank-normalised with 0.5 as a threshold for damaging variants. On checking the data downloaded from the site, GEMME was not rank-normalised but rather min-max normalised. Furthermore, Supplementary text S4 conflicts with the methods section over how GEMME scores are classified, S4 states that a raw-value threshold of -3 is used.
      • Note. This is a major comment as one of the claims is that the associated web-tool is user-friendly. While functional, the web app is very awkward to use for analysis on any more than a few variants at once.
        • The fixed window size of the protein table necessitates excessive scrolling to reach your protein-of-interest. This will also get worse as more proteins are added. Suggestion: add a search/filter bar.
        • The same applies to the dataset window.
        • You are unable to copy anything out of the tables.
        • Hyperlinks in the tables only seem to work if you open them in a new tab or window.
        • All entries in the reference column point to the MAVISp preprint even when data from other sources is displayed (e.g. MAVE studies).
        • Entering multiple mutants in the "mutations to be displayed" window is time-consuming for more than a handful of mutants. Suggestion: Add a box where multiple mutants can be pasted in at once from an external document.

      Minor comments

      • Grammar. I appreciate that this manuscript may have been compiled by a non-native English speaker, but I would be remiss not to point out that there are numerous grammar errors throughout, usually sentence order issues or non-pluralisation. The meaning of the authors is mostly clear, but I recommend very thoroughly proof-reading the final version.
      • There are numerous proteins that I know have high-quality MAVE datasets that are absent in the database e.g. BRCA1, HRAS and PPARG.
      • Checking one of the existing MAVE datasets (KRAS), I found that the variants were annotated as damaging, neutral or given a positive score (these appear to stand-in for gain-of-function variants). For better correspondence with the other columns, those with positive scores could be labelled as 'ambiguous' or 'uncertain'.
      • Numerous thresholds are defined for stabilizing / destabilizing / neutral variants in both the STABILITY and the LOCAL_INTERACTION modules. How were these thresholds determined? I note that (PMC9795540) uses a ΔΔG threshold of 1/-1 for defining stabilizing and destabilizing variants, which is relatively standard (though they also say that 2-3 would likely be better for pinpointing pathogenic variants).
      • "Overall, with the examples in this section, we illustrate different applications of the MAVISp results, spanning from benchmarking purposes, using the experimental data to link predicted functional effects with structural mechanisms or using experimental data to validate the predictions from the MAVISp modules."

      The last of these points is not an application of MAVISp, but rather a way in which external data can help validate MAVISp results. Furthermore, none of the examples given demonstrate an application in benchmarking (what is being benchmarked?). - Transcription factors section. This section describes an intended future expansion to MAVISp, not a current feature, and presents no results. As such, it should probably be moved to the conclusions/future directions section. - Figures. The dot-plots generated by the web app, and in Figures 4, 5 and 6 have 2 legends. After looking at a few, it is clear that the lower legend refers to the colour of the variant on the X-axis - most likely referencing the ClinVar effect category. This is not, however, made clear either on the figures or in the app. - "We identified ten variants reported in ClinVar as VUS (E102K, H86D, T29I, V91I, P2R, L44P, L44F, D56G, R11L, and E25Q, Fig.5a)"

      E25Q is benign in ClinVar and has had that status since first submitted.

      Significance

      Platforms that aggregate predictors of variant effect are not a new concept, for example dbNSFP is a database of SNV predictions from variant effect predictors and conservation predictors over the whole human proteome. Predictors such as CADD and PolyPhen-2 will often provide a summary of other predictions (their features) when using their platforms. MAVISp's unique angle on the problem is in the inclusion of diverse predictors from each of its different moules, giving a much wider perspective on variants and potentially allowing the user to identify the mechanistic cause of pathogenicity. The visualisation aspect of the web app is also a useful addition, although the user interface is somewhat awkward. Potentially the most valuable aspect of this study is the associated gitbook resource containing reports from biocurators for proteins that link relevant literature and analyse ClinVar variants. Unfortunately, these are only currently available for a small minority of the total proteins in the database with such reports.

      For improvement, I think that the paper should focus more on the precise utility of the web app / gitbook reports and how to interpret the results rather than going into detail about the underlying pipeline.

      In terms of audience, the fast look-up and visualisation aspects of the web-platform are likely to be of interest to clinicians in the interpretation of variants of unknown clinical significance. The ability to download the fully processed dataset on a per-protein database would be of more interest to researchers focusing on specific proteins or those taking a broader view over multiple proteins (although a facility to download the whole database would be more useful for this final group).

      My expertise.

      • I am a protein bioinformatician with a background in variant effect prediction and large-scale data analysis.
  2. Oct 2025
    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __

      This study explores chromatin organization around trans-splicing acceptor sites (TASs) in the trypanosomatid parasites Trypanosoma cruzi, T. brucei and Leishmania major. By systematically re-analyzing MNase-seq and MNase-ChIP-seq datasets, the authors conclude that TASs are protected by an MNase-sensitive complex that is, at least in part, histone-based, and that single-copy and multi-copy genes display differential chromatin accessibility. Altogether, the data suggest a common chromatin landscape at TASs and imply that chromatin may modulate transcript maturation, adding a new regulatory layer to an unusual gene-expression system.

      I value integrative studies of this kind and appreciate the careful, consistent data analysis the authors implemented to extract novel insights. That said, several aspects require clarification or revision before the conclusions can be robustly supported. My main concerns are listed below, organized by topic/result section.

      TAS prediction * Why were TAS predictions derived only from insect-stage RNA-seq data? Restricting TAS calls to one life stage risks biasing predictions toward transcripts that are highly expressed in that stage and may reduce annotation accuracy for lowly expressed or stage-specific genes. Please justify this choice and, if possible, evaluate TAS robustness using additional transcriptomes or explicitly state the limitation.

      TAS predictions derived only from insect-stage RNA-seq data because in a previous study it was shown that there are no significant differences between stages in the 5’UTR procesing in T. cruzi life stages (https://doi.org/10.3389/fgene.2020.00166) We are not testing an additional transcriptome here, because the robustness of the software was already probed in the original article were UTRme was described (Radio S, 2018 doi:10.3389/fgene.2018.00671).

      Results - "There is a distinctive average nucleosome arrangement at the TASs in TriTryps": * You state that "In the case of L. major the samples are less digested." However, Supplementary Fig. S1 suggests that replicate 1 of L. major is less digested than the T. brucei samples, while replicate 2 of L. major looks similarly digested. Please clarify which replicates you reference and correct the statement if needed.

      The reviewer has a good point. We made our statement based on the value of the maximum peak of the sequenced DNA molecules, which in general is a good indicative of the extension of the digestion achieved by the sample (Cole H, NAR, 2011).

      As the reviewer correctly points, we should have also considered the length of the DNA molecules in each percentile. However, in this case both, T. brucei’s and L major’s samples were gel purified before sequencing and it is hard to know exactly what fragments were left behind in each case. Therefore, it is better not to over conclude on that regard.

      We have now comment on this in the main manuscript, and we have clarified in the figure legends which data set we used in each case.

      * It appears you plot one replicate in Fig. 1b and the other in Suppl. Fig. S2. Please indicate explicitly which replicate is in each plot. For T. brucei, the NDR upstream of the TAS is clearer in Suppl. Fig. S2 while the TAS protection is less prominent; based on your digestion argument, this should correspond to the more-digested replicate. Please confirm.

      The replicates used for the construction of each figure are explicitly indicated in Table S1. Although we have detailed in the table the original publication, the project and accession number for each data set, the reviewer is correct that in this case it was still not completely clear to which length distribution heatmap was each sample associated with. To avoid this confusion, we have now added the accession number for each data set to the figure legends and also clarified in Table S1. Regarding the reviewer’s comment on the correspondence between the observed TAS protection and the extent of samples digestion, he/she is correct that for a more digested sample we would expect a clearer NDR. In this case, the difference in the extent of digestion between these two samples is minor, as observed the length of the main peak in the length distribution histogram for sequenced DNA molecules is the same. These two samples GSM5363006, represented in Fig1 b, and GSM5363007, represented in S2, belong to the same original paper (Maree et al 2017), and both were gel purified before sequencing. Therefore, any difference between them could not only be the result of a minor difference in the digestion level achieved in each experiment but could be also biased by the fragments included or not during gel purification. Therefore, I would not over conclude about TAS protection from this comparison. We have now included a brief comment on this, in the figure discussion

      * The protected region around the TAS appears centered on the TAS in T. brucei but upstream in L. major. This is an interesting difference. If it is technical (different digestion or TAS prediction offset), explain why; if likely biological, discuss possible mechanisms and implications.

      We appreciate the reviewer suggestion. We cannot assure if it is due to technical or biological reasons, but there is evidence that L. major ‘s genome has a different dinucleotide content and it might have an impact on nucleosome assembly. We have now added a comment about this observation in the final discussion of the manuscript.

      Results - "An MNase sensitive complex occupies the TASs in T. brucei": * The definition of "MNase activity" and the ordering of samples into Low/Intermediate/High digestion are unclear. Did you infer digestion levels from fragment distributions rather than from controlled experimental timepoints? In Suppl. Fig. S3a it is not obvious how "Low digestion" was defined; that sample's fragment distribution appears intermediate. Please provide objective metrics (e.g., median fragment length, fraction 120-180 bp) used to classify digestion levels.

      As the reviewer suggests, the ideal experiment would be to perform a time course of MNase reaction with all the samples in parallel, or to work with a fixed time point adding increasing amounts of MNase. However, even when making controlled experimental timepoints, you need to check the length distribution histogram of sequenced DNA molecules to be sure which level of digestion you have achieved.

      In this particular case, we used public available data sets to make this analysis. We made an arbitrary definition of low, intermediate and high level of digestion, not as an absolute level of digestion, but as a comparative output among the tested samples. We based our definition on the comparison of __the main peak in length distribution heatmaps because this parameter is the best metric to estimate the level of digestion of a given sample. It represents the percentage of the total DNA sequenced that contains the predominant length in the sample tested. __Hence, we considered:

      low digestion: when the main peak is longer than the expected protection for a nucleosome (longer than 150 bp). We expect this sample to contain additional longer bands that correspond to less digested material.

      intermediate digestion, when the main peak is the expected for the nucleosome core-protection (˜146-150bp).

      high digestion, when the main peak is shorter than that (shorter than 146 bp). This case, is normally accompanied by a bigger dispersion in fragment sizes.

      To do this analysis, we chose samples that render different MNase protection of the TAS when plotting all the sequenced DNA molecules relative to this point and we used this protection as a predictor of the extent of sample digestion (Figure 2). To corroborate our hypothesis, that the degree of TAS protection was indeed related to the extent of the MNase digestion of a given sample, we looked at the length distribution histogram of the sequenced DNA molecules in each case. It is the best measurement of the extent of the digestion achieved, especially, when sequencing the whole sample without any gel purification and representing all the reads in the analysis as we did. The only caveat is with the sample called “intermediate digestion 1” that belongs to the original work of Mareé 2017, since only this data set was gel purified.

      Whether the sample used in Figure 1 (from Mareé 2017) is also from the same lab and is an MNase-seq. Strictly speaking, there is no methodological difference between MNase-seq and the input of a native MNase-ChIP-seq, since the input does not undergo the IP.

      * Several fragment distributions show a sharp cutoff at ~100-125 bp. Was this due to gel purification or bioinformatic filtering? State this clearly in Methods. If gel purification occurred, that can explain why some datasets preserve the MNase-sensitive region.

      The sharp cutoff is neither due to gel purification or bioinformatic filtering, it is just due to the length of the paired-end read used in each case. In earlier works the most common was to sequence only 50bp, with the improvement of technologies it went up to 75,100 or 125 bp. We have now clarified in Table S1 the length of the paired-reads used in each case when possible.

      * Please reconcile cases where samples labeled as more-digested contain a larger proportion of >200 bp fragments than supposedly less-digested samples; this ordering affects the inference that digestion level determines the loss/preservation of TAS protection. Based on the distributions I see, "Intermediate digestion 1" appears most consistent with an expected MNase curve - please confirm and correct the manuscript accordingly.

      As explained above, it's a common observation in MNase digestion of chromatin that more extensive digestion can still result in a broad range of fragment sizes, including some longer fragments. This seemingly counter-intuitive result is primarily due to the non-uniform accessibility of chromatin and the sequence preference of the MNase enzyme, which has a preference for AT reach sequences.

      The rationale of this is as follows: when you digest chromatin with MNase and the objective is to map nucleosomes genome-wide, the ideal situation would be to get the whole material contained in the mononucleosome band. Given that MNase is less efficient to digest protected DNA but, if the reaction proceeds further, it always ends up destroying part of it, the result is always far from perfect. The better situation we can get, is to obtain samples were ˜80% of the material is contained in the mononucloesome band. __And here comes the main point: __even in the best scenario, you always get some additional longer bands, such as those for di or tri nucleosomes. If you keep digesting, you will get less than 80 % in the nucleosome band and, those remaining DNA fragments that use to contain di and tri nucleosomes start getting digested as well, originating a bigger dispersion in fragments sizes. How do we explain persistence of Long Fragments? The longest fragments (di-, tri-nucleosomes) that persist in a highly digested sample are the ones that were originally most highly protected by proteins or higher-order structure, or by containing a poor AT sequence content, making their linker DNA extremely resistant to initial cleavage. Once the majority of the genome is fragmented, these few resistant longer fragments become a more visible component of the remaining population, contributing to a broader size dispersion. Hence, you end up observing a bigger dispersion in length distributions in the final material. Bottom line, it is not a good practice to work with under or over digested samples. Our main point, is to emphasize that especially when comparing samples, it important to compare those with comparable levels of digestion. Otherwise, a different sampling of the genome will be represented in the remaining sequenced DNA.

      Results - "The MNase sensitive complexes protecting the TASs in T. brucei and T. cruzi are at least partly composed of histones": * The evidence that histones are part of the MNase-sensitive complex relies on H3 MNase-ChIP signal in subnucleosomal fragment bins. This seems to conflict with the observation (Fig. 1) that fragments protecting TASs are often nucleosome-sized. Please reconcile these points: are H3 signals confined to subnucleosomal fragments flanking the TAS while the TAS itself is depleted of H3? Provide plots that compare MNase-seq and H3 ChIP signals stratified by consistent fragment-size bins to clarify this.

      What we learned from other eukaryotic organisms that were deeply studied, such as yeast, is that NDRs are normally generated at regulatory points in the genome. In this sense, yeast tRNA genes have a complex with a bootprint smaller than a nucleosome formed by TFIIIC-TFIIB (Nagarajavel, doi: 10.1093/nar/gkt611). On the other hand, many promotor regions have an MNase-sensitive complex with a nucleosome-size footprint, but it does not contain histones (Chereji, et al 2017, doi:10.1016/j.molcel.2016.12.009). The reviewer is right that from Figure 1 and S2 we could observe that the footprint of whatever occupies the TAS region, especially in T. brucei, is nucleosome-size. However, it only shows the size, but it doesn’t prove the nature of its components. Nevertheless, those are only MNase-seq data sets. Since it does not include a precipitation with specific antibodies, we cannot confirm the protecting complex is made up by histones. In parallel, a complementary study by Wedel 2017, from Siegel’s lab, shows that using a properly digested sample and further immunoprecipitating with a-H3 antibody, the TAS is not protected by nucleosomes at least not when analyzing nucleosome size-DNA molecules. Besides, Briggs et. al 2018 (doi: 10.1093/nar/gky928) showed that at least at intergenic regions H3 occupancy goes down while R-loops accumulation increases. We have now added a supplemental figure associated to Figure 3 (new Suplemental 5) replotting R-loops and MNase-ChIP-seq for H3 relative to our predicted TAS showing this anti-correlation and how it partly correlates with MNase protection as well. As a control we show that Rpb9 trends resembles H3 as Siegel’s lab have shown in Wedel 2018.

      * Please indicate which datasets are used for each panel in Suppl. Fig. S4 (e.g., Wedel et al., Maree et al.), and avoid calling data from different labs "replicates" unless they are true replicates.

      In most of our analysis we used real replicated experiments. Such is the case MNase-seq data used in Figure 1, with the corresponding replicate experiments used in Figure S2; T. cruzi MNase-ChIP-seq data used in Figure 3b and 4a with the respective replicate used in Figures S4 and S5 (now S6 in the revised manuscript). The only case in which we used experiments coming from two different laboratories, is in the case of MNase-ChIP-seq for H3 from T. brucei. Unfortunately, there are only two public data sets coming each of them from different laboratories. The samples used in Fig 3 (from Siegel’s lab) whether the IP from H3 represented in S4 and S5 (S6 n the updated version) comes from another lab (Patterton’s). To be more rigorous, we now call them data 1 and 2 when comparing these particular case.

      The reviewer is right that in this particular case one is native chromatin (Pattertons’) while the other one is crosslinked (Siegel’s). We have now clarified it in the main text that unfortunately we do not count on a replicate but even under both condition the result remains the same, and this is compatible with my own experience, were crosslinking does not affect the global nucleosome patterns (compared nucleosome organization from crosslinked chromatin MNAse-seq inputs Chereji, Mol Cell, 2017 doi: 10.1016/j.molcel.2016.12.009 and native MNase-seq from Ocampo, NAR, 2016 doi: 10.1093/nar/gkw068).

      * Several datasets show a sharp lower bound on fragment size in the subnucleosomal range (e.g., ~80-100 bp). Is this a filtering artifact or a gel-size selection? Clarify in Methods and, if this is an artifact, consider replotting after removing the cutoff.

      We have only filtered adapter dimmer or overrepresented sequences when needed. In Figures 2 and S3 we represented all the sequenced reads. In other figures when we sort fragments sizes in silico, such as nucleosome range, dinucleosome or subnucleosome size, we make a note in the figure legends. What the reviewer points is related to the length of the sequence DNA fragment in each experiment. As we explained above, the older data-sets were performed with 50 bp paired-end reads, the newer ones are 75, 100 or 125bp. This is information is now clarified in Table S1.

      __Results - "The TASs of single and multi-copy genes are differentially protected by nucleosomes": __

      __ __* Please include T. brucei RNA-seq data in Suppl. Fig. S5b as you did for T. cruzi.

      We have shown chromatin organization for T. brucei in S5b to show that there is a similar trend. Unfortunately, we did not get a robust list of multi-copy genes for T. brucei as we did get for T. cruzi, therefore we do not want to over conclude showing the RNA-seq for these subsets of genes. The limitation is related to the fact that UTRme restrict the search and is extremely strict when calling sites at repetitive regions.

      * Discuss how low or absent expression of multigene families affects TAS annotation (which relies on RNA-seq) and whether annotation inaccuracies could bias the observed chromatin differences.

      The mapping of occurrence and annotations that belong to repetitive regions has great complexity. UTRme is specially designed to avoid overcalling those sites. In other words, there is a chance that we could be underestimating the number of predicted TASs at multi-copy genes. Regarding the impact on chromatin analysis, we cannot rule out that it might have an impact, but the observation favors our conclusion, since even when some TASs at multi-copy genes can remain elusive, we observe more nucleosome density at those places.

      * The statement that multi-copy genes show an "oscillation" between AT and GC dinucleotides is not clearly supported: the multi-copy average appears noisier and is based on fewer loci. Please tone down this claim or provide statistical support that the pattern is periodic rather than noisy.

      We have fixed this now in the preliminary revised version

      * How were multi-copy genes defined in T. brucei? Include the classification method in Methods.

      This classification was done the same way it was explained for T. cruzi

      Genomes and annotations: * If transcriptomic data for the Y strain was used for T. cruzi, please explain why a Y strain genome was not used (e.g., Wang et al. 2021 GCA_015033655.1), or justify the choice. For T. brucei, consider the more recent Lister 427 assembly (Tb427_2018) from TriTrypDB. Use strain-matched genomes and transcriptomes when possible, or discuss limitations.

      The most appropriate way to analyze high throughput data, is to aline it to the same genome were the experiments were conducted. This was clearly illustrated in a previous publication from our group were we explained how should be analyzed data from the hybrid CL Brener strain. A common practice in the past was to use only Esmeraldo-like genome for simplicity, but this resulted in output artifacts. Therefore, we aligned it to CL Brener genome, and then focused the main analysis on the Esmeraldo haplotype (Beati Plos ONE, 2023). Ideally, we should have counted on transcriptomic data for the same strain (CL Brener or Esmeraldo). Since this was not the case at that moment, we used data from Y strain that belongs to the same DTU with Esmeraldo.

      In the case of T. brucei, when we started our analysis and the software code for UTRme was written, the previous version of the genome was available. Upon 2018 version came up, we checked chromatin parameters and observed that it did not change the main observations. Therefore, we continue working with our previous setups.

      Reproducibility and broader integration: * Please share the full analysis pipeline (ideally on GitHub/Zenodo) so the results are reproducible from raw reads to plots.

      We are preparing a full pipeline in GitHub. We will make it available before manuscript full revision

      * As an optional but helpful expansion, consider including additional datasets (other life stages, BSF MNase-seq, ATAC-seq, DRIP-seq) where available to strengthen comparative claims.

      We are now including a new suplemental figure including DRIP-seq and Rp9 ChIP-seq (revised S5). Additionally, we added a new panel c to figure 4, representing FAIRE-seq data for T. cruzi fore single and multi-copy genes

      We are working on ATAC-seq analysis and BSF MNase-seq

      Optional analyses that would strengthen the study: * Stratify single-copy genes by expression (high / medium / low) and examine average nucleosome occupancy at TASs for each group; a correlation between expression and NDR depth would strengthen the functional link to maturation.

      We have now included a panel in suplemental figure 5 (now revised S6), showing the concordance for chromatin organization of stratified genes by RNA-seq levels relative to TAS.

      __Minor / editorial comments: __ * In the Introduction, the sentence "transcription is initiated from dispersed promoters and in general they coincide with divergent strand switch regions" should be qualified: such initiation sites also include single transcription start regions.

      We have clarified this in the preliminary revised version

      * Define the dotted line in length distribution plots (if it is not the median, please clarify) and consider placing it at 147 bp across plots to ease comparison.

      The dotted line is just to indicate where the maximum peak is located. It is now clarified in figure legends.

      * In Suppl. Fig. 4b "Replicate2" the x-axis ticks are misaligned with labels - please fix.

      We have now fixed the figure. Thanks for noticing this mistake.

      * Typo in the Introduction: "remodellingremodeling" → "remodeling

      Thanks for noticing this mistake, it is fixed in the current version of the manuscript

      **Referee cross-commenting** Comment 1: I think Reviewer #2 and Reviewer #3 missed that they authors of this manuscript do cite and consider the results from Wedel at al. 2017. They even re-analysed their data (e.g. Figure 3a). I second Reviewer #2 comment indicating that the inclusion of a schematic figure to help readers visualize and better understand the findings would be an important addition.

      Comment 2: I agree with Reviewer #3 that the use of different MNase digestion procedures in the different datasets have to be considered. On the other hand, I don't think there is a problem with figure 1 showing an MNase-protected TAS for T. brucei as it is based on MNase-seq data and reproduces the reported results (Maree et al. 2017). What the Siegel lab did in Wedel et al. 2017 was MNase-ChIPseq of H3 showing nucleosome depletion at TAS, but both results are not necessary contradictory: There could still be something else (which does not contain H3) sitting on the TAS protecting it from MNase digestion.

      Reviewer #1 (Significance (Required)):

      This study provides a systematic comparative analysis of chromatin landscapes at trans-splicing acceptor sites (TASs) in trypanosomatids, an area that has been relatively underexplored. By re-analyzing and harmonizing existing MNase-seq and MNase-ChIP-seq datasets, the authors highlight conserved and divergent features of nucleosome occupancy around TASs and propose that chromatin contributes to the fidelity of transcript maturation. The significance lies in three aspects: 1. Conceptual advance: It broadens our understanding of gene regulation in organisms where transcription initiation is unusual and largely constitutive, suggesting that chromatin can still modulate post-transcriptional processes such as trans-splicing. 2. Integrative perspective: Bringing together data from T. cruzi, T. brucei and L. major provides a comparative framework that may inspire further mechanistic studies across kinetoplastids. 3. Hypothesis generation: The findings open testable avenues about the role of chromatin in coordinating transcript maturation, the contribution of DNA sequence composition, and potential interactions with R-loops or RNA-binding proteins. Researchers in parasitology, chromatin biology, and RNA processing will find it a useful resource and a stimulus for targeted experimental follow-up.

      My expertise is in gene regulation in eukaryotic parasites, with a focus on bioinformatic analysis of high-throughput sequencing data

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

      Siri et al. perform a comparative analysis using publicly available MNase-seq data from three trypanosomatids (T. brucei, T. cruzi, and Leishmania), showing that a similar chromatin profile is observed at TAS (trans-splicing acceptor site) regions. The original studies had already demonstrated that the nucleosome profile at TAS differs from the rest of the genome; however, this work fills an important gap in the literature by providing the most reliable cross-species comparison of nucleosome profiles among the tritryps. To achieve this, the authors applied the same computational analysis pipeline and carefully evaluated MNase digestion levels, which are known to influence nucleosome profiling outcomes.

      In my view, the main conclusion is that the profiles are indeed similar-even when comparing T. brucei and T. cruzi. This was not clear in previous studies (and even appeared contradictory, reporting nucleosome depletion versus enrichment) largely due to differences in chromatin digestion across these organisms. The manuscript could be improved with some clarifications and adjustments:

      1. The authors state from the beginning that available MNase data indicate altered nucleosome occupancy around the TAS. However, they could also emphasize that the conclusions across the different trypanosomatids are inconsistent and even contradictory: NDR in T. cruzi versus protection-in different locations-in T. brucei and Leishmania.

      We start our manuscript by referring to the first MNase-seq data sets publicly available for each TriTryp and we point that one of the main observations, in each of them, is the occurrence of a change in nucleosome density or occupancy at intergenic regions. In T. cruzi, in a previous publication from our group, we stablished that this intergenic drop in nucleosome density occurs near the trans-splicing acceptor site. In this work, we extend our study to the other members of TriTryps: T. brucei and L. major.

      In T. brucei the papers from Patterton’s lab and Siegel’s lab came out almost simultaneously in 2017. Hence, they do not comment on each other’s work. The first one claims the presence of a well-positioned nucleosome at the TAS by using MNase-seq, while the second one, shows an NDR at the TAS by using MNase-ChIP-seq. However, we do not think they are contradictory, or they have inconsistency. We brought them together along the manuscript because we think these works can provide complementary information.

      On one hand, we infer data from Pattertons lab is slightly less digested than the sample from Siegel’s lab. Therefore, we discuss that this moderate digestion must be the reason why they managed to detect an MNase protecting complex sitting at the TAS (Figure 1). On the other hand, Sigel’s lab includes an additional step by performing MNase-ChIP-seq, showing that when analyzing nucleosome size fragments, histones are not detected at the TAS. Here, we go further in this analysis on figure 3, showing that only when looking at subnucleosome-size fragments, we are able to detect histone H3. And this is also true for T. cruzi.

      By integrating every analysis in this work and the previous ones, we propose that TASs are protected by an MNase-sensitive complex (probed in Figure 2). This complex most likely is only partly formed by histones, since only when analyzing sub-nucleosomes size DNA molecules we can detect histone H3 (Figure 3). To be absolutely sure that the complex is not entirely made up by histones, future studies should perform an MNse-ChIP-seq with less digested samples. However, it was previously shown that R-loops are enriched at those intergenic NDRs (Briggs, 2018 doi: 10.1093/nar/gky928) and that R-loops have plenty of interacting proteins (Girasol, 2023 10.1093/nar/gkad836). Therefore, most likely, this MNase-sensitive complexed have a hybrid nature made up by H3 and some other regulatory molecules, possibly involved in trans-splicing. We have now added a new figure S5 showing R-loop co-localization with the NDR.

      Regarding the comparison between different organisms, after explaining the sensitivity to MNase of the TAS protecting complex, we discuss that when comparing equally digested samples T. cruzi and T. brucei display a similar chromatin landscape with a mild NDR at the TAS (See T. cruzi represented in Figure 1 compared to T. brucei represented in Intermediate digestion 2 in Figure 2, intermediate digestion in the revised manuscript). Unfortunately, we cannot make a good comparison with L. major, since we do not count on a similar level of digestion.

      Another point that requires clarification concerns what the authors mean in the introduction and discussion when they write that trypanosomes have "...poorly organized chromatin with nucleosomes that are not strikingly positioned or phased." On the other hand, they also cite evidence of organization: "...well-positioned nucleosome at the spliced-out region.. in Leishmania (ref 34)"; "...a well-positioned nucleosome at the TASs for internal genes (ref37)"; "...a nucleosome depletion was observed upstream of every gene (ref 35)." Aren't these examples of organized chromatin with at least a few phased nucleosomes? In addition, in ref 37, figure 4 shows at least two (possibly three to four) nucleosomes that appear phased. In my opinion, the authors should first define more precisely what they mean by "poorly organized chromatin" and clarify that this interpretation does not contradict the findings highlighted in the cited literature.

      For a better understanding of nucleosome positioning and phasing I recommend the review: Clark 2010 doi:10.1080/073911010010524945, Figure 4. Briefly, in a cell population there are different alternative positions that a given nucleosome can adopt. However, some are more favorable. When talking about favorable positions, we refer to the coordinates in the genome that are most likely covered by a nucleosome and are predominant in the cell population. Additionally, nucleosomes could be phased or not. This refers not only the position in the genome, but to the distance relative to a given point. In yeast, or in highly transcribed genes of more complex eukaryotes, nucleosomes are regularly spaced and phased relative to the transcription start site (TSS) or to the +1 nucleosome (Ocampo, NAR, 2016, doi:10.1093/nar/gkw068). In trypanosomes, nucleosomes have some regular distribution when making a browser inspection but, given that they are not properly phased with respect to any point, it is almost impossible to make a spacing estimation from paired-end data. This is also consistent with a chromatin that is transcribed in an almost constitutive manner.

      As the reviewer mention, we do site evidence of organization. We think the original observations are correct, but we do not fully agree with some of the original statements. In this manuscript our aim is to take the best we learned from their original works and to make a constructive contribution adding to the original discussions. In this regard, in trypanosomes there are some conserved patterns in the chromatin landscape, but their nucleosomes are far from being well-positioned or phased. For a better understanding, compare the variations observed in the y axis when representing av. nucleosome occupancy in yeast with those observed in trypanosomes and you will see that the troughs and peaks are much more prominent in yeast than the ones observed in any TryTryp member.

      Following the reviewer’s suggestion we have now clarified this in the main text

      The paper would also benefit from the inclusion of a schematic figure to help readers visualize and better understand the findings. What is the biological impact of having nucleosomes, di-nucleosomes, or sub-nucleosomes at TAS? This is not obvious to readers outside the chromatin field. For example, the following statement is not intuitive: "We observed that, when analyzing nucleosome-size (120-180 bp) DNA molecules or longer fragments (180-300 bp), the TASs of either T. cruzi or T. brucei are mostly nucleosome-depleted. However, when representing fragments smaller than a nucleosome-size (50-120 bp) some histone protection is unmasked (Fig. 3 and Fig. S4). This observation suggests that the MNase sensitive complex sitting at the TASs is at least partly composed of histones." Please clarify.

      We appreciate the reviewer’s suggestion to make a schematic figure. We are working on this and will be added to the manuscript upon final revision.

      Regarding the biological impact of having mono, di or subnucleosome fragments, it is important to unveil the fragment size of the protected DNA to infer the nature of the protecting complex. In the case of tRNA genes in yeast, at pol III promoters they found footprints smaller than a nucleosome size that ended up being TFIIB-TFIIC (Nagarajavel, doi: 10.1093/nar/gkt611). Therefore, detecting something smaller than a nucleosome might suggest the binding of trans-acting factors different than histones or involving histones in a mixed complex. These mixed complexes are also observed, and that is the case of the centromeric nucleosome which has a very peculiar composition (Ocampo and Clark, Cells Reports, 2015). On the other hand, if instead we detect bigger fragments, it could be indicative of the presence of bigger protecting molecules or that those regions are part of higher order chromatin organization still inaccessible for MNase linker digestions.

      Here we show on 2Dplots, that complex or components protecting the TAS have nucleosome size, but we cannot assure they are entirely made up by histones, since, only when looking at subnucleosome-size fragments, we are able to detect histone H3. We have now added part of this explanation to the discussion.

      By integrating every analysis in this work and the previous ones, we propose that the TAS is protected by an MNase-sensitive complex (Figure 2). This complex most likely is only partly formed by histones, since only when analyzing sub-nucleosomes size DNA molecules we can detect histone H3 (Figure 3). As explained above, to be absolutely sure that the complex is not entirely made up by histones, future studies should perform an MNse-ChIP-seq with less digested samples. However, it was previously shown that R-loops are enriched at those intergenic NDRs (Briggs 2018) and that R-loops have plenty of interacting proteins (Girasol, 2023). Therefore, most likely, this MNase-sensitive complexed have a hybrid nature made up by H3 and some other regulatory molecules. We have now added a new S5 figure showing R-loop co-localization.

      Some references are missing or incorrect:

      we will make a thorough revision

      "In trypanosomes, there are no canonical promoter regions." - please check Cordon-Obras et al. (Navarro's group). Thank you for the appropiate suggestion.

      We have now added this reference

      Please, cite the study by Wedel et al. (Siegel's group), which also performed MNase-seq analysis in T. brucei.

      We understand that reviewer number 2# missed that we cited this reference and that we did used the raw data from the manuscript of Wedel et. al 2017 form Siegel’s group. We used the MNase-ChIP-seq data set of histone H3 in our analysis for Figures 3, S4b and S5b (S6c in the revised version), also detailed in table S1. To be even more explicit we have now included the accession number of each data set in the figure legend.

      Figure-specific comments: Fig. S3: Why does the number of larger fragments increase with greater MNase digestion? Shouldn't the opposite be expected?

      This a good observation. As we also explained to reviewer#1:

      It's a common observation in MNase digestion of chromatin that more extensive digestion can still result in a broad range of fragment sizes, including some longer fragments. This seemingly counter-intuitive result is primarily due to the non-uniform accessibility of chromatin and the sequence preference of the MNase enzyme.

      The rationale of this is as follows: when you digest chromatin with MNase and the objective is to map nucleosomes genome-wide, the ideal situation would to get the whole material contained in the mononucleosome band. Given that MNase is less efficient to digest protected DNA but, if the reaction proceeds further, it always ends up destroying part of it, the result is always far from perfect. The better situation we can get, is to obtain samples were ˜80% of the material is contained in the mononucloesome band. __And here comes the main point: __even in the best scenario, you always have some additional longer bands, such as those for di or tri nucleosomes. If you keep digesting, you will get less than 80 % in the nucleosome band and, those remaining DNA fragments that use to contain di and tri nucleosomes start getting digested as well originating a bigger dispersion in fragments sizes. How do we explain persistence of Long Fragments? The longest fragments (di-, tri-nucleosomes) that persist in a highly digested sample are the ones that were originally most highly protected by proteins or higher-order structure, making their linker DNA extremely resistant to initial cleavage. Once the majority of the genome is fragmented, these few resistant longer fragments become a more visible component of the remaining population, contributing to a broader size dispersion. Hence, there you end up having a bigger dispersion in length distributions in the final material. Bottom line, it is not a good practice to work with under or overdigested samples. Our main point is to emphasize that especially when comparing samples, it important to compare those with comparable levels of digestion. Otherwise, a different sampling of the genome will be represented in the remaining sequenced DNA Fig. S5B: Why not use MNase conditions under which T. cruzi and T. brucei display comparable profiles at TAS? This would facilitate interpretation.

      The reviewer made a reasonable observation. The reason why we used MNase-ChIP_seq instead of just MNase to test occupancy at TAS at the subsets of genes, is because we intended to be more certain if we were talking about the presence of histones or something else. By using IP for histone H3 we can see that at multi-copy genes this protein is present when looking at nucleosome-size fragments. Additionally, as shown in figure S4b, length distribution histograms are also similar for the compared IPs.

      Minor points:

      There are several typos throughout the manuscript.

      Thanks for the observation. We will check carefully.

      Methods: "Dinucelotide frecuency calculation."

      We will add a code in GitHub

      Reviewer #2 (Significance (Required)):

      In my view, the main conclusion is that the profiles are indeed similar-even when comparing T. brucei and T. cruzi. This was not clear in previous studies (and even appeared contradictory, reporting nucleosome depletion versus enrichment) largely due to differences in chromatin digestion across these organisms. Audience: basic science and specialized readers.

      Expertise: epigenetics and gene expression in trypanosomatids.

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __

      The authors analysed publicly accessible MNase-seq data in TriTryps parasites, focusing on the chromatin structure around trans-splicing acceptor sites (TASs), which are vital for processing gene transcripts. They describe a mild nucleosome depletion at the TAS of T. cruzi and L. major, whereas a histone-containing complex protects the TASs of T. brucei. In the subsequent analysis of T. brucei, they suggest that a Mnase-sensitive complex is localised at the TASs. For single-copy versus multi-copy genes, the authors show different di-nucleotide patterns and chromatin structures. Accordingly, they propose this difference could be a novel mechanism to ensure the accuracy of trans-splicing in these parasites.

      Before providing an in- depth review of the manuscript, I note that some missing information would have helped in assessing the study more thoroughly; however, in the light of the available information, I provide the following comments for consideration.

      The numbering of the figures, including the figure legends, is missing in the PDF file. This is essential for assessing the provided information.

      We apologized for not including the figure numbers in the main text, although they are located in the right place when called in the text. The omission was unwillingly made when figure legends were moved to the bottom of the main text. This is now fixed in the updated version of the manuscript.

      The publicly available Mnase- seq data are manyfold, with multiple datasets available for T. cruzi, for example. It is unclear from the manuscript which dataset was used for which figure. This must be clarified.

      This was detailed in Table S1. We have now replaced the table by an improved version, and we have also included the accession number of each data set used in the figure legends.

      Why do the authors start in figure 1 with the description of an MNase- protected TAS for T.brucei, given that it has been clearly shown by the Siegel lab that there is a nucleosome depletion similar to other parasites?

      We did not want to ignore the paper from Patterton’s lab because it was the first one to map nucleosomes genome-wide in T. brucei and the main finding of that paper claimed the existence of a well-positioned nucleosome at intergenic regions, what we though constitutes a point worth to be discussed. While Patterton’s work use MNase-seq from gel-purified samples and provides replicated experiments sequenced in really good depth; Siegel’s lab uses MNase-ChIP-seq of histone H3 but performs only one experiment and its input was not sequenced. So, each work has its own caveats and provides different information that together contributes to make a more comprehensive study. We think that bringing up both data sets to the discussion, as we have done in Figures 1 and 3, helps us and the community working in the field to enrich the discussion.

      If the authors re- analyse the data, they should compare their pipeline to those used in the other studies, highlighting differences and potential improvements.

      We are working on this point. We will provide a more detail description in the final revision.

      Since many figures resemble those in already published studies, there seems little reason to repeat and compare without a detailed comparison of the pipelines and their differences.

      Following the reviewer advice, we are now working on highlighting the main differences that justify analyzing the data the way we did and will be added in the finally revised method section.

      At a first glance, some of the figures might look similar when looking at the original manuscripts comparing with ours. However, with a careful and detailed reading of our manuscripts you can notice that we have added several analyses that allow to unveil information that was not disclosed before.

      First, we perform a systematic comparison analyzing every data set the same way from beginning to end, being the main difference with previous studies the thorough and precise prediction of TAS for the three organisms. Second, we represent the average chromatin organization relative to those predicted TASs for TriTryps and discuss their global patterns. Third, by representing the average chromatin into heatmaps, we show for the very first time, that those average nucleosome landscape are not just an average, they keep a similar organization in most of the genome. These was not done in any of the previous manuscripts except for our own (Beati, PLOS One 2023). Additionally, we introduce the discussion of how the extension of MNase reaction can affect the output of these experiments and we show 2D-plots and length distribution heatmaps to discuss this point (a point completely ignored in all the chromatin literature for trypanosomes). Furthermore, we made a far-reaching analysis by considering the contributions of each publish work even when addressed by different techniques. Finally, we discuss our findings in the context of a topic of current interest in the field, such as TriTryp’s genome compartmentalization.

      Several previous Mnase- seq analysis studies addressing chromatin accessibility emphasized the importance of using varying degrees of chromatin digestion, from low to high digestion (30496478, 38959309, 27151365).

      The reviewer is correct, and this point is exactly what we intended to illustrate in figure number 2. We appreciate he/she suggests these references that we are now citing in the final discussion. Just to clarify, using varying degrees of chromatin digestion is useful to make conclusions about a given organism but when comparing samples, strains, histone marks, etc. It is extremely important to do it upon selection of similar digested samples.

      No information on the extent of DNA hydrolysis is provided in the original Mnase- seq studies. This key information can not be inferred from the length distribution of the sequenced reads.

      The reviewer is correct that “No information on the extent of DNA hydrolysis is provided in the original Mnase-seq studies” and this is another reason why our analysis is so important to be published and discussed by the scientific community working in trypanosomes. We disagree with the reviewer in the second statement, since the level of digestion of a sequenced sample is actually tested by representing the length distribution of the total DNA sequenced. It is true that before sequencing you can, and should, check the level of digestion of the purified samples in an agarose gel and/or in a bioanalyzer. It could be also tested after library preparation, but before sequencing, expecting to observe the samples sizes incremented in size by the addition of the library adapters. But, the final test of success when working with MNase digested samples is to analyze length of DNA molecules by representing the histograms with length distribution of the sequenced DNA molecules. Remarkably, on occasions different samples might look very similar when run in a gel, but they render different length distribution histograms and this is because the nucleosome core could be intact but they might have suffered a differential trimming of the linker DNA associated to it or even be chewed inside (see Cole Hope 2011, section 5.2, doi: 10.1016/B978-0-12-391938-0.00006-9, for a detailed explanation).

      As the input material are selected, in part gel- purified mono- nucleosomal DNA bands. Furthermore the datasets are not directly comparable, as some use native MNase, while others employ MNase after crosslinking; some involve short digestion times at 37 {degree sign} C, while others involve longer digestion at lower temperatures. Combining these datasets to support the idea of an MNase- sensitive complex at the TAS of T. brucei therefore may not be appropriate, and additional experiments using consistent methodologies would strengthen the study's conclusions.

      In my opinion, describing an MNase- sensitive complex based solely on these data is not feasible. It requires specifically designed experiments using a consistent method and well- defined MNase digestion kinetics.

      As the reviewer suggests, the ideal experiment would be to perform a time course of MNase reaction with all the samples in parallel, or to work with a fix time point adding increasing amounts of MNase. However, the information obtained from the detail analysis of the length distribution histogram of sequenced DNA molecules the best test of the real outcome. In fact, those samples with different digestion levels were probably not generated on purpose.

      The only data sets that were gel purified are those from Mareé 2017 (Patterton’s lab), used in Figures 1, S1 and S2 and those from L. major shown in Fig 1. It was a common practice during those years, then we learned that is not necessary to gel purify, since we can sort fragment sizes later in silico when needed.

      As we explained to reviewer #1, to avoid this conflict, we decided to remove this data from figures 2 and S3. In summary, the 3 remaining samples comes from the same lab, and belong to the same publication (Mareé 2022). These sample are the inputs of native MNase ChIp-seq, obtain the same way, totally comparable among each other.

      Reviewer #3 (Significance (Required)):

      Due to the lack of controlled MNase digestion, use of heterogeneous datasets, and absence of benchmarking against previous studies, the conclusions regarding MNase-sensitive complexes and their functional significance remain speculative. With standardized MNase digestion and clearly annotated datasets, this study could provide a valuable contribution to understanding chromatin regulation in TriTryps parasites.

      As we have explained in the previous point our conclusions are valid since we do not compare in any figure samples coming from different treatments. The only exception to this comment could be in figure 3 when talking about MNase-ChIP-seq. We have now added a clear and explicit comment in the section and the discussion that despite having subtle differences in experimental procedures we arrive to the same results. This is the case for T. cruzi IP, run from crosslinked chromatin, compared to T. brucei’s IP, run from native chromatin.

      Along the years it was observed in the chromatin field that nucleosomes are so tightly bound to DNA that crosslinking is not necessary. However, it is still a common practice specially when performing IPs. In our own hands, we did not observe any difference at the global level neither in T. cruzi or in my previous work with yeast.

      ...

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      In this manuscript, the authors describe a good-quality ancient maize genome from 15th-century Bolivia and try to link the genome characteristics to Inca influence. Overall, the manuscript is below the standard in the field. In particular, the geographic origin of the sample and its archaeological context is not well evidenced. While dating of the sample and the authentication of ancient DNA have been evidenced robustly, the downstream genetic analyses do not support the conclusion that genomic changes can be attributed to Inca influence. Furthermore, sections of the manuscript are written incoherently and with logical mistakes. In its current form, this paper is not robust and possibly of very narrow interest. 

      Strengths: 

      Technical data related to the maize sample are robust. Radiocarbon dating strongly evidenced the sample age, estimated to be around 1474 AD. Authentication of ancient DNA has been done robustly. Spontaneous C-to-T substitutions, which are present in all ancient DNA, are visible in the reported sample with the expected pattern. Despite a low fraction of C-to-T at the 1st base, this number could be consistent with the cool and dry climate in which the sample was preserved. The distribution of DNA fragment sizes is consistent with expectations for a sample of this age. 

      Weaknesses: 

      Thank you for all your thoughtful comments. See below for comments on each.

      (1) Archaeological context for the maize sample is weakly supported by speculation about the origin and has unreasonable claims weighing on it. Perhaps those findings would be more convincing if the authors were to present evidence that supports their conclusions: i) a map of all known tombs near La Paz, ii) evidence supporting the stone tomb origins of this assemblage, and iii) evidence supporting non-Inca provenance of the tomb. 

      We believe we are clear about what information we have about context.  First, the intake records from the MSU Museum from 1890 are not as detailed as we would like, but we cannot enhance them. The mummified girl and her accoutrements, including the maize, came from a stone tower or chullpa south of La Paz, in what is now Bolivia. We do not know which stone chullpa, so a map would be of limited use.  The mortuary group is identified as Inca, but as we note the accoutrements do not appear of high status, so it is possible that she is not an elite.  Mud tombs are normally attributed to the local population, and stone towers to Inca or elites. We have clarified at multiple places in the text that the maize is from the period of Inca incursion in this part of Bolivia and have modified text to reflect greater uncertainty of Inca or local origin, but that selection for environmentally favorable characteristics had taken place.  Regardless, there are three 15th c CE or AD AMS ages on the maize, a cucurbita rind, and a camelid fiber.  The maize is almost certainly mid to late 15th century CE.

      (2) Dismissal of the admixture in the reported samples is not evidenced correctly. Population f3 statistic with an outgroup is indeed one of the most robust metrics for sample relatedness; however, it should not be used as a test of admixture. For an admixture test, the population f3 statistic should be used in the form: i) target population, ii) one possible parental population, iii) another possible parental population. This is typically done iteratively with all combinations of possible parental populations. Even in such a form, the population f3 statistic is not very sensitive to admixture in cases of strong genetic drift, and instead population f4 statistic (with an outgroup) is a recommended test for admixture. 

      We have removed “Our admixture f3-statistics test results suggest aBM is not admixed” in our revised manuscript. Since our goal here is to identify which group(s) has(have) the highest relatedness with aBM, so population f3 statistic with an outgroup is the most robust metric to do the test and to support our conclusion here.

      (3) The geographic placement of the sample based on genetic data is not robust. To make use of the method correctly, it would be necessary to validate that genetic samples in this region follow the assumption of the 'isolation-by-distance' with dense sampling, which has not been done. Additionally, the authors posit that "This suggests that aBM might not only be genetically related to the archaeological maize from ancient Peru, but also in the possible geographic location." The method used to infer the location is based on pure genetic estimation. The above conclusion is not supported by this method, and it directly contradicts the authors' suggestion that the sample comes from Bolivia.  

      We understood that it is necessary to validate the assumption of the 'isolation-by-distance' with dense sampling. But we did not do it because: 1) the ancient maize age ranges from ~5000BP to ~100BP and they were found in very different countries at different times. 2) isolation-by-distance is a population genetic concept and it's often used to test whether populations that are geographically farther apart are also more genetically different. Considering we only have 17 ancient samples in total our sample size is not sufficient for a big population test.

      For "It directly contradicts the authors' suggestion that the sample comes from Bolivia.”, as we described in our manuscript that “Given the provenience of the aBM and its age, it is possible the samples were local or alternatively were introduced into western highland Bolivia from the Inca core area – modern Peru.” The sample recording file did show the aBM sample was found in Bolivia, but we do not know where aBM originally came from before it was found in Bolivia. To answer this question, we used locator.py to predict the potential geographic location that aBM may have originally come from, and our results showed that the predicted location is inside of modern Peru and is also very close to archaeological Peruvian maize.  

      Therefore, our conclusion that "This suggests that aBM might not only be genetically related to the archaeological maize from ancient Peru, but also in the possible geographic location” does not contradict that the sample was found Bolivia.

      (4) The conclusion that Ancient Andean maize is genetically similar to European varieties and hence shares a similar evolutionary history is not well supported. The PCA plot in Figure 4 merely represents sample similarity based on two components (jointly responsible for about 20% of the variation explained), and European samples could be very distant based on other components. Indeed, the direct test using the outgroup f3 statistic does not support that European varieties are particularly closely related to ancient Andean maize. Perhaps these are more closely related to Brazil? We do not know, as this has not been measured. 

      Our conclusion is “We also found that a few types of maize from Europe have a much closer distance to the archaeological maize cluster compared to other modern maize, which indicates maize from Europe might expectedly share certain traits or evolutionary characteristics with ancient maize. It is also consistent with the historical fact that maize spread to Europe after Christopher Columbus's late 15th century voyages to the Americas. But as shown, maize also has diversity inside the European maize cluster. It is possible that European farmers and merchants may have favored different phenotypic traits, and the subsequent spread of specific varieties followed the new global geopolitical maps of the Colonial era”.

      We understood your concerns that two components only explain about 20% of the variation. But as you can see from the Figure 2b in Grzybowski, M.W. et al., 2023 publication, it described that “the first principal component (PC1) of variation for genetic marker data roughly corresponded to the division between domesticated maize and maize wild relatives is only 1.3%”. It shows this is quite common in maize, especially when the datasets include landraces, hybrids, and wild relatives. For our maize dataset, we have archaeological maize data ranging from ~5,000BP to ~100BP, and we also have modern maize, which makes the genetic structure of our data more complicated. Therefore, we think our two components are currently the best explanation currently possible. We also included PCA plot based on component 1 and 3 in Fig4_PCA13.pdf. It does not show that the European samples are very distant.

      For “Perhaps these are more closely related to Brazil?”, thank you for this very good question, but we apologize that we cannot answer this question from our current study because our study focuses on identifying the location where aBM originally came from, establishing and explaining patterns of genetic variability of maize, with a specific focus on maize strains that are related to our current aBM. Thus, we will not explore the story between maize from Brazil and European maize in our current study.

      (5) The conclusion that long branches in the phylogenetic tree are due to selection under local adaptation has no evidence. Long branches could be the result of missing data, nucleotide misincorporations, genetic drift, or simply due to the inability of phylogenetic trees to model complex population-level relationships such as admixture or incomplete lineage sorting. Additionally, captions to Figure S3, do not explain colour-coding.  

      We have removed “aBM tends to have long branches compare to tropicalis maize, which can be explained by adaption for specific local environment by time.” in our revised manuscript.

      We have added the color-coding information under Fig. S3 in our revised manuscript.

      (6) The conclusion that selection detected in aBM sample is due to Inca influence has no support. Firstly, selection signature can be due to environmental or other factors. To disentangle those, the authors would need to generate the data for a large number of samples from similar cultural contexts and from a wide-ranging environmental context, followed by a formal statistical test. Secondly, allele frequency increase can be attributed to selection or demographic processes, and alone is not sufficient evidence for selection. The presented XP-EHH method seems more suitable. Overall, methods used in this paper raise some concerns: i) how accurate are allele-frequency tests of selection when only single individual is used as a proxy for a whole population, ii) the significance threshold has been arbitrary fixed to an absolute number based on other studies, but the standard is to use, for example, top fifth percentile. Finally, linking selection to particular GO terms is not strong evidence, as correlation does not imply causation, and links are unclear anyway. 

      In sum, this manuscript presents new data that seems to be of high quality, but the analyses are frequently inappropriate and/or over-interpreted. 

      Regarding your suggestion that “from similar cultural contexts and from a wide-ranging environmental context, followed by a formal statistical test”, we apologize that this cannot be done in our current study because we could not find other archaeological maize samples/datasets that are from similar cultural contexts.

      For “Secondly, allele frequency increase can be attributed to selection or demographic processes, and alone is not sufficient evidence for selection.” Yes, we agree, and that’s why we said it “inferred” the conclusion instead of “indicated”. Furthermore, we revised the whole manuscript following all reviewers’ comments and reorganized and reduced the part on selection on aBM.

      For “The presented XP-EHH method seems more suitable”, we do not think XP-EHH is the best method that could be used here because we only have one aBM sample, but XP-EHH is more suitable for a population analysis.

      For “Finally, linking selection to particular GO terms is not strong evidence, as correlation does not imply causation, and links are unclear anyway.”, as we described in our manuscript, our results “inferred” instead of “indicated” the conclusion.

      Reviewer #2 (Public review): 

      Summary: 

      The manuscript presents valuable new datasets from two ancient maize seeds that contribute to our growing understanding of the maize evolution and biodiversity landscape in pre-colonial South America. Some of the analyses are robust, but the selection elements are not supported. 

      Strengths: 

      The data collection is robust, and the data appear to be of sufficiently high quality to carry out some interesting analytical procedures. The central finding that aBM maize is closely related to maize from the core Inca region is well supported, although the directionality of dispersal is not supported. 

      Weaknesses: 

      Thank you for your comments and suggestions. See below for responses and explanations.

      The selection results are not justified, see examples in the detailed comments below. 

      (1) The manuscript mentions cultural and natural selection (line 76), but then only gives a couple of examples of selecting for culinary/use traits. There are many examples of selection to tolerate diverse environments that could be relevant for this discussion, if desired. 

      We have added related examples with references supported in our revised manuscript.  

      (2) I would be extremely cautious about interpreting the observations of a Spanish colonizer (lines 95-99) without very significant caveats. Indigenous agriculture and food ways would have been far more nuanced than what could be captured in this context, and the genocidal activities of the Europeans would have impacted food production activities to a degree, and any contemporaneous accounts need to be understood through that lens.  

      We agree with the first part of this comment and have softened our use of this particular textual material such that it is far less central to interpretation.While of interest, we cannot evaluate the impact of colonial European activities or observational bias for purposes of this analysis.

      (3) The f3 stats presented in Figure 2 are not set up to test any specific admixture scenarios, so it is unsupported to conclude that the aBM maize is not admixed on this basis (lines 201-202). The original f3 publication (Patterson et al, 2012) describes some scenarios where f3 characteristics associate with admixture, but in general, there are many caveats to this approach, and it's not the ideal tool for admixture testing, compared with e.g., f4 and D (abba-baba) statistics.  

      You make an important point that f3 stats is not the ideal tool for admixture testing. Since our study goal here is to identify which group(s) has(have) the highest relatedness with aBM, the population f3 statistic with an outgroup is the most robust metrics with which to do the test and to support our conclusion here. We have removed the “Our admixture f3-statistics test results suggest aBM is not admixed” in our revised manuscript.

      (4) I'm a little bit skeptical that the Locator method adds value here, given the small training sample size and the wide geographic spread and genetic diversity of the ancient samples that include Central America. The paper describing that method (Battey et al 2020 eLife) uses much larger datasets, and while the authors do not specifically advise on sample sizes, they caution about small sample size issues. We have already seen that the ancient Peruvian maize has the most shared drift with aBM maize on the basis of the f3 stats, and the Locator analysis seems to just be reiterating that. I would advise against putting any additional weight on the Locator results as far as geographic origins, and personally I would skip this analysis in this case.  

      As we described in our manuscript, we have 17 archaeological samples in total. Please find more detailed information from the “geographical location prediction” section.

      We cannot add more ancient samples because they are all that we could find from all previous publications. We may still want to keep this analysis because f3 stats indicates the genome similarity, but the purpose of locator.py analysis is indicating the predicted location of origin of a genetic sample by comparing it to a set of samples of known geographic origin. 

      (5) The overlap in PCA should not be used to confirm that aBM is authentically ancient, because with proper data handling, PCA placement should be agnostic to modern/ancient status (see lines 224-226). It is somewhat unexpected that the ancient Tehuacan maize (with a major teosinte genomic component) falls near the ancient South American maize, but this could be an artifact of sampling throughout the PCA and the lack of teosinte samples that might attract that individual.  

      We have removed “which supports the authenticity of aBM as archaeological maize” in our revised manuscript. The PCA was only applied for all maize samples, so we did not include any teosinte samples in the analysis.

      (6) What has been established (lines 250-251) is genetic similarity to the Inca core area, not necessarily the directionality. Might aBM have been part of a cultural region supplying maize to the Inca core region, for example? Without a specific test of dispersal directionality, which I don't think is possible with the data at hand, this is somewhat speculative. 

      We added this and re-wrote this part in our revised manuscript.

      (7) Singleton SNPs are not a typical criterion for identifying selection; this method needs some citations supporting the exact approach and validation against neutral expectations (line 278). Without Datasets S2 and S3, which are not included with this submission, it is difficult to assess this result further. However, it is very unexpected that ~18,000 out of ~49,000 SNPs would be unique to the aBM lineage. This most likely reflects some data artifact (unaccounted damage, paralogs not treated for high coverage, which are extremely prevalent in maize, etc). I'm confused about unique SNPs in this context. How can they be unique to the aBM lineage if the SNPs used overlap the Grzybowski set? The GO results do not include any details of the exact method used or a statistical assessment of the results. It is not clear if the GO terms noted are statistically enriched.  

      We have added references 53 and 54 in our revised manuscript, and we also uploaded the Datasets S2 and S3.

      For “I'm confused about unique SNPs in this context. How can they be unique to the aBM lineage if the SNPs used overlap the Grzybowski set?”, as we described in our materials and method part that “To achieve potential unique selection on aBM, we calculated the allele frequency for each SNPs between aBM and other archaeological maize, resulting in allele frequency data for 49,896 SNPs. Of these,18,668 SNPs were unique to aBM.”  Thus, the unique SNPs for aBM came from the comparison between aBM with other archaeological maize, and we did not use any modern maize data from the Grzybowski set.

      For “The GO results do not include any details of the exact method used or a statistical assessment of the results. It is not clear if the GO terms noted are statistically enriched.” We did not do GO Term enrichment, so there are no statistical assessments for the results. What we have done was we retained the GO Terms information for each gene by checking their biological process from MaizeGDB, after that, we summarized the results in Dataset S4.

      (8) The use of XP-EHH with pseudo haplotype variant calls is not viable (line 293). It is not clear what exact implementation of XP-EHH was used, but this method generally relies on phased or sometimes unphased diploid genotype calls to observe shared haplotypes, and some minimum population size to derive statistical power. No implementation of XP-EHH to my knowledge is appropriate for application to this kind of dataset. 

      We used the same XP-EHH as this publication “Sabeti, P.C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913-918 (2007).” Specifically in our analysis, the SNP information of modern maize was compared with ancient maize. The code is available in https://doi.org/10.5061/dryad.w6m905qtd.

      XP-EHH is a statistical method used in population genetics to detect recent positive selection in one population compared to another, and it often applied in modern large maize populations in previous research. In our study, we wanted to detect recent positive selection in modern maize compared to ancient maize, thus, we applied XP-EHH here. Although the population size of ancient maize is not big, it is the best method that we can apply for our dataset here to detect recent selection on modern maize.

      Reviewer #3 (Public review): 

      Summary: 

      The authors seek to place archaeological maize samples (2 kernels) from Bolivia into genetic and geographical context and to assess signatures of selection. The kernels were dated to the end of the Incan empire, just prior to European colonization. Genetic data and analyses were used to characterize the distance from other ancient and modern maize samples and to predict the origin of the sample, which was discovered in a tomb near La Paz, Bolivia. Given the conquest of this region by the Incan empire, it is possible that the sample could be genetically similar to populations of maize in Peru, the center of the Incan empire. Signatures of selection in the sample could help reveal various environmental variables and cultural preferences that shaped maize genetic diversity in this region at that time. 

      Strengths: 

      The authors have generated substantial genetic data from these archaeological samples and have assembled a data set of published archaeological and modern maize samples that should help to place these samples in context. The samples are dated to an interesting time in the history of South America during a period of expansion of the Incan empire and just prior to European colonization. Much could be learned from even this small set of samples. 

      Weaknesses: 

      Many thanks for your comments and suggestions.  We have addressed these below and provided further explanation.

      (1) Sample preparation and sequencing: 

      Details of the quality of the samples, including the percentage of endogenous DNA are missing from the methods. The low percentage of mapped reads suggests endogenous DNA was low, and this would be useful to characterize more fully. Morphological assessment of the samples and comparison to morphological data from other maize varieties is also missing. It appears that the two kernels were ground separately and that DNA was isolated separately, but data were ultimately pooled across these genetically distinct individuals for analysis. Pooling would violate assumptions of downstream analysis, which included genetic comparison to single archaeological and modern individuals. 

      We did not do the morphological assessment of the samples and comparison to morphological data from other maize varieties because we only have 2 aBM kernels, and we do not have other archaeological samples that could be used to do comparison.

      For “It appears that the two kernels were ground separately and that DNA was isolated separately, but data were ultimately pooled across these genetically distinct individuals for analysis”, as you can see from our Materials and Methods section that “Whole kernels were crushed in a mortar and pestle”, these two kernels were ground together before sequenced. 

      While morphological assessment of the sample would be interesting, most morphological data reported for maize are from microremains (starch, phytoliths, pollen) and this is beyond the scope of our study. Most studies of macrobotanical remains do not appear to focus solely on individual kernels, but instead on (or in combination with) cob and ear shape, which were not available in the assemblage.

      (2) Genetic comparison to other samples: 

      The authors did not meaningfully address the varying ages of the other archaeological samples and modern maize when comparing the genetic distance of their samples. The archaeological samples were as old as >5000 BP to as young as 70 BP and therefore have experienced varying extents of genetic drift from ancestral allele frequencies. For this reason, age should explicitly be included in their analysis of genetic relatedness. 

      We have changed related part in our revised manuscript.

      (3) Assessment of selection in their ancient Bolivian sample: 

      This analysis relied on the identification of alleles that were unique to the ancient sample and inferred selection based on a large number of unique SNPs in two genes related to internode length. This could be a technical artifact due to poor alignment of sequence data, evidence supporting pseudogenization, or within an expected range of genetic differentiation based on population structure and the age of the samples. More rigor is needed to indicate that these genetic patterns are consistent with selection. This analysis may also be affected by the pooling of the Bolivian archaeological samples.  

      We do not think it is because of poor alignment of sequence data since we used BWA v0.7.17 with disabled seed (-l 1024) and 0 mismatch alignment. Therefore, there are no SNPs that could come from poor alignment. Please see our detailed methods description here “For the archaeological maize samples, adapters were removed and paired reads were merged using AdapterRemoval60 with parameters --minquality 20 --minlength 30. All 5՛ thymine and 3՛ adenine residues within 5nt of the two ends were hard-masked, where deamination was most concentrated. Reads were then mapped to soft-masked B73 v5 reference genome using BWA v0.7.17 with disabled seed (-l 1024 -o 0 -E 3) and a quality control threshold (-q 20) based on the recommended parameter61 to improve ancient DNA mapping”.

      For “More rigor is needed to indicate that these genetic patterns are consistent with selection”, Could you please be more specific about which method or approach we should use here? For example, methods from specific publications that could be referenced? Or which specific tool could be used?

      “This analysis may also be affected by the pooling of the Bolivian archaeological samples.” As we could not prove these two seeds came from two different individual plants, we do not think this analysis was affected by the pooling of the Bolivian archaeological samples.

      (4) Evidence of selection in modern vs. ancient maize: In this analysis, samples were pooled into modern and ancient samples and compared using the XP-EHH statistic. One gene related to ovule development was identified as being targeted by selection, likely during modern improvement. Once again, ancient samples span many millennia and both South, Central, and North America. These, and the modern samples included, do not represent meaningfully cohesive populations, likely explaining the extremely small number of loci differentiating the groups. This analysis is also complicated by the pooling of the Bolivian archaeological samples. 

      Yes, it is possible that ovule development might be a modern improvement. We re-wrote this part in our revised manuscript.

      Reviewer #1 (Recommendations for the authors): 

      My suggestion is to address the comments that outline why the methods used or results obtained are not sufficient to support your conclusions. Overall, I suggest limiting the narrative of Inca influence and framing it as speculation in the discussion section. Presenting conclusions of Inca influence in the title and abstract is not appropriate, given the very questionable evidence. 

      We agree and have changed the title to “Fifteenth century CE Bolivian maize reveals genetic affinities with ancient Peruvian maize”.

      Reviewer #2 (Recommendations for the authors): 

      (1) Line 74: Mexicana is another subspecies of teosinte; the distinction is between ssp. mexicana and ssp. parviglumis (Balsas teosinte), not mexicana and teosinte. 

      We have corrected this in our revised manuscript.

      (2) Line 100-102: This is a bit confusing, it cannot have been a symbol of empire "since its first introduction", since its introduction long predates the formation of imperial politics in the region. Reference 17 only treats the late precolonial Inca context, while ref 22 (which cites maize cultivation at 2450 BC, not 3000 BC) makes no reference to ritual/feasting contexts; it simply documents early phytolith evidence for maize cultivation. As such, this statement is not supported by the references offered.

      lines 100-102. This point is well taken and was poor prose on our part.  We have modified this discussion to reflect both the confusing statement and we have corrected our mistake in age for reference 22. associated prose has been modified accordingly.

      We have corrected them as “Indeed, in the Andes, previous research showed that under the Inca empire, maize was fulfilled multiple contextual roles. In some cases, it operated as a sacred crop” and “…since its first introduction to the region around 2500 BC”.

      (3) Line 161: IntCal is likely not the appropriate calibration curve for this region; dates should probably be calibrated using SHCal.  

      We greatly appreciate this important (and correct) observation. We have completely recalibrated the maize AMS result based on the southern hemisphere calibration curve, discussed the new calibrations, and have also invoked two other AMS dates also subjected to the southern hemisphere calibration on associated material for comparison.We are confident in a 15th century AD/CE age for the maize, most likely mid- to late 15th century.  

      (4) Lines 167-169: The increase of G and A residues shown in Supplementary Figure S1a is just before the 5' end of the read within the reference genome context, and is related to fragmentation bias - a different process from postmortem deamination. Deamination leads to 5' C->T and 3' G->A, resulting in increased T at 5' ends and increased A at 3' ends, and the diagnostic damage curve. The reduction of C/T just before reads begin is not a result of deamination. 

      We have removed the “Both features are indicative of postmortem deamination patterns” in our revised manuscript.

      (5) Lines 187-196 This section presents a lot of important external information establishing hypotheses, and needs some references.  

      We have added the related references here.

      (6) Line 421: This makes it sound like damage masking was done BEFORE read mapping. However, this conflicts with the previous paragraph about map Damage, and Supplementary Figure 1 still shows a slight but perceptible damage curve, which is impossible if all terminal Ts and As are hard-masked. This should be reconciled.  

      The Supplementary Figure 1 shows the raw ancient maize DNA sample before damage masking. Specifically, Step1: We used map Damage to check/estimate if the damage exists, and we made the Supplementary Figure 1. Step 2: Then we used our own code hard-masked the damage bases and did read mapping.

      The purpose of Supplementary Figure 1 is to show the authenticity of aBM as archaeological maize. Therefore, it should show a slight but perceptible damage curve.

      (7) Line 460: PCA method is not given (just the LD pruning and the plotting).  

      The merged dataset of SNPs for archaeological and modern maize was used for PCA analysis by using “plink –pca”.

      (8) "tropicalis" maize is not common usage, it is not clear to me what this refers to. 

      We have changed all “tropicalis maize” as “tropical maize” in our revised manuscript.

      (9) The Figure 4 color palette is not accessible for colorblind/color-deficient vision.  

      We have changed the color of Figure 4. Please find the new colors in our upload Figure 4.

      (10) Datasets S2 and S3 are not included with this submission. 

      Thank you for letting us know and your suggestion. We have included Datasets S2 and S3 here.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      We thank the Reviewers for their thorough attention to our paper and the interesting discussion about the findings. Before responding to more specific comments, here some general points we would like to clarify:

      (1) Ecological niche models are indeed correlative models, and we used them to highlight environmental factors associated with HPAI outbreaks within two host groups. We will further revise the terminology that could still unintentionally suggest causal inference. The few remaining ambiguities were mainly in the Discussion section, where our intent was to interpret the results in light of the broader scientific literature. Particularly, we will change the following expressions:

      -  “Which factors can explain…” to  “Which factors are associated with…” (line 75);

      -  “the environmental and anthropogenic factors influencing” to “the environmental and anthropogenic factors that are correlated with” (line 273);

      -  “underscoring the influence” to “underscoring the strong association” (line 282).

      (2) We respectfully disagree with the suggestion that an ecological niche modelling (ENM) approach is not appropriate for this work and the research question addressed therein. Ecological niche models are specifically designed to estimate the spatial distribution of the environmental suitability of species and pathogens, making them well suited to our research questions. In our study, we have also explicitly detailed the known limitations of ecological niche models in the Discussion section, in line with prior literature, to ensure their appropriate interpretation in the context of HPAI.

      (3) The environmental layers used in our models were restricted to those available at a global scale, as listed in Supplementary Information Resources S1 (https://github.com/sdellicour/h5nx_risk_mapping/blob/master/Scripts_%26_data/SI_Resource_S1.xlsx). Naturally, not all potentially relevant environmental factors could be included, but the selected layers are explicitly documented and only these were assessed for their importance. Despite this limitation, the performance metrics indicate that the models performed well, suggesting that the chosen covariates capture meaningful associations with HPAI occurrence at a global scale.

      Reviewer #1 (Public review):

      The authors aim to predict ecological suitability for transmission of highly pathogenic avian influenza (HPAI) using ecological niche models. This class of models identify correlations between the locations of species or disease detections and the environment. These correlations are then used to predict habitat suitability (in this work, ecological suitability for disease transmission) in locations where surveillance of the species or disease has not been conducted. The authors fit separate models for HPAI detections in wild birds and farmed birds, for two strains of HPAI (H5N1 and H5Nx) and for two time periods, pre- and post-2020. The authors also validate models fitted to disease occurrence data from pre-2020 using post-2020 occurrence data. I thank the authors for taking the time to respond to my initial review and I provide some follow-up below.

      Detailed comments:

      In my review, I asked the authors to clarify the meaning of "spillover" within the HPAI transmission cycle. This term is still not entirely clear: at lines 409-410, the authors use the term with reference to transmission between wild birds and farmed birds, as distinct to transmission between farmed birds. It is implied but not explicitly stated that "spillover" is relevant to the transmission cycle in farmed birds only. The sentence, "we developed separate ecological niche models for wild and domestic bird HPAI occurrences ..." could have been supported by a clear sentence describing the transmission cycle, to prime the reader for why two separate models were necessary.

      We respectfully disagree that the term “spillover” is unclear in the manuscript. In both the Methods and Discussion sections (lines 387-391 and 409-414), we explicitly define “spillover” as the introduction of HPAI viruses from wild birds into domestic poultry, and we distinguish this from secondary farm-to-farm transmission. Our use of separate ecological niche models for wild and domestic outbreaks reflects not only the distinction between primary spillover and secondary transmission, but also the fundamentally different ecological processes, surveillance systems, and management implications that shape outbreaks in these two groups. We will clarify this choice in the revised manuscript when introducing the separate models. Furthermore, on line 83, we will add “as these two groups are influenced by different ecological processes, surveillance biases, and management contexts”.

      I also queried the importance of (dead-end) mammalian infections to a model of the HPAI transmission risk, to which the authors responded: "While spillover events of HPAI into mammals have been documented, these detections are generally considered dead-end infections and do not currently represent sustained transmission chains. As such, they fall outside the scope of our study, which focuses on avian hosts and models ecological suitability for outbreaks in wild and domestic birds." I would argue that any infections, whether they are in dead-end or competent hosts, represent the presence of environmental conditions to support transmission so are certainly relevant to a niche model and therefore within scope. It is certainly understandable if the authors have not been able to access data of mammalian infections, but it is an oversight to dismiss these infections as irrelevant.

      We understand the Reviewer’s point, but our study was designed to model HPAI occurrence in avian hosts only. We therefore restricted our analysis to wild birds and domestic poultry, which represent the primary hosts for HPAI circulation and the focus of surveillance and control measures. While mammalian detections have been reported, they are outside the scope of this work.

      Correlative ecological niche models, including BRTs, learn relationships between occurrence data and covariate data to make predictions, irrespective of correlations between covariates. I am not convinced that the authors can make any "interpretation" (line 298) that the covariates that are most informative to their models have any "influence" (line 282) on their response variable. Indeed, the observation that "land-use and climatic predictors do not play an important role in the niche ecological models" (line 286), while "intensive chicken population density emerges as a significant predictor" (line 282) begs the question: from an operational perspective, is the best (e.g., most interpretable and quickest to generate) model of HPAI risk a map of poultry farming intensity?

      We agree that poultry density may partly reflect reporting bias, but we also assumed it a meaningful predictor of HPAI risk. Its importance in our models is therefore expected. Importantly, our BRT framework does more than reproduce poultry distribution: it captures non-linear relationships and interactions with other covariates, allowing a more nuanced characterisation of risk than a simple poultry density map. Note also that we distinguished in our models intensive and extensive chicken poultry density and duck density. Therefore, it is not a “map of poultry farming intensity”. 

      At line 282, we used the word “influence” while fully recognising that correlative models cannot establish causality. Indeed, in our analyses, “relative influence” refers to the importance metric produced by the BRT algorithm (Ridgeway, 2020), which measures correlative associations between environmental factors and outbreak occurrences. These scores are interpreted in light of the broader scientific literature, therefore our interpretations build on both our results and existing evidence, rather than on our models alone. However, in the next version of the paper, we will revise the sentence as: “underscoring the strong association of poultry farming practices with HPAI spread (Dhingra et al., 2016)”. 

      I have more significant concerns about the authors' treatment of sampling bias: "We agree with the Reviewer's comment that poultry density could have potentially been considered to guide the sampling effort of the pseudo-absences to consider when training domestic bird models. We however prefer to keep using a human population density layer as a proxy for surveillance bias to define the relative probability to sample pseudo-absence points in the different pixels of the background area considered when training our ecological niche models. Indeed, given that poultry density is precisely one of the predictors that we aim to test, considering this environmental layer for defining the relative probability to sample pseudo-absences would introduce a certain level of circularity in our analytical procedure, e.g. by artificially increasing to influence of that particular variable in our models." The authors have elected to ignore a fundamental feature of distribution modelling with occurrence-only data: if we include a source of sampling bias as a covariate and do not include it when we sample background data, then that covariate would appear to be correlated with presence. They acknowledge this later in their response to my review: "...assuming a sampling bias correlated with poultry density would result in reducing its effect as a risk factor." In other words, the apparent predictive capacity of poultry density is a function of how the authors have constructed the sampling bias for their models. A reader of the manuscript can reasonably ask the question: to what degree are is the model a model of HPAI transmission risk, and to what degree is the model a model of the observation process? The sentence at lines 474-477 is a helpful addition, however the preceding sentence, "Another approach to sampling pseudo-absences would have been to distribute them according to the density of domestic poultry," (line 474) is included without acknowledgement of the flow-on consequence to one of the key findings of the manuscript, that "...intensive chicken population density emerges as a significant predictor..." (line 282). The additional context on the EMPRES-i dataset at line 475-476 ("the locations of outbreaks ... are often georeferenced using place name nomenclatures") is in conflict with the description of the dataset at line 407 ("precise location coordinates"). Ultimately, the choices that the authors have made are entirely defensible through a clear, concise description of model features and assumptions, and precise language to guide the reader through interpretation of results. I am not satisfied that this is provided in the revised manuscript.

      We thank the Reviewer for this important point. To address it, we compared model predictive performance and covariate relative influences obtained when pseudo-absences were weighted by poultry density versus human population density (Author response table 1). The results show that differences between the two approaches are marginal, both in predictive performance (ΔAUC ranging from -0.013 to +0.002) and in the ranking of key predictors (see below Author response images 1 and 2). For instance, intensive chicken density consistently emerged as an important predictor regardless of the bias layer used.

      Note: the comparison was conducted using a simplified BRT configuration for computational efficiency (fewer trees, fixed 5-fold random cross-validation, and standardised parameters). Therefore, absolute values of AUC and variable importance may differ slightly from those in the manuscript, but the relative ranking of predictors and the overall conclusions remain consistent.

      Given these small differences, we retained the approach using human population density. We agree that poultry density partly reflects surveillance bias as well as true epidemiological risk, and we will clarify this in the revised manuscript by noting that the predictive role of poultry density reflects both biological processes and surveillance systems. Furthermore, on line 289, we will add “We note, however, that intensive poultry density may reflect both surveillance intensity and epidemiological risk, and its predictive role in our models should be interpreted in light of both processes”.

      Author response table 1.

      Comparison of model predictive performances (AUC) between pseudo-absence sampling were weighted by poultry density and by human population density across host groups, virus types, and time periods. Differences in AUC values are shown as the value for poultry-weighted minus human-weighted pseudo-absences.

      Author response image 1.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for domestic bird outbreaks. Results are shown for four datasets: H5N1 (<2020), H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      Author response image 2.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for wild bird outbreaks. Results are shown for three datasets: H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      The authors have slightly misunderstood my comment on "extrapolation": I referred to "environmental extrapolation" in my review without being particularly explicit about my meaning. By "environmental extrapolation", I meant to ask whether the models were predicting to environments that are outside the extent of environments included in the occurrence data used in the manuscript. The authors appear to have understood this to be a comment on geographic extrapolation, or predicting to areas outside the geographic extent included in occurrence data, e.g.: "For H5Nx post-2020, areas of high predicted ecological suitability, such as Brazil, Bolivia, the Caribbean islands, and Jilin province in China, likely result from extrapolations, as these regions reported few or no outbreaks in the training data" (lines 195-197). Is the model extrapolating in environmental space in these regions? This is unclear. I do not suggest that the authors should carry out further analysis, but the multivariate environmental similarly surface (MESS; see Elith et al., 2010) is a useful tool to visualise environmental extrapolation and aid model interpretation.

      On the subject of "extrapolation", I am also concerned by the additions at lines 362-370: "...our models extrapolate environmental suitability for H5Nx in wild birds in areas where few or no outbreaks have been reported. This discrepancy may be explained by limited surveillance or underreporting in those regions." The "discrepancy" cited here is a feature of the input dataset, a function of the observation distribution that should be captured in pseudo-absence data. The authors state that Kazakhstan and Central Asia are areas of interest, and that the environments in this region are outside the extent of environments captured in the occurrence dataset, although it is unclear whether "extrapolation" is informed by a quantitative tool like a MESS or judged by some other qualitative test. The authors then cite Australia as an example of a region with some predicted suitability but no HPAI outbreaks to date, however this discussion point is not linked to the idea that the presence of environmental conditions to support transmission need not imply the occurrence of transmission (as in the addition, "...spatial isolation may imply a lower risk of actual occurrences..." at line 214). Ultimately, the authors have not added any clear comment on model uncertainty (e.g., variation between replicated BRTs) as I suggested might be helpful to support their description of model predictions.

      Many thanks for the clarification. Indeed, we interpreted your previous comments in terms of geographic extrapolations. We thank the Reviewer for these observations. We will adjust the wording to further clarify that predictions of ecological suitability in areas with few or no reported outbreaks (e.g., Central Asia, Australia) are not model errors but expected extrapolations, since ecological suitability does not imply confirmed transmission (for instance, on Line 362: “our models extrapolate environmental suitability” will be changed to “Interestingly, our models extrapolate geographical”). These predictions indicate potential environments favorable to circulation if the virus were introduced.

      In our study, model uncertainty is formally assessed when comparing the predictive performances of our models (Fig. S3, Table S1), the relative influence (Table S3) and response curves (Fig. 2) associated with each environmental factor (Table S2). All the results confirming a good converge between these replicates. Finally, we indeed did not use a quantitative tool such as a MESS to assess extrapolation but did rely on qualitative interpretation of model outputs.

      All of my criticisms are, of course, applied with the understanding that niche modelling is imperfect for a disease like HPAI, and that data may be biased/incomplete, etc.: these caveats are common across the niche modelling literature. However, if language around the transmission cycle, the niche, and the interpretation of any of the models is imprecise, which I find it to be in the revised manuscript, it undermines all of the science that is presented in this work.

      We respectfully disagree with this comment. The scope of our study and the methods employed are clearly defined in the manuscript, and the limitations of ecological niche modelling in this context are explicitly acknowledged in the Discussion section. While we appreciate the Reviewer’s concern, the comment does not provide specific examples of unclear or imprecise language regarding the transmission cycle, niche, or interpretation of the models. Without such examples, it is difficult to identify further revisions that would improve clarity.

      Reviewer #2 (Public review):

      The geographic range of highly pathogenic avian influenza cases changed substantially around the period 2020, and there is much interest in understanding why. Since 2020 the pathogen irrupted in the Americas and the distribution in Asia changed dramatically. This study aimed to determine which spatial factors (environmental, agronomic and socio-economic) explain the change in numbers and locations of cases reported since 2020 (2020--2023). That's a causal question which they address by applying correlative environmental niche modelling (ENM) approach to the avian influenza case data before (2015--2020) and after 2020 (2020--2023) and separately for confirmed cases in wild and domestic birds. To address their questions they compare the outputs of the respective models, and those of the first global model of the HPAI niche published by Dhingra et al 2016.

      We do not agree with this comment. In the manuscript, it is well established that we are quantitatively assessing factors that are associated with occurrences data before and after 2020. We do not claim to determine the causality. One sentence of the Introduction section (lines 75-76) could be confusing, so we intend to modify it in the final revision of our manuscript. 

      ENM is a correlative approach useful for extrapolating understandings based on sparse geographically referenced observational data over un- or under-sampled areas with similar environmental characteristics in the form of a continuous map. In this case, because the selected covariates about land cover, use, population and environment are broadly available over the entire world, modelled associations between the response and those covariates can be projected (predicted) back to space in the form of a continuous map of the HPAI niche for the entire world.

      We fully agree with this assessment of ENM approaches.

      Strengths:

      The authors are clear about expected bias in the detection of cases, such geographic variation in surveillance effort (testing of symptomatic or dead wildlife, testing domestic flocks) and in general more detections near areas of higher human population density (because if a tree falls in a forest and there is no-one there, etc), and take steps to ameliorate those. The authors use boosted regression trees to implement the ENM, which typically feature among the best performing models for this application (also known as habitat suitability models). They ran replicate sets of the analysis for each of their model targets (wild/domestic x pathogen variant), which can help produce stable predictions. Their code and data is provided, though I did not verify that the work was reproducible.

      The paper can be read as a partial update to the first global model of H5Nx transmission by Dhingra and others published in 2016 and explicitly follows many methodological elements. Because they use the same covariate sets as used by Dhingra et al 2016 (including the comparisons of the performance of the sets in spatial cross-validation) and for both time periods of interest in the current work, comparison of model outputs is possible. The authors further facilitate those comparisons with clear graphics and supplementary analyses and presentation. The models can also be explored interactively at a weblink provided in text, though it would be good to see the model training data there too.

      The authors' comparison of ENM model outputs generated from the distinct HPAI case datasets is interesting and worthwhile, though for me, only as a response to differently framed research questions.

      Weaknesses:

      This well-presented and technically well-executed paper has one major weakness to my mind. I don't believe that ENM models were an appropriate tool to address their stated goal, which was to identify the factors that "explain" changing HPAI epidemiology.

      Here is how I understand and unpack that weakness:

      (1) Because of their fundamentally correlative nature, ENMs are not a strong candidate for exploring or inferring causal relationships.

      (2) Generating ENMs for a species whose distribution is undergoing broad scale range change is complicated and requires particular caution and nuance in interpretation (e.g., Elith et al, 2010, an important general assumption of environmental niche models is that the target species is at some kind of distributional equilibrium (at time scales relevant to the model application). In practice that means the species has had an opportunity to reach all suitable habitats and therefore its absence from some can be interpreted as either unfavourable environment or interactions with other species). Here data sets for the response (N5H1 or N5Hx case data in domestic or wild birds ) were divided into two periods; 2015--2020, and 2020--2023 based on the rationale that the geographic locations and host-species profile of cases detected in the latter period was suggestive of changed epidemiology. In comparing outputs from multiple ENMs for the same target from distinct time periods the authors are expertly working in, or even dancing around, what is a known grey area, and they need to make the necessary assumptions and caveats obvious to readers.

      We thank the Reviewer for this observation. First, we constrained pseudo-absence sampling to countries and regions where outbreaks had been reported, reducing the risk of interpreting non-affected areas as environmentally unsuitable. Second, we deliberately split the outbreak data into two periods (2015-2020 and 2020-2023) because we do not assume a single stable equilibrium across the full study timeframe. This division reflects known epidemiological changes around 2020 and allows each period to be modeled independently. Within each period, ENM outputs are interpreted as associations between outbreaks and covariates, not as equilibrium distributions. Finally, by testing prediction across periods, we assessed both niche stability and potential niche shifts. These clarifications will be added to the manuscript to make our assumptions and limitations explicit.

      Line 66, we will add: “Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution. To account for this, we analysed two distinct time periods (2015-2020 and 2020-2023).”

      Line 123, we will revise “These findings underscore the ability of pre-2020 models in forecasting the recent geographic distribution of ecological suitability for H5Nx and H5N1 occurrences” to “These results suggest that pre-2020 models captured broad patterns of suitability for H5Nx and H5N1 outbreaks, while post-2020 models provided a closer fit to the more recent epidemiological situation”.

      (3) To generate global prediction maps via ENM, only variables that exist at appropriate resolution over the desired area can be supplied as covariates. What processes could influence changing epidemiology of a pathogen and are their covariates that represent them? Introduction to a new geographic area (continent) with naive population, immunity in previously exposed populations, control measures to limit spread such as vaccination or destruction of vulnerable populations or flocks? Might those control measures be more or less likely depending on the country as a function of its resources and governance? There aren't globally available datasets that speak to those factors, so the question is not why were they omitted but rather was the authors decision to choose ENMs given their question justified? How valuable are insights based on patterns of correlation change when considering different temporal sets of HPAI cases in relation to a common and somewhat anachronistic set of covariates?

      We agree that the ecological niche models trained in our study are limited to environmental and host factors, as described in the Methods section with the selection of predictors. While such models cannot capture causality or represent processes such as immunity, control measures, or governance, they remain a useful tool for identifying broad associations between outbreak occurrence and environmental context. Our study cannot infer the full mechanisms driving changes in HPAI epidemiology, but it does provide a globally consistent framework to examine how associations with available covariates vary across time periods.

      (4) In general the study is somewhat incoherent with respect to time. Though the case data come from different time periods, each response dataset was modelled separately using exactly the same covariate dataset that predated both sets. That decision should be understood as a strong assumption on the part of the authors that conditions the interpretation: the world (as represented by the covariate set) is immutable, so the model has to return different correlative associations between the case data and the covariates to explain the new data. While the world represented by the selected covariates *may* be relatively stable (could be statistically confirmed), what about the world not represented by the covariates (see point 3)?

      We used the same covariate layers for both periods, which indeed assumes that these environmental and host factors are relatively stable at the global scale over the short timeframe considered. We believe this assumption is reasonable, as poultry density, land cover, and climate baselines do not change drastically between 2015 and 2023 at the resolution of our analysis. We agree, however, that unmeasured processes such as control measures, immunity, or governance may have changed during this time and are not captured by our covariates.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      - Line 400-401: "over the 2003-2016 periods" has an extra "s"; "two host species" (with reference to wild and domestic birds) would be more precise as "two host groups".

      - Remove comma line 404

      Many thanks for these comments, we have modified the text accordingly.

      Reviewer #2 (Recommendations for the authors):

      Most of my work this round is encapsulated in the public part of the review.

      The authors responded positively to the review efforts from the previous round, but I was underwhelmed with the changes to the text that resulted. Particularly in regard to limiting assumptions - the way that they augmented the text to refer to limitations raised in review downplayed the importance of the assumptions they've made. So they acknowledge the significance of the limitation in their rejoinder, but in the amended text merely note the limitation without giving any sense of what it means for their interpretation of the findings of this study.

      The abstract and findings are essentially unchanged from the previous draft.

      I still feel the near causal statements of interpretation about the covariates are concerning. These models really are not a good candidate for supporting the inference that they are making and there seem to be very strong arguments in favour of adding covariates that are not globally available.

      We never claimed causal interpretation, and we have consistently framed our analyses in terms of associations rather than mechanisms. We acknowledge that one phrasing in the research questions (“Which factors can explain…”) could be misinterpreted, and we are correcting this in the revised version to read “Which factors are associated with…”. Our approach follows standard ecological niche modelling practice, which identifies statistical associations between occurrence data and covariates. As noted in the Discussion section, these associations should not be interpreted as direct causal mechanisms. Finally, all interpretive points in the manuscript are supported by published literature, and we consider this framing both appropriate and consistent with best practice in ecological niche modelling (ENM) studies.

      We assessed predictor contributions using the “relative influence” metric, the terminology reported by the R package “gbm” (Ridgeway, 2020). This metric quantifies the contribution of each variable to model fit across all trees, rescaled to sum to 100%, and should be interpreted as an association rather than a causal effect.

      L65-66 The general difficulty of interpreting ENM output with range-shifting species should be cited here to alert readers that they should not blithely attempt what follows at home.

      I believe that their analysis is interesting and technically very well executed, so it has been a disappointment and hard work to write this assessment. My rough-cut last paragraph of a reframed intro would go something like - there are many reasons in the literature not to do what we are about to do, but here's why we think it can be instructive and informative, within certain guardrails.

      To acknowledge this comment and the previous one, we revised lines 65-66 to: “However, recent outbreaks raise questions about whether earlier ecological niche models still accurately predict the current distribution of areas ecologically suitable for the local circulation of HPAI H5 viruses. Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution.”

      We respectfully disagree with the Reviewer’s statement that “_there are many reasons in the literature not to do what we are about to do”._ All modeling approaches, including mechanistic ones, have limitations, and the literature is clear on both the strengths and constraints of ecological niche models. Our manuscript openly acknowledges these limits and frames our findings accordingly. We therefore believe that our use of an ENM approach is justified and contributes valuable insights within these well-defined boundaries.

      Reference: Ridgeway, G. (2007). Generalized Boosted Models: A guide to the gbm package. Update, 1(1), 2007.


      The following is the authors’ response to the original reviews.

      Reviewer #1(Public review):

      I am concerned by the authors' conceptualisation of "niche" within the manuscript. Is the "niche" we are modelling the niche of the pathogen itself? The niche of the (wild) bird host species as a group? The niche of HPAI transmission within (wild) bird host species (i.e., an intersection of pathogen and bird niches)? Or the niche of HPAI transmission in poultry? The precise niche being modelled should be clarified in the Introduction or early in the Methods of the manuscript. The first two definitions of niche listed above are relevant, but separate from the niche modelled in the manuscript - this should be acknowledged.

      We acknowledge that these concepts were probably not enough clearly defined in the previous version of our manuscript, and we have now included an explicit definition in the fourth paragraph of the Introduction section: “We developed separate ecological niche models for wild and domestic bird HPAI occurrences, these models thus predicting the ecological suitability for the risk of local viral circulation leading to the detection of HPAI occurrences within each host group (rather than the niche of the virus or the host species alone).”

      The authors should consider the precise transmission cycle involved in each HPAI case: "index cases" in farmed poultry, caused by "spillover" from wild birds, are relevant to the wildlife transmission cycle, while the ecological conditions coinciding with subsequent transmission in farmed poultry are likely to be fundamentally different. (For example, subsequent transmission is not conditional on the presence of wild birds.) Modelling these two separate, but linked, transmission cycles together may omit important nuances from the modelling framework.

      We thank the Reviewer for highlighting the distinction between primary (wild-todomestic) and secondary (farm-to-farm) transmission cycles. Our modelling framework was designed to assess the ecological suitability of HPAI occurrences in wild and domestic birds separately. In the domestic poultry models, the response variables are the confirmed outbreaks data and do not distinguish between index cases resulting from primary or secondary infections.

      One of the aims of the study is to evaluate the spatial distribution of areas ecologically suitable for local H5N1/x circulation either leading to domestic or wild bird cases, i.e. to identify environmental conditions where the virus may have persisted or spread, whether as a result of introduction by wild birds or farm-to-farm transmission. Introducing mechanistic distinctions in the response variable would not necessarily improve or affect the ecological suitability maps, since each type of transmission is likely to be associated with different covariates that are included in the models.

      Also, the EMPRES-i database does not indicate whether each record corresponds to an index case or a secondary transmission event, so in practice it would not be possible to produce two different models. However, we agree that distinguishing between types of transmission is an interesting perspective for future research. This could be explored, for example, by mapping interfaces between wild and domestic bird populations or by inferring outbreak transmission trees using genomic data when available.

      To avoid confusion, we now explicitly clarify this aspect in the Materials and Methods section: “It is important to note that the EMPRES-i database does not distinguish between index cases (e.g., primary spillover from wild birds) and secondary farm-to-farm transmissions. As such, our ecological niche models are trained on confirmed HPAI outbreaks in poultry that may result from different transmission dynamics — including both initial introduction events influenced by environmental factors and subsequent spread within poultry systems.”

      We now also address this limitation in the Discussion section: “Finally, our models for domestic poultry do not distinguish between primary introduction events (e.g., spillover from wild birds) and secondary transmission between farms due to limitations in the available surveillance data. While environmental factors likely influence the risk of initial spillover events, secondary spread is more often driven by anthropogenic factors such as biosecurity practices and poultry trade, which are not included in our current modelling framework.”

      The authors should clarify the meaning of "spillover" within the HPAI transmission cycle: if spillover transmission is from wild birds to farmed poultry, then subsequent transmission in poultry is separate from the wildlife transmission cycle. This is particularly relevant to the Discussion paragraph beginning at line 244: does "farm to farm transmission" have a distinct ecological niche to transmission between wild birds, and transmission between wild birds and farmed birds? And while there has been a spillover of HPAI to mammals, could the authors clarify that these detections are dead-end? And not represented in the dataset? Dhingra et al., 2016 comment on the contrast between models of "directly transmitted" pathogens, such as HPAI, and vector-borne diseases: for vector-borne diseases, "clear eco-climatic boundaries of vectors can be mapped", whereas "HPAI is probably not as strongly environmentally constrained". This is an important piece of nuance in their Discussion and a comment to a similar effect may be of use in this manuscript.

      Following the Reviewer’s previous comment, we have now added clarifications in the Methods and Discussion sections defining spillover as the transmission of HPAI viruses from wild birds to domestic poultry (index cases), and secondary transmission as onward spread between farms. As mentioned in our answer above, we now emphasise that our models do not distinguish these dynamics, which are likely to be influenced by different drivers — ecological in the case of spillover, and often anthropogenic (e.g., poultry trade movement, biosecurity) in the case of farm-to-farm transmission.

      The discussion regarding farm-to-farm transmission and spillovers is indeed an interpretation derived from the covariates analysis (see the second paragraph in the Discussion section). Specifically, we observed a stronger association between HPAI occurrences and domestic bird density after 2020, which may suggest that secondary infections (e.g., farm-to-farm transmission) became more prominent or more frequently reported. We however acknowledge that our data do not allow us to distinguish primary introductions from secondary transmission events, and we have added a sentence to explicitly clarify this: “However, this remains an interpretation, as the available data do not allow us to distinguish between index cases and secondary transmission events.”

      We thank the Reviewer for raising the point of mammalian infections. While spillover events of HPAI into mammals have been documented, these detections are generally considered dead-end infections and do not currently represent sustained transmission chains. As such, they fall outside the scope of our study, which focuses on avian hosts and models ecological suitability for outbreaks in wild and domestic birds. However, we agree that future work could explore the spatial overlap between mammalian outbreak detections and ecological suitability maps for wild birds to assess whether such spillovers may be linked to localised avian transmission dynamics.

      Finally, we have added a comment about the differences between pathogens strongly constrained by the environments and HPAI: “This suggests that HPAI H5Nx is not as strongly environmentally constrained as vector-borne pathogens, for which clear eco-climatic boundaries (e.g., vector borne diseases) can be mapped (Dhingra et al., 2016).” This aligns with the interpretation provided by Dhingra and colleagues (2016) and helps contextualise the predictive limitations of ecological niche models for directly transmitted pathogens like HPAI.

      There are several places where some simple clarification of language could answer my questions related to ecological niches. For example, on line 74, "the ecological niche" should be followed by "of the pathogen", or "of HPAI transmission in wild birds", or some other qualifier that is most appropriate to the Authors' conceptualisation of the niche modelled in the manuscript. Similarly, in the following sentence, "areas at risk" could be followed by "of transmission in wild birds", to make the transmission cycle that is the subject of modelling clear to the reader. On line 83, it is not clear who or what is the owner of "their ecological niches": is this "poultry and wild birds", or the pathogen?

      We agree with that suggestion and have now modified the related part of the text  accordingly (e.g., “areas at risk for local HPAI circulation” and “of HPAI in wild or domestic birds”).

      I am concerned by the authors' treatment of sampling bias in their BRT modelling framework. If we are modelling the niche of HPAI transmission, we would expect places that are more likely to be subject to disease surveillance to be represented in the set of locations where the disease has been detected. I do not agree that pseudo-absence points are sampled "to account for the lack of virus detection in some areas" - this description is misleading and does not match the following sentence ("pseudo-absence points sampled ... to reflect the greater surveillance efforts ..."). The distribution of pseudo-absences should aim to capture the distribution of probable disease surveillance, as these data act as a stand-in for missing negative surveillance records. It is sensible that pseudo-absences for disease detection in wild birds are sampled proportionately to human population density, as the disease is detected in dead wild birds, which are more likely to be identified close to areas of human occupation (as stated on line 163). However, I do not agree that the same applies to poultry - the density of farmed poultry is likely to be a better proxy for surveillance intensity in farmed birds. Human population density and farmed poultry density may be somewhat correlated (i.e., both are low in remote areas), but poultry density is likely to be higher in rural areas, which are assumed to have relatively lower surveillance intensity under the current approach. The authors allude to this in the Discussion: "monitoring areas with high intensive chicken densities ... remains crucial for the early detection and management of HPAI outbreaks".

      We agree with the Reviewer's comment that poultry density could have potentially been considered to guide the sampling effort of the pseudo-absences to consider when training domestic bird models. We however prefer to keep using a human population density layer as a proxy for surveillance bias to define the relative probability to sample pseudoabsence points in the different pixels of the background area considered when training our ecological niche models. Indeed, given that poultry density is precisely one of the predictors that we aim to test, considering this environmental layer for defining the relative probability to sample pseudo-absences would introduce a certain level of circularity in our analytical procedure, e.g. by artificially increasing to influence of that particular variable in our models.

      Furthermore, it is also worth noting that, to better account for variations in surveillance intensity, we also adjusted the sampling effort by allocating pseudo-absences in proportion to the number of confirmed outbreaks per administrative unit (country or sub-national regions for Russia and China). This approach aimed to reduce bias caused by uneven reporting and surveillance efforts between regions. Additionally, we restricted model training to countries or regions with a minimum surveillance threshold (at least five confirmed outbreaks per administrative unit). Therefore, both presence and pseudo-absence points originated from areas with more consistent surveillance data.

      We acknowledge in the Materials and Methods section that the approach proposed by the Reviewer could have been used: “Another approach to sampling pseudo-absences would have been to distribute them according to the density of domestic poultry.” Finally, our approach is also justified in our response to the next comment of the Reviewer.

      Having written my review, including the paragraph above, I briefly scanned Dhingra et al., and found that they provide justification for the use of human population density to sample pseudoabsences in farmed birds: "the Empres-i database compiles outbreak locations data from very heterogeneous sources and in the absence of explicit GPS location data, the geo-referencing of individual cases is often through the use of place name gazetteers that will tend to force the outbreak location populated place, rather in the exact location of the farm where the disease was found, which would introduce a bias correlated with human population density." This context is entirely missing from the manuscript under review, however, I maintain the comment in the paragraph above - have the Authors trialled sampling pseudo-absences from poultry density layers?

      We agree with the Reviewer’s comment and have now added this precision in the Materials and Methods section (in the third paragraph dedicated to ecological niche modelling): “However, as pointed out by Dhingra and colleagues (2016), the locations of outbreaks in the EMPRES-i database are often georeferenced using place name nomenclatures due to a lack of accurate GPS data, which could introduce a spatial bias towards populated areas.”

      The authors indirectly acknowledge the role of sampling bias in model predictions at line 163, however, this point could be clearer: there is sampling bias in the set of locations where HPAI has been observed and failure to adequately replicate this sampling bias in pseudo-absence data could lead covariates that are correlated with the observation distribution to appear to be correlated with the target distribution. This point is alluded to but should be clearly acknowledged to allow the reader to appropriately interpret your results. I understand the point being made on line 163 is that surveillance of HPAI in wild birds has become more structured and less opportunistic over time - if this is the case, a statement to this effect could replace "which could influence earlier data sets", which is a little ambiguous. The Authors acknowledge the role of sampling bias in lines 241-242 - this may be a good place to remind the reader that they have attempted to incorporate sampling bias through the selection of their pseudoabsence dataset, particularly for wild bird models.

      We thank the Reviewer for this comment. We have now clarified in the text that observed data on HPAI occurrence are inherently influenced by heterogeneous surveillance efforts and that failure to replicate this bias in pseudo-absence sampling could effectively lead to misleading correlations with covariates associated with surveillance effort rather than true ecological suitability. We have now rephrased the related sentence as follows: “This decline may indicate a reduced bias in observation data: typically, dead wild birds are more frequently found near human-populated areas due to opportunistic detections, whereas more recent surveillance efforts have become increasingly proactive (Giacinti et al., 2024).”

      Dhingra et al. aimed to account for the effect of mass vaccination of birds in China. This does not appear to be included in the updated models - is this a relevant covariate to consider in updated models? Are the models trained on pre-2020 data predicting to post-2020 given the same presence dataset as previous models? It may be helpful to provide a comment on this if we consider the pre-2020 models in this work to be representative of pre-2020 models as a cohort. Given the framing of the manuscript as an update to Dhingra et al., it may be useful for the authors to briefly summarise any differences between the existing models and updated models. Dhingra et al., also examine spatial extrapolation, which is not addressed here. Environmental extrapolation may be a useful metric to consider: are there areas where models are extrapolating that are predicted to be at high risk of HPAI transmission? Finally, they also provide some inset panels on global maps of model predictions - something similar here may also be useful.

      We thank the Reviewer for these comments. Vaccination coverage is indeed a relevant covariate for HPAI suitability in domestic birds. However, we did not include this variable in our updated models for two reasons. First, comprehensive vaccination data were only available for China, so it is not possible to include this variable in a global model. Second, available data were outdated and vaccination strategies can vary substantially over time.

      We however agree with the Reviewer that the Materials and Methods section did not clarify clearly the differences with Dhingra et al. (2016), and we now detail these differences at the beginning of the Materials and Methods section: “Our approach is similar to the one implemented by Dhingra and colleagues (2016). While Dhingra et al. (2016) developed their models only for domestic birds over the 2003-2016 periods, our models were developed for two host species separately (wild and domestic birds) and for two time periods (2016-2020 and 2020-2023).”

      We also detail the main difference concerning the pseudo-absences sampling:  Dhingra and colleagues (2016) used human population density to sample pseudo-absences to reflect potential surveillance bias and also account for spatial filtering (min/max distances from presence). We adopted a similar strategy but also incorporated outbreak count per country or province (in the case of China and Russia) into the pseudo-absence sampling process to further account for within-country surveillance heterogeneity. We have now added these specifications in the Materials and Methods section: “To account for heterogeneity in AIV surveillance and minimise the risk of sampling pseudo-absences in poorly monitored regions, we restricted our analysis to countries (or administrative level 1 units in China and Russia) with at least five confirmed outbreaks. Unlike Dhingra et al. (2016), who sampled pseudoabsences across a broader global extent, our sampling was limited to regions with demonstrated surveillance activity. In addition, we adjusted the density of pseudo-absence points according to the number of reported outbreaks in each country or admin-1 unit, as a proxy for surveillance effort — an approach not implemented in this previous study.”

      We have now also provided a comparison between the different outputs, particularly in the Results section: “Our findings were overall consistent with those previously reported by Dhingra and colleagues (Dhingra et al., 2016), who used data from January 2004 to March 2015 for domestic poultry. However, some differences were noted: their maps identified higher ecological suitability for H5 occurrences before 2016 in North America, West Africa, eastern Europe, and Bangladesh, while our maps mainly highlight ecologically suitable regions in China, South-East Asia, and Europe (Fig. S5). In India, analyses consistently identified high ecologically suitable areas for the risk of local H5Nx and H5N1 circulation for the three time periods (pre-2016, 2016-2020, and post-2020). Similar to the results reported by Dhingra and colleagues, we observed an increase in the ecological suitability estimated for H5N1 occurrence in South America's domestic bird populations post-2020. Finally, Dhingra and colleagues identified high suitability areas for H5Nx occurrence in North America, which are predicted to be associated with a low ecological suitability in the 2016-2020 models.”

      We acknowledge that some regions predicted as highly suitable correspond to areas where extrapolation likely occurs due to limited or no recorded outbreaks. We have now added these specifications when discussing the resulting suitability maps obtained for domestic birds: “For H5Nx post-2020, areas of high predicted ecological suitability, such as Brazil, Bolivia, the Caribbean islands, and Jilin province in China, likely result from extrapolations, as these regions reported few or no outbreaks in the training data”, and, for wild birds: “Some of the areas with high predicted ecological suitability reflect the result of extrapolations. This is particularly the case in coastal regions of West and North Africa, the Nile Basin, Central Asia (Kyrgyzstan, Tajikistan, Uzbekistan), Brazil (including the Amazon and coastal areas), southern Australia, and the Caribbean, where ecological conditions are similar to those in areas where outbreaks are known to occur but where records of outbreaks are still rare.”

      For wild birds (H5Nx, post-2020), high ecological suitability was predicted along the West and North African coasts, the Nile basin, Central Asia (e.g., Kyrgyzstan, Tajikistan, Uzbekistan), the Brazilian coast and Amazon region, Caribbean islands, southern Australia, and parts of Southeast Asia. Ecological suitability estimated in these regions may directly result from extrapolations and should therefore be interpreted cautiously.

      We also added a discussion of the extrapolation for wild birds (in the Discussion section): “Interestingly, our models extrapolate environmental suitability for H5Nx in wild birds in areas where few or no outbreaks have been reported. This discrepancy may be explained by limited surveillance or underreporting in those regions. For instance, there is significant evidence that Kazakhstan and Central Asia play a role as a centre for the transmission of avian influenza viruses through migratory birds (Amirgazin et al., 2022; FAO, 2005; Sultankulova et al., 2024). However, very few wild bird cases are reported in EMPRES-i. In contrast, Australia appears environmentally suitable in our models, yet no incursion of HPAI H5N1 2.3.4.4b has occurred despite the arrival of millions of migratory shorebirds and seabirds from Asia and North America. Extensive surveillance in 2022 and 2023 found no active infections nor evidence of prior exposure to the 2.3.4.4b lineage (Wille et al., 2024; Wille and Klaassen, 2023).”

      We agree that inset panels can be helpful for visualising global patterns. However, all resulting maps are available on the MOOD platform (https://app.mood-h2020.eu/core), which provides an interactive interface allowing users to zoom in and out, identify specific locations using a background map, and explore the results in greater detail. This resource is referenced in the manuscript to guide readers to the platform.

      Related to my review of the manuscript's conceptualisation above, there are several inconsistencies in terminology in the manuscript - clearing these up may help to make the methods and their justification clearer to the reader. The "signal" that the models are estimating is variously described as "susceptibility" and "risk" (lines 179-180), "HPAI H5 ecological suitability" (line 78), "likelihood of HPAI occurrences" (line 139), "risk of HPAI circulation" (line 187), "distribution of occurrence data" (line 428). Each of these quantities has slightly different meanings and it is confusing to the reader that all of these descriptors are used for model output. "Likelihood of HPAI occurrences" is particularly misleading: ecological niche models predict high suitability for a species in areas that are similar to environments where it has previously been identified, without imposing constraints on species movement. It is intuitively far more likely that there will be HPAI occurrences in areas where the disease is already established than in areas where an introduction event is required, however, the niche models in this work do not include spatial relationships in their predictions.

      We agree with the Reviewer’s comments. We have now modified the text so that in the Results section we refer to ecological suitability when referring to the outputs of the models. In the context of our Discussion section, we then interpret this ecological suitability in terms of risk, as areas with high ecological suitability being more likely to support local HPAI outbreaks.

      I also caution the authors in their interpretation of the results of BRTs, which are correlative models, so therefore do not tell us what causes a response variable, but rather what is correlated with it. On Line 31, "correlated with" may be more appropriate than "influenced by". On Line 82, "correlated with" is more appropriate than "driving". This is particularly true given the authors' treatment of sampling bias.

      We agree with the Reviewer’s comment and have now rephrased these sentences as follows: “The spatial distribution of HPAI H5 occurrences in wild birds appears to be primarily correlated with urban areas and open water regions” and “Our results provide a better understanding of HPAI dynamics by identifying key environmental factors correlated with the increase in H5Nx and H5N1 cases in poultry and wild birds, investigating potential shifts in their ecological niches, and improving the prediction of at-risk areas.”

      The following sentences in line 201 are ambiguous: "For both H5Nx and H5N1, however, isolated areas on the risk map should be interpreted with caution. These isolated areas may result from sparse data, model limitations, or local environmental conditions that may not accurately reflect true ecological suitability." By "isolated", do the authors mean remote? Or ecologically dissimilar from the set of locations where HPAI has been detected? Or ecologically dissimilar from the set of locations in the joint set of HPAI detection locations and pseudo-absences? Or ecologically similar to the set of locations where HPAI has been detected but spatially isolated? These four descriptors are each slightly different and change the meaning of the sentences. "Model limitations" are also ambiguous - could the authors clarify which specific model limitations they are referring to here? Ultimately, the point being made is probably that a model may predict high ecological suitability for HPAI transmission in areas where the disease has not yet been identified, or where a model is extrapolating in environmental space, however, uncertainty in these predictions may be greater than uncertainty in predictions in areas that are represented in surveillance data. A clear comment on model uncertainty and how it is related to the surveillance dataset and the covariate dataset is currently missing from the manuscript and would be appropriate in this paragraph.

      We understand the Reviewer’s concerns regarding these potential ambiguities, and have now rephrased these sentences as follows: “For both H5Nx and H5N1, certain areas of predicted high ecological suitability appear spatially isolated, i.e. surrounded by regions of low predicted ecological suitability. These areas likely meet the environmental conditions associated with past HPAI occurrences, but their spatial isolation may imply a lower risk of actual occurrences, particularly in the absence of nearby outbreaks or relevant wild bird movements.”

      I am concerned by the wording of the following sentence: "The risk maps reveal that high-risk areas have expanded after 2020" (line 203). This statement could be supported by an acknowledgement of the assumptions the models make of the HPAI niche: are we saying that the niche is unchanged in environmental space and that there are now more geographic areas accessible to the pathogen, or that the niche has shifted or expanded, and that there are now more geographic areas accessible to the pathogen? The authors should review the sentence beginning on line 117: if models trained on data from the old timepoint predicting to the new timepoint are almost as good as models trained on data from the new timepoint predicting to the new timepoint, doesn't this indicate that the niche, as the models are able to capture it, has not changed too much?

      We thank the Reviewer for this comment. The statement that "high-risk areas have expanded after 2020" indeed refers to an increase in the geographic extent of areas predicted to have high ecological suitability in models trained on post-2020 data. This expansion likely reflects new outbreak data from regions that had not previously reported cases, which in turn influenced model training.

      However, models trained on pre-2020 data retain reasonable predictive performance when applied to post-2020 data (see the AUC results reported in Table S1), suggesting that the models suggest an expansion in the ecological suitability, but do not provide definitive evidence of a shift in the ecological niche. We have now added a statement at the end of this paragraph to clarify this point: “However, models trained on pre-2020 data maintained reasonable predictive performance when tested on post-2020 data, suggesting that the overall ecological niche of HPAI did not drastically shift over time.”

      The final two paragraphs of the Results might be more helpful to include at the beginning of the Results, as the data discussed there are inputs to the models. Is it possible that the "rise in Shannon index for sea birds" that "suggests a broadening of species diversity within this category from 2020 onwards" is caused by the increasingly structured surveillance of HPAI in wild birds alluded to earlier in the Results? Is the "prevalence" discussed in line 226 the frequency of the families Laridae and Sulidae being represented in HPAI detection data? Or the abundance of the bird species themselves? The language here is a little ambiguous. Discussion of particular values of Shannon/Simpson indices is slightly out of context as the meanings of the indices are in the Methods - perhaps a brief explanation of the uses of Shannon/Simpson indices may be helpful to the reader here. It may also be helpful to readers who are not acquainted with avian taxonomy to provide common names next to formal names (for example, in brackets) in the body of the text, as this manuscript is published in an interdisciplinary journal.

      We thank the Reviewer for these comments. First, we acknowledge that the paragraphs on species diversity and Shannon/Simpson indices describe important data, but we have chosen to present them after the main modelling results in order to maintain a logical narrative flow. Our manuscript first presents the ecological niche models and their predictive performance, followed by interpretations of the observed patterns, including changes in avian host diversity. Diversity indices were used primarily to support and contextualise the patterns observed in the modelling results.

      For clarity, we have revised the relevant paragraphs in the Results (i) to briefly remind readers of the interpretation of the Shannon and Simpson indices (“Note that these indices reflect the diversity of bird species detected in outbreak records, not necessarily their abundance in the wild”) and (ii) to clarify that “prevalence” refers to the frequency of HPAI detection in wild bird species of the Laridae (gulls) and Sulidae (boobies and gannets) families, and not their total abundance. Family of birds includes several species, so the “common name” of a family can sometimes refer to species from other families. We have now added the common names for each family in the manuscript (even if we indeed acknowledge that “penguins” can be ambiguous).

      In the Methods, it is stated: "To address the heterogeneity of AIV surveillance efforts and to avoid misclassifying low-surveillance areas as unsuitable for virus circulation, we trained the ecological niche models only considering countries in which five or more cases have been confirmed." However, it is not clear how this processing step prevents low-surveillance areas from being misclassified. If pseudo-absences are appropriately sampled, low-surveillance areas should be less represented in the pseudo-absence dataset, which should lead the models to be uncertain in their predictions of these areas. Perhaps "To address the heterogeneity of AIV surveillance efforts and to avoid sampling pseudo-absence data in realistically low-surveillance areas" is a more accurate introduction to the paragraph. I am not entirely convinced that it is appropriate to remove detection data where the national number of cases is low. This may introduce further sampling bias into the dataset.

      We take the opportunity of the Reviewer’s comment to further clarify this important step aiming to mitigate bias associated with countries with substantial uncertainty in reporting and/or potentially insufficient HPAI surveillance data. While we indeed acknowledge that this procedure may exclude countries that had effective surveillance but low virus detection, we argue that it constitutes a relevant conservative approach to minimising the risk of sampling a significant number of pseudo-absence points in areas associated with relatively high yet undetected local HPAI circulation due to insufficient surveillance. Furthermore, given that five cases over two decades is a relatively low threshold — particularly for a highly transmissible virus such as AIV — non-detection or non-reporting remains a more plausible explanation than true absence.

      To improve clarity, we have now revised the related sentence as follows: “To account for heterogeneity in AIV surveillance and minimise the risk of sampling pseudo-absences in poorly monitored regions, we restricted our analysis to countries (or administrative level 1 units in China and Russia) with at least five confirmed outbreaks.”

      The reporting of spatial and temporal resolution of data in the manuscript could be significantly clearer. Is there a reason why human population density is downscaled to 5 arcminutes (~10km at the equator) while environmental covariate data has a resolution of 1km? The projection used is not reported. The authors should clarify the time period/resolution of the covariate data assigned to the occurrence dataset, for example, does "day LST annual mean" represent a particular year pre- or post-2020? Or an average over a number of years? Given that disease detections are associated with observation and reporting dates, and that there may be seasonal patterns in HPAI occurrence, it would be helpful to the reader to include this information when the eco-climatic indices are described. It would also be helpful to the reader to summarise the source, spatial and temporal resolution of all covariates in a table, as in Dhingra et al. Could the Authors clarify whether the duck density layer is farmed ducks or wild ducks?

      The projection is WGS 84 (EPSG:4326) and the resolution of the output maps is around 0.0833 x 0.0833 decimal degrees (i.e. 5 arcmin, or approximately 10 km at the equator). We have now added these specifications in the text: “All maps are in a WGS84 projection with a spatial resolution of 0.0833 decimal degrees (i.e. 5 arcmin, or approximately 10 km at the equator).” In addition, we have now specified in the text that duck refers to domestic duck for clarity. 

      Environmental variables retrieved for our analyses were here available as values averaged over distinct periods of time (for further detail see Supplementary Information Resources S1 — description and source of each environmental variable included in the original sets of variables — available at https://github.com/sdellicour/h5nx_risk_mapping). In future works, this would indeed be interesting to associate the occurrences to a specific season with the variables accordingly, specially for viruses such as HPAI which have been found correlated with seasons. However, we did not conduct this type of analysis in the present study, occurrences being here associated with averaged values of environmental data only.

      In line 407, the authors state a number of pseudo-absence points used in modelling, relative to the number of presence points, without clear justification. Note that relative weights can be assigned to occurrence data in most ECN software (e.g., R package gbm), to allow many pseudo-absence points to be sampled to represent the full extent of probable surveillance effort and subsequently down-weighted.

      We thank the Reviewer for this suggestion. We acknowledge that alternative approaches such as down-weighting pseudo-absence points could offer a certain degree of flexibility in representing surveillance effort. However, we opted for a fixed 1:3 ratio of pseudoabsences to presence points within each administrative unit to ensure a consistent and conservative sampling distribution. This approach aimed to limit overrepresentation of pseudoabsences in areas with sparse presence data, while still reflecting areas of likely surveillance.

      There are a number of typographical errors and phrasing issues in the manuscript. A nonexhaustive list is provided below.

      - Line 21: "its" should be "their" - Line 25: "HPAI cases"

      Modifications have been done.

      - Line 63: sentence beginning "However" is somewhat out of context - what is it (briefly) about recent outbreaks that challenge existing models?

      We have now edited that sentence as follows: “However, recent outbreaks raise questions about whether earlier ecological niche models still accurately predict the current distribution of areas ecologically suitable for the local circulation of HPAI H5 viruses.”

      - Lines 71 and 390: "AIV" is not defined in the text - Line 73: "do" ("are" and "what" are not capitalised)

      Modifications have been done.

      - Line 115: "predictability" should be "predictive capacity"

      We have now replaced “predictability” by “predictive performance”.

      - Line 180: omit "pinpointing"

      - Line 192 sentence beginning "In India," should be re-worded: is the point that there are detections of HPAI here and the model predicts high ecological suitability?

      - Line 195 sentence beginning "Finally," phrasing could be clearer: Dhingra et al. find high suitability areas for H5Nx in North America which are predicted to be low suitability in the new model.

      - Line 237: omit "the" in "with the those"

      - Line 374: missing "."

      - Line 375: "and" should be "to" (the same goes for line 421)

      - Line 448: Rephrase "Simpson index goes" to "The Simpson index ranges"

      Modifications have been done.

      Reviewer #2 (Public Review):

      What is the justification for separating the dataset at 2020? Is it just the gap in-between the avian influenza outbreaks?

      We chose 2020 as a cut-off based on a well-documented shift in HPAI epidemiology, notably the emergence and global spread of clade 2.3.4.4b, which may affect host dynamics and geographic patterns. We have now added this precision in the Materials and Methods section: “We selected 2020 as a cut-off point to reflect a well-documented shift in HPAI epidemiology, notably the emergence and global spread of clade 2.3.4.4b. This event marked a turning point in viral dynamics, influencing both the range of susceptible hosts and the geographical distribution of outbreaks.”

      If the analysis aims to look at changing case numbers and distribution over time, surely the covariate datasets should be contemporaneous with the response?

      Thank you for raising this important point. While we acknowledge that, ideally, covariates should match the response temporally, such high-resolution spatiotemporal environmental data were not available for most environmental factors considered in our ecological niche modelling analyses. While we used predictors (e.g., land-use variables, poultry density) that reflect long-term ecological suitability, we acknowledge that rather considering short-term seasonal variation could be an interesting perspective in future works, which is now explicitly stated in the Discussion section: “In addition, aligning outbreak occurrences with seasonally matched environmental variables could further refine predictions of HPAI risk linked to migratory dynamics.”

      I would expect quite different immunity dynamics between domestic and wild birds as a function of lifespan and birth rates - though no obvious sign of that in the raw data. A statement on assumptions in that respect would be good.

      Thank you for the comment. We agree that domestic and wild birds likely exhibit different immunity dynamics due to differences in lifespan, turnover rates, and exposure. However, our analyses did not explicitly model immunity processes, and the data did not show a clear signal of these differences.

      Decisions and analytical tactics from Dhingra et al are adopted here in a way that doesn't quite convey the rationale, or justify its use here.

      We thank the Reviewer for this observation. However, we do not agree with the notion that the rationale for using Dhingra et al.’s analytical framework is insufficiently conveyed. We adapted key components of their ecological niche modelling approach — such as the use of a boosted regression tree methodology and pseudo-absences sampling procedure — to ensure comparability with their previous findings, while also extending the analysis to additional time periods and host categories (wild vs. domestic birds). This framework aligns with the main objective of our study, which is to assess shifts in ecological suitability for HPAI over time and across host species, in light of changing viral dynamics.  

      Please go over the manuscript and harmonise the language about the model target - it is usually referred to as cases, but sometimes the pathogen, and others the wild and domestic birds where the cases were discovered.

      We agree and we have now modified the text to only use the “cases” or “occurrences” terminology when referring to the model inputs.

      Is the reporting of your BRT implementation correct? The text suggests that only 10 trees were run per replicate (of which there were 10 per response (domestic/wild x H5N1 / H5Nx) x distinct covariate set), but this would suggest that the authors were scarcely benefiting from the 'boosting' part of the BRTs that allow them to accurately estimate curvilinear functions. As additional trees are added, they should still be improving the loss function, and dramatically so in the early stages. The authors seem heavily guided by Elith et al's excellent paper[1] explaining BRTs and the companion tutorial piece, but in that work, the recommended approach is to run an initial model with a relatively quick learning rate that achieves the best fit to the held-out data at somewhere over 1000 trees, and then to refine the model to that number of trees with a slower learning rate. If the authors did indeed run only 10 trees I think that should be explained.

      For each model, we used the “gbm.step” function to fit boosted regression trees, initiating the process with 10 trees and allowing up to 10,000 trees in steps of 5. The optimal number of trees was automatically determined by minimising the cross-validated deviance, following the recommended approach of Elith and colleagues (2008, J. Anim. Ecol.). This setup allows the boosting algorithm to iteratively improve model performance while avoiding overfitting. These aspects are now further clarified in the Materials and Methods section: “All BRT analyses were run and averaged over 10 cross-validated replicates, with a tree complexity of 4, a learning rate of 0.01, a tolerance parameter of 0.001, and while considering 5 spatial folds. Each model was initiated with 10 trees, and additional trees were incrementally added (in steps of 5) up to a maximum of 10,000, with the optimal number selected based on cross-validation tests.”

      I'm uncomfortable with the strong interpretation of changes in indices such as those for diversity in the case of bird species with detected cases of avian influenza, and the relative influence of covariates in the environmental niche models. In the former case, if surveillance effort is increasing it might be expected that more species will be found to be infected. In the latter, I'm just not convinced that these fundamentally correlative models can support the interpretation of changing epidemiology as asserted by authors. This strikes me as particularly problematic in light of static and in some cases anachronistic predictor sets.

      We thank the Reviewer for drawing attention to how changes in surveillance intensity might influence our diversity estimates. We have now integrated a new analysis to evaluate the increase in the number of wild birds tested and discussed the potential impact of this increase on the comparison of the bird species diversity metrics presented in our study, which is now interpreted with more caution: “To evaluate whether the post-2020 increase in species diversity estimated for infected wild birds could result from an increase in the number of tests performed on wild birds, we compared European annual surveillance test counts (EFSA et al., 2025, 2019) before and after 2020 using a Wilcoxon rank-sum test. We relied on European data because it was readily accessible and offered standardised and systematically collected metrics across multiple years, making it suitable for a comparative analysis. Although borderline significant (p-value = 0.063), the Wilcoxon rank-sum test indeed highlighted a recent increase in the number of wild bird tests (on average >11,000/year pre-2020 and >22,000 post-2020), which indicates that the comparison of bird species diversity metrics should be interpreted with caution. However, such an increase in the number of tests conducted in the context of a passive surveillance framework would thus also be in line with an increase in the number of wild birds found dead and thus tested. Therefore, while the increase in the number of tests could indeed impact species diversity metrics such as the Shannon index, it can also reflect an absolute higher wild bird mortality in line with a broadened range of infected bird species.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The authors devote significant effort to characterizing the physical interaction between Bicc1 and Pkd2. However, the study does not examine or discuss how this interaction relates to Bicc1's well-established role in posttranscriptional regulation of Pkd2 mRNA stability and translation efficiency.

      The reviewer is correct that the present study has not addressed the downstream consequences of uthis interaction considering that Bicc1 is a posttranscriptional regulator of Pkd2 (and potentially Pkd1). We think that the complex of Bicc1/Pkd1/Pkd2 retains Bicc1 in the cytoplasm and thus restrict its activity in participating in posttranscriptional regulation (see Author response image 1). We, however, do not yet have data to support this and thus have not included this model in the manuscript. Yet, we have updated the discussion of the manuscript to further elaborate on the potential mechanism of the Bicc1/Pkd1/Pkd2 complex.

      We have updated the discussion to include a discussion on the potential consequences on posttranscriptional regulation by Bicc1.

      Author response image 1.

      Model of BICC1, PC1 and PC2 self-regulation. In this model Bicc1 acts as a positive regulator of PKD gene expression. In the presence of ‘sufficient’ amounts of PC1/PC2 complex, it is tethered to the complex and remains biologically inactive (Fig. 1A). However, once the levels of the PC1/PC2 complex are reduced, Bicc1 is now present in the cytoplasm to promote expression of the PKD proteins, thereby raising their levels (Fig. 4B), which then in turn will ‘shutdown’ Bicc1 activity by again tethering it to the plasma membrane.

      (2) Bicc1 inactivation appears to downregulate Pkd1 expression, yet it remains unclear whether Bicc1 regulates Pkd1 through direct interaction or by antagonizing miR-17, as observed in Pkd2 regulation. This should be further examined or discussed.

      This is a very interesting comment. Vishal Patel published that PKD1 is regulated by a mir-17 binding site in its 3’UTR (PMID: 35965273). We, however, have not evaluated whether BICC1 participates in this regulation. A definitive answer would require utilization of the mice described in above reference, which is beyond the scope of this manuscript. We, however, have revised the discussion to elaborate on this potential mechanism. 

      We have updated the discussion to include a statement on the potential direct regulation of Pkd1 mRNA by Bicc1.

      (3) The evidence supporting Bicc1 and ADPKD gene cooperativity, particularly with Pkd1, in mouse models is not entirely convincing, likely due to substantial variability and the aggressive nature of Bpk/Bpk mice. Increasing the number of animals or using a milder Bicc1 strain, such as jcpk heterozygotes, could help substantiate the genetic interaction.

      We have initially performed the analysis using our Bicc1 complete knockout, we previously reported on (PMID 20215348) focusing on compound heterozygotes. Yet, similar to the Pkd1/Pkd2 compound heterozygotes (PMID 12140187) no cyst development was observed when we sacrificed the mice as late as P21. Our strain is similar to the above mentioned jcpk, which is characterized by a short, abnormal transcript thought to result in a null allele (PMID: 12682776). We thank the reviewer for pointing us to the reference showing the heterozygous mice exhibit glomerular cysts in the adults (PMID: 7723240). This suggestion is an interesting idea we will investigate. In general, we agree with the reviewer that a better understanding of the contribution of Bicc1 to the adult PKD phenotype will be critical. To this end, we are currently generating a floxed allele of Bicc1 that will allow us to address the cooperativity in the adult kidney, when e.g. crossed to the Pkd1<sup>RC/RC</sup> mice. Yet, these experiments are beyond the timeframe for this revision. 

      No changes were made in the revised manuscript. 

      Reviewer #2 (Public review):

      (1) These results are potentially interesting, despite the limitation, also recognized by the authors, that BICC1 mutations seem exceedingly rare in PKD patients and may not "significantly contribute to the mutational load in ADPKD or ARPKD". The manuscript has several intrinsic limitations that must be addressed. 

      As mentioned above, the study was designed to explore whether there is an interaction between BICC1 and the PKD1/PKD2 and whether this interaction is functionally important. How this translates into the clinical relevance will require additional studies (and we have addressed this in the discussion of the manuscript).

      (2) The manuscript contains factual errors, imprecisions, and language ambiguities. This has the effect of making this reviewer wonder how thorough the research reported and analyses have been. 

      We respectfully disagree with the reviewer on the latter interpretation. The study was performed with rigor. We have carefully assessed the critiques raised by the reviewer. As presented below, most of the criticisms raised by the reviewer have been easily addressed in the revised version of the manuscript. Yet, none of the critiques seems to directly impact the overall interpretation of the data. 

      Reviewer #1 (Recommendations for the authors):

      (1) The manuscript requires further editing. For example, figure panels and legends are mismatched in Figure 1

      We have corrected the labeling of Figure 1. 

      (2) Y-axis units and values are inconsistent in Figures 4b-4g, Supplementary Figures S2e and S2f are not referenced in the text, genotypes are missing in Supplementary Figure S3f, and numerous typographical errors are present.

      In respect to the y-axis in Figure 4b-g, the scale is different for each of them, but that is intentional as one would lose the differences if they were all scaled identically. But we have now mentioned this in the figure legend to make the reader aware of it. In respect to the Supplemental Figure S2e,f, we included the panels in the description of the mutant BICC1 lines, but unfortunately forgot to reference them. This has now been done.

      We have updated the labeling of the Y-axis for the cystic indices adding “[%]” as the unit and updated the figure legend of Figure 4. We have included the genotypes in Supplementary Figure S3f. The Supplementary Figure S2e,f is now mentioned in the supplemental material (page 9, 2<sup>nd</sup> paragraph). 

      Reviewer #2 (Recommendations for the authors):

      (1) Previous data from mouse, Xenopus, and zebrafish suggest a crucial role for the RNAbinding protein Bicc1 in the pathogenesis of PKD, although BICC1 mutations in human PKD have not been previously reported." The cited sources (and others that were not cited) link Bicc1 mutations to renal cysts, similar to a report by Kraus (PMID: 21922595) that the authors cite later. However, a more direct link to PKD was reported by Lian and colleagues using whole Pkd1 mice (PMID: 20219263) and by Gamberi and colleagues using Pkd1 kidneys and human microarrays (PMID: 28406902). Although relevant, neither is cited here, and only the former is cited later in the manuscript.

      Thanks for pointing this out. We have added these three citations.

      We have added these three citations (PMID: 21922595, PMID: 20219263 and PMID: 28406902) in the indicated sentence.

      (2) In Figure 1B, the lanes do not seem to correspond among panels, particularly evident in the panel with myc-mBicc1. Hence, it is difficult to agree with the presented conclusions.

      We have corrected the labeling of the lanes in Figure 1b.

      (3) In the Figure 1 legend: "(g) Western blot analysis following co-IP experiments, using an anti-mouse Bicc1 or anti-goat PC2 antibody as bait, identified protein interactions between endogenous PC2 and BICC1 in UCL93 cells. Non-immune goat and mouse IgG were included as a negative control." There is no mention of panel H, although this reviewer can imagine what the authors meant. The capitalization differs in the figure and legend. More troublingly, in panel G, a non-defined star indicates a strong band present in both immune and non-immune control.

      We have corrected the figure legend of Figure 1 and clarified the non-specific band in the figure legend.

      (4) In Figure 4, the authors do not show the matched control for the Bicc1 Pkd1 interaction in panel d, nor do they show a scale bar in either a) or d). Thus, the phenotypic severity cannot be properly assessed.

      Thanks for pointing out the missing scale bars, which have now been added. In respect to the two kidneys shown in Figure 4d, the two kidneys shown are from littermates to illustrate the kidney size in agreement with the cumulative data shown in Figure 4e. Unfortunately, this litter did not have a wildtype control. As the data analysis in Figure 4e is based on littermates, mixing and matching kidneys of different litters does not seem appropriate. Thus, we have omitted showing a wildtype control in this panel. However, the size of the wildtype kidney can be seen in Figure 4a.

      We have added the scale bar to both panels and have updated the figure legend to emphasize that the kidneys shown are from littermates and that no wildtype littermate was present in this litter.

      (5) "Surprisingly, an 8-fold stronger interaction was observed between full-length PC1 and myc-mBicc1-ΔKH compared to mycmBicc1 or myc-mBicc1-ΔSAM." Assuming all the controls for protein folding and expression levels have been carried out and not shown/mentioned, this sentence seems to contradict the previous statement that Bicc1deltaSAM reduced the interaction with PC1 by 55%. Because the full length and SAM deletion have different interaction strengths, the latter sentence makes no sense.

      The reduction in the levels of myc-mBicc1-ΔSAM compared to wildtype mycmBicc1 in respect to PC1 binding was not significant. We have clarified this in the text.

      We have corrected the sentence and modified the Figure accordingly. 

      (6) Imprecise statements make a reader wonder how to interpret the data: "More than three independent experiments were analyzed." Stating the sample size or including it in the figure would save space and improve confidence in the data presented.

      We have stated the exact number of animals per conditions above each of the bars.

      (7) "Next, we performed a similar mouse study for Pkd1 by reducing the gene dose of Pkd1 postnatally in the collecting ducts using a Pkhd1-Cre as previously described40" What did the authors mean?

      The reference was included to cite the mouse strain, but realized that it can be mis-interpreted that the exact experiments has been performed previously. We have clarified this in the text.

      We have reworded the sentence to avoid misinterpretation. 

      (8) The authors examined the additive effects of knocking down Bicc1, Pkd1, and Pkd2 with morpholinos in Xenopus and, genetically, in mice. While the Bicc1[+/-] Pkd1 or 2[+/-] double heterozygote mice did not show phenotypes, the authors report that the Bicc1[-/-] Pkd1 or 2 [+/-] did instead show enlarged kidneys. What is the phenotype of a Bicc1[+/-] Pkd1 or 2 [-/-]? What we learn from the author's findings among the PKD population suggests that the latter situation would be potentially translationally relevant.

      The mouse experiments were designed to address a cooperativity between Bicc1 and either Pkd1 or Pkd2 and whether removal of one copy of Pkd1 or Pkd2 would further worsen the Bicc1 cystic kidney phenotype. Thus, the parental crosses were chosen to maximize the number of animals obtained for these genotypes. Unfortunately, these crosses did not yield the genotypes requested by the reviewer. To address the contribution of Bicc1 towards the PKD population, we will need to perform a different cross, where we eliminate Pkd1 or Pkd2 in a floxed background of Bicc1 postnatally in adult mice. While we are gearing up to perform such an experiment, this is timewise beyond the scope of the manuscript. In addition, please note that we have addressed the question about the translation towards the PKD population already in the discussion of the original submission (page 13/14, last/first paragraph).

      No changes have been made to the revised version of the manuscript.

      (9) How do the authors interpret the milder effects of the Bicc1[-/-] Pkd1[+/-] compared to Bicc1[-/-] Pkd2[+/-] relative to the respective protein-protein interactions?

      The milder effects are due to the nature of the crosses. While the Pkd2 mutant is a germline mutation, the Pkd1 mutant is a conditional allele eliminating Pkd1 only in the collecting ducts of the kidney. As such, we spare other nephron segments such as the proximal tubules, which also significantly contribute to the cyst load. As such these mouse data support the interaction between Pkd1 and Pkd2 with Bicc1, but do not allow us to directly compare the outcomes. While this was mentioned in the previous version of the manuscript, we have expanded on this in the revised version of the manuscript.

      We have expanded the results section in the revised version of the manuscript highlighting that the two different approaches cannot be directly compared.

      (10) How do the authors interpret that the strong Bicc1[Bpk] Pkd1 or Pkd2 double heterozygote mice did not have defects and "kidneys from Bicc1+/-:Pkd2+/- did not exhibit cysts (data not shown)", when the VEO PKD patients and - although not a genetic reduction - also the morpholino-treated Xenopus did?

      VEO PKD patients are characterized by a loss of function of PKD1 or PKD2 and – as we propose in this manuscript - that BICC1 further aggravates the phenotype. Yet, we do not address either in the mouse or Xenopus experiments whether BICC1 is a genetic modifier. We are simply addressing whether the two genes show a genetic interaction. In the mouse studies, we eliminate one copy of Pkd1 or Pkd2 in the background of a hypomorphic allele of Bicc1. Similarly, in the Xenopus experiments, we employ suboptimal doses of the morpholino oligomers, i.e., concentrations that did not yield a phenotypic change and then asked whether removing both together show cooperativity. It is important to state that this is based on a biological readout and not defined based on the amount of protein. While we have described this already in the original manuscript (page 7, first paragraph), we have amended our description of the Xenopus experiment to make this even clearer. 

      Finally, we agree with the reviewer that if we were to address whether Bicc1 is a modifier of the PKD phenotype in mouse, we would need to reduce Bicc1 function in a Pkd1 or Pkd2 mutants. Yet, we have recognized this already in the initial version of the manuscript in the discussion (page 14, first paragraph).

      We have expanded the results section when discussing the suboptimal amounts of the morpholino oligos (Page 6, 1<sup>st</sup> paragraph).

      (11) Unclear: "While variants in BICC1 are very rare, we could identify two patients with BICC1 variants harboring an additional PKD2 or PKD1 variant in trans, respectively." Shortly after, the authors state in apparent contradiction that "the patients had no other variants in any of other PKD genes or genes which phenocopy PKD including PKD1, PKD2, PKHD1, HNF1s, GANAB, IFT140, DZIP1L, CYS1, DNAJB11, ALG5, ALG8, ALG9, LRP5, NEK8, OFD1, or PMM2."

      The reviewer is correct. This should have been phrased differently. We have now added “Besides the variants reported below” to clarify this more adequately.

      The sentence was changed to start with “Besides the variants reported below, […].”

      (12) "The demonstrated interaction of BICC1, PC1, and PC2 now provides a molecular mechanism that can explain some of the phenotypic variability in these families." How do the authors reconcile this statement with their reported ultra-rare occurrence of the BICC1 mutations?

      As mentioned in the manuscript and also in response to the other two reviewers, Bicc1 has been shown to regulate Pkd2 gene expression in mice and frogs via an interaction with the miR-17 family of microRNAs. Moreover, the miR-17 family has been demonstrated to be critical in PKD (PMID: 30760828, PMID: 35965273, PMID: 31515477, PMID: 30760828). In fact, both other reviewers have pointed out that we should stress this more since Bicc1 is part of this regulatory pathway. Future experiments are needed to address whether Bicc1 contributes to the variability in ADPKD onset/severity. Yet, this is beyond the scope of this study. 

      Based on the comments of the two other reviewers we have further addressed the Bicc1/miR-17 interaction.

      (13) The manuscript should use correct genetic conventions of italicization and capitalization. This is an issue affecting the entire manuscript. Some exemplary instances are listed below.

      (a) "We also demonstrate that Pkd1 and Pkd2 modifies the cystic phenotype in Bicc1 mice in a dose-dependent manner and that Bicc1 functionally interacts with Pkd1, Pkd2 and Pkhd1 in the pronephros of Xenopus embryos." Genes? Proteins?

      The data presented in this section show that a hypomorphic allele of Bicc1 in mouse and a knockdown in Xenopus yields this. As both affect the proteins, the spelling should reflect the proteins.

      No changes have been made in the revised manuscript.

      (b) The sentence seems to use both the human and mouse genetic capitalization, although it refers to experiments in the mouse system “to define the Bicc1 interacting domains for PC2 (Fig. 2d,e). Full-length PC2 (PC2-HA) interacted with full-length myc-mBICC1.”

      We agree with the review that stating the species of the molecules used is critical, we have adapted a spelling of Bicc1, where BICC1 is the human homologue, mBicc1 is the mouse homologue and xBicc1 the Xenopus one.

      We have highlighted the species spelling in the methods section and labeled the species accordingly throughout the manuscript and figures. 

      (14) “Together these data supported our biochemical interaction data and demonstrated that BICC1 cooperated with PKD1 and PKD2.” Are the authors implying that these results in mice will translate to the human protein?

      We agree that we have not formally shown that the same applies to the human proteins. Thus, we have changed the spelling accordingly.

      We have revised the capitalization of the proteins. 

      (15) The text is often unclear, terse, or inconsistent.

      (a) “These results suggested that the interaction between PC1 and Bicc1 involves the SAM but not the KH/KHL domains (or the first 132 amino acids of Bicc1). It also suggests that the N-terminus could have an inhibitory effect on PC1-BICC1 association.” How do the authors define the N-terminus? The first 132 aa? KH/KHL domains?

      This was illustrated in the original Figure 2A. The DKH constructs lack the first 351 amino acids. 

      To make this more evident, we have specified this in the text as well.

      (b) Similarly, the authors state below, "Unlike PC1, PC2 interacted with mycmBICC1ΔSAM, but not myc-mBICC1-ΔKH suggesting that PC2 binding is dependent on the N-terminal domains but not the SAM domain." It is unclear if the authors refer to the KH/KHL domains or others. Whatever the reference to the N-terminal region, it should also be consistent with the section above.

      This is now specified in the text.

      (c) Unclear: "We have previously demonstrated that Pkd2 levels are reduced in a complete Bicc1 null mice,22 performing qRT-PCR of P4 kidneys (i.e. before the onset of a strong cystic phenotype), revealed that Bicc1, Pkd1 and Pkd2 were statistically significantly down9 regulated (Fig. 4h-j)".

      We have changed the text to clarify this. 

      (d) “Utilizing recombinant GST domains of PC1 and PC2, we demonstrated that BICC1 binds to both proteins in GST-pulldown assays (Fig. 1a, b)." GST-tagged domains? Fusions?

      We have changed the text to clarify this. 

      (e) "To study the interaction between BICC1, PKD1 and PKD2 we combined biochemical approaches, knockout studies in mice and Xenopus, genetic engineered human kidney cells" > genetically engineered.

      We have changed the text to clarify this.

      (f) Capitalization (e.g., see Figure S3, ref. the Bpk allele) and annotation (e.g., Gly821Glu and G821E) are inconsistent.

      We have homogenized the labeling of the capitalization and annotations throughout the manuscript. 

      (g) What do the authors mean by "homozygous evolutionarily well-conserved missense variant"?

      We have changed this is the revised version of the manuscript. 

      Reviewer #3 (Public review/Recommendations to the authors):

      (1) A further study in HUREC cells investigating the critical regulatory role of BICC1 and potential interaction with mir-17 may yet lead to a modifiable therapeutic target.

      (2) This study should ideally include experiments in HUREC material obtained from patients/families with BICC1 mutations and studying its effects on the PKD1/2 complex in primary cell lines.

      This is an excellent suggestion. We agree with the reviewer that it would have been interesting to analyze HUREC material from the affected patients. Unfortunately, besides DNA and the phenotypic analysis described in the manuscript neither human tissue nor primary patient-derived cells collected once the two patients with the BICC1 p.Ser240Pro variant passed away.

      No changes to the revised manuscript have been made to address this point.

      (3) Please remove repeated words in the following sentence in paragraph 2 of the introduction: "BICC1 encodes an evolutionarily conserved protein that is characterized by 3 K-homology (KH) and 2 KH-like (KHL) RNA-binding domains at the N-terminus and a SAM domain at the C-terminus, which are separated by a by a disordered intervening sequence (IVS).23-28".

      This has been changed.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study by Li and coworkers addresses the important and fundamental question of replication initiation in Escherichia coli, which remains open, despite many classic and recent works. It leverages single-cell mRNA-FISH experiments in strains with titratable DnaA and novel DnaA activity reporters to monitor DNA activity peaks versus size. The authors find oscillations in DnaA activity and show that their peaks correlate well with the estimated population-average replication initiation volume across conditions and imposed dnaA transcription levels. The study also proposes a novel extrusion model where DNA-binding proteins regulate free DnaA availability in response to biomass-DNA imbalance. Experimental perturbations of H-NS support the model validity, addressing key gaps in current replication control frameworks.

      Strengths:

      I find the study interesting and well conducted, and I think its main strong points are:

      (1) the novel reporters obtained with systematic synthetic biology methods, and combined with a titratable dnaA strain.

      (2) the interesting perturbations (titration, production arrest, and H-NS).

      (3) the use of single-cell mRNA FISH to monitor transcripts directly.

      The proposed extrusion model is also interesting, though not fully validated, and I think it will contribute positively to the future debate.

      We thank the reviewer for acknowledging the strengths of our study.

      Weaknesses and Limitations:

      (1) A relevant limitation in novelty is that DnaA activity and concentration oscillations have been reported by the cited Iuliani and coworkers previously by dynamic microscopy, and to a smaller extent by the other cited study by Pountain and coworkers using mRNA FISH.

      (2) An important limitation is that the study is not dynamic. While monitoring mRNA is interesting and relevant, the current study is based on concentrations and not time variations (or nascent mRNA). Conversely, the study by Iuliani and coworkers, while having the drawback of monitoring proteins, can directly assess production rates. It would be interesting for future studies or revisions to monitor the strains and reporters dynamically, as well as using (as a control) the technique of this study on the chromosomal reporters used by Iuliani et al.

      We acknowledge the value of dynamic measurements and clarify our methodological rationale.

      While luliani et al. provided valuable temporal resolution through protein dynamics, our mRNA FISH approach achieves direct decoupling of transcriptional vs. post-translational regulation (Fig 4F-H), and condition flexibility across 7 growth rates (30-66 min doubling times). This trade-off sacrifices temporal resolution for enhanced population-scale resolution and perturbation flexibility. To directly address temporal coupling, future work will implement dual-color live imaging of DnaA activity concurrent with replication initiation events.

      (3) Regarding the mathematical models, a lot of details are missing regarding the definitions and the use of such models, which are only presented briefly in the Methods section. The reader is not given any tools to understand the predictions of different models, and no analytical estimates are used. The falsification procedures are not clear. More transparency and depth in the analysis are needed, unless the models are just used as a heuristic tool for qualitative arguments (but this would weaken the claims). The Berger model, for example, has many parameters and many regimes and behaviors. When models are compared to data (e.g., in Figure 2G), it is not clear which parameters were used, how they were fixed, and whether and how the model prediction depends on parameters.

      We agree that model transparency is essential for quantitative validation. To address this, all model parameters (DnaA synthesis rate, activation/deactivation rates etc.) are explicitly tabulated in Supplementary Information Table S6. For the titration (Hansen et al. 1991) and extrusion models, we derive analytical expressions for initiation mass (IM) sensitivity to DnaA expression in Supplementary Note 1. For Figure 2G/S6, we used published parameters (Berger & Wolde 2022 SI Table 2) with experiment growth conditions (μ = 1.54 h<sup>-1</sup>).

      The extrusion model's validation relies primarily on its ability to resolve paradoxical initiation events under dnaA shutdown (Fig 6C), a test where other models fail categorically. While the Berger titration-switch hybrid can fit steady-state IM trends (Fig S6A), it cannot reproduce post-shutdown dynamics without ad hoc modifications (Fig S6B). We acknowledge that comprehensive analysis of all model regimes exceeds this study's scope but provide full simulation code for independent verification: https://github.com/BaiYangBqdq/dynamics_of_biomass_DNA_coordination

      (4) Importantly, the main statement about tight correlations of peak volumes and average estimated initiation volume does not establish coincidence, and some of the claims by the authors are unclear in these respects (e.g., when they say "we resolve a 1:1 coupling between DnaA activity thresholds and replication initiation", the statement could be correct but is ambiguous). Crucially, the data rely on average initiation volumes (on which there seems to be an eternally open debate, also involving the authors), and the estimate procedure relies on assumptions that could lead to biases and uncertainties added to the population variability (in any case, error bars are not provided).

      We acknowledge the limitations of population-level inference and have refined our claims: "Replication initiation volume scales proportionally with peak DnaA activity volume with a slope of 1.0 (R<sub>2</sub>=0.98, Fig 7G), indicating predictive correspondence rather than absolute coincidence. While population-level  𝑉<sub>𝑖</sub> estimation cannot resolve single-cell stochasticity, the consistent 𝑉*: 𝑉<sub>𝑖</sub> relationship across 20 conditions suggest DnaA activity thresholds predict initiation timing within physiological error margins”. Future work will implement simultaneously DnaA activity and replication forks by using microfluidic single-cell tracking.

      (5) The delays observed by the authors (in both directions) between the peaks of DnaAactivity conditional averages with respect to volume and the average estimated initiation volumes are not incompatible with those observed dynamically by Iuliani and coworkers. The direct experiment to prove the authors' point would be to use a direct proxy of replication initiation, such as SeqA or DnaN, and monitor initiations and quantify DnaA activity peaks jointly, with dynamic measurements.

      We acknowledge the observed temporal deviations between DnaA activity peaks (𝑉*) and population-derived volumes at initiation ( 𝑉<sub>𝑖</sub>) in certain conditions, in line with the findings of Iuliani et al. This might be mechanistically consistent with the time required for orisome assembly or oriC sequestration. They do not contradict our core finding that initiation occurs at a defined DnaA activity threshold (slope=1.0, R<sub>2</sub>=0.98 in 𝑉*: 𝑉<sub>𝑖</sub> correlation).

      (6) While not being an expert, I had some doubt that the fact that the reporters are on plasmid (despite a normalization control that seems very sensible) might affect the measurements. Also, I did not understand how the authors validated the assumptions that the reporters are sensitive to DnaA-ATP specifically. It seems this assumption is validated by previous studies only.

      We employed a plasmid-based reporter system to circumvent the significant confounding effects of chromosomal position on promoter activity, as extensively documented by Pountain et al., where local genomic context (e.g., nucleoid occlusion, supercoiling gradients, and neighboring operons) introduces uncontrolled variability. By housing the P<sub>syn66</sub> test promoter and P<sub>con</sub> normalization control in identical low-copy pSC101 vectors (<8 copies/ cell, Peterson & Phillips, Plasmid 2008), we ensured they experience equivalent physical and biochemical environments. This ratiometric design, where DnaA activity is calculated, actively corrects for global fluctuations in RNA polymerase availability, nucleotide pools, and plasmid copy number. Critically, P<sub>syn66</sub>’s architecture emulates natural DnaA-responsive elements: its strong DnaAboxes report free DnaA concentration, while its weak box is preferentially bound by DnaA-ATP (Speck et al., EMBO journal 1999), mirroring the nucleotide-state sensitivity of oriC and the native dnaA promoter. This system was indispensable for our central finding, as it uniquely enabled the decoupling of DnaA activity oscillations from transcriptional feedback (Fig. 4F-H), an experiment fundamentally impossible with chromosomally integrated reporters due to autoregulatory interference.

      Overall Appraisal:

      In summary, this appears as a very interesting study, providing valuable data and a novel hypothesis, the extrusion model, open to future explorations. However, given several limitations, some of the claims appear overstated. Finally, the text contains some selfevaluations, such as "our findings redefine the paradigm for replication control", etc., that appear exaggerated.

      We thank the reviewer for highlighting the need for precise language in framing our conclusions. We have implemented the following substantive revisions throughout the manuscript to ensure claims align strictly with empirical evidence:

      (1) Changed "redefine the paradigm for replication control" into "advance the paradigm for replication control" (Introduction)

      (2) Changed "redefine bacterial cell cycle control" into "refine bacterial cell cycle control as a dynamic interplay..." (Discussion)

      (3) Removed the term "spatial" from the Discussion's description of DnaA-chromosome interactions (Discussion, first paragraph).

      (4) Changed "provides a blueprint" into "provides a valuable tool for dissecting spatial regulation..." (Discussion, final paragraph)

      (5) Scrutinized all superlatives (e.g., "critical feat" into "important capability"; "fundamental principle of cellular organization" into "potential organizational strategy")

      (6) Replaced the instances of "robust" with evidence-backed descriptors (e.g., "sensitive," "consistent")

      (7) We agree that the extrusion model requires further validation and have emphasized this in Discussion: "While H-NS perturbation supports extrusion mechanism, future work should identify the full extruder interactome and elucidate how metabolic signals modulate their activity" (final paragraph)

      This calibrated language more accurately represents our study as a conceptual advance with testable mechanisms, not a complete paradigm shift.

      Reviewer #2 (Public review):

      Summary:

      The authors show that in E. coli, the initiator protein DnaA oscillates post-translationally: its activity rises and peaks exactly when DNA replication begins, even if dnaA transcription is held constant. To explain this, they propose an "extrusion" mechanism in which nucleoidassociated proteins such as H-NS, whose amount grows with cell volume, dislodge DnaA from chromosomal binding sites; modelling and H-NS perturbations reproduce the observed drop in initiation mass and extra initiations seen after dnaA shut-down. Together, the data and model link biomass growth to replication timing through chromosome-driven, posttranslational control of DnaA, filling gaps left by classic titration and ATP/ADP-switch models.

      Strengths:

      (1) Introduces an "extrusion" model that adds a new post-translational layer to replication control and explains data unexplained by classic titration or ATP/ADP-switch frameworks.

      (2) A major asset of the study is that it bridges the longstanding gap between DnaA oscillations and DNA-replication initiation, providing direct single-cell evidence that pulses of DnaA activity peak exactly at the moment of initiation across multiple growth conditions and genetic perturbations.

      (3) A tunable dnaA strain and targeted H-NS manipulations shift initiation mass exactly as the model predicts, giving model-driven validation across growth conditions.

      (4) A purpose-built Psyn66 reporter combined with mRNA-FISH captures DnaA-activity pulses with cell-cycle resolution, providing direct, compelling data.

      We thank the reviewer for acknowledging the strengths of our study.

      Weaknesses:

      (1) What happens to the (C+D) period and initiation time as the dnaA mRNA level changes? This is not discussed in the text or figure and should be addressed.

      We thank the reviewer for this important observation. Our data demonstrate that increased dnaA mRNA levels induce two compensatory changes in cell cycle progression:

      (1) Earlier replication initiation, manifested as a reduced initiation mass: the initiation mass decreased from 5.6 to 2.6 (OD<sub>600</sub>·ml per 10<sup>10</sup> cells) as the relative dnaA mRNA level increased from 0.2 to 7.2 (normalized to the wild-type level) (Fig. 2F, red).

      (2) Prolonged C+D period: Increased by approximately 60% (from 1.05 to 1.66 hours, Fig. 2F blue).

      The complete quantitative relationship is now explicitly described in the Results section: “Concurrently, the initiation mass was reduced by 50%, and the period from initiation to division (C+D) was increased by ~60% (Fig. 2F)”

      (2) It is unclear what is meant by "relative dnaA mRNA level." Relative to what? Wild-type expression? Maximum expression? This should be explicitly defined.

      The relative dnaA mRNA level was obtained by normalizing to that in wild-type MG1655 cells grown in the same medium. To clarify this point, we have now marked the wild-type level in Fig. 1B, and a clear description of this has also been included in the figure caption.

      (3) It would be helpful to provide some intuition for why an increase in dnaA mRNA level leads to a decrease in initiation mass per ori and an increase in oriC copy number.

      Thank you for your valuable suggestion. Increased dnaA mRNA accelerates DnaA accumulation, causing cells to reach the initiation threshold at a smaller cell size (reducing initiation mass, Fig. 2F red). This earlier initiation increases oriC copies per cell at populational level (Fig. 2E). This mechanistic interpretation now appears in the Results: “As the DnaA expression level increases, DnaA activity reaches the initiation threshold earlier. Given that cell mass remained nearly unchanged, this earlier initiation led to an increase in population-averaged cellular oriC numbers (Fig. 2E).”

      (4) The titration and switch models do not explicitly include dnaA mRNA in the dynamics of DnaA protein. Yet, in Figure 2G, initiation mass is shown to decrease linearly with dnaA mRNA level in these models. How was dnaA mRNA level represented or approximated in these simulations?

      All models presented in this article omit explicit modeling of dnaA mRNA dynamics for simplicity. However, at steady state, the relative level of dnaA mRNA can be approximated by the relative expression rate of DnaA protein, as both reflect the expression level of DnaA. This detail is now clarified in the caption of Figure 2G.

      (5) Is Schaechter's law (i.e., exponential scaling of average cell size with growth rate) still valid under the different dnaA mRNA expression conditions tested?

      Schaechter's law describes the exponential scaling of average cell size with growth rate in bacteria. In our prior work (Zheng et al., Nature Microbiology 2020), where we demonstrated that Schaechter's law fails in slow-growth regimes. However, in current study, growth rate remained constant across different dnaA expression levels (Fig. 2C), and cell mass showed no significant change (Fig. 2D). Since Schaechter's law specifically addresses how cell size scales with growth rate, it does not apply here, as growth rate was invariant in our perturbations, which selectively alter replication initiation dynamics, not growth rate or size scaling.

      (6) The manuscript should explain more explicitly how the extrusion model implements posttranslational control of DnaA and, in particular, how this yields the nonlinear drop in relative initiation mass versus dnaA mRNA seen in Figure 6E. Please provide the governing equation that links total DnaA, the volume-dependent "extruder" pool, and the threshold of free DnaA at initiation, and show - briefly but quantitatively - how this equation produces the observed concave curve.

      The governing equations linking initiation mass and DnaA expression level is now provided in Supplementary Note S1 for both the titration and the extrusion model. In general, the dependence of initiation mass (𝑉<sub>𝐼</sub>) on dnaA expression level (𝛼<sub>𝐴</sub>) dependency takes an inverse 1 proportionality form: . In the extrusion model, the incorporated extruder protein is assumed to have similar synthesis dynamics as DnaA and can release DnaA from DnaA-box. After denoting the synthesis rate of the extruder as 𝛼<sub>𝐻</sub>, the combined effect of DnaA and the extruder on replication initiation can be briefly described as: . Then the additive contribution of 𝛼<sub>𝐻</sub> dampens the sensitivity of initiation mass to changes in 𝛼<sub>𝐴</sub>, resulting in a significantly flattened curve. As a result, the predicted 𝑉<sub>𝐼</sub> − 𝛼<sub>𝐴</sub> relationship has a concave shape in the semi-log plots.

      (7) Does this Extrusion model give well well-known adder per origin, i.e., initiation to initiation is an adder.

      Yes, the extrusion model can provide the initiation-to-initiation adder phenomenon, this information was provided in fig. S3C.

      (8) DnaA protein or activity is never measured; mRNA is treated as a linear proxy. Yet the authors' own narrative stresses post-translational (not transcriptional) control of DnaA. Without parallel immunoblots or activity readouts, it is impossible to know whether a sixfold mRNA increase truly yields a proportional rise in active DnaA.

      We acknowledge the reviewer's valid concern regarding the indirect nature of our DnaA activity measurements. While mRNA levels alone cannot resolve active DnaA dynamics, our approach integrates functional replication outcomes with a validated synthetic reporter to infer activity. Crucially, elevated dnaA mRNA causes demonstrable biological effects: earlier replication initiation (Fig. 2F) and increased oriC copies (Fig. 2E), directly confirming enhanced functional DnaA activity at the oriC locus. The P<sub>syn66</sub> reporter, engineered with DnaA-boxes mirroring oriC's architecture, provides orthogonal validation, showing progressive repression to dnaA induction (Fig. 3C). Our operational metric , bases on P<sub>syn66</sub> responds sensitively to DnaA-chromosome interactions within its characterized 8-fold dynamic range (Fig. 3C). Immunoblots would be inadequate here, as they cannot distinguish functionally critical pools: free versus chromosome-bound DnaA, or DnaA-ATP versus DnaAADP, precisely the post-translational states our study implicates in regulation. We therefore prioritize functional readouts (initiation timing) and the P<sub>syn66</sub> reporter, which probes the biologically active fraction relevant to replication control.

      (9) Figure 2 infers both initiation mass and oriC copy number from bulk measurements (OD<sub>600</sub> per cell and rifampicin-cephalexin run-out) instead of measuring them directly in single cells. Any DnaA-dependent changes in cell size, shape, or antibiotic permeability could skew these bulk proxies, so the plotted relationships may not accurately reflect true initiation events.

      We acknowledge the reviewer's valid methodological concern and clarify that while bulk measurements carry inherent limitations, our approach is grounded in established techniques with demonstrated reliability. Cell mass was inferred from OD600/cell, which correlates strongly with direct dry weight measurements and microscopic cell volumes across diverse growth conditions, as validated in our prior work (Zheng et al., Nature Microbiology 2020). Crucially, cell mass remained invariant across dnaA expression levels (Fig. 2D).

      Regarding oriC quantification, the rifampicin-cephalexin run-out assay is a wildly applied for replication initiation studies. Our data shows expected 2<sup>n</sup> oriC distributions without abnormal ploidy (as shown below). While single-cell methods offer superior resolution, our bulk approach provides accurate population-level trends.

      Author response image 1.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The reviewers felt that the mathematical modeling was not adequately explained in the paper, and that this affected the readability of the manuscript. The authors are encouraged to elaborate on this aspect of the paper (in addition to strengthening other claims, if possible, per the reviewers' comments).

      We thank the editor and reviewers for their constructive feedback. We have comprehensively strengthened the mathematical modeling framework to enhance clarity and rigor.

      Reviewer #1 (Recommendations for the authors):

      The only revision I would do is a recalibration of the claims and a major effort to clarify the modeling part (including a detailed SI appendix), without necessarily performing additional work.

      To enhance mathematical modeling transparency, we have completed model description in the method section and a parameter table with literature-sourced values in Supplementary Information Table S6. Moreover, analytical derivations of initiation mass dependencies are performed and presented in the Supplementary Information Note S1.

      Of course, there are extra experiments (mentioned in the public review) that would help support some of the big claims, but that can be considered a different project.

      Thank you for your suggestion. This will be addressed in our future work.

      Minor suggestion: please put signposts or plot jointly to compare the maxima/minima in Figures 4D, E, G, and H.

      We added dashed lines in Figures 4D, and E, to synchronize visualization of DnaA activity peaks and transcriptional minima across panels, facilitating direct biological comparisons.

      Reviewer #2 (Recommendations for the authors):

      (1) Should define what DNA activity is.

      We have explicitly defined DnaA activity in the Introduction as “the capacity to initiate replication…” and noted that it is “governed by free DnaA concentration, DnaA-ATP/-ADP ratio, and orisome assembly competence”.

      (2) Word repetition - “...grown in in Luria-Bertani (LB) medium...”.

      Corrected.

      (3) Typographical error - “FISH ... was preformed" should be "performed”.

      Corrected.

      (4) The manuscript alternates between “ng ml<sup>-1</sup>” and “ng·ml<sup>-1</sup>”; choose one style and apply it uniformly.

      Standardized the units to ng·ml<sup>-1</sup> throughout.

      (5) Reference duplicates - Some citations appear twice in the bibliography (e.g., "Bintu et al., 2005a/b" and "Bintu et al., 2005b" listed again later).

      The studies by Bintu et al. (2005a, 2005b) represent separate works: 2005a details applications, and 2005b develops models.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.


      Reply to the Reviewers

      We thank the reviewers for their positive assessments overall and for many helpful suggestions for clarification to make the manuscript more accessible to a broader audience. We made minor text changes and added more labels to the figures to address these comments.

      • *

      __Referee #1

      __

      Summary: In this study, the authors show a genetic interaction of the lipid receptors Lpr-1, Lpr-3 and Scav-2 in C. elegans. They show that Lpr-1 loss-of-function specifically affects aECM localization of Lpr-3 and attribute the lethality of Lpr-1 mutants to this phenotype. The authors performed a mutagenesis screen and identified a third lipid receptor, Scav-2, as a modulating factor: loss of scav-2 partially rescues the Lpr-1 phenotype. The authors created a variety of tools for this study, notably Crispr-Cas9-mediated knock-ins for endogenous tagging of the receptors.

      Major comments:

      1. while the authors provide a nice diagram showing the potential roles and interplay of lpr-1, lpr-3 and scav-2, it remains unclear what their respective cargo is. The nature of interaction between the proteins remains unclear from the data.

      Response

      • We agree that identifying the relevant cargo(s) will be key to understanding the detailed mechanisms involved and that the lack of such information is a limitation of our study. However, the impact of our study is to show that these lipid transporters functionally interact to affect aECM organization, a role that could be relevant to many systems, including humans.

      As an optional (since time-consuming) experiment I would suggest trying more tissue-specific lipidomics.

      Response

      • This would be an interesting future experiment but is outside our current technical capabilities.

      The lipidomics data should be presented in the figures, even if there were no significant changes. Importantly, show the lipid abundance at least of total lipids, better of individual classes, normalized to the material input (e.g. number of embryos, protein).

      Response

      • The reviewer is right to point out that lipid variations could occur at different levels, and that we should exercise caution. However, the unsupervised lipidomics analysis would have detected not only individual lipid variations, but also variations in the total or subgroup lipid content. Indeed, the eggs were weighed prior to extraction and each sample was extracted with the same precise volume of solvent before analysis. Furthermore, the LC-MS/MS injection sequence included blanks and quality control (QC) samples. The blanks were the extraction solvent, which allowed us to control for features unrelated to the biological samples. The QC sample was a mixture of all the samples included in the injection sequence, reflecting the central values of the model. If a subclass of samples, such as the lpr-1 mutant, had been characterized by a decrease in one lipid, a subgroup of lipids, or all lipids, it would have clustered separately. Instead, our PCA showed that the variation between samples of the same genotype (wild type, lpr-1 mutant, or lpr-1; scav-2) was similar to the variation between samples from two different genotypes. This means that we did not detect modifications to lipid quantity specifically or in total. A figure illustrating the lipid contents would show no difference between groups.

      Figure 1g: I do not understand what the lpr3:gfp signal is: the punctae in the overview image? and where are they in the zoom image showing anulli and alae? Also, how where the anulli and alae structures labeled? please provide more information

      Response

      • All of the fluorescent signal shown in this figure panel corresponds to the indicated LPR fusion - no other labelling method was used. SfGFP::LPR-3 labels the matrix structures (alae and annuli) as well as some puncta – the ratio of matrix to puncta changes over developmental stages. We edited the figure legend to make this more clear.

      One point that is not sufficiently adressed is that the authors deduce from the inability of the scav-2 gfp knock in to suppress lpr1 lethality that scav2 function is not impaired. This is quite indirect. Can the authors provide more convincing evidence that scav-2 ki has normal function?

      Response

      • Suppression of lpr-1 (or other aECM mutant) lethality is the only known phenotype caused by loss of scav-2 Therefore, this is the only phenotype for which we can do a rescue experiment to test functionality of the knock-in. The data presented do indicate that the knock-in fusion retains significant function.

      In general, the data is clearly presented and the statistical analyses look sound.

      Response

      • Thank you

      __Minor comments: __

      Please provide page and line numbers!

      Response:

      • done

      Avoid contractions like "don't" in both text and figure legends

      Response:

      • changed one instance of “don’t” to “do not”

      Page 12: I do not understand the meaning of the sentence "This transgene also caused more modest lethality in a wild-type background"

      Response:

      • Wording changed to “This transgene caused very little lethality in a wild-type background (Fig. 6C), indicating it is not generally toxic.”

      Figure 7: what is meant with "Dodt"?

      Response:

      • Dodt gradient contrast imaging is a method for transmitted light imaging similar to DIC and is used on some confocal microscopes. It is now explained in the Methods section. We removed the Dodt label from Figure 7 since it seems to be confusing and it is not really important whether the brightfield image is DIC or Dodt.

        Reviewer #1 (Significance (Required)):

        The study is experimentally sound and uses numerous novel tools, such as endogenously tagged lipid receptors. It is an interesting study for researchers in basic research studying lipid receptors and ECM biology. It provides insights on the genetic interaction of lipid receptors. My expertise is in lipid biochemistry, inter-organ lipid trafficking and imaging. I am not very familiar with C. elegans genetics.

      __Referee #2 __ 1. The manuscript is very well written; the documentation is fine, but some more details are needed for better following the subject for readers not familiar with nematode anatomy.

      For instance, while alae are somehow explained, annuli are not - structures that look abnormal in lpr1 and lpr1-scav2 mutants (Fig. 5B).

      Response

      • Apologies for this oversight. We added annuli labels to Figure 1 and Figure 5 panels and added descriptions of annuli to the Figure 1 legend and the Results text.

      Moreover, the authors show in Fig. 1 the punctae etc in the epidermis, whereas in Fig. 2 the show Lpr3 accumulation or not in the duct and the pore (lpr1). How do they localize in the cells of these structures at high magnification? It is also important to see the Lpr3 localisation in lpr1 mutants shown in Fig. 2A with the quality of the images shown in Fig. 1F. This applies also to Figs. 4 and 5.

      Responses:

      • The embryonic duct and pore cells are very small and we have not reliably seen puncta within them. In Figs 2 and 5, we supplemented the duct and pore images with those from the epidermis, which is a much larger tissue, allowing us to resolve puncta and matrix structures with better resolution.
      • The laser settings in Figs 2,4,5 (as opposed to Fig. 1) were chosen to avoid saturation of the matrix signal so that we could do accurate quantifications as shown. The images are unmodified with respect to brightness and therefore appear relatively dim – but we think they convey the observations very accurately.

      I would like to see punctae in lpr1-scav2 doubles.

      Response:

      • Puncta in this genotype are shown for the epidermis in Figure 5. It has not been possible to see puncta specifically within the embryonic duct and pore.

      Regarding the central mechanism, one possibility is - what the authors describe - that Lpr1 is needed for Lpr3 accumulation in ducts and tubes. Alternatively, Lpr1 is needed for duct and tube expansion, in lack of which Lpr3 is unable to reach its destination that is the lumina. Scav2, in this scenario, might be antagonist of tube and duct expansion, and thereby rescue the Lpr1 mutant phenotype independently. Admittedly, the non-accumulation of Lpr3 in scav2 mutants argues against a lpr1-independent function of scav2.

      Responses:

      • LPR-1 is indeed needed to maintain duct and pore tube integrity as the tubes grow, but in mutants the tubes appear to collapse at a later stage than we imaged here (Stone et al 2009). The ~normal accumulation of LET-4 and LET-653 further argues that the duct and pore tubes are still intact at the 1.5-to-2-fold stages. Therefore, we conclude that the defect in LPR-3 accumulation precedes duct and pore collapse.
      • The changes we document in the epidermis also show that the lpr-1 mutant affects LPR-3 accumulation in another (non-tube) tissue.

      In any case, to underline the aspect of Lpr1-Scav2 dosage relationship, the authors may also have a look at Lpr3 distribution in lpr1 heterozygous, and lpr1-scav2 double heterozygous worms. In this spirit, it would be interesting to see the semi-dominant effects of scav2 on Lpr3 localisation in lpr1 mutants by microscopy.

      Response:

      • Because of the hermaphroditism of C. elegans, it would be technically challenging to confidently identify heterozygous (vs. homozygous) embryos for confocal imaging. We do not think that the results would be informative enough to warrant the effort, given that we’ve already shown that scav-2 heterozygosity can partly suppress lpr-1 The expectation is that LPR-3 levels would be partially restored in the scav-2 het, but it might take a very large sample size to confidently assess that partial effect.

      One word to the overexpression studies: it is surprising that the amounts of Scav2 delivered by the expression through the grl-2 promoter in the lpr1, scav2 background are almost matching those by the opposite effect of scav2 mutations on lpr1 dysfunction.

      Response:

      • The reviewer refers to the transgenic rescue experiment with the grl-2pro::SCAV-2 transgene. Because the scav-2 mutant phenotype being tested is suppression of lpr-1 lethality, the expected result from scav-2 rescue is to restore the lpr-1 lethal phenotype to the strain. This is exactly the result we see. We have revised the text to more clearly explain the logic.

      One issue concerns the localization of scav2-gfp "rarely" in vesicles: what are these vesicles?

      Response

      • Only a handful of vesicles were seen across all the images we collected, and we have not yet identified them. They could be associated with either SCAV-2 delivery or removal from the plasma membrane, as now stated in the text. SCAV-2 trafficking would be an interesting area for further study but is beyond the scope of this paper.

      One comment to the Let653 transgenes/knock-ins: the localization of transgenic Let653-gfp may be normal in lpr1 mutants because there are wild-type copies in the background.

      Response

      • There are wild type copies of LET-653 in the background, but no wild type copies of LPR-1. Even if the untagged LET-653 would be recruiting the tagged LET-653 as the reviewer suggests, we can still conclude that lpr-1 loss does not prevent the untagged LET-653 (and thus also the tagged LET-653) from accumulating in the duct lumen matrix.

      One thought to the model: if Scav2 has a function in a lpr1 background, this means that yet another transporter X delivers the substrate for Scav2, isn't it?

      Response

      • Yes, we completely agree with this interpretation and have revised the discussion and Figure 8 legend to more explicitly make this point.

      A word to the term haploinsifficient that is used in this study: scav2 mutants would be haploinsifficient if the heterozygous worms died in an otherwise wild-type background.

      Response

      • We disagree with this comment. The term “haploinsufficient” simply means that heterozygosity for a deletion or other loss of function allele can cause a mutant phenotype – the term is not restricted to lethal phenotypes.

        Reviewer #2 (Significance (Required)):

        Alexandra C.Belfi and colleagues wrote the manuscript entitled "Opposing roles for lipocalins and a CD36 family scavenger receptor in apical extracellular matrix-dependent protection of narrow tube integrity" in which they report on their findings on the genetic and cell-biological interaction between the lipid transporters Lpr1 and scav2 in the nematode C. elegans. In principle, these two proteins are involved in shaping the apical extracellular matrix (aECM) of ducts by regulating the amounts of Lpr3 in the extracellular space. While seems to act cell autonomously, Lpr1 has a non-cell autonomous effect on Lpr3.


      __Referee #3 __ Summary: Using a powerful combination of genetic and quantitative imaging approaches, Belfi et al., describe novel findings on the roles of several lipocalins-secreted lipid carrier proteins-in the production and organization of the apical extracellular matrix (aECM) required for small diameter tube formation in C. elegans. The work comprises a substantial extension of previous studies carried out by the Sundaram lab, which has pioneered studies into the roles of aECM and accessory proteins in creating the duct-pore excretion tube and which also plays a role in patterning of the epidermal cuticle. One core finding is that the lipocalin LPR-1 does not stably associate with the aECM but is instead required for the incorporation of another lipocalin, LPR-3. A second major finding is that reduction of function in SCAV-2, a SCARB family membrane lipid transporter, suppresses lpr-1 mutant lethality along with associated duct-pore defects and mislocalization of LPR-3. Likewise loss of scav-2 partially suppresses defects in two other aECM proteins and restores defects in LPR-3 localization in one of them (let-653). Additional genetic and protein localization studies lead to the model that LPR-1 and SCAV-2 may antagonistically regulate one or more lipid or lipoprotein factors necessary for LPR-3 localization and duct-pore formation. A role for LPR-1 and LPR-3 at lysosomes is clearly implicated based on co-localization studies, although a specific role for lysosomes (or related organelles) is not defined. Finally, MS data suggests that neither LPR-1 or SCAV-2 grossly affect lipid composition in embryos, consistent with dietary interventions failing to affect mutant phenotypes. Ultimately, a plausible schematic model is presented to explain for much of the data.

      __*Major comments:

      *__

      1. The studies are very thorough, convincing, and generally well described. Conclusions are logical and well grounded. Additional experiments are not required to support the authors major conclusions, and the data and methods are described in a sufficient detail to allow replication. As such my comments are minor and should be addressable at the author's discretion in writing.

      Response

      • Thank you for these positive comments

        __Minor comments: __2) In the abstract, "tissue-specific suppression" made me think that there was going to be a tissue-specific knockdown experiment, which was not the case. Rather scav-2 suppression is specific to the duct-pore, which corresponds to where scav-2 is expressed. Consider rewording this.

      Response

      • Wording was changed to “duct/pore-specific suppression”

        3) Page 5. Suggest wording change to, "Whereas LPR-3 incorporates stably into the precuticle, suggesting a structural role in matrix organization, LPR-1..."

      Response

      • Done

        4) LIMP-2 versus LIMP2. Both are used. Uniprot lists LIMP2, but some papers use LIMP-2. Choose one and be consistent.

      Response

      • Everything changed to LIMP2.

        5) Some of the data for S6 Fig wasn't referred to directly in the text. Namely results regarding pcyt-1 and pld-1. I'd suggest incorporating this into the results section possibly using, "As a control for our lipid supplementation experiments..."

      Response

      • These experiments are now described on page 11.

        6) Page 12 bottom. I understand the use of "oppose", but another way to put it is that SCAV-2 and LPR-1 (antagonistically or collectively) modulate aECM composition. Other terms that might confuse some readers is the use of upstream and downstream, although I OK with its use in the context of this work.

      Response

      • The genetics indicate that lpr-1 and scav-2 have opposite effects on tube shaping and LPR-3 localization, so they do function antagonistically rather than collectively/cooperatively; we decided to keep this terminology.

        7) Page 16. I understand the logic that SCAV-2 is unlikely to directly modulate LPR-3 given its presumed molecular function. But is it possible that LPR-3 levels are already maxed out in the aECM so that loss of SCAV-2 doesn't lead to any increase? Conversely, one could argue that even if acting indirectly, SCAV-2 could have led to increased LPR-3 levels, unless they were already maxed.

      Response

      • This is a good point and the possibility is now mentioned in the Results page 9. We also changed our wording in the Abstract and Discussion to acknowledge the possibility that LPR-3 could be the SCAV-2 cargo, though we still don’t favor this model.

        8) Figure legend 1. I did not see an asterisk in figure 1B.

      Response

      • thanks for catching this error, text removed

        9) Figure 1C. Might want to define the "degree" term in the legend for people outside the field.

      Response

      • We added an explanation to the figure legend.

        10) Fig 1 G. I was just wondering if cuticle autofluorescence was an issue for taking these images.

      Response

      • Cuticle auto fluorescence is generally quite dim in L4s with our settings, and it was not an issue at this mid/late L4 stage, which corresponds to when both LPR fusions are at their brightest. Note that both large panels are MAX projections and yet you can’t see any cuticle auto-fluorescence in the LPR-1 panel.

        11) Fig 2 and others. Please define error bars.

      Response

      • These correspond to the standard deviation; this information is now added to the Methods.

        12) Fig 5. From the images, it looks like lpr-1; scav-2 doubles might have a worse (pre)cuticle defect in LPR-3 localization than lpr-1 singles. If so that would be interesting and would suggest that their relationship with respect to the modulation of LPR-3 is context dependent. Admittedly, the lack of obvious scav-2 expression in the epidermis would not be consistent with an effect (positive or negative).

      Response

      • The lpr-1 scav-2 strain is certainly not improved over lpr-1 but we have not noted any consistent worsening of the phenotype either.

        13) Consider defining Dodt in the first figure legend where it appears.

      Response

      • Dodt gradient contrast imaging is a method of transmitted light imaging similar to DIC and is used on some confocal microscopes. It is now explained in the Methods section. We removed the term from Figure 7 since it seems to be confusing.

        14) For Mander's, is there a reason to report just one of the two findings (M1 or M2) versus both?

      Response

      • We now include the 2nd Manders value in the figure legend and note that value is much lower (0.25) because much of the red signal is lysosomes (where green would be quenched by acidity).

        15) Consider referring to specific panels (A, B...) within references to the supplemental files.

      Response

      • done

        16) Fig S6E. Neither "increasing nor increasing" to "increasing nor decreasing".

      Response

      • fixed

        **Referees cross-commenting**

        I thought that Reviewers 1 and 2 brought up some good points. My sense is that Belfi and colleagues can address most of these in writing, but are of course welcome to add new data as they see fit. I get that it's not a "perfect" paper where everything is explained fully or comes together, but I don't see that as a flaw that needs to be fixed. I think that the manuscript represents a good deal of work (as it is) and provides a sufficient advance while also suggesting an interesting link to disease. It will be up to individual journals to decide if the findings meets their criteria.

        Reviewer #3 (Significance (Required)):

        Significance: The work carried out in this paper, and more generally by the Sundaram lab, always has a ground-breaking element because very few labs in the field have studied in detail the developmental roles and regulation of the aECM, in large part because it can be challenging to dissect. The core findings in this study are rather novel and unexpected, namely the opposing roles of the paralogous LPR-1 and LPR-3 lipocalins and their functional interactions with SCAV-2. The study does stop short of finding specific molecules (lipid or lipoprotein) that would mediate the effects they report, and it wasn't yet clear how the lysosomal co-loc plays a role, but this is not a criticism of the work presented or the forward progress. I was particularly intrigued by the idea, presented in the discussion, that disruption of vascular aECM could potentially account for some of the (complex) observations regarding the role of lipocalins and SCARB proteins in human disease. This would represent a new avenue for researchers to consider and underscores the power of using non-biased approaches in model systems.

        As for all my reviews, this is signed by David Fay.

      • *

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1:

      Strengths:

      The innovation on the task alone is likely to be impactful for the field, extending recent continuous report (CPR) tasks to examine other aspects of perceptual decision-making and allowing more naturalistic readouts. One interesting and novel finding is the observation of dyadic convergence of confidence estimates even when the partner is incidental to the task performance, and that dyads tend to be more risk-seeking (indicating greater confidence) than when playing solo. The paper is well-written and clear.”

      We thank reviewer 1 for this encouraging evaluation. Below we address the identified weaknesses and recommendations.

      (1) Do we measure metacognitive confidence?

      One concern with the novel task is whether confidence is disambiguated from a tracking of stimulus strength or coherence. […] But in the context of an RDK task, one simple strategy here is to map eccentricity directly to (subjective) motion coherence - such that the joystick position at any moment in time is a vector with motion direction and strength. This would still be an interesting task - but could be solved without invoking metacognition or the need to estimate confidence in one's motion direction decision. […] what the subjects might be doing is tracking two features of the world - motion strength and direction. This possibility needs to be ruled out if the authors want to claim a mapping between eccentricity and decision confidence […].”

      We thank reviewer 1 for pointing out that the joystick tilt responses of our subjects could potentially be driven by stimulus coherence instead of metacognitive decision confidence. Below, we present four arguments to address this point of concern:

      (1.1) Similar physical coherence between high and low confidence states

      Nominal motion coherence is a discrete value, but the random noisiness in the stimulus causes the actual frame-by-frame coherence to be distributed around this nominal value. Because of this, subjects might scale their joystick tilt report according to the coherence fluctuations around the nominal value. To check if this was the case, we use a median split to separate stimulus states into states with large versus small joystick tilt, individually for each nominal coherence. For each stimulus state, we extracted the actual instantaneous (frame-to-frame) motion coherence, which is based on the individual movements of dots in the stimulus patch between two frames, recorded in our data files.

      First, we compared the motion coherence between stimulus states with large versus small joystick tilt. For each stimulus state, we calculated average instantaneous motion coherence, and analyzed the difference of the medians for the large versus small tilt distributions for each subject and each coherence level. The resulting histograms show the distribution of differences across all 38 subjects for each nominal coherence, and are, except for the coherence of 22%, not significantly different from zero across subjects (Author response image 1). For the 22% coherence condition, the difference amounts to 0.19% – a very small, non-perceptible difference. Thus, we do no find systematic differences between the average motion coherence in states with high versus low joystick tilt.

      Author response image 1.

      Histograms of within-subject difference between medians of average coherence distributions with large and small joystick tilt for all subjects. Coherence is color-coded (cyan – 0%, magenta – 98%). On top, the title of each panel illustrates the number of significant differences (Ranksum test in each subject) without correction for multiple comparisons (see Author response table 1 below). In the second row of the title, we show the result of the population t-test against zero. Only 22% coherence shows a significant bias. Positive values indicate higher average coherence for large joystick tilt.  

      Author response table 1.

      List of all individual significantly different coherence distributions between high and low tilt states, without correction for multiple comparisons. Median differences do not show a consistent bias (i.e. positive values) that would indicate higher average coherence for the large tilts.

      (1.2) Short-term stimulus fluctuations have no effect

      […] But to fully characterise the task behaviour it also seems important to ask how and whether fluctuations in motion energy (assuming that the RDK frames were recorded) during a steady state phase are affecting continuous reporting of direction and eccentricity, prior to asking how social information is incorporated into subjects' behaviour.

      In addition to the analysis of stimulus coherence and tilt averaged across each stimulus state (1.1), we analyzed moment-to-moment relationship between instantaneous coherence and ongoing reports of accuracy and tilt. Below, we provide evidence that short-term fluctuations in the instantaneous coherence (i.e. the motion energy of the stimulus) do not result in correlated changes in joystick responses, neither for tilt nor accuracy. For each continuous stimulus state, we calculated cross-correlation functions between the instantaneous coherence, tilt and accuracy, and then averaged the cross-correlation across all states of the same nominal coherence, and then across subjects. The resulting average cross-correlation functions are essentially flat. This further supports our interpretation that the joystick reports do not reflect short-term fluctuations of motion energy.

      Author response image 2.

      Cross-correlation between the length of the resultant vector with joystick accuracy (left) and tilt (right). Coherence is color-coded. Shaded background illustrates 95% confidence intervals.

      (1.3) Joystick tilt changes over time despite stable average stimulus coherence

      If perceptual confidence is derived from evidence integration, we should see changes over time even when the stimulus is stable. Here, we have analyzed the average slope of the joystick tilt as a function of time within each stimulus state for each subject and each coherence, to verify if our participants tilted their joystick more with additional evidence. This is illustrated with a violin plot below (Author response image 3). The linear slopes of the joystick tilt progression over the course of stimulus states are different between coherence levels. High coherence causes more tilt over time, resulting in positive slopes for most subjects. In contrast, low/no coherence results mostly in flat or negative slopes. This tilt progression over time indicates that low coherence results in lower confidence, as subjects do not wager more with weak evidence. In contrast, high coherence causes subjects to exhibit more confidence, indicated by positive slope of the joystick tilt.

      Author response image 3.

      Violin plots showing the fitted slopes of the joystick tilt time course in the last 200 samples (1667 ms) leading up to a next stimulus direction (cf. Figure 2D). Positive values signify an increase in joystick tilt over time. Each dot shows the average slope for one subject. Coherence is color-coded. The dashed line at zero indicates unchanged joystick tilt over the analyzed time window.

      (1.4) Cross-correlation between response accuracy and joystick tilt

      Similar to 1.2 above, we have cross-correlated the frame-by-frame changes of joystick accuracy and tilt for each individual stimulus state and each subject. Across subjects, changes in tilt occur later than changes in accuracy, indicating that changes in the quality of the report are followed by changes in the size of the wager. Given that this process is not driven by short-term changes in the motion energy of the stimulus (see 1.2 above), we interpret this as additional evidence for a metacognitive assessment of the quality of the behavioral report (i.e. accuracy) reflected in the size of the wager (our measure for confidence). (See Figure 2E).

      (2) Peri-decision wagering is different to post-decision wagering

      […] One route to doing this would be to ask whether the eccentricity reports show statistical signatures of confidence that have been established for more classical punctate tasks. Here a key move has been to identify qualitative patterns in the frame of reference of choice accuracy - with confidence scaling positively with stimulus strength for correct decisions, and negatively with stimulus strength for incorrect decisions (the so-called X-pattern, for instance Sanders et al. 2016 Neuron […].

      We thank reviewer 1 for the constructive feedback. Our behavioral data do not show similar signatures to the previously reported post-decision confidence expression (Desender et al., 2021; Sanders et al., 2016). The previously described patterns show, first of all, that confidence for the incorrect type1 decisions diverges from the correct type1 decisions, declining with stimulus strength (e.g. coherence), as compared to increase for correct decisions. In our task, there is a graded accuracy and (putative) confidence expression, but there are no correct or incorrect decisions – instead, there are hits and misses of the reward targets presented at nominal directions. Instead of a decline for misses, we observe an equally positive scaling with coherence for the confidence, both for hits and misses (Author response image 4A). This is because in our peri-decision wagering task, the expression of confidence causally determines the binary hit or miss outcome. The outcome in our task is a function of the two-dimensional joystick response: higher tilt (confidence) requires a more accurate response to successfully hit a target. Thus, a subject can display a high (but not high enough) level of accuracy and confidence but still remain unsuccessful. If we instead median-split the confidence reports by high and low accuracy (Author response image 4C), we observe a slight separation, especially for higher coherences, but still no clear different in slopes.

      We do observe the other two dynamic signatures of confidence (Desender et al., 2021): signature 2 – monotonically increasing accuracy as a function of confidence (Author response image 4), and signature 3 – steeper type 1 psychometric performance (accuracy) for high versus low confidence (Author response image 4D).

      Author response image 4.

      Confidence (i.e., joystick tilt, left column) and accuracy reports (right column) for different stimulus coherence, sorted by discrete outcome (hit versus miss, upper row) and the complementary joystick dimension (lower row, based on median split).

      Author response image 5.

      Accuracy reports correlate positively with confidence reports. For each stimulus state, we averaged the joystick response in the time window between 500 ms (60 samples) after a direction change until the first reward target appearance. If there was no target, we took all samples until the next RDP direction change into account. This corresponds to data snippets averaged in Figure 2D. Thus, for each stimulus state, we extracted a single value for joystick accuracy and for tilt (confidence). Subsequently, we fitted a linear regression to the accuracy-confidence scatter within each subject and within each coherence level. The plot above shows the average linear regression between accuracy and confidence across all subjects (i.e., the slopes and intercepts were averaged across n=38 subjects). Coherence is color-coded.

      (3)  Additional analyses regarding the continuous nature of our data

      I was surprised not to see more analysis of the continuous report data as a function of (lagged) task variables. […]

      Reviewer 1 requested more analyses regarding the continuous nature of our data. We agree that this is a useful addition to our paper, and thank reviewer 1 for this suggestion. To address this point, we revised main Figure 2 and provided additional panels. Panel D illustrates the continuous ramp-up of both accuracy and tilt (confidence) for high coherence levels, suggesting ongoing evidence integration and meta-cognitive assessment. Panel E shows the cross-correlation between frame-by-frame changes in accuracy and tilt (see 1.4 above). Here, we demonstrate that changes in the accuracy precede changes in joystick tilt, characterizing the continuous nature of the perceptual decision-making process.

      (4) Explicit motivation regarding continuous social experiments

      This paper is innovating on a lot of fronts at once - developing a new CPR task for metacognition, and asking exploratory questions about how a social setting influences performance on this novel task. However, the rationale for this combination was not made explicit. Is the social manipulation there to help validate the new task as a measure of confidence as dissociated from other perceptual variables? (see query 1 below). Or is the claim that the social influence can only be properly measured in the naturalistic CPR task, and not in a more established metacognition task?

      Our rationale for the combination of real-time decision making and social settings was twofold:

      i. Primates, including humans, are social species. Naturally, most behavior is centered around a social context and continuously unfolds in real-time. We wanted to showcase a paradigm in which distinct aspects of continuous perceptual decision-making could be assessed over time in individual and social environments.

      ii. Human behavior is susceptible to what others think and do. We wanted to demonstrate that the sheer presence of a co-acting social partner affects continuous decision-making, and quantify the extent and direction of social modulation.

      We agree that the motivation for combining the new task and this specific type of social co-action should be more clear. We have clarified this aspect in the Introduction, line 92-109. In brief, the continuous, free-flowing nature of the CPR task and real-time availability of social information made this design a very suitable paradigm for assessing unconstrained social influences. We see this study as the first step into disentangling the neural basis of social modulation in primates. See also the response to reviewer 2, point 2, below.

      (5) Response to minor points

      (5.1)  Clarification on behavioral modulation patterns

      Lines 295-298, isn't it guaranteed to observe these three behavioral patterns (both participants improving, both getting worse, only one improving while the other gets worse) even in random data?

      The reviewer is correct. We now simply illustrate these possibilities in Figure 4B and how these patterns could lead to divergence or convergence between the participants (see also line 282). Unlike random data, our results predominantly demonstrate convergence.

      (5.2) Clarification on AUC distributions

      Lines 703-707, it wasn't clear what the AUC values referred to here (also in Figure 3) - what are the distributions that are being compared? I think part of the confusion here comes from AUC being mentioned earlier in the paper as a measure of metacognitive sensitivity (correct vs. incorrect trial distributions), whereas my impression here is that here AUC is being used to investigate differences in variables (e.g., confidence) between experimental conditions.

      We apologize for the confusion. Indeed, the AUC analysis was used for the two purposes:

      (i) To assess the metacognitive sensitivity (line 175, Supplementary Figure 2).

      (ii) To assess the social modulation of accuracy and confidence (starting at line 232, Figures 3-6). 

      We now introduce the second AUC approach for assessing social modulation, and the underlying distributions of accuracy and confidence derived from each stimulus state, separately in each subject, in line 232.

      (5.3) Clarification of potential ceiling effects

      Could the findings of the worse solo player benefitting more than the better solo player (Figure 4c) be partly due to a compressive ceiling effect - e.g., there is less room to move up the psychometric function for the higher-scoring player?

      We thank the reviewer for this insight. First, even better performing participants were not at ceiling most of the times, even at the highest coherence (cf. Figure 2 and Supplementary Figure 3C). To test for the potential ceiling effect in the better solo players, we correlated their social modulation (expressed as AUC as in Figure 4) to the solo performance. There was no significant negative correlation for the accuracy (p > 0.063), but there was a negative correlation for the confidence (r = - 0.39, p = 0.0058), indicating that indeed low performing “better players in a dyad” showed more positive social modulation. We note however that this correlation was driven mainly by few such initially low performing “better” players, who mostly belonged to the dyads where both participants improved in confidence (green dots, Figure 4B), and that even the highest solo average confidence was at ceiling (<0.95). To conclude, the asymmetric social modulation effect we observe is mainly due to the better players declining (orange and red dots, Figure 4B), rather than due to both players improving but the better player improving less (green dots, Figure 4B).

      Reviewer 2:

      Strengths:

      There are many things to like about this paper. The visual psychophysics has been undertaken with much expertise and care to detail. The reporting is meticulous and the coverage of the recent previous literature is reasonable. The research question is novel.

      We thank reviewer 2 for this positive evaluation. Below we address the identified weaknesses and recommendations.

      (1) Streamlining the text to make the paper easier to read

      The paper is difficult to read. It is very densely written, with little to distinguish between what is a key message and what is an auxiliary side note. The Figures are often packed with sometimes over 10 panels and very long captions that stick to the descriptive details but avoid clarity. There is much that could be shifted to supplementary material for the reader to get to the main points.

      We thank reviewer 2 for the honest assessment that our article was difficult to read and understand, and for providing specific examples of confusion. We substantially improved the clarity:

      We added a Glossary that defines key terms, including Accuracy and Hit rate. 

      We replaced the confusing term “eccentricity” with joystick “tilt”.

      We simplified Figures 3 and 5, moving some panels into supplementary figures.

      We substantially redesigned and simplified our main Figure 4, displaying the data in a more straightforward, less convoluted way, and removing several panels. This change was accompanied by corresponding changes in the text (section starting at line 277).

      More generally, we shortened the Introduction, substantially revised the Results and the figure legends, and streamlined the Discussion.

      (2) Dyadic co-action vs joint dyadic decision making

      A third and very important one is what the word "dyadic" refers to in the paper. The subjects do not make any joint decisions. However, the authors calculate some "dyadic score" to measure if the group has been able to do better than individuals. So the word dyadic sometimes refers to some "nominal" group. In other places, dyadic refers to the social experimental condition. For example, we see in Figure 3c that AUC is compared for solo vs dyadic conditions. This is confusing.

      […] my key criticism is that the paper makes strong points about collective decision-making and compares its own findings with many papers in that field when, in fact, the experiments do not involve any collective decision-making. The subjects are not incentivized to do better as a group either. […]

      The reviewer is correct to highlight these important aspects. We did, in fact, not investigate a situation where two players had to reach a joint decision with interdependent payoff and there was no incentive to collaborate or even incorporate the information provided by the other player. To make the meaning of “dyadic” in our context more explicit, we have clarified the nature of the co-action and independent payoff (e.g. lines 107, 211, 482, 755 - Glossary), and used the term “nominal combined score” (line 224) and “nominal “average accuracy” within a dyad” (line 439).

      Concerning the key point about embedding our findings into the literature on collective decision-making, we would like to clarify our motivation. Outside of the recent study by Pescetelli and Yeung, 2022, we are not aware of any perceptual decision-making studies that investigated co-action without any explicit joint task. So naturally, we were stimulated by the literature on collective decisions, and felt it is appropriate to compare our findings to the principles derived from this exciting field.  Besides developing continuous – in time and in “space” (direction) – peri-decision wagering CPR game, the social co-action context is the main novel contribution of our work. Although it is possible to formulate cooperative or competitive contexts for the CPR, we leveraged the free-flowing continuous nature of the task that makes it most readily amendable to study spontaneously emerging social information integration.

      We now more explicitly emphasize that most prior work has been done using the joint decision tasks, in contrast to the co-action we study here, in Introduction and Discussion.

      (3) Addition of relevant literature to Discussion

      […] To see why this matters, look at Lorenz et al PNAS (https://www.pnas.org/doi/10.1073/pnas.1008636108) and the subsequent commentary that followed it from Farrell (https://www.pnas.org/doi/full/10.1073/pnas.1109947108). The original paper argued that social influence caused herding which impaired the wisdom of crowds. Farrell's reanalysis of the paper's own data showed that social influence and herding benefited the individuals at the expense of the crowd demonstrating a form of tradeoff between individual and joint payoff. It is naive to think that by exposing the subjects to social information, we should, naturally, expect them to strive to achieve better performance as a group.

      Another paper that is relevant to the relationship between the better and worse performing members of the dyad is Mahmoodi et al PNAS 2015 (https://www.pnas.org/doi/10.1073/pnas.1421692112). Here too the authors demonstrate that two people interacting with one another do not "bother" figuring out each others' competence and operate under "equality assumption". Thus, the lesser competent member turns out to be overconfident, and the more competent one is underconfident. The relevance of this paper is that it manages to explain patterns very similar to Schneider et al by making a much simpler "equality bias" assumption.

      We thank reviewer 2 for pointing out these highly relevant references, which we have now integrated in the Discussion (lines 430 and 467). Regarding the debate of Lorenz et al and Farell, although it is about very different type of tasks – single-shot factual knowledge estimation, it is very illuminating for understanding the differing perspectives on individual vs group benefit. We fully agree that it is naïve to assume that during independent co-action in our highly demanding task participants would strive to achieve better performance as a group – if anything, we expected less normative and more informational, reliability-driven effects as a way to cope with task demands.

      Mahmoodi et al. is a particularly pertinent and elegant study, and the equality bias they demonstrate may indeed underlie the effects we see. We admit that we did not know this paper at the time of our initial writing, but it is encouraging to see the convergence [pun intended] despite task and analysis differences. As highlighted above (2), our novel contributions remain that we observe mutual alignment, or convergence, in real-time without explicitly formulated collective decision task and associated social pressure, and that we separate asymmetric social effects on accuracy and confidence.

      Other reviewer-independent changes:

      Additional information: Angular error in Figure 2

      In panel A of the main Figure 2, we have added the angular error of the solo reports (blue dashed line) to give readers an impression about the average deviation of subjects’ joystick direction from the nominal stimulus direction. We have pointed out that angular error is the basis for accuracy calculation.

      Data alignment

      In the previous version of the manuscript, we have presented data with different alignments: Accuracy values were aligned to the appearance of the first target in a stimulus state (target-alignment) to avoid the predictive influence of target location within the remaining stimulus state, while the joystick tilt was extracted at the end of each stimulus state (state-alignment) to allow subjects more time to make a deliberate, confidence-guided report (Methods). We realized that this is confusing as it compares the social modulation of the two response dimensions at different points in time. In the revision, we use state-aligned data in most figures and analyses and clearly indicate which alignment type has been used. We kept the target-alignment for the illustration of the angular error in the solo-behavior (Figure 2). Specifically, this has only changed the reporting on accuracy statistics. None of the results have changed fundamentally, but the social modulation on accuracy became even stronger in state-aligned data.

      In summary, we hope that these revisions have resulted in an easier-to-understand and convincing article, with clear terminology and concise and important takeaway messages.

      We thank both reviewers and the editors again for their time and effort, and look forward to the reevaluation of our work.

      References

      Desender K, Donner TH, Verguts T. 2021. Dynamic expressions of confidence within an evidence accumulation framework. Cognition 207:104522. doi:10.1016/j.cognition.2020.104522

      Pescetelli N, Yeung N. 2022. Benefits of spontaneous confidence alignment between dyad members. Collective Intelligence 1. doi:10.1177/26339137221126915

      Sanders JI, Hangya B, Kepecs A. 2016. Signatures of a Statistical Computation in the Human Sense of Confidence. Neuron 90:499–506. doi:10.1016/j.neuron.2016.03.025

    1. Overall thoughts: This is an interesting history piece regarding peer review and the development of review over time. Given the author’s conflict of interest and association with the Centre developing MetaROR, I think that this paper might be a better fit for an information page or introduction to the journal and rationale for the creation of MetaROR, rather than being billed as an independent article. Alternatively, more thorough information about advantages to pre-publication review or more downsides/challenges to post-publication review might make the article seem less affiliated. I appreciate seeing the history and current efforts to change peer review, though I am not comfortable broadly encouraging use of these new approaches based on this article alone.

      Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. 

      Consider discussing the benefits of the traditional model of peer review.

      Table 1 – Most of these are self-explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.

      3.2: Considering mentioning your conflict of interest here where MetaROR is mentioned.

      With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?

      There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them. Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.

      Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.

    2. Response to the Editors and the Reviewers

      I am sincerely grateful to the editors and peer reviewers at MetaROR for their detailed feedback and valuable comments and suggestions. I have addressed each point below.

      Handling editor

      1. “However, the article’s progression and arguments, along with what it seeks to contribute to the literature need refinement and clarification. The argument for PRC is under-developed due to a lack of clarity about what the article means by scientific communication. Clarity here might make the endorsement of PRC seem like less of a foregone conclusion.”

      The structure of the paper (and discussion) has changed significantly to address the feedback.

      2. “I strongly endorse the main theme of most of the reviews, which is that the progression and underlying justifications for this article’s arguments needs a great deal of work. In my view, this article’s main contribution seems to be the evaluation of the three peer review models against the functions of scientific communication. I say ‘seems to be’ because the article is not very clear on that and I hope you will consider clarifying what your manuscript seeks to add to the existing work in this field. In any case, if that assessment of the three models is your main contribution, that part is somewhat underdeveloped. Moreover, I never got the sense that there is clear agreement in the literature about what the tenets of scientific communication are. Note that scientific communication is a field in its own right.”

      I have implemented a more rigorous approach to argumentation in response. “Scientific communication” was replaced by “scholarly communication.”

      3. “I also agree that paper is too strongly worded at times, with limitations and assumptions in the analysis minimised or not stated. For example, all of the typologies and categories drawn could easily be reorganised and there is a high degree of subjectivity in this entire exercise. Subjective choices should be highlighted and made salient for the reader. Note that greater clarity, rigour, and humility may also help with any alleged or actual bias.”

      I have incorporated the conceptual framework and description of the research methodology. However, the Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.

      4. “I agree with Reviewer 3 that the ‘we’ perspective is distracting.”

      This has been fixed.

      5. “The paragraph starting with ‘Nevertheless’ on page 2 is very long.”

      The text was restructured.

      6. “There are many points where language could be shortened for readability, for example:

      Page 3: ‘decision on publication’ could be ‘publication decision’.

      Page 5: ‘efficiency of its utilization’ could be ‘its efficiency’.

      Page 7: ‘It should be noted…’ could be ‘Note that…’.”

      I have proofread the text.

      7. “Page 7: ‘It should be noted that..’ – this needs a reference.”

      This statement has been moved to the Discussion section, paraphrased, and reference added

      “It should be also noted that peer review innovations pull in opposing directions, with some aiming to increase efficiency and reduce costs, while others aim to promote rigor and increase costs (Kaltenbrunner et al., 2022).”

      8. “I’m not sure that registered reports reflect a hypothetico-deductive approach (page 6). For instance, systematic reviews (even non-quantitative ones) are often published as registered reports and Cochrane has required this even before the move towards registered reports in quantitative psychology.”

      I have added this clarification.

      9. “I agree that modular publishing sits uneasily as its own chapter.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.

      10. “Page 14: ‘The "Publish-Review-Curate" model is universal that we expect to be the future of scientific publishing. The transition will not happen today or tomorrow, but in the next 5-10 years, the number of projects such as eLife, F1000Research, Peer Community in, or MetaROR will rapidly increase’. This seems overly strong (an example of my larger critique and that of the reviewers).”

      This part of the text has been rewritten.

      Reviewer 1

      11. “For example, although Model 3 is less chance to insert bias to the readers, it also weakens the filtering function of the review system. Let’s just think about the dangers of machine-generated articles, paper-mills, p-hacked research reports and so on. Although the editors do some pre-screening for the submissions, in a world with only Model 3 peer review the literature could easily get loaded with even more ‘garbage’ than in a model where additional peers help the screening.”

      I think that generated text is better detected by software tools. At the same time, I tried and described the pros and cons of different models in a more balanced way in the concluding section.

      12. “Compared to registered reports other aspects can come to focus that Model 3 cannot cover. It’s the efficiency of researchers’ work. In the care of registered reports, Stage 1 review can still help researchers to modify or improve their research design or data collection method. Empirical work can be costly and time-consuming and post-publication review can only say that ‘you should have done it differently then it would make sense’.”

      Thank you very much for this valuable contribution, I have added this statement at P. 11.

      13. “Finally, the author puts openness as a strength of Model 3. In my eyes, openness is a separate question. All models can work very openly and transparently in the right circumstances. This dimension is not an inherent part of the models.”

      I think that the model, providing peer reviews to all the submissions, ensures maximum transparency. However, I have made effort to make the wording more balanced and distinguish my personal perspective from the literature.

      14. “In conclusion, I would not make verdict over the models, instead emphasize the different functions they can play in scientific communication.”

      This idea has been reflected now in the concluding section.

      15. “A minor comment: I found that a number of statements lack references in the Introduction. I would have found them useful for statements such as ‘There is a point of view that peer review is included in the implicit contract of the researcher.’”

      Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.

      Reviewer 2

      16. “The primary weakness of this article is that it presents itself as an 'analysis' from which they 'conclude' certain results such as their typology, when this appears clearly to be an opinion piece. In my view, this results in a false claim of objectivity which detracts from what would

      otherwise be an interesting and informative, albeit subjective, discussion, and thus fails to discuss the limitations of this approach.”

      I have incorporated the conceptual framework and description of the research methodology. However, the Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.

      17. “A secondary weakness is that the discussion is not well structured and there are some imprecisions of expression that have the potential to confuse, at least at first.”

      The structure of the paper (and discussion) has changed significantly.

      18. “The evidence and reasoning for claims made is patchy or absent. One instance of the former is the discussion of bias in peer review. There are a multitude of studies of such bias and indeed quite a few meta-analyses of these studies. A systematic search could have been done here but there is no attempt to discuss the totality of this literature. Instead, only a few specific studies are cited. Why are these ones chosen? We have no idea. To this extent I am not convinced that the references used here are the most appropriate.”

      I have reviewed the existing references and incorporated additional sources. However, the study does not claim to conduct a systematic literature review; rather, it adopts an interpretative approach to literature analysis.

      19. “Instances of the latter are the claim that ‘The most well-known initiatives at the moment are ResearchEquals and Octopus’ for which no evidence is provided, the claim that ‘we believe that journal-independent peer review is a special case of Model 3’ for which no further argument is provided, and the claim that ‘the function of being the "supreme judge" in deciding what is "good" and "bad" science is taken on by peer review’ for which neither is provided.

      Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.

      20. “A particular example of this weakness, which is perhaps of marginal importance to the overall paper but of strong interest to this reviewer is the rather odd engagement with history within the paper. It is titled "Evolution of Peer Review" but is really focussed on the contemporary state-of-play. Section 2 starts with a short history of peer review in scientific publishing, but that seems intended only to establish what is described as the 'traditional' model of peer review. Given that that short history had just shown how peer review had been continually changing in character over centuries - and indeed Kochetkov goes on to describe further changes - it is a little difficult to work out what 'traditional' might mean here; what was 'traditional' in 2010 was not the same as what was 'traditional' in 1970. It is not clear how seriously this history is being taken. Kochetkov has earlier written that "as early as the beginning of the 21st century, it was argued that the system of peer review is 'broken'" but of course criticisms - including fundamental criticisms - of peer review are much older than this. Overall, this use of history seems designed to privilege the experience of a particular moment in time, that coincides with the start of the metascience reform movement.”

      While the paper addresses some aspects of peer review history, it does not provide a comprehensive examination of this topic. A clarifying statement to this effect has been included in the methodology section.

      “… this section incorporates elements of historical analysis, it does not fully qualify as such because primary sources were not directly utilized. Instead, it functions as an interpretative literature review, and one that is intentionally concise, as a comprehensive history of peer review falls outside the scope of this research”.

      21. “Section 2 also demonstrates some of the second weakness described, a rather loose structure. Having moved from a discussion of the history of peer review to detail the first model, 'traditional' peer review, it then also goes on to describe the problems of this model. This part of the paper is one of the best - and best - evidenced. Given the importance of it to the main thrust of the discussion it should probably have been given more space as a Section all on its own.”

      This section (now Section 4) has been extended, see also previous comment.

      22. “Another example is Section 4 on Modular Publishing, in which Kochetkov notes "Strictly speaking, modular publishing is primarily an innovative approach for the publishing workflow in general rather than specifically for peer review." Kochetkov says "This is why we have placed this innovation in a separate category" but if it is not an innovation in peer review, the bigger question is 'Why was it included in this article at all?'.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.

      23. “One example of the imprecisions of language is as follows. The author also shifts between the terms 'scientific communication' and 'science communication' but, at least in many contexts familiar to this reviewer, these are not the same things, the former denoting science-internal dissemination of results through publication (which the author considers), conferences and the like (which the author specifically excludes) while the latter denotes the science-external public dissemination of scientific findings to non-technical audiences, which is entirely out of scope for this article.”

      Thank you for your remark. As a non- native speaker, I initially did not grasp the distinction between the terms. However, I believe the phrase ‘scholarly communication’ is the most universally applicable term. This adjustment has now been incorporated into the text.

      24. “A final note is that Section 3, while an interesting discussion, seems largely derivative from a typology of Waltman, with the addition of a consideration of whether a reform is 'radical' or 'incremental', based on how 'disruptive' the reform is. Given that this is inherently a subjective decision, I wonder if it might not have been more informative to consider 'disruptiveness' on a scale and plot it accordingly. This would allow for some range to be imagined for each reform as well; surely reforms might be more or less disruptive depending on how they are implemented. Given that each reform is considered against each model, it is somewhat surprising that this is not presented in a tabular or graphical form.”

      Ultimately, I excluded this metric due to its current reliance on purely subjective judgment. Measuring 'disruptiveness', e.g., through surveys or interviews remains a task for future research.

      25. “Reconceptualize this as an opinion piece. Where systematic evidence can be drawn upon to make points, use that, but don't be afraid to just present a discussion from what is clearly a well-informed author.”

      I cannot definitively classify this work as an opinion piece. In fact, this manuscript synthesizes elements of a literature review, research article, and opinion essay. My idea was to integrate the strengths of all three genres.

      26. “Reconsider the focus on history and 'evolution' if the point is about the current state of play and evaluation of reforms (much as I would always want to see more studies on the history and evolution of peer review).”

      I have revised the title to better reflect the study’s scope and explicitly emphasize its focus on contemporary developments in the field.

      “Peer Review at the Crossroads”

      27. “Consider ways in which the typology might be expanded, even if at subordinate level.”

      I have updated the typology and introduced the third tier, where it is applicable (see Fig.2).

      Reviewer 3

      28. “In my view, the biggest issue with the current peer review system is the low quality of reviews, but the manuscript only mentions this fleetingly. The current system facilitates publication bias, confirmation bias, and is generally very inconsistent. I think this is partly due to reviewers’ lack of accountability in such a closed peer review system, but I would be curious to hear the author’s ideas about this, more elaborately than they provide them as part of issue 2.

      I have elaborated on this issue in the footnote.

      29. “I’m missing a section in the introduction on what the goals of peer review are or should be. You mention issues with peer review, and these are mostly fair, but their importance is only made salient if you link them to the goals of peer review. The author does mention some functions of peer review later in the paper, but I think it would be good to expand that discussion and move it to a place earlier in the manuscript.”

      The functions of peer review are summarized in the first paragraph of Introduction.

      30. “Table 1 is intuitive but some background on how the author arrived at these categorizations would be welcome. When is something incremental and when is something radical? Why are some innovations included but not others (e.g., collaborative peer review, see https://content.prereview.org/how-collaborative-peer-review-can-transform-scientific-research/)?”

      Collaborative peer review, namely, Prereview was mentioned in the context of Model 3 (Publish-Review-Curate). However, I have extended this part of the paper.

      31“‘Training of reviewers through seminars and online courses is part of the strategies of many publishers. At the same time, we have not been able to find statistical data or research to assess the effectiveness of such training.’ (p. 5)  There is some literature on this, although not recent. See work by Sara Schroter for example, Schroter et al., 2004; Schroter et al., 2008)”

      Thank you very much, I have added these studies and a few more recent ones.

      32. “‘It should be noted that most initiatives aimed at improving the quality of peer review simultaneously increase the costs.’ (p. 7) This claim needs some support. Please explicate why this typically is the case and how it should impact our evaluations of these initiatives.”

      I have moved this part to the Discussion section.

      33. “I would rephrase “Idea of the study” in Figure 2 since the other models start with a tangible output (the manuscript). This is the same for registered reports where they submit a tangible report including hypotheses, study design, and analysis plan. In the same vein, I think study design in the rest of the figure might also not be the best phrasing. Maybe the author could use the terminology used by COS (Stage 1 manuscript, and Stage 2 manuscript, see Details & Workflow tab of https://www.cos.io/initiatives/registered-reports). Relatedly, “Author submits the first version of the manuscript” in the first box after the ‘Manuscript (report)’ node maybe a confusing phrase because I think many researchers see the first version of the manuscript as the stage 1 report sent out for stage 1 review.”

      Thank you very much. Stage 1 and Stage 2 manuscripts look like suitable labelling solution.

      34. “One pathway that is not included in Figure 2 is that authors can decide to not conduct the study when improvements are required. Relatedly, in the publish-review-curate model, is revising the manuscripts based on the reviews not optional as well? Especially in the case of

      3a, authors can hardly be forced to make changes even though the reviews are posted on the platform.”

      All the four models imply a certain level of generalization; thus, I tried to avoid redundant details. However, I have added this choice to the PRC model (now, Model 4).

      35. “I think the author should discuss the importance of ‘open identities’ more. This factor is now not explicitly included in any of the models, while it has been found to be one of the main characteristics of peer review systems (Ross-Hellauer, 2017).”

      This part has been extended.

      36. “More generally, I was wondering why the author chose these three models and not others. What were the inclusion criteria for inclusion in the manuscript? Some information on the underlying process would be welcome, especially when claims like ‘However, we believe that journal-independent peer review is a special case of Model 3 (‘Publish-Review-Curate’).’ are made without substantiation.”

      The study included four generalized models of peer review that involved some level of abstraction.

      37. “Maybe it helps to outline the goals of the paper a bit more clearly in the introduction. This helps the reader to know what to expect.”

      The Introduction has been revised including the goal and objectives.

      38. “The Modular Publishing section is not inherently related to peer review models, as you mention in the first sentence of that paragraph. As such, I think it would be best to omit this section entirely to maintain the flow of the paper. Alternatively, you could shortly discuss it in the discussion section but a separate paragraph seems too much from my point of view.”

      Modular publishing has been combined with registered reports into the fragmented publishing group of models, now in Section 5.

      39. “Labeling model 3 as post-publication review might be confusing to some readers. I believe many researchers see post-publication review as researchers making comments on preprints, or submitting commentaries to journals. Those activities are substantially different from the publish-review-curate model so I think it is important to distinguish between these types.”

      The label was changed into Publish- Review-Curate model.

      40. “I do not think the conclusions drawn below Table 3 logically follow from the earlier text. For example, why are “all functions of scientific communication implemented most quickly and transparently in Model 3”? It could be that the entire process takes longer in Model 3 (e.g. because reviewers need more time), so that Model 1 and Model 2 lead to outputs quicker. The same holds for the following claim: ‘The additional costs arising from the independent assessment of information based on open reviews are more than compensated by the emerging opportunities for scientific pluralism.’ What is the empirical evidence for this? While I personally do think that Model 3 improves on Model 1, emphatic statements like this require empirical evidence. Maybe the author could provide some suggestions on how we can attain this evidence. Model 2 does have some empirical evidence underpinning its validity (see Scheel, Schijen, Lakens, 2021; Soderberg et al., 2021; Sarafoglou et al. 2022) but more meta-research inquiries into the effectiveness and cost-benefits ratio of registered reports would still be welcome in general.”

      The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap. I am grateful for the suggested literature on RRs, which I have now integrated into the relevant subsection.

      41. “What is the underlaying source for the claim that openness requires three conditions?”

      I have made effort to clarify within the text that this reflects my personal stance.

      42. “‘If we do not change our approach, science will either stagnate or transition into other forms of communication.’ (p. 2) I don’t think this claim is supported sufficiently strongly. While I agree there are important problems in peer review, I think would need to be a more in-depth and evidence-based analysis before claims like this can be made.”

      The sentence has been rephrased.

      43. “On some occasions, the author uses ‘we’ while the study is single authored.”

      This has been fixed.

      44. “Figure 1: The top-left arrow from revision to (re-)submission is hidden”

      I have updated Figure 1.

      45. “‘The low level of peer review also contributes to the crisis of reproducibility in scientific research (Stoddart, 2016).’ (p. 4) I assume the author means the low quality of peer review.”

      This has been fixed.

      46. “‘Although this crisis is due to a multitude of factors, the peer review system bears a significant responsibility for it.’ (p. 4) This is also a big claim that is not substantiated”

      I have paraphrased this sentence as “While multiple factors drive this crisis, deficiencies in the peer review process remain a significant contributor.” and added a footnote.

      47. “‘Software for automatic evaluation of scientific papers based on artificial intelligence (AI) has emerged relatively recently” (p. 5) The author could add RegCheck (https://regcheck.app/) here, even though it is still in development. This tool is especially salient in light of the finding that preregistration-paper checks are rarely done as part of reviews (see Syed, 2023)”

      Thank you very much, I have added this information.

      48. “There is a typo in last box of Figure 1 (‘decicion’ instead of ‘decision’). I also found typos in the second box of Figure 2, where ‘screns’ should be ‘screens’, and the author decision box where ‘desicion’ should be ‘decision’”

      This has been fixed.

      49. “Maybe it would be good to mention results blinded review in the first paragraph of 3.2. This is a form of peer review where the study is already carried out but reviewers are blinded to the results. See work by Locascio (2017), Grand et al. (2018), and Woznyj et al. (2018).”

      Thanks, I have added this (now section 5.2)

      50. “Is ‘Not considered for peer review’ in figure 3b not the same as rejected? I feel that it is rejected in the sense that neither the manuscript not the reviews will be posted on the platform.”

      Changed into “Rejected”

      51. “‘In addition to the projects mentioned, there are other platforms, for example, PREreview12, which departs even more radically from the traditional review format due to the decentralized structure of work.’ (p. 11) For completeness, I think it would be helpful to add some more information here, for example why exactly decentralization is a radical departure from the traditional model.”

      I have extended this passage.

      52. “‘However, anonymity is very conditional - there are still many “keys” left in the manuscript, by which one can determine, if not the identity of the author, then his country, research group, or affiliated organization.’ (p.11) I would opt for the neutral ‘their’ here instead of ‘his’, especially given that this is a paragraph about equity and inclusion.”

      This has been fixed.

      53. “‘Thus, “closeness” is not a good way to address biases.’ (p. 11) This might be a straw man argument because I don’t believe researchers have argued that it is a good method to combat biases. If they did, it would be good to cite them here. Alternatively, the sentence could be

      omitted entirely.

      I have omitted the sentence.

      54. “I would start the Modular Publishing section with the definition as that allows readers to interpret the other statements better.”

      Modular publishing has been combined with registered reports into the deconstructed publication group of models, now in Section 5, general definition added.

      55. “It would be helpful if the Models were labeled (instead of using Model 1, Model 2, and Model 3) so that readers don’t have to think back what each model involved.”

      All the models represent a kind of generalization, which is why non-detailed labels are used. The text labels may vary depending on the context.

      56. “Table 2: ‘Decision making’ for the editor’s role is quite broad, I recommend to specify and include what kind of decisions need to be made.”

      Changed into “Making accept/reject decisions”

      57. “Table 2: ‘Aim of review’ – I believe the aim of peer review differs also within these models (see the ‘schools of thought’ the author mentions earlier), so maybe a statement on what the review entails would be a better way to phrase this.”

      Changed into “What does peer review entail?”

      58. “Table 2: One could argue that the object of the review’ in Registered Reports is also the manuscript as a whole, just in different stages. As such, I would phrase this differently.

      Current wording fits your remark: “Manuscript in terms of study design and execution”

      Reviewer 4

      59. “Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. Consider discussing the benefits of the traditional model of peer review.”

      This section has been extended.

      60. “Table 1 – Most of these are self-explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.”

      Table 1 has been replaced by Figure 2. I have also extended text descriptions, added definitions.

      61. “With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?”

      Some of these platforms, e.g., F1000, Lifecycle Journal, replace conventional journal publishing. Modular publishing allows for step-by-step feedback from peers. An important advantage of RRs over other peer review models lies in their capacity to enhance research efficiency. By conducting peer review at Stage 1, researchers gain the opportunity to refine their study design or data collection protocols before empirical work begins. Other models of review can offer critiques such as "the study should have been conducted differently" without actionable opportunity for improvement. The key motivation for having my paper reviewed in MetaROR is the quality of peer review – I have never received so many comments, frankly! Moreover, platforms such as MetaROR usually have partnering journals.

      62. “There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them. Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.”

      I have omitted these conditions and employed the Moore’s Technology Adoption Life Cycle. Thank you very much for your comment!

      63. Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.

      The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Zhu and colleagues used high-density Neuropixel probes to perform laminar recordings in V1 while presenting either small stimuli that stimulated the classical receptive field (CRF) or large stimuli whose border straddled the RF to provide nonclassical RF (nCRF) stimulation. Their main question was to understand the relative contribution of feedforward (FF), feedback (FB), and horizontal circuits to border ownership (Bown), which they addressed by measuring crosscorrelation across layers. They found differences in cross-correlation between feedback/horizontal (FH) and input layers during CRF and nCRF stimulation. 

      Although the data looks high quality and analyses look mostly fine, I had a lot of difficulty understanding the logic in many places. Examples of my concerns are written below. 

      (1) What is the main question? The authors refer to nCRF stimulation emerging from either feedback from higher areas or horizontal connections from within the same area (e.g. lines 136 to 138 and again lines 223-232). I initially thought that the study would aim to distinguish between the two. However, the way the authors have clubbed the layers in 3D, the main question seems to be whether Bown is FF or FH (i.e., feedback and horizontal are clubbed). Is this correct? If so, I don't see the logic, since I can't imagine Bown to be purely FF. Thus, just showing differences between CRF stimulation (which is mainly expected to be FF) and nCRF stimulation is not surprising to me. 

      We thank the reviewer for their thoughtful comments. As explained in the discussion, we grouped cortical layers to reduce uncertainty in precisely assigning laminar boundaries and to increase statistical power. Consequently, this limits our ability to distinguish the relative contributions of feedback inputs, primarily targeting layers 1 and 6, and horizontal connections, mainly within layers 2/3 and 5. Nevertheless, previous findings, especially regarding the rapid emergence of B<sub>own</sub> signals, suggest that feedback is more biologically plausible than horizontal-based mechanisms.

      Importantly, the emergence of B<sub>own</sub> signals in the primate brain should not be taken for granted. Direct physiological evidence that distinguishes feedforward from feedback/horizontal mechanisms has been lacking. While we agree it is unlikely that B<sub>own</sub> is mediated solely by feedforward processing, we felt it was necessary to test this empirically, particularly using highresolution laminar recordings.

      As discussed, feedforward models of B<sub>own</sub> have been proposed (e.g., Super, Romeo, and Keil, 2010; Saki and Nishimura, 2006). These could, in theory, be supported by more general nCRF modulations arising through early feedforward inhibitions, such as those observed in the retinogeniculate pathway (e.g., Webb, Tinsley, Vincent and Derrington, 2005; Blitz and Regehr, 2005; Alitto and Usrey, 2008). However, most B<sub>own</sub> models rely heavily on response latency, yet very few studies have recorded across layers or areas simultaneously to address this directly. Notably, recent findings in area V4 show that B<sub>own</sub> signals emerge earlier in deep layers than in granular (input) layers, suggesting a non-feedforward origin (Franken and Reynolds, 2021).

      Furthermore, although previous studies have shown that the nCRF can modulate firing rates and the timing of neuronal firing across layers, our findings go beyond these effects. We provide clear evidence that nCRF modulation also alters precise spike timing relationships and interlaminar coordination, and that the magnitude of nCRF modulation depends on these interlaminar interactions. This supports the idea that B<sub>own</sub> , or more general nCRF modulation, involves more than local rate changes, reflecting layer-specific network dynamics consistent with feedback or lateral integration.

      (2) Choice of layers for cross-correlation analysis: In the Introduction, and also in Figure 3C, it is mentioned that FF inputs arrive in 4C and 6, while FB/Horizontal inputs arrive at "superficial" and "deep", which I take as layer 2/3 and 5. So it is not clear to me why (i) layer 4A/B is chosen for analysis for Figure 3D (I would have thought layer 6 should have been chosen instead) and (ii) why Layers 5 and 6 are clubbed. 

      We thank the reviewer for raising this important point. The confusion likely stems from our use of the terms “superficial” and “deep” layers when describing the targets of feedback/horizontal inputs. To clarify, by “superficial” and “deep,” we specifically refer to layers 1–3 and layers 5–6, respectively, as illustrated in Figure 3C. Feedback and horizontal inputs relatively avoid entire layer 4, including both 4C and 4A/B.

      We also emphasize that the classification of layers as feedforward or feedback/horizontal recipients is relative rather than absolute. For example, although layer 6 receives both feedforward and feedback/horizontal inputs, it contains a higher proportion of feedback/horizontal inputs compared to layers 4C and 4A/B. 

      We had addressed this rationale in the Discussion, but recognize it may not have been sufficiently emphasized. We have revised the main text accordingly to clarify this point for readers in the final manuscript version.

      (3) Addressing the main question using cross-correlation analysis: I think the nice peaks observed in Figure 3B for some pairs show how spiking in one neuron affects the spiking in another one, with the delay in cross-correlation function arising from the conduction delay. This is shown nicely during CRF stimulation in Figure 3D between 4C -> 2/3, for example. However, the delay (positive or negative) is constrained by anatomical connectivity. For example, unless there are projections from 2/3 back to 4C which causes firing in a 2/3 layer neuron to cause a spike in a layer 4 neuron, we cannot expect to get a negative delay no matter what kind of stimulation (CRF versus nCRF) is used. 

      We thank the reviewer for the insightful comment. The observation that neurons within FH<sub>i</sub> laminar compartments (layers 2/3, 5/6) can lead those in layer 4 (4C, 4A/B) during nCRF stimulation may indeed seem unexpected. However, several anatomical pathways could mediate the propagation of B<sub>own</sub> signals from FH<sub>i</sub> compartments to layer 4. We have revised the Discussion section in the final version of the manuscript to address this point explicitly.

      In Macaque V1, projections from layers 2/3 to 4A/B have been documented (Blasdel et al., 1985; Callaway and Wiser, 1996), and neurons in 4A/B often extend apical dendrites into layers 2/3 (Lund, 1988; Yoshioka et al., 1994). Although direct projections from layers 2/3 to 4C are generally sparse (Callaway, 1998), a subset of neurons in the lower part of layer 3 can give off collateral axons to 4C (Lund and Yoshioka, 1991). Additionally, some 4C neurons extend dendrites into 4B, enabling potential dendritic integration of inputs from more superficial layers (Somogyi and Cowey, 1981; Mates and Lund, 1983; Yabuta and Callaway, 1998). Sparse connections from 2/3 to layer 4 have also been reported in cat V1 (Binzegger, Douglas and Martin, 2004). Moreover, layers 2/3 may influence 4C neurons disynaptically, without requiring dense monosynaptic connections. 

      Importantly, while CCGs can suggest possible circuit arrangements, functional connectivity may arise through mechanisms not fully captured by traditional anatomical tracing. Indeed, the apparent discrepancy between anatomical and functional data is not uncommon. For example, although 4B is known to receive anatomical input primarily from 4Cα, but not 4Cβ, photostimulation experiments have shown that 4B neurons can also be functionally driven by 4Cβ (Sawatari and Callaway, 1996). Our observation of functional inputs from layers 2/3 to layer 4 is also consistent with prior findings in rodent V1, where CCG analysis (e.g., Figure 7 in Senzai, Fernandez-Ruiz and Buzsaki, 2019) or photostimulation (Xu et al., 2016) revealed similar pathways. 

      Layers 5/6 provide dense projections to layers 4A/B (Lund, 1988; Callaway, 1998). In particular, layer 6 pyramidal neurons, especially the subset classified as Type 1 cells, project substantially to layer 4C (Wiser and Callaway, 1996; Fitzpatrick et al., 1985). 

      Reviewer #2 (Public review): 

      Summary: 

      The authors present a study of how modulatory activity from outside the classical receptive field (cRF) differs from cRF stimulation. They study neural activity across the different layers of V1 in two anesthetized monkeys using Neuropixels probes. The monkeys are presented with drifting gratings and border-ownership tuning stimuli. They find that border-ownership tuning is organized into columns within V1, which is unexpected and exciting, and that the flow of activity from cellto-cell (as judged by cross-correlograms between single units) is influenced by the type of visual stimulus: border-ownership tuning stimuli vs. drifting-grating stimuli. 

      Strengths: 

      The questions addressed by the study are of high interest, and the use of Neuropixels probes yields extremely high numbers of single-units and cross-correlation histograms (CCHs) which makes the results robust. The study is well-described. 

      Weaknesses: 

      The weaknesses of the study are (a) the use of anesthetized animals, which raises questions about the nature of the modulatory signal being measured and the underlying logic of why a change in visual stimulus would produce a reversal in information flow through the cortical microcircuit and (b) the choice of visual stimuli, which do not uniquely isolate feedforward from feedback influences. 

      (1) The modulation latency seems quite short in Figure 2C. Have the authors measured the latency of the effect in the manuscript and how it compares to the onset of the visually driven response? It would be surprising if the latency was much shorter than 70ms given previous measurements of BO and figure-ground modulation latency in V2 and V1. On the same note, it might be revealing to make laminar profiles of the modulation (i.e. preferred - non-preferred border orientation) as it develops over time. Does the modulation start in feedback recipient layers? 

      (2) Can the authors show the average time course of the response elicited by preferred and nonpreferred border ownership stimuli across all significant neurons? 

      We thank the reviewer for the insightful comment—this is indeed an important and often overlooked point. As noted in the Discussion, B<sub>own</sub> modulation differs from other forms of figure-ground modulation (e.g., Lamme et al., 1998) in that it can emerge very rapidly in early visual cortex—within ~10–35 ms after response onset (Zhou et al., 2000; Sugihara et al., 2011). This rapid emergence has been interpreted as evidence for the involvement of fast feedback inputs, which can propagate up to ten times faster than horizontal connections (Girard et al., 2001). Moreover, interlaminar interactions via monosynaptic or disynaptic connections can occur on very short timescales (a few milliseconds), further complicating efforts to disentangle feedback influences based solely on latency.

      Thus, while the early onset of modulation in our data may appear surprising, it is consistent with prior B<sub>own</sub> findings, and likely reflects a combination of fast feedback and rapid interlaminar processing. This makes it challenging to use conventional latency measurements to resolve laminar differences in B<sub>own</sub> modulation. Latency comparisons are well known to be susceptible to confounds such as variability in response onset, luminance, contrast, stimulus size, and other sensory parameters. 

      Although we did not explicitly quantify the latency of B<sub>own</sub> modulation in this manuscript, our cross-correlation analysis provides a more sensitive and temporally resolved measure of interlaminar information flow. We therefore focused on this approach rather than laminar modulation profiles, as it more directly addresses our primary research question.

      (3) The logic of assuming that cRF stimulation should produce the opposite signal flow to borderownership tuning stimuli is worth discussing. I suspect the key difference between stimuli is that they used drifting gratings as the cRF stimulus, the movement of the stimulus continually refreshes the retinal image, leading to continuous feedforward dominance of the signals in V1. Had they used a static grating, the spiking during the sustained portion of the response might also show more influence of feedback/horizontal connections. Do the initial spikes fired in response to the borderownership tuning stimuli show the feedforward pattern of responses? The authors state that they did not look at cross-correlations during the initial response, but if they do, do they see the feedforward-dominated pattern? The jitter CCH analysis might suffice in correcting for the response transient. 

      We thank the reviewer for the insightful comment. As noted in the final Results section, our CRF and nCRF stimulation paradigms differ in respects beyond the presence or absence of nonclassical modulation, including stimulus properties within the CRF.

      We agree with the reviewer’s speculation that drifting gratings may continually refresh the retinal image, promoting sustained feedforward dominance in V1, whereas static gratings might allow greater influence from feedback/horizontal inputs during the sustained response. Likewise, the initial response to the B<sub>own</sub> stimulus could be dominated by feedforward activity before feedback/horizontal influences arrive. 

      This contrast was a central motivation for our experimental design: we deliberately used two stimulus conditions — drifting gratings to emphasize feedforward processing, and B<sub>own</sub> stimuli, which are known to engage feedback modulation — to test whether these two conditions yield different patterns of interlaminar information flow. Our results confirm that they do. While we did not separately analyze the very initial spike period, our focus is on interlaminar information flow during the sustained response, which serves as the primary measure of feedback/horizontal engagement in this study.

      Finally, beyond this direct comparison, we show in Figure 5 that under nCRF stimulation alone, the direction and strength of interlaminar information flow correlate with the magnitude of B<sub>own</sub> modulation, further supporting the idea that our cross-correlation approach reveals functionally meaningful differences in cortical processing.

      (4) The term "nCRF stimulation" is not appropriate because the CRF is stimulated by the light/dark edge. 

      We thank the reviewer for the comment. As noted in the Introduction, nCRF effects described in the literature invariably involve stimulation both inside and outside the CRF. Our use of the term “nCRF stimulation” refers to this experimental paradigm, rather than suggesting that the CRF itself is unstimulated. We hope this clarifies our use of the term.

      Reviewer #3 (Public review): 

      Summary: 

      The paper by Zhu et al is on an important topic in visual neuroscience, the emergence in the visual cortex of signals about figures and ground. This topic also goes by the name border ownership. The paper utilizes modern recording techniques very skillfully to extend what is known about border ownership. It offers new evidence about the prevalence of border ownership signals across different cortical layers in V1 cortex. Also, it uses pairwise cross-correlation to study signal flow under different conditions of visual stimulation that include the border ownership paradigm. 

      Strengths: 

      The paper's strengths are its use of multi-electrode probes to study border ownership in many neurons simultaneously across the cortical layers in V1, and its innovation of using crosscorrelation between cortical neurons -- when they are viewing border-ownership patterns or instead are viewing grating patterns restricted to the classical receptive field (CRF). 

      Weaknesses: 

      The paper's weaknesses are its largely incremental approach to the study of border ownership and the lack of a critical analysis of the cross-correlation data. The paper as it is now does not advance our understanding of border ownership; it mainly confirms prior work, and it does not challenge or revise consensus beliefs about mechanisms. However, it is possible that, in the rich dataset the authors have obtained, they do possess data that could be added to the paper to make it much stronger. 

      Critique: 

      The border ownership data on V1 offered in the paper replicates experimental results obtained by Zhou and von der Heydt (2000) and confirms the earlier results using the same analysis methods as Zhou. The incremental addition is that the authors found border ownership in all cortical layers extending Zhou's results that were only about layer 2/3. 

      The cross-correlation results show that the pattern of the cross-correlogram (CCG) is influenced by the visual pattern being presented. However, the results are not analyzed mechanistically, and the interpretation is unclear. For instance, the authors show in Figure 3 (and in Figure S2) that the peak of the CCG can indicate layer 2/3 excites layer 4C when the visual stimulus is the border ownership test pattern, a large square 8 deg on a side. But how can layer 2/3 excite layer 4C? The authors do not raise or offer an answer to this question. Similar questions arise when considering the CCG of layer 4A/B with layer 2/3. What is the proposed pathway for layer 2/3 to excite 4A/B? Other similar questions arise for all the interlaminar CCG data that are presented. What known functional connections would account for the measured CCGs? 

      We thank the reviewer for raising this important point. As noted in our response to a previous comment, several anatomical pathways could mediate apparent functional inputs from layers 2/3 to 4C and 4A/B. In macaque V1, projections from layers 2/3 to 4A/B have been documented (Blasdel et al., 1985; Callaway and Wiser, 1996), and neurons in 4A/B often extend apical dendrites into layers 2/3 (Lund, 1988; Yoshioka et al., 1994). Although direct projections from layers 2/3 to 4C are generally sparse (Callaway, 1998), a subset of lower layer 3 neurons can give off collateral axons to 4C (Lund and Yoshioka, 1991). Some 4C neurons also extend dendrites into 4B, potentially allowing dendritic integration of inputs from more superficial layers (Somogyi and Cowey, 1981; Mates and Lund, 1983; Yabuta and Callaway, 1998). Sparse connections from 2/3 to layer 4 have also been reported in cat V1 (Binzegger et al., 2004).

      Moreover, layers 2/3 may influence 4C neurons disynaptically, without requiring dense monosynaptic connections. While CCGs suggest possible circuit arrangements, functional connectivity may arise through mechanisms not fully captured by anatomical tracing, and apparent discrepancies between anatomical and functional data are not uncommon. For example, although 4B is known to receive anatomical input primarily from 4Cα, 4B neurons can also be functionally driven by 4Cβ using photostimulation (Sawatari and Callaway, 1996). Our observation of functional inputs from layers 2/3 to layer 4 is also consistent with prior findings in rodent V1, where CCG analysis (e.g., Figure 7 in Senzai, Fernandez-Ruiz and Buzsaki, 2019) or photostimulation (Xu et al., 2016) revealed similar pathways. 

      Layers 5/6 also provide dense projections to layers 4A/B (Lund, 1988; Callaway, 1998). In particular, layer 6 pyramidal neurons, especially the subset classified as Type 1 cells, project substantially to layer 4C (Wiser and Callaway, 1996; Fitzpatrick et al., 1985). 

      We have revised the Discussion section to explicitly address these points and clarify the potential anatomical and functional pathways underlying the measured interlaminar CCGs, highlighting how inputs from layers 2/3 and 5/6 to layer 4 can be mediated via both direct and indirect connections.

      The problems in understanding the CCG data are indirectly caused by the lack of a critical analysis of what is happening in the responses that reveal the border ownership signals, as in Figure 2. Let's put it bluntly - are border ownership signals excitatory or inhibitory? The reason I raise this question is that the present authors insightfully place border ownership as examples of the action of the non-classical receptive field (nCRF) of cortical cells. Most previous work on the nCRF (many papers cited by the authors) reveal the nCRF to be inhibitory or suppressive. In order to know whether nCRF signals are excitatory or inhibitory, one needs a baseline response from the CRF, so that when you introduce nCRF signals you can tell whether the change with respect to the CRF is up or down. As far as I know, prior work on border ownership has not addressed this question, and the present paper doesn't either. This is where the rich dataset that the present authors possess might be used to establish a fundamental property of border ownership. 

      Then we must go back to consider what the consequences of knowing the sign of the border ownership signal would mean for interpreting the CCG data. If the border ownership signals from extrastriate feedback or, alternatively, from horizontal intrinsic connections, are excitatory, they might provide a shared excitatory input to pairs of cells that would show up in the CCG as a peak at 0 delay. However, if the border ownership manuscript signals are inhibitory, they might work by exciting only inhibitory neurons in V1. This could have complicated consequences for the CCG.The interpretation of the CCG data in the present version of the m is unclear (see above). Perhaps a clearer interpretation could be developed once the authors know better what the border ownership signals are. 

      We thank the reviewer for raising this fundamental and thought-provoking question. As noted, B<sub>own</sub> signals arise from nCRF, which has often been associated with suppressive effects. However, Zhang and von der Heydt (2010) provided important insight into this issue by systematically varying the placement of figure fragments outside the CRF while keeping an edge centered within the CRF. They found that contextual fragments on the preferred side of B<sub>own</sub> produce facilitation, while those on the non-preferred side produce suppression. Thus, the nCRF contribution to B<sub>own</sub> reflects both excitatory and inhibitory modulation, depending on the spatial configuration of the figure.

      These effects were well explained by their model in which feedback from grouping cells in higher areas selectively enhances or suppresses V1/V2 neuron responses, depending on their B<sub>own</sub> preference. In this framework, the B<sub>own</sub> signal itself is not inherently excitatory or inhibitory; rather, it results from the net effect of feedback, which can be either facilitative or suppressive. Importantly, it is the input that is modulated — not that the receiving neurons are necessarily inhibitory themselves.

      In the current study, our analysis focused on CCGs showing excessive coincident spiking, i.e., positive peaks, which are typically interpreted as evidence for shared excitatory input or excitatory connections. Due to the limited number of connections, we did not analyze inhibitory interactions, such as anti-correlations or delayed suppression in the CCGs, which would be expected if the reference neuron were inhibitory. Therefore, the CCGs we report here likely reflect the excitatory component of the B<sub>own</sub> signal, and possibly its upstream drive via feedback. While a full separation of excitatory and inhibitory components remains an important goal for future work, our data suggest that B<sub>own</sub> modulation is at least partially mediated through excitatory feedback input.

      My critique of the CCG analysis applies to Figure 5 also. I cannot comprehend the point of showing a very weak correlation of CCG asymmetry with Border Ownership Index, especially when what CCG asymmetry means is unclear mechanistically. Figure 5 does not make the paper stronger in my opinion. 

      We thank the reviewer for this comment. As described in the Results section for Figure 5, the observation that interlaminar information flow correlates with B<sub>own</sub> modulation is important because it demonstrates that these flow patterns are specifically related to the magnitude of B<sub>own</sub> signals, independent of the comparisons between CRF and nCRF stimulation. 

      In Figure 3, the authors show two CCGs that involve 4C--4C pairs. It would be nice to know more about such pairs. If there are any 6--6 pairs, what they look like also would be interesting. The authors also in Figure 3 show CCG's of two 4C--4A/B pairs and it would be quite interesting to know how such CCGs behave when CRF and nCRF stimuli are compared. In other words, the authors have shown us they have many data but have chosen not to analyze them further or to explain why they chose not to analyze them. It might help the paper if the authors would present all the CCG types they have. This suggestion would be helpful when the authors know more about the sign of border ownership signals, as discussed at length above. 

      We thank the reviewer for the insightful comment. The rationale for selecting specific laminar pairs is described in the Results section after Figure 3C and further discussed in the Discussion. In brief, we focused on CCGs computed from pairs in which one neuron resided in laminar compartments receiving feedback/horizontal inputs (layers 2/3 and 5/6) and the other within compartments relatively devoid of these inputs (layers 4C and 4A/B).

      To mitigate uncertainty in defining exact laminar boundaries and to maximize statistical power, we combined some anatomical layers into distinct laminar compartments. This approach allowed us to compare the relative spike timing between neuronal pairs during CRF and nCRF stimulation. If feedback/horizontal inputs contribute more during nCRF than CRF stimulation, we expect this to be reflected in the lead-lag relationships of the CCGs. While other pairs (e.g., 5/6–5/6 or 4C– 4A/B) could in principle be analyzed, the hypothesized patterns for these pairs are less clear, and thus they were not the focus of our study. Nonetheless, these additional pairs represent interesting directions for future work.

    1. Author response:

      The following is the authors’ response to the original reviews

      We thank all the reviewers for their constructive comments. We have carefully considered your feedback and revised the manuscript accordingly. The major concern raised was the applicability of SegPore to the RNA004 dataset. To address this, we compared SegPore with f5c and Uncalled4 on RNA004, and found that SegPore demonstrated improved performance, as shown in Table 2 of the revised manuscript.

      Following the reviewers’ recommendations, we updated Figures 3 and 4. Additionally, we added one table and three supplementary figures to the revised manuscript:

      · Table 2: Segmentation benchmark on RNA004 data

      · Supplementary Figure S4: RNA translocation hypothesis illustrated on RNA004 data

      · Supplementary Figure S5: Illustration of Nanopolish raw signal segmentation with eventalign results

      · Supplementary Figure S6: Running time of SegPore on datasets of varying sizes

      Below, we provide a point-by-point response to your comments.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors describe a new computational method (SegPore), which segments the raw signal from nanopore-direct RNA-Seq data to improve the identification of RNA modifications. In addition to signal segmentation, SegPore includes a Gaussian Mixture Model approach to differentiate modified and unmodified bases. SegPore uses Nanopolish to define a first segmentation, which is then refined into base and transition blocks. SegPore also includes a modification prediction model that is included in the output. The authors evaluate the segmentation in comparison to Nanopolish and Tombo, and they evaluate the impact on m6A RNA modification detection using data with known m6A sites. In comparison to existing methods, SegPore appears to improve the ability to detect m6A, suggesting that this approach could be used to improve the analysis of direct RNA-Seq data.

      Strengths:

      SegPore addresses an important problem (signal data segmentation). By refining the signal into transition and base blocks, noise appears to be reduced, leading to improved m6A identification at the site level as well as for single-read predictions. The authors provide a fully documented implementation, including a GPU version that reduces run time. The authors provide a detailed methods description, and the approach to refine segments appears to be new.

      Weaknesses:

      In addition to Nanopolish and Tombo, f5c and Uncalled4 can also be used for segmentation, however, the comparison to these methods is not shown.

      The method was only applied to data from the RNA002 direct RNA-Sequencing version, which is not available anymore, currently, it remains unclear if the methods still work on RNA004.

      Thank you for your comments.

      To clarify the background, there are two kits for Nanopore direct RNA sequencing: RNA002 (the older version) and RNA004 (the newer version). Oxford Nanopore Technologies (ONT) introduced the RNA004 kit in early 2024 and has since discontinued RNA002. Consequently, most public datasets are based on RNA002, with relatively few available for RNA004 (as of 30 June 2025).

      Nanopolish and Tombo were developed for raw signal segmentation and alignment using RNA002 data, whereas f5c and Uncalled4are the only two software supporting RNA004 data.  Since the development of SegPore began in January 2022, we initially focused on RNA002 due to its data availability. Accordingly, our original comparisons were made against Nanopolish and Tombo using RNA002 data.

      We have now updated SegPore to support RNA004 and compared its performance against f5c and Uncalled4 on three public RNA004 datasets.

      As shown in Table 2 of the revised manuscript, SegPore outperforms both f5c and Uncalled4 in raw signal segmentation. Moreover, the jiggling translocation hypothesis underlying SegPore is further supported, as shown in Supplementary Figure S4.

      The overall improvement in accuracy appears to be relatively small.

      Thank you for the comment.

      We understand that the improvements shown in Tables 1 and 2 may appear modest at first glance due to the small differences in the reported standard deviation (std) values. However, even small absolute changes in std can correspond to substantial relative reductions in noise, especially when the total variance is low.

      To better quantify the improvement, we assume that approximately 20% of the std for Nanopolish, Tombo, f5c, and Uncalled4 arises from noise. Using this assumption, we calculate the relative noise reduction rate of SegPore as follows:

      Noise reduction rate = (baseline std − SegPore std) / (0.2 × baseline std) ​​

      Based on this formula, the average noise reduction rates across all datasets are:

      - SegPore vs Nanopolish: 49.52%

      - SegPore vs Tombo: 167.80%

      - SegPore vs f5c: 9.44%

      - SegPore vs Uncalled4: 136.70%

      These results demonstrate that SegPore can reduce the noise level by at least 9% given a noise level of 20%, which we consider a meaningful improvement for downstream tasks, such as base modification detection and signal interpretation. The high noise reduction rates observed in Tombo and Uncalled4 (over 100%) suggest that their actual noise proportion may be higher than our 20% assumption.

      We acknowledge that this 20% noise level assumption is an approximation. Our intention is to illustrate that SegPore provides measurable improvements in relative terms, even when absolute differences appear small.

      The run time and resources that are required to run SegPore are not shown, however, it appears that the GPU version is essential, which could limit the application of this method in practice.

      Thank you for your comment.

      Detailed instructions for running SegPore are provided in github (https://github.com/guangzhaocs/SegPore). Regarding computational resources, SegPore currently requires one CPU core and one Nvidia GPU to perform the segmentation task efficiently.

      We present SegPore’s runtime for typical datasets in Supplementary Figure S6 in the revised manuscript.  For a typical 1 GB fast5 file, the segmentation takes approximately 9.4 hours using a single NVIDIA DGX‑1 V100 GPU and one CPU core.

      Currently, GPU acceleration is essential to achieve practical runtimes with SegPore. We acknowledge that this requirement may limit accessibility in some environments. To address this, we are actively working on a full C++ implementation of SegPore that will support CPU-only execution. While development is ongoing, we aim to release this version in a future update.

      Reviewer #2 (Public review):

      Summary:

      The work seeks to improve the detection of RNA m6A modifications using Nanopore sequencing through improvements in raw data analysis. These improvements are said to be in the segmentation of the raw data, although the work appears to position the alignment of raw data to the reference sequence and some further processing as part of the segmentation, and result statistics are mostly shown on the 'data-assigned-to-kmer' level.

      As such, the title, abstract, and introduction stating the improvement of just the 'segmentation' does not seem to match the work the manuscript actually presents, as the wording seems a bit too limited for the work involved.

      The work itself shows minor improvements in m6Anet when replacing Nanopolish eventalign with this new approach, but clear improvements in the distributions of data assigned per kmer. However, these assignments were improved well enough to enable m6A calling from them directly, both at site-level and at read-level.

      Strengths:

      A large part of the improvements shown appear to stem from the addition of extra, non-base/kmer specific, states in the segmentation/assignment of the raw data, removing a significant portion of what can be considered technical noise for further analysis. Previous methods enforced the assignment of all raw data, forcing a technically optimal alignment that may lead to suboptimal results in downstream processing as data points could be assigned to neighbouring kmers instead, while random noise that is assigned to the correct kmer may also lead to errors in modification detection.

      For an optimal alignment between the raw signal and the reference sequence, this approach may yield improvements for downstream processing using other tools.<br /> Additionally, the GMM used for calling the m6A modifications provides a useful, simple, and understandable logic to explain the reason a modification was called, as opposed to the black models that are nowadays often employed for these types of tasks.

      Weaknesses:

      The work seems limited in applicability largely due to the focus on the R9's 5mer models. The R9 flow cells are phased out and not available to buy anymore. Instead, the R10 flow cells with larger kmer models are the new standard, and the applicability of this tool on such data is not shown. We may expect similar behaviour from the raw sequencing data where the noise and transition states are still helpful, but the increased kmer size introduces a large amount of extra computing required to process data and without knowledge of how SegPore scales, it is difficult to tell how useful it will really be. The discussion suggests possible accuracy improvements moving to 7mers or 9mers, but no reason why this was not attempted.

      Thank you for pointing out this important limitation. Please refer to our response to Point 1 of Reviewer 1 for SegPore’s performance on RNA004 data. Notably, the jiggling behavior is also observed in RNA004 data, and SegPore achieves better performance than both f5c and Uncalled4.

      The increased k-mer size in RNA004 affects only the training phase of SegPore (refer to Supplementary Note 1, Figure 5 for details on the training and testing phases). Once the baseline means and standard deviations for each k-mer are established, applying SegPore to RNA004 data proceeds similarly to RNA002. This is because each k-mer in the reference sequence has, at most, two states (modified and unmodified). While the larger k-mer size increases the size of the parameter table, it does not increase the computational complexity during segmentation. Although estimating the initial k-mer parameter table requires significant time and effort on our part, it does not affect the runtime for end users applying SegPore to RNA004 data.

      Extending SegPore from 5-mers to 7-mers or 9-mers for RNA002 data would require substantial effort to retrain the model and generate sufficient training data. Additionally, such an extension would make SegPore’s output incompatible with widely used upstream and downstream tools such as Nanopolish and m6Anet, complicating integration and comparison. For these reasons, we leave this extension for future work.

      The manuscript suggests the eventalign results are improved compared to Nanopolish. While this is believably shown to be true (Table 1), the effect on the use case presented, downstream differentiation between modified and unmodified status on a base/kmer, is likely limited as during actual modification calling the noisy distributions are usually 'good enough', and not skewed significantly in one direction to really affect the results too terribly.

      Thank you for your comment. While current state-of-the-art (SOTA) methods perform well on benchmark datasets, there remains significant room for improvement. Most SOTA evaluations are based on limited datasets, primarily covering DRACH motifs in human and mouse transcriptomes. However, m6A modifications can also occur in non-DRACH motifs, where current models may underperform. Additionally, other RNA modifications—such as pseudouridine, inosine, and m5C—are less studied, and their detection may benefit from improved signal modeling.

      We would also like to emphasize that raw signal segmentation and RNA modification detection are distinct tasks. SegPore focuses on the former, providing a cleaner, more interpretable signal that can serve as a foundation for downstream tasks. Improved segmentation may facilitate the development of more accurate RNA modification detection algorithms by the community.

      Scientific progress often builds incrementally through targeted improvements to foundational components. We believe that enhancing signal segmentation, as SegPore does, contributes meaningfully to the broader field—the full impact will become clearer as the tool is adopted into more complex workflows.

      Furthermore, looking at alternative approaches where this kind of segmentation could be applied, Nanopolish uses the main segmentation+alignment for a first alignment and follows up with a form of targeted local realignment/HMM test for modification calling (and for training too), decreasing the need for the near-perfect segmentation+alignment this work attempts to provide. Any tool applying a similar strategy probably largely negates the problems this manuscript aims to improve upon.

      We thank the reviewer for this insightful comment.

      To clarify, Nanopolish provides three independent commands: polya, eventalign, and call-methylation.

      - The polya command identifies the adapter, poly(A) tail, and transcript region in the raw signal.

      - The eventalign command aligns the raw signal to a reference sequence, assigning a signal segment to individual k-mers in the reference.

      - The call-methylation command detects methylated bases from DNA sequencing data.

      The eventalign command corresponds to “the main segmentation+alignment for a first alignment,” while call-methylation corresponds to “a form of targeted local realignment/HMM test for modification calling,” as mentioned in the reviewer’s comment. SegPore’s segmentation is similar in purpose to Nanopolish’s eventalign, while its RNA modification estimation component is similar in concept to Nanopolish’s call-methylation.

      We agree the general idea may appear similar, but the implementations are entirely different. Importantly, Nanopolish’s call-methylation is designed for DNA sequencing data, and its models are not trained to recognize RNA modifications. This means they address distinct research questions and cannot be directly compared on the same RNA modification estimation task. However, it is valid to compare them on the segmentation task, where SegPore exhibits better performance (Table 1).

      We infer the reviewer may suggest that because m6Anet is a deep neural network capable of learning from noisy input, the benefit of more accurate segmentation (such as that provided by SegPore) might be limited. This concern may arise from the limited improvement of SegPore+m6Anet over Nanopolish+m6Anet in bulk analysis (Figure 3). Several factors may contribute to this observation:

      (i) For reads aligned to the same gene in the in vivo data, alignment may be inaccurate due to pseudogenes or transcript isoforms.

      (ii) The in vivo benchmark data are inherently more complex than in vitro datasets and may contain additional modifications (e.g., m5C, m7G), which can confound m6A calling by altering the signal baselines of k-mers.

      (iii) m6Anet is trained on events produced by Nanopolish and may not be optimal for SegPore-derived events.

      (iv) The benchmark dataset lacks a modification-free (IVT) control sample, making it difficult to establish a true baseline for each k-mer.

      In the IVT data (Figure 4), SegPore shows a clear improvement in single-molecule m6A identification, with a 3~4% gain in both ROC-AUC and PR-AUC. This demonstrates SegPore’s practical benefit for applications requiring higher sensitivity at the molecule level.

      As noted earlier, SegPore’s contribution lies in denoising and improving the accuracy of raw signal segmentation, which is a foundational step in many downstream analyses. While it may not yet lead to a dramatic improvement in all applications, it already provides valuable insights into the sequencing process (e.g., cleaner signal profiles in Figure 4) and enables measurable gains in modification detection at the single-read level. We believe SegPore lays the groundwork for developing more accurate and generalizable RNA modification detection tools beyond m6A.

      We have also added the following sentence in the discussion to highlight SegPore’s limited performance in bulk analysis:

      “The limited improvement of SegPore combined with m6Anet over Nanopolish+m6Anet in bulk in vivo analysis (Figure 3) may be explained by several factors: potential alignment inaccuracies due to pseudogenes or transcript isoforms, the complexity of in vivo datasets containing additional RNA modifications (e.g., m5C, m7G) affecting signal baselines, and the fact that m6Anet is specifically trained on events produced by Nanopolish rather than SegPore. Additionally, the lack of a modification-free control (in vitro transcribed) sample in the benchmark dataset makes it difficult to establish true baselines for each k-mer. Despite these limitations, SegPore demonstrates clear improvement in single-molecule m6A identification in IVT data (Figure 4), suggesting it is particularly well suited for in vitro transcription data analysis.”

      Finally, in the segmentation/alignment comparison to Nanopolish, the latter was not fitted(/trained) on the same data but appears to use the pre-trained model it comes with. For the sake of comparing segmentation/alignment quality directly, fitting Nanopolish on the same data used for SegPore could remove the influences of using different training datasets and focus on differences stemming from the algorithm itself.

      In the segmentation benchmark (Table 1), SegPore uses the fixed 5-mer parameter table provided by ONT. The hyperparameters of the HHMM are also fixed and not estimated from the raw signal data being segmented. Only in the m6A modification task,  SegPore does perform re-estimation of the baselines for the modified and unmodified states of k-mers. Therefore, the comparison with Nanopolish is fair, as both tools rely on pre-defined models during segmentation.

      Appraisal:

      The authors have shown their method's ability to identify noise in the raw signal and remove their values from the segmentation and alignment, reducing its influences for further analyses. Figures directly comparing the values per kmer do show a visibly improved assignment of raw data per kmer. As a replacement for Nanopolish eventalign it seems to have a rather limited, but improved effect, on m6Anet results. At the single read level modification modification calling this work does appear to improve upon CHEUI.

      Impact:

      With the current developments for Nanopore-based modification largely focusing on Artificial Intelligence, Neural Networks, and the like, improvements made in interpretable approaches provide an important alternative that enables a deeper understanding of the data rather than providing a tool that plainly answers the question of whether a base is modified or not, without further explanation. The work presented is best viewed in the context of a workflow where one aims to get an optimal alignment between raw signal data and the reference base sequence for further processing. For example, as presented, as a possible replacement for Nanopolish eventalign. Here it might enable data exploration and downstream modification calling without the need for local realignments or other approaches that re-consider the distribution of raw data around the target motif, such as a 'local' Hidden Markov Model or Neural Networks. These possibilities are useful for a deeper understanding of the data and further tool development for modification detection works beyond m6A calling.

      Reviewer #3 (Public review):

      Summary:

      Nucleotide modifications are important regulators of biological function, however, until recently, their study has been limited by the availability of appropriate analytical methods. Oxford Nanopore direct RNA sequencing preserves nucleotide modifications, permitting their study, however, many different nucleotide modifications lack an available base-caller to accurately identify them. Furthermore, existing tools are computationally intensive, and their results can be difficult to interpret.

      Cheng et al. present SegPore, a method designed to improve the segmentation of direct RNA sequencing data and boost the accuracy of modified base detection.

      Strengths:

      This method is well-described and has been benchmarked against a range of publicly available base callers that have been designed to detect modified nucleotides.

      Weaknesses:

      However, the manuscript has a significant drawback in its current version. The most recent nanopore RNA base callers can distinguish between different ribonucleotide modifications, however, SegPore has not been benchmarked against these models.

      I recommend that re-submission of the manuscript that includes benchmarking against the rna004_130bps_hac@v5.1.0 and rna004_130bps_sup@v5.1.0 dorado models, which are reported to detect m5C, m6A_DRACH, inosine_m6A and PseU.<br /> A clear demonstration that SegPore also outperforms the newer RNA base caller models will confirm the utility of this method.

      Thank you for highlighting this important limitation. While Dorado, the new ONT basecaller, is publicly available and supports modification-aware basecalling, suitable public datasets for benchmarking m5C, inosine, m6A, and PseU detection on RNA004 are currently lacking. Dorado’s modification-aware models are trained on ONT’s internal data, which is not publicly released. Therefore, it is not currently feasible to evaluate or directly compare SegPore’s performance against Dorado for m5C, inosine, m6A, and PseU detection.

      We would also like to emphasize that SegPore’s main contribution lies in raw signal segmentation, which is an upstream task in the RNA modification detection pipeline. To assess its performance in this context, we benchmarked SegPore against f5c and Uncalled4 on public RNA004 datasets for segmentation quality. Please refer to our response to Point 1 of Reviewer 1 for details.

      Our results show that the characteristic “jiggling” behavior is also observed in RNA004 data (Supplementary Figure S4), and SegPore achieves better segmentation performance than both f5c and Uncalled4 (Table 2).

      Recommendations for the authors:

      Reviewing Editor:

      Please note that we also received the following comments on the submission, which we encourage you to take into account:

      took a look at the work and for what I saw it only mentions/uses RNA002 chemistry, which is deprecated, effectively making this software unusable by anyone any more, as RNA002 is not commercially available. While the results seem promising, the authors need to show that it would work for RNA004. Notably, there is an alternative software for resquiggling for RNA004 (not Tombo or Nanopolish, but the GPU-accelerated version of Nanopolish (f5C), which does support RNA004. Therefore, they need to show that SegPore works for RNA004, because otherwise it is pointless to see that this method works better than others if it does not support current sequencing chemistries and only works for deprecated chemistries, and people will keep using f5C because its the only one that currently works for RNA004. Alternatively, if there would be biological insights won from the method, one could justify not implementing it in RNA004, but in this case, RNA002 is deprecated since March 2024, and the paper is purely methodological.

      Thank you for the comment. We agree that support for current sequencing chemistries is essential for practical utility. While SegPore was initially developed and benchmarked on RNA002 due to the availability of public data, we have now extended SegPore to support RNA004 chemistry.

      To address this concern, we performed a benchmark comparison using public RNA004 datasets against tools specifically designed for RNA004, including f5c and Uncalled4. Please refer to our response to Point 1 of Reviewer 1 for details. The results show that SegPore consistently outperforms f5c and Uncalled4 in segmentation accuracy on RNA004 data.

      Reviewer #2 (Recommendations for the authors):

      Various statements are made throughout the text that require further explanation, which might actually be defined in more detail elsewhere sometimes but are simply hard to find in the current form.

      (1) Page 2, “In this technique, five nucleotides (5mers) reside in the nanopore at a time, and each 5mer generates a characteristic current signal based on its unique sequence and chemical properties (16).”

      5mer? Still on R9 or just ignoring longer range influences, relevant? It is indeed a R9.4 model from ONT.

      Thank you for the observation. We apologize for the confusion and have clarified the relevant paragraph to indicate that the method is developed for RNA002 data by default. Specifically, we have added the following sentence:

      “Two versions of the direct RNA sequencing (DRS) kits are available: RNA002 and RNA004. Unless otherwise specified, this study focuses on RNA002 data.”

      (2) Page 3, “Employ models like Hidden Markov Models (HMM) to segment the signal, but they are prone to noise and inaccuracies.”

      That's the alignment/calling part, not the segmentation?

      Thank you for the comment. We apologize for the confusion. To clarify the distinction between segmentation and alignment, we added a new paragraph before the one in question to explain the general workflow of Nanopore DRS data analysis and to clearly define the task of segmentation. The added text reads:

      “The general workflow of Nanopore direct RNA sequencing (DRS) data analysis is as follows. First, the raw electrical signal from a read is basecalled using tools such as Guppy or Dorado, which produce the nucleotide sequence of the RNA molecule. However, these basecalled sequences do not include the precise start and end positions of each ribonucleotide (or k-mer) in the signal. Because basecalling errors are common, the sequences are typically mapped to a reference genome or transcriptome using minimap2 to recover the correct reference sequence. Next, tools such as Nanopolish and Tombo align the raw signal to the reference sequence to determine which portion of the signal corresponds to each k-mer. We define this process as the segmentation task, referred to as "eventalign" in Nanopolish. Based on this alignment, Nanopolish extracts various features—such as the start and end positions, mean, and standard deviation of the signal segment corresponding to a k-mer. This signal segment or its derived features is referred to as an "event" in Nanopolish.”

      We also revised the following paragraph describing SegPore to more clearly contrast its approach:

      “In SegPore, we first segment the raw signal into small fragments using a Hierarchical Hidden Markov Model (HHMM), where each fragment corresponds to a sub-state of a k-mer. Unlike Nanopolish and Tombo, which directly align the raw signal to the reference sequence, SegPore aligns the mean values of these small fragments to the reference. After alignment, we concatenate all fragments that map to the same k-mer into a larger segment, analogous to the "eventalign" output in Nanopolish. For RNA modification estimation, we use only the mean signal value of each reconstructed event.”

      We hope this revision clarifies the difference between segmentation and alignment in the context of our method and resolves the reviewer’s concern.

      (3) Page 4, Figure 1, “These segments are then aligned with the 5mer list of the reference sequence fragment using a full/partial alignment algorithm, based on a 5mer parameter table. For example, 𝐴𝑗 denotes the base "A" at the j-th position on the reference.”

      I think I do understand the meaning, but I do not understand the relevance of the Aj bit in the last sentence. What is it used for?

      When aligning the segments (output from Step 2) to the reference sequence in Step 3, it is possible for multiple segments to align to the same k-mer. This can occur particularly when the reference contains consecutive identical bases, such as multiple adenines (A). For example, as shown in Fig. 1A, Step 3, the first two segments (μ₁ and μ₂) are aligned to the first 'A' in the reference sequence, while the third segment is aligned to the second 'A'. In this case, the reference sequence AACTGGTTTC...GTC, which contains exactly two consecutive 'A's at the start. This notation helps to disambiguate segment alignment in regions with repeated bases.

      Additionally, this figure and its subscript include mapping with Guppy and Minimap2 but do not mention Nanopolish at all, while that seems an equally important step in the preprocessing (pg5). As such it is difficult to understand the role Nanopolish exactly plays. It's also not mentioned explicitly in the SegPore Workflow on pg15, perhaps it's part of step 1 there?

      We thank the reviewer for pointing this out. We apologize for the confusion. As mentioned in the public response to point 3 of Reviewer 2, SegPore uses Nanopolish to identify the poly(A) tail and transcript regions from the raw signal. SegPore then performs segmentation and alignment on the transcript portion only. This step is indeed part of Step 1 in the preprocessing workflow, as described in Supplementary Note 1, Section 3.

      To clarify this in the main text, we have updated the preprocessing paragraph on page 6 to explicitly describe the role of Nanopolish:

      “We begin by performing basecalling on the input fast5 file using Guppy, which converts the raw signal data into ribonucleotide sequences. Next, we align the basecalled sequences to the reference genome using Minimap2, generating a mapping between the reads and the reference sequences. Nanopolish provides two independent commands: "polya" and "eventalign".
The "polya" command identifies the adapter, poly(A) tail, and transcript region in the raw signal, which we refer to as the poly(A) detection results. The raw signal segment corresponding to the poly(A) tail is used to standardize the raw signal for each read. The "eventalign" command aligns the raw signal to a reference sequence, assigning a signal segment to individual k-mers in the reference. It also computes summary statistics (e.g., mean, standard deviation) from the signal segment for each k-mer. Each k-mer together with its corresponding signal features is termed an event. These event features are then passed into downstream tools such as m6Anet and CHEUI for RNA modification detection. For full transcriptome analysis (Figure 3), we extract the aligned raw signal segment and reference sequence segment from Nanopolish's events for each read by using the first and last events as start and end points. For in vitro transcription (IVT) data with a known reference sequence (Figure 4), we extract the raw signal segment corresponding to the transcript region for each input read based on Nanopolish’s poly(A) detection results.”

      Additionally, we revised the legend of Figure 1A to explicitly include Nanopolish in step 1 as follows:

      “The raw current signal fragments are paired with the corresponding reference RNA sequence fragments using Nanopolish.”

      (4) Page 5, “The output of Step 3 is the "eventalign," which is analogous to the output generated by the Nanopolish "eventalign" command.”

      Naming the function of Nanopolish, the output file, and later on (pg9) the alignment of the newly introduced methods the exact same "eventalign" is very confusing.

      Thank you for the helpful comment. We acknowledge the potential confusion caused by using the term “eventalign” in multiple contexts. To improve clarity, we now consistently use the term “events” to refer to the output of both Nanopolish and SegPore, rather than using "eventalign" as a noun. We also added the following sentence to Step 3 (page 6) to clearly define what an “event” refers to in our manuscript:

      “An "event" refers to a segment of the raw signal that is aligned to a specific k-mer on a read, along with its associated features such as start and end positions, mean current, standard deviation, and other relevant statistics.”

      We have revised the text throughout the manuscript accordingly to reduce ambiguity and ensure consistent terminology.

      (5) Page 5, “Once aligned, we use Nanopolish's eventalign to obtain paired raw current signal segments and the corresponding fragments of the reference sequence, providing a precise association between the raw signals and the nucleotide sequence.”

      I thought the new method's HHMM was supposed to output an 'eventalign' formatted file. As this is not clearly mentioned elsewhere, is this a mistake in writing? Is this workflow dependent on Nanopolish 'eventalign' function and output or not?

      We apologize for the confusion. To clarify, SegPore is not dependent on Nanopolish’s eventalign function for generating the final segmentation results. As described in our response to your comment point 2 and elaborated in the revised text on page 4, SegPore uses its own HHMM-based segmentation model to divide the raw signal into small fragments, each corresponding to a sub-state of a k-mer. These fragments are then aligned to the reference sequence based on their mean current values.

      As explained in the revised manuscript:

      “In SegPore, we first segment the raw signal into small fragments using a Hierarchical Hidden Markov Model (HHMM), where each fragment corresponds to a sub-state of a k-mer. Unlike Nanopolish and Tombo, which directly align the raw signal to the reference sequence, SegPore aligns the mean values of these small fragments to the reference. After alignment, we concatenate all fragments that map to the same k-mer into a larger segment, analogous to the "eventalign" output in Nanopolish. For RNA modification estimation, we use only the mean signal value of each reconstructed event.”

      To avoid ambiguity, we have also revised the sentence on page 5 to more clearly distinguish the roles of Nanopolish and SegPore in the workflow. The updated sentence now reads:

      “Nanopolish provides two independent commands: "polya" and "eventalign".
The "polya" command identifies the adapter, poly(A) tail, and transcript region in the raw signal, which we refer to as the poly(A) detection results. The raw signal segment corresponding to the poly(A) tail is used to standardize the raw signal for each read. The "eventalign" command aligns the raw signal to a reference sequence, assigning a signal segment to individual k-mers in the reference. It also computes summary statistics (e.g., mean, standard deviation) from the signal segment for each k-mer. Each k-mer together with its corresponding signal features is termed an event. These event features are then passed into downstream tools such as m6Anet and CHEUI for RNA modification detection. For full transcriptome analysis (Figure 3), we extract the aligned raw signal segment and reference sequence segment from Nanopolish's events for each read by using the first and last events as start and end points. For in vitro transcription (IVT) data with a known reference sequence (Figure 4), we extract the raw signal segment corresponding to the transcript region for each input read based on Nanopolish’s poly(A) detection results.”

      (6) Page 5, “Since the polyA tail provides a stable reference, we normalize the raw current signals across reads, ensuring that the mean and standard deviation of the polyA tail are consistent across all reads.”

      Perhaps I misread this statement: I interpret it as using the PolyA tail to do the normalization, rather than using the rest of the signal to do the normalization, and that results in consistent PolyA tails across all reads.

      If it's the latter, this should be clarified, and a little detail on how the normalization is done should be added, but if my first interpretation is correct:

      I'm not sure if its standard deviation is consistent across reads. The (true) value spread in this section of a read should be fairly limited compared to the rest of the signal in the read, so the noise would influence the scale quite quickly, and such noise might be introduced to pores wearing down and other technical influences. Is this really better than using the non-PolyA tail part of the reads signal, using Median Absolute Deviation to scale for a first alignment round, then re-fitting the signal scaling using Theil Sen on the resulting alignments (assigned read signal vs reference expected signal), as Tombo/Nanopolish (can) do?

      Additionally, this kind of normalization should have been part of the Nanopolish eventalign already, can this not be re-used? If it's done differently it may result in different distributions than the ONT kmer table obtained for the next step.

      Thank you for this detailed and thoughtful comment. We apologize for the confusion. The poly(A) tail–based normalization is indeed explained in Supplementary Note 1, Section 3, but we agree that the motivation needed to be clarified in the main text.

      We have now added the following sentence in the revised manuscript (before the original statement on page 5 to provide clearer context:

      “Due to inherent variability between nanopores in the sequencing device, the baseline levels and standard deviations of k-mer signals can differ across reads, even for the same transcript. To standardize the signal for downstream analyses, we extract the raw current signal segments corresponding to the poly(A) tail of each read. Since the poly(A) tail provides a stable reference, we normalize the raw current signals across reads, ensuring that the mean and standard deviation of the poly(A) tail are consistent across all reads. This step is crucial for reducing…..”

      We chose to use the poly(A) tail for normalization because it is sequence-invariant—i.e., all poly(A) tails consist of identical k-mers, unlike transcript sequences which vary in composition. In contrast, using the transcript region for normalization can introduce biases: for instance, reads with more diverse k-mers (having inherently broader signal distributions) would be forced to match the variance of reads with more uniform k-mers, potentially distorting the baseline across k-mers.

      In our newly added RNA004 benchmark experiment, we used the default normalization provided by f5c, which does not include poly(A) tail normalization. Despite this, SegPore was still able to mask out noise and outperform both f5c and Uncalled4, demonstrating that our segmentation method is robust to different normalization strategies.

      (7) Page 7, “The initialization of the 5mer parameter table is a critical step in SegPore's workflow. By leveraging ONT's established kmer models, we ensure that the initial estimates for unmodified 5mers are grounded in empirical data.”

      It looks like the method uses Nanopolish for a first alignment, then improves the segmentation matching the reference sequence/expected 5mer values. I thought the Nanopolish model/tables are based on the same data, or similarly obtained. If they are different, then why the switch of kmer model? Now the original alignment may have been based on other values, and thus the alignment may seem off with the expected kmer values of this table.

      Thank you for this insightful question. To clarify, SegPore uses Nanopolish only to identify the poly(A) tail and transcript regions from the raw signal. In the bulk in vivo data analysis, we use Nanopolish’s first event as the start and the last event as the end to extract the aligned raw signal chunk and its corresponding reference sequence. Since SegPore relies on Nanopolish solely to delineate the transcript region for each read, it independently aligns the raw signals to the reference sequence without refining or adjusting Nanopolish’s segmentation results.

      While SegPore's 5-mer parameter table is initially seeded using ONT’s published unmodified k-mer models, we acknowledge that empirical signal values may deviate from these reference models due to run-specific technical variation and the presence of RNA modifications. For this reason, SegPore includes a parameter re-estimation step to refine the mean and standard deviation values of each k-mer based on the current dataset.

      The re-estimation process consists of two layers. In the outer layer, we select a set of 5mers that exhibit both modified and unmodified states based on the GMM results (Section 6 of Supplementary Note 1), while the remaining 5mers are assumed to have only unmodified states. In the inner layer, we align the raw signals to the reference sequences using the 5mer parameter table estimated in the outer layer (Section 5 of Supplementary Note 1). Based on the alignment results, we update the 5mer parameter table in the outer layer. This two-layer process is generally repeated for 3~5 iterations until the 5mer parameter table converges.This re-estimation ensures that:

      (1) The adjusted 5mer signal baselines remain close to the ONT reference (for consistency);

      (2) The alignment score between the observed signal and the reference sequence is optimized (as detailed in Equation 11, Section 5 of Supplementary Note 1);

      (3) Only 5mers that show a clear difference between the modified and unmodified components in the GMM are considered subject to modification.

      By doing so, SegPore achieves more accurate signal alignment independent of Nanopolish’s models, and the alignment is directly tuned to the data under analysis.

      (8) Page 9, “The output of the alignment algorithm is an eventalign, which pairs the base blocks with the 5mers from the reference sequence for each read (Fig. 1C).”

      “Modification prediction

      After obtaining the eventalign results, we estimate the modification state of each motif using the 5mer parameter table.”

      This wording seems to have been introduced on page 5 but (also there) reads a bit confusingly as the name of the output format, file, and function are now named the exact same "eventalign". I assume the obtained eventalign results now refer to the output of your HHMM, and not the original Nanopolish eventalign results, based on context only, but I'd rather have a clear naming that enables more differentiation.

      We apologize for the confusion. We have revised the sentence as follows for clarity:

      “A detailed description of both alignment algorithms is provided in Supplementary Note 1. The output of the alignment algorithm is an alignment that pairs the base blocks with the 5mers from the reference sequence for each read (Fig. 1C). Base blocks aligned to the same 5-mer are concatenated into a single raw signal segment (referred to as an “event”), from which various features—such as start and end positions, mean current, and standard deviation—are extracted. Detailed derivation of the mean and standard deviation is provided in Section 5.3 in Supplementary Note 1. In the remainder of this paper, we refer to these resulting events as the output of eventalign analysis or the segmentation task. ”

      (9) Page 9, “Since a single 5mer can be aligned with multiple base blocks, we merge all aligned base blocks by calculating a weighted mean. This weighted mean represents the single base block mean aligned with the given 5mer, allowing us to estimate the modification state for each site of a read.”

      I assume the weights depend on the length of the segment but I don't think it is explicitly stated while it should be.

      Thank you for the helpful observation. To improve clarity, we have moved this explanation to the last paragraph of the previous section (see response to point 8), where we describe the segmentation process in more detail.

      Additionally, a complete explanation of how the weighted mean is computed is provided in Section 5.3 of Supplementary Note 1. It is derived from signal points that are assigned to a given 5mer.

      (10) Page 10, “Afterward, we manually adjust the 5mer parameter table using heuristics to ensure that the modified 5mer distribution is significantly distinct from the unmodified distribution.”

      Using what heuristics? If this is explained in the supplementary notes then please refer to the exact section.

      Thank you for pointing this out. The heuristics used to manually adjust the 5mer parameter table are indeed explained in detail in Section 7 of Supplementary Note 1.

      To clarify this in the manuscript, we have revised the sentence as follows:

      “Afterward, we manually adjust the 5mer parameter table using heuristics to ensure that the modified 5mer distribution is significantly distinct from the unmodified distribution (see details in Section 7 of Supplementary Note 1).”

      (11) Page 10, “Once the table is fixed, it is used for RNA modification estimation in the test data without further updates.”

      By what tool/algorithm? Perhaps it is your own implementation, but with the next section going into segmentation benchmarking and using Nanopolish before this seems undefined.

      Thank you for pointing this out. We use our own implementation. See Algorithm 3 in Section 6 of Supplementary Note 1.

      We have revised the sentence for clarity:

      “Once a stabilized 5mer parameter table is estimated from the training data, it is used for RNA modification estimation in the test data without further updates. A more detailed description of the GMM re-estimation process is provided in Section 6 of Supplementary Note 1.”

      (12) Page 11, “A 5mer was considered significantly modified if its read coverage exceeded 1,500 and the distance between the means of the two Gaussian components in the GMM was greater than 5.”

      Considering the scaling done before also not being very detailed in what range to expect, this cutoff doesn't provide any useful information. Is this a pA value?

      Thank you for the observation. Yes, the value refers to the current difference measured in picoamperes (pA). To clarify this, we have revised the sentence in the manuscript to include the unit explicitly:

      “A 5mer was considered significantly modified if its read coverage exceeded 1,500 and the distance between the means of the two Gaussian components in the GMM was greater than 5 picoamperes (pA).”

      (13) Page 13, “The raw current signals, as shown in Figure 1B.”

      Wrong figure? Figure 2B seems logical.

      Thank you for catching this. You are correct—the reference should be to Figure 2B, not Figure 1B. We have corrected this in the revised manuscript.

      (14) Page 14, Figure 2A, these figures supposedly support the jiggle hypothesis but the examples seem to match only half the explanation. Any of these jiggles seem to be followed shortly by another in the opposite direction, and the amplitude seems to match better within each such pair than the next or previous segments. Perhaps there is a better explanation still, and this behaviour can be modelled as such instead.

      Thank you for your comment. We acknowledge that the observed signal patterns may appear ambiguous and could potentially suggest alternative explanations. However, as shown in Figure 2A, the red dots tend to align closely with the baseline of the previous state, while the blue dots align more closely with the baseline of the next state. We interpret this as evidence for the "jiggling" hypothesis, where k-mer temporarily oscillates between adjacent states during translocation.

      That said, we agree that more sophisticated models could be explored to better capture this behavior, and we welcome suggestions or references to alternative models. We will consider this direction in future work.

      (15) Page 15, “This occurs because subtle transitions within a base block may be mistaken for transitions between blocks, leading to inflated transition counts.”

      Is it really a "subtle transition" if it happens within a base block? It seems this is not a transition and thus shouldn't be named as such.

      Thank you for pointing this out. We agree that the term “subtle transition” may be misleading in this context. We revised the sentence to clarify the potential underlying cause of the inflated transition counts:

      “This may be due to a base block actually corresponding to a sub-state of a single 5mer, rather than each base block corresponding to a full 5mer, leading to inflated transition counts. To address this issue, SegPore’s alignment algorithm was refined to merge multiple base blocks (which may represent sub-states of the same 5mer) into a single 5mer, thereby facilitating further analysis.”

      (16) Page 15, “The SegPore "eventalign" output is similar to Nanopolish's "eventalign" command.”

      To the output of that command, I presume, not to the command itself.

      Thank you for pointing out the ambiguity. We have revised the sentence for clarity:

      “The final outputs of SegPore are the events and modification state predictions. SegPore’s events are similar to the outputs of Nanopolish’s "eventalign" command, in that they pair raw current signal segments with the corresponding RNA reference 5-mers. Each 5-mer is associated with various features — such as start and end positions, mean current, and standard deviation — derived from the paired signal segment.”

      (17) Page 15, “For selected 5mers, SegPore also provides the modification rate for each site and the modification state of that site on individual reads.”

      What selection? Just all kmers with a possible modified base or a more specific subset?

      We revised the sentence to clarify the selection criteria:

      “For selected 5mers that exhibit both a clearly unmodified and a clearly modified signal component, SegPore reports the modification rate at each site, as well as the modification state of that site on individual reads.”

      (18) Page 16, “A key component of SegPore is the 5mer parameter table, which specifies the mean and standard deviation for each 5mer in both modified and unmodified states (Figure 2A).”

      Wrong figure?

      Thank you for pointing this out. You are correct—it should be Figure 1A, not Figure 2A. We intended to visually illustrate the structure of the 5mer parameter table in Figure 1A, and we have corrected this reference in the revised manuscript.

      (19) Page 16, Table 1, I can't quite tell but I assume this is based on all kmers in the table, not just a m6A modified subset. A short added statement to make this clearer would help.

      Yes, you are right—it is averaged over all 5mers. We have revised the sentence for clarity as follows:

      " As shown in Table 1, SegPore consistently achieved the best performance averaged on all 5mers across all datasets..…."

      (20) Page 16, “Since the peaks (representing modified and unmodified states) are separable for only a subset of 5mers, SegPore can provide modification parameters for these specific 5mers. For other 5mers, modification state predictions are unavailable.”

      Can this be improved using some heuristics rather than the 'distance of 5' cutoff as described before? How small or big is this subset, compared to how many there should be to cover all cases?

      We agree that more sophisticated strategies could potentially improve performance. In this study, we adopted a relatively conservative approach to minimize false positives by using a heuristic cutoff of 5 picoamperes. This value was selected empirically and we did not explore alternative cutoffs. Future work could investigate more refined or data-driven thresholding strategies.

      (21) Page 16, “Tombo used the "resquiggle" method to segment the raw signals, and we standardized the segments using the polyA tail to ensure a fair comparison.”

      I don't know what or how something is "standardized" here.

      Standardized’ refers to the poly(A) tail–based signal normalization described in our response to point 6. We applied this normalization to Tombo’s output to ensure a fair comparison across methods. Without this standardization, Tombo’s performance was notably worse. We revised the sentence as follows:

      “Tombo used the "resquiggle" method to segment the raw signals, and we standardized the segments using the poly(A) tail to ensure a fair comparison (See preprocessing section in Materials and Methods).”

      (22) Page 16, “To benchmark segmentation performance, we used two key metrics: (1) the log-likelihood of the segment mean, which measures how closely the segment matches ONT's 5mer parameter table (used as ground truth), and (2) the standard deviation (std) of the segment, where a lower std indicates reduced noise and better segmentation quality. If the raw signal segment aligns correctly with the corresponding 5mer, its mean should closely match ONT's reference, yielding a high log-likelihood. A lower std of the segment reflects less noise and better performance overall.”

      Here the segmentation part becomes a bit odd:

      A: Low std can be/is achieved by dropping any noisy bits, making segments really small (partly what happens here with the transition segments). This may be 'true' here, in the sense that the transition is not really part of the segment, but the comparison table is a bit meaningless as the other tools forcibly assign all data to kmers, instead of ignoring parts as transition states. In other words, it is a benchmark that is easy to cheat by assigning more data to noise/transition states.

      B: The values shown are influenced by the alignment made between the read and expected reference signal. Especially Tombo tends to forcibly assign data to whatever looks the most similar nearby rather than providing the correct alignment. So the "benchmark of the segmentation performance" is more of an "overall benchmark of the raw signal alignment". Which is still a good, useful thing, but the text seems to suggest something else.

      Thank you for raising these important concerns regarding the segmentation benchmarking.

      Regarding point A, the base blocks aligned to the same 5mer are concatenated into a single segment, including the short transition blocks between them. These transition blocks are typically very short (4~10 signal points, average 6 points), while a typical 5mer segment contains around 20~60 signal points. To assess whether SegPore’s performance is inflated by excluding transition segments, we conducted an additional comparison: we removed 6 boundary signal points (3 from the start and 3 from the end) from each 5mer segment in Nanopolish and Tombo’s results to reduce potential noise. The new comparison table is shown in the following:

      SegPore consistently demonstrates superior performance. Its key contribution lies in its ability to recognize structured noise in the raw signal and to derive more accurate mean and standard deviation values that more faithfully represent the true state of the k-mer in the pore. The improved mean estimates are evidenced by the clearly separated peaks of modified and unmodified 5mers in Figures 3A and 4B, while the improved standard deviation is reflected in the segmentation benchmark experiments.

      Regarding point B, we apologize for the confusion. We have added a new paragraph to the introduction to clarify that the segmentation task indeed includes the alignment step.

      “The general workflow of Nanopore direct RNA sequencing (DRS) data analysis is as follows. First, the raw electrical signal from a read is basecalled using tools such as Guppy or Dorado, which produce the nucleotide sequence of the RNA molecule. However, these basecalled sequences do not include the precise start and end positions of each ribonucleotide (or k-mer) in the signal. Because basecalling errors are common, the sequences are typically mapped to a reference genome or transcriptome using minimap2 to recover the correct reference sequence. Next, tools such as Nanopolish and Tombo align the raw signal to the reference sequence to determine which portion of the signal corresponds to each k-mer. We define this process as the segmentation task, referred to as "eventalign" in Nanopolish. Based on this alignment, Nanopolish extracts various features—such as the start and end positions, mean, and standard deviation of the signal segment corresponding to a k-mer. This signal segment or its derived features is referred to as an "event" in Nanopolish. The resulting events serve as input for downstream RNA modification detection tools such as m6Anet and CHEUI.”

      (23) Page 17 “Given the comparable methods and input data requirements, we benchmarked SegPore against several baseline tools, including Tombo, MINES (26), Nanom6A (27), m6Anet, Epinano (28), and CHEUI (29).”

      It seems m6Anet is actually Nanopolish+m6Anet in Figure 3C, this needs a minor clarification here.

      m6Anet uses Nanopolish’s estimated events as input by default.

      (24) Page 18, Figure 3, A and B are figures without any indication of what is on the axis and from the text I believe the position next to each other on the x-axis rather than overlapping is meaningless, while their spread is relevant, as we're looking at the distribution of raw values for this 5mer. The figure as is is rather confusing.

      Thanks for pointing out the confusion. We have added concrete values to the axes in Figures 3A and 3B and revised the figure legend as follows in the manuscript:

      “(A) Histogram of the estimated mean from current signals mapped to an example m6A-modified genomic location (chr10:128548315, GGACT) across all reads in the training data, comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA).

      (B) Histogram of the estimated mean from current signals mapped to the GGACT motif at all annotated m6A-modified genomic locations in the training data, again comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA).”

      (25) Page 18 “SegPore's results show a more pronounced bimodal distribution in the raw signal segment mean, indicating clearer separation of modified and unmodified signals.”

      Without knowing the correct values around the target kmer (like Figure 4B), just the more defined bimodal distribution could also indicate the (wrongful) assignment of neighbouring kmer values to this kmer instead, hence this statement lacks some needed support, this is just one interpretation of the possible reasons.

      Thank you for the comment. We have added concrete values to Figures 3A and 3B to support this point. Both peaks fall within a reasonable range: the unmodified peak (125 pA) is approximately 1.17 pA away from its reference value of 123.83 pA, and the modified peak (118 pA) is around 7 pA away from the unmodified peak. This shift is consistent with expected signal changes due to RNA modifications (usually less than 10 pA), and the magnitude of the difference suggests that the observed bimodality is more likely caused by true modification events rather than misalignment.

      (26) Page 18 “Furthermore, when pooling all reads mapped to m6A-modified locations at the GGACT motif, SegPore showed prominent peaks (Fig. 3B), suggesting reduced noise and improved modification detection.”

      I don't think the prominent peaks directly suggest improved detection, this statement is a tad overreaching.

      We revised the sentense to the following:

      “SegPore exhibited more distinct peaks (Fig. 3B), indicating reduced noise and potentially enabling more reliable modification detection”.

      (27) Page18 “(2) direct m6A predictions from SegPore's Gaussian Mixture Model (GMM), which is limited to the six selected 5mers.”

      The 'six selected' refers to what exactly? Also, 'why' this is limited to them is also unclear as it is, and it probably would become clearer if it is clearly defined what this refers to.

      It is explained the page 16 in the SegPore’s workflow in the original manuscript as follows:

      “A key component of SegPore is the 5mer parameter table, which specifies the mean and standard deviation for each 5mer in both modified and unmodified states (Fig. 2A1A). Since the peaks (representing modified and unmodified states) are separable for only a subset of 5mers, SegPore can provide modification parameters for these specific 5mers. For other 5mers, modification state predictions are unavailable.”

      e select a small set of 5mers that show clear peaks (modified and unmodified 5mers) in GMM in the m6A site-level data analysis. These 5mers are provided in Supplementary Fig. S2C, as explained in the section “m6A site level benchmark” in the Material and Methods (page 12 in the original manuscript).

      “…transcript locations into genomic coordinates. It is important to note that the 5mer parameter table was not re-estimated for the test data. Instead, modification states for each read were directly estimated using the fixed 5mer parameter table. Due to the differences between human (Supplementary Fig. S2A) and mouse (Supplementary Fig. S2B), only six 5mers were found to have m6A annotations in the test data’s ground truth (Supplementary Fig. S2C). For a genomic location to be identified as a true m6A modification site, it had to correspond to one of these six common 5mers and have a read coverage of greater than 20. SegPore derived the ROC and PR curves for benchmarking based on the modification rate at each genomic location….”

      We have updated the sentence as follows to increase clarity:

      “which is limited to the six selected 5mers that exhibit clearly separable modified and unmodified components in the GMM (see Materials and Methods for details).”

      (28) Page 19, Figure 4C, the blue 'Unmapped' needs further explanation. If this means the segmentation+alignment resulted in simply not assigning any segment to a kmer, this would indicate issues in the resulting mapping between raw data and kmers as the data that probably belonged to this kmer is likely mapped to a neighbouring kmer, possibly introducing a bimodal distribution there.

      This is due to deletion event in the full alignment algorithm. See Page 8 of SupplementaryNote1:

      During the traceback step of the dynamic programming matrix, not every 5mer in the reference sequence is assigned a corresponding raw signal fragment—particularly when the signal’s mean deviates substantially from the expected mean of that 5mer. In such cases, the algorithm considers the segment to be generated by an unknown 5mer, and the corresponding reference 5mer is marked as unmapped.

      (29) Page 19, “For six selected m6A motifs, SegPore achieved an ROC AUC of 82.7% and a PR AUC of 38.7%, earning the third-best performance compared with deep leaning methods m6Anet and CHEUI (Fig. 3D).”

      How was this selection of motifs made, are these related to the six 5mers in the middle of Supplementary Figure S2? Are these the same six as on page 18? This is not clear to me.

      It is the same, see the response to point 27.

      (30) Page 21 “Biclustering reveals that modifications at the 6th, 7th, and 8th genomic locations are specific to certain clusters of reads (clusters 4, 5, and 6), while the first five genomic locations show similar modification patterns across all reads.”

      This reads rather confusingly. Both the '6th, 7th, and 8th genomic locations' and 'clusters 4,5,6' should be referred to in clearer terms. Either mark them in the figure as such or name them in the text by something that directly matches the text in the figure.

      We have added labels to the clusters and genomic locations Figure 4C, and revised the sentence as follows:

      “Biclustering reveals that modifications at g6 are specific to cluster C4, g7 to cluster C5, and g8 to cluster C6, while the first five genomic locations (g1 to g5) show similar modification patterns across all reads.”

      (31) Page 21, “We developed a segmentation algorithm that leverages the jiggling property in the physical process of DRS, resulting in cleaner current signals for m6A identification at both the site and single-molecule levels.”

      Leverages, or just 'takes into account'?

      We designed our HHMM specifically based on the jiggling hypothesis, so we believe that using the term “leverage” is appropriate.

      (32) Page 21, “Our results show that m6Anet achieves superior performance, driven by SegPore's enhanced segmentation.”

      Superior in what way? It barely improves over Nanopolish in Figure 3C and is outperformed by other methods in Figure 3D. The segmentation may have improved but this statement says something is 'superior' driven by that 'enhanced segmentation', so that cannot refer to the segmentation itself.

      We revise it as follows in the revised manuscript:

      ”Our results demonstrate that SegPore’s segmentation enables clear differentiation between m6A-modified and unmodified adenosines.”

      (33) Page 21, “In SegPore, we assume a drastic change between two consecutive 5mers, which may hold for 5mers with large difference in their current baselines but may not hold for those with small difference.”

      The implications of this assumption don't seem highlighted enough in the work itself and may be cause for falsely discovering bi-modal distributions. What happens if such a 5mer isn't properly split, is there no recovery algorithm later on to resolve these cases?

      We agree that there is a risk of misalignment, which can result in a falsely observed bimodal distribution. This is a known and largely unavoidable issue across all methods, including deep neural network–based methods. For example, many of these models rely on a CTC (Connectionist Temporal Classification) layer, which implicitly performs alignment and may also suffer from similar issues.

      Misalignment is more likely when the current baselines of neighboring k-mers are close. In such cases, the model may struggle to confidently distinguish between adjacent k-mers, increasing the chance that signals from neighboring k-mers are incorrectly assigned. Accurate baseline estimation for each k-mer is therefore critical—when baselines are accurate, the correct alignment typically corresponds to the maximum likelihood.

      We have added the following sentence to the discussion to acknowledge this limitation:

      “As with other RNA modification estimation methods, SegPore can be affected by misalignment errors, particularly when the baseline signals of adjacent k-mers are similar. These cases may lead to spurious bimodal signal distributions and require careful interpretation.”

      (34) Page 21, “Currently, SegPore models only the modification state of the central nucleotide within the 5mer. However, modifications at other positions may also affect the signal, as shown in Figure 4B. Therefore, introducing multiple states to the 5mer could help to improve the performance of the model.”

      The meaning of this statement is unclear to me. Is SegPore unable to combine the information of overlapping kmers around a possibly modified base (central nucleotide), or is this referring to having multiple possible modifications in a single kmer (multiple states)?

      We mean there can be modifications at multiple positions of a single 5mer, e.g. C m5C m6A m7G T. We have revised the sentence to:

      “Therefore, introducing multiple states for a 5mer to accout for modifications at mutliple positions within the same 5mer could help to improve the performance of the model.”

      (35) Page 22, “This causes a problem when apply DNN-based methods to new dataset without short read sequencing-based ground truth. Human could not confidently judge if a predicted m6A modification is a real m6A modification.”

      Grammatical errors in both these sentences. For the 'Human could not' part, is this referring to a single person's attempt or more extensively tested?

      Thanks for the comment. We have revised the sentence as follows:

      “This poses a challenge when applying DNN-based methods to new datasets without short-read sequencing-based ground truth. In such cases, it is difficult for researchers to confidently determine whether a predicted m6A modification is genuine (see Supplmentary Figure S5).”

      (36) Page 22, “…which is easier for human to interpret if a predicted m6A site is real.”

      "a" human, but also this probably meant to say 'whether' instead of 'if', or 'makes it easier'.

      Thanks for the advice. We have revise the sentence as follows:

      “One can generally observe a clear difference in the intensity levels between 5mers with an m6A and those with a normal adenosine, which makes it easier for a researcher to interpret whether a predicted m6A site is genuine.”

      (37) Page 22, “…and noise reduction through its GMM-based approach…”

      Is the GMM providing noise reduction or segmentation?

      Yes, we agree that it is not relevant. We have removed the sentence in the revised manuscript as follows:

      “Although SegPore provides clear interpretability and noise reduction through its GMM-based approach, there is potential to explore DNN-based models that can directly leverage SegPore's segmentation results.”

      (38) Page 23, “SegPore effectively reduces noise in the raw signal, leading to improved m6A identification at both site and single-molecule levels…”

      Without further explanation in what sense this is meant, 'reduces noise' seems to overreach the abilities, and looks more like 'masking out'.

      Following the reviewer’s suggestion, we change it to ‘mask out'’ in the revised manuscript.

      “SegPore effectively masks out noise in the raw signal, leading to improved m6A identification at both site and single-molecule levels.”

      Reviewer #3 (Recommendations for the authors):

      I recommend the publication of this manuscript, provided that the following comments (and the comments above) are addressed.

      In general, the authors state that SegPore represents an improvement on existing software. These statements are largely unquantified, which erodes their credibility. I have specified several of these in the Minor comments section.

      Page 5, Preprocessing: The authors comment that the poly(A) tail provides a stable reference that is crucial for the normalisation of all reads. How would this step handle reads that have variable poly(A) tail lengths? Or have interrupted poly(A) tails (e.g. in the case of mRNA vaccines that employ a linker sequence)?

      We apologize for the confusion. The poly(A) tail–based normalization is explained in Supplementary Note 1, Section 3.

      As shown in Author response image 1 below, the poly(A) tail produces a characteristic signal pattern—a relatively flat, squiggly horizontal line. Due to variability between nanopores, raw current signals often exhibit baseline shifts and scaling of standard deviations. This means that the signal may be shifted up or down along the y-axis and stretched or compressed in scale.

      Author response image 1.

      The normalization remains robust with variable poly(A) tail lengths, as long as the poly(A) region is sufficiently long. The linker sequence will be assigned to the adapter part rather than the poly(A) part.

      To improve clarity in the revised manuscript, we have added the following explanation:

      “Due to inherent variability between nanopores in the sequencing device, the baseline levels and standard deviations of k-mer signals can differ across reads, even for the same transcript. To standardize the signal for downstream analyses, we extract the raw current signal segments corresponding to the poly(A) tail of each read. Since the poly(A) tail provides a stable reference, we normalize the raw current signals across reads, ensuring that the mean and standard deviation of the poly(A) tail are consistent across all reads. This step is crucial for reducing…..”

      We chose to use the poly(A) tail for normalization because it is sequence-invariant—i.e., all poly(A) tails consist of identical k-mers, unlike transcript sequences which vary in composition. In contrast, using the transcript region for normalization can introduce biases: for instance, reads with more diverse k-mers (having inherently broader signal distributions) would be forced to match the variance of reads with more uniform k-mers, potentially distorting the baseline across k-mers.

      Page 7, 5mer parameter table: r9.4_180mv_70bps_5mer_RNA is an older kmer model (>2 years). How does your method perform with the newer RNA kmer models that do permit the detection of multiple ribonucleotide modifications? Addressing this comment is crucial because it is feasible that SegPore will underperform in comparison to the newer RNA base caller models (requiring the use of RNA004 datasets).

      Thank you for highlighting this important point. For RNA004, we have updated SegPore to ensure compatibility with the latest kit. In our revised manuscript, we demonstrate that the translocation-based segmentation hypothesis remains valid for RNA004, as supported by new analyses presented in the supplementary Figure S4.

      Additionally, we performed a new benchmark with f5c and Uncalled4 in RNA004 data in the revised manuscript (Table 2), where SegPore exhibit a better performance than f5c and Uncalled4.

      We agree that benchmarking against the latest Dorado models—specifically rna004_130bps_hac@v5.1.0 and rna004_130bps_sup@v5.1.0, which include built-in modification detection capabilities—would provide valuable context for evaluating the utility of SegPore. However, generating a comprehensive k-mer parameter table for RNA004 requires a large, well-characterized dataset. At present, such data are limited in the public domain. Additionally, Dorado is developed by ONT and its internal training data have not been released, making direct comparisons difficult.

      Our current focus is on improving raw signal segmentation quality, which are upstream tasks critical to many downstream analyses, including RNA modification detection. Future work may include benchmarking SegPore against models like Dorado once appropriate data become available.

      The Methods and Results sections contain redundant information - please streamline the information in these sections and reduce the redundancy. For example, the benchmarking section may be better situated in the Results section.

      Following your advice, we have removed redundant texts about the Segmentation benchmark from Materials and Methods in the revised manuscript.

      Minor comments

      (1) Introduction

      Page 3: "By incorporating these dynamics into its segmentation algorithm...". Please provide an example of how motor protein dynamics can impact RNA translocation. In particular, please elaborate on why motor protein dynamics would impact the translocation of modified ribonucleotides differently to canonical ribonucleotides. This is provided in the results, but please also include details in the Introduction.

      Following your advice, we added one sentence to explain how the motor protein affect the translocation of the DNA/RNA molecule in the revised manuscript.

      “This observation is also supported by previous reports, in which the helicase (the motor protein) translocates the DNA strand through the nanopore in a back-and-forth manner. Depending on ATP or ADP binding, the motor protein may translocate the DNA/RNA forward or backward by 0.5-1 nucleotides.”

      As far as we understand, this translocation mechanism is not specific to modified or unmodified nucleotides. For further details, we refer the reviewer to the original studies cited.

      Page 3: "This lack of interpretability can be problematic when applying these methods to new datasets, as researchers may struggle to trust the predictions without a clear understanding of how the results were generated." Please provide details and citations as to why researchers would struggle to trust the predictions of m6Anet. Is it due to a lack of understanding of how the method works, or an empirically demonstrated lack of reliability?

      Thank you for pointing this out. The lack of interpretability in deep learning models such as m6Anet stems primarily from their “black-box” nature—they provide binary predictions (modified or unmodified) without offering clear reasoning or evidence for each call.

      When we examined the corresponding raw signals, we found it difficult to visually distinguish whether a signal segment originated from a modified or unmodified ribonucleotide. The difference is often too subtle to be judged reliably by a human observer. This is illustrated in the newly added Supplementary Figure S5, which shows Nanopolish-aligned raw signals for the central 5mer GGACT in Figure 4B, displayed both uncolored and colored by modification state (according to the ground truth).

      Although deep neural networks can learn subtle, high-dimensional patterns in the signal that may not be readily interpretable, this opacity makes it difficult for researchers to trust the predictions—especially in new datasets where no ground truth is available. The issue is not necessarily an empirically demonstrated lack of reliability, but rather a lack of transparency and interpretability.

      We have updated the manuscript accordingly and included Supplementary Figure S5 to illustrate the difficulty in interpreting signal differences between modified and unmodified states.

      Page 3: "Instead of relying on complex, opaque features...". Please provide evidence that the research community finds the figures generated by m6Anet to be difficult to interpret, or delete the sections relating to its perceived lack of usability.

      See the figure provided in the response to the previous point. We added a reference to this figure in the revised manuscript.

      “Instead of relying on complex, opaque features (see Supplementary Figure S5), SegPore leverages baseline current levels to distinguish between…..”

      (2) Materials and Methods

      Page 5, Preprocessing: "We begin by performing basecalling on the input fast5 file using Guppy, which converts the raw signal data into base sequences.". Please change "base" to ribonucleotide.

      Revised as requested.

      Page 5 and throughout, please refer to poly(A) tail, rather than polyA tail throughout.

      Revised as requested.

      Page 5, Signal segmentation via hierarchical Hidden Markov model: "...providing more precise estimates of the mean and variance for each base block, which are crucial for downstream analyses such as RNA modification prediction." Please specify which method your HHMM method improves upon.

      Thank you for the suggestion. Since this section does not include a direct comparison, we revised the sentence to avoid unsupported claims. The updated sentence now reads:

      "...providing more precise estimates of the mean and variance for each base block, which are crucial for downstream analyses such as RNA modification prediction."

      Page 10, GMM for 5mer parameter table re-estimation: "Typically, the process is repeated three to five times until the 5mer parameter table stabilizes." How is the stabilisation of the 5mer parameter table quantified? What is a reasonable cut-off that would demonstrate adequate stabilisation of the 5mer parameter table?

      Thank you for the comment. We assess the stabilization of the 5mer parameter table by monitoring the change in baseline values across iterations. If the absolute change in baseline values for all 5mers is less than 1e-5 between two consecutive iterations, we consider the estimation to have stabilized.

      Page 11, M6A site level benchmark: why were these datasets selected? Specifically, why compare human and mouse ribonuclotide modification profiles? Please provide a justification and a brief description of the experiments that these data were derived from, and why they are appropriate for benchmarking SegPore.

      Thank you for the comment. These data are taken from a previous benchmark studie about m6A estimation from RNA002 data in the literature (https://doi.org/10.1038/s41467-023-37596-5). We think the data are appropreciate here.

      Thank you for the comment. The datasets used were taken from a previous benchmark study on m6A estimation using RNA002 data (https://doi.org/10.1038/s41467-023-37596-5). These datasets include human and mouse transcriptomes and have been widely used to evaluate the performance of RNA modification detection tools. We selected them because (i) they are based on RNA002 chemistry, which matches the primary focus of our study, and (ii) they provide a well-characterized and consistent benchmark for assessing m6A detection performance. Therefore, we believe they are appropriate for validating SegPore.

      (3) Results

      Page 13, RNA translocation hypothesis: "The raw current signals, as shown in Fig. 1B...". Please check/correct figure reference - Figure 1B does not show raw current signals.

      Thank you for pointing this out. The correct reference should be Figure 2B. We have updated the figure citation accordingly in the revised manuscript.

      Page 19, m6A identification at the site level: "For six selected m6A motifs, SegPore achieved an ROC AUC of 82.7% and a PR AUC of 38.7%, earning the third best performance compared with deep leaning methods m6Anet and CHEUI (Fig. 3D)." SegPore performs third best of all deep learning methods. Do the authors recommend its use in conjunction with m6Anet for m6A detection? Please clarify in the text.

      This sentence aims to convey that SegPore alone can already achieve good performance. If interpretability is the primary goal, we recommend using SegPore on its own. However, if the objective is to identify more potential m6A sites, we suggest using the combined approach of SegPore and m6Anet. That said, we have chosen not to make explicit recommendations in the main text to avoid oversimplifying the decision or potentially misleading readers.

      Page 19, m6A identification at the single molecule level: "one transcribed with m6A and the other with normal adenosine". I assume that this should be adenine? Please replace adenosine with adenine throughout.

      Thank you for pointing this out. We have revised the sentence to use "adenine" where appropriate. In other instances, we retain "adenosine" when referring specifically to adenine bound to a ribose sugar, which we believe is suitable in those contexts.

      Page 19, m6A identification at the single molecule level: "We used 60% of the data for training and 40% for testing". How many reads were used for training and how many for testing? Please comment on why these are appropriate sizes for training and testing datasets.

      In total, there are 1.9 million reads, with 1.14 million used for training and 0.76 million  for testing (60% and 40%, respectively). We chose this split to ensure that the training set is sufficiently large to reliably estimate model parameters, while the test set remains substantial enough to robustly evaluate model performance. Although the ratio was selected somewhat arbitrarily, it balances the need for effective training with rigorous validation.

      (4) Discussion

      Page 21: "We believe that the de-noised current signals will be beneficial for other downstream tasks." Which tasks? Please list an example.

      We have revised the text for clarity as follows:

      “We believe that the de-noised current signals will be beneficial for other downstream tasks, such as the estimation of m5C, pseudouridine, and other RNA modifications.”

      Page 22: "One can generally observe a clear difference in the intensity levels between 5mers with a m6A and normal adenosine, which is easier for human to interpret if a predicted m6A site is real." This statement is vague and requires qualification. Please reference a study that demonstrates the human ability to interpret two similar graphs, and demonstrate how it relates to the differences observed in your data.

      We apologize for the confusion. We have revised the sentence as follows:

      “One can generally observe a clear difference in the intensity levels between 5mers with an m6A and those with a normal adenosine, which makes it easier for a researcher to interpret whether a predicted m6A site is genuine.”

      We believe that Figures 3A, 3B, and 4B effectively illustrate this concept.

      Page 23: How long does SegPore take for its analyses compared to other similar tools? How long would it take to analyse a typical dataset?

      We have added run-time statistics for datasets of varying sizes in the revised manuscript (see Supplementary Figure S6). This figure illustrates SegPore’s performance across different data volumes to help estimate typical processing times.

      (5) Figures

      Figure 4C. Please number the hierachical clusters and genomic locations in this figure. They are referenced in the text.

      Following your suggestion, we have labeled the hierarchical clusters and genomic locations in Figure 4C in the revised manuscript.

      In addition, we revised the corresponding sentence in the main text as follows: “Biclustering reveals that modifications at g6 are specific to cluster C4, g7 to cluster C5, and g8 to cluster C6, while the first five genomic locations (g1 to g5) show similar modification patterns across all reads.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Recommendations for the Authors:

      Reviewer #1:

      We think that this manuscript brings an important contribution that will be of interest in the areas of statistical physicists, (microbiota) ecology, and (biological) data science. The evidence of their results is solid and the work improves the state-of-the-art in terms of methods. We have a few concerns that, in our opinion, the authors should address.

      Major concerns:

      (1) While the paper could be of interest for the broad audience of e-Life, the way it is written is accessible mainly to physicists. We encourage the authors to take the broad audience into account by i) explaining better the essence of what is being done at each step, ii) highlighting the relevance of the method compared to other methods, iii) discussing the ecological implications of the results.

      Examples on how to approach i) include: Modify or expand Figure 1 so that non-familiar readers can understand the summary of the work (e.g. with cartoons representing communities, diseased states and bacterial interactions and their relationship with the inference method); in each section, summarize at the beginning the purpose of what is going to be addressed in this section, and summarize at the end what the section has achieved; in Figure 2, replace symbols by their meaning as much as possible-the same for Figure 1, at the very least in the figure caption.

      Example on how to approach ii): Since the authors aim to establish a bridge between disordered systems and microbiome ecology, it could be useful to expand a bit the introduction on disordered systems for biologists/biophysicists. This could be done with an additional text box, which could also highlight the advantages of this approach in comparison to other techniques (e.g. model-free approaches can also classify healthy and diseased states).

      Example on how to approach iii): The authors could discuss with more depth the ecological implications of their results. For example, do they have a hypothesis on why demographic and neutral effects could dominate in healthy patients?

      We thank the reviewer for the observations. Following the suggestion in the revised version, each section outlines the goal of what will be addressed in that section, and summarizes what we have achieved at the end; We also updated Figure 1 and Figure 2.

      (i) For figure 1, we expanded and hopefully made more clear how we conceptualize the problem, use the data, andestablish our method. In Figure 2, we enriched the y labels of each panel with the name associated with the order parameter.

      (ii) We thank the reviewer for helping us improve the readability of the introductory part, thus providing moreinsights into disordered systems techniques for a broader audience. We have added a few explanations at the end of page 2 – to explain the advantages of such methodology compared to other strategies and models.

      (iii) We thank the reviewer for raising the need for a more in-depth ecological discussion of our results. A simple wayto understand why neutral effects may dominate in healthy patients is the following. Neutrality implies that species differences are mainly shaped by stochastic processes such as demographic noise, with species treated as different realizations of the same underlying stochastic ecological dynamics. In our analysis, we observe that healthy individuals tend to exhibit highly similar microbial communities, suggesting that the compositional variability among their microbiomes is compatible—at least in part—with the fluctuations expected from demographic stochasticity alone. In contrast, patients with the disease display significantly more heterogeneous microbial compositions. The diversity and structure of their gut communities cannot be satisfactorily explained by neutral demographic fluctuations alone.

      This discrepancy implies that additional deterministic forces—such as altered ecological interactions—are driving the divergence observed in dysbiotic states. In diseased individuals, the breakdown of such interactions leads to a structurally distinct regime that may correspond to a phase of marginal stability, as indicated by our theoretical modeling. This shift marks a transition from a community governed by neutrality and demographic noise to one dominated by non-neutral ecological forces (as depicted in Figure 4). We added these comments in the discussion section of the revised manuscript.

      (2) Taking into account the broader audience, we invite the authors to edit the abstract, as it seems to jump from one ecological concept to another without explicitly communicating what is the link between these concepts. From the first two sentences, the motivation seems to be species diversity, but no mention of diversity comes after the second sentence. There is no proper introduction/definition of what macroecological states are. After that, the authors switch to healthy and unhealthy states, without previously introducing any link between gut microbiota states and the host’s health (which perhaps could be good in the first or second sentence, although other framings can be as valid). After that, interactions appear in the text and are related to instability, but the reader might not know whether this is surprising or if healthy/unhealthy states are generally related to stability.

      We pointed out a few examples, but the authors could extend their revision on i), ii) and iii) beyond such specific comments. In our opinion, this would really benefit the paper.

      In response to the reviewer’s concern about conceptual clarity and structure, we substantially revised the abstract to improve its accessibility and logical flow. In the revised abstract, we now clearly link species diversity to microbiome structure and function from the outset, addressing initial confusion. We provide a concise definition of ”macroecological states,” framing them as reproducible statistical patterns reflecting community-level properties. Additionally, the revised version explicitly connects gut microbiome states to host health earlier, resolving the previous abrupt shift in focus. Finally, we conclude by highlighting how disordered systems theory advances our understanding of microbiome stability and functioning, reinforcing the novelty and broader significance of our approach. Overall, the revised abstract better serves a broad interdisciplinary audience, including readers unfamiliar with the technicalities of disordered systems or microbial ecology, while preserving the scientific depth and accuracy of our work

      (3) The connection with consumer-resource (CR) models is quite unusual. In Equation (12), why do the authors assume that the consumption term does not depend on R? This should be addressed, since this term is usually dependent on R in microbial ecology models.

      In case this is helpful, it is known that the symmetric Lotka-Volterra model emerges from time-scale separation in the MacArthur model, where resources reproduce logistically and are consumed by other species (e.g., plants eaten by herbivores). Consumer-resource models form a broad category, while the MacArthur model is a specific case featuring logistic resource growth. For microbes, a more meaningful justification of the generalized Lotka-Volterra (GLV) model from a consumer-resource perspective involves the consumer-resource dynamics in a chemostat, where time-scale separation is assumed and higher-order interactions are neglected. See, for example: a) The classic paper by MacArthur: R. MacArthur. Species packing and competitive equilibrium for many species. Theoretical Population Biology, 1(1):1-11, 1970. b) Recent works on time-scale separation in chemostat consumer-resource models: Anna Posfai et al., PRL, 2017 Sireci et al., PNAS, 2023 Akshit Goyal et al., PRX-Life, 2025

      We thank the reviewer for the observation. We apologize for the typo that appeared in the main text and that we promptly corrected. The Consumers-Resources model we had in mind is the classical case proposed by MacArthur, where resources are self-regulated according to a logistic growth mechanism, which leads to the generalized LotkaVolterra model we employ in our work.

      Minor concerns:

      (1) The title has a nice pun for statistical physicists, but we wonder if it can be a bit confusing for the broader audience of e-Life. Although we leave this to the author’s decision, we’d recommend considering changing the title, making it more explicit in communicating the main contribution/result of the work.

      Following the reviewer’s suggestion, we have introduced an explanatory subtitle: “Linking Species Interactions to Dysbiosis through a Disordered Lotka-Volterra Framework”.

      (2) Review the references - some preprints might have already been published: Pasqualini J. 2023, Sireci 2022, Wu 2021.

      We thank the reviewer for pointing our attention to this inaccuracy. We updated the references to Pasqualini and Sireci papers. To our knowledge, Wu’s paper has appeared as an arXiv preprint only.

      (3) Species do not generally exhibit identical carrying capacities (see Grilli, Nat. Commun., 2020; some taxa are generally more abundant than others. The authors could discuss whether the model, with the inferred parameters, can accurately reproduce the distribution of species’ mean abundances.

      We thank the reviewer for this insightful comment. As discussed in the revised manuscript (lines 294–299), our current model does not accurately reproduce the empirical species abundance distribution (SAD). This limitation stems from the assumption of constant carrying capacities across species. While empirical observations (e.g., Grilli et al., Nat. Commun., 2020 [1]) show heterogeneous mean abundances often following power-law or log-normal distributions. However, our model assumes constant carrying capacity, resulting in SADs devoid of fat tails, which diverge from empirical data.

      This simplification is implemented to maintain the analytical tractability of the disordered generalized Lotka-Volterra (dGLV) framework, a common approach also found in prior works such as Bunin (2017) and Barbier et al. (2018) [2, 3]. Introducing heterogeneity in carrying capacities, such as drawing them from a log-normal distribution, or switching to multiplicative (rather than demographic) noise, could indeed produce SADs that better align with empirical data. Nevertheless, implementing changes would significantly complicate the analytical treatment.

      We acknowledge these directions as promising avenues for future research. They could help enhance the empirical realism of the model and its capacity to capture observed macroecological patterns while posing new theoretical challenges for disordered systems analysis

      (4) A substantial number of cited works (Grilli, Nat. Commun., 2020; Zaoli & Grilli, Science Advances, 2021; Sireci et al., PNAS, 2023; Po-Yi Ho et al., eLife, 2022) suggest that environmental fluctuations play a crucial role in shaping microbiome composition and dynamics. Is the authors’ analysis consistent with this perspective? Do they expect their conclusions to remain robust if environmental fluctuations are introduced?

      We thank the reviewer for stressing this point. The introduction of environmental fluctuations in the model formally violates detailed balance, thereby preventing the definition of an energy function. To date, no study has integrated random interactions together with both demographic and environmental noise within a unified analytical framework. This is certainly a highly promising direction that some of the authors are already exploring. However, given the inherently out-of-equilibrium nature of the system and the absence of a free energy, we would need to adopt a Dynamical Mean-Field Theory formalism and eventually analyze the corresponding stationary equations to be solved self-consistently. We added, however, a brief note in the Discussion section.

      (5) The term “order parameters“ may not be intuitive for a biological audience. In any case, the authors should explicitly define each order parameter when first introduced.

      We thank the reviewer for the comment. We introduced the names of the order parameters as soon as they are introduced, along with a brief explanation of their meaning that may be accessible to an audience with biological background.

      (6) Line 242: Should ψU be ψD?

      We thank the reviewer for the observation. We corrected the typo.

      (7) Given that the authors are discussing healthy and diseased states and to avoid confusion, the authors could perhaps use another word for ’pathological’ when they refer to dynamical regimes (e.g., in Appendix 2: ’letting the system enter the pathological regime of unbounded growth’).

      We thank the reviewer for the helpful comment. As suggested, we used the term “unphysical” instead of “pathological” where needed.

      Reviewer #2:

      (1) A technical point that I could not understand is how the authors deal with compositional data. One reason for my confusion is that the order parameters h and q0 are fixed n data to 1/S and 1/S2, and thus I do not see how they can be informative. Same for carrying capacity, why is it not 1 if considering relative abundance?

      We thank the reviewer for raising this point. We acknowledge that the treatment of compositional data and the interpretation of order parameters h and q0 were not sufficiently clarified in the manuscript. Additionally, there was an imprecision in the text regarding the interpretation of these parameters.

      As defined in revised Eq. (4) of the manuscript, h and q0 are to be averaged over the entire dataset, summing across samples α. Specifically, and , where S<sub>α</sub> is the number of species present in sample α and is the average over samples. These parameters are therefore informative, as they encapsulate sample-level ecological diversity, and their variation reflects biological differences between healthy and diseased states. For instance, Pasqualini et al., 2024 [4] reported significant differences in these metrics between health conditions, thereby supporting their ecological relevance.

      Regarding carrying capacities, we clarify that although we work with relative abundance data (i.e., compositional data), we do not fix the carrying capacity K to 1. Instead, we set K to the maximum value of xi (relative abundance) within each sample, to preserve compatibility with empirical data and allow for coexistence. While this remains a modeling assumption, it ensures better ecological realism within the constraints of the disordered GLV framework.

      (2) Obviously I’m missing something, so it would be nice to clarify in simple terms the logic of the argument. I understand that Lagrange multipliers are going to be used in the model analysis, and there are a lot of technical arguments presented in the paper, but I would like a much more intuitive explanation about the way the data can be used to infer order parameters if those are fixed by definition in compositional data.

      We thank the reviewer for the observation. The order parameters can be measured directly from the data, even in the presence of compositionality, as explained above. We can connect those parameters with the theory even for compositional data, because the only effect of adding the compositionality constraint is to shift the linear coefficient in the Hamiltonian, which corresponds to shifting the average interaction µ. However, the resulting phase diagram is mostly affected by the variance of the interactions σ2 (as µ is such that we are in the bounded phase).

      (3) Another point that I did not understand comes from the fact that the authors claim that interaction variance is smaller in unhealthy microbiomes. Yet they also find that those are closer to instability, and are more driven by niche processes. I would have expected the opposite to be true, more variance in the interactions leading to instability (as in May’s original paper for instance). Is this apparent paradox explained by covariations in demographic stochasticity (T) and immigration rate (lambda)? If so, I think it would be very useful to comment on that.

      As Altieri and coworkers showed in their PRL (2021) [5], the phase diagram of our model differs fundamentally from that of Biroli et al. (2018) [6]. In the latter, the intuitive rule – greater interaction variance yields greater instability – indeed holds. For the sake of clarity, we have attached below the resulting phase diagram obtained by Altieri et al.

      The apparent paradox arises because the two phase diagrams are tuned by different parameters. Consequently, even at low temperature and with weak interaction variance, our system may sit nearer to the replica-symmetrybreaking (RSB) line.

      Fig. 3 in the main text it is not a (σ,T) phase diagram where all other parameters are kept constant. Rather, it is a plot of the inferred σ and T parameters from the data (without showing the corresponding µ).

      To capture the full, non-trivial influence of all parameters on stability, we studied the so-called “replicon eigenvalue” in the RS (i.e. single equilibrium) approximation. This leading eigenvalue measures how close a given set of inferred parameters – and hence a microbiome – is to the RSB threshold. For a visual representation of these findings, refer to Figure 4.

      Author response image 1.

      (4) What do the empirical SAD look like? It would be nice to see the actual data and how the theoretical SADs compare.

      The empirical species abundance distributions (SADs) analyzed in our study are presented and discussed in detail in Pasqualini et al., 2024 [4]. Given the overlap in content, we chose not to reproduce these figures in the current manuscript to avoid redundancy.

      As we also clarify in the revised text, the theoretical SAD is derived from the disordered generalized Lotka-Volterra (dGLV) model in the unique fixed point phase typically exhibit exponential tails. These distributions do not match the heavier-tailed patterns (e.g., log-normal or power-law-like) observed in empirical microbiome data. This discrepancy stems from the simplifying assumptions of the dGLV framework, including the use of constant carrying capacities and demographic noise.

      In the revised manuscript, we have added a brief discussion in the revised manuscript to explicitly acknowledge this limitation and emphasize it as a direction for future refinement of the model, such as incorporating heterogeneous carrying capacities or exploring alternative noise structures.

      (5) Some typos: often “niche” is written “nice”.

      We thank the reviewer for this suggestion. After inspecting the text, we corrected the reported typos.

      Reviewer #3:

      Major comments:

      (1) In the S3 text, the authors say that filtered metagenomic reads were processed using the software Kaiju. The description of the pipeline does not mention how core genes were selected, which is often a crucial step in determining the abundance of a species in a metagenomic sample. In addition, the senior author of this manuscript has published a version of Kaiju that leverages marker genes classification methods (deemed Core-Kaiju), but it was not used for either this manuscript or Pasqualini et al. (2014; Tovo et al., 2020). I am not suggesting that the data necessarily needs to be reprocessed, but it would be useful to know how core genes were chosen in Pasqualini et al. and why Core-Kaiju was not used (2014).

      Prior to the current manuscript and the PLOS Computational Biology paper by Pasqualini et al. [4], we applied the core-Kaiju protocol to the same dataset used in both studies. However, this tool was originally developed and validated using general catalogs of culturable organisms, not specifically tuned for gut microbiomes. As a result, we have realized that in many samples Core Kajiu would filter only very few species (in some samples, the number of identified species was as low as 5–10), undermining the reliability of the analysis. Due to these limitations, we opted to use the standard Kaiju version in our work. We are actively developing an improved version of the core-Kaiju protocol that will overcome the discussed limitations and preliminary results (not shown here) indicate the robustness of the obtained patterns also in this case.

      (2) My understanding of Pasqualini et al. was that diseased patients experienced larger fluctuations in abundance, while in this study, they had smaller fluctuations (Figure 3a; 2024). Is this a discrepancy between the two models or is there a more nuanced interpretation?

      We thank the reviewer for the observation. This is only an apparent discrepancy, as the term fluctuation has different meanings in the two contexts. The fluctuations referred to by the reviewer correspond to a parameter of our theory—namely, noise in the interactions. Conversely, in Pasqualini et al. σ indicates environmental fluctuations. Nevertheless, there is no conceptual discrepancy in our results: in both studies, unhealthy microbiomes were found to be less stable. In fact, also in this study, notably Fig. 4, shows that unhealthy microbiomes lie closer to the RSB line, a phenomenon that is also associated with enhanced fluctuations.

      (3) Line 38-41: It would be helpful to explicitly state what “interaction patterns” are being referenced here. The final sentence could also be clarified. Do microbiomes “host“ interactions or are they better described as a property (“have”, “harbor”). The word “host” may confuse some readers since it is often used to refer to the human host. I am also not sure what point is being made by “expected to govern natural ones”. There are interactions between members of a microbiome; experimental studies have characterized some of these interactions, which we expect to relate in some way to interactions in nature. Is this what the authors are saying?

      Thanks. We agree that this sentence was not clear. Indeed, we are referring to pairwise species interactions and not to host-microbiome interactions. We have rewritten this part in the following way: In fact, recent work shows that the network-level properties of species-species interactions —for example, the sign balance, average strength, and connectivity of the inferred interaction matrix— shift systematically between healthy and dysbiotic gut communities (see for instance, [7, 8]). Pairwise species interactions have been quantified in simplified in-vitro consortia [9, 10]; we assume that the same classes of interactions also operate—albeit in a more complex form—in the native gut microbiome.

      (4) Line 43: I appreciate that the authors separated neutral vs. logistic models here.

      (5) Lines 51-75: The framing here is well-written and convincing. Network inference is an ongoing, active subject in ecology, and there is an unfortunate focus on inferring every individual interaction because ecologists with biology backgrounds are not trained to think about the problem in the language of statistical physics.

      We thank the reviewer for these positive comments.

      (6) Line 87: Perhaps I’m missing something obvious, but I don’t see how ρi sets the intrinsic timescale of the dynamics when its units are 1/(time*individuals), assuming the dimensions of ri are inverse time.

      We thank the reviewer for the observation. We corrected this phrase in the main text.

      (7) Lines 189-190: “as close as possible to the data” it would aid the reader if you specified the criteria meant by this statement.

      We thank the reviewer for the observation. We removed the sentence, as it introduced some redundancy in our argument. In the subsequent text, the proposed method is exposed in details.

      (8) Line 198: It would aid the reader if you provided some context for what the T - σ plane represents.

      We thank the referee for the helpful indication. Indeed, we have better clarified the mutual role of the demographic noise amplitude and strength of the random interaction matrix, as theoretically predicted in the PRL (2021) by Altieri and coworkers [5]. Please, find an additional paragraph on page 6 of the resubmitted version.

      (9) Line 217: Specifying what is meant by “internal modes“ would aid the typical life science reader.

      We thank the reviewer for the suggestion. Recognizing that referring to “internal modes” to describe the SAD shape in that context might cause confusion, we replaced “internal modes“ with “peaks”.

      (10) Line 219: Some additional justification and clarification are needed here, as some may think of “m“ as being biomass.

      We added a sentence to better explain this concept. “In classical and quantum field theory, the particle-particle interaction embedded in the quadratic term is typically referred to as a mass source. In the context of this study, captures quadratic fluctuations of species abundances, as also appearing in the expression of the leading eigenvalue of the stability matrix.”

      Minor comments:

      (1) I commend the authors for removing metagenomic reads that mapped to the human genome in the preprocessing stage of their pipeline. This may seem like an obvious pre-processing step, but it is unfortunately not always implemented.

      We thank the referee for pointing this potential issue. The data used in this work, as well as the bioinformatic workflow used to generate them has been described in detail in Pasqualini et al., 2024 [4]. As one of the main steps for preprocessing, we remove reads mapping to the human genome.

      (2) Line 13: “Bacterial“ excludes archaea, and while you may not have many high-abundance archaea in your human gut data, this sentence does not specify the human gut. Usually, this exclusion is averted via the term “microbial“, though sometimes researchers raise objections to the term when the data does not include fungal members (e.g., all 16S studies).

      We thank the reviewer for this suggestion. As to include archaeal organisms, we adopt the term “microbial“ instead of “bacterial“.

      (3) Line 18: This manuscript is being submitted under the “Physics of Living Systems“ tract, but it may be useful to explicitly state in the Abstract that disordered systems are a useful approach for understanding large, complex communities for the benefit of life science researchers coming from a biology background.

      Thank. We have modified the abstract following this suggestion.

      (4) Line 68: Consider using “adapted“ or something similar instead of “mutated“ if there is no specific reason for that word choice.

      We thank the reviewer for this suggestion, which was implemented in the text.

      (5) Line 111: It would be useful to define annealed and quenched for a general life science audience.

      We thank the reviewer for this suggestion. In the “Results” section, we have opted for “time-dependent disordered interactions” to reach a broader audience and avoid any jargon. Moreover, in the Discussion we added a detailed footnote: “In contrast to the quenched approximation, the annealed version assumes that the random couplings are not fixed but instead fluctuate over time, with their covariance governed by independent Ornstein–Uhlenbeck processes.”

      (6) Line 124: Likewise for the replicon sector.

      We thank the reviewer for the suggestion. We added a footnote on page 4, after the formula, to highlight the physical intuition behind the introduction of the replicon mode.

      “The replicon eigenvalue refers to a particular type of fluctuation around the saddle-point (mean-field) solution within the replica framework. When the Hessian matrix of the replicated free energy is diagonalized, fluctuations are divided into three sectors: longitudinal, anomalous, and replicon. The replicon mode is the most sensitive to criticality signaling – by its vanishing trend – the emergence of many nearly-degenerate states. It essentially describes how ‘soft’ the system is to microscopic rearrangements in configuration space.”

      (7) Figure 2: It would be helpful to include y-axis labels for each order parameter alongside the mathematical notation.

      We thank the reviewer for this suggestion. Now the y-axis of Figure 2 includes, along the mathmetical symbol, the label of the represented quantities.

      (8) Line 242: Subscript “U” is used to denote “Unhealthy” microbiomes, but “D” is used to denote “Diseased” in Figs. 2 and 3 (perhaps elsewhere as well).

      We thank the reviewer for this observation. After checking the various subscripts in the text, coherently with figure 2 and 3, we homogenized our notation, adopting the subscript “D“ for symbols related to the diseased/unhealthy condition.

      (9) Line 283: “not to“ should be “not due to“

      We thank the reviewer for this suggestion. After inspecting the text, we corrected the reported error.

      (10) Equations 23, 34: Extra “=“ on the RHS of the first line.

      We consistently follow the same formatting across all the line breaks in the equations throughout the text.

      We are thus resubmitting our paper, hoping to have satisfactorily addressed all referees’ concerns.

      References

      (1) Jacopo Grilli. Macroecological laws describe variation and diversity in microbial communities. Nature communications, 11(1):4743, 2020.

      (2) Guy Bunin. Ecological communities with lotka-volterra dynamics. Physical Review E, 95(4):042414, 2017.

      (3) Matthieu Barbier, Jean-Franc¸ois Arnoldi, Guy Bunin, and Michel Loreau. Generic assembly patterns in complex ecological communities. Proceedings of the National Academy of Sciences, 115(9):2156–2161, 2018.

      (4) Jacopo Pasqualini, Sonia Facchin, Andrea Rinaldo, Amos Maritan, Edoardo Savarino, and Samir Suweis. Emergent ecological patterns and modelling of gut microbiomes in health and in disease. PLOS Computational Biology, 20(9):e1012482, 2024.

      (5) Ada Altieri, Felix Roy, Chiara Cammarota, and Giulio Biroli. Properties of equilibria and glassy phases of the random lotka-volterra model with demographic noise. Physical Review Letters, 126(25):258301, 2021.

      (6) Giulio Biroli, Guy Bunin, and Chiara Cammarota. Marginally stable equilibria in critical ecosystems. New Journal of Physics, 20(8):083051, 2018.

      (7) Amir Bashan, Travis E Gibson, Jonathan Friedman, Vincent J Carey, Scott T Weiss, Elizabeth L Hohmann, and Yang-Yu Liu. Universality of human microbial dynamics. Nature, 534(7606):259–262, 2016.

      (8) Marcello Seppi, Jacopo Pasqualini, Sonia Facchin, Edoardo Vincenzo Savarino, and Samir Suweis. Emergent functional organization of gut microbiomes in health and diseases. Biomolecules, 14(1):5, 2023.

      (9) Jared Kehe, Anthony Ortiz, Anthony Kulesa, Jeff Gore, Paul C Blainey, and Jonathan Friedman. Positive interactions are common among culturable bacteria. Science advances, 7(45):eabi7159, 2021.

      (10) Ophelia S Venturelli, Alex V Carr, Garth Fisher, Ryan H Hsu, Rebecca Lau, Benjamin P Bowen, Susan Hromada, Trent Northen, and Adam P Arkin. Deciphering microbial interactions in synthetic human gut microbiome communities. Molecular systems biology, 14(6):e8157, 2018.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors develop a novel method to infer ecologically-informative parameters across healthy and diseased states of the gut microbiota, although the method is generalizable to other datasets for species abundances. The authors leverage techniques from theoretical physics of disordered systems to infer different parameters-mean and standard deviation for the strength of bacterial interspecies interactions, a bacterial immigration rate, and the strength of demographic noise-that describe the statistics of microbiota samples from two groups-one for healthy subjects and another one for subjects with chronic inflammation syndromes. To do this, the authors simulate communities with a modified version of the Generalized Lotka-Volterra model and randomly-generated interactions, and then use a moment-matching algorithm to find sets of parameters that better reproduce the data for species abundances. They find that these parameters are different for the healthy and diseased microbiota groups. The results suggest, for example, that bacterial interaction strengths, relative to noise and immigration, are more dominant of microbiota dynamics in diseased states than in healthy states.

      We think that this manuscript brings an important contribution that will be of interest in the areas of statistical physics, (microbiota) ecology and (biological) data science. The evidence of their results is solid and the work improves the state-of-the-art in terms of methods.

      Strengths:

      • Using a fairly generic ecological model, the method can identify the change in the relative importance of different ecological forces (distribution of interspecies interactions, demographic noise and immigration) in different sample groups. The authors focus on the case of the human gut microbiota, showing that the data is consistent with a higher influence of species interactions (relative to demographic noise and immigration) in a disease microbiota state than in healthy ones.

      • The method is novel, original and it improves the state-of-the-art methodology for the inference of ecologically-relevant parameters. The analysis provides solid evidence on the conclusions.

      Weaknesses:

      • As a proof of concept for a new inference method, this text maintains a technical focus, which may require some familiarity with statistical physics. Nevertheless, the authors' clear introduction of key mathematical terms and their interpretations, along with a clear discussion of the ecological implications, make the results accessible and easy to follow.
    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The study explored the biomechanics of kangaroo hopping across both speed and animal size to try and explain the unique and remarkable energetics of kangaroo locomotion.

      Strengths:

      The study brings kangaroo locomotion biomechanics into the 21st century. It is a remarkably difficult project to accomplish. There is excellent attention to detail, supported by clear writing and figures.

      Weaknesses:

      The authors oversell their findings, but the mystery still persists. 

      The manuscript lacks a big-picture summary with pointers to how one might resolve the big question.

      General Comments

      This is a very impressive tour de force by an all-star collaborative team of researchers. The study represents a tremendous leap forward (pun intended) in terms of our understanding of kangaroo locomotion. Some might wonder why such an unusual species is of much interest. But, in my opinion, the classic study by Dawson and Taylor in 1973 of kangaroos launched the modern era of running biomechanics/energetics and applies to varying degrees to all animals that use bouncing gaits (running, trotting, galloping and of course hopping). The puzzling metabolic energetics findings of Dawson & Taylor (little if any increase in metabolic power despite increasing forward speed) remain a giant unsolved problem in comparative locomotor biomechanics and energetics. It is our "dark matter problem".

      Thank you for the kind words.

      This study is certainly a hop towards solving the problem. But, the title of the paper overpromises and the authors present little attempt to provide an overview of the remaining big issues. 

      We have modified the title to reflect this comment.  “Postural adaptations may contribute to the unique locomotor energetics seen in hopping kangaroos”

      The study clearly shows that the ankle and to a lesser extent the mtp joint are where the action is. They clearly show in great detail by how much and by what means the ankle joint tendons experience increased stress at faster forward speeds.

      Since these were zoo animals, direct measures were not feasible, but the conclusion that the tendons are storing and returning more elastic energy per hop at faster speeds is solid. The conclusion that net muscle work per hop changes little from slow to fast forward speeds is also solid. 

      Doing less muscle work can only be good if one is trying to minimize metabolic energy consumption. However, to achieve greater tendon stresses, there must be greater muscle forces. Unless one is willing to reject the premise of the cost of generating force hypothesis, that is an important issue to confront. Further, the present data support the Kram & Dawson finding of decreased contact times at faster forward speeds. Kram & Taylor and subsequent applications of (and challenges to) their approach supports the idea that shorter contact times (tc) require recruiting more expensive muscle fibers and hence greater metabolic costs. Therefore, I think that it is incumbent on the present authors to clarify that this study has still not tied up the metabolic energetics across speed problems and placed a bow atop the package. 

      Fortunately, I am confident that the impressive collective brain power that comprises this author list can craft a paragraph or two that summarizes these ideas and points out how the group is now uniquely and enviably poised to explore the problem more using a dynamic SIMM model that incorporates muscle energetics (perhaps ala' Umberger et al.). Or perhaps they have other ideas about how they can really solve the problem.

      You have raised important points, thank you for this feedback. We have added a limitations and considerations section to the discussion which highlights that there are still unanswered questions. Line 311-328

      Considerations and limitations

      “First, we believe it is more likely that the changes in moment arms and EMA can be attributed to speed rather than body mass, given the marked changes in joint angles and ankle height observed at faster hopping speeds. However, our sample included a relatively narrow range of body masses (13.7 to 26.6 kg) compared to the potential range (up to 80 kg), limiting our ability to entirely isolate the effects of speed from those of mass. Future work should examine a broader range of body sizes. Second, kangaroos studied here only hopped at relatively slow speeds, which bounds our estimates of EMA and tendon stress to a less critical region. As such, we were unable to assess tendon stress at fast speeds, where increased forces would reduce tendon safety factors closer to failure. A different experimental or modelling approach may be needed, as kangaroos in enclosures seem unwilling to hop faster over force plates. Finally, we did not determine whether the EMA of proximal hindlimb joints (which are more difficult to track via surface motion capture markers) remained constant with speed. Although the hip and knee contribute substantially less work than the ankle joint (Fig. 4), the majority of kangaroo skeletal muscle is located around these proximal joints. A change in EMA at the hip or knee could influence a larger muscle mass than at the ankle, potentially counteracting or enhancing energy savings in the ankle extensor muscle-tendon units. Further research is needed to understand how posture and muscles throughout the whole body contribute to kangaroo energetics.”

      Additionally, we added a line “Peak GRF also naturally increased with speed together with shorter ground contact durations (Fig. 2b, Suppl. Fig 1b)” (line 238) to highlight that we are not proposing that changes in EMA alone explain the full increase in tendon stress. Both GRF and EMA contribute substantially (almost equally) to stress, and we now give more equal discussion to both. For instance, we now also evaluate how much each contributes: “If peak GRF were constant but EMA changed from the average value of a slow hop to a fast hop, then stress would increase 18%, whereas if EMA remained constant and GRF varied by the same principles, then stress would only increase by 12%. Thus, changing posture and decreasing ground contact duration both appear to influence tendon stress for kangaroos, at least for the range of speeds we examined” (Line 245-249)

      We have added a paragraph in the discussion acknowledging that the cost of generating force problem is not resolved by our work, concluding that “This mechanism may help explain why hopping macropods do not follow the energetic trends observed in other species (Dawson and Taylor 1973, Baudinette et al. 1992, Kram and Dawson 1998), but it does not fully resolve the cost of generating force conundrum” Line 274-276.

      I have a few issues with the other half of this study (i.e. animal size effects). I would enjoy reading a new paragraph by these authors in the Discussion that considers the evolutionary origins and implications of such small safety factors. Surely, it would need to be speculative, but that's OK.

      We appreciate this comment from the reviewer, however could not extend the study to discuss animal size effects because, as we now note in the results: “The range of body masses may not be sufficient to detect an effect of mass on ankle moment in addition to the effect of speed.” Line 193

      Reviewer #2 (Public Review):

      Summary

      This is a fascinating topic that has intrigued scientists for decades. I applaud the authors for trying to tackle this enigma. In this manuscript, the authors primarily measured hopping biomechanics data from kangaroos and performed inverse dynamics. 

      While these biomechanical analyses were thorough and impressively incorporated collected anatomical data and an Opensim model, I'm afraid that they did not satisfactorily address how kangaroos can hop faster and not consume more metabolic energy, unique from other animals.  Noticeably, the authors did not collect metabolic data nor did they model metabolic rates using their modelling framework. Instead, they performed a somewhat traditional inverse dynamics analysis from multiple animals hopping at a self-selected speed.

      In the current study, we aimed to provide a joint-level explanation for the increases of tendon stress that are likely linked to metabolic energy consumption.

      We have now included a limitations section in the manuscript (See response to Rev 1). We plan to expand upon muscle level energetics in the future with a more detailed musculoskeletal model.

      Within these analyses, the authors largely focused on ankle EMA, discussing its potential importance (because it affects tendon stress, which affects tendon strain energy, which affects muscle mechanics) on the metabolic cost of hopping. However, EMA was roughly estimated (CoP was fixed to the foot, not measured) and did not detectibly associate with hopping speed (see results Yet, the authors interpret their EMA findings as though it systematically related with speed to explain their theory on how metabolic cost is unique in kangaroos vs. other animals

      As noted in our methods, EMA was not calculated from a fixed centre of pressure (CoP). We did fix the medial-lateral position, owing to the fact that both feet contacted the force plate together, but the anteroposterior movement of the CoP was recorded by the force plate and thus allowed to move. We report the movement (or lack of movement) in our results. The anterior-posterior axis is the most relevant to lengthening or shortening the distance of the ‘out-lever’ R, and thereby EMA. It is necessary to assume fixed medial-lateral position because a single force trace and CoP is recorded when two feet land on the force plate. The mediallateral forces on each foot cancel out so there is no overall medial-lateral movement if the forces are symmetrical (e.g. if the kangaroo is hopping in a straight path and one foot is not in front of the other). We only used symmetrical trials so that the anterior-posterior movement of the CoP would be reliable. We have now added additional details into the text to clarify this

      Indeed, the relationship between R and speed (and therefore EMA and speed) was not significant. However, the significant change in ankle height with speed, combined with no systematic change in COP at midstance, demonstrates that R would be greater at faster speeds. If we consider the nonsignificant relationship between R and speed to indicate that there is no change in R, then these two results conflict. We could not find a flaw in our methods, so instead concluded that the nonsignificant relationship between R and speed may be due to a small change in R being undetectable in our data. Taking both results into account, we believe it is more likely that there is a non-detectable change in R, rather than no change in R with speed, but we presented both results for transparency. We have added an additional section into the results to make this clearer (Line 177-185) “If we consider the nonsignificant relationship between R (and EMA) and speed to indicate that there is no change in R, then it conflicts with the ankle height and CoP result. Taking both into account, we think it is more likely that there is a small, but important, change in R, rather than no change in R with speed. It may be undetectable because we expect small effect sizes compared to the measurement range and measurement error (Suppl. Fig. 3h), or be obscured by a similar change in R with body mass. R is highly dependent on the length of the metatarsal segment, which is longer in larger kangaroos (1 kg BM corresponded to ~1% longer segment, P<0.001, R<sup>2</sup>=0.449). If R does indeed increase with speed, both R and r will tend to decrease EMA at faster speeds.”

      These speed vs. biomechanics relationships were limited by comparisons across different animals hopping at different speeds and could have been strengthened using repeated measures design

      There is significant variation in speed within individuals, not just between individuals. The preferred speed of kangaroos is 2-4.5 m/s, but most individuals showed a wide speed range within this. Eight of our 16 kangaroos had a maximum speed that was 1-2m/s faster than their slowest trial. Repeated measures of these eight individuals comprises 78 out of the 100 trials.   It would be ideal to collect data across the full range of speeds for all individuals, but it is not feasible in this type of experimental setting. Interference with animals such as chasing is dangerous to kangaroos as they are prone to adverse reactions to stress. We have now added additional information about the chosen hopping speeds into the results and methods sections to clarify this “The kangaroos elected to hop between 1.99 and 4.48 m s<sup>-1</sup>, with a range of speeds and number of trials for each individual (Suppl. Fig. 9).”  (Line 381-382)

      There are also multiple inconsistencies between the authors' theory on how mechanics affect energetics and the cited literature, which leaves me somewhat confused and wanting more clarification and information on how mechanics and energetics relate

      We thank the reviewer for this comment. Upon rereading we now understand the reviewers position, and have made substantial revisions to the introduction and discussion (See comments below) 

      My apologies for the less-than-favorable review, I think that this is a neat biomechanics study - but am unsure if it adds much to the literature on the topic of kangaroo hopping energetics in its current form.

      Again we thank the reviewer for their time and appreciate their efforts to strengthen our manuscript.

      Reviewer #3 (Public Review):

      Summary:

      The goal of this study is to understand how, unlike other mammals, kangaroos are able to increase hopping speed without a concomitant increase in metabolic cost. They use a biomechanical analysis of kangaroo hopping data across a range of speeds to investigate how posture, effective mechanical advantage, and tendon stress vary with speed and mass. The main finding is that a change in posture leads to increasing effective mechanical advantage with speed, which ultimately increases tendon elastic energy storage and returns via greater tendon strain. Thus kangaroos may be able to conserve energy with increasing speed by flexing more, which increases tendon strain.

      Strengths:

      The approach and effort invested into collecting this valuable dataset of kangaroo locomotion is impressive. The dataset alone is a valuable contribution.

      Thank you!

      Weaknesses:

      Despite these strengths, I have concerns regarding the strength of the results and the overall clarity of the paper and methods used (which likely influences how convincingly the main results come across).

      (1) The paper seems to hinge on the finding that EMA decreases with increasing speed and that this contributes significantly to greater tendon strain estimated with increasing speed. It is very difficult to be convinced by this result for a number of reasons:

      It appears that kangaroos hopped at their preferred speed. Thus the variability observed is across individuals not within. Is this large enough of a range (either within or across subjects) to make conclusions about the effect of speed, without results being susceptible to differences between subjects? 

      Apologies, this was not clear in the manuscript. Kangaroos hopping at their preferred speed means we did not chase or startle them into high speeds to comply with ethics and enclosure limitations. Thus we did not record a wide range of speeds within the bounds of what kangaroos are capable of in the wild (up to 12 m/s), but for the range we did measure (~2-4.5 m/s), there is a large amount of variation in hopping speed within each individual kangaroo. Out of 16 individuals, eight individuals had a difference of 1-2m/s between their slowest and fastest trials, and these kangaroos accounted for 78 out of 100 trials. Of the remainder, six individuals had three for fewer trials each, and two individuals had highly repeatable speeds (3 out of 4, and 6 out of 7 trials were within 0.5 m/s). We have now removed the terminology “preferred speed” e.g line 115. We have added additional information about the chosen hopping speeds into the results and methods, including an appendix figure “The kangaroos elected to hop between 1.99 and 4.48 m s<sup>-1</sup>, with a range of speeds and number of trials for each individual (Suppl. Fig. 9).” (Line 381-382)

      In the literature cited, what was the range of speeds measured, and was it within or between subjects?

      For other literature, to our knowledge the highest speed measured is ~9.5m/s (see supplementary Fig1b) and there were multiple measures for several individuals (see methods Kram & Dawson 1998). 

      Assuming that there is a compelling relationship between EMA and velocity, how reasonable is it to extrapolate to the conclusion that this increases tendon strain and ultimately saves metabolic cost?  They correlate EMA with tendon strain, but this would still not suggest a causal relationship (incidentally the p-value for the correlation is not reported). 

      The functions that underpin these results (e.g. moment = GRF*R) come from physical mechanics and geometry, rather than statistical correlations. Additionally, a p-value is not appropriate in the relationship between EMA and stress (rather than strain) because the relationship does not appear to be linear. We have made it clearer in the discussion that we are not proposing that entire change in stress is caused by changes in EMA, but that the increase in GRF that naturally occurs with speed will also explain some of the increase in stress, along with other potential mechanisms. The discussion has been extensively revised to reflect this. 

      Tendon strain could be increasing with ground reaction force, independent of EMA. Even if there is a correlation between strain and EMA, is it not a mathematical necessity in their model that all else being equal, tendon stress will increase as ema decreases? I may be missing something, but nonetheless, it would be helpful for the authors to clarify the strength of the evidence supporting their conclusions.

      Yes, GRF also contributes to the increase in tendon stress in the mechanism we propose (Suppl. Fig. 8), see the formulas in Fig 6, and we have made this clearer in the revised discussion (see above comment).  You are correct that mathematically stress is inversely proportional to EMA, which can be observed in Fig. 7a, and we did find that EMA decreases. 

      The statistical approach is not well-described. It is not clear what the form of the statistical model used was and whether the analysis treated each trial individually or grouped trials by the kangaroo. There is also no mention of how many trials per kangaroo, or the range of speeds (or masses) tested. 

      The methods include the statistical model with the variables that we used, as well as the kangaroo masses (13.7 to 26.6 kg, mean: 20.9 ± 3.4 kg). We did not have sufficient within individual sample size to use a linear mixed effect model including subject as a random factor, thus all trials were treated individually. We have included this information in the results section. 

      We have now moved the range of speeds from the supplementary material to the results and figure captions. We have added information on the number of trials per kangaroo to the methods, and added Suppl. Fig. 9 showing the distribution of speeds per kangaroo.

      We did not group the data e.g. by using an average speed per individual for all their trials, or by comparing fast to slow groups for statistical analysis (the latter was only for display purposes in our figures, which we have now made clearer in the methods statistics section). 

      Related to this, there is no mention of how different speeds were obtained. It seems that kangaroos hopped at a self-selected pace, thus it appears that not much variation was observed. I appreciate the difficulty of conducting these experiments in a controlled manner, but this doesn’t exempt the authors from providing the details of their approach.

      Apologies, this was not clear in the manuscript. Kangaroos hopping at their preferred speed means we did not chase or startle them into high speeds to comply with ethics and enclosure limitations. Thus we did not record a wide range of speeds within the bounds of what kangaroos are capable of in the wild (up to 12 m/s). We have now removed the terminology “preferred speed” e.g. line 115. We have added additional information about the chosen hopping speeds into the results and methods, including an appendix figure (see above comment). (Line 381-382)

      Some figures (Figure 2 for example) present means for one of three speeds, yet the speeds are not reported (except in the legend) nor how these bins were determined, nor how many trials or kangaroos fit in each bin. A similar comment applies to the mass categories. It would be more convincing if the authors plotted the main metrics vs. speed to illustrate the significant trends they are reporting.

      Thank you for this comment. The bins are used only for display purposes and not within the statistical analysis. We have clarified this in the revised manuscript: “The data was grouped into body mass (small 17.6±2.96 kg, medium 21.5±0.74 kg, large 24.0±1.46 kg) and speed (slow 2.52±0.25 m s<sup>-1</sup>, medium 3.11±0.16 m s<sup>-1</sup>, fast 3.79±0.27 m s<sup>-1</sup>) subsets for display purposes only”. (Line 495-497)

      (2) The significance of the effects of mass is not clear. The introduction and abstract suggest that the paper is focused on the effect of speed, yet the effects of mass are reported throughout as well, without a clear understanding of the significance. This weakness is further exaggerated by the fact that the details of the subject masses are not reported.

      Indeed, the primary aim of our study was to explore the influence of speed, given the uncoupling of energy from hopping speed in kangaroos. We included mass to ensure that the effects of speed were not driven by body mass (i.e.: that larger kangaroos hopped faster). Subject masses were reported in the first paragraph of the methods, albeit some were estimated as outlined in the same paragraph.

      (3) The paper needs to be significantly re-written to better incorporate the methods into the results section. Since the results come before the methods, some of the methods must necessarily be described such that the study can be understood at some level without turning to the dedicated methods section. As written, it is very difficult to understand the basis of the approach, analysis, and metrics without turning to the methods.

      The methods after the discussion is a requirement of the journal. We have incorporated some methods in the results where necessary but not too repetitive or disruptive, e.g. Fig. 1 caption, and specifying we are only analysing EMA for the ankle joint

      Reviewing Editor (Recommendations For The Authors):

      Below is a list of specific recommendations that the authors could address to improve the eLife assessment:

      (1) Based on the data presented and the fact that metabolic energy was not measured, the authors should temper their conclusions and statements throughout the manuscript regarding the link between speed and metabolic energy savings. We recommend adding text to the discussion summarizing the strengths and limitations of the evidence provided and suggesting future steps to more conclusively answer this mystery.

      There is a significant body of work linking metabolic energy savings to measured increases in tendon stress in macropods. However, the purpose of this paper was to address the unanswered questions about why tendon stress increases. We found that stress did not only increase due to GRF increasing with speed as expected, but also due to novel postural changes which decreased EMA. In the revised manuscript, we have tempered our conclusions to make it clearer that it is not just EMA affecting stress, and added limitations throughout the manuscript (see response to Rev 1). 

      (2) To provide stronger evidence of a link between speed, mechanics, and metabolic savings the authors can consider estimating metabolic energy expenditure from their OpenSIM model. This is one suggestion, but the authors likely have other, possibly better ideas. Such a model should also be able to explain why the metabolic rate increases with speed during uphill hopping.

      Extending the model to provide direct metabolic cost estimates will be the goal of a future paper, however the models does not have detailed muscle characteristics to do this in the formulation presented here. It would be a very large undertaking which is beyond the scope of the current manuscript. As per the comment above, the results of this paper are not reliant on metabolic performance. 

      (3) The authors attempt to relate the newly quantified hopping biomechanics to previously published metabolic data. However, all reviewers agree that the logic in many instances is not clear or contradictory. Could one potential explanation be that at slow speeds, forces and tendon strain are small, and thus muscle fascicle work is high? Then, with faster speeds, even though the cost of generating isometric force increases, this is offset by the reduction in the metabolic cost of muscular work. The paper could provide stronger support for their hypotheses with a much clearer explanation of how the kinematics relate to the mechanics and ultimately energy savings.

      In response to the reviewers comments, we have substantially modified the discussion to provide clearer rationale.

      (4) The methods and the effort expended to collect these data are impressive, but there are a number of underlying assumptions made that undermine the conclusions. This is due partly to the methods used, but also the paper's incomplete description of their methods. We provide a few examples below:

      It would be helpful if the authors could speak to the effect of the limited speeds tested and between-animal comparisons on the ability to draw strong conclusions from the present dataset. ·

      Throughout the discussion, the authors highlight the relationship between EMA and speed. However, this is misleading since there was no significant effect of speed on EMA. Speed only affected the muscle moment arm, r. At minimum, this should be clarified and the effect on EMA not be overstated. Additionally, the resulting implications on their ability to confidently say something about the effect of speed on muscle stress should be discussed. 

      We have now provided additional details, (see responses above) to these concerns. For instance, we added a supplementary figure showing the speed distribution per individual. The primary reviewer concern (that each kangaroo travelled at a single speed) was due to a miscommunication around the terminology “preferred” which has now been corrected. 

      We now elaborate in the results why we are not very concerned that EMA is insignificant. The statistical insignificance of EMA is ultimately due to the insignificance of the direct measurement of R, however, we now better explain in the results why we believe that this statistical insignificance is due to error/noise of the measurement which is relatively large compared to the effect size. Indirect indications of how R may increase with speed (via ankle height from the ground) are statistically significant. Lines 177-185. 

      We consider this worth reporting because, for instance, an 18% change in EMA will be undetectable by measurement, but corresponds to an 18% change in tendon stress which is measurable and physiologically significant (safety factor would decrease from 2 to 1.67).  We presented both significant and insignificant results for transparency. 

      We have also discussed this within a revised limitations section of the manuscript (Line 311328). 

      Reviewer #1 (Recommendations For The Authors):

      Title: I would cut the first half of the title. At least hedge it a bit. "Clues" instead of "Unlocking the secrets".

      We have revised the title to: “Postural adaptations may contribute to the unique locomotor energetics seen in hopping kangaroos”

      In my comments, ... typically indicates a stylistic change suggested to the text.

      Overall, the paper covers speed and size. Unfortunately, the authors were not 100% consistent in the order of presenting size then speed, or speed then size. Just choose one and stick with it.

      We have attempted to keep the order of presenting size and speed consistent, however there are several cases where this would reduce the readability of the manuscript and so in some cases this may vary. 

      One must admit that there is a lot of vertical scatter in almost all of the plots. I understand that these animals were not in a lab on a treadmill at a controlled speed and the animals wear fur coats so marker placements vary/move etc. But the spread is quite striking, e.g. Figure 5a the span at one speed is almost 10x. Can the authors address this somewhere? Limitations section?

      The variation seen likely results from attempting to display data in a 2D format, when it is in fact the result of multiple variables, including speed, mass, stride frequency and subject specific lengths. Slight variations in these would be expected to produce some noise around the mean, and I think it’s important to consider this while showing the more dominant effects. 

      In many locations in the manuscript, the term "work" is used, but rarely if ever specified that this is the work "per hop". The big question revolves around the rate of metabolic energy consumption (i.e. energy per time or average metabolic power), one must not forget that hop frequency changes somewhat across speed, so work per hop is not the final calculation.

      Thank you for this comment. We have now explicitly stated work per hop in figure captions and in the results (line 208). The change in stride frequency at this range of speeds is very small, particularly compared to the variance in stride frequency (Suppl. Fig. 1d), which is consistent with other researchers who found that stride frequency was constant or near constant in macropods at analogous speeds (e.g. Dawson and Taylor 1973, Baudinette et al. 1987). 

      Line 61 ....is likely related.

      Added “likely” (line 59)

      Line 86 I think the Allen reference is incomplete. Wasn't it in J Exp Biology?

      Thank you. Changed. 

      Line 122 ... at faster speeds and in larger individuals.

      Changed: “We hypothesised that (i) the hindlimb would be more crouched at faster speeds, primarily due to the distal hindlimb joints (ankle and metatarsophalangeal), independent of changes with body mass” (Line 121-122).

      Line 124 I found this confusing. Try to re-word so that you explain you mean more work done by the tendons and less by the ankle musculature.

      Amended: “changes in moment arms resulting from the change in posture would contribute to the increase in tendon stress with speed, and may thereby contribute to energetic savings by increasing the amount of positive and negative work done by the ankle without requiring additional muscle work” (Line 123)

      Line 129 hopefully "braking" not "breaking"!

      Thank you. Fixed. (Line 130)

      Line 129 specify fore-aft horizontal force.

      Added "fore-aft" to "negative fore-aft horizontal component" (Line 130-131)

      Line 130 add something like "of course" or "naturally" since if there is zero fore-aft force, the GRF vector of course must be vertical. 

      Added "naturally" (Line 132)

      Line 138 clarify that this section is all stance phase. I don't recall reading any swing phase data.

      Changed to: "Kangaroo hindlimb stance phase kinematics varied…" (Line 141)

      Line 143 and elsewhere. I found the use of dorsiflexion and plantarflexion confusing. In Figure 3, I see the ankle never flexing more than 90 degrees. So, the ankle joint is always in something of a flexed position, though of course it flexes and extends during contact. I urge the authors to simplify to flextion/extension and drop the plantar/dorsi.

      We have edited this section to describe both movements as greater extension (plantarflexion). (Line 147). We have further clarified this in the figure caption for figure 3.  

      Line 147 ...changes were…

      Fixed, line 150

      Line 155 I'm a bit confused here. Are the authors calculating some sort of overall EMA or are they saying all of the individual joint EMAs all decreased?

      Thank you, we clarified that it is at the ankle. Line 158

      Line 158 since kangaroos hop and are thus positioned high and low throughout the stance phase, try to avoid using "high" and "low" for describing variables, e.g. GRF or other variables. Just use "greater/greatest" etc.

      Thanks for this suggestion. We have changed "higher" into "greater" where appropriate throughout the manuscript e.g. line 161

      Lines 162 and 168 same comment here about "r" and "R". Do you mean ankle or all joints?

      Clarified that it is the gastrocnemius and plantaris r, and the R to the ankle. (Lines 164-165)

      Line 173 really, ankle height?

      Added: ankle height is "vertical distance from the ground". Line 177

      Line 177 is this just the ankle r?

      Added "of the ankle" line 158 and “Achilles” line 187 

      Line 183 same idea, which tendon/tendons are you talking about here?

      Added "Achilles" to be more clear (Line 187)

      Line 195 substitute "converted" for "transferred".

      Done (Line 210)

      Line 223 why so vague? i.e. why use "may"? Believe in your data. ...stress was also modulated by changes....

      Changed "may" to "is"

      Line 229 smaller ankle EMA (especially since you earlier talked about ankle "height").

      Changed “lower” to “smaller” Line 254

      Line 2236 ...and return elastic energy…

      Added "elastic" line 262

      Line 244 IMPORTANT: Need to explain this better! I think you are saying that the net work at the ankle is staying the same across speed, BUT it is the tendons that are storing and returning that work, it's not that the muscles are doing a lot of negative/positive work.

      Changed: “The consistent net work observed among all speeds suggests the ankle extensor muscle-tendon units are performing similar amounts of ankle work independent of speed, which would predominantly be done by the tendon.” Line 270-272)

      Line 258-261 I think here is where you are over-selling the data/story. Although you do say "a" mechanism (and not "the" mechanism, you still need to deal with the cost of generating more force and generating that force faster.

      We removed this sentence and replaced it with a discussion of the cost of generating force hypothesis, and alternative scenarios for the how force and metabolics could be uncoupled. 

      Line 278 "the" tendon? Which tendon?

      Added "Achilles"

      Line 289. I don't think one can project into the past.

      Changed “projected” to "estimated"

      Line 303 no problem, but I've never seen a paper in biology where the authors admit they don't know what species they were studying!

      Can’t be helped unfortunately. It is an old dataset and there aren’t photos of every kangaroo. Fortunately, from the grey and red kangaroos we can distinguish between, we know there are no discernible species effects on the data. 

      Lines 304-306 I'm not clear here. Did you use vertical impulse (and aerial time) to calculate body weight? Or did you somehow use the braking/propulsive impulse to calculate mass? I would have just put some apples on the force plate and waited for them to stop for a snack.

      Stationary weights were recorded for some kangaroos which did stand on the force plate long enough, but unfortunately not all of them were willing to do so. In those cases, yes, we used impulse from steady-speed trials to estimate mass. We cross-checked by estimated mass from segment lengths (as size and mass are correlated). This is outlined in the first paragraph of the methods.

      Lines 367 & 401 When you use the word "scaled" do you mean you assumed geometric similarity?

      No, rather than geometric scaling, we allowed scaling to individual dimensions by using the markers at midstance for measurements. We have amended the paragraph to clarify that the shape of the kangaroo changes and that mass distribution was preserved during the shape change (line 441-446) 

      Lines 381-82 specify "joint work"

      Added "joint work"  (Line 457)

      Figure 1 is gorgeous. Why not add the CF equation to the left panel of the caption?

      We decided to keep the information in the figure caption. “Total leg length was calculated as the sum of the segment lengths (solid black lines) in the hindlimb and compared to the pelvisto-toe distance (dashed line) to calculate the crouch factor”

      Figure 2 specify Horizontal fore-aft.

      Done

      Figure 3g I'd prefer the same Min. Max Flexion vertical axis labels as you use for hip & knee.

      While we appreciate the reviewer trying to increase the clarity of this figure, we have left it as plantar/dorsi flexion since these are recognised biomechanical terms. To avoid confusion, we have further defined these in the figure caption “For (f-g), increased plantarflexion represents a decrease in joint flexion, while increased dorsiflexion represents increased flexion of the joint.”

      Figure 4. I like it and I think that you scaled all panels the same, i.e. 400 W is represented by the same vertical distance in all panels. But if that's true, please state so in the Caption. It's remarkable how little work occurs at the hip and knee despite the relatively huge muscles there.

      Is it true that the y axes are all at the same scale. We have added this to the caption. 

      Figure 5 Caption should specify "work per hop".

      Added

      Figure 7 is another beauty.

      Thank you!

      Supplementary Figure 3 is this all ANKLE? Please specify.

      Clarified that it is the gastrocnemius and plantaris r, and the R to the ankle.

      Reviewer #2 (Recommendations For The Authors):

      To 'unlock the secrets of kangaroo locomotor energetics' I expected the authors to measure the secretive outcome variable, metabolic rate using laboratory measures. Rather, the authors relied on reviewing historic metabolic data and collecting biomechanics data across different animals, which limits the conclusions of this manuscript.

      We have revised to the title to make it clearer that we are investigating a subset of the energetics problem, specifically posture. “Postural adaptations may contribute to the unique locomotor energetics seen in hopping kangaroos.” We have also substantially modified the discussion to temper the conclusions from the paper. 

      After reading the hypothesis, why do the authors hypothesize about joint flexion and not EMA? Because the following hypothesis discusses the implications of moment arms on tendon stress, EMA predictions are more relevant (and much more discussed throughout the manuscript).

      Ankle and MTP angles are the primary drivers of changes in r, R & thus, EMA. We used a two part hypothesis to capture this. We have rephased the hypotheses: “We hypothesised that (i) the hindlimb would be more crouched at faster speeds, primarily due to the distal hindlimb joints (ankle and metatarsophalangeal), independent of changes with body mass, and (ii) changes in moment arms resulting from the change in posture would contribute to the increase in tendon stress with speed, and may thereby contribute to energetic savings by increasing the amount of positive and negative work done by the ankle without requiring additional muscle work.”

      If there were no detectable effects of speed on EMA, are kangaroos mechanically like other animals (Biewener Science 89 & JAP 04) who don't vary EMA across speeds? Despite no detectible effects, the authors state [lines 228-229] "we found larger and faster kangaroos were more crouched, leading to lower ankle EMA". Can the authors explain this inconsistency? Lines 236 "Kangaroos appear to use changes in posture and EMA". I interpret the paper as EMA does not change across speed.

      Apologies, we did not sufficiently explain this originally. We now explain in the results our reasoning behind our belief that EMA and R may change with speed. “If we consider the nonsignificant relationship between R (and EMA) and speed to indicate that there is no change in R, then it conflicts with the ankle height and CoP result. Taking both into account, we think it is more likely that there is a small, but important, change in R, rather than no change in R with speed. It may be undetectable because we expect small effect sizes compared to the measurement range and measurement error (Suppl. Fig. 3h), or be obscured by a similar change in R with body mass. R is highly dependent on the length of the metatarsal segment, which is longer in larger kangaroos (1 kg BM corresponded to ~1% longer segment, P<0.001, R<sup>2</sup>=0.449). If R does indeed increase with speed, both R and r will tend to decrease EMA at faster speeds.” (Line 177-185)

      Lines 335-339: "We assumed the force was applied along phalanx IV and that there was no medial or lateral movement of the centre of pressure (CoP)". I'm confused, did the authors not measure CoP location with respect to the kangaroo limb? If not, this simple estimation undermines primary results (EMA analyses).

      We have changed "The anterior or posterior movement of the CoP was recorded by the force plate" to read: "The fore-aft movement of the CoP was recorded by the force plate within the motion capture coordinate system" (Line 406-407) and added more justification for fixing the CoP movement in the other axis: “It was necessary to assume the CoP was fixed in the mediallateral axis because when two feet land on the force plate, the lateral forces on each foot are not recorded, and indeed cancel if the forces are symmetrical (i.e. if the kangaroo is hopping in a straight path and one foot is not in front of the other). We only used symmetrical trials to ensure reliable measures of the anterior-posterior movement of the CoP.” (Line 408-413)

      The introduction makes many assertions about the generalities of locomotion and the relationship between mechanics and energetics. I'm afraid that the authors are selectively choosing references without thoroughly evaluating alternative theories. For example, Taylor, Kram, & others have multiple papers suggesting that decreasing EMA and increasing muscle force (and active muscle volume) increase metabolic costs during terrestrial locomotion. Rather, the authors suggest that decreasing EMA and increasingly high muscle force at faster speeds don't affect energetics unless muscle work increases substantially (paragraph 2)? If I am following correctly, does this theory conflict with active muscle volume ideas that are peppered throughout this manuscript?

      Yes, as you point out, the same mechanism does lead to different results in kangaroos vs humans, for instance, but this is not a contradiction. In all species, decreasing EMA will result in an increase in muscle force due to less efficient leverage (i.e. lower EMA) of the muscles, and the muscle-tendon unit will be required to produce more force to balance the joint moment. As a consequence, human muscles activate a greater volume in order for the muscle-tendon unit to increase muscle work and produce enough force. We are proposing that in kangaroos, the increase in work is done by the achilles tendon rather than the muscles. Previous research suggests that macropod ankle muscles contract isometrically or that the fibres do not shorten more at faster speeds i.e. muscle work does not increase with speed. Instead, the additional force seems to come from the tendon storing and subsequently returning more strain energy (indicated by higher stress). We found that the increase in tendon stress comes from higher ground force at faster speeds, and from it adopting a more crouched posture which increases the tendons’ stresses compared to an upright posture for a given speed (think of this as increasing the tendon’s stress capacity). We have substantially revised the discussion to highlight this.

      Similarly, does increased gross or net tendon mechanical energy storage & return improve hopping energetics? Would more tendon stress and strain energy storage with a given hysteresis value also dissipate more mechanical energy, requiring leg muscles to produce more net work? Does net or gross muscle work drive metabolic energy consumption?

      Based on the cost of generating force hypothesis, we think that gross muscle work would be linked to driving metabolic energy consumption. Our idea here is that the total body work is a product of the work done by the tendon and the muscle combined. If the tendon has the potential to do more work, then the total work can increase without muscle work needing to increase.

      The results interpret speed effects on biomechanics, but each kangaroo was only collected at 1 speed. Are inter-animal comparisons enough to satisfy this investigation?

      We have added a figure (Suppl Fig 9) to demonstrate the distribution of speed and number of trials per kangaroo. We have also removed "preferred" from the manuscript as this seems to cause confusion. Most kangaroos travelled at a range of “casual” speeds.

      Abstract: Can the authors more fully connect the concept of tendon stress and low metabolic rates during hopping across speeds? Surely, tendon mechanics don't directly drive the metabolic cost of hopping, but they affect muscle mechanics to affect energetics.

      Amended to: " This phenomenon may be related to greater elastic energy savings due to increasing tendon stress; however, the mechanisms which enable the rise in stress, without additional muscle work remain poorly understood." (Lines 25-27).

      The topic sentence in lines 61-63 may be misleading. The ensuing paragraph does not substantiate the topic sentence stating that ankle MTUs decouple speeds and energetics.

      We added "likely" to soften the statement. (Line 59)

      Lines 84-86: In humans, does more limb flexion and worse EMA necessitate greater active muscle volume? What about muscle contractile dynamics - See recent papers by Sawicki & colleagues that include Hill-type muscle mechanics in active muscle volume estimates.

      Added: “Smaller EMA requires greater muscle force to produce a given force on the ground, thereby demanding a greater volume of active muscle, and presumably greater metabolic rates than larger EMA for the same physiology”. (Line 80-82)

      Lines 106: can you give the context of what normal tendon safety factors are?

      Good idea. Added: "far lower than the typical safety factor of four to eight for mammalian tendons (Ker et al. 1988)." Line 106-107

      I thought EMA was relatively stable across speeds as per Biewener [Science & JAP '04]. However the authors gave an example of an elephant to suggest that it is typically inversely related to speed. Can the authors please explain the disconnect and the most appropriate explanation in this paragraph?

      Knee EMA in particular changed with speed in Biewener 2004. What is “typical” probably depends on the group of animals studied; e.g., cursorial quadrupedal mammals generally seem to maintain constant EMA, but other groups do not.

      These cases are presented to show a range of consequences for changing EMA (usually with mass, but sometimes with speed). We have made several adjustments to the paragraph to make this clearer. Lines 85-93.

      The results depend on the modeled internal moment arm (r). How confident are the authors in their little r prediction? Considering complications of joint mechanics in vivo including muscle bulging. Holzer et al. '20 Sci Rep demonstrated that different models of the human Achilles tendon moment arm predict vastly different relationships between the moment arm and joint angle.

      Our values for r and EMA closely align with previous papers which measured/calculate these values in kangaroos, such as Kram 1998, and thus we are confident in our interpretation.  

      This is a misleading results sentence: Small decreases in EMA correspond to a nontrivial increase in tendon stress, for instance, reducing EMA from 0.242 (mean minimum EMA of the slow group) to 0.206 (mean minimum EMA of the fast group) was associated with an ~18% increase in tendon stress. The authors could alternatively say that a ~15% decrease in EMA was associated with an ~18% increase in tendon stress, which seems pretty comparable.

      Thank you for pointing this out, it is important that it is made clearer. Although the change in relative magnitude is approximately the same (as it should be), this does not detract from the importance. The "small decrease in EMA" is referring to the absolute values, particularly in respect to the measurement error/noise. The difference is small enough to have been undetectable with other methods used in previous studies. We have amended the sentence to clarify this.

      It now reads: “Subtle decreases in EMA which may have been undetected in previous studies correspond to discernible increases in tendon stress. For instance, reducing EMA from 0.242 (mean minimum EMA of the slow group) to 0.206 (mean minimum EMA of the fast group) was associated with an increase in tendon stress from ~50 MPa to ~60 MPa, decreasing safety factor from 2 to 1.67 (where 1 indicates failure), which is both measurable and physiologically significant.” (Line 195-200)

      Lines 243-245: "The consistent net work observed among all speeds suggests the ankle extensors are performing similar amounts of ankle work independent of speed." If this is true, and presumably there is greater limb work performed on the center of mass at faster speeds (Donelan, Kram, Kuo), do more proximal leg joints increase work and energy consumption at faster speeds?

      The skin over the proximal leg joints (knee and hip) moves too much to get reliable measures of EMA from the ratio of moment arms. This will be pursued in future work when all muscles are incorporated in the model so knee and hip EMA can be determined from muscle force.

      We have added limitations and considerations paragraph to the manuscript: “Finally, we did not determine whether the EMA of proximal hindlimb joints (which are more difficult to track via surface motion capture markers) remained constant with speed. Although the hip and knee contribute substantially less work than the ankle joint (Fig. 4), the majority of kangaroo skeletal muscle is located around these proximal joints. A change in EMA at the hip or knee could influence a larger muscle mass than at the ankle, potentially counteracting or enhancing energy savings in the ankle extensor muscle-tendon units. Further research is needed to understand how posture and muscles throughout the whole body contribute to kangaroo energetics.” (Line 321-328)

      Lines 245-246: "Previous studies using sonomicrometry have shown that the muscles of tammar wallabies do not shorten considerably during hops, but rather act near-isometrically as a strut" Which muscles? All muscles? Extensors at a single joint?

      Added "gastrocnemius and plantaris" Line 164-165

      Lines 249-254: "The cost of generating force hypothesis suggests that faster movement speeds require greater rates of muscle force development, and in turn greater cross-bridge cycling rates, driving up metabolic costs (Taylor et al. 1980, Kram and Taylor 1990). The ability for the ankle extensor muscle fibres to remain isometric and produce similar amounts of work at all speeds may help explain why hopping macropods do not follow the energetic trends observed in quadrupedal species." These sentences confuse me. Kram & Taylor's cost of force-generating hypothesis assumes that producing the same average force over shorter contact times increases metabolic rate. How does 'similar muscle work' across all speeds explain the ability of macropods to use unique energetic trends in the cost of force-generating hypothesis context?

      Thank you for highlighting this confusion. We have substantially revised the discussion clarify where the mechanisms presented deviate from the cost of generating force hypothesis. Lines 270-309

      Reviewer #3 (Recommendations For The Authors):

      In addition to the points described in the public review, I have additional, related, specific comments:

      (1) Results: Please refer to the hypotheses in the results, and relate the the findings back to the hypotheses.

      We now relate the findings back to the hypotheses 

      Line 142 “In partial support of hypothesis (i), greater masses and faster speeds were associated with more crouched hindlimb postures (Fig. 3a,c).”.

      Lines 205-206: “The increase in tendon stress with speed, facilitated in part by the change in moment arms by the shift in posture, may explain changes in ankle work (c.f. Hypothesis (ii)).” 

      (2) Results: please provide the main statistical results either in-line or in a table in the main text.

      We (the co-authors) have discussed this at length, and have agreed that the manuscript is far more readable in the format whereby most statistics lie within the supplementary tables, otherwise a reader is met with a wall of statistics. We only include values in the main text when the magnitude is relevant to the arguments presented in the results and discussion.

      (3) Line 140: Describe how 'crouched' was defined.

      We have now added a brief definition of ‘Crouch factor’ after the figure caption. (Line 143) (Fig. 3a,c; where crouch factor is the ratio of total limb length to pelvis to toe distance).

      (4) Line 162: This seems to be a main finding and should be a figure in the main text not supplemental. Additionally, Supplementary Figures 3a and b do not show this finding convincingly There should be a figure plotting r vs speed and r vs mass.

      The combination of r and R are represented in the EMA plot in the main text. The r and R plots are relegated to the supplementary because the main text is already very crowded.  Thank you for the suggestion for the figure plotting r and R versus speed, this is now included as Suppl. Fig. 3h

      (5) Line 166: Supplementary Figure 3g does not show the range of dorsiflexion angles as a function of speed. It shows r vs dorsiflexion angle. Please correct.

      Thanks for noticing this, it was supposed to reference Fig 3g rather than Suppl Fig 3g in the sentence regarding speed. We have fixed this, Line 170. 

      We had added a reference to Suppl Fig 3 on Line 169 as this shows where the peak in r with ankle angle occurs (114.4 degrees).

      (6) Line 184: Where are the statistical results for this statement?

      The relationship between stress and EMA does not appear to be linear, thus we only present R<sup>^</sup>2 for the power relationship rather than a p-value. 

      (7) Line 192: The authors should explain how joint work and power relate/support the overall hypotheses. This section also refers to Figures 4 and 5 even though Figures 6 and 7 have already been described. Please reorganize.

      We have added a sentence at the end of the work and power section to mention hypothesis (ii) and lead into the discussion where it is elaborated upon. 

      “The increase in positive and negative ankle work may be due to the increase in tendon stress rather than additional muscle work.” Line 219-220 We have rearranged the figure order.

      (8) The statistics are not reported in the main text, but in the supplementary tables. If a result is reported in the main text, please report either in-line or with a table in the main text.

      We leave most statistics in the supplementary tables to preserve the readability of the manuscript. We only include values in the main text when the magnitude is relevant to the arguments raised in the results and discussion.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This paper presents results from four independent experiments, each of which tests for rhythmicity in auditory perception. The authors report rhythmic fluctuations in discrimination performance at frequencies between 2 and 6 Hz. The exact frequency depends on the ear and experimental paradigm, although some frequencies seem to be more common than others.

      Strengths:

      The first sentence in the abstract describes the state of the art perfectly: "Numerous studies advocate for a rhythmic mode of perception; however, the evidence in the context of auditory perception remains inconsistent". This is precisely why the data from the present study is so valuable. This is probably the study with the highest sample size (total of > 100 in 4 experiments) in the field. The analysis is very thorough and transparent, due to the comparison of several statistical approaches and simulations of their sensitivity. Each of the experiments differs from the others in a clearly defined experimental parameter, and the authors test how this impacts auditory rhythmicity, measured in pitch discrimination performance (accuracy, sensitivity, bias) of a target presented at various delays after noise onset.

      Weaknesses:

      (1) The authors find that the frequency of auditory perception changes between experiments. I think they could exploit differences between experiments better to interpret and understand the obtained results. These differences are very well described in the Introduction, but don't seem to be used for the interpretation of results. For instance, what does it mean if perceptual frequency changes from between- to within-trial pitch discrimination? Why did the authors choose this experimental manipulation? Based on differences between experiments, is there any systematic pattern in the results that allows conclusions about the roles of different frequencies? I think the Discussion would benefit from an extension to cover this aspect.

      We believe that interpreting these differences remains difficult and a precise, detailed (and possibly mechanistic) interpretation is beyond the goal of the present study. The main goal of this study was to explore the consistency and variability of effects across variations of the experimental design and samples of participants. Interpreting specific effects, e.g. at particular frequencies, would make sense mostly if differences between experiments have been confirmed in a separate reproduction. Still, we do provide specific arguments for why differences in the outcome between different experiments, e.g. with and without explicit trial initialization by the participants, could be expected. See lines 91ff in the introduction and 786ff in the discussion.

      (2) The Results give the impression of clear-cut differences in relevant frequencies between experiments (e.g., 2 Hz in Experiment 1, 6 Hz in Exp 2, etc), but they might not be so different. For instance, a 6 Hz effect is also visible in Experiment 1, but it just does not reach conventional significance. The average across the three experiments is therefore very useful, and also seems to suggest that differences between experiments are not very pronounced (otherwise the average would not produce clear peaks in the spectrum). I suggest making this point clearer in the text.

      We have revised the conclusions to note that the present data do not support clear cut differences between experiments. For this reason we also refrain from detailed interpretations of specific effects, as suggested by this reviewer in point 1 above.

      (3) I struggle to understand the hypothesis that rhythmic sampling differs between ears. In most everyday scenarios, the same sounds arrive at both ears, and the time difference between the two is too small to play a role for the frequencies tested. If both ears operate at different frequencies, the effects of the rhythm on overall perception would then often cancel out. But if this is the case, why would the two ears have different rhythms to begin with? This could be described in more detail.

      This hypothesis was not invented by us, but in essence put forward in previous work. The study by Ho et al. CurrBiol 2017 has reported rhythmic effects at different frequencies in the left and right ears, and we here tried to reproduce these effects. One could speculate about an ear-difference based on studies reporting a right-ear advantage in specific listening tasks, and the idea that different time scales of rhythmic brain activity may be specifically prevail in the left and right cortical hemispheres; hence it does not seem improbable that there could be rhythmic effects in both ears at different frequencies. We note this in the introduction, l. 65ff.

      Reviewer #2 (Public review):

      Summary:

      The current study aims to shed light on why previous work on perceptual rhythmicity has led to inconsistent results. They propose that the differences may stem from conceptual and methodological issues. In a series of experiments, the current study reports perceptual rhythmicity in different frequency bands that differ between different ear stimulations and behavioral measures.

      The study suggests challenges regarding the idea of universal perceptual rhythmicity in hearing.

      Strengths:

      The study aims to address differences observed in previous studies about perceptual rhythmicity. This is important and timely because the existing literature provides quite inconsistent findings. Several experiments were conducted to assess perceptual rhythmicity in hearing from different angles. The authors use sophisticated approaches to address the research questions.

      Weaknesses:

      (1) Conceptional concerns:

      The authors place their research in the context of a rhythmic mode of perception. They also discuss continuous vs rhythmic mode processing. Their study further follows a design that seems to be based on paradigms that assume a recent phase in neural oscillations that subsequently influence perception (e.g., Fiebelkorn et al.; Landau & Fries). In my view, these are different facets in the neural oscillation research space that require a bit more nuanced separation. Continuous mode processing is associated with vigilance tasks (work by Schroeder and Lakatos; reduction of low frequency oscillations and sustained gamma activity), whereas the authors of this study seem to link it to hearing tasks specifically (e.g., line 694). Rhythmic mode processing is associated with rhythmic stimulation by which neural oscillations entrain and influence perception (also, Schroeder and Lakatos; greater low-frequency fluctuations and more rhythmic gamma activity). The current study mirrors the continuous rather than the rhythmic mode (i.e., there was no rhythmic stimulation), but even the former seems not fully fitting, because trials are 1.8 s short and do not really reflect a vigilance task. Finally, previous paradigms on phase-resetting reflect more closely the design of the current study (i.e., different times of a target stimulus relative to the reset of an oscillation). This is the work by Fiebelkorn et al., Landau & Fries, and others, which do not seem to be cited here, which I find surprising. Moreover, the authors would want to discuss the role of the background noise in resetting the phase of an oscillation, and the role of the fixation cross also possibly resetting the phase of an oscillation. Regardless, the conceptional mixture of all these facets makes interpretations really challenging. The phase-reset nature of the paradigm is not (or not well) explained, and the discussion mixes the different concepts and approaches. I recommend that the authors frame their work more clearly in the context of these different concepts (affecting large portions of the manuscript).

      Indeed, the paradigms used here and in many similar previous studies incorporate an aspect of phase-resetting, as the presentation of a background noisy may effectively reset ongoing auditory cortical processes. Studies trying to probe for rhythmicity in auditory perception in the absence any background noise have not shown any effect (Zoefel and Heil, 2013), perhaps because the necessary rhythmic processes along auditory pathways are only engaged when some sound is present. We now discuss these points, and also acknowledge the mentioned studies in the visual system; l. 57.

      (2) Methodological concerns:

      The authors use a relatively unorthodox approach to statistical testing. I understand that they try to capture and characterize the sensitivity of the different analysis approaches to rhythmic behavioral effects. However, it is a bit unclear what meaningful effects are in the study. For example, the bootstrapping approach that identifies the percentage of significant variations of sample selections is rather descriptive (Figures 5-7). The authors seem to suggest that 50% of the samples are meaningful (given the dashed line in the figure), even though this is rarely reached in any of the analyses. Perhaps >80% of samples should show a significant effect to be meaningful (at least to my subjective mind). To me, the low percentage rather suggests that there is not too much meaningful rhythmicity present. 

      We note that there is no clear consensus on what fraction of experiments should be expected or how this way of quantifying effects should be precisely valued (l. 441ff). However, we now also clearly acknowledge in the discussion that the effective prevalence is not very high (l. 663).

      I suggest that the authors also present more traditional, perhaps multi-level, analyses: Calculation of spectra, binning, or single-trial analysis for each participant and condition, and the respective calculation of the surrogate data analysis, and then comparison of the surrogate data to the original data on the second (participant) level using t-tests. I also thought the statistical approach undertaken here could have been a bit more clearly/didactically described as well.

      We here realize that our description of the methods was possibly not fully clear. We do follow the strategy as suggested by this reviewer, but rather than comparing actual and surrogate data based on a parametric t-test, we compare these based on a non-parametric percentile-based approach. This has the advantage of not making specific (and possibly not-warranted) assumptions about the distribution of the data. We have revised the methods to clarify this, l. 332ff. 

      The authors used an adaptive procedure during the experimental blocks such that the stimulus intensity was adjusted throughout. In practice, this can be a disadvantage relative to keeping the intensity constant throughout, because, on average, correct trials will be associated with a higher intensity than incorrect trials, potentially making observations of perceptual rhythmicity more challenging. The authors would want to discuss this potential issue. Intensity adjustments could perhaps contribute to the observed rhythmicity effects. Perhaps the rhythmicity of the stimulus intensity could be analyzed as well. In any case, the adaptive procedure may add variance to the data.

      We have added an analysis of task difficulty to the results (new section “Effects of adaptive task difficulty“) to address this. Overall we do not find systematic changes in task difficulty across participants for most of the experiments, but for sure one cannot rule out that this aspect of the design also affects the outcomes.  Importantly, we relied on an adaptive task difficulty to actually (or hopefully) reduce variance in the data, by keeping the task-difficulty around a certain level. Give the large number of trials collected, not using such an adaptive produce may result in performance levels around chance or near ceiling, which would make impossible to detect rhythmic variations in behavior. 

      Additional methodological concerns relate to Figure 8. Figures 8A and C seem to indicate that a baseline correction for a very short time window was calculated (I could not find anything about this in the methods section). The data seem very variable and artificially constrained in the baseline time window. It was unclear what the reader might take from Figure 8.

      This figure was intended mostly for illustration of the eye tracking data, but we agree that there is no specific key insight to be taken from this. We removed this. 

      Motivation and discussion of eye-movement/pupillometry and motor activity: The dual task paradigm of Experiment 4 and the reasons for assessing eye metrics in the current study could have been better motivated. The experiment somehow does not fit in very well. There is recent evidence that eye movements decrease during effortful tasks (e.g., Contadini-Wright et al. 2023 J Neurosci; Herrmann & Ryan 2024 J Cog Neurosci), which appears to contradict the results presented in the current study. Moreover, by appealing to active sensing frameworks, the authors suggest that active movements can facilitate listening outcomes (line 677; they should provide a reference for this claim), but it is unclear how this would relate to eye movements. Certainly, a person may move their head closer to a sound source in the presence of competing sound to increase the signal-to-noise ratio, but this is not really the active movements that are measured here. A more detailed discussion may be important. The authors further frame the difference between Experiments 1 and 2 as being related to participants' motor activity. However, there are other factors that could explain differences between experiments. Self-paced trials give participants the opportunity to rest more (inter-trial durations were likely longer in Experiment 2), perhaps affecting attentional engagement. I think a more nuanced discussion may be warranted.

      We expanded the motivation of why self-pacing trials may effectively alter how rhythmic processes affect perception, and now also allude to attention and expectation related effects (l. 786ff). Regarding eye movements we now discuss the results in the light of the previously mentioned studies, but again refrain from a very detailed and mechanistic interpretation (l. 782).

      Discussion:

      The main data in Figure 3 showed little rhythmicity. The authors seem to glance over this fact by simply stating that the same phase is not necessary for their statistical analysis. Previous work, however, showed rhythmicity in the across-participant average (e.g., Fiebelkorn's and similar work). Moreover, one would expect that some of the effects in the low-frequency band (e.g., 2-4 Hz) are somewhat similar across participants. Conduction delays in the auditory system are much smaller than the 0.25-0.5 s associated with 2-4 Hz. The authors would want to discuss why different participants would express so vastly different phases that the across-participant average does not show any rhythmicity, and what this would mean neurophysiologically.

      We now discussion the assumptions and implications of similar or distinct phases of rhythmic processes within and between participants (l. 695ff). In particular we note that different origins of the underlying neurophysiological processes eventually may suggest that such assumptions are or a not warranted.  

      An additional point that may require more nuanced discussion is related to the rhythmicity of response bias versus sensitivity. The authors could discuss what the rhythmicity of these different measures in different frequency bands means, with respect to underlying neural oscillations.

      We expanded discussion to interpret what rhythmic changes in each of the behavioral metric could imply (l. 706ff).

      Figures:

      Much of the text in the figures seems really small. Perhaps the authors would want to ensure it is readable even for those with low vision abilities. Moreover, Figure 1A is not as intuitive as it could be and may perhaps be made clearer. I also suggest the authors discuss a bit more the potential monoaural vs binaural issues, because the perceptual rhythmicity is much slower than any conduction delays in the auditory system that could lead to interference.

      We tried to improve the font sizes where possible, and discuss the potential monaural origins as suggested by other reviewers. 

      Reviewer #3 (Public review):

      Summary:

      The finding of rhythmic activity in the brain has, for a long time, engendered the theory of rhythmic modes of perception, that humans might oscillate between improved and worse perception depending on states of our internal systems. However, experiments looking for such modes have resulted in conflicting findings, particularly in those where the stimulus itself is not rhythmic. This paper seeks to take a comprehensive look at the effect and various experimental parameters which might generate these competing findings: in particular, the presentation of the stimulus to one ear or the other, the relevance of motor involvement, attentional demands, and memory: each of which are revealed to effect the consistency of this rhythmicity.

      The need the paper attempts to resolve is a critical one for the field. However, as presented, I remain unconvinced that the data would not be better interpreted as showing no consistent rhythmic mode effect. It lacks a conceptual framework to understand why effects might be consistent in each ear but at different frequencies and only for some tasks with slight variants, some affecting sensitivity and some affecting bias.

      Strengths:

      The paper is strong in its experimental protocol and its comprehensive analysis, which seeks to compare effects across several analysis types and slight experiment changes to investigate which parameters could affect the presence or absence of an effect of rhythmicity. The prescribed nature of its hypotheses and its manner of setting out to test them is very clear, which allows for a straightforward assessment of its results

      Weaknesses:

      There is a weakness throughout the paper in terms of establishing a conceptual framework both for the source of "rhythmic modes" and for the interpretation of the results. Before understanding the data on this matter, it would be useful to discuss why one would posit such a theory to begin with. From a perceptual side, rhythmic modes of processing in the absence of rhythmic stimuli would not appear to provide any benefit to processing. From a biological or homeostatic argument, it's unclear why we would expect such fluctuations to occur in such a narrow-band way when neither the stimulus nor the neurobiological circuits require it.

      We believe that the framework for why there may be rhythmic activity along auditory pathways that shapes behavioral outcomes has been laid out in many previous studies, prominently here (Schroeder et al., 2008; Schroeder and Lakatos, 2009; Obleser and Kayser, 2019). Many of the relevant studies are cited in the introduction, which is already rather long given the many points covered in this study. 

      Secondly, for the analysis to detect a "rhythmic mode", it must assume that the phase of fluctuations across an experiment (i.e., whether fluctuations are in an up-state or down-state at onset) is constant at stimulus onset, whereas most oscillations do not have such a total phase-reset as a result of input. Therefore, some theoretical positing of what kind of mechanism could generate this fluctuation is critical toward understanding whether the analysis is well-suited to the studied mechanism.

      In line with this and previous comments (by reviewer 2) we have expanded the discussion to consider the issue of phase alignment (l. 695ff). 

      Thirdly, an interpretation of why we should expect left and right ears to have distinct frequency ranges of fluctuations is required. There are a large number of statistical tests in this paper, and it's not clear how multiple comparisons are controlled for, apart from experiment 4 (which specifies B&H false discovery rate). As such, one critical method to identify whether the results are not the result of noise or sample-specific biases is the plausibility of the finding. On its face, maintaining distinct frequencies of perception in each ear does not fit an obvious conceptual framework.

      Again this point was also noted by another reviewer and we expanded the introduction and discussion in this regard (l. 65ff).

      Reviewer #1 (Recommendations for the authors):

      (1) An update of the AR-surrogate method has recently been published (https://doi.org/10.1101/2024.08.22.609278). I appreciate that this is a lot of work, and it is of coursee up to the authors, but given the higher sensitivity of this method, it might be worth applying it to the four datasets described here.

      Reading this article we note that our implementation of the AR-surrogate method was essentially as suggested here, and not as implemented by Brookshire. In fact we had not realized that Brookshire had apparently computed the spectrum based on the group-average data. As explained in the Methods section, as now clarified even better, we compute for each participant the actual spectrum of this participant’s data, and a set of surrogate spectra. We then perform a group-average of both to compute the p-value of the actual group-average based on the percentile of the distribution of surrogate averages. This send step differs from Harris & Beale, which used a one-sided t-test. The latter is most likely not appropriate in a strict statistical sense, but possibly more powerful for detecting true results compared to the percentile-based approach that we used (see l. 332ff).

      (2) When results for the four experiments are reported, a reminder for the reader of how these experiments differ from each other would be useful.

      We have added this in the Results section.

      "considerable prevalence of differences around 4Hz, with dual‐task requirements leading to stronger rhythmicity in perceptual sensitivity". There is a striking similarity to recently published data (https://doi.org/10.1101/2024.08.10.607439 ) demonstrating a 4-Hz rhythm in auditory divided attention (rather than between modalities as in the present case). This could be a useful addition to the paragraph.

      We have added a reference to this preprint, and additional previous work pointing in the same direction mentioned in there.  

      (3) There are two typos in the Introduction: "related by different from the question", and below, there is one "presented" too much.

      These have been fixed.

      Reviewer #3 (Recommendations for the authors):

      My major suggestion is that these results must be replicated in a new sample. I understand this is not simple to do and not always possible, but at this point, no effect is replicated from one experiment to the next, despite very small changes in protocol (especially experiment 1 vs 2). It's therefore very difficult to justify explaining the different effects as real as opposed to random effects of this particular sample. While the bootstrapping effects show the level of consistency of the effect within the sample studied, it can not be a substitute for a true replication of the results in a new sample.

      We agree that only an independent replication can demonstrate the robustness of the results. We do consider experiment 1 a replication test of Ho et al. CurrBiol 2017, which results in different results than reported there. But more importantly, we consider the analysis of ‘reproducibility’ by simulating participant samples a key novelty of the present work, and want to emphasize this over the within-study replication of the same experiment.  In fact, in light of the present interpretation of the data, even a within-study replication would most likely not offer a clear-cut answer. 

      As I said in the public review, the interpretation of the results, and of why perceptual cycles in arhythmic stimuli could be a plausible theory to begin with, is lacking. A conceptual framework would vastly improve the impact and understanding of the results.

      We tried to strengthen the conceptual framework in the introduction. We believe that this is in large provided by previous work, and the aim of the present study was to explore the robustness of effects and not to suggest and discover novel effects. 

      Minor comments:

      (1) The authors adapt the difficulty as a function of performance, which seems to me a strange choice for an experiment that is analyzing the differences in performance across the experiment. Could you add a sentence to discuss the motivation for this choice?

      We now mention the rationale in the Methods section and in a new section of the Results. There we also provide additional analyses on this parameter.

      (2) The choice to plot the p-values as opposed to the values of the actual analysis feels ill-advised to me. It invites comparison across analyses that isn't necessarily fair. It would be more informative to plot the respective analysis outputs (spectral power, regression, or delta R2) and highlight the windows of significance and their overlap across analyses. In my opinion, this would be more fair and accurate depiction of the analyses as they are meant to be used.

      We do disagree. As explained in the Methods (l. 374ff): “(Showing p-values) … allows presenting the results on a scale that can be directly compared between analysis approaches, metrics, frequencies and analyses focusing on individual ears or the combined data. Each approach has a different statistical sensitivity, and the underlying effect sizes (e.g. spectral power) vary with frequency for both the actual data and null distribution. As a result, the effect size reaching statistical significance varies with frequency, metrics and analyses.” 

      The fact that the level of power (or R2 or whatever metric we consider) required to reach significance differs between analyses (one ear, both ears), metrics (d-prime, bias, RT) and between analyses approaches makes showing the results difficult, as we would need a separate panel for each of those. This would multiply the number of panels required e.g. for Figure 4 by 3, making it a figure with 81 axes. Also neither the original quantities of each analysis (e.g. spectral power) nor the p-values that we show constitute a proper measure of effect size in a statistical sense. In that sense, neither of these is truly ideal for comparing between analyses, metrics etc. 

      We do agree thought that many readers may want to see the original quantification and thresholds for statistical significance. We now show these in an exemplary manner for the Binned analysis of Experiment 1, which provides a positive result and also is an attempt to replicate the findings by  Ho et al 2017. This is shown in new Figure 5. 

      (3) Typo in line 555 (+ should be plus minus).

      (4) Typo in line 572: "Comparison of 572 blocks with minus dual task those without"

      (5) Typo in line 616: remove "one".

      (6) Line 666 refers to effects in alpha band activity, but it's unclear what the relationship is to the authors' findings, which peak around 6 Hz, lower than alpha (~10 Hz).

      (7) Line 688 typo, remove "amount of".

      These points have been addressed.  

      (8) Oculomotor effect that drives greater rhythmicity at 3-4 Hz. Did the authors analyze the eye movements to see if saccades were also occurring at this rate? It would be useful to know if the 3-4 Hz effect is driven by "internal circuitry" in the auditory system or by the typical rate of eye movement.

      A preliminary analysis of eye movement data was in previous Figure 8, which was removed on the recommendation of another review.  This showed that the average saccade rate is about 0.01 saccade /per trial per time bin, amounting to on average less than one detected saccade per trial. Hence rhythmicity in saccades is unlikely to explain rhythmicity in behavioral data at the scale of 34Hz. We now note this in the Results.

      Obleser J, Kayser C (2019) Neural Entrainment and Attentional Selection in the Listening Brain. Trends Cogn Sci 23:913-926.

      Schroeder CE, Lakatos P (2009) Low-frequency neuronal oscillations as instruments of sensory selection. Trends Neurosci 32:9-18.

      Schroeder CE, Lakatos P, Kajikawa Y, Partan S, Puce A (2008) Neuronal oscillations and visual amplification of speech. Trends Cogn Sci 12:106-113.

      Zoefel B, Heil P (2013) Detection of Near-Threshold Sounds is Independent of EEG Phase in Common Frequency Bands. Front Psychol 4:262.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This is an interesting study characterizing and engineering so-called bathy phytochromes, i.e., those that respond to near infrared (NIR) light in the ground state, for optogenetic control of bacterial gene expression. Previously, the authors have developed a structure-guided approach to functionally link several light-responsive protein domains to the signaling domain of the histidine kinase FixL, which ultimately controls gene expression. Here, the authors use the same strategy to link bathy phytochrome light-responsive domains to FixL, resulting in sensors of NIR light. Interestingly, they also link these bathy phytochrome light-sensing domains to signaling domains from the tetrathionate-sensing SHK TtrS and the toluene-sensing SHK TodS, demonstrating the generality of their protein engineering approach more broadly across bacterial two-component systems.

      This is an exciting result that should inspire future bacterial sensor design. They go on to leverage this result to develop what is, to my knowledge, the first system for orthogonally controlling the expression of two separate genes in the same cell with NIR and Red light, a valuable contribution to the field.

      Finally, the authors reveal new details of the pH-dependent photocycle of bathy phytochromes and demonstrate that their sensors work in the gut - and plant-relevant strains E. coli Nissle 1917 and A. tumefaciens.

      Strengths:

      (1) The experiments are well-founded, well-executed, and rigorous.

      (2) The manuscript is clearly written.

      (3) The sensors developed exhibit large responses to light, making them valuable tools for ontogenetic applications.

      (4) This study is a valuable contribution to photobiology and optogenetics.

      We thank the reviewer for the positive verdict on our manuscript.

      Weaknesses:

      (1) As the authors note, the sensors are relatively insensitive to NIR light due to the rapid dark reversion process in bathy phytochromes. Though NIR light is generally non-phototoxic, one would expect this characteristic to be a limitation in some downstream applications where light intensities are not high (e.g., in vivo).

      We principally concur with this reviewer’s assessment that delivery of light (of any color) into living tissue can be severely limited by absorption, reflection, and scattering. That notwithstanding, at least two considerations suggest that in-vivo deployment of the pNIRusk setups we presently advance may be feasible.

      First, while the pNIRusk setups are indeed less light-sensitive compared to, e.g., our earlier redlight-responsive pREDusk and pDERusk setups (see Meier et al. Nat Commun 2024), we note that the overall light fluences required for triggering them are in the range of tens of µW per cm<sub>2</sub>. By contrast, optogenetic experiments in vivo, in particular in the neurosciences, often employ light area intensities on the order of mW per cm<sub>2</sub> and above. Put another way, compared to the optogenetic tools used in these experiments, the pNIRusk setups are actually quite sensitive to light.

      Second, sensitivity to NIR light brings the advantage of superior tissue penetration, see data reported by Weissleder Nat Biotech 2001 and Ash et al. Lasers Med Sci 2017 (both papers are cited in our manuscript). Based on these data, the intensity of blue light (450 nm) therefore falls off 5-10 times more strongly with penetration depth than that of NIR light (800 nm).

      We have added a brief treatment of these aspects in the Discussion section.

      (2) Though they can be multiplexed with Red light sensors, these bathy phytochrome NIR sensors are more difficult to multiplex with other commonly used light sensors (e.g., blue) due to the broad light responsivity of the Pfr state. This challenge may be overcome by careful dosing of blue light, as the authors discuss, but other bacterial NIR sensing systems with less cross-talk may be preferred in some applications.

      The reviewer is correct in noting that, at least to a certain extent, the pNIRusk systems also respond to blue light owing to their Soret absorbance bands (see Fig. 1). That said, we note two points:

      First, a given photoreceptor that preferentially responds to certain wavelengths, e.g., 700 nm in the case of conventional bacterial phytochromes (BphP), generally absorbs shorter wavelengths to some degree as well. Absorption of these shorter wavelengths suffices for driving electronic and/or vibronic transitions of the chromophore to higher energy levels which often give rise to productive photochemistry and downstream signal transduction. Put another way, a certain response of sensory photoreceptors to shorter wavelengths is hence fully expected and indeed experimentally borne out, as for instance shown by Ochoa-Fernandez et al. in the so-called PULSE setup (Nat Meth 2020, doi: 10.1038/s41592-020-0868-y).

      Second, known BphPs share similar Pr and Pfr absorbance spectra. We therefore expect other BphP-based optogenetic setups to also respond to blue light to some degree. Currently, there are insufficient data to gauge whether individual BphPs systematically differ in their relative sensitivity to blue compared to red or NIR light. Arguably, pertinent experiments may be an interesting subject for future study.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Meier et al. engineer a new class of light-regulated two-component systems. These systems are built using bathy-bacteriophytochromes that respond to near-infrared (NIR) light. Through a combination of genetic engineering and systematic linker optimization, the authors generate bacterial strains capable of selective and tunable gene expression in response to NIR stimulation. Overall, these results are an interesting expansion of the optogenetic toolkit into the NIR range. The cross-species functionality of the system, modularity, and orthogonality have the potential to make these tools useful for a range of applications.

      Strengths:

      (1) The authors introduce a novel class of near-infrared light-responsive two-component systems in bacteria, expanding the optogenetic toolbox into this spectral range.

      (2) Through engineering and linker optimization, the authors achieve specific and tunable gene expression, with minimal cross-activation from red light in some cases.

      (3) The authors show that the engineered systems function robustly in multiple bacterial strains, including laboratory E. coli, the probiotic E. coli Nissle 1917, and Agrobacterium tumefaciens.

      (4) The combination of orthogonal two-component systems can allow for simultaneous and independent control of multiple gene expression pathways using different wavelengths of light.

      (5) The authors explore the photophysical properties of the photosensors, investigating how environmental factors such as pH influence light sensitivity.

      Weaknesses:

      (1) The expression of multi-gene operons and fluorescent reporters could impose a metabolic burden. The authors should present data comparing optical density for growth curves of engineered strains versus the corresponding empty-vector control to provide insight into the burden and overall impact of the system on host viability and growth.

      In response to this comment, we have recorded growth kinetics of bacteria harboring the pNIRusk-DsRed plasmids or empty vectors under both inducing (i.e., under NIR light) and noninducing conditions (i.e., darkness). We did not observe systematic differences in the growth kinetics between the different cultures, thus suggesting that under the conditions tested there is no adverse effect on cell viability.

      We include the new data in Suppl. Fig. 5c-d and refer to them in the main text.

      (2) The manuscript consistently presents normalized fluorescence values, but the method of normalization is not clear (Figure 2 caption describes normalizing to the maximal fluorescence, but the maximum fluorescence of what?). The authors should provide a more detailed explanation of how the raw fluorescence data were processed. In addition, or potentially in exchange for the current presentation, the authors should include the raw fluorescence values in supplementary materials to help readers assess the actual magnitude of the reported responses.

      We appreciate this valid comment and have altered the representation of the fluorescence data. All values for a given fluorescent protein (i.e., either DsRed or YPet) across all systems are now normalized to a single reference value, thus enabling direct comparison between experiments.

      (3) Related to the prior point, it would be useful to have a positive control for fluorescence that could be used to compare results across different figure panels.

      As all data are now normalized to the same reference value, direct comparison across all figures is enabled.

      (4) Real-time gene expression data are not presented in the current manuscript, but it would be helpful to include a time-course for some of the key designs to help readers assess the speed of response to NIR light.

      In response to this comment, we include in the revised manuscript induction kinetics of bacterial cultures bearing pNIRusk upon transfer to inducing NIR-light conditions. To this end, aliquots were taken at discrete timepoints, transcriptionally and translationally arrested, and analyzed for optical density and DsRed reporter fluorescence after allowing for chromophore maturation.

      We include the new data in Suppl. Fig. 5e and refer to them in the manuscript.

      Moreover, we note that the experiments in Agrobacterium tumefaciens used a luciferase reporter thus enabling the continuous monitoring of the light-induced expression kinetics. These data (unchanged in revision) are to be found in Suppl. Fig. 9.

      Reviewer #3 (Public review):

      Summary:

      This paper by Meier et al introduces a new optogenetic module for the regulation of bacterial gene expression based on "bathy-BphP" proteins. Their paper begins with a careful characterization of kinetics and pH dependence of a few family members, followed by extensive engineering to produce infrared-regulated transcriptional systems based on the authors' previous design of the pDusk and pDERusk systems, and closing with characterization of the systems in bacterial species relevant for biotechnology.

      Strengths:

      The paper is important from the perspective of fundamental protein characterization, since bathyBphPs are relatively poorly characterized compared to their phytochrome and cyanobacteriochrome cousins. It is also important from a technology development perspective: the optogenetic toolbox currently lacks infrared-stimulated transcriptional systems. Infrared light offers two major advantages: it can be multiplexed with additional tools, and it can penetrate into deep tissues with ease relative to the more widely used blue light-activated systems. The experiments are performed carefully, and the manuscript is well written.

      Weaknesses:

      My major criticism is that some information is difficult to obtain, and some data is presented with limited interpretation, making it difficult to obtain intuition for why certain responses are observed. For example, the changes in red/infrared responses across different figures and cellular contexts are reported but not rationalized. Extensive experiments with variable linker sequences were performed, but the rationale for linker choices was not clearly explained. These are minor weaknesses in an overall very strong paper.

      We are grateful for the positive take on our manuscript.

      Reviewer #1 (Recommendations for the authors):

      (1) As eLife is a broad audience journal, please define the Soret and Q-bands (line 125).

      We concur and have added labels in fig. 1a that designate the Soret and Q bands.

      (2) The initial (0) Ac design in Figure 2b is activated by NIR and Red light, albeit modestly. The authors state that this construct shows "constant reporter fluorescence, largely independent of illumination" (line 167). This language should be changed to reflect the fact that this Ac construct responds to both of these wavelengths.

      Agreed. We have amended the text accordingly.

      (3) pNIRusk Ac 0 appears to show a greater light response than pNIRusk Av -5. However, the authors claim that the former is not light-responsive and the latter is. This conclusion should be explained or changed.

      The assignment of pNIRusk Av-5 as light-responsive is based on the relative difference in reporter fluorescence between darkness and illumination with either red or NIR light. Although the overall fluorescence is much lower in Av-5 than for Av-0, the relative change upon illumination is much more pronounced. We add a statement to this effect to the text.

      (4) The authors state that "when combining DmDERusk-Str-YPet with AvTod+21-DsRed expression rose under red and NIR light, respectively, whereas the joint application of both light colors induced both reporter genes" (lines 258-261). In contrast, Figure 3c shows that application of both wavelengths of light results in exclusive activation of YPet expression. It appears the description of the data is wrong and must be corrected. That said, this error does not impact their conclusion that two separate target genes can be independently activated by NIR and red light.

      We thank the reviewer for catching this error which we have corrected in the revised manuscript.

      (5) Line 278: I don't agree with the authors' blanket statement that the use of upconversion nanoparticles is a "grave" limitation for NIR-light mediated activation of bacterial gene expression in vivo. The authors should either expound on the severity of the limitation or use more moderate language.

      We have replaced the word ‘grave’ by ‘potential’ and thereby toned down our wording.

      Reviewer #2 (Recommendations for the authors):

      (1) Please include a discussion on the expected depth penetration of different light wavelengths. This is most relevant in the context of the discussion about how these NIR systems could be used with living therapeutics.

      Given the heterogeneity of biological tissue, it is challenging to state precise penetration depths for different wavelengths of light. That said, blue light for instance is typically attenuated by biological tissue around 5 to 10 times as strongly as near-infrared light is.

      We have expanded the Discussion chapter to cover these aspects.

      (2) It would be helpful for Figure 2C (or supplementary) to also include the response to blue light stimulation.

      We agree and have acquired pertinent data for the blue-light response. The new data are included in an updated Fig. 2c. Data acquired at varying NIR-light intensities, originally included in Fig. 2c, have been moved to Suppl. Fig. 5a-b.

      (3) In Figure 4A, data on the response of E. coli Nissle to blue and red light are missing. Including this would help identify whether the reduced sensitivity to non-NIR wavelengths observed in the E. coli lab strain is preserved in the probiotic background.

      In response to this comment, we have acquired pertinent data on E. coli Nissle. While the results were overall similar to those in the laboratory strain, the response to blue and NIR light was yet lower in the Nissle bacteria which stands to benefit optogenetic applications.

      We have updated Fig. 4a accordingly. For clarity, we only show the data for AvNIRusk in the main paper but have relegated the data on AcNIRusk to Suppl. Fig. 8. (Note that this has necessitated a renumbering of the subsequent Suppl. Figs.)

      (4) On many of the figures, there are thin gray lines that appear between the panels that it would be nice to eliminate because, in some cases, they cut through words and numbers.

      The grey lines likely arose from embedding the figures into the text document. In the typeset manuscript, which has become available on the eLife webpage in the meantime, there are no such lines. That said, we will carefully check throughout the submission/publishing/proofing process lest these lines reappear.

      (5) Page 7, line 155: "As not least seen" typo or awkward phrasing.

      We have restructured the sentence and thereby hopefully clarified the unclear phrasing.

      (6) Page 7, line 167: It does not appear to be the case that the initial pNIRusk designs show constant fluorescence that is largely independent of illumination. AcNIRusk shows an almost twofold change from dark to NIR. Reword this to avoid confusion.

      We concur with this comment, similar to reviewer #1’s remark, and have adjusted the text accordingly.

      (7) Page 8, line 174: Related to the previous point, AvNIRusk has one design that is very minimally light switchable (-5), so stating that six light switchable designs have been identified is also confusing.

      As stated in our response to reviewer #1 above, the assignment of AvNIRusk-5 as light-switchable is based on the relative fluorescence change upon illumination. We have added an explanation to the text.

      (8) Page 10, line 228-229: I was not able to find the data showing that expression levels were higher for the DmTtr systems than the pREDusk and pNIRusk setups. This may be an issue related to the normalization point. It was not clear to me how to compare these values.

      We apologize for the initially unclear representation of the data. In response to this reviewer’s general comments above, we have now normalized all fluorescence values to a single reference value, thus allowing their direct comparison.

      (9) Page 12, line 264: "finer-grained expression control can be exerted..." Either show data or adjust the language so that it is clear this is a prediction.

      True, we have replaced ‘can’ by ‘could’.

      (10) Page 25, line 590: CmpX13 cells have a reference that is given later, but it should be added where it first appears.

      Agreed, we have added the reference in the indicated place.

      (11) Page 25, line 592: define LB/Kan.

      We had already defined this abbreviation further up but, for clarity, we have added it again in the indicated position.

      (12) Page 40, line 946: "normalized by" rather than "to".

      We have implemented the requested change in the indicated and several other positions of the manuscript.

      (13) Figures 2C, 3C, and similar plots in the supplementary material would benefit from having a legend for the colors.

      We agree and have added pertinent legends to the corresponding main and supplementary figures.

      (14) As a reader, I had some trouble following all the acronyms. This is at the author's discretion, but I would eliminate ones that are not strictly essential (e.g. MTP for microtiter plate; I was unable to identify what "MCS" meant; look for other opportunities to remove acronyms).

      In the revised manuscript, we have defined the abbreviation ‘MCS’ (for ‘multiple-cloning site’) upon first occurrence. We have decided to retain the abbreviation ‘MTP’ in the text.

      (15) Could the authors briefly speculate on why A. tumefaciens activation with red light might occur?

      While we can but speculate as to the underlying reasons for the divergent red-light response in A. tumefaciens, we discuss possible scenarios below.

      Commonly, two-component systems (TCS) exhibit highly cooperative and steep responses to signal. As a consequence, even small differences in the intracellular amounts of phosphorylated and unphosphorylated response regulator (RR) can give to significantly changed gene-expression output. Put another way, the gene-expression output need not scale linearly with the extent of RR phosphorylation but, rather, is expected to show nonlinear dependence with pronounced thresholding effects.

      Differences in the pertinent RR levels can for instance arise from variations in the expression levels of the pNIRusk system components between E. coli and A. tumefaciens. Moreover, the two bacteria greatly differ in their two-component-system (TCS) repertoire. Although TCSs are commonly well insulated from each other, cross-talk with endogenous TCSs, even if limited, may cause changes in the levels of phosphorylated RR and hence gene-expression output. In a similar vein, the RR can also be phosphorylated and dephosphorylated non-enzymatically, e.g., by reaction with high-energy anhydrides (such as acetyl phosphate) and hydrolysis, respectively. Other potential origins for the divergent red-light response include differences in the strength of the promoters driving expression of the pNIRusk system components and the fluorescent/luminescent reporters, respectively.

      (16) It would be helpful for the authors to briefly explain why they needed to switch to luminescence from fluorescence for the A. tumeraciens studies.

      While there was no strict necessity to switch from the fluorescence-based system used in E. coli to a luminescence-based system in A. tumefaciens, we opted for luminescence based on prior experience with other Alphaproteobacteria (e.g., 10.1128/mSystems.00893-21), where luminescence offered significant advantages. Specifically, it provides essentially background-free signal detection and greater sensitivity for monitoring gene expression. In addition, as demonstrated in Suppl. Fig. 9c and d, the luminescence system enables real-time tracking of gene expression dynamics, which further supported its use in our experimental setup (see our response to reviewer #2’s general comments).

      (17) This is a very minor comment that the authors can take or leave, but I got hung up on the word "implement" when it appeared a few times in the manuscript because I tended to read it as "put a plan into place" rather than its other meaning.

      In the abstract, we have replaced one instance of the word ‘implement’ by ‘instrument’.

      (18) The authors should include the relevant constructs on AddGene or another public strainsharing service.

      We whole-heartedly subscribe to the idea of freely sharing research materials with fellow scientists. Therefore, we had already deposited the most relevant AvNIRusk in Addgene, even prior to the initial submission of the manuscript (accession number 235084). In the meantime, we have released the deposition, and the plasmid can be obtained from Addgene since May 15<sub>th</sub> of this year.

      Reviewer #3 (Recommendations for the authors):

      Suggestion for improvement:

      This paper relies heavily on variations in linker sequences to shift responses. I am familiar with prior work from the Moglich lab in which helical linkers were employed to shift responses in synthetic two-component systems, with interesting periodicity in responses with every 7 residues (as expected for an alpha helix) and inversion of responses at smaller linker shifts. There is no mention in this paper whether their current engineering follows a similar rationale, what types of linkers are employed (e.g. flexible vs helical), and whether there is an interpretation for how linker lengths alter responses. Can you explain what classes of linker sequences are used throughout Figures 2 and 3, and whether length or periodicity affects the outcome? This would be very helpful for readers who are new to this approach, or if the rationale here differs from the authors' prior work.

      The PATCHY approach employed at present followed a closely similar rationale as in our previous studies. That is, linkers were extended/shortened and varied in their sequence by recombining different fragments of the natural linkers of the parental receptors, i.e., the bacteriophytochrome and the FixL sensor histidine kinase, respectively. We have added a statement to this effect in the text and a reference to Suppl. Fig. 3 which illustrates the principal approach.

      Compared to our earlier studies, we isolated fewer receptor variants supporting light-regulated responses, despite covering a larger sequence space. Owing to the sparsity of the light-regulated variants, an interpretation of the linker properties and their correlation with light-regulated activity is challenging. Although doubtless unsatisfying from a mechanistic viewpoint, we therefore refrain from a pertinent discussion which would be premature and speculative at this point. As the reviewer raises a valid and important point, we have expanded the text by referring to our earlier studies and the observed dependence of functional properties on linker composition.

      It is sometimes difficult to intuit or rationalize the differences in red/IR sensitivity across closely related variants. An important example appears in Figure 3C vs 3B. I think the AvTod+21 in 3B should be the equivalent to the DsRed response in the second column of 3C (AvTod+21 + DmDERusk), except, of course, that the bacteria in 3C carry an additional plasmid for the DERusk system. However, in 3B, the response to red light is substantial - ~50% as strong as that for IR, whereas in 3C, red light elicits no response at all. What is the difference? The reason this is important is that the AvTod+21 and DMDERusk represent the best "orthogonal" red and infrared light responses, but this is not at all obvious from 3B, where AvTod+21 still causes a substantial (and for orthogonality, undesirable) response under red light. Perhaps subtle differences in expression level due to plasmid changes cause these differences in light responses? Could the authors test how the expression level affects these responses? The paper would be greatly improved if observations of the diverse red/IR responses could be rationalized by some design criteria.

      As noted above in our response to reviewer #2, we have now normalized all fluorescence readings to joint reference values, thus allowing a better comparison across experiments.

      The reviewer is correct in noting that upon multiplexing, the individual plasmid systems support lower fluorescence levels than when used in isolation. We speculate that the combination of two plasmids may affect their copy numbers (despite the use of different resistance markers and origins of replications) and hence their performance. Likewise, the cellular metabolism may be affected when multiple plasmids are combined. These aspects may well account for the absent red-light response in AvTod+21 in the multiplexing experiments which is – indeed – unexpected. As, at present, we cannot provide a clear rationalization for this effect, we recommend verifying the performance of the plasmid setups when multiplexing.

      The paper uses "red" and "infrared" to refer to ~624 nm and ~800 nm light, respectively. I wonder whether it might be possible to shift these peak wavelengths to obtain even better separation for the multiplexing experiments. Perhaps shifting the specific red wavelength could result in better separation between DERusk and AvTod systems, for example? Could the authors comment on this (maybe based on action spectra of their previously developed tools) or perhaps test a few additional stimulation wavelengths?

      The choice of illumination wavelengths used in these experiments is dictated by the LED setups available for illumination of microtiter plates. On the one hand, we are using an SMD (surface-mount device) three-color LED with a fixed wavelength of the red channel around 624 nm (see Hennemann et al., 2018). On the other hand, we are deploying a custom-built device with LEDs emitting at around 800 nm (see Stüven et al., 2019 and this work). Adjusting these wavelengths is therefore challenging, although without doubt potentially interesting.

      To address this reviewer comment, we have added a statement to the text that the excitation wavelengths may be varied to improve multiplexed applications.

      Additional minor comments:

      (1) Figure 2C: It would be very helpful to place a legend on the figure panel for what the colors indicate, since they are unique to this panel and non-intuitive.

      This comment coincides with one by reviewer #2, and we have added pertinent legends to this and related supplementary figures.

      (2) Figure 3C: it is not obvious which system uses DsRed and which uses YPet in each combination, since the text indicates that all combinations were cloned, and this is not clearly described in the legend. Is it always the first construct in the figure legend listed for DsRed and the second for YPet?

      For clarification, we have revised the x-axis labels in Fig. 3C. (And yes, it is as this reviewer surmises: the first of the two constructs harbored DsRed and the second one YPet.)

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The Major Histocompatibility Complex (MHC) region is a collection of numerous genes involved in both innate and adaptive immunity. MHC genes are famed for their role in rapid evolution and extensive polymorphism in a variety of vertebrates. This paper presents a summary of gene-level gain and loss of orthologs and paralogs within MHC across the diversity of primates, using publicly available data.

      Strengths:

      This paper provides a strong case that MHC genes are rapidly gained (by paralog duplication) and lost over millions of years of macroevolution. The authors are able to identify MHC loci by homology across species, and from this infer gene duplications and losses using phylogenetic analyses. There is a remarkable amount of genic turnover, summarized in Figure 6 and Figure 7, either of which might be a future textbook figure of immune gene family evolution. The authors draw on state-of-the-art phylogenetic methods, and their inferences are robust insofar as the data might be complete enough to draw such conclusions.

      Weaknesses:

      One concern about the present work is that it relies on public databases to draw inferences about gene loss, which is potentially risky if the publicly available sequence data are incomplete. To say, for example, that a particular MHC gene copy is absent in a taxon (e.g., Class I locus F absent in Guenons according to Figure 1), we need to trust that its absence from the available databases is an accurate reflection of its absence in the genome of the actual organisms. This may be a safe assumption, but it rests on the completeness of genome assembly (and gene annotations?) or people uploading relevant data. This reviewer would have been far more comfortable had the authors engaged in some active spot-checking, doing the lab work to try to confirm absences at least for some loci and some species. Without this, a reader is left to wonder whether gene loss is simply reflecting imperfect databases, which then undercuts confidence in estimates of rates of gene loss.

      Indeed, just because a locus has not been confirmed in a species does not necessarily mean that it is absent. As we explain in the Figure 1 caption, only a few species have had their genomes extensively studied (gray background), and only for these species does the absence of a point in this figure mean that a locus is absent. The white background rows represent species that are not extensively studied, and we point out that the absence of a point does not mean that a locus is absent from the species, rather undiscovered. We have also added a parenthetical to the text to explain this (line 156): “Only species with rows highlighted in gray have had their MHC regions extensively studied (and thus only for these rows is the absence of a gene symbol meaningful).”

      While we agree that spot-checking may be a helpful next step, one of the goals of this manuscript is to collect and synthesize the enormous volume of MHC evolution research in the primates, which will serve as a jumping-off point for other researchers to perform important wet lab work.

      Some context is useful for comparing rates of gene turnover in MHC, to other loci. Changing gene copy numbers, duplications, and loss of duplicates, are common it seems across many loci and many organisms; is MHC exceptional in this regard, or merely behaving like any moderately large gene family? I would very much have liked to see comparable analyses done for other gene families (immune, like TLRs, or non-immune), and quantitative comparisons of evolutionary rates between MHC versus other genes. Does MHC gene composition evolve any faster than a random gene family? At present readers may be tempted to infer this, but evidence is not provided.

      Our companion paper (Fortier and Pritchard, 2025) demonstrates that the MHC is a unique locus in many regards, such as its evidence for deep balancing selection and its excess of disease associations. Thus, we expect that it is evolving faster than any random gene family. It would be interesting to repeat this analysis for other gene families, but that is outside of the scope of this project. Additionally, allele databases for other gene families are not nearly as developed, but as more alleles become available for other polymorphic families, a comparable analysis could become possible.

      We have added a paragraph to the discussion (lines 530-546) to clarify that we do not know for certain whether the MHC gene family is evolving rapidly compared to other gene families.

      While on the topic of making comparisons, the authors make a few statements about relative rates. For instance, lines 447-8 compare gene topology of classical versus non-classical genes; and line 450 states that classical genes experience more turnover. But there are no quantitative values given to these rates to provide numerical comparisons, nor confidence intervals provided (these are needed, given that they are estimates), nor formal statistical comparisons to confirm our confidence that rates differ between types of genes.

      More broadly, the paper uses sophisticated phylogenetic methods, but without taking advantage of macroevolutionary comparative methods that allow model-based estimation of macroevolutionary rates. I found the lack of quantitative measurements of rates of gene gain/loss to be a weakness of the present version of the paper, and something that should be readily remedied. When claiming that MHC Class I genes "turn over rapidly" (line 476) - what does rapidly mean? How rapidly? How does that compare to rates of genetic turnover at other families? Quantitative statements should be supported by quantitative estimates (and their confidence intervals).

      These statements refer to qualitative observations, so we cannot provide numerical values. We simply conclude that certain gene groups evolve faster or slower based on the species and genes present in each clade. It is difficult to provide estimates because of the incomplete sampling of genes that survived to the present day. In addition, the presence or absence of various orthologs in different species still needs to be confirmed, at which point it might be useful to be more quantitative. We have also added a paragraph to the discussion to address this concern and advocate for similar analyses of other gene families in the future when more data is available (lines 530-546).

      The authors refer to 'shared function of the MHC across species' (e.g. line 22); while this is likely true, they are not here presenting any functional data to confirm this, nor can they rule out neofunctionalization or subfunctionalization of gene duplicates. There is evidence in other vertebrates (e.g., cod) of MHC evolving appreciably altered functions, so one may not safely assume the function of a locus is static over long macroevolutionary periods, although that would be a plausible assumption at first glance.

      Indeed, we cannot assume that the function of a locus is static across time, especially for the MHC region. In our research, we read hundreds of papers that each focused on a small number of species or genes and gathered some information about them, sometimes based on functional experiments and sometimes on measures such as dN/dS. These provide some indication of a gene’s broad classification in a species or clade, even if the evidence is preliminary. Where possible, we used this preliminary evidence to give genes descriptors “classical,” “non-classical,” “dual characteristics,” “pseudogene,” “fixed”, or “unfixed.” Sometimes multiple individuals and haplotypes were analyzed, so we could even assign a minimum number of gene copies present in a species. We have aggregated all of these references into Supplementary Table 1 (for Class I/Figure 1) and Supplementary Table 2 (for Class II/Figure 2) along with specific details about which data points in these figures that each reference supports. We realize that many of these classifications are based on a small number of individuals or indirect measures, so they may change in the future as more functional data is generated.

      Reviewer #2 (Public review):

      Summary:

      The authors aim to provide a comprehensive understanding of the evolutionary history of the Major Histocompatibility Complex (MHC) gene family across primate species. Specifically, they sought to:

      (1) Analyze the evolutionary patterns of MHC genes and pseudogenes across the entire primate order, spanning 60 million years of evolution.

      (2) Build gene and allele trees to compare the evolutionary rates of MHC Class I and Class II genes, with a focus on identifying which genes have evolved rapidly and which have remained stable.

      (3) Investigate the role of often-overlooked pseudogenes in reconstructing evolutionary events, especially within the Class I region.

      (4) Highlight how different primate species use varied MHC genes, haplotypes, and genetic variation to mount successful immune responses, despite the shared function of the MHC across species.

      (5) Fill gaps in the current understanding of MHC evolution by taking a broader, multi-species perspective using (a) phylogenomic analytical computing methods such as Beast2, Geneconv, BLAST, and the much larger computing capacities that have been developed and made available to researchers over the past few decades, (b) literature review for gene content and arrangement, and genomic rearrangements via haplotype comparisons.

      (6) The authors overall conclusions based on their analyses and results are that 'different species employ different genes, haplotypes, and patterns of variation to achieve a successful immune response'.

      Strengths:

      Essentially, much of the information presented in this paper is already well-known in the MHC field of genomic and genetic research, with few new conclusions and with insufficient respect to past studies. Nevertheless, while MHC evolution is a well-studied area, this paper potentially adds some originality through its comprehensive, cross-species evolutionary analysis of primates, focus on pseudogenes and the modern, large-scale methods employed. Its originality lies in its broad evolutionary scope of the primate order among mammals with solid methodological and phylogenetic analyses.

      The main strengths of this study are the use of large publicly available databases for primate MHC sequences, the intensive computing involved, the phylogenetic tool Beast2 to create multigene Bayesian phylogenetic trees using sequences from all genes and species, separated into Class I and Class II groups to provide a backbone of broad relationships to investigate subtrees, and the presentation of various subtrees as species and gene trees in an attempt to elucidate the unique gene duplications within the different species. The study provides some additional insights with summaries of MHC reference genomes and haplotypes in the context of a literature review to identify the gene content and haplotypes known to be present in different primate species. The phylogenetic overlays or ideograms (Figures 6 and 7) in part show the complexity of the evolution and organisation of the primate MHC genes via the orthologous and paralogous gene and species pathways progressively from the poorly-studied NWM, across a few moderately studied ape species, to the better-studied human MHC genes and haplotypes.

      Weaknesses:

      The title 'The Primate Major Histocompatibility Complex: An Illustrative Example of GeneFamily Evolution' suggests that the paper will explore how the Major Histocompatibility Complex (MHC) in primates serves as a model for understanding gene family evolution. The term 'Illustrative Example' in the title would be appropriate if the paper aimed to use the primate Major Histocompatibility Complex (MHC) as a clear and representative case to demonstrate broader principles of gene family evolution. That is, the MHC gene family is not just one instance of gene family evolution but serves as a well-studied, insightful example that can highlight key mechanisms and concepts applicable to other gene families. However, this is not the case, this paper only covers specific details of primate MHC evolution without drawing broader lessons to any other gene families. So, the term 'Illustrative Example' is too broad or generalizing. In this case, a term like 'Case Study' or simply 'Example' would be more suitable. Perhaps, 'An Example of Gene Family Diversity' would be more precise. Also, an explanation or 'reminder' is suggested that this study is not about the origins of the MHC genes from the earliest jawed vertebrates per se (~600 mya), but it is an extension within a subspecies set that has emerged relatively late (~60 mya) in the evolutionary divergent pathways of the MHC genes, systems, and various vertebrate species.

      Thank you for your input on the title; we have changed it to “A case study of gene family evolution” instead.

      Thank you also for pointing out the potential confusion about the time span of our study. We have added “Having originated in the jawed vertebrates,” to a sentence in the introduction (lines 38-39). We have also added the sentence “Here, we focus on the primates, spanning approximately 60 million years within the over 500-million-year evolution of the family \citep{Flajnik2010}.“ to be more explicit about the context for our work (lines 59-61).

      Phylogenomics. Particular weaknesses in this study are the limitations and problems associated with providing phylogenetic gene and species trees to try and solve the complex issue of the molecular mechanisms involved with imperfect gene duplications, losses, and rearrangements in a complex genomic region such as the MHC that is involved in various effects on the response and regulation of the immune system. A particular deficiency is drawing conclusions based on a single exon of the genes. Different exons present different trees. Which are the more reliable? Why were introns not included in the analyses? The authors attempt to overcome these limitations by including genomic haplotype analysis, duplication models, and the supporting or contradictory information available in previous publications. They succeed in part with this multidiscipline approach, but much is missed because of biased literature selection. The authors should include a paragraph about the benefits and limitations of the software that they have chosen for their analysis, and perhaps suggest some alternative tools that they might have tried comparatively. How were problems with Bayesian phylogeny such as computational intensity, choosing probabilities, choosing particular exons for analysis, assumptions of evolutionary models, rates of evolution, systemic bias, and absence of structural and functional information addressed and controlled for in this study?

      We agree that different exons have different trees, which is exactly why we repeated our analysis for each exon in order to compare and contrast them. In particular, the exons encoding the binding site of the resulting protein (exons 2 and 3 for Class I and exon 2 for Class II) show evidence for trans-species polymorphism and gene conversion. These phenomena lead to trees that do not follow the species tree and are fascinating in and of themselves, which we explore in detail in our companion paper (Fortier and Pritchard, 2025). Meanwhile, the non-peptide-binding extracellular-domain-encoding exon (exon 4 for Class I and exon 3 for Class II) is comparably sized to the binding-site-encoding exons and provides an interesting functional contrast. As this exon is likely less affected by trans-species polymorphism, gene conversion, and convergent evolution, we present results from it most often in the main text, though we occasionally touch on differences between the exons. See lines 191-196, 223-226, and 407-414 for some examples of how we discuss the exons in the text. Additionally, all trees from all of these exons can be found in the supplement. 

      We agree that introns would valuable to study in this context. Even though the non--binding-site-encoding exons are probably *less* affected by trans-species polymorphism, gene conversion, and convergent evolution, they are still functional. The introns, however, experience much more relaxed selection, if any, and comparing their trees to those for the exons would be valuable and illuminating. We did not generate intron trees for two reasons. Most importantly, there is a dearth of data available for the introns; in the databases we used, there was often intron data available only for human, chimpanzee, and sometimes macaque, and only for a small subset of the genes. This limitation is at odds with the comprehensive, many-gene-many-species approach which we feel is the main novelty of this work. Secondly, the introns that *are* available are difficult to align. Even aligning the exons across such a highly-diverged set of genes and pseudogenes was difficult and required manual effort. The introns proved even more difficult to try to align across genes. In the future, when more intron data is available and sufficient effort is put into aligning them, it will be possible and desirable to do a comparable analysis. We also added a sentence to the “Data” section to briefly explain why we did not include introns (lines 134-135).

      We explain our Bayesian phylogenetics approach in detail in the Methods (lines 650-725), including our assumptions and our solutions to challenges specific to this application. For further explanation of the method itself, we suggest reading the original BEAST and BEAST2 papers (Drummond & Rambaut (2007), Drummond et al. (2012), Bouckaert et al. (2014), and Bouckaert et al. (2019)). Known structural and functional information helped us validate the alignments we used in this study, but the fact that such information is not fully known for every gene and species should not affect the method itself.

      Gene families as haplotypes. In the Introduction, the MHC is referred to as a 'gene family', and in paragraph 2, it is described as being united by the 'MHC fold', despite exhibiting 'very diverse functions'. However, the MHC region is more accurately described as a multigene region containing diverse, haplotype-specific Conserved Polymorphic Sequences, many of which are likely to be regulatory rather than protein-coding. These regulatory elements are essential for controlling the expression of multiple MHC-related products, such as TNF and complement proteins, a relationship demonstrated over 30 years ago. Non-MHC fold loci such as TNF, complement, POU5F1, lncRNA, TRIM genes, LTA, LTB, NFkBIL1, etc, are present across all MHC haplotypes and play significant roles in regulation. Evolutionary selection must act on genotypes, considering both paternal and maternal haplotypes, rather than on individual genes alone. While it is valuable to compile databases for public use, their utility is diminished if they perpetuate outdated theories like the 'birth-and-death model'. The inclusion of prior information or assumptions used in a statistical or computational model, typically in Bayesian analysis, is commendable, but they should be based on genotypic data rather than older models. A more robust approach would consider the imperfect duplication of segments, the history of their conservation, and the functional differences in inheritance patterns. Additionally, the MHC should be examined as a genomic region, with ancestral haplotypes and sequence changes or rearrangements serving as key indicators of human evolution after the 'Out of Africa' migration, and with disease susceptibility providing a measurable outcome. There are more than 7000 different HLA-B and -C alleles at each locus, which suggests that there are many thousands of human HLA haplotypes to study. In this regard, the studies by Dawkins et al (1999 Immunol Rev 167,275), Shiina et al. (2006 Genetics 173,1555) on human MHC gene diversity and disease hitchhiking (haplotypes), and Sznarkowska et al. (2020 Cancers 12,1155) on the complex regulatory networks governing MHC expression, both in terms of immune transcription factor binding sites and regulatory non-coding RNAs, should be examined in greater detail, particularly in the context of MHC gene allelic diversity and locus organization in humans and other primates.

      Thank you for these comments. To clarify that the MHC “region” is different from (and contains) the MHC “gene family” as we describe it, we changed a sentence in the abstract (lines 8-10) from “One large gene family that has experienced rapid evolution is the Major Histocompatibility Complex (MHC), whose proteins serve critical roles in innate and adaptive immunity.” to “One large gene family that has experienced rapid evolution lies within the Major Histocompatibility Complex (MHC), whose proteins serve critical roles in innate and adaptive immunity.” We know that the region is complex and contains many other genes and regulatory sequences; Figure 1 of our companion paper (Fortier and Pritchard, 2025) depicts these in order to show the reader that the MHC genes we focus on are just one part of the entire region.

      We love the suggestion to look at the many thousands of alleles present at each of the classical loci. This is the focus of our complimentary paper (Fortier and Pritchard, 2025) which explores variation at the allele level. In the current paper, we look mainly at the differences between genes and the use of different genes in different species.

      Diversifying and/or concerted evolution. Both this and past studies highlight diversifying selection or balancing selection model is the dominant force in MHC evolution. This is primarily because the extreme polymorphism observed in MHC genes is advantageous for populations in terms of pathogen defence. Diversification increases the range of peptides that can be presented to T cells, enhancing the immune response. The peptide-binding regions of MHC genes are highly variable, and this variability is maintained through selection for immune function, especially in the face of rapidly evolving pathogens. In contrast, concerted evolution, which typically involves the homogenization of gene duplicates through processes like gene conversion or unequal crossing-over, seems to play a minimal role in MHC evolution. Although gene duplication events have occurred in the MHC region leading to the expansion of gene families, the resulting paralogs often undergo divergent evolution rather than being kept similar or homozygous by concerted evolution. Therefore, unlike gene families such as ribosomal RNA genes or histone genes, where concerted evolution leads to highly similar copies, MHC genes display much higher levels of allelic and functional diversification. Each MHC gene copy tends to evolve independently after duplication, acquiring unique polymorphisms that enhance the repertoire of antigen presentation, rather than undergoing homogenization through gene conversion. Also, in some populations with high polymorphism or genetic drift, allele frequencies may become similar over time without the influence of gene conversion. This similarity can be mistaken for gene conversion when it is simply due to neutral evolution or drift, particularly in small populations or bottlenecked species. Moreover, gene conversion might contribute to greater diversity by creating hybrids or mosaics between different MHC genes. In this regard, can the authors indicate what percentage of the gene numbers in their study have been homogenised by gene conversion compared to those that have been diversified by gene conversion?

      We appreciate the summary, and we feel we have appropriately discussed both gene conversion and diversifying selection in the context of the MHC genes. Because we cannot know for sure when and where gene conversion has occurred, we cannot quantify percentages of genes that have been homogenized or diversified.  

      Duplication models. The phylogenetic overlays or ideograms (Figures 6 and 7) show considerable imperfect multigene duplications, losses, and rearrangements, but the paper's Discussion provides no in-depth consideration of the various multigenic models or mechanisms that can be used to explain the occurrence of such events. How do their duplication models compare to those proposed by others? For example, their text simply says on line 292, 'the proposed series of events is not always consistent with phylogenetic data'. How, why, when? Duplication models for the generation and extension of the human MHC class I genes as duplicons (extended gene or segmental genomic structures) by parsimonious imperfect tandem duplications with deletions and rearrangements in the alpha, beta, and kappa blocks were already formulated in the late 1990s and extended to the rhesus macaque in 2004 based on genomic haplotypic sequences. These studies were based on genomic sequences (genes, pseudogenes, retroelements), dot plot matrix comparisons, and phylogenetic analyses of gene and retroelement sequences using computer programs. It already was noted or proposed in these earlier 1999 studies that (1) the ancestor of HLA-P(90)/-T(16)/W(80) represented an old lineage separate from the other HLA class I genes in the alpha block, (2) HLA-U(21) is a duplicated fragment of HLA-A, (3) HLA-F and HLA-V(75) are among the earliest (progenitor) genes or outgroups within the alpha block, (4) distinct Alu and L1 retroelement sequences adjoining HLA-L(30), and HLA-N genomic segments (duplicons) in the kappa block are closely related to those in the HLA-B and HLA-C in the beta block; suggesting an inverted duplication and transposition of the HLA genes and retroelements between the beta and kappa regions. None of these prior human studies were referenced by Fortier and Pritchard in their paper. How does their human MHC class I gene duplication model (Fig. 6) such as gene duplication numbers and turnovers differ from those previously proposed and described by Kulski et al (1997 JME 45,599), (1999 JME 49,84), (2000 JME 50,510), Dawkins et al (1999 Immunol Rev 167,275), and Gaudieri et al (1999 GR 9,541)? Is this a case of reinventing the wheel?

      Figures 6 and 7 are intended to synthesize and reconcile past findings and our own trees, so they do not strictly adhere to the findings of any particular study and cannot fully match all studies. In the supplement, Figure 6 - figure supplement 1 and Figure 7 - figure supplement 1 duly credit all of the past work that went into making these trees. Most previous papers focus on just one aspect of these trees, such as haplotypes within a species, a specific gene or allelic lineage relationship, or the branching pattern of particular gene groups. We believe it was necessary to bring all of these pieces of evidence together. Even among papers with the same focus (to understand the block duplications that generated the current physical layout of the MHC), results differ. For example, Geraghty (1992), Hughes (1995), Kulski (2004)/Kulski (2005),  and Shiina (1999) all disagree on the exact branching order of the genes MHC-W, -P, and -T, and of MHC-G, -J, and -K. While the Kulski studies you pointed out were very thorough for their era, they still only relied on data from three species and one haplotype per species. Our work is not intended to replace or discredit these past works, simply build upon them with a larger set of species and sequences. We hope the hypotheses we propose in Figures 6 and 7 can help unify existing research and provide a more easily accessible jumping-off-point for future work.

      Results. The results are presented as new findings, whereas most if not all of the results' significance and importance already have been discussed in various other publications. Therefore, the authors might do better to combine the results and discussion into a single section with appropriate citations to previously published findings presented among their results for comparison. Do the trees and subsets differ from previous publications, albeit that they might have fewer comparative examples and samples than the present preprint? Alternatively, the results and discussion could be combined and presented as a review of the field, which would make more sense and be more honest than the current format of essentially rehashing old data.

      In starting this project, we found that a large barrier to entry to this field of study is the immense amount of published literature over 30+ years. It is both time-consuming and confusing to read up on the many nuances of the MHC genes, their changing names, and their evolution, making it difficult to start new, innovative projects. We acknowledge that while our results are not entirely novel, the main advantage of our work is that it provides a thorough, comprehensive starting point for others to learn about the MHC quickly and dive into new research. We feel that we have appropriately cited past literature in both the main text, appendices, and supplement, so that readers may dive into a particular area with ease.

      Minor corrections:

      (1) Abstract, line 19: 'modern methods'. Too general. What modern methods?

      To keep the abstract brief, the methods are introduced in the main text when each becomes relevant as well as in the methods section.

      (2) Abstract, line 25: 'look into [primate] MHC evolution.' The analysis is on the primate MHC genes, not on the entire vertebrate MHC evolution with a gene collection from sharks to humans. The non-primate MHC genes are often differently organised and structurally evolved in comparison to primate MHC.

      Thank you! We have added the word “primate” to the abstract (line 25).

      (3) Introduction, line 113. 'In a companion paper (Fortier and Pritchard, 2024)' This paper appears to be unpublished. If it's unpublished, it should not be referenced.

      This paper is undergoing the eLife editorial process at the same time; it will have a proper citation in the final version.

      (4) Figures 1 and 2. Use the term 'gene symbols' (circle, square, triangle, inverted triangle, diamond) or 'gene markers' instead of 'points'. 'Asterisks "within symbols" indicate new information.

      Thank you, the word “symbol” is much clearer! We have changed “points” to “symbols” in the captions for Figure 1, Figure 1 - figure supplement 1, Figure 2, and Figure 2 - figure supplement 1. We also changed this in the text (lines 157-158 and 170).

      (5) Figures. A variety of colours have been applied for visualisation. However, some coloured texts are so light in colour that they are difficult to read against a white background. Could darker colours or black be used for all or most texts?

      With such a large number of genes and species to handle in this work, it was nearly impossible to choose a set of colors that were distinct enough from each other. We decided to prioritize consistency (across this paper, its supplement, and our companion paper) as well as at-a-glance grouping of similar sequences. Unfortunately, this means we had to sacrifice readability on a white background, but readers may turn to the supplement if they need to access specific sequence names.

      (6) Results, line 135. '(Fortier and Pritchard, 2024)' This paper appears to be unpublished. If it's unpublished, it should not be referenced.

      Repeat of (3). This paper is undergoing the eLife editorial process at the same time; it will have a proper citation in the final version.

      (7) Results, lines 152 to 153, 164, 165, etc. 'Points with an asterisk'. Use the term 'gene symbols' (circle, square, triangle, inverted triangle, diamond) or 'gene markers' instead of 'points'. A point is a small dot such as those used in data points for plotting graphs .... The figures are so small that the asterisks in the circles, squares, triangles, etc, look like points (dots) and the points/asterisks terminology that is used is very confusing visually.

      Repeat of (4). Thank you, the word “symbol” is much clearer! We have changed “points” to “symbols” in the captions for Figure 1, Figure 1 - figure supplement 1, Figure 2, and Figure 2 - figure supplement 1. We also changed this in the text (lines 157-158 and 170).

      (8) Line 178 (BEA, 2024) is not listed alphabetically in the References.

      Thank you for catching this! This reference maps to the first bibliography entry, “SUMMARIZING POSTERIOR TREES.” We are unsure how to cite a webpage that has no explicit author within the eLife Overleaf template, so we will consult with the editor.

      (9) Lines 188-190. 'NWM MHC-G does not group with ape/OWM MHC-G, instead falling outside of the clade containing ape/OWM MHC-A, -G, -J and -K.' This is not surprising given that MHC-A, -G, -J, and -K are paralogs of each other and that some of them, especially in NWM have diverged over time from the paralogs and/or orthologs and might be closer to one paralog than another and not be an actual ortholog of OWM, apes or humans.

      We included this sentence to clarify the relationships between genes and to help describe what is happening in Figure 6. Figure 6 - figure supplement 1 includes all of the references that go into such a statement and Appendix 3 details our reasoning for this and other statements.

      (10) Line 249. Gene conversion: This is recombination between two different genes where a portion of the genes are exchanged with one another so that different portions of the gene can group within one or other of the two gene clades. Alternatively, the gene has been annotated incorrectly if the gene does not group within either of the two alternative clades. Another possibility is that one or two nucleotide mutations have occurred without a recombination resulting in a mistaken interpretation or conclusion of a recombination event. What measures are taken to avoid false-positive conclusions? How many MHC gene conversion (recombination) events have occurred according to the authors' estimates? What measures are taken to avoid false-positive conclusions?

      All of these possibilities are certainly valid. We used the program GENECONV to infer gene conversion events, but there is considerable uncertainty owing to the ages of the genes and the inevitable point mutations that have occurred post-event. Gene conversion was not the focus of our paper, so we did our best to acknowledge it (and the resulting differences between trees from different exons) without spending too much time diving into it. A list of inferred gene conversion events can be found in Figure 3 - source data 1 and Figure 4 - source data 1.

      (11) Lines 284-286. 'The Class I MHC region is further divided into three polymorphic blocks-alpha, beta, and kappa blocks-that each contains MHC genes but are separated by well-conserved non-MHC genes.' The MHC class I region was first designated into conserved polymorphic duplication blocks, alpha and beta by Dawkins et al (1999 Immunol Rev 167,275), and kappa by Kulski et al (2002 Immunol Rev 190,95), and should be acknowledged (cited) accordingly.

      Thank you for catching this! We have added these citations (lines 302-303)!

      (12) Lines 285-286. 'The majority of the Class I genes are located in the alpha-block, which in humans includes 12 MHC genes and pseudogenes.' This is not strictly correct for many other species, because the majority of class I genes might be in the beta block of new and old-world monkeys, and the authors haven't provided respective counts of duplication numbers to show otherwise. The alpha block in some non-primate mammalian species such as pigs, rats, and mice has no MHC class I genes or only a few. Most MHC class I genes in non-primate mammalian species are found in other regions. For example, see Ando et al (2005 Immunogenetics 57,864) for the pig alpha, beta, and kappa regions in the MHC class I region. There are no pig MHC genes in the alpha block.

      Yes, which is exactly why we use the phrase “in humans” in that particular sentence. The arrangement of the MHC in several other primate reference genomes is shown in Figure 1 - figure supplement 2.

      (13) Line 297 to 299. 'The alpha-block also contains a large number of repetitive elements and gene fragments belonging to other gene families, and their specific repeating pattern in humans led to the conclusion that the region was formed by successive block duplications (Shiina et al., 1999).' There are different models for successive block duplications in the alpha block and some are more parsimonious based on imperfect multigenic segmental duplications (Kulski et al 1999, 2000) than others (Shiina et al., 1999). In this regard, Kulski et al (1999, 2000) also used duplicated repetitive elements neighbouring MHC genes to support their phylogenetic analyses and multigenic segmental duplication models. For comparison, can the authors indicate how many duplications and deletions they have in their models for each species?

      We have added citations to this sentence to show that there are different published models to describe the successive block duplications (line 307). Our models in Figure 6 and Figure 7 are meant to aggregate past work and integrate our own, and thus they were not built strictly by parsimony. References can be found in Figure 6 - figure supplement 1 and Figure 7 - figure supplement 1.

      (14) Lines 315-315. 'Ours is the first work to show that MHC-U is actually an MHC-A-related gene fragment.' This sentence should be deleted. Other researchers had already inferred that MHC-U is actually an MHC-A-related gene fragment more than 25 years ago (Kulski et al 1999, 2000) when the MHC-U was originally named MHC-21.

      While these works certainly describe MHC-U/MHC-21 as a fragment in the 𝛼-block, any relation to MHC-A was by association only and very few species/haplotypes were examined. So although the idea is not wholly novel, we provide convincing evidence that not only is MHC-U related to MHC-A by sequence, but also that it is a very recent partial duplicate of MHC-A. We show this with Bayesian phylogenetic trees as well as an analysis of haplotypes across many more species than were included in those papers.  

      (15) Lines 361-362. 'Notably, our work has revealed that MHC-V is an old fragment.' This is not a new finding or hypothesis. Previous phylogenetic analysis and gene duplication modelling had already inferred HLA-V (formerly HLA-75) to be an old fragment (Kulski et al 1999, 2000).

      By “old,” we mean older than previous hypotheses suggest. Previous work has proposed that MHC-V and -P were duplicated together, with MHC-V deriving from an MHC-A/H/V ancestral gene and MHC-P deriving from an MHC-W/T/P ancestral gene (Kulski (2005), Shiina (1999)). However, our analysis (Figure 5A) shows that MHC-V sequences form a monophyletic clade outside of the MHC-W/P/T group of genes as well as outside of the MHC-A/B/C/E/F/G/J/K/L group of genes, which is not consistent with MHC-A and -V being closely related. Thus, we conclude that MHC-V split off earlier than the differentiation of these other gene groups and is thus older than previously thought. We explain this in the text as well (lines 317-327) and in Appendix 3.  

      (16) Line 431-433. 'the Class II genes have been largely stable across the mammals, although we do see some lineage-specific expansions and contractions (Figure 2 and Figure 2-gure Supplement 2).' Please provide one or two references to support this statement. Is 'gure' a typo?

      We corrected this typo, thank you! This conclusion is simply drawn from the data presented in Figure 2 and Figure 2 - figure supplement 2. The data itself comes from a variety of sources, which are already included in the supplement as Figure 2 - source data 1.

      (17) Line 437. 'We discovered far more "specific" events in Class I, while "broad-scale" events were predominant in Class II.' Please define the difference between 'specific' and 'broad-scale'.

      These terms are defined in the previous sentence (lines 466-469).

      450-451. 'This shows that classical genes experience more turnover and are more often affected by long-term balancing selection or convergent evolution.' Is balancing selection a form of divergent evolution that is different from convergent evolution? Please explain in more detail how and why balancing selection or convergent evolution affects classical and nonclassical genes differently.

      Balancing selection acts to keep alleles at moderate frequencies, preventing any from fixing in the population. In contrast, convergent evolution describes sequences or traits becoming similar over time even though they are not similar by descent. While we cannot know exactly what selective forces have occurred in the past, we observe different patterns in the trees for each type of gene. In Figures 1 and 2, viewers can see at first glance that the nonclassical genes (which are named throughout the text and thoroughly described in Appendix 3) appear to be longer-lived than the classical genes. In addition, lines 204-222 and 475-488 describe topological differences in the BEAST2 trees of these two types of genes. However, we acknowledge that it could be helpful to have additional, complimentary information about the classical vs. non-classical genes. Thus, we have added a sentence and reference to our companion paper (Fortier and Pritchard, 2025), which focuses on long-term balancing selection and draws further contrast between classical and non-classical genes. In lines 481-484, we added  “We further explore the differences between classical and non-classical genes in our companion paper, finding ancient trans-species polymorphism at the classical genes but not at the non-classical genes \citep{Fortier2025b}.”

      References

      Some references in the supplementary materials such as Alvarez (1997), Daza-Vamenta (2004), Rojo (2005), Aarnink (2014), Kulski (2022), and others are missing from the Reference list. Please check that all the references in the text and the supplementary materials are listed correctly and alphabetically.

      We will make sure that these all show up properly in the proof.

      Reviewer #3 (Public review):

      Summary:

      The article provides the most comprehensive overview of primate MHC class I and class II genes to date, combining published data with an exploration of the available genome assemblies in a coherent phylogenetic framework and formulating new hypotheses about the evolution of the primate MHC genomic region.

      Strengths:

      I think this is a solid piece of work that will be the reference for years to come, at least until population-scale haplotype-resolved whole-genome resequencing of any mammalian species becomes standard. The work is timely because there is an obvious need to move beyond short amplicon-based polymorphism surveys and classical comparative genomic studies. The paper is data-rich and the approach taken by the authors, i.e. an integrative phylogeny of all MHC genes within a given class across species and the inclusion of often ignored pseudogenes, makes a lot of sense. The focus on primates is a good idea because of the wealth of genomic and, in some cases, functional data, and the relatively densely populated phylogenetic tree facilitates the reconstruction of rapid evolutionary events, providing insights into the mechanisms of MHC evolution. Appendices 1-2 may seem unusual at first glance, but I found them helpful in distilling the information that the authors consider essential, thus reducing the need for the reader to wade through a vast amount of literature. Appendix 3 is an extremely valuable companion in navigating the maze of primate MHC genes and associated terminology.

      Weaknesses:

      I have not identified major weaknesses and my comments are mostly requests for clarification and justification of some methodological choices.

      Thank you so much for your kind and supportive review!

      Reviewer #1 (Recommendations for the authors):

      (1) Line 151: How is 'extensively studied' defined?

      Extensively studied is not a strict definition, but a few organisms clearly stand apart from the rest in terms of how thoroughly their MHC regions have been studied. For example, the macaque is a model organism, and individuals from many different species and populations have had their MHC regions fully sequenced. This is in contrast to the gibbon, for example, in which there is some experimental evidence for the presence of certain genes, but no MHC region has been fully sequenced from these animals.

      (2) Can you clarify how 'classical' and 'non-classical' MHC genes are being determined in your analysis?

      Classical genes are those whose protein products perform antigen presentation to T cells and are directly involved in adaptive immunity, while non-classical genes are those whose protein products do not do this. For example, these non-classical genes might code for proteins that interact with receptors on Natural Killer cells and influence innate immunity. The roles of these proteins are not necessarily conserved between closely related species, and experimental evidence is needed to evaluate this. However, in the absence of such evidence, wherever possible we have provided our best guess as to the roles of the orthologous genes in other species, presented in Figure 1 - source data 1 and Figure 2 - source data 1. This is based on whatever evidence is available at the moment, sometimes experimental but typically based on dN/dS ratios and other indirect measures.

      (3) I find the overall tone of the paper to be very descriptive, and at times meandering and repetitive, with a lot of similar kinds of statements being repeated about gene gain/loss. This is perhaps inevitable because a single question is being asked of each of many subsets of MHC gene types, and even exons within gene types, so there is a lot of repetition in content with a slightly different focus each time. This does not help the reader stay focused or keep track. I found myself wishing for a clearly defined question or hypothesis, or some rate parameter in need of estimation. I would encourage the authors to tighten up their phrasing, or consider streamlining the results with some better signposting to organize ideas within the results.

      We totally understand your critique, as we talk about a wide range of specific genes and gene groups in this paper. To improve readability, we have added many more signposting phrases and sentences:

      “Aside from MHC-DRB, …” (line 173)

      “Now that we had a better picture of the landscape of MHC genes present in different primates, we wanted to understand the genes’ relationships. Treating Class I, Class IIA, and Class IIB separately, ...” (line 179-180)

      “We focus first on the Class I genes.” (line 191)

      “... for visualization purposes…” (line195)

      “We find that sequences do not always assort by locus, as would be expected for a typical gene.” (lines 196-197)

      “... rather than being directly orthologous to the ape/OWM MHC-G genes.” (lines 201-202)

      “Appendix 3 explains each of these genes in detail, including previous work and findings from this study.“ (lines 202-203)

      “... (but not with NWM) …” (line 208)

      “While genes such as MHC-F have trees which closely match the overall species tree, other genes show markedly different patterns, …” (lines 212-213)

      “Thus, while some MHC-G duplications appear to have occurred prior to speciation events within the NWM, others are species-specific.” (lines 218-219)

      “... indicating rapid evolution of many of the Class I genes” (lines 220-221)

      “Now turning to the Class II genes, …“ (line 223)

      “(see Appendix 2 for details on allele nomenclature) “ (line 238)

      “(e.g. MHC-DRB1 or -DRB2)” (line 254)

      “...  meaning their names reflect previously-observed functional similarity more than evolutionary relatedness.” (lines 257-258)

      “(see Appendix 3 for more detail)” (line 311)

      “(a 5'-end fragment)” (line 324)

      “Therefore, we support past work that has deemed MHC-V an old fragment.” (lines 326-327)

      “We next focus on MHC-U, a previously-uncharacterized fragment pseudogene containing only exon 3.” (line 328-329)

      “However, it is present on both chimpanzee haplotypes and nearly all human haplotypes, and we know that these haplotypes diverged earlier---in the ancestor of human and gorilla. Therefore, ...” (lines 331-333)

      “Ours is the first work to show that MHC-U is actually an MHC-A-related gene fragment and that it likely originated in the human-gorilla ancestor.” (lines 334-336)  

      “These pieces of evidence suggest that MHC-K and -KL duplicated in the ancestor of the apes.” (lines 341-342)

      “Another large group of related pseudogenes in the Class I $\alpha$-block includes MHC-W, -P, and -T (see Appendix 3 for more detail).” (lines 349-350)

      “...to form the current physical arrangement” (lines 354)

      “Thus, we next focus on the behavior of this subgroup in the trees.” (line 358)

      “(see Appendix 3 for further explanation).” (line 369)

      “Thus, for the first time we show that there must have been three distinct MHC-W-like genes in the ape/OWM ancestor.” (lines 369-371)

      “... and thus not included in the previous analysis. ” (lines 376-377)

      “MHC-Y has also been identified in gorillas (Gogo-Y) (Hans et al., 2017), so we anticipate that Gogo-OLI will soon be confirmed. This evidence suggests that the MHC-Y and -OLI-containing haplotype is at least as old as the human-gorilla split. Our study is the first to place MHC-OLI in the overall story of MHC haplotype evolution“ (lines 381-384)

      “Appendix 3 explains the pieces of evidence leading to all of these conclusions (and more!) in more detail.” (lines 395-396)

      “However, looking at this exon alone does not give us a complete picture.” (lines 410-411)

      “...instead of with other ape/OWM sequences, …” (lines 413-414)

      “Figure 7 shows plausible steps that might have generated the current haplotypes and patterns of variation that we see in present-day primates. However, some species are poorly represented in the data, so the relationships between their genes and haplotypes are somewhat unclear.” (lines 427-429)

      “(and more-diverged)” (line 473)

      “(of both classes)” (line 476)

      “..., although the classes differ in their rate of evolution.”  (line 487-488)

      “Including these pseudogenes in our trees helped us construct a new model of $\alpha$-block haplotype evolution. “ (lines 517-518)

      (4) Line 480-82: "Notably...." why is this notable? Don't merely state that something is notable, explain what makes it especially worth drawing the reader's attention to: in what way is it particularly significant or surprising?

      We have changed the text from “Notably” to “In particular” (line 390) so that readers are expecting us to list some specific findings. Similarly, we changed “Notably” to “Specifically” (line 515).

      (5) The end of the discussion is weak: "provide context" is too vague and not a strong statement of something that we learned that we didn't know before, or its importance. This is followed by "This work will provide a jumping-off point for further exploration..." such as? What questions does this paper raise that merit further work?

      We have made this paragraph more specific and added some possible future research directions. It now reads “By treating the MHC genes as a gene family and including more data than ever before, this work enhances our understanding of the evolutionary history of this remarkable region. Our extensive set of trees incorporating classical genes, non-classical genes, pseudogenes, gene fragments, and alleles of medical interest across a wide range of species will provide context for future evolutionary, genomic, disease, and immunologic studies. For example, this work provides a jumping-off-point for further exploration of the evolutionary processes affecting different subsets of the gene family and the nuances of immune system function in different species. This study also provides a necessary framework for understanding the evolution of particular allelic lineages within specific MHC genes, which we explore further in our companion paper \citep{Fortier2025b}. Both studies shed light on MHC gene family evolutionary dynamics and bring us closer to understanding the evolutionary tradeoffs involved in MHC disease associations.” (lines 576-586)

      Reviewer #3 (Recommendations for the authors):

      (1) Figure 1 et seq. Classifying genes as having 'classical', 'non-classical' and 'dual' properties is notoriously difficult in non-model organisms due to the lack of relevant information. As you have characterised a number of genes for the first time in this paper and could not rely entirely on published classifications, please indicate the criteria you used for classification.

      The roles of these proteins are not necessarily conserved between closely related species, and experimental evidence is needed to evaluate this. However, in the absence of such evidence, wherever possible we have provided our best guess as to the roles of the orthologous genes in other species, presented in Figure 1 - source data 1 and Figure 2 - source data 1. This is based on whatever evidence is available at the moment, sometimes experimental but typically based on dN/dS ratios and other indirect measures.

      (2) Line 61 It's important to mention that classical MHC molecules present antigenic peptides to T cells with variable alphabeta T cell receptors, as non-classical MHC molecules may interact with other T cell subsets/types.

      Thank you for pointing this out; we have updated the text to make this clearer (lines 63-65). We changed “‘Classical’ MHC molecules perform antigen presentation to T cells---a key part of adaptive immunity---while ‘non-classical’ molecules have niche immune roles.” to “‘Classical’ MHC molecules perform antigen presentation to T cells with variable alphabeta TCRs---a key part of adaptive immunity---while ‘non-classical’ molecules have niche immune roles.”

      (3) Perhaps it's worth mentioning in the introduction that you are deliberately excluding highly divergent non-classical MHC molecules such as CD1.

      Thank you, it’s worth clarifying exactly what molecules we are discussing. We have added a sentence to the introduction (lines 38-43): “Having originated in the jawed vertebrates, this group of genes is now involved in diverse functions including lipid metabolism, iron uptake regulation, and immune system function (proteins such as zinc-𝛼2-glycoprotein (ZAG), human hemochromatosis protein (HFE), MHC class I chain–related proteins (MICA, MICB), and the CD1 family) \citep{Hansen2007,Kupfermann1999,Kaufman2022,Adams2013}. However, here we focus on…”

      (4) Line 94-105 This material presents results, it could be moved to the results section as it now somewhat disrupts the flow.

      We feel it is important to include a “teaser” of the results in the introduction, which can be slightly more detailed than that in the abstract.

      (5) Line 118-131 This opening section of the results sets the stage for the whole presentation and contains important information that I feel needs to be expanded to include an overview and justification of your methodological choices. As the M&M section is at the end of the MS (and contains limited justification), some information on two aspects is needed here for the benefit of the reader. First, as far as I understand, all phylogenetic inferences were based entirely on DNA sequences of individual (in some cases concatenated) exons. It would be useful for the reader to explain why you've chosen to rely on DNA rather than protein sequences, even though some of the genes you include in the phylogenetic analysis are highly divergent. Second, a reader might wonder how the "maximum clade credibility tree" from the Bayesian analysis compares to commonly seen trees with bootstrap support or posterior probability values assigned to particular clades. Personally, I think that the authors' approach to identifying and presenting representative trees is reasonable (although one might wonder why "Maximum clade credibility tree" and not "Maximum credibility tree" https://www.beast2.org/summarizing-posterior-trees/), since they are working with a large number of short, sometimes divergent and sometimes rather similar sequences - in such cases, a requirement for strict clade support could result in trees composed largely of polytomies. However, I feel it's necessary to be explicit about this and to acknowledge that the relationships represented by fully resolved bifurcating representative trees and interpreted in the study may not actually be highly supported in the sense that many readers might expect. In other words, the reader should be aware from the outset of what the phylogenies that are so central to the paper represent.

      We chose to rely on DNA rather than protein sequences because convergent evolution is likely to happen in regions that code for extremely important functions such as adaptive and innate immunity. Convergent evolution acts upon proteins while trans-species polymorphism retains ancient nucleotide variation, so studying the DNA sequence can help tease apart convergent evolution from trans-species polymorphism.

      As for the “maximum clade credibility tree”, this is a matter of confusing nomenclature. In the online reference guide (https://www.beast2.org/summarizing-posterior-trees/), the tree with the maximum product of the posterior clade probabilities is called the “maximum credibility tree” while the tree that has the maximum sum of posterior clade probabilities is called the “Maximum credibility tree”. The “Maximum credibility tree” (referring to the sum) appears to have only been named in this way in the first version of TreeAnnotator. However, the version of TreeAnnotator that I used lists the options “maximum clade credibility tree” and “maximum sum of clade probabilities”. So the context suggests that the “maximum clade credibility tree” option is actually maximizing the product. This “maximum clade credibility tree” is the setting I used for this project (in TreeAnnotator version 2.6.3).

      We agree that readers may not fully grasp what the collapsed trees represent upon first read. We have added a sentence to the beginning of the results (line 188-190) to make this more explicit.

      (6) Line 224, you're referring to the DPB1*09 lineage, not the DRB1*09 lineage.

      Indeed! We have changed these typos.

      (7) Line 409, why "Differences between MHC subfamilies" and not "Differences between MHC classes"?

      We chose the word “subfamilies” because we discuss the difference between classical and non-classical genes in addition to differences between Class I and Class II genes.

      (8) Line 529-544 This might work better as a table.

      We agree! This information is now presented as Table 1.

      (9) Line 547 MHC-DRB9 appears out of the blue here - please say why you are singling it out.

      Great point! We added a paragraph (lines 614-623) to explain why this was necessary.

      (10) Line 550-551 Even though you've screened the hits manually, it would be helpful to outline your criteria for this search.

      Thank you! We’ve added a couple of sentences to explain how we did this (lines 607-610).

      (11) Line 556-580 please provide nucleotide alignments as supplementary data so that the reader can get an idea of the actual divergence of the sequences that have been aligned together.

      Thank you! We’ve added nucleotide alignments as supplementary files.

      (12) Line 651-652 Why "Maximum clade credibility tree" and not "Maximum credibility tree"? 

      Repeat of (5). This is a matter of confusing nomenclature. In the online reference guide (https://www.beast2.org/summarizing-posterior-trees/), the tree with the maximum product of the posterior clade probabilities is called the “maximum credibility tree” while the tree that has the maximum sum of posterior clade probabilities is called the “Maximum credibility tree”. The “Maximum credibility tree” (referring to the sum) appears to have only been named in this way in the first version of TreeAnnotator. However, the version of TreeAnnotator that I used lists the options “maximum clade credibility tree” and “maximum sum of clade probabilities”. So the context suggests that the “maximum clade credibility tree” option is actually maximizing the product. This “maximum clade credibility tree” is the setting I used for this project (in TreeAnnotator version 2.6.3).

      (13) In the appendices, links to references do not work as expected.

      We will make sure these work properly when we receive the proofs.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This is an interesting study of the nature of representations across the visual field. The question of how peripheral vision differs from foveal vision is a fascinating and important one. The majority of our visual field is extra-foveal yet our sensory and perceptual capabilities decline in pronounced and well-documented ways away from the fovea. Part of the decline is thought to be due to spatial averaging (’pooling’) of features. Here, the authors contrast two models of such feature pooling with human judgments of image content. They use much larger visual stimuli than in most previous studies, and some sophisticated image synthesis methods to tease apart the prediction of the distinct models.

      More importantly, in so doing, the researchers thoroughly explore the general approach of probing visual representations through metamers-stimuli that are physically distinct but perceptually indistinguishable. The work is embedded within a rigorous and general mathematical framework for expressing equivalence classes of images and how visual representations influence these. They describe how image-computable models can be used to make predictions about metamers, which can then be compared to make inferences about the underlying sensory representations. The main merit of the work lies in providing a formal framework for reasoning about metamers and their implications, for comparing models of sensory processing in terms of the metamers that they predict, and for mapping such models onto physiology. Importantly, they also consider the limits of what can be inferred about sensory processing from metamers derived from different models.

      Overall, the work is of a very high standard and represents a significant advance over our current understanding of perceptual representations of image structure at different locations across the visual field. The authors do a good job of capturing the limits of their approach and I particularly appreciated the detailed and thoughtful Discussion section and the suggestion to extend the metamer-based approach described in the MS with observer models. The work will have an impact on researchers studying many different aspects of visual function including texture perception, crowding, natural image statistics, and the physiology of low- and mid-level vision.

      The main weaknesses of the original submission relate to the writing. A clearer motivation could have been provided for the specific models that they consider, and the text could have been written in a more didactic and easy-to-follow manner. The authors could also have been more explicit about the assumptions that they make.

      Thank you for the summary. We appreciate the positives noted above. We address the weaknesses point by point below.

      Reviewer #2 (Public Review):

      Summary

      This paper expands on the literature on spatial metamers, evaluating different aspects of spatial metamers including the effect of different models and initialization conditions, as well as the relationship between metamers of the human visual system and metamers for a model. The authors conduct psychophysics experiments testing variations of metamer synthesis parameters including type of target image, scaling factor, and initialization parameters, and also compare two different metamer models (luminance vs energy). An additional contribution is doing this for a field of view larger than has been explored previously

      General Comments

      Overall, this paper addresses some important outstanding questions regarding comparing original to synthesized images in metamer experiments and begins to explore the effect of noise vs image seed on the resulting syntheses. While the paper tests some model classes that could be better motivated, and the results are not particularly groundbreaking, the contributions are convincing and undoubtedly important to the field. The paper includes an interesting Voronoi-like schematic of how to think about perceptual metamers, which I found helpful, but for which I do have some questions and suggestions. I also have some major concerns regarding incomplete psychophysical methodology including lack of eye-tracking, results inferred from a single subject, and a huge number of trials. I have only minor typographical criticisms and suggestions to improve clarity. The authors also use very good data reproducibility practices.

      Thank you for the summary. We appreciate the positives noted above. We address the weaknesses point by point below.

      Specific Comments

      Experimental Setup

      Firstly, the experiments do not appear to utilize an eye tracker to monitor fixation. Without eye tracking or another manipulation to ensure fixation, we cannot ensure the subjects were fixating the center of the image, and viewing the metamer as intended. While the short stimulus time (200ms) can help minimize eye movements, this does not guarantee that subjects began the trial with correct fixation, especially in such a long experiment. While Covid-19 did at one point limit in-person eye-tracked experiments, the paper reports no such restrictions that would have made the addition of eye-tracking impossible. While such a large-scale experiment may be difficult to repeat with the addition of eye tracking, the paper would be greatly improved with, at a minimum, an explanation as to why eye tracking was not included.

      Addressed on pg. 25, starting on line 658.

      Secondly, many of the comparisons later in the paper (Figures 9,10) are made from a single subject. N=1 is not typically accepted as sufficient to draw conclusions in such a psychophysics experiment. Again, if there were restrictions limiting this it should be discussed. Also (P11) Is subject sub-00 is this an author? Other expert? A naive subject? The subject’s expertise in viewing metamers will likely affect their performance.

      Addressed on pg. 14, starting on line 308.

      Finally, the number of trials per subject is quite large. 13,000 over 9 sessions is much larger than most human experiments in this area. The reason for this should be justified.

      In general, we needed a large number of trials to fit full psychometric functions for stimuli derived for both models, with both types of comparison, both initializations, and over many target images. We could have eliminated some of these, but feel that having a consistent dataset across all these conditions is a strength of the paper.

      In addition to the sentence on pg. 14, line 318, a full enumeration of trials is now described on pg. 23, starting on line 580.

      Model

      For the main experiment, the authors compare the results of two models: a ’luminance model’ that spatially pools mean luminance values, and an ’energy model’ that spatially pools energy calculated from a multi-scale pyramid decomposition. They show that these models create metamers that result in different thresholds for human performance, and therefore different critical scaling parameters, with the basic luminance pooling model producing a scaling factor 1/4 that of the energy model. While this is certain to be true, due to the luminance model being so much simpler, the motivation for the simple luminance-based model as a comparison is unclear.

      The use of simple models is now addressed on pg. 3, starting on line 98, as well as the sentence starting on pg. 4 line 148: the luminance model is intended as the simplest possible pooling model.

      The authors claim that this luminance model captures the response of retinal ganglion cells, often modeled as a center-surround operation (Rodieck, 1964). I am unclear in what aspect(s) the authors claim these center-surround neurons mimic a simple mean luminance, especially in the context of evidence supporting a much more complex role of RGCs in vision (Atick & Redlich, 1992). Why do the authors not compare the energy model to a model that captures center-surround responses instead? Do the authors mean to claim that the luminance model captures only the pooling aspects of an RGC model? This is particularly confusing as Figures 6 and 9 show the luminance and energy models for original vs synth aligning with the scaling of Midget and Parasol RGCs, respectively. These claims should be more clearly stated, and citations included to motivate this. Similarly, with the energy model, the physiological evidence is very loosely connected to the model discussed.

      We have removed the bars showing potential scaling values measured by electrophysiology in the primate visual system and attempted to clarify our language around the relationship between these models and physiology. Our metamer models are only loosely connected to the physiology, and we’ve decided in revision not to imply any direct connection between the model parameters and physiological measurements. The models should instead be understood as loosely inspired by physiology, but not as a tool to localize the representation (as was done in the Freeman paper).

      The physiological scaling values are still used as the mean of the priors on the critical scaling value for model fitting, as described on pg. 27, starting on line 698.

      Prior Work:

      While the explorations in this paper clearly have value, it does not present any particularly groundbreaking results, and those reported are consistent with previous literature.The explorations around critical eccentricity measurement have been done for texture models (Figure 11) in multiple papers (Freeman 2011, Wallis, 2019, Balas 2009). In particular, Freeman 20111 demonstrated that simpler models, representing measurements presumed to occur earlier in visual processing need smaller pooling regions to achieve metamerism. This work’s measurements for the simpler models tested here are consistent with those results, though the model details are different. In addition, Brown, 2023 (which is miscited) also used an extended field of view (though not as large as in this work). Both Brown 2023, and Wallis 2019 performed an exploration of the effect of the target image. Also, much of the more recent previous work uses color images, while the author’s exploration is only done for greyscale.

      We were pleased to find consistency of our results with previous studies, given the (many) differences in stimuli and experimental conditions (especially viewing angle), while also extending to new results with the luminance model, and the effects of initialization. Note that only one of the previous studies (Freeman and Simoncelli, 2011) used a pooled spectral energy model. Moreover, of the previous studies, only one (Brown et al., 2023) used color images (we have corrected that citation - thanks for catching the error).

      Discussion of Prior Work:

      The prior work on testing metamerism between original vs. synthesized and synthesized vs. synthesized images is presented in a misleading way. Wallis et al.’s prior work on this should not be a minor remark in the post-experiment discussion. Rather, it was surely a motivation for the experiment. The text should make this clear; a discussion of Wallis et al. should appear at the start of that section. The authors similarly cite much of the most relevant literature in this area as a minor remark at the end of the introduction (P3L72).

      The large differences we observed between comparison types (original vs synthesized, compared to synthesized vs synthesized) surprised us. Understanding such difference was not a primary motivation for the work, but it is certainly an important component of our results. In the introduction, we thought it best to lay out the basic logic of the metamer paradigm for foveated vision before mentioning the complications that are introduced in both the Wallis and Brown papers (paragraph beginning p. 3, line 109). Our results confirm and bolster the results of both of those earlier works, which are now discussed more fully in the Introduction (lines 109 and following).

      White Noise: The authors make an analogy to the inability of humans to distinguish samples of white noise. It is unclear however that human difficulty distinguishing samples of white noise is a perceptual issue- It could instead perhaps be due to cognitive/memory limitations. If one concentrates on an individual patch one can usually tell apart two samples. Support for these difficulties emerging from perceptual limitations, or a discussion of the possibility of these limitations being more cognitive should be discussed, or a different analogy employed.

      We now note the possibility of cognitive limits on pg. 8, starting on line 243, as well as pg. 22, line 571. The ability of observers to distinguish samples of white noise is highly dependent on display conditions. A small patch of noise (i.e., large pixels, not too many) can be distinguished, but a larger patch cannot, especially when presented in the periphery. This is more generally true for textures (as shown in Ziemba and Simoncelli (2021)). Samples of white noise at the resolution used in our study are indistinguishable.

      Relatedly, in Figure 14, the authors do not explain why the white noise seeds would be more likely to produce syntheses that end up in different human equivalence classes.

      In figure 14, we claim that white noise seeds are more likely to end up in the same human equivalence classes than natural image seeds. The explanation as to why we think this may be the case is now addressed on pg. 19, starting on line 423.

      It would be nice to see the effect of pink noise seeds, which mirror the power spectrum of natural images, but do not contain the same structure as natural images - this may address the artifacts noted in Figure 9b.

      The lack of pink noise seeds is now addressed on pg. 19, starting on line 429.

      Finally, the authors note high-frequency artifacts in Figure 4 & P5L135, that remain after syntheses from the luminance model. They hypothesize that this is due to a lack of constraints on frequencies above that defined by the pooling region size. Could these be addressed with a white noise image seed that is pre-blurred with a low pass filter removing the frequencies above the spatial frequency constrained at the given eccentricity?

      The explanation for this is similar to the lack of pink noise seeds in the previous point: the goal of metamer synthesis is model testing, and so for a given model, we want to find model metamers that result in the smallest possible critical scaling value. Taking white noise seed images and blurring them will almost certainly remove the high frequencies visible in luminance metamers in figure 4 and thus result in a larger critical scaling value, as the reviewer points out. However, the logic of the experiments requires finding the smallest critical scaling value, and so these model metamers would be uninformative. In an early stage of the project, we did indeed synthesize model metamers using pink noise seeds, and observed that the high frequency artifacts were less prominent.

      Schematic of metamerism: Figures 1,2,12, and 13 show a visual schematic of the state space of images, and their relationship to both model and human metamers. This is depicted as a Voronoi diagram, with individual images near the center of each shape, and other images that fall at different locations within the same cell producing the same human visual system response. I felt this conceptualization was helpful. However, implicitly it seems to make a distinction between metamerism and JND (just noticeable difference). I felt this would be better made explicit. In the case of JND, neighboring points, despite having different visual system responses, might not be distinguishable to a human observer.

      Thanks for noting this – in general, metamers are subthreshold, and for the purpose of the diagram, we had to discretize the space showing metameric regions (Voronoi regions) around a set of stimuli. We’ve rewritten the captions to explain this better. We address the binary subthreshold nature of the metamer paradigm in the discussion section (pg. 19, line 438).

      In these diagrams and throughout the paper, the phrase ’visual stimulus’ rather than ’image’ would improve clarity, because the location of the stimulus in relation to the fovea matters whereas the image can be interpreted as the pixels displayed on the computer.

      We agree and have tried to make this change, describing this choice on pg. 3 line 73.

      Other

      The authors show good reproducibility practices with links to relevant code, datasets, and figures.

      Reviewer #1 (Recommendations For The Authors):

      In its current form, I found the introduction to be too cursory. I felt that the article would benefit from a clearer motivation for the two models that are considered as the reader is left unclear why these particular models are of special scientific significance. The luminance model is intended to capture some aspects of retinal ganglion cells response characteristics and the spectral energy model is intended to capture some aspects of the primary visual cortex. However, one can easily imagine models that include the pooling of other kinds of features, and it would be helpful to get an idea of why these are not considered. Which aspects of processing in the retina and V1 are being considered and which are being left out, and why? Why not consider representations that capture even higher-order statistical structure than those covered by the spectral energy model (or even semantics)? I think a bit of rewriting with this in mind could improve the introduction.

      Along similar lines, I would have appreciated having the logic of the study explained more explicitly and didactically: which overarching research question is being asked, how it is operationalised in the models and experiments, and what are the predictions of the different models. Figures 2 and 3 are certainly helpful, but I felt further explanations would have made it easier for the reader to follow. Throughout, the writing could be improved by a careful re-reading with a view to making it easier to understand. For example, where results are presented, a sentence or two expanding on the implications would be helpful.

      I think the authors could also be more explicit about the assumptions they make. While these are obviously (tacitly) included in the description of the models themselves, it would be helpful to state them more openly. To give one example, when introducing the notion of critical scaling, on p.6 the authors state as if it is a self-evident fact that "metamers can be achieved with windows whose size is matched to that of the underlying visual neurons". This presumably is true only under particular conditions, or when specific assumptions about readout from populations of neurons are invoked. It would be good to identify and state such assumptions more directly (this is partly covered in the Discussion section ’The linking proposition underlying the metamer paradigm’, but this should be anticipated or moved earlier in the text).

      We agree that our introduction was too cursory and have reworked it. We have also backed off of the direct comparison to physiology and clarified that we chose these two as the simplest possible pooling models. We have also added sentences at the end of each result section attempting to summarize the implication (before discussing them fully in the discussion). Hopefully the logic and assumptions are now clearer.

      There are also some findings that warrant a more extensive discussion. For example, what is the broader implication of the finding that original vs. synthesised and synthesised vs. synthesised comparisons exhibit very different scaling values? Does this tell us something about internal visual representations, or is it simply capturing something about the stimuli?

      We believe this difference is a result of the stimuli that are used in the experiment and thus the synthesis procedure itself, which interacts with the model’s pooled image feature. We have attempted to update the relevant figures and discussions to clarify this, in the sections starting on pg 17 line 396 and pg. 19 line 417.

      At some points in the paper, a third model (’texture model’) creeps into the discussion, without much explanation. I assume that this refers to models that consider joint (rather than marginal) statistics of wavelet responses, as in the famous Portilla & Simoncelli texture model. However, it would be helpful to the reader if the authors could explain this.

      Addressed on pg. 3, starting on line 94.

      Minor corrections.

      Caption of Figure 3: ’top’ and ’bottom’ should be ’left’ and ’right’

      Line 177: ’smallest tested scaling values tested’. Remove one instance of ’tested’

      Line 212: ’the images-specific psychometric functions’ -> ’image-specific’

      Line 215: ’cloud-like pink noise’. It’s not literally pink noise, so I would drop this.

      Line 236: ’Importantly, these results cannot be predicted from the model, which gives no specific insight as to why some pairs are more discriminable than others’. The authors should specify what we do learn from the model if it fails to provide insight into why some image pairs are more discriminable than others.

      Figure 9: it might be helpful to include small insets with the ’highway’ and ’tiles’ source images to aid the reader in understanding how the images in 9B were generated.

      Table 1 placement should be after it is first referred to on line 258.

      In the Discussion section "Why does critical scaling depend on the comparison being performed", it would be helpful to consider the case where the two model metamers *are* distinguishable from each other even though each is indistinguishable from the target image. I would assume that this is possible (e.g., if the target image is at the midpoint between the two model images in image space and each of the stimuli is just below 1 JND away from the target). Or is this not possible for some reason?

      Regarding line 236: this specific line has been removed, and the discussion about this issue has all been consolidated in the final section of the discussion, starting on pg. 19 line 438.

      Regarding the final comment: this is addressed in the paragraph starting on pg. 16 line 386. To expand upon that: the situation laid out by the reviewer is not possible in our conceptualization, in which metamerism is transitive and image discriminability is binary. In order to investigate situations like the one laid out by the reviewer, one needs models whose representations have metric properties, i.e., which allow you to measure and reason about perceptual distance, which we refer to in the paragraph starting on pg. 20 line 460. We also note that this situation has not been observed in this or any other pooling model metamer study that we are aware of. All other minor changes have been addressed.

      Reviewer #2 (Recommendations For The Authors):

      Original image T should be marked in the Voronoi diagrams.

      Brown et al is miscited as 2021 should be ACM Transactions on Applied Perception 2023.

      Figure 3 caption: models are left and right, not top and bottom.

      Thanks, all of the above have been addressed.

      References

      BrownReral Encoding, in the Human Visual System. ACM Transactions on Applied Perception. 2023 Jan; 20(1):1–22.http://dx.doi.org/10.1145/356460, Dutell V, Walter B, Rosenholtz R, Shirley P, McGuire M, Luebke D. Efficient Dataflow Modeling of Periph-5, doi: 10.1145/3564605.

      Freeman Jdoi: 10.1038/nn.2889, Simoncelli EP. Metamers of the ventral stream. Nature Neuroscience. 2011 aug; 14(9):1195–1201..

      Ziemba CMnications. 2021 jul; 12(1)., Simoncelli EP. Opposing Effects of Selectivity and Invariance in Peripheral Vision. Nature Commu-https://doi.org/10.1038/s41467-021-24880-5, doi: 10.1038/s41467-021-24880-5.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      (1) The authors make fairly strong claims that "arousal-related fluctuations are isolated from neurons in the deep layers of the SC" (emphasis added). This conclusion is based on comparisons between a "slow drift axis", a low-dimensional representation of neuronal drift, and other measures of arousal (Figures 2C, 3) and motor output sensitivity (Figures 2B, 3B). However, the metrics used to compare the slow-drift axis and motor activity were computed during separate task epochs: the delay period (600-1100 ms) and a perisaccade epoch (25 ms before and after saccade initiation), respectively. As the authors reference, deep-layer SC neurons are typically active only around the time of a saccade. Therefore, it is not clear if the lack of arousal-related modulations reported for deep-layer SC neurons is because those neurons are truly insensitive to those modulations, or if the modulations were not apparent because they were assessed in an epoch in which the neurons were not active. A potentially more valuable comparison would be to calculate a slow-drift axis aligned to saccade onset. 

      The reviewer makes an important point that the calculation of an axis can depend critically on the time window of neuronal response. We find when considering this that the slow drift axis is less sensitive to this issue because it is calculated on time-averaged activity over multiple trials. In previous work we found that slow drift calculated on the stimulus evoked response in V4 was very well aligned to slow drift calculated on pre-stimulus spontaneous activity (Cowley et al, Neuron, 2020, Supplemental Figure 3A and 3B). To address this issue in the present data, we compared the axis computed for an example session for neural activity during the delay period and neural activity aligned to saccade onset. As shown new Figure 2 – figure supplement 1 in the revised manuscript, we found a similar lack of arousal-related modulations for deep-layer SC neurons when slow drift was computed using the saccade epoch (25ms before to 25ms after the onset of the saccade). Figure 2 – figure supplement 1A shows loadings for the SC slow drift axis when it was computed using spiking responses during the delay period (as in the main manuscript analysis). In contrast, Figure 2 – figure supplement 1B shows loadings from the same session when the SC slow drift axis was computed using spiking responses during the saccade epoch. The plots are highly similar and in both cases the loadings were weaker for neurons recorded from channels at the bottom of the probe which have a higher motor index. Finally, we found that projections onto the SC slow drift axis for this session were strongly correlated when the slow drift axis was computed using spiking responses during the delay period and the saccade epoch (r = 0.66, p < 0.001, Figure 1C). Taken together, these results suggest that arousal-related modulations are less evident in deep-layer SC neurons irrespective of whether slow drift was computed during the delay or saccade epoch (see also Public Reviews, Reviewer 1, Point 2).

      (2) More generally, arousal-related signals may persist throughout multiple different epochs of the task. It would be worthwhile to determine whether similar "slow-drift" dynamics are observed for baseline, sensory-evoked, and saccade-related activity. Although it may not be possible to examine pupil responses during a saccade, there may be systematic relationships between baseline and evoked responses. 

      Similar to the point above, slow drift dynamics tend to be similar across different response epochs because they are averaged across many trials and seem to tap into responsivity trends that are robust across epochs. As shown in Author response image 1 below, and the Figure 2 – figure supplement 1 in the revised manuscript, similar dynamics were observed when the SC slow drift axis was computed using spiking responses during the baseline, delay, visual and saccade epochs. We did not investigate differences between baseline and evoked pupil responses in the current paper. However, these effects were characterized in one of our previous papers that focused exclusively on the relationship between slow drift and eye-related metrics (Johnston et al., 2022, Cereb. Cortex, Figure 6). In this previous work, we found a negative correlation between baseline and evoked pupil size. Both variables were significantly correlated with slow drift, the only difference being the sign of the correlation.

      Author response image 1.

      (A-C) Dynamics of slow drift for three example sessions when the SC slow drift axis was computed using spiking responses during the baseline, delay, visual and saccade epochs. Baseline = 100ms before the onset of the target stimulus; Delay = 600 to 1100ms after the offset of the target stimulus; Stim = 25ms to 125ms after the onset of the target stimulus; Sac = 25ms before to 25ms after the onset of the saccade.

      Johnston R, Snyder AC, Khanna SB, Issar D, Smith MA (2022) The eyes reflect an internal cognitive state hidden in the population activity of cortical neurons. Cereb Cortex 32:3331–3346.

      (3) The relationships between changes in SC activity and pupil size are quite small (Figures 2C & 5C). Although the distribution across sessions (Figure 2C) is greater than chance, they are nearly 1/4 of the size compared to the PFC-SC axis comparisons. Likewise, the distribution of r2 values relating pupil size and spiking activity directly (Figure 5) is quite low. We remain skeptical that these drifts are truly due to arousal and cannot be accounted for by other factors. For example, does the relationship persist if accounting for a very simple, monotonic (e.g., linear) drift in pupil size and overall firing rate over the course of an individual session? 

      Firstly, it is important to note that the strength of the relationship between projections onto the SC slow drift axis and pupil size (r<sup>2</sup> = 0.06) is within the range reported by Joshi et al. (2016, Neuron, Figure 3). They investigated the median variance explained between the spiking responses of individual SC neurons and pupil size and found it to be approximately 0.02 across sessions. Secondly, our statistical approach of testing the actual distribution of r<sup>2</sup> values against a shuffled distribution was specifically designed to rule out the possibility that the relationship between SC spiking responses and pupil size occurred due to linear drifts. The shuffled distribution in Figure 2C of the main manuscript represents the variance that can be explained by one session’s slow drift correlated with another session’s pupil, which would contain effects that occurred due to linear drifts alone. That the actual proportion of variance explained was significantly greater than this distribution suggests that the relationship between projections onto the SC slow drift axis and pupil size reflects changes in arousal rather than other factors related to linear drifts.

      Joshi S, Li Y, Kalwani RM, Gold JI (2016) Relationships between Pupil Diameter and Neuronal Activity in the Locus Coeruleus, Colliculi, and Cingulate Cortex. Neuron 89:221–234.

      (4) It is not clear how the final analysis (Figure 6) contributes to the authors' conclusions. The authors perform PCA on: (i) residual spiking responses during the delay period binned according to pupil size, and (ii) spiking responses in the saccade epoch binned according to target location (i.e., the saccade tuning curve). The corresponding PCs are the spike-pupil axis and the saccade tuning axis, respectively. Unsurprisingly, the spikepupil axis that captures variance associated with arousal (and removes variance associated with saccade direction) was not correlated with a saccade-tuning axis that captures variance associated with saccade direction and omits arousal. Had these measures been related it would imply a unique association between a neuron's preferred saccade direction and pupil control- which seems unlikely. The separation of these axes thus seems trivial and does not provide evidence of a "mechanism...in the SC to prevent arousal-related signals interfering with the motor output." It remains unknown whether, for example, arousal-related signals may impact trial-by-trial changes in neuronal gain near the time of a saccade, or alter saccade dynamics such as acceleration, precision, and reaction time. 

      The reviewer makes a good point, and we agree that more evidence is needed to determine if the separation of the pupil size axis and saccade tuning axis is the mechanism through which cognitive and arousal-related signals can be intermixed in the SC. In the revised manuscript (lines 679-682), we have raised this as a possible explanation that necessitates further study rather than stating definitively that it is the exact mechanism through which these signals are kept separate. Our analysis here is similar to the one from Smoulder et al (2024, Neuron, Fig. 2F), in which the interactions between reward signals and target tuning in M1 were examined (and found to be orthogonal). While we agree with the reviewer that it may seem “trivial” for these axes to be orthogonal, it does not have to be so. If, for example, neural tuning curves shifted with changes in pupil size through gain changes that revealed tuning or affected tuning curve shape, there could be projections of the pupil axis onto the target tuning axis. Thus, while we agree with the reviewer that it appears sensible for these two axes to be orthogonal, our result is nonetheless a novel finding. We have edited the text in our revised manuscript, however, to make sure the nuance of this point is conveyed to the reader.

      Smoulder AL, Marino PJ, Oby ER, Snyder SE, Miyata H, Pavlovsky NP, Bishop WE, Yu BM, Chase SM, Batista AP. A neural basis of choking under pressure. Neuron. 2024 Oct 23;112(20):3424-33.

      Reviewer #2 (Public Review):

      (1) The greatest weakness in the present research is the fact that arousal is a functionally less important non-motoric variable. The authors themselves introduce the problem with a discussion of attention, which is without any doubt the most important cognitive process that needs to be functionally isolated from oculomotor processes. Given this introduction, one cannot help but wonder, why the authors did not design an experiment, in which spatial attention and oculomotor control are differentiated. Absent such an experiment, the authors should spend more time explaining the importance of arousal and how it could interfere with oculomotor behavior. 

      Although attention does represent an important cognitive process, we did not design an experiment in which attention and oculomotor control are differentiated because attention does not appear to be related to slow drift. In our first paper that reported on this phenomenon, we investigated the effects of spatial attention on slow fluctuations in neural activity by cueing the monkeys to attend to a stimulus in the left or right visual field in a block-wise manner. Each block lasted ~20 minutes and we found that slow drift did not covary with the timing of cued blocks (see Figure 4A, Cowley et al., 2020, Neuron). Furthermore, there is a large body of work showing that arousal also impacts motor behavior leading to changes in a range of eye-related metrics (e.g., pupil size, microsaccade rate and saccadic reaction time - for review, see Di Stasi et al. 2013, Neurosci. Biobehav. Rev.). We also note that the terms attention and arousal are often used in nonspecific and overlapping ways in the literature, adding to some potential confusion here. Nonetheless, pupil-linked arousal is an important variable that impacts motor performance. This has now been stated clearly in the Introduction of the revised manuscript (lines 108-114) to address the reviewer’s concerns and highlight the importance of studying how precise fixation and eye movements are maintained even in the presence of signals related to ongoing changes in brain state. 

      Cowley BR, Snyder AC, Acar K, Williamson RC, Yu BM, Smith MA (2020) Slow Drift of Neural Activity as a Signature of Impulsivity in Macaque Visual and Prefrontal Cortex. Neuron 108:551-567.e8.

      (2) In this context, it is particularly puzzling that one actually would expect effects of arousal on oculomotor behavior. Specifically, saccade reaction time, accuracy, and speed could be influenced by arousal. The authors should include an analysis of such effects. They should also discuss the absence or presence of such effects and how they affect their other results. 

      As described above, several studies across species have demonstrated that arousal impacts motor behavior e.g., saccade reaction time, saccade velocity and microsaccade rate (for review, see Di Stasi et al. 2013, Neurosci. Biobehav. Rev.). This has been clarified in the Introduction of the revised manuscript to address the reviewer's concerns (lines 108-114). Our prior work (Johnston et al, Cerebral Cortex, 2022) shows that slow drift impacts several types of oculomotor behavior. Overall, these studies highlight the impact of arousal on eye movements as a robust effect, and support the present investigation into arousal and oculomotor control signals. While we agree reaction time, accuracy, and speed all can be influenced by arousal depending on task demands, the present study is focused on the connection between slow fluctuations in neural activity, linked to arousal, and different subpopulations of SC neurons. 

      Di Stasi LL, Catena A, Cañas JJ, Macknik SL, Martinez-Conde S (2013) Saccadic velocity as an arousal index in naturalistic tasks. Neurosci Biobehav Rev 37:968–975.

      Johnston R, Snyder AC, Khanna SB, Issar D, Smith MA (2022) The eyes reflect an internal cognitive state hidden in the population activity of cortical neurons. Cereb Cortex 32:3331–3346.

      (3) The authors use the analysis shown in Figure 6D to argue that across recording sessions the activity components capturing variance in pupil size and saccade tuning are uncorrelated. however, the distribution (green) seems to be non-uniform with a peak at very low and very high correlation specifically. The authors should test if such an interpretation is correct. If yes, where are the low and high correlations respectively? Are there potentially two functional areas in SC? 

      We agree with the reviewer that our actual data distribution was non-uniform. We examined individual sessions with high and low variance explained and did not find notable differences. One source of this variation has to do with session length. Longer sessions in principle should have a chance distribution of variance explained closer to zero because they contained more time bins. Given that we had no specific hypothesis for a non-uniform distribution, we have simply displayed the full distribution of values in our figure and the statistical result of a comparison to a shuffled distribution.

      Reviewer #3 (Public Review):

      (1) However, I am concerned about two main points: First, the authors repeatedly say that the "output" layers of the SC are the ones with the highest motor indices. This might not necessarily be accurate. For example, current thresholds for evoking saccades are lowest in the intermediate layers, and Mohler & Wurtz 1972 suggested that the output of the SC might be in the intermediate layers. Also, even if it were true that the high motor index neurons are the output, they are very few in the authors' data (this is also true in a lot of other labs, where it is less likely to see purely motor neurons in the SC). So, this makes one wonder if the electrode channels were simply too deep and already out of the SC? In other words, it seems important to show distributions of encountered neurons (regardless of the motor index) across depth, in order to better know how to interpret the tails of the distributions in the motor index histogram and in the other panels of Figure Supplement 1. I elaborate more on these points in the detailed comments below. 

      The reviewer makes a good point about the efferent signals from SC. It is true that electrical thresholds are often lowest in intermediate layers, though deep layers do project to the oculomotor nuclei (Sparks, 1986; Sparks & Hartwich-Young, 1989) and often intermediate and deep layers are considered to function together to control eye movements (Wurtz & Albano, 1980). As suggested by the reviewer, we have edited the text throughout the manuscript to say that slow drift was less evident in SC neurons with a higher motor index, as well as included the above references and points about the intermediate and deep layers (Lines 73-81). Aside from the question of which layers of the SC function as the “motor output”, the reviewer raises a separate and important question – are our deep recordings still in SC. Here, we can say definitively that they are. We removed neurons if they did not exhibit elevated (above baseline) firing rates during the visual or saccade epochs of the MGS task (see Methods section on “Exclusion criteria”). All included neurons possessed a visual, visuomotor or motor response, consistent with the response properties of neurons in the SC. In addition, we found a number of neurons well above the bottom of the probe with strong motor responses and minimal loadings onto the slow drift axis (see Figure 2 – figure supplement 1A), consistent with the reviewer’s comment that intermediate layer neurons are tuned for movement and play a role in saccade production.

      Mohler CW, Wurtz RH. Organization of monkey superior colliculus: intermediate layer cells discharging before eye movements. Journal of neurophysiology. 1976 Jul 1;39(4):722-44.

      Sparks DL. Translation of sensory signals into commands for control of saccadic eye movements: role of primate superior colliculus. Physiol Rev. 1986 Jan;66(1):118-71. doi: 10.1152/physrev.1986.66.1.118. PMID: 3511480.

      Sparks DL, Hartwich-Young R. The deep layers of the superior colliculus. Reviews of oculomotor research. 1989 Jan 1;3:213-55.

      Wurtz RH, Albano JE. Visual-motor function of the primate superior colliculus. Annu Rev Neurosci. 1980;3:189-226. doi: 10.1146/annurev.ne.03.030180.001201. PMID: 6774653.

      (2) Second, the authors find that the SC cells with a low motor index are modulated by pupil diameter. However, this could be completely independent of an "arousal signal". These cells have substantial visual responses. If the pupil diameter changes, then their activity should be influenced since the monkey is watching a luminous display. So, in this regard, the fact that they do not see "an arousal signal" in most motor neurons (through the pupil diameter analyses) is not evidence that the arousal signal is filtered out from the motor neurons. It could simply be that these neurons simply do not get affected by the pupil diameter because they do not have visual sensitivity. So, even with the pupil data, it is still a bit tricky for me to interpret that arousal signals are excluded from the "output layers" of the SC. 

      The reviewer makes an important point about the SC’s visual responses. Neurons with a low motor index are, conversely, likely to have a stronger visual response index. However, we do not believe that changes in luminance can explain why the correlation between SC spiking response and pupil size is weaker for neurons with a lower motor index. Firstly, the changes in pupil size observed in the current paper and our previous work are slow and occur on a timescale of minutes (Cowley et al., 2020, Neuron) and are correlated with eye movement measures such as reaction time and microsaccade rate (Johnston et al., 2022, Cerebral Cortex). This is in stark contrast to luminance-evoked changes in pupil size that occur on a timescale of less than a second. Secondly, as shown the new Figure 5 – figure supplement 1 in the revised manuscript, very similar results were found when SC spiking responses were correlated with pupil size during the baseline period, when only the fixation point was on the screen. Although the luminance of the small peripheral target stimulus can result in small luminance-evoked changes in pupil size, no changes in luminance occurred during the baseline period which was defined as 100ms before the onset of the target stimulus. In Figure 2 – figure supplement 1 and Author response image 1 above, we show that slow drift is the same whether calculated on the baseline response, delay period, or peri-saccadic epoch. Thus, the measurement of slow drift is insensitive to the precise timing of the selection of both the window for the spiking response and the window for the pupil measurement. If luminance were the explanation for the slow changes in firing observed in visually responsive SC neurons, it would require those neurons to exhibit robust, sustained tuned responses to the small changes in retinal illuminance induced by the relatively small fluctuations in pupil size we observed from minute to minute. We are aware of no reports of such behavior in visually-responsive neurons in SC. We have included these analyses and this reasoning in the revised manuscript on lines 478-495.

      Reviewer#1 (Recommendations for the author):

      (1) It would be useful to provide line numbers in subsequent manuscripts for reviewers.

      Line numbers have been added in the revised version of the manuscript.

      (2) Page #6; last sentence: "...even impact processing at the early to mid stages of the visuomotor transformation, without leading to unwanted changes in motor output." I do not believe the authors have provided evidence that arousal levels were not associated with changes in motor output.

      As suggested by Reviewer 3 (see Public Reviews, Reviewer 3, Point 2), we have edited the text throughout the manuscript to say that slow drift was less evident in SC neurons with a higher motor index. This sentence in the revised manuscript now reads:

      “This provides a potential mechanism through which signals related to cognition and arousal can exist in the SC, and even impact processing at the early to mid stages of the visuomotor transformation, without leading to unwanted changes in SC neurons that are linked to saccade execution.”

      (3) Page #8; last paragraph: Although deep-layer SC neurons may not have been obtained during every recording session, a summary of the motor index scores observed along the probe across sessions would be useful to confirm their assumptions. 

      See Author response image 2 below which shows the motor index of each recoded SC neuron on the x-axis and session number on the y-axis. The points are colored by to the squared factor loading which represents the variance explained between the response a neuron and the slow drift axis (see Figure 3B of the main manuscript). You can see from this plot that neurons with a stronger component loading (shown in teal to yellow) typically have a lower motor index whereas the opposite is true for neurons with a weaker component loading (shown in dark blue).

      Author response image 2.

      Scatter plot showing the motor index of each recorded neuron along with the session number in which it was recorded. The points are colored by to the squared factor loading for each neuron along the slow drift axis. Note that loadings above 0.5 (33 data points in total) have been thresholded at 0.5 so that we could effectively use the color range to show all of the slow drift axis loadings.

      (4) Page #10; first paragraph: The authors should state the time window of the delay period used, since it may be distinct from the pupil analysis (first 200ms of delay). 

      This has been stated in the revised version of the manuscript. The sentence now reads:

      “We first asked if arousal-related fluctuations are present in the SC. As in previous studies that recorded from neurons in the cortex (Cowley et al., 2020), we found that the mean spiking responses of individual SC neurons during the delay period (chosen at random on each trial from a uniform distribution spanning 600-1100ms, see Methods) fluctuated over the course of a session while the monkeys performed the MGS task (Figure 2A, left).”

      (5) Page #10; second paragraph: Extra period at the end of a sentence: " most variance in the data..". 

      Fixed in the revised version of the manuscript.

      (6) Page #12: "between projections onto the SC slow drift axis and mean pupil size during the first 200ms of the delay period when a task-related pupil response could be observed." What criteria was used to determine whether a task-related pupil response was observed? 

      This was chosen based on the results of a previous study in our lab that used the same memory-guided saccade task to investigate the relationship between slow drift and changes in based and evoked pupil size (see Johnston et al., 2022, Cereb. Cortex, Figure 6B). The period was chosen based on plotting the average pupil size aligned on different trial epochs. As we show in Figure 5-figure supplement 3 above, the pupil interactions with slow drift did not depend on the particular time window of the pupil we chose.  

      (7) Page #14; Figure 2A: The axes for the individual channels are strangely floating and quite different from all other figures. Please label the channel in the figure legend that was used as an example of the projected values onto the slow drift axis.

      The figure has been changed in the revised version of the manuscript so that the tick mark denoting zero residual spikes per second is on the top layer of each plot. A scale bar was chosen instead of individual axes to reduce clutter in the figure as it was used to demonstrate how slow drift was computed. Residual spiking responses from all neurons were projected on the slow drift axis to generate the scatter plot in the bottom right-hand corner of Figure 2A. There is no single neuron to label.

      (8) Page #16: "These results demonstrate that even though arousal-related fluctuations are present in the SC, they are isolated from deep-layer neurons that elicit a strong saccadic response and presumably reside closer to the motor output." In line with our major comments, lack of arousal-related activity during the delay period is meaningless for deep-layer SC neurons that are generally inactive during this time. It does not imply that there is no arousal signal! 

      Addressed in Public Reviews, Reviewer 1, Point 1 & 2. We found a similar lack of arousal-related modulations reported for deep-layer SC neurons when slow drift was computed using the saccade epoch (Figure 1 above). In addition, similar dynamics were observed when the SC slow drift axis was computed using spiking responses during the baseline, delay, visual and saccade period (Figure 2).

      (9) Page #18: "These findings provide additional support for the hypothesis that arousalrelated fluctuations are isolated from neurons in the deep layers of the SC." The same criticism from above applies.

      Addressed in Public Reviews, Reviewer 1, Point 1 & 2.

      (10) Page #20; paragraph 3: "Taken together, the findings outlined above..." Would be useful to be more specific when referring to "activity" ; e.g., "...these neurons did not exhibit large fluctuations in delay-period activity over time".

      This sentence has been changed in the revised manuscript in light of the reviewer’s comments. It now reads:

      “In addition to being more weakly correlated with pupil size, the spiking responses of these neurons did not exhibit large fluctuations over time (Figure 2), and when considering the neuronal population as a whole, explained less variance in the slow drift axis when it was computed using population activity in the SC (Figure 3) and PFC (Figure 4).”

      Reviewer #3 (Recommendations for the author):

      The paper is clear and well-written. However, I am concerned about two main points: 

      (1) First, the authors repeatedly say that the "output" layers of the SC are the ones with the highest motor indices. This might not necessarily be accurate. For example, current thresholds for evoking saccades are lowest in the intermediate layers, and Mohler & Wurtz 1972 suggested that the output of the SC might be in the intermediate layers. Also, even if it were true that the high motor index neurons are the output, they are very few in the authors' data (this is also true in a lot of other labs, where it is less likely to see purely motor neurons in the SC). So, this makes one wonder if the electrode channels were simply too deep and already out of the SC. In other words, it seems important to show distributions of encountered neurons (regardless of motor index) across depth, in order to better know how to interpret the tails of the distributions in the motor index histogram and in the other panels of the figure supplement 1. I elaborate more on these points in the detailed comments below. 

      Addressed in Public Reviews, Reviewer 3, Point 1.

      (2) Second, the authors find that the SC cells with a low motor index are modulated by pupil diameter. However, this could be completely independent of an "arousal signal". These cells have substantial visual responses. If the pupil diameter changes, then their activity should be influenced since the monkey is watching a luminous display. So, in this regard, the fact that they do not see "an arousal signal" in most motor neurons (through the pupil diameter analyses) is not evidence that the arousal signal is filtered out from the motor neurons. It could simply be that these neurons simply do not get affected by the pupil diameter because they do not have visual sensitivity. So, even with the pupil data, it is still a bit tricky for me to interpret that arousal signals are excluded from the "output layers" of the SC. 

      Addressed in Public Reviews, Reviewer 3, Point 2.

      (3) I think that a remedy to the first point above is to change the text to make it a bit more descriptive and less interpretive. For example, just say that the slow drifts were less evident among the neurons with high motor index. 

      We thank the reviewer for this suggestion (see Public Reviews, Reviewer 3, Point 1).

      (4) For the second point, I think that it is important to consider the alternative caveat of different amounts of light entering the system. Changes in light level caused by pupil diameter variations can be quite large. 

      We thank the reviewer for this suggestion (see Public Reviews, Reviewer 3, Point 2).

      (5) Line 31: I'm a bit underwhelmed by this kind of statement. i.e. we already know that cognitive processes and brain states do alter eye movements, so why is it "critical" that high precision fixation and eye movements are maintained? And, isn't the next sentence already nulling this idea of criticality because it does show that the brain state alters the SC neurons? In fact, cognitive processes are already known to be most prevalent in the intermediate and deep layers of the SC. 

      It seems clear that while cognitive state does affect eye movements, it is desirable to have some separation between cognitive state and eye movement control. Covert attention, for instance, is precisely a situation where eye movement control is maintained to avoid overt saccades to the attended stimulus, and yet there are clear indications of attention’s impact on microsaccades and fixation. We stand by our statement that an important goal of vision is to have precise fixation and movements of the eye, and yet at the same time the eyes are subject to numerous influences by cognitive state.

      (6) Line 65: it is better to clarify that these are "functional layers" because there are actually more anatomical layers. 

      We have edited this sentence in the revised version of the manuscript so that it now reads:

      “The role of these projections in the visuomotor transformation depends on the functional layer of the SC in which they terminate”.

      (7) Line 73: this makes it sound like only the deepest layers are topographically organized, which is not true. Also, as early as Mohler & Wurtz, 1972, it was suggested that the intermediate layers have the biggest impacts downstream of the SC. This is also consistent with electrical microstimulation current thresholds for evoking saccades from the SC. 

      We have addressed the reviewers’ comments about the intermediate layers having the biggest impact downstream of the SC in Public Reviews, Reviewer 3, Point 1. Furthermore, line 73 has been changed in the revised manuscript so that it now reads:

      “As is the case for neurons in the superficial and intermediate layers, they [SC motor neurons] form a topographically organized map of visual space (White et al. 2017; Robinson 1972; Katnani and Gandhi 2011)”.  

      (8) Line 100: there is an analogous literature regarding the question of why unwanted muscle contractions do not happen. Specifically, in the context of why SC visual bursts do not automatically cause saccades (which is a similar problem to the ones you mention about cognitive signals interfering by generating unwanted eye movements), both Jagadisan & Gandhi, Curr Bio, 2022 and Baumann et al, PNAS, 2023 also showed that SC population activity not only has different temporal structure (Jagadisan & Gandhi) but also occupy different subspaces (Baumann et al) under these two different conditions (visual burst versus saccade burst). This is conceptually similar to the idea that you are mentioning here with respect to arousal. So, it is worth it to mention these studies here and again in the discussion. 

      We are grateful to the reviewer for these suggestions and have included text in the Introduction (Lines 125-128) and Discussion (Lines 678-682) of the revised manuscript along with the references cited above.

      (9) Line 147: as mentioned above, it is now generally accepted that there are quite a few "pure" motor neurons in the SC. This is consistent with what you find. E.g. Baumann et al., 2023. And, again see Mohler and Wurtz in the 1970's. So, I wonder how useful it is to go too much into this idea of the deeper motor neurons (e.g. the correlations in the other panels of the Figure 1 supplement). 

      This is related to the reviewer’s comment that the output of the SC might be in the intermediate layers. This concern has been addressed in Public Reviews, Reviewer 3, Point 1.

      (10) Figure 1 should say where the RF was for the shown spike rasters. i.e. were these the same saccade target across trials? And where was that location relative to the RF? It would help also in the text to say whether the saccade was always to the RF center or whether you were randomizing the target location. 

      We centered the array of saccade targets using the microstimulation-evoked eye movement for SC (see Methods section “Memory-guided saccade task”) to find the evoked eccentricity, and then used saccade targets with equal spacing of 45 degrees starting at zero (rightward saccade target). We did not do extensive RF mapping beyond this microstimulation centering. In Figure 1, the spike rasters are shown for a target that was visually identified to be within the neuron’s RF based on assessing responses to all 8 target angles. We have added information about this to the figure caption.

      (11) Line 218: but were there changes in the eye movement statistics? For example, the slow drift eye movements during fixation? Or even the microsaccades? 

      Addressed in Public Reviews, Reviewer 2, Point 2.  

      (12) Line 248: shuffling what exactly? I think that more explanation would be needed here. 

      Addressed in Public Reviews, Reviewer 1, Point 3.  

      (13) Line 263: but isn't this reflecting a sensory transient in the pupil diameter, since the target just disappeared? 

      Addressed in Public Reviews, Reviewer 3, Point 2.  

      (14) Line 271: I suspect that slow drift eye movements (in between microsaccades) would show higher correlations. Not sure how well you can analyze those with a video-based eye tracker. 

      We agree that fixational drift would be a worthwhile metric, but it is not one we have focused on here and to our knowledge does require higher precision tracking. 

      (15) Line 286: again, see above about similar demonstrations with respect to the visual and motor burst intervals, which clearly cause the same problem (even stronger) as the one studied here. 

      See reply, including Figure 2.

      (16) Line 330: again, I'm not sure deeper necessarily automatically means closer to the output. For example, current thresholds for evoked saccades grow higher as you go deeper. Maybe the authors can ask their colleague Neeraj Gandhi about this point specifically, just to be safe. Maybe the safest would be to remain descriptive about the data, and just say something like: arousal-related fluctuations were absent in our deepest recorded sites. 

      Addressed in Public Reviews, Reviewer 3, Point 1.

      (17) Line 332: likewise, statements like this one here would be qualified if the output was the intermediate layers......anyway if I understand what I read so far in the paper, the signal will be anyway orthogonal to the motor burst population subspace. So, maybe there's no need to emphasize that it goes away in the very deepest layers. 

      See reply above, Public Reviews, Reviewer 1, Point 4.

      (18) Figure 3A: related to the above, I think one issue could be that the deeper contacts might already be out of the SC. Maybe some cell count distribution from each channel should help in this regard. i.e. were you finding way fewer saccade-related neurons in the deepest channels (even though the few that you found were with high motor index)? If so, then wouldn't this just mean that the channel was too deep? I think there needs to be an analysis like this, to convince readers that the channels were still in the SC. Ideally, electrical stimulation current thresholds for evoking saccades at different depths would be tested, but I understand that this can be difficult at this stage. 

      Addressed in Public Reviews, Reviewer 3, Point 1.

      (19) I keep repeating this because in general, cognitive effects are stronger in the intermediate/deeper layers than in the superficial layers. If these interfere with eye movements like arousal, then why should arousal be different?

      Few studies have investigated the effects of attention on “pure” movement SC neurons that only discharge during a saccade. One study, which we cited in Introduction (Ignashchenkova et al., 2004, Nat. Neurosci.), found significant differences in spiking responses between trials with and without attentional cueing for visual and visuomotor neurons. No significant difference was found for motor neurons, consistent with our hypothesis that signals related to cognition and arousal are kept separate from saccade-related signals in the SC.

      (20) The problem with Figure 5 and its related text is that the neurons with low motor index are additionally visual. So, of course, they can be modulated if the pupil diameter changes!

      Addressed in Public Reviews, Reviewer 3, Point 2.  

      (21) I had a hard time understanding Figure 6. 

      See reply above, Public Reviews, Reviewer 1, Point 4.

      (22) Line 586: these cells have more visual responses and will be affected by the amount of light entering the eye. 

      Addressed in Public Reviews, Reviewer 3, Point 2.

    1. Establishing the boundaries for your research may come from your instructor’s assignment guidelines.

      I completely agree with this sentence I think establishing boundaries for your research is especially important but starting off with what your teacher has is important. For the context of academic papers written as a student your audience is a bit ambiguous generally speaking the only people who will read your academic papers is your professor and so understanding the guidelines and what necessarily the professor needs out of that paper is important. The purpose of the paper is to demonstrate that you not only can do research but that you are actively learning engaging and articulating the information you are researching. It's important that not only instructor headlines are clear and concise but also that in the moments that they aren't that we are asking and refining to ensure that it is an acceptable essay for the assignment.

    1. I use the end-pa-pers at the back of the book to makea personal index of the author's pointsin the order of their appearance

      The making of a personal index is a first step in building a mesh of knowledge. In just a few years, Vannevar Bush will speak of "associative trails" a phrase he uses twice in "As We May Think" (The Atlantic, July 1945), but of potentially more import is his phrase "associative indexing" which lays way to either juxtaposing or linking two ideas (either similar or disjoint) together. It bears asking the question of of whether it's more valuable to index and juxtapose similar ideas or disjoint ideas which may more frequently lead to better, more useful, and more relevant and rich future ideas.

      It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing. Bush, Vannevar. 1945. “As We May Think.” The Atlantic 176: 101–8. https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/ (October 22, 2022). #

    1. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.

      See also the precursor of personal indexing which Mortimer J. Adler mentions in 1940: https://hypothes.is/a/cPcoAqhVEfC0rJOZ0Pm-8Q

    1. Reviewer #3 (Public review):

      Summary

      The authors set out to explore the potential relationship between adult neurogenesis of inhibitory granule cells in the olfactory bulb and cumulative changes over days in odor-evoked spiking activity (representational drift) in the olfactory stream. They developed a richly detailed spiking neuronal network model based on Izhikevich (2003), allowing them to capture the diversity of spiking behaviors of multiple neuron types within the olfactory system. This model recapitulates the circuit organization of both the main olfactory bulb (MOB) and the piriform cortex (PCx), including connections between the two (both feedforward and corticofugal). Adult neurogenesis was captured by shuffling the weights of the model's granule cells, preserving the distribution of synaptic weights. Shuffling of granule cell connectivity resulted in cumulative changes in stimulus-evoked spiking of the model's M/T cells. Individual M/T cell tuning changed with time, and ensemble correlations dropped sharply over the temporal interval examined (long enough that almost all granule cells in the model had shuffled their weights). Interestingly, these changes in responsiveness did not disrupt low-dimensional stability of olfactory representations: when projected into a low-dimensional subspace, population vector correlations in this subspace remained elevated across the temporal interval examined. Importantly, in the model's downstream piriform layer, this was not the case. There, shuffled GC connectivity in the bulb resulted in a complete shift in piriform odor coding, including for low-dimensional projections. This is in contrast to what the model exhibited in the M/T input layer. Interestingly, these changes in PCx extended to the geometrical structure of the odor representations themselves. Finally, the authors examined the effect of experience on representational drift. Using an STDP rule, they allowed the inputs to and outputs from adult-born granule cells to change during repeated presentations of the same odor. This stabilized stimulus-evoked activity in the model's piriform layer.

      Strengths

      This paper suggests a link between adult neurogenesis in the olfactory bulb and representational drift in the piriform cortex. Using an elegant spiking network that faithfully recapitulates the basic physiological properties of the olfactory stream, the authors tackle a question of longstanding interest in a creative and interesting manner. As a purely theoretical study of drift, this paper presents important insights: synaptic turnover of recurrent inhibitory input can destabilize stimulus-evoked activity, but only to a degree, as representations in the bulb (the model's recurrent input layer) retain their basic geometrical form. However, this destabilized input results in profound drift in the model's second (piriform) layer, where both the tuning of individual neurons and the layer's overall functional geometry are restructured. This is a useful and important idea in the drift field, and to my knowledge, it is novel. The bulb is not the only setting where inhibitory synapses exhibit turnover (whether through neurogenesis or synaptic dynamics), and so this exploration of the consequences of such plasticity on drift is valuable. The authors also elegantly explore a potential mechanism to stabilize representations through experience, using an STDP rule specific to the inhibitory neurons in the input layer. This has an interesting parallel with other recent theoretical work on drift in the piriform (Morales et al., 2025 PNAS), in which STDP in the piriform layer was also shown to stabilize stimulus representations there. It is fascinating to see that this same rule also stabilizes piriform representations when implemented in the bulb's granule cells.

      The authors also provide a thoughtful discussion regarding the differential roles of mitral and tufted cells in drift in piriform and AON and the potential roles of neurogenesis in archicortex.

      In general, this paper puts an important and much-needed spotlight on the role of neurogenesis and inhibitory plasticity in drift. In this light, it is a valuable and exciting contribution to the drift conversation.

      Weaknesses

      I have one major, general concern that I think must be addressed to permit proper interpretation of the results.

      I worry that the authors' model may confuse thinking on drift in the olfactory system, because of differences in the behavior of their model from known features of the olfactory bulb. In their model, the tuning of individual bulbar neurons drifts over time. This is inconsistent with the experimental literature on the stability of odor-evoked activity in the olfactory bulb.

      In a foundational paper, Bhalla & Bower (1997) recorded from mitral and tufted cells in the olfactory bulb of freely moving rats and measured the odor tuning of well-isolated single units across a five-day interval. They found that the tuning of a single cell was quite variable within a day, across trials, but that this variability did not increase with time. Indeed, their measure of response similarity was equivalent within and across days. In what now reads as a prescient anticipation of the drift phenomenon, Bhalla and Bower concluded: "it is clear, at least over five days, that the cell is bounded in how it can respond. If this were not the case, we would expect a continual increase in relative response variability over multiple days (the equivalent of response drift). Instead, the degree of variability in the responses of single cells is stable over the length of time we have recorded." Thus, even at the level of single cells, this early paper argues that the bulb is stable.

      This basic result has since been replicated by several groups. Kato et al. (2012) used chronic two-photon calcium imaging of mitral cells in awake, head-fixed mice and likewise found that, while odor responses could be modulated by recent experience (odor exposure leading to transient adaptation), the underlying tuning of individual cells remained stable. While experience altered mitral cell odor responses, those responses recovered to their original form at the level of the single neuron, maintaining tuning over extended periods (two months). More recently, the Mizrahi lab (Shani-Narkiss et al., 2023) extended chronic imaging to six months, reporting that single-cell odor tuning curves remained highly similar over this period. These studies reinforce Bhalla and Bower's original conclusion: despite trial-to-trial variability, olfactory bulb neurons maintain stable odor tuning across extended timescales, with plasticity emerging primarily in response to experience. (The Yamada et al., 2017 paper, which the authors here cite, is not an appropriate comparison. In Yamada, mice were exposed daily to odor. Therefore, the changes observed in Yamada are a function of odor experience, not of time alone. Yamada does not include data in which the tuning of bulb neurons is measured in the absence of intervening experience.)

      Therefore, a model that relies on instability in the tuning of bulbar neurons risks giving the incorrect impression that the bulb drifts over time. This difference should be explicitly addressed by the authors to avoid any potential confusion. Perhaps the best course of action would be to fit their model to Mizrahi's data, should this data be available, and see if, when constrained by empirical observation, the model still produces drift in piriform. If so, this would dramatically strengthen the paper. If this is not feasible, then I suggest being very explicit about this difference between the behavior of the model and what has been shown empirically. I appreciate that in the data there is modest drift (e.g., Shani-Narkiss' Figure 8C), but the changes reported there really are modest compared to what is exhibited by the model. A compromise would be to simply apply these metrics to the model and match the model's similarity to the Shani-Narkiss data. Then the authors could ask what effect this has on drift in piriform.

      The risk here is that people will conclude from this paper that drift in piriform may simply be inherited from instability in the bulb. This view is inconsistent with what has been documented empirically, and so great care is warranted to avoid conveying that impression to the community.

      Major comments (all related to the above point)

      (1) Lines 146-168: The authors find in their model that "individual M/T cells changed their responses to the same odor across days due to adult-neurogenesis, with some cells decreasing the firing rate responses (Fig.2A1 top) while other cells increased the magnitude of their responses (Fig. 2A2 bottom, Fig. S2)" they also report a significant decrease in the "full ensemble correlation" in their model over time. They claim that these changes in individual cell tuning are "similar to what has been observed by others using calcium imaging of M/T cell activity (Kato et al., 2012 and Yamada et al., 2017)" and that the decrease in full ensemble correlation is "consistent with experimental observations (Yamada et al., 2017)." However, the conditions of the Kato and Yamada experiments that demonstrate response change are not comparable here, as odors were presented daily to the animals in these experiments. Therefore, the changes in odor tuning found in the Kato and Yamada papers (Kato Figure 4D; Yamada Figure 3E) are a function of accumulated experience with odor. This distinction is crucial because experience-induced changes reflect an underlying learning process, whereas changes that simply accumulate over time are more consistent with drift. The conditions of their model are more similar to those employed in other experiments described in Kato et al. 2012 (Figure 6C) as well as Shani-Narkiss et al. (2023), in which bulb tuning is measured not as a function of intervening experience, but rather as a function of time (Kato's "recovery" experiment). What is found in Kato is that even across two months, the tuning of individual mitral cells is stable. What alters tuning is experience with odor, the core finding of both the Kato et al., 2012 paper and also Yamada et al., 2017. It is crucial that this is clarified in the text.

      (2) The authors show that in a reduced-space correlation metric, the correlation of low-dimensional trajectories "remained high across all days"..."consistent with a recent experimental study" (Shani-Narkiss et al., 2023). It is true that in the Shani-Narkiss paper, a consistent low-dimensional response is found across days (t-SNE analysis in Shani-Narkiss Figure 7B). However, the key difference between the Shani-Narkiss data and the results reported here is that Shani-Narkiss also observed relative stability in the native space (Shani-Narkiss Figure 8). They conclude that they "find a relatively stable response of single neurons to odors in either awake or anesthetized states and a relatively stable representation of odors by the MC population as a whole (Figures 6-8; Bhalla and Bower, 1997)." This should be better clarified in the text.

      (3) In the discussion, the authors state that "In the MOB, individual M/T cells exhibited variable odor responses akin to gain control, altering their firing rate magnitudes over time. This is consistent with earlier experimental studies using calcium-imaging." (L314-6). Again, I disagree that these data are consistent with what has been published thus far. Changes in gain would have resulted in increased variability across days in the Bhalla data. Moreover, changes in gain would be captured by Kato's change index ("To quantify the changes in mitral cell responses, we calculated the change index (CI) for each responsive mitral cell-odor pair on each trial (trial X) of a given day as (response on trial X - the initial response on day 1)/(response on trial X + the initial response on day 1). Thus, CI ranges from −1 to 1, where a value of −1 represents a complete loss of response, 1 represents the emergence of a new response, and 0 represents no change." Kato et al.). This index will capture changes in gain. However, as shown in Figure 4D (red traces), Figure 6C (Recovery and Odor set B during odor set A experience and vice versa), the change index is either zero or near zero. If the authors wish to claim that their model is consistent with these data, they should also compute Kato's change index for M/T odor-cell pairs in their model and show that it also remains at 0 over time, absent experience.

    1. Reviewer #2 (Public review):

      Summary:

      This paper addresses an interesting issue: how is the search for a visual target affected by its orientation (and the viewer's) relative to other items in the scene and gravity? The paper describes a series of visual search tasks, using recognizable targets (e.g., a cat) positioned within a natural scene. Reaction times and accuracy at determining whether the target was present or absent, trial-to-trial, were measured as the target's orientation, that of the context, and of the viewer themselves (via rotation in a flight simulator) were manipulated. The paper concludes that search is substantially affected by these manipulations, primarily by the reference frame of gravity, then visual context, followed by the egocentric reference frame.

      Strengths:

      This work is on an interesting topic, and benefits from using natural stimuli in VR / flight simulator to change participants' POV and body position.

      Weaknesses:

      There are several areas of weakness that I feel should be addressed.

      (1) The literature review/introduction seems to be lacking in some areas. The authors, when contemplating the behavioral consequences of searching for a 'rotated' target, immediately frame the problem as one of rotation, per se (i.e., contrasting only rotation-based explanations; "what rotates and in which 'reference frame[s]' in order to allow for successful search?"). For a reader not already committed to this framing, many natural questions arise that are worth addressing.

      1a) Why do we need to appeal to rotation at all as opposed to, say, familiarity? A rotated cat is less familiar than a typically oriented one. This is a long-standing literature (e.g., Wang, Cavanagh, and Green (1994)), of course, with a lot to unpack.

      1b) What are the triggers for the 'corrective' rotation that presumably brings reference frames back into alignment? What if the rotation had not been so obvious (i.e. for a target that may not have a typical orientation, like a hand, or a ball, or a learned, nonsense object?) or the background had not had such clear orientation (like a cluttered non-naturalistic background of or a naturalistic backdrop, but viewed from an unfamiliar POV (e.g., from above) or a naturalistic background, but not all of the elements were rotated)? What, ultimately, is rotated? The entire visual field? Does that mean that searching for multiple targets at different angles of rotation would interfere with one another?

      1c) Relatedly, what is the process by which the visual system comes to know the 'correct' rotation? (Or, alternatively, is 'triggered to realize' that there is a rotation in play?) Is this something that needs to be learned? Is it only learned developmentally, through exposure to gravity? Could it be learned in the context of an experiment that starts with unfamiliar stimuli?

      1d) Why the appeal to natural images? I appreciate any time a study can be moved from potentially too stripped-down laboratory conditions to more naturalistic ones, but is this necessary in the present case? Would the pattern of results have been different if these were typical laboratory 'visual search' displays of disconnected object arrays?

      1e) How should we reconcile rotation-based theories of 'rotated-object' search with visual search results from zero gravity environments (e.g., for a review, see Leone (1998))?

      1f) How should we reconcile the current manipulations with other viewpoint-perspective manipulations (e.g., Zhang & Pan (2022))?

      (2) The presentation/interpretation of results would benefit from more elaboration and justification.

      2a) All of the current interpretations rely on just the RT data. First, the RT results should also be presented in natural units (i.e., seconds/ms), not normalized. As well, results should be shown as violin plots or something similar that captures distribution - a lot of important information is lost when just presenting one 'average' dot across participants. More fundamentally, I think we need to have a better accounting for performance (percent correct or d') to help contextualize the RT results. We should at least be offered some visualization (Heitz, 2014) of the speed accuracy trade-off for each of the conditions. Following this, the authors should more critically evaluate how any substantial SAT trends could affect the interpretation of results.

      2b) Unless I am missing something, the interpretation of the pattern of results (both qualitatively and quantitatively in their 'relative weight' analysis) relies on how they draw their contrasts. For instance, the authors contrast the two 'gravitational' conditions (target 0 deg versus target 90 deg) as if this were a change in a single variable/factor. But there are other ways to understand these manipulations that would affect contrasts. For instance, if one considers whether the target was 'consistent' (i.e., typically oriented) with respect to the context, egocentric, and gravitational frames, then the 'gravitational 0 deg' condition is consistent with context, egocentric view, but inconsistent with gravity. And, the 'gravitational 90 deg' condition, then, is inconsistent with context, egocentric view, but consistent with gravity. Seen this way, this is not a change in one variable, but three. The same is true of the baseline 0 deg versus baseline 90 deg condition, where again we have a change in all three target-consistency variables. The 'one variable' manipulations then would be: 1) baseline 0 versus visual context 0 (i.e., a change only in the context variable); 2) baseline 0 versus egocentric 0 (a change only in the egocentric variable); and 3) baseline 0 versus gravitational 0 (a change only in the gravitational variable). Other contrasts (e.g., gravitational 90 versus context 90) would showcase a change in two variables (in this case, a change in both context and gravity). My larger point is, again, unless I am really missing something, that the choice of how to contrast the manipulations will affect the 'pattern' of results and thereby the interpretation. If the authors agree, this needs to be acknowledged, plausible alternative schemes discussed, and the ultimate choice of scheme defended as the most valid.

      2c) Even with this 'relative weight' interpretation, there are still some patterns of results that seem hard to account for. Primarily, the egocentric condition seems hard to account for under any scheme, and the authors need to spend more time discussing/reconciling those results.

      2d) Some results are just deeply counterintuitive, and so the reader will crave further discussion. Most saliently for me, based on the results of Experiment 2 (specifically, the fact that gravitational 90 had better performance than gravitational 0), designers of cockpits should have all gauges/displays rotate counter to the airplane so that they are always consistent with gravity, not the pilot. Is this indeed a fair implication of the results?

      2e) I really craved some 'control conditions' here to help frame the current results. In keeping with the rhetorical questions posed above in 1a/b/c/d, if/when the authors engage with revisions to this paper, I would encourage the inclusion of at least some new empirical results. For me the most critical would be to repeat some core conditions, but with a symmetric target (e.g. a ball) since that would seem to be the only way (given the current design) to tease out nuisance confounding factors such as, say, the general effect of performing search while sideways (put another way, the authors would have to assume here that search (non-normalized RT's and search performance) for a ball-target in the baseline condition would be identical to that in the gravitational condition.)

  3. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. When someone presents themselves as open and as sharing their vulnerabilities with us, it makes the connection feel authentic. We feel like they have entangled their wellbeing with ours by sharing their vulnerabilities with us. Think about how this works with celebrity personalities. Jennifer Lawrence became a favorite of many when she tripped at the Oscars [f2], and turned the moment into her persona as someone with a cool-girl, unpolished, unfiltered way about her. She came across as relatable and as sharing her vulnerabilities with us, which let many people feel that they had a closer, more authentic connection with her. Over time, that persona has come to be read differently, with some suggesting that this open-styled persona is in itself also a performance. Does this mean that her performance of vulnerability was inauthentic?

      This chapter about authenticity really make me reflect on the current "performative" male trend. As you may know, the stereotype for these performative males goes along the lines of things like drinking matcha, wearing tote bags, listening to indie music like Clario... etc. In hindsight, you can chop this up as just ones interests, regardless of their gender. But the reason it's such a big trend is because people can sense when a guy is doing it purely for validation. More specifically- female validation, since these interests are more stereotypically women's interests. So like the text reads, "humans do not like to be duped", and when people can tell something is inauthentic, they're not going to take it seriously.

    1. Although it is increasingly recognised that the tools we use to examine our objects of study change our relationship to them, this is not an area that has been studied in any great detail within Digital Archaeology beyond perhaps discussions of the effects of different categories of software (the impact of GIS or database applications, for instance, or the effect of enlarged access to open data sources) on how we organise and understand the past. I have suggested elsewhere that through understanding how these technologies operate on us as well as for us, we can seek to ensure that they serve us better in what as archaeologists we already do, and help us initiate new and innovative ways of thinking about the past (Huggett 2004; 2012a). This entails going beyond the relatively commonplace reflections on specific software applications and their context of use: the tools we create, adopt, refine and employ have the effect of augmenting and scaffolding our thought and analysis, and consequently I have argued that they need to be approached in a considered, aware, and knowledgeable manner.

      it is highlights how the digital tools we use do more than organize data—they actively shape how we think about and interpret the past. He suggests that technologies “operate on us as well as for us,” meaning they influence not only the results of our research but also the cognitive processes that produce those results. This idea connects directly to my project on Tang poetry and emotion. When I use computational methods such as Voyant Tools and SnowNLP to analyze the emotional vocabulary of poems from the Tang dynasty, these tools shape the patterns I see and the questions I ask. For example, frequency counts or sentiment scores may emphasize some emotions while downplaying others that are culturally embedded in Chinese language and history. Therefore, as Huggett proposes, I must approach these technologies consciously and critically. They can scaffold my thought by helping me visualize large poetic patterns, but they can also reshape my understanding of the texts I study. This awareness encourages me to balance quantitative data with close reading and historical sensitivity, ensuring that the digital analysis deepens rather than distorts my interpretation of Tang emotional expression.

    1. Author response:

      The following is the authors’ response to the original reviews

      General Statements:

      In our manuscript, we demonstrate for the first time that RNA Polymerase I (Pol I) can prematurely release nascent transcripts at the 5' end of ribosomal DNA transcription units in vivo. This achievement was made possible by comparing wild-type Pol I with a mutant form of Pol I, hereafter called SuperPol previously isolated in our lab (Darrière at al., 2019). By combining in vivo analysis of rRNA synthesis (using pulse-labelling of nascent transcript and cross-linking of nascent transcript - CRAC) with in vitro analysis, we could show that Superpol reduced premature transcript release due to altered elongation dynamics and reduced RNA cleavage activity. Such premature release could reflect regulatory mechanisms controlling rRNA synthesis. Importantly, This increased processivity of SuperPol is correlated with resistance with BMH-21, a novel anticancer drugs inhibiting Pol I, showing the relevance of targeting Pol I during transcriptional pauses to kill cancer cells. This work offers critical insights into Pol I dynamics, rRNA transcription regulation, and implications for cancer therapeutics.

      We sincerely thank the three reviewers for their insightful comments and recognition of the strengths and weaknesses of our study. Their acknowledgment of our rigorous methodology, the relevance of our findings on rRNA transcription regulation, and the significant enzymatic properties of the SuperPol mutant is highly appreciated. We are particularly grateful for their appreciation of the potential scientific impact of this work. Additionally, we value the reviewer’s suggestion that this article could address a broad scientific community, including in transcription biology and cancer therapy research. These encouraging remarks motivate us to refine and expand upon our findings further.

      All three reviewers acknowledged the increased processivity of SuperPol compared to its wildtype counterpart. However, two out of three questions our claims that premature termination of transcription can regulate ribosomal RNA transcription. This conclusion is based on SuperPol mutant increasing rRNA production. Proving that modulation of early transcription termination is used to regulate rRNA production under physiological conditions is beyond the scope of this study. Therefore, we propose to change the title of this manuscript to focus on what we have unambiguously demonstrated:

      “Ribosomal RNA synthesis by RNA polymerase I is subjected to premature termination of transcription”.

      Reviewer 1 main criticisms centers on the use of the CRAC technique in our study. While we address this point in detail below, we would like to emphasize that, although we agree with the reviewer’s comments regarding its application to Pol II studies, by limiting contamination with mature rRNA, CRAC remains the only suitable method for studying Pol I elongation over the entire transcription units. All other methods are massively contaminated with fragments of mature RNA which prevents any quantitative analysis of read distribution within rDNA.  This perspective is widely accepted within the Pol I research community, as CRAC provides a robust approach to capturing transcriptional dynamics specific to Pol I activity. 

      We hope that these findings will resonate with the readership of your journal and contribute significantly to advancing discussions in transcription biology and related fields.

      Description of the planned revisions:

      Despite numerous text modification (see below), we agree that one major point of discussion is the consequence of increased processivity in SuperPol mutant on the “quality” of produced rRNA. Reviewer 3 suggested comparisons with other processive alleles, such as the rpb1-E1103G mutant of the RNAPII subunit (Malagon et al., 2006). This comparison has already been addressed by the Schneider lab (Viktorovskaya OV, Cell Rep., 2013 - PMID: 23994471), which explored Pol II (rpb1-E1103G) and Pol I (rpa190-E1224G). The rpa190-E1224G mutant revealed enhanced pausing in vitro, highlighting key differences between Pol I and Pol II catalytic ratelimiting steps (see David Schneider's review on this topic for further details).

      Reviewer 2 and 3 suggested that a decreased efficiency of cleavage upon backtracking might imply an increased error rate in SuperPol compared to the wild-type enzyme. Pol I mutant with decreased rRNA cleavage have been characterized previously, and resulted in increased errorrate. We already started to address this point. Preliminary results from in vitro experiments suggest that SuperPol mutants exhibit an elevated error rate during transcription. However, these findings remain preliminary and require further experimental validation to confirm their reproducibility and robustness. We propose to consolidate these data and incorporate into the manuscript to address this question comprehensively. This could provide valuable insights into the mechanistic differences between SuperPol and the wild-type enzyme. SuperPol is the first pol I mutant described with an increased processivity in vitro and in vivo, and we agree that this might be at the cost of a decreased fidelity.

      Regulatory aspect of the process:

      To address the reviewer’s remarks, we propose to test our model by performing experiments that would evaluate PTT levels in Pol I mutant’s or under different growth conditions. These experiments would provide crucial data to support our model, which suggests that PTT is a regulatory element of Pol I transcription. By demonstrating how PTT varies with environmental factors, we aim to strengthen the hypothesis that premature termination plays an important role in regulating Pol I activity.

      We propose revising the title and conclusions of the manuscript. The updated version will better reflect the study's focus and temper claims regarding the regulatory aspects of termination events, while maintaining the value of our proposed model.

      Description of the revisions that have already been incorporated in the transferred manuscript:

      Some very important modifications have now been incorporated:

      Statistical Analyses and CRAC Replicates:

      Unlike reviewers 2 and 3, reviewer 1 suggests that we did not analyze the results statistically. In fact, the CRAC analyses were conducted in biological triplicate, ensuring robustness and reproducibility. The statistical analyses are presented in Figure 2C, which highlights significant findings supporting the fact WT Pol I and SuperPol distribution profiles are different. We CRAC replicates exhibit a high correlation and we confirmed significant effect in each region of interest (5’ETS, 18S.2, 25S.1 and 3’ ETS, Figure 1) to confirm consistency across experiments. We finally took care not to overinterpret the results, maintaining a rigorous and cautious approach in our analysis to ensure accurate conclusions.

      CRAC vs. Net-seq:

      Reviewer 1 ask to comment differences between CRAC and Net-seq. Both methods complement each other but serve different purposes depending on the biological question on the context of transcription analysis. Net-seq has originally been designed for Pol II analysis. It captures nascent RNAs but does not eliminate mature ribosomal RNAs (rRNAs), leading to high levels of contamination. While this is manageable for Pol II analysis (in silico elimination of reads corresponding to rRNAs), it poses a significant problem for Pol I due to the dominance of rRNAs (60% of total RNAs in yeast), which share sequences with nascent Pol I transcripts. As a result, large Net-seq peaks are observed at mature rRNA extremities (Clarke 2018, Jacobs 2022). This limits the interpretation of the results to the short lived pre-rRNA species. In contrast, CRAC has been specifically adapted by the laboratory of David Tollervey to map Pol I distribution while minimizing contamination from mature rRNAs (The CRAC protocol used exclusively recovers RNAs with 3′ hydroxyl groups that represent endogenous 3′ ends of nascent transcripts, thus removing RNAs with 3’-Phosphate, found in mature rRNAs). This makes CRAC more suitable for studying Pol I transcription, including polymerase pausing and distribution along rDNA, providing quantitative dataset for the entire rDNA gene.

      CRAC vs. Other Methods:

      Reviewer 1 suggests using GRO-seq or TT-seq, but the experiments in Figure 2 aim to assess the distribution profile of Pol I along the rDNA, which requires a method optimized for this specific purpose. While GRO-seq and TT-seq are excellent for measuring RNA synthesis and cotranscriptional processing, they rely on Sarkosyl treatment to permeabilize cellular and nuclear membranes. Sarkosyl is known to artificially induces polymerase pausing and inhibits RNase activities which are involved in the process. To avoid these artifacts, CRAC analysis is a direct and fully in vivo approach. In CRAC experiment, cells are grown exponentially in rich media and arrested via rapid cross-linking, providing precise and artifact-free data on Pol I activity and pausing.

      Pol I ChIP Signal Comparison:

      The ChIP experiments previously published in Darrière et al. lack the statistical depth and resolution offered by our CRAC analyses. The detailed results obtained through CRAC would have been impossible to detect using classical ChIP. The current study provides a more refined and precise understanding of Pol I distribution and dynamics, highlighting the advantages of CRAC over traditional methods in addressing these complex transcriptional processes.

      BMH-21 Effects:

      As highlighted by Reviewer 1, the effects of BMH-21 observed in our study differ slightly from those reported in earlier work (Ref Schneider 2022), likely due to variations in experimental conditions, such as methodologies (CRAC vs. Net-seq), as discussed earlier. We also identified variations in the response to BMH-21 treatment associated with differences in cell growth phases and/or cell density. These factors likely contribute to the observed discrepancies, offering a potential explanation for the variations between our findings and those reported in previous studies. In our approach, we prioritized reproducibility by carefully controlling BMH-21 experimental conditions to mitigate these factors. These variables can significantly influence results, potentially leading to subtle discrepancies. Nevertheless, the overall conclusions regarding BMH-21's effects on WT Pol I are largely consistent across studies, with differences primarily observed at the nucleotide resolution. This is a strength of our CRAC-based analysis, which provides precise insights into Pol I activity.

      We will address these nuances in the revised manuscript to clarify how such differences may impact results and provide context for interpreting our findings in light of previous studies.

      Minor points:

      Reviewer #1:

      In general, the writing style is not clear, and there are some word mistakes or poor descriptions of the results, for example: 

      On page 14: "SuperPol accumulation is decreased (compared to Pol I)". 

      On page 16: "Compared to WT Pol I, the cumulative distribution of SuperPol is indeed shifted on the right of the graph." 

      We clarified and increased the global writing style according to reviewer comment.

      There are also issues with the literature, for example: Turowski et al, 2020a and Turowski et al, 2020b are the same article (preprint and peer-reviewed). Is there any reason to include both references? Please, double-check the references.  

      This was corrected in this version of the manuscript.

      In the manuscript, 5S rRNA is mentioned as an internal control for TMA normalisation. Why are Figure 1C data normalised to 18S rRNA instead of 5S rRNA? 

      Data are effectively normalized relative to the 5S rRNA, but the value for the 18S rRNA is arbitrarily set to 100%.

      Figure 4 should be a supplementary figure, and Figure 7D doesn't have a y-axis labelling. 

      The presence of all Pol I specific subunits (Rpa12, Rpa34 and Rpa49) is crucial for the enzymatic activity we performed. In the absence of these subunits (which can vary depending on the purification batch), Pol I pausing, cleavage and elongation are known to be affected. To strengthen our conclusion, we really wanted to show the subunit composition of the purified enzyme. This important control should be shown, but can indeed be shown in a supplementary figure if desired.

      Y-axis is figure 7D is now correctly labelled

      In Figure 7C, BMH-21 treatment causes the accumulation of ~140bp rRNA transcripts only in SuperPol-expressing cells that are Rrp6-sensitive (line 6 vs line 8), suggesting that BHM-21 treatment does affect SuperPol. Could the author comment on the interpretation of this result? 

      The 140 nt product is a degradation fragment resulting from trimming, which explains its lower accumulation in the absence of Rrp6. BMH21 significantly affects WT Pol I transcription but has also a mild effect on SuperPol transcription. As a result, the 140 nt product accumulates under these conditions.

      Reviewer #2:

      pp. 14-15: The authors note local differences in peak detection in the 5'-ETS among replicates, preventing a nucleotide-resolution analysis of pausing sites. Still, they report consistent global differences between wild-type and SuperPol CRAC signals in the 5'ETS (and other regions of the rDNA). These global differences are clear in the quantification shown in Figures 2B-C. A simpler statement might be less confusing, avoiding references to a "first and second set of replicates" 

      According to reviewer, statement has been simplified in this version of the manuscript.

      Figures 2A and 2C: Based on these data and quantification, it appears that SuperPol signals in the body and 3' end of the rDNA unit are higher than those in the wild type. This finding supports the conclusion that reduced pausing (and termination) in the 5'ETS leads to an increased Pol I signal downstream. Since the average increase in the SuperPol signal is distributed over a larger region, this might also explain why even a relatively modest decrease in 5'ETS pausing results in higher rRNA production. This point merits discussion by the authors. 

      We agree that this is a very important discussion of our results. Transcription is a very dynamic process in which paused polymerase is easily detected using the CRAC assay. Elongated polymerases are distributed over a much larger gene body, and even a small amount of polymerase detected in the gene body can represent a very large rRNA synthesis. This point is of paramount importance and, as suggested by the reviewer, is now discussed in detail.

      A decreased efficiency of cleavage upon backtracking might imply an increased error rate in SuperPol compared to the wild-type enzyme. Have the authors observed any evidence supporting this possibility? 

      Reviewer suggested that a decreased efficiency of cleavage upon backtracking might imply an increased error rate in SuperPol compared to the wild-type enzyme. We thank Reviewer #2 to point it as in our opinion, this is an important point what should be added to the manuscript. We have now included new data (panels 5G, 5H and 5I) in the manuscript showing that SuperPol in vitro exhibits an increased error rate compared to the WT enzyme. From these results obtained in vitro, we concluded that SuperPol shows reduced nascent transcript cleavage, associated with more efficient transcript elongation, but to the detriment of transcriptional fidelity.

      pp. 15 and 22: Premature transcription termination as a regulator of gene expression is welldocumented in yeast, with significant contributions from the Corden, Brow, Libri, and Tollervey labs. These studies should be referenced along with relevant bacterial and mammalian research. 

      According to reviewer suggestion, we referenced these studies.

      p. 23: "SuperPol and Rpa190-KR have a synergistic effect on BMH-21 resistance." A citation should be added for this statement. 

      This represents some unpublished data from our lab. KR and SuperPol are the only two known mutants resistant to BMH-21. We observed that resistance between both alleles is synergistic, with a much higher resistance to BMH-21 in the double mutant than in each single mutant (data not shown). Comparing their resistance mechanisms is a very important point that we could provide upon request. This was added to the statement.

      p. 23: "The released of the premature transcript" - this phrase contains a typo 

      This is now corrected.

      Reviewer #3:

      Figure 1B: it would be opportune to separate the technique's schematic representation from the actual data. Concerning the data, would the authors consider adding an experiment with rrp6D cells? Some RNAs could be degraded even in such short period of time, as even stated by the authors, so maybe an exosome depleted background could provide a more complete picture. Could also the authors explain why the increase is only observed at the level of 18S and 25S? To further prove the robustness of the Pol I TMA method could be good to add already characterized mutations or other drugs to show that the technique can readily detect also well-known and expected changes. 

      The precise objective of this experiment is to avoid the use of the Rrp6 mutant. Under these conditions, we prevent the accumulation of transcripts that would result from a maturation defect. While it is possible to conduct the experiment with the Rrp6 mutant, it would be impossible to draw reliable conclusions due to this artificial accumulation of transcripts.

      Figure 1C: the NTS1 probe signal is missing (it is referenced in Figure 1A but not listed in the Methods section or the oligo table). If this probe was unused, please correct Figure 1A accordingly. 

      We corrected Figure 1A.  

      Figure 2A: the RNAPI occupancy map by CRAC is hard to interpret. The red color (SuperPol) is stacked on top of the blue line, and we are not able to observe the signal of the WT for most of the position along the rDNA unit. It would be preferable to use some kind of opacity that allows to visualize both curves. Moreover, the analysis of the behavior of the polymerase is always restricted to the 5'ETS region in the rest of the manuscript. We are thus not able to observe whether termination events also occur in other regions of the rDNA unit. A Northern blot analysis displaying higher sizes would provide a more complete picture. 

      We addressed this point to make the figure more visually informative. In Northern Blot analysis, we use a TSS (Transcription Start Site) probe, which detects only transcripts containing the 5' extremity. Due to co-transcriptional processing, most of the rRNA undergoing transcription lacks its 5' extremity and is not detectable using this technique. We have the data, but it does not show any difference between Pol I and SuperPol. This information could be included in the supplementary data if asked.

      "Importantly, despite some local variations, we could reproducibly observe an increased occupancy of WT Pol I in 5'-ETS compared to SuperPol (Figure 1C)." should be Figure 2C. 

      Thanks for pointing out this mistake. It has been corrected.

      Figure 3D: most of the difference in the cumulative proportion of CRAC reads is observed in the region ~750 to 3000. In line with my previous point, I think it would be worth exploring also termination events beyond the 5'-ETS region. 

      We agree that such an analysis would have been interesting. However, with the exception of the pre-rRNA starting at the transcription start site (TSS) studied here, any cleaved rRNA at its 5' end could result from premature termination and/or abnormal processing events. Exploring the production of other abnormal rRNAs produced by premature termination is a project in itself, beyond this initial work aimed at demonstrating the existence of premature termination events in ribosomal RNA production.

      Figure 4: should probably be provided as supplementary material. 

      As l mentioned earlier (see comments), the presence of all Pol I specific subunits (Rpa12, Rpa34 and Rpa49) is crucial for the enzymatic activity we performed. This important control should be shown, but can indeed be shown in a supplementary figure if desired.

      "While the growth of cells expressing SuperPol appeared unaffected, the fitness of WT cells was severely reduced under the same conditions." I think the growth of cells expressing SuperPol is slightly affected. 

      We agree with this comment and we modified the text accordingly.

      Figure 7D: the legend of the y-axis is missing as well as the title of the plot. 

      Legend of the y-axis and title of the plot are now present.

      The statements concerning BMH-21, SuperPol and Rpa190-KR in the Discussion section should be removed, or data should be provided.

      This was discussed previously. See comment above.

      Some references are missing from the Bibliography, for example Merkl et al., 2020; Pilsl et al., 2016a, 2016b. 

      Bibliography is now fixed

      Description of analyses that authors prefer not to carry out:

      Does SuperPol mutant produces more functional rRNAs ?

      As Reviewer 1 requested, we agree that this point requires clarification.. In cells expressing SuperPol, a higher steady state of (pre)-rRNAs is only observed in absence of degradation machinery suggesting that overproduced rRNAs are rapidly eliminated. We know that (pre)rRNas are unable to accumulate in absence of ribosomal proteins and/or Assembly Factors (AF). In consequence, overproducing rRNAs would not be sufficient to increase ribosome content. This specific point is further address in our lab but is beyond the scope of this article.

      Is premature termination coupled with rRNA processing 

      We appreciate the reviewer’s insightful comments. The suggested experiments regarding the UTP-A complex's regulatory potential are valuable and ongoing in our lab, but they extend beyond the scope of this study and are not suitable for inclusion in the current manuscript.

    1. But dry sterile thunder without rain

      This line stood out to me due to its connection with the title "What the Thunder said," and similar connotation to the Gospel of John. This line appears after a somewhat odd repetition of a lack of water within the land. Rather, the speaker is left in a desolate landscape of "only rock." One may think that this baren image would also prompt a stillness of silence in nature. However, Eliot is quick to point out the presence of loud booms of thunder in my highlighted line. In particular, the thunder is "dry and sterile," therefor connecting to the state of the land; the rocky terrain is indeed also dry due to the emphasized absence of water and also sterile as a result. In The Gospel of John (line 29), thunder holds a contrasting purpose. 29] The people therefore, that stood by, and heard it, said that it thundered: others said, An angel spake to him. Therefore, the voice of God in John is expressed through thunder, showing the great force of divinity over the world. However, Eliot's vivid descriptions of the thunder in his wasteland could not be more different. The thunder is "dry and sterile." and in my opinion, lacks the religious importance evident in John, In connection the title, my reading of this line suggests that Eliot does not believe the thunder is saying anything (What the Thunder Said). Instead, we are trapped in a dry and sterile land mass with no divine connection to guide us out.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Liu et al., present glmSMA, a network-regularized linear model that integrates single-cell RNA-seq data with spatial transcriptomics, enabling high-resolution mapping of cellular locations across diverse datasets. Its dual regularization framework (L1 for sparsity and generalized L2 via a graph Laplacian for spatial smoothness) demonstrates robust performance of their model and offers novel tools for spatial biology, despite some gaps in fully addressing spatial communication.

      Overall, the manuscript is commendable for its comprehensive benchmarking across different spatial omics platforms and its novel application of regularized linear models for cell mapping. I think this manuscript can be improved by addressing method assumptions, expanding the discussion on feature dependence and cell type-specific biases, and clarifying the mechanism of spatial communication.

      The conclusions of this paper are mostly well supported by data, but some aspects of model developmentand performance evaluation need to be clarified and extended.

      We are thankful for the positive comments and have made changes following the reviewer's advice, as detailed below.

      (1) What were the assumptions made behind the model? One of them could be the linear relationship between cellular gene expression and spatial location. In complex biological tissues, non-linear relationships could be present, and this would also vary across organ systems and species. Similarly, with regularization parameters, they can be tuned to balance sparsity and smoothness adequately but may not hold uniformly across different tissue types or data quality levels. The model also seems to assume independent errors with normal distribution and linear additive effects - a simplification that may overlook overdispersion or heteroscedasticity commonly observed in RNA-seq data.

      Thank you for this comment. We acknowledge that the non-linear relationships can be present in complex tissues and may not be fully captured by a linear model. 

      Our choice of a linear model was guided by an investigation of the relationship in the current datasets, which include intestinal villus, mouse brain, and fly embryo.There is a linear correlation between expression distance and physical distance [Nitzan et al]. Within a given anatomical structure, cells in closer proximity exhibit more similar expression patterns (Fig. 3c). In tissues where non-linear relationships are more prevalent—such as the human PDAC sample—our mapping results remain robust. We acknowledge that we have not yet tested our algorithm in highly heterogeneous regions like the liver, and we plan to include such analyses in future work if necessary.

      Regarding the regularization parameters, we agree that the balance between sparsity and smoothness is sensitive to tissue-specific variation and data quality. In our current implementation, we explored a range of values to find robust defaults. Supplementary Figure 7 illustrates the regularization path for cell assignment in the fly embryo.  

      The choice of L1 and L2 regularization parameters is crucial for balancing sparsity and smoothness in spatial mapping. 

      For Structured Tissues (brain):

      Moderate L1 to ensure cells are localized.

      Small to moderate L2 to maintain local smoothness without blurring distinct regions.

      For Less Structured (PDAC):

      Slightly lower L1 to allow cells to be associated with multiple regions if boundaries are ambiguous.

      Higher L2 to stabilize mappings in noisy or mixed regions.

      (2) The performance of glmSMA is likely sensitive to the number and quality of features used. With too few features, the model may struggle to anchor cells correctly due to insufficient discriminatory power, whereas too many features could lead to overfitting unless appropriately regularized. The manuscript briefly acknowledges this issue, but further systematic evaluation of how varying feature numbers affect mapping accuracy would strengthen the claims, particularly in settings where marker gene availability is limited. A simple way to show some of this would be testing on multiple spatial omics (imaging-based) platforms with varying panel sizes and organ systems. Related to this, based on the figures, it also seems like the performance varies by cell type. What are the factors that contribute to this? Variability in expression levels, RNA quantity/quality? Biases in the panel? Personally, I am also curious how this model can be used similarly/differently if we have a FISH-based, high-plex reference atlas. Additional explanation around these points would be helpful for the readers.

      Thank you for this thoughtful comment. The performance of our method is indeed sensitive to the number and quality of selected features. To optimize feature selection, we employed multiple strategies, including Moran’s I statistic, identification of highly variable genes, and the Seurat pipeline to detect anchor genes linking the spatial transcriptomics data with the reference atlas. The number of selected markers depends on the quality of the data. For highquality datasets, fewer than 100 markers are typically sufficient for prediction. To select marker genes, we applied the following optional strategies:

      (1) Identifying highly variable genes (HVGs).

      (2) Calculating Moran’s I scores for all genes to assess spatial autocorrelation.

      (3) Generating anchor genes based on the integration of the reference atlas and scRNA-seq data using Seurat.

      We evaluated our method across diverse tissue types and platforms—including Slide-seq, 10x Visium, and Virtual-FISH—which represent both sequencing-based and imaging-based spatial transcriptomics technologies. Our model consistently achieved strong performance across these settings. It's worth noting that the performance of other methods, such as CellTrek [Wei et al] and novoSpaRc [Nitzan et al], also depends heavily on feature selection. In particular, performance degrades substantially when fewer features are used. For fair comparison across different methods, the same set of marker genes was used. Under this condition, our method outperformed the others based on KL divergence (Fig. 2b, Fig. 5g). 

      To assess the effect of marker gene quantity, we randomly selected subsets of 2,000, 1500, 1,000, 700, 500, and 200 markers from the original set. As the number of markers decreases, mapping performance declines, which is expected due to the reduction in available spatial information. This result underscores the general dependence of spatial mapping accuracy on both the number and quality of informative marker genes (Supplementary Fig. 10).

      We do not believe that the observed performance is directly influenced by cell type composition. Major cell types are typically well-defined, and rare cell types comprise only a small fraction of the dataset. For these rare populations, a single misclassification can disproportionately impact metrics like KL divergence due to small sample size. However, this does not necessarily indicate a systematic cell type–specific bias in the mapping. We incorporated a high-resolution Slide-seq dataset from the mouse hippocampus to evaluate the influence of cell type composition on the algorithm’s performance [Stickels et al., 2020]. Most cell types within the CA1, CA2, CA3, and DG regions were accurately mapped to their original anatomical locations (Fig. 5e, f, g).

      (3) Application 3 (spatial communication) in the graphical abstract appears relatively underdeveloped. While it is clear that the model infers spatial proximities, further explanation of how these mappings translate into insights into cell-cell communication networks would enhance the biological relevance of the findings.

      Thank you for this valuable feedback. We agree that further elaboration on the connection between spatial proximity and cell–cell communication would enhance the biological interpretation of our results. While our current model focuses on inferring spatial relationships,  we may provide some cell-cell communications in the future.

      (4) What is the final resolution of the model outputs? I am assuming this is dictated by the granularity of the reference atlas and the imposed sparsity via the L1 norm, but if there are clear examples that would be good. In figures (or maybe in practice too), cells seem to be assigned to small, contiguous patches rather than pinpoint single-cell locations, which is a pragmatic compromise given the inherent limitations of current spatial transcriptomics technologies. Clarification on the precise spatial scale (e.g., pixel or micrometer resolution) and any post-mapping refinement steps would be beneficial for the users to make informed decisions on the right bioinformatic tools to use.

      Thank you for the comment. For each cell, our algorithm generates a probability vector that indicates its likely spatial assignment along with coordinate information. In our framework, each cell is mapped to one or more spatial spots with associated probabilities. Depending on the amount of regularization through L1 and L2 norms, a cell may be localized to a small patch or distributed over a broader domain (Supplementary Fig. 5 & 7). For the 10x Visium data, we applied a repelling algorithm to enhance visualization [Wei et al]. If a cell’s original location is already occupied, it is reassigned to a nearby neighborhood to avoid overlap. The users can also see the entire regularization path by varying the penalty terms. 

      Nitzan M, Karaiskos N, Friedman N, Rajewsky N. Gene expression cartography. Nature. 2019;576(7785):132-137. doi:10.1038/s41586-019-1773-3

      Wei, R. et al. (2022) ‘Spatial charting of single-cell transcriptomes in tissues’, Nature Biotechnology, 40(8), pp. 1190–1199. doi:10.1038/s41587-022-01233-1.

      Stickels, R.R. et al. (2020) ‘Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-SEQV2’, Nature Biotechnology, 39(3), pp. 313–319. doi:10.1038/s41587-020-0739-1. 

      Reviewer #2 (Public review):

      Summary:

      The author proposes a novel method for mapping single-cell data to specific locations with higher resolution than several existing tools.

      Strengths:

      The spatial mapping tests were conducted on various tissues, including the mouse cortex, human PDAC, and intestinal villus.

      Weakness:

      (1) Although the researchers claim that glmSMA seamlessly accommodates both sequencing-based and image-based spatial transcriptomics (ST) data, their testing primarily focused on sequencingbased ST data, such as Visium and Slide-seq. To demonstrate its versatility for spatial analysis, the authors should extend their evaluation to imaging-based spatial data.

      Thank you for the comment. We have tested our algorithm on the virtual FISH dataset from the fly embryo, which serves as an example of image-based spatial omics data (Fig. 4c). However, such datasets often contain a limited number of available genes. To address this, we will conduct additional testing on image-based data if needed. The Allen Brain Atlas provides high-quality ISH data, and we can select specific brain regions from this resource to further evaluate our algorithm if necessary [Lein et al]. Currently, we plan to focus more on the 10x Visium platform, as it supports whole-transcriptome profiling and offers a wide range of tissue samples for analysis.

      (2) The definition of "ground truth" for spatial distribution is unclear. A more detailed explanation is needed on how the "ground truth" was established for each spatial dataset and how it was utilized for comparison with the predicted distribution generated by various spatial mapping tools.

      Thank you for the comment. To clarify how ground truth is defined across different tissues, we provided the following details. Direct ground truth for cell locations is often unavailable in scRNA-seq data due to experimental constraints. To address this, we adopted alternative strategies for estimating ground truth in each dataset:

      10x Visium Data: We used the cell type distribution derived from spatial transcriptomics (ST) data as a proxy for ground truth. We then computed the KL divergence between this distribution and our model's predictions for performance assessment.

      Slide-seq Data: We validated predictions by comparing the expression of marker genes between the reconstructed and original spatial data.

      Fly Embryo Data: We used predicted cell locations from novoSpaRc as a reference for evaluating our algorithm.

      These strategies allowed us to evaluate model performance even in the absence of direct cell location data. In addition, we can apply multiple evaluation strategies within a single dataset.

      (3) In the analysis of spatial mapping results using intestinal villus tissue, only Figure 3d supports their findings. The researchers should consider adding supplemental figures illustrating the spatial distribution of single cells in comparison to the ground truth distribu tion to enhance the clarity and robustness of their investigation.

      Thank you for the comment. In the intestinal dataset, only six large domains were defined. As a result, the task for this dataset is relatively simple—each cell only needs to be assigned to one of the six domains. As the intestinal villus is a relatively simple tissue, most existing algorithms performed well on it. For this reason, we did not initially provide extensive details in the main text.

      (4) The spatial mapping tests were conducted on various tissues, including the mouse cortex, human PDAC, and intestinal villus. However, the original anatomical regions are not displayed, making it difficult to directly compare them with the predicted mapping results. Providing ground truth distributions for each tested tissue would enhance clarity and facilitate interpretation. For instance, in Figure 2a and  Supplementary Figures 1 and 2, only the predicted mapping results are shown without the corresponding original spatial distribution of regions in the mouse cortex. Additionally, in Figure 3c, four anatomical regions are displayed, but it is unclear whether the figure represents the original spatial regions or those predicted by glmSMA. The authors are encouraged to clarify this by incorporating ground truth distributions for each tissue.

      Thank you for the comment. To improve visualization, we included anatomical structures alongside the mapping results in the next version, wherever such structures are available (e.g., mouse brain cortex, human PDAC sample, etc.). Major cell type assignments for the PDAC samples, along with anatomical structures, are shown in Supplementary Figure 9. Most of these cell types were correctly mapped to their corresponding anatomical regions.

      (5) The cell assignment results from the mouse hippocampus (Supplementary Figure 6) lack a corresponding ground truth distribution for comparison. DG and CA cells were evaluated solely based on the gene expression of specific marker genes. Additional analyses are needed to further validate the robustness of glmSMA's mapping performance on Slide-seq data from the mouse hippocampus.

      Thank you for the comment. The ground truth for DG and CA cells was not available. To better evaluate the model's performance, we computed the KL divergence between the original and predicted cell type distributions, following the same approach used for the 10x Visium dataset. We identified a higher-quality dataset for the mouse hippocampus and used it to evaluate our algorithm. Additionally, we employed KL divergence as an alternative strategy to validate and benchmark our results (Fig. 5e, f, g). Most CA cells, including CA1, CA2, and CA3 principal cells, were correctly assigned back to the CA region. Dentate principal cells were accurately mapped to the DG region (Fig. 5e, f).

      (6) The tested spatial datasets primarily consist of highly structured tissues with well-defined anatomical regions, such as the brain and intestinal villus. Anatomical regions are not distinctly separated, such as liver tissue. Further evaluation of such tissues would help determine the method's broader applicability.

      Thank you for the insightful comment. We agree that many spatial datasets used in our study are from tissues with well-defined anatomical regions. To address the applicability of glmSMA in tissues without clearly separated anatomical structures, we applied glmSMA to the Drosophila embryo, which represents a tissue with relatively continuous spatial patterns and lacks well-demarcated anatomical boundaries compared to organs like the brain or intestinal villus.

      Despite this less structured spatial organization, glmSMA demonstrated robust performance in the fly embryo, accurately mapping cells to their correct spatial spots based on gene expression profiles. This result indicates that glmSMA is not strictly limited to highly structured tissues and can generalize to tissues with more continuous or gradient-like spatial architectures. These results suggest that glmSMA has broader applicability beyond highly compartmentalized tissues.

      Lein, E., Hawrylycz, M., Ao, N. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2007). https://doi.org/10.1038/nature05453

      Reviewer #3 (Public review):

      The authors aim to develop glmSMA, a network-regularized linear model that accurately infers spatial gene expression patterns by integrating single-cell RNA sequencing data with spatial transcriptomics reference atlases. Their goal is to reconstruct the spatial organization of individual cells within tissues, overcoming the limitations of existing methods that either lack spatial resolution or sensitivity.

      Strengths:

      (1) Comprehensive Benchmarking:

      Compared against CellTrek and Novosparc, glmSMA consistently achieved lower Kullback-Leibler divergence (KL divergence) scores, indicating better cell assignment accuracy.

      Outperformed CellTrek in mouse cortex mapping (90% accuracy vs. CellTrek's 60%) and provided more spatially coherent distributions.

      (2) Experimental Validation with Multiple Real-World Datasets:

      The study used multiple biological systems (mouse brain, Drosophila embryo, human PDAC, intestinal villus) to demonstrate generalizability.

      Validation through correlation analyses, Pearson's coefficient, and KL divergence support the accuracy of glmSMA's predictions.

      We thank reviewer #3 for their positive feedback and thoughtful recommendations.

      Weaknesses:

      (1) The accuracy of glmSMA depends on the selection of marker genes, which might be limited by current FISH-based reference atlases.

      We agree that the accuracy of glmSMA is influenced by the selection of marker genes, and that current FISH-based reference atlases may offer a limited gene set. To address this, we incorporate multiple feature selection strategies, including highly variable genes and spatially informative genes (e.g., via Moran’s I), to optimize performance within the available gene space. As more comprehensive reference atlases become available, we expect the model’s accuracy to improve further.

      (2) glmSMA operates under the assumption that cells with similar gene expression profiles are likely to be physically close to each other in space which not be true under various heterogeneous environments.

      Thank you for raising this important point. We agree that glmSMA operates under the assumption that cells with similar gene expression profiles tend to be spatially proximal, and this assumption may not strictly hold in highly heterogeneous tissues where spatial organization is less coupled to transcriptional similarity.

      To address this concern, we specifically tested glmSMA on human PDAC samples, which represent moderately heterogeneous environments characterized by complex tumor microenvironments, including a mixture of ductal cells, cancer cells, stromal cells, and other components. Despite this heterogeneity, glmSMA successfully mapped major cell types to their expected anatomical regions, demonstrating that the method is robust even in the presence of substantial cellular diversity and spatial complexity.

      This result suggests that while glmSMA relies on the assumption of spatialtranscriptomic correlation, the method can tolerate a reasonable degree of spatial heterogeneity without a significant loss of performance. Nevertheless, we acknowledge that in extremely disorganized or highly mixed tissues where transcriptional similarity is decoupled from spatial proximity, the performance may be affected.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We have significant concerns about the eLife assessment and the reviews. The reviewers acknowledged substantial strengths in our work:

      • Reviewer 3 noted that “the single-unit analyses of tuning direction are robustly characterized”, “the differences in neural correlations across behaviors, regions and perturbations are robust”, and “The evidence for these claims is solid.”

      • Reviewer 2 stated that “the manuscript has been improved” with “new analyses [that] provide improved rigor”.

      Despite these, the final eLife assessment inexplicably downplayed the significance of the findings and strength of evidence.

      Broader Impact and Significance. The findings, not only the data, have theoretical and/or practical implications extending well beyond a single subfield relevant to:

      1. behavioral neuroscientists studying sensorimotor integration

      2. systems and theoretical neuroscientists

      3. neural and biomechanical engineers working on brain-computer interfaces for speech or oral or limb prosthetics

      4. soft robotics researchers

      5. comparative motor control researchers

      6. clinicians involved in the evaluation and rehabilitation of orolingual function (e.g., after stroke or glossectomy, dysphagia)

      Given this broad relevance, we question why the significance was characterized as merely "useful" rather than "important."

      Dismissive Tone Toward Descriptive Research. Some reviews displayed a dismissive or skeptical tone of the findings and their significance, even when methods were solid and support for the claims were strong. They critiqued the “descriptive nature” of our study, faulting the lack of mechanistic explanation. However, in poorly understood fields such as orofacial sensorimotor control, descriptive studies provide the empirical foundation for mechanistic studies. Rich descriptive data generate testable hypotheses that drive mechanistic discoveries forward, while mechanistic studies conducted without this groundwork often pursue precise answers to poorly formulated questions.

      Specific Issues with Reviews:

      1. Significant omission in study description:

      The eLife Assessment’s second sentence states: “The data, which include both electrophysiology and nerve block manipulations, will be of value to neuroscientists and

      neural engineers interested in tongue use.”

      This description omits our simultaneously recorded high-resolution 3D kinematics data—a significant oversight given that combining high-density electrophysiological recording from multiple cortical regions with high-resolution 3D tongue kinematics during naturalistic behaviors in non-human primates represents one of our study's key strengths. Currently, only two research labs in the US possess this capability.

      2. Overemphasis on the “smaller” and “inconsistent” findings

      While we acknowledge some inconsistent findings between animals, the reviews overemphasized these inconsistencies in ways that cast unwarranted doubt on our more significant and consistent results.

      a. Reviewer 1: “[...] the discrepancies in tuning changes across the two NHPs, coupled with the overall exploratory nature of the study, render the interpretation of these subtle differences somewhat speculative. “[...] in some recording sessions, they blocked sensory feedback using bilateral nerve block injections, which seemed to result in fewer directionally tuned units and changes in the overall distribution of the preferred direction of the units.”

      The skeptical tone of the critique is in opposition to Reviewer 3’s statement that: “the evidence for these claims were solid”. In this statement, the reviewer characterized our findings as “somewhat speculative”, seemingly overlooking robust and consistent changes we documented:

      • “Following nerve block, MIo and SIo showed significant decreases in the proportion of directionally modulated neurons across both tasks (Fig. 10A; Chi-square, MIo: p <0.001, SIo: p < 0.05).”

      • “Nerve block significantly altered PD distributions during both tasks. During feeding, MIo neurons in both subjects exhibited a significant clockwise shift in mean PD toward the center (0°), resulting in more uniform distributions (Fig. 11A; circular k-test, p < 0.01).”

      These results were obtained through careful subsampling of trials with similar kinematics for both feeding and drinking tasks, ensuring that the tuning changes in the nerve block experiments could not be attributed to differing kinematics.

      b. Reviewer 2: “One weakness of the current study is that there is substantial variability in results between monkeys.”

      This vague critique, without specifying which results showed “substantial variability”, reads as though most findings were inconsistent, unfairly casting doubt on our study’s validity.

      3. Inaccurate statements in the Reviewers’ summaries

      Several reviewer statements contain factual inaccuracies:

      a. Reviewer 2: “A majority of neurons in MIo and a (somewhat smaller) percentage of SIo modulated their firing rates during tongue movements, with different modulation depending on the direction of movement (i.e., exhibited directional tuning).”

      Reviewer 2's characterization of directional tuning misrepresents our findings. We reported substantial differences in the proportion of directionally tuned neurons between MIo and SIo during the feeding task but a smaller difference in the drinking task:

      • “The proportion of directionally tuned neurons [...] differed significantly between MIo and SIo during the feeding task in both subjects (Chi-square, p < 0.001). In rostral and caudal MIo, 80% of neurons were modulated to 3D direction (bootstrap, p < 0.05, Fig. 3B, left), compared to 52% in areas 1/2 and 3a/3b.

      • “During drinking, the proportion of directionally modulated neurons was more similar between regions (69% in MIo vs. 60% in SIo: Chi-square, p > 0.05, Fig. 3B right).”

      b. Reviewer 2: “There were differences observed in the proportion and extent of directional tuning between the feeding and licking behaviors, with stronger tuning overall during licking.”

      Reviewer 2's claim about task differences directly contradicts our findings. We consistently reported stronger tuning in feeding compared to drinking across multiple measures:

      • “The proportion of directionally tuned neurons was higher in the feeding vs. drinking task (Chi-square, p < 0.05, feeding: 72%, drinking: 66%)”;

      • “Cumulative explained variance for the first three factors was higher in feeding (MIo: 82%, SIo: 81%) than in drinking (MIo: 74%, SIo: 63%)”;

      • “Decoding using LSTM showed consistently higher accuracies in feeding compared to drinking regardless of the length of intervals used ..., behavioral window .., and directional angles ...”

      These results were also summarized in the Discussion.

      c. Reviewer 1: In Figure 12, factor 2 and 3 are plotted against each other? and factor 1 is left out?

      Reviewer 1’s observation about Figure 12 is incorrect. Factor 1 was included: Top subplots (feeding) show Factor 1 vs 3 (MIo) and Factor 1 vs 2 (SIo) while the bottom subplots (drinking) show Factor 2 vs 3 (MIo) and Factor 1 vs 2 (SIo). We plotted the two latent factors with highest explained variance for clarity, though all 20 factors were included in intertrajectory distance calculations.

      4. Framing and interpretive over-scrutiny

      Several critiques targeted framing rather than methodological rigor and emphasized that interpretations were speculative even when appropriately hedged:

      a. Reviewer 2: “A revised version of the manuscript incorporates more population-level analyses, but with inconsistent use of quantifications/statistics and without sufficient contextualization of what the reader is to make of these results.”

      Reviewer 2 mentioned "inconsistent use of quantifications/statistics" without specifying which analyses were problematic or updating their summary to include our additional population-level findings.

      b. Reviewer 2: “The described changes in tuning after nerve block could also be explained by changes in kinematics between these conditions, which temper the interpretation of these interesting results”

      Despite our addressing kinematic concerns through subsampled data analysis, Reviewer 2 remained unsatisfied, contrasting sharply with Reviewer 3's assessment that our arguments were "convincing" with "solid" evidence.

      c. Reviewer 2: “I am not convinced of the claim that tongue directional encoding fundamentally changes between drinking and feeding given the dramatically different kinematics and the involvement of other body parts like the jaw”

      Reviewer 2 expressed skepticism about fundamental encoding differences between tasks, despite our comprehensive controls including subsampled data with similar kinematics and multiple verification analyses (equal neuron numbers, stable neurons, various interval lengths, behavioral windows, and directional angles).

      Without describing why these analyses were insufficient, this criticism goes beyond methods or statistics. It casts doubt and challenges whether the conclusions are even worth drawing despite careful experimental controls.

      d. Reviewer 2: “The manuscript states that "An alternative explanation be more statistical/technical in nature: that during feeding, there will be more variability in exactly what somatosensation afferent signals are being received from trial to trial (because slight differences in kinematics can have large differences in exactly where the tongue is and the where/when/how of what parts of it are touching other parts of the oral cavity)? This variability could "smear out" the apparent tuning using these types of trial-averaged analyses. Given how important proprioception and somatosensation are for not biting the tongue or choking, the speculation that somatosensory cortical activity is suppressed during feedback is very counter-intuitive to this reviewer".

      By not updating this section, Reviewer 2 failed to acknowledge our responsive revisions, including Fano factor analysis showing higher variability in SIo during feeding versus drinking, and our updated discussion addressing their concerns about trial-to-trial variability: “Varying tongue shape, tongue’s contact with varying bolus properties (size and texture) and other oral structures (palate, teeth) may weaken the directional signal contained in SIo activity. Thus, small differences in tongue kinematics might create large differences in sensory signals across trials. When looking at trial-averaged signals, this natural variability could make the neural response patterns appear less precise or specific than they are. These are consistent with our findings that for both tasks, spiking variability was higher in SIo.”

      Authors’ Response to Recommendations for the authors:

      We thank the editors and the reviewers for their helpful comments. We have provided a response to reviewers’ recommendations and made some revisions on the manuscript. 

      Reviewer #1 (Recommendations for the authors): 

      In the newly added population factor analysis, several methodological decisions remain unclear to me:

      In Figure 7, why do the authors compare the mean distance between conditions in the latent spaces of MIo and SIo? Since these latent spaces are derived separately, they exist on different scales (with MIo appearing roughly four times larger than SIo), and this discrepancy is reflected in the reported mean distances (Figure 7, inset plots). Wouldn't this undermine a direct comparison?

      Thank you for this helpful feedback. The reviewer is correct that the latent spaces are derived separately for MIo and SIo, thus they exist on different scales as we have noted in the caption of Figure 7: “Axes for SIo are 1/4 scale of MIo.” 

      To allow for a direct comparison between MIo and SIo, we corrected the analysis by comparing their normalized mean inter-trajectory distances obtained by first calculating the geometric index (GI) of the inter-trajectory distances, d, between each pair of population trajectories per region as: GI= (d<sub>1</sub>-d<sub>2</sub>)/ (d<sub>1</sub>+d<sub>2</sub>). We then performed the statistics on the GIs and found a significant difference between mean inter-trajectory distances in MIo vs. SIo. We performed the same analysis comparing the distance travelled between MIo and SIo trajectories by getting the normalized difference in distances travelled and still found a significant difference in both tasks. We have updated the results and figure inset to reflect these changes.

      In Figure 12, unlike Figure 7 which shows three latent dimensions, only two factors are plotted. While the methods section describes a procedure for selecting the optimal number of latent factors, Figure 7 - figure supplement 3 shows that variance explained continues to increase up to about five latent dimensions across all areas. Why, then, are fewer dimensions shown?

      Thank you for the opportunity to clarify the figure. The m obtained from the 3-fold crossvalidation varied for the full sample and was 20 factors for the subsample. We clarify that all statistical analyses were done using 20 latent factors. Using the full sample of neurons, the first 3 factors explained 81% of variance in feeding data compared to 71% in drinking data. When extended to 5 factors, feeding maintained its advantage with 91% variance explained versus 82% for drinking. Because feeding showed higher variance explained than drinking across 3 or 5 factors, only three factors were shown in Figure 7 for better visualization. We added this clarification to the Methods and Results.

      Figure 12 shows the differences in the neural trajectories between the control and nerve block conditions. The control vs. nerve block comparison complicated the visualization of the results. Thus, we plotted only the two latent factors with the highest separation between population trajectories. This was clarified in the Methods and caption of Figure 12.

      In Figure 12, factor 2 and 3 are plotted against each other? and factor 1 is left out?

      This observation is incorrect; Factor 1 was included: Top subplots (feeding) show Factor 1 vs 3 (MIo) and Factor 1 vs 2 (SIo) while the bottom subplots (drinking) show Factor 2 vs 3 (MIo) and Factor 1 vs 2 (SIo).  We have clarified this in the Methods and caption of Figure 12.

      Finally, why are factor analysis results shown only for monkey R? 

      Factor analysis results were performed on both animals, but the results were shown only for monkey R to decrease the number of figures in the manuscript. Figure 7- figure supplement 1 shows the data for both monkeys. Here are the equivalent Figure 7 plots for monkey Y. 

      Author response image 1.

      Reviewer #2 (Recommendations for the authors): 

      Overall, the manuscript has been improved. 

      New analyses provide improved rigor (as just one example, organizing the feeding data into three-category split to better match the three-direction drinking data decoding analysis and also matching the neuron counts).

      The updated nerve block change method (using an equal number of trials with a similar leftright angle of movement in the last 100 ms of the tongue trajectory) somewhat reduces my concern that kinematic differences could account for the neural changes, but on the other hand the neural analyses use 250 ms (meaning that the neural differences could be related to behavioral differences earlier in the trial). Why not subselect to trials with similar trajectories throughout the whole movement(or at least show that as an additional analysis, albeit one with lower trial counts). 

      As the reviewer pointed out, selecting similar trajectories throughout the whole movement would result in lower trial counts that lead to poor statistical power. We think that the 100 ms prior to maximum tongue protrusion is a more important movement segment to control for similar kinematics between the control and nerve block conditions since this represents the subject’s intended movement endpoint. 

      A lot of the Results seemed like a list of measurements without sufficient hand-holding or guide-posting to explain what the take-away for the reader should be. Just one example to make concrete this broadly-applicable feedback: "Cumulative explained variance for the first three factors was higher in feeding (MIo: 82%, SIo: 81%) than in drinking (MIo: 74%, SIo: 63%) when all neurons were used for the factor analysis (Fig. 7)": why should we care about 3 factors specifically? Does this mean that in feeding, the neural dimensionality is lower (since 3 factors explain more of it)? Does that mean feeding is a "simpler" behavior (which is counter-intuitive and does not conform to the authors' comments about the higher complexity of feeding). And from later in that paragraph: what are we do make of the differences in neural trajectory distances (aside from quantifying using a different metric the same larger changes in firing rates that could just as well be quantified as statistics across single-neuron PETHs)?

      Thank you for the feedback on the writing style. We have made some revisions to describe the takeaway for the reader. That fewer latent factors explain 80% of the variance in the feeding data means that the underlying network activity is relatively simple despite apparent complexity. When neural population trajectories are farther away from each other in state space, it means that the patterns of activity across tongue directions are more distinct and separable, thus, less likely to be confused with each other. This signifies that neural representations of 3D tongue directions are more robust. When there is better neural discrimination and more reliable information processing, it is easier for downstream brain regions to distinguish between different tongue directions.  

      The addition of more population-level analyses is nice as it provides a more efficient summary of the neural measurements. However, it's a surface-level dive into these methods; ultimately the goal of ensemble "computation through dynamics" analyses is to discover simpler structure / organizational principles at the ensemble level (i.e., show things not evidence from single neurons), rather than just using them as a way to summarize data. For instance, here neural rotations are remarked upon in the Results, without referencing influential prior work describing such rotations and why neural circuits may use this computational motif to separate out conditions and shape muscle activity-generating readouts (Churchland et al. Nature 2012 and subsequent theoretical iterations including the Russo et al.). That said, the Russo et al tangling study was well-referenced and the present tangling results were eGectively contextualized with respect to that paper in terms of the interpretation. I wish more of the results were interpreted with comparable depth. 

      Speaking of Russo et al: the authors note qualitative differences in tangling between brain areas, but do not actually quantify tangling in either. These observations would be stronger if quantified and accompanied with statistics.

      Contrary to the reviewer’s critique, we did frame these results in the context of structure/organizational principles at the ensemble level. We had already cited prior work of Churchland et al., 2012; Michaels et al., 2016and Russo et al., 2018. In the Discussion, Differences across behaviors, we wrote: “In contrast, MIo trajectories in drinking exhibited a consistent rotational direction regardless of spout location (Fig. 7). This may reflect a predominant non-directional information such as condition-independent time-varying spiking activity during drinking (Kaufman et al., 2016; Kobak et al., 2016; Arce-McShane et al., 2023).” 

      Minor suggestions: 

      Some typos, e.g. 

      • no opening parenthesis in "We quantified directional differences in population activity by calculating the Euclidean distance over m latent factors)"

      • missing space in "independent neurons(Santhanam et al., 2009;..."); 

      • missing closing parentheses in "followed by the Posterior Inferior (Figure 3 - figure supplement 1."

      There is a one-page long paragraph in the Discussion. Please consider breaking up the text into more paragraphs each organized around one key idea to aid readability.

      Thank you, we have corrected these typos.

      Could it be that the Kaufman et al 2013 reference was intended to be Kaufman et al 2015 eNeuro (the condition-invariant signal paper)?

      Thank you, we have corrected this reference.

      At the end of the Clinical Implications subsection of the Discussion, the authors note the growing field of brain-computer interfaces with references for motor read-out or sensory write-in of hand motor/sensory cortices, respectively. Given that this study looks at orofacial cortices, an even more clinically relevant development is the more recent progress in speech BCIs (two     recent reviews: https://www.nature.com/articles/s41583-024-00819-9, https://www.annualreviews.org/content/journals/10.1146/annurev-bioeng-110122012818) many of which record from human ventral motor cortex and aspirations towards FES-like approaches for orofacial movements (e.g., https://link.springer.com/article/10.1186/s12984-023-01272-y).  

      Thank you, we have included these references.

      Reviewer #3 (Recommendations for the authors): 

      Major Suggestions 

      (1) For the factor analysis of feeding vs licking, it appears that the factors were calculated separately for the two behaviors. It could be informative to calculate the factors under both conditions and project the neural data for the two behaviors into that space. The overlap/separations of the subspace could be informative. 

      We clarify that we performed a factor analysis that included both feeding and licking for MIo, as stated in the Results: “To control for factors such as different neurons and kinematics that might influence the results, we performed factor analysis on stable neurons across both tasks using all trials (Fig. 7- figure supplement 2A) and using trials with similar kinematics (Fig. 7- figure supplement 2B).” We have revised the manuscript to reflect this more clearly.

      (2) For the LSTM, the Factor analyses and the decoding it is unclear if the firing rates are mean subtracted and being normalized (the methods section was a little unclear). Typically, papers in the field either z-score the data or do a softmax.

      The firing rates were z-scored for the LSTM and KNN. For the factor analysis, the spike counts were not z-scored, but the results were normalized. We clarified this in the Methods section.

      Minor: 

      Page 1: Abstract- '... how OSMCx contributes to...' 

      Since there are no direct causal manipulations of OSMCx in this manuscript, this study doesn't directly study the OSMCx's contribution to movement - I would recommend rewording this sentence.

      Similarly, Page 2: 'OSMCx plays an important role in coordination...' the citations in this paragraph are correlative, and do not demonstrate a causal role.

      There are similar usages of 'OSMCx coordinates...' in other places e.g. Page 8. 

      Thank you, we revised these sentences.

      Page 7: the LSTM here has 400 units, which is a very large network and contains >12000 parameters. Networks of this size are prone to memorization, it would be wise to test the rsquare of the validation set against a shuGled dataset to see if the network is actually working as intended. 

      Thank you for bringing up this important point of verifying that the network is learning meaningful patterns versus memorizing. Considering the size of our training samples, the ratio of samples to parameters is appropriate and thus the risk of memorization is low. Indeed, validation tests and cross-validation performed indicated expected network behavior and the R squared values obtained here were similar to those reported in our previous paper (Laurence-Chasen et al., 2023).


      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their paper, Hosack and Arce-McShane investigate how the 3D movement direction of the tongue is represented in the orofacial part of the sensory-motor cortex and how this representation changes with the loss of oral sensation. They examine the firing patterns of neurons in the orofacial parts of the primary motor cortex (MIo) and somatosensory cortex (SIo) in non-human primates (NHPs) during drinking and feeding tasks. While recording neural activity, they also tracked the kinematics of tongue movement using biplanar videoradiography of markers implanted in the tongue. Their findings indicate that most units in both MIo and SIo are directionally tuned during the drinking task. However, during the feeding task, directional turning was more frequent in MIo units and less prominent in SIo units. Additionally, in some recording sessions, they blocked sensory feedback using bilateral nerve block injections, which resulted in fewer directionally tuned units and changes in the overall distribution of the preferred direction of the units.

      Strengths:

      The most significant strength of this paper lies in its unique combination of experimental tools. The author utilized a video-radiography method to capture 3D kinematics of the tongue movement during two behavioral tasks while simultaneously recording activity from two brain areas. Moreover, they employed a nerve-blocking procedure to halt sensory feedback. This specific dataset and experimental setup hold great potential for future research on the understudied orofacial segment of the sensory-motor area.

      Weaknesses:

      Aside from the last part of the result section, the majority of the analyses in this paper are focused on single units. I understand the need to characterize the number of single units that directly code for external variables like movement direction, especially for less-studied areas like the orofacial part of the sensory-motor cortex. However, as a field, our decadelong experience in the arm region of sensory-motor cortices suggests that many of the idiosyncratic behaviors of single units can be better understood when the neural activity is studied at the level of the state space of the population. By doing so, for the arm region, we were able to explain why units have "mixed selectivity" for external variables, why the tuning of units changes in the planning and execution phase of the movement, why activity in the planning phase does not lead to undesired muscle activity, etc. See (Gallego et al. 2017; Vyas et al. 2020; Churchland and Shenoy 2024) for a review. Therefore, I believe investigating the dynamics of the population activity in orofacial regions can similarly help the reader go beyond the peculiarities of single units and in a broader view, inform us if the same principles found in the arm region can be generalized to other segments of sensorymotor cortex.

      We thank and agree with the reviewer on the value of information gained from studying population activity. We also appreciate that population analyses have led to the understanding that individual neurons have “mixed selectivity”. We have shown previously that OSMCx neurons exhibit mixed selectivity in their population activity and clear separation between latent factors associated with gape and bite force levels (Arce-McShane FI, Sessle BJ, Ram Y, Ross CF, Hatsopoulos NG (2023) Multiple regions of primate orofacial sensorimotor cortex encode bite force and gape. Front Systems Neurosci. doi: 10.3389/fnsys.2023.1213279. PMID: 37808467 PMCID: 10556252), and chew-side and food types (Li Z & Arce-McShane FI (2023). Cortical representation of mastication in the primate orofacial sensorimotor cortex. Program No. NANO06.05. 2023 Neuroscience Meeting Planner. Washington, D.C.: Society for Neuroscience, 2023. Online.). 

      The primary goal of this paper was to characterize single units in the orofacial region and to do a follow-up paper on population activity. In the revised manuscript, we have now incorporated the results of population-level analyses. The combined results of the single unit and population analyses provide a deeper understanding of the cortical representation of 3D direction of tongue movements during natural feeding and drinking behaviors. 

      Further, for the nerve-blocking experiments, the authors demonstrate that the lack of sensory feedback severely alters how the movement is executed at the level of behavior and neural activity. However, I had a hard time interpreting these results since any change in neural activity after blocking the orofacial nerves could be due to either the lack of the sensory signal or, as the authors suggest, due to the NHPs executing a different movement to compensate for the lack of sensory information or the combination of both of these factors. Hence, it would be helpful to know if the authors have any hint in the data that can tease apart these factors. For example, analyzing a subset of nerve-blocked trials that have similar kinematics to the control.

      Thank you for bringing this important point. We agree with the reviewer that any change in the neural activity may be attributed to lack of sensory signal or to compensatory changes or a combination of these factors. To tease apart these factors, we sampled an equal number of trials with similar kinematics for both control and nerve block feeding sessions. We added clarifying description of this approach in the Results section of the revised manuscript: “To confirm this e ect was not merely due to altered kinematics, we conducted parallel analyses using carefully subsampled trials with matched kinematic profiles from both control and nerve-blocked conditions.”

      Furthermore, we ran additional analysis for the drinking datasets by subsampling a similar distribution of drinking movements from each condition. We compared the neural data from an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. We compared the directional tuning across an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. These analyses that control for similar kinematics showed that there was still a decrease in the proportion of directionally modulated neurons with nerve block compared to the control. This confirms that the results may be attributed to the lack of tactile information. These are now integrated in the revised paper under Methods section: Directional tuning of single neurons, as well as Results section: E ects of nerve block: Decreased directional tuning of MIo and SIo neurons and Figure 10 – figure supplement 1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Hosack and Arce-McShane examines the directional tuning of neurons in macaque primary motor (MIo) and somatosensory (SIo) cortex. The neural basis of tongue control is far less studied than, for example, forelimb movements, partly because the tongue's kinematics and kinetics are difficult to measure. A major technical advantage of this study is using biplanar video-radiography, processed with modern motion tracking analysis software, to track the movement of the tongue inside the oral cavity. Compared to prior work, the behaviors are more naturalistic behaviors (feeding and licking water from one of three spouts), although the animals were still head-fixed.

      The study's main findings are that:

      • A majority of neurons in MIo and a (somewhat smaller) percentage of SIo modulated their firing rates during tongue movements, with different modulations depending on the direction of movement (i.e., exhibited directional tuning). Examining the statistics of tuning across neurons, there was anisotropy (e.g., more neurons preferring anterior movement) and a lateral bias in which tongue direction neurons preferred that was consistent with the innervation patterns of tongue control muscles (although with some inconsistency between monkeys).

      • Consistent with this encoding, tongue position could be decoded with moderate accuracy even from small ensembles of ~28 neurons.

      • There were differences observed in the proportion and extent of directional tuning between the feeding and licking behaviors, with stronger tuning overall during licking. This potentially suggests behavioral context-dependent encoding.

      • The authors then went one step further and used a bilateral nerve block to the sensory inputs (trigeminal nerve) from the tongue. This impaired the precision of tongue movements and resulted in an apparent reduction and change in neural tuning in Mio and SIo.

      Strengths:

      The data are difficult to obtain and appear to have been rigorously measured, and provide a valuable contribution to this under-explored subfield of sensorimotor neuroscience. The analyses adopt well-established methods, especially from the arm motor control literature, and represent a natural starting point for characterizing tongue 3D direction tuning.

      Weaknesses:

      There are alternative explanations for some of the interpretations, but those interpretations are described in a way that clearly distinguishes results from interpretations, and readers can make their own assessments. Some of these limitations are described in more detail below.

      One weakness of the current study is that there is substantial variability in results between monkeys, and that only one session of data per monkey/condition is analyzed (8 sessions total). This raises the concern that the results could be idiosyncratic. The Methods mention that other datasets were collected, but not analyzed because the imaging pre-processing is very labor-intensive. While I recognize that time is precious, I do think in this case the manuscript would be substantially strengthened by showing that the results are similar on other sessions.

      We acknowledge the reviewer’s concern about inter-subject variability. Animal feeding and drinking behaviors are quite stable across sessions, thus, we do not think that additional sessions will address the concern that the results could be idiosyncratic. Each of the eight datasets analyzed here have su icient neural and kinematic data to capture neural and behavioral patterns.  Nevertheless, we performed some of the analyses on a second feeding dataset from Monkey R. The results from analyses on a subset of this data were consistent across datasets; for example, (1) similar proportions of directionally tuned neurons, (2) similar distances between population trajectories (t-test p > 0.9), and (3) a consistently smaller distance between Anterior-Posterior pairs than others in MIo (t-test p < 0.05) but not SIo (p > 0.1). 

      This study focuses on describing directional tuning using the preferred direction (PD) / cosine tuning model popularized by Georgopoulous and colleagues for understanding neural control of arm reaching in the 1980s. This is a reasonable starting point and a decent first-order description of neural tuning. However, the arm motor control field has moved far past that viewpoint, and in some ways, an over-fixation on static representational encoding models and PDs held that field back for many years. The manuscript benefits from drawing the readers' attention (perhaps in their Discussion) that PDs are a very simple starting point for characterizing how cortical activity relates to kinematics, but that there is likely much richer population-level dynamical structure and that a more mechanistic, control-focused analytical framework may be fruitful. A good review of this evolution in the arm field can be found in Vyas S, Golub MD, Sussillo D, Shenoy K. 2020. Computation Through Neural Population Dynamics. Annual Review of Neuroscience. 43(1):249-75

      Thank you for highlighting this important point. Research on orofacial movements hasn't progressed at the same pace as limb movement studies. Our manuscript focused specifically on characterizing the 3D directional tuning properties of individual neurons in the orofacial area—an analysis that has not been conducted previously for orofacial sensorimotor control. While we initially prioritized this individual neuron analysis, we recognize the value of broader population-level insights.

      Based on your helpful feedback, we have incorporated additional population analyses to provide a more comprehensive picture of orofacial sensorimotor control and expanded our discussion section. We appreciate your expertise in pushing our work to be more thorough and aligned with current neuroscience approaches.

      Can the authors explain (or at least speculate) why there was such a large difference in behavioral e ect due to nerve block between the two monkeys (Figure 7)?

      We acknowledge this as a variable inherent to this type of experimentation. Previous studies have found large kinematic variation in the effect of oral nerve block as well as in the following compensatory strategies between subjects. Each animal’s biology and response to perturbation vary naturally. Indeed, our subjects exhibited different feeding behavior even in the absence of nerve block perturbation (see Figure 2 in Laurence-Chasen et al., 2022). This is why each individual serves as its own control.

      Do the analyses showing a decrease in tuning after nerve block take into account the changes (and sometimes reduction in variability) of the kinematics between these conditions? In other words, if you subsampled trials to have similar distributions of kinematics between Control and Block conditions, does the effect hold true? The extreme scenario to illustrate my concern is that if Block conditions resulted in all identical movements (which of course they don't), the tuning analysis would find no tuned neurons. The lack of change in decoding accuracy is another yellow flag that there may be a methodological explanation for the decreased tuning result.

      Thank you for bringing up this point. We accounted for the changes in the variability of the kinematics between the control and nerve block conditions in the feeding dataset where we sampled an equal number of trials with similar kinematics for both control and nerve block. However, we did not control for similar kinematics in the drinking task. In the revised manuscript, we have clarified this and performed similar analysis for the drinking task. We sampled a similar distribution of drinking movements from each condition. We compared the neural data from an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. There was a decrease in the percentage of neurons that were directionally modulated (between 30 and 80%) with nerve block compared to the control. These results have been included in the revised paper under Methods section: Directional tuning of single neurons, as well as Results section: E ects of nerve block: Decreased directionality of MIo and SIo neurons.

      While the results from decoding using KNN did not show significant differences between decoding accuracies in control vs. nerve block conditions, the results from the additional factor analysis and decoding using LSTM were consistent with the decrease in directional tuning at the level of individual neurons.  

      The manuscript states that "Our results suggest that the somatosensory cortex may be less involved than the motor areas during feeding, possibly because it is a more ingrained and stereotyped behavior as opposed to tongue protrusion or drinking tasks". Could an alternative explanation be more statistical/technical in nature: that during feeding, there will be more variability in exactly what somato sensation afferent signals are being received from trial to trial (because slight differences in kinematics can have large differences in exactly where the tongue is and the where/when/how of what parts of it are touching other parts of the oral cavity)? This variability could "smear out" the apparent tuning using these types of trial-averaged analyses. Given how important proprioception and somatosensation are for not biting the tongue or choking, the speculation that somatosensory cortical activity is suppressed during feedback is very counter-intuitive to this reviewer.

      Thank you for bringing up this point. We have now incorporated this in our revised Discussion (see Comparison between MIo and SIo). We agree with the reviewer that trialby-trial variability in the a erent signals may account for the lower directional signal in SIo during feeding than in drinking. Indeed, SIo’s mean-matched Fano factor in feeding was significantly higher than those in drinking (Author response image 1). Moreover, the results of the additional population and decoding analyses also support this.  

      Author response image 1.

      Comparison of mean-matched Fano Factor between Sio neurons during feeding and drinking control tasks across both subjects (Wilcoxon rank sum test, p < 0.001).

      Reviewer #3 (Public review):

      Summary:

      In this study, the authors aim to uncover how 3D tongue direction is represented in the Motor (M1o) and Somatosensory (S1o) cortex. In non-human primates implanted with chronic electrode arrays, they use X-ray-based imaging to track the kinematics of the tongue and jaw as the animal is either chewing food or licking from a spout. They then correlate the tongue kinematics with the recorded neural activity. Using linear regressions, they characterize the tuning properties and distributions of the recorded population during feeding and licking. Then, they recharacterize the tuning properties after bilateral lidocaine injections in the two sensory branches of the trigeminal nerve. They report that their nerve block causes a reorganization of the tuning properties. Overall, this paper concludes that M1o and S1o both contain representations of the tongue direction, but their numbers, their tuning properties, and susceptibility to perturbed sensory input are different.

      Strengths:

      The major strengths of this paper are in the state-of-the-art experimental methods employed to collect the electrophysiological and kinematic data.

      Weaknesses:

      However, this paper has a number of weaknesses in the analysis of this data.

      It is unclear how reliable the neural responses are to the stimuli. The trial-by-trial variability of the neural firing rates is not reported. Thus, it is unclear if the methods used for establishing that a neuron is modulated and tuned to a direction are susceptible to spurious correlations. The authors do not use shuffling or bootstrapping tests to determine the robustness of their fits or determining the 'preferred direction' of the neurons. This weakness colors the rest of the paper.

      Thank you for raising these points. We have performed the following additional analyses: (1) We have added analyses to ensure that the results could not be explained by neural variability. To show the trial-by-trial variability of the neural firing rates, we have calculated the Fano factor (mean overall = 1.34747; control = 1.46471; nerve block = 1.23023). The distribution was similar across directions, suggesting that responses of MIo and SIo neurons to varying 3D directions were reliable. (2) We have used a bootstrap procedure to ensure that directional tuning cannot be explained by mere chance. (3) To test the robustness of our PDs we also performed a bootstrap test, which yielded the same results for >90% of neurons, and a multiple linear regression test for fit to a cosine-tuning function. In the revised manuscript, the Methods and Results sections have been updated to include these analyses.  

      Author response image 2.

      Comparison of Fano Factor across directions for MIo and SIo Feeding Control (Kruskal-Wallis, p > 0.7).

      The authors compare the tuning properties during feeding to those during licking but only focus on the tongue-tip. However, the two behaviors are different also in their engagement of the jaw muscles. Thus many of the differences observed between the two 'tasks' might have very little to do with an alternation in the properties of the neural code - and more to do with the differences in the movements involved. 

      Using the tongue tip for the kinematic analysis of tongue directional movements was a deliberate choice as the anterior region of the tongue is highly mobile and sensitive due to a higher density of mechanoreceptors. The tongue tip is the first region that touches the spout in the drinking task and moves the food into the oral cavity for chewing and subsequent swallowing. 

      We agree with the reviewer that the jaw muscles are engaged differently in feeding vs. drinking (see Fig. 2). For example, a wider variety of jaw movements along the three axes are observed in feeding compared to the smaller amplitude and mostly vertical jaw movements in drinking. Also, the tongue movements are very different between the two behaviors. In feeding, the tongue moves in varied directions to position the food between left-right tooth rows during chewing, whereas in the drinking task, the tongue moves to discrete locations to receive the juice reward. Moreover, the tongue-jaw coordination differs between tasks; maximum tongue protrusion coincides with maximum gape in drinking but with minimum gape in the feeding behavior. Thus, the different tongue and jaw movements required in each behavior may account for some of the differences observed in the directional tuning properties of individual neurons and population activity. These points have been included in the revised Discussion.

      Author response image 3.

      Tongue tip position (mm) and jaw pitch(degree) during feeding (left) and drinking (right) behaviors. Most protruded tongue position coincides with minimum gape (jaw pitch at 0°) during  feeding but with maximum gape during drinking.

      Many of the neurons are likely correlated with both Jaw movements and tongue movements - this complicates the interpretations and raises the possibility that the differences in tuning properties across tasks are trivial.

      We thank the reviewer for raising this important point. In fact, we verified in a previous study whether the correlation between the tongue and jaw kinematics might explain differences in the encoding of tongue kinematics and shape in MIo (see Supplementary Fig. 4 in Laurence-Chasen et al., 2023): “Through iterative sampling of sub-regions of the test trials, we found that correlation of tongue kinematic variables with mandibular motion does not account for decoding accuracy. Even at times where tongue motion was completely un-correlated with the jaw, decoding accuracy could be quite high.” 

      The results obtained from population analyses showing distinct properties of population trajectories in feeding vs. drinking behaviors provide strong support to the interpretation that directional information varies between these behaviors.

      The population analyses for decoding are rudimentary and provide very coarse estimates (left, center, or right), it is also unclear what the major takeaways from the population decoding analyses are. The reduced classification accuracy could very well be a consequence of linear models being unable to account for the complexity of feeding movements, while the licking movements are 'simpler' and thus are better accounted for.

      We thank the reviewer for raising this point. The population decoding analyses provide additional insight on the directional information in population activity,  as well as a point of comparison with the results of numerous decoding studies on the arm region of the sensorimotor cortex. In the revised version, we have included the results from decoding tongue direction using a long short-term memory (LSTM) network for sequence-tosequence decoding. These results differed from the KNN results, indicating that a linear model such as KNN was better for drinking and that a non-linear and continuous decoder was better suited for feeding.  These results have been included in the revised manuscript.

      The nature of the nerve block and what sensory pathways are being affected is unclear - the trigeminal nerve contains many different sensory afferents - is there a characterization of how e ectively the nerve impulses are being blocked? Have the authors confirmed or characterized the strength of their inactivation or block, I was unable to find any electrophysiological evidence characterizing the perturbation.

      The strength of the nerve block is characterized by a decrease in the baseline firing rate of SIo neurons, as shown in Supplementary Figure 6 of “Loss of oral sensation impairs feeding performance and consistency of tongue–jaw coordination” (Laurence-Chasen et al., 2022)..

      Overall, while this paper provides a descriptive account of the observed neural correlations and their alteration by perturbation, a synthesis of the observed changes and some insight into neural processing of tongue kinematics would strengthen this paper.

      We thank the reviewer for this suggestion. We have revised the Discussion to provide a synthesis of the results and insights into the neural processing of tongue kinematics.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The procedure for anesthesia explained in the method section was not clear to me. The following information was missing: what drug/dose was used? How long the animal was under anesthesia? How long after the recovery the experiments were done?

      The animals were fully sedated with ketamine (100 mg/ml, 10 mg/kg) for less than 30 minutes, and all of the data was collected within 90 minutes after the nerve block was administered.

      (2) In Figure 10, panels A and B are very close together, it was not at first clear whether the text "Monkey R, Monkey Y" belongs to panel A or B.

      We have separated the two panels further in the revised figure.

      (3) I found Figure 11 very busy and hard to interpret. Separating monkeys, fitting the line for each condition, or using a bar plot can help with the readability of the figure.

      Thank you for the suggestion. We agree with you and have reworked this figure. To simplify it we have shown the mean accuracy across iterations.

      (4) I found the laterality discussions like "This signifies that there are more neurons in the left hemisphere contributes toward one direction of tongue movement, suggesting that there is some laterality in the PDs of OSMCx neurons that varies between individuals" bit of an over-interpretation of data, given the low n value and the dissimilarity in how strongly the nerve blocking altered monkies behavior.

      Thank you for sharing this viewpoint. We do think that laterality is a good point of comparison with studies on M1 neurons in the arm/hand region. In our study, we found that the peak of the PD distribution coincides with leftward tongue movements in feeding. The distribution of PDs provides insight into how tongue muscles are coordinated during movement. Intrinsic and extrinsic tongue muscles are involved in shaping the tongue (e.g., elongation, broadening) and positioning the tongue (e.g., protrusion/retraction, elevation/depression), respectively. These muscles receive bilateral motor innervation except for genioglossus. Straight tongue protrusion requires the balanced action of the right and left genioglossi while the lateral protrusion involves primarily the contralateral genioglossus. Given this unilateral innervation pattern, we hypothesized that left MIo/SIo neurons would preferentially respond to leftward tongue movements, corresponding to right genioglossus activation. 

      Reviewer #2 (Recommendations for the authors):

      Are the observation of tuning peaks being most frequently observed toward the anterior and superior directions consistent with the statistics of the movements the tongue typically makes? This could be analogous to anisotropies previously reported in the arm literature, e.g., Lillicrap TP, Scott SH. 2013. Preference Distributions of Primary Motor Cortex Neurons Reflect Control Solutions Optimized for Limb Biomechanics. Neuron. 77(1):168-79

      Thank you for bringing our attention to analogous findings by Lillicrap & Scott, 2013. Indeed, we do observe the highest number of movements in the Anterior Superior directions, followed by the Posterior Inferior. This does align with the distribution of tuning peaks that we observed. Author response image 4 shows the proportions of observed movements in each group of directions across all feeding datasets. We have incorporated this data in the Results section: Neuronal modulation patterns differ between MIo and SIo, as well as added this point in the Discussion.

      Author response image 4.

      Proportion of feeding trials in each group of directions. Error bars represent ±1 standard deviation across datasets (n = 4).

      "The Euclidean distance was used to identify nearest neighbors, and the number of nearest neighbors used was K = 7. This K value was determined after testing different Ks which yielded comparable results." In general, it's a decoding best practice to tune hyperparameters (like K) on fully held-out data from the data used for evaluation. Otherwise, this tends to slightly inflate performance because one picks the hyperparameter that happened to give the best result. It sounds like that held-out validation set wasn't used here. I don't think that's going to change the results much at all (especially given the "comparable results" comment), but providing this suggestion for the future. If the authors replicate results on other datasets, I suggest they keep K = 7 to lock in the method.

      K = 7 was chosen based on the size of our smallest training dataset (n = 55). The purpose of testing different K values was not to select which value gave the best result, but to demonstrate that similar K values did not affect the results significantly. We tested the different K values on a subset of the feeding data, but that data was not fully held-out from the training set. We will keep your suggestion in mind for future analysis.

      The smoothing applied to Figure 2 PSTHs appears perhaps excessive (i.e., it may be obscuring interesting finer-grained details of these fast movements). Can the authors reduce the 50 ms Gaussian smoothing (I assume this is the s.d.?) ~25 ms is often used in studying arm kinematics. It also looks like the movement-related modulation may not be finished in these 200 ms / 500 ms windows. I suggest extending the shown time window. It would also be helpful to show some trial-averaged behavior (e.g. speed or % displacement from start) under or behind the PSTHs, to give a sense of what phase of the movement the neural activity corresponds to.

      Thank you for the suggestion. We have taken your suggestions into consideration and modified Figure 2 accordingly. We decreased the Gaussian kernel to 25 ms and extended the time window shown. The trial-averaged anterior/posterior displacement was also added to the drinking PSTHs.

      Reviewer #3 (Recommendations for the authors):

      The major consideration here is that the data reported for feeding appears to be very similar to that reported in a previous study:

      "Robust cortical encoding of 3D tongue shape during feeding in macaques"

      Are the neurons reported here the same as the ones used in this previous paper? It is deeply concerning that this is not reported anywhere in the methods section.

      These are the same neurons as in our previous paper, though here we include several additional datasets of the nerve block and drinking sessions. We have now included this in the methods section.

      Second, I strongly recommend that the authors consider a thorough rewrite of this manuscript and improve the presentation of the figures. As written, it was not easy to follow the paper, the logic of the experiments, or the specific data being presented in the figures.

      Thank you for this suggestion. We have done an extensive rewrite of the manuscript and revision of the figures.

      A few recommendations:

      (1) Please structure your results sections and use descriptive topic sentences to focus the reader. In the current version, it is unclear what the major point being conveyed for each analysis is.

      Thank you for this suggestion. We have added topic sentences to the begin each section of the results.

      (2) Please show raster plots for at least a few example neurons so that the readers have a sense of what the neural responses look like across trials. Is all of Figure 2 one example neuron or are they different neurons? Error bars for PETH would be useful to show the reliability and robustness of the tuning.

      Figure 2 shows different neurons, one from MIo and one from SIo for each task. There is shading showing ±1 standard error around the line for each direction, however this was a bit difficult to see. In addition to the other changes we have made to these figures, we made the lines smaller and darkened the error bar shading to accentuate this. We also added raster plots corresponding to the same neurons represented in Figure 2 as a supplement.

      (3) Since there are only two data points, I am not sure I understand why the authors have bar graphs and error bars for graphs such as Figure 3B, Figure 5B, etc. How can one have an error bar and means with just 2 data points?

      Those bars represent the standard error of the proportion. We have changed the y-axis label on these figures to make this clearer.

      (4) Results in Figure 6 could be due to differential placement of the electrodes across the animals. How is this being accounted for?

      Yes, this is a possibility which we have mentioned in the discussion. Even with careful placement there is no guarantee to capture a set of neurons with the exact same function in two subjects, as every individual is different. Rather we focus on analyses of data within the same animal. The purpose of Figure 6 is to show the difference between MIo and SIo, and between the two tasks, within the same subject. The more salient result from calculating the preferred direction is that there is a change in the distribution between control and nerve block within the same exact population. Discussions relating to the comparison between individuals are speculative and cannot be confirmed without the inclusion of many more subjects.

      (5) For Figure 7, I would recommend showing the results of the Sham injection in the same figure instead of a supplement.

      Thank you for the suggestion, we have added these results to the figure.

      (6) I think the e ects of the sensory block on the tongue kinematics are underexplored in Figure 7 and Figure 8. The authors could explore the deficits in tongue shape, and the temporal components of the trajectory.

      Some of these effects on feeding have been explored in a previous paper, LaurenceChasen et al., 2022. We performed some additional analyses on changes to kinematics during drinking, including the number of licks per 10 second trial and the length of individual licks. The results of these are included below. We also calculated the difference in the speed of tongue movement during drinking, which generally decreased and exhibited an increase in variance with nerve block (f-test, p < 0.001). However, we have not included these figures in the main paper as they do not inform us about directionality.

      Author response image 5.

      Left halves of hemi-violins (black) are control and right halves (red) are nerve block for an individual. Horizontal black lines represent the mean and horizontal red lines the median. Results of two-tailed t-test and f-test are indicated by asterisks and crosses, respectively: *,† p < 0.05; **,†† p < 0.01; ***,††† p < 0.001.

      (9) In Figures 9 and 10. Are the same neurons being recorded before and after the nerve block? It is unclear if the overall "population" properties are different, or if the properties of individual neurons are changing due to the nerve block.

      Yes, the same neurons are being recorded before and after nerve block. Specifically, Figure 9B shows that the properties of many individual neurons do change due to the nerve block. Differences in the overall population response may be attributed to some of the units having reduced/no activity during the nerve block session.

      Additionally, I recommend that the authors improve their introduction and provide more context to their discussion. Please elaborate on what you think are the main conceptual advances in your study, and place them in the context of the existing literature. By my count, there are 26 citations in this paper, 4 of which are self-citations - clearly, this can be improved upon.

      Thank you for this suggestion. We have done an extensive rewrite of the Introduction and Discussion. We discussed the main conceptual advances in our study and place them in the context of the existing literature.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      (1) Their first major claim is that fluid flows alone must be quite strong in order to fragment the cyanobacterial aggregates they have studied. With their rheological chamber, they explicitly show that energy dissipation rates must exceed "natural" conditions by multiple orders of magnitude in order to fragment lab strain colonies, and even higher to disrupt natural strains sampled from a nearby freshwater lake. This claim is well-supported by their experiments and data.

      We thank the reviewer for this positive comment. We fully agree, as our fragmentation experiments on division-formed colonies clearly demonstrate their strong mechanical resistance in naturally occurring flows.

      (2) The authors then claim that the fragmentation of aggregates due to fluid flows occurs through erosion of small pieces. Because their experimental setup does not allow them to explicitly observe this process (for example, by watching one aggregate break into pieces), they implement an idealized model to show that the nature of the changes to the size histogram agrees with an erosion process. However, in Figure 2C there is a noticeable gap between their experiment and the prediction of their model. Additionally, in a similar experiment shown in Figure S6, the experiment cannot distinguish between an idealized erosion model and an alternative, an idealized binary fission model where aggregates split into equal halves. For these reasons, this claim is weakened.

      The two idealized models of colony fragmentation, namely erosion of single cells and fragmentation into equal sizes (or binary fission), lead to distinguishable final size distributions. We believe that our experiments for division-formed colonies support the hypothesis of the erosion mechanism. Specifically, Figure 2E shows that colony fragmentation resulted in a decrease of large colonies and a strong increase of single cells and dimers (two cells). In our view, the strong increase of single cells and dimers provides quite convincing (but indirect) evidence supporting the erosion mechanism. This is described on lines 112-121. To further address the reviewer’s concern, we have included in the revised version of Figure 2 (panels B and D) a direct comparison between these two fragmentation models for large division-formed colonies fragmented at a high dissipation rate of ε = 5.8 m<sup>2</sup>/s<sup>3</sup>. Furthermore, we have included the new Supplementary Figure S9, which details the model predictions for the colony size distribution at various time points.

      The ideal equal fragments model (i.e., where every fracture event produces two identical fragments with half the original biovolume) does not capture the biovolume transfer from large colonies to single cells, as observed for the experimental results in panel D of Figure 2 and panel E of Figure S9. In contrast, the erosion model, in panel D of Figure 2 and panel D of Figure S9, provides a good prediction of the experimental results within the experimental uncertainty. The different fragmentation models are discussed in lines 226-228 of the revised manuscript and lines 865-873 of the SI.

      (3) Their third major claim is that fluid flows only weakly cause cells to collide and adhere in a "coming together" process of aggregate formation. They test this claim in Figure 3, where they suspend single cells in their test chamber and stir them at moderate intensity, monitoring their size histogram. They show that the size histogram changes only slightly, indicating that aggregation is, by and large, not occurring at a high rate. Therefore, they lend support to the idea that cell aggregation likely does not initiate group formation in toxic cyanobacterial blooms. Additionally, they show that the median size of large colonies also does not change at moderate turbulent intensities. These results agree with previous studies (their own citation 25) indicating that aggregates in toxic blooms are clonal in nature. This is an important result and well-supported by their data, but only for this specific particle concentration and stirring intensity. Later, in Figure 5 they show a much broader range of particle concentrations and energy dissipation rates that they leave untested.

      We thank the reviewer for this positive comment. We agree that our experimental results show clear evidence that aggregated colonies have a weaker structure in comparison to division-formed colonies, thus supporting the hypothesis that clonal expansion is the main mechanism for colony formation under most natural settings. The range of energy dissipation rates of our experimental setup covers almost entirely the region for which aggregated and division-formed colonies differ in their fragmentation behavior (Zone III of Figure 5). Within this zone, aggregated colonies are fragmented and only the division-formed colonies are able to withstand the hydrodynamic stresses. Furthermore, we show that this fragmentation behavior has a low sensitivity to the total biovolume fraction, as displayed in the Supplementary Figures S2 and S4 and discussed in lines 151-154 and 160-163. We agree that our cone-and-plate setup covers a limited parameter range, and we have added a detailed discussion of these limitations in the revised manuscript, under section Materials and Methods in lines 462-473.

      (4) The fourth major result of the manuscript is displayed in Equation 8 and Figure 5, where the authors derive an expression for the ratio between the rate of increase of a colony due to aggregation vs. the rate due to cell division. They then plot this line on a phase map, altering two physical parameters (concentration and fluid turbulence) to show under what conditions aggregation vs. cell division are more important for group formation. Because these results are derived from relatively simple biophysical considerations, they have the potential to be quite powerful and useful and represent a significant conceptual advance. However, there is a region of this phase map that the authors have left untested experimentally. The lowest energy dissipation rate that the authors tested in their experiment seemed to be \dot{epsilon}~1e-2 [m^2/s^3], and the highest particle concentration they tested was 5e-4, which means that the authors never tested Zone II of their phase map. Since this seems to be an important zone for toxic blooms (i.e. the "scum formation" zone), it seems the authors have missed an important opportunity to investigate this regime of high particle concentrations and relatively weak turbulent mixing.

      We agree with the reviewer that Zone (II) of Figure 5 is of great importance to dense bloom formation under wind mixing and that this parameter range was not covered by our experiments using a cone-and-plate shear flow. The measuring range of our device was motivated by engineering applications such as artificial mixing of eutrophic lakes using bubble plumes, as well as preliminary experiments which demonstrated that high levels of dissipation rate were required to achieve fragmentation. The range of dissipation rates that can be achieved by the cone-and-plate setup is limited at the lower end by the accumulation of colonies near the stagnation point at the conical tip and at the upper end by the spillage of fluid out of the chamber. We now discuss this measuring range in lines 462-473 of the revised manuscript.

      Although our setup does not cover Zone (II), we now refer to recent results in the literature for evidence of aggregation-dominance at Zone (II). The experimental study of Wu et al. (2024) (reference number 64 of the revised manuscript) investigated the formation of Microcystis surface scum layers in wind-mixed mesocosms. Their study identified aggregation of colonies in the scum layer, resulting in increases of colony size at rates faster than cell division. These results agree with our model, and the parameters range investigated fall within the Zone II. We have included in the revised version, lines 328-337, a detailed discussion elucidating the parameter range covered in our experiments and the findings of Wu et al. (2024).

      Other items that could use more clarity:

      (5) The authors rely heavily on size distributions to make the claims of their paper. Yet, how they generated those size distributions is not clearly shown in the text. Of primary concern, the authors used a correction function (Equation S1) to estimate the counts of different size classes in their image analysis pipeline. Yet, it is unclear how well this correction function actually performs, what kinds of errors it might produce, and how well it mapped to the calibration dataset the authors used to find the fit parameters.

      We agree with the reviewer that more details of the correction function should be included. We have included in the revised version of the Supporting Information, in lines 785-796, a more detailed explanation of the correction function. Furthermore, a direct comparison of raw and corrected histograms of the size distribution and its associated uncertainty is presented in the new Supplementary Figure S8.

      (6) Second, in their models they use a fractal dimension to estimate the number of cells in the group from the group radius, but the agreement between this fractal dimension fit and the data is not shown, so it is not clear how good an approximation this fractal dimension provides. This is especially important for their later derivation of the "aggregation-to-cell division" ratio (Equation 8)

      We agree with the reviewer that more details on the estimation of fractal dimension are needed. The revised version, under Materials and Methods in lines 508-515, now includes the detailed estimation procedure, the number of colonies analysed, and the associated uncertainty.

      Reviewer #1 (Recommendations For The Authors):

      In light of the weak evidence for claim #2 outlined above, I believe the paper would benefit from a more explicit comparison in Figure 2C of the two models - idealized erosion, and idealized binary fission. With such a comparison, the authors would have stronger footing to claim that one process is more important than the other.

      As mentioned in our answer above to comment #2 of public review, we have included in the revised version of Figure 2 (panels B and D) a direct comparison between the erosion and equal fragments (binary fission) models for large division-formed colonies fragmented under ε = 5.8 m<sup>2</sup>/s<sup>3</sup>. The comparison is further detailed in the new Supplementary Figure S9 for representative time points. Only the erosion models can recover the biovolume transfer from large colonies to single cells, as observed for the experimental results in Figure 2D and further detailed in Figure S9D. We believe that the revised version of Figure 2 and the new Supplementary Figure S9 provide strong evidence in support of the erosion fragmentation model.

      Would the authors comment on their chosen range of experimental dissipation rates? For instance, was their goal more to investigate industrial/engineering applications where the goal is to disrupt the cyanobacteria, but not really typical natural conditions under which the groups might form?

      The choice of experimental dissipation rates in our experiment was such that it covers engineering applications such as artificial mixing of eutrophic lakes using bubble plumes. We have now clarified in the Introduction, on lines 37-39, that artificial mixing has been successfully applied in several lakes to suppress cyanobacterial blooms. Furthermore, we have now clarified in the caption of Figure 5 that the bars on the right side indicate typical values of dissipation rates induced by natural wind-mixing, bubble plumes in artificially mixed lakes, and laboratory-scale experiments such as cone-and-plate systems and stirred tanks. The dissipation rates induced by the bubble plumes in artificially mixed lakes could potentially fragment aggregated cyanobacterial colonies and thus disrupt bloom formation. However, our preliminary experiments demonstrated that high levels of dissipation rate were required to achieve fragmentation, therefore we’ve focused on the upper range of values (0.01 to 10 m<sup>2</sup>/s<sup>3</sup>).

      The dissipation rates generated by the cone-and-plate approach are indeed higher than the dissipation rates under typical natural conditions in lakes. We have now added a detailed discussion of the range of dissipation rates generated by the cone-and-plate approach in the revised manuscript, under section Materials and Methods in lines 462-473, where we also explain that these values are higher than the natural dissipation rates generated by wind action in lakes. However, the more generic insights obtained by our study, shown in Figure 5, are relevant for dissipation rates of natural lakes (e.g., Zone II). Therefore, in our discussion of Figure 5 we have now included the recent findings of Wu et al. (2024) (reference number [64] of the revised manuscript), who studied bloom formation of Microcystis in mesocosm experiments at dissipation rates representative of natural conditions; see also our reply to the next comment.

      The authors should consider testing the space of Zone II on their phase map, for instance at very high particle concentrations and even lower rotational speeds, in order to show that their derivations match experiments.

      Good point. As mentioned in our answer above to comment #4 of the public review, Zone II lies beyond the measuring range of our experimental setup. Instead, we refer to the recent study of Wu et al. (2024) (reference number [64] of the revised manuscript) which demonstrated that dense scum layers of Microcystis colonies are aggregation-dominated. These mesocosm experiments agree with our model predictions and their parameter range falls within Zone II. We have included in the revised version, lines 328-337, a detailed discussion where we elucidate the parameter range covered in our experiments and compare our predictions for Zone II with the recent findings of Wu et al. (2024).

      The authors should show their calibration data and fit for the correction function of equation S1. Additionally, you may consider showing "raw" and "corrected" histograms of the size distribution, to demonstrate exactly what corrections are made.

      As mentioned in our answer above to comment #5 of the public review, we have included in the revised version of the Supporting Information the new Supplementary Figure S8, which shows the raw and adjusted histograms of the size distribution, including the associated uncertainties. Furthermore, the correction function is now explained in detail in the new Supporting Information Text in lines 785-796.

      The authors might consider commenting on Figure S3 a bit more in the main text. Even at very high dissipation rates, the cyanobacterial groups don't plummet to size 1, but stay in an equilibrium around 10-20x the diameter of a single cell. What might this mean for industrial applications trying to break up the groups?

      We agree with the reviewer that further discussion of Figure S3, panels E and F, is warranted. In the revised version of the manuscript, under section Fragmentation of Microcystis colonies occurs through erosion in lines 133-137, we have now included a discussion of this figure. Figure S3F shows that more than 90% of the total biovolume ends up in the category “small colonies” (mostly single cells and dimers); hence, most of the initially large colonies do fragment to single cells or dimers. Only about 5-10% of the biovolume remains as “large colonies” of 10-20 cells. Although it is challenging to draw definitive conclusions about the behavior of these remaining large colonies, as they account for only a minor fraction of the suspension, one hypothesis is that variability in mechanical properties between colonies results in a subset of colonies exhibiting exceptional resistance even to very high dissipation rates (see lines 133-137).

      Minor comments:

      Typo Caption of Figure 2: Should read [m^2/s^3] for units

      Thanks for catching this typo. The units in the caption of Figure 2 has been corrected to [m^2/s^3].

      There is no Equation 10 in Materials and Methods as indicated in the rheology section.

      We thank the reviewer for pointing out the lack of clarity in this algebraic manipulation. In fact, the yield stress has to be substituted in the current Equation 11 (previously Eq.10), from which the critical dissipation rate must be substituted in Equation 3. The result is the critical colony size (l* = 2.8) mentioned in line 243 of the revised manuscript. The correct equation numbers and algebraic substitutions are now indicated in lines 241-243 of the revised version of the manuscript.

      <Reviewer #2 (Public review):

      Especially the introduction seems to imply that shear force is a very important parameter controlling colony formation. However, if one looks at the results this effect is overall rather modest, especially considering the shear forces that these bacterial colonies may experience in lakes. The main conclusion seems that not shear but bacterial adhesion is the most important factor in determining colony size. As the importance of adhesion had been described elsewhere, it is not clear what this study reveals about cyanobacterial colonies that was not known before.

      We would like to emphasize several key findings that our study reveals about the impacts of fluid flow on cyanobacterial colonies:

      (I) Quantification of mechanical strength in cyanobacterial colonies: Our results demonstrate the high mechanical strength of cyanobacterial colonies, as evidenced by the requirement of high shear rates to achieve fragmentation. This is new knowledge, that was not known before for cyanobacterial colonies. To this end, our study highlights the resilience of these colonies against naturally occurring flows and bridges the gap between theoretical assumptions about colony strength and experimentally measured mechanical properties.

      (II) The discovery that the mechanical strength of colonies differs between colonies formed by cell division and colonies formed by aggregation. This is again new knowledge, that was not known before for cyanobacterial colonies.

      (III) Validation of a hypothesis regarding colony formation: Using a fluid-mechanical approach, we confirm the findings of recent genetic studies (references 25 and 67 of the revised version of the manuscript) which indicated that colony formation occurs predominantly via cell division rather than cell aggregation under natural conditions (except in very dense blooms).

      (IV) Practical guidelines for cyanobacterial bloom control: Our findings provide valuable insights into the design of artificial mixing systems applied in several lakes. Artificial mixing of lakes is based on fundamentals of fluid flow, aiming at preventing aggregation of buoyant cyanobacteria in scum layers at the water surface. Our results show that the dissipation rates generated by bubble blumes in artificially mixed lakes can fragment cyanobacterial colonies formed by aggregation, but are not intense enough to cause fragmentation of division-formed colonies (see Figure 5 and lines 348-360).

      The agreement between model and experiments is impressive, but the role of the fit parameters in achieving this agreement needs to be further clarified.

      The influence of the fit parameters (namely the stickiness α1 and the pairs of colony strength parameters S1,q1,S2,q2) is discussed in the sections Dynamical changes in colony size modelled by a two-category distribution in lines 247-253 and Materials and Methods in lines 559-565. We kept the discussion concise to maintain readability. However, we agree with the reviewer that additional details about the importance of the fit parameters and the sensitivity of the results to these parameters could be beneficial. In the revised version of the section Materials and Methods in lines 560-563, we have included a detailed discussion of the fit parameters.

      The article may not be very accessible for readers with a biology background. Overall, the presentation of the material can be improved by better describing their new method.

      We apologize for the limited readability of the description of the experimental setup and model used. In the revised version of the manuscript and the SI, we have detailed further the new methods presented here. The modifications include a detailed description of the operating range of the cone-and-plate shear setup (subsection Cone-and-plate shear of the section Materials and Methods, in lines 462-473). Furthermore, we think that incorporation of the recent experimental results of Wu et al. (2024), on lines 331-337 of the manuscript, will appeal to readers with a biology background. Their mesocosm experiments support our model prediction that aggregation is the dominant mechanism for colony formation in region (II) of Figure 5.

      Reviewer #2 (Recommendations For The Authors):

      (1) The authors seem too modest in claiming technological advance. They should describe the technological advance of combining microscopy with rheometry, in such a way that this invites others to apply this or similar approaches on biological samples. Even though I feel that the advancement of knowledge of this system by their method is relatively modest, there may be more advances in other systems.

      We appreciate the positive view of the reviewer towards the importance of this technology and we agree that its advantages should be advertised to researchers investigating similar systems. We have now given more attention to the technological advance of combining microscopic imaging with rheometry in the final paragraph of the Conclusions (lines 386400), where we now also briefly discuss an interesting recent study of marine snow (Song et al. 2023, Song and Rau 2022, reference numbers 70 and 71 of the revised manuscript), which used a similar combination of microscopy and rheometry as in our study. Furthermore, in the Methods section, we now briefly explain how the rheometry can be adjusted to investigate other systems (lines 474-480).

      (2) It seems reasonable -also based on what we already know about these aggregates - to assume that the main difference in shear sensitivity between field samples and cultures lies in the production of extracellular polysaccharide substance (EPS). To go beyond what is already known, the study could try to provide more direct and quantitative evidence for EPS involvement. For example, using a chemical quantification of EPS levels, or perturbing EPS levels using digestive enzymes.

      We agree with the reviewer that further characterization of the EPS is highly relevant to understand the mechanical strength of colonies. However, we believe that chemical quantification and/or degradation of EPS lies beyond the scope of our article and should be addressed by future studies.

      (3) Assuming EPS is indeed the reason for the differences in shear resistance: the authors speculate the reason why the field samples have more EPS lies in chemical composition (Calcium/nitrogen levels). In addition, there could be grazing that is known to promote aggregation (possibly increasing EPS), or just inherent genetic differences between strains. I am not necessarily expecting the authors to explore this direction experimentally, but it seems certainly feasible and would make the final result less speculative.

      We agree with the reviewer that there are more biotic and abiotic factors that can influence EPS amount and composition. The influence of grazing and other relevant factors on cell adhesion is discussed in references [26-29], cited in our introduction in lines 50-53. As discussed in our answer to recommendation #2, we believe that a quantitative investigation of these various factors is beyond the scope of this work and should be addressed in future studies.

      (4) A cool finding seems to be the critical relative diameter (Fig 2E), a colony size that seems invariant under shear. I was slightly surprised that the authors seem to take little effort to understand this critical diameter mechanistically (for example by predicting it, or experimentally perturbing it). Again, not a necessary requirement, but this is where the study could harness its technological advantage to provide a more quantitative understanding of something that goes beyond the existing knowledge of the system.

      We apologize to the reviewer if our descriptions and discussions of Figure 2 were unclear. One of the key conclusions from our experiments is that the critical relative diameter depends on the dissipation rate, as shown in Figure 2F. This dependence is also incorporated into the model through the constitutive equation (2). Furthermore, we expect the mechanical resistance of colonies, quantified by the critical relative diameter, to be affected by other biotic and abiotic factors that influence EPS amount and composition.

      (5) The jump from 0.019 to 1.1 m²/s³ seems large. What was the reason for not exploring intermediate values? The authors should also define low, modest and intense dissipation rates more clearly. Currently, they seem somewhat arbitrarily defined, i.e. 0.019 m²/s³ is described as low (methods) and moderate (results). In Fig 2, the authors further talk about low dissipation rates without a quantitative description.

      We thank the reviewer for pointing out the lack of clarity in the choice of parameter range and the nomenclature. Regarding the former, the suspension of division-formed colonies of Microcystis strain V163 displayed negligible fragmentation for dissipation rates between 0.019 to 1.1 m<sup>2</sup>/s<sup>3</sup>, as seen in Figures S2A and S3A. Due to the low sensitivity of the fragmentation results in this region, we don’t expect change in behavior for intermediate values. Regarding the nomenclature, we have corrected the inconsistencies throughout the text. We have chosen to name the dissipation rate values as: low for values typical of windmixing, moderate for values typical of the core of bubble plumes, and intense for values typical of propellers. Whenever mentioned in the text, the numerical value of dissipation rate is also included to avoid doubt.

      (6.) The structure and narrative of the paper can be improved. The article first describes all lab culture experiments and then the model, while the first figure already shows model fits. Perhaps it would be better to first describe the aggregation experiments, to constrain the appropriate terms of the model, and then move to fragmentation.

      We appreciate the recommendation of the reviewer regarding the structure. We have chosen to describe first the fragmentation experiments (Fig. 2), as these can be understood without introducing the aggregation effects. In contrast, the steady state results in the aggregation experiments (Fig. 3) come from the balance between aggregation and fragmentation. Therefore, we judged the current order to be more appropriate. The model fits are combined with the experimental results in Figures 2 and 3 to have a concise display. We have ensured that all the concepts required to understand each figure panel are explained prior to their discussion.

      (7) The number of data points that go into the histogram needs to be indicated. The main reason is that the authors report the distribution in terms of the biovolume fraction, suggesting the numerical counts are converted into volume. This to me seems like the most sensible parameter, but I could not find how this conversion is calculated (my apologies if I missed it). This seems especially relevant because a single large colony can impact this histogram quite considerably.

      We apologize for the lack of clarity in the calibration and conversion steps of the size distribution. As discussed above in the answer to comment #5 of the reviewer #1, more details of the calibration process have been added to the revised version of the Supporting Information Text in lines 785-796. Furthermore, the new Supplementary Figure S8 presents examples of the raw and adjusted size distribution, including the total number of counted colonies per histogram and the associated uncertainties in the concentration and biovolume distributions.

      (8) Over the timescales measured here, colonies could start sinking (or floating), possibly in a size-dependent manner, that could lead to a bias due to boundary effects. Did the authors consider this potential artifact?

      The sinking or floating of colonies is a relevant process which was taken into account in the choice of our parameter range for the dissipation rate. The minimum dissipation rate used in our experiments ensures that the upward inertial velocity near stagnation is sufficient to counteract the sedimentation of colonies. A detailed discussion of the choice of the parameter range is now included in the revised version of the Materials and Methods in lines 462-473.

      (9) "On the one hand, sequencing of the genetic diversity within Microcystis colonies supports the hypothesis that colony formation undernatural conditions is primarily driven by cell division [25]. On the other hand, cell aggregation can occur on a shorter time scale and may offer improved protection against high grazing pressure [26]." This appears somewhat constructed, as what is described as "on the other hand" is not evidence against the genetic diversity.

      We agree that the suggested dichotomy in this text appeared somewhat constructed, and we have now removed the wording “on the one hand” and “on the other hand”. The studies from reference [25] demonstrated that the genetic diversity between independent Microcystis colonies is much greater than the diversity within colonies. If cell aggregation was the dominant mechanism, a similar genetic diversity would be observed between and within colonies, which contrasts the findings from reference [25]. We have adjusted the text in the revised manuscript, in lines 46-54, to clarify this point.

      (10) The phase diagram seems largely based on extrapolations that are made outside of the measurement regime (e.g. dark red bars indicating the dissipation rate, Fig 5 - by the way 1 this color scheme could use some better contrast, by the way 2 Fig S7 suggests a wider dissipation rate range as indicated in Fig 5, why?). Hence there seems to be the need to more clearly lineate experimental results, simulations, and extrapolations in the phase diagram.

      We agree with the reviewer that further clarifications should be given about the parameter range covered in our experiments and apologize for the lack of readability in the color scheme of Fig 5. In lines 329-337, 346-347, 353-355, we have highlighted the parameters range covered by our experiments as well as the range covered by previous studies of windmixed mesocosm (namely reference [64] of the revised manuscript). Regarding the color scheme of Figure 5, we have modified the legend of the figure to improve readability. The color contrast was increased and leader lines were added to connect the colored bars with the respective label.

      (11) Unfortunately, the manuscript did not contain line numbers.

      We apologize to the reviewer for the lack of line numbers in our initial version. The revised version of the manuscript now contains line numbers, both in the main text and the supporting information.

      (12) Fig 2D. Caption is too minimal. Y-axis could better be named "Fraction of colonies" as both small and large colonies are plotted.

      The caption for Figure 2D was extended to better describe the plot. We have kept the y-axis label as “Fraction of small colonies”, since this is the quantity displayed by the three curves in the plot.

      (13) An inset should have axis labels.

      All the insets in our plots display the same variables as their respective plots. In order to keep the plots light and preserve readability, we therefore prefer to present the axis labels only along the x-axis and y-axis of the main plots, which implies by convention that the same axis labels also apply to the insets. To the best of our knowledge, this is a common approach.

      (14) Page 5, first words. Likely Fig 3A, not 2A was meant.

      We thank the reviewer for pointing out this readability issue. We intend to compare both Figures 2A and 3A. The text of the revised manuscript, in lines 146-148, has been adjusted with the correct figure numbers.

      (15) Introduction, second last paragraph, third last line. "suspension leaded to a broad distribution" I assume you meant "... led to a ..."

      We thank the reviewer for pointing out this typo. It has been corrected (line 122).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      In this study, the authors offer a theoretical explanation for the emergence of nematic bundles in the actin cortex, carrying implications for the assembly of actomyosin stress fibers. As such, the study is a valuable contribution to the field actomyosin organization in the actin cortex. While the theoretical work is solid, experimental evidence in support of the model assumptions remains incomplete. The presentation could be improved to enhance accessibility for readers without a strong background in hydrodynamic and nematic theories.

      To address the weaknesses identified in this assessment, we have expanded the motivation and description of the theoretical model, specifically insisting on the experimental evidence supporting its rationale and assumptions. These changes in the revised manuscript are implemented in the two first paragraphs of Section “Theoretical model” and in a more detailed description and justification of the different mathematical terms that appear in that section. We have made an effort to map in our narrative different terms to mechanistic processes in the actomyosin network. Even if the nature of the manuscript is inevitably theoretical, we think that the revised manuscript will be more accessible to a broader spectrum of readers.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this article, Mirza et al developed a continuum active gel model of actomyosin cytoskeleton that account for nematic order and density variations in actomyosin. Using this model, they identify the requirements for the formation of dense nematic structures. In particular, they show that self-organization into nematic bundles requires both flow-induced alignment and active tension anisotropy in the system. By varying model parameters that control active tension and nematic alignment, the authors show that their model reproduces a rich variety of actomyosin structures, including tactoids, fibres, asters as well as crystalline networks. Additionally, discrete simulations are employed to calculate the activity parameters in the continuum model, providing a microscopic perspective on the conditions driving the formation of fibrillar patterns.

      Strengths:

      The strength of the work lies in its delineation of the parameter ranges that generate distinct types of nematic organization within actomyosin networks. The authors pinpoint the physical mechanisms behind the formation of fibrillar patterns, which may offer valuable insights into stress fiber assembly. Another strength of the work is connecting activity parameters in the continuum theory with microscopic simulations.

      We thank the referee for these comments.

      Weaknesses:

      (A) This paper is a very difficult read for nonspecialists, especially if you are not well-versed in continuum hydrodynamic theories. Efforts should be made to connect various elements of theory with biological mechanisms, which is mostly lacking in this paper. The comparison with experiments is predominantly qualitative.

      We understand the point of the referee. While it is unavoidable to present the continuum hydrodynamic theory behind our results, we have made an effort in the revised manuscript to (1) motivate the essential features required from a theoretical model of the actomyosin cytoskeleton capable of describing its nematic self organization (two first paragraphs of Section “Theoretical model”), and to (2) explicitly explain the physical meaning of each of the mathematical terms in the theory, and when appropriate, relate them to molecular mechanisms in the cytoskeleton. We hope that the revised manuscript addresses the concern of the referee.

      Regarding the comparison with experiments, they are indeed qualitative because the main point of the paper is to establish a physical basis for the self-organization of dense nematic structures in actomyosin gels. Somewhat surprisingly, we argue that a compelling mechanism explaining the tendency of actomyosin gels to form patterns of dense nematic bundles has been lacking. As we review in the introduction, these patterns are qualitatively diverse across cell types and organisms in terms of geometry and dynamics, and for this reason, our goal is to show that the same material in different parameter regimes can exhibit such qualitative diversity. A quantitative comparison is difficult for several reasons. First, many of the parameters in our theory have not been measured and are expected to vary wildly between cell types. In fact, estimates in the literature often rely on comparison with hydrodynamic models such as ours. For this reason, we chose to delineate regimes leading to qualitatively different emerging architectures and dynamics. Second, the patterns of nematic bundles found across cell types depend on the interaction between (1) the intrinsic tendency of actomyosin gels to form such structures studied here and (2) other elements of the cellular context. For instance, polymerization and retrograde flow from the lamellipodium, the physical barrier of the nucleus, and the interaction with the focal adhesion machinery are essential to understand the emergence of stress fibers in adherent cells. Cell shape and curvature anisotropy control the orientation of actin bundles in parallel patterns in the wings and trachea of insects. Nuclear positions guide the actin bundles organizing the cellularization of Sphaeroforma arctica [11]. Here, we focus on establishing that actomyosin gels have an intrinsic ability to self organize into dense nematic bundles, and leave how this property enables the morphogenesis of specific structures for future work. We have emphasized this point in the revised section of conclusions.

      (B) It is unclear if the theory is suited for in vitro or in vivo actomyosin systems. The justification for various model assumptions, especially concerning their applicability to actomyosin networks, requires a more thorough examination.

      We thank the referee for this comment. Our theory is applicable to actomyosin gels originating from living cells. To our knowledge, the ability of reconstituted actomyosin gels from purified proteins to sustain the kind of contractile dynamical steady-states observed in living cells is very limited. In the revised manuscript, we cite a very recent preprint presenting very exciting but partial results in this direction [49]. Instead, reconstituted in vitro systems encapsulating actomyosin cell extracts robustly recapitulate contractile steady-states. This point has been clarified in the first paragraph of Section “Theoretical model”.

      (C) The classification of different structures demands further justification. For example, the rationale behind categorizing structures as sarcomeric remains unclear when nematic order is perpendicular to the axis of the bands. Sarcomeres traditionally exhibit a specific ordering of actin filaments with alternating polarity patterns.

      We agree with the referee and in the revised manuscript we have avoided the term “sarcomeric” because it refers to very specific organizations in cells. What we previously called “sarcomeric patterns”, where bands of high density exhibit nematic order perpendicular to the axis of the bands, is not a structure observed to our knowledge in cells. It is introduced to delimit the relevant region in parameter space. In the revised manuscript, we refer to this pattern as “banded pattern with perpendicular nematic organization” or “banded pattern” in short.

      (D) Similarly, the criteria for distinguishing between contractile and extensile structures need clarification, as one would expect extensile structures to be under tension contrary to the authors' claim.

      We thank the referee for raising this point, which was not sufficiently clarified in the original manuscript. We first note that in incompressible active nematic models, active tension is deviatoric (traceless and anisotropic) because an isotropic component would simply get absorbed by the pressure field enforcing incompressibility. Being compressible, our model admits an active tension tensor with deviatoric and isotropic components. We consider always a contractile (positive) isotropic component of active tension, but the deviatoric component can be either contractile (𝜅 > 0) or extensile (𝜅 < 0), where we follow the common terminology according to which in contractile/extensile active nematics the active stress is proportional to q with a positive/negative proportionality constant [see e.g. https://doi.org/10.1038/s41467018-05666-8]. Furthermore, as clarified in the revised manuscript, total active stresses accounting for the deviatoric and isotropic components are always contractile (positive) in all directions, as enforced by the condition |𝜅| < 1.

      For fibrillar patterns, we need 𝜅 < 0, and therefore active stresses are larger perpendicular to the nematic direction. This means that the anisotropic component of the active tension is extensile, although, accounting for the isotropic component, total active tension is contractile (see Fig. 1c). This is now clarified in the text following Eq. 7 and in Fig. 1.

      However, following fibrillar pattern formation and as a result of the interplay between active and viscous stresses, the total stress can be larger along the emergent dense nematic structures (“contractile structures”) or perpendicular to them (“extensile structures”). To clarify this point, in the revised Fig. 4 and the text referring to it, we have expanded our explanation and plotted the difference between the total stress component parallel to the nematic direction (𝜎∥) and the component perpendicular to the nematic direction (𝜎⊥), with contractile structures satisfying 𝜎∥ − 𝜎⊥ > 0 and extensile structures satisfying 𝜎∥ − 𝜎⊥ < 0. See lines 280 to 303. This is consistent with the common notion of contractile/extensile systems in incompressible nematic systems [see e.g. https://doi.org/10.1038/s41467-018-05666-8].

      (E) Additionally, its unclear if the model's predictions for fiber dynamics align with observations in cells, as stress fibers exhibit a high degree of dynamism and tend to coalesce with neighboring fibers during their assembly phase.

      In the present work, we focus on the self-organization of a periodic patch of actomyosin gel. However, in adherent cells boundary conditions play an essential role, as discussed in our response to comment (A) by this referee. In ongoing work, we are studying with the present model the dynamics of assembly and reconfiguration of dense nematic structures in domains with boundary conditions mimicking in adherent cells, possibly interacting with the adhesion machinery, finding dynamical interactions as those suggested by the referee. As an example, we show a video of a simulation where at the edge of the circular domain, there is an actin influx modeling the lamellipodium, and in four small regions friction is higher simulating focal adhesions. Under these boundary conditions, the model presented in the paper exhibits the kind of dynamical reorganizations alluded by the referee.

      Author response video 1.

      We would like to note, however, that the prominent stress fibers in cells adhered to stiff substrates, so abundantly reported in the literature, are not the only instance of dense nematic actin bundles. In the present manuscript, we emphasize the relation of the predicted organizations with those found in different in vivo contexts not related to stress fibers, such as the aligned patterns of bundles in insects (trachea, scales in butterfly wings), in hydra, or in reproductive organs of C elegans; the highly dynamical network of bundles observed in C elegans early embryos; or the labyrinth patters of micro-ridges in the apical surface of epidermal cells in fish.

      (F) Finally, it seems that the microscopic model is unable to recapitulate the density patterns predicted by the continuum theory, raising questions about the suitability of the simulation model.

      We thank the referee for raising this question, which needs further clarification. The goal of the microscopic model is not to reproduce the self-organized patterns predicted by the active gel theory. The microscopic model lacks essential ingredients, notably a realistic description of hydrodynamics and turnover. Our goal with the agent-based simulations is to extract the relation between nematic order and active stresses for a small homogeneous sample of the network. This small domain is meant to represent the homogeneous active gel prior to pattern formation, and it allows us to substantiate key assumptions of the continuum model leading to pattern formation, notably the dependence of isotropic and deviatoric components of the active stress on density and nematic order (Eq. 7) and the active generalized stress promoting ordering.

      We should mention that reproducing the range of out-of-equilibrium mesoscale architectures predicted by our active gel model with agent-based simulations seems at present not possible, or at least significantly beyond the state-of-the-art. To our knowledge, these models have not been able to reproduce the heterogeneous nonequilibrium contractile states involving sustained self-reinforcing flows underlying the pattern formation mechanism studied in our work. The scope of the discrete network simulations has been clarified in lines 340 to 349 in the revised manuscript.

      While agent-based cytoskeletal simulations are very attractive because they directly connect with molecular mechanisms, active gel continuum models are better suited to describe out-of-equilibrium emergent hydrodynamics at a mesoscale. We believe that these two complementary modeling frameworks are rather disconnected in the literature, and for this reason, we have attempted substantiate some aspects of our continuum modeling with discrete simulations. We have emphasized the complementarity of the two approaches in the conclusions.

      Reviewer #1 (Recommendations For The Authors):

      Questions on the theory:

      Does rho describe the density of actin or myosin? The authors say that they are modeling actomyosin material as a whole, but the actin and myosin should be modeled separately. Along, similar lines, does Q define the ordering of actin or myosin?

      Active gel models of the actomyosin cytoskeleton have been formulated with independent densities for actin and for myosin or using a single density field, implicitly assuming a fixed stoichiometry. Super-resolution imaging of the actomyosin cytoskeleton also suggest that in principle it makes sense to consider different nematic fields for actin and for myosin filaments. In the revised manuscript, we now explicitly mention that our density and nematic field are effective descriptions of the entire actomyosin gel (lines 82-84).

      A more detailed model would entail additional material parameters, not available experimentally, which may help reproduce specific experiments but that would make the systematic study of the different behaviors much more difficult. Our approach has been to keep the model minimal meeting the fundamental requirements outlined in the first paragraphs of Section “Theoretical model”.

      Should the active stress depend on material density? It seems strange (from Eq. 3) that active stress could be non-zero even where density is zero, since sigma_act does not depend on rho.

      Yes, active stress is assumed to be proportional to density. Eq. 3 in the original manuscript was misleading (it was multiplied by rho in Eq. 2). In the revised manuscript, we have explained with a bit more detail the theoretical model, clarifying this point.

      The authors should clearly explain their rationale for retaining certain types of nonlinear terms while ignoring others in theory. For instance, the nonlinearities in the equations of motion are sometimes quadratic in the fields, while there are also some cubic terms. Please remark up to what order in the fields the various interactions are modeled.

      We thank the referee for raising this point. The nonlinearities in the theory are easily explained on the basis of a small number of choices. We have added a new paragraph towards the end of Section “Theoretical model” (lines 145 to 152) providing a rationale for the origin and underlying assumptions leading to different nonlinearities.

      To connect with experiments and the biological context, please explain the biological origin of various terms in the model: (1) L-dependent terms in Eq. 2 and 4, (2) Flowalignment of nematic order and experimental evidence in support of it, (3) densitydependent susceptibility terms in Eq. 4

      (1) Unfortunately, the L-dependent terms are very bulky, but are very standard in nematic theories. The best way to understand their physical significance is through the expression of the nematic free-energy, which is now given and explained in the revised manuscript (Eq. 3). The resulting complicated expression for the molecular field and the nematic stress (Eqs. 4 and 5) are mathematical consequences of the choice of nematic free energy. In the revised manuscript, we also attempt to provide a basis for these terms in the context of the actin cytoskeleton. (2) To our knowledge, the best reference supporting this term from experiments is Reymann et al, eLife (2016). In the revised manuscript, we have provided a physical interpretation. (3) We have expanded the motivation and plausible microscopic justification of this term.

      There are different 'activity' terms in the model. Their biophysical origin is not made clear. For example, the authors should make clear if these activities arise from filament or motor activity. Relatedly, the authors should provide a comprehensive discussion of the signs of the different active parameters and their physical interpretations.

      In an active gel model, activity parameters are phenomenological and how they map to molecular mechanisms is not precisely known, although conventionally contractile active tension is ascribed to the mechanical transduction of chemical power by myosin motors. The fact is that, besides myosin activity, there are many nonequilibrium processes in the actomyosin cytoskeleton that may lead to active stresses including (de)polymerization of filaments or (un)binding of crosslinkers. In the revised manuscript, we have added sentences illustrating how different terms may result from microscopic mechanisms, but providing a precise mapping between our model and nonequilibrium dynamics of proteins is beyond the scope of our work, although our discrete network simulations address this issue to a certain degree.

      Following the suggestion of the referee, our description of the theory now discusses much more extensively the signs of activity parameters and their physical interpretations, e.g. the text following Eq. 7.

      Throughout the paper, various activity terms are varied independently of each other. Is that a reasonable assumption given that activities should depend on ATP and are thus not independent of one another?

      We agree that, ultimately, all active process depend on the conversion of chemical energy into mechanical energy. However, recent work has highlighted how active tension also depends on the microscopic architecture of the network controlled by multiple regulators of the actomyosin cytoskeleton (e.g. Chug et al, Nat Cell Biol, 2017). It is reasonable to expect that, for a given rate of ATP consumption, chemical power will be converted into mechanical power in different ways depending on the micro-architecture of the cytoskeleton, e.g. the stoichiometry of filaments, crosslinkers, myosins, or the length distribution of filaments (very long filaments crosslinked by myosins may be difficult to reorient but may contract efficiently).

      We have added a paragraph in Section “Theoretical model” with a discussion, lines 153 to 156.

      Sarcomeres are muscle fibers that exhibit alternating polarity pattern. Such patterning is not evident in what the authors call 'sarcomeres' in Fig. 2. I believe the authors should revise their terminology and not loosely interpret existing classifications in the field.

      We thank the referee for raising this point. We have changed the terminology.

      Fig 2a: Is the cartoon for filament alignment incorrect for kappa>0?

      The cartoon is correct. In the revised manuscript we have explained more clearly the physical meaning of kappa in the text following Eq. 7. In the caption of Fig. 1 and of Fig. 2a, we have also clarified that when the absolute value of kappa is <1, then active tension is positive in all directions.

      Within the section "Requirements for fibrillar and banded patterns", it will be useful to show the figures for varying the different active parameters in the main figures.

      We have followed the referee’s suggestion and moved Supp. Fig. 1 of the original manuscript to the main figures.

      How do the authors decide if bundles are contractile or extensile? Why are contractile bundles under tension while extensile bundles are under compression? I would expect the opposite.

      We agree that this point deserves a more detailed explanation. In the revised manuscript and in the new Figure 4, we further develop this point. The fibrillar pattern forms when kappa<0. We further assume that -1<kappa<0, so that active tension is positive in all directions. In this regime, the deviatoric (anisotropic) part of active tension is extensile. However, following pattern formation and because of the interplay between active and viscous stresses, the total stress in the emerging bundles may become extensile or contractile, depending on whether the largest component of stress is perpendicular or along the bundle axis. This is now presented in the updated figure, with new panels presenting maps of the total tension. The text discussing this point has been rewritten and we hope that the new version is much clearer (lines 280 to 303).

      A contractile bundle tends to shorten, but it cannot do it because of boundary conditions or the interaction with other bundles. As a result they are in tension. Conversely, an extensile bundle tries to elongate, but being constrained, it becomes compressed. As an analogy, consider the cortex of a suspended cell. The cortex is contractile, but it cannot contract because of volume regulation in th cell, which is typically pressurized. As a result, tension in the cortex is positive, as shown by Laplace’s law [10.1016/j.tcb.2020.03.005]. We have tried to clarify this point in the revised manuscript.

      Can the authors reproduce alternating density patterns using the cytosim simulations? This is an important step in establishing the correspondence between the continuum theory and the agent-based model.

      We have addressed this point in our response to public comment (F) of this referee.

      The authors do not provide code or data.

      The finite element code with an input file require to run a representative simulation in the paper is now made available, see Ref. [74].

      The customizations of Cytosim needed to account for nematic order in our discrete network simulations are available, see Ref. [98].

      Reviewer #2 (Public Review):

      Summary:

      The article by Waleed et al discusses the self organization of actin cytoskeleton using the theory of active nematics. Linear stability analysis of the governing equations and computer simulations show that the system is unstable to density fluctuations and self organized structures can emerge. While the context is interesting, I am not sure whether the physics is new. Hence I have reservations about recommending this article.

      We thank the referee for these comments. In the revised manuscript, we have highlighted the novelty, particularly in the last paragraph of the introduction, the first two paragraphs of Section “Theoretical model”, and in the conclusions. Despite a very large literature on theoretical models of stress fibers, actin rings, and active nematics, we argue that the active self-organization of dense nematic structures from an isotropic and low-density gel has not been compellingly explained so far. Many models assume from the outset the presence of actin bundles, or explain their formation using localized activity gradients. The literature of active nematics has extensively studied symmetry breaking and the self-organization. However, most of the works assume initial orientational order. Only a few works study the emergence of nematic order from a uniform isotropic state, but consider dry systems lacking hydrodynamic interactions or incompressible and density-independent systems [37,38]. Yet, pattern formation in actomyosin gels is characterized by large density variations, and by highly compressible flows, which coordinate in a mechanism relying on an advective instability and self-reinforcing flows.

      Our theoretical model is not particularly novel, and as we mention in the manuscript, it can be particularized to different models used in the literature. However, we argue that it has the right minimal features to capture nematic self-organization in actomyosin gels. To our knowledge, no previous study explains the emergence of dense and nematic structures from a low-density isotropic gel as a result of activity and involving the advective instability typical of symmetry-breaking and patterning in the actomyosin cytoskeleton. These are important qualitative features of our results that resonate with a large experimental record, and as such, we believe that our work provides a new and compelling mechanism relying on self-organization to explain the prominence and diversity of patterns involving dense nematic bundles in the actomyosin cytoskeleton across species.

      Strengths:

      (i) Analytical calculations complemented with simulations (ii) Theory for cytoskeletal network

      Weaknesses:

      Not placed in the context or literature on active nematics.

      We agree with the referee that this was a weakness of the original manuscript. In the revised manuscript, within reasonable space constraints given the size and dynamism of the field of active nematics, we have placed our work in the context of this field (end of introduction and first two paragraphs of Section “Theoretical model”). The published version of our companion manuscript [45] also contributes to providing a clear context to our theoretical model within the field.

      Reviewer #2 (Recommendations For The Authors):

      The article by Waleed et al discusses the self organization of actin cytoskeleton using the theory of active nematics. Linear stability analysis of the governing equations and computer simulations show that the system is unstable to density fluctuations and self organized structures can emerge. While the context is interesting, I am not sure whether the physics is new. Hence I have reservations about recommending this article. I explain my questions comments below.

      We have responded to this comment above.

      (i) Active nematics including density variations have been dealt quite extensively in the literature. For example, the works of Sriram Ramaswami have dealt with this system including linear stability analysis, simulations etc. In what way is the present work different from the system that they have considered?

      (ii) Active flows leading to self organization has been a topic of discussion in many works. For example: (i) Annual Review of Fluid Mechanics, Vol. 43:637-659, 2010, https://doi.org/10.1146/annurev-fluid-121108-145434 (ii) S Santhosh, MR Nejad, A Doostmohammadi, JM Yeomans, SP Thampi, Journal of Statistical Physics 180, 699-709 (iii) M. G. Giordano1, F. Bonelli2, L. N. Carenza1,3, G. Gonnella1 and G. Negro1, Europhysics Letters, Volume 133, Number 5. In what way this work is different from any of these?

      (iii) I am confused about the models used in the paper. There is significant literature from Prof. Mike Cates group, Prof. Julia Yeomans group, Prof. Marchetti's group who all use similar governing equations. In the present paper, I find it hard to understand whether the model used is similar to the existing ones in literature or are there significant differences. It should be clarified.

      Response to (i), (ii) and (iii).

      We completely agree with this referee (and also the previous referee), that the contextualization of our work in the field of active nematics was very insufficient. In the revised manuscript, the last paragraph of the introduction and the first two paragraphs of Section “Theoretical model” now address this point. In short, previous active nematic models predicting patterns with density variations have been either for dry active matter (disregarding hydrodynamic interactions), or for suspensions of active particles moving in an incompressible flow. None of these previous works predict nematic pattern formation as a result of activity relying on the advective instability and self-reinforcing compressible flows, leading to high density and high order bundles surrounded by an isotropic low density phase. Yet, these are fundamental features observed in actomyosin gels. Many works deal with symmetry-breaking of a system with pre-existing order, but very few address how order emerges actively from an isotropic state. We thank the referee for pointing at the paper by Santhosh et al, who nicely make this argument and is now cited. Our mechanism is fundamentally different from that in Santhosh, whose model is incompressible and ignores density variations.

      We hope that the revised manuscript addresses this important concern.

      (i) >(iv) Below Eqn 6, it starts by saying that the “...origin..is clear...” Its not. I don't understand the physical origin of the instability, and this should be clarified, may be with some illustrations.

      We apologize for this unfortunate sentence, which we have rewritten in the revised manuscript (lines 181 to 185).

      Reviewer #3 (Public Review):

      The manuscript "Theory of active self-organization of dense nematic structures in the actin cytoskeleton" analysis self-organized pattern formation within a two-dimensional nematic liquid crystal theory and uses microscopic simulations to test the plausibility of some of the conclusions drawn from that analysis. After performing an analytic linear stability analysis that indicates the possibility of patterning instabilities, the authors perform fully non-linear numerical simulations and identify the emergence of stripelike patterning when anisotropic active stresses are present. Following a range of qualitative numerical observations on how parameter changes affect these patterns, the authors identify, besides isotropic and nematic stress, also active self-alignment as an important ingredient to form the observed patterns. Finally, microscopic simulations are used to test the plausibility of some of the conclusions drawn from continuum simulations.

      The paper is well written, figures are mostly clear and the theoretical analysis presented in both, main text and supplement, is rigorous. Mechano-chemical coupling has emerged in recent years as a crucial element of cell cortex and tissue organization and it is plausible to think that both, isotropic and anisotropic active stresses, are present within such effectively compressible structures. Even though not yet stated this way by the authors, I would argue that combining these two is of the key ingredients that distinguishes this theoretical paper from similar ones. The diversity of patterning processes experimentally observed is nicely elaborated on in the introduction of the paper, though other closely related previous work could also have been included in these references (see below for examples).

      We thank the referee for these comments and for the suggestion to emphasize the interplay of isotropic and anisotropic active tension, which is possible only in a compressible gel, as mentioned in the revised manuscript. We have emphasized this point in different places in the revised manuscript. We thank the suggestions of the referee to better connect with existing literature.

      To introduce the continuum model, the authors exclusively cite their own, unpublished pre-print, even though the final equations take the same form as previously derived and used by other groups working in the field of active hydrodynamics (a certainly incomplete list: Marenduzzo et al (PRL, 2007), Salbreux et al (PRL, 2009, cited elsewhere in the paper), Jülicher et al (Rep Prog Phys, 2018), Giomi (PRX, 2015),...). To make better contact with the broad active liquid crystal community and to delineate the present work more compellingly from existing results, it would be helpful to include a more comprehensive discussion of the background of the existing theoretical understanding on active nematics. In fact, I found it often agrees nicely with the observations made in the present work, an opportunity to consolidate the results that is sometimes currently missed out on. For example, it is known that self-organised active isotropic fluids form in 2D hexagonal and pulsatory patterns (Kumar et al, PRL, 2014), as well as contractile patches (Mietke et al, PRL 2019), just as shown and discussed in Fig. 2. It is also known that extensile nematics, \kappa<0 here, draw in material laterally of the nematic axis and expel it along the nematic axis (the other way around for \kappa>0, see e.g. Doostmohammadi et al, Nat Comm, 2018 "Active Nematics" for a review that makes this point), consistent with all relative nematic director/flow orientations shown in Figs. 2 and 3 of the present work.

      We thank the referee for these suggestions. Indeed, in the original submission we had outsourced much of the justification of the model and the relevant literature to a related pre-print, but this is not reasonable. The companion publication has now been accepted in the New Journal of Physics, with significant changes to better connect the work to the field of active nematics. A preprint reflecting those changes is available in Ref. [64], but we hope to reference the published paper that will come out soon.

      In the revised manuscript, we have significantly rewritten the Section “Theoretical model” to frame the continuum model in the context of the field of active nematics. While our model and results have commonalities with previous work, there are also important differences. We have highlighted the novelty of the present work along with the relation with previous studies and theoretical models in the last paragraph of the introduction and the first two paragraphs of Section “Theoretical model”. Furthermore, as suggested by the referee, we have made an effort to connect our results with previous work by Kumar, Mietke, Doostmohammadi and others.

      Regarding the last point alluded by the referee (“extensile nematics, \kappa<0 here, draw in material laterally of the nematic axis and expel it along the nematic axis”), the picture raised by the referee would be nuanced for our compressible system as compared to the incompressible systems discussed in that reference. As we have elaborated in our response to point (D) of Referee #1, our systems are overall contractile (with positive active tension in all directions), but the deviatoric component of the active tension can be either extensile or contractile. In our “extensile” models (left in Fig. 2c), material is drawn to laterally to the nematic axis but it is not expelled along this axis. Instead, it is “expelled” by turnover. In the revised manuscript, we have added a comment about this.

      The results of numerical simulations are well-presented. Large parts of the discussion of numerical observations - specifically around Fig. 3 - are qualitative and it is not clear why the analysis is restricted to \kappa<0. Some of the observations resonate with recent discussions in the field, for example the observation of effectively extensile dynamics in a contractile system is interesting and reminiscent of ambiguities about extensile/contractile properties discussed in recent preprints (https://arxiv.org/abs/2309.04224). It is convincingly concluded that, besides nematic stress on top of isotropic one, active self-alignment is a key ingredient to produce the observed patterns.

      We thank the referee for these comments. We are reluctant to extend the detailed analysis of emergent architectures and dynamics to the case \kappa > 0 as it leads to architectures not observed, to our knowledge, in actin networks. In the revised manuscript, we have expanded and clarified the characterization of emergent contractile/extensile networks by reporting the relative magnitude of stress along and perpendicular to the nematic direction. Our revised manuscript clearly shows that even though all of our simulations describe locally contractile systems with extensile anisotropic active tension, the emergent meso-structures can be either extensile or contractile, with the extensile ones exhibiting the usual bend-type instability (a secondary instability in our system) described classically for extensile active nematic systems. We have rewritten the text discussing this (lines 280 to 303), where we have placed these results in the context of recent work reporting the nontrivial relation between the contractility/extensibility of the local units vs the nematic pattern.

      I compliment the authors for trying to gain further mechanistic insights into this conclusion with microscopic filament simulations that are diligently performed. It is rightfully stated that these simulations only provide plausibility tests and, within this scope, I would say the authors are successful. At the same time, it leaves open questions that could have been discussed more carefully. For example, I wonder what can be said about the regime \kappa>0 (which is dropped ad-hoc from Fig. 3 onward) microscopically, in which the continuum theory does also predict the formation of stripe patterns - besides the short comment at the very end? How does the spatial inhomogeneous organization the continuum theory predicts fit in the presented, microscopic picture and vice versa?

      We thank the referee for this compliment. We think that the point raised by the referee is very interesting. It is reasonable to expect that the sign of \kappa may not be a constant but rather depend on S and \rho. Indeed, for a sparse network with low order, the progressive bundling by crosslinkers acting on nearby filaments is likely to produce a large active stress perpendicular to the nematic direction, whereas in a dense and highly ordered region, myosin motors are more likely to effectively contract along the nematic direction whereas there is little room for additional lateral contraction by additional bundling. As discussed in our response to referee #1, we believe that studying the formation of patterns using the discrete network simulations is far beyond the scope of our work. We discuss in lines 332 to 341, as well as in the last paragraph of the conclusions, the scope and limitations of our discrete network simulations.

      Overall, the paper represents a valuable contribution to the field of active matter and, if strengthened further, might provide a fruitful basis to develop new hypothesis about the dynamic self-organisation of dense filamentous bundles in biological systems.

      Reviewer #3 (Recommendations For The Authors):

      • The statement "the porous actin cytoskeleton is not a nematic liquid-crystal because it can adopt extended isotropic/low-order phases" is difficult to understand and should be clarified, as the next paragraph starts formulating a nematic active liquid crystal theory. Do the authors mean a crystal that "Tends to be in a disordered phase?", according to its equilibrium properties? It would still be a "nematic liquid crystal", only its ground state is not a nematic phase.

      We agree with the referee, and we hope that changes in the introduction and in Section “Theoretical model” address this comment.

      • I could not find what Frank energy is precisely used, that would be helpful information.

      In the revised manuscript, we have provided the expression for the nematic free energy in Eq. 3.

      • The Significance of green/purple arrows in Fig 2a sketch unclear, green arrows also in b,c, do they represent the same quantity? From the simulations images it is overall it is very difficult to see how the flows are oriented near the high-density regions (i.e. if they are towards / away from the strip).

      We thank the referee for bringing this up. The colorcodings of the sketches were confusing. The modified figures (Fig. 1(c) and Fig. 2(a)) present now a clearer and unified representation of anisotropic tension. The green arrows in Fig. 2(c) represent the out-of-equilibrium flows in the steady state. We agree that the zoom is insufficient to resolve the flow structure. For this reason, in the revised Fig. 2, we have added additional panels showing the flow with higher resolution.

      • It is currently unclear how the linear stability results - beyond identification of the parameter \delta - inform any of the remaining manuscript. Quantitative comparisons of the various length scales seen in simulated patterns (e.g. Fig. 2b, 3c etc) with linear predictions and known characteristic length scales would be instructive mechanistically, would make the overall presentation more compelling and probes limitations of linear results.

      In the revised manuscript, we have provided further information so that the readers can appreciate the predictions and limitations of the linear stability results. We have added a sentence and a Figure to show that, in addition to the critical activity, the linear theory provides a good prediction of the wavelengh of the pattern. See lines 199 to 201.

      • It is not clear what is meant by "[bundle-formation] requires that active tension perpendicular to nematic orientation is larger than along this direction", and therefore also not why that would be "counter-intuitive". If interpreted naively, I would say that a large tension brings in more filaments into the bundle, so that may well be an obviously helpful feature for bundle formation and maintenance. In any case, it would be helpful if clarity is improved throughout when arguments about "directions of tensions" are made.

      We have significantly rewritten the first paragraphs of section “Microscopic origin…” to clarify this point (lines 330 to 339). This paragraph, along with other changes in the manuscript such as the explanation of Eq. 7 or the discussion about the stress anisotropy in the new version of Fig. 4 (see lines 280 to 303), provide a better explanation of this important point.

      • All density color bars: Shouldn't they rather be labelled \rho/\rho_0?

      Yes! We have corrected this typo.

      • Scalar product missing in caption definition of order parameter Fig. 2

      We have corrected this typo.

      • Fig. 3a: I suggest to put the expression for q0 in the caption

      We have changed q_0 by S_0 and clarified its meaning in the caption of what now is Fig 4.

      • Paragraph on bottom right of page 6 should several times probably refer to Fig. 3c(...), instead of Fig. 3b

      We have corrected this typo.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors report the structure of the human CTF18-RFC complex bound to PCNA. Similar structures (and more) have been reported by the O'Donnell and Li labs. This study should add to our understanding of CTF18-RFC in DNA replication and clamp loaders in general. However, there are numerous major issues that I recommend the authors fix. 

      Strengths: 

      The structures reported are strong and useful for comparison with other clamp loader structures that have been reported lately. 

      Weaknesses: 

      The structures don't show how CTF18-RFC opens or loads PCNA. There are recent structures from other groups that do examine these steps in more detail, although this does not really dampen this reviewer's enthusiasm. It does mean that the authors should spend their time investigating aspects of CTF18-RFC function that were overlooked or not explored in detail in the competing papers. The paper poorly describes the interactions of CTF18-RFC with PCNA and the ATPase active sites, which are the main interest points. The nomenclature choices made by the authors make the manuscript very difficult to read. 

      Reviewer #2 (Public review): 

      Summary 

      Briola and co-authors have performed a structural analysis of the human CTF18 clamp loader bound to PCNA. The authors purified the complexes and formed a complex in solution. They used cryo-EM to determine the structure to high resolution. The complex assumed an auto-inhibited conformation, where DNA binding is blocked, which is of regulatory importance and suggests that additional factors could be required to support PCNA loading on DNA. The authors carefully analysed the structure and compared it to RFC and related structures. 

      Strength & Weakness 

      Their overall analysis is of high quality, and they identified, among other things, a human-specific beta-hairpin in Ctf18 that flexibly tethers Ctf18 to Rfc2-5. Indeed, deletion of the beta-hairpin resulted in reduced complex stability and a reduction in a primer extension assay with Pol ε. This is potentially very interesting, although some more work is needed on the quantification. Moreover, the authors argue that the Ctf18 ATP-binding domain assumes a more flexible organisation, but their visual representation could be improved. 

      The data are discussed accurately and relevantly, which provides an important framework for rationalising the results. 

      All in all, this is a high-quality manuscript that identifies a key intermediate in CTF18dependent clamp loading. 

      Reviewer #3 (Public review): 

      Summary: 

      CTF18-RFC is an alternative eukaryotic PCNA sliding clamp loader that is thought to specialize in loading PCNA on the leading strand. Eukaryotic clamp loaders (RFC complexes) have an interchangeable large subunit that is responsible for their specialized functions. The authors show that the CTF18 large subunit has several features responsible for its weaker PCNA loading activity and that the resulting weakened stability of the complex is compensated by a novel beta hairpin backside hook. The authors show this hook is required for the optimal stability and activity of the complex. 

      Relevance: 

      The structural findings are important for understanding RFC enzymology and novel ways that the widespread class of AAA ATPases can be adapted to specialized functions. A better understanding of CTF18-RFC function will also provide clarity into aspects of DNA replication, cohesion establishment, and the DNA damage response. 

      Strengths: 

      The cryo-EM structures are of high quality enabling accurate modelling of the complex and providing a strong basis for analyzing differences and similarities with other RFC complexes. 

      Weaknesses: 

      The manuscript would have benefitted from more detailed biochemical analysis to tease apart the differences with the canonical RFC complex. 

      I'm not aware of using Mg depletion to trap active states of AAA ATPases. Perhaps the authors could provide a reference to successful examples of this and explain why they chose not to use the more standard practice in the field of using ATP analogues to increase the lifespan of reaction intermediates. 

      Overall appraisal: 

      Overall the work presented here is solid and important. The data is sufficient to support the stated conclusions and so I do not suggest any additional experiments. 

      Reviewer #1 (Recommendations for the authors): 

      We thank the reviewer for their positive comments and for their thorough review. All raised points have been addressed below.

      Major points 

      (1) The nomenclature used in the paper is very confusing and sometimes incorrect. The authors refer to CTF18 protein as "Ctf18", and the entire CTF18-RFC complex as "CTF18". This results in massive confusion because it is hard to ascertain whether the authors are discussing the individual subunits or the entire complex. Because these are human proteins, each protein name should be fully capitalized (i.e. CTF18, RFC4 etc). The full complex should be referred to more clearly with the designation CTF18-RFC or CTF18-RLC (RFC-like complex). Also, because the yeast and human clamp loader complexes use the same nomenclature for different subunits, it would be best for the authors to use the "A, B, C, D, E subunit" nomenclature that has been standard in the field for the past 20 years. Finally, the authors try to distinguish PCNA subunits by labeling them "PCNA2" or "PCNA1" (see Page 8 lines 180,181 for an example). This is confusing because the names of the RFC subunits have similar formats (RFC2, RFC3, RFC4, etc). In the case of RFC this denotes unique genes, whereas PCNA is a homotrimer. Could the authors think of another way to denote the different subunits, such as super/subscript? PCNA-I, PCNA-II, PCNA-III? 

      We thank the reviewer for pointing out the confusing nomenclature. Following the referee suggestion, we now refer to the CTF18 full complex as “CTF18-RFC”. We prefer keeping the nomenclature used for CTFC18 subunits as RFC2, RFC3 etc., as recently used in Yuan et al, Science, 2024. However, we followed the referee’s suggestion for PCNA subunits, now referred to as PCNA-I, PCNA-II and PCNA-III.

      (2) I believe that the authors are over-interpreting their data in Figure 1. The claim that "less sharp definition" of the map corresponding to the AAA+ domain of Ctf18 supports a relatively high mobility of this subunit is largely unsubstantiated. There are several reasons why one could get varying resolution in a cryo-EM reconstruction, such as compositional heterogeneity, preferred orientation artifacts, or how the complex interacts with the air-water interface. If other data were presented that showed this subunit is flexible, this evidence would support that data but cannot alone as justification for subunit mobility. Along these lines, how was the buried surface area (2300 vs 1400 A2) calculated? Is this the total surface area or only the buried surface area involving the AAA+ domains? It is surprising that these numbers are so different considering that the subunits and complexes look so similar (Figures 1c and 2b). 

      We respectfully disagree with the suggestion that our interpretation of local flexibility in the AAA+ domain of Ctf18 is overreaching. Several lines of evidence support this interpretation. First, compositional heterogeneity is unlikely, as the A′ domain of Ctf18 is well-resolved and forms stable interactions with RFC3, indicating that Ctf18 is consistently incorporated into the complex. Second, preferred orientation artifacts are excluded, as the particle distribution shows excellent angular coverage (Fig. S9a). Third, we now include a 3D variability analysis (3DVA; Supplementary Video 1), which reveals local conformational heterogeneity centered around the AAA+ domain of Ctf18, consistent with intrinsic flexibility.

      Regarding the buried surface area values, the reported numbers refer specifically to the interfaces between the AAA+ domain of Ctf18 and RFC2, and are derived from buried surface area calculations performed with PISA. The smaller interface (~1400 Ų) compared to RFC1–RFC2 (~2300 Ų) reflects low sequence identity (~26%) and divergent structural features, including the absence of conserved elements such as the canonical PIP-box in Ctf18. We have clarified and expanded this explanation in the revised manuscript (Page 7).

      (3) The authors very briefly discuss interactions with PCNA and how the CTF18-RFC complex differs from the RFC complex. This is amongst the most interesting results from their work, but also not well-developed. Moreover, Figure 3D describing these interactions is extremely unclear. I feel like this observation had potential to be interesting, but is largely ignored by the authors. 

      We thank the referee for pointing this out. We have expanded the section describing the interactions of CTF18-RFC and PCNA (Page 9 in the new manuscript), and made a new panel figure with further details (Fig. 3D).  

      (4) The authors make the observation that key ATP-binding residues in RFC4 are displaced and incompatible with nucleotide binding in their CTF18-RFC structure compared to the hRFC structure. This should be a main-text figure showing these displacements and how it is incompatible with ATP binding. Again, this is likely an interesting finding that is largely glossed over by the authors. 

      We now discuss this feature in detail (Pag 11 in the new manuscript), and added two figure insets (Fig. 4c) describing the incompatibility of RFC4 with nucleotide binding.

      (5) The authors claim that the work of another group (citation 50) "validate(s) our predictions regarding the significant similarities between CTF18-RFC and canonical RFC in loading PCNA onto a ss/dsDNA junction." However, as far as this reviewer can tell the work in citation 50 was posted online before the first draft of this manuscript appeared on biorxiv, so it is dubious to claim that these were "predictions." 

      We agree with the referee about this claim. We have now revised the text as follows:

      “While our work was being finalized, several cryo-EM structures of human CTF18-RFC bound to PCNA and primer/template DNA were reported by another group (He et al, PNAS, 2024). These findings are consistent with the distinct features of CTF18-RFC observed in our structures and independently support the notion of significant mechanistic similarity between CTF18-RFC and canonical RFC in loading PCNA onto a ss/dsDNA junction”.

      (6) The authors use a primer extension assay to test the effects of truncating the Nterminal beta hairpin of CTF18. However, this assay is only a proxy for loading efficiency and the observed effects of the mutation are rather subtle. The authors could test their hypothesis more clearly if they performed an ATPase assay or even better a clamp loading assay. 

      We thank the referee for this valuable suggestion. In response, we have performed clamp loading assays comparing the activities of human RFC, wild-type CTF18-RFC, and the β-hairpin–truncated CTF18-RFC mutant. The results, now presented in Fig. 6 and Table 1 of the revised manuscript, clearly show that truncation of the N-terminal βhairpin results in a slower rate of PCNA loading. We propose that this reduced loading rate likely contributes to the diminished Pol ε–mediated DNA synthesis observed in the primer extension assays.

      Minor points 

      (1) Page 3 line 53 the introduction suggests that ATP hydrolysis prompts clamp closure. While this may be the case, to my knowledge all recent structural work shows that closure can occur without ATP hydrolysis. It may be better to rephrase it to highlight that under normal loading conditions, ATP hydrolysis occurs before clamp closure. 

      The text now reads (Page 3): 

      “DNA binding prompts the closure of the clamp and hydrolysis of ATP induces the concurrent disassembly of the closed clamp loader from the sliding clamp-DNA complex, completing the cycle necessary for the engagement of the replicative polymerases to start DNA synthesis.”

      (2) Page 3 line 60, I do not see how the employment of alternative loaders highlights the specificity of the loading mechanism - would it not be possible for multiple loaders to have promiscuous clamp loading? 

      We thank the referee for this comment. The text now reads (Page 3):

      “However, eukaryotes also employ alternative loaders (20), including CTF18-RFC (6, 21-24), which likely use a conserved loading mechanism but are functionally specialized through specific protein interactions and context-dependent roles in DNA replication.”

      (3) Page 4 line 75 could you please cite a study that shows Ctf8 and Dcc1 bind to the Ctf18 C-terminus and that a long linker is predicted to be flexible? 

      Two references have been added (Stokes et al, NAR, 2020 and Grabarczyk et al, Structure, 2018)

      (4) Figure 2A has the N-terminal region of Ctf18 as bound to RFC3 but should likely be labeled as bound to RFC5. This caused significant confusion while trying to parse this figure. Further, the inclusion of "X" as a sequence - does this refer to a sequence that was not buildable in the cryo-EM map? I would be surprised that density immediately after the conserved DEXX box motif is unbuildable. If this is the case, it should be clearly stated in the figure legend that "X" denotes an unbuildable sequence. For the conserved beta-hairpin in the sequence, could the authors superimpose the AlphaFold prediction onto their structure? It would be more informative than just looking at the sequence. 

      We apologize for this confusion. The error in Figure 2A has been corrected. The figure caption now explicitely says that “X” refers to amino acid residues in the sequence which were not modelled. A superposition of the cryo-EM model of the N-terminal Beta hairpin in human Ctf18 and AlphaFold predictions for this feature in drosophila and yeast Ctf18 is now presented in Figure 2A.

      (5) Page 8 line 168, the use of the term "RFC5" here feels improper, since the "C" subunit is not RFC5 in all lower eukaryotes (see comment above about nomenclature). For instance, in S cerevisiae, the C subunit is RFC3. I would expect this interaction to be maintained in all C subunits, not all RFC5 subunits. 

      The text now reads (Page 8):

      “Therefore, lower eukaryotes may use a similar b-hairpin motif to bind the corresponding subunit of the RFC-module complex (RFC5 in human, Rfc3 in S. cerevisiae), emphasizing its importance.”  

      (6) Page 10 line 228, the authors claim that hydrolysis is dispensable at the Ctf18/RFC2 interface based on evidence from RFC1/RFC2 interface, by analogy that this is the "A/B" interface in both loaders. However, the wording makes it sound as if the cited data were collected while studying Ctf18 loaders. The authors should clarify this point. 

      The text has been modified as follows (Pag 11): 

      “Prior research has indicated that hydrolysis at the large subunit/RFC2 interface is not essential for clamp loading by various loaders (48-51), while the others are critical for the clamp-loading activity of eukaryotic RFCs. “

      (7) Page 11 line 243/244 the authors introduce the separation pin. Could they clarify whether Ctf18 contains any aromatic residues in this structural motif that would suggest it serves the same functional purpose? Also, the authors highlight this is similar to yeast RFC, which makes it sound like this is not conserved in human RFC, but the structural motif is also conserved in human RFC. 

      We thank the reviewer for this helpful comment. We have clarified in the revised text (Page 12) that the separation pin is conserved not only in yeast RFC but also in human RFC, and now note that human Ctf18 also harbors aromatic residues at the corresponding positions. This observation is supported by the new panel in Figure 4e.

      Minutia 

      (1) Page 2 line 37 please remove the word "and" before PCNA. 

      This has been corrected.

      (2) Please define AAA+ and update the language to clarify that not all pentameric AAA+ ATPases are clamp loaders. 

      AAA+ has been now defined (Page 3).

      (3) Page 4 line 86 Given the relatively weak interaction of Pol ε. 

      This has been corrected.

      (4) Page 8 line 204 the authors likely mean "leucine" and not "lysine". 

      We thank the reviewer for catching this. The error has been corrected.

      (5) Page 14 line 300, the authors claim that CTF18 utilizes three subunits but then list four. 

      We have corrected this.

      Reviewer #2 (Recommendations for the authors): 

      We thank the reviewer for their positive comments and valuable suggestions. The points raised by the referee have been addressed below.

      Major point: 

      (1) Please quantify Figure 6 and S9 from 3 independent repeats and determine the standard deviation to show the variability of the Ctf18 beta hairpin deletion.  The authors suggest that a suboptimal Ctf18 complex interaction with PCNA impacts the stability of the complex, but do not test this hypothesis. Could the suboptimal PIP motif in Ctf18 be changed to an improved motif and the impact tested in the primer extension assay? Although not essential, it would be a nice way to explore the mechanism. 

      We thank the reviewer for the suggestion. However, we note that Figure 6b (now 7b) already presents the quantification of the primer extension assay from three independent replicates, with error bars showing standard deviations, and includes the calculated rate of product accumulation. These data clearly indicate a 42% reduction in primer synthesis rate upon deletion of the Ctf18 β-hairpin.

      We agree that we do not provide direct evidence of impaired complex stability upon deletion of the Ctf18 β-hairpin. However, the 2D classification of the cryo-EM dataset (Figure S9) shows a marked reduction in the number of particles corresponding to intact CTF18-RFC–PCNA complexes in the β-hairpin deletion sample, with the majority of particles corresponding to free PCNA. This contrasts with the wild-type dataset, where complex particles are predominant. These findings indirectly suggest that deletion of the β-hairpin compromises the stability or assembly of the clamp-loader–clamp complex.

      We thank the reviewer for the valuable suggestion to mutate the weak PIP-box of Ctf18. While an interesting direction, we instead sought to directly test the mechanism by performing quantitative clamp loading assays. These assays revealed a significant reduction in the rate of PCNA loading by the CTF18<sup>Δ165–194</sup>-RFCmutant (Figure 6), supporting the conclusion that the β-hairpin contributes to productive PCNA loading. This loading delay likely underlies the reduced rate of primer extension observed in the Pol ε assay (Figure 7), consistent with impaired formation of processive polymerase– clamp complexes.

      (2) I did not see the method describing how the 2D classes were quantified to evaluate the impact of the Ctf18 beta hairpin deletion on complex formation. Please add the relevant information. 

      The relevant information has been added to the Method section:

      “For quantification of complex stability, the number of particles contributing to each 2D class was extracted from the classification metadata (Datasets 1 and 3). All classes showing isolated PCNA rings were summed and compared to the total number of particles in classes representing intact CTF18-RFC–PCNA complexes. This analysis was performed for both wild-type and β-hairpin deletion mutant datasets. Notably, no 2D classes corresponding to free PCNA were observed in the wild-type dataset, whereas in the mutant dataset, a substantial fraction of particles corresponded to isolated PCNA, suggesting reduced stability of the mutant complex.”

      Minor point: 

      (1) Page 2, line 25. Detail what type of mobility is referred to. Do you mean flexibility in the EM-map? 

      We have clarified this. The text now reads:

      “The unique RFC1 (Ctf18) large subunit of CTF18-RFC, which based on the cryo-EM map shows high relative flexibility, is anchored to PCNA through an atypical low-affinity PIP box”

      (2) Page 4, line 82. Please introduce CMGE, or at least state what the abbreviation stands for. 

      This has been addressed.

      (3) Page 4, line 89. Specify that the architecture of the HUMAN CTF18-RFC module is not known, as the yeast one has been published. 

      At the time our study was initiated, the architecture of the human CTF18-RFC module was unknown. A structure of the human complex was published by another group during the final stages of our work and is now properly acknowledged in the Discussion.

      (4) Page 6. Is it possible to illustrate why the autoinhibited state cannot bind to DNA? A visual representation would be nice. 

      We thank the reviewer for this suggestion. Figure 4b in the original manuscript already illustrates why the autoinhibited, overtwisted conformation of the CTF18-RFC pentamer cannot accommodate DNA. In this state, the inner chamber of the loader is sterically occluded, precluding the binding of duplex DNA.

      Reviewer #3 (Recommendations for the authors): 

      We thank Reviewer #3 for their constructive feedback and positive overall assessment of our work.

      We also thank the reviewer for their remarks on the use of Mg depletion to halt hydrolysis. Magnesium is an essential cofactor for ATP hydrolysis, and its depletion is expected to effectively prevent catalysis by destabilizing the transition state, possibly more completely than the use of slowly hydrolysable analogues such as ATPγS. We have recently employed Mg<sup>²+</sup> depletion to successfully trap a pre-hydrolytic intermediate in a replicative AAA+ helicase engaged in DNA unwinding (Shahid et al., Nature, 2025). This precedent supports the rationale for our choice, and the reference has now been included in the revised manuscript.

      I think the authors deposited the FSC curve for the +Mg structure in the -Mg structure PDB/EMDB entry according to the validation report. 

      We thank the reviewer for their careful inspection of the deposition materials. The discrepancy in the deposited FSC curve has now been corrected, and the appropriate FSC curves have been assigned to the correct PDB/EMDB entries.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Strengths: 

      Overall, this manuscript is well-written and contains a large amount of high-quality data and analyses. At its core, it helps to shed light on the overlapping roles of Edc3 and Scd6 in sculpting the yeast transcriptome. 

      Weaknesses: 

      (1) While the data presented makes conclusions about mRNA stability based on corresponding ChIP-Seq analyses and analyzing other mutants (e.g. Dcp2 knockout), at no point is mRNA stability actually ever directly assessed. This direct assessment, even for select transcripts, would further strengthen their conclusions. 

      We appreciate the reviewer’s concern but wish to emphasize that we conducted ChIP-Seq analysis of RNA Polymerase II occupancies in the CDSs of all genes, known to be a reliable indicator of transcription rate, and found only small increases in Pol II occupancies that cannot account for the increased transcript levels of the cohort of mRNAs up-regulated in the scd∆6edc3∆ double mutant (Fig. 3E). This provides strong evidence that increased transcription is not the main driver of increased mRNA abundance in this mutant.  Bolstering this conclusion, we showed that the Hap2/Hap3/Hap4/Hap5 complex of transcription factors responsible for induction of Ox. Phos. genes was not activated in scd6Δedc3Δ cells in glucose medium (Fig. 6F(ii)); nor was the Adr1 activator of CCR genes activated (Fig. S9C(i)), ruling out transcriptional induction of their target genes in glucose-replete scd6Δ/edc3Δ cells and instead favoring reduced degradation as the mechanism underlying derepression of Ox. Phos. and CCR gene transcripts in this mutant. In Fig. 3B, we further showed that the majority of mRNAs up-regulated in the scd6Δedc3Δ double mutant are also derepressed by dcp2Δ, and in Fig. 3D that the mRNAs up-regulated in scd∆6edc3∆ cells exhibit a higher than average codon protection index (CPI) indicating a heightened involvement of decapping and co-translational degradation by Xrn1 in their decay. To provide additional support for our conclusion, we have conducted new experiments to measure the abundance of capped mRNAs genome-wide by CAGE sequencing of total mRNA in both WT and scd∆6edc3∆ cells.  As established previously, normalizing CAGE TPMs to total mRNA TPMs determined by RNA-Seq, dubbed the C/T ratio, provides a reliable measure of the capped proportion of each transcript.  The new data presented in Fig. 3C indicate that the mRNAs up-regulated in the scd∆6edc3∆ mutant have significantly lower than average C/T ratios in WT cells, whereas the C/T ratios for the down-regulated transcripts are higher than average, and that these differences between the two groups and all expressed mRNAs are diminished in the scd∆6edc3∆ double mutant. These are the results expected if the up-regulated mRNAs are selectively targeted for decapping in WT cells dependent on Edc3/Scd6, whereas the downregulated mRNAs are targeted by Edc3/Scd6 less than the average transcript. In the original version of the paper, we came to the same conclusion by analyzing our previous CAGE data for the dhh1∆ mutant for the same transcripts dysregulated scd∆6edc3∆ cells, now presented as supportive data in Fig. S3F. Finally, we added the fact that among all four Dhh1 target mRNAs examined in the previous study of He et al. (2022) and found here to be up-regulated selectively in the scd6∆edc3∆ double mutant (Fig. S10), two of them (SDS23 and HXT6) were shown directly to have longer half-lives in dhh1∆ vs. WT cells by He et al. (2018). Hence, the combined evidence is compelling that selective up-regulation of particular mRNAs in the scd∆6edc3∆ mutant results from diminished decapping/decay rather than enhanced transcription; and we feel that the additional supporting evidence that would be provided by measuring half-lives of a small group of up-regulated transcripts would not justify the considerable effort required to do so.  Moreover, the standard approach for such experiments of impairing transcription with an inhibitor of Pol II or a Pol II Ts<sup>-</sup> mutation has been criticized because of the known buffering (suppression) of mRNA decay rates in response to impaired transcription.

      (2) Scd6 and Edc3 show a high level of functional redundancy, as demonstrated by the double mutant. As these proteins form complexes with other decapping factors/activators, I'm curious if depleting both proteins in the double mutant destabilizes any of these other factors. Have the authors ever assessed the levels of other key decapping factors in the double mutants (i.e. Dhh1, Pat1, Dcp2...etc)? I wonder if depleting both proteins leads to a general destabilization of key complexes. It would also be interesting to see if depleting Edc3 or Scd6 leads to a concomitant increase in the other protein as a compensatory mechanism. 

      We thank the reviewer for this insight.  Examining our Ribo-Seq and TMT-MS data revealed that Dhh1 expression and steady-state abundance are increased ~2-fold in the scd6∆edc3∆ strain, indicating that the up-regulation of many of the same mRNAs by scd6∆edc3∆ and dhh1∆ does not result indirectly from reduced levels of Dhh1 in the scd6∆edc3∆ mutant. The predicted increased in Dhh1 expression might signify a compensatory response to the absence of Scd6/Edc3.  We also observed an ~40% reduction in Dcp2 translation (RPFs) and mRNA abundance in the scd6∆edc3∆ strain, which might contribute to the up-regulation of mRNAs dysregulated in this mutant. However, our new immunoblot analyses revealed no significant reduction in steady-state Dcp2 levels in scd6∆edc3∆ cells (Input lanes in Figs. 3F and S4C(i)-(ii)). Moreover, our previous finding that the majority of mRNAs subject to NMD, up-regulated by both upf1∆ and dcp2∆, are not upregulated by scd6∆edc3∆ implies that Dcp2 abundance in scd6∆edc3∆ cells is adequate for normal levels of NMD and favors a direct role for Scd6/Edc3 in accelerating degradation of most transcripts up-regulated in this mutant. We have added these points to the DISCUSSION.

      (3) While not essential, it would be interesting if the authors carried out add-back experiments to determine which domain within Scd6/Edce3 plays a critical role in enforcing the regulation that they see. Their double mutant now puts them in a perfect position to carry out such experiments. 

      We agree with the reviewer that our scd6∆edc3∆ strain provides an opportunity to dissect the Scd6 and Edc3 proteins to determine which domains and motifs of each protein are most critically required for their functions in activating mRNA decay. However, if conducted thoroughly, this would entail an extensive analysis requiring a combination of genetics, biochemistry and genomics.  Considering the large amount of data already presented in 43 and 34 panels of main and supplementary figures, respectively, we feel that these additional experiments would be conducted more appropriately as a stand-alone follow-up study.

      Reviewer #2 (Public review): 

      Weaknesses: 

      The authors show very nicely in Figure S1A that growth phenotypes from scd6Δedc3∆ can be rescued by transformation of EDC3 (pLfz614-7) or SCD6 (pLfz615-5). The manuscript might benefit from using these rescue strategies in the analysis performed (e.g. RNA-seq, ribosome occupancies, and translational efficiencies). Also, these rescue assays could provide a good platform to further characterise the protein-protein interactions between Edc3, Scd6, and Dhh1. 

      We responded to this point immediately above in responding to Rev. #1.

      Reviewer #3 (Public review): 

      Weaknesses: 

      The limitations of the study include the use of indirect evidence to support claims that Edc3 and Scd6 recruit Dhh1 to the Dcp2 complex, which is inferred from correlations in mRNA abundance and ribosome profiling data rather than direct biochemical evidence. 

      While the reviewer makes a valid point, it is important to note that the greater correlations between effects of scd6∆edc3∆ with those conferred by dhh1∆ vs. pat1∆ also extended to changes in metabolites (Fig. 7A-C). To provide more direct evidence that Edc3 and Scd6 recruit Dhh1 to the Dcp2 complex, we have now conducted co-immunoprecipitation experiments (presented in new Figs. 3F and S5) demonstrating that association of Dhh1 with Dcp2 is diminished in the scd6∆edc3∆ double mutant but not in either scd6∆ or edc3∆ single mutant, thus providing biochemical support for our proposal.

      Also, there is limited exploration of other signals as the study is focused on glucose availability, and it is unclear whether the findings would apply broadly across different environmental stresses or metabolic pathways. Nonetheless, the study provides new insights into how mRNA decapping and degradation are tightly linked to metabolic regulation and nutrient responses in yeast. The RNA-seq and ribosome profiling datasets are valuable resources for the scientific community, providing quantitative information on the role of decapping activators in mRNA stability and translation control. 

      While not disputing the facts of this comment, we think it is unjustified to label as a weakness that our study focused on glucose-grown cells considering the large amount of new data and insights made possible by our multi-omics approach, presented in >70 separate figure panels and nine supplementary datafiles, which the reviewer has characterized as being valuable to the scientific community.  Parallel studies in non-preferred carbon or nitrogen sources are underway and represent large-scale investigations in their own right, for which the current dataset in glucose-replete cells provides the critical reference condition.

      Reviewer #1 (Recommendations for the authors): 

      The authors made a note that a set of 37 mRNAs is repressed exclusively by Edc3 with little contribution by Scd6, a list that includes the RPS28B mRNA. Edc3 has been previously reported to promote the decay of this mRNA in a deadenylation-independent fashion by binding to an element in its 3'UTR (PMIDs 15225544, 24492965). Can the authors comment on whether Edc3 may be binding to similar elements in the 3'UTRs of these transcripts in their shortlist? This could be an interesting topic matter for discussion as well. 

      While an interesting idea, this seems unlikely because the 3’UTR sequence in RPS28B mRNA was shown to bind Rps28 protein itself to confer heightened decapping and decay dependent on Edc3 in a negative autoregulatory loop that exerts tight control over Rps28 protein levels.  It would be surprising if Edc3mediated repression of the other 36 mRNAs would involve Rps28 as none of them encode cytoplasmic ribosomal proteins. Nevertheless, we searched for a conserved motif among the 3’UTRs of the 37 mRNAs using the MEME suite and found enrichment for motifs identified for RNA binding proteins Hrp1 and Nab2 and two novel motifs, but none of these motifs could be recognized within in the Rps28 autoregulatory loop.  We have chosen not to comment on these findings in the revised manuscript to avoid lengthening it unnecessarily with inconclusive observations.

      Reviewer #2 (Recommendations for the authors): 

      The authors show very nicely in Figure S1A that growth phenotypes from scd6Δedc3∆ can be rescued by the transformation of EDC3 (pLfz614-7) or SCD6 (pLfz615-5). The manuscript might benefit from using these rescue strategies on the analysis performed (e.g. RNA-seq, ribosome occupancies, and translational efficiencies); or expressing truncated mutants of EDC3 (pLfz614-7) or SCD6 (pLfz615-5), to show that they can act as dominant negative competitors, either on the binding to Dhh1 and Dcp2. 

      We addressed this comment above in our response to this Reviewer.

      Reviewer #3 (Recommendations for the authors): 

      (1) Labels such as "mRNA_up_s6,e3" are not defined in figures or the text. I suggest clearer sample labeling throughout. 

      The labels had been defined at first mention in the RESULTS but are now indicated there more explicitly, as well as in the legend to Fig. 1.

      (2) In Figure 1D it is surprising that the mRNA profile has a peak in the 5' UTR. I would expect to see such a peak in ribosome footprinting data. Is it possible these are incorrectly labeled?

      The figure is correctly labeled. Generally, one does not expect to see RPFs in the 5’UTR region unless there is an efficiently translated uORF, which appears not to be the case for MDH2.

      In general, the information in this panel and C is inadequate. None of the numbers are clearly explained in the figure legend or in the figure. 

      We had cited the legend to Fig. S3C for details of all such gene browser images but have now inserted this information into the Fig. 1D legend, at the first occurrence of such data in the regular figures. 

      (3) Figures 1C and 1D are in the wrong order.

      Corrected.

      (4) Figure 2D is a very complicated Venn Diagram. I suggest using UpSet plots as an alternative to Venn diagrams to more clearly convey overlaps between sets.  

      We provided additional explanatory text in the Fig. 2D legend to facilitate understanding.

      (5) The use of the same color scheme to represent different sets in panels of the same figure is a source of confusion. E.g. the cyan in Figures 2A, 2D, and 2E indicates unrelated categories, but one would think they are related.

      The use of the same cyan color in these three figure panels actually does designate results for the same set of 591 mRNAs up-regulated in the three mutants.  The application of the color schemes is now mentioned explicitly in Figs. 1, 2, and S3.

      (6) Reporting of p-values = 0 in figures is not useful.

      Corrected.

      (7) The whole manuscript is extremely long which reduces the overall impact. For example, the introduction is six pages long. I suggest reducing redundant text and being more concise to enhance readability. 

      We tried to streamline the text wherever possible, in particular shortening the Introduction by two pages.

      (8) Many abbreviations are used throughout the text that are not introduced the first time they are used. 

      Corrected throughout.

      (9) The ERCC normalization is unclear. Were the spike-ins added before cell lysis to allow estimation of per-cell RNA counts or to the extracted RNA? If added to extracted RNA rather than cells it is not clear to me how the claim can be made regarding increased mRNA abundance in the mutants. 

      We thank the reviewer for this comment. As we explained in the Methods, 2.4 µl of 1:100 diluted ERCC RNA Spike-In Control Mix 1 was added to 1.2 µg of each total RNA sample prior to cDNA library preparation.  Because the majority of total mRNA is comprised of rRNA, this normalization yields the abundance of each mRNA relative to rRNA. Owing to repression of rESR mRNAs encoding ribosomal proteins and biogenesis factors in the scd6∆edc3∆ strain (Fig. S3D), the ribosome content per cell is expected to be reduced in this mutant vs. WT. We showed previously that the isogenic dcp2∆ mutant that elicits an ESR response of similar magnitude, showed a 30% reduction in bulk ribosomal subunits per cell compared to same WT strain examined here {Vijjamarri, 2023 #7866}.  Assuming a similar reduction in ribosome abundance in the scd6∆edc3∆ mutant, the changes in mRNA per cell conferred by the scd6∆edc3∆ mutation are expected to be 0.7-fold of the ERCCnormalized values given in Fig. 3E, yielding fold-changes of 2.00 and 0.62 for the mRNA_up and mRNA_dn, groups, respectively, which still differ substantially from the corresponding changes in normalized Rpb1 occupancies of 1.2 and 0.93, respectively.  We have added this new analysis to the text of RESULTS.

      (10) The use of the terms "up-regulated" and "derepressed" throughout is confusing. Both refer to observed increased abundance of mRNAs, but they imply different causes which are never clearly defined. 

      We changed all occurrences of “derepressed” to “up-regulated”.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public review):

      (1)  The sharpening model of expectation can predict surround suppression. The authors could further clarify how the cancellation model predicts a monotonic profile of expectation (Figure 1C) with the highest response at the expected orientation, while the cancellation model suggests a suppression of neurons tuned toward the expected stimulus.

      We thank the reviewer for the comment. We would like to emphasize that as the expected signal is suppressed, the relative weight or salience of unexpected inputs increases. We have clarified this interpretation in the manuscript as follows:

      “Here, given these two mechanisms making opposite predictions about how expectation changes the neural responses of unexpected stimuli, thereby displaying different profiles of expectation, we speculated that if expectation operates by the sharpening model with suppressing unexpected information, we should observe an inhibitory zone surrounding the focus of expectation, and its profile then should display as a center-surround inhibition (Fig. 1c, left). If, however, expectation operates as suggested by the cancelation model with highlighting unexpected information, the inhibitory zone surrounding the focus of expectation should be eliminated, and the profile should instead display a monotonic gradient (Fig. 1c, right).”

      (2) I'm a bit concerned about whether the profile solely arises from modulation of expectation. The two auditory cues are each associated with a fixed orientation, which may be confounded by other cognitive processes like visual working memory or attention (which I think the authors also discussed). Although the authors tried to use SFD task to render orientation task-irrelevant, luminance edges (i.e., orientation) and spatial frequency in gratings are highly intertwined and orientation of the gratings may help recall the first grating's SF (fixed at 0.9 c/{degree sign}), especially given the first and second grating's orientations are not very different (4.8{degree sign}).

      We agree that dissociating expectation from attention and other top-down processes remains a key challenge in visual expectation research (see Summerfield & Egner, 2009; Summerfield & de Lange, 2014; de Lange et al., 2018). As is generally acknowledged, expectation reflects the probability of a sensory event, while selective attention relates to its behavioral relevance. To minimize attentional influences, our task design ensured that grating orientation was not taskrelevant: on each trial, participants discriminated either orientation or spatial frequency difference, such that orientation itself did not require attentional allocation, a point already discussed in the manuscript.

      Regarding visual working memory, we argue that even if participants recalled the first grating’s spatial frequency in the SFD task, they were not required to retain its precise spatial frequency (or orientation), as their task was simply to judge whether the second grating appeared denser or sparser. In other words, orientation (or spatial frequency) itself was not task-relevant. Moreover, although not included in the manuscript, we conducted a post-experiment debriefing in which participants were asked whether they noticed any association between the auditory tone and the grating orientation. None of the participants reported this relationship correctly, suggesting that the tone-orientation mapping remained implicit and was unlikely to be driven by strategic attention or memory.

      However, we acknowledge that certain confounding processes such as statistical learning or implicit mapping acquisition cannot be fully ruled out given the current paradigm. Future studies using methods with higher temporal resolution (e.g., EEG/MEG) may help to dissociate these mechanisms more precisely.

      (3) For each of the expected orientations (20{degree sign} or 70{degree sign}), the unexpected ones are linearly separable (i.e., all unexpected ones lie on one side of the expected angle). This might further encourage people to shift their attended or expected orientation, according to the optimal tuning hypothesis. Would this provide an alternative explanation to the tuning shift that the authors found?

      We thank the reviewer for pointing out the relevance of the optimal tuning hypothesis. We acknowledge that the optimal tuning theory (Navalpakkam & Itti, 2007) is an important framework, particularly in visual search paradigms, where attentional templates may shift away from non-target features to enhance discriminability.

      In our task, this hypothesis would predict a shift of expectation toward <20° in E20° trials and >70° in E70° trials, given that all unexpected orientations lie on one side of the expected angle. Importantly, the optimal tuning hypothesis predicts such shifts not only in Δ20°, Δ25°, and Δ30° trials but also in the Δ0° trials. In this regard, the observed shift in Δ20° and Δ30° (Experiment 2) and Δ25° (Experiment 3) trials is broadly consistent with the predictions of the optimal tuning account. However, we did not observe a corresponding shift away from nontarget features in the Δ0° condition, suggesting limited behavioral evidence for optimal tuning effects under our current task settings.

      It is important to note that most previous studies supporting optimal tuning (e.g., Navalpakkam & Itti, 2007; Scolari & Serences, 2009; Geng, DiQuattro, & Helm, 2017; Yu & Geng, 2019) have used visual search paradigms that differ from our design in several critical ways, including the number of stimuli presented, their spatial arrangement (eccentricity), task demands, and so on. Therefore, it is difficult to determine whether the optimal tuning hypothesis could serve as an alternative explanation within the context of our current study. We agree that future studies could further examine how such task parameters influence the presence or absence of optimal tuning.

      (4) It is great that the authors conducted computational modeling to elucidate the potential neuronal mechanisms of expectation. But I think the sharpening hypothesis (e.g., reviewed in de Lange, Heilbron & Kok, 2018) focuses on the neural population level, i.e., narrowing of population tuning profile, while the authors conducted the sharpening at the neuronal tuning level. However, the sharpening of population does not necessarily rely on the sharpening of individual neuronal tuning. For example, neuronal gain modulation can also account for such population sharpening. I think similar logic applies to the orientation adjustment experiment. The behavioral level shift does not necessarily suggest a similar shift at the neuronal level. I would recommend that the authors comment on this.

      We thank the reviewer for this to-the-point comment. As de Lange et al. (2018) noted, “there is not always a direct correspondence between neural-level and voxel-level selectivity patterns.” That is, neuronal tuning, population-level tuning, voxel-level selectivity, and behavioral adaptive outcomes may reflect different underlying mechanisms and do not necessarily align in a one-toone fashion. We fully acknowledge that population-level tuning effects may also result from various neuronal mechanisms such as gain modulation (for review, see Salinas & Thier, 2000), shifts in preferred orientation (Ringach, et al., 1997; Jeyabalaratnam et al., 2013), asymmetric broadening of tuning curves (Schumacher et al., 2022), or tuning curve sharpening (Ringach, et al., 1997; Schoups et al., 2001).  

      In our modeling, we implemented sharpening and shifts of neuronal tuning curves as a conceptual model simplification, intended to explore potential mechanisms underlying expectation-related center-surround suppression effects. While sharpening-based accounts (e.g., Kok et al. 2012) have often been emphasized, we stress that other mechanisms, such as gain modulation or tuning shifts, may also contribute. Our goal is not to provide a definitive account, but to highlight such plausible mechanisms and encourage future investigation. We have revised the Discussion to emphasize that multiple mechanisms may underlie the observed effects.

      “We note that our implementation of sharpening and shifts at the neuronal level serves as a conceptual model simplification, as population-level tuning, voxel-level selectivity, and behavioral adaptive outcomes may reflect different underlying neuronal mechanisms and do not necessarily align in a one-to-one fashion. Here, we stress that other potential mechanisms beyond sharpening, such as tuning shifts, may also contribute to visual expectation.” 

      (5) If the orientation adjustment experiment suggests that both sharpening and shifting are present at the same time, have the authors tried combining both in their computational model?

      We agree with the reviewer that it is necessary to consider the combined model. Accordingly, we implemented a computational model incorporating sharpening of the expected orientation channel together with shifting of the unexpected orientation channels. This model

      successfully captured the sharpening of the expected-orientation channel and the shift of the unexpectedorientation channels (Supplementary Fig. 3). For the expected orientation (Δ0°) , results showed that the amplitude change was significantly higher than zero on both OD (t(23) = 2.582, p = 0.017, Cohen’s d = 0.527) and SFD (t(23) = 2.078, p = 0.049, Cohen’s d = 0.424) tasks (Supplementary Fig. 3e, vertical stripes); the width change was significantly lower than zero on both OD (t(23) = -2.438, p = 0.023, Cohen’s d = 0.498) and SFD (t(23) = -2.578, p = 0.017, Cohen’s d = 0.526) tasks (Supplementary Fig. 3e, diagonal stripes). For unexpected orientations (Δ10°-Δ40°), however, the amplitude and width changes were not significant with zero on either OD (amplitude change: t(23) = 0.443, p = 0.662, Cohen’s d = 0.091; width change: t(23) = -1.819, p = 0.082, Cohen’s d = 0.371) or SFD (amplitude change: t(23) = 1.130, p = 0.270, Cohen’s d = 0.231; width change: t(23) = -1.710, p = 0.101, Cohen’s d = 0.349) tasks (Supplementary Fig. 3f). In the meantime, the location shift was significantly different than zero for unexpected orientations (Δ10°-Δ40°, OD task: t(23) = 3.611, p = 0.001, Cohen’s d = 0.737; SFD task: t(23) = 2.418, p = 0.024, Cohen’s d = 0.493 (Supplementary Fig. 3g). These results provided further evidence that tuning sharpening and tuning shift jointly contribute to center– surround inhibition in expectation.  

      Reviewer#1 (Recommendation for the Author):

      (1) A direct comparison between tasks (baseline vs. expectation conditions) would have strengthened the findings. Specifically, contrasting performance in the orientation discrimination task with the spatial frequency discrimination task could have provided clearer evidence that participants actually used the auditory cues to attend to the expected orientation. This comparison would be particularly important for validating cue manipulation in the orientation discrimination task.

      We agree that a direct comparison between the orientation discrimination (OD) and spatial frequency discrimination (SFD) tasks could further clarify how expectation (auditory cues) differentially modulates orientation relevance. However, the primary goal of the current study was to examine expectation effects within each task separately and to demonstrate that such effects are independent of attentional modulation driven by the task-relevance of orientation.

      In addition, the OD and SFD tasks differ not only in the relevant task features (orientation vs. spatial frequency discrimination), but also in stimulus properties and difficulty, for example, the arbitrary use of 20–70° as the orientation range and ~0.9 cycles/° as the spatial frequency setting, a direct comparison could introduce confounding factors unrelated to expectation.

      Importantly, Previous studies (e.g., Kok et al., 2012, 2017; Aitken et al., 2020) and our current results show that participants performed significantly better when the auditory cue matched the expected orientation, supporting the validity of our expectation manipulation.

      (2) An interesting consideration is why the center-surround inhibition profile of expectation was independent of the task-relevance of orientation. Previous studies (e.g., Kok et al., 2012) have found that orientation discrimination patterns differ depending on whether orientation is taskrelevant or irrelevant. This could be useful to discuss the possible discrepancies.

      We thank the reviewer for this inspiring comment. Kok et al. (2012) showed that both orientation and contrast tasks elicited similar fMRI decoding results, regardless of task relevance, suggesting neural mechanisms of expectation operate independently of whether orientation is task relevant. Behaviorally, they reported better performance for expected versus unexpected trials in the orientation task (3.4° vs. 3.8°, t(17) = 2.8, p = 0.013), and a marginal trend (although not significant) in the contrast task (4.3% vs. 5.0%, t(17) = 1.9, p = 0.075). If any differences between the two tasks exist, they may lie in the correlation between behavioral and fMRI effects, a question that goes beyond the scope of the current study. Therefore, it is hard to strongly conclude that orientation discrimination patterns differ depending on whether orientation is taskrelevant or irrelevant in their paper.

      Our study differs from theirs in at least two important ways, which may account for the clearer expectation facilitatory effect we observed in the expectation (Δ0°) condition. First, in our study, the orientation-irrelevant task involved spatial frequency discrimination (SFD) rather than contrast discrimination. Compared to contrast, spatial frequency has been shown to exhibit a clear cueing effect, as reported in Fang & Liu (2019). Second, our design included a baseline condition, which was absent in their study. We computed discrimination sensitivity (DS) to quantify how much the discrimination threshold (DT) changed relative to baseline. By using this baseline-referenced approach, we observed a significant facilitatory expectation effect in the Δ0° condition, an effect that shifted from marginal significance in their orientation-irrelevant task to clear significance in our study.

      (3) The authors might consider briefly explaining how the orientation adjustment paradigm used in this study is particularly effective for examining the potential co-existence of tuning sharpening and tuning shift computations, and how this approach complements traditional orientation discrimination tasks in characterizing expectation-related mechanisms.

      We thank the reviewer for this valuable suggestion. We agree that further clarification is needed to better connect the two experiments. To explain this, we have elaborated further in the manuscript.

      “To further explore the co-existence of both Tuning sharpening and Tuning shift computations in center-surround inhibition profile of expectation, participants were asked to perform a classic orientation adjustment experiment. Unlike profile experiment (discrimination tasks), the adjustment experiment provides a direct, trial-by-trial measure of participants’ perceived orientation, capturing the full distribution of responses. This enables the construction of orientation-specific tuning curves, allowing us to detect both tuning sharpening and tuning shifts, thereby offering a more nuanced understanding of the computational mechanisms underlying expectation.”

      (4) These interesting findings raise important questions about their relationship to existing hybrid models of attentional modulation. Could the authors discuss how their results might align with or extend previous work demonstrating combined feature-similarity gain and surround suppression effects for orientation (e.g., Fang & Liu, 2019)? Could a hybrid model potentially provide a better account of these data than the pure surround suppression model?

      We thank the reviewer for this valuable comment. We agree that hybrid model should be mentioned in the manuscript and we have elaborated further in the Discussion.

      “For example, within the orientation space, the inhibitory zone was about 20°, 45°, and 54° for expectation evident here, feature-based attention[21], and visual perceptual learning[35], respectively; within the feature-based attention, it was about 30° and 45° in color [77] and motion direction [53] spaces, respectively These variations hint at the exciting possibility that the width of the inhibitory surround may flexibly adapt to stimulus context and task demands, ultimately facilitating our perception and behavior in a changing environment. This principle is consistent with the hybrid model of feature-based attention [53,54,75], where attention is deployed adaptively to prioritize task-relevant information through feature-similarity gain which filters out the most distinctive distractors, and surround suppression which inhibits similar and confusable ones, thereby jointly shaping the attentional tuning profile.”

      (5) On page 19, there appears to be a missing symbol in the description of the Tuning Sharpening model. The text states: 'the tuning width of each channel's tuning function is parameterized by ??', where the question marks seem to indicate a missing parameter symbol.

      We appreciate the reviewer’s careful attention. Yes, the "ơ" is missing, which was likely caused by a formatting issue. We have corrected it.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      This work investigated how the sense of control influences perceptions of stress. In a novel "Wheel Stopping" task, the authors used task variations in difficulty and controllability to measure and manipulate perceived control in two large cohorts of online participants. The authors first show that their behavioral task has good internal consistency and external validity, showing that perceived control during the task was linked to relevant measures of anxiety, depression, and locus of control. Most importantly, manipulating controllability in the task led to reduced subjective stress, showing a direct impact of control on stress perception. However, this work has minor limitations due to the design of the stressor manipulations/measurements and the necessary logistics associated with online versus in-person stress studies.

      Nevertheless, this research adds to our understanding of when and how control can influence the effects of stress and is particularly relevant to mental health interventions.

      We thank the reviewer for their clear and accurate summary of the findings. 

      Strengths:

      The primary strength of this research is the development of a unique and clever task design that can reliably and validly elicit variations in beliefs about control. Impressively, higher subjective control in the task was associated with decreased psychopathology measures such an anxiety and depression in a non-clinical sample of participants. In addition, the authors found that lower control and higher difficulty in the task led to higher perceived stress, suggesting that the task can reliably manipulate perceptions of stress. Prior tasks have not included both controllability and difficulty in this manner and have not directly tested the direct influence of these factors on incidental stress, making this work both novel and important for the field.

      We thank the reviewer for their positive comments.

      Weaknesses:

      One minor weakness of this research is the validity of the online stress measurements and manipulations. In this study, the authors measure subjective stress via self-report both during the task and also after either a Trier Social Stress Test (high-stress condition) or a memory test (low-stress condition). One concern is that these stress manipulations were really "threats" of stress, where participants never had to complete the stress tasks (i.e., recording a speech for judgment). While this is not unusual for an in-lab study and can reliably elicit substantial stress/anxiety, in an online study, there is a possibility for communication between participants (via online forums dedicated to such communication), which could weaken the stress effects. That said, the authors did find sensible increases and decreases of perceived stress between relevant time points, but future work could improve upon this design by including more complete stress manipulations and measuring implicit physiological signs of stress.

      We thank the reviewer for urging us to expand on this point. The reviewer is right that stress was merely anticipatory and is in that sense different to the canonical TSST. However, there are ample demonstrations that such anticipatory stress inductions are effective at reliably eliciting physiological and psychological stress responses (e.g. Nasso et al., 2019; Schlatter et al., 2021; Steinbeis et al., 2015). Further, there is evidence that online versions of the TSST are also effective (DuPont et al., 2022; Meier et al., 2022), including evidence that the speech preparation phase conducted online was related to increases in heart rate and blood pressure (DuPont et al., 2022). Importantly, and as the reviewer notes in relation to our study specifically, the anticipatory TSST had a significant impact on subjective stress in the expected direction demonstrating that it was effective at eliciting subjective stress. We have elaborated further on this in our manuscript (pages 8 and 9) as follows: 

      “Prior research has found TSST anticipation to elicit both psychological and physiological stress responses [37-39], suggesting that the task anticipation would be a valid stress induction despite participants not performing the speech task. Moreover, prior research has validated the use of remote TSST in online settings [40, 41], including evidence that the speech preparation phase (online) was related to increased heart rate and blood pressure compared to controls [40].”

      Reviewer #2 (Public review):

      Summary:

      The authors have developed a behavioral paradigm to experimentally manipulate the sense of control experienced by the participants by changing the level of difficulty of a wheel-stopping task. In the first study, this manipulation is tested by administering the task in a factorial design with two levels of controllability and two levels of stressor intensity to a large number of participants online while simultaneously recording subjective ratings on perceived control, anxiety, and stress. In the second study, the authors used the wheel-stopping task to induce a high sense of controllability and test whether this manipulation buffers the response to a subsequent stress induction when compared to a neutral task, like looking at pleasant videos.

      We thank the reviewer for their accurate summary.

      Strengths:

      (1) The authors validate a method to manipulate stress.

      (2) The authors use an experimental manipulation to induce an enhanced sense of controllability to test its impact on the response to stress induction.

      (3) The studies involved big sample sizes.

      We thank the reviewer for noting these positive aspects of our study. 

      Weaknesses:

      (1) The study was not preregistered.

      This is correct.

      (2) The control manipulation is conflated with task difficulty, and, therefore the reward rate. Although the authors acknowledge this limitation at the end of the discussion, it is a very important limitation, and its implications are not properly discussed. The discussion states that this is a common limitation with previous studies of control but omits that many studies have controlled for it using yoking.

      We agree that these are very important issues to consider in the interpretation of our findings. It is important to note, that while our task design does not separate these constructs, we are able to do so in our statistical analyses. For example, our measure of perceived difficulty was included in analyses assessing the fluctuations in stress and control in which subjective control still had a unique effect on the experience of stress over and above perceived difficulty, suggesting that subjective control explains variance in stress beyond what is accounted for by perceived difficulty. Similarly, we have also included additional analyses in which we include the win rate (i.e. percentage of trials won) as a covariate when assessing the relationship between subjective control, perceived difficulty and subjective stress, in which subjective control and perceived difficulty still uniquely predict subjective stress when controlling for win rate. This suggests that there is unique variance in subjective control, separate from perceived task difficulty and win rate that is relevant to stress. We have included these analyses (page 16 of manuscript) as follows:

      “To further isolate the relationship between subjective control and stress separate from perceived task difficulty or objective task performance, we also included the overall win rate (percentage of trials won during the WS task) in the models. In Study 1, lower feelings of control were related to higher levels of subjective stress (β= -0.12, p<.001) even when controlling for both  win rate (β= -0.06, p=.220) and perceived task difficulty (β= 0.37, p<.001, Table S10). This also replicated in Study 2, where lower subjective control was associated with higher feelings of stress (β= -0.32, p<.001) when controlling for perceived task difficulty (β= 0.31, p<.001) and win rate (β= -0.11, p=.428, Table S11). This suggests that there is unique variance in subjective feelings of control, separate from task performance, relevant to subjective stress.”

      As well as expanding on this in the Discussion (pages 27 and 28) as follows:

      “While our task design does not separate control from obtained reward, we are able to do so in the statistical analyses. Like with perceived difficulty, we statistically accounted for reward rate and showed that the relationship between subjective control and stress was not accounted for by reward rate, for example. Similarly, participants received feedback after every trial, and thus feedback valence may contribute to stress perception. However, given that overall win rate (which captures the feedback received during the task) did not predict stress over and above perceived difficulty or subjective control, it suggests that feedback is unlikely to relate to stress over and above difficulty. Future work will need to disentangle this further to rule out such potential confounds.”

      Further, in terms of the wider literature on these issues, we have added more to this point in our discussion, especially in relation to previous literature that also varies control by reward rate (e.g. Dorfman & Gershman, 2019, who use a reward rate of 80% in high control conditions and 50% in low control conditions). This can be found in the manuscript on page 27 as follows: 

      “Previous research typically accounts for different outcomes (e.g. punishment) by yoking controllable and uncontrollable conditions [3] though other work has manipulated the controllability of rewards by changing the reward rate [for example 30] where a decoy stimulus is rewarded 50% of the time in the low control condition but 80% in the high control condition).”

      (3) The methods are not always clear enough, and it is difficult to know whether all the manipulations are done within-subjects or some key manipulations are done between subjects.

      We have added more information in the methods section (page 8) clarifying withinsubject manipulations (WS task parameters) and between-subject manipulations (stressor intensity task, WS task version in Study 1, and WS task/video task in Study 2). Additionally, as recommended by Reviewer 1, we have provided more information in the methods section and Table S3 regarding the details of on-screen written feedback provided to participants after each trial of the WS Task.

      (4) The analysis of internal consistency is based on splitting the data into odd/even sliders. This choice of data parcellation may cause missed drifts in task performance due to learning, practice effects, or tiredness, thus potentially inflating internal consistency.

      We agree that this can indeed be an issue, though drift is likely to be present in any task including even in mood in resting-state (Jangraw et al., 2023). To respond to this specific point, we parcellated the timepoints into a 1<sup>st</sup>/2<sup>nd</sup> half split and report the ICC in the supplementary information. While values are lower, indeed likely due to systematic drifts in task performance as participants learn to perform the task (especially for Study 2 since the order of parameters were designed to get easier throughout the experiment), the ICC values are still high. Control sliders: Study 1 = 0.82, Study 2: = 0.68; Difficulty sliders: Study 1: = 0.84, Study 2 = 0.57; Stress sliders: Study 1 = 0.45, Study 2 = 0.71. As seen, the lowest ICC is for stress sliders in Study 1. This may be because the first 3 sliders (included in the 1<sup>st</sup> half split) were all related to the stress task (initial, post-stress, task, post-debrief) and the final 4 sliders (in the 2<sup>nd</sup> half split) were the three sliders during the WS task and shortly afterwards. 

      (5) Study 2 manipulates the effect of domain (win versus loss WS task), but the interaction of this factor with stressor intensity is not included in the analysis.

      We agree that this would be a valuable analysis to include. We have run additional analyses (section Sensitivity and Exploratory Analyses, pages 24 and 25), testing the interaction of Domain (win or loss) with stressor intensity (and time) when predicting the stress buffering and stress relief effects. This revealed no significant main effects of domain or interactions including domain, suggesting that domain did not impact the stress induction or relief differently depending on whether it was followed by the high or low stressor intensity condition. While the control by time interaction (our main effect of interest) still held for stress induction in this more complex model, the control by time interaction did not hold for the stress relief. However, this more complex model did not provide a better fit for the data, motivating us to continue to draw conclusions from the original model specification with domain as a covariate (rather than an interaction).

      We outline these analyses on page 24 of the manuscript, as follows:

      “Third, we included the interaction of domain with stressor intensity and with time, to test whether the win or loss domain in the WS task significantly impacted stress induction or stress relief differently depending on stressor intensity. There were no significant effects or interactions of domain (Table S14) for stress induction or stress relief, and the main effect of interest (the interaction between time and control) still held for the stress induction (β= 10.20, SE=4.99 p=.041, Table S14), though was no longer significant for the stress relief  (β= 6.72, SE=4.28, p=.117, Table S14). This more complex model did not significantly improve model fit (χ<sup>²</sup>(3)= 1.46, p=.691) compared to our original specification (with domain as a covariate rather than an interaction) and had slightly worse fit (higher AIC and BIC) than the original model (AIC = 5477.2 versus 5472.7, BIC = 5538.5 versus 5520.8).”

      This study will be of interest to psychologists and cognitive scientists interested in understanding how controllability and its subjective perception impact how people respond to stress exposure. Demonstrating that an increased sense of control buffers/protects against subsequent stress is important and may trigger further studies to characterize this phenomenon better. However, beyond the highlighted weaknesses, the current study only studied the effect of stress induction consecutive to the performance of the WS task on the same day and its generalizability is not warranted.

      We thank the reviewer for this assessment and agree that we cannot assume these findings would generalise to more prolonged effects on stress responses.

      Reviewer #3 (Public review):

      Summary:

      This is an interesting investigation of the benefits of perceiving control and its impact on the subjective experience of stress. To assess a subjective sense of control, the authors introduce a novel wheel-stopping (WS) task where control is manipulated via size and speed to induce low and high control conditions. The authors demonstrate that the subjective sense of control is associated with experienced subjective stress and individual differences related to mental health measures. In a second experiment, they further show that an increased sense of control buffers subjective stress induced by a trier social stress manipulation, more so than a more typical stress buffering mechanism of watching neutral/calming videos.

      We agree with this accurate summary of our study. 

      Strengths:

      There are several strengths to the manuscript that can be highlighted. For instance, the paper introduces a new paradigm and a clever manipulation to test an important and significant question. Additionally, it is a well-powered investigation that allows for confidence in replicability and the ability to show both high internal consistency and high external validity with an interesting set of individual difference analyses. Finally, the results are quite interesting and support prior literature while also providing a significant contribution to the field with respect to understanding the benefits of perceiving control.

      We thank the reviewer for this positive assessment. 

      Weaknesses:

      There are also some questions that, if addressed, could help our readership.

      (1) A key manipulation was the high-intensity stressor (Anticipatory TSST signal), which was measured via subjective ratings recorded on a sliding scale at different intervals during testing. Typically, the TSST conducted in the lab is associated with increases in cortisol assessments and physiological responses (e.g., skin conductance and heart rate). The current study is limited to subjective measures of stress, given the online nature of the study. Since TSST online may also yield psychologically different results than in the lab (i.e., presumably in a comfortable environment, not facing a panel of judges), it would be helpful for the authors to briefly discuss how the subjective results compare with other examples from the literature (either online or in the lab). The question is whether the experienced stress was sufficiently stressful given that it was online and measured via subjective reports. The control condition (low intensity via reading recipes) is helpful, but the low-intensity stress does not seem to differ from baseline readings at the beginning of the experiment.

      We agree that it would be helpful to expand on this further. Similar to the comment made by Reviewer 1, we wish to point out that there are ample demonstrations that such anticipatory stress inductions are effective at reliably eliciting physiological and psychological stress responses (e.g. Nasso et al., 2019; Schlatter et al., 2021; Steinbeis et al., 2015). Further, there is evidence that online versions of the TSST are also effective (DuPont et al., 2022; Meier et al., 2022), including evidence that the speech preparation phase conducted online was related to increases in heart rate and blood pressure (DuPont et al., 2022). We have elaborated further on this in our manuscript on pages 8 and 9 as follows:

      “Prior research has found TSST anticipation to elicit both psychological and physiological stress responses [37-39], suggesting that the task anticipation would be a valid stress induction despite participants not performing the speech task. Moreover, prior research has validated the use of remote TSST in online settings [40, 41], including evidence that the speech preparation phase (online) was related to increased heart rate and blood pressure compared to controls [40].”

      (2) The neutral videos represent an important condition to contrast with WS, but it raises two questions. First, the conditions are quite different in terms of experience, and it is interesting to consider what another more active (but not controlled per se) condition would be in comparison to the WS performance. That is, there is no instrumental action during the neutral video viewing (even passive ratings about the video), and the active demands could be an important component of the ability to mitigate stress. Second, the subjective ratings of the stress of the neutral video appear equivalent to the win condition. Would it have been useful to have a high arousal video (akin to the loss condition) to test the idea that experience of control will buffer against stress? That way, the subjective stress experience of stress would start at equivalent points after WS3.

      We agree with the reviewer that this is an important issue to clarify. In our deliberations when designing this study, we considered that that any task with actionoutcome contingencies would have a degree of controllability. To better distinguish experiences of control (WS task) to an experience of no/neutral control (i.e., neither high nor low controllability), we decided to use a task in which no actions were required during the task itself. Importantly, however, there was an active demand and concentration was still required in order to perform the attention checks regarding the content of the videos and ratings of the videos. 

      Thank you for the suggestion of having a high arousal video condition. This would indeed be interesting to test how experiencing ‘neutral’ control and high(er) stress levels preceding the stressor task influences stress buffering and stress relief, and we have included this suggestion for future research in the discussion section (page 28) as below:

      “Another avenue for future research would be to test how control buffers against stress when compared to a neutral control scenario of higher stress levels, akin to the loss domain in the WS Task, given that participants found the video condition generally relaxing. However, given that we found no differences dependent on domain for the stress induction in the WS Task conditions, it is possible that different versions of a neutral control condition would not impact the stress induction.”

      (3) For the stress relief analysis, the authors included time points 2 and 3 (after the stressor and debrief) but not a baseline reading before stress. Given the potential baseline differences across conditions, can this decision be justified in the manuscript?

      We thank the reviewer for raising this. Regarding the stress relief analyses (timepoints 2 and 3) and not including timepoint 1 (after the WS/video task) stress in the model, we have added to the manuscript that there was no significant difference in stress ratings between the high control and neutral control (collapsed across stress and domain) at timepoint 1 (hence why we do not think it’s necessary to include in the stress relief model). Nevertheless, we have now included a sensitivity analysis to test the Timepoint*Control interaction of stress relief when including timepoint 1 stress as a covariate. The timepoint by control interaction still holds, suggesting that the initial stress level prior to the stress induction does not impact our results of interest. The details of this analysis are included in the Sensitivity and Exploratory Analyses section on page 24:

      “Although there were no significant differences between control groups in subjective stress immediately after the WS/video task (t(175.6)=1.17, p=.244), we included participants’ stress level after the WS/video task as a covariate in the stress relief analyses (Table S12). The results revealed a main effect of initial stress (β= 0.643, SE=0.040, p<.001, Table S12) on the stress relief after the stressor debrief. Compared to excluding initial stress as in the original analyses (Table 4), there was now no longer a main effect of domain (β= 0.236, SE=2.60, p=.093, Table S12), but the inference of all other effects remained the same. Importantly, there was still a significant time by control interaction (β= 9.65, SE=3.74, p=.010, Table S12) showing that the decrease in stress after the debrief was greater in the highly controllable WS condition than the neutral control video condition, even when accounting for the initial stress level.”

      (4) Is the increased control experience during the losses condition more valuable in mitigating experienced stress than the win condition?

      We agree that this would be helpful to clarify. To test whether the loss domain was more valuable at mitigating experiences of stress than the win condition, we ran additional analyses with just the high control condition (WS task) to test for a Domain*Time interaction. This revealed no significant Domain*Time interaction, suggesting that the stress buffering or stress relief effect was not dependent on domain in the high control conditions. These analyses are outlined in the Sensitivity and Exploratory Analyses section on page 25:

      “Finally, to test whether the loss domain was more valuable at mitigating experiences of stress than the win condition, we ran additional analyses with just the high control condition (WS task) for the stress induction and stress relief to test for an interaction of domain and time. For the stress induction, there was no significant two-way interaction of domain and time (β= -1.45, SE=4.80, p=.763), nor a significant three-way interaction of domain by time by stressor intensity (β= -3.96, SE=6.74, p=.557, Table S15), suggesting that there were no differences in the stress induction dependent on domain. Similarly for the stress relief, there was no significant two-way interaction of domain and time (β= -5.92, SE=4.42, p=.182), nor a significant three-way interaction of domain by time by stressor intensity interaction (β= 8.86, SE=6.21, p=.154, Table S15), suggesting that there were no differences in the stress relief dependent on the WS Task domain.

      (5) The subjective measure of control ("how in control do you feel right now") tends to follow a successful or failed attempt at the WS task. How much is the experience of control mediated by the degree of experienced success/schedule of reinforcement? Is it an assessment of control or, an evaluation of how well they are doing and/or resolution of uncertainty? An interesting paper by Cockburn et al. 2014 highlights the potential for positive prediction errors to enhance the desire for control.

      We thank the reviewer for this comment. Similar to comments regarding reward rate, our task does not allow us to fully separate control from success/reinforcement because of the manipulation of difficulty. However, we did undertake sensitivity analyses and the inclusion of overall win rate accounted for limited variance when predicting stress over and above subjective control and difficulty (page 16). 

      “To further isolate the relationship between subjective control and stress separate from perceived task difficulty or objective task performance, we also included the overall win rate (percentage of trials won during the WS task) in the models. In Study 1, lower feelings of control were related to higher levels of subjective stress (β= -0.12, p<.001) even when controlling for both  win rate (β= -0.06, p=.220) and perceived task difficulty (β= 0.37, p<.001, Table S10). This also replicated in Study 2, where lower subjective control was associated with higher feelings of stress (β= -0.32, p<.001) when controlling for perceived task difficulty (β= 0.31, p<.001) and win rate (β= -0.11, p=.428, Table S11). This suggests that there is unique variance in subjective feelings of control, separate from task performance, relevant to subjective stress.” 

      (6) While the authors do a very good job in their inclusion and synthesis of the relevant literature, they could also amplify some discussion in specific areas. For example, operationalizing task controllability via task difficulty is an interesting approach. It would be useful to discuss their approach (along with any others in the literature that have used it) and compare it to other typically used paradigms measuring control via presence or absence of choice, as mentioned by the authors briefly in the introduction.

      We are delighted to expand on this particular point and have done so in the Discussion on page 27:

      “Previous research typically accounts for different outcomes (e.g. punishment) by yoking controllable and uncontrollable conditions [3] though other work has manipulated the controllability of rewards by changing the reward rate [for example 30] where a decoy stimulus is rewarded 50% of the time in the low control condition but 80% in the high control condition). While our task design does not separate control from obtained reward, we are able to do so in the statistical analyses.” 

      (7) The paper is well-written. However, it would be useful to expand on Figure 1 to include a) separate figures for study 1 (currently not included) and 2, and b) a timeline that includes the measurements of subjective stress (incorporated in Figure 1). It would also be helpful to include Figure S4 in the manuscript.

      We have expanded Figure 1 to include both Studies 1 and 2 and a timeline of when subjective stress was assessed throughout the experiment as well as adding Figure S4 to the main manuscript (now top panel within Figure 4). 

      Reviewer #1 (Recommendations for the authors):

      (1) Study 2 shows a greater decrease in subjective stress after the high-control task manipulation than after the pleasant video. One possible confound is whether the amount of time to complete the WS task and the video differ. It could be helpful to look at the average completion time for the WS task and compare that to the length of the videos. Alternatively, in future studies, control for this by dynamically adjusting the video play length to each participant based on how long they took to complete the WS task.

      This is an interesting suggestion. As a result, we have included the time taken as a covariate in the stress induction and stress relief analyses to ensure that any differences in time between the WS task and video task were not accounting for any of the stress induction or relief analyses. Controlling for the total time taken did not impact the stress induction or relief results. This is included in the Sensitivity and Exploratory Analyses section on page 24:

      “Our second sensitivity analyses was conducted because the experiment took longer to complete for the video condition (mean = 54.3 minutes, SD = 12.4 minutes) than the WS task condition (mean = 39.7 minutes, SD = 12.8 minutes, t(186.19)=-9.32, p<.001). We therefore included the total time (in ms) as a covariate in the stress induction and stress relief analyses for Study 2. This showed that accounting for total time did not change the results of interest (Table S13), further highlighting that the time by control interactions were robust.”

      (2) Because participants received feedback about their success/failure in the WS task, a confounding factor could be that they received positive feedback on highly controllable trials and negative feedback on low control trials (and/or highly difficult trials). This would suggest that it is not controllability per se that contributes to stress perception but rather feedback valence. The authors show that this is a likely factor in their results in Study 2, which shows significant effects of the loss domain on perceived control and stress. Was a similar analysis done in Study 1? Do participants receive feedback in Study 1? It would be helpful to include this information somewhere in the manuscript. I would be curious to know whether *any* feedback at all influences controllability/stress perceptions.

      We thank the reviewer for this interesting suggestion. It is an interesting question as to whether feedback valence is related to stress in Study 1, and we have added this point to the Discussion on pages 27 and 28. To speak to this point, when we include the overall win rate (which captures the subsequent feedback received) when predicting subjective stress, win rate is not a significant predictor of stress over and above perceived difficulty and subjective control, suggesting that overall feedback valence may not be related to stress in Study 1. We take this as evidence that feedback may not be as important in terms of accounting for the relationship between stress and control. However, we unfortunately do not have any data in which there was no feedback provided to speak to this conclusively. This would be an interesting future study. The excerpt below is added to pages 27 and 28 of the discussion section:

      “Like with perceived difficulty, we statistically accounted for reward rate and showed that the relationship between subjective control and stress was not accounted for by reward rate, for example. Similarly, participants received feedback after every trial, and thus feedback valence may contribute to stress perception. However, given that overall win rate (which captures the feedback received during the task) did not predict stress over and above perceived difficulty or subjective control, it suggests that feedback is unlikely to relate to stress over and above difficulty. Future work will need to disentangle this further to rule out such potential confounds.”

      To respond specifically to the reviewer’s question about the feedback given to participants, written feedback was provided on screen to participants on a trial-bytrial basis also in Study 1 (i.e. for both studies), and we have provided more clarity about this in the manuscript on page 8 as well as providing additional details in Table S3:

      “After each trial, participants were shown written feedback on screen as to whether the segment had successfully stopped on the red zone (or not), and the associated reward (or lack of). See Table S3 for details.”

      (3) I'm not sure how to interpret the fact that in Figure S1, the BICs are all essentially the same. Does this mean that you don't really need all of these varying aspects of the task to achieve the same effects? Could the task be made simpler?

      The similarity of BIC values suggests that a simpler WS task would have produced a worse account of the data approximately in keeping with the extent to which it is a simpler model. Here, the BIC scores for the models are similar, suggesting that adding these parameters adds explanatory power in keeping with what would have been expected from adding a parameter, but not more. We do note that the BIC is a relatively strict and conservative comparison. The fact that the most complex model overall narrowly improves parsimony; combined with the interpretable parameter values and the prior expectations given the task setup led us to focus on this most complex model.  

      (4) A minor point, but the authors refer to their sample as "neurotypical." Were they assessed for prior/current psychopathology/medications? If not, I might use a different term here (perhaps "non-clinical sample"), since some prior work has shown that online samples actually have higher instances of psychopathology compared to community samples.

      We have changed the phrasing of ‘neurotypical’ to a ‘non-clinical sample’ as recommended.

      Reviewer #2 (Recommendations for the authors):

      Figure 4S is very informative and could be presented in the main text.

      We have expanded Figure 1 to include both Studies 1 and 2 and a timeline of when subjective stress was assessed throughout the experiment as well as adding Figure S4 to the main manuscript (top panel of Figure 4). 

      References:

      Dorfman, H. M., & Gershman, S. J. (2019). Controllability governs the balance between Pavlovian and instrumental action selection. Nature Communications, 10(1), 5826. https://doi.org/10.1038/s41467-019-13737-7

      DuPont, C. M., Pressman, S. D., Reed, R. G., Manuck, S. B., Marsland, A. L., & Gianaros, P. J. (2022). An online Trier social stress paradigm to evoke affective and cardiovascular responses. Psychophysiology, 59(10), e14067. https://doi.org/10.1111/psyp.14067

      Jangraw, D. C., Keren, H., Sun, H., Bedder, R. L., Rutledge, R. B., Pereira, F., Thomas, A. G., Pine, D. S., Zheng, C., Nielson, D. M., & Stringaris, A. (2023). A highly replicable decline in mood during rest and simple tasks. Nature Human Behaviour, 7(4), 596–610. https://doi.org/10.1038/s41562-023-015197

      Meier, M., Haub, K., Schramm, M.-L., Hamma, M., Bentele, U. U., Dimitroff, S. J., Gärtner, R., Denk, B. F., Benz, A. B. E., Unternaehrer, E., & Pruessner, J. C. (2022). Validation of an online version of the trier social stress test in adult men and women. Psychoneuroendocrinology, 142, 105818. https://doi.org/10.1016/j.psyneuen.2022.105818

      Nasso, S., Vanderhasselt, M.-A., Demeyer, I., & De Raedt, R. (2019). Autonomic regulation in response to stress: The influence of anticipatory emotion regulation strategies and trait rumination. Emotion, 19(3), 443–454. https://doi.org/10.1037/emo0000448

      Schlatter, S., Schmidt, L., Lilot, M., Guillot, A., & Debarnot, U. (2021). Implementing biofeedback as a proactive coping strategy: Psychological and physiological effects on anticipatory stress. Behaviour Research and Therapy, 140, 103834. https://doi.org/10.1016/j.brat.2021.103834

      Steinbeis, N., Engert, V., Linz, R., & Singer, T. (2015). The effects of stress and affiliation on social decision-making: Investigating the tend-and-befriend pattern. Psychoneuroendocrinology, 62, 138–148. https://doi.org/10.1016/j.psyneuen.2015.08.003

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The manuscript reports a series of experiments designed to test whether optogenetic activation of infralimbic (IL) neurons facilitates extinction retrieval and whether this depends on animals' prior experience. In Experiment 1, rats underwent fear conditioning followed by either one or two extinction sessions, with IL stimulation given during the second extinction; stimulation facilitated extinction retrieval only in rats with prior extinction experience. Experiments 2 and 3 examined whether backward conditioning (CS presented after the US) could establish inhibitory properties that allowed IL stimulation to enhance extinction, and whether this effect was specific to the same stimulus or generalized to different stimuli. Experiments 5 - 7 extended this approach to appetitive learning: rats received backward or forward appetitive conditioning followed by extinction, and then fear conditioning, to determine whether IL stimulation could enhance extinction in contexts beyond aversive learning and across conditioning sequences. Across studies, the key claim is that IL activation facilitates extinction retrieval only when animals possess a prior inhibitory memory, and that this effect generalizes across aversive and appetitive paradigms.

      Strengths:

      (1) The design attempts to dissect the role of IL activity as a function of prior learning, which is conceptually valuable.

      We thank the Reviewer for their positive assessment.

      (2) The experimental design of probing different inhibitory learning approaches to probe how IL activation facilitates extinction learning was creative and innovative.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      (1) Non-specific manipulation.

      ChR2 was expressed in IL without distinction between glutamatergic and GABAergic populations. Without knowing the relative contribution of these cell types or the percentage of neurons affected, the circuit-level interpretation of the results is unclear.

      ChR2 was intentionally expressed in the infralimbic cortex (IL) without distinction between local neuronal populations for two reasons. First, this manuscript aimed to uncover some of the features characterizing the encoding of inhibitory memories in the IL, and this encoding likely engages interactions among various neuronal populations within the IL. Second, the hypotheses tested in the manuscript derived from findings that indiscriminately stimulated the IL using the GABA<sub>A</sub> receptor antagonist picrotoxin, which is best mimicked by the approach taken. We agree that it is also important to determine the respective contributions of distinct IL neuronal populations to inhibitory encoding; however, the global approach implemented in the present experiments represents a necessary initial step. This rationale will be incorporated into the revised manuscript, which will also make reference to the need to identify the relative contributions of the various neuronal populations within the IL. 

      (2) Extinction retrieval test conflates processes

      The retrieval test included 8 tones. Averaging across this many tone presentations conflate extinction retrieval/expression (early tones) with further extinction learning (later tones). A more appropriate analysis would focus on the first 2-4 tones to capture retrieval only. As currently presented, the data do not isolate extinction retrieval.

      It is unclear when retrieval of what has been learned across extinction ceases and additional extinction learning occurs. In fact, it is only the first stimulus presentation that unequivocally permits a distinction between retrieval and additional extinction learning, as the conditions for this additional learning have not been fulfilled at that presentation. However, confining evidence for retrieval to the first stimulus presentation introduces concerns that other factors could influence performance. For instance, processing of the stimulus present at the start of the session may differ from that present at the end of the previous session, thereby affecting what is retrieved. Such differences between the stimuli present at the start and end of an extinction session have been long recognized as a potential explanation for spontaneous recovery (Estes, 1955). More importantly, whether the test data presented confound retrieval and additional extinction learning or not, the interpretation remains the same with respect to the effects of a prior history of inhibitory learning on enabling the facilitative effects of IL stimulation. Finally, it is unclear how these facilitative effects could occur in the absence of the subjects retrieving the extinction memory formed under the stimulation. Nevertheless, the revised manuscript will provide the trial-by-trial performance during the post-extinction retrieval tests and discuss this issue.

      (3) Under-sampling and poor group matching.

      Sample sizes appear small, which may explain why groups are not well matched in several figures (e.g., 2b, 3b, 6b, 6c) and why there are several instances of unexpected interactions (protocol, virus, and period). This baseline mismatch raises concerns about the reliability of group differences.

      Efforts were made to match group performance upon completion of each training stage and before IL stimulation. Unfortunately, these efforts were not completely successful due to exclusions following post-mortem analyses. However, we acknowledge that the unexpected interactions deserve further discussion, and this will be incorporated into the revised manuscript (see also comment from Reviewer 2). Although we cannot exclude that sample sizes may have contributed to some of these interactions, we remain confident about the reliability of the main findings reported, especially given their replication across the various protocols. Overall, the manuscript provides evidence that IL stimulation does not facilitate brief extinction in the absence of prior inhibitory experience in five different experiments, replicating previous findings (Lingawi et al., 2018; Lingawi et al., 2017). It also replicates these previous findings by showing that prior experience with either fear or appetitive extinction enables IL stimulation to facilitate subsequent fear extinction. Furthermore, the facilitative effects of such stimulation following fear or appetitive backward conditioning are replicated in the present manuscript.  

      (4) Incomplete presentation of conditioning data.

      Figure 3 only shows a single conditioning session despite five days of training. Without the full dataset, it is difficult to evaluate learning dynamics or whether groups were equivalent before testing.

      We apologize, as we incorrectly labeled the X axis for the backward conditioning data set in Figures 3B, 4B, 4D and 5B. It should have indicated “Days” instead of “Trials”. This error will be corrected in the revised manuscript.

      (5) Interpretation stronger than evidence.

      The authors conclude that IL activation facilitates extinction retrieval only when an inhibitory memory has been formed. However, given the caveats above, the data are insufficient to support such a strong mechanistic claim. The results could reflect non-specific facilitation or disruption of behavior by broad prefrontal activation. Moreover, there is compelling evidence that optogenetic activation of IL during fear extinction does facilitate subsequent extinction retrieval without prior extinction training (Do-Monte et al 2015, Chen et al 2021), which the authors do not directly test in this study.

      As noted above, the revised manuscript will show that the interpretations of the main findings stand whether ore the test data confounds retrieval with additional extinction learning. The revised manuscript will also clarify the plotting of the data for the backward conditioning stages. We do agree that further discussion of the unexpected interactions is necessary, and this will also be incorporated into the revised manuscript. However, the various replications of the core findings provide strong evidence for their reliability and the interpretations advanced in the original manuscript. The proposal that the results reflect non-specific facilitation or disruption of behavior seems highly unlikely. Indeed, the present experiments and previous findings (Lingawi et al., 2018; Lingawi et al., 2017) provide multiple demonstrations that IL stimulation fails to produce any facilitation in the absence of prior inhibitory experience with the target stimulus. Although these demonstrations appear inconsistent with previous studies (Do-Monte et al., 2015; Chen et al., 2021), this inconsistency is likely explained by the fact that these studies manipulated activity in specific IL neuronal populations. Previous work has already revealed differences between manipulations targeting discrete IL neuronal populations as opposed to general IL activity (Kim et al., 2016). Importantly, as previously noted, the present manuscript aimed to generally explore inhibitory encoding in the IL that, as we will acknowledge, is likely to engage several neuronal populations within the IL. Adequate statements on these matters will be included in the revised manuscript.

      Impact:

      The role of IL in extinction retrieval remains a central question in the fear learning literature. However, because the test used conflates extinction retrieval with new learning and the manipulations lack cell-type specificity, the evidence presented here does not convincingly support the main claims. The study highlights the need for more precise manipulations and more rigorous behavioral testing to resolve this issue.

      As noted in our responses, the interpretations of the data presented remain identical whether the test data conflate extinction retrieval with additional extinction learning or not. Although we agree that it is important to establish the role of specific IL neuronal populations in extinction learning, this was beyond the scope of the manuscript and the findings reported remain valuable to our understanding of inhibitory encoding within the IL.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors examine the mechanisms by which stimulation of the infralimbic cortex (IL) facilitates the retention and retrieval of inhibitory memories. Previous work has shown that optogenetic stimulation of the IL suppresses freezing during extinction but does not improve extinction recall when extinction memory is probed one day later. When stimulation occurs during a second extinction session (following a prior stimulation-free extinction session), freezing is suppressed during the second extinction as well as during the tone test the following day. The current study was designed to further explore the facilitatory role of the IL in inhibitory learning and memory recall. The authors conducted a series of experiments to determine whether recruitment of IL extends to other forms of inhibitory learning (e.g., backward conditioning) and to inhibitory learning involving appetitive conditioning. Further, they assessed whether their effects could be explained by stimulus familiarity. The results of their experiments show that backward conditioning, another form of inhibitory learning, also enabled IL stimulation to enhance fear extinction. This phenomenon was not specific to aversive learning, as backward appetitive conditioning similarly allowed IL stimulation to facilitate extinction of aversive memories. Finally, the authors ruled out the possibility that IL facilitated extinction merely because of prior experience with the stimulus (e.g., reducing the novelty of the stimulus). These findings significantly advance our understanding of the contribution of IL to inhibitory learning. Namely, they show that the IL is recruited during various forms of inhibitory learning, and its involvement is independent of the motivational value associated with the unconditioned stimulus.

      Strengths:

      (1) Transparency about the inclusion of both sexes and the representation of data from both sexes in figures.

      We thank the Reviewer for their positive assessment.

      (2) Very clear representation of groups and experimental design for each figure.

      We thank the Reviewer for their positive assessment.

      (3) The authors were very rigorous in determining the neurobehavioral basis for the effects of IL stimulation on extinction. They considered multiple interpretations and designed experiments to address these possible accounts of their data.

      We thank the Reviewer for their positive assessment.

      (4) The rationale for and the design of the experiments in this manuscript are clearly based on a wealth of knowledge about learning theory. The authors leveraged this expertise to narrow down how the IL encodes and retrieves inhibitory memories.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      (1) In Experiment 1, although not statistically significant, it does appear as though the stimulation groups (OFF and ON) differ during Extinction 1. It seems like this may be due to a difference between these groups after the first forward conditioning. Could the authors have prevented this potential group difference in Extinction 1 by re-balancing group assignment after the first forward conditioning session to minimize the differences in fear acquisition (the authors do report a marginally significant effect between the groups that would undergo one vs. two extinction sessions in their freezing during the first conditioning session)?

      As noted (see response to Reviewer 1), efforts were made daily to match group performance across the training stages, but these efforts were ultimately hampered by the necessary exclusions following post-mortem analyses. This will be made explicit in the revised manuscript. Regarding freezing during Extinction 1, as noted by the Reviewer, the difference, which was not statistically significant, was absent across trials during the subsequent forward fear conditioning stage. Likewise, the protocol difference observed during the initial forward fear conditioning was absent in subsequent stages. We are therefore confident that these initial differences (significant or not) did not impact the main findings at test. Importantly, these findings replicate previous work using identical protocols in which no differences were present during the training stages. These considerations will be addressed in the revised manuscript.

      (2) Across all experiments (except for Experiment 1), the authors state that freezing during the initial conditioning increased across "days". The figures that correspond to this text, however, show that freezing changes across trials. In the methods, the authors report that backward conditioning occurred over 5 days. It would be helpful to understand how these data were analyzed and collated to create the final figures. Was the freezing averaged across the five days for each trial for analyses and figures?

      We apologize, as noted above, we incorrectly labeled the X axis for the backward conditioning data sets in Figures 3B, 4B, 4D and 5B. It should have indicated “Days” instead of “Trials”. The data shown in these Figures use the average of all trials on a given day. This will be clarified in the methods section of the revised manuscript. The labeling errors on the Figures will be corrected.

      (3) In Experiment 3, the authors report a significant Protocol X Virus interaction. It would be useful if the authors could conduct post-hoc analyses to determine the source of this interaction. Inspection of Figure 4B suggests that freezing during the two different variants of backward conditioning differs between the virus groups. Did the authors expect to see a difference in backward conditioning depending on the stimulus used in the conditioning procedure (light vs. tone)? The authors don't really address this confounding interaction, but I do think a discussion is warranted.

      We agree with the Reviewer that further discussion of the Protocol x Virus interaction that emerged during the backward conditioning and forward conditioning stages of Experiment 3 is warranted. This will be provided in the revised manuscript. Briefly, during both stages, follow-up analyses did not reveal any differences (main effects or interactions) between the two groups trained with the light stimulus (Diff-EYFP and Diff-ChR2). By contrast, the ChR2 group trained with the tone (Back-ChR2) froze more overall than the EYFP group (Back-EYFP), but there were no other significant differences between the two groups. Based on these analyses, the Protocol x Virus interaction appears to be driven by greater freezing in the ChR2 group trained with the tone rather than a difference in the backward conditioning performance based on stimulus identity. Consistent with this, the statistical analyses did not reveal a main effect of Protocol during either the backward conditioning stage or the stimulus trials during the forward conditioning stage. Nevertheless, during this latter stage, a main effect of Protocol emerged during baseline performance, but once again, this seems to be driven by the Back-ChR2 group. Critically, it is unclear how greater stimulus freezing in the Back-ChR2 group during forward conditioning would lead to lower freezing during the post-extinction retrieval test.  

      (4) In this same experiment, the authors state that freezing decreased during extinction; however, freezing in the Diff-EYFP group at the start of extinction (first bin of trials) doesn't look appreciably different than their freezing at the end of the session. Did this group actually extinguish their fear? Freezing on the tone test day also does not look too different from freezing during the last block of extinction trials.

      We confirm that overall, there was a significant decline in freezing across the extinction session shown in Figure 4B. The Reviewer is correct to point out that this decline was modest (if not negligible) in the Diff-EYFP group, which was receiving its first inhibitory training with the target tone stimulus. It is worth noting that across all experiments, most groups that did not receive infralimbic stimulation displayed a modest decline in freezing during the extinction session since it was relatively brief, involving only 6 or 8 tone alone presentations. This was intentional, as we aimed for the brief extinction session to generate minimal inhibitory learning and thereby to detect any facilitatory effect of infralimbic stimulation. This issue will be clarified and explained in the revised version of the manuscript.

      (5) The Discussion explored the outcomes of the experiments in detail, but it would be useful for the authors to discuss the implications of their findings for our understanding of circuits in which the IL is embedded that are involved in inhibitory learning and memory. It would also be useful for the authors to acknowledge in the Discussion that although they did not have the statistical power to detect sex differences, future work is needed to explore whether IL functions similarly in both sexes.

      In line with the Reviewer’s suggestion (see also Reviewer 3), the revised manuscript will include a discussion of the broader implications of the findings regarding inhibitory brain circuitry and will acknowledge the need to further explore sex differences and IL functions.

      Reviewer #3 (Public review):

      Summary:

      This is a really nice manuscript with different lines of evidence to show that the IL encodes inhibitory memories that can then be manipulated by optogenetic stimulation of these neurons during extinction. The behavioral designs are excellent, with converging evidence using extinction/re-extinction, backwards/forwards aversive conditioning, and backwards appetitive/forwards aversive conditioning. Additional factors, such as nonassociative effects of the CS or US, are also considered, and the authors evaluate the inhibitory properties of the CS with tests of conditioned inhibition.

      Strengths:

      The experimental designs are very rigorous with an unusual level of behavioral sophistication.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      (1) More justification for parametric choices (number of days of backwards vs forwards conditioning) could be provided.

      All experimental parameters were based on previously published experiments showing the capacity of the backward conditioning protocols to generate inhibitory learning and the forward conditioning protocols to produce excitatory learning. Although this was mentioned in the methods section, we acknowledge that further explanation is required to justify the need for multiple days of backward training. This will be provided in the revised manuscript.

      (2) The current discussion could be condensed and could focus on broader implications for the literature.

      The revised manuscript will make an effort to condense the discussion and focus on broader implications for the literature.

      References

      Chen, Y.-H., Wu, J.-L., Hu, N.-Y., Zhuang, J.-P., Li, W.-P., Zhang, S.-R., Li, X.-W., Yang, J.-M., & Gao, T.-M. (2021). Distinct projections from the infralimbic cortex exert opposing effects in modulating anxiety and fear. J Clin Invest, 131(14), e145692. https://doi.org/10.1172/JCI145692

      Do-Monte, F. H., Manzano-Nieves, G., Quiñones-Laracuente, K., Ramos-Medina, L., & Quirk, G. J. (2015). Revisiting the role of infralimbic cortex in fear extinction with optogenetics. J Neurosci, 35(8), 3607-3615. https://doi.org/10.1523/JNEUROSCI.3137-14.2015

      Estes, W. K. (1955). Statistical theory of spontaneous recovery and regression. Psychol Rev, 62(3), 145-154. https://doi.org/10.1037/h0048509

      Kim, H.-S., Cho, H.-Y., Augustine, G. J., & Han, J.-H. (2016). Selective Control of Fear Expression by Optogenetic Manipulation of Infralimbic Cortex after Extinction. Neuropsychopharmacology, 41(5), 1261-1273. https://doi.org/10.1038/npp.2015.276

      Lingawi, N. W., Holmes, N. M., Westbrook, R. F., & Laurent, V. (2018). The infralimbic cortex encodes inhibition irrespective of motivational significance. Neurobiol Learn Mem, 150, 64-74. https://doi.org/10.1016/j.nlm.2018.03.001

      Lingawi, N. W., Westbrook, R. F., & Laurent, V. (2017). Extinction and Latent Inhibition Involve a Similar Form of Inhibitory Learning that is Stored in and Retrieved from the Infralimbic Cortex. Cereb Cortex, 27(12), 5547-5556. https://doi.org/10.1093/cercor/bhw322

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this study, Ana Lapao et al. investigated the roles of Rab27 effector SYTL5 in cellular membrane trafficking pathways. The authors found that SYTL5 localizes to mitochondria in a Rab27A-dependent manner. They demonstrated that SYTL5-Rab27A positive vesicles containing mitochondrial material are formed under hypoxic conditions, thus they speculate that SYTL5 and Rab27A play roles in mitophagy. They also found that both SYTL5 and Rab27A are important for normal mitochondrial respiration. Cells lacking SYTL5 undergo a shift from mitochondrial oxygen consumption to glycolysis which is a common process known as the Warburg effect in cancer cells. Based on the cancer patient database, the author noticed that low SYTL5 expression is related to reduced survival for adrenocortical carcinoma patients, indicating SYTL5 could be a negative regulator of the Warburg effect and potentially tumorigenesis.

      Strengths:

      The authors take advantage of multiple techniques and novel methods to perform the experiments.

      (1) Live-cell imaging revealed that stably inducible expression of SYTL5 co-localized with filamentous structures positive for mitochondria. This result was further confirmed by using correlative light and EM (CLEM) analysis and western blotting from purified mitochondrial fraction.

      (2) In order to investigate whether SYTL5 and Rab27A are required for mitophagy in hypoxic conditions, two established mitophagy reporter U2OS cell lines were used to analyze the autophagic flux.

      Weaknesses:

      This study revealed a potential function of SYTL5 in mitophagy and mitochondrial metabolism. However, the mechanistic evidence that establishes the relationship between SYTL5/Rab27A and mitophagy is insufficient. The involvement of SYTL5 in ACC needs more investigation. Furthermore, images and results supporting the major conclusions need to be improved.

      We thank the reviewer for their constructive comments. We agree that a complete understanding of the mechanism by which SYTL5 and Rab27A are recruited to the mitochondria and subsequently involved in mitophagy requires further investigation. Here, we have shown that SYTL5 recruitment to the mitochondria requires both its lipid-binding C2 domains and the Rab27A-binding SHD domain (Figure 1G-H). This implies a coincidence detection mechanism for mitochondrial localisation of SYTL5.  Additionally, we find that mitochondrial recruitment of SYTL5 is dependent on the GTPase activity and mitochondrial localisation of Rab27A (Figure 2D-E). We also identified proteins linked to the cellular response to oxidative stress, reactive oxygen species metabolic process, regulation of mitochondrion organisation and protein insertion into mitochondrial membrane to be enriched in the SYTL5 interactome (Figure 3A and C).

      However, less details regarding the mitochondrial localisation of Rab27A are understood. To investigate this, we have now performed a mass spectrometry analysis to identify the interactome of Rab27A (see Author response table 1 below,). U2OS cells with stable expression of mScarlet-Rab27A or mScarlet only, were subjected to immunoprecipitation, followed by MS analysis.  Of the 32 significant Rab27A-interacting hits (compared to control), two of the hits are located in the inner mitochondrial membrane (IMM); ATP synthase F(1) complex subunit alpha (P25705), and mitochondrial very long-chain specific acyl-CoA dehydrogenase (VLCAD)(P49748). However, as these IMM proteins are not likely involved in mitochondrial recruitment of Rab27A, observed under basal conditions, we choose not to include these data in the manuscript. 

      It is known that other RAB proteins are recruited to the mitochondria. During parkin-mediated mitophagy, RABGEF1 (a guanine nucleotide exchange factor) is recruited through its ubiquitin-binding domain and directs mitochondrial localisation of RAB5, which subsequently leads to recruitment of RAB7 by the MON1/CCZ1 complex[1]. As already mentioned in the discussion (p. 12), ubiquitination of the Rab27A GTPase activating protein alpha (TBC1D10A) is reduced in the brain of Parkin KO mouse compared to controls[35], suggesting a possible connection of Rab27A with regulatory mechanisms that are linked with mitochondrial damage and dysfunction. While this an interesting avenue to explore, in this paper we will not follow up further on the mechanism of mitochondrial recruitment of Rab27A. 

      Author response table 1.

      Rab27A interactome. Proteins co-immunoprecipitated with mScarlet-Rab27A vs mScarlet expressing control. The data show average of three replicates. 

      To investigate the role of SYTL5 in the context of ACC, we acquired the NCI-H295R cell line isolated from the adrenal gland of an adrenal cancer patient. The cells were cultured as recommended from ATCC using DMEM/F-12 supplemented with NuSerum and ITS +premix. It is important to note that the H295R cells were adapted to grow as an adherent monolayer from the H295 cell line which grows in suspension. However, there can still be many viable H295R cells in the media. 

      We attempted to conduct OCR and ECAR measurements using the Seahorse XF upon knockdown of SYTL5 and/or Rab27A in H295R cells. For these assays, it is essential that the cells be seeded in a monolayer at 70-90% confluency with no cell clusters[4]. Poor adhesion of the cells can cause inaccurate measurements by the analyser. Unfortunately, the results between the five replicates we carried out were highly inconsistent, the same knockdown produced trends in opposite directions in different replicates. This is likely due to problems with seeding the cells. Despite our best efforts to optimise seeding number, and pre-coating the plate with poly-D-lysine[5] we observed poor attachment of cells and inability to form a monolayer. 

      To study the localisation of SYTL5 and Rab27A in an ACC model, we transduced the H295R cells with lentiviral particles to overexpress pLVX-SV40-mScarlet-I-Rab27A and pLVX-CMV-SYTL5-EGFP-3xFLAG. Again, this proved unsuccessful after numerous attempts at optimising transduction. 

      These issues limited our investigation into the role of SYTL5 in ACC to the cortisol assay (Supplementary Figure 6). For this the H295R cells were an appropriate model as they are able to produce an array of adrenal cortex steroids[6] including cortisol[7]. In this assay, measurements are taken from cell culture supernatants, so the confluency of the cells does not prevent consistent results as the cortisol concentration was normalised to total protein per sample. With this assay we were able to rule out a role for SYTL5 and Rab27A in the secretion of cortisol.  

      Another consideration when investigating the involvement of SYTL5 in ACC, is that in general ACC cells should have a low expression of SYTL5 as is seen from the patient expression data (Figure 6B).

      The reviewer also writes “Furthermore, images and results supporting the major conclusions need to be improved.”. We have tried several times, without success, to generate U2OS cells with CRISPR/Cas9-mediated C-terminal tagging of endogenous SYTL5 with mNeonGreen, using an approach that has been successfully implemented in the lab for other genes. This is likely due to a lack of suitable sgRNAs targeting the C-terminal region of SYTL5, which have a low predicted efficiency score and a large number of predicted off-target sites in the human genome including several other gene exons and introns (see Author response image 2). 

      We have also included new data (Supplementary Figure 4B) showing that some of the hypoxia-induced SYTL5-Rab27A-positive vesicles stain positive for the autophagy markers p62 and LC3B when inhibiting lysosomal degradation, further strengthening our data that SYTL5 and Rab27A function as positive regulators of mitophagy.  

      Reviewer #2 (Public review): 

      Summary:

      The authors provide convincing evidence that Rab27 and STYL5 work together to regulate mitochondrial activity and homeostasis.

      Strengths:

      The development of models that allow the function to be dissected, and the rigorous approach and testing of mitochondrial activity.

      Weaknesses:

      There may be unknown redundancies in both pathways in which Rab27 and SYTL5 are working which could confound the interpretation of the results.

      Suggestions for revision:

      Given that Rab27A and SYTL5 are members of protein families it would be important to exclude any possible functional redundancies coming from Rab27B expression or one of the other SYTL family members. For Rab27 this would be straightforward to test in the assays shown in Figure 4 and Supplementary Figure 5. For SYTL5 it might be sufficient to include some discussion about this possibility.

      We thank the reviewer for pointing out the potential redundancy issue for Rab27A and SYTL5. There are multiple studies demonstrating the redundancy between Rab27A and Rab27B. For example, in a study of the disease Griscelli syndrome, caused by Rab27A loss of function, expression of either Rab27A or Rab27B rescues the healthy phenotype indicating redundancy[8]. This redundancy however applies to certain function and cell types. In fact, in a study regarding hair growth, knockdown of Rab27B had the opposite effect to knockdown of Rab27A[9].

      In this paper, we conducted all assays in U2OS cells, in which the expression of Rab27B is very low. Human Protein Atlas reports expression of 0.5nTPM for Rab27B, compared to 18.4nTPM for Rab27A. We also observed this low level of expression of Rab27B compared to Rab27A by qPCR in U2OS cells. Therefore, there would be very little endogenous Rab27B expression in cells depleted of Rab27A (with siRNA or KO). In line with this, Rab27B peptides were not detected in our SYTL5 interactome MS data (Table 1 in paper). Moreover, as Rab27A depletion inhibits mitochondrial recruitment of SYTL5 and mitophagy, it is not likely that Rab27B provides a functional redundancy. It is possible that Rab27B overexpression could rescue mitochondrial localisation of SYTL5 in Rab27A KO cells, but this was not tested as we do not have any evidence for a role of Rab27B in these cells. Taken together, we believe our data imply that Rab27B is very unlikely to provide any functional redundancy to Rab27A in our experiments. 

      For the SYTL family, all five members are Rab27 effectors, binding to Rab27 through their SHD domain. Together with Rab27, all SYTL’s have been implicated in exocytosis in different cell types. For example, SYTL1 in exocytosis of azurophilic granules from neutrophils[10], SYTL2 in secretion of glucagon granules from pancreatic α cells[11], SYTL3 in secretion of lytic granules from cytotoxic T lymphocytes[12], SYTL4 in exocytosis of dense hormone containing granules from endocrine cells[13] and SYTL5 in secretion of the RANKL cytokine from osteoblasts[14]. This indicates a potential for redundancy through their binding to Rab27 and function in vesicle secretion/trafficking. However, one study found that different Rab27 effectors have distinct functions at different stages of exocytosis[15].

      Very little known about redundancy or hierarchy between these proteins. Differences in function may be due to the variation in gene expression profile across tissues for the different SYTL’s (see Author response image 1 below). SYTL5 is enriched in the brain unlike the others, suggesting possible tissue specific functions. There are also differences in the binding affinities and calcium sensitivities of the C2iA and C2B domains between the SYTL proteins[16].

      Author response image 1.

      GTEx Multi Gene Query for SYTL1-5

      All five SYTL’s are expressed in the U2OS cell line with nTPMs according to Human Protein Atlas of SYTL1: 7.5, SYTL2: 13.4, SYTL3:14.2, SYTL4: 8.7, SYTL5: 4.8. In line with this, in the Rab27A interactome, when comparing cells overexpressing mScarlet-Rab27A with control cells, we detected all five SYTL’s as specific Rab27A-interacting proteins (see Author response table 1 above). Whereas, in the SYTL5 interactome we did not detect any other SYTL protein (table 1 in paper), confirming that they do not form a complex with SYTL5. 

      We have included the following text in the discussion (p. 12): “SYTL5 and Rab27A are both members of protein families, suggesting possible functional redundancies from Rab27B or one of the other SYTL isoforms. While Rab27B has a very low expression in U2OS cells, all five SYTL’s are expressed. However, when knocking out or knocking down SYTL5 and Rab27A we observe significant effects that we presume would be negated if their isoforms were providing functional redundancies. Moreover, we did not detect any other SYTL protein or Rab27B in the SYTL5 interactome, confirming that they do not form a complex with SYTL5.”

      Suggestions for Discussion: 

      Both Rab27A and STYL5 localize to other membranes, including the endolysosomal compartments. How do the authors envisage the mechanism or cellular modifications that allow these proteins, either individually or in complex to function also to regulate mitochondrial funcYon? It would be interesYng to have some views.

      We agree that it would be interesting to better understand the mechanism involved in modulation of the localisation and function of SYTL5 and Rab27A at different cellular compartments, including the mitochondria. Here, we have shown that SYTL5 recruitment to the mitochondria involves coincidence detection, as both its lipid-binding C2 domains and the Rab27A-binding SHD domain are required (Figure 1G-H). Both these domains also seem required for localisation of SYTL5 to vesicles, and we can only speculate that binding to different lipids (Figure 1F) may regulate SYTL5 localisation. Additionally, we find that mitochondrial recruitment of SYTL5 is dependent on the GTPase activity and mitochondrial localisation of Rab27A (Figure 2D-E). However, this seems also the case for vesicular recruitment of SYTL5, although a few SYTL5-Rab27A (T23N) positive vesicles were seen (Figure 2E). 

      To characterise the mechanisms involved in mitochondrial localisation of Rab27A, we have performed mass spectrometry analysis to identify the interactome of Rab27A (see Author response table 1 above). U2OS cells with stable expression of mScarlet-Rab27A or mScarlet only were subjected to immunoprecipitation, followed by MS analysis.  Of the 32 significant Rab27A-interacting hits (compared to control), two of the hits localise in the inner mitochondrial membrane (IMM); ATP synthase F(1) complex subunit alpha (P25705), and mitochondrial very long-chain specific acyl-CoA dehydrogenase (VLCAD)(P49748). However, as these IMM proteins are not likely involved in mitochondrial recruitment of Rab27A, observed under basal conditions, we chose not to include these data in the manuscript. 

      It is known that other RAB proteins are recruited to the mitochondria by regulation of their GTPase activity. During parkin-mediated mitophagy, RABGEF1 (a guanine nucleotide exchange factor) is recruited through its ubiquitin-binding domain and directs mitochondrial localisation of RAB5, which subsequently leads to recruitment of RAB7 by the MON1/CCZ1 GEF complex[1]. As already mentioned in the discussion (p.12), ubiquitination of the Rab27A GTPase activating protein alpha (TBC1D10A) is reduced in the brain of Parkin KO mouse compared to controls[35], suggesting a possible connection of Rab27A with regulatory mechanisms that are linked with mitochondrial damage and dysfunction. While this an interesting avenue to explore, it is beyond the scope of this paper. 

      Our data suggest that SYTL5 functions as a negative regulator of the Warburg effect, the switch from OXPHOS to glycolysis. While both SYTL5 and Rab27A seem required for mitophagy of selective mitochondrial components, and their depletion leading to reduced mitochondrial respiration and ATP production, only depletion of SYTL5 caused a switch to glycolysis. The mechanisms involved are unclear, but we found several proteins linked to the cellular response to oxidative stress, reactive oxygen species metabolic process, regulation of mitochondrion organisation and protein insertion into mitochondrial membrane to be enriched in the SYTL5 interactome (Figure 3A and C).

      We have addressed this comment in the discussion on p.12 

      Reviewer #3 (Public review):

      Summary:

      In the manuscript by Lapao et al., the authors uncover a role for the Rab27A effector protein SYTL5 in regulating mitochondrial function and turnover. The authors find that SYTL5 localizes to mitochondria in a Rab27A-dependent way and that loss of SYTL5 (or Rab27A) impairs lysosomal turnover of an inner mitochondrial membrane mitophagy reporter but not a matrix-based one. As the authors see no co-localization of GFP/mScarlet tagged versions of SYTL5 or Rab27A with LC3 or p62, they propose that lysosomal turnover is independent of the conventional autophagy machinery. Finally, the authors go on to show that loss of SYTL5 impacts mitochondrial respiration and ECAR and as such may influence the Warburg effect and tumorigenesis. Of relevance here, the authors go on to show that SYTL5 expression is reduced in adrenocortical carcinomas and this correlates with reduced survival rates.

      Strengths:

      There are clearly interesting and new findings here that will be relevant to those following mitochondrial function, the endocytic pathway, and cancer metabolism.

      Weaknesses:

      The data feel somewhat preliminary in that the conclusions rely on exogenously expressed proteins and reporters, which do not always align.

      As the authors note there are no commercially available antibodies that recognize endogenous SYTL5, hence they have had to stably express GFP-tagged versions. However, it appears that the level of expression dictates co-localization from the examples the authors give (though it is hard to tell as there is a lack of any kind of quantitation for all the fluorescent figures). Therefore, the authors may wish to generate an antibody themselves or tag the endogenous protein using CRISPR.

      We agree that the level of SYTL5 expression is likely to affect its localisation. As suggested by the reviewer, we have tried hard, without success, to generated U2OS cells with CRISPR knock-in of a mNeonGreen tag at the C-terminus of endogenous SYTL5, using an approach that has been successfully implemented in the lab for other genes. This is likely due to a lack of suitable sgRNAs targeting the C-terminal region of SYTL5, which have a low predicted efficiency score and a large number of predicted off-target sites in the human genome including several other gene exons and introns (see Author response image 2). 

      Author response image 2.

      Overview of sgRNAs targeting the C-terminal region of SYTL5 

      Although the SYTL5 expression level might affect its cellular localization, we also found the mitochondrial localisation of SYTL5-EGFP to be strongly increased in cells co-expressing mScarletRab27A, supporting our findings of Rab27A-mediated mitochondrial recruitment of SYTL5. We have also included new data (Supplementary Figure 4B) showing that some of the hypoxia-induced SYTL5Rab27A-positive vesicles stain positive for the autophagy markers p62 and LC3B when inhibiting lysosomal degradation, further strengthening our data that SYTL5 and Rab27A function as positive regulators of mitophagy.  

      In relation to quantitation, the authors found that SYTL5 localizes to multiple compartments or potentially a few compartments that are positive for multiple markers. Some quantitation here would be very useful as it might inform on function. 

      We find that SYTL5-EGFP localizes to mitochondria, lysosomes and the plasma membrane in U2OS cells with stable expression of SYTL5-EGFP and in SYTL5/Rab27A double knock-out cells rescued with SYTL5EGFP and mScralet-Rab27A. We also see colocalization of SYTL5-EGFP with endogenous p62, LC3 and LAMP1 upon induction of mitophagy. However, as these cell lines comprise a heterogenous pool with high variability we do not believe that quantification of the overexpressing cell lines would provide beneficial information in this scenario. As described above, we have tried several times to generate SYTL5 knock-in cells without success.  

      The authors find that upon hypoxia/hypoxia-like conditions that punctate structures of SYTL5 and Rab27A form that are positive for Mitotracker, and that a very specific mitophagy assay based on pSu9-Halo system is impaired by siRNA of SYTL5/Rab27A, but another, distinct mitophagy assay (Matrix EGFP-mCherry) shows no change. I think this work would strongly benefit from some measurements with endogenous mitochondrial proteins, both via immunofluorescence and western blot-based flux assays. 

      In addition to the western blotting for different endogenous ETC proteins showing significantly increased levels of MTCO1 in cells depleted of SYTL5 and/or Rab27A (Figure 5E-F), we have now blotted for the endogenous mitochondrial proteins, COXIV and BNIP3L, in DFP and DMOG conditions upon knockdown of SYTL5 and/or Rab27A (Figure 5G and Supplementary Figure 5A). Although there was a trend towards increased levels, we did not see any significant changes in total COXIV or BNIP3L levels when SYTL5, Rab27A or both are knocked down compared to siControl. Blotting for endogenous mitochondrial proteins is however not the optimum readout for mitophagy. A change in mitochondrial protein level does not necessarily result from mitophagy, as other factors such as mitochondrial biogenesis and changes in translation can also have an effect. Mitophagy is a dynamic process, which is why we utilise assays such as the HaloTag and mCherry-EGFP double tag as these indicate flux in the pathway. Additionally, as mitochondrial proteins have different half-lives, with many long-lived mitochondrial proteins[17], differences in turnover rates of endogenous proteins make the results more difficult to interpret. 

      A really interesting aspect is the apparent independence of this mitophagy pathway on the conventional autophagy machinery. However, this is only based on a lack of co-localization between p62or LC3 with LAMP1 and GFP/mScarlet tagged SYTL5/Rab27A. However, I would not expect them to greatly colocalize in lysosomes as both the p62 and LC3 will become rapidly degraded, while the eGFP and mScarlet tags are relatively resistant to lysosomal hydrolysis. -/+ a lysosome inhibitor might help here and ideally, the functional mitophagy assays should be repeated in autophagy KOs. 

      We thank the reviewer for this suggestion. We have now repeated the colocalisation studies in cells treated with DFP with the addition of bafilomycin A1 (BafA1) to inhibit the lysosomal V-ATPase. Indeed, we find that a few of the SYTL5/Rab27A/MitoTracker positive structures also stain positive for p62 and LC3 (Supplementary Figure 4B). As expected, the occurrence of these structures was rare, as BafA1 was only added for the last 4 hrs of the 24 hr DFP treatment. However, we cannot exclude the possibility that there are two different populations of these vesicles.

      The link to tumorigenesis and cancer survival is very interesYng but it is not clear if this is due to the mitochondrially-related aspects of SYTL5 and Rab27A. For example, increased ECAR is seen in the SYTL5 KO cells but not in the Rab27A KO cells (Fig.5D), implying that mitochondrial localization of SYTL5 is not required for the ECAR effect. More work to strengthen the link between the two sections in the paper would help with future direcYons and impact with respect to future cancer treatment avenues to explore. 

      We agree that the role of SYTL5 in ACC requires future investigation. While we observe reduced OXPHOS levels in both SYTL5 and Rab27A KO cells (Figure 5B), glycolysis was only increased in SYTL5 KO cells (Figure 5D). We believe this indicates that Rab27A is being negatively regulated by SYTL5, as ECAR was unchanged in both the Rab27A KO and Rab27A/SYTL5 dKO cells. This suggests that Rab27A is required for the increase in ECAR when SYTL5 is depleted, therefore SYTL5 negatively regulates Rab27A. The mechanism involved is unclear, but we found several proteins linked to the cellular response to oxidative stress, reactive oxygen species metabolic process, regulation of mitochondrion organisation and protein insertion into mitochondrial membrane to be enriched in the SYTL5 interactome (Figure 3A and C).

      To investigate the link to cancer further, we tested the effect of knockdown of SYTL5 and/or Rab27A on the levels of mitochondrial ROS. ROS levels were measured by flow cytometry using the MitoSOX Red dye, together with the MitoTracker Green dye to normalise ROS levels to the total mitochondria. Cells were treated with the antioxidant N-acetylcysteine (NAC)[18] as a negative control and menadione as a positive control, as menadione induces ROS production via redox cycling[19]. We must consider that there is also a lot of autofluorescence from cells that makes it impossible to get a level of ‘zero ROS’ in this experiment. We did not see a change in ROS with knockdown of SYTL5 and/or Rab27A compared to the NAC treated or siControl samples (see Author response image 3 below). The menadione samples confirm the success of the experiment as ROS accumulated in these cells. Thus, based on this, we do not believe that low SYTL5 expression would affect ROS levels in ACC tumours.

      Author response image 3.

      Mitochondrial ROS production normalised to total mitochondria

      As discussed in our response to Reviewer #1, we tried hard to characterise the role of SYTL5 in the context of ACC using the NCI-H295R cell line isolated from the adrenal gland of an adrenal cancer patient. We attempted to conduct OCR and ECAR measurements using the Seahorse XF upon knockdown of SYTL5 and/or Rab27A in H295R cells without success, due to poor attachment of the cells and inability to form a monolayer. We also transduced the H295R cells with lentiviral particles to overexpress pLVX-SV40-mScarlet-I-Rab27A and pLVX-CMV-SYTL5-EGFP-3xFLAG to study the localisation of SYTL5 and Rab27A in an ACC model. Again, this proved unsuccessful after numerous attempts at optimising the transduction. These issues limited our investigation into the role of SYTL5 in ACC to the cortisol assay (Supplementary Figure 6). For this the H295R cells were an appropriate model as they are able to produce an array of adrenal cortex steroids[6] including cortisol[7] In this assay, measurements are taken from cell culture supernatants, so the confluency of the cells does not prevent consistent results as the cortisol concentration was normalised to total protein per sample. With this assay we were able to rule out a role for SYTL5 and Rab27A in the secretion of cortisol.  

      Another consideration when investigating the involvement of SYTL5 in ACC, is that in general ACC cells should have a low expression of SYTL5 as is seen from the patient expression data (Figure 6B).

      Further studies into the link between SYTL5/Rab27A and cancer are beyond the scope of this paper as we are limited to the tools and expertise available in the lab.

      References

      (1) Yamano, K. et al. Endosomal Rab cycles regulate Parkin-mediated mitophagy. eLife 7 (2018). https://doi.org:10.7554/eLife.31326

      (2) Carré, M. et al. Tubulin is an inherent component of mitochondrial membranes that interacts with the voltage-dependent anion channel. The Journal of biological chemistry 277, 33664-33669 (2002). https://doi.org:10.1074/jbc.M203834200

      (3) Hoogerheide, D. P. et al. Structural features and lipid binding domain of tubulin on biomimetic mitochondrial membranes. Proceedings of the National Academy of Sciences 114, E3622-E3631 (2017). https://doi.org:10.1073/pnas.1619806114

      (4) Plitzko, B. & Loesgen, S. Measurement of Oxygen Consumption Rate (OCR) and Extracellular Acidification Rate (ECAR) in Culture Cells for Assessment of the Energy Metabolism. Bio Protoc 8, e2850 (2018). https://doi.org:10.21769/BioProtoc2850

      (5) Yavin, E. & Yavin, Z. Attachment and culture of dissociated cells from rat embryo cerebral hemispheres on polylysine-coated surface. The Journal of cell biology 62, 540-546 (1974). https://doi.org:10.1083/jcb.62.2.540

      (6) Wang, T. & Rainey, W. E. Human adrenocortical carcinoma cell lines. Mol Cell Endocrinol 351, 5865 (2012). https://doi.org:10.1016/j.mce.2011.08.041

      (7) Rainey, W. E. et al. Regulation of human adrenal carcinoma cell (NCI-H295) production of C19 steroids. J Clin Endocrinol Metab 77, 731-737 (1993). https://doi.org:10.1210/jcem.77.3.8396576

      (8) Barral, D. C. et al. Functional redundancy of Rab27 proteins and the pathogenesis of Griscelli syndrome. J. Clin. Invest. 110, 247-257 (2002). https://doi.org:10.1172/jci15058

      (9) Ku, K. E., Choi, N. & Sung, J. H. Inhibition of Rab27a and Rab27b Has Opposite Effects on the Regulation of Hair Cycle and Hair Growth. Int. J. Mol. Sci. 21 (2020). https://doi.org:10.3390/ijms21165672

      (10) Johnson, J. L., Monfregola, J., Napolitano, G., Kiosses, W. B. & Catz, S. D. Vesicular trafficking through cortical actin during exocytosis is regulated by the Rab27a effector JFC1/Slp1 and the RhoA-GTPase–activating protein Gem-interacting protein. Mol. Biol. Cell 23, 1902-1916 (2012). https://doi.org:10.1091/mbc.e11-12-1001

      (11) Yu, M. et al. Exophilin4/Slp2-a targets glucagon granules to the plasma membrane through unique Ca2+-inhibitory phospholipid-binding activity of the C2A domain. Mol. Biol. Cell 18, 688696 (2007). https://doi.org:10.1091/mbc.e06-10-0914

      (12) Kurowska, M. et al. Terminal transport of lyXc granules to the immune synapse is mediated by the kinesin-1/Slp3/Rab27a complex. Blood 119, 3879-3889 (2012). https://doi.org:10.1182/blood-2011-09-382556

      (13) Zhao, S., Torii, S., Yokota-Hashimoto, H., Takeuchi, T. & Izumi, T. Involvement of Rab27b in the regulated secretion of pituitary hormones. Endocrinology 143, 1817-1824 (2002). https://doi.org:10.1210/endo.143.5.8823

      (14) Kariya, Y. et al. Rab27a and Rab27b are involved in stimulation-dependent RANKL release from secretory lysosomes in osteoblastic cells. J Bone Miner Res 26, 689-703 (2011). https://doi.org:10.1002/jbmr.268

      (15) Zhao, K. et al. Functional hierarchy among different Rab27 effectors involved in secretory granule exocytosis. Elife 12 (2023). https://doi.org:10.7554/eLife.82821

      (16) Izumi, T. Physiological roles of Rab27 effectors in regulated exocytosis. Endocr J 54, 649-657 (2007). https://doi.org:10.1507/endocrj.kr-78

      (17) Bomba-Warczak, E. & Savas, J. N. Long-lived mitochondrial proteins and why they exist. Trends in cell biology 32, 646-654 (2022). https://doi.org:10.1016/j.tcb.2022.02.001

      (18) Curtin, J. F., Donovan, M. & Cotter, T. G. Regulation and measurement of oxidative stress in apoptosis. Journal of Immunological Methods 265, 49-72 (2002). https://doi.org:https://doi.org/10.1016/S0022-1759(02)00070-4

      (19) Criddle, D. N. et al. Menadione-induced Reative Oxygen Species Generation via Redox Cycling Promotes Apoptosis of Murine Pancreatic Acinar Cells. Journal of Biological Chemistry 281, 40485-40492 (2006). https://doi.org:https://doi.org/10.1074/jbc.M607704200

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The innate immune system serves as the first line of defense against invading pathogens. Four major immune-specific modules - the Toll pathway, the Imd pathway, melanization, and phagocytosis- play critical roles in orchestrating the immune response. Traditionally, most studies have focused on the function of individual modules in isolation. However, in recent years, it has become increasingly evident that effective immune defense requires intricate interactions among these pathways. 

      Despite this growing recognition, the precise roles, timing, and interconnections of these immune modules remain poorly understood. Moreover, addressing these questions represents a major scientific undertaking. 

      Strengths: 

      In this manuscript, Ryckebusch et al. systematically evaluate both the individual and combined contributions of these four immune modules to host defense against a range of pathogens. Their findings significantly enhance our understanding of the layered architecture of innate immunity. 

      We thank the reviewer for their kind assessment.

      Weaknesses: 

      While I have no critical concerns regarding the study, I do have several suggestions to offer that may help further strengthen the manuscript. These include: 

      (1) Have the authors validated the efficiency of the mutants used in this study? It would be helpful to include supporting data or references confirming that the mutations effectively disrupted the intended immune pathways. 

      We have done so in Figure 1.

      (2) Given the extensive use of double, triple, and quadruple mutants, a more detailed description of the mutant construction process is warranted. 

      We now provide a supplement (File S1) that details the successive genetic crosses and recombinations that were required to generate these compound fly stocks carrying multiple mutations. We also provide some information regarding rapid screening of stocks for phenotypes. Of note some of these fly stocks have been deposited at VDRC as they will be useful to fly community to assess immune modules in a controlled background, and complete stock information will be tied to these stocks there.

      Reviewer #2 (Public review): 

      Summary: 

      In this work, the authors take a holistic view of Drosophila immunity by selecting four major components of fly immunity often studied separately (Toll signaling, Imd signaling, phagocytosis, and melanization), and studying their combinatory effects on the efficiency of the immune response. They achieve this by using fly lines mutant for one of these components, or modules, as well as for a combination of them, and testing the survival of these flies upon infection with a plethora of pathogens (bacterial, viral, and fungal). 

      Strengths: 

      It is clear that this manuscript has required a large amount of hands-on work, considering the number of pathogens, mutations, and timepoints tested. In my opinion, this work is a very welcome addition to the literature on fly immune responses, which obviously do not occur in one type of response at a time, but in parallel, subsequently, and/or are interconnected. I find that the major strength of this work is the overall concept, which is made possible by the mutations designed to target the specific immune function of each module (at least seemingly) without major effects on other functions. I believe that the combinatory mutants will be of use for the fly community and enable further studies of the interplay of these components of immune response in various settings. 

      To control for the effects arising from the genetic variation other than the intended mutations, the mutants have been backcrossed into a widely used, isogenized Drosophila strain called w1118. Therefore, the differences accounted for by the genotype are controlled. 

      I also appreciate that the authors have investigated the two possible ways of dealing with an infection: tolerance and resistance, and how the modules play into those. 

      We thank the reviewer for their kind assessment. 

      Weaknesses: 

      While controlling for the background effects is vital, the w1118 background is problematic (an issue not limited to this manuscript) because of the wide effects of the white mutation on several phenotypes (also other than eye color/eyesight). It is a possibility that the mutation influences the functionality of the immune response components, for example, via effects of the faulty tryptophan handling on the metabolism of the animal. 

      I acknowledge that it is not reasonable to ask for data in different backgrounds better representing a "wild type" fly (however, that is defined is another question), but I think this matter should be brought up and discussed. 

      We agree with the reviewer and have included caveats on the different genetic effects brought about the combinatory mutant approach including differences in white gene status, insertion of GFP or DsRed markers, and nature of genetic mutations (Line 142-on).

      “Of note, the strains used in this study differ in their presence/absence of the white<sup>+</sup> gene, present in the PPO1<sup>∆</sup>, NimC1<sup>1</sup> and eater<sup>1</sup> mutations.  In addition to its well established function in eye pigmentation, the white gene can also impact host neurology and intestinal stem cell proliferation (Ferreiro et al., 2017; Sasaki et al., 2021). We did not observe any obvious correlations between white<sup>+</sup> gene status and susceptibilities in this study. Moreover,  in a previous study looking at the cumulative effects of AMP mutations on lifespan, white gene status and fluorescent markers did not readily explain differences in longevity (Hanson and Lemaitre, 2023). We therefore believe that the extreme immune susceptibility we have created through deficiencies for pathways regulating hundreds of genes, or major immune modules, overwhelms the potential effects of white<sup>+</sup> and other transgenic markers. For additional information on which stocks bear which markers, see discussion in Supplementary file 1.”

      Of interest, we were highly conscious of this concern in working with combinatory AMP mutants which differed in white, GFP, and DsRed copies. However, even over the many weeks of snowballing effects on microbiota community composition and structure, we found no trends tied strictly to white+ or to other genetic insertions on lifespan (Hanson and Lemaitre, 2023; DMM).

      The whole study has been conducted on male flies. Immune responses show quite extensive sex-specific variation across a variety of species studied, also in the fly. But the reasons for this variation are not fully understood. Therefore, I suggest that the authors conduct a subset of experiments on female flies to see if the findings apply to both sexes, especially the infection-specificity of the module combinations.  

      We thank the reviewer for this suggestion. We have performed the requested experiments, and include female survival trends in Figure 4supp1. We have added the following text to the main manuscript (Line 554):

      “All survival experiments to this point were done with males. We therefore assessed key survival trends for these infections in females to learn whether the dynamics we observed were consistent across sexes (Figure 4supp1). For all three pathogens (Pr rettgeri, Sa aureus, C. albicans) the rank order of susceptibility was broadly similar between males and females, with higher rates of mortality in females overall. Thus, we found no marked sex-bygenotype interaction. Interestingly, the greater susceptibility of females in our hands is true even for ∆ITPM flies, although there are only a few surviving flies on which we can base these conclusions. However, these data may suggest the sexual dimorphism in defense against infection that we see against these pathogens is due to factors independent of the immune modules we disrupted.”

      It is worth noting that male-female sex dichotomies in infection are inconsistent across the literature, with strong lab-specific effects (Belmonte et al., 2020 and personal observation). In our lab setting, we consistently see female mortality higher than males when compared, independent of pathogen and mutant background. We have not seen notable interaction terms of sex and genotype for most immune deficient mutants. It is quite interesting to have done these experiments with ITPM, however, which reveals that there is at least a trend suggesting this dichotomy is independent of the four immune modules we deleted. Still, our infection conditions kill most males, and so it would be good to replicate this sex-specific ∆ITPM result in a dedicated study with doses chosen to improve the resolution of male-female differences. For now, we prefer to use conservative language and avoid overinterpreting this trend, but do feel it merits mentioning.  

      Recommendations for the authors:

      Comment on statistical requests

      Both reviewers requested further clarity on the statistical analyses supplemental to Figure 3. We haved address these comments as follows.

      First, we now provide an additional supplementary .zip file containing summary statistics for all survival data in Figure 3 (Supplementary File 3). We have additionally added this text to line 226 to make this data treatment more clear:

      …” we chose to focus on major differences apparent in summary statistics,Highlighting”…

      And we highlight that all survival data are also provided as Kaplan-Meier survival curves in the main or supplementary figures in Line 233:

      “Kaplan-Meier survival curves for all experiments are provided in the main text or supplementary information”.

      Second, as outlined in the main text, we were unable to sample across all pathogenby-genotype interactions systematically, and this unfortunately obfuscates robust statistical modelling. We addressed the challenge of finding meaningful statistical differences by focusing on trends only if they were i) consistent across experimental replicates, ii) of a consistent logic across comparable genotypes, ensuring random inter-experimental noise was not unduly shaping interpretations, and iii) of a mean lifespan difference ≥1.0 days compared to wild-type, and compared to relevant unchallenged or clean-injury controls. This last choice was especially important because not all experimental replicates included all genotypes due to challenges of animal husbandry and coordination among multiple researchers over five years of data collection. As a result, our initial analyses using a cox mixed-effects model found it to be rather useless, being insensitive to important experiment batch effects visible to the eye because statistically-affected genotypes were not present in all experiments.

      We therefore ensured that behaviour relative to controls within* experiments was consistent, rather than the comparison of genotypes to controls across the sum of experiments with a post-hoc treatment attempting to apportion variance to experiment batch (but unable to do so for some genotypes and some batches). Due to differeces in baseline health and the dynamics explained by studies like Duneau et al. (2017; eLife, there is an expected unequal variance of genotype*pathogen interactions across experiment batches. Unfortunately, this unequal variance, coupled with incomplete sampling across experiment batches, means “highly significant” differences can emerge that don’t hold up to scrutiny of comparisons to controls taken only from within an experiment batch. Thus, we chose to forego a cox mixed effect model approach entirely. Instead, our highly conservative approach, focusing on only very large effects with a mean lifespan difference ≥1.0 days, mitigates these issues. We have taken great care to ensure that any results we highlight stand up to inter-experiment batch effects. We would further draw the reviewers’ attention to our response to Reviewer 2 relating to Figure 3, which emphasizes the level of conservativism that we are applying.

      At the end of the Discussion, we have added the following sentence to emphasize these limitations:

      “…a combinatorial mutation approach to deciphering immune function can be extended even to the broad level of whole immune modules. Of note, we were unable to systematically sample all genotype-bypathogen interactions equally. We have therefore been highly conservative in our reporting of major effects. There are likely many important interactions” not discussed in our study. Future investigations may highlight important biology that is apparent in our data, but which we may not have mentioned here. To this end, we have deposited our isogenic immunity fly stocks in the Vienna Drosophila Resource Centre to facilitate their use. Beyond immunity, our tools can also be of use to study various questions at the cutting edge of aging, memory, neurodegeneration, cancer, and more, where immune genes are repeatedly implicated. We hope that this set of lines will be useful to the community to better characterize the Drosophila host defense.”

      We recognise this response may not fully satisfy the reviewers’ requests. While use of summary statistics is simple, our rules for highlighting interactions of importance are defined, readily understood and interpreted, and draw attention to key trends in that are backed by a solid understanding of the data and its limitations. We have taken this approach out of a responsibility to avoid making spurious assertions that stem from underpowered statistical models rather than from the biology itself.

      Reviewer #1 (Recommendations for the authors): 

      (1) Lines 1092-1093 - Please double-check the labeling of the panels in Figure 2. It appears that panels A and C correspond to single-module mutants, whereas panels B and D refer to compound-module mutants. 

      We have modified Figure 2 and Figure 2supp1 labelling. We also realise there was an error in the column titling that contributed to the confusion. We hope the new layout is clear, and thank the reviewers for noting this issue.

      (2) Lines 347-377 - Figure 2D is not cited in the text. 

      We now cite Fig2D in Line 356.

      (3) P values should be indicated in Figure 2 and Figure 3 for all relevant comparisons. Additionally, "ns" (not significant) should be added in Figure 5A-B. 

      We make the effort to show key uninfected survival trends in Figure 2, and list the total flies (n_flies) in Fig3 to provide the reader with the underlying confidence in the trends observed. We focus on differences of mean lifespan of at least 1 day, and which are consistent in direction across combinatory mutations.  We have avoided the multiple comparisons of cox proportional hazard survival analyses throughout this study because they are overly sensitive for our purposes, as we have previously when systematically comparing many genotypes to each other (see Hanson and Lemaitre, 2023; DMM).

      (4) Minor points: Hml-Gal4, UAS-GFP should be italic; Line 192-- "uL" and "uM"; Line 596: P>.05.

      We have made these changes. We’re unsure what the comment regarding P>.05 referred to, but have removed spaces and made it non-italics. 

      Reviewer #2 (Recommendations for the authors): 

      Statistical analyses and their outcomes are clearly indicated only for the data in Figure 1 and Figure 5 and in the supplement for Figure 1, while they are not reported/not easily accessible for other data. For the main figures, statistics should be indicated in the figure for an easier assessment of the data. In case of multiple comparisons potentially crowding the plots too much, statistics may be in a supplementary file/table. 

      See response above.

      In case of the hemocytes, besides phagocytosis, I would think that ROS generation via the DUOX/NOX system is also an integral part of the immune response against pathogens, and that has not been included here. That might be an interesting addition for future experiments. As the NimC1, eater double mutant flies are said to have fewer hemocytes, it is possible that this function of the hemocytes is affected as well. This could be commented on in the text. 

      The reviewer raises a good point. The role of DUOX and NOX in ROS responses is not assessed in our study. To our knowledge, DUOX and NOX participate primarily in the wound repair response, or in epithelial renewal at damage sites or in the gut. In our study on systemic immunity, we did not assess the role of clotting, the precise function of ROS, and we have missed other host defense or stress response mechanisms as well (e.g. constitutively-expressed AMP-like genes, TEPs, JAK-STAT) that likely play a role in the systemic immune defense. Considering the lethality caused by Nox and Duox mutation, there would be inherent genetic difficulties to recombine these as multiple mutations. Unfortunately, this makes it  difficult to include these processes in our analysis in a systematic manner.  We are already happy to have generated fly lines lacking four immune modules simultaneously, even if they are not fully immune deficient. We have mentioned this point in the discussion (Line 613-on).

      Of note, the NimC1, eater double mutants actually have decreased hemocyte counts at the adult stage (Melcarne et al,. 2019). Thus NimC1, eater double mutants are not impaired only in phagocytosis, but the overall cellular response. We make a point to outline this in Line 225-257, and 607.

      I think it could be mentioned that the melanization response at larval stage (against parasitoids) functions differently from the melanization described here (requiring hemocyte differentiation and PPO3).

      A good point. We have added this mention in Line 97:

      “In addition, a third PPO gene (PPO3) is specifically expressed by lamellocytes, specialized hemocytes that differentiate in larvae responding to and enveloping invading parasites (Dudzic et al., 2015)”.

      Overall, the clarity of the figures and figure legends could be worked on to make them a bit easier to follow. Below are some of my suggestions: 

      (1) In Figure 2, adding headings to parts C & D (similarly to A & B) would make it easier to follow what is happening in the figure at a glance. Also, it is rather difficult to visually follow which strain is which in the plots. I'd suggest adding the key/legend for single mutants below 2A & B, and the key for the double mutants below C & D. If a mutant is present in A & B and in C & D, it could be included in both keys. I also think that it would be intuitive to present the single mutants by dashed lines and double mutants by continuous lines (or vice versa), so that one would easily distinguish between them. Of note, the figure legend says that A & B are single mutants, but for example in B there are also some double mutants (?). 

      We have modified Figure 2 and Figure 2supp1 labelling. We also realise there was an error in the column titling that contributed to the confusion. We hope the new layout is clear, and thank the reviewers for noting this issue.

      (2) In Figure 3, it looks like ΔMel is almost identical to controls in the clean injury survival, but in Figure 2C, it is clearly doing worse. I might be missing something here, but would like the authors to clarify the matter. Also, the meaning of the numbers in the heat map could be explained in the figure legend and/or added to the figure (color key). 

      The reviewer is correct. We thank the reviewer for this astute observation. Inadvertently, we used an old version of the Figure 2 preparation where only a subset of experiments was entered in the Prism data file rather than the total data used to inform Figure 3. This issue affected all genotypes.

      We have reviewed the data in Figure 2, Figure 2supp1, and Figure 3, and updated these figures accordingly to ensure they represent the full survival data. We have also incorporated new experiments into the sum data related to male-female differences and to fill gaps in the data from the 1<sup>st</sup> submission. We will also note due to the nature of 1<sup>st</sup> decimal rounding that the difference between WT and ΔMel appears slightly underrepresented: the true difference (over the 7-day lifespan) is 0.37. We’ve provided a version of this figure rounded to 2 decimal places below, but prefer the simpler 1 decimal place in the main text for readability. The updated Figure 2 shows the full data in Figure 3 accurately.

      We will also take this opportunity to highlight how conservative our ≥1.0 days difference approach is. Breaking down survival curve patterns in Figure 2 relative to mean differences in Figure 3, for clean injury, approximately ~75% of ΔMel flies survive to day 7 with mortality mostly taking place between days 3-7. The result is a mean lifespan of 6.37 days. On a survival curve, this difference appears quite strong, but in our mean lifespan table the difference is rather muted (WT vs. ΔMel difference = 0.37 days). Thus, differences of ≥1.0 days reflect very strong trends in survival data that are near-guaranteed to be independent of experimental noise. While we note issues that prevented us from a fully systematic sampling for all experiments, we are confident that the ≥1.0 day differences we highlight, using the rules explained in the main text, are robust. While this approach could be seen as overly conservative, it is our preference in this initial study, containing combinations of 25 treatments and 14 genotypes, to be highly conservative. Future studies may investigate other strong differences we have not highlighted, and the data we provide here can help generate expectations and guide those studies.

      Author response image 1.

      Figure 3 with 2 decimals places of rounding for mean lifespans. The 7-day clean injury mean lifespan of WT is 6.74 days, and of ΔMel is 6.37 days. Due to rounding, in the 1 decimal Figure 3 this difference appears as if it is only 0.3 days, but it closer to 0.4 days. Regardless, this level of difference, which appears rather clearly in a survival curve, is well below the level of difference we have chosen to highlight in our study.

      (1) Figure 4: I find it very tedious to compare CFUs among different mutants from the plots. As the idea is to compare bacterial loads among the mutants at different timepoints, it would be easier to compare them if the data were shown within a timepoint (CFUs of each mutant at 2h, at 6h, and so on). This is also how the results are written in the text (within a time point). Would it also be clearer if the CFU plots were named, for example: " A', B', and C'"? 

      We appreciate this note. We feel both representations have merits and pitfalls, but prefer our original design showing the progression of bacterial growth within genotype first. However, we have added dotted lines representing the wild-type bacterial loads at 2hpi, 12hpi, and 24hpi to assist the reader in making acrossgenotype comparisons at key time points. Like this, the reader can see if the error bars (StDev) overlap the mean of the wild-type, and so make more intuitive judgements about whether these differences are meaningful.

      (2) Figure 2D is not referred to in the text. 

      We now cite Fig2D in Line 356.

    1. Author response:

      The issue of a control without blue light illumination was raised. Clearly without the light we will not obtain any signal in the fluorescence microscopy experiments, which would not be very informative. Instead, we changed the level of blue light illumination in the fluorescence microscopy experiments (figure 4A) and the response of the bacteria scales with dosage. It is very hard to find an alternative explanation, beyond that the blue light is stressing the bacteria and modulating their membrane potentials.

      One of the referees refuses to see wavefronts in our microscopy data. We struggle to understand whether it is an issue with definitions (Waigh has published a tutorial on the subject in Chapter 5 of his book ‘The physics of bacteria: from cells to biofilms’, T.A.Waigh, CUP, 2024 – figure 5.1 shows a sketch) or something subtler on diffusion in excitable systems. We stand by our claim that we observe wavefronts, similar to those observed by Prindle et al<sup>1</sup> and Blee et al<sup>2</sup> for B. subtilis biofilms.

      The referee is questioning our use of ThT to probe the membrane potential. We believe the Pilizota and Strahl groups are treating the E. coli as unexcitable cells, leading to their problems. Instead, we believe E. coli cells are excitable (containing the voltage-gated ion channel Kch) and we now clearly state this in the manuscript. Furthermore, we include a section here discussing some of the issues with ThT.


      Use of ThT as a voltage sensor in cells

      ThT is now used reasonably widely in the microbiology community as a voltage sensor in both bacterial [Prindle et al]1 and fungal cells [Pena et al]12. ThT is a small cationic fluorophore that loads into the cells in proportion to their membrane potential, thus allowing the membrane potential to be measured from fluorescence microscopy measurements.

      Previously ThT was widely used to quantify the growth of amyloids in molecular biology experiments (standardized protocols exist and dedicated software has been created)13 and there is a long history of its use14. ThT fluorescence is bright, stable and slow to photobleach.

      Author response image 1 shows a schematic diagram of the ThT loading in E. coli in our experiments in response to illumination with blue light. Similar results were previously presented by Mancini et al15, but regimes 2 and 3 were mistakenly labelled as artefacts.

      Author response image 1.

      Schematic diagram of ThT loading during an experiment with E. coli cells under blue light illumination i.e. ThT fluorescence as a function of time. Three empirical regimes for the fluorescence are shown (1, 2 and 3).

      The classic study of Prindle et al on bacterial biofilm electrophysiology established the use of ThT in B. subtilis biofilms by showing similar results occurred with DiSc3 which is widely used as a Nernstian voltage sensor in cellular biology1 e.g. with mitochondrial membrane potentials in eukaryotic organisms where there is a large literature. We repeated such a comparative calibration of ThT with DiSc3 in a previous publication with both B. subtilis and P. aeruginosa cells2. ThT thus functioned well in our previous publications with Gram positive and Gram negative cells.

      However, to our knowledge, there are now two groups questioning the use of ThT and DiSc3 as voltage sensors with E. coli cells15-16. The first by the Pilizota group claims ThT only works as a voltage sensor in regime 1 of Author response image 1 using a method based on the rate of rotation of flagellar motors. Another slightly contradictory study by the Strahl group claims DiSc316 only acts as a voltage sensor with the addition of an ionophore for potassium which allows free movement of potassium through the E. coli membranes.

      Our resolution to this contradiction is that ThT does indeed work reasonably well with E. coli. The Pilizota group’s model for rotating flagellar motors assumes the membrane voltage is not varying due to excitability of the membrane voltage (otherwise a non-linear Hodgkin Huxley type model would be needed to quantify their results) i.e. E. coli cells are unexcitable. We show clearly in our study that ThT loading in E. coli is a function of irradiation with blue light and is a stress response of the excitable cells. This is in contradiction to the Pilizota group’s model. The Pilizota group’s model also requires the awkward fiction of why cells decide to unload and then reload ThT in regimes 2 and 3 of Author response image 1 due to variable membrane partitioning of the ThT. Our simple explanation is that it is just due to the membrane voltage changing and no membrane permeability switch needs to be invoked. The Strahl group’s16 results with DiSc3 are also explained by a neglect of the excitable nature of E. coli cells that are reacting to blue light irradiation. Adding ionophores to the E. coli membranes makes the cells unexcitable, reduces their response to blue light and thus leads to simple loading of DiSc3 (the physiological control of K+ in the cells by voltage-gated ion channels has been short circuited by the addition of the ionophore).

      Further evidence of our model that ThT functions as a voltage sensor with E. coli include:

      1) The 3 regimes in Author response image 1 from ThT correlate well with measurements of extracellular potassium ion concentration using TMRM i.e. all 3 regimes in Author response image 1 are visible with this separate dye (figure 1d).

      2) We are able to switch regime 3 in Author response image 1, off and then on again by using knock downs of the potassium ion channel Kch in the membranes of the E. coli and then reinserting the gene back into the knock downs. This cannot be explained by the Pilizota model.

      We conclude that ThT works reasonably well as a sensor of membrane voltage in E. coli and the previous contradictory studies15-16 are because they neglect the excitable nature of the membrane voltage of E. coli cells in response to the light used to make the ThT fluoresce.

      Three further criticisms of the Mancini et al method15 for calibrating membrane voltages include:

      1) E. coli cells have clutches that are not included in their models. Otherwise the rotation of the flagella would be entirely enslaved to the membrane voltage allowing the bacteria no freedom to modulate their speed of motility.

      2) Ripping off the flagella may perturb the integrity of the cell membrane and lead to different loading of the ThT in the E. coli cells.

      3) Most seriously, the method ignores the activity of many other ion channels (beyond H+) on the membrane voltage that are known to exist with E. coli cells e.g. Kch for K+ ions. The Pilizota groups uses a simple Nernstian battery model developed for mitochondria in the 1960s. It is not adequate to explain our results.

      An additional criticism of the Winkel et al study17 from the Strahl group is that it indiscriminately switches between discussion of mitochondria and bacteria e.g. on page 8 ‘As a consequence the membrane potential is dominated by H+’. Mitochondria are slightly alkaline intracellular organelles with external ion concentrations in the cytoplasm that are carefully controlled by the eukaryotic cells. E. coli are not i.e. they have neutral internal pHs, with widely varying extracellular ionic concentrations and have reinforced outer membranes to resist osmotic shocks (in contrast mitochondria can easily swell in response to moderate changes in osmotic pressure).

      A quick calculation of the equilibrium membrane voltage of E. coli can be easily done using the Nernst equation dependent on the extracellular ion concentrations defined by the growth media (the intracellular ion concentrations in E. coli are 0.2 M K+ and 10-7 M H+ i.e. there is a factor of a million fewer H+ ions). Thus in contradiction to the claims of the groups of Pilizota15 and Strahl17, H+ is a minority determinant to the membrane voltage of E. coli. The main determinant is K+. For a textbook version of this point the authors can refer to Chapter 4 of D. White, et al’s ‘The physiology and biochemistry of prokaryotes’, OUP, 2012, 4th edition.

      Even in mitochondria the assumption that H+ dominates the membrane potential and the cells are unexcitable can be questioned e.g. people have observed pulsatile depolarization phenomena with mitochondria18-19. A large number of K+ channels are now known to occur in mitochondrial membranes (not to mention Ca2+ channels; mitochondria have extensive stores of Ca2+) and they are implicated in mitochondrial membrane potentials. In this respect the seminal Nobel prize winning research of Peter Mitchell (1961) on mitochondria needs to be amended20. Furthermore, the mitochondrial work is clearly inapplicable to bacteria (the proton motive force, PMF, will instead subtly depend on non-linear Hodgkin-Huxley equations for the excitable membrane potential, similar to those presented in the current article). A much more sophisticated framework has been developed to describe electrophysiology by the mathematical biology community to describe the activity of electrically excitable cells (e.g. with neurons, sensory cells and cardiac cells), beyond Mitchell’s use of the simple stationary equilibrium thermodynamics to define the Proton Motive Force via the electrochemical potential of a proton (the use of the word ‘force’ is unfortunate, since it is a potential). The tools developed in the field of mathematical electrophysiology8 should be more extensively applied to bacteria, fungi, mitochondria and chloroplasts if real progress is to be made.


      Related to the previous point, we now cite articles from the Pilizota and Strahl groups in the main text (one from each group). Unfortunately, the space constraints of eLife mean we cannot make a more detailed discussion in the main article.

      In terms of modelling the ion channels, the Hodgkin-Huxley type model proposes that the Kch ion channel can be modelled as a typical voltage-gated potassium ion channel i.e. with a 𝑛<sup>4</sup> term in its conductivity. The literature agrees that Kch is a voltage-gated potassium ion channel based on its primary sequence<sup>3</sup>. The protein has the typical 6 transmembrane helix motif for a voltage-gated ion channel. The agent-based model assumes little about the structure of ion channels in E. coli, other than they release potassium in response to a threshold potassium concentration in their environment. The agent based model is thus robust to the exact molecular details chosen and predicts the anomalous transport of the potassium wavefronts reasonably well (the modelling was extended in a recent Physical Review E article(<sup>4</sup>). Such a description of reaction-anomalous diffusion phenomena has not to our knowledge been previously achieved in the literature<sup>5</sup> and in general could be used to describe other signaling molecules.

      1. Prindle, A.; Liu, J.; Asally, M.; Ly, S.; Garcia-Ojalvo, J.; Sudel, G. M., Ion channels enable electrical communication in bacterial communities. Nature 2015, 527, 59.

      2. Blee, J. A.; Roberts, I. S.; Waigh, T. A., Membrane potentials, oxidative stress and the dispersal response of bacterial biofilms to 405 nm light. Physical Biology 2020, 17, 036001.

      3. Milkman, R., An E. col_i homologue of eukaryotic potassium channel proteins. _PNAS 1994, 91, 3510-3514.

      4. Martorelli, V.; Akabuogu, E. U.; Krasovec, R.; Roberts, I. S.; Waigh, T. A., Electrical signaling in three-dimensional bacterial biofilms using an agent-based fire-diffuse-fire model. Physical Review E 2024, 109, 054402.

      5. Waigh, T. A.; Korabel, N., Heterogeneous anomalous transport in cellular and molecular biology. Reports on Progress in Physics 2023, 86, 126601.

      6. Hodgkin, A. L.; Huxley, A. F., A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology 1952, 117, 500.

      7. Dawson, S. P.; Keizer, J.; Pearson, J. E., Fire-diffuse-fire model of dynamics of intracellular calcium waves. PNAS 1999, 96, 606.

      8. Keener, J.; Sneyd, J., Mathematical Physiology. Springer: 2009.

      9. Coombes, S., The effect of ion pumps on the speed of travelling waves in the fire-diffuse-fire model of Ca2+ release. Bulletin of Mathematical Biology 2001, 63, 1.

      10. Blee, J. A.; Roberts, I. S.; Waigh, T. A., Spatial propagation of electrical signals in circular biofilms. Physical Review E 2019, 100, 052401.

      11. Gorochowski, T. E.; Matyjaszkiewicz, A.; Todd, T.; Oak, N.; Kowalska, K., BSim: an agent-based tool for modelling bacterial populations in systems and synthetic biology. PloS One 2012, 7, 1.

      12. Pena, A.; Sanchez, N. S.; Padilla-Garfias, F.; Ramiro-Cortes, Y.; Araiza-Villaneuva, M.; Calahorra, M., The use of thioflavin T for the estimation and measurement of the plasma membrane electric potential difference in different yeast strains. Journal of Fungi 2023, 9 (9), 948.

      13. Xue, C.; Lin, T. Y.; Chang, D.; Guo, Z., Thioflavin T as an amyloid dye: fibril quantification, optimal concentration and effect on aggregation. Royal Society Open Science 2017, 4, 160696.

      14. Meisl, G.; Kirkegaard, J. B.; Arosio, P.; Michaels, T. C. T.; Vendruscolo, M.; Dobson, C. M.; Linse, S.; Knowles, T. P. J., Molecular mechanisms of protein aggregation from global fitting of kinetic models. Nature Protocols 2016, 11 (2), 252-272.

      15. Mancini, L.; Tian, T.; Guillaume, T.; Pu, Y.; Li, Y.; Lo, C. J.; Bai, F.; Pilizota, T., A general workflow for characterization of Nernstian dyes and their effects on bacterial physiology. Biophysical Journal 2020, 118 (1), 4-14.

      16. Buttress, J. A.; Halte, M.; Winkel, J. D. t.; Erhardt, M.; Popp, P. F.; Strahl, H., A guide for membrane potential measurements in Gram-negative bacteria using voltage-sensitive dyes. Microbiology 2022, 168, 001227.

      17. Derk te Winkel, J.; Gray, D. A.; Seistrup, K. H.; Hamoen, L. W.; Strahl, H., Analysis of antimicrobial-triggered membrane depolarization using voltage sensitive dyes. Frontiers in Cell and Developmental Biology 2016, 4, 29.

      18. Schawarzlander, M.; Logan, D. C.; Johnston, I. G.; Jones, N. S.; Meyer, A. J.; Fricker, M. D.; Sweetlove, L. J., Pulsing of membrane potential in individual mitochondria. The Plant Cell 2012, 24, 1188-1201.

      19. Huser, J.; Blatter, L. A., Fluctuations in mitochondrial membrane potential caused by repetitive gating of the permeability transition pore. Biochemistry Journal 1999, 343, 311-317.

      20. Mitchell, P., Coupling of phosphorylation to electron and hydrogen transfer by a chemi-osmotic type of mechanism. Nature 1961, 191 (4784), 144-148.

      21. Baba, T.; Ara, M.; Hasegawa, Y.; Takai, Y.; Okumura, Y.; Baba, M.; Datsenko, K. A.; Tomita, M.; Wanner, B. L.; Mori, H., Construction of Escherichia Coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Molecular Systems Biology 2006, 2, 1.

      22. Schinedlin, J.; al, e., Fiji: an open-source platform for biological-image analysis. Nature Methods 2012, 9, 676.

      23. Hartmann, R.; al, e., Quantitative image analysis of microbial communities with BiofilmQ. Nature Microbiology 2021, 6 (2), 151.


      The following is the authors’ response to the original reviews.

      Critical synopsis of the articles cited by referee 2:

      (1) ‘Generalized workflow for characterization of Nernstian dyes and their effects on bacterial physiology’, L.Mancini et al, Biophysical Journal, 2020, 118, 1, 4-14.

      This is the central article used by referee 2 to argue that there are issues with the calibration of ThT for the measurement of membrane potentials. The authors use a simple Nernstian battery (SNB) model and unfortunately it is wrong when voltage-gated ion channels occur. Huge oscillations occur in the membrane potentials of E. coli that cannot be described by the SNB model. Instead a Hodgkin Huxley model is needed, as shown in our eLife manuscript and multiple other studies (see above). Arrhenius kinetics are assumed in the SNB model for pumping with no real evidence and the generalized workflow involves ripping the flagella off the bacteria! The authors construct an elaborate ‘work flow’ to insure their ThT results can be interpreted using their erroneous SNB model over a limited range of parameters.

      (2) ‘Non-equivalence of membrane voltage and ion-gradient as driving forces for the bacterial flagellar motor at low load’, C.J.Lo, et al, Biophysical Journal, 2007, 93, 1, 294.

      An odd de novo chimeric species is developed using an E. coli  chassis which uses Na+ instead of H+ for the motility of its flagellar motor. It is not clear the relevance to wild type E. coli, due to the massive physiological perturbations involved. A SNB model is using to fit the data over a very limited parameter range with all the concomitant errors.

      (3) Single-cell bacterial electrophysiology reveals mechanisms of stress-induced damage’, E.Krasnopeeva, et al, Biophysical Journal, 2019, 116, 2390.

      The abstract says ‘PMF defines the physiological state of the cell’. This statement is hyperbolic. An extremely wide range of molecules contribute to the physiological state of a cell. PMF does not even define the electrophysiology of the cell e.g. via the membrane potential. There are 0.2 M of K+ compared with 0.0000001 M of H+ in E. coli, so K+ is arguably a million times more important for the membrane potential than H+ and thus the electrophysiology!

      Equation (1) in the manuscript assumes no other ions are exchanged during the experiments other than H+. This is a very bad approximation when voltage-gated potassium ion channels move the majority ion (K+) around!

      In our model Figure 4A is better explained by depolarisation due to K+ channels closing than direct irreversible photodamage. Why does the THT fluorescence increase again for the second hyperpolarization event if the THT is supposed to be damaged? It does not make sense.

      (4) ‘The proton motive force determines E. coli robustness to extracellular pH’, G.Terradot et al, 2024, preprint.

      This article expounds the SNB model once more. It still ignores the voltage-gated ion channels. Furthermore, it ignores the effect of the dominant ion in E. coli, K+. The manuscript is incorrect as a result and I would not recommend publication.

      In general, an important problem is being researched i.e. how the membrane potential of E. coli is related to motility, but there are serious flaws in the SNB approach and the experimental methodology appears tenuous.

      Answers to specific questions raised by the referees

      Reviewer #1 (Public Review):

      Summary:

      Cell-to-cell communication is essential for higher functions in bacterial biofilms. Electrical signals have proven effective in transmitting signals across biofilms. These signals are then used to coordinate cellular metabolisms or to increase antibiotic tolerance. Here, the authors have reported for the first time coordinated oscillation of membrane potential in E. coli biofilms that may have a functional role in photoprotection.

      Strengths:

      - The authors report original data.

      - For the first time, they showed that coordinated oscillations in membrane potential occur in E. Coli biofilms.

      - The authors revealed a complex two-phase dynamic involving distinct molecular response mechanisms.

      - The authors developed two rigorous models inspired by 1) Hodgkin-Huxley model for the temporal dynamics of membrane potential and 2) Fire-Diffuse-Fire model for the propagation of the electric signal.

      - Since its discovery by comparative genomics, the Kch ion channel has not been associated with any specific phenotype in E. coli. Here, the authors proposed a functional role for the putative K+ Kch channel : enhancing survival under photo-toxic conditions.

      We thank the referee for their positive evaluations and agree with these statements.

      Weaknesses:

      - Since the flow of fresh medium is stopped at the beginning of the acquisition, environmental parameters such as pH and RedOx potential are likely to vary significantly during the experiment. It is therefore important to exclude the contributions of these variations to ensure that the electrical response is only induced by light stimulation. Unfortunately, no control experiments were carried out to address this issue.

      The electrical responses occur almost instantaneously when the stimulation with blue light begins i.e. it is too fast to be a build of pH. We are not sure what the referee means by Redox potential since it is an attribute of all chemicals that are able to donate/receive electrons. The electrical response to stress appears to be caused by ROS, since when ROS scavengers are added the electrical response is removed i.e. pH plays a very small minority role if any.

      - Furthermore, the control parameter of the experiment (light stimulation) is the same as that used to measure the electrical response, i.e. through fluorescence excitation. The use of the PROPS system could solve this problem.

      >>We were enthusiastic at the start of the project to use the PROPs system in E. coli as presented by J.M.Krajl et al, ‘Electrical spiking in E. coli probed with a fluorescent voltage-indicating protein’, Science, 2011, 333, 6040, 345. However, the people we contacted in the microbiology community said that it had some technical issues and there have been no subsequent studies using PROPs in bacteria after the initial promising study. The fluorescent protein system recently presented in PNAS seems more promising, ‘Sensitive bacterial Vm sensors revealed the excitability of bacterial Vm and its role in antibiotic tolerance’, X.Jin et al, PNAS, 120, 3, e2208348120.

      - Electrical signal propagation is an important aspect of the manuscript. However, a detailed quantitative analysis of the spatial dynamics within the biofilm is lacking. In addition, it is unclear if the electrical signal propagates within the biofilm during the second peak regime, which is mediated by the Kch channel. This is an important question, given that the fire-diffuse-fire model is presented with emphasis on the role of K+ ions.

      We have presented a more detailed account of the electrical wavefront modelling work and it is currently under review in a physical journal, ‘Electrical signalling in three dimensional bacterial biofilms using an agent based fire-diffuse-fire model’, V.Martorelli, et al, 2024 https://www.biorxiv.org/content/10.1101/2023.11.17.567515v1

      - Since deletion of the kch gene inhibits the long-term electrical response to light stimulation (regime II), the authors concluded that K+ ions play a role in the habituation response. However, Kch is a putative K+ ion channel. The use of specific drugs could help to clarify the role of K+ ions.

      Our recent electrical impedance spectroscopy publication provides further evidence that Kch is associated with large changes in conductivity as expected for a voltage-gated ion channel (https://pubs.acs.org/doi/10.1021/acs.nanolett.3c04446, 'Electrical impedance spectroscopy with bacterial biofilms: neuronal-like behavior', E.Akabuogu et al, ACS Nanoletters, 2024, in print.

      - The manuscript as such does not allow us to properly conclude on the photo-protective role of the Kch ion channel.

      That Kch has a photoprotective role is our current working hypothesis. The hypothesis fits with the data, but we are not saying we have proven it beyond all possible doubt.

      - The link between membrane potential dynamics and mechanosensitivity is not captured in the equation for the Q-channel opening dynamics in the Hodgkin-Huxley model (Supp Eq 2).

      Our model is agnostic with respect to the mechanosensitivity of the ion channels, although we deduce that mechanosensitive ion channels contribute to ion channel Q.

      - Given the large number of parameters used in the models, it is hard to distinguish between prediction and fitting.

      This is always an issue with electrophysiological modelling (compared with most heart and brain modelling studies we are very conservative in the choice of parameters for the bacteria). In terms of predicting the different phenomena observed, we believe the model is very successful.

      Reviewer #2 (Public Review):

      Summary of what the authors were trying to achieve:

      The authors thought they studied membrane potential dynamics in E.coli biofilms. They thought so because they were unaware that the dye they used to report that membrane potential in E.coli, has been previously shown not to report it. Because of this, the interpretation of the authors' results is not accurate.

      We believe the Pilizota work is scientifically flawed.

      Major strengths and weaknesses of the methods and results:

      The strength of this work is that all the data is presented clearly, and accurately, as far as I can tell.

      The major critical weakness of this paper is the use of ThT dye as a membrane potential dye in E.coli. The work is unaware of a publication from 2020 https://www.sciencedirect.com/science/article/pii/S0006349519308793 [sciencedirect.com] that demonstrates that ThT is not a membrane potential dye in E. coli. Therefore I think the results of this paper are misinterpreted. The same publication I reference above presents a protocol on how to carefully calibrate any candidate membrane potential dye in any given condition.

      We are aware of this study, but believe it to be scientifically flawed. We do not cite the article because we do not think it is a particularly useful contribution to the literature.

      I now go over each results section in the manuscript.

      Result section 1: Blue light triggers electrical spiking in single E. coli cells

      I do not think the title of the result section is correct for the following reasons. The above-referenced work demonstrates the loading profile one should expect from a Nernstian dye (Figure 1). It also demonstrates that ThT does not show that profile and explains why is this so. ThT only permeates the membrane under light exposure (Figure 5). This finding is consistent with blue light peroxidising the membrane (see also following work Figure 4 https://www.sciencedirect.com/science/article/pii/S0006349519303923 [sciencedirect.com] on light-induced damage to the electrochemical gradient of protons-I am sure there are more references for this).

      The Pilizota group invokes some elaborate artefacts to explain the lack of agreement with a simple Nernstian battery model. The model is incorrect not the fluorophore.

      Please note that the loading profile (only observed under light) in the current manuscript in Figure 1B as well as in the video S1 is identical to that in Figure 3 from the above-referenced paper (i.e. https://www.sciencedirect.com/science/article/pii/S0006349519308793 [sciencedirect.com]), and corresponding videos S3 and S4. This kind of profile is exactly what one would expect theoretically if the light is simultaneously lowering the membrane potential as the ThT is equilibrating, see Figure S12 of that previous work. There, it is also demonstrated by the means of monitoring the speed of bacterial flagellar motor that the electrochemical gradient of protons is being lowered by the light. The authors state that applying the blue light for different time periods and over different time scales did not change the peak profile. This is expected if the light is lowering the electrochemical gradient of protons. But, in Figure S1, it is clear that it affected the timing of the peak, which is again expected, because the light affects the timing of the decay, and thus of the decay profile of the electrochemical gradient of protons (Figure 4 https://www.sciencedirect.com/science/article/pii/S0006349519303923 [sciencedirect.com]).

      We think the proton effect is a million times weaker than that due to potasium i.e. 0.2 M K+ versus 10-7 M H+. We can comfortably neglect the influx of H+ in our experiments.

      If find Figure S1D interesting. There authors load TMRM, which is a membrane voltage dye that has been used extensively (as far as I am aware this is the first reference for that and it has not been cited https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1914430 [ncbi.nlm.nih.gov]/). As visible from the last TMRM reference I give, TMRM will only load the cells in Potassium Phosphate buffer with NaCl (and often we used EDTA to permeabilise the membrane). It is not fully clear (to me) whether here TMRM was prepared in rich media (it explicitly says so for ThT in Methods but not for TMRM), but it seems so. If this is the case, it likely also loads because of the damage to the membrane done with light, and therefore I am not surprised that the profiles are similar.

      The vast majority of cells continue to be viable. We do not think membrane damage is dominating.

      The authors then use CCCP. First, a small correction, as the authors state that it quenches membrane potential. CCCP is a protonophore (https://pubmed.ncbi.nlm.nih.gov/4962086 [pubmed.ncbi.nlm.nih.gov]/), so it collapses electrochemical gradient of protons. This means that it is possible, and this will depend on the type of pumps present in the cell, that CCCP collapses electrochemical gradient of protons, but the membrane potential is equal and opposite in sign to the DeltapH. So using CCCP does not automatically mean membrane potential will collapse (e.g. in some mammalian cells it does not need to be the case, but in E.coli it is https://www.biorxiv.org/content/10.1101/2021.11.19.469321v2 [biorxiv.org]). CCCP has also been recently found to be a substrate for TolC (https://journals.asm.org/doi/10.1128/mbio.00676-21 [journals.asm.org]), but at the concentrations the authors are using CCCP (100uM) that should not affect the results. However, the authors then state because they observed, in Figure S1E, a fast efflux of ions in all cells and no spiking dynamics this confirms that observed dynamics are membrane potential related. I do not agree that it does. First, Figure S1E, does not appear to show transients, instead, it is visible that after 50min treatment with 100uM CCCP, ThT dye shows no dynamics. The action of a Nernstian dye is defined. It is not sufficient that a charged molecule is affected in some way by electrical potential, this needs to be in a very specific way to be a Nernstian dye. Part of the profile of ThT loading observed in https://www.sciencedirect.com/science/article/pii/S0006349519308793 [sciencedirect.com] is membrane potential related, but not in a way that is characteristic of Nernstian dye.

      Our understanding of the literature is CCCP poisons the whole metabolism of the bacterial cells. The ATP driven K+ channels will stop functioning and this is the dominant contributor to membrane potential.

      Result section 2: Membrane potential dynamics depend on the intercellular distance

      In this chapter, the authors report that the time to reach the first intensity peak during ThT loading is different when cells are in microclusters. They interpret this as electrical signalling in clusters because the peak is reached faster in microclusters (as opposed to slower because intuitively in these clusters cells could be shielded from light). However, shielding is one possibility. The other is that the membrane has changed in composition and/or the effective light power the cells can tolerate (with mechanisms to handle light-induced damage, some of which authors mention later in the paper) is lower. Given that these cells were left in a microfluidic chamber for 2h hours to attach in growth media according to Methods, there is sufficient time for that to happen. In Figure S12 C and D of that same paper from my group (https://ars.els-cdn.com/content/image/1-s2.0-S0006349519308793-mmc6.pdf [ars.els-cdn.com]) one can see the effects of peak intensity and timing of the peak on the permeability of the membrane. Therefore I do not think the distance is the explanation for what authors observe.

      Shielding would provide the reverse effect, since hyperpolarization begins in the dense centres of the biofilms. For the initial 2 hours the cells receive negligible blue light. Neither of the referee’s comments thus seem tenable.

      Result section 3: Emergence of synchronized global wavefronts in E. coli biofilms

      In this section, the authors exposed a mature biofilm to blue light. They observe that the intensity peak is reached faster in the cells in the middle. They interpret this as the ion-channel-mediated wavefronts moved from the center of the biofilm. As above, cells in the middle can have different membrane permeability to those at the periphery, and probably even more importantly, there is no light profile shown anywhere in SI/Methods. I could be wrong, but the SI3 A profile is consistent with a potential Gaussian beam profile visible in the field of view. In Methods, I find the light source for the blue light and the type of microscope but no comments on how 'flat' the illumination is across their field of view. This is critical to assess what they are observing in this result section. I do find it interesting that the ThT intensity collapsed from the edges of the biofilms. In the publication I mentioned https://www.sciencedirect.com/science/article/pii/S0006349519308793#app2 [sciencedirect.com], the collapse of fluorescence was not understood (other than it is not membrane potential related). It was observed in Figure 5A, C, and F, that at the point of peak, electrochemical gradient of protons is already collapsed, and that at the point of peak cell expands and cytoplasmic content leaks out. This means that this part of the ThT curve is not membrane potential related. The authors see that after the first peak collapsed there is a period of time where ThT does not stain the cells and then it starts again. If after the first peak the cellular content leaks, as we have observed, then staining that occurs much later could be simply staining of cytoplasmic positively charged content, and the timing of that depends on the dynamics of cytoplasmic content leakage (we observed this to be happening over 2h in individual cells). ThT is also a non-specific amyloid dye, and in starving E. coli cells formation of protein clusters has been observed (https://pubmed.ncbi.nlm.nih.gov/30472191 [pubmed.ncbi.nlm.nih.gov]/), so such cytoplasmic staining seems possible.

      >>It is very easy to see if the illumination is flat (Köhler illumination) by comparing the intensity of background pixels on the detector. It was flat in our case. Protons have little to do with our work for reasons highlighted before. Differential membrane permittivity is a speculative phenomenon not well supported by any evidence and with no clear molecular mechanism.

      Finally, I note that authors observe biofilms of different shapes and sizes and state that they observe similar intensity profiles, which could mean that my comment on 'flatness' of the field of view above is not a concern. However, the scale bar in Figure 2A is not legible, so I can't compare it to the variation of sizes of the biofilms in Figure 2C (67 to 280um). Based on this, I think that the illumination profile is still a concern.

      The referee now contradicts themselves and wants a scale bar to be more visible. We have changed the scale bar.

      Result section 4: Voltage-gated Kch potassium channels mediate ion-channel electrical oscillations in E. coli

      First I note at this point, given that I disagree that the data presented thus 'suggest that E. coli biofilms use electrical signaling to coordinate long-range responses to light stress' as the authors state, it gets harder to comment on the rest of the results.

      In this result section the authors look at the effect of Kch, a putative voltage-gated potassium channel, on ThT profile in E. coli cells. And they see a difference. It is worth noting that in the publication https://www.sciencedirect.com/science/article/pii/S0006349519308793 [sciencedirect.com] it is found that ThT is also likely a substrate for TolC (Figure 4), but that scenario could not be distinguished from the one where TolC mutant has a different membrane permeability (and there is a publication that suggests the latter is happening https://onlinelibrary.wiley.com/doi/10.1111/j.1365-2958.2010.07245.x [onlinelibrary.wiley.com]). Given this, it is also possible that Kch deletion affects the membrane permeability. I do note that in video S4 I seem to see more of, what appear to be, plasmolysed cells. The authors do not see the ThT intensity with this mutant that appears long after the initial peak has disappeared, as they see in WT. It is not clear how long they waited for this, as from Figure S3C it could simply be that the dynamics of this is a lot slower, e.g. Kch deletion changes membrane permeability.

      The work that TolC provides a possible passive pathway for ThT to leave cells seems slightly niche. It just demonstrates another mechanism for the cells to equilibriate the concentrations of ThT in a Nernstian manner i.e. driven by the membrane voltage.

      The authors themselves state that the evidence for Kch being a voltage-gated channel is indirect (line 54). I do not think there is a need to claim function from a ThT profile of E. coli mutants (nor do I believe it's good practice), given how accurate single-channel recordings are currently. To know the exact dependency on the membrane potential, ion channel recordings on this protein are needed first.

      We have good evidence form electrical impedance spectroscopy experiments that Kch increases the conductivity of biofilms  (https://pubs.acs.org/doi/10.1021/acs.nanolett.3c04446, 'Electrical impedance spectroscopy with bacterial biofilms: neuronal-like behavior', E.Akabuogu et al, ACS Nanoletters, 2024, in print.

      Result section 5: Blue light influences ion-channel mediated membrane potential events in E. coli

      In this chapter the authors vary the light intensity and stain the cells with PI (this dye gets into the cells when the membrane becomes very permeable), and the extracellular environment with K+ dye (I have not yet worked carefully with this dye). They find that different amounts of light influence ThT dynamics. This is in line with previous literature (both papers I have been mentioning: Figure 4 https://www.sciencedirect.com/science/article/pii/S0006349519303923 [sciencedirect.com] and https://ars.els-cdn.com/content/image/1-s2.0-S0006349519308793-mmc6.pdf [ars.els-cdn.com] especially SI12), but does not add anything new. I think the results presented here can be explained with previously published theory and do not indicate that the ion-channel mediated membrane potential dynamics is a light stress relief process.

      The simple Nernstian battery model proposed by Pilizota et al is erroneous in our opinion for reasons outlined above. We believe it will prove to be a dead end for bacterial electrophysiology studies.

      Result section 6: Development of a Hodgkin-Huxley model for the observed membrane potential dynamics

      This results section starts with the authors stating: 'our data provide evidence that E. coli manages light stress through well-controlled modulation of its membrane potential dynamics'. As stated above, I think they are instead observing the process of ThT loading while the light is damaging the membrane and thus simultaneously collapsing the electrochemical gradient of protons. As stated above, this has been modelled before. And then, they observe a ThT staining that is independent from membrane potential.

      This is an erroneous niche opinion. Protons have little say in the membrane potential since there are so few of them. The membrane potential is mostly determined by K+.

      I will briefly comment on the Hodgkin Huxley (HH) based model. First, I think there is no evidence for two channels with different activation profiles as authors propose. But also, the HH model has been developed for neurons. There, the leakage and the pumping fluxes are both described by a constant representing conductivity, times the difference between the membrane potential and Nernst potential for the given ion. The conductivity in the model is given as gK*n^4 for potassium, gNa*m^3*h sodium, and gL for leakage, where gK, gNa and gL were measured experimentally for neurons. And, n, m, and h are variables that describe the experimentally observed voltage-gated mechanism of neuronal sodium and potassium channels. (Please see Hodgkin AL, Huxley AF. 1952. Currents carried by sodium and potassium ions through the membrane of the giant axon of Loligo. J. Physiol. 116:449-72 and Hodgkin AL, Huxley AF. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117:500-44).

      In the 70 years since Hodgkin and Huxley first presented their model, a huge number of similar models have been proposed to describe cellular electrophysiology. We are not being hyperbolic when we state that the HH models for excitable cells are like the Schrödinger equation for molecules. We carefully adapted our HH model to reflect the currently understood electrophysiology of E. coli.

      Thus, in applying the model to describe bacterial electrophysiology one should ensure near equilibrium requirement holds (so that (V-VQ) etc terms in authors' equation Figure 5 B hold), and potassium and other channels in a given bacterium have similar gating properties to those found in neurons. I am not aware of such measurements in any bacteria, and therefore think the pump leak model of the electrophysiology of bacteria needs to start with fluxes that are more general (for example Keener JP, Sneyd J. 2009. Mathematical physiology: I: Cellular physiology. New York: Springer or https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000144 [journals.plos.org])

      The reference is to a slightly more modern version of a simple Nernstian battery model. The model will not oscillate and thus will not help modelling membrane potentials in bacteria. We are unsure where the equilibrium requirement comes from (inadequate modelling of the dynamics?)

      Result section 7: Mechanosensitive ion channels (MS) are vital for the first hyperpolarization event in E. coli.

      The results that Mcs channels affect the profile of ThT dye are interesting. It is again possible that the membrane permeability of these mutants has changed and therefore the dynamics have changed, so this needs to be checked first. I also note that our results show that the peak of ThT coincides with cell expansion. For this to be understood a model is needed that also takes into account the link between maintenance of electrochemical gradients of ions in the cell and osmotic pressure.

      The evidence for permeability changes in the membranes seems to be tenuous.

      A side note is that the authors state that the Msc responds to stress-related voltage changes. I think this is an overstatement. Mscs respond to predominantly membrane tension and are mostly nonspecific (see how their action recovers cellular volume in this publication https://www.pnas.org/doi/full/10.1073/pnas.1522185113 [pnas.org]). Authors cite references 35-39 to support this statement. These publications still state that these channels are predominantly membrane tension-gated. Some of the references state that the presence of external ions is important for tension-related gating but sometimes they gate spontaneously in the presence of certain ions. Other publications cited don't really look at gating with respect to ions (39 is on clustering). This is why I think the statement is somewhat misleading.

      We have reworded the discussion of Mscs since the literature appears to be ambiguous. We will try to run some electrical impedance spectroscopy experiments on the Msc mutants in the future to attempt to remove the ambiguity.

      Result section 8: Anomalous ion-channel-mediated wavefronts propagate light stress signals in 3D E. coli biofilms.

      I am not commenting on this result section, as it would only be applicable if ThT was membrane potential dye in E. coli.

      Ok, but we disagree on the use of ThT.

      Aims achieved/results support their conclusions:

      The authors clearly present their data. I am convinced that they have accurately presented everything they observed. However, I think their interpretation of the data and conclusions is inaccurate in line with the discussion I provided above.

      Likely impact of the work on the field, and the utility of the methods and data to the community:

      I do not think this publication should be published in its current format. It should be revised in light of the previous literature as discussed in detail above. I believe presenting it in it's current form on eLife pages would create unnecessary confusion.

      We believe many of the Pilizota group articles are scientifically flawed and are causing the confusion in the literature.

      Any other comments:

      I note, that while this work studies E. coli, it references papers in other bacteria using ThT. For example, in lines 35-36 authors state that bacteria (Bacillus subtilis in this case) in biofilms have been recently found to modulate membrane potential citing the relevant literature from 2015. It is worth noting that the most recent paper https://journals.asm.org/doi/10.1128/mbio.02220-23 [journals.asm.org] found that ThT binds to one or more proteins in the spore coat, suggesting that it does not act as a membrane potential in Bacillus spores. It is possible that it still reports membrane potential in Bacillus cells and the recent results are strictly spore-specific, but these should be kept in mind when using ThT with Bacillus.

      >>ThT was used successfully in previous studies of normal B. subtilis cells (by our own group and A.Prindle, ‘Spatial propagation of electrical signal in circular biofilms’, J.A.Blee et al, Physical Review E, 2019, 100, 052401, J.A.Blee et al, ‘Membrane potentials, oxidative stress and the dispersal response of bacterial biofilms to 405 nm light’, Physical Biology, 2020, 17, 2, 036001, A.Prindle et al, ‘Ion channels enable electrical communication in bacterial communities’, Nature, 2015, 527, 59-63). The connection to low metabolism pore research seems speculative.

      Reviewer #3 (Public Review):

      It has recently been demonstrated that bacteria in biofilms show changes in membrane potential in response to changes in their environment, and that these can propagate signals through the biofilm to coordinate bacterial behavior. Akabuogu et al. contribute to this exciting research area with a study of blue light-induced membrane potential dynamics in E. coli biofilms. They demonstrate that Thioflavin-T (ThT) intensity (a proxy for membrane potential) displays multiphasic dynamics in response to blue light treatment. They additionally use genetic manipulations to implicate the potassium channel Kch in the latter part of these dynamics. Mechanosensitive ion channels may also be involved, although these channels seem to have blue light-independent effects on membrane potential as well. In addition, there are challenges to the quantitative interpretation of ThT microscopy data which require consideration. The authors then explore whether these dynamics are involved in signaling at the community level. The authors suggest that cell firing is both more coordinated when cells are clustered and happens in waves in larger, 3D biofilms; however, in both cases evidence for these claims is incomplete. The authors present two simulations to describe the ThT data. The first of these simulations, a Hodgkin-Huxley model, indicates that the data are consistent with the activity of two ion channels with different kinetics; the Kch channel mutant, which ablates a specific portion of the response curve, is consistent with this. The second model is a fire-diffuse-fire model to describe wavefront propagation of membrane potential changes in a 3D biofilm; because the wavefront data are not presented clearly, the results of this model are difficult to interpret. Finally, the authors discuss whether these membrane potential changes could be involved in generating a protective response to blue light exposure; increased death in a Kch ion channel mutant upon blue light exposure suggests that this may be the case, but a no-light control is needed to clarify this.

      In a few instances, the paper is missing key control experiments that are important to the interpretation of the data. This makes it difficult to judge the meaning of some of the presented experiments.

      (1) An additional control for the effects of autofluorescence is very important. The authors conduct an experiment where they treat cells with CCCP and see that Thioflavin-T (ThT) dynamics do not change over the course of the experiment. They suggest that this demonstrates that autofluorescence does not impact their measurements. However, cellular autofluorescence depends on the physiological state of the cell, which is impacted by CCCP treatment. A much simpler and more direct experiment would be to repeat the measurement in the absence of ThT or any other stain. This experiment should be performed both in the wild-type strain and in the ∆kch mutant.

      ThT is a very bright fluorophore (much brighter than a GFP). It is clear from the images of non-stained samples that autofluorescence provides a negligible contribution to the fluorescence intensity in an image.

      (2) The effects of photobleaching should be considered. Of course, the intensity varies a lot over the course of the experiment in a way that photobleaching alone cannot explain. However, photobleaching can still contribute to the kinetics observed. Photobleaching can be assessed by changing the intensity, duration, or frequency of exposure to excitation light during the experiment. Considerations about photobleaching become particularly important when considering the effect of catalase on ThT intensity. The authors find that the decrease in ThT signal after the initial "spike" is attenuated by the addition of catalase; this is what would be predicted by catalase protecting ThT from photobleaching (indeed, catalase can be used to reduce photobleaching in time lapse imaging).

      Photobleaching was negligible over the course of the experiments. We employed techniques such as reducing sample exposure time and using the appropriate light intensity to minimize photobleaching.

      (3) It would be helpful to have a baseline of membrane potential fluctuations in the absence of the proposed stimulus (in this case, blue light). Including traces of membrane potential recorded without light present would help support the claim that these changes in membrane potential represent a blue light-specific stress response, as the authors suggest. Of course, ThT is blue, so if the excitation light for ThT is problematic for this experiment the alternative dye tetramethylrhodamine methyl ester perchlorate (TMRM) can be used instead.

      Unfortunately the fluorescent baseline is too weak to measure cleanly in this experiment. It appears the collective response of all the bacteria hyperpolarization at the same time appears to dominate the signal (measurements in the eLife article and new potentiometry measurements).

      (4) The effects of ThT in combination with blue light should be more carefully considered. In mitochondria, a combination of high concentrations of blue light and ThT leads to disruption of the PMF (Skates et al. 2021 BioRXiv), and similarly, ThT treatment enhances the photodynamic effects of blue light in E. coli (Bondia et al. 2021 Chemical Communications). If present in this experiment, this effect could confound the interpretation of the PMF dynamics reported in the paper.

      We think the PMF plays a minority role in determining the membrane potential in E. coli. For reasons outlined before (H+ is a minority ion in E. coli compared with K+).

      (5) Figures 4D - E indicate that a ∆kch mutant has increased propidium iodide (PI) staining in the presence of blue light; this is interpreted to mean that Kch-mediated membrane potential dynamics help protect cells from blue light. However, Live/Dead staining results in these strains in the absence of blue light are not reported. This means that the possibility that the ∆kch mutant has a general decrease in survival (independent of any effects of blue light) cannot be ruled out.

      >>Both strains of bacterial has similar growth curve and also engaged in membrane potential dynamics for the duration of the experiment. We were interested in bacterial cells that observed membrane potential dynamics in the presence of the stress. Bacterial cells need to be alive to engage in membrane potential  dynamics (hyperpolarize) under stress conditions. Cells that engaged in membrane potential dynamics and later stained red were only counted after the entire duration. We believe that the wildtype handles the light stress better than the ∆kch mutant as measured with the PI.

      (6) Additionally in Figures 4D - E, the interpretation of this experiment can be confounded by the fact that PI uptake can sometimes be seen in bacterial cells with high membrane potential (Kirchhoff & Cypionka 2017 J Microbial Methods); the interpretation is that high membrane potential can lead to increased PI permeability. Because the membrane potential is largely higher throughout blue light treatment in the ∆kch mutant (Fig. 3AB), this complicates the interpretation of this experiment.

      Kirchhoff & Cypionka 2017 J Microbial Methods, using fluorescence microscopy, suggested that changes in membrane potential dynamics can introduce experimental bias when propidium iodide is used to confirm the viability of tge bacterial strains, B subtilis (DSM-10) and Dinoroseobacter shibae, that are starved of oxygen (via N2 gassing) for 2 hours. They attempted to support their findings by using CCCP in stopping the membrane potential dynamics (but never showed any pictoral or plotted data for this confirmatory experiment). In our experiment methodology, cell death was not forced on the cells by introducing an extra burden or via anoxia. We believe that the accumulation of PI in ∆kch mutant is not due to high membrane potential dynamics but is attributed to the PI, unbiasedly showing damaged/dead cells. We think that propidium iodide is good for this experiment. Propidium iodide is a dye that is extensively used in life sciences. PI has also been used in the study of bacterial electrophysiology (https://pubmed.ncbi.nlm.nih.gov/32343961/, ) and no membrane potential related bias was reported.

      Throughout the paper, many ThT intensity traces are compared, and described as "similar" or "dissimilar", without detailed discussion or a clear standard for comparison. For example, the two membrane potential curves in Fig. S1C are described as "similar" although they have very different shapes, whereas the curves in Fig. 1B and 1D are discussed in terms of their differences although they are evidently much more similar to one another. Without metrics or statistics to compare these curves, it is hard to interpret these claims. These comparative interpretations are additionally challenging because many of the figures in which average trace data are presented do not indicate standard deviation.

      Comparison of small changes in the absolute intensities is problematic in such fluorescence experiments. We mean the shape of the traces is similar and they can be modelled using a HH model with similar parameters.

      The differences between the TMRM and ThT curves that the authors show in Fig. S1C warrant further consideration. Some of the key features of the response in the ThT curve (on which much of the modeling work in the paper relies) are not very apparent in the TMRM data. It is not obvious to me which of these traces will be more representative of the actual underlying membrane potential dynamics.

      In our experiment, TMRM was used to confirm the dynamics observed using ThT. However, ThT appear to be more photostable than TMRM (especially towars the 2nd peak). The most interesting observation is that with both dyes, all phases of the membrane potential dynamics were conspicuous (the first peak, the quiescent period and the second peak). The time periods for these three episodes were also similar.

      A key claim in this paper (that dynamics of firing differ depending on whether cells are alone or in a colony) is underpinned by "time-to-first peak" analysis, but there are some challenges in interpreting these results. The authors report an average time-to-first peak of 7.34 min for the data in Figure 1B, but the average curve in Figure 1B peaks earlier than this. In Figure 1E, it appears that there are a handful of outliers in the "sparse cell" condition that likely explain this discrepancy. Either an outlier analysis should be done and the mean recomputed accordingly, or a more outlier-robust method like the median should be used instead. Then, a statistical comparison of these results will indicate whether there is a significant difference between them.

      The key point is the comparison of standard errors on the standard deviation.

      In two different 3D biofilm experiments, the authors report the propagation of wavefronts of membrane potential; I am unable to discern these wavefronts in the imaging data, and they are not clearly demonstrated by analysis.

      The first data set is presented in Figures 2A, 2B, and Video S3. The images and video are very difficult to interpret because of how the images have been scaled: the center of the biofilm is highly saturated, and the zero value has also been set too high to consistently observe the single cells surrounding the biofilm. With the images scaled this way, it is very difficult to assess dynamics. The time stamps in Video S3 and on the panels in Figure 2A also do not correspond to one another although the same biofilm is shown (and the time course in 2B is also different from what is indicated in 2B). In either case, it appears that the center of the biofilm is consistently brighter than the edges, and the intensity of all cells in the biofilm increases in tandem; by eye, propagating wavefronts (either directed toward the edge or the center) are not evident to me. Increased brightness at the center of the biofilm could be explained by increased cell thickness there (as is typical in this type of biofilm). From the image legend, it is not clear whether the image presented is a single confocal slice or a projection. Even if this is a single confocal slice, in both Video S3 and Figure 2A there are regions of "haze" from out-of-focus light evident, suggesting that light from other focal planes is nonetheless present. This seems to me to be a simpler explanation for the fluorescence dynamics observed in this experiment: cells are all following the same trajectory that corresponds to that seen for single cells, and the center is brighter because of increased biofilm thickness.

      We appreciate the reviewer for this important observation. We have made changes to the figures to address this confusion. The cell cover has no influence on the observed membrane potential dynamics. The entire biofilm was exposed to the same blue light at each time. Therefore all parts of the biofilm received equal amounts of the blue light intensity. The membrane potential dynamics was not influenced by cell density (see Fig 2C). 

      The second data set is presented in Video S6B; I am similarly unable to see any wave propagation in this video. I observe only a consistent decrease in fluorescence intensity throughout the experiment that is spatially uniform (except for the bright, dynamic cells near the top; these presumably represent cells that are floating in the microfluidic and have newly arrived to the imaging region).

      A visual inspection of Video S6B shows a fast rise, a decrease in fluorescence and a second rise (supplementary figure 4B). The data for the fluorescence was carefully obtained using the imaris software. We created a curved geometry on each slice of the confocal stack. We analyzed the surfaces of this curved plane along the z-axis. This was carried out in imaris.

      3D imaging data can be difficult to interpret by eye, so it would perhaps be more helpful to demonstrate these propagating wavefronts by analysis; however, such analysis is not presented in a clear way. The legend in Figure 2B mentions a "wavefront trace", but there is no position information included - this trace instead seems to represent the average intensity trace of all cells. To demonstrate the propagation of a wavefront, this analysis should be shown for different subpopulations of cells at different positions from the center of the biofilm. Data is shown in Figure 8 that reflects the velocity of the wavefront as a function of biofilm position; however, because the wavefronts themselves are not evident in the data, it is difficult to interpret this analysis. The methods section additionally does not contain sufficient information about what these velocities represent and how they are calculated. Because of this, it is difficult for me to evaluate the section of the paper pertaining to wave propagation and the predicted biofilm critical size.

      The analysis is considered in more detail in a more expansive modelling article, currently under peer review in a physics journal, ‘Electrical signalling in three dimensional bacterial biofilms using an agent based fire-diffuse-fire model’, V.Martorelli, et al, 2024 https://www.biorxiv.org/content/10.1101/2023.11.17.567515v1

      There are some instances in the paper where claims are made that do not have data shown or are not evident in the cited data:

      (1) In the first results section, "When CCCP was added, we observed a fast efflux of ions in all cells"- the data figure pertaining to this experiment is in Fig. S1E, which does not show any ion efflux. The methods section does not mention how ion efflux was measured during CCCP treatment.

      We have worded this differently to properly convey our results.

      (2) In the discussion of voltage-gated calcium channels, the authors refer to "spiking events", but these are not obvious in Figure S3E. Although the fluorescence intensity changes over time, it's hard to distinguish these fluctuations from measurement noise; a no-light control could help clarify this.

      The calcium transients observed were not due to noise or artefacts.

      (3) The authors state that the membrane potential dynamics simulated in Figure 7B are similar to those observed in 3D biofilms in Fig. S4B; however, the second peak is not clearly evident in Fig. S4B and it looks very different for the mature biofilm data reported in Fig. 2. I have some additional confusion about this data specifically: in the intensity trace shown in Fig. S4B, the intensity in the second frame is much higher than the first; this is not evident in Video S6B, in which the highest intensity is in the first frame at time 0. Similarly, the graph indicates that the intensity at 60 minutes is higher than the intensity at 4 minutes, but this is not the case in Fig. S4A or Video S6B.

      The confusion stated here has now been addressed. Also it should be noted that while Fig 2.1 was obtained with LED light source, Fig S4A was obtained using a laser light source. While obtaining the confocal images (for Fig S4A ), the light intensity was controlled to further minimize photobleaching. Most importantly, there is an evidence of slow rise to the 2nd peak in Fig S4B. The first peak, quiescence and slow rise to second peak are evident.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Scientific recommendations:

      - Although Fig 4A clearly shows that light stimulation has an influence on the dynamics of cell membrane potential in the biofilm, it is important to rule out the contribution of variations in environmental parameters. I understand that for technical reasons, the flow of fresh medium must be stopped during image acquisition. Therefore, I suggest performing control experiments, where the flow is stopped before image acquisition (15min, 30min, 45min, and 1h before). If there is no significant contribution from environmental variations (pH, RedOx), the dynamics of the electrical response should be superimposed whatever the delay between stopping the flow stop and switching on the light.

      In this current research study, we were focused on studying how E. coli cells and biofilms react to blue light stress via their membrane potential dynamics. This involved growing the cells and biofilms, stopping the media flow and obtaining data immediately. We believe that stopping the flow not only helped us to manage data acquisition, it also helped us reduce the effect of environmental factors. In our future study we will expand the work to include how the membrane potential dynamics evolve in the presence of changing environmental factors for example such induced by stopping the flow at varied times.

      - Since TMRM signal exhibits a linear increase after the first response peak (Supplementary Figure 1D), I recommend mitigating the statement at line 78.

      - To improve the spatial analysis of the electrical response, I suggest plotting kymographs of the intensity profiles across the biofilm. I have plotted this kymograph for Video S3 and it appears that there is no electrical propagation for the second peak. In addition, the authors should provide technical details of how R^2(t) is measured in the first regime (Figure 7E).

      See the dedicated simulation article for more details. https://www.biorxiv.org/content/10.1101/2023.11.17.567515v1

      - Line 152: To assess the variability of the latency, the authors should consider measuring the variance divided by the mean instead of SD, which may depend on the average value.

      We are happy with our current use of standard error on the standard deviation. It shows what we claim to be true.

      - Line 154-155: To truly determine whether the amplitude of the "action potential" is independent of biofilm size, the authors should not normalise the signals.

      Good point. We qualitatively compared both normalized and unnormalized data. Recent electrical impedance spectroscopy measurements (unpublished) indicate that the electrical activity is an extensive quantity i.e. it scales with the size of the biofilms.

      - To precise the role of K+ in the habituation response, I suggest using valinomycin at sub-inhibitory concentrations (10µM). Besides, the high concentration of CCCP used in this study completely inhibits cell activity. Not surprisingly, no electrical response to light stimulation was observed in the presence of CCCP. Finally, the Kch complementation experiment exhibits a "drop after the first peak" on a single point. It would be more convincing to increase the temporal resolution (1min->10s) to show that there is indeed a first and a second peak.

      An interesting experiment for the future.

      - Line 237-238: There are only two points suggesting that the dynamics of hyperpolarization are faster at higher irradiance(Fig 4A). The authors should consider adding a third intermediate point at 17µW/mm^2 to confirm the statement made in this sentence.

      Multiple repeats were performed. We are confident of the robustness of our data.

      - Line 249 + Fig 4E: It seems that the data reported on Fig 4E are extracted from Fig 4D. If this is indeed the case, the data should be normalised by the total population size to compare survival probabilities under the two conditions. It would also be great to measure these probabilities (for WT and ∆kch) in the presence of ROS scavengers.

      - To distinguish between model fitting and model predictions, the authors should clearly state which parameters are taken from the literature and which parameters are adjusted to fit the experimental data.

      - Supplementary Figure 4A: why can't we see any wavefront in this series of images?

      For the experimental data, the wavefront was analyzed by employing the imaris software. We systematically created a ROI with a curved geometry within the confocal stack (the biofilm). The fluorescence of ThT was traced along the surface of the curved geometry was analyzed along the z-axis.

      - Fig 7B: Could the authors explain why the plateau is higher in the simulations than in the biofilm experiments? Could they add noise on the firing activities?

      See the dedicated Martorelli modelling article. In general we would need to approach stochastic Hodgkin-Huxley modelling and the fluorescence data (and electrical impedance spectroscopy data) presented does not have extensive noise (due to collective averaging over many bacteria cells).

      - Supplementary Figure 4B: Why can't we see the second peak in confocal images?

      The second peak is present although not as robust as in Fig 2B. The confocal images were obtained with a laser source. Therefore we tried to create a balance between applying sufficient light stress on the bacterial cells and mitigating photobleaching.

      Editing recommendations:

      The editing recommendations below has been applied where appropriate

      - Many important technical details are missing (e.g. R^2, curvature, and 445nm irradiance measurements). Error bars are missing from most graphs. The captions should clearly indicate if these are single-cell or biofilm experiments, strain name, illumination conditions, number of experiments, SD, or SE. Please indicate on all panels of all figures in the main text and in the supplements, which are the conditions: single cell vs. biofilm, strains, medium, centrifugal vs centripetal etc..., where relevant. Please also draw error bars everywhere.

      We have now made appropriate changes. We specifically use cells when we were dealing with single cells and biofilms when we worked on biofilms. We decided to describe the strain name either on the panel or the image description.

      - Line 47-51: The way the paragraph is written suggests that no coordinated electrical oscillations have been observed in Gram-negative biofilms. However, Hennes et al (referenced as 57 in this manuscript) have shown that a wave of hyperpolarized cells propagates in Neisseria gonorrhoea colony, which is a Gram-negative bacterium.

      We are now aware of this work. It was not published when we first submitted our work and the authors claim the waves of activity are due to ROS diffusion NOT propagating waves of ions (coordinated electrical wavefronts).

      - Line 59: "stressor" -> "stress" or "perturbation".

      The correction has been made.

      - Line 153: Please indicate in the Material&Methods how the size of the biofilm is measured.

      The biofilm size was obtained using BiofilmQ and the step by step guide for using BiofilmQ were stated..

      - Figure 2A: Please provide associated brightfield images to locate bacteria.

      - Line 186: Please remove "wavefront" from the caption. Fig2B only shows the average signal as a function of time.

      This correction has been implemented.

      - Fig 3B,C: Please indicate single cell and biofilm on the panels and also WT and ∆kch.

      - Line 289: I suggest adding "in single cell experiments" to the title of this section.

      - Fig 5A: blue light is always present at regular time intervals during regime I and II. The presence of blue light only in regime I could be misleading.

      - Fig 5C: The curve in Fig 5D seems to correspond to the biofilm case. The curve given by the model, should be compared with the average curve presented in Fig 1D.

      - Fig 6A, B, and C: These figures could be moved to supplements.

      - Line 392: Replace "turgidity" with "turgor pressure".

      - Fig 7C,E: Please use a log-log scale to represent these data and indicate the line of slope 1.

      - Fig 7E: The x-axis has been cropped.

      - Please provide a supplementary movie for the data presented in Fig 7E.

      - Line 455: E. Coli biofilms do not express ThT.

      - Line 466: "\gamma is the anomalous exponent". Please remove anomalous (\gamma can equal 1 at this stage).

      - Line 475: Please replace "section" with "projection".

      - Line 476: Please replace "spatiotemporal" with "temporal". There is no spatial dependency in either figure.

      - Line 500: Please define Eikonal approximation.

      - Fig 8 could be moved to supplements.

      - Line 553: "predicted" -> "predict".

      - Line 593: Could the authors explain why their model offers much better quantitative agreement?

      - Line 669: What does "universal" mean in that context?

      - Line 671: A volume can be pipetted but not a concentration.

      - Line 676: Are triplicates technical or biological replicates?

      - Sup Fig1: Please use minutes instead of seconds in panel A.

      - Model for membrane dynamics: "The fraction of time the Q+ channel is open" -> "The dynamics of Q+ channel activity can be written". Ditto for K+ channel...

      - Model for membrane dynamics: "the term ... is a threshold-linear". This function is not linear at all. Why is it called linear? Also, please describe what \sigma is.

      - ABFDF model: "releasing a given concentration" -> "releasing a local concentration" or "a given number" but it's not \sigma anymore. Besides, this \sigma is unlikely related to the previous \sigma used in the model of membrane potential dynamics in single cells. Please consider renaming one or the other. Also, ions are referred to as C+ in the text and C in equation 8. Am I missing something?

      Reviewer #2 (Recommendations For The Authors):

      I have included all my comments as one review. I have done so, despite the fact that some minor comments could have gone into this section, because I decided to review each Result section. I thus felt that not writing it as one review might be harder to follow. I have however highlighted which comments are minor suggestions or where I felt corrections.

      However, while I am happy with all my comments being public, given their nature I think they should be shown to authors first. Perhaps the authors want to go over them and think about it before deciding if they are happy for their manuscript to be published along with these comments, or not. I will highlight this in an email to the editor. I question whether in this case, given that I am raising major issues, publishing both the manuscript and the comments is the way to go as I think it might just generate confusion among the audience.

      Reviewer #3 (Recommendations For The Authors):

      I was unable to find any legends for any of the supplemental videos in my review materials, and I could not open supplemental video 5.

      I made some comments in the public review about the analysis and interpretation of the time-to-fire data. One of the other challenges in this data set is that the time resolution is limited- it seems that a large proportion of cells have already fired after a single acquisition frame. It would be ideal to increase the time resolution on this measurement to improve precision. This could be done by imaging more quickly, but that would perhaps necessitate more blue light exposure; an alternative is to do this experiment under lower blue light irradiance where the first spike time is increased (Figure 4A).

      In the public review, I mentioned the possible impact of high membrane potential on PI permeability. To address this, the experiment could be repeated with other stains, or the viability of blue light-treated cells could be addressed more directly by outgrowth or colony-forming unit assays.

      In the public review, I mentioned the possible combined toxicity of ThT and blue light. Live/dead experiments after blue light exposure with and without ThT could be used to test for such effects, and/or the growth curve experiment in Figure 1F could be repeated with blue light exposure at a comparable irradiance used in the experiment.

      Throughout the paper and figure legends, it would help to have more methodological details in the main text, especially those that are critical for the interpretation of the experiment. The experimental details in the methods section are nicely described, but the data analysis section should be expanded significantly.

      At the end of the results section, the authors suggest a critical biofilm size of only 4 µm for wavefront propagation (not much larger than a single cell!). The authors show responses for various biofilm sizes in Fig. 2C, but these are all substantially larger. Are there data for cell clusters above and below this size that could support this claim more directly?

      The authors mention image registration as part of their analysis pipeline, but the 3D data sets in Video S6B and Fig. S4A do not appear to be registered- were these registered prior to the velocity analysis reported in Fig. 8?

      One of the most challenging claims to demonstrate in this paper is that these membrane potential wavefronts are involved in coordinating a large, biofilm-scale response to blue light. One possible way to test this might be to repeat the Live/Dead experiment in planktonic culture or the single-cell condition. If the protection from blue light specifically emerges due to coordinated activity of the biofilm, the Kch mutant would not be expected to show a change in Live/Dead staining in non-biofilm conditions.

      Line 140: How is "mature biofilm" defined? Also on this same line, what does "spontaneous" mean here?

      Line 151: "much smaller": Given that the reported time for 3D biofilms is 2.73 {plus minus} 0.85 min and in microclusters is 3.27 {plus minus} 1.77 min, this seems overly strong.

      Line 155: How is "biofilm density" characterized? Additionally, the data in Figure 2C are presented in distance units (µm), but the text refers to "areal coverage"- please define the meaning of these distance units in the legend and/or here in the text (is this the average radius?).

      Lines 161-162: These claims seem strong given the data presented before, and the logic is not very explicit. For example, in the second sentence, the idea that this signaling is used to "coordinate long-range responses to light stress" does not seem strongly evidenced at this point in the paper. What is meant by a long-range response to light stress- are there processes to respond to light that occur at long-length scales (rather than on the single-cell scale)? If so, is there evidence that these membrane potential changes could induce these responses? Please clarify the logic behind these conclusions.

      Lines 235-236: In the lower irradiance conditions, the responses are slower overall, and it looks like the ThT intensity is beginning to rise at the end of the measurement. Could a more prominent second peak be observed in these cases if the measurement time was extended?

      Line 242-243: The overall trajectories of extracellular potassium are indeed similar, but the kinetics of the second peak of potassium are different than those observed by ThT (it rises some minutes earlier)- is this consistent with the idea that Kch is responsible for that peak? Additionally, the potassium dynamics also reflect the first peak- is this surprising given that the Kch channel has no effect on this peak?

      Line 255-256: Again, this seems like a very strong claim. There are several possible interpretations of the catalase experiment (which should be discussed); this experiment perhaps suggests that ROS impacts membrane potential, but does not obviously indicate that these membrane potential fluctuations mitigate ROS levels or help the cells respond to ROS stress. The loss of viability in the ∆kch mutant might indicate a link between these membrane potential experiments and viability, but it is hard to interpret without the no-light control I mention in the public review.

      Lines 313-315: "The model predicts... the external light stress". Please clarify this section. Where this prediction arises from in the modeling work? Second, I am not sure what is meant by "modulates the light stress" or "keeps the cell dynamics robust to the intensity of external light stress" (especially since the dynamics clearly vary with irradiance, as seen in Figure 4A).

      Line 322: I am not sure what "handles the ROS by adjusting the profile of the membrane potential dynamics" means. What is meant by "handling" ROS? Is the hypothesis that membrane potential dynamics themselves are protective against ROS, or that they induce a ROS-protective response downstream, or something else? Later in lines 327-8 the authors write that changes in the response to ROS in the model agree with the hypothesis, but just showing that ROS impacts the membrane potential does not seem to demonstrate that this has a protective effect against ROS.

      Line 365-366: This section title seems confusing- mechanosensitive ion channels totally ablate membrane potential dynamics, they don't have a specific effect on the first hyperpolarization event. The claim that mechanonsensitive ion channels are specifically involved in the first event also appears in the abstract.

      Also, the apparent membrane potential is much lower even at the start of the experiment in these mutants- is this expected? This seems to imply that these ion channels also have a blue light independent effect.

      Lines 368, 371: Should be VGCCs rather than VGGCs.

      Line 477: I believe the figure reference here should be to Figure 7B, not 6B.

      Line 567-568: "The initial spike is key to registering the presence of the light stress." What is the evidence for this claim?

      Line 592-594: "We have presented much better quantitative agreement..." This is a strong claim; it is not immediately evident to me that the agreement between model and prediction is "much better" in this work than in the cited work. The model in Figure 4 of reference 57 seems to capture the key features of their data. Clarification is needed about this claim.

      Line 613: "...strains did not have any additional mutations." This seems to imply that whole genome sequencing was performed- is this the case?

      Line 627: I believe this should refer to Figure S2A-B rather than S1.

      Line 719: What percentage of cells did not hyperpolarize in these experiments?

      Lines 751-754: As I mentioned above, significant detail is missing here about how these measurements were made. How is "radius" defined in 3D biofilms like the one shown in Video S6B, which looks very flat? What is meant by the distance from the substrate to the core, since usually in this biofilm geometry, the core is directly on the substrate? Most importantly, this only describes the process of sectioning the data- how were these sections used to compute the velocity of ThT signal propagation?

      I also have some comments specifically on the figure presentation:

      Normalization from 0 to 1 has been done in some of the ThT traces in the paper, but not all. The claims in the paper would be easiest to evaluate if the non-normalized data were shown- this is important for the interpretation of some of the claims.

      Some indication of standard deviation (error bars or shading) should be added to all figures where mean traces are plotted.

      Throughout the paper, I am a bit confused by the time axis; the data consistently starts at 1 minute. This is not intuitive to me, because it seems that the blue light being applied to the cells is also the excitation laser for ThT- in that case, shouldn't the first imaging frame be at time 0 (when the blue light is first applied)? Or is there an additional exposure of blue light 1 minute before imaging starts? This is consequential because it impacts the measured time to the first spike. (Additionally, all of the video time stamps start at 0).

      Please increase the size of the scale bars and bar labels throughout, especially in Figure 2A and S4A.

      In Figure 1B and D, it would help to decrease the opacity on the individual traces so that more of them can be discerned. It would also improve clarity to have data from the different experiments shown with different colored lines, so that variability between experiments can be clearly visualized.

      Results in Figure 1E would be easier to interpret if the frequency were normalized to total N. It is hard to tell from this graph whether the edges and bin widths are the same between the data sets, but if not, they should be. Also, it would help to reduce the opacity of the sparse cell data set so that the full microcluster data set can be seen as well.

      Biofilm images are shown in Figures 2A, S3A, and Video S3- these are all of the same biofilm. Why not take the opportunity to show different experimental replicates in these different figures? The same goes for Figure S4A and Video S6B, which again are of the same biofilm.

      Figure 2C would be much easier to read if the curves were colored in order of their size; the same is true for Figure 4A and irradiance.

      The complementation data in Figure S3D should be moved to the main text figure 3 alongside the data about the corresponding knockout to make it easier to compare the curves.

      Fig.ure S3E: Is the Y-axis in this graph mislabeled? It is labeled as ThT fluorescence, but it seems that it is reporting fluorescence from the calcium indicator?

      Video S6B is very confusing - why does the video play first forwards and then backwards? Unless I am looking very carefully at the time stamps it is easy to misinterpret this as a rise in the intensity at the end of the experiment. Without a video legend, it's hard to understand this, but I think it would be much more straightforward to interpret if it only played forward. (Also, why is this video labeled 6B when there is no video 6A?)

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Qin et al. set out to investigate the role of mechanosensory feedback during swallowing and identify neural circuits that generate ingestion rhythms. They use Drosophila melanogaster swallowing as a model system, focusing their study on the neural mechanisms that control cibarium filling and emptying in vivo. They find that pump frequency is decreased in mutants of three mechanotransduction genes (nompC, piezo, and Tmc), and conclude that mechanosensation mainly contributes to the emptying phase of swallowing. Furthermore, they find that double mutants of nompC and Tmc have more pronounced cibarium pumping defects than either single mutants or Tmc/piezo double mutants. They discover that the expression patterns of nompC and Tmc overlap in two classes of neurons, md-C and md-L neurons. The dendrites of md-C neurons warp the cibarium and project their axons to the subesophageal zone of the brain. Silencing neurons that express both nompC and Tmc leads to severe ingestion defects, with decreased cibarium emptying. Optogenetic activation of the same population of neurons inhibited filling of the cibarium and accelerated cibarium emptying. In the brain, the axons of nompC∩Tmc cell types respond during ingestion of sugar but do not respond when the entire fly head is passively exposed to sucrose. Finally, the authors show that nompC∩Tmc cell types arborize close to the dendrites of motor neurons that are required for swallowing, and that swallowing motor neurons respond to the activation of the entire Tmc-GAL4 pattern.

      Strengths:

      • The authors rigorously quantify ingestion behavior to convincingly demonstrate the importance of mechanosensory genes in the control of swallowing rhythms and cibarium filling and emptying

      • The authors demonstrate that a small population of neurons that express both nompC and Tmc oppositely regulate cibarium emptying and filling when inhibited or activated, respectively

      • They provide evidence that the action of multiple mechanotransduction genes may converge in common cell types

      Thank you for your insightful and detailed assessment of our work. Your constructive feedback will help to improve our manuscript.

      Weaknesses:

      • A major weakness of the paper is that the authors use reagents that are expressed in both md-C and md-L but describe the results as though only md-C is manipulated-Severing the labellum will not prevent optogenetic activation of md-L from triggering neural responses downstream of md-L. Optogenetic activation is strong enough to trigger action potentials in the remaining axons. Therefore, Qin et al. do not present convincing evidence that the defects they see in pumping can be specifically attributed to md-C.

      Thank you for your comments. This is important point that we did not adequately address in the original preprint. We have obtained imaging and behavioral results that strongly suggest md-C, rather than md-L, are essential for swallowing behavior.

      36 hours after the ablation of the labellum, the signals of md-L were hardly observable when GFP expression was driven by the intersection between Tmc-GAL4 & nompC-QF (see F Figure 3—figure supplement 1A). This observation indicates that the axons of md-L likely degenerated after 36 hours, and were unlikely to influence swallowing. Moreover, the projecting pattern of Tmc-GAL4 & nompC-QF>>GFP exhibited no significant changes in the brain post labellum ablation.

      Furthermore, even after labellum ablation for 36 hours, flies exhibited responses to light stimulation (see Figure 3—figure supplement 1B-C, Video 5) when ReaChR was expressed in md-C. We thus reasoned that md-C but not md-L, plays a crucial role in the swallowing process.

      • GRASP is known to be non-specific and prone to false positives when neurons are in close proximity but not synaptically connected. A positive GRASP signal supports but does not confirm direct synaptic connectivity between md-C/md-L axons and MN11/MN12.

      In this study, we employed the nSyb-GRASP, wherein the GRASP is expressed at the presynaptic terminals by fusion with the synaptic marker nSyb. This method demonstrates an enhanced specificity compared to the original GRASP approach.

      Additionally, we utilized +/ UAS-nSyb-spGFP1-10, lexAop-CD4-spGFP11 ; + / MN-LexA fruit flies as a negative control to mitigate potential false signals originating from the tool itself (Author response image 1, scale bar = 50μm). Beside the genotype Tmc-Gal4, Tub(FRT. Gal80) / UAS-nSyb-spGFP1-10, lexAop-CD4-spGFP11 ; nompC-QF, QUAS-FLP / MN-LexA fruit flies discussed in this manuscript, we also incorporated genotype Tmc-Gal4, Tub(FRT. Gal80) / lexAop-nSyb-spGFP1-10, UAS-CD4-spGFP11 ; nompC-QF, QUAS-FLP / MN-LexA fruit flies as a reverse control (Author response image 2). Unexpectedly, similar positive signals were observed, indicating that, positive signals may emerge due to close proximity between neurons even with nSyb-GRASP.

      Author response image 1.

      It should be noted that the existence of synaptic projections from motor neurons (MN) to md-C cannot be definitively confirmed at this juncture. At present, we can only posit the potential for synaptic connections between md-C and motor neurons. A more conclusive conclusion may be attainable with the utilization of comprehensive whole-brain connectome data in future studies.

      Author response image 2.

      • As seen in Figure 2—figure supplement 1, the expression pattern of Tmc-GAL4 is broader than md-C alone. Therefore, the functional connectivity the authors observe between Tmc expressing neurons and MN11 and 12 cannot be traced to md-C alone

      It is true that the expression pattern of Tmc-GAL4 is broader than that of md-C alone. Our experiments, including those flies expressing TNT in Tmc+ neurons, demonstrated difficulties in emptying (Figure 2A, 2D). Notably, we encountered challenges in finding fly stocks bearing UAS>FRT-STOP-P2X2. Consequently, we opted to utilize Tmc-GAL4 to drive UAS-P2X2 instead. We believe that the results further support our hypothesis on the role of md-C in the observed behavioral change in emptying.

      Overall, this work convincingly shows that swallowing and swallowing rhythms are dependent on several mechanosensory genes. Qin et al. also characterize a candidate neuron, md-C, that is likely to provide mechanosensory feedback to pumping motor neurons, but the results they present here are not sufficient to assign this function to md-C alone. This work will have a positive impact on the field by demonstrating the importance of mechanosensory feedback to swallowing rhythms and providing a potential entry point for future investigation of the identity and mechanisms of swallowing central pattern generators.

      Reviewer #2 (Public Review):

      In this manuscript, the authors describe the role of cibarial mechanosensory neurons in fly ingestion. They demonstrate that pumping of the cibarium is subtly disrupted in mutants for piezo, TMC, and nomp-C. Evidence is presented that these three genes are co-expressed in a set of cibarial mechanosensory neurons named md-C. Silencing of md-C neurons results in disrupted cibarial emptying, while activation promotes faster pumping and/or difficulty filling. GRASP and chemogenetic activation of the md-C neurons is used to argue that they may be directly connected to motor neurons that control cibarial emptying.

      The manuscript makes several convincing and useful contributions. First, identifying the md-C neurons and demonstrating their essential role for cibarium emptying provides reagents for further studying this circuit and also demonstrates the important of mechanosensation in driving pumping rhythms in the pharynx. Second, the suggestion that these mechanosensory neurons are directly connected to motor neurons controlling pumping stands in contrast to other sensory circuits identified in fly feeding and is an interesting idea that can be more rigorously tested in the future.

      At the same time, there are several shortcomings that limit the scope of the paper and the confidence in some claims. These include:

      a) the MN-LexA lines used for GRASP experiments are not characterized in any other way to demonstrate specificity. These were generated for this study using Phack methods, and their expression should be shown to be specific for MN11 and MN12 in order to interpret the GRASP experiments.

      Thanks for the suggestion. We have checked the expression pattern of MN-LexA, which is similar to MN-GAL4 used in previous work (Manzo et al., PNAS., 2012, PMID:22474379) . Here is the expression pattern:

      Author response image 3.

      b) There is also insufficient detail for the P2X2 experiment to evaluate its results. Is this an in vivo or ex vivo prep? Is ATP added to the brain, or ingested? If it is ingested, how is ATP coming into contact with md-C neuron if it is not a chemosensory neuron and therefore not exposed to the contents of the cibarium?

      The P2X2 experimental preparation was done ex vivo. We immersed the fly in the imaging buffer, as described in the Methods section under Functional Imaging. Following dissection and identification of the subesophageal zone (SEZ) area under fluorescent microscopy, we introduced ATP slowly into the buffer, positioned at a distance from the brain

      c) In Figure 3C, the authors claim that ablating the labellum will remove the optogenetic stimulation of the md-L neuron (mechanosensory neuron of the labellum), but this manipulation would presumably leave an intact md-L axon that would still be capable of being optogenetically activated by Chrimson.

      Please refer to the corresponding answers for reviewer 1 and Figure 3—figure supplement 1.

      d) Average GCaMP traces are not shown for md-C during ingestion, and therefore it is impossible to gauge the dynamics of md-C neuron activation during swallowing. Seeing activation with a similar frequency to pumping would support the suggested role for these neurons, although GCaMP6s may be too slow for these purposes.

      Profiling the dynamics of md-C neuron activation during swallowing is crucial for unraveling the operational model of md-C and validating our proposed hypothesis. Unfortunately, our assay faces challenges in detecting probable 6Hz fluorescent changes with GCaMP6s.

      In general, we observed an increase of fluorescent signals during swallowing, but movement of alive flies during swallowing influenced the imaging recording, so we could not depict a decent tracing for calcium imaging for md-C neurons. To enhance the robustness of our findings, patching the md-C neurons would be a more convincing approach. As illustrated in Figure 2, the somata of md-C neurons are situated in the cibarium rather than the brain. patching of the md-C neuron somata in flies during ingestion is difficult.

      e) The negative result in Figure 4K that is meant to rule out taste stimulation of md-C is not useful without a positive control for pharyngeal taste neuron activation in this same preparation.

      We followed methods used in the previous work (Chen et al., Cell Rep., 2019, PMID:31644916), which we believe could confirm that md-C do not respond to sugars.

      In addition to the experimental limitations described above, the manuscript could be organized in a way that is easier to read (for example, not jumping back and forth in figure order).

      Thanks for your suggestion and the manuscript has been reorganized.

      Reviewer #3 (Public Review):

      Swallowing is an essential daily activity for survival, and pharyngo-laryngeal sensory function is critical for safe swallowing. In Drosophila, it has been reported that the mechanical property of food (e.g. Viscosity) can modulate swallowing. However, how mechanical expansion of the pharynx or fluid content sense and control swallowing was elusive. Qin et al. showed that a group of pharyngeal mechanosensory neurons, as well as mechanosensory channels (nompC, Tmc, and Piezo), respond to these mechanical forces for regulation of swallowing in Drosophila melanogaster.

      Strengths:

      There are many reports on the effect of chemical properties of foods on feeding in fruit flies, but only limited studies reported how physical properties of food affect feeding especially pharyngeal mechanosensory neurons. First, they found that mechanosensory mutants, including nompC, Tmc, and Piezo, showed impaired swallowing, mainly the emptying process. Next, they identified cibarium multidendritic mechanosensory neurons (md-C) are responsible for controlling swallowing by regulating motor neuron (MN) 12 and 11, which control filling and emptying, respectively.

      Weaknesses:

      While the involvement of md-C and mechanosensory channels in controlling swallowing is convincing, it is not yet clear which stimuli activate md-C. Can it be an expansion of cibarium or food viscosity, or both? In addition, if rhythmic and coordinated contraction of muscles 11 and 12 is essential for swallowing, how can simultaneous activation of MN 11 and 12 by md-C achieve this? Finally, previous reports showed that food viscosity mainly affects the filling rather than the emptying process, which seems different from their finding.

      We have confirmed that swallowing sucrose water solution activated md-C neurons, while sucrose water solution alone could not (Figure 4J-K). We hypothesized that the viscosity of the food might influence this expansion process.

      While we were unable to delineate the activation dynamics of md-C neurons, our proposal posits that these neurons could be activated in a single pump cycle, sequentially stimulating MN12 and MN11. Another possibility is that the activation of md-C neurons acts as a switch, altering the oscillation pattern of the swallowing central pattern generator (CPG) from a resting state to a working state.

      In the experiments with w1118 flies fed with MC (methylcellulose) water, we observed that viscosity predominantly affects the filling process rather than the emptying process, consistent with previous findings. This raises an intriguing question. Our investigation into the mutation of mechanosensitive ion channels revealed a significant impact on the emptying process. We believe this is due to the loss of mechanosensation affecting the vibration of swallowing circuits, thereby influencing both the emptying and filling processes. In contrast, viscosity appears to make it more challenging for the fly to fill the cibarium with food, primarily attributable to the inherent properties of the food itself.

      Reviewer #4 (Public Review):

      A combination of optogenetic behavioral experiments and functional imaging are employed to identify the role of mechanosensory neurons in food swallowing in adult Drosophila. While some of the findings are intriguing and the overall goal of mapping a sensory to motor circuit for this rhythmic movement are admirable, the data presented could be improved.

      The circuit proposed (and supported by GRASP contact data) shows these multi-dendritic neurons connecting to pharyngeal motor neurons. This is pretty direct - there is no evidence that they affect the hypothetical central pattern generator - just the execution of its rhythm. The optogenetic activation and inhibition experiments are constitutive, not patterned light, and they seem to disrupt the timing of pumping, not impose a new one. A slight slowing of the rhythm is not consistent with the proposed function.

      Motor neurons implicated in patterned motions can be considered effectors of Central Pattern Generators (CPGs)(Marder et al., Curr Biol., 2001, PMID: 11728329; Hurkey et al., Nature., 2023, PMID:37225999). Given our observation of the connection between md-C neurons and motor neurons, it is reasonable to speculate that md-C neurons influence CPGs. Compared to the patterned light (0.1s light on and 0.1s light off) used in our optogenetic experiments, we noted no significant changes in their responses to continuous light stimulation. We think that optogenetic methods may lead to overstimulation of md-C neurons, failing to accurately mimic the expansion of the cibarium during feeding.

      Dysfunction in mechanosensitive ion channels or mechanosensory neurons not only disrupts the timing of pumping but also results in decreased intake efficiency (Figure 1E). The water-swallowing rhythm is generally stable in flies, and swallowing is a vital process that may involve redundant ion channels to ensure its stability.

      The mechanosensory channel mutants nompC, piezo, and TMC have a range of defects. The role of these channels in swallowing may not be sufficiently specific to support the interpretation presented. Their other defects are not described here and their overall locomotor function is not measured. If the flies have trouble consuming sufficient food throughout their development, how healthy are they at the time of assay? The level of starvation or water deprivation can affect different properties of feeding - meal size and frequency. There is no description of how starvation state was standardized or measured in these experiments.

      Defects in mechanosensory channel mutants nompC, piezo, and TMC, have been extensively investigated (Hehlert et al., Trends Neurosci., 2021, PMID:332570000). Mutations in these channels exhibit multifaceted effects, as illustrated in our RNAi experiments (see Figure 2E). Deprivation of water and food was performed in empty fly vials. It's important to note that the duration of starvation determines the fly's willingness to feed but not the pump frequency (Manzo et al., PNAS., 2012, PMID:22474379).

      In most cases, female flies were deprived water and food in empty vials for 24 hours because after that most flies would be willing to drink water. The deprivation time is 12 hours for flies with nompC and Tmc mutated or flies with Kir2.1 expressed in md-C neurons, as some of these flies cannot survive 24h deprivation.

      The brain is likely to move considerably during swallow, so the GCaMP signal change may be a motion artifact. Sometimes this can be calculated by comparing GCaMP signal to that of a co-expressed fluorescent protein, but there is no mention that this is done here. Therefore, the GCaMP data cannot be interpreted.

      We did not co-express a fluorescent protein with GCaMP for md-C. The head of the fly was mounted onto a glass slide, and we did not observe significant signal changes before feeding.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      .>Abstract: I disagree that swallow is the first step of ingestion. The first paragraph also mentions the final checkpoint before food ingestion. Perhaps sufficient to say that swallow is a critical step of ingestion.

      Indeed, it is not rigorous enough to say “first step”. This has been replaced by “early step”.

      Introduction:

      Line 59: "Silence" should be "Silencing"

      This has been replaced.

      Results:

      Lines 91-92: I am not clear about what this means. 20% of nompC and 20% of wild-type flies exhibit incomplete filling? So nompC is not different from wild-type?

      Sorry for the mistake. Viscous foods led to incomplete emptying (not incomplete filling), as displayed in Video 4. The swallowing behavior differs between nompC mutants and wild-type flies, as illustrated in Figure 1C, Figure 1—figure supplement 1A-C and video 1&5.

      When fed with 1% MC water solution (Figure 1—figure supplement 1E-H). We found that when fed with 1% MC watere solution, Tmc or piezo mutants displayed incomplete emptying, which could constitute a long time proportion of swallowing behavior; while only 20% of nompC flies and 20% of wild-type flies sporadically exhibit incomplete emptying, which is significantly different. Though the percent of flies displaying incomplete pump is similar between nompC mutant and wild-type files, you can find it quite different in video 1 and 5.

      Line 94: Should read: “while for foods with certain viscosity, the pump of Tmc or piezo mutants might"

      What evidence is there for weakened muscle motion? The phenotypes of all three mutants is quite similar, so concluding that they have roles in initiation versus swallowing strength is not well supported -this would be better moved to the discussion since it is speculative.

      Muscles are responsible for pumping the bolus from the mouth to the crop. In the case of Tmc or piezo mutants, as evidenced by incomplete filling for viscous foods (see Video 4), we speculate that the loss of sensory stimuli leads to inadequate muscle contraction. The phenotypes observed in Tmc and piezo mutants are similar yet distinct from those of the wild-type or nompC mutant, as shown in Video 1 and 4. The phrase "due to weakened muscle motion" has been removed for clarity.

      Line 146: If md-L neurons are also labeled by this intersection, then you are not able to know whether the axons seen in the brain are from md-L or md-C neurons. Line 148: cutting the labellum is not sufficient to ablate md-L neurons. The projections will still enter the brain and can be activated with optogenetics, even after severing the processes that reside in the labellum.

      Please refer to the responses for reviewer #1 (Public Review):” A major weakness of the paper…” and Figure 4.

      Line 162: If the fly head alone is in saline, do you know that the sucrose enters the esophagus? The more relevant question here is whether the md-C neurons respond to mechanical force. If you could artificially inflate the cibarium with air and see the md-C neurons respond that would be a more convincing result. So far you only know that these are activated during ingestion, but have not shown that they are activated specifically by filling or emptying. In addition, you are not only imaging md-C (md-L is also labeled). This caveat should be mentioned.

      We followed the methods outlined in the previous work (Chen et al., Cell Rep., 2019, PMID:31644916), which suggested that md-C neurons do not respond to sugars. While we aimed to mechanically stimulate md-C neurons, detecting signal changes during different steps of swallowing is challenging. This aspect could be further investigated in subsequent research with the application of adequate patch recording or two-photon microscopy (TPM).

      Figure 3: It is not clear what the pie charts in Figure 3 A refer to. What are the three different rows, and what does blue versus red indicate?

      Figure 3A illustrates three distinct states driven by CsChrimson light stimulation of md-C neurons, with the proportions of flies exhibiting each state. During light activation, flies may display difficulty in filling, incomplete filling, or a normal range of pumping. The blue and red bars represent the proportions of flies showing the corresponding state, as indicated by the black line.

      Figure 4: Where are the example traces for J? The comparison in K should be average dF/F before ingestion compared with average dF/F during ingestion. Comparing the in vitro response to sucrose to the in vivo response during ingestion is not a useful comparison.

      Please refer to the answers for reviewer #2 question d).

      Reviewer #2 (Recommendations For The Authors):

      Suggested experiments that would address some of my concerns listed in the public review include:

      a) high resolution SEZ images of MN-LexA lines crossed to LexAop-GFP to demonstrate their specificity

      b) more detail on the P2X2 experiment. It is hard to make suggestions beyond that without first seeing the details.

      c) presenting average GCaMP traces for all calcium imaging results

      d) to rule out taste stimulation of md-C (Figure 4K) I would suggest performing more extensive calcium imaging experiments with different stimuli. For example, sugar, water, and increasing concentrations of a neutral osmolyte (e.g. PEG) to suppress the water response. I think that this is more feasible than trying to get an in vitro taste prep to be convincing.

      Please refer to the responses for public review of reviewer #2.

      Reviewer #3 (Recommendations For The Authors):

      Below I list my suggestions as well as criticisms.

      (1) It would be excellent if the authors could demonstrate whether varying levels of food viscosity affect md-C activation.

      That is a good point, and could be studied in future work.

      (2) It is not clear whether an intersectional approach using TMC-GAL4 and nompC-QF abolishes labelling of the labellar multidendritic neurons. If this is the case, please show labellar multidendritic neurons in TMC-GAL4 only flies and flies using the intersectional approach. Along with this question, I am concerned that labellum-removed flies could be used for feeding assay.

      Intersectional labelling using TMC-GAL4 and nompC-QF could not abolish labelling of the labellar multidendritic neurons (Author response image 4). Labellum-removed flies could be used for feeding assay (Figure 3—figure supplement 1B-C, video 5), but once LSO or cibarium of fly was damaged, swallowing behavior would be affected. Removing labellum should be very careful.

      Author response image 4.

      (3) Please provide the detailed methods for GRASP and include proper control.

      Please refer to the responses for public review of reviewer #1.

      (4) The authors hypothesized that md-C sequentially activates MN11 and 12. Is the time gap between applying ATP on md-C and activation of MN11 or MN12 different? Please refer to the responses for public review of reviewer #3. The time gap between applying ATP on md-C and activation of MN11 or MN12 didn’t show significant differences, and we think the reason is that the ex vivo conditions could not completely mimic in vivo process.

      I found the manuscript includes many errors, which need to be corrected.

      (1) The reference formatting needs to be rechecked, for example, lines 37, 42, and 43.

      (2) Line 44-46: There is some misunderstanding. The role of pharyngeal mechanosensory neurons is not known compared with chemosensory neurons.

      (3) Line 49: Please specify which type of quality of food. Chemical or physical?

      (4) Line 80 and Figure 1B-D Authors need to put filling and emptying time data in the main figure rather than in the supplementary figure. Otherwise, please cite the relevant figures in the text(S1A-C).

      (5) Line 84-85; Is "the mutant animals" indicating only nompC? Please specify it.

      (6) Figure 1a: It is hard to determine the difference between the series of images. And also label filling and emptying under the time.

      (7) S1E-H: It is unclear what "Time proportion of incomplete pump" means. Please define it.

      (8) Please reorganize the figures to follow the order of the text, for example, figures 2 and 4

      (9) Figure 4A. There is mislabelling in Figure 4A. It is supposed to be phalloidin not nc82.

      (10) Figure 4K: It does not match the figure legend and main text.

      (11) Figure 4D and G: Please indicate ATP application time point.

      Thanks for your correction and all the points mentioned were revised.

      Reviewer #4 (Recommendations For The Authors):

      The figures need improvement. 1A has tiny circles showing pharynx and any differences are unclear.

      The expression pattern of some of these drivers (Supplement) seems quite broad. The tmc nompC intersection image in Figure 1F is nice but the cibarium images are hard to interpret: does this one show muscle expression? What are "brain" motor neurons? Where are the labellar multi-dendritic neurons?

      Tmc nompC intersection image show no expression in muscles. Somata of motor neurons 12 or 11 situated at SEZ area of brain, while somata of md-C neurons are in the cibarium. Image of md-L neurons was posted in response for reviewer #3 (Recommendations For The Authors):

      Why do the assays alternate between swallowing food and swallowing water?

      Thank for your suggestion, figure 1A has been zoomed-in. The Tmc nompC intersection image in Figure 2F displayed the position of md-C neurons in a ventral perspective, and muscles were not labelled. We stained muscles in cibarium by phalloidin and the image is illustrated in Figure 4A, while we didn’t find overlap between md-C neurons and muscles. Image of md-L neurons were posted as Author response image 4.

      In the majority of our experiments, we employed water to test swallowing behavior, while we used methylcellulose water solution to test swallowing behavior of mechanoreceptor mutants, and sucrose solution for flies with md-C neurons expressing GCaMP since they hardly drank water when their head capsules were open.

      How starved or water-deprived were the flies?

      One day prior to the behavioral assays, flies were transferred to empty vials (without water or food) for 24 hours for water deprivation. Flies who could not survive 24h deprivation would be deprived for 12h.

      How exactly was the pumping frequency (shown in Fig 1B) measured? There is no description in the methods at all. If the pump frequency is scored by changes in blue food intensity (arbitrary units?), this seems very subjective and maybe image angle dependent. What was camera frame rate? Can it capture this pumping speed adequately? Given the wealth of more quantitative methods for measuring food intake (eg. CAFE, flyPAD), it seems that better data could be obtained.

      How was the total volume of the cibarium measured? What do the pie charts in Figure 3A represent?

      The pump frequency was computed as the number of pumps divided by the time scale, following the methodology outlined in Manzo et al., 2012. Swallowing curves were plotted using the inverse of the blue food intensity in the cibarium. In this representation, ascending lines signify filling, while descending lines indicate emptying (see Figure 2D, 3B). We maintain objectivity in our approach since, during the recording of swallowing behavior, the fly was fixed, and we exclusively used data for analysis when the Region of Interest (ROI) was in the cibarium. This ensures that the intensity values accurately reflect the filling and emptying processes. Furthermore, we conducted manual frame-by-frame checks of pump frequency, and the results align with those generated by the time series analyzer V3 of ImageJ.

      For the assessment of total volume of ingestion, we referred the methods of CAFE, utilizing a measurable glass capillary. We then calculated the ingestion rate (nL/s) by dividing the total volume of ingestion by the feeding time.

      The changes seem small, in spite of the claim of statistical significance.

      The observed stability in pump frequency within a given genotype underscores the significance of even seemingly small changes, which is statistically significant. We speculate that the stability in swallowing frequency suggests the existence of a redundant mechanism to ensure the robustness of the process. Disruption of one channel might potentially be partially compensated for by others, highlighting the vital nature of the swallowing mechanism.

      How is this change in pump frequency consistent with defects in one aspect of the cycle - either ingestion (activation) or expulsion (inhibition)?

      Please refer to Figure 2, 3. Both filling and emptying process were affects, while inhibition mainly influences emptying time (Figure 1—figure supplement 1).

      for the authors:

      Line 48: extensively

      Line 62 - undiscovered.

      Line 107, 463: multi

      Line 124: What is "dysphagia?" This is an unusual word and should be defined.

      Line 446: severe

      Line 466: in the cibarium or not?

      Thanks for your correction and all the places mentioned were revised.

    1. Author Response:

      Assessment note: “Whereas the results and interpretations are generally solid, the mechanistic aspect of the work and conclusions put forth rely heavily on in vitro studies performed in cultured L6 myocytes, which are highly glycolytic and generally not viewed as a good model for studying muscle metabolism and insulin action.”

      While we acknowledge that in vitro models may not fully recapitulate the complexity of in vivo systems, we believe that our use of L6 myotubes is appropriate for studying the mechanisms underlying muscle metabolism and insulin action. As mentioned below (reviewer 2, point 1), L6 myotubes possess many important characteristics relevant to our research, including high insulin sensitivity and a similar mitochondrial respiration sensitivity to primary muscle fibres. Furthermore, several studies have demonstrated the utility of L6 myotubes as a model for studying insulin sensitivity and metabolism, including our own previous work (PMID: 19805130, 31693893, 19915010).

      In addition, we have provided evidence of the similarities between L6 cells overexpressing SMPD5 and human muscle biopsies at protein levels and the reproducibility of the negative correlation between ceramide and Coenzyme Q observed in L6 cells in vivo, specifically in the skeletal muscle of mice in chow diet. These findings support the relevance of our in vitro results to in vivo muscle metabolism.

      Finally, we will supplement our findings by demonstrating a comparable relationship between ceramide and Coenzyme Q in mice exposed to a high-fat diet, to be shown in Supplementary Figure 4 H-I. Further animal experiments will be performed to validate our cell-line based conclusions. We hope that these additional results address the concerns raised by the reviewer and further support the relevance of our in vitro findings to in vivo muscle metabolism and insulin action.

      Points from reviewer 1:

      1. Although the authors' results suggest that higher mitochondrial ceramide levels suppress cellular insulin sensitivity, they rely solely on a partial inhibition (i.e., 30%) of insulin-stimulated GLUT4-HA translocation in L6 myocytes. It would be critical to examine how much the increased mitochondrial ceramide would inhibit insulin-induced glucose uptake in myocytes using radiolabel deoxy-glucose.

      Response: The primary impact of insulin is to facilitate the translocation of glucose transporter type 4 (GLUT4) to the cell surface, which effectively enhances the maximum rate of glucose uptake into cells. Therefore, assessing the quantity of GLUT4 present at the cell surface in non-permeabilized cells is widely regarded as the most reliable measure of insulin sensitivity (PMID: 36283703, 35594055, 34285405). Additionally, plasma membrane GLUT4 and glucose uptake are highly correlated. Whilst we have routinely measured glucose uptake with radiolabelled glucose in the past, we do not believe that evaluating glucose uptake provides a better assessment of insulin sensitivity than GLUT4.

      We will clarify the use of GLUT4 translocation in the Results section:

      “...For this reason, several in vitro models have been employed involving incubation of insulin sensitive cell types with lipids such as palmitate to mimic lipotoxicity in vivo. In this study we will use cell surface GLUT4-HA abundance as the main readout of insulin response...”

      1. Another important question to be addressed is whether glycogen synthesis is affected in myocytes under these experimental conditions. Results demonstrating reductions in insulin-stimulated glucose transport and glycogen synthesis in myocytes with dysfunctional mitochondria due to ceramide accumulation would further support the authors' claim.

      Response: We have carried out supplementary experiments to investigate glycogen synthesis in our insulin-resistant models. Our approach involved L6-myotubes overexpressing the mitochondrial-targeted construct ASAH1 (as described in Fig. 3). We then challenged them with palmitate and measured glycogen synthesis using 14C radiolabeled glucose. Our observations indicated that palmitate suppressed insulin-induced glycogen synthesis, which was effectively prevented by the overexpression of ASAH1 (N = 5, * p<0.05). These results provide additional evidence highlighting the role of dysfunctional mitochondria in muscle cell glucose metabolism.

      These data will be added to Supplementary Figure 4K and the results modified as follows:

      “Notably, mtASAH1 overexpression protected cells from palmitate-induced insulin resistance without affecting basal insulin sensitivity (Fig. 3E). Similar results were observed using insulin-induced glycogen synthesis as an ortholog technique for Glut4 translocation. These results provide additional evidence highlighting the role of dysfunctional mitochondria in muscle cell glucose metabolism (Sup. Fig. 5K). Importantly, mtASAH1 overexpression did not rescue insulin sensitivity in cells depleted…”

      We will add to the method section:

      “L6 myotubes overexpressing ASAH were grown and differentiated in 12-well plates, as described in the Cell lines section, and stimulated for 16 h with palmitate-BSA or EtOH-BSA, as detailed in the Induction of insulin resistance section.

      On day seven of differentiation, myotubes were serum starved in plain DMEM for 3 and a half hours. After incubation for 1 hour at 37C with 2 µCi/ml D-[U-14C]-glucose in the presence or absence of 100 nM insulin, glycogen synthesis assay was performed, as previously described (Zarini S. et al., J Lipid Res, 63(10): 100270, 2022).”

      1. In addition, it would be critical to assess whether the increased mitochondrial ceramide and consequent lowering of energy levels affect all exocytic pathways in L6 myoblasts or just the GLUT4 trafficking. Is the secretory pathway also disrupted under these conditions?

      Response: As the secretory pathway primarily involves the synthesis and transportation of soluble proteins that are secreted into the extracellular space, and given that the majority of cellular transmembrane proteins (excluding those of the mitochondria) use this pathway to arrive at their ultimate destination, we believe that the question posed by the reviewer is highly challenging and beyond the scope of our research. We will add this to the discussion:

      “...the abundance of mPTP associated proteins suggesting a role of this pore in ceramide induced insulin resistance (Sup. Fig. 6E). In addition, it is yet to be determined whether the trafficking defect is specific to Glut4 or if it affects the exocytic-secretory pathway more broadly…”

      Points from reviewer 2:

      1. The mechanistic aspect of the work and conclusions put forth rely heavily on studies performed in cultured myocytes, which are highly glycolytic and generally viewed as a poor model for studying muscle metabolism and insulin action. Nonetheless, the findings provide a strong rationale for moving this line of investigation into mouse gain/loss of function models.

      Response: The relative contribution of the anaerobic (glycolysis) and aerobic (mitochondria) contribution to the muscle metabolism can change in L6 depending on differentiation stage. For instance, Serrage et al (PMID30701682) demonstrated that L6-myotubes have a higher mitochondrial abundance and aerobic metabolism than L6-myoblasts. Others have used elegant transcriptomic analysis and metabolic characterisation comparing different skeletal muscle models for studying insulin sensitivity. For instance, Abdelmoez et al in 2020 (PMID31825657) reported that L6 myotubes exhibit greater insulin-stimulated glucose uptake and oxidative capacity compared with C2C12 and Human Mesenchymal Stem Cells (HMSC). Overall, L6 cells exhibit higher metabolic rates and primarily rely on aerobic metabolism, while C2C12 and HSMC cells rely on anaerobic glycolysis. It is worth noting that L6 myotubes are the cell line most closely related to adult human muscle when compared with other muscle cell lines (PMID31825657). Our presented results in Figure 6 H and I provide evidence for the similarities between L6 cells overexpressing SMPD5 and human muscle biopsies. Additionally, in Figure 3J-K, we demonstrate the reproducibility of the negative correlation between ceramide and Coenzyme Q observed in L6 cells in vivo, specifically in the skeletal muscle of mice in chow diet. Furthermore, we have supplemented these findings by demonstrating a comparable relationship in mice exposed to a high-fat diet, as shown in Supplementary Figure 4 H-I (refer to point 4). We will clarify these points in the Discussion:

      “In this study, we mainly utilised L6-myotubes, which share many important characteristics with primary muscle fibres relevant to our research. Both types of cells exhibit high sensitivity to insulin and respond similarly to maximal doses of insulin, with Glut4 translocation stimulated between 2 to 4 times over basal levels in response to 100 nM insulin (as shown in Fig. 1-4 and (46,47)). Additionally, mitochondrial respiration in L6-myotubes have a similar sensitivity to mitochondrial poisons, as observed in primary muscle fibres (as shown in Fig. 5 (48)). Finally, inhibiting ceramide production increases CoQ levels in both L6-myotubes and adult muscle tissue (as shown in Fig. 2-3). Therefore, L6-myotubes possess the necessary metabolic features to investigate the role of mitochondria in insulin resistance, and this relationship is likely applicable to primary muscle fibres”.

      We will also add additional data - in point 2 - from differentiated human myocytes that are consistent with our observations from the L6 models. Additional experiments are in progress to further extend these findings.

      1. One caveat of the approach taken is that exposure of cells to palmitate alone is not reflective of in vivo physiology. It would be interesting to know if similar effects on CoQ are observed when cells are exposed to a more physiological mixture of fatty acids that includes a high ratio of palmitate, but better mimics in vivo nutrition.

      Response: Palmitate is widely recognized as a trigger for insulin resistance and ceramide accumulation, which mimics the insulin resistance induced by a diet in rodents and humans. Previous studies have compared the effects of a lipid mixture versus palmitate on inducing insulin resistance in skeletal muscle, and have found that the strong disruption in insulin sensitivity caused by palmitate exposure was lessened with physiologic mixtures of fatty acids, even with a high proportion of saturated fatty acids. This was associated, in part, to the selective partitioning of fatty acids into neutral lipids (such as TAG) when muscle cells are exposed to physiologic lipid mixtures (Newsom et al PMID25793412). Hence, we think that using palmitate is a better strategy to study lipid-induced insulin resistance in vitro. We will add to results:

      “In vitro, palmitate conjugated with BSA is the preferred strategy for inducing insulin resistance, as lipid mixtures tend to partition into triacylglycerides (33)”.

      We are also performing additional in vivo experiments to add to the physiological relevance of the findings.

      1. While the utility of targeting SMPD5 to the mitochondria is appreciated, the results in Figure 5 suggest that this manoeuvre caused a rather severe form of mitochondrial dysfunction. This could be more representative of toxicity rather than pathophysiology. It would be helpful to know if these same effects are observed with other manipulations that lower CoQ to a similar degree. If not, the discrepancies should be discussed.

      Response: We conducted a staining procedure using the mitochondrial marker mitoDsRED to observe the effect of SMPD5 overexpression on cell toxicity. The resulting images, displayed in the figure below (Author response image 1), demonstrate that the overexpression of SMPD5 did not result in any significant changes in cell morphology or impact the differentiation potential of our myoblasts into myotubes.

      Author response image 1.

      In addition, we evaluated cell viability in HeLa cells following exposure to SACLAC (2 uM) to induce CoQ depletion (left panel). Specifically, we measured cell death by monitoring the uptake of Propidium iodide (PI) as shown in the right panel. Our results demonstrated that Saclac-induced CoQ depletion did not lead to cell death at the doses used for CoQ depletion (Author response image 2).

      Author response image 2.

      Therefore, we deemed it improbable that the observed effect is caused by cellular toxicity, but rather represents a pathological condition induced by elevated levels of ceramides. We will add to discussion:

      “...downregulation of the respirasome induced by ceramides may lead to CoQ depletion. Despite the significant impact of ceramide on mitochondrial respiration, we did not observe any indications of cell damage in any of the treatments, suggesting that our models are not explained by toxic/cell death events.”

      1. The conclusions could be strengthened by more extensive studies in mice to assess the interplay between mitochondrial ceramides, CoQ depletion and ETC/mitochondrial dysfunction in the context of a standard diet versus HF diet-induced insulin resistance. Does P053 affect mitochondrial ceramide, ETC protein abundance, mitochondrial function, and muscle insulin sensitivity in the predicted directions?

      Response: We would like to note that the metabolic characterization and assessment of ETC/mitochondrial function in these mice (both fed a high-fat (HF) and chow diet, with or without P053) were previously published (Turner N, PMID30131496). In addition to this, we have conducted targeted metabolomic and lipidomic analyses to investigate the impact of P053 on ceramide and CoQ levels in HF-fed mice. As illustrated in the figures below (Author response image 3), the administration of P053 led to a reduction in ceramide levels (left panel) and an increase in CoQ levels (right panel) in HF-fed mice, which is consistent with our in vitro findings.

      Author response image 3.

      We will add to results:

      “…similar effect was observed in mice exposed to a high fat diet for 5 wks (Supp. Fig. 4H-I further phenotypic and metabolic characterization of these animals can be found in (41))”

      We will further perform more in-vivo studies to corroborate these findings.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Hearing and balance rely on specialized ribbon synapses that transmit sensory stimuli between hair cells and afferent neurons. Synaptic adhesion molecules that form and regulate transsynaptic interactions between inner hair cells (IHCs) and spiral ganglion neurons (SGNs) are crucial for maintaining auditory synaptic integrity and, consequently, for auditory signaling. Synaptic adhesion molecules such as neurexin-3 and neuroligin-1 and -3 have recently been shown to play vital roles in establishing and maintaining these synaptic connections ( doi: 10.1242/dev.202723 and DOI: 10.1016/j.isci.2022.104803). However, the full set of molecules required for synapse assembly remains unclear.

      Karagulan et al. highlight the critical role of the synaptic adhesion molecule RTN4RL2 in the development and function of auditory afferent synapses between IHCs and SGNs, particularly regarding how RTN4RL2 may influence synaptic integrity and receptor localization. Their study shows that deletion of RTN4RL2 in mice leads to enlarged presynaptic ribbons and smaller postsynaptic densities (PSDs) in SGNs, indicating that RTN4RL2 is vital for synaptic structure. Additionally, the presence of "orphan" PSDs-those not directly associated with IHCs-in RTN4RL2 knockout mice suggests a developmental defect in which some SGN neurites fail to form appropriate synaptic contacts, highlighting potential issues in synaptic pruning or guidance. The study also observed a depolarized shift in the activation of CaV1.3 calcium channels in IHCs, indicating altered presynaptic functionality that may lead to impaired neurotransmitter release. Furthermore, postsynaptic SGNs exhibited a deficiency in GluA2/3 AMPA receptor subunits, despite normal Gria2 mRNA levels, pointing to a disruption in receptor localization that could compromise synaptic transmission. Auditory brainstem responses showed increased sound thresholds in RTN4RL2 knockout mice, indicating impaired hearing related to these synaptic dysfunctions.

      The findings reported here significantly enhance our understanding of synaptic organization in the auditory system, particularly concerning the molecular mechanisms underlying IHC-SGN connectivity. The implications are far-reaching, as they not only inform auditory neuroscience but also provide insights into potential therapeutic targets for hearing loss related to synaptic dysfunction.

      We would like to thank the reviewer for appreciating the work and the advice that helped us to further improve the manuscript. We have carefully addressed all concerns, please see our point-per-point response below and the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      Kargulyan et al. investigate the function of the transsynaptic adhesion molecule RTN4RL2 in the formation and function of ribbon synapses between type I spiral ganglion neurons (SGNs) and inner hair cells. For this purpose, they study constitutive RTN4RL2 knock-out mice. Using immunohistochemistry, they reveal defects in the recruitment of protein to ribbon synapses in the knockouts. Serial block phase EM reveals defects in SGN projections in mutants. Electrophysiological recordings suggest a small but statistically significant depolarized shift in the activation of Cav1.3 Ca<sup>2+</sup> channels. Auditory thresholds are also elevated in the mutant mice. The authors conclude that RTN4RL2 contributes to the formation and function of auditory afferent synapses to regulate auditory function.

      We would like to thank the reviewer for appreciating the work and the advice that helped us to further improve the manuscript. We have carefully addressed all concerns, please see our point-per-point response below and the revised manuscript.

      Strengths:

      The authors have excellent tools to analyze ribbon synapses.

      Weaknesses:

      However, there are several concerns that substantially reduce my enthusiasm for the study.

      (1) The analysis of the expression pattern of RTN4RL2 in Figure 1 is incomplete. The authors should show a developmental time course of expression up into maturity to correlate gene expression with major developmental milestones such as axon outgrowth, innervation, and refinement. This would allow the development of models supporting roles in axon outgrowth versus innervation or both.

      We agree that it would be valuable to show the developmental time course of RTN4RL2 expression. In response to the reviewer’s comment, we are providing RNAscope data from developmental ages E11.5, E12.5 and E16 in Figure 1. RTN4RL2 shows expression at E11.5/E12.5 both in the spiral ganglion and hair cell region, with first onset in the hair cells. We conclude that RTN4RL2 is expressed highest during fiber growth at embryonic stages and is downregulated during postnatal development maintaining low levels of expression during adulthood.

      (2) It would be important to improve the RNAscope data. Controls should be provided for Figure 1B to show that no signal is observed in hair cells from knockouts. The authors apparently already have the sections because they analyzed gene expression in SGNs of the knock-outs (Figure 1C).

      In Figure 1C gene expression in SGNs was assessed at p40, while the expression in hair cells is provided for p1 animals. Unfortunately, we do not have KO controls for p1 animals. However, as indicated in our manuscript, previously published RNA expression datasets do find RTN4RL2 expression in hair cells. Therefore, we think it is unlikely that our results are unspecific.

      (3) It is unclear from the immunolocalization data in Figure 1D if all type I SGNs express RTN4RL2. Quantification would be important to properly document the presence of RTN4RL2 in all or a subset of type I SGNs. If only a subset of SGNs express RTN4RL2, it could significantly affect the interpretation of the data. For example, SGNs selectively projecting to the pillar or modiolar side of hair cells could be affected. These synapses significantly differ in their properties.

      According to already published single cell RNAseq dataset from Shrestha et al., 2018, RTN4RL2 expression does not seem to show a clear type I SGN subtype specificity (Author response image 1). In response to the reviewer’s comment, we have further performed anti-Parvalbumin (PV) and anti-calretinin (CR) immunostainings in mid-modiolar cryosections of RTN4RL2<sup>+/+</sup> and RTN4RL2<sup>-/-</sup> cochleae. Parvalbumin was chosen to label all SGNs and CALB2 was chosen primarily as a type Ia SGN marker (Sun et al., 2018). We present the data from all analyzed samples below (figure 2 of this rebuttal letter). Cell segmentation masks of PV positive cells were obtained using Cellpose 2.0 and the average CR intensity was calculated in those masks. While the distributions of CR intensity and the ratio of CR and PV intensities are slightly shifted in RTN4RL2<sup>-/-</sup> cochleae, we take the data to suggest that the composition of the spiral ganglion by molecular type I SGN subtypes is largely unchanged in RTN4RL2<sup>-/-</sup> mice.

      Author response image 1.

      Author response image 1 cites single cell RNAseq data of Brikha R Shrestha, Chester Chia, Lorna Wu, Sharon G Kujawa, M Charles Liberman, Lisa V Goodrich. Sensory neuron diversity in the inner ear is shaped by activity. Cell. 2018 Aug 23; 174(5):1229-1246.e17. doi: 10.1016/j.cell/2018.07.007

      Author response image 2.

      Calretinin intensity distribution in spiral ganglion of RTN4RL2<sup>+/+</sup> and RTN4RL2<sup>-/-</sup> mice. (A) Mid-modiolar cochlear cryosections from RTN4RL2<sup>+/+</sup> (top) and RTN4RL2<sup>-/-</sup> (bottom) mice immunolabeled against Parvalbumin (PV) and Calretinin (CR). Scale bar = 20 mm. (B) Distribution of CR intensity in PV positive cells (N = 3 for each genotype). (C) Distribution of the ratio of CR and PV intensities (N = 3 for each genotype).

      (4) It is important to show proper controls for the RTN4RL2 immunolocalization data to show that no staining is observed in knockouts.

      Unfortunately, our recent attempts to perform RTN4RL2 immunostainings on cryosections failed and therefore, we decided to remove the RTNr4RL2 immunostainings from Figure 1. We have adjusted the results section accordingly.

      (5) The authors state in the discussion that no staining for RTN4RL2 was observed at synaptic sites. This is surprising. Did the authors stain multiple ages? Was there perhaps transient expression during development? Or in axons indicative of a role in outgrowth, not synapse formation?

      We thank the reviewer for the comment. We have now tried RTN4RL2 immunostainings on cryosections at several developmental stages, but unfortunately this time did not succeed to obtain reproducible and reliable results. Therefore, we decided to also remove the previous immunostainings from Figure 1. We have adjusted the results section as well as removed our statement of not detecting RTN4RL2 near the synaptic regions from the discussion.

      (6) In Figure 2 it seems that images in mutants are brighter compared to wildtypes. Are exposure times equivalent? Is this a consistent result?

      Yes, the samples were prepared in parallel, imaged and analyzed in the same manner.

      No, we did not observe consistent differences in brightness and also did not find it in the exemplary images of figure 2.

      (7) The number of synaptic ribbons for wildtype in Figure 2 is at 10/IHCs, and in Figure 2 Supplementary Figure 2 at 20/IHCs (20 is more like what is normally reported in the literature). The value for mutant similarly drastically varies between the two figures. This is a significant concern, especially because most differences that are reported in synaptic parameters between wild-type and mutants are far below a 2-fold difference.

      The key message is that there is no difference in the numbers of ribbons and synapses between the genotypes for the cochlear apex (~10 ribbons/IHCs, Figure 2 and Figure 2-figure supplement 2) and the mid- and base of the cochlea (more ribbons/IHCs, Figure 2-figure supplement 2). Figure 2-figure supplement 3 (now Figure 3) shows that there is a massive reduction of postsynaptic GluA2, while both Figure 2 and Figure 2-figure supplement 2 indicate that the number synapses is normal. These are two different data sets and while we closely collaborated and also shared the Moser lab protocols and analysis routines, we agree that there is a difference in the absolute synapse count, which most likely was an observer difference and different choice of tonotopic positions of analysis. In Figure 2 only the apical hair cells have been analyzed. The Moser lab, since establishing the immunofluorescence-based quantification of synapse number (Khimich et al., 2005) reported tonotopic differences in synapse counts (focus of Meyer et al., 2009 and reported by others: e.g. Kujawa and Liberman, 2009): apical and basal IHCs lower synapse numbers than mid-cochlear IHCs.

      (8) The authors report differences in ribbon volume between wild-type and mutant. Was there a difference between the modiolar/pillar region of hair cells? It is known that synaptic size varies across the modiolar-pillar axis. Maybe smaller synapses are preferentially lost?

      We thank the reviewer for the comment. Unfortunately, our already acquired datasets from 3-week-old mice did not allow us to check whether the previously described modiolar-pillar gradient of the ribbon size was collapsed in RTN4RL2<sup>-/-</sup> mice due to the not so well-preserved morphology of the inner hair cells in our preparations. However, since the number of the ribbons is not changed in the RTN4RL2 KO mice, we do not think that the increase in the ribbon size is due to the loss of small ribbons. In response to the reviewers comment we have analyzed the modiolar-pillar gradient of the ribbon size in IHCs of middle turn of the cochlea form a newly acquired dataset of 14-week-old mice. We took the fluorescence intensity of Ctbp2 positive puncta as a proxy for the ribbon size. In these older mice we found a preserved modiolar-pillar gradient of the ribbon size (larger ribbons at the modiolar side). We summarized the results in the below Author response image 3.

      Author response image 3.

      The modiolar-pillar gradient of ribbon size is preserved in RTN4RL2<sup>-/-</sup> IHCs. (A) Maximum intensity projections of approximately 2 IHCs stained against Vglut3 and Ctbp2 from 14-week-old RTN4RL2<sup>+/+</sup> (left) and RTN4RL2<sup>-/-</sup> (right) mice. Scale bar = 5 mm. (B) Synaptic ribbons on the modiolar side show higher fluorescence intensity than the ones on the pillar side of mid-cochlear IHCs in both RTN4RL2<sup>+/+</sup> (left, N=2) RTN4RL2<sup>-/-</sup> (right, N=2) mice. (C) Average fluorescence intensity of modiolar ribbons per IHC is higher than the average fluorescence intensity of pillar ribbons (paired t-test, p < 0.001).

      (9) The authors show in Figure 2 - Supplement 3 that GluA2/3 staining is absent in the mutants. Are GluA4 receptors upregulated? Otherwise, synaptic transmission should be abolished, which would be a dramatic phenotype. Antibodies are available to analyze GluA4 expression, the experiment is thus feasible. Did the authors carry out recordings from SGNs?

      In response to the reviewer’s comment, we have performed GluA4 stainings in RTN4LR2<sup>-/-</sup> mice and did not detect any GluA4 positive signal in the mutants (new Figure 3-figure supplement 1). Unfortunately, our animal breeding license was expired at the time we received the reviews and that is why our results are from 14-week-old animals. To verify that the absence of GluA4 signal is not due to potential PSD loss in 14-week-old RTN4RL2<sup>-/-</sup>, we have additionally performed anti-Ctbp2, anti-Homer1 and anti-Vglut3 stainings in 14-week-old animals. Despite the reduced number, we still observed juxtaposing pre- and postsynaptic puncta. We assume that the reviewer asks for patch-clamp recordings from SGNs, which are, as we are confident the reviewer is aware of, technically very challenging and beyond the scope of the present study but an important objective for future studies.  In response to the reviewers comment we have added a statement to the discussion pointing to these patch-clamp recordings from SGNs as important objective for future studies.

      (10) The authors use SBEM to analyze SGN projections and synapses. The data suggest that a significant number of SGNs are not connected to IHCs. A reconstruction in Figure 3 shows hair cells and axons. It is not clear how the outline of hair cells was derived, but this should be indicated. Also, is this a defect in the formation of synapses and subsequent retraction of SGN projections? Or could RTN4RL2 mutants have a defect in axonal outgrowth and guidance that secondarily affects synapses? To address this question, it would be useful to sparsely label SGNs in mutants, for example with AAV vectors expression GFP, and to trace the axons during development. This would allow us to distinguish between models of RTN4RL2 function. As it stands, it is not clear that RTN4RL2 acts directly at synapses.

      We agree with the reviewer on the value of a developmental study of afferent connectivity but consider this beyond the scope of the present study. In response to the reviewer's comment, we have replaced the IHC outlines with volume-reconstructed IHCs in Figure 3B (now Figure 4B). Moreover, as shown in Figure 3F (now Figure 4F), most if not all type-I SGNs (both with and without ribbon) were unbranched in the mutants just like in wildtype (also shown for a larger sample in Hua et al., 2021), arguing against morphological abnormality during development.

      (11) The authors observe a tiny shift in the operation range of Ca<sup>2+</sup> channels that has no effect on synaptic vesicle exocytosis. It seems very unlikely that this difference can explain the auditory phenotype of the mutant mice.

      We assume that the statement refers to the normal exocytosis of mutant IHCs at the potential of maximal Ca<sup>2+</sup> influx (Figure 3G and H, now Figure 4G and H). We would like to note that this experiment was performed to probe for a deficit of synapse function beyond that of the Ca<sup>2+</sup> channel activation, but did not address the impact of the altered voltage—dependence of Ca<sup>2+</sup> channel activation. In response to the reviewer’s comment, we have now added further discussion to more clearly communicate that for the range of receptor potentials achieved near sound threshold we expect impaired IHC exocytosis as the Ca<sup>2+</sup> channels require slightly more depolarization for activation in the mutant IHCs.

      (12) ABR recordings were conducted in whole-body knockouts. Effects on auditory thresholds could be a secondary consequence of perturbation along the auditory pathway. Conditional knockouts or precisely designed rescue experiments would go a long way to support the authors' hypothesis. I realize that this is a big ask and floxed mice might not be available to conduct the study.

      Thanks for this helpful comment and, indeed, unfortunately, we do not have conditional KO mice at our disposal. We totally agree that this will be important also for clarifying the role of IHC vs. SGN expression of RTN4RL2. In response to the reviewer’s comment, we now discussed the shortcoming of using constitutive RTN4RL2<sup>-/-</sup> mice and added this important experiment on IHC and SGN specific deletion of RTN4RL2 as an objective of future studies.

      Reviewer #3 (Public review):

      In this study, the authors used RNAscope and immunostaining to confirm the expression of RTN4RL2 RNA and protein in hair cells and spiral ganglia. Through RTN4RL2 gene knockout mice, they demonstrated that the absence of RTN4RL2 leads to an increase in the size of presynaptic ribbons and a depolarized shift in the activation of calcium channels in inner hair cells. Additionally, they observed a reduction in GluA2/3 AMPA receptors in postsynaptic neurons and identified additional "orphan PSDs" not paired with presynaptic ribbons. These synaptic alterations ultimately resulted in an increased hearing threshold in mice, confirming that the RTN4RL2 gene is essential for normal hearing. These data are intriguing as they suggest that RTN4RL2 contributes to the proper formation and function of auditory afferent synapses and is critical for normal hearing. However, a thorough understanding of the known or postulated roles of RTN4Rl2 is lacking.

      We would like to thank the reviewer for appreciating the work and the advice that helped us to further improve the manuscript. We have carefully addressed all concerns, please see our point-per-point response below and the revised manuscript.

      While the conclusions of this paper are generally well supported by the data, several aspects of the data analysis warrant further clarification and expansion.

      (1) A quantitative assessment is necessary in Figure 1 when discussing RNA and protein expression. It would be beneficial to show that expression levels are quantitatively reduced in KO mice compared to wild-type mice. This suggestion also applies to Figure 2-supplement 3.D, which examines expression levels.

      The processing of our control and KO samples for RNAscope was not strictly done in parallel and therefore we would like to refrain from quantitative comparison.

      (2) In Figure 2, the authors present a morphological analysis of synapses and discuss the presence of "orphan PSDs." I agree that Homer1 not juxtaposed with Ctbp2 is increased in KO mice compared to the control group. However, in quantifying this, they opted to measure the number of Homer1 juxtaposed with Ctbp2 rather than directly quantifying the number of Homer1 not juxtaposed with Ctbp2. Quantifying the number of Homer1 not juxtaposed with Ctbp2 would more clearly represent "orphan PSDs" and provide stronger support for the discussion surrounding their presence.

      We appreciate the reviewer’s comment. We did not perform this analysis primarily because “orphan” Homer1 puncta, as seen in our immunostainings, are distributed away from hair cells in diverse morphologies and sizes. This makes distinguishing them from unspecific immunofluorescent spots—also present in wild-type samples—challenging. In response to the reviewer’s request, we analyzed the number of “orphan” Homer1 puncta in our previously acquired RTN4RL2<sup>+/+</sup> and RTN4RL2<sup>-/-</sup> samples. Using the surface algorithm in Imaris software, we applied identical parameters across all samples to create surfaces for Homer1-positive puncta (total Homer1 puncta). We quantified “orphan” Homer1 puncta as the difference between total and ribbon-juxtaposing Homer1 puncta and normalized this number to the IHC count. Our results showed 4.3 vs. 26.8 “orphan” Homer1 puncta per IHC in RTN4RL2<sup>+/+</sup> and RTN4RL2<sup>-/-</sup> samples, respectively. We note that variations in acquired volumes between samples may introduce confounding effects.

      (3) In Figure 2, Supplementary 3, the authors discuss GluA2/3 puncta reduction and note that Gria2 RNA expression remains unchanged. However, there is an issue with the lack of quantification for Gria2 RNA expression. Additionally, it is noted that RNA expression was measured at P4. While the timing for GluA2/3 puncta assessment is not specified, if it was assessed at 3 weeks old as in Figure 2's synaptic puncta analysis, it would be inappropriate to link Gria2 RNA expression with GluA2/3 protein expression at P4. If RNA and protein expression were assessed at P4, please indicate this timing for clarity.

      GluA2/3 immunostainings were performed in 1 to 1.5-month-old animals. We apologize for not indicating this before and have now included it in Figure 3 legend. The processing of our control and KO samples for RNAscope was not strictly done in parallel and therefore we would like to refrain from quantitative comparison.

      (4) In Figure 3, the authors indicate that RTN4RL2 deficiency reduces the number of type 1 SGNs connected to ribbons. Given that the number of ribbons remains unchanged (Figure 2), it is important to clearly explain the implications of this finding. It is already known that each type I SGN forms a single synaptic contact with a single IHC. The fact that the number of ribbons remains constant while additional "orphan PSDs" are present suggests that the overall number of SGNs might need to increase to account for these findings. An explanation addressing this would be helpful.

      In Figure 3 (now Figure 4), we found additional type-1 SGNs that are unconnected to IHC, in good agreement with “orphan PSDs” observed under the light microscope. Indeed, we also confirmed monosynaptic, unbranched fiber morphology (Figure 3F, now Figure 4F). Together, these results imply about a 20% increase in the overall number of SGNs, which however we did not observe in SGN soma counting.

      (5) In Figure 4F and 5Cii, could you clarify how voltage sensitivity (k) was calculated? Additionally, please provide an explanation for the values presented in millivolts (mV).

      Voltage sensitivity (k) was calculated as the slope of the Boltzmann fit to the fractional activation curves: , Where G is conductance, G<sub>max</sub> is the maximum conductance, V<sub>m</sub> is the membrane potential, V<sub>half</sub> is the voltage corresponding to the half maximal activation of Ca<sup>2+</sup> channels and k (slope of the curve) is the voltage sensitivity of Ca<sup>2+</sup> channel activation. We have now added this to our Materials and Methods section.

      (6) In Figure 6, the author measured the threshold of ABR at 2-4 months old. Since previous figures confirming synaptic morphology and function were all conducted on 3-week-old mice, it would be better to measure ABR at 3 weeks of age if possible.

      ABR measurements for comparisons in a cohort of age-matched mice require fully developed individuals. 3 weeks is the minimum age that is regarded for a mature ear. However, variation in developmental differences among one litter is very frequent that affects normal hearing thresholds. From our own experience we do not regard the ear fully functional before 6 weeks of age. Then hearing thresholds are lowest indicating full functionality. Since the C57BL/6 background strain has a genetic defect in the Cadherin 23-coding gene (Cdh23) at the ahl locus of mouse chromosome 10 these mice exhibit early onset and progression of age-related hearing loss starting at 5–8 months (Hunter & Willott, 1987). Therefore, we chose a “safe” time window for stable and unaffected ABR recordings of 2-4 months to provide most representative data.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Please include information on the validation of all the antibodies used in this study, or reference the relevant work where the antibodies were previously validated.

      In response to the reviewer’s comment, we have now included a table listing all primary antibodies used in this study. Where possible, we provide references for knockout (KO) validation. Otherwise, we refer to the manufacturer’s information, as provided in the respective datasheets.

      (2) Figure 2 illustrates the pre- and postsynaptic changes observed in RTN4RL2 knockout (KO) mice. Please specify the age of the mice and the cochlear region depicted and analyzed in Figure 2.

      We thank the reviewer for the comment. The IHCs of apical cochlear region were analyzed in mice at 3 weeks of age. We have now added this to the figure legend.

      (3) The discovery of orphan SGN neurites in RTN4RL2 KO mice is particularly intriguing. I wonder whether the additional Homer1-positive puncta illustrated in Figure 2 are present in these orphan SGN neurites, which would suggest that they may be functional. Conducting immunohistochemistry (IHC) labeling for type I SGN neurites using an anti-Tuj1 antibody, along with Homer1, would help localize the additional Homer1 puncta shown in Figure 2. Additionally, the "extra" Homer1 puncta appears less striking in the data presented in Figure 2-Supplement 2. Quantifying the number of Homer1 puncta in wild-type versus KO mice across different cochlear regions will help visualize the Figure 2-Supplement 2 data and relate the presence of extra neurites to the increased auditory brainstem response (ABR) thresholds observed at all frequencies.

      We thank the reviewer for the comment and we agree that localizing orphan PSDs on the SGN neurites would be very useful. Unfortunately, the animal breeding license in the Göttingen lab had expired. At the time we received the reviews we only had access to 14-week-old animals and could not perform the stainings in animals which would have comparable age range to the rest of the study (3-4 weeks). The phenotype of extra Homer1 puncta was not as drastic in 14-week-old animals as it was in previously stained 3-week-old animals. Nevertheless, we still tried NF200, Homer1 and Vglut3 immunostainings in 14-week-old animals. We present representative single imaging planes of NF200, Homer1 and Vglut3 stainings in Author response image 4. Additionally, we provide exemplary images from 7-week-old RTN4RL2<sup>-/-</sup>, where it looks like that the orphan Homer1 puncta are found on calretinin positive neurites.

      Author response image 4.

      Attempts to localize “orphan” Homer1 patches on type I SGN neurites. (A) Single exemplary imaging planes of apical IHC region from RTN4RL2<sup>+/+</sup> (left) and RTN4RL2<sup>-/-</sup> (right) mice immunolabeled against NF200, Vglut3 and Homer1. White arrows show putative “orphan” Homer1 puncta on NF200 positive neurites. Scale bar = 5 mm. (B) Maximum intensity projections of representative confocal stacks of IHCs from RTN4RL2<sup>-/-</sup> mice immunolabeled against Calretinin and Homer1. Scale bars = 5 mm. White arrows show possible “orphan” Homer1 puncta on Calretinin positive boutons.

      (4) The authors noted a reduction in the number of GluA2/3-positive puncta in RTN4RL2 KOs, as shown in Figure 2-Supplement 3. However, in the Results section (page 5, line 124), it is unclear whether the authors refer to a reduction in fluorescence intensity or the number of puncta. Please clarify this.

      We thank the reviewer for the comment. We refer to the number and have now added this to the manuscript.

      (5) I find it particularly interesting that, despite the presence of smaller but synaptically engaged Homer1-positive SGN neurites, these appear to lack or present a reduction in the number of GluA2/3 puncta, and that GluA2/3 puncta are observed in non-ribbon juxtaposed neurites. Therefore, I suggest including GluA2/3 (Fig2 supplement 3) data in the main figure. It would be valuable to determine whether the orphan neurites express both Homer1 and GluA2/3, which could indicate that the defect is not solely due to reduced GluA2/3 expression at the formed synapses, but also to the presence of additional orphan synapses. I would also mention in the discussion how the phenotype of the RTN4L2 KO compares to the GluA2/3 KO and if the lack of GluA2/3 at the AZ could explain the increase in ABR threshold. Quantification of GluA2/3 puncta at the apical, middle, and basal region would also help understand the auditory phenotype of the KO mice.

      We have changed Figure2-figure supplement 3 to become a main figure (Figure 3) based on the recommendation of the reviewer. We agree, that it would be valuable to perform immunohistochemistry combining anti-GluA2/3 and anti-Homer1 and anti-Ctbp2 antibodies to see if the “orphan” Homer1 patches house GluA2/3 not juxtaposing synaptic ribbons. Unfortunately, as mentioned above, due to the expiration of our animal breeding and experimentation licenses we did not manage to do those experiments. We have however performed stainings with anti-GluA4 antibodies and could not detect GluA4 signal in RTN4RL2<sup>-/-</sup> mice (Figure 3-figure supplement 1). This potentially could explain the more drastic ABR threshold elevation in RTN4RL2<sup>-/-</sup> mice compared to e.g. GluA3 KO mice. We have now made this clearer in our discussion.

      (6) I suggest considering the use of color-blind friendly palettes for figures and graphs in this manuscript to enhance clarity and ensure that the findings are accessible to a wider audience and improve the overall effectiveness of the presentation. Please use color-blind-friendly schemes in Figure 1 and Figure 2 Supplement 3.

      Done.

      (7) Could you please explain what "XX {plus minus} Y, SD = W" means in the figure legends?

      Mean ± SEM (standard error of the mean), SD (standard deviation) are indicated in the legends. In response to the reviewer comment we have now added an explanation in the Materials and Methods –> Data analysis and statistics section.

      (8) Please include information about the ear tested (left or right or both).

      Both ears were tested. Since there was no significant difference between right and left ear we did not further consider this factor. We will add this fact more precisely in the Material and methods section.

      Reviewer #3 (Recommendations for the authors):

      (1) Line 90: Why not show this control, it is a nice control.

      Unfortunately, our recent attempts to perform RTN4RL2 immunostaining on cryosections were unsuccessful. Therefore, we decided to remove RTN4RL2 immunostaining from Figure 1 and have adjusted the results section accordingly.

      (2) Line 94: Please provide a reference for these interactions.

      Done.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The manuscript discusses the role of phosphorylated ubiquitin (pUb) by PINK1 kinase in neurodegenerative diseases. It reveals that elevated levels of pUb are observed in aged human brains and those affected by Parkinson's disease (PD), as well as in Alzheimer's disease (AD), aging, and ischemic injury. The study shows that increased pUb impairs proteasomal degradation, leading to protein aggregation and neurodegeneration. The authors also demonstrate that PINK1 knockout can mitigate protein aggregation in aging and ischemic mouse brains, as well as in cells treated with a proteasome inhibitor. While this study provided some interesting data, several important points should be addressed before being further considered.

      Strengths:

      (1) Reveals a novel pathological mechanism of neurodegeneration mediated by pUb, providing a new perspective on understanding neurodegenerative diseases.

      (2) The study covers not only a single disease model but also various neurodegenerative diseases such as Alzheimer's disease, aging, and ischemic injury, enhancing the breadth and applicability of the research findings.

      Weaknesses:

      (1) PINK1 has been reported as a kinase capable of phosphorylating Ubiquitin, hence the expected outcome of increased p-Ub levels upon PINK1 overexpression. Figures 5E-F do not demonstrate a significant increase in Ub levels upon overexpression of PINK1 alone, whereas the evident increase in Ub expression upon overexpression of S65A is apparent. Therefore, the notion that increased Ub phosphorylation leads to protein aggregation in mouse hippocampal neurons is not yet convincingly supported.

      Indeed, overexpression of sPINK1 alone resulted in minimal changes in Ub levels in the soluble fraction (Figure 5E), which is expected given that the soluble Ub pool remains relatively stable and buffered. However, sPINK1* overexpression led to a marked increase in Ub levels in the insoluble fraction, indicative of increased protein aggregation (Figure 5F). The molecular weight distribution of Ub in the insoluble fraction was predominantly below 70 kDa, suggesting that phosphorylation inhibits Ub chain elongation.

      To further validate this mechanism, we utilized the Ub/S65A mutant to antagonize Ub phosphorylation and observed a significant reduction in the intensity of aggregated bands at low molecular weights, indicating restored proteasomal activity. The observed increase in Ub levels in the soluble fraction upon Ub/S65A overexpression is likely due to enhanced ubiquitination driven by elevated Ub-S65A, and notably, Ub/S65A was also detectable using an antibody against wild-type Ub.

      Consistent with these findings, overexpression of Ub/S65E resulted in a further increase in Ub levels in the insoluble fraction, with intensified low molecular weight bands. The effect was even more pronounced than that observed with sPINK1 transfection, likely resulting from the complete phosphorylation mimicry achieved by Ub/S65E, compared to the relatively low levels of phosphorylation by PINK1.

      These findings collectively support the conclusion that sPINK1 promotes protein aggregation via Ub phosphorylation. We have updated the Results and Discussion sections to more clearly present the data and explain the various controls.

      (2) The specificity of PINK1 and p-Ub antibodies requires further validation, as a series of literature indicate that the expression of the PINK1 protein is relatively low and difficult to detect under physiological conditions.

      We acknowledge the challenges in achieving high specificity with commercially available and customgenerated antibodies targeting PINK1 and pUb, particularly given their low endogenous expression under physiological conditions. However, in our study, we observed robust immunofluorescent staining for PINK1 (Figures 1A, 1C, and 1G) and pUb (Figures 1B, 1D, and 1G) in human brain samples from Alzheimer's disease (AD) patients, as well as in mouse models of AD and cerebral ischemia. The clear visualization can be partly attributed to the pathological upregulation of PINK1 and pUb under disease conditions. Importantly, the images from pink1<sup>-/-</sup> mice exhibit much weaker staining.

      Additionally, we detected a significant elevation in the pUb levels in aged mouse brains compared to younger ones (Figures 1E and 1F). In contrast, pink1<sup>-/-</sup> mice showed no change in pUb levels with aging, despite some background signals, demonstrating that pUb accumulation during aging is PINK1dependent. Collectively, these results support the specificity of the antibodies used in detecting pathophysiological changes in PINK1 and pUb levels.

      For cultured cells, pink1<sup>-/-</sup> cells served as a negative control for both PINK1 (Figures 2B and 2C) and pUb (Figures 2D and 2E). While the pUb Western blot exhibited some nonspecific background, pUb levels in pink1<sup>-/-</sup> cells remained unchanged across all MG132 treatment conditions (Figures 2D and 2E), further attesting the usability of the antibodies in conjunction with appropriated controls.

      We have updated the manuscript with higher-resolution images; individual image files have been uploaded separately.

      (3) In Figure 6, relying solely on Western blot staining and Golgi staining under high magnification is insufficient to prove the impact of PINK1 overexpression on neuronal integrity and cognitive function. The authors should supplement their findings with immunostaining results for MAP2 or NeuN to demonstrate whether neuronal cells are affected.

      We included NeuN immunofluorescent staining at 10, 30, and 70 days post transfection in Figure 5— figure supplement 2. The results clearly demonstrate a significant loss of NeuN-positive cells in the hippocampus following Ub/S65E overexpression, while no apparent reduction was observed with sPINK1 transfection alone. 

      We have also quantified MAP2 protein levels via Western blotting and examined morphology of neuronal dendrite and synaptic structure using Golgi staining. These analyses revealed a significant reduction in MAP2 levels and synaptic damage upon sPINK1 or Ub/S65E overexpression (Figures 6F and 6H), consistent with the proteomics analysis (Figure 5—figure supplementary 5). Notably, these detrimental effects could be rescued by co-expression of Ub/S65A, reinforcing the role of pUb in mediating these structural changes.

      Together, our findings from NeuN immunostaining, MAP2 protein analysis, proteomics analysis, and Golgi staining provide strong evidence for the impact of PINK1 overexpression and pUb elevation on neuronal integrity and synaptic structure.

      (4) The authors should provide more detailed figure captions to facilitate the understanding of the results depicted in the figures.

      Figure captions have been updated with more details incorporated in the revised manuscript.

      (5) While the study proposes that pUb promotes neurodegeneration by affecting proteasomal function, the specific molecular mechanisms and signaling pathways remain to be elucidated.

      The molecular mechanisms and signaling pathways through which pUb promotes neurodegeneration are likely multifaceted and interconnected. Our findings suggest that mitochondrial dysfunction plays a central role following sPINK1* overexpression. This is supported by (1) an observed increase in full-length PINK1, indicative of impaired mitochondrial quality control, and (2) proteomic data showing enhanced mitophagy at 30 days post-transfection, followed by substantial mitochondrial injuries at 70 days post-transfection (Figure 5—figure supplement 5 and Supplementary Data). The progressive mitochondrial damage caused by protein aggregates would exacerbate neuronal injury and degeneration.

      Additionally, reduced proteasomal activity may lead to the accumulation of inhibitory proteins that are normally degraded by the ubiquitin-proteasome system. Our proteomics analysis identified a >50fold increase in CamK2n1 (UniProt ID: Q6QWF9), an endogenous inhibitor of CaMKII activation, following sPINK1* overexpression. The accumulation of CamK2n1 suppresses CaMKII activation, thereby inhibiting the CREB signaling pathway (Figure 7), which is essential for synaptic plasticity and neuronal survival. This disruption can further contribute to neurodegenerative processes.

      Thus, our findings underscore the complexity of pUb-mediated neurodegeneration and call for further investigation into downstream consequences.

      Reviewer #1 (Recommendations for the authors):

      Suggestions for improved or additional experiments, data or analyses.

      We have performed additional experiments to investigate how the impairment of ubiquitinproteasomal activity contributes to neurodegeneration. Specifically, we investigated CamK2n1, an endogenous inhibitor of CaMKII, which is normally degraded by the proteasome to allow CaMKII activation. Our proteomics analysis revealed a significant (>50-fold) elevation of CamKI2n1 following sPINK1 overexpression (Figure 5—figure supplement 5 and Supplementary Data).

      To validate this mechanism, we conducted immunofluorescence and Western blot analyses, demonstrating reduced levels of phosphorylated CaMKII (pCaMKII) and phosphorylated CREB (pCREB), as well as reduced levels of downstream proteins such as BDNF and ERK. These results have been incorporated into the revised manuscript (Figure 7).

      As the proteasome is crucial in maintaining proteostasis, its dysregulation would trigger neurodegeneration through multiple pathways, contributing to a broad cascade of pathological events.

      Reviewer #2 (Public review):

      Summary:

      The manuscript makes the claim that pUb is elevated in a number of degenerative conditions including Alzheimer's Disease and cerebral ischemia. Some of this is based on antibody staining which is poorly controlled and difficult to accept at this point. They confirm previous results that a cytosolic form of PINK1 accumulates following proteasome inhibition and that this can be active. Accumulation of pUb is proposed to interfere with proteostasis through inhibition of the proteasome. Much of the data relies on over-expression and there is little support for this reflecting physiological mechanisms.

      Weaknesses:

      The manuscript is poorly written. I appreciate this may be difficult in a non-native tongue, but felt that many of the problems are organizational. Less data of higher quality, better controls and incision would be preferable. Overall the referencing of past work is lamentable. Methods are also very poor and difficult to follow.

      Until technical issues are addressed I think this would represent an unreliable contribution to the field.

      (1) Antibody specificity and detection under pathological conditions

      We recognize the limitations of commercially available antibodies for detecting PINK1 and pUb. Nevertheless, our findings reveal a significant elevation in PINK1 and pUb levels under pathological conditions, such as Alzheimer's disease (AD) and ischemia. Additionally, we observed an increase in pUb level during brain aging, further demonstrating its relevance and a potentially causative role for this special pathological condition. Similarly, elevated pUb levels were observed for cultured cells following pharmacological treatment or oxygen-glucose deprivation (OGD).

      In contrast, in pink1<sup>-/-</sup> mice and HEK293 cells used as negative controls, PINK1 and pUb levels remained consistently low. Therefore, the observed elevation of PINK1 and pUb are associated with special pathological conditions, rather than an antibody-detection anomaly.

      (2) Overexpression as a model for pathological conditions

      To investigate whether the inhibitory effects of sPINK1 on the ubiquitin-proteasome system (UPS) depend on its kinase activity, we employed a kinase-dead version of sPINK1* as a negative control. Given that PINK1 targets multiple substrates, we also investigated whether its effects on UPS inhibition were specifically mediated by ubiquitin phosphorylation. To this end, we used Ub/S65A (a phospho-null mutant) to block Ub phosphorylation by sPINK1, and Ub/S65E (a phospho-mimetic mutant) to mimic phosphorylated Ub. These well-defined controls ensured the robustness of our conclusions.

      Although overexpression does not perfectly replicate physiological conditions, it provides a valuable model for studying pathological scenarios such as neurodegeneration and brain aging, where pUb levels are elevated. For example, we observed a 30.4% increase in pUb levels in aged mouse brains compared to young brains (Figure 1F). Similarly, in our sPINK1 overexpression model, pUb levels increased by 43.8% and 59.9% at 30- and 70-days post-transfection, respectively, compared to controls (Figures 5A and 5C). Notably, co-expression of sPINK1* with Ub/S65A almost entirely prevented sPINK1* accumulation (Figure 5B), indicating that an active UPS can efficiently degrade this otherwise stable variant of sPINK1.

      Together, our findings demonstrate that sPINK1 accumulation inhibits UPS activity, an effect that can be reversed by the phospho-null Ub mutant. The overexpression model mimics pathological conditions and provides valuable insights into pUb-mediated proteasomal dysfunction.

      (3) Organization of the manuscript

      Following your suggestion, we have restructured the manuscript to present the key findings in a more logical and cohesive sequence:

      (a) Evidence for elevated PINK1 and pUb levels across a broad spectrum of pathological and neurodegenerative conditions;

      (b) The effects of pUb elevation in cultured cells, focusing on the proteasome;

      (c) Mechanistic insights into how pUb elevation inhibits proteasomal activity;

      (d) The absence of PINK1 and pUb alleviates protein aggregation;

      (e) Evidence for the causative relationship between elevated pUb levels and proteasomal inhibition;

      (f) Demonstration that pUb elevation directly contributes to neuronal degeneration;

      (g) Give an additional evidence to explain the mechanism of neuronal degeneration post sPINK1* over-expression. The downstream effects of elevated CamK2n1, an inhibitor of CaMKII, resulting from proteasomal inhibition.

      This reorganization should ensure a clear and progressive narrative, and enhance the overall coherence and impact of the revised manuscript.

      (4) Revisions to writing, referencing, and methodology

      We have made a great effort to enhance the clarity and flow of the manuscript, including the addition of references to appropriately acknowledge prior work. We have also expanded the Methods section with additional details to improve readability and ensure reproducibility. We believe these revisions effectively address the concerns raised and strengthen the overall quality of the manuscript.

      Reviewer #2 (Recommendations for the authors):

      Figure 1: PINK1 is a poorly expressed protein and difficult to detect by Western blot let alone by immunofluorescence. I have direct experience of the antibody used in this study and do not consider it reliable. There are much cleaner reagents out there, although they still have many challenges. The minimal requirement here is for the PINK1 antibody staining to be compared in wild-type and knockout mice. One would also expect to see a mitochondrial staining which would require higher magnification to be definitive, but it does not look like it to me. This is a key foundational figure and is unreliable. The pUb antibody also has a high background, see for example figure 2E.

      Under physiological conditions, PINK1 and pUb levels are indeed low, making their detection challenging. However, under pathological conditions, their expression is significantly elevated, correlating with disease severity. Given the limitations of available reagents, using appropriate controls is a standard approach in biological research.

      Nevertheless, we observed robust immunofluorescent staining for PINK1 (Figures 1A, 1C, and 1G) and pUb (Figures 1B, 1D, and 1G) in human brain samples from Alzheimer’s disease (AD) patients and mouse models of AD and cerebral ischemia. Compared to healthy controls, the significant elevation of PINK1 and pUb under these pathological conditions accounts for their clear visualization. To validate antibody specificity, we have included images from pink1<sup>-/-</sup> mice as negative controls (Figure 1C and 1D, third panel).

      Furthermore, we analyzed pUb levels in both young and aged mice, using pink1<sup>-/-</sup> mice as controls.

      Our results revealed a significant increase in pUb levels in aged wild-type mice (Figures 1E and 1F), In contrast, pink1<sup>-/-</sup> mice exhibited relatively low pUb levels, with no notable change between young and aged groups. These findings reinforce the conclusion that pUb accumulation during aging is dependent on PINK1.Furthermore, we analyzed pUb levels in both young and aged mice, using pink1<sup>-/-</sup> mice as controls.

      For HEK293 cells, pink1<sup>-/-</sup> cells were used as a negative control for assessing PINK1 (Figures 2B and 2C) and pUb levels (Figures 2D and 2E). While the pUb Western blot did show some nonspecific background, as you have noted, pUb levels significantly increased following MG132 treatment of the wildtype cells. In contrast, no such increase was observed in pink1<sup>-/-</sup> cells (Figure 2D and 2E). These results further validate the reliability of our findings.

      Regarding mitochondrial staining, we recognize that PINK1 localization can vary depending on the pathological context. For example, in Alzheimer’s disease, PINK1 exhibits relatively high nuclear staining, while in cerebral ischemia and brain aging, it is predominantly cytoplasmic and punctate. In contrast, in young, healthy mouse brains, PINK1 is more uniformly distributed. The observed elevation in pUb levels could arise from mitochondrial PINK1 or soluble sPINK1 in the cytoplasm, and it remains unclear whether nuclear PINK1 contributes to pUb accumulation. Investigating the role of PINK1 in different forms and subcellular localizations will be an important avenue for future research.

      To enhance clarity, we have updated our images and replaced them with higher-resolution versions in the revised manuscript.

      Please also confirm that the GAPDH loading controls represent the same gels, to my eye they do not match.

      We have reviewed all the bands, and confirmed that the GAPDH loading controls correspond to the same gels. For different gels, we use separate GAPDH loading controls. There are two experimental scenarios to consider:

      (1) When there is a large difference in molecular weight between target proteins, we cut the gel into sections and incubate each section with different antibodies separately.

      (2) When the molecular weight difference is small and cutting is not feasible, we first probe the membrane with one antibody, strip it, and then re-incubate the membrane with a second antibody.

      These approaches ensure accurate and reliable detection of target proteins with various molecular weights relative to GAPDH.

      1H. Ponceau.

      We have corrected the spelling.

      Figure 2 many elements are confirmation of work already reported and this must be made clearer in the text. 

      Indeed, the elevation of sPINK1 and pUb upon proteasomal inhibition has been previously reported, and these studies have been acknowledged (Gao, et al, 2016; Dantuma, et al, 2000). In the present study, we expand on these findings by conducting a detailed analysis of the time- and concentrationdependent effects of MG132 on sPINK1 and pUb levels, establishing a causative relationship between pUb accumulation and proteasomal inhibition. Furthermore, we demonstrate that sPINK1 overexpression and MG132-induced proteasomal inhibition exhibit no additive effect, indicating that both converge on the same pathway, resulting in the impairment of proteasomal activity.

      It has been established that ubiquitin phosphorylation inhibits Ub chain elongation (Wauer, et al, 2015). However, our study provides novel insights by identifying an additional mechanism: phosphorylated Ub also interferes with the noncovalent interactions between Ub chain and Ub receptors in the proteasome, which further contributes to the impairment of UPS function.

      The PINK1 kinase-dead mutant construction (Figure 2F) and the use of Ub-GFP as a proteasomal substrate were based on established methodologies, which have been appropriately cited in the manuscript (Beilina, etal 2005 for KD sPINK1; Yamano, et al for endogenous PINK1; Samant, et al, 2018 and Dantuma, et al, 2000 for Ub-GFP probe). Similarly, our use of puromycin and BALA treatments follows previously reported protocols (Gao, et al, 2016), which allowed us to dissect the relative contributions of sPINK1* overexpression to proteasomal vs. autophagic dysfunction.

      As you have noted, our study has built upon prior findings while introducing new mechanistic insights into sPINK1 and pUb-mediated proteasomal dysfunction.

      2C 24h MG132 not recommended, most cells are dead by then.

      We used MG132 treatment for 24 hours to evaluate the time-course effects of proteasomal inhibition on PINK1 and pUb levels in HEK293 cells (Figures 2C and 2E). We did observe some decrease in both PINK1 and pUb levels at 24 hours compared to 12 hours, which may result from some extend of cell death at the longer treatment duration.

      In SH-SY5Y cells, we collected cells at 24 hours after MG132 administration (Figure 5—figure supplementary 1). Though protein aggregation was evident in these cells, we did not observe pronounced cell death under these conditions, justifying our treatment.

      Our findings are consistent with previous studies demonstrating that MG132 at 5 µM for 24 hours effectively induces proteasomal inhibition without substantial cytotoxicity. For example, studies using human esophageal squamous cancer cells have reported that this treatment condition inhibits cell proliferation while maintaining cell viability, with cell viability >70% after 24-hour treatment with 5 µM MG132 (Int J Mol Med 33: 1083-1088, 2014). 

      MG132 has been commonly used at concentrations ranging from 5 to 50 µM for durations of 1 to 24 hours, as stated at the vendor’s website (https://www.cellsignal.com/products/activatorsinhibitors/mg-132/2194).

      2I what is BALA do they mean bafilomycin. This is a v-ATPase inhibitor, not just an autophagy inhibitor.

      We appreciate the reviewer’s comment regarding the use of BALA in Figure 2I. To clarify, BALA refers to bafilomycin A1, a well-established v-ATPase inhibitor that blocks lysosomal acidification. While bafilomycin A1 is commonly used as an autophagy inhibitor, its primary mechanism involves inhibiting lysosomal function, which is critical for autophagosome-lysosome fusion and subsequent degradation of autophagic cargo.

      In our study, we used bafilomycin A1 in conjunction with puromycin to dissect the relative contributions of sPINK1 overexpression on proteasomal and autophagic activities. Puromycin induces protein misfolding and aggregation, causing stress on both degradation pathways. By inhibiting lysosomal function with bafilomycin A1 and blocking the protein degradation load at various stages, we can tell the relative contributions of autophagy and UPS pathways.

      We acknowledge that bafilomycin A1’s effects extend beyond autophagy, as it also inhibits v-ATPase activity. However, its inhibition of lysosomal degradation is integral to distinguishing autophagy’s contribution under the experimental conditions, and BALA treatment has been used in extensively in previous studies (Mauvezin and Neufeld, 2015). 

      We have further clarified this treatment in the revised manuscript.

      Figure 3. Legend or text needs to be more explicit about how chains have been produced. From what I can gather from methods only a single E2 has been trialed. Authors should use at least one of the criteria used by Wauer et al. (2014) to confirm the stoichiometry of phosphorylation. The concept that pUb can interfere with E2 discharging is not new, but not universal across E2s.

      We have cited in the manuscript that PINK1-mediated ubiquitin phosphorylation can interfere with ubiquitin chain elongation for certain E2 enzymes (Wauer et al., 2015). 

      To clarify, the focus of our current work is on how elevation of Ub phosphorylation impacts UPS activity, rather than exploring the broader effects of Ub phosphorylation on Ub chain elongation. For this reason, we have used the standard E2 that is well-established for generating K48-linked polyUb chain (Pickart CM, 2005). Moreover, our findings go further and by demonstrate that phosphorylated K48-linked polyubiquitin exhibits weaker non-covalent interactions with proteasomal ubiquitin receptors. This dual effect—on both covalent chain elongation and non-covalent interactions— contributes to the observed reduction in ubiquitin-proteasome activity, a novel aspect of our study.

      To address the reviewer’s concerns, we have added details in the Methods section and figure legends regarding the generation of ubiquitin chains. Specifically, we used ubiquitin-activating enzyme E1 (UniProt ID: P22314) and ubiquitin-conjugating enzyme E2-25K (UniProt ID: P61086) to generate K48-linked ubiquitin chains. 

      Our ESI-MS analysis showed that only 1–2 phosphoryl groups were incorporated into the K48-linked tetra-ubiquitin chains (Figure 3—figure supplement 2). This is consistent with our in vivo findings, where pUb levels increased by 30.4% in aged mouse brains compared to young brains (Figure 1F). Notably, even sub-stoichiometric phosphorylation onto the K48-linked ubiquitin chain significantly weakens the non-covalent interactions with the proteasome (Figures 3E and 3H).

      Figure 4. I could find no definition of the insoluble fraction, nor details on how it is prepared.

      The insoluble fraction primarily contains proteins that are aggregated or associated with hydrophobic interactions and cannot be solubilized by RIPA buffer. We have provided more details in the Methods of the revised manuscript about how the insoluble fraction was prepared. Our approach was based on established protocols for fractionating soluble and insoluble proteins from brain tissues (Wirths, 2017). Here is an outline of the procedure, which enables the separation and subsequent analysis of distinct protein populations:

      • Lysis and preparation of soluble fraction: Cells and brain tissues were lysed using RIPA buffer (Beyotime Biotechnology, cat# P0013B) containing protease (P1005) and phosphatase inhibitors (P1081) on ice for 30 minutes, with gentle vortexing every 10 minutes. Brain samples were homogenized using a precooled TissuePrep instrument (TP-24, Gering Instrument Company). Lysates were centrifuged at 12,000 rpm for 30 minutes at 4°C. The supernatant was collected as the soluble protein fraction.

      • Preparation of insoluble fraction: The pellet was resuspended in 20 µl of SDS buffer (2% SDS, 50 mM Tris-HCl, pH 7.5) and subjected to ultrasonic pyrolysis at 4°C for 8 cycles (10 seconds ultrasound, 30 seconds interval). The samples were then centrifuged at 12,000 rpm for 30 minutes at 4°C. The supernatant obtained after this step was designated as the insoluble protein fraction.

      • Protein quantification: Protein concentrations for both soluble and insoluble fractions were determined using the BCA Protein Assay Kit (Beyotime Biotechnology, cat# P0009).

      Figure 5. What is the transfection efficiency? How many folds is sPINK1 over-expressed? Typically, a neuron will have only a few hundred copies of PINK1 at the basal state. How much mutant ubiquitin is expressed relative to wild type, seeing the free ubiquitin signals on the gels might be helpful here, but they seem to have been cut off. 

      We appreciate the reviewer's insightful comments regarding transfection efficiency, the extent of sPINK1 overexpression, and the expression levels of mutant ubiquitin relative to wild-type ubiquitin. Below, we provide detailed responses to each point:

      Transfection Efficiency: Our immunofluorescent staining for NeuN, a neuronal marker, demonstrated that over 90% of NeuN-positive cells were co-localized with GFP (Figure 5—figure supplement 2), indicating a high transfection efficiency in our neuronal cultures.

      Extent of sPINK1 Overexpression: Quantifying the exact fold increase of sPINK1 upon overexpression is inherently difficult due to its low basal expression under physiological conditions, making the relative increase difficult to measure (small denominator effect). However, our Western blot analysis shows that ischemic events can cause a substantial elevation of PINK1 levels, including both full-length and cleaved forms (Figure 1H). This suggests that our overexpression model recapitulates the pathological increase in PINK1, making it a relevant system for studying disease mechanisms.

      From Figure 5B, it is evident that sPINK1 levels differ significantly between neurons overexpressing sPINK1 alone and those co-expressing sPINK1 + Ub/S65A (70 days post-transfection). Overexpression of sPINK1 alone results in multiple PINK1 bands, consistent with sPINK1, endogenous PINK1 (induced by mitochondrial damage), and ubiquitinated sPINK1. In comparison, co-expressing Ub/S65A leads to faint PINK1 bands, suggesting that in the presence of a functionally restored proteasome, overexpressed sPINK1 is rapidly degraded. Therefore, actual accumulation of sPINK1 depends on proteasomal activity, and the “over-expressed” PINK1 level can be comparable to levels observed under native, pathological conditions.

      Expression Levels of Mutant Ubiquitin Relative to Wild-Type: Assessing the expression levels of mutant versus wild-type ubiquitin is indeed valuable. In Figure 5E, we observed a 38.9% increase in high-molecular-weight ubiquitin conjugates in the soluble fraction when comparing the sPINK1+Ub/S65A group to the control. This increase suggests that mutant ubiquitin is actively incorporated into polyubiquitin chains.

      Regarding free monomeric ubiquitin, its low abundance and rapid incorporation into polyubiquitin chains make it difficult to visualize in Western blots. Additionally, its low molecular weight and lower antibody binding valency further reduce its visibility.

      General: a number of effects are shown following over-expression but no case is made that these levels of pUb are ever attained physiologically. I am very unconvinced by these findings and think the manuscript needs to be improved at multiple levels before being added to the record.

      We understand the reviewer’s concerns regarding the relevance of pUb levels observed in our overexpression model. To clarify, our study is not focused on physiological levels of pUb, but rather on pathologically elevated levels, which have been documented in various neurodegenerative conditions. While overexpression is not a perfect replication of pathological states, it provides a valuable tool to investigate mechanisms that become relevant under disease conditions. Moreover, we have taken steps to ensure the validity of our findings and to address potential limitations associated with overexpression models:

      Pathological Relevance: Besides several reported literatures, we observed significant increases in PINK1 and pUb levels in human brain samples from Alzheimer's disease (AD) patients, as well as in mouse models of AD, cerebral ischemia (including mouse middle cerebral artery occlusion ischemic model and oxygen glucose deprivation cell model), and aging (e.g., Figures 1E, 1F, and 1H). All these data show that pUb levels are elevated under pathological conditions. Our overexpression model mimics these pathological scenarios by recreating the high levels of pUb, which lead to the impairment of proteasomal activity and subsequent disruption of proteostasis.

      Use of Robust Controls: To ensure the reliability of our results and interpretations, we employed multiple controls for our experiments. We have used pink1<sup>-/-</sup> mice and cells to confirm that pUb accumulation is PINK1-dependent (Figures 1C and 2C). We have also included kinase-dead sPINK1 mutant and Ub/S65A phospho-null mutants to negate/counteract the specific roles of PINK1 activity and pUb in proteasomal dysfunction. On the other hand, we have used Ub/S65E for phosphomimetic mutant, corresponding to a 100% Ub phosphorylation.

      Importantly, we have compared sPINK1 overexpression with both baseline and disease-mimicking conditions, thus to ensure that the observed effects are consistent with pathological changes. Furthermore, our findings are supported by complementary evidences from human brain samples, model animals, cell cultures, and molecular assays. Integrating the different controls and various approaches, we have provided mechanistic insights into how elevated pUb levels causes proteasomal impairment and contributes to neurodegeneration.

      Our findings elucidate how elevated pUb level contributes to the disruption of proteostasis in neurodegenerative conditions. While overexpression may have limitations, it remains a powerful tool for dissecting pathological mechanisms and testing hypotheses. Our results align with and expand upon previous studies suggesting pUb as a biomarker of neurodegeneration (Hou, et al, 2018; Fiesel, et al, 2015), and provide mechanistic insights into how elevated pUb and sPINK1 drive a viscous feedforward cycle, ultimately leading to proteasomal dysfunction and neurodegeneration. 

      We hope these clarifications highlight the relevance and rigor of our study, and welcome additional suggestions to improve the manuscript.

      Reviewer #3 (Public review):

      Summary:

      This study aims to explore the role of phosphorylated ubiquitin (pUb) in proteostasis and its impact on neurodegeneration. By employing a combination of molecular, cellular, and in vivo approaches, the authors demonstrate that elevated pUb levels contribute to both protective and neurotoxic effects, depending on the context. The research integrates proteasomal inhibition, mitochondrial dysfunction, and protein aggregation, providing new insights into the pathology of neurodegenerative diseases.

      Strengths:

      - The integration of proteomics, molecular biology, and animal models provides comprehensive insights.

      - The use of phospho-null and phospho-mimetic ubiquitin mutants elegantly demonstrates the dual effects of pUb.

      - Data on behavioral changes and cognitive impairments establish a clear link between cellular mechanisms and functional outcomes.

      Weaknesses:

      - While the study discusses the reciprocal relationship between proteasomal inhibition and pUb elevation, causality remains partially inferred.

      It has been well-established that protein aggregates, particularly neurodegenerative fibrils, can impair proteasomal activity (McDade, et al., 2024; Kinger, et al., 2024; Tseng, et al., 2008). Other contributing factors, including ATP depletion, reduced proteasome component expression, and covalent modifications of proteasomal subunits, can also lead to declined proteasomal function. Additionally, mitochondrial injury serves as an important source of elevated PINK1 and pUb levels. Recent studies have demonstrated that efficient mitophagy is essential to prevent pUb accumulation, whereas partial mitophagy failure results in elevated PINK1 levels (Chin, et al, 2023; Pollock, et al. 2024).

      While pathological conditions can impair proteasomal function and slow sPINK1 degradation, leading to its accumulation, our results demonstrate that overexpression of sPINK1 or PINK1 can initiate this cycle as well. Once this cycle is initiated, it becomes self-perpetuating, as sPINK1 and pUb accumulation progressively impair proteasomal function, leading to more protein aggregates and mitochondrial damages.

      Importantly, we show that co-expression of Ub/S65A effectively rescues cells from this cycle, which further illustrates the pivotal role of pUb in driving proteasomal inhibition and the causality between pUb elevation and proteasomal inhibition. At the animal level, pink1 knockout prevents protein aggregation under aging and cerebral ischemia conditions (Figures 1E and 1G). 

      Together, by controlling at protein, cell, and animal levels, our findings support this self-reinforcing and self-amplifying cycle of pUb elevation, proteasomal inhibition, protein aggregation, mitochondrial damage, and ultimately, neurodegeneration.

      - The role of alternative pathways, such as autophagy, in compensating for proteasomal dysfunction is underexplored.

      Indeed, previous studies have shown that elevated sPINK1 can enhance autophagy (Gao, et al., 2016,), potentially compensating for impaired UPS function. One mechanism involves PINK1mediated phosphorylation of p62, which enhances autophagic activity.

      In our study, we observed increased autophagic activity upon sPINK1 overexpression, as shown in Figure 2I (middle panel, without BALA). This increase in autophagy may facilitate the degradation of ubiquitinated proteins induced by puromycin, partially mitigating proteasomal dysfunction. This compensation might also explain why protein aggregation, though statistically significant, increased only slightly at 70 days post-sPINK1 transfection (Figure 5F). Additionally, we detected a mild but statistically insignificant increase in LC3II levels in the hippocampus of mouse brains at 70 days postsPINK1 transfection (Figure 5—figure supplement 6), further supporting the notion of autophagy activation.

      However, while autophagy may provide some compensation, its effect is likely limited. The UPS and autophagy serve distinct roles in protein degradation:

      • Autophagy is a bulk degradation pathway, primarily targeting damaged organelles, intracellular pathogens, and protein aggregates, often in a non-selective manner.

      • The UPS, in contrast, is highly selective, degrading short-lived regulatory proteins, misfolded proteins, and proteins tagged for degradation via ubiquitination.

      Thus, while sPINK1 overexpression enhances autophagy-mediated degradation, it simultaneously impairs UPS-mediated degradation. This suggests that autophagy partially compensates for proteasomal dysfunction but is insufficient to counterbalance the UPS's selective degradation function. We have incorporated additional discussion in the revised manuscript.

      - The immunofluorescence images in Figure 1A-D lack clarity and transparency. It is not clear whether the images represent human brain tissue, mouse brain tissue, or cultured cells. Additionally, the DAPI staining is not well-defined, making it difficult to discern cell nuclei or staging. To address these issues, lower-magnification images that clearly show the brain region should be provided, along with improved DAPI staining for better visualization. Furthermore, the Results section and Figure legends should explicitly indicate which brain region is being presented. These concerns raise questions about the reliability of the reported pUb levels in AD, which is a critical aspect of the study's findings.

      We have taken steps to address the concerns regarding clarity and transparency in Figure 1A-D. We have already addressed the source of tissues at the left of each images. For example, we have written “human brain with AD” at the left side of Figure 1A, and “mouse brains with AD” at the left side of Figure 1C.

      Briefly, the human brain samples in Figure 1 originate from the cingulate gyrus of Alzheimer’s disease (AD) patients. Our analysis revealed that PINK1 is primarily localized within cell bodies, whereas pUb is more abundant around Aβ plaques, likely in nerve terminals. For the mouse brain samples, we have now explicitly indicated in the figure legends and Results section that the images represent the neocortex of APP/PS1 mice, a mouse model relevant to AD pathology, as well as the corresponding regions in wild-type and pink1<sup>-/-</sup> mice. We have ensured that the brain regions and sources are clearly stated throughout the manuscript.

      Regarding image clarity, we have uploaded higher-resolution versions of the images in the revised manuscript to improve visualization of key features, including DAPI staining. We believe these revisions enhance the reliability and interpretability of our findings, particularly in relation to the reported pUb levels in AD. 

      - Figure 4B should also indicate which brain region is being presented.

      The images were taken for layer III-IV in the neocortex of mouse brains. We have included this information in the figure legend of the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      - Expand on the potential compensatory role of autophagy in response to proteasomal dysfunction.

      Upon proteasomal inhibition, cells may activate autophagy as an alternative pathway of degradation to help clear damaged or misfolded proteins. Autophagy is a bulk degradation process that targets long-lived proteins, damaged organelles, and aggregated proteins for lysosomal degradation. While this pathway can provide some compensation, it is distinct from the ubiquitin-proteasome system (UPS), which specializes in the selective degradation of short-lived regulatory proteins and misfolded proteins.

      In our study, we observed increased autophagic activity following sPINK1 overexpression (Figure 2J, middle panel, without BALA) and a slight, though statistically insignificant, increase in LC3II levels in the hippocampus of mouse brains at 70 days post-sPINK1 transfection (Figure 5—figure supplement 6). These findings suggest that autophagy is indeed upregulated as a compensatory response to proteasomal dysfunction, potentially facilitating the degradation of aggregated ubiquitinated proteins. Additionally, gene set enrichment analysis (GSEA) revealed similar enrichment of autophagy pathways at 30 and 70 days post-sPINK1 overexpression (Figure 5—figure supplement 5).

      However, the compensatory capacity of autophagy is likely limited. While autophagy can reduce protein aggregation, it is an inherently non-selective process and cannot fully replace the targeted functions of the UPS. Moreover, as we illustrate in Figure 7 of the revised manuscript, UPS is essential for degrading specific regulatory and inhibitory proteins and plays a critical role in cellular proteostasis, particularly in signaling regulation, cell cycle control, and stress responses.

      Together, while autophagy activation provides some degree of compensation, it cannot fully restore cellular proteostasis. The interplay between these two degradation pathways is an important area for future investigation. For the present study, our focus is on how pUb elevations impact proteasomal activity and elicits downstream effects.

      We have incorporated these additional discussions on this topic in the revised manuscript.

      - Simplify the discussion of complex mechanisms to improve accessibility for readers.

      We have revised the Discussion to present the mechanisms in a more coherent and accessible manner, ensuring clarity for a broader readership. These revisions should make the discussion more intuitive while preserving the depth of our findings.

      - Statistical analyses could benefit from clarifying how technical replicates and biological replicates were accounted for across experiments.

      We have clarified our statistical analysis in the Methods section and figure legends, explicitly detailing how many biological replicates were accounted for across experiments. These revisions should enhance transparency and clarity, ensuring that our findings are robust and reproducible.

      - The image in Figure 3D is too small to distinguish any signals. A larger and clearer image should be presented.

      We have expanded the images in Figure 3D. Additionally, we have replaced figures with version of better resolutions throughout the manuscript.

      - NeuN expression in Figure 4B differs between wildtype and pink-/- mice. Additional validation is needed to determine whether pink-/- enhances NeuN expression.

      The difference in NeuN immunofluorescence intensity between wild-type and pink1<sup>-/-</sup> mice in Figure 4B may simply result from variations in image acquisition rather than an actual difference in NeuN expression.

      Our single nuclei RNA-seq analyses of wild-type and pink1<sup>-/-</sup> mice at 3 and 18 months of age reveal no significant differences in NeuN expression at the transcript level (data provided below). This confirms that the observed variation in fluorescence intensity is unlikely to reflect an authentic upregulation of NeuN expression. Thus, factors like the concentration of antibody, image exposure and processing may contribute to differences in staining intensity.

      Author response image 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors have examined gene expression between life cycle stages in a range of brown macroalgae to examine whether there are conserved aspects of biological features. 

      Strengths: 

      The manuscript incorporates large gene expression datasets from 10 different species and therefore enables a comprehensive assessment of the degree of conservation of different aspects of gene expression and underlying biology. 

      The findings represent an important step forward in our understanding of the core aspects of cell biology that differ between life cycle phases and provide a substantial resource for further detailed studies in this area. Convincing evidence is provided for the conservation of lifecycle-specific gene expression between species, particularly in core housekeeping gene modules. 

      Weaknesses: 

      I found a few weaknesses in the methodology and experimental design. I think the manuscript could have been clearer when linking the findings to the biology of the brown algae. 

      Reviewer #2 (Public review): 

      Summary: 

      The manuscript by Ratchinski et al presents a comprehensive analysis of developmental and life history gene expression patterns in brown algal species. The manuscript shows that the degree of generation bias or generation-specific gene expression correlates with the degree of dimorphism. It also reports conservation of life cycle features within generations and marked changes in gene expression patterns in Ectocarpus in the transition between gamete and early sporophyte. The manuscript also reports considerable conservation of gene expression modules between two representative species, particularly in genes associated with conserved functional characteristics. 

      Strengths: 

      The manuscript represents a considerable "tour de force" dataset and analytical effort. While the data presented is largely descriptive, it is likely to provide a very useful resource for studies of brown algal development and for comparative studies with other developmental and life cycle systems. 

      Weaknesses: 

      Notwithstanding the well-known issues associated with inferring function from transcriptomics-only studies, no major weaknesses were identified by this reviewer. 

      Reviewing Editor Comments:

      The overall assessment of the reviewers does not contain major aspects of concern. We nevertheless recommend that the authors carefully consider the constructive comments, as this will further improve their manuscript. 

      Reviewer #1 (Recommendations for the authors): 

      (1) Line 32: The abstract states 'considerable conservation of co-expressed gene modules', but the degree of conservation between Ectocarpus and D. dichotoma appeared limited to specific subsets of genes with highly conserved housekeeping functions, e.g., translation. I think the wording of the abstract should be rephrased to better reflect this. 

      We agree that genes with housekeeping functions figure strongly in the gene modules that showed strong conservation between Ectocarpus species 7 and D. dichotoma (and we actually highlight this point in the manuscript) but we do not believe that this invalidates the conservation. In the analysis shown in Figure 6A, for example, high scores were obtained for both connectivity and density for about a third of the gene modules and these modules cover broad range of cellular functions. This is a significant result given the large phylogenetic distance and we feel that "considerable conservation" is appropriate as a description of the level of correlation. 

      (2) Introduction - The Introduction needs a better explanation of the biology of the life cycle phases. Some of this information is present in the 1st paragraph of Materials and Methods, although it would be preferable to include this information within the main text, ideally within the Introduction before the Results are described. For example, when are flagella present? The presence of flagella could be indicated in Figure 3. The ecology of the life cycle is also not described. Are life cycles present in the same ecological niche? Do they co-exist or occupy distinct environments? It would be useful to understand how the observed genotypes could relate to this wider aspect of the brown algal biology. 

      We have added a sentence to explain that zoids (gametes and spores) are the only flagellated stages of the life cycle (line 678). In addition, in the legend for Figure 3, we have indicated which of the life cycle stages analysed in panel 3A consisted entirely or partially of flagellated cells. We have also added information about phenology to the Introduction. 

      (3) Line 127. 'The proportion of generation specific genes was positively correlated with the level of dimorphism'. The level of dimorphism between species was not clear to me. This needs to be clearly displayed in Figure 1B. 

      We had attempted to illustrate the level of dimorphism, using the size of each generation as a measurable proxy, in Figure S1 but we agree that the information was not very clearly presented. To improve clarity, we now provide independent size scales for each generation of the life cycle in this figure and state in the legend that "Size bars indicate the approximate sizes of each generation of each life cycle, providing an indication of the degree of dimorphism between the two generations.". In the text, Figure S1 is cited earlier in the paragraph but we now repeat the citation of the figure at the end of the sentence "The proportion of generation-specific genes (...) was positively correlated with the level of dimorphism" so that the reader can specifically consult the supplementary figure for this phenotypic parameter. 

      (4) Line 267. Are there known differences in cell wall composition between life cycle phases or within each generation as individual life cycle phases mature (e.g., differences between unicellular and multicellular stages)? 

      Detailed comparative analyses of cell wall composition at different stages of the life cycle have not been carried out for brown algae. However, Congo red stains Ectocarpus gametophytes but not sporophytes (Coelho et al., 2011), indicating a difference in cell wall composition between the two generations. Zoids (spores and gametes) do not have a cell wall and calcofluor white staining of meio-spores has indicated that a cell wall only starts to be deposited 24-48 hours post-release (Arun et al., 2013).

      (5) Line 388. The authors should comment on the accuracy of OrthoFinder for different gene types across this degree of divergence (250 MYA). The best conservation was found in genes with housekeeping characteristics (line 401). It may be that these gene modules show the highest degree of conservation in expression patterns, but I also wonder whether they pattern may also emerge because finding true orthologues is easier for highly conserved gene families. 

      We do not believe that this is the case because, as mentioned above, the "housekeeping" modules cover quite a broad range of cellular functions. Note also that the modules were given functional labels based on their being clearly enriched in genes corresponding to a particular class of function but not all the genes in a module have a predicted function that corresponds to the functional classification. 

      However, we have carried out an analysis to look for evidence of the bias proposed by the reviewer. For this, we used BLASTp identity scores as an approximate proxy for pairwise identity between Ectocarpus species 7 and D. dichotoma one-to-one orthologues in each module and plotted the mean identity score for each module against the Fischer test p-value of the contingency table in Figure 6C (Author response image 1).

      Author response image 1.

      Plot of estimations of the mean percent shared identity between the orthologues within each module (based on mean BLASTp identity scores) against log10(pvalue) values obtained with the Fisher's exact test applied in Figure 6C to determine whether pairs of modules shared a greater number of one-to-one orthologues than expected from a random distribution. Error bars indicate the standard deviation. 

      This analysis did not detect any correlation between the degree of sequence conservation of orthologues in a module and the degree of conservation of the module between Ectocarpus species 7 and D. dichotoma.

      Minor comments 

      (1) Line 650 loose should be lose.

      The error has been corrected.

      (2) Line 695 filtered through a 1 μm filter to remove multicellular gametophyte fractions. Is this correct? It seems too small to allow gametes to pass through. 

      Yes, the text is correct, a 1 μm filter was used. The gametes do pass through this filter, presumably because they do not have a rigid cell wall, allowing them to squeeze through the filter when a light pressure is applied. 

      (3) Line 709 - DDT should be DTT 

      The error has been corrected.

      Reviewer #2 (Recommendations for the authors): 

      (1) It is not clear why the chosen species for analysis do not include fucoid algae, which display a high degree of dimorphism between generations and which are relatively well studied with respect to gene expression patterns during early development. Indeed, it was recently shown that gene expression patterns in developing embryos of Fucus spp. obey the "hourglass" pattern whereby gene expression shows a minima of transcription age index (i.e., higher expression of evolutionarily older genes) associated with differentiation at the phylotypic stage. I am somewhat surprised that the manuscript does not consider this feature in the analysis or discussion. 

      Brown algae of the order Fucales have diploid life cycles and therefore do not alternate between a sporophyte and gametophyte generation. It is for this reason that we thought that it was more interesting to compare Ectocarpus species 7 with D. dichotoma, which has a haploid-diploid life cycle.

      (2) In Discussion, the comparison of maternal to zygote transition in animals and land plants, which show a high degree of dimorphism, with Ectocarpus would be strengthened by data/discussion from other brown algae that show a high degree of dimorphism. 

      Animals have diploid life cycles and dimorphism in that lineage generally refers to sexual rather than generational dimorphism. Land plants do have highly dimorphic haploiddiploid life cycles but it is unclear how this characteristic relates to events that occur during the maternal to zygote transition. In Ectocarpus, the transition from gamete to the first stages of sporophyte development involved more marked changes in gene expression than we observed when comparing the mature sporophyte and gametophyte generations (Figure 3C). At present, there is no evidence that events during these two transitions are correlated. The relationship between changes in gene expression during very early sporophyte development and during alternation of life cycle generations could be investigated further using a highly dimorphic kelp model system such as Saccharina latissima but we are not aware of any studies that have specifically addressed this point. 

      (3) Since marked changes were observed during the transition from gamete to early sporophyte in Ectocarpus, it would be interesting to know how gene expression patterns change during the transition from gamete to partheno-sporophyte. Would the same patterns of downregulation and upregulation be expected? 

      The sporophyte individuals derived from gamete parthenogenesis (parthenosporophytes) are indistinguishable morphologically and functionally from diploid sporophytes derived from gamete fusions (see line 76). They also express generation marker genes in a comparable manner (Peters et al., 2008). Based on these observations, we have treated partheno-sporophytes and diploid sporophytes as equivalent in our experiments. For clarity, we have now distinguished partheno-sporophyte from diploid sporophyte samples in Table S1. 

      (4) The authors show a correlation between the degree of dimorphism and generation-biased or generation-specific expression. How was the degree of dimorphism quantified? 

      The degree of dimorphism is illustrated in Figure S1 using the relative size of the two generations as a proxy. Size estimations are approximate because the size of an individual of a particular species is quite variable but the ten species nonetheless represent a very clear gradient of dimorphism due to the extreme differences in size between generations of species at each end of the scale, with the sporophyte generation being several orders of magnitude larger than the gametophyte generation or visa versa. 

      References

      Arun A, Peters NT, Scornet D, Peters AF, Cock JM, Coelho SM. 2013. Non-cell autonomous regulation of life cycle transitions in the model brown alga Ectocarpus. New Phytol 197:503– 510. doi:10.1111/nph.12007

      Coelho SM, Godfroy O, Arun A, Le Corguillé G, Peters AF, Cock JM. 2011. OUROBOROS is a master regulator of the gametophyte to sporophyte life cycle transition in the brown alga Ectocarpus. Proc Natl Acad Sci USA 108:11518–11523. doi:10.1073/pnas.1102274108

      Peters AF, Scornet D, Ratin M, Charrier B, Monnier A, Merrien Y, Corre E, Coelho SM, Cock JM. 2008. Life-cycle-generation-specific developmental processes are modified in the immediate upright mutant of the brown alga Ectocarpus siliculosus. Development 135:1503–1512.doi:10.1242/dev.016303

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript presents a study on expectation manipulation to induce placebo and nocebo effects in healthy participants. The study follows standard placebo experiment conventions with the use of TENS stimulation as the placebo manipulation. The authors were able to achieve their aims. A key finding is that placebo and nocebo effects were predicted by recent experience, which is a novel contribution to the literature. The findings provide insights into the differences between placebo and nocebo effects and the potential moderators of these effects.

      Specifically, the study aimed to:

      (1) assess the magnitude of placebo and nocebo effects immediately after induction through verbal instructions and conditioning

      (2) examine the persistence of these effects one week later, and

      (3) identify predictors of sustained placebo and nocebo responses over time.

      Strengths:

      An innovation was to use sham TENS stimulation as the expectation manipulation. This expectation manipulation was reinforced not only by the change in pain stimulus intensity, but also by delivery of non-painful electrical stimulation, labelled as TENS stimulation.

      Questionnaire-based treatment expectation ratings were collected before conditioning and after conditioning, and after the test session, which provided an explicit measure of participants' expectations about the manipulation.

      The finding that placebo and nocebo effects are influenced by recent experience provides a novel insight into a potential moderator of individual placebo effects.

      We thank the reviewer for their thorough evaluation of our manuscript and for highlighting the novelty and originality of our study.

      Weaknesses:

      There are a limited number of trials per test condition (10), which means that the trajectory of responses to the manipulation may not be adequately explored.

      We appreciate the reviewer’s comment regarding the number of trials in the test phase. The trial number was chosen to ensure comparability with previous studies addressing similar research questions with similar designs (e.g. Colloca et al., 2010). Our primary objective was to directly compare placebo and nocebo effects within a within-subject design and to examine their persistence one week after the first test session. While we did not specifically aim to investigate the trajectory of responses within a single testing session, we fully agree that a comprehensive analysis of the trajectories of expectation effects on pain would be a valuable extension of our work. We have now acknowledged this limitation and future direction in the revised manuscript.

      The paragraph reads as follows: “It is important to note that our study was designed in alignment with previous studies addressing similar questions (e.g., Colloca et al., 2010). Our primary aim was to directly compare placebo and nocebo effects in a within-subject design and assess their persistence of these effects one week following the first test session. One limitation of our approach is the relatively short duration of each session, which may have limited our ability to examine the trajectory of responses within a single session. Future studies could address this limitation by increasing the number of trials for a more comprehensive analysis.”

      On day 8, one stimulus per stimulation intensity (i.e., VAS 40, 60, and 80) was applied before the start of the test session to re-familiarise participants with the thermal stimulation. There is a potential risk of revealing the manipulation to participants during the re-familiarization process, as they were not previously briefed to expect the painful stimulus intensity to vary without the application of sham TENS stimulation.

      We thank the reviewer for the opportunity to clarify this point. Participants were informed at the beginning of the experiment that we would use different stimulation intensities to re-familiarize them with the stimuli before the second test session. We are therefore confident that participants perceived this step as part of a recalibration rather than associating it with the experimental manipulation. We have added this information to the revised version of the manuscript.

      The paragraph now reads as follows: “On day 8, one stimulus per stimulation intensity (i.e., VAS 40, 60 and 80) was applied before the start of the test session to re-familiarise participants with the thermal stimulation. Note that participants were informed that these pre-test stimuli were part of the recalibration and refamiliarization procedure conducted prior to the second test session.”

      The differences between the nocebo and control conditions in pain ratings during conditioning could be explained by the differing physiological effects of the different stimulus intensities, so it is difficult to make any claims about expectation effects here.

      We appreciate the reviewer’s comment and agree that, despite the careful calibration of the three pain stimuli, we cannot entirely rule out the possibility that temporal dynamics during the conditioning session were influenced by differential physiological effects of the varying stimulus intensities (e.g., intensity-dependent habituation or sensitization). We have addressed this in the revision of the manuscript, but we would like to emphasize that the stronger nocebo effects during the test phase are statistically controlled for any differences in the conditioning session.

      The paragraph now reads: “This asymmetry is noteworthy in and of itself because it occurred despite the equidistant stimulus calibration relative to the control condition prior to conditioning. It may be the result of different physiological effects of the stimuli over time or amplified learning in the nocebo condition, consistent with its heightened biological relevance, but it could also be a stronger effect of the verbal instructions in this condition.”

      A randomisation error meant that 25 participants received an unbalanced number of 448 trials per condition (i.e., 10 x VAS 40, 14 x VAS 60, 12 x VAS 80).

      We agree that this is indeed unfortunate. However, we would like to point out that all analyses reported in the manuscript have been controlled for the VAS ratings in the conditioning session, i.e., potential effects of the conditioned placebo and nocebo stimuli. Moreover, we have now conducted additional analyses, presented here in our response to the reviewers, to demonstrate that this imbalance did not systematically bias the results. Importantly, the key findings observed during the test phase remain robust despite this issue.

      Specifically, when excluding these 25 participants from the analyses, the reported stronger nocebo compared to placebo effects in the test session on day 1 remain unchanged. Likewise, the comparison of placebo and nocebo effects between days 1 and 8 shows the same pattern when excluding the participants in question. The only exception is the interaction between effect (placebo vs nocebo) x session (day 1 vs day 8), which changed from a borderline significant result (p = .049) to insignificant (p = .24). However, post hoc tests continued to show the same pattern as originally reported: a significant reduction in the nocebo effect from day 1 to day 8 and no significant change in the placebo effect.

      Reviewer #2 (Public review):

      Summary:

      Kunkel et al aim to answer a fundamental question: Do placebo and nocebo effects differ in magnitude or longevity? To address this question, they used a powerful within-participants design, with a very large sample size (n=104), in which they compared placebo and nocebo effects - within the same individuals - across verbal expectations, conditioning, testing phase, and a 1-week follow-up. With elegant analyses, they establish that different mechanisms underlie the learning of placebo vs nocebo effects, with the latter being acquired faster and extinguished slower. This is an important finding for both the basic understanding of learning mechanisms in humans and for potential clinical applications to improve human health.

      Strengths:

      Beyond the above - the paper is well-written and very clear. It lays out nicely the need for the current investigation and what implications it holds. The design is elegant, and the analyses are rich, thoughtful, and interesting. The sample size is large which is highly appreciated, considering the longitudinal, in-lab study design. The question is super important and well-investigated, and the entire manuscript is very thoughtful with analyses closely examining the underlying mechanisms of placebo versus nocebo effects.

      We thank the reviewer for their positive evaluation of our manuscript and for acknowledging the methodological rigor and the significant implications for clinical applications and the broader research field.

      Weaknesses:

      There were two highly addressable weaknesses in my opinion:

      (1) I could not find the preregistration - this is crucial to verify what analyses the authors have committed to prior to writing the manuscript. Please provide a link leading directly to the preregistration - searching for the specified number in the suggested website yielded no results.

      We thank the reviewer for pointing this out. We included a link to the preregistration in the revised manuscript. This study was pre-registered with the German Clinical Trial Register (registration number: DRKS00029228; https://drks.de/search/de/trial/DRKS00029228).

      (2) There is a recurring issue which is easy to address: because the Methods are located after the Results, many of the constructs used, analyses conducted, and even the main placebo and nocebo inductions are unclear, making it hard to appreciate the results in full. I recommend finding a way to detail at the beginning of the results section how placebo and nocebo effects have been induced. While my background means I am familiar with these methods, other readers will lack that knowledge. Even a short paragraph or a figure (like Figure 4) could help clarify the results substantially. For example, a significant portion of the results is devoted to the conditioning part of the experiment, while it is unknown which part was involved (e.g., were temperatures lowered/increased in all trials or only in the beginning).

      We thank the reviewer for their helpful comment and agree that the Results section requires additional information that would typically be provided by the Methods section if it directly followed the Introduction. In response, we have moved the former Figure 4 from the Methods section to the beginning of the Results section as a new Figure 1, to improve clarity. Further, we have revised the Methods section to explicitly state that all trials during the conditioning phase were manipulated in the same way.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Given that the authors are claiming (correctly) that there is only limited work comparing placebo/nocebo effects, there are some papers missing from their citations:

      Nocebo responses are stronger than placebo responses after subliminal pain conditioning - - Jensen, K., Kirsch, I., Odmalm, S., Kaptchuk, T. J. & Ingvar, M. Classical conditioning of analgesic and hyperalgesic pain responses without conscious awareness. Proc. Natl. Acad. Sci. USA 112, 7863-7 (2015)

      We thank the reviewer and have now included this relevant publication into the introduction of the revised manuscript.

      Hird, E.J., Charalambous, C., El-Deredy, W. et al. Boundary effects of expectation in human pain perception. Sci Rep 9, 9443 (2019). https://doi.org/10.1038/s41598-019-45811-x

      We thank the reviewer for suggesting this relevant publication. We have now included it into the discussion of the revised manuscript by adding the following paragraph:

      “Recent work using a predictive coding framework further suggests that nocebo effects may be less susceptible to prediction error than placebo effects (Hird et al., 2019), which could contribute to their greater persistence and strength in our study.”

      (2) The trial-by-trial pain ratings could have been usefully modelled with a computational model, such as a Bayesian model (this is especially pertinent given the reference to Bayesian processing in the discussion). A multilevel model could also be used to increase the power of the analysis. This is a tentative suggestion, as I appreciate it would require a significant investment of time and work - alternatively, the authors could acknowledge it in the Discussion as a useful future avenue for investigation, if this is preferred.

      We thank the reviewer for this thoughtful suggestion. While we agree that computational modelling approaches could provide valuable insights into individual learning, our study was not designed with this in mind and the relatively small number of trials per condition and the absence of trial-by-trial expectancy ratings limit the applicability of such models. We have therefore chosen not to pursue such analysis but highlight it in the discussion as a promising direction for future research.

      “Notably, the most recent experience was the most predictive in all three analyses; for instance, the placebo effect on day 8 was predicted by the placebo effect on day 1, not by the initial conditioning. This finding supports the Bayesian inference framework, where recent experiences are weighted more heavily in the process of model updating because they are more likely to reflect the current state of the environment, providing the most relevant and immediate information needed to guide future actions and predictions24. Interestingly, while a change in pain predicted subsequent nocebo effects, it seemed less influential than for placebo effects. This aligns with findings that longer conditioning enhanced placebo effects, while it did not affect nocebo responses10 and the conclusion that nocebo instruction may be sufficient to trigger nocebo responses. Using Bayesian modeling, future studies could identify individual differences in the development of placebo and nocebo effects by integrating prior experiences and sensory inputs, providing a probabilistic framework for understanding the underlying mechanisms.”

      (3) The paper is missing any justification of sample size, i.e. power analysis - please include this.

      We apologize for the missing information on our a priori power analysis. As there is a lack of prior studies investigating within-subjects comparisons of placebo and nocebo effects that could inform precise effect size estimates for our research question, we based our calculation on the ability detect small effects. Specifically, the study was powered to detect effect sizes in the range of d = 0.2 - 0.25 with α = .05 and power = .9, yielding a required sample size of N = 83-129. We have now added this information to the methods section of the revised manuscript.

      (4) "On day 8, one stimulus per stimulation intensity (i.e., VAS 40, 60 and 80) was applied before the start of the test session to re-familiarise participants with the thermal stimulation."

      What were the instructions about this? Was it before the electrode was applied? This runs the risk of unblinding participants, as they only expect to feel changes in stimulus intensity due to the TENS stimulation.

      We thank the reviewer for pointing out the potential risk of unblinding participants due to the re-familiarization process prior to the second test session. We would like to clarify that we followed specific procedures to prevent participants from associating this process with the experimental manipulation. The re-familiarisation with the thermal stimuli was conducted after the electrode had been applied and re-tested to ensure that both stimulus modalities were re-introduced in a consistent and neutral context. Participants were explicitly informed that both procedures were standard checks prior to the actual test session (“We will check both once again before we begin the actual measurement.”). For the thermal stimuli, we informed participants that they would experience three different intensities to allow the skin to acclimate (e.g., “...we will test the heat stimuli in 3 trials with different temperatures, allowing your skin to acclimate to the stimuli. …”), without implying any connection to the experimental conditions.

      Importantly, this re-familiarization procedure mirrored what participants had already experienced during the initial calibration session on day 1. We therefore assume that participants interpreted as a routine technical step rather than part of the experimental manipulation. We have now clarified this procedure in the methods section of the revised manuscript.

      (5) "For a comparison of pain intensity ratings between time-points, an ANOVA with the within-subject factors Condition (placebo, nocebo, control) and Session (day 1, day 8) was carried out. For the comparison of placebo and nocebo effects between the two test days, an ANOVA with the with-subject factors Effect (placebo effect, nocebo effect) and Session (day 1, day 8) was used."

      It seems that one ANOVA is looking at raw pain scores and one is looking at difference scores, but this is a bit confusing - please rephrase/clarify this, and explain why it is useful to include both.

      We thank the reviewer for highlighting this point. Our primary analyses focus on placebo and nocebo effects, which we define as the difference in pain intensity ratings between the control and the placebo condition (placebo effect) and the nocebo and the control condition (nocebo effect), respectively.

      To examine whether condition effects were present at each time-point, we first conducted two separate repeated measures ANOVAs - one for day 1 and one for day 8 - with the within-subject factor CONDITION (placebo, nocebo, control).

      To compare the magnitude and persistence of placebo and nocebo effects over time, we then calculated the above-mentioned difference scores and submitted these to a second ANOVA with within-subject factors EFFECT (placebo vs. nocebo effect) and SESSION (day 1 vs. day 8). We have now clarified this approach on page 19 of the revised manuscript. To avoid confusion, the Condition x Session ANOVA has been removed from the manuscript.

      (6) Please can the authors provide a figure illustrating trial-by-trial ratings during test trials as well as during conditioning trials?

      In response to the reviewer’s point, we now provide the trial-by-trial ratings of the test phases on days 1 and 8 as an additional figure in the Supplement (Figure S1) and would like to clarify that trial-by-trial pain intensity ratings of the conditioning phase are displayed in Figure 2C of the manuscript,

      (7) "Separate multiple linear regression analyses were performed to examine the influence of expectations (GEEE ratings) and experienced effects (VAS ratings) on subsequent placebo and nocebo effects. For day 1, the placebo effect was entered as the dependent variable and the following variables as potential predictors: (i) expected improvement with placebo before conditioning, (ii) placebo effect during conditioning and (iii) the expected improvement with placebo before the test session at day 1"

      The term "placebo effect during conditioning" is a bit confusing - I believe this is just the effect of varying stimulus intensities - please could the authors be more explicit on the terminology they use to describe this? NB changes in pain rating during the conditioning trials do not count as a placebo/nocebo effect, as most of the change in rating will reflect differences in stimulation intensity.

      We agree with the reviewer that the cited paragraph refers to the actual application of lower or higher pain stimuli during the conditioning session, rather than genuinely induced placebo or nocebo effect. We thank the reviewer for this helpful observation and have revised the terminology, accordingly, now referring to these as “pain relief during conditioning” and “pain worsening during conditioning”.

      (8) Supplementary materials: "The three temperature levels were perceived as significantly different (VAS ratings; placebo condition: M= 32.90, SD= 16.17; nocebo condition: M= 56.62, SD= 17.09; control condition: M= 80.84, SD= 12.18"

      This suggests that the VAS rating for the control condition was higher than for the nocebo condition. Please could the authors clarify/correct this?

      We thank the reviewer for spotting this error. The values for the control and the nocebo condition had accidentally been swapped. This has now been corrected in the manuscript: control condition: M= 56.62, SD= 17.09; nocebo condition: M= 80.84, SD= 12.18.

      (9) "To predict placebo responses a week later (VAScontrol - VASplacebo at day 8), the same independent variables were entered as for day 1 but with the following additional variables (i) the placebo effect at day 1 and (ii) the expected improvement with placebo before the test session at day 8."

      Here it would be much clearer to say 'pain ratings during test trials at day 1".

      We agree with the reviewer and have revised the manuscript as suggested.

      (10) For completeness, please present the pain intensity ratings during conditioning as well as calibration/test trials in the figure.

      Please see our answer to comment (6).

      (11) In Figure 1a, it looks like some participants had rated the control condition as zero by day 8. If so, it's inappropriate to include these participants in the analysis if they are not responding to the stimulus. Were these the participants who were excluded due to pain insensitivity?

      On day 8, the lowest pain intensity ratings observed were VAS 3 in the placebo condition and VAS 2 in the control condition, both from the same participant. All other participants reported minimum values of VAS 11 or higher (all on a scale from 0-100). Thus, no participant provided a pain rating of VAS 0, and all ratings indicated some level of pain perception in response to the stimulus. We did not define an exclusion criterion based on day 8 pain ratings in our preregistration, and we did not observe any technical issues with the stimulation procedure. To avoid post-hoc exclusions and maintain consistency with our preregistered analysis plan, we therefore decided to include all participants in the analysis.

      (12) "Comparison of day 1 and day 8. A direct comparison of placebo and nocebo effects on day 1 and day 8 pain intensity ratings showed a main effect of Effect with a stronger nocebo effect (F(1,97)= 53.93, 131 p< .001, η2= .36) but no main effect of Day (F(1,97)= 2.94, p= .089, η2 = .029). The significant Effect x Session interaction indicated that the placebo effect and the nocebo effect developed differently over time (F(1,97)= 3.98, p= .049, η2 = .039)"

      This is confusing as it talks about a main effect of "day" and then interaction with "session" - are they two different models? The authors need to clarify.

      We thank the reviewer for pointing this out. In our analysis, “Session” is the correct term for the experimental factor, which has two factor levels, “day 1” and “day 8”. This has now been corrected in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) More information on how "size of the effect" in Figures 1b and 2b was calculated is needed; this can be in the legend. If these are differences between control and each condition, then they were reversed for one condition (nocebo?), which is ok - but this should be clearly explained.

      We agree with the reviewer and have now revised the figure legends to improve clarity. The legends now read:

      1b: “Figure 1. Pain intensity ratings and placebo and nocebo effects during calibration and test sessions. (A) Mean pain intensity ratings in the placebo, nocebo and control condition during calibration, and during the test sessions at day 1 and day 8. (B) Placebo effect (control condition - placebo condition, i.e., positive value of difference) and nocebo effect (nocebo condition - control condition, i.e., positive value of difference) on day 1 and day 8. Error bars indicate the standard error of the mean, circles indicate mean ratings of individual participants. *: p < .001, : p < .01, n.s.: non-significant.”

      2b: “Figure 2. Mean and trial-by-trial pain intensity ratings, placebo and nocebo effects during conditioning. (A) Mean pain intensity ratings of the placebo, nocebo and control condition during conditioning. (B) Placebo effect (control condition - placebo condition, i.e., positive value of difference) and nocebo effect (nocebo condition - control condition, i.e., positive value of difference) during conditioning. (C) Trial-by-trial pain intensity ratings (with confidence intervals) during conditioning. Error bars indicate the standard error of the mean, circles indicate mean ratings of individual participants. ***: p < .001.”

      (2) In the methods, I was missing a clear understanding of how many trials there were in the conditioning phase, and then how many in the other testing phases. Also, how long did the experiment last in total?

      We apologize that the exact number of trials in the testing phases was not clear in the original manuscript. We now indicate on page 18 of the revised manuscript that we used 10 trials per condition in the test sessions. We have also added information on the duration of each test day (i.e., three hours on day 1 and one hour on day 8) on page 15.

      (3) In expectancy ratings, line 186 - are improvement and worsening expectations different from expected pain relief? It is implied that these are two different constructs - it would be helpful to clarify that.

      We agree that this is indeed confusing and would like to clarify that both refer to the same construct. We used the Generic rating scale for previous treatment experiences, treatment expectations, and treatment effects (GEEE questionnaire, Rief et al. 2021) that discriminates between expected symptom improvement, expected symptom worsening, and expected side effects due to a treatment. We now use the terms “expected pain relief” and “expected pain worsening” throughout the whole manuscript.

      (4) In the last section of the Results, somatosensory amplification comes out of nowhere - and could be better introduced (see point 2 above).

      We agree with the reviewer that introducing the concept of somatosensory amplification and its potential link to placebo/nocebo effects only in the Methods is unhelpful, given that this section appears at the end of the manuscript. We therefore now introduce the relevant publication (Doering et al., 2015) before reporting our findings on this concept.

      (5) In line 169, if the authors want to specify what portion of the variance was explained by expectancy, they could conduct a hierarchical regression, where they first look at R2 without the expectancy entered, and only then enter it to obtain the R2 change.

      We fully agree that hierarchical regression can be a useful approach for isolating the contribution of variables. However, in our case, expectancy was assessed at different time points (e.g., before conditioning and before the test session on day 1), and there was no principled rationale for determining the order in which these different expectancy-related variables should be entered into a hierarchical model.

      That said, in response to the reviewer’s suggestion, we have now conducted hierarchical regression analyses in which all expectancy-related variables were entered together as a single block (see below). These analyses largely confirmed the findings reported so far and are provided here in the response to the reviewers below. Given the exploratory nature of this grouping and the lack of an a priori hierarchy, we feel that the standard multiple regression models remain the most appropriate for addressing our research question because it allows us to evaluate the total contribution of expectancy-related predictors while also examining the individual contribution of each variable within the block. We would therefore prefer to retain these as the primary analyses in the manuscript.

      Results of the hierarchical regression analyses:

      Day 1 - Placebo response: In step 1, we entered the difference in pain intensity ratings between the control and the placebo condition during conditioning as a predictor. In step 2, we added the two variables reflecting expectations (i.e., expected improvement with placebo (i) before conditioning and (ii) before the test session on day 1). This allowed us to assess whether expectation-related variables explained additional variance beyond the effect of conditioning.

      The overall regression model at step 1 was significant, F(1, 102) = 13.42, p < .001, explaining 11.6% of the variance in the dependent variable (R<sup>2</sup> = .116). Adding the expectancy-related predictors in step 2 did not lead to a significant increase in explained variance, ΔR<sup>2</sup> = .007, F(2, 100) = 0.384, p = .682. Thus, the conditioning response significantly predicted placebo-related pain reduction on day 1, but additional information on expectations did not account for further variance.

      Day 1 - Nocebo response: The equivalent analysis was run for the nocebo response on day 1. In step 1, the pain intensity difference between the nocebo and the control condition was entered as a predictor before adding the two expectancy ratings (i.e., expected worsening with nocebo (i) before conditioning and (ii) before the test session on day 1).

      In step 1, the regression model was not statistically significant, F(1, 102) = 2.63, p = .108, and explained only 2.5% of the variance in nocebo response (R<sup>2</sup> = .025). Adding the expectation-related predictors in Step 2 slightly increased the explained variance by ΔR<sup>2</sup> = .027, but this change was also non-significant, F(2, 100) = 1.41, p = .250. The overall variance explained by the full model remained low (R<sup>2</sup> = .052). These results suggest that neither conditioning nor expectation-related variables reliably predicted nocebo-related pain increases on day 1.

      Day 8 - Placebo response: For the prediction of the placebo effect on day 8, the following variables reflecting perceived effects were entered as predictors in step 1: the difference in pain intensity ratings between the control and the placebo condition (i) during conditioning and (ii) on day 1. In step 2, the variables reflecting expectations were added: the expected improvement with placebo (i) before conditioning, (ii) before the test session on day 1 and (iii) before the test session on day 8.

      In step 1, the model was statistically significant, F(3, 95) = 14.86, p < .001, explaining 23.8% of the variance in the placebo response (R<sup>2</sup> = .238, Adjusted R<sup>2</sup> = .222). In step 2, the addition of the expectation-related predictors resulted in a non-significant improvement in model fit, ΔR<sup>2</sup> = .051, F(3, 92) = 2.21, p = .092. The overall variance explained by the full model increased modestly to 29.0%.

      Day 8 - Nocebo response: For the equivalent analyses of nocebo responses on day 8, the following variables were included in step 1: the difference in pain intensity ratings between the nocebo and the control condition (i) during conditioning and (ii) on day 1. In step 2, we entered the variables reflecting nocebo expectations including expected worsening with nocebo (i) before conditioning, (ii) before the test session on day 1 and (iii) before the test session on day 8. In step 1, the model significantly predicted the day 8 nocebo response, F(3, 95) = 6.04, p = .003, accounting for 11.3% of the variance (R<sup>2</sup> = .113, Adjusted R<sup>2</sup> = .094). However, the addition of expectation-related predictors in Step 2 resulted in only a negligible and non-significant improvement, ΔR<sup>2</sup> = .006, F(3, 92) = 0.215, p = .886. The full model explained just 11.9% of the variance (R<sup>2</sup> = .119).

      Typos:

      (6) Abstract - 104 heathy xxx (word missing).

      (7) Line 61 - reduce or decrease - I think you meant increase.

      Thank you, we have now corrected both sentences.

      References

      Colloca L, Petrovic P, Wager TD, Ingvar M, Benedetti F. How the number of learning trials affects placebo and nocebo responses. Pain. 2010

      Doering BK, Nestoriuc Y, Barsky AJ, Glaesmer H, Brähler E, Rief W. Is somatosensory amplification a risk factor for an increased report of side effects? Reference data from the German general population. J Psychosom Res. 2015

    1. Kasirzadeh’s account of accumulative risk still relies on threat actors such as cyberattackers to a large extent, whereas our concern is simply about the current path of capitalism. And we think that such risks are unlikely to be existential, but are still extremely serious

      so not so much about a single Superintelligent AI, as society gradually drowning in AI enshittification. it may not be existential to society but it still really sucks

    1. Some design scholars have questioned whether focusing on people and activities is enough to account for what really matters, encouraging designers to consider human values77 Friedman, B., & Hendry, D. G. (2019). Value sensitive design: Shaping technology with moral imagination. MIT Press. . For example, instead of viewing a pizza delivery app as a way to get pizza faster and more easily, we might view it as a way of supporting the independence of elderly who do not have the mobility to pick up a pizza on their own. Or, perhaps more darkly, instead of viewing TSA screening at an airport a way of identifying potential terrorists, we consider it through the value of power, as the screening process had more to do with maintaining political power in times of fear than it did with actually preventing terrorism. This shift in framing can enable designers to better consider the values of design stakeholders through their design process, and identify people they may not have designed for otherwise (e.g., people who are house bound because of injury, or politicians).

      This section specifically got me reflecting about to what degree should human values be balanced when comparing to people and activities. The way I see it, I believe the people and activities (and systems) should be the main focus whenever one is designing. Shifting the focus to an aspect as subjective as "human values" may go into a downfall sacrificing resources that could be otherwise used towards a people/activity focused design. Overall I think that encouraging the consideration of subject matters similar to these may end up wasting resources.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Reviewer):

      It is not clear from the analysis presented in the paper how persistent those environmentally induced changes, do they remain with the bats till the end of their lives.

      Currently, the long-term effects of enrichment on the bats remain uncertain. Preliminary results suggest that these differences may persist throughout the bats’ lifetimes; however, further data analysis is ongoing to determine the extent of these effects. We also addressed now at the manuscript discussion

      Reviewer #2 (Public Reviewer):

      (1) Assessing personality metrics and the indoor paradigm: While I applaud this effort and think the metrics used are justified, I see a few issues in the results as they are currently presented:

      (a) [Major] I am somewhat concerned that here, the foraging box paradigm is being used for two somewhat conflicting purposes: (1) assessing innate personality and (2) measuring changes in personality as a result of experience. If the indoor foraging task is indeed meant to measure and reflect both at the same time, then perhaps this can be made more explicit throughout the manuscript. In this circumstance, I think the authors could place more emphasis on the fact that the task, at later trials/measurements, begins to take on the character of a "composite" measure of personality and experience.

      Personality traits should generally be stable over time, but personality can also somewhat change with experience. We used the foraging box to assess individual personality, but we also examined the assumption that what we are measuring is a proxy of personality and hence is stable over time. We now clarify this in the manuscript. 

      (b) [Major] Although you only refer to results obtained in trials 1 and 2 when trying to estimate "innate personality" effects, I am a little worried that the paradigm used to measure personality, i.e. the stable components of behavior, is itself affected by other factors such as age (in the case of activity, Fig. 1C3, S1C1-2), the environment (see data re trial 3), and experience outdoors (see data re trials 4/5).

      We found that boldness was the most consistent trait, showing persistence between trials 1 to 5, i.e., 144 days apart on average. We thus also used Boldness as the primary parameter for assessing the effects of personality on the outdoors behavior. While we evaluated other traits for completeness, boldness was the only one that consistently met the criteria for personality, which is why we focused on it in our analyses. The other traits which were not stable over time could be used to assess the effects of experience on behavior

      Ideally, a study that aims to disentangle the role of predisposition from early-life experience would have a metric for predisposition that is relatively unchanging for individuals, which can stand as a baseline against a separate metric that reflects behavioral differences accumulated as a result of experience.

      I would find it more convincing that the foraging box paradigm can be used to measure personality if it could be shown that young bats' behavior was consistent across retests in the box paradigm prior to any environmental exposure across many baseline trials (i.e. more than 2), and that these "initial settings" were constant for individuals. I think it would be important to show that personality is consistent across baseline trials 1 and 2. This could be done, for example, by reproducing the plots in Fig. 1C1-3 while plotting trial 1 against trial 2. (I would note here that if a significant, positive correlation were to be found (as I would expect) between the measures across trial 1 and 2, it is likely that we would see the "habituation effect" the authors refer to expressed as a steep positive slope on the correlation line (indicating that bold individuals on trial 1 are much bolder on trial 2).)

      We agree and thus used boldness which was found to be stable over five trials (three of which were without external experience). We note that if Boldness as we measured it increased over time, the differences between individuals remained similar and this is what is expected from personality traits measured in the same paradigm several times (after the animal acquires experience).  

      (c) Related to the previous point, it was not clear to me why the data from trial 2 (the second baseline trial) was not presented in the main body of the paper, and only data from trial 1 was used as a baseline.

      We added a main figure, showing the correlation between the two baseline trials

      In the supplementary figure and table, you show that the bats tended to exhibit more boldness and exploratory behavior, but fewer actions, in trial 2 as compared with trial 1. You explain that this may be due to habituation to the experimental setup, however, the precise motivation for excluding data from trial 2 from the primary analyses is not stated. I would strongly encourage the authors to include a comparison of the data between the baseline trials in their primary analysis (see above), combine the information from these trials to form a composite baseline against which further analyses are performed, or further justify the exclusion of data as a baseline.

      We had no intention of excluding data from baseline 2. As we have shown several times before (e.g., Harten, 2021) bats’ boldness as we measure it in the box experiment increases over sessions performed nearby in time. This means that trial 2’s boldness was higher than that of trial 1 and trial 3 which made the data less suitable for a Linear model. Moreover, our measurement of boldness is capped (with a maximum of 1) again making it less suitable for a Linear model. However, following the reviewer’s question we now ran all analyses with trial 2’s data included and not only that the results remained the same, some of the models fit better (based on the AIC criterion). We added this information to the revised manuscript.  

      (2) Comparison of indoor behavioral measures and outdoor behavioral measures Regarding the final point in the results, correlation between indoor personality on Trial 4 and outdoor foraging behavior: It is not entirely clear to me what is being tested (neither the details of the tests nor the data or a figure are plotted). Given some of the strong trends in the data - namely, (1) how strongly early environment seems to affect outdoor behavior, (2) how strongly outdoor experience affects boldness, measured on indoor behavior (Fig. 1D) - I am not convinced that there is no relationship, as is stated here, between indoor and outdoor behavior. If this conclusion is made purely on the basis of a p-value, I would suggest revisiting this analysis.

      We agree that the relationship between indoor personality measures and outdoor foraging behavior is of great interest and had expected to find some correspondence between the two. To test this, we conducted multiple GLM analyses using the different indoor behavioral traits as predictors of outdoor behaviors. These analyses did not reveal any significant correlations. We also performed a separate analysis using PC1 (derived from the indoor behavioral variables) as a predictor, and again found no significant associations with outdoor behavior.

      We were indeed surprised by this outcome. It is possible that the behavioral traits we assessed indoors (boldness, exploration, and activity) do not fully capture the dimensions of behavior that are most relevant to foraging in the wild. For example, traits such as neophobia or decisionmaking under risk, which we did not assess directly, may have had stronger predictive value for outdoor behavior. We now highlight this point more clearly in the Discussion and acknowledge the possibility that alternative or additional personality traits might have revealed meaningful relationships.

      (3) Use of statistics/points regarding the generalized linear models While I think the implementation of the GLMM models is correct, I am not certain that the interpretation of the GLMM results is entirely correct for cases where multivariate regression has been performed (Tables 4s and S1, and possibly Table 3). (You do not present the exact equation they used for each model (this would be a helpful addition to the methods), therefore it is somewhat difficult to evaluate if the following critique properly applies, however...)

      The "estimate" for a fixed effect in a regression table gives the difference in the outcome variable for a 1 unit increase in the predictor variable (in the case of numeric predictors) or for each successive "level" or treatment (in the case of categorical variables), compared to the baseline, the intercept, which reflects the value of the outcome variable given by the combination of the first value/level of all predictors. Therefore, for example, in Table 4a - Time spend outside: the estimate for Bat sex: male indicates (I believe) the difference in time spent outside for an enriched male vs. an enriched female, not, as the authors seem to aim to explain, the effect of sex overall. Note that the interpretation of the first entry, Environmental condition: impoverished, is correct. I refer the authors to the section "Multiple treatments and interactions" on p. 11 of this guide to evaluating contrasts in G/LMMS: https://bbolker.github.io/mixedmodelsmisc/notes/contrasts.pdf

      We are not certain we fully understand the comment; however, if our understanding is correct, we respectfully disagree. A GLM analysis without interaction terms—as conducted in our study—functions as a multiple linear regression, wherein each factor's estimate reflects its individual effect on the dependent variable. For example in the case of sex, it examines he effect of sex on the tie spent out independently of enrichment. An interaction term would be needed to test sex*enrichment. We have added the models’ formula, and we hope this clarifies our approach

      Reviewer #1 (Recommendations for the authors):

      I would recommend the following:

      (1) As video tracking and behavioral analysis softwares are wide spread, it would be great to see this applied to the bat behavior indoor to answer questions like how does the bat velocity or heading or acceleration correlate with the behavioral measures boldness , activity or exploration? In the same gist, can one infer boldness, activity or exploration from measured bat velocity or other parameters? I think this will further make the indoor behavior more quantitative.

      In a tent of the size used in our study, bats’ flight behavior tends to be highly stereotypical: they typically perch on the wall, take off, circle the tent—sometimes multiple times—and then either land or not, and enter or not. Flight velocity is largely determined by individual maneuverability and the physical constraints of the space; thus, precise tracking is unlikely to provide further insight into boldness. In contrast, decision-making behaviors—such as whether to land or enter—more accurately reflect personality traits, as we have shown previously (Harten et al., 2018). Moreover, accurate 3D tracking in such an environment is possible but definitely not easy due to the many blind-spots resulting from the cameras being inside the 3D volume.  Nonetheless, we quantified flight activity and assessed its correlation with the other behavioral axes. As it was highly correlated with general activity, we did not include it as an independent parameter in the main analysis. However, in response to the reviewer’s suggestion, we now present this analysis in the Supplementary Materials.

      (2) It is not clear whether the bats come from the same genetic background. they might be but it is not mentioned in the methods under the experimental subjects.

      We have shown in the past that there is no familial relations in a randomly caught sample of bats in the colony where we usually work (Harten et al., 2018). The bats were caught in three, not related wild colonies. The text referring to the table was clarified in the revised manuscript

      (3) It will be great to include the author's thoughts about mechanisms underlying those environmentally induced changes in behavior in the discussion section along with how this will affect the bats' social foraging abilities. Another question that comes to mind is whether growing up with a large number of bats constitute an enriched environment in itself.

      We agree that this could count as an enrichment, and we thus ensured similar group sizes in both groups for this reason. We clarify this in the revised manuscript. 

      We have elaborated on the underlying mechanisms in the discussion, focusing on how they contribute to behavioral changes.

      Reviewer #2 (Recommendations for the authors):

      (1) Outdoor foraging behavior

      If I understand correctly, the data you display in Fig. 3A is only from the 2nd to 3rd weeks of exploration, i.e. just before the first post-exploration trial.

      What does the data look like for the second outdoor exploration data, i.e. before the final trial?

      Is there a specific reason why these measures were only computed on the GPS data from the 3rd week outside? If so, can this sampling of the data be motivated or briefly addressed (in the methods and wherever else necessary)?

      In order to allow a comparison between individuals, we had to restrict ourself to a period we had data from many individuals (some dissapeared later on).

      Following the reviewer suggestion – we added a supplemenry figure including days 21-26

      I would find it important and of great interest to see movement maps for more animals, as these give very rich information that is not entirely captured by the three proxies of outdoor activity.

      Are these four exemplary animals sampled from both seasons?

      Did you check to see if there were any overall differences in outdoor foraging behavior as a function of the season in which the bats were captured?

      Yes, the samples represent individuals from both tested years. This was clarified, and additional examples were included in a supplementary figure.

      Variable of time spent outdoors: You mention that you did not include the nights that the bat spent in the colony in these calculations. Did you also look to see if 'the number of nights when the bats left the colony' predicted the bat's earlier enrichment treatment? This could also be interesting to consider.

      In response to the reviewer’s comment, we conducted an additional analysis to test whether the proportion of nights each bat spent foraging outside the roost was predicted by its earlier environmental condition (enriched vs. impoverished). We also examined whether sex or age influenced this variable. This analysis showed no significant effect of environmental condition, sex, or age on the proportion of nights spent foraging outside the roost

      [Following on point 3 in public review...]

      When wishing to discuss the effect/significance of predictors overall, it is common to present the modelling results as an analysis of variance table. See, for example, the two-way anova section (p. 182) in the book Practical Regression and ANOVA using R: https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

      I think the output of passing the model object to an "anova" yields the table that you may be looking for, where the variance accounted for by a predictor is given overall, and not just relative to the first level of all predictors. Naturally, this information can be used in combination with the information provided by the raw model output presented in the paper.

      I assume you have done this analysis in R, but am not sure, as the statistical software used is not mentioned. There are several packages in R that allow users to quickly plot the graphical interaction of the parameters they use in models, which aids in interpreting results. It would be good to check results of model fitting in this manner.

      Relatedly, I was unable to locate the data and code for this paper using the DOI provided. Neither searching the internet using the doi nor entering the doi on the Mendeley Data website returned the right results. I tried searching Mendeley Data using the senior author's last name, but the most recent entry does not appear to be from this paper. https://data.mendeley.com/datasets/fr48bmnhxj/1

      We thank the reviewer for the helpful comment. The analysis was indeed conducted in MATLAB, and this has now been clarified in the manuscript. We have also revised the result tables to improve clarity and included the exact formulas used for each model. Regarding the data availability, the reviewer is correct — the dataset had not yet been published at the time of submission. It is now available at the provided DOI link.

      ### Suggestions and questions for the present paper, grouped thematically:

      [Major] Expansion and development of results: I thought there were many interesting and suggestive points in this data that could be expanded upon. I mention some of these here. While the authors of course do not need to implement all of these suggestions, I think the paper would benefit from a more substantial presentation of this rich data set:

      (a) Individual differences as such are not emphasized in the paper so much, as the analyses, particularly those expressed as boxplots, are grouped. The scatter plots in Figure 1 give the richest insight into how individual behavior changes throughout the course of the experiment. I would advocate for the authors to show additional comparisons using such scatter plots (perhaps in the supplementary, if needed).

      We thank the reviewer and added scatter plots to figure 2

      (b) In the second paragraph of the results, the authors introduce the concept of a pareto front and that of personality archetypes (lines 101-107). I found this very interesting, but these concepts were never reiterated upon later in the results or in the discussion. In fact, at many points, I found myself curious as to how the three indoor measures of personality might be combined to form a composite measure of personality (and likewise for outdoor measures). Have you tried to combine measures into a composite and tried to measure whether this composite metric provides any additional insight into these phenomena? For example, what if you mapped the starting position of each bat as a point in a three-dimensional space, given by the three personality measures, and then evaluated their trajectory through this space with measurements taken at later trials. Could innate personality be interpreted as the starting vector in this space (measured across the two baseline trials)? 

      Following the reviewer’s (justified) curiosity we ran a PCA analysis on the behavioral data from trials 1 and 5 and found that there is a significant correlation between the individual scores on PC1. This can be thought of as a measurement that takes both boldness and exploration into account (the weight of activity was very low). We added this information to the revised manuscript and also use this new behavioral parameter as a predisposition in the models (instead of exploration and activity). 

      Could environmental exposure be quantified as a warping of the trajectory through this space? Finally, could outdoor experience also be incorporated to evaluate how an individual arrives at its final measurement of personality combined with experience (trial 5)?

      The paper currently tries to explain outdoors behavior given personality and not vice versa. While this is a very interesting suggestion, we feel that adding this analysis would make the premise of the paper less clear and since the paper is already somewhat complex, we prefer to leave this analysis for a future study. 

      Examining the 3D trajectories of the individuals through the personality space did not reveal any immediate clear pattern (triangles mark the first trial and colours depict the environmental treatment) – 

      Author response image 1.

      Related to this point: I think the strongest part of the paper is the result showing that bats exposed to enriched environments explore farther, more often, and over larger distances than bats that were raised in an impoverished environment.

      We completely agree and tried to further emphasize this  

      (c) While these results of the outdoor GPS tracking are very clear, I wish that more information were extracted from the tracking data, which is incredibly rich and certainly can be used to derive many interest parameters beyond those that the authors have shown here. Examples might include: distance travelled (as opposed to estimated km2 or farthest point), a metric of navigational ability (how much "dead reckoning" the animal engages in). I even wonder if the areas or landmarks visited by the enriched bats might be found to be more complex, challenging, or richer by some measure.

      This study was a first step, aiming to establish a connection between early exposure and outdoors foraging

      We agree that there are many more analyses that can be done and indeed that ones related to navigation capabilities are missing. We are still collecting data on these bats and hope to present a more advanced analysis with a time span of years. 

      (d) Related to the above point: I find it very interesting that in 3 of the 4 bats for which you show exemplary movement data (Fig. 3, panels B and C), they appear to travel to the farthest distances and cover the most ground early on, and become more "conservative" in their flight paths on later evenings. This point is not explored in the discussion, nor related to earlier measurements.

      During the first months of exploration, bats will occasionally perform long exploratory flights in between bouts of shorter flights where they return to nearby familiar trees. This behavior can be seen in more detail in Harten et al Science 2020. We are currently quantifying this more carefully for another study. 

      (e) Finally, my points about the possible strength of a composite measure of the three personality metrics is related to my concern about one of the conclusions, which is that innate personality does not have an effect on outdoor foraging behavior. I think the manner in which this was tested statistically is likely to bias the results against finding such a result given that personality metrics are used to predict outdoor behaviors in an individual manner (6 models in total, each examining a single comparison of predisposition to outdoor behavior), while both indoor personality metrics (Fig 1B) and outdoor behaviors appear to be correlated with each other (Table 5).

      Are there other analyses you have performed that are not presented in the paper and that have led you to conclude that there is no relationship here?

      We agree with the reviewer, that our findings do not exclude an effect of innate personality on foraging but only suggest no such affect for the parameter we measured. That said, we did expect to find an effect of boldness because this parameter has been shown to differentiate much between groups (Harten et al., 2018), and to correlate with other parameters of behavior. We were therefore surprised to find no significant effects, as we had anticipated observing some differences.

      Following the reviewer’s previous comment we now also tested another predisposition parameter – the PC1 score and also found that it did not explain foraging. 

      (f) Personality measured before and after early environmental exposure (related to point (a) above): I find it interesting that the positive correlation in boldness between baseline and post-enrichment or baseline and post-release suggests that the individuals that were the most bold remained bold (and likewise for less adventurous individuals). The correlation for activity, too, still suggests that more active individuals early in life are likely to remain very active after enrichment, even accounting for the fact that activity is confounded with age.

      Perhaps you could place some emphasis on the fact that the initial variation between individuals also appears to be relatively stable over repeated trials. You might also consider measuring this directly (population variance over successive trials; relationship of population variance on indoor measures vs. outdoor measures...)

      Yes – this is a main point of interest. We further emphasize that in the revised manuscript 

      (g) Effect of indoor behavior following early experience on outdoor behavior: You evaluate the effect of predisposition (measured on baseline trial 1) and environmental condition on measures of outdoor activity (Table 4). I wonder if you also tried using indoor behavioral measures measured on the post-enrichment trial 3 to predict outdoor foraging behavior.

      Assuming that these measures are in fact reflecting a combination of predisposition and accumulated experience, then measurements at this closer time point may tell you how the combination of innate traits and early acquired experience affect behavior in the wild.

      We appreciate the reviewer’s insightful suggestion to test whether indoor behavior from post-enrichment Trial 3, reflecting both innate traits and experience, predicts outdoor foraging behavior. We conducted this analysis, but found that the boldness in Trial 3 did not significantly predict any of the outdoor activity measures.

      (2) [Minor] Age/development: While the authors discuss the effect of their manipulations on behavioral measures, they do not much discuss the effect of age.

      I think it would be important to include at some point a mention of the developmental stages of Rousettus, giving labels to certain age ranges, e.g. pup, juvenile, adult, and to provide more context about the stages at which bats were tested in the discussion. Presently, age is only really mentioned as an explanation for declining activity levels, but I wonder if it might also have an influence on boldness.

      It would also be very elegant for figures where age is given in days, to additional label then with these stages.

      All bats were juveniles during the trials (approximately 4 to 8 months old), so they could not be divided into distinct age groups. To assess the effect of age, it was included as a predictor (in days) in the GLM analysis.

      (3) [Major] Effect of early experience and outdoor experience on the indoor task: In the paragraph on lines 278-285, you argue that the effect of seeing earlyenriched bats exhibit more boldness in trial 5 was likely due to post-sampling bias...

      I tend to disagree with this conclusion. I actually find this result both interesting and intuitive - that bats that were exposed to an enriched environment and have had experience in the wild, show much bolder activity on a familiar indoor foraging test (i.e. outside experience has made the animals bolder than before) (Fig 1, lines 159-161, Fig. S1). I did not notice this possibility mentioned in the discussion of the results.

      I also do not fully understand this argument. Could you please explain further?

      We accept the reviewer's comment and updated the manuscript (lines 336346) explaining the two hypotheses more clearly and arguing that it is difficult to tell them apart with the current data.

      [Minor] You also say that "this difference... can be seen in Figure 2 when examining only the bats that had remained until the last trial (Figure 2A2)." Do you mean supplementary Figure S1 A2? In fact, I am entirely unclear on what data is plotted in the supplementary Figure S1 and what differentiates the two columns of figures and the two models presented in the supplementary table. Did you plot data similar to that in Figure 2, with only bats that were present for all trials, but not show this data?

      There was a mistake: what was previously referred to as 2A2 is actually S2 A2.

      On the right side—only among the individuals with GPS data—the change is already evident at Baseline 2, where only the bolder individuals remain. If you have suggestions for a better analysis approach, we would be happy to hear them.

      ### Minor points

      General points regarding figures:

      For Figures 2 and 3A1-3 (as well as Fig. S1): Authors must show the raw data points over the box plots. It is very difficult to interpret the data and conclusions without being able to see the true distribution.

      Done

      For all figures showing grouped individual data, please annotate all panels or sets of boxplots with the number of bats whose data entered into each, as it is a little difficult to keep track of the changing sample sizes across experimental stages.

      To enhance transparency, we have added individual data points to all boxplots, allowing visual estimation of sample sizes across experimental stages. While numerical annotations are not included on the figures, the exact number of bats contributing to each group is provided in the Methods section (Table 8), ensuring this information is readily accessible to readers.In response to the reviewer’s request, we have updated all relevant figures to display individual data points within each boxplot. This addition makes it easier to track changes in sample size across different experimental stages.

      Unless I've missed the reason behind differences in axis labelling across the figures, it seems that trials are not always referred to consistently. E.g. Fig. 1 labels say "Trial 1 (baseline)" and fig. 2 labels say "Baseline 1 0 days." I'm not entirely sure if these correspond to exactly the same data. If so, perhaps the labels can be made uniform. I think the descriptive ones (Baseline 1, Postenrichment...) may be more helpful to the reader than providing the trial number (Trial 1, etc....).

      Done

      Figure 1:

      Very good Fig. 1A and 1B.

      For panels C1-3 & D, I think it would make it easier for the reader if the personality measure labels were placed at the top of each panel, e.g. "Boldness (entrance proportion)". The double axis labels are not only harder to read, they are also redundant, as the personality measure label repeats on both axes.

      Done

      Panel C1: For the first panel in this sequence, I think it would be elegant to include an annotation in the figure that indicates what the datapoints lying on either side of the dashed line means, i.e. "bolder after enrichment treatment" in the upper left corner, and "bolder before enrichment treatment" in the bottom right corner.

      Panel C2: It appears as though many of the data points in this panel overlap, and it appears to me that the blue data points in particular are overlaid by the orange ones. I am guessing this happens because proportion values based on entrances to only 6 boxes end up giving a more "discrete" looking distribution. I wonder if you can find a way to allow all the data to be visible by, e.g., jittering the data slightly; if there is rounding being done to the proportions, perhaps don't round them so that minute differences will allow them to escape the overlap; or possibly split the panel by enrichment treatment.

      Caption for C1-3: it may be helpful to mention the correlation line color scheme: "enriched (blue lines), the impoverished (orange lines)". The caption also says positive correlations were found for "both environments together," but this correlation line is not shown. Perhaps mention "(not shown)" or show line. Please rephrase the sentence "Dashed line represents the Y=X line." for more transparency and clarity. I understand you mean an "equality" or "unity" line, but perhaps you can explicitly state the information that this line provides, something like e.g. "Dashed line indicates equal values measured on both trials."

      We added the line for a reference, the caption was corrected

      Figure 3:

      Panels B1-C2: I would suggest giving these panels supertitles that indicate that B panels are enriched, C panels are impoverished, and that each panel is data from a different individual.

      The legend was corrected to be more clear about the figure

      General points regarding tables:

      Please revisit tables for formatting and typos, particularly in Table 4. Please also revise table captions for clarity. E.g. "first exploration as predisposition" to "Exploration (Baseline 1)" or similar

      Done

      Supplementary Tables and Figure: these are missing captions and explanations.

      The missing parts were adddad and corrected

      Points of clarification/style:

      It would seem to me more logical to present the results shown in Table 3 before those in Table 2, given that the primary in-lab manipulation is discussed with relation to Table 3, and the analysis in Table 2 is discussed rather as a limitation (though I believe this result can be expanded upon further, see above).

      For the activity metric, I would suggest showing this data as actions/hour instead of actions/minute. I think it is much more intuitive to consider, for example, that a bat makes 2 actions every hour, than that it makes 0.002 actions per minute.

      Done

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this study, Gu et al. employed novel viral strategies, combined with in vivo two-photon imaging, to map the tone response properties of two groups of cortical neurons in A1. The thalamocortical recipient (TR neurons) and the corticothalamic (CT neurons). They observed a clear tonotopic gradient among TR neurons but not in CT neurons. Moreover, CT neurons exhibited high heterogeneity of their frequency tuning and broader bandwidth, suggesting increased synaptic integration in these neurons. By parsing out different projecting-specific neurons within A1, this study provides insight into how neurons with different connectivity can exhibit different frequency response-related topographic organization.

      Strengths:

      This study reveals the importance of studying neurons with projection specificity rather than layer specificity since neurons within the same layer have very diverse molecular, morphological, physiological, and connectional features. By utilizing a newly developed rabies virus CSN-N2c GCaMP-expressing vector, the authors can label and image specifically the neurons (CT neurons) in A1 that project to the MGB. To compare, they used an anterograde trans-synaptic tracing strategy to label and image neurons in A1 that receive input from MGB (TR neurons).

      Weaknesses:

      Perhaps as cited in the introduction, it is well known that tonotopic gradient is well preserved across all layers within A1, but I feel if the authors want to highlight the specificity of their virus tracing strategy and the populations that they imaged in L2/3 (TR neurons) and L6 (CT neurons), they should perform control groups where they image general excitatory neurons in the two depths and compare to TR and CT neurons, respectively. This will show that it's not their imaging/analysis or behavioral paradigms that are different from other labs. 

      We thank the reviewer for these constructive suggestions. As recommended, we have performed control experiments that imaged the general excitatory neurons in superficial layers (shown below), and the results showed a clear tonotopic gradient, which was consistent with previous findings (Bandyopadhyay et al., 2010; Romero et al., 2020; Rothschild et al., 2010; Tischbirek et al., 2019), thereby validating the reliability of our imaging/analysis approach. The results are presented in a new supplemental figure (Figure 2- figure supplementary 3).

      Related publications:

      (1) Gu M, Li X, Liang S, Zhu J, Sun P, He Y, Yu H, Li R, Zhou Z, Lyu J, Li SC, Budinger E, Zhou Y, Jia H, Zhang J, Chen X. 2023. Rabies virus-based labeling of layer 6 corticothalamic neurons for two-photon imaging in vivo. iScience 26: 106625. DIO: https://doi.org/10.1016/j.isci.2023.106625, PMID: 37250327

      (2) Bandyopadhyay S, Shamma SA, Kanold PO. 2010. Dichotomy of functional organization in the mouse auditory cortex. Nat Neurosci 13: 361-8. DIO: https://doi.org/10.1038/nn.2490, PMID: 20118924

      (3) Romero S, Hight AE, Clayton KK, Resnik J, Williamson RS, Hancock KE, Polley DB. 2020. Cellular and Widefield Imaging of Sound Frequency Organization in Primary and Higher Order Fields of the Mouse Auditory Cortex. Cerebral Cortex 30: 1603-1622. DIO: https://doi.org/10.1093/cercor/bhz190, PMID: 31667491

      (4) Rothschild G, Nelken I, Mizrahi A. 2010. Functional organization and population dynamics in the mouse primary auditory cortex. Nat Neurosci 13: 353-60. DIO: https://doi.org/10.1038/nn.2484, PMID: 20118927

      (5) Tischbirek CH, Noda T, Tohmi M, Birkner A, Nelken I, Konnerth A. 2019. In Vivo Functional Mapping of a Cortical Column at Single-Neuron Resolution. Cell Rep 27: 1319-1326 e5. DIO: https://doi.org/10.1016/j.celrep.2019.04.007, PMID: 31042460

      Figures 1D and G, the y-axis is Distance from pia (%). I'm not exactly sure what this means. How does % translate to real cortical thickness?

      We thank the reviewer for this question. The distance of labeled cells from pia was normalized to the entire distance from pia to L6/WM border for each mouse, according to the previous study (Chang and Kawai, 2018). For all mice tested, the entire distance from pia to L6/WM border was 826.5 ± 23.4 mm (in the range of 752.9 to 886.1).

      Related publications:

      Chang M, Kawai HD. 2018. A characterization of laminar architecture in mouse primary auditory cortex. Brain Structure and Function 223: 4187-4209. DIO: https://doi.org/10.1007/s00429-018-1744-8, PMID: 30187193

      For Figure 2G and H, is each circle a neuron or an animal? Why are they staggered on top of each other on the x-axis? If the x-axis is the distance from caudal to rostral, each neuron should have a different distance? Also, it seems like it's because Figure 2H has more circles, which is why it has more variation, thus not significant (for example, at 600 or 900um, 2G seems to have fewer circles than 2H). 

      We sincerely appreciate the reviewer’s careful attention to the details of our figures. Each circle in the Figure 2G and H represents an individual imaging focal plane from different animals, and the median BF of some focal planes may be similar, leading to partial overlap. In the regions where overlap occurs, the brightness of the circle will be additive.

      Since fewer CT neurons, compared to TR neurons, responded to pure tones within each focal plane, as shown in Figure 2- figure supplementary 2, a larger number of focal planes were imaged to ensure a consistent and robust analysis of the pure tone response characteristics. The higher variance and lack of correlation in CT neurons is a key biological finding, not an artifact of sample size. The data clearly show a wide spread of median BFs at any given location for CT neurons, a feature absent in the TR population.

      Similarly, in Figures 2J and L, why are the circles staggered on the y-axis now? And is each circle now a neuron or a trial? It seems they have many more circles than Figure 2G and 2H. Also, I don't think doing a correlation is the proper stats for this type of plot (this point applies to Figures 3H and 3J).

      We regret any confusion have caused. In fact, Figure 2 illustrates the tonotopic gradient of CT and TR neurons at different scales. Specifically, Figures 2E-H present the imaging from the focal plane perspective (23 focal planes in Figures 2G, 40 focal planes in Figures 2H), whereas Figures 2I-L provide a more detailed view at the single-cell level (481 neurons in Figures 2J, 491 neurons in Figures 2L). So, Figures 2J and L do indeed have more circles than Figures 2G and H. The analysis at these varying scales consistently reveals the presence of a tonotopic gradient in TR neurons, whereas such a gradient is absent in CT neurons.

      We used Pearson correlation as a standard and direct method to quantify the linear relationship between a neuron's anatomical position and its frequency preference, which is widely used in the field to provide a quantitative measure (R-value) and a significance level (p-value) for the strength of a tonotopic gradient. The same statistical logic applies to testing for spatial gradients in local heterogeneity in Figure 3. We are confident that this is an appropriate and informative statistical approach for these data.

      What does the inter-quartile range of BF (IQRBF, in octaves) imply? What's the interpretation of this analysis? I am confused as to why TR neurons show high IQR in HF areas compared to LF areas, which means homogeneity among TR neurons (lines 213 - 216). On the same note, how is this different from the BF variability?  Isn't higher IQR equal to higher variability?

      We thank the reviewer for raising this important point. IQRBF, is a measure of local tuning heterogeneity. It quantifies the diversity of BFs among neighboring neurons. A small IQRBF means neighbors are similarly tuned (an orderly, homogeneous map), while a large IQRBF means neighbors have very different BFs (a disordered, heterogeneous map). (Winkowski and Kanold, 2013; Zeng et al., 2019).

      From the BF position reconstruction of all TR neurons (Figures 2I), most TR neurons respond to high-frequency sounds in the high-frequency (HF) region, but some neurons respond to low frequencies such as 2 kHz, which contributes to high IQR in HF areas. This does not contradict our main conclusion, that the TR neurons is significantly more homogeneous than the CT neurons. BF variability represents the stability of a neuron's BF over time, while IQR represents the variability of BF among different neurons within a certain range. (Chambers et al., 2023).

      Related publications:

      (1) Chambers AR, Aschauer DF, Eppler JB, Kaschube M, Rumpel S. 2023. A stable sensory map emerges from a dynamic equilibrium of neurons with unstable tuning properties. Cerebral Cortex 33: 5597-5612. DIO: https://doi.org/10.1093/cercor/bhac445, PMID: 36418925

      (2) Winkowski DE, Kanold PO. 2013. Laminar transformation of frequency organization in auditory cortex. Journal of Neuroscience 33: 1498-508. DIO: https://doi.org/10.1523/JNEUROSCI.3101-12.2013, PMID: 23345224

      (3) Zeng HH, Huang JF, Chen M, Wen YQ, Shen ZM, Poo MM. 2019. Local homogeneity of tonotopic organization in the primary auditory cortex of marmosets. Proceedings of the National Academy of Sciences of the United States of America 116: 3239-3244. DIO: https://doi.org/10.1073/pnas.1816653116, PMID: 30718428

      Figure 4A-B, there are no clear criteria on how the authors categorize V, I, and O shapes. The descriptions in the Methods (lines 721 - 725) are also very vague.

      We apologize for the initial vagueness and have replaced the descriptions in the Methods section. “V-shaped”: Neurons whose FRAs show decreasing frequency selectivity with increasing intensity. “I-shaped”: Neurons whose FRAs show constant frequency selectivity with increasing intensity. “O-shaped”: Neurons responsive to a small range of intensities and frequencies, with the peak response not occurring at the highest intensity level.

      To provide better visual intuition, we show multiple representative examples of each FRA type for both TR and CT neurons below. We are confident that these provide the necessary clarity and reproducibility for our analysis of receptive field properties.

      Author response image 1.

      Different FRA types within the dataset of TR and CT neurons. Each row shows 6 representative FRAs from a specific type. Types are V-shaped (‘V'), I-shaped (‘I’), and O-shaped (‘O’). The X-axis represents 11 pure tone frequencies, and the Y-axis represents 6 sound intensities.

      Reviewer #2 (Public Review):

      Summary:

      Gu and Liang et. al investigated how auditory information is mapped and transformed as it enters and exits an auditory cortex. They use anterograde transsynaptic tracers to label and perform calcium imaging of thalamorecipient neurons in A1 and retrograde tracers to label and perform calcium imaging of corticothalamic output neurons. They demonstrate a degradation of tonotopic organization from the input to output neurons.

      Strengths:

      The experiments appear well executed, well described, and analyzed.

      Weaknesses:

      (1) Given that the CT and TR neurons were imaged at different depths, the question as to whether or not these differences could otherwise be explained by layer-specific differences is still not 100% resolved. Control measurements would be needed either by recording (1) CT neurons in upper layers, (2) TR in deeper layers, (3) non-CT in deeper layers and/or (4) non-TR in upper layers.

      We appreciate these constructive suggestions. To address this, we performed new experiments and analyses.

      Comparison of TR neurons across superficial layers: we analyzed our existing TR neuron dataset to see if response properties varied by depth within the superficial layers. We found no significant differences in the fraction of tuned neurons, field IQR, or maximum bandwidth (BWmax) between TR neurons in L2/3 and L4. This suggests a degree of functional homogeneity within the thalamorecipient population across these layers. The results are presented in new supplemental figures (Figure 2- figure supplementary 4).

      Necessary control experiments.

      (1) CT neurons in upper layers. CT neurons are thalamic projection neurons that only exist in the deeper cortex, so CT neurons do not exist in upper layers (Antunes and Malmierca, 2021).

      (2) TR neurons in deeper layers. As we mentioned in the manuscript, due to high-titer AAV1-Cre virus labeling controversy (anterograde and retrograde labelling both exist), it is challenging to identify TR neurons in deeper layers.

      (3) non-CT in deeper layers and/or (4) non-TR in upper layers.

      To directly test if projection identity confers distinct functional properties within the same cortical layers, we performed the crucial control of comparing TR neurons to their neighboring non-TR neurons. We injected AAV1-Cre in MGB and a Cre-dependent mCherry into A1 to label TR neurons red. We then co-injected AAV-CaMKII-GCaMP6s to label the general excitatory population green.  In merged images, this allowed us to functionally image and directly compare TR neurons (yellow) and adjacent non-TR neurons (green). We separately recorded the responses of these neurons to pure tones using two-photon imaging. The results show that TR neurons are significantly more likely to be tuned to pure tones than their neighboring non-TR excitatory neurons. This finding provides direct evidence that a neuron's long-range connectivity, and not just its laminar location, is a key determinant of its response properties. The results are presented in new supplemental figures (Figure 2- figure supplementary 5).

      Related publications:

      Antunes FM, Malmierca MS. 2021. Corticothalamic Pathways in Auditory Processing: Recent Advances and Insights From Other Sensory Systems. Front Neural Circuits 15: 721186. DIO: https://doi.org/10.3389/fncir.2021.721186, PMID: 34489648

      (2) What percent of the neurons at the depths are CT neurons? Similar questions for TR neurons?

      We thank the reviewer for the comments. We performed histological analysis on brain slices from our experimental animals to quantify the density of these projection-specific populations. Our analysis reveals that CT neurons constitute approximately 25.47%\22.99%–36.50% of all neurons in Layer 6 of A1. In the superficial layers(L2/3 and L4), TR neurons comprise approximately 10.66%\10.53%–11.37% of the total neuronal population.

      Author response image 2.

      The fraction of CT and TR neurons. (A) Boxplots showing the fraction of CT neurons. N = 11 slices from 4 mice. (B) Boxplots showing the fraction of TR neurons. N = 11 slices from 4 mice.

      (3) V-shaped, I-shaped, or O-shaped is not an intuitively understood nomenclature, consider changing. Further, the x/y axis for Figure 4a is not labeled, so it's not clear what the heat maps are supposed to represent.

      The terms "V-shaped," "I-shaped," and "O-shaped" are an established nomenclature in the auditory neuroscience literature for describing frequency response areas (FRAs), and we use them for consistency with prior work. V-shaped: Neurons whose FRAs show decreasing frequency selectivity with increasing intensity. I-shaped: Neurons whose FRAs show constant frequency selectivity with increasing intensity. O-shaped: Neurons responsive to a small range of intensities and frequencies, with the peak response not occurring at the highest intensity level.

      (Rothschild et al., 2010). We have included a more detailed description in the Methods.

      The X-axis represents 11 pure tone frequencies, and the Y-axis represents 6 sound intensities. So, the heat map represents the FRA of neurons in A1, reflecting the responses for different frequencies and intensities of sound stimuli. In the revised manuscript, we have provided clarifications in the figure legend.

      (4) Many references about projection neurons and cortical circuits are based on studies from visual or somatosensory cortex. Auditory cortex organization is not necessarily the same as other sensory areas. Auditory cortex references should be used specifically, and not sources reporting on S1, and V1.

      We thank the reviewers for their valuable comments. We have made a concerted effort to ensure that claims about cortical circuit organization are supported by findings specifically from the auditory cortex wherever possible, strengthening the focus and specificity of our discussion.

      Reviewer #3 (Public Review):

      Summary:

      The authors performed wide-field and 2-photon imaging in vivo in awake head-fixed mice, to compare receptive fields and tonotopic organization in thalamocortical recipient (TR) neurons vs corticothalamic (CT) neurons of mouse auditory cortex. TR neurons were found in all cortical layers while CT neurons were restricted to layer 6. The TR neurons at nominal depths of 200-400 microns have a remarkable degree of tonotopy (as good if not better than tonotopic maps reported by multiunit recordings). In contrast, CT neurons were very heterogenous in terms of their best frequency (BF), even when focusing on the low vs high-frequency regions of the primary auditory cortex. CT neurons also had wider tuning.

      Strengths:

      This is a thorough examination using modern methods, helping to resolve a question in the field with projection-specific mapping.

      Weaknesses:

      There are some limitations due to the methods, and it's unclear what the importance of these responses are outside of behavioral context or measured at single timepoints given the plasticity, context-dependence, and receptive field 'drift' that can occur in the cortex.

      (1) Probably the biggest conceptual difficulty I have with the paper is comparing these results to past studies mapping auditory cortex topography, mainly due to differences in methods. Conventionally, the tonotopic organization is observed for characteristic frequency maps (not best frequency maps), as tuning precision degrades and the best frequency can shift as sound intensity increases. The authors used six attenuation levels (30-80 dB SPL) and reported that the background noise of the 2-photon scope is <30 dB SPL, which seems very quiet. The authors should at least describe the sound-proofing they used to get the noise level that low, and some sense of noise across the 2-40 kHz frequency range would be nice as a supplementary figure. It also remains unclear just what the 2-photon dF/F response represents in terms of spikes. Classic mapping using single-unit or multi-unit electrodes might be sensitive to single spikes (as might be emitted at characteristic frequency), but this might not be as obvious for Ca2+ imaging. This isn't a concern for the internal comparison here between TR and CT cells as conditions are similar, but is a concern for relating the tonotopy or lack thereof reported here to other studies.

      We sincerely thank the reviewer for the thoughtful evaluation of our manuscript and for your positive assessment of our work.

      (1)  Concern regarding Best Frequency (BF) vs. Characteristic Frequency (CF)

      Our use of BF, defined as the frequency eliciting the highest response averaged across all sound levels, is a standard and practical approach in 2-photon Ca²⁺ imaging studies. (Issa et al., 2014; Rothschild et al., 2010; Schmitt et al., 2023; Tischbirek et al., 2019). This method is well-suited for functionally characterizing large numbers of neurons simultaneously, where determining a precise firing threshold for each individual cell can be challenging.

      (2) Concern regarding background noise of the 2-photon setup

      We have expanded the Methods section ("Auditory stimulation") to include a detailed description of the sound-attenuation strategies used during the experiments. The use of a custom-built, double-walled sound-proof enclosure lined with wedge-shaped acoustic foam was implemented to significantly reduce external noise interference. These strategies ensured that auditory stimuli were delivered under highly controlled, low-noise conditions, thereby enhancing the reliability and accuracy of the neural response measurements obtained throughout the study.

      (3) Concern regarding the relationship between dF/F and spikes

      While Ca²⁺ signals are an indirect and filtered representation of spiking activity, they are a powerful tool for assessing the functional properties of genetically-defined cell populations. As you note, the properties and limitations of Ca²⁺ imaging apply equally to both the TR and CT neuron groups we recorded. Therefore, the profound difference we observed—a clear tonotopic gradient in one population and a lack thereof in the other—is a robust biological finding and not a methodological artifact.

      Related publications:

      (1) Issa JB, Haeffele BD, Agarwal A, Bergles DE, Young ED, Yue DT. 2014. Multiscale optical Ca2+ imaging of tonal organization in mouse auditory cortex. Neuron 83: 944-59. DIO: https://doi.org/10.1016/j.neuron.2014.07.009, PMID: 25088366

      (2) Rothschild G, Nelken I, Mizrahi A. 2010. Functional organization and population dynamics in the mouse primary auditory cortex. Nat Neurosci 13: 353-60. DIO: https://doi.org/10.1038/nn.2484, PMID: 20118927

      (3) Schmitt TTX, Andrea KMA, Wadle SL, Hirtz JJ. 2023. Distinct topographic organization and network activity patterns of corticocollicular neurons within layer 5 auditory cortex. Front Neural Circuits 17: 1210057. DIO: https://doi.org/10.3389/fncir.2023.1210057, PMID: 37521334

      (4) Tischbirek CH, Noda T, Tohmi M, Birkner A, Nelken I, Konnerth A. 2019. In Vivo Functional Mapping of a Cortical Column at Single-Neuron Resolution. Cell Rep 27: 1319-1326 e5. DIO: https://doi.org/10.1016/j.celrep.2019.04.007, PMID: 31042460

      (2) It seems a bit peculiar that while 2721 CT neurons (N=10 mice) were imaged, less than half as many TR cells were imaged (n=1041 cells from N=5 mice). I would have expected there to be many more TR neurons even mouse for mouse (normalizing by number of neurons per mouse), but perhaps the authors were just interested in a comparison data set and not being as thorough or complete with the TR imaging?

      As shown in the Figure 2- figure supplementary 2, a much higher fraction of TR neurons was "tuned" to pure tones (46% of 1041 neurons) compared to CT neurons (only 18% of 2721 neurons). To obtain a statistically robust and comparable number of tuned neurons for our core analysis (481 tuned TR neurons vs. 491 tuned CT neurons), it was necessary to sample a larger total population of CT neurons, which required imaging from more animals.

      (3) The authors' definitions of neuronal response type in the methods need more quantitative detail. The authors state: "Irregular" neurons exhibited spontaneous activity with highly variable responses to sound stimulation. "Tuned" neurons were responsive neurons that demonstrated significant selectivity for certain stimuli. "Silent" neurons were defined as those that remained completely inactive during our recording period (> 30 min). For tuned neurons, the best frequency (BF) was defined as the sound frequency associated with the highest response averaged across all sound levels.". The authors need to define what their thresholds are for 'highly variable', 'significant', and 'completely inactive'. Is best frequency the most significant response, the global max (even if another stimulus evokes a very close amplitude response), etc.

      We appreciate the reviewer's suggestions. We have added more detailed description in the Methods.

      Tuned neurons: A responsive neuron was further classified as "Tuned" if its responses showed significant frequency selectivity. We determined this using a one-way ANOVA on the neuron's response amplitudes across all tested frequencies (at the sound level that elicited the maximal response). If the ANOVA yielded a p-value < 0.05, the neuron was considered "Tuned”. Irregular neurons: Responsive neurons that did not meet the statistical criterion for being "Tuned" (i.e., ANOVA p-value ≥ 0.05) were classified as "Irregular”. This provides a clear, mutually exclusive category for sound-responsive but broadly-tuned or non-selective cells. Silent neurons: Neurons that were not responsive were classified as "Silent". This quantitatively defines them as cells that showed no significant stimulus-evoked activity during the entire recording session. Best frequency (BF): It is the frequency that elicited the maximal mean response, averaged across all sound levels.

      To provide greater clarity, we showed examples in the following figures.

      Author response image 3.

      Reviewer #1 (Recommendations For The Authors):

      (1) A1 and AuC were used exchangeably in the text.

      Thank you for pointing out this issue. Our terminological strategy was to remain faithful to the original terms used in the literature we cite, where "AuC" is often used more broadly. In the revised manuscript, we have performed a careful edit to ensure that we use the specific term "A1" (primary auditory cortex) when describing our own results and recording locations, which were functionally and anatomically confirmed.

      (2) Grammar mistakes throughout.

      We are grateful for the reviewer’s suggested improvement to our wording. The entire manuscript has undergone a thorough professional copyediting process to correct all grammatical errors and improve overall readability.

      (3) The discussion should talk more about how/why L6 CT neurons don't possess the tonotopic organization and what are the implications. Currently, it only says 'indicative of an increase in synaptic integration during cortical processing'...

      Thanks for this suggestion. We have substantially revised and expanded the Discussion section to explore the potential mechanisms and functional implications of the lack of tonotopy in L6 CT neurons.

      Broad pooling of inputs: We propose that the lack of tonotopy is an active computation, not a passive degradation. CT neurons likely pool inputs from a wide range of upstream neurons with diverse frequency preferences. This broad synaptic integration, reflected in their wider tuning bandwidth, would actively erase the fine-grained frequency map in favor of creating a different kind of representation.

      A shift from topography to abstract representation: This transformation away from a classic sensory map may be critical for the function of corticothalamic feedback. Instead of relaying "what" frequency was heard, the descending signal from CT neurons may convey more abstract, higher-order information, such as the behavioral relevance of a sound, predictions about upcoming sounds, or motor-related efference copy signals that are not inherently frequency-specific.’

      Modulatory role of the descending pathway: The descending A1-to-MGB pathway is often considered to be modulatory, shaping thalamic responses rather than driving them directly. A modulatory signal designed to globally adjust thalamic gain or selectivity may not require, and may even be hindered by, a fine-grained topographical organization.

      Reviewer #2 (Recommendations For The Authors):

      (1) Given that the CT and TR neurons were imaged at different depths, the question as to whether or not these differences could otherwise be explained by layer-specific differences is still not 100% resolved. Control measurements would be needed either by recording (1) CT neurons in upper layers (2) TR in deeper layers (3) non-CT in deeper layers and/or (4) non-TR in upper layers.

      We appreciate these constructive suggestions. To address this, we performed new experiments and analyses.

      Comparison of TR neurons across superficial layers: we analyzed our existing TR neuron dataset to see if response properties varied by depth within the superficial layers. We found no significant differences in the fraction of tuned neurons, field IQR, or maximum bandwidth (BWmax) between TR neurons in L2/3 and L4. This suggests a degree of functional homogeneity within the thalamorecipient population across these layers.

      Necessary control experiments.

      (1) CT neurons in upper layers. CT neurons are thalamic projection neurons that only exist in the deeper cortex, so CT neurons do not exist in upper layers (Antunes and Malmierca, 2021).

      (2) TR neurons in deeper layers. As we mentioned in the manuscript, due to high-titer AAV1-Cre virus labeling controversy (anterograde and retrograde labelling both exist), it is challenging to identify TR neurons in deeper layers.

      (3) non-CT in deeper layers and/or (4) non-TR in upper layers.

      To directly test if projection identity confers distinct functional properties within the same cortical layers, we performed the crucial control of comparing TR neurons to their neighboring non-TR neurons. We injected AAV1-Cre in MGB and a Cre-dependent mCherry into A1 to label TR neurons red. We then co-injected AAV-CaMKII-GCaMP6s to label the general excitatory population green.  In merged images, this allowed us to functionally image and directly compare TR neurons (yellow) and adjacent non-TR neurons (green). We separately recorded the responses of these neurons to pure tones using two-photon imaging. The results show that TR neurons are significantly more likely to be tuned to pure tones than their neighboring non-TR excitatory neurons. This finding provides direct evidence that a neuron's long-range connectivity, and not just its laminar location, is a key determinant of its response properties.

      Related publications:

      Antunes FM, Malmierca MS. 2021. Corticothalamic Pathways in Auditory Processing: Recent Advances and Insights From Other Sensory Systems. Front Neural Circuits 15: 721186. DIO: https://doi.org/10.3389/fncir.2021.721186, PMID: 34489648

      (3) V-shaped, I-shaped, or O-shaped is not an intuitively understood nomenclature, consider changing. Further, the x/y axis for Figure 4a is not labeled, so it's not clear what the heat maps are supposed to represent.

      The terms "V-shaped," "I-shaped," and "O-shaped" are an established nomenclature in the auditory neuroscience literature for describing frequency response areas (FRAs), and we use them for consistency with prior work. V-shaped: Neurons whose FRAs show decreasing frequency selectivity with increasing intensity. I-shaped: Neurons whose FRAs show constant frequency selectivity with increasing intensity. O-shaped: Neurons responsive to a small range of intensities and frequencies, with the peak response not occurring at the highest intensity level.

      (Rothschild et al., 2010). We have included a more detailed description in the Methods.

      The X-axis represents 11 pure tone frequencies, and the Y-axis represents 6 sound intensities. So, the heat map represents the FRA of neurons in A1, reflecting the responses for different frequencies and intensities of sound stimuli. In the revised manuscript, we have provided clarifications in the figure legend.

      (4) Many references about projection neurons and cortical circuits are based on studies from visual or somatosensory cortex. Auditory cortex organization is not necessarily the same as other sensory areas. Auditory cortex references should be used specifically, and not sources reporting on S1, V1.

      We thank the reviewers for their valuable comments. We have made a concerted effort to ensure that claims about cortical circuit organization are supported by findings specifically from the auditory cortex wherever possible, strengthening the focus and specificity of our discussion.

      Reviewer #3 (Recommendations For The Authors):

      I suggest showing some more examples of how different neurons and receptive field properties were quantified and statistically analyzed. Especially in Figure 4, but really throughout.

      We thank the reviewer for this valuable suggestion. To provide greater clarity, we have added more examples in the following figure.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this study, Ledamoisel et al. examined the evolution of visual and chemical signals in closely related Morpho butterfly species to understand their role in species coexistence. Using an integrative, state-of-the-art approach combining spectrophotometry, visual modeling, and behavioral mate choice experiments, they quantified differences in wing iridescence and assessed its influence on mate preference in allopatry and sympatry. They also performed chemical analyses to determine whether sympatric species exhibit divergent chemical cues that may facilitate species recognition and mate discrimination. The authors found iridescent coloration to be similar in sympatric Morpho species. Furthermore, male mate choice experiments revealed that in sympatry, males fail to discriminate conspecific females based on coloration, reinforcing the idea that visual signal convergence is primarily driven by predation pressure. In contrast, the divergence of chemical signals among sympatric species suggests their potential role in facilitating species recognition and mate discrimination. The authors conclude that interactions between ecological pressures and signal evolution may shape species coexistence.

      Strengths:

      The study is well-designed and integrates multiple methodological approaches to provide a thorough assessment of signal evolution in the studied species. I appreciate the authors' careful consideration of multiple selective pressures and their combined influence on signal divergence and convergence. Additionally, the inclusion of both visual and chemical signals adds an interesting and valuable dimension to the study, enhancing its importance. Beyond butterflies, this research broadens our understanding of multimodal communication and signal evolution in the context of species coexistence.

      Weaknesses:

      (1) The broader significance of the findings needs to be better articulated. While the authors emphasize that comparing adaptive traits in sympatry and allopatry provides insights into selective processes shaping reproductive isolation and coexistence, it is unclear what key conceptual or theoretical questions are being addressed. Are these patterns expected under certain evolutionary scenarios? Have they been empirically demonstrated in other systems? The authors should explicitly state the overarching research question, incorporate some predictions, and better contextualize their findings within the existing literature. If the results challenge or support previous work, that should be highlighted to strengthen the study's importance in a broader context.

      We thank the reviewer for their valuable feedback. We understand that the framing of the results and the discussion may fail to convey the broader significance of our findings. In the first version of the manuscript, we framed our manuscript around the processes shaping reproductive isolation and co-existence in sympatry, but now realize that this question was too broad in regards to our results. We thus strictly focused on outlining the importance of ecological interactions in the evolution of traits in sympatric species. In the revised version of the manuscript, we rewrote the first paragraph of the introduction to introduce context regarding the effect of ecological interactions on trait evolution (lines 43-60). We then explicitly introduce the theoretical question investigated in our paper (i.e. “we investigate how ecological interactions in sympatry can constrain natural and sexual selection shaping trait evolution”, lines 62-63) and our predictions regarding the evolution of traits in sympatry vs. allopatry (lines 74-80). We also added predictions regarding our experiments on Morpho at the end of the introduction (lines 146-157). As a result, the discussion is now better aligned with the introduction, by discussing the putative effect of predation and mate choice on the evolution of wing iridescence in Morpho.

      (2) The motivation for studying visual signals and mate choice in allopatric populations (i.e., at the intraspecific level) is not well articulated, leaving their role in the broader narrative unclear. In particular, the rationale behind experiments 1, 2, and 3 is not well defined, as the authors have not made a strong case for the need for these intraspecific comparisons in the introduction. This issue is further compounded by the authors' primary focus on signal evolution in sympatry throughout both the results and the discussion. For instance, the divergence of iridescence in allopatry is a potentially interesting result. But the authors have not discussed its implications.

      We now clearly state in the introduction our motivation for studying visual signals and mate choice in allopatric populations (lines 74-80, lines 146-157). We argued that intraspecific comparisons help identify whether visual cues can be used in mate recognition between phylogenetically close subspecies, between whom visual resemblance is supposed to be higher than between closely-related species (tetrad experiment, and experiment 1). As M. h. bristowi and M. h. theodorus have different wing pattern, we also used this comparison to identify the traits involved in male mate preference within a species, testing the importance of iridescent color (experiment 2) or iridescent patterning (experiment 3). The results of those experiments can then be used to assess whether these traits are used in species recognition between sympatric species. See also our answers to recommendations 11 and 15 from reviewer #1.

      Overall, given that the primary conclusions are based on results and analyses in sympatry, the role of allopatric populations in shaping these conclusions needs to be better integrated and justified. Without a stronger link between the comparative framework and the study's key takeaways, the use of allopatric populations feels somewhat peripheral rather than central to the study's aim. Since the primary conclusions remain valid even without the allopatric comparisons, their inclusion requires a clearer rationale.

      To make a stronger case for the use of the allopatric population in our manuscript, we strengthened the justification behind the study of intraspecific allopatric populations vs. interspecific sympatric populations, as the iridescence measurements and the mate choice experiments in allopatric populations can serve as a baseline in studying how species interactions can shape the evolution of traits and mate recognition when compared to sympatric populations. Following your major comment #1, we rewrote the introduction to include a justification to the need for studying allopatric vs. sympatric populations (lines 74-80), and also further highlighted the need to study iridescence in sympatric species to fully understand the trait evolution of sympatric species in the discussion (339-343).

      (3) While the authors demonstrate that iridescence is indistinguishable to predators in sympatry, they overstate the role of predation in driving convergence. The present study does not experimentally demonstrate that iridescence in this species has a confusion effect or contributes to evasive mimicry. Alternatively, convergence could result from other selective forces, such as signal efficacy due to environmental conditions, rather than being solely driven by predation.

      We acknowledge that our study does not directly demonstrate that iridescence contributes to evasive mimicry. We did tone down the interpretation of the results in the discussion and state that predation is not the only selective pressure that could have promoted a convergent evolution of iridescence in sympatric species, as iridescence is a trait that could be involved in thermoregulation (lines 346-353) and camouflage (lines 363-369) for example. We made sure to mention that convergence in iridescent signals in sympatry is only an indirect support to the evasive mimicry hypothesis, and that further research is still needed, including direct predation experiments, to show that this convergence is indeed triggered by predation (lines 391-396).  

      Reviewer #2 (Public review):

      This study presents an investigation of the visual and chemical properties and mating behaviour in Morpho butterflies, aimed at addressing the nature of divergence between closely related species in sympatry. The study species consists of three subspecies of Morpho helenor (bristowi, theodorus, and helenor), and the conspecific Morpho achilles achilles. The authors postulate that whereas the iridescent blue signals of all (sub)species should function as a predator reduction signal (similar to aposematism) and therefore exhibit convergence, the same signals should indicate divergence if used as a mating signal, particularly in sympatric populations. They also assess chemical profiles among the species to assess the potential utility of scent in mediating species/sex discrimination.

      The authors first used reflectance spectrometry to calculate hue, brightness, and chroma, plus two measures of "iridescence" (perhaps better phrased as angular dependence) in each (sub)species. This indicated the ubiquitous presence of sexual dimorphism in brightness (males brighter), which also appears to be the case for iridescence (Figure 3A-B). Analysis of these data also indicated that whereas there is evidence for divergence among subspecies in allopatry, the same evidence is lacking for species in sympatry (P = 0.084). This was supported further by visual modelling, which showed that both conspecifics and birds should be (theoretically) capable of perceiving the colour difference among allopatric populations of M. helenor, whereas the same is not true for the sympatric species.

      The authors then conducted mate choice trials, first using live individuals and second using female dummies. The live experiments indicated the presence of assortative mating among the two subspecies of M. helenor (bristowi and theodorus). The dummy presentations indicated (a) bristowi males prefer conspecific wings, whereas theodorus have no preference, (b) bristowi males prefer the con(sub)specific colour pattern, (c) theodorus prefer the con(sub)specific iridescence when the pattern is manipulated to be similar among female dummies. A fourth experiment, using sympatric M. achilles and M. helenor, indicated no preference for conspecific female dummies. Finally, chemical analysis indicated substantial differences between these two species in putative pheromone compounds, and especially so in the males.

      The authors conclude that the similarity of iridescence among species in sympatry is suggestive of convergence upon a common anti-predation signal. Despite some behavioural evidence in favourof colour (iridescence)-based mate discrimination, chemical differences between Achilles and Helenor are posed as more likely to function for species isolation than visual differences.

      Overall, I enjoyed reading this manuscript, which presents a valiant attempt at studying visual, chemical and behavioural divergence in this iconic group of butterflies.

      Major comments

      My only major comment concerns the authors' favoured explanation for aposematism (or evasive mimicry) for convergence among species, which is based upon the you-can't-catch-me hypothesis first presented by Young 1971. Although there is supporting work showing that iridescent-like stimuli are more difficult to precisely localize by a range of viewers, most of the evidence as applied to the Morpho system is circumstantial, and I'm not certain that there is widespread acceptance of this hypothesis. Given that the present study deals with closely-related  (sub)species, one alternative explanation - a "null" hypothesis of sorts - is for a lack of divergence (from a common starting point) as opposed to evolutionary convergence per se. in other words, two subspecies are likely to retain ancestral character states unless there is selection that causes them to diverge. I feel that the manuscript would benefit from a discussion of this alternative, if not others. Signalling to predators could very well be involved in constraining the extent of convergence, but this seems a little premature to state as an up-front conclusion of this work. There is also the result of a *dorsal* wing manipulation by Vieira-Silva et al. 2024 which seems difficult to reconcile in light of this explanation. Whereas this paper is cited by the authors, a more nuanced discussion of their experimental results would seem appropriate here.

      We thank the reviewer for their constructive comments on our manuscript. We appreciate the reviewer’s concern regarding the way iridescence convergence between sympatric species is discussed in our manuscript, which align with similar concerns raised by Reviewer 1. Indeed, the you-can't-catch-me hypothesis has not been yet empirically tested in Morpho, this is currently a working hypothesis only supported by indirect lines of evidence.

      Among the 30 known Morpho species, iridescence is most likely the ancestral character, notably because iridescence is a trait shared by a majority of Morpho (we now mention this in the introduction lines 108-110). In this paper, we thus did not aim to identify the evolutionary forces involved in the appearance of iridescence in this group, but rather wanted to understand to what extent ecological interactions can impact the diversification (or not) of this trait. As such, the dorsal manipulations performed in Vieira-Silva et al 2024 showing that iridescence in Morpho may have a similar effect than crypsis does not impact our working hypothesis. Instead, we use VieraSilva et al 2024 to discuss the potential anti-predator effect of iridescence, that could potentially promote convergent evolution of iridescent patterns.

      In the main text, we now clearly mention our null hypothesis: under a scenario of neutral evolution of iridescence, we would expect that the divergence in wing coloration between two M. helenor subspecies would be lower than between two different Morpho species (M. helenor and M. achilles) and showed that our results sharply differ from this null expectation.

      We then improved the discussion by adding alternative hypotheses potentially explaining the convergent iridescent signal detected in sympatric species: we discussed the expected effect under neutral evolution (lines 339-343), but also added alternative hypotheses regarding the diversification of iridescence due to camouflage (lines 363-369), predator evasion (lines 373-377) and thermoregulation (lines 346-353).

      Reviewer #3 (Public review):

      The authors investigated differences in iridescence wing colouration of allopatric (geographically separated) and sympatric (coexisting) Morpho butterfly (sub)species. Their aim was to assess if iridescence wing colouration of Morpho (sub)species converged or diverged depending on coexistence and if iridescence wing colouration was involved in mating behaviour and reproductive isolation. The authors hypothesize that iridescence wing colouration of different (sub)species should converge in sympatry and diverge in allopatry. In sympatry, iridescence wing colouration can act as an effective antipredator defence with shared benefits if multiple (sub)species share the same colouration. However, shared wing colouration can have potential costs in terms of reproductive interference since wing colouration is often involved in mate recognition. If the benefits of a shared antipredator defence outweigh the costs of reproductive interference, iridescence wing colouration will show convergence and alternative mate recognition strategies might evolve, such as chemical mate recognition. In allopatry, iridescence wing colouration is expected to diverge due to adaptation to different local conditions and no alternative mate recognition is expected.

      Strengths:

      (1) Using allopatric and sympatric (sub)species that are closely related is a powerful way to test evolutionary hypotheses

      (2) By clearly defining iridescence and measuring colour spectra from a variety of angles, applying different methods, a very comprehensive dataset of iridescence wing colouration is achieved.

      (3) By experimentally manipulating wing coloration patterns, the authors show visual mate recognition for M. h. bristowi and could, in theory, separate different visual aspects of colouration (patterns VS iridescence strength).

      (4) Measurements of chemical profiles to investigate alternative mate recognition strategies in case of convergence of visual signals.

      Weaknesses:

      In my opinion, studies should be judged on the methods and data included, and not on additional measurements that could have been taken or additional treatments/species that should be included, since in most ecological and evolutionary studies, more measurements or treatments/species can always be included. However, studies do need to ensure appropriate replication and appropriate measurements to test their hypothesis AND support their conclusions. The current study failed to ensure appropriate replication, and in various cases, the results do not support the conclusions.

      First, when using allopatric and sympatric (sub)species pairs to test evolutionary hypotheses, replication is important. Ideally, multiple allopatric and sympatric (sub)species pairs are compared to avoid outlier (sub)species or pairs that lead to biased conclusions. Unfortunately, the current study compares 1 allopatric and 1 sympatric (sub)species pair, hence having poor (no) replication on the level of allopatric and sympatric (sub)species pairs,

      We would like to thank the reviewer for their constructive feedback. We agree that replication is important to test evolutionary hypotheses and that our study lacks replication for allopatric and sympatric Morpho populations. Ideally, one would require several allopatric and sympatric replicates to conclude on the effect of species interaction in trait evolution. Our study is a preliminary attempt at answering this question, covering a few Morpho populations but proposing a broad assessment of iridescence and mate preference for those populations. We clearly mentioned in the discussion that investigating multiple populations is needed to test whether the trend we observed in this paper can be generalized (line 388-392).

      Second, chemical profiles were only measured for sympatric species and not for allopatric (sub)species, which limits the interpretation of this data. The allopatric (sub)species could have been measured as non-coexistence "control". If coexistence and convergence in wing colouration drives the evolution of alternative mate recognition signals, such alternative signals should not evolve/diverge for allopatric (sub)species where wing colouration is still a reliable mate recognition cue. More importantly, no details are provided on the quantification of butterfly chemical profiles, which is essential to understand such data. It is unclear how the chemical profiles were quantified and what data (concentrations, ratios, proportions) were used to perform NDMS and generate Figure 5 and the associated statistical tests.

      We recognize that having the chemical profiles of the genitalia of the Morpho from the allopatric populations would have made a stronger case in favor of reinforcement acting on the divergence of the chemical compounds found on the genitalia of the sympatric Morpho species. Due to limited access to the biological material needed at the time of the chromatography, we could not test for lower divergence in the chemical profiles of allopatric Morpho butterflies. We made sure to mention this limitation in the discussion (lines 457-461). 

      We already stated in the methods that we compiled the area under the peak of each components found in the chromatograms of our samples and that we performed all the statistical analyses on this dataset. To make it clearer, we mention in the new version of the manuscript that the area under the peak of each component allows to measure the concentration of the components (in the methods lines 720, 723, 733). We also added some precisions in the legend of Figure 5.

      Third, throughout the discussion, the authors mention that their results support natural selection by predators on iridescent wing colouration, without measuring natural selection by predators or any other measure related to predation. It is unclear by what predators any of the butterfly species are predated on at this point

      We made sure to mention in the introduction (line 132-136) and in the discussion (line 373-377) that previous predation experiments performed on Morpho and other butterflies showed evidence that birds are likely predators for these species. These observations lead us to test for the putative effect of predation on the evolution of their color pattern, without directly testing predatory rates. We made sure this information is transparent in the revised manuscript, and now precise that assessing wing convergence is only an indirect way of testing the escape mimicry hypothesis (line 393-396).

      To continue on the interpretation of the data related to selection on specific traits by specific selection agents: This study did not measure any form of selection or any selection agent. Hence, it is not known if iridescent wing colouration is actually under selection by predators and/or mates, if maybe other selection agents are involved or if these traits converge due to genetic correlations with other traits under selection. For example, Iridescent colouration in ground beetles has functions as antipredator defence but also thermo- and water regulation. None of these issues are recognized or discussed.

      The lack of discussion of alternative selective pressures involved in the evolution of iridescence was pointed out by all reviewers. We thus modified the text to account for this comment, and no longer limit our discussion to the putative effects of predation. We now specifically discuss alternative hypotheses, including crypsis (362-369) and thermoregulation (line 346-353).

      Finally, some of the results are weakly supported by statistics or questionable methodology.

      Most notably, the perception of the iridescence coloration of allopatric subspecies by bird visual systems. Although for females, means and errors (not indicated what exactly, SD, SE or CI) are clearly above the 1 JND line, for males, means are only slightly above this line and errors or CIs clearly overlap with the 1 JND line. Since there is no additional statistical support, higher means but overlap of SD, SE or CI with the baseline provides weak statistical support for differences.

      We thank the reviewer for bringing interpretation issues concerning the chromatic distances of allopatric Morpho species measured with a bird vision model. We made sure to be nuanced in the description of this graph in the results section (line 208-212). Note that this addition does not change our main conclusion stating that Morpho and predator visual models better discriminate iridescence differences between allopatric subspecies than between sympatric species.

      We now also clearly mention in the figure’s legend that the error bars represent the confidence intervals obtained after performing a bootstrap analysis, in addition to the mention of the nature of the error bars already mentioned in the methods (line 580).

      Regarding the assortative mating experiment, the results are clearly driven by M. bristowi. For M. theodorus, females mate equally often with conspecifics (6 times) as with M. bristowi (5 times). For males, the ratio is slightly better (6 vs 3), but with such low numbers, I doubt this is statistically testable. Overall low mating for M. bristowi could indicate suboptimal experimental conditions, and hence results should be interpreted with care.

      We recognize that the tetrad experiment results are mainly driven by M. bristowi’s behavior as already mentioned in the results (line 231-232) but we now also mention it in the discussion (lines 401-402). This experiment would have benefited from more replicates, but the limited access to live males and virgin females for both subspecies was a limiting factor. Fisher’s exact test used to assess assortative mating is specifically appropriate to small sample sizes. We recognize that the sampling size is not ideal, however it is still statistically testable.

      Regarding the wing manipulation experiment, M. theodorus does not show a preference when dummies with non-modified wings are presented and prefers non-modified dummies over modified dummies. This is acknowledged by the authors but not further discussed. Certainly, some control treatment for wing modification could have been added.

      The use of controls to consider the effect of wing modification and odor by the permanent marker were already mentioned in the methods (lines 636-639). Following your recommendation and comments from the other reviewers, we now mention the use of this control in the results (lines 278283). We also address a potential issue that would have resulted in the rejection of these modified dummies by live males: we cannot be sure whether butterflies perceive these modifications as equivalent to natural coloration (lines 281-282). An additional control could have been used, adding black ink on the black dorsal parts of the pattern to assess its potential visual effect. The constraints on sampling unfortunately did not allow to add another treatment.

      Overall, the fact that certain measurements only provide evidence for 1 of the 2 (sub)species (assortative mating, wing manipulation) or one sex of one of the species (bird visual systems) means overall interpretation and overgeneralization of the results to both allopatric or sympatric species should be done with care, and such nuances should ideally be discussed.

      The aim of the authors, "to investigate the antagonistic effects of selective pressures generated by mate recognition and shared predation" has not been achieved, and the conclusions regarding this aim are not supported by the results. Nevertheless, the iridescence colour measurements are solid, and some of the behavioural experiments and chemical profile measurements seem to yield interesting results. The study would benefit from less overinterpretation of the results in the framework of predation and more careful consideration of methodological difficulties, statistical insecurities, and nuances in the results.

      Overall, we would like to thank all reviewers for their thorough assessment of our work. We understand that the imbalance between mate choice data, visual model data and chemical data only gives us a partial assessment of species recognition in Morpho butterflies, thus requiring more precision in the interpretation and the discussion of our results. We made sure to add balanced interpretations in our discussion, by mentioning the lack of replicates for allopatric and sympatric populations (lines 391-392), and the lack of chemical characterization of allopatric species (lines 458361, see previous comments) and by being more transparent on methodological limitations that we failed to convey in the first version of our manuscript. We brought nuance to our discussion and also discussed alternative hypotheses to predation to explain the convergence of iridescence found in sympatry.

      Reviewing Editor Comments:

      While all reviewers acknowledge the value of your data, they converge in their recommendations to tone down the evolutionary interpretations. Ideally, to test your main hypothesis, you would need several species pairs, or if only one, as in your case, replicated sympatric and allopatric sites for both species. Furthermore, your more specific hypotheses about convergence (vs. nondivergence), response to predators (vs. other environmental variables), and avoiding interspecific mating in sympatry (vs. not avoiding it in allopatry) would require appropriate alternative treatments/controls. We therefore recommend that you focus on those statements that you can support with your experiments and data, and introduce these statements in the introduction with reference to the appropriate literature.

      Reviewer #1 (Recommendations for the authors):

      (1) Line 25: This stated aim seems a bit off. The authors did not sensu stricto quantify 'how shared adaptive traits may shape genetic divergence' in this study. I suggest rewriting or deleting this whole sentence altogether. The study's aim is already clear in lines 29-34.

      We deleted the mention of the characterization of genetic divergence, since this study did not focus on any genetic analysis.

      (2) Line 34: The authors here state that they compared allopatric vs sympatric populations. This is strictly not true for M. Achilles. Further, the results after this sentence focus solely ondivergence/convergence in sympatry, nothing at the intraspecific level and implications of the findings

      We now mention that we tested allopatric vs. sympatric species of M. helenor only (lines 28-29). We also mention that the behavioral experiments were based on intraspecific comparisons, and discuss the implications of this result in the discussion.

      (3) Line 35: 'convergence driven by predation': this is a strong statement and cannot be directly inferred from the present set of experiments. Consider toning it down.

      We added nuance to this statement by rephrasing it “suggesting that predation may favors local resemblance” (lines 32-33)

      (4) Line 36: Replace 'behavioral results' with 'behavioral experiments' or something similar.

      Corrected

      (5) Line 45-49: These opening statements need some citations.

      We provided references for the first few lines, by citing terHorst et al 2018 (line 44) underlining the importance of species interactions in trait evolution, and Blomberg et al 2003 (line 45) showing that closely-related species tend to resemble each other by quantifying the phylogenetic signal of various traits.

      (6) Line 83, 165: 'visual effect', not sure what the authors are referring to. Please rewrite.

      We defined “visual effect” as the way wing color patterns could be perceived by predators or mates. We removed mentions of “visual effect” and directly used its definition instead.

      (7) Line 105 onwards: This section of the introduction could benefit from more concise writing. The authors might consider reducing the number of specific examples and instead offering broader general statements, supported by citations from multiple studies.

      We reduced the number of examples given in this paragraph and used general statements supported by multiple citations as examples. (lines 102-119).

      (8) Line 108-110: This sentence seems to be redundant with the previous one.

      We merged this sentence with the previous one to improve clarity. (lines 103-105)

      (9) Line 140: 'with chemical defenses': include citations here.

      We added citations of Joron et al 1999 and Merrill et al 2014, which document the evolution of convergent wing patterns (mimicry) in butterfly species with chemical-defenses.

      (10) Line 149: This is a bit of a stretch. Note that genetic divergence could be influenced by many other things, not only the processes that the authors examined.

      We agree with the reviewer that the study of the convergent vs. divergent evolution of visual cues is not enough to fully understand the mechanisms allowing genetic divergence between species. Because this paper does not focus on characterizing genetic divergence, we removed it from the manuscript to avoid oversimplification.

      (11) Line 151: Again. Here, the author's primary focus seems to be at an interspecific level. One is left to wonder about the need for comparisons at the intraspecific level in M.helenor and the implications. Please clarify

      In the end of the introduction (lines 146-157), we specifically highlighted the importance of intraspecific comparisons. While studying the effect of sympatry on the evolution of the iridescent color pattern, we use this intraspecific comparison as a baseline to account for convergence or divergence of iridescence in a sympatric interspecific pair of Morpho, because under neutral evolution two subspecies are expected to be more similar than two different species (this assumption has been clarified line 147-148). We also used intraspecific mate choice to test for the use of visual cues in mate recognition (experiment 1) and to test what type of signal could be perceived by Morphos (the iridescent coloration or the iridescent pattern, experiment 2 and 3). These results help contextualize the interspecific mate choice, focused on determining whether visual cues could also be used in species recognition. Since we show that iridescent coloration is important in mate recognition at the intraspecific scale, it helps understand why species recognition is low at the interspecific scale because of wing color convergence between M. helenor and M. achilles.

      (12) Line 154: 'signals on mate preferences'.

      Corrected.

      (13) Line 189: 'At the intraspecific level', maybe in the brackets include 'allopatric populations' just so the results are in a similar format as in the color contrast section below.

      We added details to make clearer that the intraspecific level is studied between allopatric Morpho populations (line 189).

      (14) Line 189-192: Please rearrange the figure (current B as A and vice versa) or present the results in order as in the figure (interspecific first and then intraspecific level).

      We rearranged Figure 3 so that the intraspecific comparison (allopatric population) appears as A and the interspecific level (sympatric population) appears as B, to follow the order of presentation in the main text.

      (15) Line 232: The motivation behind experiments 1, 2, and 3 is unclear. The authors have not made a strong point in the introduction about the need for these comparisons at an intraspecific level. Given that the authors are focused on divergence/convergence at an interspecific level, this set of experiments seems to be irrelevant to the present study. The implications of these findings are also not discussed.

      We added motivation to the use of experiment 1, 2, and 3 in the introduction (lines 151-154) by stating that those experiments were used to assess whether blue color could indeed be used as a mating cue in Morpho helenor (experiment 1) and to try to understand what part of the visual signal is important in mate choice in Morpho helenor: the wing pattern (experiment 2) or the iridescent coloration (experiment 3). Although motivation for these experiments was not detailed in our manuscript, we already discussed the implications of the results of experiments 1, 2 and 3 in the discussion by stating that visual cues can take many forms and that considering both color AND pattern is important in understanding visual cues (lines 408-416). We carefully reworked this new version to make it more straightforward.

      (16) Line 260: Insert 'wild-type' before model to ensure similar wording as in the previous section.

      Corrected.

      (17) Line 286: Insert 'sympatric' after mimetic.

      Corrected.

      (18) Line 307: Include a reference to the figures or table where these results are presented.

      We now mention in the main text that the different proportions of beta-ocimene found between males M. helenor and M. achilles are shown in Table S2.

      (19) Line 343: These inferences are speculative. Add a line here, something like 'although this warrants further research in this species'.

      We detailed what additional experiments are needed lines 388-396.

      (20) Line 357: The authors have not discussed their results on iridescence divergence in allopatric populations (line 190) and its implications.

      We now made clear in the beginning of the discussion that the divergence of iridescence in allopatric populations is used as a baseline to test for convergent iridescence between species (lines 339-343).

      (21) Line 361 onwards: This first paragraph is a bit confusing, as the results mainly focus on allopatry, while the title refers to sympatry.

      To avoid confusion between the title and the content of the discussion, we divided the last part of the discussion into two different parts. As the first paragraph mainly focus on allopatry, we isolated it and titled it “Iridescent color patterns can be used as mate recognition cues in M. helenor” (line 498). The next paragraph of the discussion, focusing on the sympatric Morpho populations, has been titled “Evolution of visual and olfactory cues in mimetic sister-species living in sympatry” (line 418).

      (21)  Line 383: visual cues 'as' poor species.

      Corrected.

      (23) Line 405: Why females here and not males? This is again confusing since the authors tested for male mate choice in the main experiments. Some background information on sex-specific mate choice in the methods might help.

      In this specific sentence, we talk about performing mate choice experiments to test for the discrimination of olfactory cues by females (and not males) because we found a high divergence in the chemical compounds found on male genitalia. Although female chemical compounds could also be used as a cue by males in mate recognition, olfactive mate choice is often driven by female choice in butterflies. We recognize that this perspective does not line up with the mate choice presented in our results section which focused on male mate choice based on visual cues, because of ecological reasons (Morpho males tend to be attracted to bright blue colorations but not females) and technical reasons (in cages, females tend to hide away from the males or male dummies, and this behavior is not compatible with experiments involving flying around false males). In the discussion, we made sure to precise that the perspective we cite here is about testing the implications of divergence in male olfactory cues (line 454). We also added motivation to why we chose to investigate male (and not female) mate choice based on visual cues in the methods (lines 613-618) and in the results (219-223).

      (24) Line 417: This inference is speculative. Consider toning it down.

      We rewrote the sentence: “We find evidence of converging iridescent patterns in sympatry suggesting that predation could play a major role in the evolution of iridescence. Further work is nevertheless needed to directly test this hypothesis and establish the important of evasive mimicry in Morpho” (lines 465-468).

      (25) Line 429: 'Convergent trait evolution leads to mutualistic interactions enhancing coexistence'. Careful here. It is not very evident how convergent trait evolution (iridescence) is mutualistic in this case, as there is no experimental evidence for evasive mimicry yet. Consider rewording or toning this sentence down.

      We agree with the reviewer and removed this statement, only keeping the end of the sentence: “Altogether, this study addresses how convergence in one trait as a result of biotic interactions may alter selection on traits in other sensory modalities, resulting in a complex mosaic of biodiversity. (lines 479-481).

      (26) Line 442: Since the samples come from a breeding farm, I have a few questions. How are the authors sure about the location where the specimens were collected? How long have they been kept in captivity? Have they been subjected to any artificial selection? More details are needed here.

      Since M. helenor bristowi and M. helenor theodorus are only found in the wild in West and East Ecuador respectively, those M. helenor subspecies can only be collected in those two allopatric populations. Their phenotype is directly linked to their geographic repartition, this is how we made sure about their collect location. M. h. theodorus we used in this study were caught in East Ecuador in Tena, and M. h. bristowi were caught in West Ecuador in Pedro Vincente Madonado. We received pupae from the breeding farm, meaning that the Morpho used for the experiments were raised in captivity since their date of emergence. Upon emergence, they were transferred into cages for 4 to 5 days to wait for sexual maturity before performing the tetrad and mate choice experiments. This information was added to the method (lines 490-496).

      (27) Line 476: Include some citations supporting this statement.

      We now cite Bennett and Théry (2007), reviewing avian color vision, and Briscoe (2008), characterizing the sensitivity of the photoreceptors found in the eyes of butterflies. Both citations show that the 300-700nm range is seen by avian and butterfly visual systems.

      (28) Line 480 onwards: Please clarify if the analysis used only one value (mean?) per species, sex, angle of measurement, and locality or included data from multiple individuals.

      The analyses of both colorimetric variables and global iridescence were performed using iridescence data from multiple individuals (10 males and 10 females from M. h. bristowi, M. h. theodorus, M. h. helenor and M. a. achilles), for which we measured iridescence at 21 angles of illumination. Sampling size are mentioned lines 507, 515, 540-542.

      (29) Line 510: Is there a specific reason that authors did not investigate achromatic contrasts? Provide some justification here. Or include the results of achromatic contrasts in the supplement.

      We added the achromatic results in the supplement and in the results (lines 200-204). For both the avian visual model and the Morpho visual model, the confidence intervals always overlapped with the JND threshold, showing that neither birds nor butterflies could theoretically discriminate the wing reflectance brightness in allopatric and sympatric populations.

      (30) Line 552 onwards: I may have missed it. It is not entirely clear why the authors focused on male mate choice rather than female preference for visual cues. The authors should explicitly justify this choice and cite previous studies demonstrating that male mate choice, rather than female preference, is important in this species. This should be stated in the results section as well.

      We added a paragraph in the method (lines 613-618) to describe the ecological and technical reasons leading to testing only male mate choice using visual cues (also see our response to recommendation #23).

      (31) Line 537 onwards: What was the criterion used to score that mating had occurred? Why first mating and not how long they were mating? Please add these details.

      We stopped the experiment as soon as a male/female pair was formed by joining their genitalia (we added this information in the method lines 599-600). Since the tetrad experiment involves the interaction of two males and two females from different subspecies, we considered that mate choice happened before the formation of any couple, and is not necessarily dependent on how long they mate by observing their mating behavior. For instance, we witnessed avoidance behaviors from females that systematically hide their genitalia and refused to join their abdomen to some males, while being very ‘open’ to others (but did not quantify it).  

      (32) Line 571: The authors used a black permanent marker to modify wing patterns but did not validate whether butterflies perceive these modifications as equivalent to natural coloration. It is possible that the alterations introduced unintended visual cues and may explain why most males rejected the dummies (line 267). The authors should acknowledge this limitation here.

      We now acknowledge this limitation in the method (lines 638-639) and in the results section (lines 278-283).

      (33) Line 591: Insert 'above' after protocol.

      Corrected.

      (34) Line 605: If the authors included random effects in their model, then it should be generalized linear mixed model (GLMM) and not GLM as they wrote.

      We indeed included a random effect in our model accounting for male ID and trial number, we thus replaced “GLM” by “GLMM” in the manuscript.

      (35) Line 615: This set of analyses does not seem to account for pseudo-replication, as the data were recorded from the same male more than once (Line 583). Please clarify and redo the analysis with the GLMM framework

      We run new analyses using the GLMM framework: we used a binomial GLMM to test whether individuals preferentially interacted with dummy 1 vs. dummy 2 while accounting for pseudoreplication. The previously detected tendencies hold true with these new analyses, except for the visual mate discrimination of M. achilles: we now find statistical evidence that M. achilles tend to approach more their conspecifics during the mate choice experiment, although the signal is weak (line 297-307). Indeed, while we previously concluded that both species in sympatry (M. helenor and M. achilles) could not discriminate their conspecific mates, we now emphasize that M. achilles is somewhat sensitive to some visual signals. However, its estimated probability of approaching a conspecific is only 0.54, which is low compared to the estimated probability of approaching (0.61) or touching (0.84) a con-subspecific for M. bristowi. We thus concluded that even though some visual cues could be relevant for mate recognition, they are less reliable for male choice in sympatric populations were color patterns are more convergent, compared to allopatric populations. We thus updated Figure 4 and Figure S8 and S9, which are now picturing the probability of approaching or touching a conspecific or con-subspecific with the updated pvalues retrieved from the GLMM analyses. We also updated the results (line 297-307) and the discussion (lines 430-438) to bring nuance to our previous results.  

      (36) Line 963: Figure 3D. Is there a particular reason for comparing allopatric populations only within Ecuador rather than between Ecuador and French Guiana for M. helenor? Please clarify.

      We aimed at comparing the putative discrimination of blue coloration using visual models vs. what the butterflies actually discriminate using mate choice experiments. Since we only performed mate choice experiments involving M. h. bristowi x M. h. theodorus (allopatric populations within Ecuador) and M. h. helenor x M. a. achilles (sympatric population from Ecuador), we only looked at those comparisons using visual models. We added this precision lines (559-560).

      (37) Line 980: Are these predicted probabilities or just mean proportions as written in line 614? Then the label should be changed to 'Proportion of approaches' or something similar.

      Following our answer to recommendation #35, the points now represent the probability of touching a conspecific in the graph for each male, for every trial of every male tested. We corrected the legend of the figure. 

      Reviewer #2 (Recommendations for the authors):

      (1) Line 25: "...therefore facilitating co-existence in sympathy".

      Corrected.

      (2) Line 28: "contrasting" instead of contrasted.

      Corrected.

      (3) Line 33: begin a new sentence at the colon.

      Corrected.

      (4) Line 49: the phrase "habitat filtering" is unclear and should perhaps be defined or qualified.

      We replaced “habitat filtering” by its definition and cited Keddy (1992), describing the community assembly rules and defining habitat filtering (line 46)

      (5) Line 52: remove "even".

      Corrected.

      (6) Line 53: divergent suites may also result because traits are often constrained by genetic architecture (multivariate genetic covariances). This is discussed at length and specifically in relation to ornamental coloration by Kemp et al. 2023

      We rewrote the introduction and focused on only reviewing the ecological interactions promoting trait divergence in sympatric species, and did not mention genetics in this paper.

      (7) Line 87: (and throughout) refer to "colouration" or "colour pattern" rather than "colourations".

      Corrected.

      (8) Line 151: Remove "To do so,".

      Corrected.

      (9) Line 191: I would like to see the degrees of freedom for this test.

      We added the F-statistic=2.09 and the degrees of freedom df=1 of this test, and for all the following tests.

      (10) Line 201: (and throughout) replace "on" with "of".

      Corrected.

      (11) Line 205: modelling the visual properties of the wings allows one to infer what is theoretically visible/distinguishable. The modelling is useful but not necessarily definitive of vision/behaviour per se under different conditions in the wild. I therefore think it is appropriate to phrase the wording around the modelling approach more carefully. Perhaps refer to "theoretical" or "inferred" discriminability, or state (e.g.) that species should/should not be capable of perceiving differences based on the modelling data. You do this well in your wording of lines 207-209. This need not apply in the discussion because you're then dealing with the combination of modelling results and behaviour (mating trials).

      We agree with the reviewer that visual modelling only allows to infer what is theoretically discriminated by the butterflies, and that the wording of our sentence is confusing. We therefore modified the sentence to account for those precisions: “Morpho butterflies and predators can theoretically visually perceive the difference in the blue coloration between different subspecies of M. helenor…… using both bird and Morpho visual models” (line 206-209).

      (12) Line 222: Either the chi-square test or Fisher's exact test should be sufficient (why report both?)

      Chi-square test relies on large-sample assumptions (expected counts>5) whereas Fischer’s exact test does not and is valid even with small or unbalanced sample sizes. Since the M. bristowi female/M. h. theodorus male paring only occurred 3 times, we do not meet the primary assumptions to apply a Chi-square test, although it is significant. We used a Fischer’s test to confirm the results. Using both and finding that both tests are significant shows that the results are robust, although they may appear redundant. To simplify, we remove the results of the Chisquare test and only keep the Fisher’s test in the methodology and the results.

      (13) Line 224 (and throughout): Degrees of freedom should be provided for statistical tests.

      We reported the statistic value and the degrees of freedom for all mentions of the statistical tests in the main text, except for the Fischer test which does not rely on an asymptotic distribution like the Chi-squared distribution as it is an exact test.

      (14) Lines 266-267: This sentence has interest, but it is rather vague at present. Wouldn't your controls account for the effect of manipulation? This could be explained further.

      During our mate choice experiments, all Morpho female dummies used for the experiments were painted with black markers, either on their dorsal blue band to modify their blue iridescent phenotype, or on their ventral side, thus controlling for the effect of manipulation. However, we cannot rule out that the modification of the dorsal blue iridescence could have had a “repulsive” effect for males for several reasons. For example, depending on the visual discrimination of darker colors by Morphos, the painted black band could have a slightly different color compared to the dark “brown” usually surrounding their blue iridescent patterns. We now explain this in the results (lines 278-283) and in the methodology (lines 638-639)  

      (15) Line 316: I'm not certain that the similarity is best described as "striking", given a P-value of 0.084 for this contrast

      We agree with the reviewer and removed this adjective for this line.

      (16) Lines 387-390: This sentence is puzzling because, theoretically speaking, we should expect selection on visual preference to be heightened (not relaxed) in sympatry if colouration isincluded among the traits used in mate selection. I'm not certain I have understood the meaning here.

      We would like to thank the reviewer for pointing out this typo. If shared predatory pressures favors convergent evolution of color pattern, then the visual signals become less reliable for species recognition. As a result, sexual selection on visual preference is heightened and becomes stronger, favoring the evolution of alternative cues used to discriminate conspecific mates. We changed the sentence and now write “the convergent evolution of iridescent wing patterns… may have negatively impact visual discrimination and favored the evolution of divergent olfactory cues” (lines 457-458).

      (17) Line 529: Mating experiments. Given that these are quite large butterflies, I wondered whether a 3x3x2m cage would be sufficient in size to allow the expression of male courtship. A brief description of the courtship behaviour in these species or Morphos generally would be a useful addition to the paper.

      A cage this size was enough for the males to express a flight behavior similar to what can be seen in nature, while also being able to see the females (live females or dummies). We tried to perform mate experiments in a larger cage (7m x 5m x 3m) but the trials were not conclusive because male did not find the dummies depending on where they were flying in the cage. A 3mx3mx2m cage is a good compromise maximizing interactions while still allowing enough space to fly. We now describe Morpho male behavior and female behavior in the methods (lines 613-618).

      (18) Line 546: Why are both tests needed (chi-square AND Fisher's exact)?

      Similarly to our answer on recommendations #12, were used both tests to show robustness in the statistical results. We only kept the Fisher’s test results to simplify the results.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This valuable study investigates the role of HIF1a signalling in epicardial activation and neonatal heart regeneration in mice. Through a combination of genetic and pharmacological approaches, the authors show that stabilization of HIF1a enhances epicardial activation and extends the regenerative capacity of the heart beyond the typical neonatal window following myocardial infarction (MI). However, several aspects of the study remain incomplete and would benefit from further clarification and additional experimental support to solidify the conclusions.

      We reveal herein prolonged epicardial activation following myocardial infarction (MI) beyond post-natal days 1-7 (P1-P7) by genetic or pharmacological stabilisation of HIF-signalling. This extends the so-called “regenerative window” during an adult-like response to injury, leading to enhanced survived myocardium and functional improvement of the heart, even against a backdrop of persistent, albeit reduced, fibrosis. The epicardium is known to enhance cardiomyocyte proliferation and myocardial growth during heart development via trophic growth factor (for example, IGF-1, FGF, VEGF, TGFβ and BMP) signalling (reviewed in PMID:29592950) and epicardium-derived cell-conditioned medium reduces infarct size and improves heart function (PMID: 21505261). Further experiments, outside of the scope of the current study, are required to determine whether activated neonatal epicardium elicits similar paracrine support to sustain the myocardium and heart function after injury beyond P7 into adulthood.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Gamen et al. analyzed the functional role of HIF signaling in the epicardium, providing evidence that stabilization of the hypoxia signaling pathway might contribute to neonatal heart regeneration. By generating different conditionally mouse mutants and performing pharmacological interventions, the authors demonstrate that stabilizing HIF signaling enhances cardiac regeneration after MI in P7 neonatal hearts.

      Strengths:

      The study presents convincing genetic and pharmacological approaches to the role of hypoxia signaling in enhancing the regenerative potential of the epicardium.

      Weaknesses:

      The major weakness is the lack of convincing evidence demonstrating the role of hypoxia signaling in EMT modulation in epicardial cells. Additionally, novel experimental approaches should be performed to allow for the translation of these findings to the clinical arena.

      We respectfully disagree that we have not convincingly demonstrated a role for HIF-signalling in promoting epicardial EMT. We adopt epicardial explant assays utilising a well characterised ex vivo protocol previously described for studying EMT in embryonic, neonatal and adult epicardium (PMID: 27023710, PMID: 12297106; PMID: 17108969, PMID: 19235142). These assays demonstrate in WT1<sup>CreERT2</sup>;Phd2<sup>fl/fl</sup> explants enhanced cobblestone to spindle-like change in cell morphology, increased cell migration, appearance of stress fibres and an up-regulation of the mesenchymal marker alpha-smooth muscle actin (αSMA); all parameters associated with EMT. In addition, our in vivo analyses of Wt1<sup>CreERT2</sup>;Phd2<sup>fl/fl</sup> hearts, in response to neonatal injury, reveal elevated numbers of WT1+ epicardial cells within the sub-epicardial region and underlying myocardium as is associated with active EMT and subsequent migration from the epicardium.

      Reviewer #2 (Public review):

      Summary:

      In this study, Gamen et al. investigated the roles of hypoxia and HIF1a signaling in regulating epicardial function during cardiac development and neonatal heart regeneration. They found that WT1<sup>+</sup> epicardial cells become hypoxic and begin expressing HIF1a from mid-gestation onward. During development, epicardial HIF1a signaling regulates WT1 expression and promotes coronary vasculature formation. In the postnatal heart, genetic and pharmacological upregulation of HIF1a sustained epicardial activation and improved regenerative outcomes.

      Strengths:

      HIF1a signaling was manipulated in an epicardium-specific manner using appropriate genetic tools.

      Weaknesses:

      There appears to be a discrepancy between some of the conclusions and the provided histological data. Additionally, the study does not offer mechanistic insight into the functional recovery observed.

      We respectfully disagree with the comment that our histological data does not support our conclusions and expand on this in the response to specific reviewer comments. We agree that further mechanistic experiments outside of the scope of the current study are required to identify precisely how activated neonatal epicardium results in increased healthy myocardium after injury beyond post-natal day 7 (P7).

      Reviewer #3 (Public review):

      Summary:

      The authors' research here was to understand the role of hypoxia and hypoxia-induced transcription factor Hif-1a in the epicardium. The authors noted that hypoxia was prevalent in the embryonic heart, and this persisted into neonatal stages until postnatal day 7 (P7). Hypoxic regions in the heart were noted in the outer layer of the heart, and expression of Hif-1a coincided with the epicardial gene WT1. It has been documented that at P7, the mouse heart cannot regenerate after myocardial infarction, and the authors speculated that the change in epicardial hypoxic conditions could play a role in regeneration. The authors then used genetic and pharmacological tools to increase the activity of Hif genes in the heart and noted that there was a significant improvement in cardiac function when Hif-1a was active in the epicardium. The authors speculated that the presence of Hif-1a improved cell survival.

      Strengths:

      A focus on hypoxia and its effects on the epicardium in development and after myocardial infarction. This study outlines the potential to extend the regenerative time window in neonatal mammalian hearts.

      We thank the reviewer for this positive endorsement and recognition of the importance of mechanistic insight into how to extend the window of neonatal heart regeneration.

      Weaknesses:

      While the observations of improved cardiac function are clear, the exact mechanism of how increased Hif-1a activity causes these effects is not completely revealed. The authors mention improved myocardium survival, but do not include studies to demonstrate this.

      We report an increase in healthy myocardium arising from prolonged activation of the epicardium during the neonatal window and following injury at post-natal day 7 (P7). We speculate this recapitulates the role of the epicardium during heart development which is known to be a source of trophic growth factors that can enhance myocardial growth. Further experiments are required, out-of-scope of this study, to define a mechanistic link between HIF-signalling, epicardial activation and myocardial survival in the setting of prolonged neonatal heart regeneration.

      There is an indication that fibrosis is decreased in hearts where Hif activity is prolonged, but there are no studies to link hypoxia and fibrosis.

      We believe the decreased fibrosis is a natural consequence of the increase in survived myocardium arising from the activated epicardium. There is strong precedent here following injury at post-natal day 1 (P1) in which fibrosis is evident early-on but is resolved over time with growth of the myocardium in the regenerating heart (PMID: 23248315).

      Recommendations for the authors:

      Reviewing Editor Comments:

      (1) Address issues related to image quality, colocalization, sample labeling, appropriate controls, and quantification - particularly in Figures 1, 2, 6, and Supplementary Figure 9. Increase sample size as noted by reviewers.

      The issues of co-localisation and sample labelling have been addressed under response to reviewers. We are unable to increase sample numbers but have clarified the number of regions per section and numbers of sections per heart analysed where appropriate.

      (2) Clarify the effects of epicardial HIF1a activation on neovascularization.

      We have removed reference in the abstract to an effect on neovascularisation.

      (3) Extend assessments of epicardial hypoxia and HIF1a expression to earlier embryonic stages, when epicardial EMT is more active.

      Our earliest timepoint of E12.5 marks the onset of epicardial EMT and E13.5 is the stage with the most significant mobilisation of epicardium-derived cells (EPDCs) into the sub-epicardial region and underlying myocardium (PMID: 32359445). In the same study, E11.5 lineage tracing of epicardial cells is restricted to outer layer of the heart; thus, our timepoints are representative in capturing both the onset and progression of in vivo EMT.

      (4) Strengthen EMT assays and mechanistic modeling. Provide evidence from physiologically relevant models, as current 2D culture assays do not adequately support conclusions about EMT. Include additional EMT markers and quantification where appropriate.

      We respectfully disagree that epicardial explants are not a valid assay for assessing EMT. As noted under responses to reviewers, such primary explants have been widely described elsewhere (PMID: 27023710, PMID: 12297106; PMID: 17108969, PMID: 19235142) and enable documentation of multiple parameters that are associated with active EMT, including an assessment of the extent of cell migration, cobblestone (epithelial) to spindle-like (mesenchymal) cell morphologies, stress fibre formation and expression of alpha-smooth muscle actin as a mesenchymal marker. We support our findings in explants by revealing reduced WT1+ epicardium-derived cells (EPDCs) in the sub-epicardial region and underlying myocardium of WT1<sup>CreERT2/+</sup>;Hif1a<sup>fl/fl</sup> embryonic hearts (data in Figure 2) indicative of impaired epicardial EMT and migration of EPDCs and in vivo following neonatal MI with pharmacological inhibition of PHD2, where we observe the reciprocal phenotype of increased numbers of epicardium-derived cells emerging from the outer epicardial layer (data in Figure 6).

      (5) Strengthen mechanistic insights into the role of epicardial cells in the functional recovery observed in MI hearts.

      We agree that further experiments are required, out-of-scope of this study, to define a mechanistic link between HIF-signalling, epicardial activation and myocardial survival in the setting of prolonged neonatal heart regeneration.

      Reviewer #1 (Recommendations for the authors):

      The manuscript by Gamen et al. analyzed the functional role of HIF signaling in the epicardium, providing evidence that stabilization of the hypoxia signaling pathway might contribute to neonatal heart regeneration. By generating different conditionally mouse mutants and performing pharmacological interventions, the authors demonstrate that stabilizing HIF signaling enhances cardiac regeneration after MI in P7 neonatal hearts. The study is potentially interesting, but it presents several major caveats.

      (1) One of the critical points reported in the early stages of this study is the early co-localization of Wt1, the hypoxic report (HP1), and HIF signaling pathways master regulators (i.e., HIF1a and HIF1b) during embryonic development. Figure 1 is meant to report such findings. However, unfortunately, I hardly see any co-localization at all in the Wt1+ epicardial cells for HP1, with some colocalization is seen for HIF1 and 2 alpha, although none of these data are quantified. Thus, it is hard to believe such co-localization.

      We respectfully disagree with this comment. We highlight cells in Figure 1 that are co-stained for WT1+ and HP1. In addition, we identify HIF1-α and HIF2- α positive cells which either reside within the epicardium, as the outer cell layer, or within the underlying sub-epicardial region, respectfully.

      (2) The authors claimed that they have analyzed the expression of the hypoxic report, as well as Wt1 and the HIF signaling pathways master regulators (i.e., HIF1a and HIF1b) in the AV groove, as compared to the apex, in embryonic heart ranging from E12.5 to E18.5 (Figure 1). Unfortunately, all images provided that are tagged as AV groove are rather misleading. They do not represent the AV groove but part of the right ventricular free wall. If the authors want to refer to the AV groove, AV cushions should be visible underneath.

      We have removed specific reference to the AV groove and refer to the highlighted regions as the “Base” of the heart.

      (3) The authors analyzed the hypoxic condition of the developing heart from E12.5 to E18.5. However, it remains unclear why the authors only explored the hypoxic conditions from E12.5 onwards, since epicardial EMT mainly occurs earlier than this time point, i.e., E10.5 onwards. Therefore, it would be needed to explore it already at this earlier time point.

      We respectfully disagree with the reviewer and refer to the comment above regarding the fact that E12.5 marks the onset of epicardial EMT and E13.5 is the stage with the most significant mobilisation of epicardium-derived cells (EPDCs) into the sub-epicardial region and underlying myocardium (PMID: 32359445).

      (4) The authors reported a conditional mouse model of HIF1alpha deletion by using the Wt1CreERT2 driver. Curiously, Wt1 is dependent on hypoxia signaling (i.e., HIF1a). Therefore, it is unclear whether there is a negative feedback loop between the deletion of Hif1alpha and the activation of the Cre driver might have functional consequences. Convincing evidence should be provided that such crosstalk does not interfere with Hif1alpha inactivation, and therefore, appropriate controls should be run in parallel.

      We discount a negative feedback loop in this instance based on the fact we have utilised heterozygous mice for the WT1<sup>CreERT2/+</sup> line and observe a consistent and reproducible phenotype for the developing hearts on a Wt1<sup>CreERT2/+</sup>;Hif1a<sup>fl/fl</sup> background and following injury in Wt1<sup>CreERT2/+</sup>;Phd2<sup>fl/fl</sup> mice. Collectively this indicates that the WT1-CreERT2 driver is active in the context of diminishing HIF-1α and Phd2, respectively. In addition, have carried out parallel experiments using epicardial explants derived from R26R-CreERT2;Phd2<sup>fl/fl</sup> (Figure 3) to circumvent any potential confounding issues; the results of which are consistent with increased epicardial EMT in support of our overall hypothesis.

      (5) On Figure 2a-f the authors reported that epicardial cells are diminished in Wt1CreERT2Hif1alpha mice as compared to controls. I am very sorry, but I do not see any difference. Furthermore, it is unclear to me how the authors quantified such differences, i.e., what marker signal did they use and how it was performed (Figure 2c and d)?

      We respectfully disagree with the reviewer and draw attention to the single channel panels of WT1+ staining in Figure 2, which show clear differences between numbers of epicardial cells in the mutant mice compared to controls (comparing magenta cells in panels a) versus b). Quantification was carried out for numbers of WT1+ cells residing within the PDPN-positive epicardium (and underlying PDPN-negative myocardium) across multiple images from multiple sections and multiple hearts.

      (6) On Figure 2g, the authors reported differences in total vessel length. Are they referring to impaired microvasculature development? Or is this analysis also including major coronary vessels? What about the major coronary vessels and trees, is there any affection?

      This analysis refers to the microvasculature and not the major coronary arteries or coronary trees.

      (7) The authors reported that there might be some differences in EMT markers, but unfortunately, all of them are analyzed on 2D cultures, where no substrate for EMT is present, i.e., an underlying ECM bed. Thus, the authors cannot claim that EMT is altered. Additional experiments using either collagen substrate and/or Matrigel are required to fully demonstrate that EMT is impaired. Furthermore, quantitative analyses of such differences should be provided.

      The 2D cultures are epicardial explants from mutant versus wild type hearts and represent a widely adopted previously published ex-vivo assay for investigating epicardial EMT across embryonic to adult stages (PMID: 27023710, PMID: 12297106; PMID: 17108969, PMID: 19235142); including an assessment of the extent of migration and cobblestone (epithelial) to spindle-like (mesenchymal) cell morphologies, stress fibre formation and expression of alpha-smooth muscle actin as a mesenchymal marker. We do not understand the comment regarding an “underlying ECM bed” as the cells exhibit EMT routinely on tissue culture plastic and will deposit their own ECM during the culture time course and in response to EMT/cell migration. In terms of quantification this was carried out for scratch assay experiments, as a proxy for EMT and emergent mesenchymal cell migration, as presented in Figure 3i, j with significant enhanced scratch closure and cell migration following Molidustat treatment.

      (8) The description of data provided on Supplementary Figure 5 is spurious and should be removed. A note in the discussion might be sufficient.

      We respectfully disagree. The ChIP-seq data, in what is now Figure 2- figure supplement 3, highlights a HIF-1 α binding site within the Wt1 locus suggesting putative upstream regulation of WT1 by HIF-1α. Thus this provides a potential explanation as to how HIF-1α may activate the epicardium through up-regulation of Wt1/WT1.

      (9) On Figure 3, the authors further illustrate the change of EMT markers using ex vivo cardiac explants. They reported increased expression of Snai2 that, although statistically significant, is most likely of no biological relevance (increase of only 20% at transcript level). What about Snai1, Prrx1, and other EMT promoters? Are they also induced? As previously stated, these 2D cultures do not provide supporting evidence that EMT is occurring, thus 3D gel assays should be performed in which Z-axis analyses will provide evidence on the different migratory behaviour of those cells.

      We respectfully suggest that a 20% change in snai2 expression is biologically meaningful with respect to EMT. This in-turn is supported by associated cell migration, reduced ZO-1 expression, increased stress fibres and increased alpha-SMA as a mesenchymal marker; all properties associated with active EMT. Other suggested markers have not been validated as formally required for EMT, for example Snai1 (PMID: 23097346). The migratory capacity of targeted versus epicardial cells was assessed by combined explant and scratch assay experiments.

      (10) The description of single-cell analyses is very incomplete. Which mice were used for these analyses, wildtype control, or hypoxic mice? Please provide a clearer description of the samples used. Additionally, the entire rationale of these analyses is dubious. Doing single-cell analyses to analyze a couple or three markers in a very small cell population is rather ridiculous. qPCR might be far more appropriate and convincing, or a bulk RNAseq analysis of isolated epicardial cells.

      The single-cell analyses represent an unbiased assessment of different pathways in epicardial cells (identified bioinformatically) between intact P1 and P7 stages in wild type (control) hearts, with a focus on hypoxia-related gene expression and HIF-dependent pathways. It was not designed to analyse a small number of genes, rather global differences in the hypoxic states between P1 and P7 hearts. Selected genes (Vegfa, Pdk3, Egln 1 (Phd2)) were analysed to highlight the key differences in hypoxic signalling across the regenerative window. The fact the hearts were uninjured/intact is clarified in the text and legends for Figure 4 and now Figure 4-figure supplement 1.

      (11) The analyses provided in Figure 5 are very interesting and their findings are very relevant. However, I would think that the complementary experimental approach should also be done, i.e, MI followed by activation with tamoxifen, since that situation would be more realistic in the clinical setting.

      Tamoxifen causes respiratory failure in neonates with MI, so the two cannot be combined at the same time or soon after surgery. Moreover, tamoxifen takes significant time to take effect on targeted gene down-regulation which may negate sufficient activation of the epicardium following injury.

      The experiments in Figure 5 were designed to demonstrate that prolonged heart regeneration could be elicited in a cell-specific (epicardial-specific) manner via a genetic approach. The pharmacological experiments in Figure 6 are complementary in this regard by demonstrating equivalent effects with drug (Molidustat) delivery to reduce PHD2 and stabilise HIF post-MI.

      (12) In Figure 6, expression of Wt1 is highly prominent in P7 controls, mainly restricted to the epicardial lining while in the experimental setting, such Wt1 expression is broadly distributed on the subepicardial space, nicely demonstrating epicardial activation. However, it is very surprising to see such Wt1 expression in controls, something that is not expected, as compared to the data reported in Figure 4g. Could the authors please reconcile these findings?

      Figure 6 represents the injury setting and Figure 4g the intact setting (as clarified above, in the text and revised figure legends). Hence in the latter WT1 expression is significantly reduced in the P7 heart, as anticipated. With injury at P7 we anticipate activation of WT1 in control hearts, albeit restricted to the epicardial layer (as occurs in adult hearts, PMID: 21505261). In contrast, following Molidustat-treatment of P7 hearts post-MI we observe extensive epicardial expansion into the sub-epicardial region and EPDC migration into the underlying myocardium (Figure 6b).

      Reviewer #2 (Recommendations for the authors):

      The role of hypoxia and HIF1a signaling in epicardial activation is an important topic, and the genetic approaches employed in this study are appropriate. However, several aspects of the study remain unclear and would benefit from further clarification or explanation by the authors:

      (1) The authors detected hypoxic regions using an anti-pimonidazole fluorescence-conjugated monoclonal antibody (HP1). The data would become more compelling if negative and positive controls were provided.

      We believe the HP1 staining is compelling in the images shown and is consistent with hypoxic regions of the developing heart. We reveal HP1 staining at cellular resolution with neighbouring cells positive and negative for the HP1 signal in the apex of the heart and within the epicardium and sub-epicardial regions at E12.5 (Figure 1a) and diminished/altered hypoxic/HP1 regional signal through subsequent developmental stages at E14.5-18.5 (Figure 1a-d).

      (2) Many HIF1a-positive cells in the AV groove region do not appear to overlap with HP1 staining (Figure 1a). Providing a low-magnification image of HIF1α expression would be helpful to better assess the extent of overlap with HP1 staining

      HIF-1 is highly unstable and hence detection of HIF-1+ cells will likely only sample of cells compared to HP1 which is a surrogate for broader regions of hypoxia.

      (3) Although the authors conclude that epicardial HIF1a deletion results in a significant reduction of WT1⁺ cells in both the epicardium and myocardium (Figure 2a-d), the provided images are not sufficiently clear to fully support this interpretation. Providing additional evidence to support this conclusion would be helpful.

      We respectfully disagree with the reviewer and draw attention to the single channel panels of WT1+ staining which show clear differences between numbers of epicardial cells in the mutant mice compared to controls (Figure 2a versus 2b; magenta WT1+ staining).

      (4) Similar to the point raised above, the authors' conclusion regarding the increased expression of WT1 following Molidustat treatment does not appear to be fully supported by the provided images (Figure 6b-f). Immunofluorescence staining for WT1 does not clearly demonstrate epicardial expression in the remote zone of either the control or Molidustat-treated hearts. In addition, while an increase of WT1<sup>+</sup> cells is observed in the infarct zone of the Molidustat-treated heart, it is somewhat unexpected that such expansion is not evident in the corresponding region of the control heart, given that epicardial cells typically expand near the infarct area. Clarification on these points would be helpful.

      Figure 6b reveals WT1 expression in controls (upper panel set) that is reactivated proximal to the infarct region, given WT1 is not expressed in adult epicardium but restricted to the epicardial layer (as occurs in injured adult mouse hearts PMID: 21505261). This contrasts with what is observed in the Molidustat-treated P7 hearts post-MI, where we observe epicardial expansion and migration of WT1+ cells into the underlying myocardium (Figure 6b, lower panel set, infarct zone).

      (5) The authors conclude that WT1<sup>+</sup> cells in the myocardial tissue exhibit endothelial identity based on the colocalization of WT1 and EMCN signals (Supplementary Figure 9c). However, this interpretation is difficult to assess, as WT1 is a nuclear marker and EMCN is a membrane protein, which makes precise colocalization challenging to confirm with confidence. Additional supporting evidence may be necessary to substantiate this conclusion.

      WT1 is known to be up regulated in endothelial cells in response to injury as shown previously in several studies (for example, PMID: 25681586). Here we show clear co-localisation of nuclear WT1 and cytoplasmic Endomucin (EMCN) in what is now Figure 6- figure supplement 1c and would encourage the reviewer and readers to magnify the image by zooming-in on the relevant co-stained panel.

      (6) The authors conclude that activation of epicardial HIF1a signaling has no effect on neovascularization in postnatal MI hearts (Figure 5c). However, the abstract states: "Finally, a combination of genetic and pharmacological stabilisation of HIF ... increased vascularisation, augmented infarct resolution and preserved function beyond the 7-day regenerative window" (Lines 38-41). Clarification regarding this apparent discrepancy would be appreciated.

      The abstract has been altered to remove the statement of increased vascularisation.

      (7) The study appears somewhat incomplete, as it lacks mechanistic insight into the functional recovery observed following epicardial Phd2 deletion and Molidustat treatment in postnatal MI hearts. Although the authors suggest a potential paracrine role of the epicardium in protecting cardiomyocytes from apoptosis, this hypothesis has not been experimentally addressed. Incorporating such analysis would help to reinforce the study's conclusions.

      Further experiments are required, which are out-of-scope of this study, to define a mechanistic link between the genetic or pharmacological stabilisation of HIF-signalling, epicardial activation and myocardial survival in the setting of prolonged neonatal heart regeneration.

      Other points:

      (1) Providing single-channel images for Figures 1a-d and 6g would be helpful for clarity and interpretation.

      We believe the combined channel views of co-staining for two markers on a background of DAPI staining to pin-point cell nuclei, are informative and support our conclusions.

      (2) Have the authors considered using AngioTool to quantify the number of vessels in Figure 5b-c?

      AngioToolTM was used to quantify the vessels, as we have used previously (PMID: 33462113) and this is now added to the methods and legend of Figure 2.

      Reviewer #3 (Recommendations for the authors):

      There are several areas where the manuscript can be improved, such that its conclusions can be solidified.

      (1) The authors highlight a point where blocking Phd2 can enhance survival of cardiac tissue, but did not report on survival markers. They surmised that apoptosis could be decreased in Phd2 mutant or Molidustat treatment but did not show this. The authors should determine if apoptosis is decreased in the myocardium and epicardium.

      We show evidence of increased levels of healthy myocardium in the genetic and pharmacological models of stabilised HIF-signalling. We exclude increased cardiac hypertrophy or increased cardiomyocyte proliferation as causative, so suggest as a reasonable alternative enhanced survival, albeit this need not necessarily be via an apoptotic pathway given the incidence of necrotic cell death during MI. We are unable to generate new surgeries and mutant/treated heart samples to analyse for apoptotic markers at this stage.

      (2) There appears to be no difference in cardiomyocyte proliferation in Molidustat-treated animals, but the experiment was only performed on 2 to 3 animals. This is too small a sample size to conclude from these results. The authors should increase the sample size to make this assertion.

      We respectfully disagree that we are unable to conclude no effect on cardiomyocyte proliferation. We analysed multiple heart regions per section, for EdU+/cTnT+ colocalised signals across several sections per heart, set against a consistency of effect on other parameters in hearts treated with Molidustat. We are unable to generate more P7 heart surgeries +/- Molidustat and +/- EdU at this stage.

      (3) It is curious as to how, after myocardial infarction, the fibrotic scar tissue is decreased in the Phd2 deletion but not as profound in Molidustat-treated mice at d21. Can the authors speculate why the difference exists and how this decrease arises? For example, are there decreased pro-inflammatory signals in Phd2 deleted mice? Is there decreased collagen deposition and ECM gene expression? Do macrophage recruitment into the infarct zone differ between mutant/treated vs WT?

      The representative images in Figure 6k reveal a trend towards reduced fibrosis with Molidistat treatment (Figure 6l), but across all hearts analysed this was not as significant as observed in the epicardial-specific deletion injured hearts (Figure 5g, h). This may be due to the relatively short half-life of Molidustat (approximately 4-10 hours, PMID: 32248614), the dosing regimen for the drug and/or the fact that it was not specifically delivered/targeted to the epicardium.

      (4) The magnified images in Figure 1 do not match the boxes in the whole heart images. It is unclear what the white boxes signify.

      The white boxes have been removed from Figure 1. The magnified image panels are from serial heart sections and this is now clarified in the Figure 1 legend.

    1. However, most societies do not value creative thinking and so our skills in generating ideas rapidly atrophies, as we do not practice it, and instead actively learn to suppress it11 Csikszentmihalyi, M. (2014). Society, culture, and person: A systems view of creativity. Springer Netherlands. . That time you said something creative and your mother called you weird? You learned to stop being creative. That time you painted something in elementary school and your classmate called it ugly? You learned to stop taking creative risks. That time you offered an idea in a class project and everyone ignored it? You must not be creative. Add up all of these little moments and where most people end up in life is possessing a strong disbelief in their ability to generate ideas

      I agree with the idea that our society actively works to suppress creativity. This affirms my perspective that we often prioritize getting the right answers rather than thinking creatively in order to get a range of answers for a question. I think this because we, inherently, as humans think of things in black and white. If something isn't the "right" or "correct" idea, it is simply wrong. In reality, these answers may not be wrong and may just be different. Through my own experiences at school, I've seen how people are quick to shut down the idea generation process to just skip ahead to the solution. Especially with generative AI now, we're outsourcing our thinking. This is harmful because we need to be able to think. If we can't think, we can't create.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Weaknesses: 

      (1) The authors claim that choroidal neovascular tuft phenotypes are similar in TgfbrR1 KO and TgfbrR2 KO mice. However, the phenotypes look more severe in the TgfbrR1 KO rather than TgfbrR2 KO mice. Can the authors show a quantitative comparison of the number of choroidal neovascular tufts per whole eye cross-section in both genotypes? 

      Thank you for asking about this.  Each VE-cad-CreER;TGFBR1 CKO/- and VE-cad-CreER;TGFBR2 CKO/- retina exhibits multiple zones of choroidal neovascularization.  The examples in Figures 1 and Figure 1 – Figure supplements 1 and 2 are mostly from retinas with loss of TGFBR1, but we could have chosen similar examples from retinas with loss of TGFBR2.  The quantification in the original version of Figure 1- Figure supplement 1 panel C had a labeling error.  It actually showed the quantification choroidal neovascularization (CNV) in the sum of both VE-cad-CreER;TGFBR1 CKO/- and VE-cad-CreER;TGFBR2 CKO/- retinas, not only in VE-cad-CreER;TGFBR1 CKO/- retinas as originally labeled.  The point that it made is that CNV is seen with loss of TGF-beta signaling but not in control retinas or retinas with loss of Norrin signaling.  We have now updated that plot by separating the data points for VE-cad-CreER;TGFBR1 CKO/- and VE-cad-CreER;TGFBR2 CKO/- retinas, so that they can be compared to each other.   The result shows ~2.5-fold more CNV in VE-cad-CreER;TGFBR2 CKO/- retinas compared to VE-cad-CreER;TGFBR1 CKO/-.  We think it likely that a more extensive sampling would show little or no difference between these two genotypes – but the data is what it is. This is now described in the Results section. 

      We have also added a panel D to Figure 1- Figure supplement 1, which shows a retina flatmount analysis of CNV.  This is done by mounting the retina with the photoreceptor side up so that the outer retina can be optimally imaged. 

      (2) In the analysis of Sulfo-NHS-Biotin leakage in the retina to assess blood-retina barrier maturation. The authors claim that there is increased vascular leakage in the TgfbR1 KO mice. However, it does not seem like Sulfo-NHS-biotin is leaking outside the vessels. Therefore, it cannot be increased vascular permeability. Can the authors provide a detailed quantification of the leakage phenotype? 

      Thank you for raising this point.  Your comment prompted us to look at this question in greater depth with more experiments.  We have expanded Figure 2 to show and quantify a comparison between control (i.e. phenotypically WT), NdpKO, and TGFBR1 endothelial KO and we have expanded the associated part of the Results section (Figure 2C and D).  In a nutshell, control retinas show little Sulfo-NHS-biotin accumulation in or around the vasculature or in the parenchyma; NdpKO retinas show Sulfo-NHS-biotin accumulation in the vasculature and in the parenchyma (i.e., the area between the vessels); and VEcadCreER;Tgfbr1CKO/- retinas show Sulfo-NHS-biotin accumulation in the vascular tufts with minimal accumulation in the non-tuft vasculature and minimal leakage into the parenchyma.   The conclusion is that the bulk of the retinal vasculature in TGFBR1 endothelial KO mice is minimally or not at all leaky – very different from the situation with loss of Norrin/Frizzled4 signaling.

      (3) The immune cell phenotyping by snRNAseq is premature, as the number of cells is very small. The authors should sort for CD45+ cells and perform single-cell RNA sequencing. 

      Thank you for raising this point.  For the revised manuscript, we have performed additional snRNAseq analyses using the same tissue processing protocol as for our original snRNAseq data.  We have opted to homogenize the tissue and prepare nuclei (our original method) rather than dissociate the tissue and FACS sorting for CD45+ cells because the nuclear isolation approach is unbiased – we assume that nuclei from all cell types are present after tissue homogenization.  By contrast, we cannot be certain that CD45 FACS will capture the full range of immune cells since some cells may not express CD45, may express CD45 at low level, or may be tightly adherent to other cells, such as vascular endothelial cell.  Additionally, by following the original protocol, we can combine the original snRNAseq dataset and the new snRNAseq dataset.  In the revised manuscript we present the snRNAseq data from the combination of the original and the more recent snRNAseq datasets (revised Figure 4; N=628 immune cell nuclei).  The new analysis comes to the same conclusions as the original analysis: the immune cell infiltrate in the mutant retinas is composed of a wide variety of immune cells.

      (4) The analysis of BBB leakage phenotype in TgfbR1 KO mice needs to be more detailed and include tracers as well as serum IgG leakage. 

      As described in our response to query 2, we have conducted additional experiments to look at vascular leakage in control, VE-cad-CreER;TGFBR1 CKO/-, and NdpKO retinas.  We have also looked at Sulfo-NHS-biotin leakage in the VE-cadCreER;TGFBR1 CKO/- brain, and it is indistinguishable from WT controls.  Since Sulfo-NHS-biotin is a low MW tracer (<1,000 kDa), this implies that loss of TGF-beta signaling does not increase non-specific diffusion of either low or high MW molecules.  Therefore, the elevated levels of IgG in the brain parenchyma in young VE-cad-CreER;TGFBR1 CKO/- mice (Figure 8A) likely represents specific transport of IgG across the BBB.  Such transport is known to occur via Fc receptors expressed on vascular endothelial cells, although it is normally greater in the brain-to-blood direction than in the blood-to-brain direction.  For example, see Lafrance-Vanasse et al (2025) Leveraging neonatal Fc receptor (FcRn) to enhance antibody transport across the blood brain barrier.  Nat Commun. 16:4143.  This is now described in greater detail in the Results section.

      (5) A previous study (Zarkada et al., 2021, Developmental Cell) showed that EC-deletion of Alk5 affects the D tip cells. The phenotypes of those mice look very similar to those shown for TgfbrR1 KO mice. Are D-tip cells lost in these mutants by snRNAseq? 

      Please note: Alk5 is another name for TGFBR1.  This is noted in the second sentence of paragraph 4 of the Introduction.  The reviewer is correct: there are a lot of similarities because these are exactly the same KO mice.  Also, Zarkada and we used the same VEcadCreER to recombine the CKO allele.  The proposed snRNAseq analysis would serve as an independent check on the diving (D) tip vs stalk cell analyses published in Zarkada et al (2021) Specialized endothelial tip cells guide neuroretina vascularization and blood-retina-barrier formation. Dev Cell 56:2237-2251.  We have not gone in this direction because the question of tip vs. stalk cells and of subtypes of tip cells in WT vs. mutant retinas is beyond our focus on choroidal neovascularization and the role of immune cells and vascular inflammation.  The proposed snRNAseq analysis would also require a major effort since tip cells are rare and must be harvested from large numbers of early postnatal retinas followed by FACS enrichment for vascular endothelial cells.  Finally, we have no reason to doubt the results of Zarkada et al.

      Reviewer #2 (Public review): 

      Summary:

      The authors meticulously characterized EC-specific Tgfbr1, Tgfbr2, or double knockout in the retina, demonstrating through convincing immunostaining data that loss of TGF-β signaling disrupts retinal angiogenesis and choroidal neovascularization. Compared to other genetic models (Fzd4 KO, Ndp KO, VEGF KO), the Tgfbr1/2 KO retina exhibits the most severe immune cell infiltration. The authors proposed that TGF-β signaling loss triggers vascular inflammation, attracting immune cells - a phenotype specific to CNS vasculature, as non-CNS organs remain unaffected. 

      Strengths: 

      The immunostaining results presented are clear and robust. The authors performed well-controlled analyses against relevant mouse models. snRNA-seq corroborates immune cell leakage in the retina and vascular inflammation in the brain. 

      Weaknesses: 

      The causal link between TGF-β loss, vascular inflammation, and immune infiltration remains unresolved. The authors' model posits that EC-specific TGF-β loss directly causes inflammation, which recruits immune cells. However, an alternative explanation is plausible: Tgfbr1/2 KO-induced developmental defects (e.g., leaky vessels) permit immune extravasation, subsequently triggering inflammation. The observations that vein-specific upregulation of ICAM1 staining and the lack of immune infiltration phenotypes in the non-CNS tissues support the alternative model. Late-stage induction of Tgfbr1/2 KO (avoiding developmental confounders) could clarify TGF-β's role in retinal angiogenesis versus anti-inflammation. 

      Thank you for raising this point.  Your comment prompted us to look at this question in greater depth with more experiments.  We have expanded Figure 2 to show and quantify a comparison between control (i.e. phenotypically WT), NdpKO, and TGFBR1 endothelial KO and we have expanded the associated part of the Results section (Figure 2C and D).  In a nutshell, control retinas show little Sulfo-NHS-biotin accumulation in or around the vasculature or in the parenchyma; NdpKO retinas show Sulfo-NHS-biotin accumulation in the vasculature and in the parenchyma (i.e., the area between the vessels); and VEcadCreER;Tgfbr1CKO/- retinas show Sulfo-NHS-biotin accumulation in the vascular tufts with minimal accumulation in the non-tuft vasculature and minimal leakage into the parenchyma.   The conclusion is that the bulk of the retinal vasculature in TGFBR1 endothelial KO mice is minimally or not at all leaky – very different from the situation with loss of Norrin/Frizzled4 signaling.

      In the revised manuscript, we have expanded the Discussion section to address the two alternative hypotheses raised by the reviewer.  Here are the relevant data in a nutshell: (1) vascular leakage into the parenchyma, as measured with sulfo-NHSbiotin, in TGFBR1 endothelial CKO retinas is far less than in NdpKO retinas, where nearly all ECs convert to a fenestration+ (PLVAP+) phenotype and there is leakage of sulfo-NHS-biotin, (2) ICAM1 in ECs in TGFBR1 endothelial CKO retinas increases several-fold more than in NdpKO or Frizzled4KO retinas, (3) TGFBR1 endothelial CKO retinas have more infiltrating immune cells than NdpKO or Frizzled4KO retinas, and (4) in TGFBR1 endothelial CKO retinas large numbers of immune cells are observed within and adjacent to blood vessels.  We think that the simplest explanation for these data is that loss of TGFbeta signaling in ECs causes an endothelial inflammatory state with enhanced immune cell extravasation.  That said, the case for this model is not water-tight, and there could be less direct mechanisms at play.  In particular, this model does not explain why the inflammatory phenotype is limited to CNS (and especially retinal) vasculature.

      Regarding the last sentence of the reviewer’s comment (“Late stage induction…”), we have tried activating CreER recombination at different ages and we observe a large reduction in the inflammatory phenotype when recombination is initiated after vascular development is complete.   This observation suggests that the vascular developmental/anatomic defect – and perhaps the resulting retinal hypoxia response – is required for the inflammatory phenotype.  In the revised manuscript we have expanded the Results and Discussion sections to describe this observation.

      Reviewer #1 (Recommendations for the authors): 

      Suggestions for experiments: 

      (1) The authors need to show a quantitative comparison of the number of choroidal neovascular tufts per whole eye crosssection in both genotypes (TgfbR1 and TgfbR2 KO mice). 

      Thank you for raising this point.  The quantification in the original version of Figure 1- Figure supplement 1 panel C was mis-labeled.  It quantifies choroidal neovascularization (CNV) in both VE-cad-CreER;TGFBR1 CKO/- and VE-cadCreER;TGFBR2 CKO/- retinas, not VE-cad-CreER;TGFBR1 CKO/- retinas only as originally labeled.  The point it makes is that CNV is seen with loss of TGF-beta signaling but not in control retinas or retinas with loss of Norrin signaling.  We have now corrected that plot by separating the data points for VE-cad-CreER;TGFBR1 CKO/- and VE-cad-CreER;TGFBR2 CKO/- retinas, so that they can be compared to each other.   The result shows ~2.5-fold more CNV in VE-cad-CreER;TGFBR2 CKO/- retinas compared to VE-cad-CreER;TGFBR1 CKO/-.  This is now described in the Results section. 

      (2) In the analysis of Sulfo-NHS-Biotin leakage in the retina to assess blood-retina barrier maturation. The authors should provide a detailed quantification of the leakage phenotype outside the vessels into the CNS parenchyma, both in the retina and brain, in TgfbR1 KO mice. 

      Thank you for raising this point.  There is no detectable Sulfo-NHS-biotin leakage into the brain parenchyma in VE-cadCreER;TGFBR1 CKO/- mice.  We have expanded Figure 2 to show and quantify the data for retinal vascular leakage (Figure 2C and D).  The data show that in VE-cad-CreER;TGFBR1 CKO/- mice there is accumulation of Sulfo-NHS-biotin in the vascular tufts but minimal accumulation elsewhere in the retinal vasculature and minimal leakage of Sulfo-NHS-biotin into the retinal parenchyma.

      (3) The immune cell phenotyping by snRNAseq is premature, as the number of cells is very small. The authors should sort for CD45+ cells and perform single-cell RNA sequencing to ascertain these preliminary data. 

      Thank you for raising this point.  We have performed additional snRNAseq analyses using the same tissue processing protocol as for our original snRNAseq data to increase the numbers of cells.  We have opted to homogenize the tissue and prepare nuclei (our original method) rather than dissociating the cells and FACS sorting for CD45+ cells because the nuclear isolation approach is unbiased – we assume that nuclei from all cell types are present.  By contrast, we cannot be certain that CD45 FACS will capture the full range of immune cells, since some cells may not express CD45, may express CD45 at low level, or may be tightly adherent to other cells, such as vascular endothelial cell.  Additionally, by following the original protocol, we can combine the original snRNAseq dataset of and the new snRNAseq dataset.  In the revised manuscript we present the snRNAseq data from the combination of the original and the more recent snRNAseq datasets (revised Figure 4; N=628 immune cell nuclei).  The new analysis comes to the same conclusion as in the original submission, namely that the immune cell infiltrate in the mutant retinas is composed of a wide variety of immune cells.  The Results section has been expanded to describe this new data and analysis.    

      (4) The analysis of BBB leakage phenotype in TgfbR1 KO mice needs to be more detailed and include tracers as well as serum IgG leakage. 

      Sulfo-NHS biotin leakage in the VE-cad-CreER;TGFBR1 CKO/- brain is minimal, and it is indistinguishable from WT controls.  Since Sulfo-NHS biotin is a low MW tracer (<1,000 kDa), this implies that loss of TGF-beta signaling does not increase non-specific diffusion of either low or high MW molecules.  Therefore, the elevated levels of IgG in the brain parenchyma in young VE-cad-CreER;TGFBR1 CKO/- mice (Figure 8A) likely represents specific transport of IgG across the BBB.  Such transport is known to occur via Fc receptors expressed on vascular endothelial cells, although it is normally greater in the brain-to-blood direction than in the blood-to-brain direction.  For example, see Lafrance-Vanasse et al (2025) Leveraging neonatal Fc receptor (FcRn) to enhance antibody transport across the blood brain barrier.  Nat Commun. 16:4143.  This is now described in greater detail in the Results section.

      (5) The authors should perform a more detailed RNAseq analysis of tip and stack (stalk) cells in TgfbrR1 KO mice to determine whether D tip cells are lost in these mutants by snRNAseq. 

      The proposed snRNAseq analysis would serve as an independent check on the diving (D) tip vs stalk cell analyses published by Zarkada et al, who analyzed the same VE-cad-CreER;TGFBR1 CKO/- mutant mice, although they refer to the TGFBR1 gene by its alternate name ALK5 [Zarkada et al (2021) Specialized endothelial tip cells guide neuroretina vascularization and blood-retina-barrier formation. Dev Cell 56:2237-2251].  We have not gone in this direction because the question of tip vs. stalk cells and of subtypes of tip cells in WT vs. mutant retinas is beyond our focus on choroidal neovascularization and the role of immune cells and vascular inflammation.  The proposed snRNAseq analysis would also require a major effort since tip cells are rare and must be harvested from large numbers of early postnatal retinas followed by FACS enrichment for vascular endothelial cells.

      Suggestions for improving the manuscript:  

      (6) The statement that ECs acquire properties of immune cells (Page 2, Line 90) is incorrect. Endothelial cells may acquire characteristics of antigen presenting cells. 

      Thank you for that correction.  Based on the review from Amersfoort et al (2022) (Amersfoort J, Eelen G, Carmeliet P. (2022) Immunomodulation by endothelial cells - partnering up with the immune system? Nat Rev Immunol 22:576-588) and the articles cited in it, we have changed the sentence to “Although vascular endothelial cells (ECs) are not generally considered to be part of the immune system, in some locations and under some conditions they acquire properties characteristic of immune cells, including secretion of cytokines, surface display of co-stimulatory or co-inhibitory receptors, and antigen presentation in association with MHC class II proteins (Pober and Sessa, 2014; Amersfoort et al., 2022).”  

      (7) The statement in Page 3, Line 100-101 [In CNS ECs, quiescence is maintained in part by the actions of astrocyte-derived Sonic Hedgehog, with the result that few immune cells other than resident microglia are found within the CNS (Alvarez et al., 2011).] is incomplete. Wnt signaling also suppresses the expression of leukocyte adhesion molecules from endothelial cells and therefore helps with immune cell quiescence. 

      Thank you for raising that point.  We have expanded that sentence to include Wnt signaling in CNS endothelial cells, as described in the following reference: Lengfeld JE, Lutz SE, Smith JR, Diaconu C, Scott C, Kofman SB, Choi C, Walsh CM, Raine CS, Agalliu I, Agalliu D. (2017) Endothelial Wnt/beta-catenin signaling reduces immune cell infiltration in multiple sclerosis. Proc Natl Acad Sci USA 114:E1168-E1177.

      (8) It may be beneficial for the reader to separate the results of the vascular phenotypes related to choroidal neovascularization compared to retinal vascular development. 

      Thank you for this suggestion.  The two topics are partly overlapping: choroidal neovascularization is described in Figure 1, and retinal development is described in Figures 1 and 2.  The challenge is that some of same images illustrate both phenotypes as in Figure 1, so the topics cannot be easily separated.

      (9) In addition to comparing the phenotypes in Tgfb signaling mutant mice with Wnt signaling and VEGF-A signaling mutants, the authors should compare and contrast their data with those found in Alk5 KO mice, as there are a lot of similarities. 

      The reviewer has alerted us to a nomenclature challenge which we will try to resolve in the introduction: Alk5 is just another name for TGFBR1.  The reviewer is correct: there are a lot of similarities between the present study and that of Zarkada et al (2021) because both use the same TGFBR1(=Alk5) CKO mice.

      Reviewer #2 (Recommendations for the authors): 

      Figure 2 

      For 2B, the authors should clarify whether the two regions shown in the Tgfbr1 KO retina (P14) represent central vs. peripheral areas, as phenotype severity varies. 

      For 2C, does the uneven biotin accumulation reflect developmental gradients (e.g., central-peripheral maturation timing)? 

      Thank you for raising these points.  Regarding Figure 2B, these images are all from the mid-peripheral retina, where the phenotype is moderately severe.  This is now noted in the figure legend.

      Regarding Figure 2C, the reviewer is correct that the pattern of Sulfo-NHS-biotin is uneven in VEcadCreER;Tgfbr1CKO/- retinas – it accumulates only in the tufts.  We have expanded Figure 2C to show a comparison between control (i.e.

      phenotypically WT), NdpKO, and TGFBR1 endothelial KO retinas, and we have expanded the associated part of the Results section.  In a nutshell, control retinas show little Sulfo-NHS-biotin accumulation in the vasculature or in the parenchyma; NdpKO retinas show Sulfo-NHS-biotin accumulation in the vasculature and in the parenchyma (i.e., the area between the vessels); and VEcadCreER;Tgfbr1CKO/- retinas show Sulfo-NHS-biotin accumulation in the vascular tufts with minimal accumulation in the non-tuft vasculature and minimal leakage into the parenchyma.   The conclusion is that the bulk of the retinal vasculature in TGFBR1 endothelial KO mice is not leaky – very different from the situation with loss of Norrin/Frizzled4 signaling.

      Figure 6 

      The claim that PECAM1+ rings on veins reflect EC-immune cell binding is uncertain, as PECAM1 is also known to be expressed by immune cells. The complete correlation of PECAM1 and CD45 staining signals suggests that a subset of immune cells upregulates PECAM1. The VEcadCreER;Tgfbr1 flox/-; SUN1:GFP reporter would be helpful to delineate ECimmune cell proximity. Super-resolution imaging with Z-stacks could also resolve spatial relationships (luminal vs. abluminal immune cell adhesion). 

      Thank you for this comment.  The reviewer is correct that, at the resolution of these images, we cannot determine whether the PECAM1 immunostaining signal is derived from ECs, from leukocytes, or from both.  This is now stated in the Results section.  The PECAM1-rich endothelial ring structure associated with leukocyte extravasation has been characterized in various publications, for example in (1) Carman CV, Springer TA. (2004) A transmigratory cup in leukocyte diapedesis both through individual vascular endothelial cells and between them. J Cell Biol 167:377-388 and (2) Mamdouh Z, Mikhailov A, Muller WA. (2009) Transcellular migration of leukocytes is mediated by the endothelial lateral border recycling compartment. J Exp Med 206:2795-2808.  The ring structures visualized in Figure 6D by PECAM1 immunostaining conform to the ring structures described in these and other papers.  In showing these structures, our point is simply that they likely represent sites of leukocyte extravasation.  This is now clarified in the text.  We have also added some additional references on leukocyte extravasation and the ring structures.

      Figure 7 

      A time-course analysis of ICAM1 would strengthen the mechanistic model. Does ICAM1 upregulation precede immune infiltration (supporting inflammation as the primary defect)? Given that immune cells appear by P14 (per snRNA-seq), is ICAM1 elevated earlier? 

      This is an interesting idea, but based on what is known about leukocyte adhesion and extravasation we predict that there will not be a clean temporal separation between ICAM1 induction and leukocyte adhesion/infiltration.  That is, if the proinflammatory state causes an increase in the number of leukocytes, then as ICAM1 levels increase, leukocyte adhesion would also increase.  Similarly, if the presence of leukocytes increases the pro-inflammatory state, then as the number of leukocytes increases, the levels of ICAM1 would be predicted to increase.  Thus, we think that a time course analysis is unlikely to provide a definitive conclusion.

      Figure 8-SF1 

      In brain slices, a transient pan-IgG accumulation suggests a self-resolving defect in the BBB. However, this BBB impairment appears to be spatiotemporally distinct from ICAM1 upregulation. ICAM1 staining is restricted to the lesion site, aligning with immune cell-driven inflammation. 

      Thank you for raising these points.  The reviewer is correct that these observations don’t fit together in a clear way.  There does not appear to be a general increase in brain vascular permeability in VE-cad-CreER;TGFBR1 CKO/- mice, as shown by sulfo-NHS-biotin.  However, there is a large and transient increase in IgG in the brain parenchyma, suggestive of a general vascular alteration, and – as the reviewer correctly notes – it is not accompanied by a generalized increase in ICAM1 vascular immunostaining.  At this point, we don’t have any real insight into the mechanistic basis of the transient IgG increase.

      Thank you for handling this manuscript.

    1. Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether advantageous and disadvantageous inequality aversion can be vicariously learned and generalized. Using an adapted version of the ultimatum game (UG), in three phases, participants first gave their own preference (baseline phase), then interacted with a "teacher" to learn their preference (learning phase), and finally were tested again on their own (transfer phase). The key measure is whether participants exhibited similar choice preference (i.e., rejection rate and fairness rating) influenced by the learning phase, by contrasting their transfer phase and baseline phase. Through a series of statistical modeling and computational modeling, the authors reported that both advantageous and disadvantageous inequality aversion can indeed be learned (Study 1), and even be generalised (Study 2).

      Strengths:

      This study is very interesting, that directly adapted the lab's previous work on the observational learning effect on disadvantageous inequality aversion, to test both advantageous and disadvantageous inequality aversion in the current study. Social transmission of action, emotion, and attitude have started to be looked at recently, hence this research is timely. The use of computational modeling is mostly appropriate and motivated. Study 2 that examined the vicarious inequality aversion on conditions where feedback was never provided is interesting and important to strengthen the reported effects. Both studies have proper justifications to determine the sample size.

      Weaknesses:

      Despite the strengths, a few conceptual aspects and analytical decisions have to be explained, justified, or clarified.

      INTRODUCTION/CONCEPTUALIZATION

      (1) Two terms seem to be interchangeable, which should not, in this work: vicarious/observational learning vs preference learning. For vicarious learning, individuals observe others' actions (and optionally also the corresponding consequence resulted directly by their own actions), whereas, for preference learning, individuals predict, or act on behalf of, the others' actions, and then receive feedback if that prediction is correct or not. For the current work, it seems that the experiment is more about preference learning and prediction, and less so about vicarious learning. But the intro and set are heavily around vicarious learning, and late the use of vicarious learning and preference learning is rather mixed in the text. I think either tone down the focus on vicarious learning, or discuss how they are different. Some of the references here may be helpful: Charpentier et al., Neuron, 2020; Olsson et al., Nature Reviews Neuroscience, 2020; Zhang & Glascher, Science Advances, 2020

      EXPERIMENTAL DESIGN

      (2) For each offer type, the experiment "added a uniformly distributed noise in the range of (-10 ,10)". I wonder how this looks like? With only integers such as 25:75, or even with decimal points? More importantly, is it possible to have either 70:30 or 90:10 option, after adding the noise, to have generated an 80:20 split shown to the participants? If so, for the analyses later, when participants saw the 80:20 split, which condition did this trial belong to? 70:30 or 90:10? And is such noise added only to the learning phase, or also to the baseline/transfer phases? This requires some clarification.

      (3) For the offer conditions (90:10, 70:30, 50:50, 30:70, 10:90) - are they randomized? If so, how is it done? Is it randomized within each participants, and/or also across participants (such that each participant experienced different trial sequences)? This is important, as the order especially for the leanring phase can largely impact on the preference learning of the participants.

      STATISTICAL ANALYSIS & COMPUTATIONAL MODELING

      (4) In Study 1 DI offer types (90:10, 70:30), the rejection rate for DI-AI averse looks consistently higher than that for DI averse (ie, blue line is above the yellow line). Is this significant? If so, how come? Since this is a between-subject design, I would not anticipate such a result (especially for the baseline). Also, for the LME results (eg, Table S3), only interactions were reported but not the main results.

      (5) I do not particularly find this analysis appealing: "we examined whether participants' changes in rejection rates between Transfer and Baseline, could be explained by the degree to which they vicariously learned, defined as the change in punishment rates between the first and last 5 trials of the Learning phase." Naturally, participants' behavior in the first 5 trials in the learning phase will be similar to those in the baseline; and their behavior in the last 5 trials in the learning phase would echo those at the transfer phase. I think it would be stronger to link the preference learning results to the chance between baseline and transfer phase, eg, by looking at the difference between alpha (beta) at the end of the learning phase and the initial alpha (beta).

      (6) I wonder if data from the baseline and transfer phases can also be modeled, using a simple Fehr-Schimdt model? This way, the change in alpha/beta can also be examined between the baseline and transfer phase.

      (7) I quite liked Study 2 that tests the generalization effect, and I expected to see an adapted computational modeling to directly reflect this idea. Indeed, the authors wrote "[...] given that this model [...] assumes the sort of generalization of preferences between offer types [...]". But where exactly did the preference learning model assumed the generalization? In the methods, the modeling seems to be only about Study 1; did the authors advise their model to accommodate Study 2? The authors also ran simulation for the learning phase in Study 2 (Figure 6), and how did the preference updated (if at all) for offers (90:10 and 10:90) where feedback was not given? Extending/Unpacking the computational modeling results for Study2 will be very helpful for the paper.

      Comments on revisions:

      I kept my original public review, so that future readers can see the progress and development of the manuscript.

      The authors have largely addressed my original questions/concerns, and I have two outstanding comments.

      (a) Related to my original comment #6, where I suggested to apply the F-S model also to the baseline and transfer phase. The authors were inclined not to do it, but in fact later in comment #7 and in the manuscript they opted to use a more complex F-S-based model to their learning phase. I agree that the rejection rate is indeed a clear indication, but for completeness, it'd be more consistent and compelling if the paper follows a model-free (model-agnostic) and model-based approach in all phases of the experiment.

      (b) Related to my original comment #4, I appreciate that the authors have provided more details of their LMM models. But I don't think it is accurate regardless. First, all offer levels (50:50, 30:70, 10:90), should not be coded as pure categorical levels. In fact, they have an ordinal meaning, a single ordinal predictor with three levels should be used. This also avoids the excessive number of interactions the authors have pointed out.

      Second, running a model with only interactions without main effects is flawed. All textbooks on stats emphasize that without the presence of the main effects, the interpretation of interaction only is biased.

      So these LMMs needs to be revised before the manuscript eventually gets to a version of record.

    2. Reviewer #2 (Public review):

      Summary:

      This study investigates whether individuals can learn to adopt egalitarian norms that incur a personal monetary cost, such as rejecting offers that benefit them more than the giver (advantageous inequitable offers). While these behaviors are uncommon, two experiments aim to demonstrate that individuals can learn to reject such offers by observing a "teacher" who follows these norms. The authors use computational modelling to argue that learners adopt these norms through a sophisticated process, inferring the latent structure of the teacher's preferences, akin to theory of mind.

      Strengths:

      This paper is well-written and tackles an important topic relevant to social norms, morality, and justice. The findings are promising (though further control conditions are necessary to support the conclusions). The study is well-situated in the literature, with a clever experimental design and a computational approach that may offer insights into latent cognitive processes. In the revision, the authors clarified some questions related to the initial submission.

      Weaknesses:

      Despite these strengths, I remain unconvinced that the current evidence supports the paper's central claims. Below, I outline several issues that, in my view, limit the strength of the conclusions.

      (1) Experimental Design and Missing Control Condition:

      The authors set out to test whether observing a "teacher" who is averse to advantageous inequity (Adv-I) will affect observers' own rejection of Adv-I offers. However, I think the design of the task lacks an important control condition needed to address this question. At present, participants are assigned to one of two teachers: DIS or DIS+ADV. Behavioral differences between these groups can only reveal relative differences in influence; they cannot establish whether (and how) either teacher independently affects participants' own behavior. For example, a significant difference between conditions can emerge even if participants are only affected by the DIS teacher and are not affected at all by the DIS+ADV teacher. What is crucially missing here is a no-teacher control condition, which can then be compared with each teacher condition separately. This control condition would also control for pure temporal effects unrelated to teacher influence (e.g., increasing Adv-I rejections due to guilt build-up).

      While this criticism applies to both experiments, it is especially apparent in Experiment 2. As shown in Figure 4, the interaction for 10:90 offers reflects a decrease in rejection rates following the DIS teacher, with no significant change following the DIS+ADV teacher. Ignoring temporal effects, this pattern suggests that participants may be learning NOT to reject from the DIS teacher, rather than learning to reject from the DIS+ADV teacher. On this basis, I do not see convincing evidence that participants' own choices were shaped by observing Adv-I rejections.

      In the Discussion, the authors write that "We found that participants' own Adv-I-averse preferences shifted towards the preferences of the Teacher they just observed, and the strength of these contagion effects related to the degree of behavior change participants exhibited on behalf of the Teachers, suggesting that they internalized, at least somewhat, these inequity preferences." However, there is no evidence that directly links the degree of behaviour change (on the teacher's behalf) to contagion effects (own behavioural change). I think there was a relevant analysis in the original version, but it was removed from the current version.

      (2) Modelling Efforts: The modelling approach is underdeveloped. The identification of the "best model" lacks transparency, as no model-recovery results are provided. Additionally, behavioural fits for the losing models are not shown, leaving readers in the dark about where these models fail. Readers would benefit from seeing qualitative/behavioural patterns that favour the winning model. Moreover, the reinforcement learning (RL) models used are overly simplistic, treating actions as independent when they are likely inversely related. For example, the feedback that the teacher would have rejected an offer provides evidence that rejection is "correct" but also that acceptance is "an error," and the latter is not incorporated into the modelling. In other words, offers are modelled as two-armed bandits (where separate values are learned for reject and accept actions), but the situation is effectively a one-armed bandit (if one action is correct, the other is mistaken). It is unclear to what extent this limitation affects the current RL formulations. Can the authors justify/explain their reasoning for including these specific variants? The manuscript only states Q-values for reject actions, but what are the Q-values for accept actions? This is unclear.

      In Experiment 2, only the preferred model is capable of generalization, so it is perhaps unsurprising that this model "wins." However, this does not strongly support the proposed learning mechanism, lacking a comparison with simpler generalizing mechanisms (see following comments).

      (3) Conceptual Leap in Modelling Interpretation: The distinction between simple RL models and preference-inference models seems to hinge on the ability to generalize learning from one offer to another. Whereas in the RL models, learning occurs independently for each offer (hence no cross-offer generalization), preference inference allows for generalization between different offers. However, the paper does not explore "model-free" RL models that allow generalization based on the similarity of features of the offers (e.g., payment for the receiver, payment for the offer-giver, who benefits more). Such models are more parsimonious and could explain the results without invoking a theory of mind or any modelling of the teacher. In such model versions, a learner acquires a functional form that allows prediction of the teacher's feedback based on offer features (e.g., linear or quadratic weighting). Because feedback for an offer modulates the parameters of this function (feature weights), generalization occurs without necessarily evoking any sophisticated model of the other person. This leaves open the possibility that RL models could perform just as well or even outperform the preference learning model, casting doubt on the authors' conclusions.

      Of note: even the behaviourists knew that when Little Albert was taught to fear rats, this fear generalized to rabbits. This could occur simply because rabbits are somewhat similar to rats. But this doesn't mean Little Albert had a sophisticated model of animals that he used to infer how they behave.

      In their rebuttal letter, the authors acknowledge these possibilities, but the manuscript still does not explore or address alternative mechanisms.

      (4) Limitations of the Preference-Inference Model: The preference-inference model struggles to capture key aspects of the data, such as the increase in rejection rates for 70:30 DI offers during the learning phase (e.g., Fig. 3A, AI+DI blue group). This is puzzling. Thinking about this, I realized the model makes quite strong, unintuitive predictions which are not examined. For example, if a subject begins the learning phase rejecting the 70:30 offer more than 50% of the time (meaning the starting guilt parameter is higher than 1.5), then, over learning, the tendency to reject will decrease to below 50% (the guilt parameter will be pulled down below 1.5). This is despite the fact that the teacher rejects 75% of the offers. In other words, as learning continues, learners will diverge from the teacher. On the other hand, if a participant begins learning by tending to accept this offer (guilt < 1.5), then during learning, they can increase their rejection rate but never above 50%. Thus, one can never fully converge on the teacher. I think this relates to the model's failure in accounting for the pattern mentioned above. I wonder if individuals actually abide by these strict predictions. In any case, these issues raise questions about the validity of the model as a representation of how individuals learn to align with a teacher's preferences (given that the model doesn't really allow for such an alignment).

      In their rebuttal letter, the authors acknowledged these anomalies and stated that they were able to build a better model (where anomalies are mitigated, though not fully eliminated). But they still report the current model and do not develop/discuss alternatives. A more principled model may be a Bayesian model where participants learn a belief distribution (rather than point estimates) regarding the teacher's parameters.

      (5) Statistical Analysis: The authors state in their rebuttal letter that they used the most flexible random effect structure in mixed-effects models. But this seems not to be the case in the model reported in Table SI3 (the very same model was used for other analyses too). Indeed, here it seems only intercepts are random effects. This left me confused about which models were used.

    3. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether advantageous and disadvantageous inequality aversion can be vicariously learned and generalized. Using an adapted version of the ultimatum game (UG), in three phases, participants first gave their own preference (baseline phase), then interacted with a "teacher" to learn their preference (learning phase), and finally were tested again on their own (transfer phase). The key measure is whether participants exhibited similar choice preferences (i.e., rejection rate and fairness rating) influenced by the learning phase, by contrasting their transfer phase and baseline phase. Through a series of statistical modeling and computational modeling, the authors reported that both advantageous and disadvantageous inequality aversion can indeed be learned (Study 1), and even be generalised (Study 2).

      Strengths:

      This study is very interesting, it directly adapted the lab's previous work on the observational learning effect on disadvantageous inequality aversion, to test both advantageous and disadvantageous inequality aversion in the current study. Social transmission of action, emotion, and attitude have started to be looked at recently, hence this research is timely. The use of computational modeling is mostly appropriate and motivated. Study 2, which examined the vicarious inequality aversion in conditions where feedback was never provided, is interesting and important to strengthen the reported effects. Both studies have proper justifications to determine the sample size.

      Weaknesses:

      Despite the strengths, a few conceptual aspects and analytical decisions have to be explained, justified, or clarified.

      INTRODUCTION/CONCEPTUALIZATION

      (1) Two terms seem to be interchangeable, which should not, in this work: vicarious/observational learning vs preference learning. For vicarious learning, individuals observe others' actions (and optionally also the corresponding consequence resulting directly from their own actions), whereas, for preference learning, individuals predict, or act on behalf of, the others' actions, and then receive feedback if that prediction is correct or not. For the current work, it seems that the experiment is more about preference learning and prediction, and less so about vicarious learning. The intro and set are heavily around vicarious learning, and later the use of vicarious learning and preference learning is rather mixed in the text. I think either tone down the focus on vicarious learning, or discuss how they are different. Some of the references here may be helpful: (Charpentier et al., Neuron, 2020; Olsson et al., Nature Reviews Neuroscience, 2020; Zhang & Glascher, Science Advances, 2020)

      We are appreciative of the Reviewer for raising this question and providing the reference. In response to this comment we have elected to avoid, in most cases, use of the term ‘vicarious’ and instead focus the paper on learning of others’ preferences (without specific commitment to various/observational learning per se). These changes are reflected throughout all sections of the revised manuscript, and in the revised title. We believe this simplified terminology has improved the clarity of our contribution.

      EXPERIMENTAL DESIGN

      (2) For each offer type, the experiment "added a uniformly distributed noise in the range of (-10 ,10)". I wonder what this looks like? With only integers such as 25:75, or even with decimal points? More importantly, is it possible to have either 70:30 or 90:10 option, after adding the noise, to have generated an 80:20 split shown to the participants? If so, for the analyses later, when participants saw the 80:20 split, which condition did this trial belong to? 70:30 or 90:10? And is such noise added only to the learning phase, or also to the baseline/transfer phases? This requires some clarification.

      We thank the Reviewer for pointing this out. The uniformly distributed noise was added to all three phases to make the proposers’ behavior more realistic. This added noise was rounded to integer numbers, constrained from -9 to 9, which means in both 70:30 and 90:10 offer types, an 80:20 split could not occur. We have made this feature of our design clear in the Method section Line 524 ~ 528:

      “In all task phases, we added uniformly distributed noise to each trial’s offer (ranging from -9 to 9, inclusive, rounding to the nearest integer) such that the random amount added (or subtracted) from the Proposer’s share was subtracted (or added) to the Receiver’s share. We adopted this manipulation to make the proposers’ behavior appear more realistic. The orders of offers participants experienced were fully randomized within each experiment phase. ”

      (3) For the offer conditions (90:10, 70:30, 50:50, 30:70, 10:90) - are they randomized? If so, how is it done? Is it randomized within each participant, and/or also across participants (such that each participant experienced different trial sequences)? This is important, as the order especially for the learning phase can largely impact the preference learning of the participants.

      We agree with the Reviewer the order in which offers are experienced could be very important. The order of the conditions was randomized independently for each participant (i.e. each participant experienced different trial sequences). We made this point clear in the Methods part. Line 527 ~ 528:

      “The orders of offers participants experienced were fully randomized within each experiment phase.”

      STATISTICAL ANALYSIS & COMPUTATIONAL MODELING

      (4) In Study 1 DI offer types (90:10, 70:30), the rejection rate for DI-AI averse looks consistently higher than that for DI averse (ie, the blue line is above the yellow line). Is this significant? If so, how come? Since this is a between-subject design, I would not anticipate such a result (especially for the baseline). Also, for the LME results (eg, Table S3), only interactions were reported but not the main results.

      We thank the Reviewer for pointing out this feature of the results. Prompted by this comment, we compared the baseline rejection rates between two conditions for these two offer types, finding in Experiment 1 that rejection rates in the DI-AI-averse condition were significantly higher than in the DI-averse condition (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.13, p < 0.001, Offer 70:30, β = 0.09, p < 0.034). We agree with the Reviewer that there should, in principle, be no difference between the experiences of participants in these two conditions is identical in the Baseline phase. However, we did not observe these difference in baseline preferences in Experiment 2 (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.07, p < 0.100, Offer 70:30, β = 0.05, p < 0.193). On the basis of the inconsistency of this effect across studies we believe this is a spurious difference in preferences stemming from chance.

      Regarding the LME results, the reason why only interaction terms are reported is due to the specification of the model and the rationale for testing.

      Taking the model reported in Table S3 as an example—a logistic model which examines Baseline phase rejection rates as a function of offer level and condition—the between-subject conditions (DI-averse and DI-AI-averse) are represented by dummy-coded variables. Similarly, offer types were also dummy-coded, such that each of the five columns (90:10, 70:30, 50:50, 30:70, and 10:90) correspond corresponded to a particular offer type. This model specification yields ten interaction terms (i.e., fixed effects) of interest—for example, the “DI-averse × Offer 90:10” indicates baseline rejection rates for 90:10 offers in DI-averse condition. Thus, to compare rejection rates across specific offer types, we estimate and report linear contrasts between these resultant terms. We have clarified the nature of these reported tests in our revised Results—for example, line189-190: “linear contrasts; e.g. 90:10 vs 10:90, all Ps<0.001, see Table S3 for logistic regression coefficients for rejection rates).

      Also in response to this comment that and a recommendation from Reviewer 2 (see below), we have revised our supplementary materials to make each model specification clearer as SI line 25:

      RejectionRate ~ 0 + (Disl + Advl):(Offer10 + Offer30 + Offer50 + Offer70 + Offer90) + (1|Subject)”

      (5) I do not particularly find this analysis appealing: "we examined whether participants' changes in rejection rates between Transfer and Baseline, could be explained by the degree to which they vicariously learned, defined as the change in punishment rates between the first and last 5 trials of the Learning phase." Naturally, the participants' behavior in the first 5 trials in the learning phase will be similar to those in the baseline; and their behavior in the last 5 trials in the learning phase would echo those at the transfer phase. I think it would be stronger to link the preference learning results to the change between the baseline and transfer phase, eg, by looking at the difference between alpha (beta) at the end of the learning phase and the initial alpha (beta).

      Thanks for pointing this out. Also, considering the comments from Reviewer 2 concerning the interpretation of this analysis, we have elected to remove this result from our revision.

      (6) I wonder if data from the baseline and transfer phases can also be modeled, using a simple Fehr-Schimdt model. This way, the change in alpha/beta can also be examined between the baseline and transfer phase.

      We agree with the Reviewer that a simplified F-S model could be used, in principle, to characterize Baseline and Transfer phase behavior, but it is our view that the rejection rates provide readers with the clearest (and simplest) picture of how participants are responding to inequity. Put another way, we believe that the added complexity of using (and explaining) a new model to characterize simple, steady-state choice behavior (within these phases) would not be justified or add appreciable insights about participants’ behavior.

      (7) I quite liked Study 2 which tests the generalization effect, and I expected to see an adapted computational modeling to directly reflect this idea. Indeed, the authors wrote, "[...] given that this model [...] assumes the sort of generalization of preferences between offer types [...]". But where exactly did the preference learning model assume the generalization? In the methods, the modeling seems to be only about Study 1; did the authors advise their model to accommodate Study 2? The authors also ran simulation for the learning phase in Study 2 (Figure 6), and how did the preference update (if at all) for offers (90:10 and 10:90) where feedback was not given? Extending/Unpacking the computational modeling results for Study 2 will be very helpful for the paper.

      We are appreciative of the Reviewer’s positive impression of Experiment 2. Upon reflection, we realize that our original submission was not clear about the modeling done in Experiment 2, and we should clarify here that we did also fit the Preference Inference model to this dataset. As in Experiment 1, this model assumes that the participants have a representation of the teacher’s preference as a Fehr-Schmidt form utility function and infer the Teacher’s Envy and Guilt parameters through learning. The model indicates that, on the basis of experience with the Teacher’s preferences on moderately unfair offers (i.e., offer 70:30 and offer 30:70), participants can successfully infer these guess of these two parameters, and in turn, compute Fehr-Schmidt utility to guide their decisions in the extreme unfair offers (i.e., offer 90:10 and offer 10:90).

      In response to this comment, we have made this clearer in our Results (Line 377-382):

      “Finally, following Experiment 1, we fit a series of computational models of Learning phase choice behavior, comparing the goodness-of-fit of the four best-fitting models from Experiment 1 (see Methods). As before, we found that the Preference Inference model provided the best fit of participants’ Learning Phase behavior (Figure S1a, Table S12). Given that this model is able to infer the Teacher’s underlying inequity-averse preferences (rather than learns offer-specific rejection preferences), it is unsurprising that this model best describes the generalization behavior observed in Experiment 2.”

      and in our revised Methods (Line 551-553)

      “We considered 6 computational models of Learning Phase choice behavior, which we fit to individual participants’ observed sequences of choices, in both Experiments 1 and 2, via Maximum Likelihood Estimation”

      Reviewer #2 (Public review):

      Summary:

      This study investigates whether individuals can learn to adopt egalitarian norms that incur a personal monetary cost, such as rejecting offers that benefit them more than the giver (advantageous inequitable offers). While these behaviors are uncommon, two experiments demonstrate that individuals can learn to reject such offers through vicarious learning - by observing and acting in line with a "teacher" who follows these norms. The authors use computational modelling to argue that learners adopt these norms through a sophisticated process, inferring the latent structure of the teacher's preferences, akin to theory of mind.

      Strengths:

      This paper is well-written and tackles a critical topic relevant to social norms, morality, and justice. The findings, which show that individuals can adopt just and fair norms even at a personal cost, are promising. The study is well-situated in the literature, with clever experimental design and a computational approach that may offer insights into latent cognitive processes. Findings have potential implications for policymakers.

      Weaknesses:

      Note: in the text below, the "teacher" will refer to the agent from which a participant presumably receives feedback during the learning phase.

      (1) Focus on Disadvantageous Inequity (DI): A significant portion of the paper focuses on responses to Disadvantageous Inequitable (DI) offers, which is confusing given the study's primary aim is to examine learning in response to Advantageous Inequitable (AI) offers. The inclusion of DI offers is not well-justified and distracts from the main focus. Furthermore, the experimental design seems, in principle, inadequate to test for the learning effects of DI offers. Because both teaching regimes considered were identical for DI offers the paradigm lacks a control condition to test for learning effects related to these offers. I can't see how an increase in rejection of DI offers (e.g., between baseline and generalization) can be interpreted as speaking to learning. There are various other potential reasons for an increase in rejection of DI offers even if individuals learn nothing from learning (e.g. if envy builds up during the experiment as one encounters more instances of disadvantageous fairness).

      We are appreciative of the Reviewer’s insight here and for the opportunity to clarify our experimental logic. We included DI offers in order to 1) expose participants to the full spectrum of offer types, and avoid focusing participants exclusively upon AI offers, which might result in a demand characteristic and 2) to afford exploration of how learning dynamics might differ in DI context s—which was, to some extent, examined in our previous study (FeldmanHall, Otto, & Phelps, 2018)—versus AI contexts. Furthermore, as this work builds critically on our previous study, we reasoned that replicating these original findings (in the DI context) would be important for demonstrating the generality of the learning effects in the DI context across experimental settings. We now remark on this point in our revised Introduction Line 129 ~132:

      “In addition, to mechanistically probe how punitive preferences are acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      (2) Statistical Analysis: The analysis of the learning effects of AI offers is not fully convincing. The authors analyse changes in rejection rates within each learning condition rather than directly comparing the two. Finding a significant effect in one condition but not the other does not demonstrate that the learning regime is driving the effect. A direct comparison between conditions is necessary for establishing that there is a causal role for the learning regime.

      We agree with the Reviewer and upon reflection, believe that direct comparisons between conditions would be helpful to support the claim that the different learning conditions are responsible for the observed learning effects. In brief, these specific tests buttress the idea that exposure to AI-averse preferences result in increases in AI punishment rates in the Transfer phase (over and above the rates observed for participants who were only exposed to DI-averse preferences).

      Accordingly, our revision now reports statistics concerning the differences between conditions for AI offers in Experiment 1 (Line 198~ 207):

      “Importantly, when comparing these changes between the two learning conditions, we observed significant differences in rejection rates for Adv-I offers: compared to exposure to a Teacher who rejected only Dis-I offers, participants exposed to a Teacher who rejected both Dis-I and Adv-I offers were more likely to reject Adv-I offers and rated these offers more unfair. This difference between conditions was evident in both 30:70 offers (Rejection rates: β(SE) = 0.10(0.04), p = 0.013; Fairness ratings: β(SE) = -0.86(0.17), p < 0.001) and 10:90 offers (Rejection rates: β(SE) = 0.15(0.04), p < 0.001, Fairness ratings: β(SE) = -1.04(0.17), p < 0.001). As a control, we also compared rejection rates and fairness rating changes between conditions in Dis-I offers (90:10 and 30:70) and Fair offers (i.e., 50:50) but observed no significant difference (all ps > 0.217), suggesting that observing an Adv-I-averse Teacher’s preferences did not influence participants’ behavior in response to Dis-I offers.”

      Line 222 ~ 230:

      “A mixed-effects logistic regression revealed a significant larger (positive) effect of trial number on rejection rates of Adv-I offers for the Adv-Dis-I-Averse condition compared to the Dis-I-Averse condition. This relative rejection rate increase was evident both in 30:70 offers (Table S7; β(SE) = -0.77(0.24), p < 0.001) and in 10:90 offers (β(SE) = -1.10(0.33), p < 0.001). In contrast, comparing Dis-I and Fairness offers when the Teacher showed the same tendency to reject, we found no significant difference between the two conditions (90:10 splits: β(SE)=-0.48(0.21),p=0.593;70:30 splits: β(SE)=-0.01(0.14),p=0.150; 50:50 splits: β(SE)=-0.00(0.21),p=0.086). In other words, participants by and large appeared to adjust their rejection choices in accordance with the Teacher’s feedback in an incremental fashion.”

      And in Experiment 2 Line 333 ~ 345:

      “Similar to what we observed in Experiment 1 (Figure 4a), Compared to the participants in the Dis-I-Averse Condition, participants in the Adv-I-Averse Condition increased their rates of rejection of extreme Adv-I offerers (i.e., 10:90) in the Transfer Phase, relative to the Baseline phase (β(SE) = -0.12(0.04), p < 0.004; Table S9), suggesting that participants’ learned (and adopted) Adv-I-averse preferences, generalized from one specific offer type (30:70) to an offer types for which they received no Teacher feedback (10:90). Examining extreme Dis-I offers where the Teacher exhibited identical preferences across the two learning conditions, we found no difference in the Changes of Rejection Rates from Baseline to Transfer phase between conditions (β(SE) = -0.05(0.04), p < 0.259). Mirroring the observed rejection rates (Figure 4b), relative to the Dis-I-Averse Condition, participants’ fairness ratings for extreme Adv-I offers increased more from the Baseline to Transfer phase in the Adv-Dis-I-Averse Condition than in the Dis-I-Averse condition (β(SE) = -0.97(0.18), p < 0.001), but, importantly, changes in fairness ratings for extreme Dis-I offers did not differ significantly between learning conditions (β(SE) = -0.06(0.18), p < 0.723)”

      Line 361 ~ 368:

      “Examining the time course of rejection rates in Adv-I-contexts during the Learning phase (Figure 5) revealed that participants learned over time to punish mildly unfair 30:70 offers, and these punishment preferences generalized to more extreme offers (10:90). Specifically, compared to the Dis-I-Averse Condition, in the Adv-Dis-I-Averse condition we observed a significant larger trend of increase in rejections rates for 10:90 (Adv-I) offers (Figure 5, β(SE) = -0.81(0.26), p < 0.002 mixed-effects logistic regression, see Table S10). Again, when comparing the rejection rate increase in the extremely Dis-I offers (90:10), we didn’t find significant difference between conditions (β(SE) = -0.25(0.19), p < 0.707).”

      (3) Correlation Between Learning and Contagion Effects:

      The authors argue that correlations between learning effects (changes in rejection rates during the learning phase) and contagion effects (changes between the generalization and baseline phases) support the idea that individuals who are better aligning their preferences with the teacher also give more consideration to the teacher's preferences later during generalization phase. This interpretation is not convincing. Such correlations could emerge even in the absence of learning, driven by temporal trends like increasing guilt or envy (or even by slow temporal fluctuations in these processes) on behalf of self or others. The reason is that the baseline phase is temporally closer to the beginning of the learning phase whereas the generalization phase is temporally closer to the end of the learning phase. Additionally, the interpretation of these effects seems flawed, as changes in rejection rates do not necessarily indicate closer alignment with the teacher's preferences. For example, if the teacher rejects an offer 75% of the time then a positive 5% learning effect may imply better matching the teacher if it reflects an increase in rejection rate from 65% to 70%, but it implies divergence from the teacher if it reflects an increase from 85% to 90%. For similar reasons, it is not clear that the contagion effects reflect how much a teacher's preferences are taken into account during generalization.

      This comment is very similar to a previous comment made by Reviewer 1, who also called into question the interpretability of these correlations. In response to both of these comments we have elected to remove these analyses from our revision.

      (4) Modeling Efforts: The modelling approach is underdeveloped. The identification of the "best model" lacks transparency, as no model-recovery results are provided, and fits for the losing models are not shown, leaving readers in the dark about where these models fail. Moreover, the reinforcement learning (RL) models used are overly simplistic, treating actions as independent when they are likely inversely related (for example, the feedback that the teacher would have rejected an offer provides feedback that rejection is "correct" but also that acceptance is "an error", and the later is not incorporated into the modelling). It is unclear if and to what extent this limits current RL formulations. There are also potentially important missing details about the models. Can the authors justify/explain the reasoning behind including these variants they consider? What are the initial Q-values? If these are not free parameters what are their values?

      We are appreciative of the Reviewer for identifying these potentially unaddressed questions.

      The RL models we consider in the present study are naïve models which, in our previous study (FeldmanHall, Otto, & Phelps, 2018), we found to capture important aspects of learning. While simplistic, we believed these models serve as a reasonable baseline for evaluating more complex models, such as the Preference Inference model. We have made this point more explicit in our revised Introduction, Line 129 ~ 132:

      “In addition, to mechanistically probe how punitive preferences may be acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis-I context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      Again, following from our previous modeling of observational learning (FeldmanHall et al., 2018), we believe that the feedback the Teacher provides here is ideally suited to the RL formalism. In particular, when the teacher indicates that the participant’s choice is what they would have preferred, the model receives a reward of ‘1’ (e.g., the participant rejects and the Teacher indicates they would preferred rejection, resulting in a positive prediction error) otherwise, the model receives a reward of ‘0’ (e.g., the participant accepts and the Teacher indicates they would preferred rejection, resulting in a negative prediction error), indicating that the participant did not choose in accordance with the Teacher’s preferences. Through an error driven learning process, these models provide a naïve way of learning to act in accordance with the Teacher’s preferences.

      Regarding the requested model details: When treating the initial values as free parameters (model 5), we set Q(reject, offertype) as free values in [0,1] and Q(accept,offertype) as 0.5. This setting can capture participants' initial tendency to reject or accept offers from this offer type. When the initial values are fixed, for all offer types we set Q(reject, offertype) = Q(accept,offertype) = 0.5. In practice, when the initial values are fixed, setting them to 0.5 or 0 doesn’t make much difference. We have clarified these points in our revised Methods, Line 275 ~ 576:

      “We kept the initial values fixed in this model, that is Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90)”

      And Line 582 ~ 584:

      “Formally, this model treats Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90) as free parameters with values between 0 and 1.”

      (5) Conceptual Leap in Modeling Interpretation: The distinction between simple RL models and preference-inference models seems to hinge on the ability to generalize learning from one offer to another. Whereas in the RL models learning occurs independently for each offer (hence to cross-offer generalization), preference inference allows for generalization between different offers. However, the paper does not explore RL models that allow generalization based on the similarity of features of the offers (e.g., payment for the receiver, payment for the offer-giver, who benefits more). Such models are more parsimonious and could explain the results without invoking a theory of mind or any modelling of the teacher. In such model versions, a learner learns a functional form that allows to predict the teacher's feedback based on said offer features (e.g., linear or quadratic form). Because feedback for an offer modulates the parameters of this function (feature weights) generalization occurs without necessarily evoking any sophisticated model of the other person. This leaves open the possibility that RL models could perform just as well or even show superiority over the preference learning model, casting doubt on the authors' conclusions. Of note: even the behaviourists knew that as Little Albert was taught to fear rats, this fear generalized to rabbits. This could occur simply because rabbits are somewhat similar to rats. But this doesn't mean little Alfred had a sophisticated model of animals he used to infer how they behave.

      We are appreciative of the Reviewer for their suggestion of an alternative explanation for the observed generalization effects. Our understanding of the suggestion, put simply, put simply, is that an RL model could capture the observed generalization effects if the model were to learn and update a functional form of the Teacher’s rejection preferences using an RL-like algorithm. This idea is similar, conceptually to our account of preference learning whereby the learner has a representation of the teacher’s preferences. In our experiment the offer is in the range of [0-100], the crux of this idea is why the participants should take the functional form (either v-shaped or quadratic) with the minimum at 50. This is important because, at the beginning of the learning phase, the rejection rates are already v-shaped with 50 as its minimum. The participants do not need to adjust the minimum of this functional form. Thus, if we assume that the participants represent the teacher’s rejection rate as a v-shape function with a minimum at [50,50], then this very likely implies that the participants have a representation that the teacher has a preference for fairness. Above all, we agree that with suitable setup of the functional form, one could implement an RL model to capture the generalization effects, without presupposing an internal “model” of the teacher’s preferences.

      However, there is another way of modeling the generalization effect by truly “model-free” similarity-based Reinforcement learning. In this approach, we do not assume any particular functional form of the teacher’s preferences, but rather, assumes that experience acquired in one offer type can be generalized to offers that are close (i.e., similar) to the original offer. Accordingly, we implement this idea using a simple RL model in which the action values for each offer type is updated by a learning rate that is scaled by the distance between that offer and the experienced offer (i.e., the offer that generated the prediction error). This learning rate is governed by a Gaussian distribution, similar to the case in the Gaussian process regression (cf. Chulz, Speekenbrink, & Krause, 2018). The initial value of the ‘Reject’ action, for each offer , is set to a free parameter between 0 and 1, and the initial value for the 'Accept’ action was set to 0.5. The results show that even though this model exhibits the trend of increasing rejection rates observed in the AI-DI punish condition, the initial preferences (i.e., starting point of learning) diverges markedly from the Learning phase behavior we observed in Experiment 1:

      Author response image 1.

      This demonstrated that the participant at least maintains a representation of the teacher’s preference at the beginning. That is, they have prior knowledge about the shape of this preference. We incorporated this property into the model, that is, we considered a new model that assumes v-shaped starting values for rejection with two parameters, alpha and beta, governing the slope of this v-shaped function (this starting value actually mimics the shape of the preference functions of the Fehr-Schmidt model). We found that this new model (which we term the “Model RL Sim Vstart”) provided a satisfactory qualitative fit of the Transfer phase learning curves in Experiment 1 (see below).

      Author response image 2.

      However, we didn’t adopt this model as the best model for the following reasons. First, this model yielded a larger AIC value (indicating worse quantitative fit) compared to our preference Inference model in both Experiments 1 and 2, likely owing to its increased complexity (5 free parameters versus 4 in the Preference Inference model). Accordingly, we believe that inclusion of this model in our revised submission would be more distracting than helpful on account of the added complexity of explaining and justifying these assumptions, and of course its comparatively poor goodness of fit (relative to the preference inference model).

      (6) Limitations of the Preference-Inference Model: The preference-inference model struggles to capture key aspects of the data, such as the increase in rejection rates for 70:30 DI offers during the learning phase (e.g. Figure 3A, AI+DI blue group). This is puzzling.

      Thinking about this I realized the model makes quite strong unintuitive predictions that are not examined. For example, if a subject begins the learning phase rejecting the 70:30 offer more than 50% of the time (meaning the starting guilt parameter is higher than 1.5), then overleaning the tendency to reject will decrease to below 50% (the guilt parameter will be pulled down below 1.5). This is despite the fact the teacher rejects 75% of the offers. In other words, as learning continues learners will diverge from the teacher. On the other hand, if a participant begins learning to tend to accept this offer (guilt < 1.5) then during learning they can increase their rejection rate but never above 50%. Thus one can never fully converge on the teacher. I think this relates to the model's failure in accounting for the pattern mentioned above. I wonder if individuals actually abide by these strict predictions. In any case, these issues raise questions about the validity of the model as a representation of how individuals learn to align with a teacher's preferences (given that the model doesn't really allow for such an alignment).

      In response to this comment we explain our efforts to build a new model that might be able conceptually resolves the issue identified by the Reviewer.

      The key intuition guiding the Preference inference model is a Bayesian account of learning which we aimed to further simplify. In this setting, a Bayesian learner maintains a representation of the teacher’s inequity aversion parameters and updates it according to the teacher’s (observed) behavior. Intuitively, the posterior distribution shifts to the likelihood of the teacher’s action. On this view, when the teacher rejects, for instance, an AI offer, the learner should assign a higher probability to larger values of the Guilt parameter, and in turn the learner should change their posterior estimate to better capture the teacher’s preferences.

      In the current study, we simplified this idea, implementing this sort of learning using incremental “delta rule” updating (e.g. Equation 8 of the main text). Then the key question is to define the “teaching signal”. Assuming that the teacher rejects an offer 70:30, based on Bayesian reasoning, the teacher’s envy parameter (α) is more likely to exceed 1.5 (computed as 30/(50-30), per equation 7) than to be smaller than 1.5. Thus, 1.5, which is then used in equation 8 to update α, can be thought of as a teaching signal. We simply assumed that if the initial estimate is already greater than 1.5, which means the prior is consistent with the likelihood, no updating would occur. This assumption raises the question of how to set the learning rate range. In principle, an envy parameter that is larger than 1.5 should be the target of learning (i.e., the teaching signal), and thus our model definition allows the learning rate to be greater than 1, incorporating this possibility.

      Our simplified preference inference model has already successfully captured some key aspects of the participants’ learning behavior. However, it may fail in the following case: assume that the participant has an initial estimate of 1.51 for the envy parameter (β). Let’s say this corresponds to a rejection rate of 60%. Thus, no matter how many times the teacher rejects the offer 70:30, the participant’s estimate of the envy parameter remains the same, but observing only one offer acceptance would decrease this estimate, and in turn, would decrease the model’s predicted rejection rate. We believe this is the anomalous behavior—in 70:30 offers—identified by the Reviewer which the model does not appear able to recreate participants’ in these offers.

      This issue actually touches the core of our model specification, that is, the choosing of the teaching signal. As we chose 1.5 as the teaching signal—i.e. lower bound on whenever the teacher rejects or accepts an offer of 70:30, a very small deviation of 1.5 would fail one part of updating. One way to mitigate this problem would be to choose a lower bound for α greater than 1.5, such that when the Teacher rejects a 70:30 offer, we assign a number greater than 1.5 (by ‘hard-coding’ this into the model via modification of equation 7). One sensible candidate value could be the middle point between 1.5 and 10 (the maximum value of α per our model definition). Intuitively, the model of this setting could still pull up the value of α to 1.51 when the teacher rejects 70:30, thus alleviating (but not completely eliminating) the anomaly.

      We fitted this modified Preference Inference model to the data from Experiment 1 (see Author response image 3 below) and found that even though this model has a smaller AIC (and thus better quantitative fit than the original Preference Inference model), it still doesn’t fully capture the participants’ behavior for 70:30 offers.

      Author response image 3.

      Accordingly, rather than revising our model to include an unprincipled ‘kludge’ to account for this minor anomaly in the model behavior, we have opted to report our original model in our revision as we still believe it parsimoniously captures our intuitions about preference learning and provides a better fit to the observed behavior than the other RL models considered in the present study.

      Reviewer #1 (Recommendations for the authors):

      (1) I do not particularly prefer the acronyms AI and DI for disadvantageous inequity and advantageous inequity. Although they have been used in the literature, not every single paper uses them. More importantly, AI these days has such a strong meaning of artificial intelligence, so when I was reading this, I'd need to very actively inhibit this interpretation. I believe for the readability for a wider readership of eLife, I would advise not to use AI/DI here, but rather use the full terms.

      We thank the Reviewer for this suggestion. As the full spelling of the two terms are somewhat lengthy, and appear frequently in the figures, we have elected to change the abbreviations for disadvantageous inequity and advantageous inequity to Dis-I and Adv-I, respectively in the main text and the supplementary information. We still use AI/DI in the response letter to make the terminology consistent.

      (2) Do "punishment rate" and "rejection rate" mean the same? If so, it would be helpful to stick with one single term, eg, rejection rate.

      We thank the Reviewer for this suggestion. As these terms have the same meaning, we have opted to use the term “rejection rate” throughout the main text.

      (3) For the linear mixed effect models, were other random effect structures also considered (eg, random slops of experimental conditions)? It might be worth considering a few model specifications and selecting the best one to explain the data.

      Thanks for this comment. Following established best practices (Barr, Levy, Scheepers, & Tily, 2013) we have elected to use a maximal random effects structure, whereby all possible predictor variables in the fixed effects structure also appear in the random effects structure.

      (4) For equation (4), the softmax temperature is denoted as tau, but later in the text, it is called gamma. Please make it consistent.

      We are appreciative of the Reviewer’s attention to detail. We have corrected this error.

      Reviewer #2 (Recommendations for the authors):

      (1) Several Tables in SI are unclear. I wasn't clear if these report raw probabilities of coefficients of mixed models. For any mixed models, it would help to give the model specification (e.g., Walkins form) and explain how variables were coded.

      We are appreciative of the Reviewer’s attention to detail. We have clarified, in the captions accompanying our supplemental regression tables, that these coefficients represent log-odds. Regretfully we are unaware of the “Walkins form” the Reviewer references (even after extensive searching of the scientific literature). However, in our new revision we do include lme4 model syntax in our supplemental information which we believe will be helpful for readers seeking replicate our model specification.

      (2) In one of the models it was said that the guilt and envy parameters were bounded between 0-1 but this doesn't make sense and I think values outside this range were later reported.

      We are again appreciative of the Reviewer’s attention to detail. This was an error we have corrected— the actual range is [0,10].

      (3) It is unclear if the model parameters are recoverable.

      In response to this comment our revision now reports a basic parameter recovery analysis for the winning Preference Inference model. This is reported in our revised Methods:

      “Finally, to verify if the free parameters of the winning model (Preference Inference) are recoverable, we simulated 200 artificial subjects, based on the Learning Phase of Experiment 1, with free parameters randomly chosen (uniformly) from their defined ranges. We then employed the same model-fitting procedure as described above to estimate these parameter value, observing that parameters. We found that all parameters of the model can be recovered (see Figure S2).”

      And scatter plots depicting these simulated (versus recovered) parameters are given in Figure S2 of our revised Supplementary Information:

      (4) I was confused about what Figure S2 shows. The text says this is about correlating contagious effects for different offers but the captions speak about learning effects. This is an important aspect which is unclear.

      We have removed this figure in response to both Reviewers’ comments about the limited insights that can be drawn on the basis of these correlations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Turner et al. present an original approach to investigate the role of Type-1 nNOS interneurons in driving neuronal network activity and in controlling vascular network dynamics in awake head-fixed mice. Selective activation or suppression of Type-1 nNOS interneurons has previously been achieved using either chemogenetic, optogenetic, or local pharmacology. Here, the authors took advantage of the fact that Type-1 nNOS interneurons are the only cortical cells that express the tachykinin receptor 1 to ablate them with a local injection of saporin conjugated to substance P (SP-SAP). SP-SAP causes cell death in 90 % of type1 nNOS interneurons without affecting microglia, astrocytes, and neurons. The authors report that the ablation has no major effects on sleep or behavior. Refining the analysis by scoring neural and hemodynamic signals with electrode recordings, calcium signal imaging, and wide-field optical imaging, the authors observe that Type-1 nNOS interneuron ablation does not change the various phases of the sleep/wake cycle. However, it does reduce low-frequency neural activity, irrespective of the classification of arousal state. Analyzing neurovascular coupling using multiple approaches, they report small changes in resting-state neural-hemodynamic correlations across arousal states, primarily mediated by changes in neural activity. Finally, they show that nNOS type 1 interneurons play a role in controlling interhemispheric coherence and vasomotion.

      In conclusion, these results are interesting, use state-of-the-art methods, and are well supported by the data and their analysis. I have only a few comments on the stimulus-evoked haemodynamic responses, and these can be easily addressed.

      We thank the reviewer for their positive comments on our work.

      Reviewer #2 (Public review):

      Summary:

      This important study by Turner et al. examines the functional role of a sparse but unique population of neurons in the cortex that express Nitric oxide synthase (Nos1). To do this, they pharmacologically ablate these neurons in the focal region of whisker-related primary somatosensory (S1) cortex using a saponin-substance P conjugate. Using widefield and 2photon microscopy, as well as field recordings, they examine the impact of this cell-specific lesion on blood flow dynamics and neuronal population activity. Locally within the S1 cortex, they find changes in neural activity paFerns, decreased delta band power, and reduced sensory-evoked changes in blood flow (specifically eliminating the sustained blood flow change amer stimulation). Surprisingly, given the tiny fraction of cortical neurons removed by the lesion, they also find far-reaching effects on neural activity paFerns and blood volume oscillations between the cerebral hemispheres.

      Strengths:

      This was a technically challenging study and the experiments were executed in an expert manner. The manuscript was well wriFen and I appreciated the cartoon summary diagrams included in each figure. The analysis was rigorous and appropriate. Their discovery that Nos1 neurons can have far-reaching effects on blood flow dynamics and neural activity is quite novel and surprising (to me at least) and should seed many follow-up, mechanistic experiments to explain this phenomenon. The conclusions were justified by the convincing data presented.

      Weaknesses:

      I did not find any major flaws in the study. I have noted some potential issues with the authors' characterization of the lesion and its extent. The authors may want to re-analyse some of their data to further strengthen their conclusions. Lastly, some methodological information was missing, which should be addressed.

      We thank the reviewer for their enthusiasm for our work.

      Reviewer #3 (Public review):

      The role of type-I nNOS neurons is not fully understood. The data presented in this paper addresses this gap through optical and electrophysiological recordings in adult mice (awake and asleep).

      This manuscript reports on a study on type-I nNOS neurons in the somatosensory cortex of adult mice, from 3 to 9 months of age. Most data were acquired using a combination of IOS and electrophysiological recordings in awake and asleep mice. Pharmacological ablation of the type-I nNOS populations of cells led to decreased coherence in gamma band coupling between lem and right hemispheres; decreased ultra-low frequency coupling between blood volume in each hemisphere; decreased (superficial) vascular responses to sustained sensory stimulus and abolishment of the post-stimulus CBV undershoot. While the findings shed new light on the role of type-I nNOS neurons, the etiology of the discrepancies between current observations and literature observations is not clear and many potential explanations are put forth in the discussion.

      We thank the reviewer for their comments.

      Reviewer #1 (Recommendations for the authors):  

      (1) Figure 3, Type-1 nNOS interneuron ablation has complex effects on neural and vascular responses to brief (.1s) and prolonged (5s) whisker stimulation. During 0.1 s stimulation, ablation of type 1 nNOS cells does not affect the early HbT response but only reduces the undershoot. What is the pan-neuronal calcium response? Is the peak enhanced, as might be expected from the removal of inhibition? The authors need to show the GCaMP7 trace obtained during this short stimulation.

      Unfortunately, we did not perform brief stimulation experiments in GCaMP-expressing mice. As we did not see a clear difference in the amplitude of the stimulus-evoked response with our initial electrophysiology recordings (Fig. 3a), we suspected that an effect might be visible with longer duration stimuli and thus pivoted to a pulsed stimulation over the course of 5 seconds for the remaining cohorts. It would have been beneficial to interweave short-stimulus trials for a direct comparison between the complimentary experiments, but we did not do this.

      During 5s stimulation, both the early and delayed calcium/vascular responses are reduced. Could the authors elaborate on this? Does this mean that increasing the duration of stimulation triggers one or more additional phenomena that are sensitive to the ablation of type 1 nNOS cells and mask what is triggered by the short stimulation? Are astrocytes involved? How do they interpret the early decrease in neuronal calcium?

      As our findings show that ablation reduces the calcium/vascular response more prominently during prolonged stimulation, we do suspect that this is due to additional NO-dependent mechanisms or downstream responses. NO is modulator of neural activity, generally increasing excitability (Kara and Friedlander 1999, Smith and Otis 2003), so any manipulation that changes NO levels will change (likely decrease) the excitability of the network, potentially resulting in a smaller hemodynamic response to sensory stimulation secondary to this decrease. While short stimuli engage rapid neurovascular coupling mechanisms, longer duration (>1s) stimulation could introduce additional regulatory elements, such as astrocytes, that operate on a slower time scale. On the right, we show a comparison of the control groups ploFed together from Fig. 3a and 3b with vertical bars aligned to the peak. During the 5s stimulation, the time-to-peak is roughly 830 milliseconds later than the 0.1s stimulation, meaning it’s plausible that the signals don’t separate until later. Our interpretation is that the NVC mechanisms responsible for brief stimulus-evoked change are either NO-independent or are compensated for in the SSP-SAP group by other means due to the chronic nature of the ablation. 

      We have added the following text to the Discussion (Line 368): “Loss of type-I nNOS neurons drove minimal changes in the vasodilation elicited by brief stimulation, but led to decreased vascular responses to sustained stimulation, suggesting that the early phase of neurovascular coupling is not mediated by these cells, consistent with the multiple known mechanisms for neurovascular coupling (AFwell et al 2010, Drew 2019, Hosford & Gourine 2019) acting through both neurons and astrocytes with multiple timescales (Le Gac et al 2025, Renden et al 2024, Schulz et al 2012, Tran et al 2018).”

      Author response image 1.

      (2) In Figures 4d and e, it is unclear to me why the authors use brief stimulation to analyze the relationship between HbT and neuronal activity (gamma power) and prolonged stimulation for the relationship between HbT and GCaMP7 signal. Could they compare the curves with both types of stimulation?

      As discussed previously, we did not use the same stimulation parameters across cohorts. The mice with implanted electrodes received only brief stimulation, while those undergoing calcium imaging received longer duration stimulus. 

      Reviewer #2 (Recommendations for the authors):

      (1) Results, how far-reaching is the cell-specific ablation? Would it be possible to estimate the volume of the cortex where Nos1 cells are depleted based on histology? Were there signs of neuronal injury more remotely, for example, beading of dendrites?

      We regularly see 1-2 mm in diameter of cell ablation within the somatosensory cortex of each animal, which is consistent with the spread of small molecules. Ribosome inactivating proteins like SAP are smaller than AAVs (~5 nm compared to ~25 nm in diameter) and thus diffuse slightly further. We observed no obvious indication of neuronal injury more remotely or in other brain regions, but we did not image or characterize dendritic beading, as this would require a sparse labeling of neurons to clearly see dendrites (NeuN only stains the cell body). Our histology shows no change in cell numbers. 

      We have added the following text to the Results (Line 124): “Immunofluorescent labeling in mice injected with Blank-SAP showed labeling of nNOS-positive neurons near the injection site. In contrast, mice injected with SP-SAP showed a clear loss in nNOS-labeling, with a typical spread of 1-2 mm from the injection site, though nNOS-positive neurons both subcortically and in the entirety of the contralateral hemisphere remaining intact.”

      (2) For histological analysis of cell counts amer the lesion, more information is needed. How was the region of interest for counting cells determined (eg. 500um radius from needle/pipeFe tract?) and of what volume was analysed?

      The region of interest for both SSP-SAP and Blank SAP injections was a 1 mm diameter circle centered around the injection site and averaged across sections (typically 3-5 when available). In most animals, the SSP-SAP had a lateral spread greater than 500 microns and encompassed the entire depth of cortex (1-1.5 mm in SI, decreasing in the rostral to caudal direction). The counts within the 1 mm diameter ROI were averaged across sections and then converted into the cells per mm area as presented. Note the consistent decrease in type I nNOS cells seen across mice in Fig 1d, Fig S1b.

      We have added the following text in the Materials & Methods (Line 507): “The region of interest for analysis of cell counts was determined based on the injection site for both SP-SAP and Blank SAP injections, with a 1 mm diameter circle centered around the injection site and averaged across 3-5 sections where available. In most animals, the SP-SAP had a lateral spread greater than 500 microns and encompassed the entire depth of cortex (1-1.5 mm in SI).”

      (3) Based on Supplementary Figure 1, it appears that the Saponin conjugate not only depletes Nos neurons but also may affect vascular (endothelial perhaps) Nos expression. Some quantification of this effect and its extent may be insighIul in terms of ascribing the effects of the lesion directly on neurons vs indirectly and perhaps more far-reaching via vascular/endothelial NOS.

      Thank you for this comment. While this is a possibility, while we have found that the high nNOS expression of type-I nnoos neurons makes NADPH diaphorase a good stain for detecting them, it is less useful for cell types that expres NOS at lower levels.  We have found that the absolute intensity of NADPH diaphorase staining is somewhat variable from section to section. Variability in overall NADPH diaphorase intensity is likely due to several factors, such as duration of staining, thickness of the section, and differences in PFA concentration within the tissue and between animals. As NADPH diaphorase staining is highly sensitive to amount PFA exposure, any small differences in processing could affect the intensity, and slight differences in perfusion quality and processing could account. A second, perhaps larger issue could be due to differences in the number of arteries (which will express NOS at much higher levels than veins, and thus will appear darker) in the section. We did not stain for smooth muscle and so cannot differentiate arteries and veins.  Any difference in vessel intensity could be due to random variations in the numbers of arteries/veins in the section. While we believe that this is a potentially interesting question, our histological experiments were not able to address it.

      (4) The assessment for inflammation took place 1 month amer the lesion, but the imaging presumably occurred ~ 2 weeks amer the lesion. Note that it seemed somewhat ambiguous as to when approximately, the imaging, and electrophysiology experiments took place relative to the induction of the lesion. Presumably, some aspects of inflammation and disruption could have been missed, at the time when experiments were conducted, based on this disparity in assessment. The authors may want to raise this as a possible limitation.

      We apologize for our unclear description of the timeline. We began imaging experiments at least 4 weeks amer ablation, the same time frame as when we performed our histological assays. 

      We have added the following text to the Discussion (Line 379): “With imaging beginning four weeks amer ablation, there could be compensatory rewiring of local and/or network activity following type-I nNOS ablation, where other signaling pathways from the neurons to the vasculature become strengthened to compensate for the loss of vasodilatory signaling from the typeI nNOS neurons.”

      (5) Results Figure 2, please define "P or delta P/P". Also, for Figure 2c-f, what do the black vertical ticks represent?

      ∆P/P is the change in the gamma-band power relative to the resting-state baseline, and black tick marks indicate binarized periods of vibrissae motion (‘whisking’). We have clarified this in Figure caption 2 (Line 174).

      (6) Figure 3b-e, is there not an undershoot (eventually) amer 5s of stimulation that could be assessed? 

      Previous work has shown that there is no undershoot in response to whisker stimulations of a few seconds (Drew, Shih, Kelinfeld, PNAS, 2011).  The undershoot for brief stimuli happens within ~2.5 s of the onset/cessation of the brief stimulation, this is clearly lacking in the response to the 5s stim (Fig 3).  The neurovascular coupling mechanisms recruited during the short stimulation are different than those recruited during the long stimulus, making a comparison of the undershoot between the two stimulation durations problematic. 

      For Figures 3e and 6 how was surface arteriole diameter or vessel tone measured? 2P imaging of fluorescent dextran in plasma? Please add the experimental details of 2P imaging to the methods. Including some 2P images in the figures couldn't hurt to help the reader understand how these data were generated.

      We have added details about our 2-photon imaging (FITC-dextran, full-width at half-maximum calculation for vessel diameter) as well as a trace and vessel image to Figure 2.

      We have added the following text to the Materials & Methods (Line 477): “In two-photon experiments, mice were briefly anesthetized and retro-orbitally injected with 100 µL of 5% (weight/volume) fluorescein isothiocyanate–dextran (FITC) (FD150S, Sigma-Aldrich, St. Louis, MO) dissolved in sterile saline.”

      We have added the following text to the Materials & Methods (Line 532): “A rectangular box was drawn around a straight, evenly-illuminated vessel segment and the pixel intensity was averaged along the long axis to calculate the vessel’s diameter from the full-width at half-maximum (https://github.com/DrewLab/Surface-Vessel-FWHM-Diameter; (Drew, Shih et al. 2011)).”

      (7) Did the authors try stimulating other body parts (eg. limb) to estimate how specific the effects were, regionally? This is more of a curiosity question that the authors could comment on, I am not recommending new experiments.

      We did measure changes in [HbT] in the FL/HL representation of SI during locomotion (Line 205), which is known to increase neural activity in the somatosensory cortex (Huo, Smith and Drew, Journal of Neuroscience, 2014; Zhang et al., Nature Communications 2019). We observed a similar but not statistically significant trend of decreased [HbT] in SP-SAP compared to control. This may have been due to the sphere of influence of the ablation being centered on the vibrissae representation and not having fully encompassed the limb representation. We agree with the referee that it would be interesting to characterize these effects on other sensory regions as well as brain regions associated with tasks such as learning and behavior.

      (8) Regarding vasomotion experiments, are there no other components of this waveform that could be quantified beyond just variance? Amplitude, frequency? Maybe these don't add much but would be nice to see actual traces of the diameter fluctuations. Further, where exactly were widefield-based measures of vasomotion derived from? From some seed pixel or ~1mm ROI in the center of the whisker barrel cortex? Please clarify.

      The reviewer’s point is well taken. We have added power spectra of the resting-state data which provides amplitude and frequency information. The integrated area under the curve of the power spectra is equal to the variance. Widefield-based measures of vasomotion were taken from the 1 mm ROI in the center of the whisker barrel cortex.

      We have added the following text to the Materials & Methods (Line 560): “Variance during the resting-state for both ∆[HbT] and diameter signals (Fig. 7) was taken from resting-state events lasting ≥10 seconds in duration. Average ∆[HbT] from within the 1 mm ROI over the vibrissae representation of SI during each arousal state was taken with respect to awake resting baseline events ≥10 seconds in duration.” 

      (9) On page 13, the title seems like a bit strong. The data show a change in variance but that does not necessarily mean a change in absolute amplitude. Also, I did not see any reports of absolute vessel widths between groups from 2P experiments so any difference in the sampling of larger vs smaller arterioles could have affected the variance (ie. % changes could be much larger in smaller arterioles).

      We have updated the title of Figure 7 to specifically state power (which is equivalent to the variance) rather than amplitude (Line 331). We have also added absolute vessel widths to the Results (Line 340): “There was no difference in resting-state (baseline) diameter between the groups, with Blank-SAP having a diameter of 24.4 ± 7.5 μm and SP-SAP having a diameter of 23.0 ± 9.4 μm (Fest, p ti 0.61). “

      (10) Big picture question. How could a manipulation that affects so few cells in 1 hemisphere (below 0.5% of total neurons in a region comprising 1-2% of the volume of one hemisphere) have such profound effects in both hemispheres? The authors suggest that some may have long-range interhemispheric projections, but that is presumably a fraction of the already small fraction of Nos1 neurons. Perhaps these neurons have specializing projections to subcortical brain nuclei (Nucleus Basilis, Raphe, Locus Coerulus, reticular thalamus, etc) that then project widely to exert this outsized effect? Has there not been a detailed anatomical characterization of their efferent projections to cortical and sub-cortical areas? This point could be raised in the discussion.

      We apologize for the lack of clarity of our work in this point.  We would like to clarify that the only analysis showing a change in the unablated hemisphere being coherence/correlation analysis between the two hemispheres.  Other metrics (LFP power and CBV power spectra) do not change in the hemisphere contralateral to the injections site, as we show in data added in two supplementary figures (Fig. S4 and 7). The coherence/correlation is a measure of the correlated dynamics in the two hemispheres. For this metric to change, there only needs to be a change in the dynamics of one hemisphere relative to another.  If some aspects of the synchronization of neural and vascular dynamics across hemispheres are mediated by concurrent activation of type I nNOS neurons in both hemispheres, ablating them in one hemisphere will decrease synchrony. It is possible that type I nNOS neurons make some subcortical projections that were not reported in previous work (Tomioka 2005, Ruff 2024), but if these exist they are likely to be very small in number as they were not noted.  

      We have added the text in the Results (Line 228): “In contrast to the observed reductions in LFP in the ablated hemisphere, we noted no gross changes in the power spectra of neural LFP in the unablated hemisphere (Fig. S7) or power of the cerebral blood volume fluctuations in either hemisphere (Fig. S4).”

      Line 335): “The variance in ∆[HbT] during rest, a measure of vasomotion amplitude, was significantly reduced following type-I nNOS ablation (Fig. 7a), dropping from 40.9 ± 3.4 μM<sup>2</sup> in the Blank-SAP group (N ti 24, 12M/12F) to 23.3 ± 2.3 μM<sup>2</sup> in the SP-SAP group (N ti 24, 11M/13F) (GLME p ti 6.9×10<sup>-5</sup>) with no significant di[erence in the unablated hemisphere (Fig. S7).”

      Reviewer #3 (Recommendations for the authors):

      (1)  The reporting would be greatly strengthened by following ARRIVE guidelines 2.0: https://arriveguidelines.org/: aFrition rates and source of aFrition, justification for the use of 119 (beyond just consistent with previous studies), etc.

      We performed a power analysis prior to our study aiming to detect a physiologically-relevant effect size of (Cohen’s d) ti 1.3, or 1.3 standard deviations from the mean. Alpha and Power were set to the standard 0.05 and 0.80 respectively, requiring around 8 mice per group (SP-SAP, Blank, and for histology, naïve animals) for multiple independent groups (ephys, GCamp, histology). To potentially account for any aFrition due to failures in Type-I nNOS neuron ablation or other problems (such as electrode failure or window issues) we conservatively targeted a dozen mice for each group. Of mice that were imaged (1P/2P), two SP-SAP mice were removed from the dataset (24 SP-SAP remaining) post-histological analysis due to not showing ablation of nNOS neurons, an aFrition rate of approximately 8%.

      We have added the following text to the Materials & Methods (Line 441): “Sample sizes are consistent with previous studies (Echagarruga et al 2020, Turner et al 2023, Turner et al 2020, Zhang et al 2021) and based on a power analysis requiring 8-10 mice per group (Cohen’s d ti 1.3, α ti 0.05, (1 - β) ti 0.800). Experimenters were not blind to experimental conditions or data analysis except for histological experiments. Two SP-SAP mice were removed from the imaging datasets (24 SP-SAP remaining) due to not showing ablation of nNOS neurons during post-histological analysis, an aFrition rate of approximately 8%.”

      (2) Intro, line 38: Description of the importance of neurovascular coupling needs improvement. Coordinated haemodynamic activity is vital for maintaining neuronal health and the energy levels needed.

      We have added a sentence to the introduction (Line 41): “Neurovascular coupling plays a critical role in supporting neuronal function, as tightly coordinated hemodynamic activity is essential for meeting energy metabolism and maintaining brain health (Iadecola et al 2023, Schaeffer & Iadecola 2021).“

      (3) Given the wide range of mice ages, how was the age accounted for/its effects examined?

      Previous work from our lab has shown that there is no change in hemodynamics responses in awake mice over a wide range of ages (2-18 months), so the age range we used (3 and 9 months of age) should not impact this.  

      We have added the following text in the Results (Line 437): “Previous work from our lab has shown that the vasodilation elicited by whisker stimulation is the same in 2–4-month-old mice as in 18-month-old mice (BenneF, Zhang et al. 2024). As the age range used here is spanned by this time interval, we would not expect any age-related differences.”

      (4) How was the susceptibility of low-frequency neuronal coupling signals to noise managed? How were the low-frequency bands results validated?

      We are not sure what the referee is asking here. Our electrophysiology recordings were made differentially using stereotrodes with tips separated by ~100µm, which provides excellent common-mode rejection to noise and a localized LFP signal. Previous publications from our lab (Winder et al., Nature Neuroscience 2017; Turner et al., eLife2020) and others (Tu, Cramer, Zhang, eLife 2024) have repeatedly show that there is a very weak correlation between the power in the low frequency bands and hemodynamic signals, so our results are consistent with this previous work. 

      (5) It would be helpful to demonstrate the selectivity of cell *death* (as opposed to survival) induced by SP-SAP injections via assessments using markers of cell death.

      We agree that this would be helpful complement to our histological studies that show loss of type-I nNOS neurons, but no loss of other cells and minimal inflammation with SP-saporin injections.  However, we did not perform histology looking at cell death, only at surviving cells, given that we see no obvious inflammation or cells loss, which would be triggered by nonspecific cell death.  Previous work has established that saporin is cytotoxic and specific only to cell that internalize the saporin.   Internalization of saporin causes cell death via apoptosis (Bergamaschi, Perfe et al. 1996), and that the substance P receptor is internalized when the receptor is bound (Mantyh, Allen et al. 1995). Treatment of internalized saporin generates cellular debris that is phagocytosed by microglial, consistent with cell death (Seeger, Hartig et al. 1997). While it is possible that treatment of SP-saporin causes type 1 nNOS neurons to stop expressing nitric oxide synthase (which would make them disappear from our IHC staining), we think that this is unlikely given the literature shows internalized saporin is clearly cytotoxic. 

      We have added the following text to the Results (Line 131): “It is unlikely that the disappearance of type-I nNOS neurons is because they stopped expressing nNOS, as internalized saporin is cytotoxic. Exposure to SP-conjugated saporin causes rapid internalization of the SP receptor-ligand complex (Mantyh, Allen et al. 1995), and internalized saporin causes cell death via apoptosis (Bergamaschi, Perfe et al. 1996). In the brain, the resulting cellular debris from saporin administration is then cleared by microglia phagocytosis (Seeger, Hartig et al. 1997).”

      (6) Was the decrease in inter-hemispheric correlation associated with any changes to the corpus callosum?

      We noted no gross changes to the structure of the corpus callosum in any of our histological reconstructions following SSPSAP administration, however, we did not specifically test for this. Again, as we note in our reply in reviewer 2, the decrease in interhemispheric synchronization does not imply that there are changes in the corpus callosum and could be mediated by the changes in neural activity in the hemisphere in which the Type-I nNOS neurons were ablated.

      (7) How were automated cell counts validated?

      Criteria used for automated cell counts were validated with comparisons of manual counting as described in previous literature. We have added additional text describing the process in the Materials & Methods (Line 510): “For total cell counts, a region of interest (ROI) was delineated, and cells were automatically quantified under matched criteria for size, circularity and intensity. Image threshold was adjusted until absolute value percentages were between 1-10% of the histogram density. The function Analyze Par-cles was then used to estimate the number of particles with a size of 100-99999 pixels^2 and a circularity between 0.3 and 1.0 (Dao, Suresh Nair et al. 2020, Smith, Anderson et al. 2020, Sicher, Starnes et al. 2023). Immunoreactivity was quantified as mean fluorescence intensity of the ROI (Pleil, Rinker et al. 2015).”

      (8) Given the weighting of the vascular IOS readout to the superficial tissue, it is important to qualify the extent of the hemodynamic contrast, ie the limitations of this readout.

      We have added the following text to the Discussion (Line 385): “Intrinsic optical signal readout is primarily weighted toward superficial tissue given the absorption and scaFering characteristics of the wavelengths used. While surface vessels are tightly coupled with neural activity, it is still a maFer of debate whether surface or intracortical vessels are a more reliable indicator of ongoing activity (Goense et al 2012; Huber et al 2015; Poplawsky & Kim 2014).” 

      (9) Partial decreases observed through type-I iNOS neuronal ablation suggest other factors also play a role in regulating neural and vascular dynamics: data presented thus do *not* "indicate disruption of these neurons in diseases ranging from neurodegeneration to sleep disturbances," as currently stated. Please revise.

      We agree with the reviewer. We have changed the abstract sentence to read (Line 30): “This demonstrates that a small population of nNOS-positive neurons are indispensable for regulating both neural and vascular dynamics in the whole brain, raising the possibility that loss of these neurons could contribute to the development of neurodegenerative diseases and sleep disturbances.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Most human traits and common diseases are polygenic, influenced by numerous genetic variants across the genome. These variants are typically non-coding and likely function through gene regulatory mechanisms. To identify their target genes, one strategy is to examine if these variants are also found among genetic variants with detectable effects on gene expression levels, known as eQTLs. Surprisingly, this strategy has had limited success, and most disease variants are not identified as eQTLs, a puzzling observation recently referred to as "missing regulation". 

      In this work, Jeong and Bulyk aimed to better understand the reasons behind the gap between disease-associated variants and eQTLs. They focused on immune-related diseases and used lymphoblastoid cell lines (LCLs) as a surrogate for the cell types mediating the genetic effects. Their main hypothesis is that some variants without eQTL evidence might be identifiable by studying other molecular intermediates along the path from genotype to phenotype. They specifically focused on variants that affect chromatin accessibility, known as caQTLs, as a potential marker of regulatory activity. 

      The authors present data analyses supporting this hypothesis: several disease-associated variants are explained by caQTLs but not eQTLs. They further show that although caQTLs and eQTLs likely have largely overlapping underlying genetic variants, some variants are discovered only through one of these mapping strategies. Notably, they demonstrate that eQTL mapping is underpowered for gene-distal variants with small effects on gene expression, whereas caQTL mapping is not dependent on the distance to genes. Additionally, for some disease variants with caQTLs but no corresponding eQTLs in LCLs, they identify eQTLs in other cell types. 

      Altogether, Jeong and Bulyk convincingly demonstrate that for immune-related diseases, discovering the missing disease-eQTLs requires both larger eQTL studies and a broader range of cell types in expression assays. It remains to be seen what fractions of the missing diseaseeQTLs will be discovered with either strategy and whether these results can be extended to other diseases or traits. 

      We thank the reviewer for their accurate summary of our study and positive review of our findings for immune-related diseases.

      It should be noted that the problem of "missing regulation" has been investigated and discussed in several recent papers, notably Umans et al., Trends in Genetics 2021; Connally et al., eLife 2022; Mostafavi et al., Nat. Genet. 2023. The results reported by Jeong and Bulyk are not unexpected in light of this previous work (all of which they cite), but they add valuable empirical evidence that mostly aligns with the model and discussions presented in Mostafavi et al. 

      We thank the reviewer for their positive review of our results and manuscript. As Reviewer #1 noted, whether our and others' observation extends to other diseases or traits is an open question. For instance, Figure 2b in Mostafavi et al., Nat. Genet. (2023) demonstrated that there was a spectrum of depletion of eQTLs and enrichment of GWAS signals in constrained genes across various tissues and traits, respectively. Therefore, gene expression constraint may play a larger or smaller role in different diseases or traits. That immune cell types and cell states are extremely diverse (Schmiedel et al., Cell (2018) and Calderon et al., Nat. Genet. (2019), just to name a few) likely adds to the complexity of gene regulation that contributes to immune-mediated disease.

      Reviewer #2 (Public Review): 

      Summary: 

      eQTLs have emerged as a method for interpreting GWAS signals. However, some GWAS signals are difficult to explain with eQTLs. In this paper, the authors demonstrated that caQTLs can explain these signals. This suggests that for GWAS signals to actually lead to disease phenotypes, they must be accessible in the chromatin. This implies that for GWAS signals to translate into disease phenotypes, they need to be accessible within the chromatin. 

      However, fundamentally, caQTLs, like GWAS, have the limitation of not being able to determine which genes mediate the influence on disease phenotypes. This limitation is consistent with the constraints observed in this study. 

      We thank the reviewer for their accurate summary of our results.

      (1) For reproducibility, details are necessary in the method section.

      Details about adding YRI samples in ATAC-seq: For example, how many samples are there, and what is used among public data? There is LCL-derived iPSC and differentiated iPSC (cardiomyocytes) data, not LCL itself. How does this differ from LCL, and what is the rationale for including this data despite the differences?

      Banovich et al., Genome Research (2018) (PMID: 29208628), who generated data using LCLderived iPSCs and differentiated iPSCs (cardiomyocytes), also generated ATAC-seq data from 20 YRI LCL samples. We analyzed those data to identify open chromatin regions (i.e., ATACseq peaks) in LCLs and merged the regions with open chromatin regions identified with 100 GBR LCL samples from two studies by Kumasaka et al. (Nature Genetics (2016)

      PMID: 26656845 and Nature Genetics (2019) PMID: 30478436). However, we restricted the caQTL analysis to only the 100 GBR samples because of possible ancestry effects and batch effects. We attempted caQTL analysis with the 20 YRI samples as well, but the result was noisy, likely due to smaller sample size and lower read depth of the ATAC-seq data.

      caQTL is described as having better power than eQTL despite having fewer samples. How does the number of ATAC peaks used in caQTL compare to the number of gene expressions used in eQTL?

      The number of ATAC peaks used in caQTL (99,320) is ~6.7 times greater than the number of genes (14,872) used in the eQTL analysis. Therefore, there is a higher chance of detecting a significant caQTL signal and a significant colocalization signal than there is for eQTLs. However, we reasoned that since distal eQTLs are more easily detected as caQTLs and since increasing the sample size of eQTLs through meta-analysis uncovered additional eQTL colocalization at loci with caQTL colocalization only, colocalized caQTLs are likely capturing disease-relevant regulatory effects.

      Details about RNA expression data: In the method section, it states that raw data (ERP001942) was accessed, and in data availability, processed data (E-GEUV-1) was used. These need to be consistent.

      Thank you for pointing this out. We used the processed data from Expression Atlas (https://www.ebi.ac.uk/gxa/experiments/E-GEUV-1/Results), and that's what we meant by "We downloaded RNA expression level data of the LCL samples from the Expression Atlas." We have revised the “RNA expression data preparation” section in our manuscript to make the text clearer.

      How many samples were used (the text states 373, but how was it reduced from the original 465, and the total genotype is said to be 493 samples while ATAC has n=100; what are the 20 others?), and it mentions European samples, but does this exclude YRI?

      We thank the reviewer for pointing out these points of confusion. Our reported count of 493 samples included YRI samples with RNA-seq data or ATAC-seq data that we ultimately did not use for QTL analyses. There were 373 European samples with RNA-seq data that we used for eQTL analysis, and 100 GBR samples (including some that overlap with the 373 European samples) that we used for caQTL analysis. We have revised the text to clarify these points.

      (2) Experimental results determining which TFs might bind to the representative signals of caQTL are required.

      We agree that caQTL colocalization is just the start of elucidating the regulatory mechanism of a GWAS locus. Determining which TFs are bound and which TFs' binding is altered would be necessary to describe the causal regulatory mechanism. For this, we utilized the Cistrome database to search for TFs whose binding overlaps the colocalized caQTL peaks. We present the results of this analysis in Supplementary Table 3 and Supplementary Figure 4, both of which we have added in our revised manuscript. Overall, protein factors associated with active transcription, such as POL2RA, and several immune cell TFs, including RUNX3, SPI1, and RELA, were frequently detected in those peaks. Detecting these factors in most peaks supports the likelihood that the colocalized caQTL peaks are active cis-regulatory elements. These results are consistent with our observation of enriched caQTL-mediated heritability in regions with active histone marks (Figure 1).

      (3) It is stated that caQTL is less tissue-specific compared to eQTL; would caQTL performed with ATAC-seq results from different cell types, yield similar results?

      We thank the reviewer for the question. Calderon et al. (PMID: 31570894) observed that "most effects on allelic imbalance (of ATAC-seq) were shared regardless of lineage or condition". Yet, there were regions where a different cell type or state would show inaccessibility (Figure 4d in Calderon et al.). Thus, we expect that ATAC-seq results from different cell types (e.g., T cells, B cells, monocytes, etc.) would lead to additional caQTLs showing colocalization at cell-typespecific open chromatin. However, if a region is accessible in both cell types, caQTL may be detected in both. Moreover, Alasoo et al., Nature Genetics (2018) (PMID: 29379200) observed that “many disease-risk variants affect chromatin structure in a broad range of cellular states, but their effects on expression are highly context specific.” In both studies, the authors investigated immune cell types, and there could be different observations in non-immune cell types and other diseases and traits.

      Reviewer #1 (Recommendations For The Authors): 

      I think it would strengthen the paper to explore gene-level differences in the discovery of caQTLs and eQTLs. For example, complex disease-relevant genes, on average, have more/longer regulatory domains (as shown by Wang and Goldstein, AJHG 2020; Mostafavi et al., Nat. Genet. 2023). Therefore, it is plausible that for such genes, caQTLs are much more easily discoverable than eQTLs due to (i) a larger mutational target size for caQTLs, and (ii) dispersion of expression heritability across multiple domains, which hampers the discovery of eQTLs but not caQTLs, which are studied independently of other domains in the region. In other words, discovered caQTLs and eQTLs likely vary in terms of their distance to genes (as the authors report), as well as their target genes.

      We thank the reviewer for the suggestion to explore gene-level differences. We expect that the effects of complex disease-relevant genes having more / longer regulatory domains, on average, to explain our observations. We agree on both of your points that there are many more regulatory elements that are captured as accessible regions than expressed genes and that genes often have multiple independent eQTLs leading to dispersion of heritability. The genelevel trend that we described was the distance of the regulatory element from the genes. Additional analyses would be a relevant future direction.

      Also considering gene-level analysis, Mostafavi et al. show that the types of biases they report for eQTLs also apply to other molecular QTLs. It would be valuable to compare GWAS hits with versus without caQTL colocalization. Similarly, it would be insightful to compare GWAS hits with both colocalized caQTLs and eQTLs to GWAS hits with colocalized caQTLs but no eQTLs in any of the cell types. 

      We thank the reviewer for the comment. Investigating for potential biases in the colocalized caQTL would be useful, but we considered it beyond the scope of this work. In terms of biological factors, we demonstrated through mediated heritability analyses that more accessible chromatin (based on ATAC-seq read coverage) and regions with active histone marks were enriched for autoimmune disease associations (Figure 1). Furthermore, as greater distance of the regulatory variant from the transcription start site significantly reduced the cis-heritability, we would expect that distance would play a major role, similar to Mostafavi et al.’s conclusions.

      I don't think the argument for the role of natural selection contributing to the "missing regulation" is presented accurately. Specifically, large eQTLs acting on top trait-relevant genes are under stronger selection and thus, on average, segregate at lower frequencies. This makes them difficult to discover in eQTL assays. However, if not lost, they contribute as much, if not more, to trait heritability than weaker eQTLs at the same gene because their larger effects compensate for their lower frequency. At the most extreme, selection should have a "flattening" effect (e.g., see Simons et al., PLOS Biol 2018; O'Connor et al., AJHG 2019): weak and strong eQTLs at the same gene are expected to contribute equally to heritability. Therefore, the statement "Consequently, only weak eQTL variants, often in regions distal to the gene's promoter, may remain and affect traits" is not correct. If this turns out to be empirically true, other models, such as pleiotropic selection, need to explain it. 

      We thank the reviewer for the correction. We agree with the comment and have revised the sentences in the introduction accordingly.

      It is worth speculating why caQTLs may be more consistent across cell types than cis-eQTLs. Additionally, readers may infer from the paper that the focus should shift from eQTLs to caQTLs, which may not be the authors' intention. Perhaps these approaches are complementary: caQTLs can help with TSS-distal disease variants, while finding the target gene and regulatory context is more straightforward with eQTL colocalization. Addressing these points in the discussion will be helpful.

      We appreciate the reviewer's suggestion to clarify the advantages of incorporating cis-eQTLs and caQTLs. Our argument is exactly as you put it, and we added a paragraph on this in the Discussion.

      I believe the authors could do more to contextualize their findings within the existing literature on the subject, particularly Umans et al., Trends in Genetics 2021; Connally et al., eLife 2022; and Mostafavi et al., Nat. Genet. 2023. For instance, Umans et al. suggest that "if most standard eQTLs are generally benign, increasing sample size and adding more tissue types in an effort to identify even more standard eQTLs may not help us to explain many more disease risk mutations". Conversely, Mostafavi et al. argue for a multipronged approach, which appears more aligned with the authors' conclusions.

      We followed the reviewer’s suggestion to place our work in the context of existing literature on this topic. Moreover, we clarified what our recommendations for future data generation are.

      I thought Figures 1C-D were unclear. 

      We added a sentence in the figure legend describing that stronger and more significant enrichment indicate that mediated heritability is concentrated in that subset.

      Reviewer #2 (Recommendations For The Authors): 

      Complete workflow figures for caQTL calling and eQTL calling are required. 

      To improve clarity of the caQTL and eQTL calling workflow, we added Supplementary Figure 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      In this manuscript, Chen et al. investigate the role of the membrane estrogen receptor GPR30 in spinal mechanisms of neuropathic pain. Using a wide variety of techniques, they first provide convincing evidence that GPR30 expression is restricted to neurons within the spinal cord, and that GPR30 neurons are well-positioned to receive descending input from the primary sensory cortex (S1). In addition, the authors put their findings in the context of the previous knowledge in the field, presenting evidence demonstrating that GRP30 is expressed in the majority of CCK-expressing spinal neurons. Overall, this manuscript furthers our understanding of neural circuity that underlies neuropathic pain and will be of broad interest to neuroscientists, especially those interested in somatosensation. Nevertheless, the manuscript would be strengthened by additional analyses and clarification of data that is currently presented. 

      Strengths: 

      The authors present convincing evidence for the expression of GPR30 in the spinal cord that is specific to spinal neurons. Similarly, complementary approaches including pharmacological inhibition and knockdown of GPR30 are used to demonstrate the role of the receptor in driving nerve injury-induced pain in rodent models. 

      Weaknesses: 

      Although steps were taken to put their data into the broader context of what is already known about the spinal circuitry of pain, more considerations and analyses would help the authors better achieve their goal. For instance, to determine whether GPR30 is expressed in excitatory or inhibitory neurons, more selective markers for these subtypes should be used over CamK2. Moreover, quantitative analysis of the extent of overlap between GPR30+ and CCK+ spinal neurons is needed to understand the potential heterogeneity of the GPR30 spinal neuron population, and to interpret experiments characterizing descending SI inputs onto GPR30 and CCK spinal neurons. Filling these gaps in knowledge would make their findings more solid. 

      Thank you very much for your constructive feedback.

      In response to your suggestion, we have used more specific markers to distinguish excitatory (VGLUT2) and inhibitory (VGAT) neurons via in situ hybridization. These analyses revealed that GPR30 is predominantly expressed in excitatory neurons of the superficial dorsal horn (SDH), as presented in the Results section (lines 117-120) and in Figure 2A-B.

      Additionally, we performed a quantitative analysis to determine the extent of co-localization between GPR30+ and CCK+ neurons. The data were included in the Results (lines 131–132) and Figure 2G.

      Reviewer #2 (Public review):

      Using a variety of experimental manipulations, the authors show that the membrane estrogen receptor G protein-coupled estrogen receptor (GPER/GPR30) expressed in CCK+ excitatory spinal interneurons plays a major role in the pain symptoms observed in the chronic constriction injury (CCI) model of neuropathic pain. Intrathecal application of selective GPR30 agonist G-1 induced mechanical allodynia and thermal hyperalgesia in male and female mice. Downregulation of GPR30 in CCK+ interneurons prevented the development of mechanical and thermal hypersensitivity during CCI. They also show the up modulation of AMPA receptor expression by GPR30. 

      Generally, the conclusions are supported by the experimental results. I also would like to see significant improvements in the writing and the description of results. 

      Methodological details for some of the techniques are rather sparse. For example, when examining the co-localization of various markers, the authors do not indicate the number of animals/sections examined. Similarly, when examining the effect of shGper1, it is unclear how many cells/sections/animals were counted and analyzed. 

      In other sections, there is no description of the concentration of drugs used (for example, Figure 4H). In Figures 4C-E, there is no indication of the duration of the recordings, the ionic conditions, the effect of glutamate receptor blockers, etc 

      Some results appear anecdotal in the way they are described. For example, in Figure 5, it is unclear how many times this experiment was repeated. 

      We sincerely appreciate your valuable feedback and thoughtful recommendations.

      To address your concerns regarding methodological transparency, we have added the following details to the revised manuscript:

      The number of animals and sections analyzed in co-localization studies.

      The number of cells/sections/animals used in each quantification following shGper1 treatment.

      The concentrations of drugs administered (e.g., in Figure 4H).

      Detailed recording conditions, including duration, ionic composition, and pharmacological conditions (Figures 4C-E).

      In addition, we have thoroughly revised the writing throughout the manuscript to enhance clarity and precision in the description of our findings.

      Reviewer #3 (Public review): 

      Summary: 

      The authors convincingly demonstrate that a population of CCK+ spinal neurons in the deep dorsal horn express the G protein-coupled estrogen receptor GPR30 to modulate pain sensitivity in the chronic constriction injury (CCI) model of neuropathic pain in mice. Using complementary pharmacological and genetic knockdown experiments they convincingly show that GPR30 inhibition or knockdown reverses mechanical, tactile, and thermal hypersensitivity, conditioned place aversion, and c-fos staining in the spinal dorsal horn after CCI. They propose that GPR30 mediates an increase in postsynaptic AMPA receptors after CCI using slice electrophysiology which may underlie the increased behavioral sensitivity. They then use anterograde tracing approaches to show that CCK and GPR30 positive neurons in the deep dorsal horn may receive direct connections from the primary somatosensory cortex. Chemogenetic activation of these dorsal horn neurons proposed to be connected to S1 increased nociceptive sensitivity in a GPR30-dependent manner. Overall, the data are very convincing and the experiments are well conducted and adequately controlled. However, the proposed model of descending corticospinal facilitation of nociceptive sensitivity through GPR30 in a population of CCK+ neurons in the dorsal horn is not fully supported. 

      Strengths: 

      The experiments are very well executed and adequately controlled throughout the manuscript. The data are nicely presented and supportive of a role for GPR30 signaling in the spinal dorsal horn influencing nociceptive sensitivity following CCI. The authors also did an excellent job of using complementary approaches to rigorously test their hypothesis. 

      Weaknesses: 

      The primary weakness in this manuscript involves overextending the interpretations of the data to propose a direct link between corticospinal projections signaling through GPR30 on this CCK+ population of spinal dorsal horn neurons. For example, even in the cropped images presented, GPR30 is present in many other CCK-negative neurons. Only about a quarter of the cells labeled by the anterograde viral tracing experiment from S1 are CCK+. Since no direct evidence is provided for S1 signaling through GPR30, this conclusion should be revised. 

      Thank you for your encouraging comments and critical insights.

      We fully acknowledge the concern regarding the proposed direct involvement of corticospinal projections in modulating nociceptive behavior via GPR30 in CCK+ neurons. While our anterograde tracing experiments suggest anatomical overlap, we agree that definitive evidence of functional connectivity is lacking.

      Accordingly, we have revised the Abstract, Discussion, and Graphical Abstract to present our findings more cautiously. We now describe our observations as indicating that S1 projections potentially interact with GPR30<sup>+</sup> spinal neurons, rather than asserting a definitive functional link.

      To support this revised interpretation, we performed additional quantitative analyses examining the co-localization among S1 projections, CCK+, and GPR30+ neurons. Furthermore, we clarified that the chemogenetic activation studies targeted a mixed neuronal population and did not exclusively manipulate CCK+ neurons.

      These changes aim to better align our conclusions with the presented data and provide a more nuanced framework for future investigations.

      Reviewer #1 (Recommendations for the authors): 

      Major corrections 

      (1) Figure 2: The authors conclude that GPR30 is mainly expressed in excitatory spinal neurons because they are labeled by a virus with a Camk2 promoter. While there is evidence that Camk2 is specific to excitatory neurons in the brain, based on RNAseq datasets (e.g. Linnarsson Lab, http://mousebrain.org/adolescent/genesearch.html ) this is less clear cut within the spinal cord. A more direct way to assess the relative expression of GPR30 in excitatory versus inhibitory neurons would be to perform immunohistochemistry or FISH with GPR30/Vglut2/Vgat. 

      Alternatively, if this observation is not crucial for the overall arch of the story, I recommend the authors eliminate these data, as they do not support the idea that GPR30 is mainly in excitatory neurons. 

      We thank the reviewer for highlighting this important limitation. To strengthen our conclusion regarding the neuronal identity of GPR30-expressing cells, we performed fluorescent in situ hybridization (FISH) using vGluT2 (marker for excitatory neurons) and VGAT (marker for inhibitory neurons). The results confirmed that GPR30 is predominantly expressed in vGluT2-positive excitatory neurons within the spinal cord. These new data are presented in the revised manuscript (lines 117-120) and shown in Figure 2A-B.

      (2) (2a) Figure 2: The authors also report that GPR30 is expressed in most CCK+ spinal neurons. A more rigorous way to present the data would be to perform quantification and report the % of CCK neurons that are GPR30. 

      (2b) More importantly, it is unclear what % of GPR30 neurons are CCK+. These types of quantifications would provide useful insights into the heterogeneity of CCK and GPR30 neuron populations, and help align findings of experiments using the behavioral pharmacology using GRP antagonists to the knockdown of Gper1 in CCK spinal neurons - for instance, does a population of GRP30+/CCK- neurons exist? If so, it would be worth discussing what role (if any) that population might play in nerve injury-induced mechanical allodynia. 

      Understanding the breakdown of GPR30 populations becomes even more relevant when the authors characterize which cell types are targeted by descending projections from S1. It is clear that the vast majority of CCK+ neurons that receive descending input from S1 neurons are GPR30+, but there are many other GPR30+ neurons that do not receive input from SI neurons presented in 5M. Is this simply because only a small fraction of CCK+/GPR30+ neurons are targeted by descending S1 projections, or could they represent a distinct population of GPR30 neurons? 

      (2a) We appreciate the suggestion. Quantification showed that approximately 90% of CCK⁺ neurons express GPR30, and about 50% of GPR30⁺ neurons co-express CCK. These data are now provided in the revised Results (lines 131-132) and in Figure 2F-G.

      (2b) Indeed, our data reveal that a substantial portion of GPR30⁺ neurons do not co-express CCK. While this study focuses on GPR30 function in CCK⁺ neurons, we recognize the potential relevance of GPR30⁺/CCK⁻ populations. We have addressed this point in the Discussion (lines 303-306):

      “However, it should be noted that half of GPR30⁺ neurons are not co-localized with CCK⁺ neurons, and further studies are needed to explore the function of these GPR30⁺/CCK⁻ neurons in neuropathic pain.”

      Regarding descending input, our data in Figure 5 show that S1 projections selectively innervate a subset (~30%) of CCK⁺ neurons, most of which co-express GPR30. This suggests that S1-targeted CCK⁺/GPR30⁺ neurons may represent a functionally distinct population. We have added clarification to the revised manuscript, while acknowledging that further studies are needed to elucidate the roles of non-targeted GPR30⁺ neurons.

      (3) Throughout the manuscript both male and female mice were used in experiments. Rather than referring to male and female mice as different genders, it would be more appropriate to describe them as different sexes. 

      As suggested, we have replaced all instances of “gender” with “sex” throughout the revised manuscript.

      (4) Figure 5: To increase the ease of interpreting the figure, in panels 5J and 5N, it would be helpful to indicate directly on the figure panel which another marker was assessed in double-labeling analyses.

      We have revised Figures 5J and 5N to include clear labels identifying the markers used in double-labeling analyses, to improve interpretability.

      Minor corrections: 

      (1) Line 36, I believe the authors mean to say "GPER/GPR30 in spinal neurons", rather than just "spinal". 

      Corrected as suggested. The sentence now reads (line 34):

      “Here we showed that the membrane estrogen receptor G-protein coupled estrogen receptor (GPER/GPR30) in spinal neurons was significantly upregulated in chronic constriction injury (CCI) mice…”

      (2) There are minor grammatical errors throughout the manuscript that interfere with comprehension. Proofreading/editing of the English language use may be beneficial. 

      We have thoroughly revised the manuscript for clarity and corrected grammatical and syntactic errors to improve readability.

      (3) Line 169-170, reads "Known that EPSCs are mediated by glutamatergic receptors like AMPA receptors and several studies have been reported the relationship between GPR30 and AMPA receptor25,29". Rewriting the sentence such that it better describes what the known relationship is between GPR30 and AMPA would be helpful in setting up the rationale of the experiment in Figure 4. 

      We have rewritten this section to better clarify the rationale behind the electrophysiological experiments (lines 161-164):

      “Given that EPSCs are primarily mediated through glutamatergic receptors such as AMPA receptors, and emerging evidence suggesting that GPR30 enhances excitatory transmission by promoting clustering of glutamatergic receptor subunits, we examined whether GPR30 modulates EPSCs via AMPA receptor-dependent mechanisms.”

      (4) Line 198-199 "Then we explored the possible connections among GPR30, S1-SDH projections and CCK+ neuron." In the context of spinal circuitry, "connections" may raise the expectation that synaptic connectivity will be evaluated. What I think best describes what the authors investigated in Figure 5 is the "relationship" between GPR30, S1-SDH projections, and CCK+ neurons. 

      We have revised the sentence accordingly (lines 184-186):

      “Building on previous findings suggesting a functional interaction between S1-SDH projections and spinal CCK⁺ neurons, our current study aimed to further elucidate the structural relationship among GPR30, S1-SDH projections, and CCK⁺ neurons.”

      (5) Figure 5: To increase the ease of interpreting the figure, in panels 5J and FN, it would be helpful to indicate directly on the figure panel which other marker was assessed in double-labeling analyses. 

      We have added direct labels to figure panels to clarify double-labeled analyses in the revised Figure 5J and 5N.

      Reviewer #2 (Recommendations for the authors): 

      (1) Can the authors provide more detail about the distribution of CCK+ cells in the spinal cord and, in particular, the localization of double-stained (CCK/cfos) neurons? 

      We thank the reviewer for this suggestion. To better characterize the distribution of CCK⁺ neurons within the spinal dorsal horn (SDH), we performed immunostaining in CCK-tdTomato mice using lamina-specific markers: CGRP (lamina I), IB4 (lamina II), and NF200 (lamina III–V). Our results demonstrate that CCK⁺ neurons are primarily localized in the deeper laminae of the SDH. These findings are now described in the revised Results (lines 126–129) and shown in Figure 2E.

      In addition, we conducted c-Fos immunostaining in CCK-Ai14 mice and found increased activation of CCK⁺ neurons following CCI. This supports the involvement of CCK⁺ neurons in neuropathic pain. These data are included in the Results (lines 129–131) and Supplementary Figure S4.

      (2) Figure 2A. There is no formal quantification of the percentage of TdTomato+ neurons that are also CCK+. The description of these results is insufficient. 

      We appreciate this point and have revised the description of Figure 2A accordingly. To strengthen our analysis, we conducted additional FISH experiments with vGluT2 and VGAT probes. Quantification revealed that GPR30 is predominantly expressed in excitatory neurons (approximately 60%). These data are shown in the revised Results (lines 117-119) and Figures 2A-B and S3. This supports our conclusion that GPR30 is largely localized to excitatory spinal interneurons.

      (3) Figure 4H. What is the evidence that these are AMPA-mediated currents? This is not explained in the text. 

      Thank you for raising this point. We now provide detailed experimental procedures to clarify that the recorded EPSCs are AMPA receptor–mediated. Specifically, spinal slices from CCK-Cre mice were used, and excitatory postsynaptic currents were recorded in the presence of APV (100 μM, NMDA receptor blocker), bicuculline (20 μM, GABA_A receptor blocker), and strychnine (0.5 μM, glycine receptor blocker), ensuring that the observed currents were AMPA-dependent. These methodological details are now clearly described in the revised Results (lines 165–173) and supported by prior literature (Zhang et al., J Biol Chem 2012; Hughes et al., J Neurosci 2010).

      (1) Yan Zhang, Xiao Xiao, Xiao-Meng Zhang, Zhi-Qi Zhao, Yu-Qiu Zhang (2012). Estrogen facilitates spinal cord synaptic transmission via membrane-bound estrogen receptors: implications for pain hypersensitivity. J Biol Chem. Sep 28;287(40):33268-81.

      (2) Ethan G Hughes, Xiaoyu Peng, Amy J Gleichman, Meizan Lai, Lei Zhou, Ryan Tsou, Thomas D Parsons, David R Lynch, Josep Dalmau, Rita J Balice-Gordon (2010). Cellular and synaptic mechanisms of anti-NMDA receptor encephalitis. J Neurosci. 2010 Apr 28;30(17):5866-75.

      (4) What is the signaling mechanism leading to a larger amplitude of currents after G-1 infusion? 

      We thank the reviewer for this important question. G-1 is a selective agonist for GPR30. Based on previous studies by Luo et al. (2016), we speculate that activation of GPR30 may increase the clustering of glutamatergic receptor subunits at postsynaptic sites, thereby enhancing AMPA receptor-mediated currents. While our current study did not directly address the intracellular signaling cascade, we have incorporated this mechanistic speculation in the Discussion.

      Jie Luo, X.H., Yali Li, Yang Li, Xueqin Xu, Yan Gao, Ruoshi Shi, Wanjun Yao, Juying Liu, Changbin Ke (2016). GPR30 disrupts the balance of GABAergic and glutamatergic transmission in the spinal cord driving to the development of bone cancer pain. Oncotarget 7, 73462-73472. 10.18632/oncotarget.11867.

      (5) Figure 4I. Please include error bars. 

      We have revised Figure 4I to include error bars, as requested.

      (6) Line 198. What is the evidence that AAV2/1 EF1α FLP is an antegrade trans monosynaptic marker? 

      We thank you for this request. AAV2/1 has been widely used for anterograde monosynaptic tracing based on its properties (Wang et al., Nat Neurosci 2024; Wu et al., Neurosci Bull 2021): (1) it infects neurons at the injection site and undergoes active anterograde transport; (2) newly assembled viral particles are released at synapses and infect postsynaptic partners; (3) in the absence of helper viruses, the spread halts at the first synapse, ensuring monosynaptic restriction. We have elaborated on this in the revised manuscript (line 198), citing Wang et al. (Nat Neurosci 2024) and Wu et al. (Neurosci Bull 2021).

      (1) Hao Wang, Qin Wang, Liuzhe Cui, Xiaoyang Feng, Ping Dong, Liheng Tan, Lin Lin, Hong Lian, Shuxia Cao, Huiqian Huang, Peng Cao, Xiao-Ming Li (2024). A molecularly defined amygdalaindependent tetra-synaptic forebrain-tohindbrain pathway for odor-driven innate fear and anxiety. Nat Neurosci. 2024 Mar;27(3):514-526.

      (2) Zi-Han Wu, Han-Yu Shao, Yuan-Yuan Fu, Xiao-Bo Wu, De-Li Cao, Sheng-Xiang Yan, Wei-Lin Sha, Yong-Jing Gao, Zhi-Jun Zhang (2021). Descending Modulation of Spinal Itch Transmission by Primary Somatosensory Cortex. Neurosci Bull. 2021 Sep;37(9):1345-1350.

      (7) Figure 5G. I do not understand the logic of this experiment. A Cre AAV is injected in the S1 cortex. Why should this lead to the expression of tdTomato on a downstream (postsynaptic?) neuron? The authors should quote the literature that supports this anterograde transsynaptic transport.

      We appreciate this question. As described in previous studies (e.g., Wu et al., Neurosci Bull 2021), AAV2/1-Cre injected into the S1 cortex leads to Cre expression in projection targets due to transsynaptic anterograde transport. Subsequent injection of a Cre-dependent AAV (AAV2/9-DIO-mCherry) into the spinal cord enables specific labeling of postsynaptic neurons that receive input from S1. We have clarified this mechanism in line 206 and provided the appropriate citation.

      Zi-Han Wu, Han-Yu Shao, Yuan-Yuan Fu, Xiao-Bo Wu, De-Li Cao, Sheng-Xiang Yan, Wei-Lin Sha, Yong-Jing Gao, Zhi-Jun Zhang (2021). Descending Modulation of Spinal Itch Transmission by Primary Somatosensory Cortex. Neurosci Bull. 2021 Sep;37(9):1345-1350.

      (8) The same question arises when interpreting the results obtained in Figure 6.

      We thank the reviewer for the question, and we have addressed it in point (7).

      (9) Line 257. How do the authors envision that estrogen would change its modulation of GPR30 under basal and neuropathic conditions? Is there any evidence for this speculation? 

      We thank the reviewer for raising this thoughtful question. In the current study, we focused on pharmacologically manipulating GPR30 activity via its selective agonist and antagonist. We did not directly investigate how endogenous estrogen regulates GPR30 under physiological and neuropathic states. We have recognized this limitation and highlighted the need for future research to investigate this regulatory mechanism.

      (10-20) In my opinion, the entire manuscript needs a careful revision of the English language. While one can follow the text, it contains numerous grammatical and syntactic errors that make the reading far from enjoyable. I am highlighting just a few of the many errors. 

      We appreciate the reviewer’s honest assessment. The manuscript has undergone thorough language editing by a native English speaker to correct grammatical errors, improve clarity, and enhance overall readability. We also restructured several sections, particularly the Discussion, to improve logical flow.

      (21) The discussion of results is a bit disorganized, with disconnected sentences and statements, and somewhat repetitive. For example, lines 303 to 306 lack adequate flow. It is also quite long and includes general statements that add little to the discussion of the new findings (lines 326-333). 

      We agree and have revised the Discussion extensively. Disconnected or repetitive sentences (e.g., lines 303-306, 326-333) have been removed or rewritten. For instance, we added a new transitional paragraph (lines 307-311) to improve flow:

      “Abnormal activation of neurons in the SDH is a key contributor to hyperalgesia, and enhanced excitatory synaptic transmission is a major mechanism driving increased neuronal excitability. Therefore, we evaluated excitatory postsynaptic currents (EPSCs) and observed increased amplitudes in CCK⁺ neurons following CCI, suggesting elevated excitability in these neurons.”

      We also removed redundant generalizations to maintain a focused discussion of our novel findings.

      Reviewer #3 (Recommendations for the authors): 

      (1) What is the distribution of GPR30 throughout the spinal cord and DRG? The authors demonstrate that this can overlap with a CCK+ population, but there are many GPR30+ and CCK negative neurons, even in the cropped images presented. It would be helpful to quantify the colocalization with CCK. 

      We thank the reviewer for this important point. As shown in the revised manuscript, GPR30 is expressed in both the spinal cord and dorsal root ganglia (DRG). However, our updated data (Figure 1B) demonstrate that Gper1 mRNA levels in the DRG are not significantly altered after CCI, suggesting a limited involvement of DRG GPR30 in neuropathic pain. These results are described in the revised Results (line 94).

      Regarding spinal co-expression, we performed a detailed quantification. Approximately 90% of CCK⁺ neurons express GPR30, while about 50% of GPR30⁺ neurons are CCK⁺. These co-localization results are now included in the revised Results and presented in Figure 2G.

      (2) It is clear that CCI and GPR30 influence excitatory synaptic transmission in CCK+ neurons. However, these experiments do not fully support the authors' claims of a postsynaptic upregulation of AMPARs. Comparing amplitudes and frequencies of spontaneous EPSCs cannot necessarily distinguish a pre- vs postsynaptic change since some of these EPSCs can arise from spontaneous action potential firing. I suggest revising this conclusion. 

      We appreciate these insightful comments. We fully agree that our data from spontaneous EPSC recordings (sEPSCs) in CCK⁺ neurons are not sufficient to distinguish between pre- and postsynaptic mechanisms, as sEPSCs may include spontaneous presynaptic activity. Therefore, we have revised the text throughout the manuscript to avoid overstating conclusions related to postsynaptic AMPA receptor upregulation.

      (3) What is the rationale for the evoked EPSC experiments from electrical stimulation in "the deep laminae of SDH?" I do not think that this experiment can rule out a presynaptic contribution of GPR30 to the evoked responses, particularly if these are Gs-coupled at presynaptic terminals. Paired-pulse stimulations could help answer this question, otherwise, alternative interpretations, also related to the point above, should be provided. 

      We thank the reviewer for this thoughtful critique. Indeed, electrical stimulation of the deep SDH laminae does not exclude presynaptic involvement, especially considering that GPR30 is a G protein–coupled receptor (GPCR) and could act presynaptically. We agree that paired-pulse ratio (PPR) analysis would be more informative in distinguishing pre- from postsynaptic effects, but this was not performed due to technical limitations in our current experimental setup.

      Accordingly, we have revised our interpretations in both the Results and Discussion to acknowledge that our data do not rule out presynaptic contributions. We now state that GPR30 activation enhances EPSCs in CCK⁺ neurons, while further studies are needed to dissect the precise site of action.

      (4) I appreciate the challenging nature of the trans-synaptic viral labeling approaches, but the chemogenetic and Gper knockdown experiments do not selectively target this CCK+ population of deep dorsal horn neurons. The data are clear that each of these components (descending corticospinal projections, CCK neurons, and GPR30) can modulate nociceptive hypersensitivity, but I do not agree with the overall conclusion that each of are directly linked as the authors propose. I recommend revising the overall conclusion and title to reflect the convincing data presented. 

      We thank the reviewer for this critical observation. We agree that while our data show functional roles for descending cortical input, CCK⁺ neurons, and GPR30 in modulating pain hypersensitivity, the evidence does not establish a definitive direct circuit integrating all three components.

      In response, we have revised our conclusions to reflect this limitation. Specifically, we avoided claiming a direct functional link among S1 projections, CCK⁺ neurons, and GPR30. Instead, we now propose that GPR30 modulates neuropathic pain primarily through its action in CCK⁺ spinal neurons, with potential involvement of descending facilitation from the somatosensory cortex.

      Additionally, we have revised the manuscript title to better reflect our mechanistic focus:<br /> “GPR30 in spinal CCK-positive neurons modulates neuropathic pain.”

      Minor Corrections

      (1) The authors should refer to mice by sex, not gender. 

      Corrected throughout the manuscript.

      (2) Page 9, line 195: "significantly" is used to refer to co-localization of 28.1%. What is this significant to? 

      We have revised the sentence to accurately describe the observed percentage, without implying statistical significance:

      “Our co-staining results revealed that a high proportion of CCK⁺ S1-SDH postsynaptic neurons expressed GPR30” (line 198-199).

      (3) I recommend modifying some of the transition phrases like "by the way," "what's more," and "besides". 

      All informal expressions have been replaced with academic alternatives including “Furthermore,” “Additionally,” and “Moreover.”

      (4) Additional guides to mark specific laminae in the dorsal horn would be useful. 

      We added immunostaining with laminar markers (CGRP for lamina I and NF200 for lamina III–V), and these data are now shown in Figure 2E and described in the Results (lines 126-129).

      (5) Page 5, line 115: immunochemistry should be immunohistochemistry. 

      Corrected as suggested.

      (6) Page 6, line 136: "Confirming the structural connnections" was not demonstrated here. Perhaps co-localization between GPR30 and CCK+. 

      The text was revised to “To functionally interrogate GPR30 and CCK⁺ neurons in neuropathic pain...” (line 133).

      (7) Page 8, line 166: unsure what "took and important role" means. 

      This phrasing was corrected for clarity and replaced with an accurate scientific description.

      (8) Page 8, line 168: "IPSCs of spinal CCK+ neurons" implies that they are sending inhibitory inputs. 

      We revised the term to “EPSCs” to correctly reflect excitatory synaptic currents in CCK⁺ neurons.

      (9) Page 8, line 169: "Known that EPSCs" is missing an introductory phrase. 

      The sentence was rewritten to include an appropriate introductory clause (lines 161–164):

      “Given that EPSCs are primarily mediated through glutamatergic receptors such as AMPA receptors...”

      (10) Page 10, line 227 and 228: "adequately" and "sufficiently" should be adequate and sufficient. 

      We corrected these terms to the proper adjective forms: “adequate” and “sufficient” (lines 224-225).

    1. Before we can outline these perspectives, it is necessary to make somegeneral demarcations about what we mean when we talk about politicalsignificance in the context of data visualizations. First, ‘politics’ may beunderstood in narrow terms, as the workings of political parties, processes,and institutions. Politics may also be understood in a wider sense, as thestruggle for power more broadly, as this struggle takes place both in theprivate as well as the cultural sphere, and by symbolic as well as materialmeans.

      I think it's very interesting that the context of political significance is necessary here to explain how data and politics are intertwined. In a broad or a narrow sense, data visualizations are equally as important in conveying things of important, and in regards to politics, can seriously impact how persuasive, informative, or even divisive a subject is.

    1. Back to the university: what are you supposed to be learning here? At minimum, you’ll probably pick up bits of knowledge here and there, but an effective education isn’t just about memorizing facts. It’s much more than about learning that but also learning how, especially given Cal Poly’s motto of “Learn by Doing.” But if you rely on using AI for your coursework, you might not even be learning that some particular thing is true. With AI and search engines, you can still access that knowledge you’re supposed to be learning, but being able to access x isn’t the same as internalizing x; the latter is much more useful, as we’ll discuss more below in part 3, “Future risks.”

      I think this is an important distinction. Just being able to access information with AI isn’t the same as actually learning and internalizing it. Memorizing facts may not be the point of education, but being able to apply and use knowledge is. If we skip the process of working through ideas ourselves then we risk missing the deeper how of learning

    1. Suzanne Briet: Physical evidence as document

      In part, I appreciate the pragmatism of Briet's approach. It would certainly make a cataloger's life easier to view documents in this way and, on its surface, it makes a tremendous amount of "sense".

      However, I can't help but feel this view is a little too limited. Certainly, it seems to me, the antelope itself would be a source of information. In one way it is an example of what an "antelope" is, but it is also an individual and, beyond that, an individual at a certain snapshot in time.

      In a very broad view, we can think that nothing is truly permanent as all things are constantly changing. I think it depends so much on how we observe and questions of time scale.

      Human beings are not even exactly what we were in the past. We grow (both physically and in other ways), we change (we age, we change our minds, we change our clothes, we get tattoos, we erase tattoos) and eventually we, as an individual, will cease to exist by any observable means (depending on your belief system) other than by the "things" we leave behind.

      We also continue to exist, in a sense, in the minds of those who knew us, but their memories cannot be a whole picture of who we were and certainly no one may know truly how we are inside our own heads. Others will certainly bring their own biases or preferences to their memories of us which may or may not be a complete picture of who we were.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Li et al. investigate Ca2+ signaling in T. gondii and argue that Ca2+ tunnels through the ER to other organelles to fuel multiple aspects of T. gondii biology. They focus in particular on TgSERCA as the presumed primary mechanism for ER Ca2+ filling. Although, when TgSERCA was knocked out there was still a Ca2+ release in response to TG present.

      Note that we did not generate a complete SERCA knockout, as this gene is essential, and its complete loss would not permit the isolation of viable parasites. Instead, we created conditional mutants that downregulate the expression of SERCA. Importantly, some residual activity is present in the mutant after 24 h of ATc treatment as shown in Fig 4C. This is consistent with our Western blots, which demonstrate the presence of residual SERCA protein at 1, 1.5 and 2 days post ATc treatment (Fig. 3B). We have clarified this point in the revised manuscript (lines 232233). See also lines 97-102.

      Overall the Ca2+ signaling data do not support the conclusion of Ca2+ tunneling through the ER to other organelles in fact they argue for direct Ca2+ uptake from the cytosol. The authors show EM membrane contact sites between the ER and other organelles, so Ca2+ released by the ER could presumably be taken up by other organelles but that is not ER Ca2+ tunneling. They clearly show that SERCA is required for T. gondii function.

      Overall, the data presented to not fully support the conclusions reached

      We agree that the data does not support Ca<sup>2+</sup> tunneling as defined and characterized in mammalian cells. In response to this comment, we have modified the title and the text accordingly.

      However, we respectfully would like to emphasize that the study demonstrates more than just the role of SERCA in T. gondii “function”. Our findings reveal that the ER, through SERCA activity, sequesters calcium following influx through the PM (see reviewer 2 comment). The ER calcium pool is important for replenishing other intracellular compartments.

      The experiments support a model in which the ER actively takes up cytosolic Ca²⁺ as it enters the parasite and contributes to intracellular Ca²⁺ redistribution during transitions between distinct extracellular calcium environments. We believe that the role of the ER in modulating intracellular calcium dynamics is demonstrated in Figures 1H–K, 4G-H, and 5H–K. To highlight the relevance of these findings, we have included an expanded discussion in the revised manuscript. See lines 443-449 and 510-522.

      Data argue for direct Ca2+ uptake from the cytosol

      The ER most likely takes up calcium from the cytosol following its entry through the PM and redistributes it to the other organelles. We deleted any mention of the word “tunneling” and replaced it with transfer and re-distribution as they reflect our experimental findings more accurately.

      We interpret the experiments shown in Figure 1 H and I as re-distribution because the amount of calcium released after nigericin or GPN are greatly enhanced after TG addition. We first add calcium to allow intracellular stores to become filled, followed by the addition of TG, which allows calcium leakage from the ER. This leaked calcium can either enter the cytosol and be pumped out or be taken up by other organelles. Our interpretation is that this process leads to an increased calcium content in acidic compartments.

      We conducted an additional experiment in which SERCA was inhibited prior to calcium addition, allowing cytosolic calcium to be exported or taken up by acidic stores. We observed a change in the GPN response (Fig. S2A), possibly indicating that the PLVAC can sequester calcium when SERCA is inactive. While this may support the reviewer’s view, TG treatment does not reflect physiological conditions and may enhance calcium transfer to other compartments. Although the result is interesting, interpretation is complicated by the use of parasites in suspension and drug exposure in solution. Single-parasite measurements are not feasible due to weak signals, and adhered parasites are even less physiological than those in suspension.

      In support of our view, the experiments shown in Figs 4G and H show that down regulating SERCA reduces significantly the response to GPN indicating diminished acidic store loading. In Fig 5I we observe that mitochondrial calcium uptake is reduced in the iDSERCA (+ATc) mutant in response to GPN. Fig 2B demonstrates that TgSERCA can take up calcium at 55 nM, close to resting cytosolic calcium while in Figures 5E and S5B we show that the mitochondrion is not responsive to an increase of cytosolic calcium. Uptake by the mitochondria requires much higher concentrations (Fig 5B-C), which may be achieved within microdomains at MCS between the ER and mitochondrion. This is also consistent with findings reported by Li et al (Nat Commun. 2021) where similar microdomains mediated transfer of calcium to the apicoplast (Fig. 7 E and F of the mentioned reference) was observed.

      Reviewer 2 (Public review):

      The role of the endoplasmic reticulum (ER) calcium pump TgSERCA in sequestering and redistributing calcium to other intracellular organelles following influx at the plasma membrane.

      T. gondii transitions through life cycle stages within and exterior to the host cells, with very different exposures to calcium, adds significance to the current investigation of the role of the ER in redistributing calcium following exposure to physiological levels of extracellular calcium

      They also use a conditional knockout of TgSERCA to investigate its role in ER calcium store-filling and the ability of other subcellular organelles to sequester and release calcium. These knockout experiments provide important evidence that ER calcium uptake plays a significant role in maintaining the filling state of other intracellular compartments.

      We thank the reviewer.

      While it is clearly demonstrated, and not surprising, that the addition of 1.8 mM extracellular CaCl2 to intact T. gondii parasites preincubated with EGTA leads to an increase in cytosolic calcium and subsequent enhanced loading of the ER and other intracellular compartments, there is a caveat to the quantitation of these increases in calcium loading. The authors rely on the amplitude of cytosolic free calcium increases in response to thapsigargin, GPN, nigericin, and CCCP, all measured with fura2. This likely overestimates the changes in calcium pool sizes because the buffering of free calcium in the cytosol is nonlinear, and fura2 (with a Kd of 100-200 nM) is a substantial, if not predominant, cytosolic calcium buffer. Indeed, the increases in signal noise at higher cytosolic calcium levels (e.g. peak calcium in Figure 1C) are indicative of fura2 ratio calculations approaching saturation of the indicator dye.

      We acknowledge the limitations associated with using Fura-2 for cytosolic calcium measurements. However, according to the literature (Grynkiewicz, Get al. (1985). J. Biol. Chem. 260 (6): 3440–3450. PMID 3838314) Fura-2 is suited for measurements between 100 nM and 1 µM calcium. The responses in our experiments were within that range and the experiments with the SERCA mutant and mitochondrial GCaMPfs supports the conclusions of our work.

      However, we agree with the reviewer that the experiment shown in Fig 1C (now Fig 1D) presents a response that approaches the limit of the linear range of Fura-2. In response to this, we have replaced this panel with a more representative experiment that remains within the linear range of the indicator (revised Fig 1D). Additionally, we have included new experiments adding GPN along with corresponding quantifications, which further support our conclusions regarding calcium dynamics in the parasite.

      Another caveat, not addressed, is that loading of fura2/AM can result in compartmentalized fura2, which might modify free calcium levels and calcium storage capacity in intracellular organelles.

      We are aware of the potential issue of Fura-2 compartmentalization, and our protocol was designed to minimize this effect. We load cells with Fura-2 for 26 min at room temperature, then maintain them on ice, and restrict the use of loaded parasites to 2-3 hours. We have observed evidence of compartmentalization as this is reflected in increasing concentrations of resting calcium with time. We carry out experiments within a time frame in which the resting calcium stays within the 100 nM range. We have included a sentence in the Materials and Methods section. Lines 604-606.

      Additionally, following this reviewer’s suggestion, we performed further experiments to directly assess compartmentalization. See below the full response to reviewer 2.

      The finding that the SERCA inhibitor cyclopiazonic acid (CPA) only mobilizes a fraction of the thapsigargin-sensitive calcium stores in T. gondii coincides with previously published work in another apicomplexan parasite, P. falciparum, showing that thapsigargin mobilizes calcium from both CPA-sensitive and CPA-insensitive calcium pools (Borges-Pereira et al., 2020, DOI: 10.1074/jbc.RA120.014906). It would be valuable to determine whether this reflects the off-target effects of thapsigargin or the differential sensitivity of TgSERCA to the two inhibitors.

      This is an interesting observation, and we now include a discussion of this result considering the Plasmodium study and include the citation. Lines 436-442.

      Figure S1 suggests differential sensitivity, and it shows that thapsigargin mobilizes calcium from both CPA-sensitive and CPA-insensitive calcium pools in T. gondii. Also important is that we used 1 µM TG as we are aware that TG has shown off-target effects at higher concentrations. TG is a well-characterized, irreversible SERCA inhibitor that ensures complete and sustained inhibition of SERCA activity. In contrast, CPA is a reversible inhibitor whose effectiveness is influenced by ATP levels, and it may only partially inhibit SERCA or dissociate over time, allowing residual Ca²⁺ reuptake into the ER.

      Additionally, as suggested by the reviewer we performed experiments using the Mag-Fluo-4 protocol to compare the inhibitory effects of CPA and TG. These results are presented in Fig. S3 (Lines 217-223). Under the conditions of the Mag-Fluo-4 assay with digitonin-permeabilized cells, both TG and CPA showed similar rates of Ca<sup>2+</sup> leakage following the addition of the inhibitor. This may indicate that under the conditions of the Mag-Fluo-4 experiments the rate of Ca<sup>2+</sup> leak is mostly determined by the intrinsic leak mechanism and not by the nature of the inhibitor. By contrast, in intact Fura-2–loaded cells, CPA induces a smaller cytosolic Ca²⁺ increase than TG, consistent with less efficient SERCA inhibition likely due to its reversibility and possibly incomplete inhibition under cellular conditions.

      The authors interpret the residual calcium mobilization response to Zaprinast observed after ATc knockdown of TgSERCA (Figures 4E, 4F) as indicative of a target calcium pool in addition to the ER. While this may well be correct, it appears from the description of this experiment that it was carried out using the same conditions as Figure 4A where TgSERCA activity was only reduced by about 50%.

      We partially agree with the reviewer that 50% knockdown of TgSERCA means that the ER may still be targeted by zaprinast, and that there is no definitive evidence of the involvement of another calcium pool. The Mag-Fluo-4 experiment, while we acknowledge that the fluorescence of MagFluo-4 is not linear to calcium, indicates that SERCA activity is present even after 24 hr of ATc treatment. However, when Zaprinast is added after TG, we observed a significant calcium release in wild type cells. This result suggests the presence of another large calcium pool than the one mobilized by TG (PMID: 2693306).

      We recently published work describing the Golgi as a calcium store in Toxoplasma (PMID: 40043955) and we showed in Fig. S4 D-G of that work, that GPN treatment of tachyzoites loaded with Fura-2 diminished the Zaprinast response indicating that they could be impacting a similar store. In the present study we performed additional experiments in which TG was followed by GPN and Zaprinast showing a similar pattern. GPN significantly diminished the Zaprinast response. These results are shown now in Figure S2B. We address these possibilities in the discussion and interpretation of the result. Lines 451-460.

      The data in Figures 4A vs 4G and Figures 4B vs 4H indicate that the size of the response to GPN is similar to that with thapsigargin in both the presence and absence of extracellular calcium. This raises the question of whether GPN is only releasing calcium from acidic compartments or whether it acts on the ER calcium stores, as previously suggested by Atakpa et al. 2019 DOI: 10.1242/jcs.223883. Nonetheless, Figure 1H shows that there is a robust calcium response to GPN after the addition of thapsigargin.

      The results of the indicated experiments did not exclude the possibility that GPN can also mobilize some calcium from the ER besides acidic organelles. We don’t have any evidence to support that GPN can mobilize calcium from the ER either. Based on our unpublished work, we think GPN mainly release calcium from the PLVAC. We included the mentioned citation and discuss the result considering the possibility that GPN may be acting on more than one store. Lines 451-460.

      An important advance in the current work is the use of state-of-the-art approaches with targeted genetically encoded calcium indicators (GECIs) to monitor calcium in important subcellular compartments. The authors have previously done this with the apicoplast, but now add the mitochondria to their repertoire. Despite the absence of a canonical mitochondrial calcium uniporter (MCU) in the Toxoplasma genome, the authors demonstrate the ability of T. gondii mitochondrial to accumulate calcium, albeit at high calcium concentrations. Although the calcium concentrations here are higher than needed for mammalian mitochondrial calcium uptake, there too calcium uptake requires calcium levels higher than those typically attained in the bulk cytosolic compartment. And just like in mammalian mitochondria, the current work shows that ER calcium release can elicit mitochondrial calcium loading even when other sources of elevated cytosolic calcium are ineffective, suggesting a role for ER-mitochondrial membrane contact sites. With these new tools in hand, it will be of great value to elucidate the bioenergetics and transport pathways associated with mitochondrial calcium accumulation in T. gondii.

      We thank this reviewer praising our work. Studies of bioenergetics and transport pathways associated with mitochondrial calcium accumulation is part of our future plans mentioned in lines 520-522 and 545.

      The current studies of calcium pools and their interactions with the ER and dependence on SERCA activity in T. gondi are complemented by super-resolution microscopy and electron microscopy that do indeed demonstrate the presence of close appositions between the ER and other organelles (see also videos). Thus, the work presented provides good evidence for the ER acting as the orchestrating organelle delivering calcium to other subcellular compartments through contact sites in T. gondi, as has become increasingly clear from work in other organisms.

      Thank you

      Reviewer #3 (Public review):

      This manuscript describes an investigation of how intracellular calcium stores are regulated and provides evidence that is in line with the role of the SERCA-Ca2+ATPase in this important homeostasis pathway. Calcium uptake by mitochondria is further investigated and the authors suggest that ER-mitochondria membrane contact sites may be involved in mediating this, as demonstrated in other organisms.

      The significance of the findings is in shedding light on key elements within the mechanism of calcium storage and regulation/homeostasis in the medically important parasite Toxoplasma gondii whose ability to infect and cause disease critically relies on calcium signalling. An important strength is that despite its importance, calcium homeostasis in Toxoplasma is understudied and not well understood.

      We agree with the reviewer. Thank you

      A difficulty in the field, and a weakness of the work, is that following calcium in the cell is technically challenging and thus requires reliance on artificial conditions. In this context, the main weakness of the manuscript is the extrapolation of data. The language used could be more careful, especially considering that the way to measure the ER calcium is highly artificial - for example utilising permeabilization and over-loading the experiment with calcium. Measures are also indirect - for example, when the response to ionomycin treatment was not fully in line with the suggested model the authors hypothesise that the result is likely affected by other storage, but there is no direct support for that.

      The Mag-Fluo-4-based protocol for measuring intraluminal calcium is well established and has been extensively used in mammalian cells, DT40 cells and other cells for measuring intraluminal calcium, activity of SERCA and response to IP3 (Some examples: PMID: 32179239, PMID: 15963563, PMID: 19668195, PMID: 30185837, PMID: 19920131).

      Furthermore, we have successfully employed this protocol in previous work, including the characterization of the Trypanosoma brucei IP3R (PMID: 23319604) and the assessment of SERCA activity in Toxoplasma (PMID: 40043955 and 34608145). The citation PMID: 32179239 provides a detailed description of the protocol, including references to its prior use. In addition, the schematic at the top of Figure 2 summarizes the experimental workflow, reinforcing that the protocol follows established methodologies. We included more references and an expanded discussion, lines 425-435.

      We respectfully disagree with the concern regarding potential calcium overloading. The cells used in our assays were permeabilized, which is a critical step that allows to precisely control calcium concentrations. All experiments were conducted at 220 nM free calcium, a concentration within the physiological range of cytosolic calcium fluctuations. This concentration was consistently used across all studies described above. Importantly, permeabilization ensures that the dye present in the cytosol becomes diluted, and allows MgATP (which cannot cross intact membranes) to access the ER membrane, in addition to be able to expose the ER to precise calcium concentrations.

      The Mag-Fluo-4 loading conditions are designed to allow compartmentalization of the indicator to all intracellular compartments and the calcium uptake stimulated by MgATP exclusively occurs in the compartment occupied by SERCA as only SERCA is responsive to MgATP-dependent transport in this experimental setup

      Regarding the use of IO, we would like to clarify that its broad-spectrum activity is welldocumented. As a calcium ionophore, IO facilitates calcium release across multiple membranes, and not just the ER leading to a more substantial calcium release compared to the more selective effect of TG. The results observed with IO were consistent with this expected broader activity and support our interpretation.

      Lastly, we emphasize that the experiment in Figure 2 was designed specifically to assess SERCA activity in situ under defined conditions. It was not intended to provide a comprehensive characterization of the role of TgSERCA in the parasite. We now clarify this distinction in the revised Discussion lines 425-435.

      Below we provide some suggestions to improve controls, however, even with those included, we would still be in favour of revising the language and trying to avoid making strong and definitive conclusions. For example, in the discussion perhaps replace "showed" with "provide evidence that are consistent with..."; replace or remove words like "efficiently" and "impressive"; revise the definitive language used in the last few lines of the abstract (lines 13-17); etc. Importantly we recommend reconsidering whether the data is sufficiently direct and unambiguous to justify the model proposed in Figure 7 (we are in favour of removing this figure at this early point of our understanding of the calcium dynamic between organelles in Toxoplasma).

      We thank the reviewer for the suggestions and we modified the language as suggested. We limited the use of the word "showed" to references to previously published work. We deleted the other words

      Figure 7 is intended as a conceptual model to summarize our proposed pathways, and, like all models, it represents a working hypothesis that may not fully capture the complexity of calcium dynamics in the parasite. In light of the reviewer’s comments, we revised the figure and legend to clearly distinguish between pathways for which there is experimental evidence from those that are hypothetical.

      Another important weakness is poor referencing of previous work in the field. Lines 248250 read almost as if the authors originally hypothesised the idea that calcium is shuttled between ER and mitochondria via membrane contact sites (MCS) - but there is extensive literature on other eukaryotes which should be first cited and discussed in this context. Likewise, the discussion of MCS in Toxoplasma does not include the body of work already published on this parasite by several groups. It is informative to discuss observations in light of what is already known.

      The sentence in which we state the hypothesis about the calcium transfer refers specifically to Toxoplasma. To clarify this, we have now added the phrase “In mammalian cells” (Line 311) and included additional citations, as suggested by the reviewer. While only a few studies have described membrane contact sites (MCSs) in Toxoplasma, we do cite several pertinent articles (e.g., lines 479-486). We believe that we cited all articles mentioning MCS in T. gondii

      However, we must clarify to the reviewer that the primary focus of our study is not to characterize or confirm the presence of MCSs in T. gondii, but rather to demonstrate functional calcium transfer between the ER and mitochondria. Our data support the conclusion that this transfer requires close apposition of these organelles, consistent with the presence of MCSs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Line 45: change influx to release as Ca2+ influx usually referred to Ca2+ entry from the extracellular space. Same for line 71.

      Corrected, line 47 and 73

      (2) Line 54: consider toning down the strong statement of 'widely' accepted as ER Ca2+ subdomain heterogeneity remains somewhat debated.

      Changed the sentence to “it has been proposed”, Line 56

      (3) Line 119-21: A lower release in response to TG is typical and does not reflect TG specific for SERCA. It is due to the slow kinetics of Ca2+ leak out of the ER allowing other buffering and transport mechanisms to act. Also, could be a reflection of the duration after TG treatment to allow complete store depletion. Figure S1A-B shows that there is still Ca2+ in the stores following TG but the TG signal does not go back to baseline arguing that the leak is still active. Hence the current data does not address the specificity of TG for TgSERCA. Please revise the statement accordingly.

      Thank for the suggestion, we changed the sentence to this: “This result could reflect the slow kinetics of Ca²⁺ leak from the ER, allowing other buffering and transport mechanisms to mitigate the phenomenon. Alternatively, it may indicate the duration after TG treatment allowing time to complete store depletion. As shown in Figure S1A-B, residual Ca²⁺ remains in the stores after TG treatment, and the TG-induced phenomenon does not return to baseline, suggesting that the leak remains active”. Lines 124-128

      (4) Figure 1C: the authors interpret the data 'This Ca2+ influx appeared to be immediately taken up by the ER as the response to TG was much greater in parasites previously exposed to extracellular Ca2+'. I don't understand this interpretation, in Ca2+-containing solution it would expected to have a larger signal as TG is likely to activate store-operated Ca2+ entry which would contribute to a larger cytosolic Ca2+ transient. Does T. gondii have SOCE? It cannot be uptake into the ER as SERCA is blocked. Unless the authors are arguing for another ER Ca2+ uptake pathway? But why are Ca2+ uptake in the ER would lower the signal whereas the data show an increased signal?

      We pre-incubated the suspension with calcium to allow filling of the stores, while SERCA is still active, and added thapsigargin (TG) at 400 seconds to measure calcium release. The experiment was designed to introduce the concept that the ER may have access to extracellular calcium, a phenomenon not yet clearly demonstrated in Toxoplasma. We did not expect to have less release by TG but if the ER is not efficient in filling after extracellular calcium entry it would be expected to have a similar response to TG. Yes, it is very possible that when we add TG we are also seeing more calcium entry through the PM as we previously proposed that the increased cytosolic Ca<sup>2+</sup> may regulate Ca<sup>2+</sup> entry. However, the evidence does not support that this increased entry would be triggered by store depletion. The experiments with the SERCA mutant (Fig. 4D) shows that in the conditional knockout mutant, the ER is partially depleted, yet this does not lead to enhanced calcium entry, suggesting that the depletion alone is not sufficient to trigger increased influx.

      There is no experimental evidence supporting the regulation of calcium entry by store depletion in Toxoplasma (PMID: 24867952). We revised the text to clarify this point and expanded the discussion on store-operated calcium entry (SOCE). While it is possible that a channel similar to Orai exists in Toxoplasma, it is highly unlikely to be regulated by store depletion, as there is no gene homologous to STIM. If store-regulated calcium entry does occur in Toxoplasma, it is likely mediated through a different, still unidentified, mechanism. Lines 461-467.

      (5) The choice of adding Ca2+ first followed by TG is curious as it is more difficult to interpret. Would be more informative to add TG, allow the leak to complete, and then add Ca2+ which would allow temporal separation between Ca2+ release from stores and Ca2+ influx from the extracellular space. Was this experiment done? If not would be useful to have the data.

      Yes, this experiment was already published: PMID: 24867952 and PMID: 38382669.

      It mainly highlighted that increased cytosolic calcium may regulate calcium entry most likely through a TRP channel. See our response to point 4 and the description of the new Fig. S2 in the response to point 7.

      (6) Line 136-39: these experiments as designed - partly because of the issues discussed above - do not address the ability of organelles to access extracellular Ca2+ or the state of refilling of intracellular Ca2+ stores. They can simply be interpreted as the different agents (TG, Nig, GPN, CCCP) inducing various levels of Ca2+ influx.

      Concerning TG, the experiment shown in Fig. 4D shows that depletion of the ER calcium does not result in stimulation of calcium entry, indicating the absence of classical SOCE activation in Toxoplasma.

      To our knowledge, neither mitochondria nor lysosomes (or other acidic compartments) are capable of triggering classical SOCE in mammalian cells.

      Given that the ER in Toxoplasma lacks the canonical components required to initiate SOCE, it is unclear why the mitochondria or acidic compartments would be able to do so. While it is possible that T. gondii utilizes an alternative mechanism for store-operated calcium entry, investigating such a pathway would require a comprehensive study. In mammalian systems, it took almost 15 years and the efforts of multiple research groups to identify the molecular components of SOCE. Expecting this complex question to be resolved within the scope of a single study is unrealistic.

      Our current data show that the mitochondrion is unable to access calcium from the cytosol, as shown in Figure 5E. Performing a similar experiment for the PLVAC would be ideal; however, expression of fluorescent calcium indicators in this organelle has not been successful. This is likely due to the presence of several proteases that degrade expressed proteins, as well as the acidic environment, which quenches fluorescence. These challenges have made studying calcium dynamics in the PLVAC particularly difficult.

      To address the reviewer’s comment, we performed an additional experiment presented in Fig. S2A. In this experiment, we first inhibited SERCA with thapsigargin (TG), preventing calcium uptake into the ER, and subsequently added calcium to the suspension. Under these conditions, calcium cannot be sequestered by the ER. We then applied GPN and quantified the response, comparing it to a similar experimental condition without TG. Indeed, under these conditions, we observed a significant but modest increase in the GPN-induced response, suggesting that the PLVAC may be capable of directly taking up calcium from the cytosol. However, this occurs under conditions of SERCA inhibition which creates nonphysiological conditions with elevated cytosolic calcium levels and the presence of TG may promote additional ER leakage, both of which could artificially enhance PLVAC uptake. Under physiological conditions, with functional SERCA activity, the ER would likely sequester cytosolic calcium more efficiently, thereby limiting calcium availability for PLVAC direct uptake. Thus, while the result is intriguing, it may not reflect calcium handling under normal cellular conditions. See lines 172-178.

      (7) Figure 1H-I: I disagree with the authors' interpretation of the results (lines 144-153). The data argue that by blocking ER Ca2+ uptake by TG, other organelles take up Ca2+ from the cytosol where it accumulates due to the leak and Ca2+ influx as is evident from the data allowing more release. The data does not argue for ER Ca2+ tunneling to other organelles. Tunneling would be reduced in the presence of TG (see PMID: 30046136, 24867608).

      We partially agree with this concern. In our experiments, TG was used to inhibit SERCA and block calcium uptake into the ER, allowing calcium to leak into the cytosol. We propose that this leaked calcium is subsequently taken up by other intracellular compartments. This effect is observed immediately upon TG addition. However, pre-incubation with TG or knockdown of SERCA reduces calcium storage in the ER, thereby diminishing the transfer of calcium to other stores.

      To further support our claim, we performed additional experiments in the absence of extracellular calcium, now presented in Figure 1J-K. We observed that calcium release triggered by GPN or nigericin was significantly enhanced when both agents were added after TG. These results suggest that calcium initially released from the ER can be sequestered by other compartments. As mentioned, we deleted any mention of “tunneling,” but we believe the data support the occurrence of calcium transfer. New results described in lines 166-171.

      The experiment in Fig S2A described in the response to (6) also addresses this concern. Under physiological conditions with functional SERCA, cytosolic calcium would likely be rapidly sequestered by the ER, limiting its availability to other compartments. See lines 172178.

      (8) Line 175: SERCA-dependent Ca2+ uptake is higher at 880 nM as would be expected yet the authors state that it's optimal at 220 nM Ca2+ ?

      Yes, it is true that the SERCA-dependent Ca<sup>2+</sup> uptake rate is higher at elevated Ca²⁺ concentrations. We chose to use 220 nM free calcium because of several reasons: 1) this concentration is close to physiological cytosolic levels fluctuations; 2) it is commonly used in studies of mammalian SERCA; and 3) calcium uptake is readily detectable at this level. While this may not represent the maximal activity conditions for SERCA, we believe it is a reasonable and physiologically relevant choice for assessing calcium transport activity SERCA-dependent. We added one sentence to the results explaining this reasoning (lines 204-207) and we deleted the word optimal.

      (9) Figure 3H: the saponin egress data support the conclusion that organelles Ca2+ take up cytosolic Ca2+ directly without the need for ER tunneling.

      The saponin concentration used permeabilizes the host cell membrane, allowing the intracellular tachyzoite to be surrounded with the added higher extracellular calcium concentration. The saponin concentration used does not affect the tachyzoite membrane as the parasite is still moving and calcium oscillations were clearly seen under similar conditions (PMID: 26374900 ). The resulting calcium increase in the tachyzoite cytosol is what stimulates parasite motility and egress. Since SERCA activity is reduced in the mutant, cytosolic calcium accumulates more rapidly, reaching the threshold for egress sooner and thereby accelerating parasite exit. The result does not support that the other stores contribute to this because of the Ionomycin response, which shows that egress is diminished in the mutant, likely because the calcium stores are depleted. We added an explanation in the results, lines 262-269 and the discussion, lines 532-539.

      (10) Figure S2: the HA and SERCA signals do not match perfectly? Could this reflect issues with HA tagging, potentially off-target effects? Was this tested?

      These are not off-target effects, as we did not observe them in the control cells lacking HA tagging. The HA signal also disappeared after treatment with ATc, further confirming that the IFA signal is specific. We agree with the reviewer that the signals do not align perfectly. This discrepancy could be due to differences in antibody accessibility or the fact that the two antibodies recognize different regions of the protein. We added a sentence about this in the result; lines 240-243.

      Reviewer #2 (Recommendations for the authors):

      The description of the data of Figures 1B and S1A starting on line 108 would be easier to follow if Figure S1A was actually incorporated into Figure 1. It is not clear why these two complementary experiments were separated since they are both equally important in understanding and interpreting the data.

      We re-arranged figure 1 and incorporated S1A now as Fig 1C.

      As noted in the public comments, loading of fura2/AM can result in compartmentalized fura2, which can contaminate the cytosolic calcium measurements and might modify free calcium levels and calcium storage capacity in intracellular organelles. This can be assessed using the digitonin permeabilization method used in the MagFluo4 measurements, but in this case, detecting the fura2 signal remaining after cell permeabilization.

      As suggested by the reviewer, we measured Fura-2 compartmentalization by permeabilizing cells with digitonin as we do for the Mag-Fluo-4 and the fluorescence was reduced almost completely and was unresponsive to any additions (see Author response image 1).

      Author response image 1.

      T. gondii tachyzoites in suspension exposed to Thapsigargin Calcium and GPN. The dashed lines shows and experiments using the same conditions but parasites were permeabilized with digitonin shows a similar experiment with parasites exposed to MgATP.to release the cytosolic Fura. Part B

      Following the public comment regarding the residual calcium mobilization response to Zaprinast observed after 24 h ATc knockdown of SERCA (Figsures 4E, 4F, as explained in the legend to Figure 4), was there still a response to Zaprinast after 48 h knockdown, where the thapsigargin response was apparently fully ablated?

      Unfortunately, we were unable to perform this experiment as it is not possible to obtain sufficient cells at 48 h with ATc. Due to the essential role of TgSERCA, parasites are unable to replicate after 24 h.

      As noted in the public comments, the data in Figure 4A vs 4G and Figure 4B vs 4H appear to show that the calcium responses to GPN are similar to that with thapsigargin, which seems unexpected if the acidic compartment is loaded from the ER. The results with GPN addition after thapsigargin (Figure 1H) argue against this, but the authors should still cite the work of Atakpa et al.

      We think that the reviewer is concerned that GPN may also be acting on the ER. This is a possibility that we considered, and we now included the suggested citation (line 457). However, we believe that it is difficult to directly compare the responses, as the kinetics of calcium release from the ER may differ from those of release from the PLVAC. This could be due to differences in the calcium buffering capacity between the two compartments. Additionally, it is possible that calcium leaked from the ER is more efficiently sequestered by other stores or extruded through the plasma membrane than calcium released from the PLVAC. Besides, GPN is known to have a more disruptive effect on membranes compared to TG, which may also influence their responses. As noted by the reviewer, Figure 1H also supports the idea that the acidic compartment is loaded from the ER.

      The abbreviation for the plant-like vacuolar compartment (PLVAC) only appears in a figure legend but should be defined in the main text on first use.

      Corrected, lanes 140-143

      The authors should cite the previous study of Borges-Pereira et al., 2020 (PMID: 32848018) that also demonstrates the incomplete overlap of the calcium pools mobilized by thapsigargin and CPA in P. falciparum. The ability to measure calcium in intracellular stores using MagFluo4 opens the possibility to further investigate this discrepancy between CPA and thapsigargin, but CPA does not appear to have been used in the permeabilized cell experiments with MagFluo4. I would suggest that this could be added to Figure 2 and/or Figure 4, or at least as a supplementary figure.

      In response to this reviewer’s critique we performed additional experiments with Mag-Fluo4 loaded parasites. These are presented in the new Figure S3. We added CPA and TG and combined them to inhibit SERCA and to allow calcium leak from the loaded organelle. Under these conditions, we observed a very similar leak rate after the addition of the inhibitors as measured by the slope of Ca<sup>2+</sup> leak. We believe that the leak rate is most likely determined by the intrinsic ER mechanism. See the discussion of this result in lines 436442 and the previous response to the same reviewer comment.

      Reviewer #3 (Recommendations for the authors):

      Suggestions for improved or additional experiments, data, or analyses

      (1) Figure 1A is not mentioned in the main text even though it is discussed.

      Corrected

      (2) Figure 1G: Values do not match, how can GPN be so high?

      These figures were replaced by new traces and individual quantification analyses for each experiment.

      (3) Figure 1H and I: Is this type of data/results also available for the mitochondrion?

      Unfortunately, we were not able to include this experiment because we were unable to accurately quantify the mitochondrial calcium release. Instead, we used mitochondrial GECIs and the results are shown in Figure 5 to study mitochondrial calcium uptake.

      (4) Figure 1H: where does the calcium go after GPN addition? Taken up by another calcium store?

      Most likely calcium is extruded through the plasma membrane by the activity of the Calcium ATPase TgA1.

      However, the reviewer’s suggestion is also possible, and calcium could be taken by another store like the mitochondrion. In this regard, we did observe a large mitochondrial calcium increase (parasites expressing SOD2-GCaMp6) after adding GPN (Fig 5I) suggesting that the mitochondrion may take calcium from the organelle targeted by GPN. However, the calcium affinity of the mitochondrion is very low, so the concentration of calcium needs to be very high to activate it and these concentrations are most likely achieved at the microdomains formed between the mitochondrion and other organelles.

      (5) Figure 2B-C: Further explanation of why these particular values were chosen for the follow-up experiments would be helpful for the reader.

      We tested a wide range of MgATP and free calcium concentrations to measure ER Ca<sup>2+</sup> uptake catalyzed by TgSERCA. The concentrations shown fall within the linear range.

      We followed the free calcium concentrations used by studies of mammalian SERCA (https://doi.org/10.1016/j.ceca.2020.102188 ). In this protocol they used 220 nM free calcium, which was close to cytosolic Ca<sup>2+</sup> levels. TgSERCA can take up calcium efficiently at this concentration, as shown in Fig 2. We used less MgATP than the mammalian cell protocols, since we did not observe a significant increase in SERCA activity beyond 0.5 mM MgATP. We added one more sentence explaining in the results, lines 204-207.

      (6) Figure 3E: Revise the error bar? (and note that colours do not match the graph legend).

      The colors do match; the problem visualizing it is because vacuoles containing a single parasite are virtually absent in the control group without ATc treatment.

      (7) Figure 3H: 'Interestingly, when testing egress after the addition of saponin in the presence of extracellular Ca2+, we observed that the tachyzoites egressed sooner (Figure 3H, saponin egress).' This is the only graph showing egress timing, and thus it is not clear what is the comparison. The egressed here is sooner compared to what condition? Egress in the absence of Ca2+? This requires clarification and might require the control data to be added.

      In the saponin experiment we compare time to egress of the mutant grown with or without ATc. The measurement is for time to egress after adding saponin. This experiment is in the presence of extracellular calcium. The protocol was previously used to measure time to egress: PMID: 40043955, PMID: 38382669, PMID: 26374900. See also response to question 9 of reviewer 1.

      (8) Figure 4C: There is a small peak appearing right after TG addition this should be discussed and explained.

      This trace was generated in a different fluorometer, F-4000. This was an artifact due to jumping of the signal when adding TG. Multiple repeats of the same experiment in the newer F7000 did not show the peak. We included in the MM the use of the F-4000 fluorometer for some experiments. We apologize for the omission. Lines 609-610

      (9) Figure 5A: An important control that is missing is co-localisation with a mitochondrial marker.

      The expression of the SOD2-GCaMP6 has been characterized: PMID: 31758454

      (10) Figure 5H: This line was made for this study however the line genetic verification is missing.

      In response to this concern we now include a new Figure S5 showing the fluorescence of GCaMP6 in the mitochondrion of the iDTgSERCA mutant (Fig. S5A). We include several parasites. In addition, we show fluorescence measurements after addition of Calcium showing that the cells are unresponsive indicating that the indicator is not in the cytosol. Lines 650-651 and 344-348.

      (11) Figure 6D: since the membranes are hard to see, it is not clear whether the arrows show structures that are in line with the definition of membrane contact sites. The authors should provide an in-depth analysis of the length of the interaction between the membranes where the distance is less than 30 nM, and discuss how many structures corresponding to the definition were analysed.

      All the requested details are now included in the legend to Figure S3.

      Minor corrections to the text and figures

      (1) Unify statistical labelling throughout the paper replacing *** with p values.

      Corrected. We changed the *** with the actual p value in some figures. For figure 2 and Fig S1, we still use the *** due to the space limitation.

      (2) Unify ATC vs ATc throughout the paper.

      Corrected

      (3) Unify capitalization of line name (iΔTgserca/i ΔTgSERCA) throughout the paper.

      Corrected

      (4) Unify capitalization of p value (p/P) throughout the paper.

      Corrected in figures

      (5) Unify Fig X vs Fig. X throughout the text.

      Corrected

      (6) Add values of scale bars to legends (eg Figure S2).

      Corrected

      (7) What is the time point for the data in Figures 4E-H, 5H, and S3? 24hrs? include in the legend.

      Added 24 h to the legends. Fig S3 is now S4.

      (8) Figure 3F: The second graph is NS thus perhaps no need for the p-value?

      Corrected

      (8) Figure 3G: Worth considering swapping the two around: first attachment and then invasion?

      Corrected. Invasion and attachment bars were swapped.

      (10) Figure 4A/B: Wrong colour match for Figure 4B.

      Corrected

      (11) Figure 4F: In the main text, the authors reference to Figure 1F, correct to 4F.

      Corrected

      (12) Figure 4H: In the main text, authors reference to Figure 1H, correct to 4H.

      Corrected

    1. Author response:

      Reviewer #1 (Public review):

      In this important study, the authors develop a suite of machine vision tools to identify and align fluorescent neuronal recording images in space and time according to neuron identity and position. The authors provide compelling evidence for the speed and utility of these tools. While such tools have been developed in the past (including by the authors), the key advancement here is the speed and broad utility of these new tools. While prior approaches based on steepest descent worked, they required hundreds of hours of computational time, while the new approaches outlined here are >600-fold faster. The machine vision tools here should be immediately useful to readers specifically interested in whole-brain C. elegans data, but also for more general readers who may be interested in using BrainAlignNet for tracking fluorescent neuronal recordings from other systems.

      I really enjoyed reading this paper. The authors had several ground truth examples to quantify the accuracy of their algorithms and identified several small caveats users should consider when using these tools. These tools were primarily developed for C. elegans, an animal with stereotyped development, but whose neurons can be variably located due to internal motion of the body. The authors provide several examples of how BrainAlignNet reliably tracked these neurons over space and time. Neuron identity is also important to track, and the authors showed how AutoCellLoader can reliably identify neurons based on their fluorescence in the NeuroPAL background. A challenge with NeuroPAL though, is the high expression of several fluorophores, which compromises behavioral fidelity. The authors provide some possible avenues where this problem can be addressed by expressing fewer fluorophores. While using all four channels provided the best performance, only using the tagRFP and CyOFP channels was sufficient for performance that was close to full performance using all 4 NeuroPAL channels. This result indicates that the development of future lines with less fluorophore expression could be sufficient for reliable neuronal identification, which would decrease the genetic load on the animal, but also open other fluorescent channels that could be used for tracking other fluorescent tools/markers. Even though these tools were developed for C. elegans specifically, they showed BrainAlignNet can be applied to other organisms as well (in their case, the cnidarian C. hemisphaerica), which broadens the utility of their tools.

      Strengths:

      (1) The authors have a wealth of ground-truth training data to compare their algorithms against, and provide a variety of metrics to assess how well their new tools perform against hand annotation and/or prior algorithms.

      (2) For BrainAlignNet, the authors show how this tool can be applied to other organisms besides C. elegans.

      (3) The tools are publicly available on GitHub, which includes useful README files and installation guidance.

      We thank the reviewer for noting these strengths of our study.

      Weaknesses:

      (1) Most of the utility of these algorithms is for C. elegans specifically. Testing their algorithms (specifically BrainAlignNet) on more challenging problems, such as whole-brain zebrafish, would have been interesting. This is a very, very minor weakness, though.

      We appreciate the reviewer’s point that expanding to additional animal models would be valuable. In the study, we have so far tested our approaches on C. elegans and Jellyfish. Given that this is considered a ‘very, very minor weakness’ and that it does not directly affect the results or analyses in the paper, we think this might be better to address in future work.

      (2) The tools are benchmarked against their own prior pipeline, but not against other algorithms written for the same purpose.

      We agree that it would be valuable to benchmark other labs’ software pipelines on our datasets. We note that most papers in this area, which describe those pipelines, provide the same performance metrics that we do (accuracy of neuron identification, tracking accuracy, etc), so a crude, first-order comparison can be obtained by comparing the numbers in the papers. But, we agree that a rigorous head-to-head comparison would require applying these different pipelines to a common dataset. We considered performing these analyses, but we were concerned that using other labs’ software ‘off the shelf’ on our data might not represent those pipelines in their best light when compared to our pipeline that was developed with our data in mind. Data from different microscopy platforms can be surprisingly different and we wouldn’t want to perform an analysis that had this bias. Therefore, we feel that this comparison would be best pursued by all of these labs collaboratively (so that they can each provide input on how to run their software optimally). Indeed, this is an important area for future study. In this spirit, we have been sharing our eat-4::GFP datasets (that permit quantification of tracking accuracy) with other labs looking for additional ways to benchmark their tracking software.

      We also note that there are not really any pipelines to directly compare against CellDiscoveryNet, as we are not aware of any other fully unsupervised approach for neuron identification in C. elegans.

      (3) Considerable pre-processing was done before implementation. Expanding upon this would improve accessibility of these tools to a wider audience.

      Indeed, some pre-processing was performed on images before registration and neuron identification -- understanding these nuances can be important. The pre-processing steps are described in the Results section and detailed in the Methods. They are also all available in our open-source software. For BrainAlignNet, the key steps were: (1) selecting image registration problems, (2) cropping, and (3) Euler alignment. Steps (1) and (3) were critically important and are extensively discussed in the Results and Discussion sections of our study (lines 142-144, 218-234, 318-323, 704-712). Step (2) is standard in image processing. For AutoCellLabeler and CellDiscoveryNet, the pre-processing was primarily to align the 4 NeuroPAL color channels to each other (i.e. make sure the blue/red/orange/etc channels for an animal are perfectly aligned). This is also just a standard image processing step to ensure channel alignment. Thus, the more “custom” pre-processing steps were extensively discussed in the study and the more “common” steps are still described in the Methods. The implementation of all steps is available in our open-source software.

      Reviewer #2 (Public review):

      Summary:

      The paper introduced the pipeline to analyze brain imaging of freely moving animals: registering deforming tissues and maintaining consistent cell identities over time. The pipeline consists of three neural networks that are built upon existing models: BrainAlignNet for non-rigid registration, AutoCellLabeler for supervised annotation of over 100 neuronal types, and CellDiscoveryNet for unsupervised discovery of cell identities. The ambition of the work is to enable high-throughput and largely automated pipelines for neuron tracking and labeling in deforming nervous systems.

      Strengths:

      (1) The paper tackles a timely and difficult problem, offering an end-to-end system rather than isolated modules.

      (2) The authors report high performance within their dataset, including single-pixel registration accuracy, nearly complete neuron linking over time, and annotation accuracy that exceeds individual human labelers.

      (3) Demonstrations across two organisms suggest the methods could be transferable, and the integration of supervised and unsupervised modules is of practical utility.

      We thank the reviewer for noting these strengths of our study.

      Weaknesses:

      (1) Lack of solid evaluation. Despite strong results on their own data, the work is not benchmarked against existing methods on community datasets, making it hard to evaluate relative performance or generality.

      We agree that it would be valuable to benchmark many labs’ software pipelines on some common datasets, ideally from several different research labs. We note that most papers in this area, which describe the other pipelines that have been developed, provide the same performance metrics that we do (accuracy of neuron identification, tracking accuracy, etc), so a crude, first-order comparison can be obtained by comparing the numbers in the papers. But, we agree that a rigorous head-to-head comparison would require applying these different pipelines to a common dataset. We considered performing these analyses, but we were concerned that using other labs’ software ‘off the shelf’ and comparing the results to our pipeline (where we have extensive expertise) might bias the performance metrics in favor of our software. Therefore, we feel that this comparison would be best pursued by all of these labs collaboratively (so that they can each provide input on how to run their software optimally). Indeed, this is an important area for future study. In this spirit, we have been sharing our eat-4::GFP datasets (that permit quantification of tracking accuracy) with other labs looking for additional ways to benchmark their tracking software.

      We also note that there are not really any pipelines to directly compare against CellDiscoveryNet, as we are not aware of any other fully unsupervised approach for neuron identification in C. elegans.

      (2) Lack of novelty. All three models do not incorporate state-of-the-art advances from the respective fields. BrainAlignNet does not learn from the latest optical flow literature, relying instead on relatively conventional architectures. AutoCellLabeler does not utilize the advanced medNeXt3D architectures for supervised semantic segmentation. CellDiscoveryNet is presented as unsupervised discovery but relies on standard clustering approaches, with limited evaluation on only a small test set.

      We appreciate that the machine learning field moves fast. Our goal was not to invent entirely novel machine learning tools, but rather to apply and optimize tools for a set of challenging, unsolved biological problems. We began with the somewhat simpler architectures described in our study and were largely satisfied with their performance. It is conceivable that newer approaches would perhaps lead to even greater accuracy, flexibility, and/or speed. But, oftentimes, simple or classical solutions can adequately resolve specific challenges in biological image processing.

      Regarding CellDiscoveryNet, our claim of unsupervised training is precise: CellDiscoveryNet is trained end-to-end only on raw images, with no human annotations, pseudo-labels, external classifiers, or metadata used for training, model selection, or early stopping. The loss is defined entirely from the input data (no label signal). By standard usage in machine learning, this constitutes unsupervised (often termed “self-supervised”) representation learning. Downstream clustering is likewise unsupervised, consuming only image pairs registered by CellDiscoveryNet and neuron segmentations produced by our previously-trained SegmentationNet (which provides no label information).

      (3) Lack of robustness. BrainAlignNet requires dataset-specific training and pre-alignment strategies, limiting its plug-and-play use. AutoCellLabeler depends heavily on raw intensity patterns of neurons, making it brittle to pose changes. By contrast, current state-of-the-art methods incorporate spatial deformation atlases or relative spatial relationships, which provide robustness across poses and imaging conditions. More broadly, the ANTSUN 2.0 system depends on numerous manually tuned weights and thresholds, which reduces reproducibility and generalizability beyond curated conditions.

      Regarding BrainAlignNet: we agree that we trained on each species’ own data (worm, jellyfish) and we would suggest other labs working on new organisms to do the same based on our current state of knowledge. It would be fantastic if there was an alignment approach that generalized to all possible cases of non-rigid-registration in all animals – an important area for future study. We also agree that pre-alignment was critical in worms and jellyfish, which we discuss extensively in our study (lines 142-144, 318-321, 704-712).

      Regarding AutoCellLabeler: the animals were not recorded in any standardized pose and were not aligned to each other beforehand – they were basically in a haphazard mix of poses and we used image augmentation to allow the network to generalize to other poses, as described in our study. It is still possible that AutoCellLabeler is somehow brittle to pose changes (e.g. perhaps extremely curved worms) – while we did not detect this in our analyses, we did not systematically evaluate performance across all possible poses. However, we do note that this network was able to label images taken from freely-moving worms, which by definition exhibit many poses (Figure 5D, lines 500-525); aggregating the network’s performance across freely-moving data points allowed it to nearly match its performance on high-SNR immobilized data. This suggests a degree of robustness of the AutoCellLabeler network to pose changes.

      Regarding ANTSUN 2.0: we agree that there are some hyperparameters (described in our study) that affect ANTSUN performance. We agree that it would be worthwhile to fully automate setting these in future iterations of the software.

      Evaluation:

      To make the evaluation more solid, it would be great for the authors to (1) apply the new method on existing datasets and (2) apply baseline methods on their own datasets. Otherwise, without comparison, it is unclear if the proposed method is better or not. The following papers have public challenging tracking data: https://elifesciences.org/articles/66410, https://elifesciences.org/articles/59187, https://www.nature.com/articles/s41592-023-02096-3.

      Please see our response to your point (1) under Weaknesses above.

      Methodology:

      (1) The model innovations appear incrementally novel relative to existing work. The authors should articulate what is fundamentally different (architectural choices, training objectives, inductive biases) and why those differences matter empirically. Ablations isolating each design choice would help.

      There are other efforts in the literature to solve the neuron tracking and neuron identification problems in C. elegans (please see paragraphs 4 and 5 of our Introduction, which are devoted to describing these). However, they are quite different in the approaches that they use, compared to our study. For example, for neuron tracking they use t->t+1 methods, or model neurons as point clouds, etc (a variety of approaches have been tried). For neuron identification, they work on extracted features from images, or use statistical approaches rather than deep neural networks, etc (a variety of approaches have been tried). Our assessment is that each of these diverse approaches has strengths and drawbacks; we agree that a meta-analysis of the design choices used across studies could be valuable.

      We also note that there are not really any pipelines to directly compare against CellDiscoveryNet, as we are not aware of any other fully unsupervised approach for neuron identification in C. elegans.

      (2) The pipeline currently depends on numerous manually set hyperparameters and dataset-specific preprocessing. Please provide principled guidelines (e.g., ranges, default settings, heuristics) and a robustness analysis (sweeps, sensitivity curves) to show how performance varies with these choices across datasets; wherever possible, learn weights from data or replace fixed thresholds with data-driven criteria.

      We agree that there are some ANTSUN 2.0 hyperparameters (described in our Methods section) that could affect the quality of neuron tracking. It would be worthwhile to fully automate setting these in future iterations of the software, ensuring that the hyperparameter settings are robust to variation in data/experiments.

      Appraisal:

      The authors partially achieve their aims. Within the scope of their dataset, the pipeline demonstrates impressive performance and clear practical value. However, the absence of comparisons with state-of-the-art algorithms such as ZephIR, fDNC, or WormID, combined with small-scale evaluation (e.g., ten test volumes), makes the strength of evidence incomplete. The results support the conclusion that the approach is useful for their lab's workflow, but they do not establish broader robustness or superiority over existing methods.

      We wish to remind the reviewer that we developed BrainAlignNet for use in worms and jellyfish. These two animals have different distributions of neurons and radically different anatomy and movement patterns. Data from the two organisms was collected in different labs (Flavell lab, Weissbourd lab) on different types of microscopes (spinning disk, epifluorescence). We believe that this is a good initial demonstration that the approach has robustness across different settings.

      Regarding comparisons to other labs’ C. elegans data processing pipelines, we agree that it will be extremely valuable to compare performance on common datasets, ideally collected in multiple different research labs. But we believe this should be performed collaboratively so that all software can be utilized in their best light with input from each lab, as described above. We agree that such a comparison would be very valuable.

      Impact:

      Even though the authors have released code, the pipeline requires heavy pre- and post-processing with numerous manually tuned hyperparameters, which limits its practical applicability to new datasets. Indeed, even within the paper, BrainAlignNet had to be adapted with additional preprocessing to handle the jellyfish data. The broader impact of the work will depend on systematic benchmarking against community datasets and comparison with established methods. As such, readers should view the results as a promising proof of concept rather than a definitive standard for imaging in deformable nervous systems.

      Regarding worms vs jellyfish pre-processing: we actually had the exact opposite reaction to that of the reviewer. We were surprised at how similar the pre-processing was for these two very different organisms. In both cases, it was essential to (1) select appropriate registration problems to be solved; and (2) perform initialization with Euler alignment. Provided that these two challenges were solved, BrainAlignNet mostly took care of the rest. This suggests a clear path for researchers who wish to use this approach in another animal. Nevertheless, we also agree with the reviewer’s caution that a totally different use case could require some re-thinking or re-strategizing. For example, the strategy of how to select good registration problems could depend on the form of the animal’s movement.

      Reviewer #3 (Public review):

      Context:

      Tracking cell trajectories in deformable organs, such as the head neurons of freely moving C. elegans, is a challenging task due to rapid, non-rigid cellular motion. Similarly, identifying neuron types in the worm brain is difficult because of high inter-individual variability in cell positions.

      Summary:

      In this study, the authors developed a deep learning-based approach for cell tracking and identification in deformable neuronal images. Several different CNN models were trained to: (1) register image pairs without severe deformation, and then track cells across continuous image sequences using multiple registration results combined with clustering strategies; (2) predict neuron IDs from multicolor-labeled images; and (3) perform clustering across multiple multicolor images to automatically generate neuron IDs.

      Strengths:

      Directly using raw images for registration and identification simplifies the analysis pipeline, but it is also a challenging task since CNN architectures often struggle to capture spatial relationships between distant cells. Surprisingly, the authors report very high accuracy across all tasks. For example, the tracking of head neurons in freely moving worms reportedly reached 99.6% accuracy, neuron identification achieved 98%, and automatic classification achieved 93% compared to human annotations.

      We thank the reviewer for noting these strengths of our study.

      Weaknesses:

      (1) The deep networks proposed in this study for registration and neuron identification require dataset-specific training, due to variations in imaging conditions across different laboratories. This, in turn, demands a large amount of manually or semi-manually annotated training data, including cell centroid correspondences and cell identity labels, which reduces the overall practicality and scalability of the method.

      We performed dataset-specific training for image registration and neuron identification, and we would encourage new users to do the same based on our current state of knowledge. This highlights how standardization of whole-brain imaging data across labs is an important issue for our field to address and that, without it, variations in imaging conditions could impact software utility. We refer the reviewer to an excellent study by Sprague et al. (2025) on this topic, which is cited in our study.

      However, at the same time, we wish to note that it was actually reasonably straightforward to take the BrainAlignNet approach that we initially developed in C. elegans and apply it to jellyfish. Some of the key lessons that we learned in C. elegans generalized: in both cases, it was critical to select the right registration problems to solve and to preprocess with Euler registration for good initialization. Provided that those problems were solved, BrainAlignNet could be applied to obtain high-quality registration and trace extraction. Thus, our study provides clear suggestions on how to use these tools across multiple contexts.

      (2) The cell tracking accuracy was not rigorously validated, but rather estimated using a biased and coarse approach. Specifically, the accuracy was assessed based on the stability of GFP signals in the eat-4-labeled channel. A tracking error was assumed to occur when the GFP signal switched between eat-4-negative and eat-4-positive at a given time point. However, this estimation is imprecise and only captures a small subset of all potential errors. Although the authors introduced a correction factor to approximate the true error rate, the validity of this correction relies on the assumption that eat-4 neurons are uniformly distributed across the brain - a condition that is unlikely to hold.

      We respectfully disagree with this critique. We considered the alternative suggested by the reviewer (in their private comments to the authors) of comparing against a manually annotated dataset. But this annotation would require manually linking ~150 neurons across ~1600 timepoints, which would require humans to manually link neurons across timepoints >200,000 times for a single dataset. These datasets consist of densely packed neurons rapidly deforming over time in all 3 dimensions. Moreover, a single error in linking would propagate across timepoints, so the error tolerance of such annotation would be extremely low. Any such manually labeled dataset would be fraught with errors and should not be trusted. Instead, our approach relies on a simple, accurate assumption: GFP expression in a neuron should be roughly constant over a 16min recording (after bleach correction) and the levels will be different in different neurons when it is sparsely expressed. Because all image alignment is done in the red channel, the pipeline never “peeks” at the GFP until it is finished with neuron alignment and tracking. The eat-4 promoter was chosen for GFP expression because (a) the nuclei labeled by it are scattered across the neuropil in a roughly salt-and-pepper fashion – a mixture of eat-4-positive and eat-4-negative neurons are found throughout the head; and (b) it is in roughly 40% of the neurons, giving very good overall coverage. Our view is that this approach of labeling subsets of neurons with GFP should become the standard in the field for assessing tracking accuracy – it has a simple, accurate premise; is not susceptible to human labeling error; is straightforward to implement; and, since it does not require manual labeling, is easy to scale to multiple datasets. We do note that it could be further strengthened by using multiple strains each with different ‘salt-and-pepper’ GFP expression patterns.

      (3) Figure S1F demonstrates that the registration network, BrainAlignNet, alone is insufficient to accurately align arbitrary pairs of C. elegans head images. The high tracking accuracy reported is largely due to the use of a carefully designed registration sequence, matching only images with similar postures, and an effective clustering algorithm. Although the authors address this point in the Discussion section, the abstract may give the misleading impression that the network itself is solely responsible for the observed accuracy.

      Our tracking accuracy requires (a) a careful selection of registration problems, (b) highly accurate registration of the selected registration problems, and (c) effective clustering. We extensively discussed the importance of the choosing of the registration problems in the Results section (lines 218-234 and 318-321), Discussion section (lines 704-708), and Methods section (955-970 and 1246-1250) of our paper. We also discussed the clustering aspect in the Results section (lines 247-259), Discussion section (lines 708-712), and Methods section (lines 1162-1206). In addition, our abstract states that the BrainAlignNet needs to be “incorporated into an image analysis pipeline,” to inform readers that other aspects of image analysis need to occur (beyond BrainAlignNet) to perform tracking.

      (4) The reported accuracy for neuron identification and automatic classification may be misleading, as it was assessed only on a subset of neurons labeled as "high-confidence" by human annotators. Although the authors did not disclose the exact proportion, various descriptions (such as Figure 4f) imply that this subset comprises approximately 60% of all neurons. While excluding uncertain labels is justifiable, the authors highlight the high accuracy achieved on this subset without clearly clarifying that the reported performance pertains only to neurons that are relatively easy to identify. Furthermore, they do not report what fraction of the total neuron population can be accurately identified using their methods-an omission of critical importance for prospective users.

      The reviewer raises two points here: (1) whether AutoCellLabeler accuracy is impacted by ease of human labeling; and (2) what fraction of total neurons are identified. We address them one at a time.

      Regarding (1), we believe that the reviewer overlooked an important analysis in our study. Indeed, to assess its performance, one can only compare AutoCellLabeler’s output against accurate human labels – there is simply no way around it. However, we noted that AutoCellLabeler was identifying some neurons with high confidence even when humans had low confidence or had not even tried to label the neurons (Fig. 4F). To test whether these were in fact accurate labels, we asked additional human labelers to spend extra time trying to label a random subset of these neurons (they were of course blinded to the AutoCellLabeler label). We then assessed the accuracy of AutoCellLabeler against these new human labels and found that they were highly accurate (Fig. 4H). This suggests that AutoCellLabeler has strong performance even when some human labelers find it challenging to label a neuron. However, we agree that we have not yet been able to quantify AutoCellLabeler performance on the small set of neuron classes that humans are unable to identify across datasets.

      Regarding (2), we agree that knowing how many neurons are labeled by AutoCellLabeler is critical. For example, labeling only 3 neurons per animal with 100% accuracy isn’t very helpful. We wish to emphasize that we did not omit this information: we reported the number of neurons labeled for every network that we characterized in the study, alongside the accuracy of those labels (please see Figures 4I, 5A, and 6G; Figure 4I also shows the number of human labels per dataset, which the reviewer requested). We also showed curves depicting the tradeoff between accuracy and number of neurons labeled, which fully captures how we balanced accuracy and number of neurons labeled (Figures 5D and S4A). It sounds like the reviewer also wanted to know the total number of recorded neurons. The typical number of recorded neurons per dataset can also be found in the paper in Fig. 2E.

    1. Author response:

      eLife Assessment

      This valuable study presents a theoretical model of how punctuated mutations influence multistep adaptation, supported by empirical evidence from some TCGA cancer cohorts. This solid model is noteworthy for cancer researchers as it points to the case for possible punctuated evolution rather than gradual genomic change. However, the parametrization and systematic evaluation of the theoretical framework in the context of tumor evolution remain incomplete, and alternative explanations for the empirical observations are still plausible.

      We thank the editor and the reviewers for their thorough engagement with our work. The reviewers’ comments have drawn our attention to several important points that we have addressed in the updated version. We believe that these modifications have substantially improved our paper.

      There were two major themes in the reviewers’ suggestions for improvement. The first was that we should demonstrate more concretely how the results in the theoretical/stylized modelling parts of our paper quantitatively relate to dynamics in cancer.

      To this end, we have now included a comprehensive quantification of the effect sizes of our results across large and biologically-relevant parameter ranges. Specifically, following reviewer 1’s suggestion to give more prominence to the branching process, we have added two figures (Fig S3-S4) quantifying the likelihood of multi-step adaptation in a branching process for a large range of mutation rates and birth-death ratios. Formulating our results in terms of birth-death ratios also allowed us to provide better intuition regarding how our results manifest in models with constant population size vs models of growing populations. In particular, the added figure (Fig S3) highlights that the effect size of temporal clustering on the probability of successful 2-step adaptation is very sensitive to the probability that the lineage of the first mutant would go extinct if it did not acquire a second mutation. As a result, the phenomenon we describe is biologically likely to be most effective in those phases during tumor evolution in which tumor growth is constrained. This important pattern had not been described sufficiently clearly in the initial version of our manuscript, and we thank both reviewers for their suggestions to make these improvements.

      The second major theme in the reviewers’ suggestions was focused on how we relate our theoretical findings to readouts in genomic data, with both reviewers pointing to potential alternative explanations for the empirical patterns we describe.

      We have now extended our empirical analyses following some of the reviewers’ suggestions. Specifically, we have included analyses investigating how the contribution of reactive oxygen species (ROS)-related mutation signatures correlates with our proxies for multi-step adaptation; and we have included robustness checks in which we use Spearman instead of Pearson correlations. Moreover, we have included more discussion on potential confounds and the assumptions going into our empirical analyses as well as the challenges in empirically identifying the phenomena we describe.

      Below, we respond in detail to the individual comments made by each reviewer.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Grasper et al. present a combined analysis of the role of temporal mutagenesis in cancer, which includes both theoretical investigation and empirical analysis of point mutations in TCGA cancer patient cohorts. They find that temporally elevated mutation rates contribute to cancer fitness by allowing fast adaptation when the fitness drops (due to previous deleterious mutations). This may be relevant in the case of tumor suppressor genes (TSG), which follow the 2-hit hypothesis (i.e., biallelic 2 mutations are necessary to deactivate TS), and in cases where temporal mutagenesis occurs (e.g., high APOBEC, ROS). They provide evidence that this scenario is likely to occur in patients with some cancer types. This is an interesting and potentially important result that merits the attention of the target audience. Nonetheless, I have some questions (detailed below) regarding the design of the study, the tools and parametrization of the theoretical analysis, and the empirical analysis, which I think, if addressed, would make the paper more solid and the conclusion more substantiated.

      Strengths:

      Combined theoretical investigation with empirical analysis of cancer patients.

      Weaknesses:

      Parametrization and systematic investigation of theoretical tools and their relevance to tumor evolution.

      We sincerely thank Reviewer 1 for their comments. As communicated in more detail in the point-by-point replies to the “Recommendations for the authors”, we have revised the paper to address these comments in various ways. To summarize, Reviewer 1 asked for (1) more comprehensive analyses of the parameter space, especially in ranges of small fitness effects and low mutation rates; (2) additional clarifications on details of mechanisms described in the manuscript; and (3) suggested further robustness checks to our empirical analyses. We have addressed these points as follows: we have added detailed analyses of dynamics and effect sizes for branching processes (see Sections SI2 and SI3 in the Supplementary Information, as well as Figures S3 and S4). As suggested, these additions provide characterizations of effect sizes in biologically relevant parameter ranges (low mutation rates and smaller fitness effect sizes), and extend our descriptions to processes with dynamically changing population sizes. Moreover, we have added further clarifications at suggested points in the manuscript, e.g. to elaborate on the non-monotonicities in Fig 3. Lastly, we have undertaken robustness checks using Spearman rather than Pearson correlation coefficients to quantify relations between TSG deactivation and APOBEC signature contribution, and have performed analyses investigating dynamics of reactive oxygen species-associated mutagenesis instead of APOBEC.

      Reviewer #2 (Public review):

      This work presents theoretical results concerning the effect of punctuated mutation on multistep adaptation and empirical evidence for that effect in cancer. The empirical results seem to agree with the theoretical predictions. However, it is not clear how strong the effect should be on theoretical grounds, and there are other plausible explanations for the empirical observations.

      Thank you very much for these comments. We have now substantially expanded our investigations of the parameter space as outlined in the response to the “eLife Assessment” above and in the detailed comments below (A(1)-A(3)) to convey more quantitative intuition for the magnitude of the effects we describe for different phases of tumor evolution. We agree that there could be potential additional confounders to our empirical investigations besides the challenges regarding quantification that we already described in our initial version of the manuscript. We have thus included further discussion of these in our manuscript (see replies to B(1)-B(3)), and we have expanded our empirical analyses as outlined in the response to the “eLife Assessment”.

      For various reasons, the effect of punctuated mutation may be weaker than suggested by the theoretical and empirical analyses:

      (A1) The effect of punctuated mutation is much stronger when the first mutation of a two-step adaptation is deleterious (Figure 2). For double inactivation of a TSG, the first mutation--inactivation of one copy--would be expected to be neutral or slightly advantageous. The simulations depicted in Figure 4, which are supposed to demonstrate the expected effect for TSGs, assume that the first mutation is quite deleterious. This assumption seems inappropriate for TSGs, and perhaps the other synergistic pairs considered, and exaggerates the expected effects.

      Thank you for highlighting this discrepancy between Figure 2 and Figure 4. For computational efficiency and for illustration purposes, we had opted for high mutation rates and large fitness effects in Figure 2; however, our results are valid even in the setting of lower mutation rates and fitness effects. To improve the connection to Figure 4, and to address other related comments regarding parameter dependencies, we have now added more detailed quantification of the effects we describe (Figures SF3 and SF4) to the revised manuscript. These additions show that the effects illustrated in Figure 2 retain large effect sizes when going to much lower mutation rates and much smaller fitness effects. Indeed, while under high mutation rates we only see the large relative effects if the first mutation is highly deleterious, these large effects become more universal when going to low mutation rates.

      In general, it is correct that the selective disadvantage (or advantage) conveyed by the first mutation affects the likelihood of successful 2-step adaptations. It is also correct that the magnitude of the ‘relative effect’ of temporal clustering on valley-crossing is highest if the lineage with only the first of the two mutations is vanishingly unlikely to produce a second mutant before going extinct. If the first mutation is strongly deleterious, the lineage of such a first mutant is likely to quickly go extinct – and therefore also more likely to do so before producing a second mutant.

      However, this likelihood of producing the second mutant is also low if the mutation rate is low. As our added figure (Figure SF3) illustrates, at low mutation rates appropriate for cancer cells, is insensitive to the magnitude of the fitness disadvantage for large parts of the parameter space. Especially in populations of constant size (approximated by a birth/death ratio of 1), the relative effects for first mutations that reduce the birth rate by 0.5 or by 0.05 are indistinguishable (Figure SF3f).

      Moreover, the absolute effect (f<sub>k</sub> - f<sub>1</sub>), as we discuss in the paper (Figures SF2 and SF3) is largest in regions of the parameter space in which the first mutant is not infinitesimally unlikely to produce a second mutant (and f<sub>k</sub>  and f<sub>1</sub> would be infinitesimally small), but rather in parameter regions in which this first mutant has a non-negligible chance to produce a second mutant. The absolute effect (f<sub>k</sub> - f<sub>1</sub>) therefore peaks around fitness-neutral first mutations. While the next comment (below) says that our empirical investigations more closely resemble comparisons of relative effects and not absolute effects, we would expect that the observations in our data come preferentially from multi-step adaptations with large absolute effect since the absolute effect is maximal when both f<sub>k</sub> and f<sub>1</sub> are relatively high.

      In summary, we believe Figure 2, while having exaggerated parameters for very defendable reasons, is not a misleading illustration of the general phenomenon or of its applicability in biological settings, as effect sizes remain large when moving to biologically realistic parameter ranges. To clarify this issue, we have largely rewritten the relevant paragraphs in the results section and have added two additional figures (Figures SF3 and SF4) as well as a section in the SI with detailed discussion (SI2).

      (A2) More generally, parameter values affect the magnitude of the effect. The authors note, for example, that the relative effect decreases with mutation rate. They suggest that the absolute effect, which increases, is more important, but the relative effect seems more relevant and is what is assessed empirically.

      Thank you for this comment. As noted in the replies to the above comments, we have now included extensive investigations of how sensitive effect sizes are to different parameter choices. We also apologize for insufficiently clearly communicating how the quantities in Figure 4 relate to the findings of our theoretical models.

      The challenge in relating our results to single-timepoint sequencing data is that we only observe the mutations that a tumor has acquired, but we do not directly observe the mutation rate histories that brought about these mutations. As an alternative readout, we therefore consider (through rough proxies: TSGs and APOBEC signatures) the amount of 2-step adaptations per acquired/retained mutation. While we unfortunately cannot control for the average mutation rate in a sample, we motivate using this “TSG-deactivation score” by the hypothesis that for any given mutation rate, we expect a positive relationship between the amount of temporal clustering and the amount of 2-step adaptations per acquired/retained mutation. This hypothesis follows directly from our theoretical model where it formally translates to the statement that for a fixed μ, f<sub>k</sub> is increasing in k.

      However, while both quantities f<sub>k</sub>/f<sub>1</sub> or f<sub>k</sub> - f<sub>1</sub> from our theoretical model relate to this hypothesis – both are increasing in k –, neither of them maps directly onto the formulation of our empirical hypothesis.

      We have now rewritten the relevant passages of the manuscript to more clearly convey our motivation for constructing our TSG deactivation score in this form (P. 4-6).

      (A3) Routes to inactivation of both copies of a TSG that are not accelerated by punctuation will dilute any effects of punctuation. An example is a single somatic mutation followed by loss of heterozygosity. Such mechanisms are not included in the theoretical analysis nor assessed empirically. If, for example, 90% of double inactivations were the result of such mechanisms with a constant mutation rate, a factor of two effect of punctuated mutagenesis would increase the overall rate by only 10%. Consideration of the rate of apparent inactivation of just one TSG copy and of deletion of both copies would shed some light on the importance of this consideration.

      This is a very good point, thank you. In our empirical analyses, the main motivation was to investigate whether we would observe patterns that are qualitatively consistent with our theoretical predictions, i.e. whether we would find positive associations between valley-crossing and temporal clustering. Our aim in the empirical analyses was not to provide a quantitative estimate of how strongly temporally clustered mutation processes affect mutation accumulation in human cancers. We hence restricted attention to only one mutation process which is well characterized to be temporally clustered (APOBEC mutagenesis) and to only one category of (epi)genomic changes (SNPs, in which APOBEC signatures are well characterized). Of course, such an analysis ignores that other mutation processes (e.g. LOH, copy number changes, methylation in promoter regions, etc.) may interact with the mechanisms that we consider in deactivating Tumor suppressor genes.

      We have now updated the text to include further discussion of this limitation and further elaboration to convey that our empirical analyses are not intended as a complete quantification of the effect of temporal clustering on mutagenesis in-vivo (P. 10,11).

      Several factors besides the effects of punctuated mutation might explain or contribute to the empirical observations:

      (B1) High APOBEC3 activity can select for inactivation of TSGs (references in Butler and Banday 2023, PMID 36978147). This selective force is another plausible explanation for the empirical observations.

      Thank you for making this point. We agree that increased APOBEC3 activity, or any other similar perturbation, can change the fitness effect that any further changes/perturbations to the cell would bring about. Our empirical analyses therefore rely on the assumption that there are no major confounding structural differences in selection pressures between tumors with different levels of APOBEC signature contributions. We have expanded our discussion section to elaborate on this potential limitation (P. 10-11).

      While the hypothesis that APOBEC3 activity selects for inactivation of TSGSs has been suggested, there remain other explanations. Either way, the ways in which selective pressures have been suggested to change would not interfere relevantly with the effects we describe. The paper cited in the comment argues that “high APOBEC3 activity may generate a selective pressure favoring” TSG mutations as “APOBEC creates a high [mutation] burden, so cells with impaired DNA damage response (DDR) due to tumor suppressor mutations are more likely to avert apoptosis and continue proliferating”. To motivate this reasoning, in the same passage, the authors cite a high prevalence of TP53 mutations across several cancer types with “high burden of APOBEC3-induced mutations”, but also note that “this trend could arise from higher APOBEC3 expression in p53-mutated tumors since p53 may suppress APOBEC3B transcription via p21 and DREAM proteins”.

      Translated to our theoretical framework, this reasoning builds on the idea that APOBEC3 activity increases the selective advantage of mutants with inactivation of both copies of a TSG. In contrast, the mechanism we describe acts by altering the chances of mutants with only one TSG allele inactivated to inactivate the second allele before going extinct. If homozygous inactivation of TSGs generally conveys relatively strong fitness advantages, lineages with homozygous inactivation would already be unlikely to go extinct. Further increasing the fitness advantage of such lineages would thus manifest mostly in a quicker spread of these lineages, rather than in changes in the chance that these lineages survive. In turn, such a change would have limited effect on the “rate” at which such 2-step adaptations occur, but would mostly affect the speed at which they fixate. It would be interesting to investigate these effects empirically by quantifying the speed of proliferation and chance of going extinct for lineages that newly acquired inactivating mutations in TSGs.

      Beyond this explicit mention of selection pressures, the cited paper also discusses high occurrences of mutations in TSGs in relation to APOBEC. These enrichments, however, are not uniquely explained by an APOBEC-driven change in selection pressures. Indeed, our analyses would also predict such enrichments.

      (B2) Without punctuation, the rate of multistep adaptation is expected to rise more than linearly with mutation rate. Thus, if APOBEC signatures are correlated with a high mutation rate due to the action of APOBEC, this alone could explain the correlation with TSG inactivation.

      Thank you for making this point. Indeed, an identifying assumption that we make is that average mutation rates are balanced between samples with a higher vs lower APOBEC signature contribution. We cannot cleanly test this assumption, as we only observe aggregate mutation counts but not mutation rates. However, the fact that we observe an enrichment for APOBEC-associated mutations among the set of TSG-inactivating mutations (see Figure 4F) would be consistent with APOBEC-mutations driving the correlations in Fig 4D, rather than just average mutation rates. We have now added a paragraph to our manuscript to discuss these points (P. 10-11).

      (B3) The nature of mutations caused by APOBEC might explain the results. Notably, one of the two APOBEC mutation signatures, SBS13, is particularly likely to produce nonsense mutations. The authors count both nonsense and missense mutations, but nonsense mutations are more likely to inactivate the gene, and hence to be selected.

      Thank you for making this point.  We have included it in our discussion of potential confounders/limitations in the revised manuscript (P. 10-11).

    1. Reviewer #1 (Public review):

      Summary:

      This study focuses on characterizing the EEG correlates of item-specific proportion congruency effects. In particular, two types of learned associations are characterized. One being associations between stimulus features and control states (SC), and the other being stimulus features and responses (SR). Decoding methods are used to identify SC and SR correlates and to determine whether they have similar topographies and dynamics.

      The results suggest SC and SR associations are simultaneously coactivated and have shared topographies, with the inference being that these associations may share a common generator.

      Strengths:

      Fearless, creative use of EEG decoding to test tricky hypotheses regarding latent associations.

      Nice idea to orthogonalize the ISPC condition (MC/MI) from stimulus features.

      Weaknesses:

      (1) I'm relatively concerned that these results may be spurious. I hope to be proven wrong, but I would suggest taking another look at a few things.

      While a nice idea in principle, the ISPC manipulation seems to be quite confounded with the trial number. E.g., color-red is MI only during phase 2, and is MC primarily only during Phase 3 (since phase 1 is so sparsely represented). In my experience, EEG noise is highly structured across a session and easily exploited by decoders. Plus, behavior seems quite different between Phase 2 and Phase 3. So, it seems likely that the classes you are asking the decoder to separate are highly confounded with temporally structured noise.

      I suggest thinking of how to handle this concern in a rigorous way. A compelling way to address this would be to perform "cross-phase" decoding, however I am not sure if that is possible given the design.

      The time courses also seem concerning. What are we to make of the SR and SC timecourses, which have aggregate decoding dynamics that look to be <1Hz?

      Some sanity checks would be one place to start. Time courses were baselined, but this is often not necessary with decoding; it can cause bias (10.1016/j.jneumeth.2021.109080), and can mask deeper issues. What do things look like when not baselined? Can variables be decoded when they should not be decoded? What does cross-temporal decoding look like - everything stable across all times, etc.?

      (2) The nature of the shared features between SR and SC subspaces is unclear.

      The simulation is framed in terms of the amount of overlap, revealing the number of shared dimensions between subspaces. In reality, it seems like it's closer to 'proportion of volume shared', i.e., a small number of dominant dimensions could drive a large degree of alignment between subspaces.

      What features drive the similarity? What features drive the distinctions between SR and SC? Aside from the temporal confounds I mentioned above, is it possible that some low-dimensional feature, like EEG congruency effect (e.g., low-D ERPs associated with conflict), or RT dynamics, drives discriminability among these classes? It seems plausible to me - all one would need is non-homogeneity in the size of the congruency effect across different items (subject-level idiosyncracies could contribute: 10.1016/j.neuroimage.2013.03.039).

      (3) The time-resolved within-trial correlation of RSA betas is a cool idea, but I am concerned it is biased. Estimating correlations among different coefficients from the same GLM design matrix is, in general, biased, i.e., when the regressors are non-orthogonal. This bias comes from the expected covariance of the betas and is discussed in detail here (10.1371/journal.pcbi.1006299). In short, correlations could be inflated due to a combination of the design matrix and the structure of the noise. The most established solution, to cross-validate across different GLM estimations, is unfortunately not available here. I would suggest that the authors think of ways to handle this issue.

      (4) Are results robust to running response-locked analyses? Especially the EEG-behavior correlation. Could this be driven by different RTs across trials & trial-types? I.e., at 400 ms post-stim onset, some trials would be near or at RT/action execution, while others may not be nearly as close, and so EEG features would differ & "predict" RT.

      (5) I suggest providing more explanation about the logic of the subspace decoding method - what trialtypes exactly constitute the different classes, why we would expect this method to capture something useful regarding ISPC, & what this something might be. I felt that the first paragraph of the results breezes by a lot of important logic.

      In general, this paper does not seem to be written for readers who are unfamiliar with this particular topic area. If authors think this is undesirable, I would suggest altering the text.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements

      We thank the reviewers for providing thoughtful and constructive feedback, which will help us improve the clarity and rigor of the paper. On balance, the reviews were positive. Reviewer 1 mentioned that “This is a strong manuscript with few problems and all important findings well justified, indeed this is a nicely polished…..high-quality manuscript,” and that “this paper makes a major breakthrough, showing that cell autonomous defects in hTSCs are very likely at the heart of the pathology observed in GIN-prone murine mutants.” Reviewer 3 stated that “The study is well designed, and the manuscript is very well written. The conclusions are supported by the evidence presented.” Reviewer 2 was less enthusiastic, with main concerns being that “The paper is mostly descriptive and often quite confusing leaving one not much closer to understanding the mechanistic basis for the interesting sex-biased semi-lethal phenotype.” and felt that figure titles/section headers overstated the results, and finally recommended to improve some technical aspects and tempering conclusions. The proposed edits we think address most issues raised by the reviewers either with re-writing or adding data as described below.

      In response to reviewer #1 comments:

      Major comments:

      • I am confused as to the basis of the sex-skewing phenomenon? Is the problem that lack of maternally loaded WT Mcm4 worsens the phenotype, or is the issue that Mcm4C3/C3 dams are less able to retain pregnancies, perhaps being a more inflammatory environment? Also, while there quite consistent evidence for reduced viability of Mcm4C3/C3McmGt/+ progeny, especially for female progeny, how confident can we be that the genotype of the dam vs. sire is important? Notably on a Ddx58 background, the progeny of the Mcm4C3/C3 sire included seven live male Mcm4C3/C3McmGt/+ but no female.

      Regarding the first point (sex skewing only when female is C3/C3), we also suspected either: 1) the maternal uterine environment, or 2) reduced oocyte quality. Although not reported in this manuscript, we tested #1 by performing embryo transfer experiments. Transferring 2-cell stage embryos from sex-skewing mating to WT females did not rescue the sex-bias. We then examined oocytes from C3/C3 females. We found evidence for compromised mitochondria and transcriptome disruption. However, we are not sure why this happens (poor follicle support? Oocyte intrinsic phenomenon?). We are reserving these results and additional experiments for another paper, especially since this one mainly deals with GIN and placenta development. If the reviewers feel strongly that the embryo transfer data is crucial, we can include it.

      Regarding how confident we are that the genotype of the dam vs. sire is important, this stems from our previous paper by McNairn et al 2019 (the percentage of female C3/C3 M2/+ from sex-skewing mating is 20% compared to 60% from the reciprocal mating), which was quite dramatic. Consistent with this, MCM levels were significantly reduced in the placentae only when the dam was C3/C3 and the sire C3/+ M2/+, but not in the reciprocal cross. The reviewer makes a good observation about the Ddx58 cross; we can only hypothesize that the mutation somehow sensitizes females in this scenario and will make mention of it in the revision. We also realize that we neglected to write in Methods that the Ddx58 allele was coisogenic in the C3H background.

      • I'm not sure what Supplementary Figure 6 is showing (faster differentiation of C3 but less TGC?). Regardless, it's hard to draw too much conclusion from one not-very-pretty Western blot. This figure requires both additional replicates and a better explanation of how it fits with the other conclusions of the paper..

      We hypothesized that the JZ defect observed in the semi-lethal genotype placentas could arise either from impaired maintenance of the progenitor pool or from reduced capacity of mutant trophoblast progenitors to differentiate into the JZ lineage. The blot in Supplementary Figure 6 was intended as a qualitative demonstration that mutant trophoblast stem cells can differentiate into JZ lineages. We recognize that the figure is not definitive and will revise the text to clarify its purpose. A replicate(s) of the Western will be performed as suggested.

      • Supplementary Figure 7F-G is puzzling. Half of the mESCs have gamma-H2AX at all times, including most in S or G2 phase? In Figure S7E, do the quadrants correspond to being negative or positive for gamma-H2AX? At very least, IF images showing clear gamma-H2AX foci would be much more convincing.

      The gates for γH2AX FACS analysis were established using negative controls lacking primary antibody. As reported previously, embryonic stem cells display high basal levels of γH2AX staining (Chuykin et al., Cell Cycle 2008; Turinetto et al., Stem Cells 2012; Ahuja et al., Nat Comm 2016), which likely explains the broad signal observed across cell cycle phases. Regardless, we will provide immunofluorescence staining of γH2Ax and foci count in our revision.

      • The methods section is well detailed, but it would be ideal to clarify how many replicates each Western Blot or flow cytometry experiment is representative of.

      Thanks for the suggestion. We will update this for Fig4 and Fig5.

      Minor comments:

      • Is it possible that cGAS-STING and RIG pathways act redundantly to cause inflammation and lethality, or that other innate immune components are involved? I don't expect the authors to make compound mutants to test this but at least this possibility should be discussed textually.

      We appreciate the reviewer’s point, and had the same suspicion. Supporting this, we will add new RNA-seq analysis of Tmem173 KO placentas revealed elevated inflammatory gene expression compared to C3/C3 M2/+ controls, consistent with potential redundancy or feedback regulation. We will update in supplementary figures to reflect this.

      In response to reviewer #2 comments:

      Major comments:

      A major concern throughout the paper is that conclusions are often overstating their data. The title of figure 2 is "placentae with replication stress have smaller junctional and labyrinth zones". However, there is no measure of replication stress in this figure, just a histological evaluation of the placentae from the different mutants. The title of figure 3 is "Impact of GIN on LZ is less than JZ," but there is no measure of GIN, but instead measurement of number of cells in cell cycle and some bulk RNA-seq analysis. Title of figure 4 is "TSCs with increased genomic instability exhibit abnormal phenotypes." Again there is no measure of GIN, but instead staining of derived TSCs for proliferation, cell death, and a TSC marker. Title of figure 5 is "DNA damage responses and G2/M checkpoint activation drive premature TSC differentiation." However, there does not appear to be a difference in gH2AX between the two mutant genotypes. Checkpoint proteins might be up, but need quantification and reproduction. > 4C is the only marker of differentiation. Importantly, all the analyses here are associations, not connections, so cannot use the word "drive". Similar issues can be raised with a number of the supplementary figures.

      The Chaos3 (chromosome aberrations occurring spontaneously 3) model is a well-established system of intrinsic chronic replication stress and GIN. It is characterized by ~20 fold elevation of blood micronuclei (Shima et al., Nature 2007), a hallmark of GIN (Soxena et al., Mol Cell 2022); a destabilized MCM2-7 helicase prone to replication fork collapse (Bai et al., PLoS Genet 2016); and increased mitotic chromosome abnormalities and decreased dormant origins (Kawabata et al., Mol Cell 2011; Chuang et al., Nucleic Acid Res 2012) that are known to cause GIN and replication stress (Ibarra et al., PNAS 2008 ). Also, in our previous work (McNairn et al Nature 2019), we showed that placentae from C3/C3 dams exhibit significantly elevated γH2Ax as well as reduced MCM2 and MCM4 protein levels. In our current study, we also observe elevated γH2Ax in mutant TSCs (C3/C3 and C3/C3 M2/+), consistent with genomic instability. Nevertheless, we acknowledge that in TSCs, we did not formally demonstrate replications stress(RS), so where appropriate, we will advise figure titles, for example to say that “cells/placentae with a GIN or RS genotype.”

      We acknowledge the reviewers concern regarding western blots. We will provide quantification and statistics in our revision.

      1) A deeper analysis of the cell lines is likely to be the most fruitful path to reveal interesting mechanisms. It is very surprising that there is no phenotype in ESCs. Authors should check for increased apoptosis. Maybe the phenotypic cells are lost. Or do ESCs use different MCMs/mechanisms of DNA replication or are they better able to handle replication stress and GIN? How many passages were the TSCs and ESCs cultured for? Does GIN (i.e. aneuploidy, CNVs) develop in TSCs and ESCs with passaging? How do the MCM mutations impact the molecular identity of the ESC and TSC cells including their heterogeneity in the population.

      We assessed apoptosis using cleaved caspase 3 flow cytometry in mutant ESCs and observed no difference compared to controls (we will add this data as Supplementary Fig. 7).

      We believe there are intrinsic differences in TSCs and ESCs in their ability to respond to and counteract replication stress and DNA damage. ESCs are known to license more replication origins than somatic cells at a higher rate, which protects them from short G1-induced replication stress (Ahuja et al., Nat Comm 2016; Ge et al., Stem Cell Rep 2015; Matson et al., eLife 2017). Human placental cells physiologically exhibit high levels of mutation rate and chromosomal instability in vivo (Coorens et al., Nature 2021). Supporting this, Wang, D., et al (Nat Comm 2025) reported that several cell cycle and DDR regulators are differentially expressed in human TSCs vs human pluripotent stem cells. Whether such transcriptional differences directly contribute to functional outcomes remains to be determined.

      All experiments in this study were conducted using early-passage ESCs and TSCs (i.e. Finally, we showed that close to 90% mutant ESCs are KLF4+ (a naive pluripotency marker) whereas EOMES+ cells were significantly reduced in TSCs carrying the GIN genotype (Fig. 4E–F and Supplementary Fig. 7), highlighting lineage-specific differences.

      Minor Comments:

      1) There is a lack of quantification and repeats for all Westerns. At minimum there should be three repeats for each experiment, quantification including normalization to a reference protein, and stats confirming any proposed differences between conditions.

      We will update our revision with quantification and statistics for western blots.

      2) I would recommend moving the results in supp table 1 to figure 1. While negative, they are the newer results. The results shown in current figure 1 are essentially a reproduction of their previous work.

      The placental observations presented in Fig.1 are new. In particular, the placental and embryonic weight measurements graphed in Fig1B and C have not been published by our group. Fig1A reproduces our previous observation on embryo viability in GIN mutants (McNairn et al., Nature 2019), while the schematic was provided for better flow and readability given the complex mating schemes. We are agnostic on the Suppl Table 1. It could be changed to a new Table 1 in the main section depending on the journal.

      In response to reviewer #3 comments:

      Major Comments

      While the inclusion of bulk RNAseq data of whole placental tissue is appreciated, the interpretation of the results is somewhat problematic, as it is acknowledged that the cell type composition of the placentas is drastically different between groups. Making conclusions based upon GSEA analysis of two different groups with drastically different cell type composition is somewhat misleading, as based on the results, it is a direct reflection of the cell types present. It would be more helpful to perform cell type deconvolution of the RNAseq data to estimate the proportion of each cell type within the bulk samples and compare that to what is seen histologically and not dive too deeply into the pathways since the results could just be a reflection of the cell types e.g. angiogenesis pathways from more endothelial cells. Additionally, the RNAseq data can be leveraged to look at expression of inflammatory genes by sex, which may show interesting patterns based on the other results.

      We agree that the representation of cell types in the placenta is problematic especially for underrepresented genes. We propose to use the BayesPrism tool (Chu et al., Nat Cancer 2022) to deconvolute bulk RNA-seq for better representation of transcriptional changes in the placenta.

      Section: GIN impairs trophoblast stem cell establishment and maintenance. To support the assertion in the first paragraph, beyond measuring apoptosis, it would be helpful at this stage to look at RNA expression levels indicative of the activation of DNA damage checkpoint genes

      We have performed RNA-seq on mutant ESC and TSCs and are in the process of data analysis. We will update these results in the revision.

      Please include additional methodological details in the methods section on the statistical analysis done for differential expression analysis. Specifically, what type of normalization was used, if lowly expressed genes were filtered out and at what cutoff, what statistical model was used (did you include covariates?), what comparisons were made? Did you stratify by sex? What cutoff was used for statistical significance? Did you perform multiple testing correction?

      We will update RNA-Seq data analysis methods in our full revision.

      2. Description of the revisions that have already been incorporated in the transferred manuscript

      Reviewer #1 comments:

      • Supplementary Table 1. would be enhanced greatly showing comparable tables for Mcm4C3/C3 x Mcm4C3/+McmGt/+ in mice without the Tmem173 or Ddx58 mutations. It is fine to recycle data from McNairn 2019 here, as long as the source is indicated, but a comparison is needed.

      Thanks for pointing this out. We have updated this suggestion in Supp table 1.

      • In Figure S3E-F, is the box above each graph supposed to show the genotype of the dam?

      Yes. Thanks for pointing this out. We have added a description in the figure legend to make it clear.

      • "Indeed, the placenta and embryo weights of E13.5 Mcm4C3/C3 Mcm2Gt/+ Mcm3Gt/+ animals were significantly improved vs. Mcm4C3/C3 Mcm2Gt/+ animals, rendering them similar to Mcm4C3/C3 littermates (Fig. 6A-C). The JZ (but not LZ) area in Mcm4C3/C3 Mcm2Gt/+ Mcm3Gt/+ placentae also increased to the level of Mcm4C3/C3 littermates (Fig. 6D-H)." There are two problems here. First, the figure calls are wrong. Second, the description of the data is not quite right, it looks like the C3/C3 and C3/C3 M2/+ M3/+ LZs are a similar size to each and are statistically indistinguishable.

      Thanks for catching this. We have updated these in the main text.

      *Reviewer #2 comments: *

      Minor comment

      • Need to review citations to figures. For example, no citations are made to figure 4a and 4c.

      Thanks for catching this. We have updated the text.

      Reviewer #3 comments:

      Define the first use of >4C DNA content to help readers understand this potentially unfamiliar term.

      We have edited this part to indicate cells with more than 4C DNA content for better clarity.

      iDEP tool - please include citation to manuscript instead of link

      We have updated this citation.

      Check citations. Some citations to BioRxiv that are now published e.g. 13.

      We have updated this citation.

      3. Description of analyses that authors prefer not to carry out

      Reviewer 2

      2) Along similar lines, most of the in vivo phenotypic analyses are performed at E13.5, long after defects are likely beginning to express themselves especially given that they see phenotypes in the TSCs, which represent the polar TE of a E4.5. To understand the primary defects of the in vivo phenotype, they should be looking much earlier. Supplemental figure 5 is a start but represents a rather superficial analysis.

      The peri-implantation period, namely E4.5, represents a “black box” of embryonic development given that this is a critical stage for implantation. Aside from being an extremely difficult stage to analyze technically, we don’t think it is essential to the conclusions (or doable in a timely manner), especially given the use of TSCs. If we complete EdU studies on E6.5 embryos, we will include them.

      3) Fig. 6 would benefit from evidence that MCM3 mutant is rescuing MCM4 levels in the chromatin fraction of cells and the DNA damage phenotype.

      The genetic evidence presented is strong, and although we didn’t do the suggested experiment, we feel that our previous studies (McNairn et al., Nature 2019 and Chuang et al., PLoS Genet 2010) on the effects of MCM3 as a nuclear export factor (as it is in yeast (Liku et al., Mol Biol Cell 2005)) are a reasonable basis for not repeating such experiments. Furthermore, we are no longer maintaining the Mcm3 line and it would take over a year to reconstitute and rebreed triple mutants.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This paper by Poverlein et al reports the substantial membrane deformation around the oxidative phosphorylation super complex, proposing that this deformation is a key part of super complex formation. I found the paper interesting and well-written but identified a number of technical issues that I suggest should be addressed:

      We thank Reviewer 1 for finding our work interesting. We have addressed the technical issues below.

      (1) Neither the acyl chain chemical makeup nor the protonation state of CDL are specified. The acyl chain is likely 18:2/18:2/18:2/18:2, but the choice of the protonation state is not straightforward.

      We thank the Reviewer for highlighting this missing information. We have now added this information in the Materials and Methods section:

      "…were performed in a POPC:POPE:cardiolipin (2:2:1) membrane containing 5 mol% QH<sub>2</sub> / Q (1:1 ratio). Cardiolipin was modeled as tetraoleoyl cardiolipin (18:1/18:1/18:1/18:1) with a headgroup modeled in a singly protonated state (with Q<sub>tot</sub>=-1)."

      (2) The analysis of the bilayer deformation lacks membrane mechanical expertise. Here I am not ridiculing the authors - the presentation is very conservative: they find a deformed bilayer, do not say what the energy is, but rather try a range of energies in their Monte Carlo model - a good strategy for a group that focuses on protein simulations. The bending modulus and area compressibility modulus are part of the standard model for quantifying the energy of a deformed membrane. I suppose in theory these might be computed by looking at the per-lipid distribution in thickness fluctuations, but this route is extremely perilous on a per-molecule basis. Instead, the fluctuation in the projected area of a lipid patch is used to imply the modulus [see Venable et al "Mechanical properties of lipid bilayers from molecular dynamics simulation" 2015 and citations within]. Variations in the local thickness of the membrane imply local variations of the leaflet normal vector (the vector perpendicular to the leaflet surface), which is curvature. With curvature and thickness, the deformation energy is analyzed.

      See:

      Two papers: "Gramicidin A Channel Formation Induces Local Lipid Redistribution" by Olaf Andersen and colleagues. Here the formation of a short peptide dimer is experimentally linked to hydrophobic mismatch. The presence of a short lipid reduces the influence of the mismatch. See below regarding their model cardiolipin, which they claim is shorter than the surrounding lipid matrix.

      Also, see:

      Faraldo-Gomez lab "Membrane transporter dimerization driven by differential lipid solvation energetics of dissociated and associated states", 2021. Mondal et al "Membrane Driven Spatial Organization of GPCRs" 2013 and many citations within these papers.

      While I strongly recommend putting the membrane deformation into standard model terms, I believe the authors should retain the basic conservative approach that the membrane is strongly deformed around the proteins and that making the SC reduces the deformation, then exploring the consequences with their discrete model.

      We thank the Reviewer for the suggestions and for pointing out the additional references, which are now cited in the revised manuscript. The analysis is indeed significantly more complex for large multi-million atom supercomplexes in comparison to small peptides (gramicidin A) or model systems of lipid membranes. However, in the revised manuscript, we have conducted further analysis on the membrane curvature effects based on the suggestions. We were able to estimate the energetic contribution of the changes in local membrane thickness and curvature, which are now summarized in Table 1, and described in the main text and SI. We find that both the curvature and local thickness contribute to the increased stability of SC.

      We have now extensively modified the result to differentiate between different components of membrane strain properly:

      "We observe a local decrease in the membrane thickness at the protein-lipid interface (Fig. 2G, Fig S2A,D,E), likely arising from the thinner hydrophobic belt region of the OXPHOS proteins (ca. 30 Å, Fig. S1A) relative to the lipid membrane (40.5 Å, Fig. S1). We further observe ∼30% accumulation of cardiolipin at the thinner hydrophobic belt regions (Fig. 2H, Fig. S2B,F,G), with an inhomogeneous distribution around the OXPHOS complexes. While specific interactions between CDL and protein residues may contribute to this enrichment (Fig. 2N), CDL prefers thermodynamically thinner membranes (∼38 Å, Fig. S1B, Fig. S5F). These changes are further reflected in the reduced end-toend distance of lipid chains in the local membrane belt (see Methods, Fig. S6, cf. also Refs. (41-44). In addition to the perturbations in the local membrane thickness, the OXPHOS proteins also induce a subtle inward curvature towards the protein-lipid interface (Fig. S5G), which could modulate the accessibility of the Q/QH2 substrate into the active sites of CI and CIII<sub>2</sub> (see below, section Discussion). This curvature is accompanied by a distortion of the local membrane plane itself (Fig. 2A-F, Fig. S4AC, Fig. S7), with perpendicular leaflet displacements reaching up to ~2 nm relative to the average leaflet plane.

      To quantify the membrane strain effects, we analyzed the cgMD trajectories by projecting the membrane surface onto a 2-dimensional grid and calculating the local membrane height and thickness at each grid point. From these values, we quantified the local membrane curvature (Fig. S5H), which measures the energetic cost of deforming the membrane from a flat geometry (ΔG<sub>curv</sub>). We also computed the energetics associated with changes in the membrane thickness, assessed from the deviations from an ideal local membrane in the absence of embedded proteins (ΔG<sub>thick</sub>, see Supporting Information, for technical details). Our analysis suggests that both contributions are substantially reduced upon formation of the SC, with the curvature decreasing by 19.8 ± 1.3 kcal mol-1 and the thickness penalty by 2.8 ± 2.0 kcal mol-1 (Table 1). These results indicate a significant thermodynamic advantage for SC formation, as it minimizes lipid deformation and stabilizes the membrane environment surrounding Complex I and III.”

      […]

      “Taken together, the analysis suggests that the OXPHOS complexes affect the mechanical properties of the membranes by inducing a small inwards curvature towards the protein-lipid interface (Fig. S5), resulting in a membrane deformation effect, while the SC formation releases some deformation energy relative to the isolated OXPHOS complexes. The localization of specific lipids around the membrane proteins, as well as local membrane perturbation effects, is also supported by simulations of other membrane proteins (45, 46), suggesting that the effects could arise from general protein-membrane interactions.”

      Our Supporting Information section now provides additional information about the membrane curvature.

      (41) R. M. Venable, F. L. H. Brown, R. W. Pastor, Mechanical properties of lipid bilayers from molecular dynamics simulation. Chemistry and Physics of Lipids 192, 60-74 (2015).

      (42) R. Chadda et al., Membrane transporter dimerization driven by differential lipid solvation energetics of dissociated and associated states. eLife 10, e63288 (2021).

      (43) S. Mondal et al., Membrane Driven Spatial Organization of GPCRs. Scientific Reports 3, 2909 (2013).

      (44) J. A. Lundbæk, S. A. Collingwood, H. I. Ingólfsson, R. Kapoor, O. S. Andersen, Lipid bilayer regulation of membrane protein function: gramicidin channels as molecular force probes. Journal of The Royal Society Interface 7, 373-395 (2009).

      We also expanded our SI Method section to account for the new calculations:

      “Analysis of lipid chain end-to-end length

      To probe the protein-induced deformation effect of the membrane, the membrane curvature (H), and the end-to-end distance between the lipid chains, were computed based on aMD and cgMD simulations. The lipid chain length was computed from simulations A1-A6 and C1 based on the first and last carbon atoms of each lipid chain. For example, the end-to-end length of a cardiolipin chain was determined as the distance between atom “CA1” and atom “CA18”.

      “Membrane Curvature and Deformation Energy

      The local mean curvature of the membrane midplane was computed by approximating the membrane surface as a height function Z(x,y), defined as the average location of the N-side and P-side leaflets at each grid point. Based on this, the mean curvature H(x,y) was calculated as,

      where the derivatives are defined as .

      The thickness deformation energy was computed from the local thickness d(x,y) relative to a reference thickness distribution F(d), derived from membrane-only simulations, and converted to a free energy profile via Boltzmann inversion. At each grid point, the F(d) was summed over the grid,

      The bending deformation energy was computed from the mean curvature field H(x,y), assuming a constant bilayer bending modulus κ (taken as 20 kJ mol-1 = 4.78 kcal mol-1):

      where Δ_A_ is the area of the grid cell.

      The thickness and curvature fields were obtained by projecting the coarse-grained MD trajectories (one frame per ns) onto a 2D-grid with a resolution of 0.5 nm. Grid points with low occupancy were downweighted to mitigate noise. More specifically, points with counts below 50% of the median grid count were scaled linearly by their relative count value. To focus the analysis on the region around the protein– membrane interface, only grid points within a radius of 20 nm from the center of the complex were included in the energy calculations. Energies were normalized to an effective membrane area of 1000 nm2 to facilitate the comparison between systems. Bootstrapping with resampling over frames was performed to estimate the standard deviations of G<sub>thick</sub> and G<sub>curv</sub>.

      We find that G<sub>curve</sub> converges slowly due to its sensitivity to local derivatives and the small grid size required to resolve the curvature contribution near the protein. Consequently, tens of microseconds of simulations were necessary to obtain well-converged estimates of the curvature energy.”

      (1) If CDL matches the hydrophobic thickness of the protein it would disrupt SC formation, not favor it. The authors' hypothesis is that the SC stabilizes the deformed membrane around the separated elements. Lipids that are compatible with the monomer deformed region stabilize the monomer, similarly to a surfactant. That is, if CDL prefers the interface because the interface is thin and their CDL is thin, CDL should prevent SC formation. A simpler hypothesis is that CDL's unique electrostatics are part of the glue.

      We rephrased the corresponding paragraph in the Discussion section to reflect the role of electrostatics for the behavior of cardiolipin.

      "…supporting the involvement of CDL as a "SC glue". In this regard, electrostatic effects arising from the negatively charged cardiolipin headgroup could play an important role in the interaction of the OXPHOS complexes."

      Generally our simulations suggest that CDL prefers thinner membranes, which could rationalize these findings.

      "We find that CDL prefers thinner membranes relative to the neutral phospholipids (PE/PC, Fig. S5F),[…]”

      (2) Error bars for lipid and Q* enrichments should be computed averaging over multi-lipid regions of the protein interface, e.g., dividing the protein-lipid interface into six to ten domains, in particular functionally relevant regions. Anionic lipids may have long, >500 ns residence times, which makes lipid enrichment large and characterization of error bars challenging in short simulations. Smaller regions will be noisy. The plots depicted in, for example, Figure S2 are noisy.

      It is indeed challenging to capture lipid movements on the timescales accessible for atomistic MD, and hence the data in Figure S2 contains some noise. In this regard, for the cgMD data presented in the revised Fig. S2H,I, the concentration data was averaged for six domains of the protein-lipid interface.

      (3) The membrane deformation is repeatedly referred to as "entropic" without justification. The bilayer has significant entropic and enthalpic terms just like any biomolecule, why are the authors singling out entropy? The standard "Helfrich" energetic Hamiltonian is a free energy model in that it implicitly integrates over many lipid degrees of freedom.

      We apologize for the unclear message – our intention was not to claim that the effects are purely entropic, but could arise from a combination of both entropic and enthalpic effects. We hope that this has now been better clarified in the revised manuscript. We also agree that it is difficult to separate between entropic and enthalpic effects. However, we wish to point out that, e.g., the temperature-dependence of the SC formation suggests that the entropic contribution is also affecting the process.

      Regarding the Helfrich Hamiltonian, we note that the standard model assumes a homogeneous fluid-like sheet. We have thus difficulties in relating this model to capture the local effects.

      Revisions / clarifications in the main manuscript:

      "SC formation is affected by both enthalpic and entropic effects."

      "We have shown here that the respiratory chain complexes perturb the IMM by affecting the local membrane dynamics. The perturbed thickness and alteration in the lipid dynamics lead to an energetic penalty, which can be related to molecular strain effects, as suggested by the changes of both the internal energy of lipid and their interaction with the surroundings (Fig. S2, S5, S6), which are likely to be of enthalpic origin. However, lipid binding to the OXPHOS complex also results in a reduction in the translational and rotational motion of the lipids and quinone (Fig. S8-S9), which could result in entropic changes. The strain effects are therefore likely to arise from a combination of enthalpic and entropic effects."

      (4) Figure S7 shows the surface area per lipid and leaflet height. This appears to show a result that is central to the interpretation of SC formation but which makes very little sense. One simply does not increase both the height and area of a lipid. This is a change in the lipid volume! The bulk compressibility of most anything is much higher than its Young's modulus [similar to area compressibility]. Instead, something else has happened. My guess is that there is *bilayer* curvature around these proteins and that it has been misinterpreted as area/thickness changes with opposite signs of the two leaflets. If a leaflet gets thin, its area expands. If the manuscript had more details regarding how they computed thickness I could help more. Perhaps they measured the height of a specific atom of the lipid above the average mid-plane normal? The mid-plane of a highly curved membrane would deflect from zero locally and could be misinterpreted as a thickness change.

      We thank the Reviewer for this insightful comment. We chose to define the membrane thickness based on the height of the lipid P-atoms above the average midplane normal. The Reviewer is correct that this measurement gives a changing thickness for a highly curved membrane. However, in this scenario, the thickness would always be overestimated [d<sub>true</sub> = d<sub>measured</sub> / cos (angle between global mid-plane normal and local mid-plane normal)]. Therefore, since we observe a smaller thickness at the protein-lipid interface, the effect is not likely to result from an artifact. For further clarification, see Fig. S4I showing the averaged local position of the Patoms in the cgMD simulations, which further supports that there is a local deformation of the lipid.

      The changes in the local membrane thickness are also supported by our analysis of the membrane thickness (Fig.S2A) and by the lipid chain length distributions (Fig.S6).

      (5) The authors write expertly about how conformational changes are interpreted in terms of function but the language is repeatedly suggestive. Can they put their findings into a more quantitative form with statistical analysis? "The EDA thus suggests that the dynamics of CI and CIII2 are allosterically coupled."

      We extended our analysis on the allosteric effects, which is now described in the revised main text, the SI and the Methods section:

      "In this regard, our graph theoretical analysis (Fig. S11C,D) further indicates that ligand binding to Complex I induces a dynamic crosstalk between NDUFA5 and NDUFA10, consistent with previous work (50, 51), and affecting also the motion of UQCRC2 with respect to its surroundings. Taken together, these effects suggest that the dynamics of CI and CIII<sub>2</sub> show some correlation that could result in allosteric effects, as also indicated based on cryo-EM analysis (40)."

      “Extended Methods

      Allosteric Network Analysis. Interactions between amino acid residues were modeled as an interaction graph, where each residue was represented by a vertex. Two nodes were connected by an edge, if the Ca atoms of the corresponding amino acid residues were closer than 7.5 Å for more than 50% of the frames of simulations S1-S6 (time step of frames: 1 ns). (7) This analysis was carried out for the aMD simulations of the supercomplex, analyzing differences between the Q bound and apo states (simulations A1+A2+A3 vs. A4+A5+A6).”

      (6) The authors write "We find that an increase in the lipid tail length decreases the relative stability of the SC (Figure S5C)" This is a critical point but I could not interpret Figure S5C consistently with this sentence. Can the authors explain this?

      We apologize for this oversight. This sentence should refer to Fig. S5F, which has now been corrected. We have additionally updated the figure to provide an improved estimation of the thickness contribution based on the lipid tail length.

      "We find that an increase in the lipid tail length decreases the relative stability of the SC (Fig. S5F)"

      (7) The authors use a 6x6 and 15x15 lattice to analyze SC formation. The SC assembly has 6 units of E_strain favoring its assembly, which they take up to 4 kT. At 3 kT, the SC should be favored by 18 kT, or a Boltzmann factor of 10^8. With only 225 sites, specific and non-specific complex formation should be robust. Can the authors please check their numbers or provide a qualitative guide to the data that would make clear what I'm missing?

      In the revised manuscript, we have now clarified the definition of the lattice model and the respective energies:

      In summary, the qualitative data presented are interesting (especially the combination of molecular modeling with simpler Monte Carlo modeling aiding broader interpretation of the results) ... but confusing in terms of the non-standard presentation of membrane mechanics and the difficulty of this reviewer to interpret some of the underlying figures: especially, the thickness of the leaflets around the protein and the relative thickness of cardiolipin. Resolving the quantitative interpretation of the bilayer deformation would greatly enhance the significance of their Monte Carlo model of SC formation.

      We thank the Reviewer for the helpful suggestion. We hope that the revisions help to clarify the non-standard presentation and connect to concepts used in the lipid membrane community.

      Reviewer #2 (Public review):

      Summary:

      The authors have used large-scale atomistic and coarse-grained molecular dynamics simulations on the respiratory chain complex and investigated the effect of the complex on the inner mitochondrial membrane. They have also used a simple phenomenological model to establish that the super complex (SC) assembly and stabilisation are driven by the interplay between the "entropic" forces due to strain energy and the enthalpies forces (specific and non-specific) between lipid and protein domains. The authors also show that the SC in the membrane leads to thinning and there is preferential localisation of certain lipids (Cardiolipin) in the annular region of the complex. The data reports that the SC assembly has an effect on the conformational dynamics of individual proteins making up the assembled complex and they undergo "allosteric crosstalk" to maintain the stable functional complex. From their conformational analyses of the proteins (individual and while in the complex) and membrane "structural" properties (such as thinning/lateral organization etc) as well from the out of their phenomenological lattice model, the authors have provided possible implications and molecular origin about the function of the complex in terms of aspects such as charge currents in internal mitochondrion membrane, proton transport activity and ATP synthesis.

      Strengths:

      The work is bold in terms of undertaking modelling and simulation of such a large complex that requires simulations of about a million atoms for long time scales. This requires technical acumen and resources. Also, the effort to make connections to experimental readouts has to be appreciated (though it is difficult to connect functional pathways with limited (additive forcefield) simulations.

      We thank the Reviewer for recognizing the challenge in simulating multimillion atom membrane proteins. We also thank the Reviewer for recognizing the connections we have made to different experiments. Our work indeed relies on atomistic and coarse-grained molecular simulations, which are widely recognized to provide accurate models of membrane proteins.

      Weakness:

      There are several weaknesses in the paper (please see the list below). Claims such as "entropic effect", "membrane strain energy" and "allosteric cross talks" are not properly supported by evidence and seem far-fetched at times. There are other weaknesses as well. Please see the list below.

      We thank the Reviewer for pointing out that key concepts needed further clarification. Please see answers to specific questions below:

      (i) Membrane "strain energy" has been loosely used and no effort is made to explain what the authors mean by the term and how they would quantify it. If the membrane is simulated in stress-free conditions, where are strains building up from?

      We thank the Reviewer for this important question. In the revised manuscript, we have toned down the assignment of the effects into pure entropic or enthalpic effects. We have also provided further clarification of the effects observed in the membrane.

      Example of revisions / clarifications in the main text:

      "SC formation is affected by both enthalpic and entropic effects."

      "We have shown here that the respiratory chain complexes perturb the IMM by affecting the local membrane dynamics. The perturbed thickness and alteration in the lipid dynamics lead to an energetic penalty, which can be related to molecular strain effects, as suggested by the changes of both the internal energy of lipid and their interaction with the surroundings (Fig. S2, S5, S6), which are likely to be of enthalpic origin. However, lipid binding to the OXPHOS complex, also results in a reduction in the translational and rotational motion of the lipids and quinone (Fig. S8-S9), which could result in entropic changes. The strain effects are therefore likely to arise from a combination of enthalpic and entropic effects."

      We have also revised the result section, where we now have explicitly defined and clarified the different contributions to membrane strain, observed in our simulations:

      In the following, we define membrane strain as the local perturbations of the lipid bilayer induced by protein-membrane interactions. These include changes in (i) membrane thickness, (ii) the local membrane composition, (iii) lipid chain configurations, and (iv) local curvature of the membrane plane relative to an undisturbed, protein-free bilayer. Together, these phenomena reflect the thermodynamic effects associated with accommodating large protein complexes within the membrane.

      We now also provide a more quantitative estimation of the membrane strain based on the contribution of changes in local thickness and curvature, summarize in Table 1.

      (ii) In result #1 (Protein membrane interaction modulates the lipid dynamics ....), I strongly feel that the readouts from simulations are overinterpreted. Membrane lateral organization in terms of lipids having preferential localisation is not new (see doi: 10.1021/acscentsci.8b00143) nor membrane thinning and implications to function (https://doi.org/10.1091/mbc.E20-12-0794). The distortions that are visible could be due to a mismatch in the number of lipids that need to be there between the upper and lower leaflets after the protein complex is incorporated. Also, the physiological membrane will have several chemically different lipids that will minimise such distortions as well as would be asymmetric across the leaflets - none of which has been considered. Connecting chain length to strain energy is also not well supported - are the authors trying to correlate membrane order (Lo vs Ld) with strain energy?

      We thank the Reviewer for the suggestions. The role of the membrane in driving supercomplex formation has not, to our knowledge, been suggested before. There are certainly many important studies, which have been better highlighted in the revised manuscript. In this context, we also now cite the papers Srivastava & coworkers and Tielemann & coworkers.

      “The localization of specific lipids around the membrane proteins, as well as local membrane perturbation effects, are also supported by simulations of other membrane proteins (45, 46), suggesting that the effects could arise from general protein-membrane interactions.”

      (45) V. Corradi et al., Lipid–Protein Interactions Are Unique Fingerprints for Membrane Proteins. ACS Central Science 4 (June 13, 2018).

      (46) K. Baratam, K. Jha, A. Srivastava, Flexible pivoting of dynamin pleckstrin homology domain catalyzes fission: insights into molecular degrees of freedom. Molecular Biology of the Cell 32 (2021 Jul 1).

      Physiological membrane will have several chemically different lipids that will minimise such distortions as well as would be asymmetric across the leaflets

      We agree with this point. As shown in Figs. 2H,N, S6, S13, we suggest that cardiolipin functions as a buffer molecule. However, very little is experimentally known about the asymmetric distribution of lipids in the IMM. Therefore, modelling the effect of asymmetry across the left is outside the scope of this study. Moreover, as now better clarified in the revised manuscript, we agree that it is difficult to unambiguously divide the effect into enthalpic and entropic contributions.

      To address the main concern of the Reviewer, we have updated the main text and Supporting Information to clearly state the different aspects of how the proteinmembrane interactions induce perturbations of the lipid bilayer. We define these effects as membrane strain. We now use the changes in local thickness and local curvature to quantify the effect of membrane strain on the stability of the respiratory SC.

      (iii) Entropic effect: What is the evidence towards the entropic effect? If strain energy is entropic, the authors first need to establish that. They discuss enthalpy-entropy compensation but there is no clear data or evidence to support that argument. The lipids will rearrange themselves or have a preference to be close to certain regions of the protein and that generally arises because of enthalpies reasons (see the body of work done by Carol Robinson with Mass Spec where certain lipids prefer proteins in the GAS phase, certainly there is no entropy at play there). I find the claims of entropic effects very unconvincing.

      We agree that it is difficult to distinguish the entropic vs. enthalpic contributions. In the revised manuscript, we better clarify that both effects are likely to be involved.

      The native MS work by Robinson and coworkers and others support that many lipids are strongly bound to membrane proteins, as also supported by the local binding of certain lipid molecules, such as CDL to the SC (Figs. S2, S6, S13).

      We suggest that the accumulation of cardiolipin at the protein-lipid interface involves a combination of entropic and enthalpic effects, arising from the reduction of the lipid mobility (entropy) as indicated by lowered diffusion (Fig. S9), and formation of noncovalent bonds between the lipid and the OXPHOS protein (Fig. S14).

      We added further clarification to the Discussion section.

      “Taken together, our combined findings suggest that the SC formation is affected by thermodynamic effects that reduce the molecular strain in the lipid membrane, whilst the perturbed micro-environment also affects the lipid and Q dynamics, as well as the dynamics of the OXPHOS proteins (see below).”

      (iv) The changes in conformations dynamics are subtle as reported by the authors and the allosteric arguments are made based on normal mode analyses. In the complex, there are large overlapping regions between the CI, CIII2, and SCI/III2. I am not sure how the allosteric crosstalk claim is established in this work - some more analyses and data would be useful. Normal mode analyses (EDA) suggest that the motions are coupled and correlated - I am not convinced that it suggests that there is allosteric cross-talk.

      Our analysis suggests that the SC changes the dynamics of the system. Although it is difficult to assign how these effects result in activity modulation of the system, we note these changes relate to sites that are central for the charge transfer reactions. We thank the Reviewer for suggesting to extend the analysis, which further suggests that regions of the proteins could be allosterically coupled.

      (v) The lattice model should be described better and the rationale for choosing the equation needs to be established. Specific interactions look unfavourable in the equation as compared to non-specific interactions.

      We have now provided further clarification of the lattice model in the Methods section. Addition to the main text:

      “Lattice model of SC formation. A lattice model of the CI and CIII<sub>2</sub> was constructed (Fig. 4A,B) by modeling the OXPHOS proteins in unique grid positions on a 2D N×N lattice. Depending on the relative orientation, the protein-protein interaction was described by specific interactions (giving rise to the energetic contribution E<sub>specific</sub> < 0) and non-specific interactions (E<sub>non-specific</sub> > 0). The membrane-protein interaction determined the strain energy of the membrane (E<sub>strain</sub>), based on the number of neighboring "lipid" occupied grids that are in contact with proteins (Fig. 4A). The interaction between the lipids was indirectly accounted for by the background energy of the model. The proteins could occupy four unique orientations on a grid ([North, East, South, West]). The states and their respective energies that the system can visit are summarized in Table S6.”

      “The conformational landscape was sampled by Monte Carlo (MC) using 10<sup>7</sup> MC iterations with 100 replicas. Temperature effects were modeled by varying β, and the effect of different protein-to-lipid ratios by increasing the grid area. The simulation details can be found in Table S7.”

      Reviewer #3 (Public review):

      Summary:

      In this contribution, the authors report atomistic, coarse-grained, and lattice simulations to analyze the mechanism of supercomplex (SC) formation in mitochondria. The results highlight the importance of membrane deformation as one of the major driving forces for SC formation, which is not entirely surprising given prior work on membrane protein assembly, but certainly of major mechanistic significance for the specific systems of interest.

      Strengths:

      The combination of complementary approaches, including an interesting (re)analysis of cryo-EM data, is particularly powerful and might be applicable to the analysis of related systems. The calculations also revealed that SC formation has interesting impacts on the structural and dynamical (motional correlation) properties of the individual protein components, suggesting further functional relevance of SC formation. Overall, the study is rather thorough and highly creative, and the impact on the field is expected to be significant.

      Weaknesses:

      In general, I don't think the work contains any obvious weaknesses, although I was left with some questions.

      We thank the Reviewer for acknowledging that our work is thorough and creative, and that it is likely to have a significant impact on the field.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Diffusion is quantified in speed units (Figure S8). The authors should explain why they have used an apparently incorrect model for quantifying diffusion. The variance of the distribution of a diffusing molecule is linear with time, not its standard deviation (as I suppose I would use for computing effective molecular speed). Perhaps they are quantifying residence times, in which molecules near a wall (protein) will appear to have half the movements of a bulk molecule. This is confusing.

      We thank the Reviewer for the comment. The data shown in previous version of Figure S8 corresponded to the effective molecular velocity, which is now clarified in the revised figure (now Fig. S9). This measure was used to reflect the average residence time of the groups in the vicinity of the sites.

      However, as suggested by the Reviewer, we now also analyzed the positiondependent diffusion of the quinone in the new Figure S9:

      (2) With a highly charged bilayer a large water layer is necessary to verify that the concentration of salt is plateauing at 150 mM at the box edge. 45 A appears to be the default in CHARMM-GUI, but this default guidance is not based on the charge of the bilayer. I suggest the authors plot the average concentration of both anions and cations in mM units along the z coordinate of the simulation cell.

      We thank the Reviewer for the suggestion. We have now provided an analysis of the average ion concentrations along the z coordinate, supporting that the salt concentration plateaus at 150 mM at the box edge.

      Typos:

      SI: "POPC/POPE or CLD" should be CDL

      We apologize for the mistake. We have corrected the typos:

      "of the membrane thickness in a POPC/POPE/CDL/QH2 membrane and a CDL membrane."

      "a pure CDL membrane"

      Reviewer #2 (Recommendations for the authors):

      (1) Suggestion regarding membrane strain energy claims:

      Changes in area per lipid and membrane thinning are surely not akin to membrane strain energy changes. At best, the authors should calculate the area compressibility (both in bilayers with and without proteins) and then make comments. In general, if they are talking about the in-plane properties (bilayer being liquid in 2D), I do not see how they can discuss membrane strain energy with NPT=1 atms barostat reservoir that they are simulating against. At least they can try to plot the membrane lateral pressures in various conditions and then start making such comments. If it was a closed vesicle, I would expect some tension in the membrane due to the closed surface but in the conditions in which the simulations are run, I do not see how strain is so important. If the authors want to be more rigorous, they can calculate "atomic viral" values by doing a tessellation and showing the data to make their point. Strain energy would mean that there is a modulus in-plane. Bending modulus would surely change with membrane thinning and area compressibility changes (simple plate theory) but linear strain is surely something to be defined well before making claims out of it.

      Our work shows that the OXPHOS proteins alter the local membrane thickness and curvature, and we now quantify the deformation penalty associated with that (Table 1). As stated above, we now provide a better definition and description 'membrane strain’ and the observed effect, which is likely to contain both enthalpic and entropic contributions.

      As suggested by the Reviewer, we have computed the lateral pressure profiles around the OXPHOS proteins, further supporting that there are energetic effects related to the "solvation" of the membrane proteins in the IMM. To this end, Figs. S2D,E; Figure S4I and Fig. S5G,H shows the membrane distortion effect; while in Fig. S5A supports that there the 'internal energy' of the lipids changes as result of the SC formation, further justifying that these effects can be assigned as 'strain effects'. The analysis has also been extended by computing the end-to-end distances, shown in Fig. S6.

      Unfortunately, it is technically unfeasible to accurately estimate the area compressibility, bending modulus, or the atomic virial for the present multi-million membrane protein simulations.

      Summary of Revisions/Additions:

      Fig. S2 [...] (D, E) Difference in the membrane thickness around the SC relative to CI (left) or relative to CIII<sub>2</sub> (right) from (D) aMD and (E) cgMD.

      Fig. S4. [...] (I) Visualization of the membrane distortion effect.

      Fig. S5. Analysis of membrane-induced distortion effects. (A) Relative strain effect relative to a lipid membrane from atomistic MD simulations of the SCI/III2, CI, and CIII<sub>2</sub>, suggesting reduction of the membrane strain (blue patches) in the SC surroundings. The figure shows the non-bonded energies relative to the average non-bonded energies from membrane simulations (simulation M4, Table S1). (B) The lipid strain contribution for different lipids calculated from non-bonded interaction energies of the lipids relative to the average lipid interaction in a IMM membrane model (simulation M4). The figure shows the relative strain contribution for nearby lipids (r < 2 Å, in color from panel (C), and lipids >5 Å from the OXPHOS proteins. (C) Selection of lipids (< 2 Å) interacting with the OXPHOS proteins. (D) Potential of mean force (PMF) of membrane thickness derived from thickness distributions from cgMD simulations of a membrane, the SCI/III2, CI, and CIII<sub>2</sub>. (E) Membrane thickness as a function of CDL concentration from cgMD simulations. (F) ΔGthick of the SC as a function of membrane thickness based on cgMD simulations. (G) Membrane curvature around the SCI/III2 (left), CI (middle), and CIII<sub>2</sub> (right) from atomistic simulations. (H) Squared membrane curvature obtained from cgMD simulations, within a 20 nm radius around the center of the system. These maps correspond to the curvature field used in the calculation of the bending deformation energy term (G<sub>curv</sub>).

      Fig. S6. Analysis of lipid end-to-end distance from aMD simulations of (A) SC, (B) CI, (C) CIII<sub>2</sub>.

      (2) Membrane distortions:

      Membrane distortions can arise due to a mismatch in the area between the upper leaflet and the lower left especially when a protein is embedded. Authors can carefully choose the numbers to keep the membrane stable.

      We have further clarified in the revised manuscript that the membranes are stable in all simulation setups. During building the simulation setups, it was carefully considered that no leaflet introduced higher lipid densities that could result in artificial displacements. Our results of the local changes in the lipid dynamics and structure around the OXPHOS complexes are independently supported by both our atomistic and coarse-grained simulations, which contain significantly larger membranes. Moreover, as discussed in our work, the local membrane distortion is also experimentally supported by cryoEM analysis as well as recent in situ cryoTEM data, showing that the OXPHOS proteins indeed affect the local membrane properties.

      Clarifications/Additions to the main text:

      “We find that the individual OXPHOS complexes, CI and CIII<sub>2</sub>, induce pronounced membrane strain effects, supported both by our aMD (Fig. S2A) and cgMD simulations with a large surrounding membrane (Fig. 2G).“

      ” The localization of specific lipids around the membrane proteins, as well as local membrane perturbation effects, are also supported by simulations of other membrane proteins (45, 46), suggesting that the effects could arise from general protein-membrane interactions.”

      "During construction of the simulation setups, it was carefully considered that no leaflet introduced higher lipid densities that could result in artificial displacement effects."

      (3) Strain energy as an entropic effect:

      Please establish that the strain energy (if at all present) can be called an entropic effect.

      We have now better clarified that the SC formation results from combined enthalpic and entropic effects. We apologize that the previous version of the text was unclear in this respect.

      To further probe the involvement of entropic effects, we derived entropic and enthalpic contributions from our lattice model. The model supports that increased strain contributions also alters the entropic contributions, further supporting the coupling between the effects.

      We have also clarified our definition of the effects:

      " The perturbed thickness and alteration in the lipid dynamics leads to an energetic penalty, which can be related to molecular strain effects, as suggested by the changes of both the internal energy of lipid and their interaction with the surroundings (Fig. S2, S5, S6), which are likely to be of enthalpic origin. However, lipid binding to the OXPHOS complex, also results in a reduction in the translational and rotational motion of the lipids and quinone (Fig. S8-S9), which could result in entropic changes. The strain effects are therefore likely to arise from a combination of enthalpic and entropic effects."

      (4) Allosteric cross-talk:

      A thorough network analysis (looking at aspects like graph laplacian, edge weights, eigenvector centrality, changes in characteristic path length, etc can be undertaken to establish allostery (see https://doi.org/10.1093/glycob/cwad094, Ruth Nussinov/Ivet Bahar papers).

      We have expanded the network analysis as suggested by the Reviewer. In this regard, we have expanded the analysis by computing the covariance matrix, further supporting that the SC could involve correlated protein dynamics. We observe a prominent change especially with respect to the ligand state of Complex I, indicative of some degree of allostery, while we find that the apo state of Complex I leads to a slight uncoupling of the motion between CI and CIII<sub>2</sub>.

      Additions in the main text:

      In this regard, our graph theoretical analysis (Fig. S11) further indicates that ligand binding to Complex I induces a dynamic crosstalk between NDUFA5 and NDUFA10, consistent with previous work (48, 49), and affecting also the motion of UQCRC2 with respect to its surroundings_._ Taken together, these effects suggest that the dynamics of CI and CIII<sub>2</sub> show some correlation that could result in allosteric effects, as also indicated based on the cryoEM analysis.

      (5) Lattice model:

      The equation needs to be rationalised. For example, specific interaction (g_i g_j favours separation (lower energy when i and j are not next to each other), and nonspecific interaction favours proximity. Why is that? Also, the notation for degeneracy in partition function and the notation for lattice point. It is mentioned that the "interaction between the lipids was indirectly accounted for by the "background energy" of the model". If the packing/thinning etc are so important to the molecular simulations, will not the background energy change with changing lipid organising during complex formation?

      We have further expanded the technical discussion of the energy terms in our lattice model.

      For example, specific interaction (g_i g_j favours separation (lower energy when i and j are not next to each other), and non-specific interaction favours proximity. Why is that

      "The g<sub>i</sub>g<sub>j</sub> -term assigns a specific energy contribution when the OXPHOS complexes are in adjacent lattice points only in a correct orientation (modeling a specific non-covalent interaction between the complexes such as the Arg29<sup>FB4</sup>-Asp260<sup>C1</sup>/Glu259<sup>C1</sup> interaction between CI and CIII<sub>2</sub>). The d<sub>i</sub>d<sub>j</sub> -term assigns a non-specific interaction for the OXPHOS complexes when they are in adjacent lattice points, but in a "wrong" orientation relative to each other to form a specific interaction. The term introduces a strain into all lattice points surrounding an OXPHOS complex, mimicking the local membrane perturbation effects observed in our molecular simulations.

      This leads to the partition function,

      where wi is the degeneracy of the state, modeling that the SC and OXPHOS proteins can reside at any lattice position of the membrane, and where β=1/k<sub>B</sub>T (k<sub>B</sub>, Boltzmann's constant; T, temperature). The probability of a given state i was calculated as,

      with the free energy (G) defined as,

      This discussion has been included in the methods sections to ensure that our work remains readable for the biological community studying supercomplexes from a biochemical, metabolic, and physiological perspectives.

      (6) This is a minor issue but the paper is poorly organised and can be fixed readily. The figures are not referenced in order. For example, Figure 2G is discussed before discussing Figures 2A-2F (never discussed). Figure S2 is referenced before Figure S1.

      Answer: We thank the Reviewer for pointing this out. The order of the figures was revised.

      Reviewer #3 (Recommendations for the authors):

      A few minor questions/suggestions, not necessarily in the order of importance:

      (1) The discussion of the timescale of simulations is a bit misleading. For example, the discussion cites a timescale of 0.3 ms of CG simulations. The value is actually the sum of multiple CG simulations on the order of 50-75 microseconds. These are already very impressive lengths of CG simulations, there is no need to use the aggregated time to claim even longer time scales.

      We thank the Reviewer for the suggestion on this important clarification. We have now modified the text and tables accordingly:

      "(0.3 ms in cumulative simulation time, 50-75 μs/cgMD simulation)"

      (2) The observation of cardiolipin (CDL) accumulation is interesting. How close are the head groups, relative to the electrostatic screening length at the interface? Should one worry about the potential change of protonation state coupled with the CDL redistribution?

      Answer: We thank the Reviewer for this excellent comment, which has also been on our mind. The CDL indeed form contacts with various functional groups at the protein interface (as shown in Fig. S13), as well as bulk ions (sodium) that could tune the p_K_a of the CDLs, and result in a protonation change. We have clarified these effects in the revised manuscript:

      "While CDL was modeled here in the singly anionic charged state (but cf. Fig. S5E), we note that the local electrostatic environment could tune their p_K_a that result in protonation changes of the lipid, consistent with its function as a proton collecting antenna (62)."

      (3) The authors refer to the membrane strain effect as entropic. Since membrane bending implicates a free energy change that includes both enthalpic and entropic components, I wonder how the authors reached the conclusion that the effect is largely entropic in nature.

      We agree with the Reviewer that the effects are likely to comprise both enthalpic and entropic contributions, which are difficult to separate in practice. To reflect this, we have now better clarified why we consider that both contributions are involved. We apologize that our previous version of the manuscript was unclear in this respect. Clarifications in the main text:

      “The perturbed thickness and alteration in the lipid dynamics lead to an energetic penalty, which can be related to molecular strain effects, as suggested by the changes of both the internal energy of lipid and their interaction with the surroundings (Fig. S2, S5, S6), which are likely to be of enthalpic origin. However, lipid binding to the OXPHOS complex also results in a reduction in the translational and rotational motion of the lipids and quinone (Fig. S8-S9), which could result in entropic changes. The strain effects are therefore likely to arise from a combination of enthalpic and entropic effects."

      (4) The authors refer to the computed dielectric constant as epsilon_perpendicular. Did the authors really distinguish the parallel and perpendicular component of the dielectric tensor, as was done by, for example, R. Netz and co-workers for planar surfaces?

      We have extracted the perpendicular dielectric constant from the total dielectric profiles. We clarify that this differs from the formal definition of by Netz and coworkers.

      “The calculations were performed by averaging the total M over fixed z values from the membrane plane. Note that this treatment differs from extraction of radial and axial contributions of the dielectric tensor, as developed by Netz and co-workers (cf. Ref. (3) and refs therein) that requires a more elaborate treatment, which is outside the scope of the present work.”

      (3) P. Loche, C. Ayaz, A. Schlaich, Y. Uematsu, R.R. Netz. Giant Axial Dielectric Response in Water-Filled Nanotubes and Effective Electrostatic Ion-Ion Interactions from a Tensorial Dielectric Model. J Phys Chem B 123, 10850-10857 (2019).

      (5) Regarding the effect of SC formation on protein structure and dynamics, especially allosteric effects, most of the discussions are rather qualitative in nature. More quantitative analysis would be valuable. For example, the authors did compute covariance matrix although it appears that they chose not to discuss the results in depth. Is the convergence of concern and therefore no thorough discussion is given?

      We have now expanded the analysis by computing the covariance matrix, further supporting that the SC could involve correlated protein dynamics. We observe a prominent change, especially with respect to the ligand state of Complex I, indicative of some degree of allostery, while we find that the apo state of Complex I leads to a slight uncoupling of the motion between CI and CIII<sub>2</sub>.

      Additions in the main text:

      “In this regard, our graph theoretical analysis (Fig. S11) further indicates that ligand binding to Complex I induces a dynamic crosstalk between NDUFA5 and NDUFA10, consistent with previous work (48, 49), and affecting also the motion of UQCRC2 with respect to its surroundings. Taken together, these effects suggest that the dynamics of CI and CIII<sub>2</sub> show some correlation that could result in allosteric effects, as also indicated based on the cryoEM analysis (40).”

      (6) The discussion of quinone diffusion is interesting, although I'm a bit intrigued by the unit of the diffusion constant cited in the discussion. Perhaps a simple typo?

      The plot showed the molecular velocity, which roughly corresponding to the residence times. However, as suggested by the Reviewer, we now also analyzed the position-dependent diffusion of the quinone in the new Figure S9:

    1. Reviewer #3 (Public review):

      Summary:

      Lmx1a is an orthologue of apterous in flies, which is important for dorsal-ventral border formation in the wing disc. Previously, this research group has described the importance of the chicken Lmx1b in establishing the boundary between sensory and non-sensory domains in the chicken inner ear. Here, the authors described a series of cellular changes during border formation in the chicken inner ear, including alignment of cells at the apical border and concomitant constriction basally. The authors extended these observations to the mouse inner ear and showed that these morphological changes occurred at the border of Lmx1a positive and negative regions, and these changes failed to develop in Lmx1a mutants. Furthermore, the authors demonstrated that the ROCK-dependent actomyosin contractility is important for this border formation and blocking ROCK function affected epithelial basal constriction and border formation in both in vitro and in vivo systems.

      Strengths:

      The morphological changes described during border formation in the developing inner ear are interesting. Linking these changes to the function of Lmx1a and ROCK dependent actomyosin contractile function are provocative.

      Weaknesses:

      There are several outstanding issues that need to be clarified before one can pin the morphological changes observed being causal to border formation and that Lmx1a and ROCK are involved.

      Comments on the latest version:

      The revised manuscript has provided clarity of their results on some levels, but unfortunately, the basal restriction during border formation remains unclear and the study did not advance the understanding of role of Lmx1a in boundary formation. Overall comments are indicated below:

      (1) The authors states in the rebuttal, "we do not think that ROCK activity is required for the formation or maintenance of the basal constriction at the interface of Lmx1a-expressing and non-expressing cells"<br /> If the above is the sentiment of the authors, then the manuscript is not written to support this sentiment clearly, starting with this misleading sentence in the Abstract, "The boundary domain is absent in Lmx1a-deficient mice, which exhibit defects in sensory organ segregation, and is disrupted by the inhibition of ROCK-dependent actomyosin contractility."

      (2) As acknowledged by the authors, the data as they currently stand could be explained by Lmx1a functioning in specifying the non-sensory fate and may not function directly in boundary formation. With this caveat in mind, the role of Lmx1a in boundary formation remains unclear.

      (3) I feel like the word "orchestrate" in the title is an overstatement.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This manuscript investigated the mechanism underlying boundary formation necessary for proper separation of vestibular sensory end organs. In both chick and mouse embryos, it was shown that a population of cells abutting the sensory (marked by high Sox2 expression) /nonsensory cell populations (marked by Lmx1a expression) undergo apical expansion, elongation, alignment and basal constriction to separate the lateral crista (LC) from the utricle. Using Lmx1a mouse mutant, organ cultures, pharmacological and viral-mediated Rock inhibition, it was demonstrated that the Lmx1a transcription factor and Rock-mediated actomyosin contractility is required for boundary formation and LC-utricle separation.

      Strengths:

      Overall, the morphometric analyses were done rigorously and revealed novel boundary cell behaviors. The requirement of Lmx1a and Rock activity in boundary formation was convincingly demonstrated.

      Weaknesses:

      However, the precise roles of Lmx1a and Rock in regulating cell behaviors during boundary formation were not clearly fleshed out. For example, phenotypic analysis of Lmx1a was rather cursory; it is unclear how Lmx1a, expressed in half of the boundary domain, control boundary cell behaviors and prevent cell mixing between Lmx1a+ and Lmx1a- compartments? Well-established mechanisms and molecules for boundary formation were not investigated (e.g. differential adhesion via cadherins, cell repulsion via ephrin-Eph signaling). Moreover, within the boundary domain, it is unclear whether apical multicellular rosettes and basal constrictions are drivers of boundary formation, as boundary can still form when these cell behaviors were inhibited. Involvement of other cell behaviors, such as radial cell intercalation and oriented cell division, also warrant consideration. With these lingering questions, the mechanistic advance of the present study is somewhat incremental.

      We have acknowledged the lingering questions this referee points out in our Discussion and agree that the roles of differential cell adhesion and cell intercalation would be worth exploring in further studies. Despite these remaining questions, the conceptual advances are significant, since this study provides the first evidence that a tissue boundary forms in between segregating sensory organs in the inner ear (there are only a handful of embryonic tissues in which a tissue boundary has been found in vertebrates) and highlights the evolutionary conservation of this process. This work also provides a strong descriptive basis for any future study investigating the mechanisms of tissue boundary formation in the mouse and chicken embryonic inner ear. 

      Reviewer #2 (Public review):

      Summary:

      Chen et al. describe the mechanisms that separate the common pan-sensory progenitor region into individual sensory patches, which presage the formation of the sensory epithelium in each of the inner ear organs. By focusing on the separation of the anterior and then lateral cristae, they find that long supra-cellular cables form at the interface of the pansensory domain and the forming cristae. They find that at these interfaces, the cells have a larger apical surface area, due to basal constriction, and Sox2 is down-regulated. Through analysis of Lmx1 mutants, the authors suggest that while Lmx1 is necessary for the complete segregation of the sensory organs, it is likely not necessary for the initial boundary formation, and the down-regulation of Sox2.

      Strengths:

      The manuscript adds to our knowledge and provides valuable mechanistic insight into sensory organ segregation. Of particular interest are the cell biological mechanisms: The authors show that contractility directed by ROCK is important for the maintenance of the boundary and segregation of sensory organs.

      Weaknesses:

      The manuscript would benefit from a more in-depth look at contractility - the current images of PMLC are not too convincing. Can the authors look at p or ppMLC expression in an apical view? Are they expressed in the boundary along the actin cables? Does Y-27362 inhibit this expression?

      The authors suggest that one role for ROCK is the basal constriction. I was a little confused about basal constriction. Are these the initial steps in the thinning of the intervening nonsensory regions between the sensory organs? What happens to the basally constricted cells as this process continues?

      In our hands, the PMLC immunostaining gave a punctate staining in epithelial cells and was difficult to image and interpret in whole-mount preparations, which did not allow us to investigate its specific association to the actin-cable-like structures. It is a very valuable suggestion to try alternative methods of fixation to improve the quality of the staining and images in future work. 

      The basal constriction of the cells at the border of the sensory organs was not always clearly visible in freshly-fixed samples, and was absent in the majority of short-term organotypic cultures in control medium, which made it impossible to ascertain the role of ROCK in its formation using pharmacological approaches in vitro (see Figure 7 and corresponding Result section).  On the other hand, the overexpression of a dominant-negative form of ROCK (RCII-GFP) in ovo using RCAS revealed a persistence of basal constriction in transfected cells despite a disorganisation of the boundary domain (Figure 8). We conclude from these experiments that ROCK activity is not necessary for the formation and maintenance of the basal constriction. We also remain uncertain about the exact role of this basal constriction. It could be either a cause or consequence of the expansion of the apical surface of cells in the boundary domain, it could contribute to the limitation of cell intermingling and the formation of the actin-cable-like structure at the interface of Lmx1a-expressing and non-expressing cells, and may indeed prefigure some of the further changes in cell morphology occurring in non-sensory domains separating the sensory organs (cell flattening and constrictions of the epithelial walls in between sensory organs). 

      The steps the authors explore happen after boundaries are established. This correlates with a down-regulation of Sox2, and the formation of a boundary. What is known about the expression of molecules that may underlie the apparent interfacial tension at the boundaries? Is there any evidence for differential adhesion or for Eph-Ephrin signalling? Is there a role for Notch signalling or a role for Jag1 as detailed in the group's 2017 paper?

      Great questions. It is indeed likely that some form of differential cell tension and/or adhesion participates to the formation and maintenance of this boundary, and we have mentioned in the discussion some of the usual suspects (cadherins, eph/ephrin signalling,…) although it is beyond the scope of this paper to determine their roles in this context. 

      As we have discussed in this paper and in our 2017 study (see also Ma and Zhang, Development,  2015 Feb 15;142(4):763-73. doi: 10.1242/dev.113662) we believe that Notch signalling is maintaining prosensory character, and its down-regulation by Lmx1a/b expression is required for the specification of the non-sensory domains in between segregating sensory organs. Although we have not tested this directly in this study, any disruption in Notch signalling would be expected to affect indirectly the formation or maintenance of the boundary domain. 

      A comment on whether cellular intercalation/rearrangements may underlie some of the observed tissue changes.

      We have not addressed this topic directly in the present study but we have included a brief comment on the potential implication of cellular intercalation and rearrangements in the discussion: “It is also possible that the repositioning of cells through medial intercalation could contribute to the straightening of the boundary as well as the widening of the nonsensory territories in between sensory patches.”

      The change in the long axis appears to correlate with the expression of Lmx1a (Fig 5d). The authors could discuss this more. Are these changes associated with altered PCP/Vangl2 expression?

      We are not sure about the first point raised by the referee. We have quantified cell elongation and orientation in Lmx1a-GFP heterozygous and homozygous (null) mice, and our results suggest that the elongation of the cells occurs throughout the boundary domain, and is probably not dependent on Lmx1a expression (boundary cells are in fact more elongated in the Lmx1a mutant).  We have not investigated the expression of components of the planar cell polarity pathway. This is a very interesting suggestion, worth exploring in further studies.

      Reviewer #3 (Public review):

      Summary:

      Lmx1a is an orthologue of apterous in flies, which is important for dorsal-ventral border formation in the wing disc. Previously, this research group has described the importance of the chicken Lmx1b in establishing the boundary between sensory and non-sensory domains in the chicken inner ear. Here, the authors described a series of cellular changes during border formation in the chicken inner ear, including alignment of cells at the apical border and concomitant constriction basally. The authors extended these observations to the mouse inner ear and showed that these morphological changes occurred at the border of Lmx1a positive and negative regions, and these changes failed to develop in Lmx1a mutants. Furthermore, the authors demonstrated that the ROCK-dependent actomyosin contractility is important for this border formation and blocking ROCK function affected epithelial basal constriction and border formation in both in vitro and in vivo systems.

      Strengths:

      The morphological changes described during border formation in the developing inner ear are interesting. Linking these changes to the function of Lmx1a and ROCK dependent actomyosin contractile function are provocative.

      Weaknesses:

      There are several outstanding issues that need to be clarified before one could pin the morphological changes observed being causal to border formation and that Lmx1a and ROCK are involved.

      We have addressed the specific comments and suggestions of the reviewer below. We wish however to point out that we do not think that ROCK activity is required for the formation or maintenance of the basal constriction at the interface of Lmx1a-expressing and nonexpressing cells (see previous answer to referee #2)

      Reviewer #1 (Recommendations for the authors):

      Specific comments:

      (1) Figures 1 and 2, and related text. Based on the whole-mount images shown, the anterior otocyst appeared to be a stratified epithelium with multiple cell layers. If so, it should be clarified whether the x-y view of in the "apical" and "basal" plane are from cells residing in the apical and basal layers, respectively. Moreover, it would be helpful to include a "stage 4", a later stage to show if and when basal constrictions resolve.

      In fact, at these early stages of development, the otic epithelium is “pseudostratified”: it is formed by a single layer of irregularly shaped cells, each extending from the base to the apical aspect of the epithelium, but with their nuclei residing at distinct positions along this basal-apical axis as mitotic cells progress through the cell cycle.  The nuclei divide at the surface of the epithelium, then move back to the most basal planes within daughter cells during interphase. This process, known as interkinetic nuclear migration, has been well described in the embryonic neural tube and occurs throughout the developing otic epithelium (e.g. Orr, Dev Biol. 1975, 47,325-340, Ohta et al., Dev Biol. 2010 Sep 15;347(2):369–381. doi: 10.1016/j.ydbio.2010.09.002; ). Consequently, the nuclei visible in apical or basal planes in x-y views belong to cells extending from the base to the apex of the epithelium, but which are at different stages of the cell cycle. 

      We have not included a late stage of sensory organ segregation in this study (apart from a P0 stage in the mouse inner ear, see Figure 4) since data about later stages of sensory organ morphogenesis are available in other studies, including our Mann et al. eLife 2017 paper describing Lmx1a-GFP expression in the embryonic mouse inner ear.

      (2) Related to above, the observed changes in cell organization raised the possibility that the apical multicellular rosettes and basal constrictions observed in Stage 3 (and 2) could be intermediates of radial cell intercalations, which would lead to expansion of the space between sensory organs and thinning of the boundary domains. To see if it might be happening, it would be helpful to include DAPI staining to show the overall tissue architecture at different stages and use optical reconstruction to assess the thickness of the epithelium in the presumptive boundary domain over time.

      We agree with this referee. Besides cell addition by proliferation and/or changes in cell morphology, radial cell intercalations could indeed contribute to the spatial segregation of inner ear sensory organs (a brief statement on this possibility was added to the Discussion). It is clear from images shown in Figure 4 (and from other studies) that the non-sensory domain separating the cristae from the utricle gets flatter and its cells also enlarge as development proceeds. We do not think that DAPI staining is required to demonstrate this. Perhaps the best way to show that radial cell intercalations occur would be to perform liveimaging of the otic epithelium, but this is technically challenging in the mouse or chicken inner ear. An alternative model system might be the zebrafish inner ear, in which some liveimaging data have shown a progressive down-regulation of Jag1 expression during sensory organ segregation (and a flattening of “boundary domains”), suggesting a conservation of the basic mechanisms at play (Ma and Zhang, Development,  2015 Feb 15;142(4):763-73. doi: 10.1242/dev.113662).

      (3) Similarly, it would be helpful to include the DAPI counterstain in Figures 4, 7, and 8 to show the overall tissue architecture.

      We do not have DAPI staining for these particular images but in most cases, Sox2 immunostaining gives a decent indication of tissue morphology. 

      (4) Figure 2(z) and Figure 4d. The arrows pointing at the basal constrictions are obstructing the view of the basement membrane area, making it difficult to appreciate the morphological changes. They should be moved to the side. Can the authors comment whether they saw evidence for radial intercalations (e.g. thinning of the boundary domain) or partial unzippering of adjoining compartments along the basal constrictions?

      The arrows in Figure 2(z) and Figure 4d have been moved to the side of the panels. 

      See previous comment. Besides the presence of multicellular rosettes, we have not seen direct evidence of radial cell intercalation – this would be best investigated using liveimaging. As development proceeds, the epithelial domain separating adjoining sensory organs becomes wider. The cells that compose it gradually enlarge and flatten, as can be seen for example at P0 in the mouse inner ear (Figure 4g). 

      (5) Figures 3 and 5, and related text. It should be clarified whether the measurements were all taken from the surface cells. For Fig. 3e and 5d, the mean alignment angles of the cell long axis in the boundary regions should be provided in the text.

      The sensory epithelium in the otocyst is pseudostratified, hence, the measurement was taken from the surface of all epithelial cells labelled with F-actin. 

      We have added histograms representing the angular distribution of the cell long axis orientations in the boundary region to Figure 3 and Figure 5 Supplementary 1. We believe that this type of representation is more informative than the numerical value of the mean alignment angles of the cell long axis for defined sub-domains. 

      (6) It would be helpful to also quantify basal constrictions using the cell skeleton analysis. In addition, it would be helpful to show x-y views of cell morphology at the level of basal constrictions in the mouse tissue, similar to the chick otocyst shown in Figure 2.

      The data that we have collected do not allow a precise quantification of basal constrictions with cell skeleton analysis, due to the generally fuzzy nature of F-actin staining in the basal planes of the epithelium. However, we have followed the referee’s advice and analysed Factin staining in x-y views in the Lmx1a-GFP knock-in (heterozygous) mice. We found that the first signs of basal F-actin enrichment and multicellular actin-cable like structures at the interface of Lmx1a-positive and negative cells are visible at E11.5, and F-actin staining in the basal planes increases in intensity and extent at E13.5. (shown in new Figure 4 – Supplementary Figure 1).

      (7) Figure 5 and related text. It would be informative to analyze Lmx1a mutants at early stages (E11-E13) to pinpoint cell behavior defects during boundary formation.

      We chose the E15 stage because it is one at which we can unequivocally recognize and easily image and analyse the boundary domain from a cytoarchitectural point of view. We recognize that it would have been worth including earlier stages in this analysis but have not been able to perform these additional studies due to time constraints and unavailability of biological material. 

      (8) Figure 5-Figure S1, the quantifications suggest that Lmx1a loss had both cellautonomous and non-autonomous effects on boundary cell behaviors. This is an interesting finding, and its implication should be discussed.

      It is well-known that the absence of Lmx1a function induces a very complex (and variable) phenotype in terms of inner ear morphology and patterning defects. It is also clear from this study that the absence of Lmx1 causes non-cell autonomous defects in the boundary domain and we have already mentioned this in the discussion: “Finally, the patterning abnormalities in Lmx1a<sup>GFP/GFP</sup> samples occurred in both GFP-positive and negative territories, which points at some type of interaction between Lmx1a-expressing and nonexpressing cells, and the possibility that the boundary domain is also a signalling centre influencing the differentiation of adjacent territories.”

      (9) Figure 6 and related text. To correlate myosin II activity with boundary cell behaviors, it would be important to immunolocalize pMLC in the boundary domain in whole-mount otocyst preparations from stage 1 to stage 3.

      We tried to perform the suggested immunostaining experiments, but in our hands at least, the antibody used did not produce good quality staining in whole-mount preparations. We have therefore included images of sectioned otic tissue, which show some enrichment in pMLC immunostaining at the interface of segregating organs (Figure 6).

      (10) Figures 7 and 8. A caveat of long-term Rock inhibition is that it can affect cell proliferation and differentiation of both sensory and non-sensory cells, which would cause secondary effects on boundary formation. This caveat was not adequately addressed. For example, does Rock signaling control either the rate or the orientation of cell division to promote boundary formation? Together with the mild effect of acute Rock inhibition, the precise role of Rock signaling in boundary formation remains unclear.

      We absolutely agree that the exact function of ROCK could not be ascertained in the in vitro experiments, for the reasons we have highlighted in the manuscript (no clear effect in short term treatments, great level of tissue disorganisation in long-term treatments). This prompted us to turn to an in ovo approach. The picture remains uncertain in relation to the role of ROCK in regulating cell division/intercalation but we have been at least able to show a requirement for the maintenance of an organized and regular boundary. 

      (11) Figure 8. RCII-GFP likely also have non-autonomous effects on cell apical surface area. In 8d, it would be informative to include cell area quantifications of the GFP control for comparison.

      It is possible that some non-autonomous effects are produced by RCII-GFP expression, but these were not the focus of the present study and are not particularly relevant in the context of large patches of overexpression, as obtained with RCAS vectors. 

      We have added cell surface area quantifications of the control RCAS-GFP construct for comparison (Figure 8e).

      (12) The significance of the presence of cell divisions shown in Figure 9 is unclear. It would be informative to include some additional analysis, such as a) quantify orientation of cell divisions in and around the boundary domain and b) determine whether patterns of cell division in the sensory and nonsensory regions are disrupted in Lmx1a mutants.

      These are indeed fascinating questions, but which would require considerable work to answer and are beyond the scope of this paper. 

      Minor comments:

      (1) Figure 1. It should be clarified whether e', h' and k' are showing cortical F-actin of surface cells. Do the arrowheads in i' and l' correspond to the position of either of the arrowheads in h' and k', respectively?

      The epithelium in the otocyst is pseudostratified. Therefore, images e’, h’, k’ display F-actin labelling on the surface of tissue composed of a single cell layer. We have added arrows to images e”, h”, and k” to indicate the corresponding position of z-projections and included appropriate explanation in the legend of Figure 1: “Black arrows on the side of images e”, h”, and k” indicate the corresponding position of z-projections.”

      (2) Figure 3-Figure S1. Please mark the orientation of the images shown.

      We labelled the sensory organs in the figure to allow for recognizing the orientation. 

      (3) Figure 4. Orthogonal reconstructions should be labeled (z) to be consistent with other figures.

      We have corrected the labelling in the orthogonal reconstruction to (z). 

      (4) Figure 4g. It is not clear what is in the dark area between the two bands of Lmx1a+ cells next to the utricle and the LC. Are those cells Lmx1a negative? It is unclear whether a second boundary domain formed or the original boundary domain split into two between E15 and P0? Showing the E15 control tissue from Figure 5 would be more informative than P0.

      In this particular sample there seems to be a folding of the tissue (visible in z-reconstructions) that could affect the appearance of the projection shown in 4g. We believe the P0 is a valuable addition to the E15 data, showing a slightly later stage in the development of the vestibular organs.

      (5) Figure 5a, e. Magnified regions shown in b and f should be boxed correspondingly.

      This figure has been revised. We realized that the previous low-magnification shown in (e) (now h) was from a different sample than the one shown in the high-magnification view. The new figure now includes the right low-magnification sample (in h) and the regions shown in the high-magnification views have been boxed.

      (6) Figure 8f, h, j. Magnified regions shown in g, i and k should be boxed correspondingly.

      The magnified regions were boxed in Figure 8 f, h, and j. Additionally, black arrows have been placed next to images 8g", 8i", and 8k" to highlight the positions of the z-projections. An appropriate explanation has also been added to the figure legend.

      (9) Figure 8. It would be helpful to show merged images of GFP and F-actin, to better appreciate cell morphology of GFP+ and GFP- cells.

      As requested, we have added images showing overlap of GFP and F-actin channels in Figure 8.

      Reviewer #2 (Recommendations for the authors):

      The PMLC staining could be improved. Two decent antibodies are the p-MLC and pp-MLC antibodies from CST. pp-MLC works very well after TCA fixation as detailed in https://www.researchsquare.com/article/rs-2508957/latest . As phalloidin does not work well after TCA fixation, affadin works very well for segmenting cells.

      If the authors do not wish to repeat the pMLC staining, the details of the antibody used should be mentioned.

      We used mouse IgG1 Phospho-Myosin Light Chain 2 (Ser19) from Cell Signaling Technology (catalogue number #3675) in our immunohistochemistry for PMLC. This is one of the two antibodies recommended by the reviewer #2. Information about this antibody has now been included in material and methods. This antibody has been referenced by many manuscripts, but unfortunately, in our hands at least, it did not perform well in whole-mount preparations.

      A statement on the availability of the data should be included.

      We have included a statement on the data availability: “All data generated or analysed during this study is available upon request.”

      Reviewer #3 (Recommendations for the authors):

      Outstanding issues:

      (1) Morphological description: The apical alignment of epithelial cells at the border is clear but not the upward pull of the basal lamina. Very often, it seems to be the Sox2 staining that shows the upward pull better than the F-actin staining. Perhaps, adding an anti-laminin staining to indicate the basement membrane may help.

      Indeed, the upward pull of the basement membrane is not always very clear. We performed some anti-laminin immunostaining on mouse cryosections and provide below (Figure 1) an example of such experiment. The results appear to confirm an upward displacement of the basement membrane in the region separating the lateral crista from the utricle in the E13 mouse inner ear, but given the preliminary nature of these experiments, we believe that these results do not warrant inclusion in the manuscript. The term “pull” is somehow implying that the epithelial cells are responsible for the upward movement of the basement membrane, but since we do not have direct evidence that this is the case, we have replaced “pull” by “displacement” throughout the text. 

      (2) It is not clear how well the cellular changes are correlated with the timing of border formation as some of the ages shown in the study seem to be well after the sensory patches were separated and the border was established.

      For some experiments (for example E15 in the comparison of mouse Lmx1a-GFP heterozygous and homozygous inner ear tissue; E6 for the RCAS experiments), the early stages of boundary formation are not covered because we decided to focus our analysis on the late consequences of manipulating Lmx1a/ROCK activity in terms of sensory organ segregation. The dataset is more comprehensive for the control developmental series in the chicken and mouse inner ear. 

      (3) The Lmx1a data, as they currently stand could be explained by Lmx1a being required for non-sensory development and not necessarily border formation. Additionally, the relationship between ROCK and Lmx1a was not investigated. Since the investigators have established the molecular mechanisms of Lmx1 function using the chicken system previously, the authors could try to correlate the morphological events described here with the molecular evidence for Lmx1 functioning during border formation in the same chicken system. Right now, only the expression of Sox2 is used to correlate with the cellular events, and not Lmx1, Jag1 or notch.

      These are valid points. Exploring in detail the epistatic relationships between Notch signalling/Lmx1a/ROCK/boundary formation in the chicken model would be indeed very interesting but would require extensive work using both gain and loss-of-function approaches, combined with the analysis of multiple markers (Jag1/Sox2/Lmx1b/PMLC/Factin..). At this point, and in agreement with the referee’s comment, we believe that Lmx1a is above all required for the adoption of the non-sensory fate. The loss of Lmx1a function in the mouse inner ear produce defects in the patterning and cellular features of the boundary domain, but these may be late consequences of the abnormal differentiation of the nonsensory domains that separate sensory organs. Furthermore, ROCK activity does not appear to be required for Sox2 expression (i.e. adoption or maintenance of the sensory fate) since the overexpression of RCII-GFP does not prevent Sox2 expression in the chicken inner ear. This fits with a model in which Notch/Lmx1a regulate cell differentiation whilst ROCK acts independently or downstream of these factors during boundary formation. 

      Specific comments:

      (1) Figure 1. The downregulation of Sox2 is consistent between panels h and k, but not between panels e and h. The orthogonal sections showing basal constriction in h' and k' are not clear.

      The downregulation is noticeable along the lower edge of the crista shown in h; the region selected for the high-magnification view sits at an intermediate level of segregation (and Sox2 downregulation). 

      The basal constriction is not very clear in h, but becomes easier to visualize in k. We have displaced the arrow pointing at the constriction, which hopefully helps. 

      (2) Figure 2. Where was the Z axis taken from? One seems to be able to imagine the basal constriction better in the anti-Sox2 panel than the F-actin panel. A stain outlining the basement membrane better could help.

      Arrows have been added on the side of the horizontal views to mark the location of the zreconstruction. See our previous replies to comments addressing the upward displacement of the basement membrane.

      (3) Figure 4

      I question the ROI being chosen in this figure, which seems to be in the middle of a triad between LC, prosensory/utricle and the AC, rather than between AC and LC. If so, please revise the title of the figure. This could also account for the better evidence of the apical alignment in the upper part of the f panel.

      We have corrected the text. 

      In this figure, the basal constriction is a little clearer in the orthogonal cuts, but it is not clear where these sections were taken from.

      We have added black arrows next to images 4c’, 4f’, and 4i’ to indicate the positions of the zprojections.  

      By E13.5, the LC is a separate entity from the utricle, it makes one wonder how well the basal constriction is correlated with border formation. The apical alignment is also present by P0, which raises the question that the apical alignment and basal restriction may be more correlated with differentiation of non-sensory tissue rather than associated with border formation.

      We agree E13.5 is a relatively late stage, and the basal constriction was not always very pronounced. The new data included in the revised version include images of basal planes of the boundary domain at E11.5, which reveal F-actin enrichment and the formation of an actin-cable-like structure (Figure 4 suppl. Fig1). Furthermore, the chicken dataset shows that the changes in cell size, alignment, and the formation of actin-cable-like structure precede sensory patch segregation and are visible when Sox2 expression starts to be downregulated in prospective non-sensory tissue (Figure 1, Figure 2). Considering the results from both species, we conclude that these localised cellular changes occur relatively early in the sequence of events leading to sensory patch segregation, as opposed to being a late consequence of the differentiation of the non-sensory territories.  

      I don't follow the (x) cuts for panels h and I, as to where they were taken from and why there seems to be an epithelial curvature and what it was supposed to represent.

      We have added black arrows next to the panels 4c’, 4f’, and 4i’ to indicate the positions of the z-projections and modified the legend accordingly. The epithelial curvature is probably due to the folding of the tissue bordering the sensory organs during the manipulation/mounting of the tissue for imaging.

      (4) Figure 5 The control images do not show the apical alignment and the basal constriction well. This could be because of the age of choice, E15, was a little late. Unfortunately, the unclarity of the control results makes it difficult for illustrating the lack of cellular changes in the mutant. The only take-home message that one could extract from this figure is a mild mixing of Sox2 and Lmx1a-Gfp cells in the mutant and not much else. Also, please indicate the level where (x) was taken from.

      Black arrows have been placed next to images 5e and 5l to highlight the positions of the zprojections. The stage E15 chosen for analysis was appropriate to compare the boundary domains once segregation is normally completed. We believe the results show some differences in the cellular features of the boundary domain in the Lmx1a-null mouse, and we have in fact quantified this using Epitool in Figure 5 – Suppl. Fig 1. Cells are more elongated and better aligned in the Lmx1a-null than in the heterozygous samples.  

      (5) Figure 7. I think the cellular disruption caused by the ROCK inhibitor, shown in q', is too severe to be able to pin to a specific effect of ROCK on border formation. In that regard, the ectopic expression of the dominant negative form of ROCK using RCAS approach is better, even though because it is a replication competent form of RCAS, it is still difficult to correlate infected cells to functional disruption.

      We used a replication-competent construct to induce a large patch of infection, increasing our chances of observing a defect in sensory organ segregation and boundary formation. We agree that this approach does not allow us to control the timing of overexpression, but the mosaicism in gene expression, allowing us to compare in the same tissue large regions with/without perturbed ROCK activity, proved more informative than the pharmacological/in vitro experiments.

      (6) Figure 8. Outline the ROI of i in h, and k in j. Outline in k the comparable region in k'. In k", F-actin staining is not uniform. Indicate where (x) was taken from in K.

      The magnified regions were boxed in Figure 8 f, h, and j. Region outlined in figures k’-k” has also been outlined in corresponding region in figure k. Additionally, black arrows have been placed next to images 8g", 8i", and 8k" to highlight the positions of the z-projections. An appropriate explanation has also been added to the figure legend.

      Minor comments:

      (1) P.18, 1st paragraph, extra bracket at the end of the paragraph.

      Bracket removed

      (2) P.22, line 11, in ovo may be better than in vivo in this case.

      We agree, this has been corrected. 

      (3) P.25, be consistent whether it is GFP or EGFP.

      Corrected to GFP.

      (4) P.26, line 5. Typo on "an"

      Corrected to “and”

      Author response image 1.

      Expression of Laminin and Sox2 in the E13 mouse inner ear. a-a’’’) Low magnification view of the utricle, the lateral crista, and the non-sensory (Sox2-negative) domain separating these. Laminin staining is detected at relatively high levels in the basement membrane underneath the sensory patches. At higher magnification (b-b’’’), an upward displacement of the basement membrane (arrow) is visible in the region of reduced Sox2 expression, corresponding to the “boundary domain” (bracket). 

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Strengths: 

      Sarpaning et al. provide a thorough characterization of putative Rnt1 cleavage of mRNA in S. cerevisiae. Previous studies have discovered Rnt1 mRNA substrates anecdotally, and this global characterization expands the known collection of putative Rnt1 cleavage sites. The study is comprehensive, with several types of controls to show that Rnt1 is required for several of these cleavages.

      Weaknesses: 

      (1) Formally speaking, the authors do not show a direct role of Rnt1 in mRNA cleavage - no studies were done (e.g., CLIP-seq or similar) to define direct binding sites. Is the mutant Rnt1 expected to trap substrates? Without direct binding studies, the authors rely on genetics and structure predictions for their argument, and it remains possible that a subset of these sites is an indirect consequence of rnt1. This aspect should be addressed in the discussion.

      We have added to this point in the discussion, as requested. We do not, however, agree that CLIP-seq or other methods are needed to address this point, or would even be helpful in the question the reviewer raises. 

      Importantly, we show that recombinant Rnt1 purified from E. coli cleaves the same sites as those mapped in vivo. This does provide direct evidence that Rnt1 directly binds those RNAs. Furthermore, it shows that it can bind these RNAs without the need of other proteins. Our observation that many mRNAs are cleaved at -14 and +16 positions from NGNN stem loops to leave 2-nt 3’ overhangs provides further support that these are the products of an RNase III enzyme, and Rnt1 is the only family member in yeast. Thus, we disagree with the reviewer that our studies do not show direct targeting.

      CLIP-seq experiments would be valuable, but they would address a different point. CLIP-seq measures protein binding to RNA targets, and it is likely that Rnt1 binds some RNAs without cleaving them. In addition, only a transient interaction are needed for cleavage and such transient interactions might not be readily detected by CLIP-seq. Thus, CLIP-seq would reveal the RNAs bound by Rnt1, but would not help identify which ones are cleaved. Catala et al (2004) showed that the catalytically inactive mutant of Rnt1 carries out some functions that are important for the cell cycle. The CLIP-seq studies would be valuable to determine these non-catalytic roles of Rnt1, but we consider those questions beyond the scope of the current study.

      (2) The comprehensive list of putative Rnt1 mRNA cleavage sites is interesting insofar as it expands the repertoire of Rnt1 on mRNAs, but the functional relevance of the majority of these sites remains unknown. Along these lines, the authors should present a more thorough characterization of putative Rnt1 sites recovered from in vitro Rnt1 cleavage.

      We have included new data that confirm that YDR514C cleavage by Rnt1 is relevant to yeast cell physiology. We show that YDR514C overexpression is indeed toxic, as we previously postulated. More importantly, we generated an allele of YDR514C that has synonymous mutations designed to disrupt the stem-loop recognized by Rnt1. We show that at 37 °C, both the wild-type and mutant allele are toxic to rnt1∆ cells, but that in cells that express Rnt1, the wild-type cleavable allele is more toxic than the allele with the mutated stem-loop. This genetic interaction provides strong evidence that cleavage of YDR514C by Rnt1 is relevant to cell physiology. 

      We have also added PARE analysis of poly(A)-enriched and poly(A)-depleted reactions and show that compared to Dcp2, Rnt1 preferentially targets poly(A)+ mRNAs, consistent with it targeting nuclear RNAs. We discuss in more detail that by cleaving nuclear RNA, Rnt1 provides a kinetic proofreading mechanism for mRNA export competence.

      (3) The authors need to corroborate the rRNA 3'-ETS tetraloop mutations with a northern analysis of 3'-ETS processing to confirm an ETS processing defect (which might need to be done in decay mutants to stabilize the liberated ETS fragment). They state that the tetraloop mutation does not yield a growth defect and use this as the basis for concluding that rRNA cleavage is not the major role of Rnt1 in vivo, which is a surprising finding. But it remains possible that tetraloop mutations did not have the expected disruptive effect in vivo; if the ETS is processed normally in the presence of tetraloop mutations, it would undermine this interpretation. This needs to be more carefully examined.

      We have removed the rRNA 3'-ETS tetraloop mutations, because initial northern blot analysis indicated that Rnt1 cleavage is not completely blocked by the mutations we designed. Therefore, the reviewer is correct that tetraloop mutations did not have the expected disruptive effect in vivo. Future investigations will be required to fully understand this. This was a minor point and removing this focuses the paper on its major contributions

      (4) To support the assertion that YDR514C cleavage is required for normal "homeostasis," and more specifically that it is the major contributor to the rnt1∆ growth defect, the authors should express the YDR514C-G220S mutant in the rDNA∆ strains with mutations in the 3'-ETS (assuming they disrupt ETS processing, see above). This simple experiment should provide a relative sense of "importance" for one or the other cleavage being responsible for the rnt1∆ defect. Given the accepted role of Rnt1 cleavage in rRNA processing and a dogmatic view that this is the reason for the rnt1∆ growth defect, such a result would be surprising and elevate the functional relevance and significance of Rnt1 mRNA cleavage.

      We agree that the experiment proposed by the reviewer is very simple, but we are puzzled by the rationale. First, our experiments do not support that there is anything special about the G220S mutation in YDR514C. A complete loss of function (ydr514c∆) also suppresses the growth defect, suggesting that ydr514c-G220S is a simple loss of function allele. We have clarified that the G220S mutation is distant from the stem-loop recognized by Rnt1 and is unlikely to affect cleavage by Rnt1. Instead, Rnt1 cleavage and the G220S mutation are independent alternative ways to reduce Ydr514c function. We have clarified this point in the text. 

      As mentioned in response to point #3, we have included other additional experiments that address the same overall question raised here – the importance of YDR514C mRNA cleavage by Rnt1.    

      (5) Given that some Rnt1 mRNA cleavage is likely nuclear, it is possible that some of these targets are nascent mRNA transcripts, as opposed to mature but unexported mRNA transcripts, as proposed in the manuscript. A role for Rnt1 in co-transcriptional mRNA cleavage would be conceptually similar to Rnt1 cleavage of the rRNA 3'-ETS to enable RNA Pol I "torpedo" termination by Rat1, described by Proudfoot et al (PMID 20972219). To further delineate this point, the authors could e.g., examine the poly-A tails on abundant Rnt1 targets to establish whether they are mature, polyadenylated mRNAs (e.g., northern analysis of oligo-dT purified material). A more direct test would be PARE analysis of oligo-dT enriched or depleted material to determine the poly-A status of the cleavage products. Alternatively, their association with chromatin could be examined. 

      We have added the requested PARE analysis of oligo-dT enriched or depleted material to determine the polyA status of the cleavage products and related discussions. These confirm our proposal that Rnt1 cleaves mature but unexported mRNA transcripts

      We also note that the northern blots shown in figures 2E, 4C, and 5B use oligo dT selected RNA because the signal was undetectable when we used total RNA. This suggests that the cleaved mRNAs are indeed polyadenylated. 

      The term nascent is somewhat ambiguous, but if the reviewer means RNA that is still associated with Pol II and has not yet been cleaved by the cleavage and polyadenylation machinery, we think that is inconsistent with our findings. We have also re-analyzed the NET-seq data from https://pubmed.ncbi.nlm.nih.gov/21248844/ and find no prominent peaks for our Rnt1 sites in Pol II associated RNAs, although for BDF2 NET-seq does suggest that “spliceosome-mediated decay” is co-transcriptional as would be expected. Altogether these data confirm our previous proposal that Rnt1 mainly cleaves mRNAs that have completed polyadenylated but are not yet exported.

      (6) While laboratory strains of budding yeast have a single RNase III ortholog Rnt1, several other budding yeast have a functional RNAi system with Dcr and Ago (PMID 19745116), and laboratory yeast strains are a derived state due to pressure from the killer virus to lose the RNAi system (PMID 21921191). The current study could provide new insight into the relative substrate preferences of Rnt1 and budding yeast Dicer, which could be experimentally confirmed by expressing Dcr in RNT1 and rnt1∆ strains. In lieu of experiments, discussion of the relevance of Rnt1 cleavage compared to yeast RNAi should be included in the discussion before the "human implications" section.

      The reviewer points out that most other eukaryotic species have multiple RNase III family members, which is a general point we discussed and have now expanded on. The reviewer specifically points to papers that study a species that was incorrectly referred to as Saccharomyces castellii in PMID 19745116, but whose current name is Naumovozyma castellii, reflecting that it is not that closely related to S. cerevisiae (diverged about 86 million years ago; for the correct species phylogeny, see http://ygob.ucd.ie/browser/species.html, as both of the published papers the reviewer cites have some errors in the phylogeny). 

      The other species discussed in PMID 19745116 (Vanderwaltozyma polyspora and Candida albicans) are even more distant. There have been several studies on substrate specificity of Dcr1 versus Rnt1 (including PMID 19745116). 

      The reviewer suggests that expressing Dcr1 in S. cerevisiae would be a valuable addition. However, we can’t envision a mechanism by which S. cerevisiae maintained physiologically relevant Dcr1 substrates in the absence of Dcr1. The results from the proposed study would, in our opinion, be limited to identifying RNAs that can be cleaved in this particular artificial system. We think an important implication of our work is that similar studies to ours should be caried out in rnt1∆, dcr1∆, and double mutants in either S. pombe or N. castellii, as well as in drosha knock outs in animals, and we discuss this in more detail in the revised paper. 

      (7) For SNR84 in Figure S3D, it appears that the TSS may be upstream of the annotated gene model. Does RNA-seq coverage (from external datasets) extend upstream to these additional mapped cleavages? The assertion that the mRNA is uncapped is concerning; an alternative explanation is that the nascent mRNA has a cap initially but is subsequently cleaved by Rnt1. This point should be clarified or reworded for accuracy.

      We agree with the reviewer that the most likely explanation is that the primary SNR84 transcript is capped, and 5’ end processed by Rnt1 and Rat1 to make a mature 5’ monophosphorylated SNR84 and have clarified the text accordingly. We suspect our usage of “uncapped” might have been confusing. “uncapped” was not meant to indicate that the primary transcript did not receive a cap, but instead that the mature transcript did not have a cap. We now use “5’ end processed” and “5’ monophosphorylated”. 

      Reviewer #2 (Public review):  

      The yeast double-stranded RNA endonuclease Rnt1, a homolog of bacterial RNase III, mediates the processing of pre-rRNA, pre-snRNA, and pre-snoRNA molecules. Cells lacking Rnt1 exhibit pronounced growth defects, particularly at lower temperatures. In this manuscript, Notice-Sarpaning examines whether these growth defects can be attributed at least in part to a function of Rnt1 in mRNA degradation. To test this, the authors apply parallel analysis of RNA ends (PARE), which they developed in previous work, to identify polyA+ fragments with 5' monophosphates in RNT1 yeast that are absent in rnt1Δ cells. Because such RNAs are substrates for 5' to 3' exonucleolytic decay by Rat1 in the nucleus or Xrn1 in the cytoplasm, these analyses were performed in a rat1-ts xrn1Δ background. The data recapitulate known Rtn1 cleavage sites in rRNA, snRNAs, and snoRNAs, and identify 122 putative novel substrates, approximately half of which are mRNAs. Of these, two-thirds are predicted to contain double-stranded stem loop structures with A/UGNN tetraloops, which serve as a major determinant of Rnt1 substrate recognition. Rtn1 resides in the nucleus, and it likely cleaves mRNAs there, but cleavage products seem to be degraded after export to the cytoplasm, as analysis of published PARE data shows that some of them accumulate in xrn1Δ cells. The authors then leverage the slow growth of rnt1Δ cells for experimental evolution. Sequencing analysis of thirteen faster-growing strains identifies mutations predominantly mapping to genes encoding nuclear exosome co-factors. Some of the strains have mutations in genes encoding a laratdebranching enzyme, a ribosomal protein nuclear import factor, poly(A) polymerase 1, and the RNAbinding protein Puf4. In one of the puf4 mutant strains, a second mutation is also present in YDR514C, which the authors identify as an mRNA substrate cleaved by Rnt1. Deletion of either puf4 or ydr514C marginally improves the growth of rnt1Δ cells, which the authors interpret as evidence that mRNA cleavage by Rnt1 plays a role in maintaining cellular homeostasis by controlling mRNA turnover. 

      While the PARE data and their subsequent in vitro validation convincingly demonstrate Rnt1mediated cleavage of a small subset of yeast mRNAs, the data supporting the biological significance of these cleavage events is substantially less compelling. This makes it difficult to establish whether Rnt1-mediated mRNA cleavage is biologically meaningful or simply "collateral damage" due to a coincidental presence of its target motif in these transcripts.

      We thank the reviewer and have added additional data to support our conclusion that mRNA cleavage, at least for YDR514C, is not simply collateral damage, but a physiologically relevant function of Rnt1. From an evolutionary perspective, cleavage of mRNAs by Rnt1 might have initially been collateral damage, but if there is a way to use this mechanism, evolution is probably going to use it.

      (1) A major argument in support of the claim that "several mRNAs rely heavily on Rnt1 for turnover" comes from comparing number of PARE reads at the transcript start site (as a proxy for fraction of decapped transcripts) and at the Rnt1 cleavage site (as a proxy for fraction of Rnt1-cleaved transcripts). The argument for this is that "the major mRNA degradation pathway is through decapping". However, polyA tail shortening usually precedes decapping, and transcripts with short polyA tails would be strongly underrepresented in PARE sequencing libraries, which were constructed after two rounds of polyA+ RNA selection. This will likely underestimate the fraction of decapped transcripts for each mRNA. There is a wide range of well-established methods that can be used to directly measure differences in the half-life of Rnt1 mRNA targets in RNT1 vs rnt1Δ cells. Because the PARE data rely on the presence of a 5' phosphate to generate sequencing reads, they also cannot be used to estimate what fraction of a given mRNA transcript is actually cleaved by Rnt1. 

      The reviewer is correct that decapping preferentially affects mRNAs with shortened poly(A) tails, that Rnt1 cleavage likely affects mostly newly made mRNAs with long poly(A) tails, and that PARE may underestimate the decay of mRNAs with shortened poly(A) tails. We have reanalyzed our previously published data where we performed PARE on both the poly(A)-enriched fraction and the poly(A)-depleted fraction (that remains after two rounds of oligo dT selection). Rnt1 products are over-represented in the poly(A)-enriched fraction, while decapping products are enriched in the poly(A)-depleted fraction, providing further support to our conclusion that Rnt1 cleaves nuclear RNA. We have re-written key sections of the paper accordingly.

      The reviewer also points out that “There is a wide range of well-established methods that can be used to directly measure differences in the half-life of Rnt1 mRNA targets in RNT1 vs rnt1Δ cells.” However, all of those methods measure mRNA degradation rates from the steady state pool, which is mostly cytoplasmic. We have, in different contexts, used these methods, but as we pointed out they are inappropriate to measure degradation of nuclear RNA. There are some studies that measure nuclear degradation rates, but this requires purifying nuclei. There are two major drawbacks to this. First, it cannot distinguish between degradation in the nucleus and export from the nucleus because both processes cause disappearance from the nucleus. Second, the purification of yeast nuclei requires “spheroplasting” or enzymatically removing the rigid cell wall. This spheroplasting is likely to severely alter the physiological state of the yeast cell. Given these significant drawbacks and the substantial time and money required, we chose not to perform this experiment.  

      (2) Rnt1 is almost exclusively nuclear, and the authors make a compelling case that its concentration in the cytoplasm would likely be too low to result in mRNA cleavage. The model for Rnt1-mediated mRNA turnover would therefore require mRNAs to be cleaved prior to their nuclear export in a manner that would be difficult to control. Alternatively, the Rnt1 targets would need to re-enter prior to cleavage, followed by export of the cleaved fragments for cytoplasmic decay. These processes would need to be able to compete with canonical 5' to 3' and 3' to 5' exonucleolytic decay to influence mRNA fate in a biologically meaningful way.

      We disagree that mRNA export would be difficult to control, as is elegantly demonstrated by the 13 KDa HIV Rev protein. The export of many other RNAs is tightly controlled such that many RNAs are rapidly degraded in the nucleus by, for example, Rat1 and the RNA exosome, while other RNAs are rapidly exported. Indeed, the competition between RNA export and nuclear degradation is generally thought to be an important quality control for a variety of mRNAs and ncRNAs. We do agree with the reviewer that re-import of mRNAs appears unlikely (which is why we do not discuss it), although it occurs efficiently for other Rnt1-cleaved RNAs such as snRNAs. We have clarified the text accordingly, including in the introduction, results, and discussion. 

      (3) The experimental evolution clearly demonstrates that mutations in nuclear exosome factors are the most frequent suppressors of the growth defects caused by Rnt1 loss. This can be rationalized by stabilization of nuclear exosome substrates such as misprocessed snRNAs or snoRNAs, which are the major targets of Rnt1. The rescue mutations in other pathways linked to ribosomal proteins (splicing, ribosomal protein import, ribosomal mRNA binding) support this interpretation. By contrast, the potential suppressor mutation in YDR514C does not occur on its own but only in combination with a puf4 mutation; it is also unclear whether it is located within the Rnt1 cleavage motif or if it impacts Rnt1 cleavage at all. This can easily be tested by engineering the mutation into the endogenous YDR514C locus with CRISPR/Cas9 or expressing wild-type and mutant YDR514C from a plasmid, along with assaying for Rnt1 cleavage by northern blot. Notably, the growth defect complementation of YDR514C deletion in rnt1Δ cells is substantially less pronounced than the growth advantage afforded by nuclear exosome mutations (Figure S9, evolved strains 1 to 5). These data rather argue for a primary role of Rnt1 in promoting cell growth by ensuring efficient ribosome biogenesis through pre-snRNA/pre-snoRNA processing. 

      The reviewer makes several points. 

      First, we have clarified that the ydr514c-G220S mutation is not near the Rnt1 cleavage motif and is unlikely to affect cleavage by Rnt1. This is exactly what would be expected for a mutation that was selected for in an rnt1∆ strain. Although the reviewer appears to expect it, a mutation that affects Rnt1 cleavage could not be selected for in a strain that lacks Rnt1.

      Second, the reviewer points out that the original ydr514c mutations arose in a strain that also had a puf4 deletion. However, we show that ydr514c∆ also suppresses rnt1∆. Furthermore, we have added additional data that overexpressing an uncleavable YDR514C mRNA affects yeast growth at 37 °C more than the wild-type cleavable form further supporting that the cleavage of YDR154C by Rnt1 is physiologically relevant. 

      Reviewer #2 (Recommendations for the authors): 

      (1) The description of the PARE library construction protocol and data analysis workflow is insufficient to ensure their robustness and reproducibility. The library construction protocol should include details of the individual steps, and the data analysis workflow description should include package versions and exact commands used for each analysis step.

      We have clarified that the experiments were performed exactly as previously described and have included very detailed methods. The Galaxy server does not require commands and instead we have indicated the parameters chosen in the various steps. We have also added that the PARE libraries for poly(A)+ and poly(A)- fractions were generated in the lab of Pam Green according to their protocol, which is not exactly the same as ours. Nevertheless, the Rnt1 sites are also evident from those libraries, further demonstrating the robustness of our data. 

      (2) PARE signal is expressed as a ratio of sequencing coverage at a given nucleotide in RNT1 vs rnt1Δ cells. This poses challenges to estimating fold changes: by definition, there should be no coverage at Rnt1 cleavage sites in rnt1Δ cells, as there will not be any 5' monophosphate-containing mRNA fragments to be ligated to the library construction linker. This should be accounted for in the data analysis pipeline - the DESeq2 package, for example, handles this very well (https://support.bioconductor.org/p/64014/).

      The reviewer is correct and we have clarified how we do account for the possibility of having 0 reads by adding an arbitrary 0.01 cpm to all PARE scores for wild type and mutant. In the original manuscript this was not explicitly mentioned and the reader would have to go to our previous paper to learn about this detail. Adding this 0.01 cpm pseudocount avoids dividing by 0 when we calculate a comPARE score. This means we actually underestimate the fold change. As can be seen in the red line in the image below, the y-axis modified log2FC score maxes out along a diagonal line at log2([average RNT1 reads]/0.01) instead of at infinity. That is, at a wild type peak height of 1 cpm, the maximum possible score is log2(1.01/.01), which equals 6.66, and at 10 cpm, the maximum score is ~10, etc.). As can be seen, many of the scores fall along this diagonal, reflecting that indeed, there are 0 reads in the rnt1∆ samples.

      Author response image 1.

      There are multiple ways to deal with this issue, and ours is not uncommon. DESeq2, suggested by the reviewer, uses a different method, which relies on the assumption that the dispersion of read counts for genes of any given expression strength is constant, and then uses that dispersion to “correct” the 0 read counts. While this is a valid way for differential gene expression when comparing similar RNAs, the underlying assumption that the dispersion of expression of all genes is similar for similar expression level is questionable for comparing, for example, mRNAs, snoRNAs, and snRNAs. Thus, we are not convinced that this is a better way to deal with 0 counts. Our analysis accepts that 0 might be the best estimate for the number of counts that are expected from rnt1∆ samples. 

      (3) The analysis in Figure S8 is insufficient to demonstrate that the four mRNAs depicted are significantly more abundant in rnt1Δ vs RNT1 cells - differences in coverage could simply be a result of different sequencing depth. Please use an appropriate method for estimating differential expression from RNA-Seq data (e.g., DESeq2). 

      Unfortunately, the previously published data we included as figure S8 (now figure S9) did not include replicates, and we agree that it does not rigorously show an effect. The reviewer suggests that we analyze the data by DESeq2, which requires replicates, and thus, cannot be done. Instead we have clarified this. If the reviewer is not satisfied with this, we are prepared to delete it.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The objective of this study was to infer the population dynamics (rates of differentiation, division, and loss) and lineage relationships of clonally expanding NK cell subsets during an acute immune response. 

      Strengths: 

      A rich dataset and thorough analysis of a particular class of stochastic models. 

      We thank the reviewer for the positive comment.

      Weaknesses: 

      The stochastic models used are quite simple; each population is considered homogeneous with first-order rates of division, death, and differentiation. In Markov process models such as these, there is no dependence of cellular behavior on its history of divisions. In recent years models of clonal expansion and diversification, in the settings of T and B cells, have progressed beyond this picture. So I was a little surprised that there was no mention of the literature exploring the role of replicative history in differentiation (e.g. Bresser Nat Imm 2022), nor of the notion of family 'division destinies' (either in division number or the time spent proliferating, as described by the Cyton and Cyton2 models developed by Hodgkin and collaborators; e.g. Heinzel Nat Imm 2017). The emerging view is that variability in clone (family) size may arise predominantly from the signals delivered at activation, which dictate each precursor's subsequent degree of expansion, rather than from the fluctuations deriving from division and death modeled as Poisson processes. 

      As you pointed out, the Gerlach and Buchholz Science papers showed evidence for highly skewed distributions of family sizes and correlations between family size and phenotypic composition. Is it possible that your observed correlations could arise if the propensity for immature CD27+ cells to differentiate into mature CD27- cells increases with division number? The relative frequency of the two populations would then also be impacted by differences in the division rates of each subset - one would need to explore this. But depending on the dependence of the differentiation rate on division number, there may be parameter regimes (and time points) at which the more differentiated cells can predominate within large clones even if they divide more slowly than their immature precursors. One might not then be able to rule out the two-state model. I would like to see a discussion or rebuttal of these issues. 

      We thank the reviewer for the insightful comment and drawing our attention to the Cyton models. We have discussed the Cyton models in the Introduction (lines 80-95) and the Discussion (lines 538-553) sections of the revised manuscript and carried out simulations for the variant of the Cyton model suggested by the reviewer. The two-state model showed that for certain parameters it can give rise to a negative correlation between the clone size and the percentage of immature (CD27+) NK cells in the absence of any death suggesting the potential importance of division destiny along with stochastic fluctuations in giving rise to the heterogeneity observed in NK cell clone size distributions in the expansion phase. In addition, we also considered a two-state model where the NK cell activation time in individual cells vary following a log-normal distribution; this two-state model also shows the presence of negative correlations between clone sizes and the percentage of immature NK cells within the clones. We have added new results (Figs. S2-3) and discussed the results (lines 223-232) in the Results and the Discussion (lines 538-553) sections. We believe these additional simulations provide new insights into the results we carried out with our two- and three- state models. 

      Reviewer #2 (Public review): 

      Summary: 

      Wethington et al. investigated the mechanistic principles underlying antigen-specific proliferation and memory formation in mouse natural killer (NK) cells following exposure to mouse cytomegalovirus (MCMV), a phenomenon predominantly associated with CD8+ T cells. Using a rigorous stochastic modeling approach, the authors aimed to develop a quantitative model of NK cell clonal dynamics during MCMV infection. 

      Initially, they proposed a two-state linear model to explain the composition of NK cell clones originating from a single immature Ly49+CD27+ NK cell at 8 days post-infection (dpi). Through stochastic simulations and analytical investigations, they demonstrated that a variant of the twostate model incorporating NK cell death could explain the observed negative correlation between NK clone sizes at 8 dpi and the percentage of immature (CD27+) NK cells (Page 8, Figure 1e, Supplementary Text 1). However, this two-state model failed to accurately reproduce the first (mean) and second (variance and covariance) moments of the measured CD27+ and CD27- NK cell populations within clones at 8 dpi (Figure 1g). 

      To address this limitation, the authors increased the model's complexity by introducing an intermediate maturation state, resulting in a three-stage model with the transition scheme: CD27+Ly6C- → CD27-Ly6C- → CD27-Ly6C+. This three-stage model quantitatively fits the first and second moments under two key constraints: (i) immature CD27+ NK cells exhibit faster proliferation than CD27- NK cells, and (ii) there is a negative correlation (upper bound: -0.2) between clone size and the fraction of CD27+ cells. The model predicted a high proliferation rate for the intermediate stage and a high death rate for the mature CD27-Ly6C+ cells. 

      Using NK cell reporter mice data from Adams et al. (2021), which tracked CD27+/- cell population dynamics following tamoxifen treatment, the authors validated the three-stage model. This dataset allowed discrimination between NK cells originating from the bone marrow and those pre-existing in peripheral blood at the onset of infection. To test the prediction that mature CD27- NK cells have a higher death rate, the authors measured Ly49H+ NK cell viability in the mice spleen at different time points post-MCMV infection. Experimental data confirmed that mature (CD27-) NK cells exhibited lower viability compared to immature (CD27+) NK cells during the expansion phase (days 4-8 post-infection). 

      Further mathematical analyses using a variant of the three-stage model supported the hypothesis that the higher death rate of mature CD27- cells contributes to a larger proportion of CD27- cells in the dead cell compartment, as introduced in the new variant model. 

      Altogether, the authors proposed a three-stage quantitative model of antigen-specific expansion and maturation of naïve Ly49H+ NK cells in mice. This model delineates a maturation trajectory: (i) CD27+Ly6C- (immature) → (ii) CD27-Ly6C- (mature I) → (iii) CD27-Ly6C+ (mature II). The findings highlight the highly proliferative nature of the mature I (CD27-Ly6C-) phenotype and the increased cell death rate characteristic of the mature II (CD27-Ly6C+) phenotype. 

      Strengths: 

      By designing models capable of explaining correlations, first and second moments, and employing analytical investigations, stochastic simulations, and model selection, the authors identified the key processes underlying antigen-specific expansion and maturation of NK cells. This model distinguishes the processes of antigen-specific expansion, contraction, and memory formation in NK cells from those observed in CD8+ T cells. Understanding these differences is crucial not only for elucidating the distinct biology of NK cells compared to CD8+ T cells but also for advancing the development of NK cell therapies currently under investigation. 

      We thank the reviewer for the positive comments.

      Weaknesses: 

      The conclusions of this paper are largely supported by the available data. However, a comparative analysis of model predictions with more recent works in the field would be desirable. Moreover, certain aspects of the simulations, parameter inference, and modeling require further clarification and expansion, as outlined below: 

      (1) Initial Conditions and Grassmann Data: The Grassmann data is used solely as a constraint, while the simulated values of CD27+/CD27- cells could have been directly fitted to the Grassmann data, which assumes a 1:1 ratio of CD27+/CD27- at t = 0. This approach would allow for an alternative initial condition rather than starting from a single CD27+ cell, potentially improving model applicability. 

      We fit the moments of the cell populations along with the ratio of resulting cells from an initial condition of 1:1 ratio of CD27+/CD27- cells at t=0 in the model. The initial condition agrees with the experimental data. However, this fit produced parameter values that will lead to greater growth of mature CD27- NK cells compared to that of immature CD27+ NK cells. This could result from the equal weights given to the ratio as well as to the different moments, and a realistic parameter estimate could correspond to an unequal weight between the ratio and the moments. Imposing the constraint Δ<sub>k</sub> >0 in the fitting drives the parameter search in the region, which seems to alleviate this issue that produces estimates of the rates consistent with higher growth of immature NK cells. We included Table S6 and accompanying description to show this, as well as an additional section in the Materials and Methods (lines 669-676). 

      (2) Correlation Coefficients in the Three-State Model: Although the parameter scan of the threestate model (Figure 2) demonstrates the potential for achieving negative correlations between colony size and the fraction of CD27+ cells, the authors did not present the calculated correlation coefficients using the estimated parameter values from fitting the three-state model to the data. Including these simulations would provide additional insight into the parameter space that supports negative correlations and further validate the model.  

      We have included this figure (Figure 2d) in the revised manuscript.

      (3) Viability Dynamics and Adaptive Response: The authors measured the time evolution of CD27+/- dynamics and viability over 30 days post-infection (Figure 4). It would be valuable to test whether the three-state model can reproduce the adaptive response of CD27- cells to MCMV infection, particularly the observed drop in CD27- viability at 5 dpi (prior to the 8 dpi used in the study) and its subsequent rebound at 8 dpi. Reproducing this aspect of the experiment is critical to determine whether the model can simultaneously explain viability dynamics and moment dynamics. Furthermore, this analysis could enable sensitivity analysis of CD27- viability with respect to various model parameters. 

      We have compared the expansion kinetics of the adoptively transferred Ly49H+ NK cells (Figure 2) and endogenous Ly49H+ NK cells, where the endogenous NK cells show slower growth rates than their adoptively transferred counterparts (see lines 422-429). The data shown in Figure 4 refer to the relative percentage of the mature and immature endogenous NK cells, thus cannot be explained by the three-state model calibrated by the expansion of the adoptively transferred NK cells. One of the issues with using the viability data for parameter estimation for endogenous cells is the need to assume a model for dead cell clearance. We assume a model where dead cells are cleared according to a first-order decay reaction and vary the rate of this reaction to show that the qualitative results are in line with our model rates. This model cannot recreate the dip and rebound observed in the data, and instead monotonically and asymptotically approaches a percentage of live cells. We have attached a figure showing this behavior below. Rather, we intend to use this model as qualitative validation that the relative viability of mature NK cells is lower than that of immature NK cells. Models that include time-dependence of clearance of dead cells, or models with a higher-order (i.e. second) reaction for clearance of dead cells in which propensity for clearance is lower at early times and greater at later times may be better suited for this purpose but are beyond the scope of our validation. 

      Author response image 1.

      Reviewer #1 (Recommendations for the authors):  

      I think the manuscript could be improved substantially by exploring alternative models that incorporate replicative history. At the very least it needs a deeper discussion of the literature relating to clonal expansion, putting the existing models in the context of these studies, and arguing convincingly that your conclusions are robust.  

      We have substantially expanded our explorations with alternative models, in particular we considered a variant of the Cyton model suggested by Reviewer#1, a model where NK cells become activated at different times, and a model with asymmetric NK cell division. We have shown the results (Figs. S2-3) in the Supplementary material and discussed the results in the Results and Discussion sections. Please refer to our response #1 to Reviewer #1 for more details. 

      Reviewer #2 (Recommendations for the authors): 

      (1) Possible Typo (Page 12, Line 254): 

      The phrase: "immature NK cells compared to their immature counterparts" appears to contain a typo. Consider rephrasing for clarity. 

      Done. Thanks for finding this. 

      (2) Clarification of Data Source and Computational Procedure: 

      In the statement: "The NK cell clones reported by Flommersfeld et al. contained mixtures of CD27+ and CD27- NK cells. We evaluated the percentage of CD27+ NK cells in each clone and computed the correlation (Csize-CD27+) of the size of the clone with the percentage of CD27+ NK cells in the clones." Please clarify the data source and computational methodology for evaluating the percentage of CD27+ cells within clones. Additionally, consider including the curated data in the supplementary materials. Since the data originates from different immune compartments, explain which compartments were used. If data from all compartments were included, discuss how the calculated correlation changes when stratifying data from different sources (e.g., spleen and lymph nodes).  

      We have clarified the data source (spleen) where appropriate.

      (3) Figure 1b (Correlation Coefficient): 

      While the correlation coefficient with p-value is mentioned, it would be beneficial to also provide the standard deviation of the correlation coefficient and a 95% confidence band for the fitted line. This is particularly relevant as the authors use -0.2 as the upper bound for the correlation coefficient when fitting the three-stage model. 

      We have included the CI and the p-value for the correlation shown in Figure 1b. The figure with the 95% confidence band shown in the figure (appended below) where both axes are in normal scale does not appear visually clear as in Figure 1b where the clone sizes are shown in the logscale. Thus, we did not include the confidence band in Figure 1b but display the CI and p-values on the figure. If the reviewer prefers, we can include the figure with the confidence band in the SI.

      Author response image 2.

      (4) Confidence Intervals in Tables: 

      If confidence intervals in the tables are calculated using bootstrapping, please mention this explicitly in the table headings for clarity. 

      Done.

      (5) Figure 2d-e (Simulation Method): 

      Specify the simulation method used (e.g., stochastic simulation algorithm [SSA], as mentioned in the materials and methods). Panel (e) lacks a caption-please provide one. Additionally, it would be interesting to include the correlation between clone size and the fraction of CD27+ cells in the clones (similar to the experimental data from Flommersfeld et al., 2021). 

      Done.

      (6) Figure 3 (Confidence Band): 

      Include a 95% confidence band for the simulated values to enhance the interpretability of the plots. 

      Done.

      (7) Materials and Methods Section:  Include a mathematical formula defining the metrics described, ensuring clarity and precision. 

      Done. See newly added lines 587-599, as well as existing content in the Supplementary Materials.

      (8) Supplementary Text 1 (Numerical Integration and AICc): 

      The section "Numerical Integration of Master Equation and Calculation of the AICc" is well done. However, given that the master equation involves a system of 106 coupled ODEs, it would be highly appreciated if the authors provided the formulation in matrix representation for better comprehension. 

      We have included a supplementary text (Supplementary Text I) and a schematic figure within the text to provide the details.

      (9) Figure S7b (Three-State Model Validation): 

      Given that the three-state model fits the data, assess whether it can also fit the first and secondmoment data effectively. This validation would strengthen the robustness of the model.

      Although we showed that the best fit of the clonal burst data (moments) vastly overestimates the growth rates of endogenous cells (Figure S9a, previously Figure S7a), we did not fully emphasize the differences in the datasets that make fitting both with the same parameters impossible. We have added additional text in the main text where Figure S9a is located (lines 427-429) to discuss this.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The manuscript by Raices et al., provides novel insights into the role and interactions between SPO-11 accessory proteins in C. elegans. The authors propose a model of meiotic DSBs regulation, critical to our understanding of DSB formation and ultimately crossover regulation and accurate chromosome segregation. The work also emphasizes the commonalities and species-specific aspects of DSB regulation.

      Strengths:

      This study capitalizes on the strengths of the C. elegans system to uncover genetic interactions between a large number of SPO-11 accessory proteins. In combination with physical interactions, the authors synthesize their findings into a model, which will serve as the basis for future work, to determine mechanisms of DSB regulation.

      Weaknesses:

      The methodology, although standard, lacks quantification. This includes the mass spectrometry data , along with the cytology. The work would also benefit from clarifying the role of the DSB machinery on the X chromosome versus the autosomes.

      • We have uploaded the MS data and added a summary table with the number of peptides and coverage.

      • We have added statistics to the comparisons of DAPI body counts.

      • We have provided additional images of the change in HIM-5 localization

      • We have quantified the overlap (or lack thereof) between XND-1 and HIM-17 and the DNA axis

      Reviewer #2 (Public Review):

      Summary:

      Meiotic recombination initiates with the formation of DNA double-strand break (DSB) formation, catalyzed by the conserved topoisomerase-like enzyme Spo11. Spo11 requires accessory factors that are poorly conserved across eukaryotes. Previous genetic studies have identified several proteins required for DSB formation in C. elegans to varying degrees; however, how these proteins interact with each other to recruit the DSB-forming machinery to chromosome axes remains unclear.

      In this study, Raices et al. characterized the biochemical and genetic interactions among proteins that are known to promote DSB formation during C. elegans meiosis. The authors examined pairwise interactions using yeast two-hybrid (Y2H) and co-immunoprecipitation and revealed an interaction between a chromatin-associated protein HIM-17 and a transcription factor XND-1. They further confirmed the previously known interaction between DSB-1 and SPO-11 and showed that DSB-1 also interacts with a nematodespecific HIM-5, which is essential for DSB formation on the X chromosome. They also assessed genetic interactions among these proteins, categorizing them into four epistasis groups by comparing phenotypes in double vs. single mutants. Combining these results, the authors proposed a model of how these proteins interact with chromatin loops and are recruited to chromosome axes, offering insights into the process in C. elegans compared to other organisms.

      Weaknesses:

      This work relies heavily on Y2H, which is notorious for having high rates of false positives and false negatives. Although the interactions between HIM-17 and XND-1 and between DSB-1 and HIM-5 were validated by co-IP, the significance of these interactions was not tested, and cataloging Y2H interactions does not yield much more insight.

      We appreciate that the reviewer recognized the value of our IP data, but we beg to differ that we rely too heavily on the Y2H. We also provide genetic analysis on bivalent formation to support the physical interaction data. We do acknowledge that there are caveats with Y2H, however, including that a subset of the interactions can only be examined with proteins in one orientation due to auto-activation. While we acknowledge that it would be nice to have IP data for all of the proteins using CRISPR-tagged, functional alleles, these strains are not all feasible (e.g. no functional rec-1 tag has been made) and are beyond the scope of the current work.

      Moreover, most experiments lack rigor, which raises serious concerns about whether the data convincingly supports the conclusions of this paper. For instance, the XND-1 antibody appears to detect a band in the control IP; however, there was no mention of the specificity of this antibody.

      We previously showed the specificity of this antibody in its original publication showing lack of staining in the xnd-1 mutant by IF (Wagner et al., 2010). To further address this, however, we have now included a new supplementary figure (Figure S1) demonstrating the specificity of the XND-1 antibody by Western blot. The antibody detects a distinct band in extracts from wild-type (N2) worms, but this band is absent in two independent xnd-1 mutant strains. This confirms that the antibody specifically recognizes XND-1, supporting the validity of the IP results shown in the main figures.

      Additionally, epistasis analysis of various genetic mutants is based on the quantification of DAPI bodies in diakinesis oocytes, but the comparisons were made without statistical analyses.

      We have added statistical analysis to all datasets where quantification was possible, strengthening the rigor and interpretation of our findings.

      For cytological data, a single representative nucleus was shown without quantification and rigorous analysis. The rationale for some experiments is also questionable (e.g. the rescue by dsb-2 mutants by him-5 transgenes in Figure 2), making the interpretation of the data unclear. Overall, while this paper claims to present "the first comprehensive model of DSB regulation in a metazoan", cataloging Y2H and genetic interactions did not yield any new insights into DSB formation without rigorous testing of their significance in vivo. The model proposed in Figure 4 is also highly speculative.

      Regarding the cytology, we provide new images and quantification of HIM-17 and XND-1 overlap with the DNA axes. We also added full germ line images showing HIM-5 localization in wild type and dsb-1 mutants, to provide a more complete and representative view of the observed phenotype. To further support our findings, we’ve also included images demonstrating that this phenotype is consistently observed with both in live worm with the the him-5::GFP transgene and in fixed worms with an endogenously tagged version of HIM-5.

      Reviewer #3 (Public Review):

      During meiosis in sexually reproducing organisms, double-strand breaks are induced by a topoisomerase-related enzyme, Spo11, which is essential for homologous recombination, which in turn is required for accurate chromosome segregation. Additional factors control the number and genome-wide distribution of breaks, but the mechanisms that determine both the frequency and preferred location of meiotic DSBs remain only partially understood in any organism.

      The manuscript presents a variety of different analyses that include variable subsets of putative DSB factors. It would be much easier to follow if the analyses had been more systematically applied. It is perplexing that several factors known to be essential for DSB formation (e.g., cohesins, HORMA proteins) are excluded from this analysis, while it includes several others that probably do not directly contribute to DSB formation (XND-1, HIM-17, CEP-1, and PARG-1).

      We respectfully disagree with the reviewer’s statement regarding the selection of factors included in our analysis. In this work, our focus was specifically on SPO-11 accessory factors — proteins that directly interact with or regulate SPO-11 activity during doublestrand break formation. Cohesins and chromosome axis proteins (such as the HORMA domain proteins) are essential for establishing the correct chromosome architecture that supports DSB formation, but there is no evidence that they are direct accessory factors of SPO-11. Therefore, they were intentionally excluded from this study to maintain a clear and focused scope on proteins that more directly modulate SPO-11 function.

      Conversely, XND-1, HIM-17, CEP-1, and PARG-1 have all been implicated in regulating aspects of SPO-11-mediated DSB formation or its immediate environment. Although their contributions mayinvolve broader chromatin or DNA damage response regulation, prior literature supports their inclusion as relevant modulators of SPO-11 activity, justifying their analysis within the context of this work.

      The strongest claims seem to be that "HIM-5 is the determinant of X-chromosome-specific crossovers" and "HIM-5 coordinates the actions of the different accessory factors subgroups." Prior work had already shown that mutations in him-5 preferentially reduce meiotic DSBs on the X chromosome. While it is possible that HIM-5 plays a direct role in DSB induction on the X chromosome, the evidence presented here does not strongly support this conclusion. It is also difficult to reconcile this idea with evidence from prior studies that him-5 mutations predominantly prevent DSB formation on the sex chromosomes, while the protein localizes to autosomes.

      HIM-5 is not the only protein that is autosomally enriched but preferentially affects the X chromosome: MES-4 and MRG-1 are both autosomally-enriched but influence silencing of the X chromosome. While HIM-5 appears autosomally-enriched, it does not appear to be autosomal-exclusive. While we would ideally perform ChIP to determine its localization on chromatin, this method for assaying DSB sites is likely insufficient to identify DSB sites which differ in each nucleus and for which there are no known hotspots in the worm.

      him-5 mutants confer an ~50% reduction in total number of breaks and a very profound change in break dynamics (seen by RAD-51 foci (Meneely et al., 2012)). Since the autosomes receives sufficient breaks in this context to attain a crossover in >98% of nuclei, this indicates that the autosomes are much less profoundly impacted by loss of DSB functions than is the X chromosome. Indeed, prior data from co-author, Monica Colaiacovo, showed that fewer breaks occur on the X (Gao, 2015) likely resulting from differences in the chromatin composition of the X and autosome resulting from X chromosome silencing.

      The conclusion that HIM-5 must be required for breaks on the X comes from the examination of DSB levels and their localization in different mutants that impair but do not completely abrogate breaks. In any situation where HIM-5 protein expression is affected (xnd-1, him-17, and him-5 null alleles), breaks on the X are reduced/ eliminated. By contrast, in dsb-2 mutants, where HIM-5 expression is unaffected, both X and autosomal breaks are impacted equally. As discussed above, in the absence of HIM-5 function, there are ~15 breaks/ nucleus. The Ppie1::him-5 transgene is expressed to lower levels than Phim-5::him-5, but in the best case, the ectopic expression of this protein should give a maximum of ~15 breaks (the total # of breaks is thought to be ~30/nucleus). By these estimates, Ppie-1::him-5; him-17 and him-5 null mutants have the same number of breaks. Yet, in the former case, breaks occur on the X; whereas in the latter they do not. The best explanation for this discrepancy is that HIM-5 is sufficient to recruits the DSB machinery to the X chromosome.

      The one experiment that seems to elicit the conclusion that HIM-5 expression is sufficient for breaks on the X chromosome is flawed (see below). The conclusion that HIM-5 "coordinates the activities of the different accessory sub-groups" is not supported by data presented here or elsewhere.

      We have reorganized the discussion to more directly address the reviewers’ concerns. We raise the possibility that HIM-5 has an important role in bringing together the SPO-11 and its interacting components (DSB-1/2/3) with the other DSB inducing factors, including those factors that regulating DSB timing (XND-1), coordination with the cell cycle (REC-1), association with the chromosome axis (PARG-1, MRE-11), and coupling to downstream resection and repair (MRE-11, CEP-1).  

      This raises a natural question: if HIM-5 has such a central role, why are the phenotypes of HIM-5 so mild? We propose that while the loss of DSBs on the X appears mild, more profound effects are seen in the total number, timing, and placement of the DSBs across the genome- all of which are diminished or altered in the absence of HIM-5. The phenotypes of him-5 loss reminiscent of those observed in Prdm9-/- in mice where breaks are relocated to transcriptional start sites and show significant delay in formation. As with PRDM9, the comparatively subtle phenotypes of HIM-5 loss do not diminish its critical role in promoting proper DSB formation in most mammals.

      Like most other studies that have examined DSB formation in C. elegans, this work relies on indirect assays, here limited to the cytological appearance of RAD-51 foci and bivalent chromosomes, as evidence of break formation or lack thereof. Unfortunately, neither of these assays has the power to reveal the genome-wide distribution or number of breaks. These assays have additional caveats, due to the fact that RAD-51 association with recombination intermediates and successful crossover formation both require multiple steps downstream of DSB induction, some of which are likely impaired in some of the mutants analyzed here. This severely limits the conclusions that can be drawn. Given that the goal of the work is to understand the effects of individual factors on DSB induction, direct physical assays for DSBs should be applied; many such assays have been developed and used successfully in other organisms.

      We appreciate the reviewer’s thoughtful comments. We agree that RAD-51 foci are an indirect readout of DSB formation and that their dynamics can be influenced by defects in downstream repair processes. However, in C. elegans, the available methods for directly detecting DSBs are limited. Unlike other organisms, C. elegans lacks γH2AX, eliminating the possibility of using γH2AX as a DSB marker. TUNEL assays, while conceptually appealing, have proven unreliable and poorly reproducible in the germline context. Similarly, RPA foci do not consistently correlate with the number of DSBs and are influenced by additional processing steps.

      Given these limitations, RAD-51 foci remain the most widely accepted surrogate for monitoring DSB formation in C. elegans. While we fully acknowledge the caveats associated with this approach — particularly the potential effects of downstream repair defects — RAD-51 analysis continues to provide valuable insight into DSB dynamics and regulation, especially when interpreted in combination with other phenotypic assessments.

      Throughout the manuscript, the writing conflates the roles played by different factors that affect DSB formation in very different ways. XND-1 and HIM-17 have previously been shown to be transcription factors that promote the expression of many germline genes, including genes encoding proteins that directly promote DSBs. Mutations in either xnd-1 or him-17 result in dysregulation of germline gene expression and pleiotropic defects in meiosis and fertility, including changes in chromatin structure, dysregulation of meiotic progression, and (for xnd-1) progressive loss of germline immortality. It is thus misleading to refer to HIM-17 and XND-1 as DSB "accessory factors" or to lump their activities with those of other proteins that are likely to play more direct roles in DSB induction.

      It is clear that we will not reach agreement about the direct vs indirect roles here of chromatin remodelers/transcription factors in break formation. In yeast, there is a precedent for SPP1 and in mouse for Prdm9, both of which could be described as transcription factors as well, as having roles in break formation by creating an open chromatin environment for the break machinery. We envision that these proteins function in the same fashion. The changes in histone acetylation in the xnd-1 mutants supports such a claim.

      We do not know what the reviewer is referring to in statement that “XND-1 and HIM-17 have previously been shown to be transcription factors that promote the expression of many germline genes.” While the Carelli et al paper indeed shows a role for HIM-17 in expression of many germline genes, there is only one reference to XND-1 in this manuscript (Figure S3A) which shows that half of XND-1 binding sites overlap with the co-opted germline promoters. There is no transcriptional data at all on xnd-1 mutants, save our studies (referenced herein) that XND-1 regulates him-5 expression.

      For example, statements such as the following sentence in the Introduction should be omitted or explained more clearly: "xnd-1 is also unique among the accessory factors in influencing the timing of DSBs; in the absence of xnd-1, there is precocious and rapid accumulation of DSBs as monitored by the accumulation of the HR strand-exchange protein RAD-51.

      We are not sure what is confusing here. The distribution of RAD-51 foci is significantly altered in xnd-1 mutants and peak levels of breaks are achieved as nuclei leave the transition zone (Wagner et al., 2010; McClendon et al., 2016). There is no other mutation that causes this type of change in RAD-51 distribution.

      "The evidence that HIM-17 promotes the expression of him-5 presented here corroborates data from other publications, notably the recent work of Carelli et al. (2022), but this conclusion should not be presented as novel here.

      We have clarified this in the text. We note that this paper showed alterations in him-5 levels by RNA-Seq but they did not validate these results with quantitative RT-PCR. Thus, our studies do provide an important validation of their prior results.

      The other factors also fall into several different functional classes, some of which are relatively well understood, based largely on studies in other organisms. The roles of RAD50 and MRE-11 in DSB induction have been investigated in yeast and other organisms as well as in several prior studies in C. elegans. DSB-1, DSB-2, and DSB-3 are homologs of relatively well-studied meiotic proteins in other organisms (Rec114 and Mei4) that directly promote the activity of Spo11, although the mechanism by which they do so is still unclear.

      Whilst we agree that we understand some of the functions of the homologs, there are clearly examples in other processes of conserved proteins adopting unique regulatory function. We should not presume evolutionary conservation until proven. Indeed the comparison between the Mer2 proteins becomes particularly relevant here. For example, the RMM complex in plants does not contain PRD3, although this protein is thought to have function in DSB formation and repair (Lambing et al, 2022; Vrielynck et al., 2021; Thangavel et al., 2023). In Sordaria, as well, the Mer2 homolog has distinct functions (Tesse et al., 2017).  

      Mutations in PARG-1 (a Poly-ADP ribose glycohydrolase) likely affect the regulation of polyADP-ribose addition and removal at sites of DSBs, which in turn are thought to regulate chromatin structure and recruitment of repair factors; however, there is no convincing evidence that PARG-1 directly affects break formation.

      Our prior collaborative studies on PARG-1 showed that is has a non-catalytic function that promote DSBs that is independent of accumulation of PAR (Janisiw et al., 2020; Trivedi et al., 2022)

      CEP-1 is a homolog of p53 and is involved in the DNA damage response in the germline, but again is unlikely to directly contribute to DSB induction.

      We respectfully disagree with the reviewer’s statement. While CEP-1 is indeed a homolog of p53 and plays a major role in the DNA damage response, prior work from Brent Derry’s lab and from our group (Mateo et al., 2016) demonstrated that specific cep-1 separationof-function alleles affect DSB induction and/or repair pathway choice independently of canonical DNA damage checkpoint activation. In particular, defects in DSB formation observed in certain cep-1 mutants can be rescued by exogenous irradiation, supporting a direct or closely linked role in promoting DSB formation rather than merely responding to damage. Thus, based on these functional data, we considered CEP-1 a relevant factor to include in our analysis. We have now clarified this rationale in the revised manuscript.

      HIM-5 and REC-1 do not have apparent homologs in other organisms and play poorly understood roles in promoting DSB induction. A mechanistic understanding of their functions would be of value to the field, but the current work does not shed light on this. A previous paper (Chung et al. G&D 2015) concluded that HIM-5 and REC-1 are paralogs arising from a recent gene duplication, based on genetic evidence for a partially overlapping role in DSB induction, as well as an argument based on the genomic location of these genes in different species; however, these proteins lack any detectable sequence homology and their predicted structures are also dissimilar (both are largely unstructured but REC-1 contains a predicted helical bundle lacking in HIM-5). Moreover, the data presented here do not reveal overlapping sets of genetic or physical interactions for the two genes/proteins. Thus, this earlier conclusion was likely incorrect, and this idea should not be restated uncritically here or used as a basis to interpret phenotypes.

      Actually, there is quite good bioinformatic analysis that the rec-1 and him-5 loci evolved from a gene duplication and that each share features of the ancestral protein (Chung et al., 2015). We are sorry if the reviewer casts aspersions on the prior literature and analyses. The homology between these genes with the ancestral protein is near the same degree as dsb-1, dsb-2, or dsb-3 to their ancestral homologs (<17%).

      DSB-1 was previously reported to be strictly required for all DSB and CO formation in C. elegans. Here the authors test whether the expression of HIM-5 from the pie-1 promoter can rescue DSB formation in dsb-1 mutants, and claim to see some rescue, based on an increase in the number of nuclei with one apparent bivalent (Figure 2C). This result seems to be the basis for the claim that HIM-5 coordinates the activities of other DSB proteins. However, this assay is not informative, and the conclusion is almost certainly incorrect. Notably, a substantial number of nuclei in the dsb-1 mutant (without Ppie-1::him-5) are reported as displaying a single bivalent (11 DAPI staining bodies) despite prior evidence that DSBs are absent in dsb-1 mutants; this suggests that the way the assay was performed resulted in false positives (bivalents that are not actually bivalents), likely due to inclusion of nuclei in which univalents could not be unambiguously resolved in the microscope. A slightly higher level of nuclei with a single unresolved pair of chromosomes in the dsb-1; Ppie-1::him-5 strain is thus not convincing evidence for rescue of DSBs/CO formation, and no evidence is presented that these putative COs are X-specific. The authors should provide additional experimental evidence - e.g., detection of RAD-51 and/or COSA-1 foci or genetic evidence of recombination - or remove this claim. The evidence that expression of Ppie-1::him-5 may partially rescue DSB abundance in dsb-2 mutants is hard to interpret since it is currently unknown why C. elegans expresses 2 paralogs of Rec114 (DSB-1 and DSB-2), and the age-dependent reduction of DSBs in dsb-2 mutants is not understood.

      We have removed this claim in part because we have been unable to create the triple mutants strains to analyze COSA-1 foci.

      To the point about 11 vs 12 DAPI bodies: the literature is actually replete with examples of 11 DAPI bodies vs 12 in mutants with no breaks:

      Hinman al., 2021: null allele of dsb-3 has an average of 11.6 +/- 0.6 breaks;

      Stamper et al, 2013, show just over 60% of dsb-1 nuclei with 12 DAPI bodies and 5-10% with 10 DAPI bodies. (Figure 1);

      In addition, we also previously showed (Machovina et al., 2016) that a subset of meiotic nuclei have a single RAD-51 focus and can achieve a crossover. RAD-51 foci in spo-11 were also reported in Colaiacovo et al., 2003.

      Several of the factors analyzed here, including XND-1, HIM-17, HIM-5, DSB-1, DSB-2, and DSB-3, have been shown to localize broadly to chromatin in meiotic cells. Coimmunoprecipitation of pairs of these factors, even following benzonase digestion, is not strong evidence to support a direct physical interaction between proteins.

      Similarly, the super-resolution analysis of XND-1 and HIM-17 (Figure 1EF) does not reveal whether these proteins physically interact with each other, and does not add to our understanding of these proteins functions, since they are already known to bind to many of the same promoters. Promoters are also likely to be located in chromatin loops away from the chromosome axis, so in this respect, the localization data are also confirmatory rather than novel.

      While the binding to promoters would be expected to be on DNA loops, that has not been definitively shown in the worm germ line. The supplemental data of the Carelli paper suggests that there are ~250 binding sites for each protein at these coopted promoters. This could not account for crossover map seen in C. elegans.

      The reviewer states correct that we do not reveal that these proteins interact, but we have shown that the two proteins co-IP and have a Y2H interaction. This interaction is supporedt by a recent publication (Blazickova et al., 2025) corroborating this conclusion and identifies XND-1 in HIM-17 co-IPs also in the presence of benzonase. We do now show, however, by immuno-localization that the two proteins appear to be adjacent, but nonoverlapping. As now described in the text, AlphaFold 3 modeling and structural analysis suggests that the two proteins do interact directly and that the tagged 5’ end of HIM-17 used in our studies is likely to be at least 200nm from the putative XND-1 binding interface, a distance that is consistent with our confocal images showing frequent juxtaposition of the two proteins.

      The phenotypic analysis of double mutant combinations does not seem informative. A major problem is that these different strains were only assayed for bivalent formation, which (as mentioned above) requires several steps downstream of DSB induction. Additionally, the basis for many of the single mutant phenotypes is not well understood, making it particularly challenging to interpret the effects of double mutants. Further, some of the interactions described as "synergistic" appear to be additive, not synergistic. While additive effects can be used as evidence that two genes work in different pathways, this can also be very misleading, especially when the function of individual proteins is unknown. I find that the classification of genes into "epistastasis groups" based on this analysis does not shed light on their functions and indeed seems in some cases to contradict what is known about their functions. ‘

      As described above, each of the proteins analyzed is thought to have a direct role in regulating meiotic DSB formation and single mutant phenotypes are consistent with this interpretation. In almost all-if not all- of these cases, IR induced breaks suppress univalent phenotypes (or uncover a downstream repair defect (e.g. in mre-11)) supporting this conclusion. We have changed the terminology from “epistasis groups” since this is not strict epistasis, but rather, “functional groups”.  

      The yeast two-hybrid (Y2H) data are only presented as a single colony. While it is understandable to use a 'representative' colony, it is ideal to include a dilution series for the various interactions, which is how Y2H data are typically shown.

      The Y2H data are presented as spots on a plate and are from three to four individual transformants per interaction tested, and are not individual colonies. The experiment was repeated in triplicate from different transformations. We have now made this clearer in the materials and methods section. This approach has been successfully used to examine protein interactions in our prior manuscripts of yeast and human proteins [Gaines et al (2015) Nat. Comms 6:7834; Kondrashova et al (2017) Cancer Discovery 7:984; Garcin et al (2019) PLoS Genetics 15:e1008355; Bonilla et al (2021) eLife 1: e68080) Prakash et al (2022) PNAS 119: e2202727119, etc]

      Additional (relatively minor) concerns about these data:

      (1) Several interactions reported here seem to be detected in only one direction - e.g., MRE-11-AD/HIM-5-BD, REC-1-AD/XND-1-BD, and XND-1-AD/HIM-17-BD - while no interactions are seen with the reciprocal pairs of fusion proteins. I'm not sure if some of this is due to pasting "positive" colony images into the wrong position in the grid, but this should be addressed.

      The asymmetry in the interactions observed is due to the well-known phenomenon in yeast two-hybrid (Y2H) assays where certain plasmids exhibit self-activation when fused in one orientation, making interpretation of reciprocal interactions challenging. In our experiment, some of the plasmids indeed showed self-activation in one direction, which likely accounts for the lack of interaction seen with the reciprocal pairs of fusion proteins. We have clarified this point in the Methods.

      (2) DSB-3 was only assayed in pairwise combinations with a subset of other proteins; this should be explained; it is also unclear why the interaction grids are not symmetrical about the diagonal.

      We have now completed the analysis by adding the interactions of DSB-3 with the remaining proteins that were missing from the initial set.

      (3) I don't understand why the graphic summaries of Y2H data are split among 3 different figures (1, 2, and 3).

      We chose to split the graphic summaries of the Y2H data across Figures 1, 2, and 3 because we felt this organization better aligns with the flow of the results presented in each figure. Each set of interactions is shown in the context of the specific experiments and findings discussed in those sections, which we believe helps provide a clearer and more logical presentation of the data.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Figure 1: B) The IP is difficult to interpret - there is a band of the corresponding size to XND-1 in the control lane calling into question the specificity of the IP/Western.

      We added a supplemental figure with the specificity of the antibody showing that there is a background non-specific band.

      C) More information about the mass spectrometry should be included. No indication of the number of times a peptide was identified, or the overall coverage of the identified proteins.

      Done

      This is important as in the results section (line 114) the authors indicate that there was "strong" interaction yet there is no way to assess this.

      D) Why wasn't hatching measured in the him-5p::him-5; him-17(ok424) strain?

      Great question. I guess we need to do this while back out for review. If anyone has suggestions of what to say here. Clearly we overlooked this point but do have the strain.

      E) Quantification of the cytology should be included.

      We have now quantified overlap between XND-1 and HIM-17

      Figure 2: C) Statistics should be included.

      Done

      E) Quantification should be included for the cytology. I recommend changing the eals15 to HIM-5.

      We included better images showing whole gonads instead of one or two nuclei. We were not sure what the reviewers want us to quantify here since the relocalization of the protein to the cytoplasm is very clear.

      I have a general issue with the use of the term epistasis - this is used to order gene function based on different mutant phenotypes, usually with null alleles. While I think the authors have valid points with how they group the different SPO-11 accessory proteins, I do not think they should use the word epistasis, but rather genetic interactions.

      We appreciate the reviewers thoughts on this matter and have removed the term epistasis and use functional groups or genetic interactions throughout the text.

      Figure 4 and the nature of the X chromosome: First, I think it would help the non-C. elegans reader to include a little more information on the X chromosome with respect to its differences compared to the autosomes. I also think that, if possible, it would be beneficial to include a model of the X in Figure 4.

      We have added more about X/autosome differences in the intro and during the discussion of HIM-5 function and have added a figure showing difference in the behavior of the X/autosomes during DSB/crossover formation.

      Minor points:

      Abstract: Given the findings of Silva and Smolikove on SPO-11 breaks, I recommend removing "early" from line 28 in the Abstract.

      Done

      Introduction (line 93): I think "biochemical studies" is a stretch here - I recommend "interaction studies".

      Done

      Results: (lines 160-161): mutations are not required for breaks. Line 172, there is a problem with the sentence.

      Corrected

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      (1) Figure 1B- The signal for XND-1 seems to appear both in the control and him-17::HA IP. Do the authors have tested the specificity of the XND-1 antibody?

      We included a supplementary figure demonstrating the specificity of the XND-1 antibody by Western blot. This was also previously published (Wagner et al., 2010)

      (2) Figure 1D - can the authors provide an explanation why the him-5p::him-5 transgene that drives a higher expression than pie-1p::him-5 fails to suppress the Him phenotype seen in him-17? What are the HIM-5 levels like in these two strains compared to N2 and him-17 null mutants? Can this information provide explanation for the differential effect of the him-5 transgenes?

      We previously reported that him-5p::him-5 drives higher expression than pie-1p::him-5 (McClendon et al, 2016).

      The reason that him-5p::him-5 does not rescue, despite higher wild type expression is that HIM-17 directly regulates expression of him-5. Since HIM-17 does not regulate the pie-1 promoter, the pie-1p::him-5 construct can at least partially suppress the him-17 mutation.

      We have (hopefully) explained this better in the text.  

      (3) Line 102- the subheading "HIM-5 is the essential factor for meiotic breaks in the Xchromosome" may not be appropriate for this section. This is what has previously been known. However, the results in Figure 1 demonstrate that a him-5 transgene can partially rescue the him-17 and ¬xnd-1 phenotype, but not that it is essential for meiotic DSB formation on X chromosomes.

      We think some of the concern here is sematic and have changed the phraseology to say that HIM-5 is SUFFICIENT for DSBs on the X… which had not previously been shown.

      Vis-à-vis the X chromosome, in all genetic backgrounds examined, the absence of HIM-5 consistently results in a complete lack of DSBs on the X. For instance, in dsb-2 mutants— where HIM-5 is still expressed—DSBs are reduced genome-wide, but the X chromosome occasionally retains breaks. In contrast, even a weak allele of him-17 results specifically in the loss of X chromosome breaks, underscoring a unique requirement for HIM-5 in promoting DSBs on the X. While Figure 1 shows that a him-5 transgene can partially rescue him-17 and xnd-1 phenotypes, the consistent observation that X breaks are absent without HIM-5 supports its classification as sufficient for DSB formation on the X chromosome.

      (4) Figure 1E - please consider enlarging the images and showing multiple examples.

      Done.

      I also suggest that the authors perform a more rigorous analysis to support the conclusion that XND-1 and HIM-17 localize away from the axis by quantifying multiple images and doing line-scan analysis.

      Provided. New images are provided in both, the main and supplemental figures, and quantification is included. There is no detectable overlap of the two protein with one another or the DNA axes (see quantification of overlap in Fig. 1).

      (5) Line 162 - This is the first mention of DSB-1, DSB-2, and DSB-3 in the paper. DSB-1 and DSB-2 are Rec114 homologs in C. elegans (Tesse et al., 2017), while DSB-3 is a homolog of Mei4 (Hinman et al., 2021). These proteins should be properly introduced in the introduction with appropriate citations.

      Done. We appreciate the reviewer pointing out that this was the first reference to these genes.

      (6) Line 169 - the rationale for this experiment is unclear. Why did the Y2H interaction between HIM-5 and DSB-1 prompt the authors to test the rescue of dsb-1 or dsb-2 phenotypes by the ectopic expression of him-5? Do the authors have evidence that HIM-5 level is reduced in dsb-1 or dsb-2 mutants?

      We have reorganized this section to better explain the motivation for looking at these interactions. We did see a difference in the localization in HIM-5 in the dsb-1 mutant animals and we did have a sense that HIM-5 was critical for breaks on the X. We reasoned that it could have independent functions in promoting breaks that were not yet appreciated so wanted to do this experiment.

      (7) Line 172 - "very slightly reduced". This claim requires statistical analysis.

      We added statistical analysis, but we also removed this claim.

      (8) Figures 2C and 2D - Can the authors provide an explanation why the pie-1p::him-5 transgene fails to suppress the phenotypes in dsb-1, while the him-5p::him-5 trasgene can? Again, the rationale for these experiments is unclear. Because of this, the interpretation is also unclear.

      The difference in rescue between the pie-1p::him-5 and him-5p::him-5 transgenes likely reflects differences in expression levels. As previously shown (McClendon et al., 2016), the him-5p::him-5 construct results in significantly higher expression of HIM-5 protein compared to pie-1p::him-5. This elevated expression likely explains its ability to partially rescue the dsb-1 phenotype. In contrast, the lower expression driven by the pie-1 promoter is insufficient to compensate for the absence of dsb-1 function. We have clarified the rationale and interpretation of these experiments in the revised manuscript to better reflect this point.

      (9) Lines 184-185 - the data for endogenously tagged HIM-5::3xHA are not shown anywhere in the paper. This must be shown.

      We have added this in the supplemental figures.

      (10) Figure 2D and 2E - what does the localization of pie-1p::him-5::GFP (eaIs15) and him5p::him-5::GFP (eaIs4) look like in wild-type and dsb-1 mutants? Are the cytoplasmic aggregates caused by increased levels of HIM-5 expression? Can the differential behavior of him-5 transgenes provide explanation for differential rescues?

      We now show both live and fixed images of Phim-5::him-5::gfp transgenes, as well as the localization of the endogenously HA-tagged HIM-5 locus (Figure 2 and S3). In all cases, the protein is initially nuclear and then absent from meiotic nuclei with similar timing. The Ppie1::him-5 transgene was very difficult to image due to low expression (even in wild type) so it not shown here. We presume it is the slightly elevated level of expression of the Phim5::him-5::gfp that can explain the differential rescue.

      (11) Lines 221-222, where are the results shown? Please refer to Figure S3.

      Done

      (12) Figure S3 - these need statistical analyses.

      Done

      (13) Lines 230-231 - what about the rec-1; parg-1; cep-1 triple mutant?

      This is an excellent suggestion and not one we have not yet pursued. Given the lack of strong phenotypes in all combination of double mutants, we prioritized other experiments . However, we agree that examining the rec-1; parg-1; cep-1 triple mutant would provide a valuable test of whether these factors act in the same pathway, and we appreciate the reviewer highlighting this potential future direction.

      (14) Line 298 - I suggest the authors take a look at the Alphafold prediction of DSB-1/DSB-2/DSB-3 and the comparison to human and budding yeast Rec114/Mei4 complex in Guo et al., 2022 eLife, which could provide insights into the Y2H results.

      We thank the reviewer for these comments and have indeed used these interactions and predicted homologies to zero in a region of interaction between these proteins that resembles what is seen in humans and yeast with a dimer of REC114 like proteins wraps stabilizing a central Mei4 helix . This is now shown in Figure 3H, I. Satisfyingly, this modeling predicts that a trimer comprised of 2 DSB-1 proteins with DSB-3 is more stable than a DSB1-DSB-2-DSB-3 trimer. This might explain why DSB-2 is not required in young adults and only becomes essential as DSB-1 levels drop in older animals (Rosu et al., 2013)

      (15) Can the authors introduce mutations within the DSB-1 interfaces that disrupt the interaction to either SPO-11 or DSB-2?

      We have begun to address this question by introducing targeted mutations within DSB-1. As shown in Figure 3E and 3F, mutations in the C-terminal region of DSB-1—which includes a core of four α-helices—disrupt its interaction with DSB-2 and DSB-3, but not with SPO-11. These findings suggest that the C-terminus mediates interactions specifically with DSB2 and DSB-3

      (16) Line 323 - The him-5 phenotypes are too weak to support the idea that it serves as the linchpin for the whole DSB complex. Do the authors have an explanation for why him-5 mutants exhibit X-chromosome-specific DSB defects?

      In response to the reviewer, above, and in the text, we have included a more detailed explanation of why we think HIM-5 has a key role in coordinating meiotic break formation. Although, identified for its role on the X, the phenotypes associated with DSB formation in the mutant are really quite pleiotropic and severe.

      (17) Line 436 - C. elegans lacks DSB hotspots.

      Removed

      Minor comments:

      (1) Figure 1A - please show the raw data for the yeast two-hybrid.

      We show representative yeast colonies in Figure S3.

      (2) It looks like the labeling for Figure 1B and 1C are switched.

      Fixed.

      (3) Figure 1B - what does the red box indicate? Please explain it in the legend.

      It indicates the XND-1 band. We added that information in the legend.

      (4) Figure 1C - in the legend, it was noted that the results are from GFP pulldowns of HIM17::GFP. However, the method for Figure 1B and the method section noted that HIM-17 was tagged with 3xHA, and the pull-down was performed using anti-HA affinity matrix. Please reconcile this discrepancy.

      That’s because they were done in two different sets of experiments. For the IPs we used a HIM-17::HA strain and for the MS, a HIM-17::GFP strain.

      (5) Also in Figure 1C - please call Table S2 in the main text when discussing the mass spec results. Also, it is not clear what HIM-17 and GFP indicate in the table. What makes CKU80 different from the other proteins listed under GFP? Please explain more clearly in the legend.

      We have move the table to supplemental data where we have included all of the peptide counts and gene coverage. We have included in the revised method rationale for inclusion in this table which explains why CKU-80 differs.

      (6) Line 527 - it is unclear what experiment was done for HIM-17. Please revise it to indicate that this is for "HIM-17 immunoprecipitation". Also please indicate the strain used for HIM17 pull-down (AV280?).

      (7) Line 113- please be specific about how the HIM-17 IP was performed. Which epitope and strains are used for pull-downs?

      This indeed was AV280. This has been added to the text and methods.

      (8) Figure 1D- What does ND mean? In the text, it was stated that there was only a minor suppression of hatching rates. The hatching rate for him-5p::him-5; him-17 must have been measured, and the data must be presented.

      ND does mean not determined. We have removed the statement about “minor suppression”. We only tested the overall population dynamics in the Phim-5::him-5;him17(ok424) and the DAPI body counts. The failure to suppress the latter suggests there would be no enect on hatching rates, although we did not test this directly. Since we had done this for the Ppie-1::him-5;him-17 strain, we provided this information to further support the claims of genetic rescue by ectopic expression.

      (9) Line 151 - please specify that STED was used.

      We have removed the STED images, and just show the confocal images with Lightning Processing.

      (10) Figure 1E- the authors suggested that HIM-17 and XND-1 mainly localize to autosomes but not the X chromosome. However, there is not enough evidence that the chromosome excluded from HIM-17 staining is indeed an X chromosome.

      (11) Figure 1E (Line 154) - what are the active chromatin markers examined? Where are the data?

      We have previously shown that the chromosome lacking XND-1 staining is the X (Wagner et al., 2010). The X is heterochromatic and chromatin marks associated with active transcription, including H3K4me3 and HTZ-1 (a variant H2A), preferentially localize to autosomes, effectively anti-marking the X chromosome. As shown in the new Figure 1E, a single chromosome has very little XND-1 and HIM-17 associated proteins. This is the X chromosome.

      (12) Line 172 - It should be a comma instead of the period after "In dsb-1 mutants".

      Fixed

      (13) Figure S3H-K - I suggest the authors indicate the alleles of mre-11 (null vs. iow1) on the graph, similarly to him-5(e1490) to avoid confusion.

      Done

      (14) Lines 294 and 600 - Guo et al. 2022 is now published in eLife. The authors must cite the published paper, not the preprint.

      Fixed

      (15) Line 407 - the reference Carelli et al., 2022 is missing.

      Added

      (16) Line 766 - please remove "is" before nuclear.

      Done

      Reviewer #3 (Recommendations For The Authors):

      Major issues:

      In my view, the most interesting mechanistic finding in the paper is the evidence that HIM-5 may not bind to chromatin in the absence of DSB-1. If validated, this would suggest that HIM-5 is likely to be directly involved in a process that promotes break formation, in contrast to factors such as HIM-17 and XND-1. It does not, however, support the idea that HIM-5 is at the top of a hierarchy of DSB factors, as it is interpreted here. More importantly, the data supporting this claim are unconvincing; only a single image of an unfixed gonad from an animal expressing HIM-5::GFP is shown. Immunofluorescence should be performed and the results must be quantified.

      We have provided additional images of the HIM-5 relocalization to show that we observed this in both fixed and live worms with two different tagged strains. The exclusion from the nucleus is seen in all scenarios. Whether the protein now accumulates exclusively in the cytoplasm/ is destabilized is challenging to address with the fixed images due to the arbitrariness of defining “background” staining.

      More generally, this type of analysis, looking at the interdependence of different factors for their association with chromosomes, is much more informative than the genetic interaction data presented in the paper, which does not seem to provide any mechanistic insights into the functions of the factors analyzed. The paper could potentially be greatly improved through a more extensive, systematic analysis of the interdependence of DSBpromoting factors for their localization to chromosomes.

      We have at least added this for XND-1 and HIM-17 and show they are not interdependent for chromosome association. We also provide for the first time data on the localization of HIM-5 in the dsb-1 mutant. Many of the other interactions have already been shown in the literature and/or were not warranted base on the lack of genetic interaction we present here.

      Minor issues:

      The title is vague and inconclusive. A more concrete title summarizing the major findings would help readers to assess whether the work is of interest.

      We have discussed the title extensively with all authors and all would like to keep the current title.

      The authors claim that the expression of HIM-5 from a different promoter (Ppie-1::him-5) but not its endogenous promoter (Phim-5::him-5) can partially rescue the DSB defect in him-17 mutants. To support this claim, they should really quantify the germline expression of HIM-5 in wild-type, him-17, him-17; Ppie-1::him-5, and Phim-5::him-5; him-17.

      We had previously reported the expression in the N2 background of both transgenes (McClendon et al., 2016)

      Panel O appears to be missing from Figure S3.

      Fixed

      The evidence for chromosome fusions in cep-1; mre-11 mutants shown in S4D is not convincing and the claim should be removed unless stronger evidence can be obtained.

      A clearer image has been added

      The basis of the following statement is unclear: "Furthermore, rec-1;him-5 double mutants give an age-dependent severe loss of DSBs (like dsb-2 mutants) suggesting that the ancestral function of the protein may have a more profound effect on break formation." The manuscript does not seem to include data regarding age-dependent loss of DSBs and no other publication is cited to support this claim. The interpretation is also perplexing; I think that it may be predicated on the idea that REC-1 and HIM-5 are paralogs, but as stated above, this claim is not well supported and is likely specious.

      We have added the reference. This was shown in Chung et al., 2013 – the paper that presented the cloning of the rec-1 locus.

  4. Sep 2025
    1. Author response:

      Joint Public Review

      This manuscript puts forward the provocative idea that a posttranslational feedback loop regulates daily and ultradian rhythms in neuronal excitability. The authors used in vivo long-term tip recordings of the long trichoid sensilla of male hawkmoths to analyze spontaneous spiking activity indicative of the ORNs' endogenous membrane potential oscillations. This firing pattern was disrupted by pharmacological blockade of the Orco receptor. They then use these recordings together with computational modeling to predict that Orco receptor neuron (ORN) activity is required for circadian, not ultradian, firing patterns. Orco did not show a circadian expression pattern in a qPCR experiment, and its conductance was proposed to be regulated by cyclic nucleotide levels. This evidence led the authors to conclude that a post-translational feedback loop (PTFL) clockwork, associated with the ORN plasma membrane, allows for temporal control of pheromone detection via the generation of multi-scale endogenous membrane potential oscillations. The findings will interest researchers in neurophysiology, circadian rhythms, and sensory biology. However, the manuscript has limited experimental evidence to support its central hypothesis and is undermined by several questionable assumptions that underlie their data analysis and model builds, as well as insufficient biological data, including critical controls to validate and/or fully justify the model the authors are proposing.

      We thank the reviewers for their thorough and thoughtful comments and believe that the manuscript will be much stronger once we incorporate the requested changes.

      Please note that we used ORN as acronym for “olfactory receptor neuron” throughout the manuscript. ORNs contain odorant receptors (ORs), and in insects these ORs have to associate with the olfactory receptor co-receptor (Orco) in the cilium of the neuron to form functional OR-Orco complexes for odorant detection. Besides this chaperone function, Orco can form homomers with the potential to act as ionic pacemaker channels; a role which we explore in this study.

      Strengths:

      The study is notable for its combination of long-term in vivo tip recordings with computational modeling, which is technically challenging and adds weight to the authors' claims. The link between Orco, cyclic nucleotides, and circadian regulation is potentially important for sensory neuroscience, and the modeling framework itself - a stochastic Hodgkin-Huxley formulation that explicitly incorporates channel noise - is a solid and forward-looking contribution. Together, these elements make the study conceptually bold and of clear interest to circadian and olfactory biologists.

      Major weaknesses:

      At the same time, several limitations temper the conclusions. The pharmacological evidence relies on a single antagonist and concentration, without key controls. The circadian analysis is based on relatively small numbers of neurons, with rhythms detected only in subsets, and the alignment procedure used in constant darkness raises concerns of bias. The molecular evidence is sparse, with only three qPCR timepoints, and the model, while creative, rests on assumptions that are not yet fully supported by in vivo data.

      Please see our responses to the detailed comments.

      Detailed comments are provided below:

      (1) The role for Orco proposed in the authors' model largely stems from the effects seen following the administration of (a single dose) of the Orco antagonist, OLC15. However, this hypothesis is undercut by the lack of adequate pharmacological controls, including a basic multipoint OLC15 dose-response series in addition to the administration of blockers for the other channels that are embedded in their model, but which were ruled out as being involved in the modulation of biological rhythms. In addition, these studies would (ideally) also benefit from the inclusion of the same concentration (series) of an inactive OLC15 analog to better control for off-target effects.

      The Orco agonist VUAA1 (Jones et al., 2011) binds directly to Orco and increases the channel open time probability. In M. sexta hawkmoths, we have already published that VUAA 1 increases the low spontaneous activity of ORNs in a dose-dependent fashion (Nolte et al., 2016). Chen and Luetje (2012) systematically varied the chemical structure of VUAA1 to identify new Orco ligands and discovered 22 Orco Ligand Candidates (OLC) that either activated or inhibited Orco. In their heterologous expression system, Orco was most sensitive to inhibition by OLC15. Based on these results, we published a dose-response curve of OLC15 inhibition (1-100 µM) using in vivo tip recordings of pheromone-sensitive long trichoid sensilla of M. sexta (Nolte et al., 2016). In that study, we could also demonstrate that OLC15 antagonizes the VUAA1 activation of Orco.

      Furthermore, we tested other published Orco antagonists in in vivo assays in intact hawkmoths, focusing on amiloride-derived antagonists, because we previously identified an amiloride-sensitive cation channel in hawkmoth ORNs. We found that, in contrast to OLC15, the amilorides HMA and MIA were not Orco-specific but instead affected different targets depending on time-of-day (Nolte et al., 2016). Based on those experiments and the dose-response curves we determined that the Orco agonist VUAA1 (Jones et al., 2011) and the Orco antagonist OLC15 (Chen and Luetje, 2012) worked best in hawkmoth ORNs to target Orco pharmacologically. Based on comparative tests with other published Orco antagonists we settled since then in all further experiments on a dose of 50 µM OLC15.

      We will clarify the Methods section accordingly.

      (2) The expression pattern of Orco was assessed using qPCR at only three timepoints. Rhythmic transcripts can easily be missed with such sparse sampling (Hughes et al., 2017). A minimum of six evenly spaced timepoints across a 24-hour cycle would be required to confidently rule out circadian transcriptional regulation. In addition, the use of the timeless mRNA control from another study is not acceptable. Furthermore, qPCR analysis measures transcript abundance, not transcription, as the authors repeatedly state. Transcriptional studies would require nuclear run-off or, more recently, can be done with snRNAseq analysis. Taken together, these concerns undermine the authors' desire to rule out TTFL-based control that directly led them to implicate a PTTF-based model.

      We agree with the referees that more time points and a direct comparison between timeless and Orco mRNA levels should be included in this manuscript. We will include these additional qPCR experiments and edit the manuscript to make clear that we measure transcript abundance, but we will not perform snRNAseq analysis due to time- and financial constraints. We are currently working on the transcriptional control of Orco, both during ontogeny and throughout the day but this work in progress is beyond the scope of this manuscript.

      (3) The modelling presented is based on Orco as a ZT-dependent conductance tied to the cAMP oscillations that were reported by this group in the cockroach and from the presence and functionality in Manduca of homomeric Orco complexes that are devoid of tuning ORs. While these complexes have been generated in cell culture and other heterologous expression systems, as well as presumably exist in vivo in the Drosophila empty neuron and other tuning OR mutants, there is no evidence that these complexes exist in wild-type Manduca ORNs. While this doesn't necessarily undermine every aspect of their models, the authors should note the presence of Orco/OR complexes rather than Orco homomeric complexes.

      Our ELISAs found circadian oscillations in cAMP levels not only in antennae of the Madeira cockroach (Schendzielorz et al., 2014, 2012), but also in hawkmoth antennae (Schendzielorz et al., 2015). We will add the 2015 citation to the Modeling chapter in the Methods section to clarify this.

      We agree with the referees that we cannot distinguish between Orco homo- and heteromers in the different compartments of our hawkmoth ORNs. Thus, as the referee suggests, we will add text regarding the presence and localization of OR-Orco heteromers. However, we have indications that Orco homomers could indeed be present in the hawkmoth ORNs. In a heterologous expression system, MsexOrco expression alone was sufficient to increase intracellular Ca<sup>2+</sup> levels in response to VUAA1 application (Nolte et al., 2013). In differentiating primary cell cultures of hawkmoth antennae, Orco expression started during a developmental time window where ORNs did not yet express pheromone receptors, and Orco affected spontaneous activity (Nolte et al., 2016). Thus, Orco homomers are present in developing hawkmoth ORNs during a time window where ORNs already express spontaneous activity but cannot heteromerize with pheromone receptors. However, we do not know whether and in what ratio homo- and heteromers of Orco and ORs are present in the respective sensillum compartments of adult hawkmoths (Nolte et al., 2013; Stengl, 1994; Stengl and Hildebrand, 1990).

      We will clarify our manuscript accordingly.

      (4) Some aspects of the authors' models, most notably the decision to phase align/optimize their DD and OLC15 recordings, are likely to bias their interpretations.

      It is consensus that insects display daily and circadian rhythms in pheromone-dependent mating, odor-gated feeding, and egg-laying behavior that phase-locks to environmental rhythms, corresponding with daily/circadian rhythms of sensory neuron physiology (e.g., Merlin et al., 2007; Rymer et al., 2007; Schendzielorz et al., 2015, 2012). However, circadian rhythms can be easily masked by stress, like the disturbances during a very challenging long-term recording experiment over several days. In addition, we observed in our animal raising facility that in LD 17:7 light-dark cycles the originally nocturnal hawkmoths M. sexta distribute their activity patterns over the course of the day, finding nocturnal as well as diurnal hawkmoths. Thus, light-dark cycles were not enough to ensure phase-synchronized behavioral rhythms, and it is very likely that the nocturnal hawkmoths rely heavily on pheromone/odor dependent synchronization as also found in other moth species (Ghosh et al., 2024). Here, we used isolated males that were never exposed to the female pheromones so that their circadian activity patterns readily disperse. Therefore, it became necessary in free-running conditions to first determine the respective behavioral rhythm for each animal, and then to phase-align their activity patterns to allow for statistical analysis. Otherwise, circadian differences would average out in a free-running population. As requested by the referees in point (7), we will use additional tests for rhythmicity in each of our recordings and revise the manuscript accordingly.

      Assuming that hawkmoths need pheromone presence as additional Zeitgeber, we are currently working on a new set of experiments where we attempt to improve synchronization by exposure to LD cycles and pheromone before DD and OLC15 recordings. We will add these experiments to the manuscript.

      (5) The tip recordings from long trichoid sensilla are critical aspects of this study. These recordings were carried out on upper sensillar tips located on the distal-most second annulus. Since there are approximately 80 annuli on the Manduca antennae, it is unclear whether the recordings are representative of the antennal response.

      We think the reviewers might have misinterpreted our description of the recording site. In the Methods, we state that we clip off the 20 most distal annuli (leaving a stump of about 60 annuli) and insert the reference electrode into the flagellum up to the second annulus from the cut end, i.e., the recording site is located at 2/3 – 3/4 of the antenna length as seen from the head of the animal. We will make this more clear in the Methods section.

      In addition, our lab did show with antibody stainings against Orco that apparently all ORNs that innervate long and short trichoid sensilla along the whole flagellum express the same staining pattern (Nolte et al., 2016). Furthermore, our patch clamp recordings of primary cell cultures of whole male antennae found largely overlapping ion channel populations across ORNs. This would indicate that all ORNs, whether they express pheromone- or general odorant receptors, could potentially share the same Orco-dependent spontaneous activity rhythms. In our lab, different experimenters from different years that recorded from long trichoid sensilla on different annuli did not detect obvious differences in neither the spontaneous activity nor the pheromone responses (c.f., Dolzer et al., 2003; Gawalek and Stengl, 2018; Schneider et al., 2025). Thus, it is very likely that we are reporting a general encoding mechanism that is not locally restricted along the antennal flagellum.

      (5.1) The authors do not provide any data in support of their cAMP/cGMP-based Orco gating…

      There are publications supporting cyclic nucleotide gating of Orco in Drosophila, but only after previous phosphorylation via protein kinase C (PKC; review: (Wicher and Miazzi, 2021)). Since Orco is very conserved among insect species, it is likely that these PKC and cGMP/cAMP-dependent regulations are present in other insect species. We are currently running thorough tip-recording experiments on the regulation of Orco gating, which are beyond the scope of this manuscript. However, we will add a set of experiments to this manuscript that demonstrates cAMP gating of Orco.

      (5.2)… and the PTTF model proposed is somewhat disappointing.

      For a detailed introduction of our PTFL membrane clock hypothesis please see our opinion paper (Stengl and Schneider, 2024).

      (5.3) The model seems to be influenced by their long-held proposal that insect olfactory signaling has a critical metabotropic component involving cyclic nucleotides, PKC, etc, a view that may be influenced by the use of Orco homomeric complexes generated in HEK cells.

      Indeed, we propose a metabotropic pheromone-transduction cascade, which in moths and cockroaches is based on G-protein-mediated activation of phospholipase C but not on adenylyl cyclase activation. Our hypothesis is not influenced by HEK cell heterologous expression studies of Orco but is supported by our own work comparing in vivo tip recordings of intact hawkmoths with patch clamp experiments on hawkmoth primary cell cultures of olfactory receptor neurons, which are able to respond to their species-specific pheromones in vitro ((Schneider et al., 2025; Stengl, 2010; Stengl and Funk, 2013; Wicher and Miazzi, 2021). In addition, a multitude of publications by other laboratories with in vivo and in vitro studies using physiological, genetic, and immunocytochemical assays all support a metabotropic signal transduction cascade in insect olfaction (reviews: Stengl, 2010; Stengl and Funk, 2013; Wicher and Miazzi, 2021). In contrast, the hypothesis suggesting a solely ionotropic pheromone- and general odor-dependent transduction cascade for all insect species is based on very sparse experimental evidence, based primarily on heterologous expression studies such as HEK cells that lack the insect’s WT molecular surroundings, and thus, cannot predict OR-Orco function in vivo. Furthermore, the ionotropic hypothesis is heavily based upon the argument that an inverse 7TM receptor cannot couple to G-proteins, which lacks careful backup via biochemical and structural studies. In addition, the ionotropic hypothesis lacks support via carefully performed physiological in vivo studies in different insect species that paid attention to analysis of the distinct kinetic components of ORN´s odor/pheromone responses and that employ physiological concentrations and durations of odor/pheromone stimuli (please see our most recent publication by Schneider et al. (2025)).

      (5.4) Nevertheless, structural studies on Orco do not support a cyclic nucleotide binding site, although PKC-based phosphorylation has been implicated in the fine-tuning/adaptation of olfactory signaling.

      While structural studies did not find evidence for conserved known cyclic nucleotide binding sites on Orco, this does not exclude the presence of so far unknown binding sites, or via sites that fold out only after a specific sequence of previous phosphorylations of the many phosphorylation sites on Orco. Indeed, physiological studies in Drosophila presented evidence for cyclic nucleotide dependence of Orco after previous PKC-dependent phosphorylation (Getahun et al., 2013). Our ongoing in vivo experiments in hawkmoths further corroborate a PKC- and cAMP-dependent modulation of Orco. These studies will be published in a follow-up publication.

      (6) Because only 5/11 LD and 7/10 DD animals showed daily rhythms, with averages lacking clear daily modulation, the methods are not sufficiently reliable enough to reveal novel underlying mechanisms of circadian rhythm generation. The reported results are therefore not yet reliable or quantifiable. To quantify their results, the authors should apply tests for circadian rhythmicity using methods such as RAIN, JTK CYCLE, MetaCycle, or Echo. The use of FFT and Wavelet is applauded, but these methods do not have tests of significance for rhythms and can be biased when analyzing data in which there could only be 1-3 circadian cycles. Because the conclusions appear to be based on 11-12 neurons that were recorded for 2-4 days, the reader is concerned that the methods are not yet perfected to provide strong evidence for circadian regulation of spontaneous firing of ORNs. The average data (e.g., Figure 3Bii and 3Cii) highlight the apparent lack of daily rhythms. In summary, the results would be more compelling if more than 50% of the recordings had significant circadian amplitudes and with similar periods and phases.

      The long-term tip-recordings of intact hawkmoths are very challenging and take a very long time to accomplish, thus, we are very happy that we succeeded in obtaining so many of them (N=34). Since 5/11 LD recordings and 7/10 DD recordings revealed daily/circadian rhythmicity and since many other physiological recordings at different ZTs of different members of our laboratory all revealed ZT-dependent pheromone-transduction we can be certain that the physiology of hawkmoth antennae is under strict circadian control. Please see also our response to (4) above commenting the phase-dispersal of activity rhythms observed in our experiments, as well as in the behavior of hawkmoth males in the mating cage.

      Nevertheless, we will follow the advice of the referees to apply additional tests for significance of rhythms in spontaneous activity, and we are thankful for the tests suggested that we were not aware of.

      (7) The statement that circadian patterns of ORN firing are lost with the Orco antagonist (OLC15) is not strongly supported. The manuscript should be revised to quantify how Orco changed circadian amplitude in the 12 recorded neurons. Measures of circadian amplitude can avoid confusing/vague statements like Line 394 “low and high frequency bands appeared to merge during the activity phase around ZT 0 in the animals that showed clear circadian rhythms (N = 5 of 11 in LD)”. The conclusion that Orco blocks circadian firing appears to be contradicted by Figure 6, which indicates that ~6 of these neurons had circadian periods detected by wavelet. The manuscript would be strengthened with details about the specificity and reproducibility of the Orco antagonist. The authors quantify the gradual decrease in firing with the slope of a linear fit to estimate how the “effectiveness [of OLC15] increased over time.” They conclude that the drug “obliterated circadian rhythms and attenuated the spontaneous activity in several, but not all experiments (N = 8 of 12).” The report would be greatly strengthened with corroborating data from additional Orco antagonists and additional doses of OLC15 (the authors use only 50 uM OLC15).

      We will revise our data analysis, according to the valuable suggestions of the referees.

      However, based upon our previous studies with other Orco antagonists and different doses of OLC15 (Nolte et al., 2016) we found that 50 µM OLC15 is the best Orco antagonist dose in M. sexta to target Orco-dependent modulation of spontaneous action potential activity of hawkmoth olfactory receptor neurons. Please see also our response to (1).

      (8) The manuscript includes several statements that are more speculation than conclusion. For example, there is no evidence for tuning or plasticity in this report. Statements like the following should be removed or addressed with experiments that show changes in odor response specificity or sensitivity: "ORN signalosomes are highly plastic endogenous PTFL clocks comprising receptors for circadian and ultradian Zeitgebers that allow to tune into internal physiological and external environmental rhythms as basis for active sensing." (Discussion Line 622). The paper concludes that (line 380) "mean frequency of spontaneous spiking and the frequency of bursting expressed daily modulation, and are both most likely controlled via a circadian clock that targets the leak channel Orco." This is too bold given the available results.

      We will revise the discussion accordingly and clarify which statements are supported via published evidence and which are predictions based upon our novel hypothesis published in our opinion paper (Stengl and Schneider, 2024).

      (9.1) Because Orco conductance is modulated by cyclic nucleotides, it remains highly plausible that circadian regulation occurs upstream at the level of signaling pathways (e.g., calcium, calcium-binding proteins, GPCRs, cyclases, phosphodiesterases).

      We agree with the referees that it is very likely that there are multiple layers of interconnected feedback cycles that control Orco localization and activity. Our novel hypothesis suggests interlocked TTFL and PTFL control of physiological circadian rhythms, not strictly hierarchical TTFL control, which would require a daily turnover of membrane proteins and transcriptional control via the established TTFL clock in insect ORNs. We currently search for TTFL control at all levels of odor/pheromone transduction using ZT-dependent transcriptomics in combination with qPCR and single nuclear transcriptomics, involving also all the molecules suggested by the referees. These studies are ongoing, are very time- and money-consuming, and are beyond the scope of this manuscript.

      (9.2) The possibility that circadian oscillations of cyclic nucleotides are generated by the canonical TTFL mechanism has not been excluded. In fact, extensive work in Drosophila has demonstrated that the TTFL-based molecular clock proteins are required for circadian rhythms in olfaction.

      Our experiments that test circadian TTFL control at different levels of the cAMP transduction cascade in hawkmoth antennae are on the way and are part of another publication. We will revise our discussion accordingly.

      The experiments published for TTFL dependent control of Drosophila olfaction that we are aware of (Krishnan et al., 1999; Tanoue et al., 2004) do not exclude interlinked PTFL and TTFL clocks. Krishnan et al. (1999) demonstrate that the TTFL clock in antennal olfactory receptor neurons correlates with circadian rhythms in odor responses measured in electroantennogram (EAG) recordings, not in single sensillum recordings as in our experiments. EAG recordings comprise not only voltage responses of the olfactory sensory neurons but also voltage changes generated in non-neuronal antennal cells such as trichogen and tormogen cells that built the transepithelial potential gradient via vATPases that generates the high K<sup>+</sup> concentration in the sensillum lymph (Jain et al., 2024; Klein, 1992; Thurm and Küppers, 1980). In addition, EAG recordings most likely contain responses of afferent neurons originating from somata in the brain that maintain central control of the antennae. Thus, EAG recordings are difficult to interpret.

      (11) A defining feature of circadian oscillators is the feedback mechanism that generates a time delay (e.g., PERIOD/TIMELESS repressing their own transcription). While the authors describe how cyclic nucleotides can regulate Orco conductance, they do not provide a convincing explanation of how Orco activity could, in turn, feed back into the proposed PTFL to sustain oscillations. For these reasons, the authors should consider:

      a) Providing a broader discussion of non-TTFL models of circadian rhythms (e.g., redox cycles, post-translational modifications).

      We will revise the discussion accordingly.

      b) Reassessing Orco expression using a higher-resolution temporal sampling ({greater than or equal to}6 timepoints per 24 h).

      We will add those experiments to the revised version of the manuscript (see our response to (2)).

      c) Clarifying or revising the PTFL model to explicitly address how feedback would be achieved. Alternatively, the data may be more consistent with Orco conductance rhythms being regulated by post-translational mechanisms downstream of the canonical TTFL oscillator, as suggested by the Drosophila olfactory system literature.

      We will revise the manuscript accordingly.

      Minor weaknesses:

      (1) The authors should compare the firing patterns of ORN neurons to the bursts, clusters, and packets of retinal efferent spikes reported in Liu JS and Passaglia CL (2011; JBR). By comparing measures in moths to measures in Limulus, the authors might be able to address the question: Is the daily firing pattern of ORN neurons likely a conserved feature of circadian control of sensory sensitivity?

      We will revise the discussion accordingly.

      (2) The methods need further details. For example, it is unclear if or how single neuron activity was discriminated and whether the results were compromised by the relatively large environmental fluctuations in temperature (21-27oC), humidity (35-60%), or other cues known to modulate spontaneous firing.

      We will clarify the Methods section.

      References

      Chen S, Luetje CW. 2012. Identification of New Agonists and Antagonists of the Insect Odorant Receptor Co-Receptor Subunit. PLOS ONE 7:e36784. doi:10.1371/journal.pone.0036784

      Dolzer J, Fischer K, Stengl M. 2003. Adaptation in pheromone-sensitive trichoid sensilla of the hawkmoth Manduca sexta. J Exp Biol 206:1575–1588. doi:10.1242/jeb.00302

      Gawalek P, Stengl M. 2018. The Diacylglycerol Analogs OAG and DOG Differentially Affect Primary Events of Pheromone Transduction in the Hawkmoth Manduca sexta in a Zeitgebertime-Dependent Manner Apparently Targeting TRP Channels. Front Cell Neurosci 12:218. doi:10.3389/fncel.2018.00218

      Getahun MN, Olsson SB, Lavista-Llanos S, Hansson BS, Wicher D. 2013. Insect Odorant Response Sensitivity Is Tuned by Metabotropically Autoregulated Olfactory Receptors. PLOS ONE 8:e58889. doi:10.1371/journal.pone.0058889

      Ghosh S, Suray C, Bozzolan F, Palazzo A, Monsempès C, Lecouvreur F, Chatterjee A. 2024. Pheromone-mediated command from the female to male clock induces and synchronizes circadian rhythms of the moth Spodoptera littoralis. Curr Biol 34:1414-1425.e5. doi:10.1016/j.cub.2024.02.042

      Jain K, Prelic S, Hansson BS, Wicher D. 2024. Expression of Drosophila melanogaster V-ATPases in Olfactory Sensillum Support Cells. Insects 15:1016. doi:10.3390/insects15121016

      Jones PL, Pask GM, Rinker DC, Zwiebel LJ. 2011. Functional agonism of insect odorant receptor ion channels. Proc Natl Acad Sci 108:8821–8825. doi:10.1073/pnas.1102425108

      Klein U. 1992. The insect V-ATPase, a plasma membrane proton pump energizing secondary active transport: immunological evidence for the occurrence of a V-ATPase in insect ion-transporting epithelia. J Exp Biol 172:345–354. doi:10.1242/jeb.172.1.345

      Krishnan B, Dryer SE, Hardin PE. 1999. Circadian rhythms in olfactory responses of Drosophila melanogaster. Nature 400:375–378. doi:10.1038/22566

      Merlin C, Lucas P, Rochat D, François M-C, Maïbèche-Coisne M, Jacquin-Joly E. 2007. An Antennal Circadian Clock and Circadian Rhythms in Peripheral Pheromone Reception in the Moth Spodoptera littoralis. J Biol Rhythms 22:502–514. doi:10.1177/0748730407307737

      Nolte A, Funk NW, Mukunda L, Gawalek P, Werckenthin A, Hansson BS, Wicher D, Stengl M. 2013. In situ Tip-Recordings Found No Evidence for an Orco-Based Ionotropic Mechanism of Pheromone-Transduction in Manduca sexta. PLOS ONE 8:e62648. doi:10.1371/journal.pone.0062648

      Nolte A, Gawalek P, Koerte S, Wei H, Schumann R, Werckenthin A, Krieger J, Stengl M. 2016. No Evidence for Ionotropic Pheromone Transduction in the Hawkmoth Manduca sexta. PLOS ONE 11:e0166060. doi:10.1371/journal.pone.0166060

      Rymer J, Bauernfeind AL, Brown S, Page TL. 2007. Circadian rhythms in the mating behavior of the cockroach, Leucophaea maderae. J Biol Rhythms 22:43–57. doi:10.1177/0748730406295462

      Schendzielorz J, Schendzielorz T, Arendt A, Stengl M. 2014. Bimodal Oscillations of Cyclic Nucleotide Concentrations in the Circadian System of the Madeira Cockroach Rhyparobia maderae. J Biol Rhythms 29:318–331. doi:10.1177/0748730414546133

      Schendzielorz T, Peters W, Boekhoff I, Stengl M. 2012. Time of Day Changes in Cyclic Nucleotides Are Modified via Octopamine and Pheromone in Antennae of the Madeira Cockroach. J Biol Rhythms 27:388–397. doi:10.1177/0748730412456265

      Schendzielorz T, Schirmer K, Stolte P, Stengl M. 2015. Octopamine Regulates Antennal Sensory Neurons via Daytime-Dependent Changes in cAMP and IP3 Levels in the Hawkmoth Manduca sexta. PLOS ONE 10:e0121230. doi:10.1371/journal.pone.0121230

      Schneider AC, Schröder K, Chang Y, Nolte A, Gawalek P, Stengl M. 2025. Hawkmoth Pheromone Transduction Involves G-Protein–Dependent Phospholipase Cβ Signaling. eNeuro 12:ENEURO.0376-24.2024. doi:10.1523/ENEURO.0376-24.2024

      Stengl M. 2010. Pheromone Transduction in Moths. Front Cell Neurosci 4:133. doi:10.3389/fncel.2010.00133

      Stengl M. 1994. Inositol-trisphosphate-dependent calcium currents precede cation currents in insect olfactory receptor neurons in vitro. J Comp Physiol A 174:187–194. doi:10.1007/BF00193785

      Stengl M, Funk NW. 2013. The role of the coreceptor Orco in insect olfactory transduction. J Comp Physiol A 199:897–909. doi:10.1007/s00359-013-0837-3

      Stengl M, Hildebrand JG. 1990. Insect olfactory neurons in vitro: morphological and immunocytochemical characterization of male-specific antennal receptor cells from developing antennae of male Manduca sexta. J Neurosci 10:837–847. doi:10.1523/JNEUROSCI.10-03-00837.1990

      Stengl M, Schneider AC. 2024. Contribution of membrane-associated oscillators to biological timing at different timescales. Front Physiol 14:1243455. doi:10.3389/fphys.2023.1243455

      Tanoue S, Krishnan P, Krishnan B, Dryer SE, Hardin PE. 2004. Circadian Clocks in Antennal Neurons Are Necessary and Sufficient for Olfaction Rhythms in Drosophila. Curr Biol 14:638–649. doi:10.1016/j.cub.2004.04.009

      Thurm U, Küppers J. 1980. Epithelial physiology of insect sensilla In: Locke M, Smith DS, editors. Insect Biology in the Future. Academic Press. pp. 735–763. doi:10.1016/B978-0-12-454340-9.50039-2

      Wicher D, Miazzi F. 2021. Functional properties of insect olfactory receptors: ionotropic receptors and odorant receptors. Cell Tissue Res 383:7–19. doi:10.1007/s00441-020-03363-x

    1. Botryllus schlosseri (Tunicata) is a colonial chordate that has long been studied for its multiple developmental pathways and regenerative abilities and its genetically determined allorecognition system based on a polymorphic locus that controls chimerism and cell parasitism. We present the first chromosome-level genome assembly from an isogenic colony of B. schlosseri clade A1 using a mix of long and short reads scaf-folded using Hi-C. This haploid assembly spans 533 Mb, of which 96% are found in 16 chromosome-scale scaffolds. With a BUSCO completeness of 91.2%, this complete and contiguous B. schlosseri genome assembly provides a valuable genomic resource for the scientific community and lays the foundation for future investigations into the molecular mechanisms underlying coloniality, regeneration, histocompatibility, and the immune system in tunicates.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf097), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3: Cristian Canestro

      TO THE AUTHORS

      In this MS entitled 'First chromosome-level genome assembly of the colonial chordate model Botryllus schlosseri (Tunicata)', Olivier De Thier and colleagues report the first chromosome-scale assembly of this colonial ascidian specie, paying special attention to differences with previous published assemblies and importantly between haplotypes. The MS is very well written, very easy and pleasant to read. This provides data of great quality and very relevant not only for the ascidian/tunicate community, but to the field of genome structural evolution. I firmly recommend it for publication, although I think that the authors could discuss it in deeper detail. Specially, I miss for instance a more elaborate discussion of the results in our understanding of the similarities and differences between clades that have been published in the last years (I have not been able to find some relevant articles in this regard cited in the bibliography). I also feel that a deeper analysis of the differences between haplotypes could be very interesting, unless they are artifactual effects of the assemblies. As mentioned below, unless this is part of a longer story for a different MS beyond the scope of this one, I encourage the authors to validate some of the differences they find between haplotypes, and try to correlate the structural variations, with differences in gene counts between haplotypes, and to explore whether these differences could be correlated with aspects of biological relevance. I miss, for instance, Venn diagrams with gene contents between previous assemblies, and the haplotypes/haploid genome here reported. In any case, I firmly recommend this MS for publications, since most of my suggestions are not intended to interrogate the results of the MS, but to improve it, but I also understand that some may go beyond the scope of this MS.

      Minor points: Introduction Page 1: "the basic body plan of adult tunicates is highly conserved across the entire subphylum [3]". This sentence, which could be OK for ascidians, probably provides a highly simplified vision of Tunicate adult morphologies, specially comparing the divergent morphologies of Thaliaceans and Appendicularians. Please, elaborate the sentence.

      To understand the comparisons between the data of this MS and previously reported genomes, it seems crucial to understand well the meaning of the "clades and subclades". Please, include in the introduction (or where needed), how are defined those clades, which are their origins and biological/geographical differences, … and all the critical information that will specially help non-tunicate readers to understand the results.

      Results: The authors refer to the presence of large-scale genomic palindromes in Bs1 and Bs3. But it is unclear what are these structures. I suggest to please provide some more detailed explanation about the palindromic nature of these regions.

      The data of haplotype-resolved assemblies is very interesting. I wonder if it is possible to somehow measure the amount of heterozygosity between haplotype 1 and 2, and those versus the previous versions of the genome, to better understand intra and inter-variation between subclades? The differences of the size of some regions between Colombera and this study, and even between haplotypes 1 and 2, are very interesting. I would find more informative to merge the three graphs of Figure S9 into one single graph, so we can also easily compare the different in sizes of the haplotypes with the haploid. If some of those differences are actually due to deletions, that would deserve further analysis. If this analysis is not part of another ongoing project that will be published somewhere else, I suggest identifying with a dot-plot some of those differences, specially between haplotypes, and validate with long-reads crossing those regions whether some of the deletions are real or artifactual. Please, include the dotplot graph together with the two haplotypes in figure S10. In those cases that could be real, it would be very interesting what genes are gone, and if those are not placed somewhere else in the genome as result of translocations, or those genes are actually gone and could explain some of the differences reported in the gen count between haplotypes.

      The authors mentioned the presence of multiple structural variations, although some of which could be artifactual of miss-assemblies. Interestingly, the plot of the synteny blocks between the two haplotypes in figure S11 shows some of those structural variations, including cases of: - deletions: for instance, there are "blank" regions in Bs1A and Bs3A with no lines, which may reflect areas that are not present in the haplotype B. - duplications and translocations within chromosomes or between chromosomes of different haplotypes. Just looking to this plot, I wonder how the distribution of chromosomes between haplotypes is done. For instance, I see that Bs7B shares a duplicated synteny block with chromosomes Bs10B and Bs14B, but not with Bs10A and Bs10B, which means that the duplications are intra-haplotype present in B but not in A. But I wonder if it is possible that Bs10B and Bs14B could be in fact switched to haplotype A, and therefore there would be no duplication nor deletion in one of the haplotypes, just a simple translocation. I may be wrong in the interpretation, but I'm curious to understand the graph. In any case, again, as mentioned above, it would be worthy to validate some of those variations with long reads, which could illuminate the biological relevance between the haplotypes and discard potential artifactual errors of the assemblies.

      I notice that in figures 7 and S13, some lines are thicker than others. Is this because many "thin" lines are overlapped, and they look like a "thick" line. Otherwise, the visual effect of different thicknesses could be misleading. Please, clarify.

      In the analysis of the Hox cluster the authors say "[…] our new assembly revealed that B. schlosseri's Hox genes are not scattered. Instead, eight of them were clustered on the second largest scaffold (Bs2), whereas two other ones are found on the 15th largest scaffold (Bs15)." Generally, the description of the Hox gene in a cluster refers to the fact they are in the vicinity, with near not many other genes in between Hox genes. Therefore, I would not describe that eight Hox genes are clustered by the simple fact that they are in the same chromosome (maybe even in different arms).

    1. AbstractBackground Reference genomes for the entire sea turtle clade have the potential to reveal the genetic basis of traits driving the ecological and phenotypic diversity in these ancient and iconic marine species. Furthermore, these genomic resources can support conservation efforts and deepen our understanding of their unique evolution.Results We present haplotype-resolved, chromosome-level reference genomes and high-quality gene annotations for five sea turtle species. This completes the catalog of reference genomes of the entire sea turtle clade when combined with our previously published reference genomes. Our analysis reveals remarkable genome synteny and collinearity across all species, despite the clade’s origin dating back more than 60 million years. Regions of high interspecific genetic distance and intraspecific genetic diversity are consistently clustered in genomic hotspots, which are enriched with genes coding for immune response proteins, olfactory receptors, zinc fingers, and G-protein-coupled receptors. These hotspot regions may offer insights into the genetic mechanisms driving phenotypic divergence among species, and represent areas of significant adaptive potential. Ancient demographic analysis revealed a synchronous population expansion among sea turtle species during the Pleistocene, with varying magnitudes of demographic change, likely shaped by their diverse ecological adaptations, and biogeographic contexts.Conclusions Our work provides genomic resources for exploring genetic diversity, evolutionary adaptations, and demographic histories of sea turtles. We outline genomic regions with increased diversity, linked to immune response, sensory evolution, and adaptation to varying environments that have historically been subject to strong diversifying selection, and likely will underpin sea turtle’s responses to future environmental change. These reference genomes can assist conservation by providing insights into the demographic and evolutionary processes that sustain and threaten these iconic species.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf105), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Brendan Reid

      The authors of this work provide a fantastic addition to the genomic resources currently available for marine turtles with five new, apparently high-quality reference genomes. These new resources enable a number of interesting cross-species analyses in this group, including phylogenetic reconstruction, inference of demographic history, and identification of hotspots of diversity and divergence. I though this paper was quite clearly written and easy to read overall, and I have one major and a few more minor comments/suggestions.

      Major comment: there is an extensive literature on hybridization among marine turtle lineages (see Vilaca et al. 2021, https://doi.org/10.1111/mec.16113, for a recent genomic example), with lots of evidence for ancient gene flow after initial lineage divergence as well as recent hybridization. The authors do not really mention this phenomenon at all, and since I think it has a lot of bearing on all of the results it would make sense to re-think your findings in light of the fact that some level of gene flow has occurred. Would extensive synteny/lack of genomic rearrangements potentially enable hybridization? Is overall low divergence among lineages potentially a function of gene flow? Are regions of high divergence the result of selection (as you suggest), or could these regions potentially be resistant to gene flow? I believe that IQtree assumes a strictly bifurcating tree, and gene flow can influence PSMC inferences (see Mazet et al. 2016, https://doi.org/10.1038/hdy.2015.104) - how would gene flow among lineages affect your inference of divergence dates and demographic histories?

      MInor commentsL [note - line numbers would have been helpful for providing comments on specific items! I will refer to the lower-left page numbers and paragraph instead]:

      page 3, paragraph 2: Some of the applications you refer to here don't seem terribly germane to the relevance of "genomic resources" in management and conservation per se, and several are just methods using some kind of genetic data ... e.g., "abundance"/close-kin mark recapture doesn't require full genomes (and the reference you cite used microsat data), and the "community"/eDNA applications don't generally rely on genomes but instead on databases of a few (usually mitochondrial) genes. Either include methods that truly benefit from the development of high-quality reference genomes or broaden this to something like "growth in molecular ecology techniques".

      page 4, paragraph 2: last sentence is a bit of a run-on, could break this up a bit.

      page 10, paragraph 3: for me, the ROH methods need some additional explanation and interpretation. The more detailed methods indicate that the ROH were identified on the basis of lower-than-average heterozygosity rather than true homozygosity - I can understand why this might have been done (since the baseline level of heterozygosity varies across species) but it still seems a bit arbitrary and could risk mistaking stretches with simply low variation for IBD tracts. I wonder if a ROH-detection method like ROHan that explicitly incorporates baseline genomic heterozygosity into its model would be more appropriate for comparing results across species and could give different results. I also question a bit the interpretation of these low-diversity tracts as evidence of inbreeding per se. The authors do not comment much on the length distributions of these ROH - given that many of them are quite short I would expect that if there was mating between close kin it probably happened far back in the past and the IBD tracts have been broken up by recombination.

      page 11, paragraph 2: for PSMC analyses it is important to note the method assumes that differences in coalescence time/Ne across the genome result from demography alone. If portions of the genome are under balancing/diversifying selection (such as the areas of high diversity that you detect in this study), the local Ne for inferred these regions would be expected to be larger than the rest of the genome, which could lead to the spurious detection of population expansion or contraction (more likely a contraction for balancing selection). See Boitard et al. 2022 (https://doi.org/10.1093/genetics/iyac008) for a more detailed treatement. I would try excluding the regions putatively under diversifying selection and re-run PSMC to see if your inferences change.

    1. AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Ellen Visscher

      The paper introduces a python package for imputation, filtering, segmentation, feature extraction and visualisation of CNA profiles. It explains some of the elements of the package, and then demonstrates how data from multiple cohorts can be processed and combined using the package preprocessing pipeline. The authors then use processed data from 3 different cohorts to perform cancer type prediction using a CNN. From this, they get an interesting result to find a biomarker that differentiates two different lung cancers. Throughout, they show visualisations using their package. The package itself seems well documented and designed to be used. There is some clarification required in the methods section specifically around the CNN training and the models therein. There is also one major question of whether all the preprocessing steps are actually required for the downstream CNN analysis. Overall, however, this is a well written manuscript, providing a useful software tool for further analysis of CNA data.

      Major comments: - CNN section- how are the segments decided- is it based on all the training data, or just data in a batch? - Throughout the results pertaining to figure 3A-C, you call it test accuracy- to be clear is this is based on your CV hold outs? This should be reworded everywhere to reflect this. As cross validation indicates, this is not a test set and is a validation set- which is also the way you use it. - Regarding the above, you have a comment saying: "the best test accuracy without cross-validation was 92.34%". Could you please clarify what you mean by this. Only in the CNN section do you describe your training approach, which does not mention a test or separate validation set. - It reads slightly unclearly- you have a section called "model transfer", but are you training 3 different models- one per dataset? You only have one figure for training results which suggests one dataset, but then you have this section called model transfer? - Re all the above, please dedicate a small subsection in methods making this clearer. Are there dedicated test sets? If your main results are for aggregated data, then what are you testing on to ensure generalisability? What is the point of training the 3 different models on 3 different datasets? Perhaps it would make more sense to hold one dataset out as your test set. In some ways, that is what the model transfer is showing, but it would be less confusing to clarify that aim instead of suddenly introducing 3 models. - If the CNN architecture is essentially the same as in Attique et. al., the performance is basically the same and they use only CNs a gene locations- how does this demonstrate that the preprocessing from CNSistent is necessary or advantageous for this task? Maybe having a result which combines CN calls naively over gene locations and comparing to this across the aggregate datasets would be a good way of comparing? I.e showing that preproccessing does offer an advantage when combining different datasets together? Also because this is what you argue in your abstract. For this analysis you would have to make sure you also compare across the same samples to differentiate between filtering/other preprocessing steps. - In Figure 3I, you say "notice the similarity of chromosome 3 pattern for the correctly classified LUSC samples (red) and the misclassified ones (orange)". This is confusing because the orange and red are not similar. In fact for this whole section, it seems that figure 3I does not align with what you are saying?

      Minor comments/errors: - Clarification on why CNSistent needs a reference genome if it's dealing with segments? How is this information used- is it just for the known gaps? - Your caption of Supplementary Figure 1 has a typo about a breakpoint at 16 instead of 14. - You do not explain how you use the knee pt to filter (i.e is it samples above/below the knee pt.) - Your CNN graphic is difficult to interpret and non-standard. - CNN section should clarify at the beginning what the input is and what the output is (i.e a prediction that a sample belongs to a particular cancer type) before explaining the architectural details. - Even though you control for class imbalance, some cancer types are so poorly represented it is unlikely a CNN could learn that, you do kind of mention this in the discussion, but maybe some sort of minimum threshold for inclusion would make sense. - For Fig2D you refer to it as GND, but the axes/title says hemizygosity-are these things equivalent? E.g could have 3-3, low hemizygosity but not diploid? Or if it's aggregated across the whole genome its assumed equivalent? - There is a grammatical error "Runtimes decreased in a near-linearly with the number of compute cores" - You make a comment that "We therefore suspect some TCGA lung cancers might be cases of co-occurring adeno and squamous carcinomas." This is a possibility but given pleiotropy of many phenotypes- it may also be that the biomarker is not always unique to squamous carcinomas.

      Suggestions/Nice to haves: - Maybe make it clearer inside the paper what visualisations come with CNSistent. Looking at the software documentation, there's obviously a lot of useful visualisations that come with that- and some of them you have used in Figure 3 for e.g. - Given there are more total CN callers, maybe good to mention somewhere how CNSistent would work for total CNs only. - You remove profiles that you say are uninformative, could you not include this and then just show how accuracy correlates with no. of break-pts (for e.g). In some ways one might think that there could be useful information in few alteration profiles- because those alterations might be more upstream/causal. - The aggregation step could maybe affect downstream analysis. I.e taking the average could introduce CNs that were never called. Even using min/max- this implies a constant copy number in that region, which may lose information- e.g if it is a functional region having two diff CNs across gene might imply non-functionality. Did you explore the effect of aggregation step? Perhaps taking a small enough resolution of segment types would account for this anyway.

    1. AbstractPolyadenylation is a dynamic process which is important in cellular physiology. Oxford Nanopore Technologies direct RNA-sequencing provides a strategy for sequencing the full-length RNA molecule and analysis of the transcriptome and epi-transcriptome. There are currently several tools available for poly(A) tail-length estimation, including well-established tools such as tailfindr and nanopolish, as well as two more recent deep learning models: Dorado and BoostNano. However, there has been limited benchmarking of the accuracy of these tools against gold-standard datasets. In this paper we evaluate four poly(A) estimation tools using synthetic RNA standards (Sequins), which have known poly(A) tail-lengths and provide a valuable approach to measuring the accuracy of poly(A) tail-length estimation. All four tools generate mean tail-length estimates which lie within 12% of the correct value. Overall, Dorado is recommended as the preferred approach due to its relatively fast run times, low coefficient of variation and ease of use with integration with base-calling.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf098), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Christoph Dieterich

      In this manuscript, the authors present a benchmark to assess the performance of different tools designed for estimation of polyA tail length from Nanopore direct RNA-sequencing data. These tools include tailfindr, nanopolish, Dorado and Boost Nano. Benchmarks on tools and algorithms to analyze Nanopore data, both third party tools and official ONT releases, are of utmost importance for the field. The use of synthetic constructs with known ground truth is recommended as well. Consequently, this study has the potential to provide a significant contribution to the field.

      In the current form, I can however not recommend it for publication in GigaScience. My major concerns are: a) Use of only RNA002 data. This chemistry is outdated and thus the Benchmark is only relevant for old, possibly already published data. A comprehensive Benchmark should also include RNA004 and available tools there (at least Dorado). b) The current data set only contains two polyA tail length, which are relatively short and do not cover longer polyA tails that are common e.g. in mammalian cells. A proper Benchmark should show the performance of the analyzed tools over a range of polyA tail lengths.

      Minor comments: 1) Abstract: "All four tools generate mean tail-length estimates which lie within 13% of the correct value." The value of 13% is given in the Abstract from the submission system, wherease the abstract in the Main text says 12%. Which value is correct? 2) Background, first paragraph: the role of the polyA tail in RNA circularization, which is required for efficient translation of cellular mRNAs is not mentioned. Reference is missing for "is increasingly recognised as a dynamic process which influences timing and degree of protein production." 3) Background, second paragraph: Chiron seems to be a relatively old basecaller (no models for new chemistries). It should be mentioned here that it is required for BoostNano. 4) Mis-priming of internal polyA sites may an important confounding (and currently overlooked) source of errors in Nanopore sequencing. This should be quantified properly and analyzed in more detail (length of these stretches, influence of other nucleotides within the A-rich stretch, etc.). Should be done as well on whole transcriptome data with more possible mispriming sites. 5) Why do the authors think that the poly(T) stretch of the RTA might be truncated? This is composed of DNA oligos, which should be quite stable 6) What are the parameters for filtering used by Dorado and BoostNano? Can the authors explain, why the filtered reads differ? 7) Dorado seems to systematically underestimate polyA tail length. Is this true also for data generated with RNA004 chemistry and longer polyA tails?

    1. AbstractThe ability to differentiate between viable and dead microorganisms in metagenomic data is crucial for various microbial inferences, ranging from assessing ecosystem functions of environmental microbiomes to inferring the virulence of potential pathogens from metagenomic analysis. While established viability-resolved genomic approaches are labor-intensive as well as biased and lacking in sensitivity, we here introduce a new fully computational framework that leverages nanopore sequencing technology to assess microbial viability directly from freely available nanopore signal data. Our approach utilizes deep neural networks to learn features from such raw nanopore signal data that can distinguish DNA from viable and dead microorganisms in a controlled experimental setting of UV-induced Escherichia cell death. The application of explainable AI tools then allows us to pinpoint the signal patterns in the nanopore raw data that allow the model to make viability predictions at high accuracy. Using the model predictions as well as explainable AI, we show that our framework can be leveraged in a real-world application to estimate the viability of obligate intracellular Chlamydia, where traditional culture-based methods suffer from inherently high false negative rates. This application shows that our viability model captures predictive patterns in the nanopore signal that can be utilized to predict viability across taxonomic boundaries. We finally show the limits of our model’s generalizability through antibiotic exposure of a simple mock microbial community, where a new model specific to the killing method had to be trained to obtain accurate viability predictions. While the potential of our computational framework’s generalizability and applicability to metagenomic studies needs to be assessed in more detail, we here demonstrate for the first time the analysis of freely available nanopore signal data to infer the viability of microorganisms, with many potential applications in environmental, veterinary, and clinical settings.Author summary Metagenomics investigates the entirety of DNA isolated from an environment or a sample to holistically understand microbial diversity in terms of known and newly discovered microorganisms and their ecosystem functions. Unlike traditional culturing of microorganisms, genomic approaches are not able to differentiate between viable and dead microorganisms since DNA might persist under different environmental circumstances. The viability of microorganisms is, however, of importance when making inferences about a microorganism’s metabolic potential, a pathogen’s virulence, or an entire microbiome’s impact on its environment. As existing viability-resolved genomic approaches are labor-intensive, expensive, and lack sensitivity, we here investigate our hypothesis if freely available nanopore sequencing signal dat that captures DNA molecule information beyond the DNA sequence might be leveraged to infer such viability. This hypothesis assumes that DNA from dead microorganisms accumulates certain damage signatures that reflect microbial viability and can be read from nanopore signal data using fully computational frameworks. We here show first evidence that such a computational framework might be feasible by training a deep model on controlled experimental data to predict viability at high accuracy, exploring what the model has learned, and using it in a real-world application by application to a bacterial species of veterinary relevance. We finally show that a specific model has to be trained to accurately predict viability after antibiotic exposure of a mock microbial community. While the generalizability of our computational framework therefore needs to be assessed in much more detail, we here demonstrate that freely available data might be usable for relevant viability inferences in environmental, veterinary, and clinical settings.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf100), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Finlay Maguire

      In this paper the authors train a ResNet-based model to predict whether individual 10,000 sample chunks of nanopore signal data originate from live or killed bacterial isolate cultures. From live and UV-killed (at exponential phase) E. coli K-12 cultures DNA was extracted and sequenced using separate R10.4.1 flowcells on a MinION. Signal data from each read in the live and dead extractions were then processed by discarding the first 1,500 samples and dividing the remaining signals into 10,000 sample chunks. These were then split into a balanced 60:20:20 train, test, and validation datasets with the constraint that no two chunks from the same read would end up in the same dataset (e.g., chunk 1 and chunk 2 of 1st read in the killed culture would hypothetically be separated into train and test). During this they also explored/compared the impact of chunk size, model architecture, and performance of a sequence based model using the E. coli data. With a nicely performed class-activation map and masking approach they then identified the signal regions most strongly associated with dead-predictions (such as twisting/kinking/pore blockage of DNA around pyrimidine dimers). Finally, they applied their trained model to a live and heat-killed Chlamydia abortus culture and compared their results to stained microscopy and propidium monoazide PCR measures of viability. They found equivalent performance on the C. abortus data to their E. coli data (despite a different killing-method and taxa).

      The manuscript is well written and the methods are clearly described (including well documented code and deposited data). The authors explainability methodology is excellent although it would have been nice to see a bit more in-depth interpretation of those results. The authors have also presented a convincing case that nanopore signal data does contain information that can be used to distinguish signal chunks from live and dead bacterial monocultures. This methods has the potential to be useful in clinical and environmental genomics if it can be extended to more heterogeneous metagenomic samples. However, despite the title and framing of this manuscript (i.e., "metagenomics"), their analyses do not involve any metagenomic data and their results so far do not demonstrate if this is fesible. Currently, the overall framing (and title) of the manuscript is not appropriate given the work performed at this point. Similarly, given that both E. coli and C. abortus "dead" cultures resulted in median read length less than half the live cultures, the authors do not fully make the case that the signal and ResNet approach is actually required relative to simpler baseline models. Finally, although they did evaluate performance on a complete separate dataset, the authors should at least explore/quantify the correlation of live/dead prediction across chunks of the same read given the default expectation of non-independence of signal chunks from the same read.

      Major - Although the title and framing of the paper suggest that the authors are classifying live and dead bacteria in metagenomic datasets, the actual experiments and method developed are entirely based around sequencing of cultured clonal bacterial isolates. Metagenomic datasets are going to have considerably more heterogeneity in viability, species composition, and DNA signal characteristics. Given this, the paper's title, introduction, and parts of the discussion are a bit of an oversell and inappropriate. This manuscript should be revised to more clearly reflect the work actually performed.

      • This paper doesn't establish whether a ResNet + Signal approach actually outperforms a much simpler baseline. For example, given there is a clear extraction and median read-length differences between live and dead samples, it is possible that a much simpler logistic model using basic features such as read length and/or translocation could perform equivalently.

      • Although the C. abortus analysis demonstrates limited impact of leakage, I'm still a bit concerned that the potential non-independence of chunks from the same read (i.e., chunk 1 and chunk 3 of the same read are more likely to share similar live/dead signal characteristics than Chunk 1 and 3 of different reads). By not having multiple chunks of the same read in the training, validation, or test datasets the authors may have avoided issues with longer-reads being more represented in their datasets. However, this has the potential to introduce data leakage between train and test set (which may impact generalisability when they attempt to extend this method to metagenomics). I think this paper would be improved by some exploration of the correlation of live/dead prediction across chunks of the same read. How often do different chunks of the same read disagree? How does this impact the overall performance of the model? Does taking the average prediction across chunks of the same read improve or degrade performance? Would this problem be better suited to a multiple instance learning approach (i.e., a live/dead label applied to all chunks from a single read) especially in more heterogeneous datasets? To what degree do longer reads with more chunks contribute disproportionately to the overall performance in the C. abortus dataset?

      Minor

      • SRA records don't seem to be live yet (https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=1123127)

      • Are the actual pod5 files available?

      • Read-level performance should be analysed and reported.

      • Figure 1B: the test subplot numbers are almost too small to read - they may benefit from being its own panel.

      • Plot axes labels are not always clear (e.g., Figure 3) percentage of what? Chunks? or Reads? It would be nice to see consistent capitalisation of labels and legends.

      • Predictions on viable E. coli and viable C. abortus seems surprisingly similar (91.44% vs 91.34% viable and 8.56% vs 8.66% dead) despite different taxa, potentially underlying viable cell proportion, and output probability densities. This would benefit from further discussion/analysis - do misclassified chunks have any common characteristics? Would you expect the E. coli to have similar microscopy/PCR measured viability percentage as the C. abortus.

      • Would be good to see a bit more discussion/exploration of impact of mixed live/dead cells given ~37.6% viability measure in the C. abortus sample (e.g., how well do models perform with different ratios of live/dead reads) - could potentially be achieved using in-silico spike ins).

    1. There is a third kind of answer that, without competing with the previous two, demonstrates the value of philosophy, even (perhaps, especially) for students like our imagined protagonist: philosophy is the antidote to the uncritical acceptance of the world and ourselves as we are.

      I like the phrase "antidote to the uncritical acceptance" quite a lot. At first, you may think that an "uncritical acceptance" isn't necessarily a bad thing. However, thinking about it more, do you really want to just blindly accept the world around you? Looking critically at yourself and the world allows you to make changes and work to improve the lives of yourself and others, among many other things, simply because you dared to question.

    2. The deep underlying idea is that if we have to choose a social and political arrangement without knowing the position that we may occupy in society, we will choose fair principles to govern our social and political institutions. My teacher had our class re-enact a scenario very much like this one in class. We discussed the principles that would govern our imagined society before we picked our fate out of a hat. Until that point in my young life, I had never thought about justice in that way

      This is a very interesting way to think about justice. The author introduces this method to imagine a fair society with no bias. The reason this works so well is because not knowing what ur position in society will be, allows you to genuinely try ur best to make society as fair and enjoyable as possible for every individual

    3. Therefore, the first step in this kind of philosophical education is to shake students out of a complacent and uncritical acceptance of the world as it is.

      I think this is one of the most important reasons why we need to study philosophy. When we repeat our daily routines and become accustomed to them, we tend to overlook the injustices within them or we may not even recognize them as injustices. Philosophy enables us to think more critically about the society we live in, its institutions, and the impact they have on us.

    4. When students take this imaginative exercise seriously, they start to feel as discomfited as Descartes himself must have. The ground starts shaking under them. It is at this moment that philosophy starts its work.

      By asking so many bizarre questions that one normally does not consider on a day-to-day basis, it pushes us outside of our comfort zone and forces us to take a step into the unknown. This encourages our brains to work in different ways that it may normally not think, ask questions beyond our general scope of thinking, and create new connects and ideas that we may normally have not considered. I think this kind of emphasizes the importance of philosophy because it teaches us how to react when we are pushed outside of our comfort zone and how to think beyond our normal flow of consciousness.

    5. Many philosophers have persuasively criticized Rawls’ use of the original position as an argumentative tool. But we often forget, I think, how successfully it harnesses the power of the imagination to construct an alternative vision of what society could be like.

      We are so used to the life we live that we in ways we become comfortable in it. When imagining a different reality, one in which they may be less high up/wealthy, it becomes difficult for some to acknowledge just the amount of privilege they once had. The "Theory of Justice" gives people a different perspective on life and how different each and every person's life is from one another.

    6. The deep underlying idea is that if we have to choose a social and political arrangement without knowing the position that we may occupy in society, we will choose fair principles to govern our social and political institutions. My teacher had our class re-enact a scenario very much like this one in class. We discussed the principles that would govern our imagined society before we picked our fate out of a hat. Until that point in my young life, I had never thought about justice in that way. The power of this exercise contributed in no small way to my becoming a philosopher. I have recreated a similar activity in various classes I have taught. The discussion it generates among students is reliably superb, but the best moment is when students discover their fate – whether they end up being a doctor or a garbage truck driver or a poor young mother – and have to reckon (at least for that class period) with their principles. Many philosophers have persuasively criticized Rawls’ use of the original position as an argumentative tool. But we often forget, I think, how successfully it harnesses the power of the imagination to construct an alternative vision of what society could be like.

      Though it was a little difficult for me to picture this in real life as it is not realistic that society is completely unaware of ones capabilities before choosing their position in the social hierarchy, I think that this is fascinating to imagine. We often forget that we may not be as secure in our social status or career as we think we are so it is important to be aware of those of lower status around you and not take your position for granted.

    7. Now, ask yourself: what could philosophy do for you?

      I think this is a very interesting start to this article! It puts us into the shoes of someone in a difficult position, in which they must tirelessly work away to simply have a shot at a decent, livable lifestyle. I feel that this scenario they painted for us so vividly is really powerful when leading into this question, because I think people in the current climate of the world tend to underestimate the importance of philosophy, or don't really think about it at all. While maybe a lot of us don't completely relate to the situation of the young mother, a lot of us DO have our own struggles and might find ourselves lost in the grueling work that may come with everyday life. And when simply going through with our daily lives is hard enough, why should we bother with philosophy? Personally, I don't really think about the idea of philosophy at all, and I never really thought it would be relevant to me based on what I want to do in life. And when people don't think something is relevant, why bother with it, right? Life is busy enough as it is. But really, it probably has a lot more relevancy in my life than I think, and I believe that this idea is somewhat being conveyed in this part. That's just how I saw this paragraph, but I thought it was a strong opening!

    8. The deep underlying idea is that if we have to choose a social and political arrangement without knowing the position that we may occupy in society, we will choose fair principles to govern our social and political institutions. My teacher had our class re-enact a scenario very much like this one in class. We discussed the principles that would govern our imagined society before we picked our fate out of a hat. Until that point in my young life, I had never thought about justice in that way. The power of this exercise contributed in no small way to my becoming a philosopher. I have recreated a similar activity in various classes I have taught. The discussion it generates among students is reliably superb, but the best moment is when students discover their fate – whether they end up being a doctor or a garbage truck driver or a poor young mother – and have to reckon (at least for that class period) with their principles. Many philosophers have persuasively criticized Rawls’ use of the original position as an argumentative tool. But we often forget, I think, how successfully it harnesses the power of the imagination to construct an alternative vision of what society could be like.

      This is a brilliant way to describe others lived experiences and how what might not affect you, could affect someone else. Using philosophical teachings can reveal the privileges of some and the shortcomings of others and hopefully create a better understanding of everyones blindspots in day to day life. Truly a very powerful and humbling exercise that can help create common ground and allow others to empathize with eachother and hopefully create a more just society.

    9. The deep underlying idea is that if we have to choose a social and political arrangement without knowing the position that we may occupy in society, we will choose fair principles to govern our social and political institutions. My teacher had our class re-enact a scenario very much like this one in class. We discussed the principles that would govern our imagined society before we picked our fate out of a hat. Until that point in my young life, I had never thought about justice in that way. The power of this exercise contributed in no small way to my becoming a philosopher. I have recreated a similar activity in various classes I have taught. The discussion it generates among students is reliably superb, but the best moment is when students discover their fate – whether they end up being a doctor or a garbage truck driver or a poor young mother – and have to reckon (at least for that class period) with their principles. Many philosophers have persuasively criticized Rawls’ use of the original position as an argumentative tool. But we often forget, I think, how successfully it harnesses the power of the imagination to construct an alternative vision of what society could be like.

      This idea that we must get rid of the idea of "safety" within our lives and experiences can be imagined as a vision of the future that we as people, don't want to imagine. Being a "poor mother" or a "garbage truck driver" can be thought of as a disappointing fate to many who attend college, it can even be a fate so poor in the minds of students, that it serves as motivation in their eyes ; to not be like "them" , its a phrase that sticks with many who hold themselves to a high idea of success. But I believe and resonate with this idea of harnessing imagination as it broadness our perspective on education and life, because no matter how safe we feel behind a wall of education or wealth, there can always be a force of society that challenges our goals.

    1. AbstractWater buffalo is a cornerstone livestock species in many low- and middle-income countries, yet major gaps persist in its genomic characterization—complicated by the divergent karyotypes of its two sub-species (swamp and river). Such genomic complexity makes water buffalo a particularly good candidate for the use of graph genomics, which can capture variation missed by linear reference approaches. However, the utility of this approach to improve water buffalo has been largely unexplored.We present a comprehensive pangenome that integrates four newly generated, highly contiguous assemblies of Pakistani river buffalo with available assemblies from both sub- species. This doubles the number of accessible high-quality river buffalo genomes and provides the most contiguous assemblies for the sub-species to date. Using the pangenome to assay variation across 711 global samples, we uncovered extensive genomic diversity, including thousands of large structural variants absent from the reference genome, spanning over 140 Mb of additional sequence. We demonstrate the utility of these data by identifying putative functional indels and structural variants linked to selective sweeps in key genes involved in productivity and immune response across 26 populations.This study represents one of the first successful applications of graph genomics in water buffalo and offers valuable insights into how integrating assemblies can transform analyses of water buffalo and other species with complex evolutionary histories. We anticipate that these assemblies, and the pangenome and putative functional structural variants we have released, will accelerate efforts to unlock water buffalo’s genetic potential, improving productivity and resilience in this economically important species.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf099), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 4: Wai Yee Low

      Review of "A comprehensive water buffalo pangenome reveals extensive structural variation linked to population specific signatures of selection". This is an impressive work at the frontier of buffalo genomics. I truly enjoy reading the work and my questions/comments are aimed at improving it further. My detailed comments are below: Line 30: I think it is better you include the actual number of publicly available assemblies used to create the pangenome graph. Line 71: There is now a swamp buffalo reference genome with annotation too (NCBI accession: PCC_UOA_SB_1v2). Perhaps consider to cite the swamp buffalo ref https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae053/7753516 and rewrite the sentence to say a pangenome can be used for both swamp and river, but a single linear ref from either subspecies for read mapping is not good enough. Line 79: "highlighted" Line 82: What do you mean by "higher quality"? The assemblies have been discussed in this review: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.629861/full Line 105: Technically, the graph method for bovine species, which includes water buffalo, is being investigated by the Bovine Pangenome Consortium (BPC). However, nothing useful has been published on the buffalo graph but perhaps consider citing the BPC since your paper overlaps with it (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02975-0). Line 165: It will be good if you add a bit more context of the PanGenie method here as the researchers in buffalo community are not used to this. Additionally, it will be great if all code is made available on GitHub or as Supplementary Info. Line 170: To produce phase pangenome graph, don't you need all input assemblies to be phased? All are input assemblies phased? The UOA_WB_1 is locally phased, not phased throughout the genome. Line 235: "a list of 403 unrelated individuals." What does this translate to in terms that geneticists can understand? Do you mean siblings have been removed? Or individuals sharing the same grandparents were removed? Line 246: Can you please explain how did you get the coordinates to match between the GATK and PanGenie method? You'll need matching coordinates for concordance analysis. As I understand it, the GATK was based on UOA_WB_1? Line 254: Why these 3 chromosomes? Line 257: If you had not filtered for relatedness, how will it impact the selective sweep work? I think including some context will help the readers. Line 259: do you mean at least six samples per group? If yes, is 6 samples enough? Line 261: genotype quality less than 25 according to bcftools? Since you only used biallelic variants, please provide the breakdown between biallelic and multiallelic. Line 281: "… we first PacBio HiFi sequenced one female" Please rewrite this. Line 282: How common are these two breeds in percentage? Line 291: Is this already known? Perhaps cite the literature to show the agreement with previous studies? Fig 1D: This is a bit too small to see especially the SV distribution at the bottom. I can hardly see the median? Line 310: Why did you choose UOA_WB_1 as the reference? Line 311: the ~32.8 mil variants are comprised of SNPs as well? Fig 2: This is probably a panel of a figure but should not be the entire figure. The size of the circle indicates sample size but there should be a legend on the plot for this to say the sizes, right? Darker colour should be used to highlight the countries with samples instead of white? Maybe this could be a Supp figure too. Line 356: S Figure 4 and 5 should be main figures? You will need to annotate the abbreviation of sample-country in the legend of S Figure 5. Line 360: "To enable reuse we have made this dataset available …" The dataset should be made available to reviewers? Line 368: "76% of SNVs were called by both callers" 76% seem low. Also, called does not mean concordant. What is the concordance among called SNVs in both? Did the pangenome approach called most of the variants found in GATK? If not, what might be the reasons? Fig 3B: It is not immediately clear what the difference is, between non repetitive and repetitive regions. The overlapping text in the x-axes makes it hard to read. Line 390: "Analyses such as the study of selective sweeps or genome-wide association studies where low frequency variants are often filtered out will benefit less from the advantages of GATK, particularly given its longer run time." From here on, in this paragraph, it's Discussion, not Results. Line 418: Why human? Could you use cattle? Line 427: I tried the browser and not sure what I can learn from it. It will be helpful if there is a README with some examples on what can be explored. Line 450: How large before you considered it as larger variant? Is this ability to study larger variants still hold despite using only ~10 assemblies in the graph? The use of short reads for selective sweep study will still benefit from being able to incorporate these larger variants? As I understand it, the larger variants were found only from graph, not from the short reads. As such, the selective sweep may not be associated with any larger variants? Line 470: Fig S8 should be a main figure? Line 513: Instead of uniprot link, perhaps consider including this as Supplementary info or text. The info in the link may change in the future. Line 551: However, without scaffolding, the assemblies of Pakistani river buffalo may not be good enough to function as reference genomes for river buffalo? Line 552: When considering new bases, did you do this for each assembly independently or the new bases were discovered cumulatively? Line 581: Some of my questions at Line 450 can be discussed here. Line 586: Perhaps consider discussing the limitations of the small number of assemblies used to create the graph. As such, many SVs are likely still missing and we are still unable to properly assess allele frequency of these larger SVs. Additionally, while some SVs may not be considered as large in this work, it does not mean they have no impact.

    1. AbstractBackground Influenza A virus (IAV) poses a significant threat to animal health globally, with its ability to overcome species barriers and cause pandemics. Rapid and accurate IAV subtypes and host source prediction is crucial for effective surveillance and pandemic preparedness. Deep learning has emerged as a powerful tool for analyzing viral genomic sequences, offering new ways to uncover hidden patterns associated with viral characteristics and host adaptation.Findings We introduce WaveSeekerNet, a novel deep learning model for accurate and rapid prediction of IAV subtypes and host source. The model leverages attention-based mechanisms and efficient token mixing schemes, including the Fourier Transform and the Wavelet Transform, to capture intricate patterns within viral RNA and protein sequences. Extensive experiments on diverse datasets demonstrate WaveSeekerNet’s superior performance to existing models that use the traditional self-attention mechanism. Notably, WaveSeekerNet rivals VADR (Viral Annotation DefineR) in subtype prediction using the high-quality RNA sequences, achieving the maximum score of 1.0 on metrics including the Balanced Accuracy, F1-score (Macro Average), and Matthews Correlation Coefficient (MCC). Our approach to subtype and host source prediction also exceeds the pre-trained ESM-2 (Evolutionary Scale Modeling) models with respect to generalization performance and computational cost. Furthermore, WaveSeekerNet exhibits remarkable accuracy in distinguishing between human, avian, and other mammalian hosts. The ability of WaveSeekerNet to flag potential cross-species transmission events underscores its significant value for real-time surveillance and proactive pandemic preparedness efforts.Conclusions WaveSeekerNet’s superior performance, efficiency, and ability to flag potential cross-species transmission events highlight its potential for real-time surveillance and pandemic preparedness. This model represents a significant advancement in applying deep learning for IAV classification and holds promise for future epidemiological, veterinary studies, and public health interventions.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf089), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1:Will Dampier

      The manuscript presented by Nguyen et al. is well written, well researched, and well executed. The use of this new "wavelet style" neural network shows both an increased training efficiency and improved accuracy at detecting influenza subtypes for surveillance. However, I think their comparison to a 'plain' Transformer model does not take advantage of the improvements in pre-training and transfer-learning that have become standard practice in deep-learning. I have also included some stylistic suggestions to improve the figures as presented. After addressing these comments, I believe that this will become a very strong manuscript.

      Major Comments:

      The authors present a comparison between their new wavelet architecture and a standard transformer architecture using a one-hot encoded vector of amino-acids. I believe that this is the correct 'null model' to compare your wavelet architecture to, however, it does not represent the 'state of the art' in utilizing transformers for sequence analysis. As I'm sure the authors are aware, the disadvantage of transformers is that they take an extensive amount of training (they note the transformer only models take 2-4X more training epochs to converge). However, the advantage they bring is that they can be extensively trained for one task and then transfer that learning to another related task. A number of models have been pre-trained on giant collections of proteins Asgari et al, https://doi.org/10.1371/journal.pone.0141287 and Rives et al https://doi.org/10.1073/pnas.2016239118 which then allow one to transfer that knowledge to different domains with fewer examples such as demonstrated in Dampier et al https://doi.org/10.3389/fviro.2022.880618. It would be interesting to see whether your wavelet model defeats these pre-trained models with transfer learning. If you showed that, you could argue that there is no need for the extensive expense of 'foundational models'.

      The authors discuss that there is a significant imbalance in the training set and they used up-sampling and limiting to balance out the class representation. Since the classes are not equally represented, the model may not be equally able to predict each class. And the high metrics may only be a representation of its ability to predict the popular classes correctly. The authors should include an additional set of figures (supplemental is fine) that show the metrics broken out by Subtype. It would also be interesting to see a graph of the class-size (before up-sampling) vs F1-score (or another metric) on that class. This could provide lower-bounds for how many samples are needed to train the model.

      Minor Comments:

      Figures 3, 4, and 5: These would benefit from a linked y-axis. It is hard to compare across A/B/C/D when the axes have different y-limits.

    1. Author response:

      We thank both reviewers for their valuable comments. We have prepared a point-by-point response below.

      Reviewer #1 (Public review):

      Weaknesses:

      (1) The conclusions regarding the links between neural and behavioral mechanisms are mostly well supported by the data. However, what is less convincing is the authors' argument that their study offers evidence of 'priming'. An important hallmark of priming, at least as is commonly understood by cognitive scientists, is that it is stimulus specific: i.e., a repeated stimulus facilitates response times (repetition priming), or a repeated but previously ignored stimulus increases response times (negative priming). That is, it is an effect on a subsequent repeated stimulus, not ANY subsequent stimulus. Because (prime or target) stimuli are not repeated in the current experiments, the conditions necessary for demonstrating priming effects are not present. Instead, a different phenomenon seems to be demonstrated here, and one that might be more akin to approach/avoidance behavior to a novel or salient stimulus following an appetitive/aversive stimulus, respectively.

      (2) On a similar note, the authors' claim that 'priming' per se has not been well studied in non-human animals is not quite correct and would need to be revised. Priming effects have been demonstrated in several animal types, although perhaps not always described as such. For example, the neural underpinnings of priming effects on behavior have been very well characterized in human and non-human primates, in studies more commonly described as investigations of 'response suppression'.

      We thank the reviewer for these critical comments. After careful consideration of both reviews, we agree that “priming” may not be the most accurate term to describe the behavioral phenomenon. We plan to revise our terminology throughout the manuscript accordingly to better capture the generalized nature of the effect we observe.

      (3) The outcome measure - i.e., difference scores between the two odors or odor and non-odor (i.e., the number of flies choosing to approach the novel odor versus the number approaching the non-odor (air)) - appears to be reasonable to account for a natural preference for odors in the mock-trained group. However, it does not provide sufficient clarification of the results. The findings would be more convincing if these relative scores were unpacked - that is, instead of analyzing difference scores, the results of the interaction between group and odor preference (e.g., novel or air) (or even within the pre- and post-training conditions with the same animals) would provide greater clarity. This more detailed account may also better support the argument that the results are not due to conditioning of the US with pure air.

      We use the PI score as a standard metric to quantify all the odor preference in behavioral assays because it allows for robust comparison across different genetic or treatment groups under the same experimental setting. In T-maze, real time tracking of fly trajectories is technically difficult. With olfactory arenas, we showed some examples of fly distribution in quadrants over the entire odor choice test period (Figure 2—figure supplement 2) for both pre-trained and post-trained groups and discussed the trajectories in Discussion. We will ensure this point is clarified in the revised text.                       

      Reviewer #2 (Public review):

      […] They finally recorded from different mushroom body output neurons, including the one (MBON-γ4γ5) likely affected by the increased activity of the corresponding γ4 reward dopaminergic neurons after shock preexposure. They recorded odour-evoked responses from these neurons before and after shock preexposure, but did not find any plasticity, while they found a logical effect during spaced cycles of aversive training.

      We thank the reviewer for the summary. We would like to clarify that we did, in fact, observe plasticity in MBON-γ4γ5 following shock exposure, as shown in Figure 4B.

      Overall, the study is very interesting with a substantial amount of behavioural analysis and in vivo 2-photon calcium imaging data, but some major (and some minor) issues have to be resolved to strengthen their conclusions.

      (1) According to neuropsychological work (Henson, Encyclopedia of Neuroscience (2009), vol. 7, pp. 1055-1063), « Priming refers to a change in behavioral response to a stimulus, following prior exposure to the same, or a related, stimulus. Examples include faster reaction times to make a decision about the stimulus, a bias to produce that stimulus when generating responses, or the more accurate identification of a degraded version of the stimulus". Or "Repetition priming refers to a change in behavioural response to a stimulus following re-exposure" (PMID: 18328508). I therefore do not think that the effects observed by the authors are really the investigation of the neural mechanisms of priming. To me, the effect they observed seems more related to sensitisation, especially for the activation of sweet-sensing neurons. For the shock effect, it could be a safety phenomenon, as in Jacob and Waddell, 2020, involving (as for sugar reward) different subsets for short-term and long-term safety.

      As noted in our response to Reviewer #1, we plan to revise our use of the term “priming” in the manuscript to more accurately interpret the behavioral phenomenon.

      (2) The author missed the paper from Thomas Preat, The Journal of Neuroscience, October 15, 1998, 18(20):8534-8538 (Decreased Odor Avoidance after Electric Shock in Drosophila Mutants Biases Learning and Memory Tests). In this paper, one of the effects observed by the authors has already been described, and the molecular requirement of memory-related genes is investigated. This paper should be mentioned and discussed.

      We thank the reviewer for bringing this important reference to our attention. We will cite the Preat (1998) paper and discuss its relevant findings in relation to our own in the revised manuscript.

      (3) Overall, the bidirectional effect they observed is interesting; however, their results are not always clear, and the use of a delta PI is sometimes misleading. The authors have mentioned that shocks induced attraction to the novel odour, while they should stick to the increase or decrease in preference/avoidance.

      The ΔPI is calculated either as (trained PI – mock PI) for different animals or as (post PI – pre PI) for the same animals, with the specific calculation clarified in each figure legend. A positive ΔPI signifies an increase in preference for the odor, which is equivalent to a relative attraction or a decrease in avoidance.

      As not all experiments are done in parallel logic, it is not always easy to understand which protocol the authors are using. For example, only optogenetics is used in the appetitive preexposure. Does exposing flies to sugar or activating reward dopaminergic neurons also increase odour avoidance? The observed increased odour avoidance after optogenetic activation of sweet-sensing neurons involve reward (e.g., decreased response) and/or punishment (e.g., increased response) to increase odour avoidance?  

      We used different behavioral assays (T-maze or arena), stimuli (real shock or optogenetics), and protocols (different or same animal groups) to robustly demonstrate the phenomenon across platforms. We explained each protocol in the figures or texts, and we’ll make them clearer to follow in the revised version. We focused on activating a clean set of sugar sensing neurons because this optogenetic stimulus is an effective and efficient substitute to real sugar. We agree that testing reward dopaminergic neuron activation is a logical extension and will consider adding these experiments in the revised work.

      The author should always statistically test the fly behavioural performances against 0 to have an idea of random choice or a clear preference toward an odour.

      Our primary focus is on the change in preference induced by training, rather than the innate odor preference itself, which can be highly variable due to physiological and environmental factors. Statistical testing against 0 for innate preference scores is not standard practice in this specific paradigm, as the critical question is whether a treatment alters behavior relative to a control.

      On the appetitive side, the internal hunger state would play an important role. The author should test it or at least discuss it.

      For appetitive experiments, we always starve the flies on 1% agar for two days prior to behavioral tests to standardize their hunger state. We will consider adding fed flies as control groups in the revised work.

      (4) The authors found a discrepancy between genetic backgrounds; sometimes the same odour can be attractive or aversive.

      We observed minor discrepancies in innate odor preferences across genetic backgrounds, which is a known and common occurrence. Different genotypes and temperatures can result in different baseline PI scores. However, the key finding is that the relative change in odor preference following an aversive stimulus is consistent: it increases the relative preference for an odor compared to air. This sometimes reverses valence (aversion to attraction) and other times simply reduces aversion. Our analysis focuses on this consistent, relative change.

      Different effects between the T-maze and the olfactory arena are found. The authors proposed that: "Punishment priming effect was still not detected, probably due to the insensitivity of the optogenetic arena". This is unclear to me, considering all prior work using this arena. The author should discuss it more clearly.

      The punishment effect with CS+ present was reliably detected in the T-maze (Figure 1A) but was not significant in the olfactory arena (Figure 2—figure supplement 1B-C). We hypothesize that the olfactory arena assay is less sensitive than the T-maze for detecting such subtle behavioral changes. This is evidenced by the fact that even classical odor-shock conditioning yields lower PI in the arena (typically ~0.4) than in the T-maze (~0.8), likely due to the greater distance flies must explore and travel. The higher variance in the arena may therefore mask more modest effects. Here the effect under investigation was induced by optogenetically activating only a small subset of aversive dopaminergic neurons, a stimulus that is likely weaker than full electric shock. This reduced stimulus strength may have contributed to the challenge of detecting a significant effect in the less sensitive arena paradigm.

      They mentioned that flies could not be conditioned with air and electric shock. However, flies could be conditioned with the context + shock, which is changing in the T-maze and not in the optogenetic area.

      While flies can be conditioned to context, during the optogenetic stimulation period in the arena, the light is delivered uniformly across all four quadrants. Therefore, any potential context conditioning would be equivalent across the entire chamber and should not bias the final distribution of flies between the odor and air quadrants during the test, nor affect the calculated PI score.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors revealed the cellular heterogeneity of companion cells (CCs) and demonstrated that the florigen gene FT is highly expressed in a specific subpopulation of these CCs in Arabidopsis. Through a thorough characterization of this subpopulation, they further identified NITRATE-INDUCIBLE GARP-TYPE TRANSCRIPTIONAL REPRESSOR 1 (NIGT1)-like transcription factors as potential new regulators of FT. Overall, these findings are intriguing and valuable, contributing significantly to our understanding of florigen and the photoperiodic flowering pathway. However, there is still room for improvement in the quality of the data and the depth of the analysis. I have several comments that may be beneficial for the authors. 

      Strengths: 

      The usage of snRNA-seq to characterize the FT-expressing companion cells (CCs) is very interesting and important. Two findings are novel: 1) Expression of FT in CCs is not uniform. Only a subcluster of CCs exhibits high expression level of FT. 2) Based on consensus binding motifs enriched in this subcluster, they further identify NITRATE-INDUCIBLE GARP-TYPE TRANSCRIPTIONAL REPRESSOR 1 (NIGT1)-like transcription factors as potential new regulators of FT. 

      We are pleased to hear that reviewer 1 noted the novelty and importance of our work. As reviewer 1 mentioned, we are also excited about the identification of a subcluster of companion cells with very high FT expression. We believe that this work is an initial step to describe the molecular characteristics of these FT-expressing cells. We are also excited to share our new findings on NIGT1s as potential FT regulators. We believe this finding will attract a broader audience, as the molecular factor coordinating plant nutrition status with flowering time remains largely unknown despite its well-known phenomenon.

      Weaknesses: 

      (1) Title: "A florigen-expressing subpopulation of companion cells". It is a bit misleading. The conclusion here is that only a subset of companion cells exhibit high expression of FT, but this does not imply that other companion cells do not express it at all. 

      We agree with this comment, as it was not our intention to sound like that FT is not produced in other companion cells than the subpopulation we identified. We revised the title to more accurately reflect the point. The new title is “Companion cells with high florigen production express other small proteins and reveal a nitrogen-sensitive FT repressor.”

      (2) Data quality: Authors opted for fluorescence-activated nuclei sorting (FANS) instead of traditional cell sorting method. What is the rationale behind this decision? Readers may wonder, especially given that RNA abundance in single nuclei is generally lower than that in single cells. This concern also applies to snRNA-seq data. Specifically, the number of genes captured was quite low, with a median of only 149 genes per nucleus. Additionally, the total number of nuclei analyzed was limited (1,173 for the pFT:NTF and 3,650 for the pSUC2:NTF). These factors suggest that the quality of the snRNA-seq data presented in this study is quite low. In this context, it becomes challenging for the reviewer to accurately assess whether this will impact the subsequent conclusions of the paper. Would it be possible to repeat this experiment and get more nuclei?

      We appreciate this comment; we noticed that we did not clearly explain the rationale for using single-nucleus RNA sequencing (snRNA-seq) instead of single-cell RNA-seq (scRNA-seq). As reviewer 1 mentioned, RNA abundance in scRNA-seq is higher than in snRNA-seq. To conduct scRNA-seq using plant cells, protoplasting is the necessary step. However, in our study, protoplasting has many drawbacks in isolating our target cells from the phloem. First, it is technically challenging to efficiently isolate protoplasts from highly embedded phloem companion cells from plant tissues. Typically, at least several hours of enzymatic incubation are required to obtain protoplasts from companion cells (often using semi-isolated vasculatures), and the efficiency of protoplasting vasculature cells remains low. Secondly, for our analysis, restoring the time information within a day is also crucial. Therefore, we employed a more rapid isolation method. In the revision, we will explain our rationale for choosing snRNA-seq due to the technical limitations. In the revised manuscripts, we added four new sentences in the Introduction section to clearly explain these points.

      Reviewer 1 also raised a concern about the quality of our snRNA-seq data, referring to the relatively low readcounts per nucleus. Although we believe that shallow reads do not necessarily indicate low quality and are confident in the accuracy of our snRNA-seq data, as supported by the detailed follow-up experiments (e.g., imaging analysis in Fig. 4B), we agree that it is important to address this point in the revision and alleviate readers’ concerns regarding the data quality. 

      We believe the primary reason for the low readcounts per cell is the small amount of RNA present in each Arabidopsis vascular cell nucleus that we isolated. For bulk nuclei RNAseq, we collected 15,000 nuclei. However, the total RNA amount was approximately 3 ng. It indicates that each nucleus isolated contains a very limited amount of RNA (by the simple calculation, 3,000 pg / 15,000 nuclei = 0.2 pg/nucleus). It appears that the size of cells and nuclei was still small in 2-week-old seedlings; thus, each nucleus may contain lower levels of RNA. During the optimization process, we also tried to fix the tissues that we hoped to restore nuclear retained RNA, but unfortunately, in our hands, we encountered the technical issue of nuclei aggregation that hindered the sorting process, which is not suitable for single-nucleus RNA-seq.

      Reviewer 1 suggested that we repeat the same snRNA-seq experiment. We agree that having more cells increases the reliability of data. However, to our knowledge, higher cell numbers enhance the confidence of clustering, but not readcounts per cell. In our snRNAseq data, our target, FT-expressing cells, were observed in cluster 7, which projected at an obvious distance from other cell clusters. Therefore, we think that having more nuclei does not significantly help in separating high FT-expressing cluster 7 cells and different types of cells, although we may obtain more DEGs from the cluster 7 cells. Considering the costs and time required for additional snRNA-seq experiments, we think that adding more followup molecular biology experiment data would be more practical. We clearly stated the limitations of our approach in the Discussion section. “A drawback of our snRNA-seq analysis was shallow reads per nucleus. It appears mainly due to the low abundance of mRNA in nuclei from 2-week-old leaves. Based on our calculation, the average mRNA level per nucleus is approximately 0.2 pg (3,000 pg mRNA from 15,000 sorted nuclei). Future technological advance is needed to improve the data quality“

      In this revised version of the manuscript, we silenced FT gene expression using an amiRNA against FT driven by tissue-specific promoters [pROXY10, cluster 7; pSUC2, companion cells; pPIP2.6, cluster 4 (for the spatial expression pattern of PIP2.6, please see the new data shown in Fig. S8F); pGC1, guard cells]. Given that both FT and ROXY10 were highly expressed in cluster 7 of our snRNA-seq dataset, we anticipated the late flowering phenotype of pROXY10:amiRNA-ft. As we expected, pROXY10:amiR-ft but not pPIP2.6:amiR-ft lines showed delayed flowering phenotypes (Fig. S14A), supporting the validity of our snRNA-seq approach. We are also now more confident in the resolution of our snRNA-seq analysis, since cluster 4-specific PIP2.6 did not cause late flowering despite its higher basal expression than ROXY10 (Fig. S14B).

      (3) Another disappointment is that the authors did not utilize reporter genes to identify the specific locations of the FT-high expressing cells (cluster 7 cells) within the CC population in vivo. Are there any discernible patterns that can be observed? 

      In the original manuscript, as we showed only limited spatial images of overlap between FT and other cluster 7 genes in Fig. 4B, this comment is totally understandable. To respond to it, we added whole leaf images showing the spatial expression of FT and other cluster 7 genes (Fig. S12). These data indicate that cluster 7 genes including FT are expressed highly in minor veins in the distal part of the leaf but weakly in the main vein. We also added enlarged images of spatial expression of FT and cluster 7 genes (FLP1 and ROXY10) to note that those genes do not overlap completely (Fig. S13).

      In contrast to cluster 7 genes, genes highly expressed in cluster 4, such as LTP1 and MLP28, are reportedly highly expressed in the main leaf vein. To further confirm it, we established a transgenic line that expresses a GFP-fusion protein controlled by the promoter of a cluster 4-specific gene PIP2.6 (Fig. S8F). It also showed strong GFP signals in the main vein, consistent with previous observations of LTP1 and MLP28.   In summary, FT-expressing cells (cluster 7 cells) are enriched in companion cells in the minor vein, and their expression patterns show a clear distinction from genes expressed in the main vein (e.g., cluster 4-specific genes). 

      (4) The final disappointment is that the authors only compared FT expression between the nigtQ mutants and the wild type. Does this imply that the mutant does not have a flowering time defect particularly under high nitrogen conditions? 

      We agree with reviewer 1 that more experiments are required to conclude the role of NIGT1 on FT regulation, in addition to our Y1H data, flowering time data of NIGT1 overexpressors, and FT expression in NIGT1 overexpressors and nigtQ mutant.

      First, to test the direct regulation of NIGT1s on FT transcription, we conducted a transient luciferase (LUC) assay in tobacco leaves using effectors (p35S:NIGT1.2, p35S:NIGT1.4, and p35S:GFP) and reporters [pFT:LUC (FT promoter fused with LUC) and pFTm:LUC (the same FT promoter with mutations in NIGT1-binding sites fused with LUC)]. Our result showed that NIGT1.2 and NIGT1.4, but not GFP, decreased the activity of pFT:LUC but not pFTm:LUC (Fig. 5C). This indicates that NIGT1s directly repress the FT gene.

      Second, to address reviewer 1’s suggestion about the effect of of nigtQ mutation on flowering time, we have grown WT and nigtQ plants on 20 mM and 2 mM NH<sub>4</sub>NO<sub>3</sub>. Under 20 mM NH<sub>4</sub>NO<sub>3</sub>, the nigtQ line bolted at earlier days than WT; under 2 mM NH<sub>4</sub>NO<sub>3</sub>, nigtQ and WT bolted at almost same timing (Fig. S17D and E). This result suggests that the nigtQ mutation affects flowering timing depending on nitrogen nutrient status. However, leaf numbers of bolted plants were not different between WT and nigtQ lines (Fig. S17E). Therefore, it appears that nigtQ mutation also accelerated overall growth of plants rather than flowering promotion. We also have measured flowering time by counting leaf numbers of the nigtQ and WT plants at bolting on nitrogen-rich soil. The mutant generated slightly more leaves than WT when they flowered (Fig. S17G). These results suggest that the NIGT-derived fine-tuning of FT regulation is conditional on higher nitrogen conditions. 

      Minor: 

      (1) Abstract: "Our bulk nuclei RNA-seq demonstrated that FT-expressing cells in cotyledons and in true leaves differed transcriptionally.". This sentence is not informative. What exactly is the difference in FT-expressing cells between cotyledons and true leaves? 

      We modified the sentence to clarify the differences between cotyledons and true leaves. “Our bulk nuclei RNA-seq demonstrated that FT-expressing cells in cotyledons and true leaves showed differences especially in FT repressor genes.”

      (2) As a standard practice, to support the direct regulation of FT by NIGT1, the authors should provide EMSA and ChIP-seq data. Ideally, they should also generate promoter constructs with deletions or mutations in the NIGT1 binding sites. 

      To test direct interaction of NIGT1 to the FT promoter sequences, we performed the transient reporter assay using FT promoter driven luciferase reporter (Fig. 5C). NIGT1.2 and NIGT1.4 repressed the FT promoter activity; however, with NIGT1 binding site mutations, this repression was not observed, indicating that NIGT1 binds to the ciselements in the FT promoter to repress its transcription.

      (3) Sorting: Did the authors fix the samples before preparing the nuclei suspension? If not, could this be the reason the authors observed the JA-responsive clusters (Fig. 2J)? Please provide more details related to nuclei sorting in the Methods section. 

      We added a new subsection in the Materials and Methods section to explain a detail of the nuclei sorting procedure. We did not include a sample fixation step. We have tried formaldehyde fixation; however, it clumped nuclei, which was not suitable for snRNA-seq. Moreover, fixation steps generally reduce readcounts of single-cell RNA-seq according to the 10X Genomics’ guideline.

      We agree that JA responses were triggered during the FANS nuclei isolation. Therefore, we added the following sentence. “Since our FANS protocol did not include a sample fixation step to avoid clumping, these cells likely triggered wounding responses during the chopping and sorting process (Fig. S1B).  

      Reviewer #2 (Public review): 

      This manuscript submitted by Takagi et al. details the molecular characterization of the FTexpressing cell at a single-cell level. The authors examined what genes are expressed specifically in FT-expressing cells and other phloem companion cells by exploiting bulk nuclei and single-nuclei RNA-seq and transgenic analysis. The authors found the unique expression profile of FT-expressing cells at a single-cell level and identified new transcriptional repressors of FT such as NIGT1.2 and NIGT1.4. 

      Although previous researchers have known that FT is expressed in phloem companion cells, they have tended to neglect the molecular characterization of the FT-expressing phloem companion cells. To understand how FT, which is expressed in tiny amounts in phloem companion cells that make up a very small portion of the leaf, can be a key molecule in the regulation of the critical developmental step of floral transition, it is important to understand the molecular features of FT-expressing cells in detail. In this regard, this manuscript provides insight into the understanding of detailed molecular characteristics of the FT-expressing cell. This endeavor will contribute to the research field of flowering time. 

      We are grateful that reviewer 2 recognizes the importance of transcriptome profiling of FTexpressing cells at the single-cell level.

      Here are my comments on how to improve this manuscript. 

      (1) The most noble finding of this manuscript is the identification of NTGI1.2 as the upstream regulator of FT-expressing cluster 7 gene expression. The flowering phenotypes of the nigtQ mutant and the transgenic plants in which NIGT1.2 was expressed under the SUC2 gene promoter support that NIGT1.2 functions as a floral repressor upstream of the FT gene. Nevertheless, the expression patterns of NIGT1.2 genes do not appear to have much overlap with those of NIGT1.2-downstream genes in the cluster 7 (Figs S14 and F3). An explanation for this should be provided in the discussion section. 

      We agree with reviewer 2 that the spatial expression patterns of NIGT1.2 and cluster 7 genes do not overlap much, and some discussion should be provided in the manuscript. Although we do not have a concrete answer for this phenomenon, we obtained the new data showing that NIGT1.2 and NIGT1.4 directly repress the FT gene in planta (Fig. 5C).  As NIGT1.2/1.4 are negative regulators of FT, it is plausible that NIGT1.2/1.4 may suppress FT gene expression in non-cluster 7 cells to prevent the misexpression of FT. We added this point in the Results section.

      (2) To investigate gene expression in the nuclei of specific cell populations, the authors generated transgenic plants expressing a fusion gene encoding a Nuclear Targeting Fusion protein (NTF) under the control of various cell type-specific promoters. Since the public audience would not know about NTF without reading reference 16, some explanation of NTF is necessary in the manuscript. Please provide a schematic of constructs the authors used to make the transformants.

      As reviewer 2 pointed out, we lacked a clear explanation of why we used NTF in this study. NTF is the fusion protein that consists of a nuclear envelope targeting WPP domain, GFP, and a biotin acceptor peptide. It was initially designed for the INTACT (isolation of nuclei tagged in specific cell types) method, which enables us to isolate bulk nuclei from specific tissues. Although our original intention was to profile the bulk transcriptome of mRNAs that exist in nuclei of the FT-expressing cells using INTACT, we utilized our NTF transgenic lines for snRNA-seq analysis. To explain what NTF is to readers, we included a schematic diagram of NTF (Fig. S1A) and more explanation about NTF in the Results section.

      Again, we appreciate all reviewers’ careful and constructive comments. With these changes, we hope our revised manuscript is now satisfactory.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      Summary: 

      The study by Klug et al. investigated the pathway specificity of corticostriatal projections, focusing on two cortical regions. Using a G-deleted rabies system in D1-Cre and A2a-Cre mice to retrogradely deliver channelrhodopsin to cortical inputs, the authors found that M1 and MCC inputs to direct and indirect pathway spiny projection neurons (SPNs) are both partially segregated and asymmetrically overlapping. In general, corticostriatal inputs that target indirect pathway SPNs are likely to also target direct pathway SPNs, while inputs targeting direct pathway SPNs are less likely to also target indirect pathway SPNs. Such asymmetric overlap of corticostriatal inputs has important implications for how the cortex itself may determine striatal output. Indeed, the authors provide behavioral evidence that optogenetic activation of M1 or MCC cortical neurons that send axons to either direct or indirect pathway SPNs can have opposite effects on locomotion and different effects on action sequence execution. The conclusions of this study add to our understanding of how cortical activity may influence striatal output and offer important new clues about basal ganglia function. 

      The conceptual conclusions of the manuscript are supported by the data, but the details of the magnitude of afferent overlap and causal role of asymmetric corticostriatal inputs on some behavioral outcomes may be a bit overstated given technical limitations of the experiments. 

      For example, after virally labeling either direct pathway (D1) or indirect pathway (D2) SPNs to optogenetically tag pathway-specific cortical inputs, the authors report that a much larger number of "non-starter" D2-SPNs from D2-SPN labeled mice responded to optogenetic stimulation in slices than "non-starter" D1 SPNs from D1-SPN labeled mice did. Without knowing the relative number of D1 or D2 SPN starters used to label cortical inputs, it is difficult to interpret the exact meaning of the lower number of responsive D2-SPNs in D1 labeled mice (where only ~63% of D1-SPNs themselves respond) compared to the relatively higher number of responsive D1-SPNs (and D2-SPNs) in D2 labeled mice. While relative differences in connectivity certainly suggest that some amount of asymmetric overlap of inputs exists, differences in infection efficiency and ensuing differences in detection sensitivity in slice experiments make determining the degree of asymmetry problematic. 

      It is also unclear if retrograde labeling of D1-SPN- vs D2-SPN- targeting afferents labels the same densities of cortical neurons. This gets to the point of specificity in some of the behavioral experiments. If the target-based labeling strategies used to introduce channelrhodopsin into specific SPN afferents label significantly different numbers of cortical neurons, might the difference in the relative numbers of optogenetically activated cortical neurons itself lead to behavioral differences? 

      We thank the reviewer for the comments and for raising additional interpretations of our results. We agree that determining the relative number of D1- versus D2-SPN starter cells would allow a more accurate estimate of connectivity. However, due to current technical limitations, achieving this level of precision remains challenging. As the reviewer also noted, differences in the number of cortical neurons targeting D1- versus D2-SPNs could introduce additional complexity to the functional effects observed in the behavioral experiments. Moreover, functional heterogeneity is likely to exist not only among cortical neurons projecting to striatal D1- or D2-SPNs, but also within the striatal D1- and D2-SPN populations themselves. Addressing these questions at the single-neuron level will require more refined viral tools in combination with improved recording and manipulation techniques. Despite these limitations, our results suggest that a subpopulation of cortical neurons selectively targets striatal D1-SPNs, supporting a functional dichotomy of pathway-specific corticostriatal subcircuits in the control of behavior.   

      Reviewer #2 (Public review): 

      Summary: 

      Klug et al. use monosynaptic rabies tracing of inputs to D1- vs D2-SPNs in the striatum to study how separate populations of cortical neurons project to D1- and D2-SPNs. They use rabies to express ChR2, then patch D1-or D2-SPNs to measure synaptic input. They report that cortical neurons labeled as D1-SPN-projecting preferentially project to D1-SPNs over D2-SPNs. In contrast, cortical neurons labeled as D2-SPN-projecting project equally to D1- and D2-SPNs. They go on to conduct pathway-specific behavioral stimulation experiments. They compare direct optogenetic stimulation of D1- or D2-SPNs to stimulation of MCC inputs to DMS and M1 inputs to DLS. In three different behavioral assays (open field, intra-cranial self-stimulation, and a fixed ratio 8 task), they show that stimulating MCC or M1 cortical inputs to D1-SPNs is similar to D1-SPN stimulation, but that stimulating MCC or M1 cortical inputs to D2-SPNs does not recapitulate the effects of D2-SPN stimulation (presumably because both D1- and D2-SPNs are being activated by these cortical inputs). 

      Strengths: 

      Showing these same effects in three distinct behaviors is strong. Overall, the functional verification of the consequences of the anatomy is very nice to see. It is a good choice to patch only from mCherry-negative non-starter cells in the striatum. This study adds to our understanding of the logic of corticostriatal connections, suggesting a previously unappreciated structure. 

      Weaknesses: 

      One limitation is that all inputs to SPNs are expressing ChR2, so they cannot distinguish between different cortical subregions during patching experiments. Their results could arise because the same innervation patterns are repeated in many cortical subregions or because some subregions have preferential D1-SPN input while others do not. 

      Thank you for raising this thoughtful concern. It is indeed not feasible to restrict ChR2 expression to a specific cortical region using the first-generation rabies-ChR2 system alone. A more refined approach would involve injecting Cre-dependent TVA and RG into the striatum of D1- or A2A-Cre mice, followed by rabies-Flp infection. Subsequently, a Flp-dependent ChR2 virus could be injected into the MCC or M1 to selectively label D1- or D2-projecting cortical neurons. This strategy would allow for more precise targeting and address many of the current limitations.

      However, a significant challenge lies in the cytotoxicity associated with rabies virus infection. Neuronal health begins to deteriorate substantially around 10 days post-infection, which provides an insufficient window for robust Flp-dependent ChR2 expression. We have tested several new rabies virus variants with extended survival times (Chatterjee et al., 2018; Jin et al., 2024), but unfortunately, they did not perform effectively or suitably in the corticostriatal systems we examined.

      In our experimental design, the aim is to delineate the connectivity probabilities to D1 or D2-SPNs from cortical neurons. Our hypothesis considered includes the possibility that similar innervation patterns could occur across multiple cortical subregions, or that some subregions might show preferential input to D1-SPNs while others do not, or a combination of both scenarios. This leads us to perform a series behavior test that using optogenetic activation of the D1- or D2-projecting cortical populations to see which could be the case.

      In the cortical areas we examined, MCC and M1, during behavioral testing, there is consistency with our electrophysiological results. Specifically, when we stimulated the D1-projecting cortical neurons either in MCC or in M1, mice exhibited facilitated local motion in open field test, which is the same to the activation of D1 SPNs in the striatum along (MCC: Fig 3C & D vs. I; M1: Fig 3F & G vs. L). Conversely, stimulation of D2-projecting MCC or M1 cortical neurons resulted in behavioral effects that appeared to combine characteristics of both D1- and D2-SPNs activation in the striatum (MCC: Fig 3C & D vs. J; M1: Fig 3F & G vs. M). The similar results were observed in the ICSS test. Our interpretation of these results is that the activation of D1-projecting neurons in the cortex induces behavior changes akin to D1 neuron activation, while activation of D2-projecting neurons in the cortex leads to a combined effect of both D1 and D2 neuron activation. This suggests that at least some cortical regions, the ones we tested, follow the hypothesis we proposed.

      There are also some caveats with respect to the efficacy of rabies tracing. Although they only patch non-starter cells in the striatum, only 63% of D1-SPNs receive input from D1-SPN-projecting cortical neurons. It's hard to say whether this is "high" or "low," but one question is how far from the starter cell region they are patching. Without this spatial indication of where the cells that are being patched are relative to the starter population, it is difficult to interpret if the cells being patched are receiving cortical inputs from the same neurons that are projecting to the starter population. The authors indicate they are patching from mCherry-negative neurons within the region of the mCherry-positive neurons, but since the mCherry population will include both true starter cells and monosynaptically connected cells, this is not perfectly precise. Convergence of cortical inputs onto SPNs may vary with distance from the starter cell region quite dramatically, as other mapping studies of corticostriatal inputs have shown specialized local input regions can be defined based on cortical input patterns (Hintiryan et al., Nat Neurosci, 2016, Hunnicutt et al., eLife 2016, Peters et al., Nature, 2021). 

      This is a valid concern regarding anatomical studies. Investigating cortico-striatal connectivity at the single-cell level remains technically challenging due to current methodological limitations. At present, we rely on rabies virus-mediated trans-synaptic retrograde tracing to identify D1- or D2-projecting cortical populations. This anatomical approach is coupled with ex vivo slice electrophysiology to assess the functional connectivity between these projection-defined cortical neurons and striatal SPNs. This enables us to quantify connection ratios, for example, the proportion of D1-projecting cortical neurons that functionally synapse onto non-starter D1-SPNs.

      To ensure the robustness of our conclusions, it is essential that both the starter cells and the recorded non-starter SPNs receive comparable topographical input from the cortex and other brain regions. Therefore, we carefully designed our experiments so that all recorded cells were located within the injection site, were mCherry-negative (i.e., non-starter cells), and were surrounded by ChR2-mCherry-positive neurons. This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry.

      These methodological details are also described in the section on ex vivo brain slice electrophysiology, specifically in the Methods section, lines 453–459:

      “D1-SPNs (eGFP-positive in D1-eGFP mice, or eGFP-negative in D2-eGFP mice) or D2-SPNs (eGFP-positive in D2-eGFP mice, or eGFP-negative in D1-eGFP mice) that were ChR2-mCherry-negative, but in the injection site and surrounded by cells expressing ChR2-mCherry were targeted for recording. This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry.”

      This experimental strategy was implemented to control for potential spatial biases and to enhance the interpretability of our connectivity measurements.

      A caveat for the optogenetic behavioral experiments is that these optogenetic experiments did not include fluorophore-only controls, although a different control (with light delivered in M1) is provided in Supplementary Figure 3. Another point of confusion is that other studies (Cui et al, J Neurosci, 2021) have reported that stimulation of D1-SPNs in DLS inhibits rather than promotes movement. This study may have given different results due to subtly different experimental parameters, including fiber optic placement and NA.

      We appreciate the reviewer’s thoughtful evaluation and comments. We have added a short discussion of Cui et al.’s study on optogenetic stimulation of D1-SPNs in the DLS (lines 341-343), which reports findings that contrast with ours and those of other studies.

      Reviewer #3 (Public review): 

      Review of resubmission: The authors provided a response to the reviews from myself and other reviewers. While some points were made satisfactorily, particularly in clarification of the innervation of cortex to striatum and the effects of input stimulation, many of my points remain unaddressed. In several cases, the authors chose to explain their rationale rather than address the issues at hand. A number of these issues (in fact, the majority) could be addressed simply by toning done the confidence in conclusions, so it was disappointing to see that the authors by and large did not do this. I repeat my concerns below and note whether I find them to have been satisfactorily addressed or not. 

      In the manuscript by Klug and colleagues, the investigators use a rabies virus-based methodology to explore potential differences in connectivity from cortical inputs to the dorsal striatum. They report that the connectivity from cortical inputs onto D1 and D2 MSNs differs in terms of their projections onto the opposing cell type, and use these data to infer that there are differences in cross-talk between cortical cells that project to D1 vs. D2 MSNs. Overall, this manuscript adds to the overall body of work indicating that there are differential functions of different striatal pathways which likely arise at least in part by differences in connectivity that have been difficult to resolve due to difficulty in isolating pathways within striatal connectivity, and several interesting and provocative observations were reported. Several different methodologies are used, with partially convergent results, to support their main points. 

      However, I have significant technical concerns about the manuscript as presented that make it difficult for me to interpret the results of the experiments. My comments are below. 

      Major: 

      There is generally a large caveat to the rabies studies performed here, which is that both TVA and the ChR2-expressing rabies virus have the same fluorophore. It is thus essentially impossible to determine how many starter cells there are, what the efficiency of tracing is, and which part of the striatum is being sampled in any given experiment. This is a major caveat given the spatial topography of the cortico-striatal projections. Furthermore, the authors make a point in the introduction about previous studies not having explored absolute numbers of inputs, yet this is not at all controlled in this study. It could be that their rabies virus simply replicates better in D1-MSNs than D2-MSNs. No quantifications are done, and these possibilities do not appear to have been considered. Without a greater standardization of the rabies experiments across conditions, it is difficult to interpret the results. 

      This is still an issue. The authors point out why they chose various vectors. I can understand why the authors chose the fluorophores etc. that they did, yet the issues I raised previously are still valid. The discussion should mention that this is a potential issue. It does not necessarily invalidate results, but it is an issue. Furthermore, it is possible (in all systems) that rabies replicates better/more efficiently in some cells than others. This is one possible interpretation that has not really been explored in any study. I don't suggest the authors attempt to do that, but it should be raised as a potential interpretation. If the rabies results could mean several different things, the authors owe it to the readership to state all possible interpretations of data.

      We thank the reviewer for the comments and suggestions. Because the same fluorophore (mCherry) was used in both TVA- and ChR2-expressing viruses, it was not possible to distinguish true starter SPNs from TVA-only SPNs or monosynaptically labeled SPNs. This limitation makes it difficult to precisely assess the efficiency of rabies labeling and retrograde tracing in our experimental setup. Moreover, differences in rabies replication efficiency between D1- and D2-SPNs could potentially lead to an apparent lower connection probability from D1-projecting cortical neurons to D2-SPNs than from D2-projecting cortical neurons to D1-SPNs. We have added this clarification to the Discussion (lines 280-297).

      The authors claim using a few current clamp optical stimulation experiments that the cortical cells are healthy, but this result was far from comprehensive. For example, membrane resistance, capacitance, general excitability curves, etc are not reported. In Figure S2, some of the conditions look quite different (e.g., S2B, input D2-record D2, the method used yields quite different results that the authors write off as not different). Furthermore, these experiments do not consider the likely sickness and death that occurs in starter cells, as has been reported elsewhere. Health of cells in the circuit is overall a substantial concern that alone could invalidate a large portion, if not all, of the behavioral results. This is a major confound given those neurons are thought to play critical roles in the behaviors being studied. This is a major reason why first-generation rabies viruses have not been used in combination with behavior, but this significant caveat does not appear to have been considered, and controls e.g., uninfected animals, infected with AAV helpers, etc, were not included. 

      This issue remains unaddressed. I did not request clarity about experimental design, but rather, raised issues about the potential effects of toxicity. I believe this to be a valid concern that needs to be discussed in the manuscript, especially given what look visually like potential differences in S2. 

      We understand and appreciate the reviewer’s concern regarding the potential cytotoxicity of rabies virus infection. Although we performed the in vivo optogenetic behavioral experiments during a period when rabies-infected cells are generally considered relatively healthy, some deficits in starter cells may still occur and could contribute to the observed effects of optogenetic cortical stimulation. We have added this clarification to the Discussion (lines 298-306).

      The overall purity (e.g., EnvA pseudotyping efficiency) of the RABV prep is not shown. If there was a virus that was not well EnvA-pseudotyped and thus could directly infect cortical (or other) inputs, it would degrade specificity. This issue has not been addressed. Viral strain is irrelevant. The quality of the specific preparations used is what matters.

      While most of the study focuses on the cortical inputs, in slice recordings, inputs from the thalamus are not considered, yet likely contribute to the observed results. Related to this, in in vivo optogenetic experiments, technically, if the thalamic or other inputs to the dorsal striatum project to the cortex, their method will not only target cortical neurons but also terminals of other excitatory inputs. If this cannot be ruled it, stating that the authors are able to selectively activate the cortical inputs to one or the other population should be toned down. 

      The authors added text to the discussion to address this point. While it largely does what is intended, based on the one study cited, I disagree with the authors' conclusions that it is "clear" that potential contamination from other sites does not play a role. The simplest interpretation is the one the authors state, and there is some supporting evidence to back up that assertion, but to me that falls short of making the point "clear" that there are no other interpretations. 

      The statements about specificity of connectivity are not well founded. It may be that in the specific case where they are assessing outside of the area of injections, their conclusions may hold (e.g., excitatory inputs onto D2s have more inputs onto D1s than vice versa). However, how this relates to the actual site of injection is not clear. At face value, if such a connectivity exists, it would suggest that D1-MSNs receive substantially more overall excitatory inputs than D2s. It is thus possible that this observation would not hold over other spatial intervals. This was not explored and thus the conclusions are over-generalized. e.g., the distance from the area of red cells in the striatum to recordings was not quantified, what constituted a high level of cortical labeling was not quantified, etc. Without more rigorous quantification of what was being done, it is difficult to interpret the results. 

      Again, the goal here would be to make a statement about this in the discussion to clarify limitations of the study. I don't expect the authors to re-do all of these experiments, but since they are discussing the corticostriatal circuits, which have multiple subdomains, this remains a relevant point. It has not been addressed. 

      The results in Figure 3 are not well controlled. The authors show contrasting effects of optogenetic stimulation of D1-MSNs and D2-MSNs in the DMS and DLS, results which are largely consistent with the canon of basal ganglia function. However, when stimulating cortical inputs, stimulating the inputs from D1-MSNs gives the expected results (increased locomotion) while stimulating putative inputs to D2-MSNs had no effect. This is not the same as showing a decrease in locomotion - showing no effect here is not possible to interpret. 

      I think that the caveat of showing no clear effects of inputs to D2 stimulation should be pointed out. Yes, I understand that the viruses appeared to express etc., but again it remains possible that the results are driven by a lack of e.g., sufficient ChR2 expression. Aside from a full quantification of the number of cells expressing ChR2, overlap in fiber placement and ChR2 expression (which I don't suggest), this remains a possibility and should be pointed out, as it remains a possibility. 

      In the light of their circuit model, the result showing that inputs to D2-MSNs drive ICSS is confusing. How can the authors account for the fact that these cells are not locomotor-activating, stimulation of their putative downstream cells (D2-MSNs) does not drive ICSS, yet the cortical inputs drive ICSS? Is the idea that these inputs somehow also drive D1s? If this is the case, how do D2s get activated, if all of the cortical inputs tested net activate D1s and not D2s? Same with the results in Figure 4 - the inputs and putative downstream cells do not have the same effects. Given potential caveats of differences in viral efficiency, spatial location of injections, and cellular toxicity, I cannot interpret these experiments. 

      The explanation the authors provide in their rebuttal makes sense, however this should be included in the discussion of the manuscript, as it is interesting and relevant. 

      We thank the reviewer for the valuable comments and suggestions. In line with the reviewer’s recommendation, we have incorporated these explanations into the Discussion (lines 242–279) to help interpret the complex behavioral outcomes of optogenetic stimulation of cortical neurons projecting to D1- or D2-SPNs.

      Reviewer #2 (Recommendations for the authors): 

      I appreciate the authors' responses, which helped clarify some experimental choices. I appreciate that the experiment in Fig S3 serves as a reasonable light control for optogenetics experiments. The careful comparison with methods in Cui et al (2021) is useful, although not added to the main manuscript. Some of the other citations here don't really address the controversy, e.g. Kravitz at al is in DMS, but perhaps fully addressing this issue is outside the scope of the current manuscript and awaits further experiments. I also appreciate the clarification for recording locations that "This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry." However, the statement in the reviewer response does not seem to be added to the manuscript's methods, which I think would be helpful. The criteria for choosing recorded cells are still a bit fuzzy without a map of recording locations and histology. There is also a problem that mCherry-positive cells could be starter cells or could be monosynaptically traced cells, so it is hard to know the area of the starter cell population in these experiments for sure. My evaluation of the manuscript remains largely the same as the original. However, I have adjusted my public review a bit to incorporate the authors' responses. I still think this paper has valuable information, suggesting an interesting and previously unappreciated structure of corticostriatal inputs that I hope this group and others will continue to investigate and incorporate into models of basal ganglia function.

      We thank the reviewer for the valuable suggestions. We have now included a comparison with Cui et al. in the Discussion. In addition, we have added the criteria for selecting recorded cells to the Methods section: ‘This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry.’

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary: 

      This paper applies methods for segmentation, annotation, and visualization of acoustic analysis to zebra finch song. The paper shows that these methods can be used to predict the stage of song development and to quantify acoustic similarity. The methods are solid and are likely to provide a useful tool for scientists aiming to label large datasets of zebra finch vocalizations. The paper has two main parts: 1) establishing a pipeline/ package for analyzing zebra finch birdsong and 2) a method for measuring song imitation. 

      Strengths: 

      It is useful to see existing methods for syllable segmentation compared to new datasets.

      It is useful, but not surprising, that these methods can be used to predict developmental stage, which is strongly associated with syllable temporal structure.

      It is useful to confirm that these methods can identify abnormalities in deafened and isolated songs. 

      Weaknesses: 

      For the first part, the implementation seems to be a wrapper on existing techniques. For instance, the first section talks about syllable segmentation; they made a comparison between whisperseg (Gu et al, 2024), tweetynet (Cohen et al, 2022), and amplitude thresholding. They found that whisperseg performed the best, and they included it in the pipeline. They then used whisperseg to analyze syllable duration distributions and rhythm of birds of different ages and confirmed past findings on this developmental process (e.g. Aronov et al, 2011). Next, based on the segmentation, they assign labels by performing UMAP and HDBScan on the spectrogram (nothing new; that's what people have been doing). Then, based on the labels, they claimed they developed a 'new' visualization - syntax raster ( line 180 ). That was done by Sainburg et. al. 2020 in Figure 12E and also in Cohen et al, 2020 - so the claim to have developed 'a new song syntax visualization' is confusing. The rest of the paper is about analyzing the finch data based on AVN features (which are essentially acoustic features already in the classic literature). 

      First, we would like to thank this reviewer for their kind comments and feedback on this manuscript. It is true that many of the components of this song analysis pipeline are not entirely novel in isolation. Our real contribution here is bringing them together in a way that allows other researchers to seamlessly apply automated syllable segmentation, clustering, and downstream analyses to their data. That said, our approach to training TweetyNet for syllable segmentation is novel. We trained TweetyNet to recognize vocalizations vs. silence across multiple birds, such that it can generalize to new individual birds, whereas Tweetynet had only ever been used to annotate song syllables from birds included in its training set previously. Our validation of TweetyNet and WhisperSeg in combination with UMAP and HDBSCAN clustering is also novel, providing valuable information about how these systems interact, and how reliable the completely automatically generated labels are for downstream analysis. We have added a couple sentences to the introduction to emphasize the novelty of this approach and validation.

      Our syntax raster visualization does resemble Figure 12E in Sainburg et al. 2020, however it differs in a few important ways, which we believe warrant its consideration as a novel visualization method. First, Sainburg et al. represent the labels across bouts in real time; their position along the x axis reflects the time at which each syllable is produced relative to the start of the bout. By contrast, our visualization considers only the index of syllables within a bout (ie. First syllable vs. second syllable etc) without consideration of the true durations of each syllable or the silent gaps between them. This makes it much easier to detect syntax patterns across bouts, as the added variability of syllable timing is removed. Considering only the sequence of syllables rather than their timing also allows us to more easily align bouts according to the first syllable of a motif, further emphasizing the presence or absence of repeating syllable sequences without interference from the more variable introductory notes at the start of a motif. Finally, instead of plotting all bouts in the order in which they were produced, our visualization orders bouts such that bouts with the same sequence of syllables will be plotted together, which again serves to emphasize the most common syllable sequences that the bird produces. These additional processing steps mean that our syntax raster plot has much starker contrast between birds with stereotyped syntax and birds with more variable syntax, as compared to the more minimally processed visualization in Sainburg et al. 2020. There doesn’t appear to be any similar visualizations in Cohen et al. 2020. 

      The second part may be something new, but there are opportunities to improve the benchmarking. It is about the pupil-tutor imitation analysis. They introduce a convolutional neural network that takes triplets as an input (each tripled is essentially 3 images stacked together such that you have (anchor, positive, negative), Anchor is a reference spectrogram from, say finch A; positive means a different spectrogram with the same label as anchor from finch A, and negative means a spectrogram not related to A or different syllable label from A. The network is then trained to produce a low-dimensional embedding by ensuring the embedding distance between anchor and positive is less than anchor and negative by a certain margin. Based on the embedding, they then made use of earth mover distance to quantify the similarity in the syllable distribution among finches. They then compared their approach performance with that of sound analysis pro (SAP) and a variant of SAP. A more natural comparison, which they didn't include, is with the VAE approach by Goffinet et al. In this paper (https://doi.org/10.7554/eLife.67855, Fig 7), they also attempted to perform an analysis on the tutor pupil song.  

      We thank the reviewer for this suggestion. We have included a comparison of our triplet loss embedding model to the VAE model proposed in Goffinet et al. 2021. We also included comparisons of similarity scoring using each of these embedding models combined with either earth mover’s distance (EMD) or maximum mean discrepancy (MMD) to calculate the similarity of the embeddings, as was done in Goffinet et al. 2021. As discussed in the updated results section of the paper and shown in the new Figure 6–figure supplement 1, the Triplet loss model with MMD performs best for evaluating song learning on new birds, not included in model training. We’ve updated the main text of the paper to reflect this switch from EMD to MMD for the primary similarity scoring approach.

      Reviewer #2 (Public Review):

      Summary: 

      In this work, the authors present a new Python software package, Avian Vocalization Network (AVN) aimed at facilitating the analysis of birdsong, especially the song of the zebra finch, the most common songbird model in neuroscience. The package handles some of the most common (and some more advanced) song analyses, including segmentation, syllable classification, featurization of song, calculation of tutor-pupil similarity, and age prediction, with a view toward making the entire process friendlier to experimentalists working in the field.

      For many years, Sound Analysis Pro has served as a standard in the songbird field, the first package to extensively automate songbird analysis and facilitate the computation of acoustic features that have helped define the field. More recently, the increasing popularity of Python as a language, along with the emergence of new machine learning methods, has resulted in a number of new software tools, including the vocalpy ecosystem for audio processing, TweetyNet (for segmentation), t-SNE and UMAP (for visualization), and autoencoder-based approaches for embedding.

      Strengths: 

      The AVN package overlaps several of these earlier efforts, albeit with a focus on more traditional featurization that many experimentalists may find more interpretable than deep learning-based approaches. Among the strengths of the paper are its clarity in explaining the several analyses it facilitates, along with high-quality experiments across multiple public datasets collected from different research groups. As a software package, it is open source, installable via the pip Python package manager, and features high-quality documentation, as well as tutorials. For experimentalists who wish to replicate any of the analyses from the paper, the package is likely to be a useful time saver.

      Weaknesses: 

      I think the potential limitations of the work are predominantly on the software end, with one or two quibbles about the methods.

      First, the software: it's important to note that the package is trying to do many things, of which it is likely to do several well and few comprehensively. Rather than a package that presents a number of new analyses or a new analysis framework, it is more a codification of recipes, some of which are reimplementations of existing work (SAP features), some of which are essentially wrappers around other work (interfacing with WhisperSeg segmentations), and some of which are new (similarity scoring). All of this has value, but in my estimation, it has less value as part of a standalone package and potentially much more as part of an ecosystem like vocalpy that is undergoing continuous development and has long-term support. 

      We appreciate this reviewer’s comments and concerns about the structure of the AVN package and its long-term maintenance. We have considered incorporating AVN into the VocalPy ecosystem but have chosen not to for a few key reasons. (1) AVN was designed with ease of use for experimenters with limited coding experience top of mind. VocalPy provides excellent resources for researchers with some familiarity with object-oriented programming to manage and analyze their datasets; however, we believe it may be challenging for users without such experience to adopt VocalPy quickly. AVN’s ‘recipe’ approach, as you put it, is very easily accessible to new users, and allows users with intermediate coding experience to easily navigate the source code to gain a deeper understanding of the methodology. AVN also consistently outputs processed data in familiar formats (tables in .csv files which can be opened in excel), in an effort to make it more accessible to new users, something which would be challenging to reconcile with VocalPy’s emphasis on their `dataset`classes. (2) AVN and VocalPy differ in their underlying goals and philosophies when it comes to flexibility vs. standardization of analysis pipelines. VocalPy is designed to facilitate mixing-and-matching of different spectrogram generation, segmentation, annotation etc. approaches, so that researchers can design and implement their own custom analysis pipelines. This flexibility is useful in many cases. For instance, it could allow researchers who have very different noise filtering and annotation needs, like those working with field recordings versus acoustic chamber recordings, to analyze their data using this platform. However, when it comes to comparisons across zebra finch research labs, this flexibility comes at the expense of direct comparison and integration of song features across research groups. This is the context in which AVN is most useful. It presents a single approach to song segmentation, labeling, and featurization that has been shown to generalize well across research groups, and which allows direct comparisons of the resulting features. AVN’s single, extensively validated, standard pipeline approach is fundamentally incompatible with VocalPy’s emphasis on flexibility. We are excited to see how VocalPy continues to evolve in the future, and recognize the value that both AVN and VocalPy bring to the songbird research community, each with their own distinct strengths, weaknesses, and ideal use cases. 

      While the code is well-documented, including web-based documentation for both the core package and the GUI, the latter is available only on Windows, which might limit the scope of adoption. 

      We thank the reviewer for their kind words about AVN’s documentation. We recognize that the GUI’s exclusive availability on Windows is a limitation, and we would be happy to collaborate with other researchers and developers in the future to build a Mac compatible version, should the demand present itself. That said, the python package works on all operating systems, so non-Windows users still have the ability to use AVN that way.

      That is to say, whether AVN is adopted by the field in the medium term will have much more to do with the quality of its maintenance and responsiveness to users than any particular feature, but I believe that many of the analysis recipes that the authors have carefully worked out may find their way into other code and workflows. 

      Second, two notes about new analysis approaches:

      (1) The authors propose a new means of measuring tutor-pupil similarity based on first learning a latent space of syllables via a self-supervised learning (SSL) scheme and then using the earth mover's distance (EMD) to calculate transport costs between the distributions of tutors' and pupils' syllables. While to my knowledge this exact method has not previously been proposed in birdsong, I suspect it is unlikely to differ substantially from the approach of autoencoding followed by MMD used in the Goffinet et al. paper. That is, SSL, like the autoencoder, is a latent space learning approach, and EMD, like MMD, is an integral probability metric that measures discrepancies between two distributions. (Indeed, the two are very closely related: https://stats.stackexchange.com/questions/400180/earth-movers-distance-andmaximum-mean-discrepency.) Without further experiments, it is hard to tell whether these two approaches differ meaningfully. Likewise, while the authors have trained on a large corpus of syllables to define their latent space in a way that generalizes to new birds, it is unclear why such an approach would not work with other latent space learning methods.  

      We recognize the similarities between these approaches and have included comparisons of the VAE and MMD as in the Goffinet paper to our triplet loss model and EMD.  As discussed in the updated results section of the paper and shown in the new Figure 6–figure supplement 1, the Triplet loss model with MMD performs best for evaluating song learning on new birds, not included in model training. We’ve updated the main text of the paper to reflect this switch from EMD to MMD for the primary similarity scoring approach. 

      (2) The authors propose a new method for maturity scoring by training a model (a generalized additive model) to predict the age of the bird based on a selected subset of acoustic features. This is distinct from the "predicted age" approach of Brudner, Pearson, and Mooney, which predicts based on a latent representation rather than specific features, and the GAM nicely segregates the contribution of each. As such, this approach may be preferred by many users who appreciate its interpretability.  

      In summary, my view is that this is a nice paper detailing a well-executed piece of software whose future impact will be determined by the degree of support and maintenance it receives from others over the near and medium term.

      Reviewer #3 (Public Review):

      Summary: 

      The authors invent song and syllable discrimination tasks they use to train deep networks. These networks they then use as a basis for routine song analysis and song evaluation tasks. For the analysis, they consider both data from their own colony and from another colony the network has not seen during training. They validate the analysis scores of the network against expert human annotators, achieving a correlation of 80-90%. 

      Strengths: 

      (1) Robust Validation and Generalizability: The authors demonstrate a good performance of the AVN across various datasets, including individuals exhibiting deviant behavior. This extensive validation underscores the system's usefulness and broad applicability to zebra finch song analysis, establishing it as a potentially valuable tool for researchers in the field.

      (2) Comprehensive and Standardized Feature Analysis: AVN integrates a comprehensive set of interpretable features commonly used in the study of bird songs. By standardizing the feature extraction method, the AVN facilitates comparative research, allowing for consistent interpretation and comparison of vocal behavior across studies.

      (3) Automation and Ease of Use. By being fully automated, the method is straightforward to apply and should introduce barely an adoption threshold to other labs.

      (4) Human experts were recruited to perform extensive annotations (of vocal segments and of song similarity scores). These annotations released as public datasets are potentially very valuable. 

      Weaknesses: 

      (1) Poorly motivated tasks. The approach is poorly motivated and many assumptions come across as arbitrary. For example, the authors implicitly assume that the task of birdsong comparison is best achieved by a system that optimally discriminates between typical, deaf, and isolated songs. Similarly, the authors assume that song development is best tracked using a system that optimally estimates the age of a bird given its song. My issue is that these are fake tasks since clearly, researchers will know whether a bird is an isolated or a deaf bird, and they will also know the age of a bird, so no machine learning is needed to solve these tasks. Yet, the authors imagine that solving these placeholder tasks will somehow help with measuring important aspects of vocal behavior.  

      We appreciate this reviewer’s concerns and apologize for not providing sufficiently clear rationale for the inclusion of our phenotype classifier and age regression models in the original manuscript. These tasks are not intended to be taken as a final, ultimate culmination of the AVN pipeline. Rather, we consider the carefully engineered 55-interpretable feature set to be AVN’s final output, and these analyses serve merely as examples of how that feature set can be applied. That said, each of these models do have valid experimental use cases that we believe are important and would like to bring to the attention of the reviewer.

      For one, we showed how the LDA model that can discriminate between typical, deaf, and isolate birds’ songs not only allows us to evaluate which features are most important for discriminating between these groups, but also allows comparison of the FoxP1 knock-down (FP1 KD) birds to each of these phenotypes. Based on previous work (Garcia-Oscos et al. 2021), we hypothesized that FP1 KD in these birds specifically impaired tutor song memory formation while sparing a bird’s ability to refine their own vocalizations through auditory feedback. Thus, we would expect their songs to resemble those of isolate birds, who lack a tutor song memory, but not to resemble deaf birds who lack a tutor song memory and auditory feedback of their own vocalizations to guide learning. The LDA model allowed us to make this comparison quantitatively for the first time and confirm our hypothesis that FP1 KD birds’ songs are indeed most like isolates’. In the future, as more research groups publish their birds’ AVN feature sets, we hope to be able to make even more fine-grained comparisons between different groups of birds, either using LDA or other similar interpretable classifiers. 

      The age prediction model also has valid real-world use cases. For instance, one might imagine an experimental manipulation that is hypothesized to accelerate or slow song maturation in juvenile birds. This age prediction model could be applied to the AVN feature sets of birds having undergone such a manipulation to determine whether their predicted ages systematically lead or lag their true biological ages, and which song features are most responsible for this difference. We didn’t have access to data for any such birds for inclusion in this paper, but we hope that others in the future will be able to take inspiration from our methodology and use this or a similar age regression model with AVN features in their research. We have added a couple lines to the ‘Comparing Song Disruptions with AVN Features’ and ‘Tracking Song Development with AVN Features’ sections of the results to make this more clear. 

      Along similar lines, authors assume that a good measure of similarity is one that optimally performs repeated syllable detection (i.e. to discriminate same syllable pairs from different pairs). The authors need to explain why they think these placeholder tasks are good and why no better task can be defined that more closely captures what researchers want to measure. Note: the standard tasks for self-supervised learning are next word or masked word prediction, why are these not used here? 

      This reviewer appears to have misunderstood our similarity scoring embedding model and our rationale for using it. We will explain it in more depth here and have added a paragraph to the ‘Measuring Song Imitation’ section of the results explaining this rationale more briefly.

      First, nowhere are we training a model to discriminate between same and different syllable pairs. The triplet loss network is trained to embed syllables in an 8-dimensional space such that syllables with the same label are closer together than syllables with different labels. The loss function is related to the relative distance between embeddings of syllables with the same or different labels, not the classification of syllables as same or different. This approach was chosen because it has repeatedly been shown to be a useful data compression step (Schorff et al. 2015, Thakur et al. 2019) before further downstream tasks are applied on its output, particularly in contexts where there is little data per class (syllable label). For example, Schorff et al. 2015 trained a deep convolutional neural network with triplet loss to embed images of human faces from the same individual closer together than images of different individuals in a 128dimensional space. They then used this model to compute 128-dimensional representations of additional face images, not included in training, which were used for individual facial recognition (this is a same vs. different category classifier), and facial clustering, achieving better performance than the previous state of the art. The triplet loss function results in a model that can generate useful embeddings of previously unseen categories, like new individuals’ faces, or new zebra finches’ syllables, which can then be used in downstream analyses. This meaningful, lower dimensional space allows comparisons of distributions of syllables across birds, as in Brainard and Mets 2008, and Goffinet et al. 2021. 

      Next word and masked word prediction are indeed common self-supervised learning tasks for models working with text data, or other data with meaningful sequential organization. That is not the case for our zebra finch syllables, where every bird’s syllable sequence depends only on its tutor’s sequence, and there is no evidence for strong universal syllable sequencing rules (James et al. 2020). Rather, our embedding model is an example of a computer vision task, as it deals with sets of two-dimensional images (spectrograms), not sequences of categorical variables (like text). It is also not, strictly speaking, a selfsupervised learning task, as it does require syllable labels to generate the triplets. A common selfsupervised approach for dimensionality reduction in a computer vision task such as this one would be to train an autoencoder to compress images to a lower dimensional space, then faithfully reconstruct them from the compressed representation.  This has been done using a variational autoencoder trained on zebra finch syllables in Goffinet et al. 2021. In keeping with the suggestions from reviewers #1 and #2, we have included a comparison of our triplet loss model with the Goffinet et al. VAE approach in the revised manuscript. 

      (2) The machine learning methodology lacks rigor. The aims of the machine learning pipeline are extremely vague and keep changing like a moving target. Mainly, the deep networks are trained on some tasks but then authors evaluate their performance on different, disconnected tasks. For example, they train both the birdsong comparison method (L263+) and the song similarity method (L318+) on classification tasks. However, they evaluate the former method (LDA) on classification accuracy, but the latter (8-dim embeddings) using a contrast index. In machine learning, usually, a useful task is first defined, then the system is trained on it and then tested on a held-out dataset. If the sensitivity index is important, why does it not serve as a cost function for training?

      Again, this reviewer seems not to understand our similarity scoring methodology. Our similarity scoring model is not trained on a classification task, but rather on an embedding task. It learns to embed spectrograms of syllables in an 8-dimensional space such that syllables with the same label are closer together than syllables with different labels. We could report the loss values for this embedding task on our training and validation datasets, but these wouldn’t have any clear relevance to the downstream task of syllable distribution comparison where we are using the model’s embeddings. We report the contrast index as this has direct relevance to the actual application of the model and allows comparisons to other similarity scoring methods, something that the triplet loss values wouldn’t allow. 

      The triplet loss method was chosen because it has been shown to yield useful low-dimensional representations of data, even in cases where there is limited labeled training data (Thakur et al. 2019). While we have one of the largest manually annotated datasets of zebra finch songs, it is still quite small by industry deep learning standards, which is why we chose a method that would perform well given the size of our dataset. Training a model on a contrast index directly would be extremely computationally intensive and require many more pairs of birds with known relationships than we currently have access to. It could be an interesting approach to take in the future, but one that would be unlikely to perform well with a dataset size typical to songbird research. 

      Also, usually, in solid machine learning work, diverse methods are compared against each other to identify their relative strengths. The paper contains almost none of this, e.g. authors examined only one clustering method (HDBSCAN).  

      We did compare multiple methods for syllable segmentation (WhisperSeg, TweetyNet, and Amplitude thresholding) as this hadn’t been done previously. We chose not to perform extensive comparison of different clustering methods as Sainburg et al. 2020 already did so and we felt no need to reduplicate this effort. We encourage this reviewer to refer to Sainburg et al.’s excellent work for comparisons of multiple clustering methods applied to zebra finch song syllables.

      (3) Performance issues. The authors want to 'simplify large-scale behavioral analysis' but it seems they want to do that at a high cost. (Gu et al 2023) achieved syllable scores above 0.99 for adults, which is much larger than the average score of 0.88 achieved here (L121). Similarly, the syllable scores in (Cohen et al 2022) are above 94% (their error rates are below 6%, albeit in Bengalese finches, not zebra finches), which is also better than here. Why is the performance of AVN so low? The low scores of AVN argue in favor of some human labeling and training on each bird.  

      Firstly, the syllable error rate scores reported in Cohen et al. 2022 are calculated very differently than the F1 scores we report here and are based on a model trained with data from the same bird as was used in testing, unlike our more general segmentation approach where the model was tested on different birds than were used in training. Thus, the scores reported in Cohen et al. and the F1 scores that we report cannot be compared. 

      The discrepancy between the F1<sub>seg</sub> scores reported in Gu et al. 2023 and the segmentation F1 scores that we report are likely due to differences in the underlying datasets. Our UTSW recordings tend to have higher levels of both stationary and non-stationary background noise, which make segmentation more challenging. The recordings from Rockefeller were less contaminated by background noise, and they resulted in slightly higher F1 scores. That said, we believe that the primary factor accounting for this difference in scores with Gu et al. 2023 is the granularity of our ‘ground truth’ syllable segments. In our case, if there was never any ambiguity as to whether vocal elements should be segmented into two short syllables with a very short gap between them or merged into a single longer syllable, we chose to split them. WhisperSeg had a strong tendency to merge the vocal elements in ambiguous cases such as these. This results in a higher rate of false negative syllable onset detections, reflected in the low recall scores achieved by WhisperSeg (see Figure 2–figure supplement 1b), but still very high precision scores (Figure 2–figure supplement 1a). While WhisperSeg did frequently merge these syllables in a way that differed from our ground truth segmentation, it did so consistently, meaning it had little impact on downstream measures of syntax entropy (Figure 3c) or syllable duration entropy (Figure 3–figure supplement 2a). It is for that reason that, despite a lower F1 score, we still consider AVN’s automatically generated annotations to be sufficiently accurate for downstream analyses. 

      Should researchers require a higher degree of accuracy and precision with their annotations (for example, to detect very subtle changes in song before and after an acute manipulation) we suggest they turn toward one of the existing tools for supervised song annotation, such as TweetyNet.

      (4) Texas bias. It is true that comparability across datasets is enhanced when everyone uses the same code. However, the authors' proposal essentially is to replace the bias between labs with a bias towards birds in Texas. The comparison with Rockefeller birds is nice, but it amounts to merely N=1. If birds in Japanese or European labs have evolved different song repertoires, the AVN might not capture the associated song features in these labs well.  

      We appreciate the author’s concern about a bias toward birds from the UTSW colony. However, this paper shows that despite training (for the similarity scoring) and hyperparameter fitting (for the HDBSCAN clustering) on the UTSW birds, AVN performs as well if not better on birds from Rockefeller than from UTSW. To our knowledge, there are no publicly available datasets of annotated zebra finch songs from labs in Europe or in Asia but we would be happy to validate AVN on such datasets, should they become available. Furthermore, there is no evidence to suggest that there is dramatic drift in zebra finch vocal repertoire between continents which would necessitate such additional validation. While we didn’t have manual annotations for this dataset (which would allow validation of our segmentation and labeling methods), we did apply AVN to recordings shared with us by the Wada lab in Japan, where visual inspection of the resulting annotations suggested comparable accuracy to the UTSW and Rockefeller datasets. 

      (5) The paper lacks an analysis of the balance between labor requirement, generalizability, and optimal performance. For tasks such as segmentation and labeling, fine-tuning for each new dataset could potentially enhance the model's accuracy and performance without compromising comparability. E.g. How many hours does it take to annotate hundred song motifs? How much would the performance of AVN increase if the network were to be retrained on these? The paper should be written in more neutral terms, letting researchers reach their own conclusions about how much manual labor they want to put into their data.  

      With standardization and ease of use in mind, we designed AVN specifically to perform fully automated syllable annotation and downstream feature calculations. We believe that we have demonstrated in this manuscript that our fully automated approach is sufficiently reliable for downstream analyses across multiple zebra finch colonies. That said, if researchers require an even higher degree of annotation precision and accuracy, they can turn toward one of the existing methods for supervised song annotation, such as TweetyNet. Incorporating human annotations for each bird processed by AVN is likely to improve its performance, but this would require significant changes to AVN’s methodology, and is outside the scope of our current efforts.

      (6) Full automation may not be everyone's wish. For example, given the highly stereotyped zebra finch songs, it is conceivable that some syllables are consistently mis-segmented or misclassified. Researchers may want to be able to correct such errors, which essentially amounts to fine-tuning AVN. Conceivably, researchers may want to retrain a network like the AVN on their own birds, to obtain a more fine-grained discriminative method.  

      Other methods exist for supervised or human-in-the-loop annotation of zebra finch songs, such as TweetyNet and DAN (Alam et al. 2023). We invite researchers who require a higher degree of accuracy than AVN can provide to explore these alternative approaches for song annotation. Incorporating human feedback into AVN was never the goal of our pipeline, would require significant changes to AVN’s design and is outside the scope of this manuscript.

      (7) The analysis is restricted to song syllables and fails to include calls. No rationale is given for the omission of calls. Also, it is not clear how the analysis deals with repeated syllables in a motif, whether they are treated as two-syllable types or one.  

      It is true that we don’t currently have any dedicated features to describe calls. This could be a useful addition to AVN in the future. 

      What a human expert inspecting a spectrogram would typically call ‘repeated syllables’ in a bout are almost always assigned the same syllable label by the UMAP+HDBSCAN clustering. The syntax analysis module includes features examining the rate of syllable repetitions across syllable types, as mentioned in lines 222-226 of the revised manuscript. See https://avn.readthedocs.io/en/latest/syntax_analysis_demo.html#Syllable-Repetitions for further details.

      (8) It seems not all human annotations have been released and the instruction sets given to experts (how to segment syllables and score songs) are not disclosed. It may well be that the differences in performance between (Gu et al 2023) and (Cohen et al 2022) are due to differences in segmentation tasks, which is why these tasks given to experts need to be clearly spelled out. Also, the downloadable files contain merely labels but no identifier of the expert. The data should be released in such a way that lets other labs adopt their labeling method and cross-check their own labeling accuracy.  

      All human annotations used in this manuscript have indeed been released as part of the accompanying dataset. Syllable annotations are not provided for all pupils and tutors used to validate the similarity scoring, as annotations are not necessary for similarity comparisons. We have expanded our description of our annotation guidelines in the methods section of the revised manuscript. All the annotations were generated by one of two annotators. The second annotator always consulted with the first annotator in cases of ambiguous syllable segmentation or labeling, to ensure that they had consistent annotation styles. Unfortunately, we haven’t retained records about which birds were annotated by which of the two annotators, so we cannot share this information along with the dataset. The data is currently available in a format that should allow other research groups to use our annotations either to train their own annotation systems or check the performance of their existing systems on our annotations.  

      (9) The failure modes are not described. What segmentation errors did they encounter, and what syllable classification errors? It is important to describe the errors to be expected when using the method. 

      As we discussed in our response to this reviewer’s point (3), WhisperSeg has a tendency to merge syllables when the gap between them is very short, which explains its lower recall score compared to its precision on our dataset (Figure 2–figure supplement 1). In rare cases, WhisperSeg also fails to recognize syllables entirely, again impacting its precision score. TweetyNet hardly ever completely ignores syllables, but it does tend to occasionally merge syllables together or over-segment them. Whereas WhisperSeg does this very consistently for the same syllable types within the same bird, TweetyNet merges or splits syllables more inconsistently. This inconsistent merging and splitting has a larger effect on syllable labeling, as manifested in the lower clustering v-measure scores we obtain with TweetyNet compared to WhisperSeg segmentations. TweetyNet also has much lower precision than WhisperSeg, largely because TweetyNet often recognizes background noises (like wing flaps or hopping) as syllables whereas WhisperSeg hardly ever segments non-vocal sounds. 

      Many errors in syllable labeling stem from differences in syllable segmentation. For example, if two syllables with labels ‘a’ and ‘b’ in the manual annotation are sometimes segmented as two syllables, but sometimes merged into a single syllable, the clustering is likely to find 3 different syllable types; one corresponding to ‘a’, one corresponding to ‘b’ and one corresponding to ‘ab’ merged. Because of how we align syllables across segmentation schemes for the v-measure calculation, this will look like syllable ‘b’ always has a consistent cluster label (or is missing a label entirely), but syllable ‘a’ can carry two different cluster labels, depending on the segmentation. In certain cases, even in the absence of segmentation errors, a group of syllables bearing the same manual annotation label may be split into 2 or 3 clusters (it is extremely rare for a single manual annotation group to be split into more than 3 clusters). In these cases, it is difficult to conclusively say whether the clustering represents an error, or if it actually captured some meaningful systematic difference between syllables that was missed by the annotator. Finally, sometimes rare syllable types with their own distinct labels in the manual annotation are merged into a single cluster. Most labeling errors can be explained by this kind of merging or splitting of groups relative to the manual annotation, not to occasional mis-classifications of one manual label type as another.

      For examples of these types of errors, we encourage this reviewer and readers to refer to the example confusion matrices in figure 2f and Figure 2–figure supplement 3b&e. We also added two paragraphs to the end of the ‘Accurate, fully unsupervised syllable labeling’ section of the Results in the revised manuscript. 

      (10) Usage of Different Dimensionality Reduction Methods: The pipeline uses two different dimensionality reduction techniques for labeling and similarity comparison - both based on the understanding of the distribution of data in lower-dimensional spaces. However, the reasons for choosing different methods for different tasks are not articulated, nor is there a comparison of their efficacy.  

      We apologize for not making this distinction sufficiently clear in the manuscript and have added a paragraph to the ‘Measuring Song Imitation’ section of the Results explaining the rational for using an embedding model for similarity scoring. 

      We chose to use UMAP for syllable labeling because it is a common embedding methodology to precede hierarchical clustering and has been shown to result in reliable syllable labels for birdsong in the past (Sainburg et al. 2020). However, it is not appropriate for similarity scoring, because comparing EMD or MMD scores between birds requires that all the birds’ syllable distributions exist within the same shared embedding space. This can be achieved by using the same triplet loss-trained neural network model to embed syllables from all birds. This cannot be achieved with UMAP because all birds whose scores are being compared would need to be embedded in the same UMAP space, as distances between points cannot be compared across UMAPs. In practice, this would mean that every time a new tutor-pupil pair needs to be scored, their syllables would need to be added to a matrix with all previously compared birds’ syllables, a new UMAP would need to be computed, and new EMD or MMD scores between all bird pairs would need to be calculated using their new UMAP embeddings. This is very computationally expensive and quickly becomes unfeasible without dedicated high power computing infrastructure. It also means that similarity scores couldn’t be compared across papers without recomputing everything each time, whereas EMD and MMD scores obtained with triplet loss embeddings can be compared, provided they use the same trained model (which we provide as part of AVN) to embed their syllables in a common latent space. 

      (11) Reproducibility: are the measurements reproducible? Systems like UMAP always find a new embedding given some fixed input, so the output tends to fluctuate.

      There is indeed a stochastic element to UMAP embeddings which will result in different embeddings and therefore different syllable labels across repeated runs with the same input. We observed that v-measures scores were quite consistent within birds across repeated runs of the UMAP, and have added an additional supplementary figure to the revised manuscript showing this (Figure 2–figure supplement 4).

      Reviewer #1 (Recommendations For The Authors):

      (1) Benchmark their similarity score to the method used by Goffinet et al, 2021 from the Pearson group. Such a comparison would be really interesting and useful.  

      This has been added to the paper. 

      (2) Please clarify exactly what is new and what is applied from existing methods to help the reader see the novelty of the paper.  

      We have added more emphasis on the novel aspects of our pipeline to the paper’s introduction. 

      Minor:

      It's unclear if AVN is appropriate as the paper deals only with zebra finch song - the scope is more limited than advertised.

      We assume this is in reference to ‘Birdsong’ in the paper’s title and ‘Avian’ in Avian Vocalization Network. There is a brief discussion of how these methods are likely to perform on other commonly studied songbird species at the end of the discussion section.

      Reviewer #2 (Recommendations For The Authors):

      A few points for the authors to consider that might strengthen or inform the paper:

      (1) In the public review, I detailed some ways in which the SSL+EMD approach is unlikely to be appreciably distinct from the VAE+MMD approach -- in fact, one could mix and match here. It would strengthen the authors' claim if they showed via experiments that their method outperforms VAE+MMD, but in the absence of that, a discussion of the relation between the two is probably warranted.  

      This comparison has been added to the paper.

      (2) ll. 305-310: This loss of accuracy near the edge is expected on general Bayesian grounds. Any regression approach should learn to estimate the conditional mean of the age distribution given the data, so ages estimated from data will be pulled inward toward the location of most training data. This bias is somewhat mitigated in the Brudner paper by a more flexible model, but it's a general (and expected) feature of the approach.

      (3) While the online AVA documentation looks good, it might benefit from a page on design philosophy that lays out how the various modules fit together - something between the tutorials and the nitty-gritty API. That way, users would be able to get a sense of where they should look if they want to harness pieces of functionality beyond the tutorials.

      Thank you for this suggestion. We will add a page on AVN’s design philosophy to the online documentation. 

      (4) While the manuscript does compare AVN to packages like TweetyNet and AVA that share some functionality, it doesn't really mention what's been going on with the vocalpy ecosystem, where the maintainers have been doing a lot to standardize data processing, integrate tools, etc. I would suggest a few words about how AVN might integrate with these efforts.

      We thank the reviewer for this suggestion.

      (5) ll. 333-336: It would be helpful to provide a citation to some of the self-supervised learning literature this procedure is based on. Some citations are provided in methods, but the general approach is worth citing, in my opinion. 

      We have added a paragraph to the results section with more background on self-supervised learning for dimensionality reduction, particularly in the context of similarity scoring.

      (6) One software concern for medium-term maintenance: AVN docs say to use Python 3.8, and GitHub says the package is 3.9 compatible. I also saw in the toml file that 3.10 and above are not supported. It's worth noting that Python 3.9 reaches its end of life in October 2025, so some dependencies may have to be altered or changed for the package to be viable going forward.  

      Thank you for this comment. We will continue to maintain AVN and update its dependencies as needed.

      Minor points:

      (1) It might be good to note that WhisperSeg is a different install from AVN. May be hard for novice users, though there's a web interface that's available. 

      We’ve added a line to the methods section making this clear. 

      (2) Figure 6b: Some text in the y-axis labels is overlapping here. 

      This has been fixed. Thank you for bringing it to our attention. 

      (3) The name of the Python language is always capitalized.  

      We’ve fixed this capitalization error throughout the manuscript. Thank you.

      Reviewer #3 (Recommendations For The Authors):

      (1) I recommend that the authors improve the motivation of the chosen tasks and data or choose new tasks that more clearly speak to the optimizations they want to perform. 

      We have included more details about the motivation for our LDA classification analysis, age prediction model and embedding model for similarity scoring in the results of the revised manuscript, as discussed in more detail in the above responses to this reviewer. Thank you for these suggestions. 

      (2) They need to rigorously report the (classification) scores on the test datasets: these are the scores associated with the cost function used during training.  

      Based on this reviewer’s ‘Weaknesses: 3’ comment in the public reviews, we believe that they are referring to a classification score for the triplet loss model. As we explained in response to that comment, this is not a classification task, therefor there is no classification score to report. The loss function used to train the model was a triplet loss function. While we could report these values, they are not informative for how well this approach would perform in a similarity scoring context, as explained above. As such, we prefer to include contrast index and tutor contrast index scores to compare the models’ performance for similarity score, as these are directly relevant to the task and are established in the field for said task.

      (3) They need to explain the reasons for the poor performance (or report on the inconsistencies with previous work) and why they prefer a fully automated system rather than one that needs some fine-tuning on bird-specific data.

      We’ve addressed this comment in the public response to this reviewer’s weakness points 3, 5, and 6. 

      (4) They should consider applying their method to data from Japanese and European labs.  

      We’ve addressed this comment in the public response to this reviewer’s weakness point 4.

      (5) The need to document the failure modes and report all details about the human annotations.  

      We’ve added additional description of the failure modes for our segmentation and labeling approaches in the results section of the revised manuscript.

      Details: 

      The introduction is very vague, it fails to make a clear case of what the problem is and what the approach is. It reads a bit like an advertisement for machine learning: we are given a hammer and are looking for a nail.  

      We thank the reviewer for this viewpoint; however, we disagree and have decided to keep our Introduction largely unchanged. 

      L46 That interpretability is needed to maximize the benefits of machine learning is wrong, see self-driving cars and chat GPT.  

      This line states that ‘To truly maximize the benefits of machine learning and deep learning methods for behavior analysis, their power must be balanced with interpretability and generalizability’. We firmly believe that interpretability is critically important when using machine learning tools to gain a deeper scientific understanding of data, including animal behavior data in a neuroscience context. We believe that the introduction and discussion of this paper already provide strong evidence for this claim. 

      L64 What about zebra finches that repeat a syllable in the motif, how are repetitions dealt with by AVN?  

      This is already described in the results section in lines 222-226, and in the methods in the ‘Syntax Features: Repetition Bouts’ section.

      L107 Say a bit more here, what exactly has been annotated?  

      We’ve added a sentence in the introduction to clarify this. Line 113-115. 

      L112 Define spectrogram frames. Do these always fully or sometimes partially contain a vocalization? 

      Spectrogram frames are individual time bins used to compute the spectrogram using a short-term Fourier transform. As described in the ‘Methods; Labeling : UMAP Dimensionality Reduction” section, our spectrograms are computed using ‘The short term Fourier transform of the normalized audio for each syllable […] with a window length of 512 samples and a hop length of 128 samples’. Given that the song files have a standard sampling rate of 44.1kHz, this means each time bin represents 11.6ms of song data, with successive frames advancing in time by 2.9ms. These contain only a small fraction of a vocalization. 

      L122 The reported TweetyNet score of 0.824 is lower than the one reported in Figure 2a.  

      The center line in the box plot in Figure 2a represents the median of the distribution of TweetyNet vmeasure scores. Given that there are a couple outlying birds with very low scores, the mean (0.824 as reported in the text of the results section) is lower than the median. This is not an error.

      L155 Some of the differences in performance are very small, reporting of the P value might be necessary. 

      These methods are unlikely to statistically significantly differ in their validation scores. This doesn’t mean that we cannot use the mean/median values reported to justify favoring one method over another. This is why we’ve chosen not to report p-values here.

      L161 The authors have not really tested more than a single clustering method, failing to show a serious attempt to achieve good performance.  

      We’ve addressed this comment in the public response to this reviewer’s weakness point 2.

      L186 Did isolate birds produce stereotyped syllables that can be clustered? 

      Yes, they did. The validation for clustering of isolate bird songs can be found in Figure 2–figure supplement 4. 

      Fig. 3e: How were the multiple bouts aligned?

      This is described in lines 857-876 in the ‘Methods: Song Timing Features: Rhythm Spectrograms” section of the paper.

      L199 There is a space missing in front of (n=8).  

      Thank you for bringing this to our attention. It’s been corrected in the updated manuscript. 

      L268 Define classification accuracy.  

      We’ve added a sentence in lines 953-954 of the methods section defining classification accuracy. 

      L325 How many motifs need to be identified, why does this need to be done manually? There are semiautomated methods that can allow scaling, these should be  cited here. Also, the mention of bias here should be removed in favor of a more extensive discussion on the experimenter bias (traditionally vs Texas bias (in this paper).  

      All of the methods cited in this line have graphical user interfaces that require users to select a file containing song and manually highlight the start and end each motif to be compared. The exact number of motifs required varies depending on the specific context (e.g. more examples are needed to detect more subtle differences or changes in song similarity) but it is fairly standard for reviewers to score 30 – 100 pairs of motifs. 

      We’ve discussed the tradeoffs between full automation and supervised or human-in-the loop methods in response to this reviewer’s public comment ‘weakness #5 and 6’. Briefly, AVN’s aim is to standardize song analysis, to allow direct comparisons between song features and similarity scores across research groups. We believe, as explained in the paper, that this can be best achieve by having different research groups use the same deep learning models, which perform consistently well across those groups. Introducing semi-automated methods would defeat this benefit of AVN. 

      We’ve also addressed the question of ‘Texas bias’ in response to their reviewer’s public comment ‘Weakness #4’. 

      L340 How is EMD applied? Syllables are points in 8-dim space, but now suddenly authors talk about distributions without explaining how they got from points to distributions. Same in L925.  

      We apologize for the confusion here. The syllable points in the 8-d space are collectively an empirical distribution, not a probability distribution. We referred to them simply as ‘distributions’ to limit technical jargon in the results of the paper, but have changed this to more precise language in the revised manuscript.

      L351 Why do authors now use 'contrast index' to measure performance and no longer 'classification accuracy'?  

      We’ve addressed this comment in the public response to this reviewer’s weakness points 1 and 2.

      Figure 6 What is the confusion matrix, i.e. how well can the model identify pupil-pupil pairings from pupiltutor and from pupil-unrelated pairings? I guess that would amount to something like classification accuracy.  

      There is no model classifying comparisons as pupil-pupil vs. pupil-tutor etc. These comparisons exist only to show the behavior of the similarity scoring approach, which consists of a dissimilarity measure (MMD or EMD) applied to low dimensional representations of syllable generated by the triplet loss model or VAE. This was clarified further in our public response to this reviewer’s weakness points 1 and 2. 

      L487 What are 'song files', and what do they contain?   

      ‘Song files’ are .wav files containing recordings of zebra finch song. They typically contain a single song bout, but they can include multiple song bouts if they are produced close together, or incomplete song bouts if the introductory notes were very soft or the bouts were very long (>30s from the start of the file). Details of these recordings are provided in the ‘Methods: Data Acquisition: UTSW Dataset’ section of the manuscript.

      L497 Calls were only labelled for tweetynet but not for other tasks.  

      That is correct. The rationale for this is provided in the ‘Methods: Manual Song Annotation’ section of the manuscript. 

      L637 There is a contradiction (can something be assigned to the 'own manual annotation category' when the same sentence states that this is done 'without manual annotation'?) 

      We believe there is confusion here between automated annotation and validation. Any bird can be automatically annotated without the need for any existing manual annotations for that individual bird. However, manual labels are required to compare automatically generated annotations against for validation of the method.

      L970 Spectograms of what? (what is the beginning of a song bout, L972). 

      The beginning of a song bout is the first introductory note produced by a bird after a period without vocalizations. This is standard.

    1. Reviewer #3 (Public review):

      Summary:

      The aim of this study was to investigate the temporal progression of the neural response to event boundaries in relation to uncertainty and error. Specifically, the authors asked (1) how neural activity changes before and after event boundaries, (2) if uncertainty and error both contribute to explaining the occurrence of event boundaries, and (3) if uncertainty and error have unique contributions to explaining the temporal progression of neural activity.

      Strengths:

      One strength of this paper is that it builds on an already validated computational model. It relies on straightforward and interpretable analysis techniques to answer the main question, with a smart combination of pattern similarity metrics and FIR. This combination of methods may also be an inspiration to other researchers in the field working on similar questions. The paper is well written and easy to follow. The paper convincingly shows that (1) there is a temporal progression of neural activity change before and after an event boundary, and (2) event boundaries are predicted best by the combination of uncertainty and error signals.

      Weaknesses:

      Regarding question 3, I am less convinced by the results. They show that overlapping but somewhat distinct sets of brain regions relate to uncertainty and error boundaries over time. And that some regions show distinct patterns of temporal progressions in pattern change with both types of boundaries. However, most of the effects they observe in this analysis may still be driven by shared variance, as suggested by the results in Figure 6 and the high correlation between the two boundary time series. More specific comments are provided below.

      Impact:

      If these comments can be addressed sufficiently, I expect that this work will impact the field in its thinking on what drives event boundaries and spur interest in understanding the mechanisms behind the temporal progression of neural activity around these boundaries.

      Comments

      (1) The current analysis of the neural data does not convincingly show that uncertainty and prediction error both contribute to the neural responses. As both terms are modelled in separate FIR models, it may be that the responses we see for both are mostly driven by shared variance. Given that the correlation between the two is very high (r=0.49), this seems likely. The strong overlap in the neural responses elicited by both, as shown in Figure 6, also suggests that what we see may mainly be shared variance. To improve the interpretability of these effects, I think it is essential to know whether uncertainty and error explain similar or unique parts of the variance. The observation that they have distinct temporal profiles is suggestive of some dissociation, but not as convincing as adding them both to a single model.

      (2) The results for uncertainty and error show that uncertainty has strong effects before or at boundary onset, while error is related to more stabilization after boundary onset. This makes me wonder about the temporal contribution of each of these. Could it be the case that increases in uncertainty are early indicators of a boundary, and errors tend to occur later?

      (3) Given that there is a 24-second period during which the neural responses are shaped by event boundaries, it would be important to know more about the average distance between boundaries and the variability of this distance. This will help establish whether the FIR model can properly capture a return to baseline.

      (4) Given that there is an early onset and long-lasting response of the brain to these event boundaries, I wonder what causes this. Is it the case that uncertainty or errors already increase at 12 seconds before the boundaries occur? Or if there are other makers in the movie that the brain can use to foreshadow an event boundary? And if uncertainty or errors do increase already 12 seconds before an event boundary, do you see a similar neural response at moments with similar levels of error or uncertainty, which are not followed by a boundary? This would reveal whether the neural activity patterns are specific to event boundaries or whether these are general markers of error and uncertainty.

      (5) It is known that different brain regions have different delays of their BOLD response. Could these delays contribute to the propagation of the neural activity across different brain areas in this study?

      (6) In the FIR plots, timepoints -12, 0, and 12 are shown. These long intervals preclude an understanding of the full temporal progression of these effects.

    2. Author response:

      Reviewer #1 (Public review):

      Summary:

      This paper investigates the control signals that drive event model updating during continuous experience. The authors apply predictions from previously published computational models to fMRI data acquired while participants watched naturalistic video stimuli. They first examine the time course of BOLD pattern changes around human-annotated event boundaries, revealing pattern changes preceding the boundary in anterior temporal and then parietal regions, followed by pattern stabilization across many regions. The authors then analyze time courses around boundaries generated by a model that updates event models based on prediction error and another that uses prediction uncertainty. These analyses reveal overlapping but partially distinct dynamics for each boundary type, suggesting that both signals may contribute to event segmentation processes in the brain.

      Strengths:

      (1) The question addressed by this paper is of high interest to researchers working on event cognition, perception, and memory. There has been considerable debate about what kinds of signals drive event boundaries, and this paper directly engages with that debate by comparing prediction error and prediction uncertainty as candidate control signals.

      (2) The authors use computational models that explain significant variance in human boundary judgments, and they report the variance explained clearly in the paper.

      (3) The authors' method of using computational models to generate predictions about when event model updating should occur is a valuable mechanistic alternative to methods like HMM or GSBS, which are data-driven.

      (4) The paper utilizes an analysis framework that characterizes how multivariate BOLD pattern dissimilarity evolves before and after boundaries. This approach offers an advance over previous work focused on just the boundary or post-boundary points.

      We appreciate this reviewer’s recognition of the significance of this research problem, and of the value of the approach taken by this paper.

      Weaknesses:

      (1) While the paper raises the possibility that both prediction error and uncertainty could serve as control signals, it does not offer a strong theoretical rationale for why the brain would benefit from multiple (empirically correlated) signals. What distinct advantages do these signals provide? This may be discussed in the authors' prior modeling work, but is left too implicit in this paper.

      We added a brief discussion in the introduction highlighting the complementary advantages of prediction error and prediction uncertainty, and cited prior theoretical work that elaborates on this point. Specifically, we now note that prediction error can act as a reactive trigger, signaling when the current event model is no longer sufficient (Zacks et al., 2007). In contrast, prediction uncertainty is framed as proactive, allowing the system to prepare for upcoming changes even before they occur (Baldwin & Kosie, 2021; Kuperberg, 2021). Together, this makes clearer why these two signals could each provide complementary benefits for effective event model updating.

      "One potential signal to control event model updating is prediction error—the difference between the system’s prediction and what actually occurs. A transient increase in prediction error is a valid indicator that the current model no longer adequately captures the current activity. Event Segmentation Theory (EST; Zacks et al., 2007) proposes that event models are updated when prediction error increases beyond a threshold, indicating that the current model no longer adequately captures ongoing activity. A related but computationally distinct proposal is that prediction uncertainty (also termed "unpredictability"), in addition to error, serves as the control signal (Baldwin & Kosie, 2021). The advantage of relying on prediction uncertainty to detect event boundaries is that it is inherently proactive: the cognitive system can start looking for cues about what might come next before the next event starts (Baldwin & Kosie, 2021; Kuperberg, 2021)."

      (2) Boundaries derived from prediction error and uncertainty are correlated for the naturalistic stimuli. This raises some concerns about how well their distinct contributions to brain activity can be separated. The authors should consider whether they can leverage timepoints where the models make different predictions to make a stronger case for brain regions that are responsive to one vs the other.

      We addressed this concern by adding an analysis that explicitly tests the unique contributions of prediction error– and prediction uncertainty–driven boundaries to neural pattern shifts. In the revised manuscript, we describe how we fit a combined FIR model that included both boundary types as predictors and then compared this model against versions with only one predictor. This allowed us to identify the variance explained by each boundary type over and above the other. The results revealed two partially dissociable sets of brain regions sensitive to error- versus uncertainty-driven boundaries (see Figure S1), strengthening our argument that these signals make distinct contributions.

      "To account for the correlation between uncertainty-driven boundaries and error-driven boundaries, we also fitted a FIR model that predicts pattern dissimilarity from both types of boundaries (combined FIR) for each parcel. Then, we performed two likelihood ratio tests: combined FIR to error FIR, which measures the unique contribution of uncertainty boundaries to pattern dissimilarity, and combined FIR to uncertainty FIR, which measures the unique contribution of error boundaries to pattern dissimilarity. The analysis also revealed two dissociable sets of brain regions associated with each boundary type (see Figure S1)."

      (3) The authors refer to a baseline measure of pattern dissimilarity, which their dissimilarity measure of interest is relative to, but it's not clear how this baseline is computed. Since the interpretation of increases or decreases in dissimilarity depends on this reference point, more clarity is needed.

      We clarified how the FIR baseline is estimated in the methods section. Specifically, we now explain that the FIR coefficients should be interpreted relative to a reference level, which reflects the expected dissimilarity when timepoints are far from an event boundary. This makes it clear what serves as the comparison point for observed increases or decreases in dissimilarity.

      "The coefficients from the FIR model indicates changes relative to baseline, which can be conceptualized as the expected value when far from the boundary."

      (4) The authors report an average event length of ~20 seconds, and they also look at +20 and -20 seconds around each event boundary. Thus, it's unclear how often pre- and post-boundary timepoints are part of adjacent events. This complicates the interpretations of the reported time courses.

      This is related to reviewer's 2 comment, and it will be addressed below.

      (5) The authors describe a sequence of neural pattern shifts during each type of boundary, but offer little setup of what pattern shifts we might expect or why. They also offer little discussion of what cognitive processes these shifts might reflect. The paper would benefit from a more thorough setup for the neural results and a discussion that comments on how the results inform our understanding of what these brain regions contribute to event models.

      We thank the reviewer for this advice on how better to set the context for the different potential outcomes of the study. We expanded both the introduction and discussion to better set up expectations for neural pattern shifts and to interpret what these shifts may reflect. In the introduction, we now describe prior findings showing that sensory regions tend to update more quickly than higher-order multimodal regions (Baldassano et al., 2017; Geerligs et al., 2021, 2022), and we highlight that it remains unclear whether higher-order updates precede or follow those in lower-order regions. We also note that our analytic approach is well-suited to address this open question. In the discussion, we then interpret our results in light of this framework. Specifically, we describe how we observed early shifts in higher-order areas such as anterior temporal and prefrontal cortex, followed by shifts in parietal and dorsal attention regions closer to event boundaries. This pattern runs counter to the traditional bottom-up temporal hierarchy view and instead supports a model of top-down updating, where high-level representations are updated first and subsequently influence lower-level processing (Friston, 2005; Kuperberg, 2021). To make this interpretation concrete, we added an example: in a narrative where a goal is reached midway—for instance, a mystery solved before the story formally ends—higher-order regions may update the event representation at that point, and this updated model then cascades down to shape processing in lower-level regions. Finally, we note that the widespread stabilization of neural patterns after boundaries may signal the establishment of a new event model.

      Excerpt from Introduction:

      “More recently, multivariate approaches have provided insights into neural representations during event segmentation. One prominent approach uses hidden Markov models (HMMs) to detect moments when the brain switches from one stable activity pattern to another (Baldassano et al., 2017) during movie viewing; these periods of relative stability were referred to as "neural states" to distinguish them from subjectively perceived events. Sensory regions like visual and auditory cortex showed faster transitions between neural states. Multi-modal regions like the posterior medial cortex, angular gyrus, and intraparietal sulcus showed slower neural state shifts, and these shifts aligned with subjectively reported event boundaries. Geerligs et al. (2021, 2022) employed a different analytical approach called Greedy State Boundary Search (GSBS) to identify neural state boundaries. Their findings echoed the HMM results: short-lived neural states were observed in early sensory areas (visual, auditory, and somatosensory cortex), while longer-lasting states appeared in multi-modal regions, including the angular gyrus, posterior middle/inferior temporal cortex, precuneus, anterior temporal pole, and anterior insula. Particularly prolonged states were found in higher-order regions such as lateral and medial prefrontal cortex...

      The previous evidence about evoked responses at event boundaries indicates that these are dynamic phenomena evolving over many seconds, with different brain areas showing different dynamics (Ben-Yakov & Henson, 2018; Burunat et al., 2024; Kurby & Zacks, 2018; Speer et al., 2007; Zacks, 2010). Less is known about the dynamics of pattern shifts at event boundaries, because the HMM and GSBS analysis methods do not directly provide moment-by-moment measures of pattern shifts. For example, one question is whether shifts in higher-order regions precedes or follow shifts in lower-level regions. Both the spatial and temporal aspects of evoked responses and pattern shifts at event boundaries have the potential to provide evidence about potential control processes for event model updating.”

      Excerpt from Discussion:

      “We first characterized the neural signatures of human event segmentation by examining both univariate activity changes and multivariate pattern changes around subjectively identified event boundaries. Using multivariate pattern dissimilarity, we observed a structured progression of neural reconfiguration surrounding human-identified event boundaries. The largest pattern shifts were observed near event boundaries (~4.5s before) in dorsal attention and parietal regions; these correspond with regions identified by Geerligs et al. as shifting their patterns on an intermediate timescale (2022). We also observed smaller pattern shifts roughly 12 seconds prior to event boundaries in higher-order regions within anterior temporal cortex and prefrontal cortex, and these are slow-changing regions identified by Geerligs et al. (2022). This is puzzling. One prevalent proposal, based on the idea of a cortical hierarchy of increasing temporal receptive windows (TRWs), suggests that higher-order regions should update representations after lower-order regions do (Chang et al., 2021). In this view, areas with shorter TRWs (e.g., word-level processors) pass information upward, where it is integrated into progressively larger narrative units (phrases, sentences, events). This proposal predicts neural shifts in higher-order regions to follow those in lower-order regions. By contrast, our findings indicate the opposite sequence. Our findings suggest that the brain might engage in top-down event representation updating, with changes in coarser-grain representations propagating downward to influence finer-grain representations. (Friston, 2005; Kuperberg, 2021). For example, in a narrative where the main goal is achieved midway—such as a detective solving a mystery before the story formally ends—higher-order regions might update the overarching event representation at that point, and this updated model could then cascade down to reconfigure how lower-level regions process the remaining sensory and contextual details. In the period after a boundary (around +12 seconds), we found widespread stabilization of neural patterns across the brain, suggesting the establishment of a new event model. Future work could focus on understanding the mechanisms behind the temporal progression of neural pattern changes around event boundaries.”

      Reviewer #2 (Public review):

      Summary:

      Tan et al. examined how multivoxel patterns shift in time windows surrounding event boundaries caused by both prediction errors and prediction uncertainty. They observed that some regions of the brain show earlier pattern shifts than others, followed by periods of increased stability. The authors combine their recent computational model to estimate event boundaries that are based on prediction error vs. uncertainty and use this to examine the moment-to-moment dynamics of pattern changes. I believe this is a meaningful contribution that will be of interest to memory, attention, and complex cognition research.

      Strengths:

      The authors have shown exceptional transparency in terms of sharing their data, code, and stimuli, which is beneficial to the field for future examinations and to the reproduction of findings. The manuscript is well written with clear figures. The study starts from a strong theoretical background to understand how the brain represents events and has used a well-curated set of stimuli. Overall, the authors extend the event segmentation theory beyond prediction error to include prediction uncertainty, which is an important theoretical shift that has implications in episodic memory encoding, the use of semantic and schematic knowledge, and attentional processing.

      We thank the reader for their support for our use of open science practices, and for their appreciation of the importance of incorporating prediction uncertainty into models of event comprehension.

      Weaknesses:

      The data presented is limited to the cortex, and subcortical contributions would be interesting to explore. Further, the temporal window around event boundaries of 20 seconds is approximately the length of the average event (21.4 seconds), and many of the observed pattern effects occur relatively distal from event boundaries themselves, which makes the link to the theoretical background challenging. Finally, while multivariate pattern shifts were examined at event boundaries related to either prediction error or prediction uncertainty, there was no exploration of univariate activity differences between these two different types of boundaries, which would be valuable.

      The fact that we observed neural pattern shifts well before boundaries was indeed unexpected, and we now offer a more extensive interpretation in the discussion section. Specifically, we added text noting that shifts emerged in higher-order anterior temporal and prefrontal regions roughly 12 seconds before boundaries, whereas shifts occurred in lower-level dorsal attention and parietal regions closer to boundaries. This sequence contrasts with the traditional bottom-up temporal hierarchy view and instead suggests a possible top-down updating mechanism, in which higher-order representations reorganize first and propagate changes to lower-level areas (Friston, 2005; Kuperberg, 2021). (See excerpt for Reviewer 1’s comment #5.)

      With respect to univariate activity, we did not find strong differences between error-driven and uncertainty-driven boundaries. This makes the multivariate analyses particularly informative for detecting differences in neural pattern dynamics. To support further exploration, we have also shared the temporal progression of univariate BOLD responses on OpenNeuro for interested researchers.

      Reviewer #3 (Public review):

      Summary:

      The aim of this study was to investigate the temporal progression of the neural response to event boundaries in relation to uncertainty and error. Specifically, the authors asked (1) how neural activity changes before and after event boundaries, (2) if uncertainty and error both contribute to explaining the occurrence of event boundaries, and (3) if uncertainty and error have unique contributions to explaining the temporal progression of neural activity.

      Strengths:

      One strength of this paper is that it builds on an already validated computational model. It relies on straightforward and interpretable analysis techniques to answer the main question, with a smart combination of pattern similarity metrics and FIR. This combination of methods may also be an inspiration to other researchers in the field working on similar questions. The paper is well written and easy to follow. The paper convincingly shows that (1) there is a temporal progression of neural activity change before and after an event boundary, and (2) event boundaries are predicted best by the combination of uncertainty and error signals.

      We thank the reviewer for their thoughtful and supportive comments, particularly regarding the use of the computational model and the analysis approaches.

      Weaknesses:

      (1) The current analysis of the neural data does not convincingly show that uncertainty and prediction error both contribute to the neural responses. As both terms are modelled in separate FIR models, it may be that the responses we see for both are mostly driven by shared variance. Given that the correlation between the two is very high (r=0.49), this seems likely. The strong overlap in the neural responses elicited by both, as shown in Figure 6, also suggests that what we see may mainly be shared variance. To improve the interpretability of these effects, I think it is essential to know whether uncertainty and error explain similar or unique parts of the variance. The observation that they have distinct temporal profiles is suggestive of some dissociation, but not as convincing as adding them both to a single model.

      We appreciate this point. It is closely related to Reviewer 1's comment 2; please refer to our response above.

      (2) The results for uncertainty and error show that uncertainty has strong effects before or at boundary onset, while error is related to more stabilization after boundary onset. This makes me wonder about the temporal contribution of each of these. Could it be the case that increases in uncertainty are early indicators of a boundary, and errors tend to occur later?

      We also share the intuition that increases in uncertainty are early indicators of a boundary, and errors tend to occur later. If that is the case, we would expect some lags between prediction uncertainty and prediction error. We examined lagged correlation between prediction uncertainty and prediction error, and the optimal lag is 0 for both uncertainty-driven and error-driven models. This indicates that when prediction uncertainty rises, prediction error also simultaneously rises.

      Author response image 1.

      (3) Given that there is a 24-second period during which the neural responses are shaped by event boundaries, it would be important to know more about the average distance between boundaries and the variability of this distance. This will help establish whether the FIR model can properly capture a return to baseline.

      We have added details about the distribution of event lengths. Specifically, we now report that the mean length of subjectively identified events was 21.4 seconds (median 22.2 s, SD 16.1 s). For model-derived boundaries, the average event lengths were 28.96 seconds for the uncertainty-driven model and 24.7 seconds for the error-driven model.

      "For each activity, a separate group of 30 participants had previously segmented each movie to identify fine-grained event boundaries (Bezdek et al., 2022). The mean event length was 21.4 s (median 22.2 s, SD 16.1 s). Mean event lengths for uncertainty-driven model and error-driven model were 28.96s, and 24.7s, respectively."

      (4) Given that there is an early onset and long-lasting response of the brain to these event boundaries, I wonder what causes this. Is it the case that uncertainty or errors already increase at 12 seconds before the boundaries occur? Or if there are other makers in the movie that the brain can use to foreshadow an event boundary? And if uncertainty or errors do increase already 12 seconds before an event boundary, do you see a similar neural response at moments with similar levels of error or uncertainty, which are not followed by a boundary? This would reveal whether the neural activity patterns are specific to event boundaries or whether these are general markers of error and uncertainty.

      We appreciate this point; it is similar to reviewer 2’s comment 2. Please see our response to that comment above.

      (5) It is known that different brain regions have different delays of their BOLD response. Could these delays contribute to the propagation of the neural activity across different brain areas in this study?

      Our analyses use ±20 s FIR windows, and the key effects we report include shifts ~12s before boundaries in higher-order cortex and ~4.5s pre-boundary in dorsal attention/parietal areas. Given the literature above, region-dependent BOLD delays are much smaller (~1–2s) than the temporal structure we observe (Taylor et al., 2018), making it unlikely that HRF lag alone explains our multi-second, region-specific progression.

      (6) In the FIR plots, timepoints -12, 0, and 12 are shown. These long intervals preclude an understanding of the full temporal progression of these effects.

      For page length purposes, we did not include all timepoints. We uploaded an animation of all timepoints in Openneuro for interested researchers.

      References

      Taylor, A. J., Kim, J. H., & Ress, D. (2018). Characterization of the hemodynamic response function across the majority of human cerebral cortex. NeuroImage, 173, 322–331. https://doi.org/10.1016/j.neuroimage.2018.02.061

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Weaknesses: 

      The main weakness in this paper lies in the authors' reliance on a single model to derive conclusions on the role of local antigen during the acute phase of the response by comparing T cells in model antigen-vaccinia virus (VV-OVA) exposed skin to T cells in contralateral skin exposed to DNFB 5 days after the VV-OVA exposure. In this setting, antigen-independent factors may contribute to the difference in CD8+ T cell number and phenotype at the two sites. For example, it was recently shown that very early memory precursors (formed 2 days after exposure) are more efficient at seeding the epithelial TRM compartment than those recruited to skin at later times (Silva et al, Sci Immunol, 2023). DNFB-treated skin may therefore recruit precursors with reduced TRM potential. In addition, TRM-skewed circulating memory precursors have been identified (Kok et al, JEM, 2020), and perhaps VV-OVA exposed skin more readily recruits this subset compared to DNFB-exposed skin. Therefore, when the DNFB challenge is performed 5 days after vaccinia virus, the DNFB site may already be at a disadvantage in the recruitment of CD8+ T cells that can efficiently form TRM. In addition, CD8+ T cell-extrinsic mechanisms may be at play, such as differences in myeloid cell recruitment and differentiation or local cytokine and chemokine levels in VV-infected and DNFB-treated skin that could account for differences seen in TRM phenotype and function between these two sites. Although the authors do show that providing exogenous peptide antigen at the DNFB-site rescues their phenotype in relation to the VV-OVA site, the potential antigen-independent factors distinguishing these two sites remain unaddressed. In addition, there is a possibility that peptide treatment of DNFB-treated initiates a second phase of priming of new circulatory effectors in the local-draining lymph nodes that are then recruited to form TRM at the DFNB-site, and that the effect does not solely rely on TRM precursors at the DNFB-treated skin site at the time of peptide treatment. 

      Thank you for pointing out these potential caveats to our work.  We have considered the possibility that late application of peptide or cell-extrinsic difference could affect the interpretation of our results.  We would like to highlight that in our prior publication on this topic [1], we found that OT-1 responses in mice infected with VV-OVA and VV-N (irrelevant antigen) yielded the same responses as in our VV-OVA/DNFB models.  In addition, in both our prior publication and our current manuscript, application of peptide to DNFB painted sites results in T<sub>RM</sub> with a similar phenotype to those in the VV-OVA site.  Thus, we are confident that it is the presence of cognate antigen in the skin that drives the augmented T<sub>RM</sub> fitness that we observe.

      Secondly, although the authors conclusively demonstrate that TGFBRIII is induced by TCR signals and required for conferring increased fitness to local-antigen-experienced CD8+ TRM compared to local antigen-inexperienced cells, this is done in only one experiment, albeit repeated 3 times. The data suggest that antigen encounter during TRM formation induces sustained TGFBRIII expression that persists during the antigen-independent memory phase. It remains unclear why only the antigen encounter in skin, but not already in the draining lymph nodes, induces sustained TGFBRIII expression. Further characterizing the dynamics of TGFBRIII expression on CD8+ T cells during priming in draining lymph nodes and over the course of TRM formation and persistence may shed more light on this question. Probing the role of this mechanism at other sites of TRM formation would also further strengthen their conclusions and enhance the significance of this finding. 

      This is an intriguing point.  We do not understand why expression of TGFbR3 in T<sub>RM</sub> required antigen encounter in the skin if T<sub>RM</sub> at all sites clearly have encountered antigen during priming in the LN.  We speculate that durable TGFbR3 expression may require antigen encounter in the context of additional cues present in the periphery or only once cells have committed to the T<sub>RM</sub> lineage.  A more detailed characterization of the dynamics of TGFbR3 expression in multiple tissues would be informative and represents a promising future direction for this project.  We note that to robustly perform these experiments a reporter mouse would likely be a requirement.

      Reviewer #2 (Public review): 

      Weaknesses: 

      Overall, the authors' conclusions are well supported, although there are some instances where additional controls, experiments, or clarifications would add rigor. The conclusions regarding skin-localized TCR signaling leading to increased skin CD8+ TRM proliferation in-situ and increased TGFBR3 expression would be strengthened by assessing skin CD8+ TRM proliferation and TGFBR3 expression in models of high versus low avidity topical OVA-peptide exposure.

      Thank you for these helpful suggestions.  We did not attempt these experiment as we were concerned that given the relatively modest expansion differences observed with the APL that resolving differences in TGFbR3 and BrdU would prove unreliable. However, this is something that we could attempt as we continue working on this project.

      The authors could further increase the novelty of the paper by exploring whether TGFBR3 is regulated at the RNA or protein level. To this end, they could perform analysis of their single-cell RNA sequencing data (Figure 1), comparing Tgfbr3 mRNA in DNFB versus VV-treated skin. 

      As discussed above, a more detailed analysis of TGFbR3 regulation is of great interest.  These experiments would likely require the creation of additional tools (e.g. a reporter mouse) to provide robust data.  However, as suggested, we have re-analyzed our scRNAseq looking for expression of Tgfbr3. Pseudobulk analysis of cells isolated from VV or DNFB sites suggests that Tgfbr3 appears to be elevated in antigen-experienced TRM at steady-state (Author response image 1).

      Author response image 1.

      Pseudobulk analysis by average gene expression of Tgfbr3 in cells isolated from either VV or DNFB treated flanks, divided by the average gene expression of Tgfbr3 in naïve CD8 T cells from the same dataset.

      For clarity, when discussing antigen exposure throughout the paper, it would be helpful for the authors to be more precise that they are referring to the antigen in the skin rather than in the draining lymph node. A more explicit summary of some of the lab's previous work focused on CD8+ TRM and the role of TGFb would also help readers better contextualize this work within the existing literature on which it builds. 

      We appreciate this feedback, and we have clarified this in the text.

      For rigor, it would be helpful where possible to pair flow cytometry quantification with the existing imaging data.

      Thank you for these suggestions.  In terms of quantification of number of T<sub>RM</sub>by flow cytometry, we have previously demonstrated as much as a 36-fold decrease in cell count when compared to numbers directly visualized by immunofluorescence [1].  Thus, for enumeration of T<sub>RM</sub> we rely primarily on direct IF visualization and use flow cytometry primarily for phenotyping.

      Additional controls, namely enumerating TRM in the opposite, untreated flank skin of VV-only-treated mice and the treated flank skin of DNFB-only treated mice, would help contextualize the results seen in dually-treated mice in Figure 2.

      Without a source of inflammation (e.g. VV infection of DNFB) we see very few T<sub>RM</sub>in untreated skin.  A representative image is provided (Author response image 2).  A single DNFB stimulation does not recruit any CD8+ T cells to the skin without a prior sensitization [2].

      Author response image 2.

      Representative images of epidermal whole mounts of VV treated flank skin, and an untreated site from the same mouse isolated on day 50 post infection and stained for CD8a.

      In figure legends, we suggest clearly reporting unpaired T tests comparing relevant metrics within VV or DNFB-treated groups (for example, VV-OVA PBS vs VV-OVA FTY720 in Figure 3F).

      Thank you for this suggestion.  The figure legends have been amended.

      Finally, quantifying right and left skin draining lymph node CD8+ T cell numbers would clarify the skin specificity and cell trafficking dynamics of the authors' model. 

      We quantified the numbers of CD8 T cells in left and right skin draining lymph nodes by flow cytometry in mice at day 50 post VV infection DNFB-pull.  We observe similar numbers of cells at both sites (Author response Image 3).

      Author response Image 3.

      Quantification of total number of CD8+ T cells in left and right inguinal lymph nodes. Each symbol represents paired data from the same individual animal, and this is representative of 3 separate experiments.

      Reviewer #1 (Recommendations for the authors): 

      (1) Figures 1D and S1C demonstrate that 80-90 % of TRM at both VV and DNFB sites express CD103+. In contrast, the sequencing data suggests the TRM at the VV site has much higher Itgae expression. Also, clusters 3 and 4, which express significantly more Itgae than all other clusters, together comprise only ~30% of CD8+ T cells at the VV-infected skin site. How can these discrepancies between transcript and protein expression be explained? 

      Thank you for these excellent comments. T<sub>RM</sub> at both VV and DNFB sites appear to express similarly high levels of CD103 protein in both the OT-I system as we previously published [1] and in a polyclonal system using tetramers.  The lower penetrance of Itgae expression in the scRNAseq data we attribute to a lack of sensitivity which is common with this modality.  However, the relative increased expression of Itgae in clusters 3 and 4 is interesting and may suggest increased Itgae production/stability.  However, in the absence of any effect on protein expression, we chose not to focus on these mRNA differences.

      (2) For the experiments in Figure 3D, in order to exclude a contribution from circulating memory cells, FTY720 should have been administered during the duration of, not prior to, the initiation of the recall response. The effect of FTY720 wears off quickly, so the current experimental setting likely allows for circulating cells to enter the skin. This concern is mitigated by the results of anti-Thy1.1 mAb treatment, but documenting the experiment as in Figure D will likely be confusing to readers. 

      Thank you for this comment.  We relied on the literature indicating that the half-life of FTY720 in blood is longer than 6 days [3-5].  However, on reviewing this again, there are other reports suggesting a lower halflife.  Thank you for pointing out this potential caveat.  As mentioned above, we do not think this affects the interpretation of our data as similar results were obtained with anti-Thy1.1

      (3) Similar to what is described in the weaknesses section, the data on TGFBRIII expression is lacking. When is TGFBRIII induced? In the LN during primary activation and it is then sustained by a secondary antigen exposure at the peripheral target tissue site? Or is it only induced in the peripheral tissue, and there is interesting biology to uncover in regard to how it is induced by the TCR only after secondary exposure, etc.? 

      Thank you for these comments. As discussed above, a more detailed analysis of TGFbR3 regulation is of great interest.  These experiments would likely require the creation of additional tools (e.g. a reporter mouse) to provide robust data and are part of our future directions.

      (4) As described in the weakness section, there could be TCR-independent differences between the VV-OVA and DNFB sites that lead to phenotypic changes in the TRMs that are formed there, both CD8+ T cell-intrinsic (kinetics; with regard to time after initial priming) and extrinsic (microenvironmental differences due to the nature of the challenge, recruited cell types, cytokines, chemokines, etc.). Since the authors report the use of both VV and VV-ova, we recommend an experimental strategy that controls for this by challenging one site with VV and another with VV-OVA concomitantly, followed by repeating the key experiments reported in this manuscript. 

      As discussed above, we have previously published a very similar experiment using VV-OVA and VV-N infection on opposite flanks [1].

      (5) In Figure 6J please indicate means and provide more of the statistics comparing the groups (such as comparing VV-WT vehicle to VV-KO vehicle etc.), and potentially display on a linear scale as with all of the other figures looking at cells/mm2 to help convince the reader of the conclusions and support the secondary findings mentioned in the text such as "Notably, numbers of Tgfbr3ΔCD8 TRM in cohorts treated with vehicle remained at normal levels indicating that loss of TGFβRIII does not affect TRM epidermal residence in the steady state" despite it looking like there is a decrease when looking at the graph. 

      We appreciate the feedback on the readability of this figure, and so have updated figure 6J to be on a linear scale and added additional helpful statistics to the figure legend. The difference between Tgfbr3<sup>WT</sup> and Tgfbr3<sup>∆CD8</sup> at steady state is excellent point, and we agree that there could to be a trend towards reduction in the huNGFR+ T<sub>RM</sub> across both groups, even without CWHM12 administration. However, we did not see statistically significant reductions in steady-state Tgfbr3<sup>∆CD8</sup> T<sub>RM</sub>, but the slight reduction in both VV-OVA and DNFB treated flanks suggests that TGFßRIII may play a role in steady-state maintenance of all T<sub>RM</sub>. Perhaps with more sensitive tools to better visualize TGFßRIII expression, we could identify stepwise upregulation of TGFßRIII depending on TCR signal strength, possibly starting in the lymph node. We have also amended our description of this figure in the text, to allow for the possibility that a low, but under the level of detection amount of TGFßRIII could play a role in steady-state maintenance of both local antigen-experienced and bystander T<sub>RM</sub>.

      Minor points: 

      (1) In describing Figure 4B, the term "doublets" for pairs of connected dividing cells is confusing. 

      Thank you for this comment, the term has been revised to “dividing cells” in the text and figure.

      (2) Figure legend 4F: BrdU is not "expressed" . 

      Very true, it has been changed to “incorporation”.

      (3) Do CreERT2 and/or huNGFR expressed by transferred OT-I cells act as foreign antigens in C57BL/6 mice, potentially causing elimination of circulating memory cells? If that were the case, this would not necessarily confound the read-out of TRM persistence studied here, since skin TRM are likely protected from at least antibody-mediated deletion and their numbers are not maintained by recruitment of circulating cells at stead-state. However, it would be useful to be aware of this potential limitation of this and similar models. 

      Thank you for raising the important technical concern.  In our prior work [1] and this work, we monitor the levels of transferred OT-I cells in the blood over time.  We have not observed rejection of huNGFR+ cells.  We also note that others using the same system have also not observed rejection [6].

      (4) In Figure 6J, means or medians should be indicated 

      This has been updated in Figure 6J.

      (5) Using the term "antigen-experienced" to specifically refer to TRM at the VV site could be confusing, since those at the DNFB site are also Ag-experienced (in the LN draining the VV skin site). 

      We agree that it is a challenging term, as all T<sub>RM</sub> are memory cells. That is why in the text we refer to T<sub>RM</sub> isolated from the VV site as “local antigen experienced T<sub>RM</sub>.”, to try to distinguish them from bystanders that did not experience local antigen.

      (6) The Title essentially restates what was already reported in the authors' prior study. If the data supporting the TGFBRIII-mediated mechanism is studied in more depth, maybe adding this aspect to the title may be useful? 

      Thank you for this suggestion.  I think the current title is probably most suitable for the current manuscript but we are willing to change it should the editors support an alternative title.

      Reviewer #2 (Recommendations for the authors): 

      (1) Definition of bystander CD8+ TRM: The first paragraph of the introduction defines CD8+ TRM. To improve the clarity of this definition, we suggest being explicit that bystander TRM experience cognate antigen in the SDLNs but, in contrast to other TRM, do not experience cognate antigen in the skin. 

      Thank you, we have clarified this is in the text.

      (2) Consider softening the language when comparing the efficiency of CD8+ recruitment of the skin between DNFB and VV-treated flanks. For example, substitute "equal efficiency" with "comparable efficiency" since it is difficult to directly compare the extent of inflammation between viral and hapten-based treatments. 

      We have adjusted this terminology throughout the paper.

      (3) Throughout figure legends, we appreciate the indication of the number of experimental repeats performed. We suggest, either through statistics or supplemental figures, demonstrating the degree of variability between experiments to aid readers in understanding the reproducibility of results. 

      Thank you for this suggestion.  In key figures we show data from individual mice across multiple experiments. Thus, inter-experiment variability is captured in our figures.  

      (4) Figure 1: 

      a) Add control mice treated with either vaccinia virus or DNFB and harvest back skin at day 52 to demonstrate baseline levels of polyclonal and B8R tetramer-positive CD8s in the epidermis. These controls would clarify the background CD8+ expansion that might occur in DNFB-treated mice in the absence of vaccinia virus. 

      This point was addressed above.

      b) Figure 1: It would be helpful to see the %Tet+ population specifically in the CD103+ population, recognizing that the majority of the CD8+ from the skin are CD103+. 

      We did look only at CD103+ CD8 T cells from the skin for our tetramer analysis, so this has been clarified in the figure legend.

      c) Provide a UMAP, very similar to 1H, where CD8+ T cells, vaccinia virus, and DNFB-treated flanks are overlaid.

      Thank you for this suggestion.  A UMAP combining aspects of 1G (cell types from the whole ImmgenT dataset) with 1H (our data) results in a figure that is very difficult to interpret.  Thus, we have separated cell types across the entire ImmgenT data set (e.g. CD8+ T cells) and our data into 2 separate panels.

      d) 1D: left flow plot has numbered axis while the right flow plot does not. 

      Thank you, this has been fixed.

      (5) Figure 2: 

      a) In the figure legend, define what is meant by the grey line present in Figures 2C and 2D. 

      This has been updated in the figure legend.

      b) Edit the Y axis of 2C and 2D to specify the TRM signature score. 

      This has been updated in the figure.

      c) Include panel 1D from 1S into Figure 2 to help clarify for the reader what genes are expressed in the 0 - 5 clusters.

      We appreciate the feedback, but we found the heatmap made the figure look too busy, so we feel comfortable keeping it available within supplemental figure 1.

      d) In body of text explicitly discuss that the TRM module used to calculate a signature score was created using virus infection modules (HSV, LCMV and influenza) and thus some of the transcriptional similarity between the authors vaccinia virus treated CD8+ TRM and the TRM module might be due to viral infection rather than TRM status.

      Thank you for this comment.  We have now emphasized this point in the text.

      (6) Figure 3: 

      a) If there are leftover tissue sections, it would be optimal to show specific staining for CD103. We recognize that this data has been previously published by the lab, but it would be ideal to show it once in this paper. 

      Unfortunately, we do not have leftover tissue sections, so we are unable to measure CD103 by I.F. in these experiments.

      b) If you did collect skin draining lymph nodes in the Thy1.1 depletion model, it would be nice to see flow data showing the depletion effects in the skin draining lymph nodes in addition to the blood. 

      Unfortunately, we did not collect the skin draining lymph nodes, and do not have that data for the relevant experiments.

      c) Figure 3 F & G: Perform a T-test comparing vaccinia virus PBS to FTY720 and isotype to anti-Thy1.1 within the same treatment group. Showing no significance with these two comparisons would strengthen the authors' claims. Statistics can be described in legend. 

      We have included this analysis in the figure legend.

      (7) Figure 4: 

      a) It would be helpful to have the CD69+/CD103+ population in this model discussed/defined more. The CD69 expression seen in 4E is lower than the reviewers would've predicted, and it would be interesting to see CD103 expression as well.

      We have found that generally CD103 is a stronger marker for in the skin by flow, as CD69 staining is somewhat less robust in the colors we have chosen.  By way of example, we present gating we did upstream in that experiment, gated previously on liveCD45+CD3+CD8+ events (Author response image 4).

      Author response image 4.

      Representative flow cytometric plots showing CD69 and CD103 expression in gated live CD45+CD8+CD90.1+ cells isolates from VV-OVA or DNFB treated flanks.

      (8) Figure 5: 

      a) Define APL and its purpose in both the body of text and the figure legend. 

      We have clarified this in the text and the figure legend.

      b) Using in-vivo BrdU, compare proliferation between high avidity N4 and low avidity Y3 OVA-peptide at the primary recall timepoint. 

      We considered this, but due to the lack of sensitivity of the BrdU incorporation and the relatively subtle phenotype of the Y3, we did not think the assay would be sensitive enough to identify differences.

      (9) Figure 6: 

      a) Compare TGFBR3 expression in CD8+ T cells from mice receiving high avidity N4 versus low avidity Y3 OVA-peptide at the primary recall timepoint. 

      This point was discussed above.

      b) Either 1) examine TGFBR3 mRNA expression in VV vs DNFB skin from scRNA-seq dataset or 2) perform a qPCR on epidermal CD8+ T cells from mice receiving high avidity N4 versus low avidity Y3 at the primary recall timepoint. This would help distinguish whether TGFBR3 regulation occurs at the mRNA versus protein level. 

      This point has been discussed above.

      c) Figure 6A: Not required, but it seems like the TGFBR3 gate could be shifted to the right a bit. 

      The gates were set using FMO.

      d) Figure 6C: What comparison is the asterisk indicating significance referring to?

      It is the Dunnett’s test comparing VV-OVA to DNFB and untreated skin, the figure has been amended to clarify this point.

      e) Figure 6: To increase the rigor of the claim that CWHM12 is creating a TGFb limiting condition, the authors could either 1) perform an ELISA or cell-based assay measuring active TGFb, 2) recapitulate results of 6J using monoclonal antibody against avb6 as done in Hirai et al., 2021, Immunity., or 3) examine Tgfbr3 mRNA expression in your single cell RNAseq data, comparing cluster 0 and cluster 3.

      We are pleased to have the opportunity to show Tgfbr3 mRNA, which is above in figure R1.

      (10) Material and methods: 

      Specify how the localization of the back skin used for imaging was made consistent between the right and left flanks. 

      We have updated this methodology in the text.

      Literature Cited

      (1) Hirai, T., et al., Competition for Active TGFβ Cytokine Allows for Selective Retention of Antigen-Specific Tissue- Resident Memory T Cells in the Epidermal Niche. Immunity, 2021. 54(1): p. 84-98.e5.

      (2) Manresa, M.C., Animal Models of Contact Dermatitis: 2,4-Dinitrofluorobenzene-Induced Contact Hypersensitivity, in Animal Models of Allergic Disease: Methods and Protocols, K. Nagamoto-Combs, Editor. 2021, Springer US: New York, NY. p. 87-100.

      (3) Müller, H.C., et al., The Sphingosine-1 Phosphate receptor agonist FTY720 dose dependently affected endothelial integrity in vitro and aggravated ventilator-induced lung injury in mice. Pulmonary Pharmacology & Therapeutics, 2011. 24(4): p. 377-385.

      (4) Nofer, J.-R., et al., FTY720, a Synthetic Sphingosine 1 Phosphate Analogue, Inhibits Development of Atherosclerosis in Low-Density Lipoprotein Receptor–Deficient Mice. Circulation, 2007. 115(4): p. 501-508.

      (5) Brinkmann, V., et al., Fingolimod (FTY720): discovery and development of an oral drug to treat multiple sclerosis. Nat Rev Drug Discov, 2010. 9(11): p. 883-97.

      (6) Andrews, L.P., et al., A Cre-driven allele-conditioning line to interrogate CD4<sup>+</sup> conventional T cells. Immunity, 2021. 54(10): p. 2209-2217.e6.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This study investigates how collective navigation improvements arise in homing pigeons. Building on the Sasaki & Biro (2017) experiment on homing pigeons, the authors use simulations to test seven candidate social learning strategies of varying cognitive complexity, ranging from simple route averaging to potentially cognitively demanding selective propagation of superior routes. They show that only the simplest strategy-equal route averaging-quantitatively matches the experimental data in both route efficiency and social weighting. More complex strategies, while potentially more effective, fail to align with the observed data. The authors also introduce the concept of "effective group size," showing that the chaining design leads to a strong dilution of earlier individuals' contributions. Overall, they conclude that cognitive simplicity rather than cumulative cultural evolution explains collective route improvements in pigeons.

      Strengths:

      The manuscript addresses an important question and provides a compelling argument that a simpler hypothesis is necessary and sufficient to explain findings of a recent influential study on pigeon route improvements, via a rigorous systematic comparison of seven alternative hypotheses. The authors should be commended for their willingness to critically re-examine established interpretations. The introduction and discussion are broad and link pigeon navigation to general debates on social learning, wisdom of crowds, and CCE.

      We thank the reviewer for their positive comments.

      Weaknesses:

      The lack of availability of codes and data for this manuscript, especially given that it critically examines and proposes alternative hypotheses for an important published work.

      We thank the reviewer for their comment. The code and data for our manuscript are an important aspect of the study, and we had intended to make them publicly available upon publication. The link to our code and data on figshare can be found here: (https://doi.org/10.6084/m9.figshare.28950032.v1). We will further add this link to the Data Availability Statement of our revised version.  

      Reviewer #2 (Public review):

      Summary:

      The manuscript investigates which social navigation mechanisms, with different cognitive demands, can explain experimental data collected from homing pigeons. Interestingly, the results indicate that the simplest strategy - route averaging - aligns best with the experimental data, while the most demanding strategy - selectively propagating the best route - offers no advantage. Further, the results suggest that a mixed strategy of weighted averaging may provide significant improvements.

      The manuscript addresses the important problem of identifying possible mechanisms that could explain observed animal behavior by systematically comparing different candidate models. A core aspect of the study is the calculation of collective routes from individual bird routes using different models that were hypothesized to be employed by the animals, but which differ in their cognitive demands.

      The manuscript is well-written, with high-quality figures supporting both the description of the approach taken and the presentation of results. The results should be of interest to a broad community of researchers investigating (collective) animal behavior, ranging from experiment to theory. The general approach and mathematical methods appear reasonable and show no obvious flaws. The statistical methods also appear.

      Strengths:

      The main strength of the manuscript is the systematic comparison of different meta-mechanisms for social navigation by modeling social trajectories from solitary trajectories and directly comparing them with experimental results on social navigation. The results show that the experimentally observed behavior could, in principle, arise from simple route averaging without the need to identify "knowledgeable" individuals. Another strength of the work is the establishment of a connection between social navigation behavior and the broader literature on the wisdom of crowds through the concept of effective group size.

      We thank the reviewer for their positive comments.

      Weaknesses:

      However, there are two main weaknesses that should be addressed:

      (1) The first concerns the definition of "mechanism" as used by the authors, for example, when writing "navigation mechanism." Intuitively, one might assume that what is meant is a behavioral mechanism in the sense of how behavior is generated as a dynamic process. However, here it is used at a more abstract (meta) level, referring to high-level categories such as "averaging" versus "leader-follower" dynamics. It is not used in the sense of how an individual makes decisions while moving, where the actual route followed in a social context emerges from individuals navigating while simultaneously interacting with conspecifics in space and time. In the presented work, the approach is to directly combine (global) route data of solitary birds according to the considered "meta-mechanisms" to generate social trajectories. Of course, this is not how pigeon social navigation actually works-they do not sit together before the flight and say, "This is my route, this is your route, let's combine them in this way." A mechanistic modeling approach would instead be some form of agent-based model that describes how agents move and interact in space and time. Such a "bottom-up" approach, however, has its drawbacks, including many unknown parameters and often strongly simplifying (implicit) assumptions. I do not expect the authors to conduct agent-based modeling, but at the very least, they should clearly discuss what they mean by "mechanism" and clarify that while their approach has advantages-such as naturally accounting for the statistical features of solitary routes and allowing a direct comparison of different meta-mechanisms is also limited, as it does not address how behavior is actually generated. For example, the approach lacks any explicit modeling of errors, uncertainty, or stochasticity more broadly (e.g., due to environmental influences). Thus, while the presented study yields some interesting results, it can only be considered an intermediate step toward understanding actual behavioral mechanisms.

      We thank the reviewer for their comment and thoughtful suggestions. We agree that the inherent behavioral mechanisms and the biological basis of these mechanisms cannot be determined just through the navigational data alone. For instance, it remains unexplored if pigeons are adapting their behavior based only on social cues from their partners or using other navigational features such as landmarks or roads, location of the sun, geomagnetic cues or prior learnt routes. However, we do agree (as also pointed by the reviewer) that these behavioral rules generate an emergent ‘meta-mechanism’ where the bird pairs are behaving as if their preferred routes are averaged during a flight. It will be important in future work to explore the biological basis of these mechanisms, but our current approach allows us to only describe the mechanisms in a meta sense with any confidence. Considering this, we believe that our analysis is a more top-down approach towards describing the outcomes of these underlying mechanisms in an abstract sense. We would also like to point the reviewer to Dalmaijer, 2024 [1] who used a bottom up approach, using naive agents and showed that cumulative route improvements emerged in the absence of any sophisticated communication in the same dataset, in agreement with our approach. Considering these points, we will make changes in our revised version to clearly elaborate on what the definition of ‘mechanism’ should include in line with the reviewer’s feedback.

      (2) While the presented study raises important questions about the applicability and viability of cumulative cultural evolution (CCE) in explaining certain animal behaviors such as social navigation, I find that it falls short in discussing them. What are the implications regarding the applicability of CCE to animal data and to previously claimed experimental evidence for CCE? Should these experiments be re-analyzed or critically reassessed? If not, why? What are good examples from animal behavior where CCE should not be doubted? Furthermore, what about the cited definitions and criteria of CCE? Are they potentially too restrictive? Should they be revised-and if so, how? Conversely, if the definitions become too general, is CCE still a useful concept for studying certain classes of animal behavior? I think these are some of the very important questions that could be addressed or at least raised in the discussion to initiate a broader debate within the community.

      We thank the reviewer for their comments and interesting questions regarding our study. We agree with the reviewer that our study opens up new avenues for critically analysing the criteria previous studies have used for providing evidence of CCE in non-human animals. According to our literature review, we found that the field has been usually motivated in thinking about CCE in a ‘process’ focused manner (Reindl et al. [2]) in regards to individuals being able to compare strategies and selecting ones resulting in higher individual fitness. This preferential selection of strategies – termed innovations — allows for the stereotypical ratcheting effect seen in CCE. In our study, we propose that in the case of homing pigeons, the ratcheting effect is more of a statistical outcome rather than deliberate individual judgement. We believe that this strategy is also amenable to certain task types (which in our study was homing route choice) and may change for others (for example solving a puzzle box) and the task also needs to be sufficiently complex for animals to benefit from the use of social information (Caldwell et al. 2008 [3]). Thus, we recommend future work to address what classes of problems would fit well within the definition of “emergent” CCE and which ones don’t. Keeping this framework in mind, studies should clearly state what definition of CCE they are using and should be critically evaluated for their underlying task type and cognitive mechanisms to deem them as CCE. Considering these points we will expand our discussion to highlight these key questions that could be critical to think upon for future research.

      References:

      (1) Dalmaijer ES (2024) Cumulative route improvements spontaneously emerge in artificial navigators even in the absence of sophisticated communication or thought. PLoS Biol. 22:e3002644.

      (2) Reindl, E., Gwilliams, A.L., Dean, L.G. et al. (2020) Skills and motivations underlying children’s cumulative cultural learning: case not closed. Palgrave Commun 6, 106.

      (3) Caldwell CA, Millen AE (2008) Studying cumulative cultural evolution in the laboratory. Phil. Trans. R. Soc. B 363:3529-3539.

    1. Author response:

      Description of the planned revisions

      Reviewer #1 (Evidence, reproducibility and clarity):

      Summary

      The authors focused on medaka retinal organoids to investigate the mechanism underlying the eye cup morphogenesis. The authors succeeded to induce lens formation in fish retinal organoids using 3D suspension culture with minimal growth factor-containing media containing the Hepes. At day 1, Rx3:H2B-GFP+ cells appear in the surface region of organoids. At day 1.5, Prox1+cells appear in the interface area between the organoid surface and the core of central cell mass, which develops a spherical-shaped lens later. So, Prox1+ cells covers the surface of the internal lens cell core. At day 2, foxe3:GFP+ cells appear in the Prox1+ area, where early lens fiber marker, LFC, starts to be expressed. In addition, foxe3:GFP+ cells show EdU+ incorporation, indicating that foxe3:GFP+ cells have lens epithelial cell-characters. At day 4, cry:EGFP+ cells differentiate inside the spherical lens core, whose the surface area consists of LFC+ and Prox1+ cells. Furthermore, at day 4, the lens core moves towards the surface of retinal organoids to form an eye-cup like structure, although this morphogenesis "inside out" mechanism is different from in vivo cellular "outside -in" mechanism of eye cup formation. From these data, the authors conclude that optic cup formation, especially the positioning of the lens, is established in retinal organoids though the different mechanism of in vivo morphogenesis.

      Overall, manuscript presentation is nice. However, there are still obscure points to understand background mechanism. My comments are shown below.

      Major comments

      (1) At the initial stage of retinal organoid morphogenesis, a spherical lens is centrally positioned inside the retinal organoids, by covering a central lens core by the outer cell sheet of retinal precursor cells. I wonder if the formation of this structure may be understood by differential cell adhesive activity or mechanical tension between lens core cells and retinal cell sheet, just like the previous study done by Heisenberg lab on the spatial patterning of endoderm, mesoderm and ectoderm (Nat. Cell Biol. 10, 429 - 436 (2008)). Lens core cells may be integrated inside retinal cell mass by cell sorting through the direct interaction between retinal cells and lens cells, or between lens cells and the culture media. After day 1, it is also possible to understand that lens core moves towards the surface of retinal organoids, if adhesive/tensile force states of lens core cells may be change by secretion of extracellular matrix. I wonder if the authors measure physical property, adhesive activity and solidness, of retinal precursor cells and lens core cells. If retinal organoids at day 1 are dissociated and cultured again, do they show the same patterning of internal lens core covering by the outer retinal cell sheet?

      The question, whether different adhesive activity is involved in cell sorting and lens formation is indeed very intriguing. To address this point, we will include additional experiment (see Revision Plan, experiment 1). This experiment will be based on the dissociation and re-aggregation of lens-forming organoids as suggested by the reviewer. To monitor cell type specific sorting, we will employ a lens progenitor reporter line Foxe3::GFP and the retina-specific Rx2::H2B-RFP. If different adhesive activities of lens and retinal progenitor cells are involved and drive the process of cell sorting, dissociation and re-aggregation will result in cell sorting based on their identity. 

      (2) Optic cup is evaginated from the lateral wall of neuroepithelium of the diencephalon. In zebrafish, cell movement occurs from the pigment epithelium to the neural retina during eye morphogenesis in an FGF-dependent manner. How the medaka optic cup morphogenesis is coordinated? I also wonder if the authors conduct the tracking of cell migration during optic cup morphogenesis to reveal how cell migration and cell division are regulated in lens of the Medaka retinal organoids. It is also interesting to examine how retinal cell movement is coordinated during Medaka retinal organoids.

      Looking into the detail of how optic cup-looking tissue arrangement of ocular organoids is achieved on cellular level is of course interesting. Our previous study showed that optic vesicles of medaka retinal organoids do not form optic cups (for details please see Zilova et al., 2021, eLIFE). We assume that the formation of cup-looking structure of the ocular organoids is mediated by the following processes: establishment of retina and lens domains at the specific region of the organoid – retina on the surface and lens in the center (see Figure S2 d and Figure 3e, and Figure 4). Further dislocation of the centrally formed lens towards the organoid periphery through the retina layer, places the lens to the periphery while retinal cells stay static. We assume that the “cup-like” shape is acquired by extrusion of the lens from the center of the organoid. To clarify this process with respect to tissue rearrangements and cell movements, we will include additional experiments (see Revision Plan, experiment 2) and follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion to dissect individual contribution of retinal/lens cells to this process (cross-reference with Reviewer #2).

      (3) The authors showed that blockade of FGF signaling affects lens fiber differentiation in day 1-2, whereas lens formation seems to be intact in the presence of FGF receptor inhibitor in day 0-1. I suggest the authors to examine which tissue is a target of FGF signaling in retinal organoids, using markers such as pea3, which is a downstream target of ERK branch of FGF signaling. Since FGF signaling promotes cell proliferation, is the lens core size normal in SU5402-treated organoids from day 0 to day 1?

      Assessing the activity of FGF signaling (cross-reference to Reviewer #3) in the organoids is indeed an important point. To address which tissue is the target of FGF signaling we will include additional experiments and assess the phosphorylation status of ERK (pERK) and expression of the ERK downstream target pea3, as suggested by the reviewer (see Revision Plan, experiment 3). That will allow to identify the tissue within the organoid responding to the Fgf signaling.

      Lens core size of organoids treated with SU5402 from day 0 to day 1 is fully comparable to the control (please see Figure 6b).

      (4) Fig. 3f and 3g indicate that there is some cell population located between foxe3:GFP+ cells and rx2:H2B-RFP+ cells. What kind of cell-type is occupied in the interface area between foxe3:GFP+ cells and rx2:H2B-RFP+ cells?

      That is for sure an interesting question. We are aware of this population of cells. We currently do not have data that would with certainty clarify the fate of those cells. We are currently following up on that question with the use of scRNA sequencing, however we will not be able to address this question in the current manuscript.

      (5) Fig. 5e indicates the depth of Rx3 expression at day 1. Is the depth the thickness of Rx3 expressing cell sheet, which covers the central lens core in the organoids? If so, I wonder if total cell number of Rx3 expressing cell sheet may be different in each seeded-cell number, because thickness is the same across each seeded-cell number, but the surface area size may be different depending on underneath the lens core size. Please clarify this point.

      Yes. Figure 5e indicates the thickness of the cell sheet expressing Rx3 that lies on the surface of the organoid. Indeed, the number of Rx3-expressing cells (and lens cells) scales with the size of the organoid as stated in the submitted manuscript.

      (6) Noggin application inhibits lens formation at day 0-1. BMP signaling regulates formation of lens placode and olfactory placode at the early stage of development. It is interesting to examine whether Noggin-treated organoid expands olfactory placode area. Please check forebrain territory markers.

      What tissue differentiates at the expense of the lens in BMP inhibitor-treated organoids is of course an intriguing question. To address the identity of cells differentiated under this condition we will include an additional experiment (see Revision Plan, experiment 4 as suggested by the reviewer). We will check for the expression of Lhx2, Otx2 and Huc/D to address this point.

      I have no minor comments

      Referees cross-commenting

      I agree that all reviewers have similar suggestions, which are reasonable and provided the same estimated time for revision.

      Reviewer #1 (Significance):

      Strength:

      This study is unique. The authors examined eye cup morphogenesis using fish retinal organoids. Eye cup normally consists of the lens, the neural retina, pigment epithelium and optic stalk. However, retinal organoids seem to be simple and consists of two cell types, lens and retina. Interestingly, a similar optic cup-like structure is achieved in both cases; however, underlying mechanism is different. It is interesting to investigate how eye morphogenesis is regulated in retinal organoids,under the unconstrained embryo-free environment.

      Limitation:

      Description is OK, but analysis is not much profound. It is necessary to apply a bit more molecular and cellular level analysis, such as tracking of cell movement and visualization of FGF signnaling in organoid tissues.

      Advancement:

      The current study is descriptive. Need some conceptual advance, which impact cell biology field or medical science.

      Audience:

      The target audience of current study are still within ophthalmology and neuroscience community people, maybe translational/clinical rather than basic biology. To beyond specific fields, need to formulate a general principle for cell and developmental biology.

      Reviewer #2 (Evidence, reproducibility and clarity):

      In this study from Stahl et al., the authors demonstrate that medaka pluripotent embryonic cells can self-organise into eye organoids containing both retina and lens tissues. While these organoids can self-organize into an eye structure that resembles the vertebrate eye, they are built from a fundamentally different morphogenetic process – an “inside-out” mechanism where the lens forms centrally and moves outward, rather than the normal “outside-in” embryonic process. This is a very interesting discovery, both for our understanding of developmental biology and the potential for tissue engineering applications. The study would benefit from some additional experiments and a few clarifications.

      The authors suggest that the lens cells are the ones that move from the central to a more superficial position. Is this an active movement of lens cells or just the passive consequence of the retina cells acquiring a cup shape? Are the retina cells migrating behind the lens or the lens cells pushing outwards? High-resolution imaging of organoid cup formation, tracking retina cells in combination with membrane labeling of all cells would help elucidate the morphogenetic processes occurring in the organoids. Membrane labeling would also be useful as Prox1 positive lens cells appear elongated in embryos while in the organoids, cell shapes seem less organised, less compact and not elongated (for example as shown in Fig 3f,g).

      Looking into the detail of how optic cup-looking tissue arrangement of ocular organoids is achieved on cellular level is of course interesting. We assume that the formation of cup-looking structures of the ocular organoids is mediated by following processes: establishment of retina and lens domains at a specific region of the organoid – retina on the surface and lens in the center (see Figure S2 d and Figure 3e, and Figure 4). Further dislocation of centrally formed lenses towards the organoid periphery through the retina layer, place the lens to the periphery while retinal cells stay static. We assume that the “cup-like” shape is acquired by extrusion of the lens. To clarify this process with respect to tissue rearrangements and cell movements, we will include additional experiments (see Revision Plan, experiment 2). We will follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion to dissect the individual contribution of retinal/lens cells to this process (cross-reference with Reviewer #1).

      The organoids could be a useful tool to address how cell fate is linked to cell shape acquisition. In the forming organoids, retinal tissue initially forms on the outside, while non-retinal tissue is located in the centre; this central tissue later expresses lens markers. Do the authors have any insights into why fate acquisition occurs in this pattern? Is there a difference in proliferation rates between the centrally located cells and the external ones? Could it be that highly proliferative cells give rise to neural retina (NR), while lower proliferating cells become lens?

      The question how is the retinal and lens domain established in this specific manner is indeed intriguing and very interesting. We dedicated a part of the discussion to this topic. We discuss the role of the diffusion limit and the potential contribution of BMB and FGF signaling to this arrangement. Additional experiments (see Revision Plan, experiment 3) addressing the source and target tissues of FGF and BMP signaling in the organoid will ultimately bring more clarity to our understanding of the tissue arrangements in the organoid. 

      Although analysis of the proliferation rate of the cells at the surface and in the central region of the organoid might possibly show some differences in the proliferation rates between lens and retinal cells, we do not have any indications, that the proliferation rate itself would be instructive or superior to the cell fate decisions.

      What happens in organoids that do not form lenses? Do these organoids still generate foxe3 positive cells that fail to develop into a proper lens structure? And in the absence of lens formation, does the retina still acquire a cup shape?

      Lens formation is primarily dependent on acquisition/specification of Foxe3-expressing lens placode progenitors. If those are not present, a lens does not develop. Once Foxe3-expressing progenitors are established, a lens is formed in unperturbed conditions (measured by the presence of expression of crystallin proteins). In such conditions, organoids that do not have a lens, do not carry Foxe3-expressing cells.

      In the absence of the lens, the organoid is composed of retinal neuroepithelium, that does not form an optic cup (for details of such phenotypes please see Zilova et al., 2021, eLIFE).

      The author suggest that lens formation occurs even in the absence of Matrigel. Is the process slower in these conditions? Are the resulting organoids smaller? While there are indeed some LFC expressing cells by day2, these cells are not very well organised and the pattern of expression seems dotty. Moreover, LFC staining seems to localise posterior to the LFC negative, lens-like structure (e.g. Fig.S1 3o’clock).

      How do these organoids develop beyond day 4? Do they maintain their structural integrity at later stages?

      The role of HEPES in promoting organoid formation is intriguing. Do the authors have any insights into why it is important in this context? Have the authors tried other culture conditions and does culture condition influence the morphogenetic pathways occurring within the organoids?

      We thank the reviewer for pointing this out. We were not clear in the wording and describing of our observation. Indeed, Matrigel is not required for acquisition of lens fate, which can be demonstrated with the expression of lens-specific markers. However, the presence of Matrigel has a profound impact on the structural aspects of organoid formation. Matrigel is essential for organization of retinal-committed cells into the retinal epithelium (Zilova et al., 2021, eLIFE). The absence of the structure of the retinal epithelium can indeed negatively impact on the cellular organization and the overall lens structure. To clarify the contribution of the Matrigel to the speed of organoid lens development and to the overall structure of the organoid lens we will perform additional experiments (see Revision Plan, experiment 5). With the use of Foxe3::GFP reporter line we will measure the onset of the lens-specific gene expression. In addition, we will use the immunohistochemistry to assess the gross morphology and size of the organoids grown without the Matrigel (cross-reference with Reviewer #3).

      The role of the HEPES in lens formation is indeed very intriguing and currently under investigation. As HEPES is mainly used to regulate pH of the culture media and pH might have an impact on multiple cellular processes, it will require significant time investment to dissect molecular mechanism underlying the effect of HEPES on the process of lens formation (cross reference with Reviewer #3) and therefore cannot be addressed in the current manuscript.

      Referees cross-commenting

      Pleased to see that all the other reviewers are positive about the study and raise similar concerns and comments

      Reviewer #2 (Significance):

      This is a very interesting paper, and it will be important to determine whether this alternative morphogenetic process is specific to medaka or if similar developmental routes can be recapitulated in organoid cultures from other vertebrate species.

      Reviewer #3 (Evidence, reproducibility and clarity):

      Summary:

      The manuscript by Stahl and colleagues reports an approach to generate ocular organoids composed of retinal and lens structures, derived from Medaka blastula cells. The authors present a comprehensive characterisation of the timeline followed by lens and retinal progenitors, showing these have distinct origins, and that they recapitulate the expression of differentiation markers found in vivo. Despite this molecular recapitulation, morphogenesis is strikingly different, with lens progenitors arising at the centre of the organoid, and subsequently translocating to the outside.

      Comments:

      - The manuscript presents a beautiful set of high quality images showing expression of lens differentiation markers over time in the organoids. The set of experiments is very robust, with high numbers of organoids analysed and reproducible data. The mechanism by which lens specification is promoted in these organoids is, however, poorly analysed, and the reader does not get a clear understanding of what is different in these experiments, as compared to previous attempts, to support lens differentiation. There is a mention to HEPES supplementation, but no further analysis is provided, and the fact that the process is independent of ECM contradicts, as the authors point out, previous reports. The manuscript would benefit from a more detailed analysis of the mechanisms that lead to lens differentiation in this setting.

      The role of the HEPES in lens formation is indeed very intriguing and under current investigation. As HEPES is mainly used to regulate pH of the culture media and pH might have an impact on multiple cellular processes it will require a significant time investment to dissect molecular mechanism underlying the effect of HEPES on the process of lens formation (cross reference with Reviewer #2) and therefore unfortunately cannot be addressed in the current manuscript.

      To clarify the contribution of the Matrigel to the organoid lens development we will perform additional experiments (see Revision Plan, experiment 5). With the use of Foxe3::GFP reporter line we will measure the onset of the lens-specific gene expression. In addition, we will use the immunohistochemistry to assess the gross morphology and size of the organoids grown without the Matrigel (cross-reference with Reviewer #2).

      - The markers analysed to show onset of lens differentiation in the organoids seem to start being expressed, in vivo, when the lens placode starts invaginating. An analysis of earlier stages is not presented. This would be very informative, allowing to determine whether progenitors differentiate as placode and neuroepithelium first, to subsequently continue differentiating into lens and retina, respectively. Could early placodal and anterior neural plate markers be analysed in the organoids? This would provide a more complete sequence of lens vs retina differentiation in this model.

      Yes. The figures show the expression of lens and retinal markers in the embryo in later developmental stages and the timing of their expression can be documented with higher temporal resolution. In the revised version of the manuscript, we will provide the information about the onset of expression of Rx3::H2B-GFP (retina) and Foxe3::GFP (lens) (see Author response image 1). Rx3 represents one of the earlies markers labeling the presumptive eye field within the region of the anterior neural plate (S16, late gastrula). FoxE3::GFP expression can be detected within the head surface ectoderm before the lens placode is formed showing that Foxe3 is a suitable marker of placodal progenitors in medaka.

      We are convinced that the onset of Rx3 and Foxe3-driven reporters is early enough to make the claim about the separate origin of the lens (placodal) and retinal (anterior neuroectoderm) tissues within the ocular organoids.

      Author response image 1.

      - The analysis of BMP and Fgf requirement for lens formation and differentiation is suggestive, but the source of these signals is not resolved or mentioned in the manuscript. Are BMP4 and Fgf8 expressed by the organoids? Where are they coming from?

      Indeed, addressing the source of BMP and FGF activation would bring more clarity in understanding the mechanism of retina/lens specification within the ocular organoids (cross reference with Reviewer #1). To address this point, we will include additional experiments (see Revision Plan, experiment 3). We will analyze the expression of respective ligands (Bmp4 and Fgf8) and activation of downstream effectors of BMP and FGF signaling pathways within the ocular organoids as suggested by Reviewer #1 and Reviewer #3.

      - The fact that the lens becomes specified in the centre of the organoid is striking, but it is for me difficult to visualise how it ends up being extruded from the organoid. Did the authors try to follow this process in movies? I understand that this may be technically challenging, but it would certainly help to understand the process that leads to the final organisation of retinal and lens tissues in the organoid. There is no discussion of why the morphogenetic mechanism is so different from the in vivo situation. The manuscript would benefit from explicitly discussing this.

      Following the extruding lens in vivo is indeed very relevant suggestion. To clarify the process of ocular organoid formation in the respect of tissue rearrangements and cell movements, we will include additional experiment (see Revision Plan, experiment 2). We will follow lens- and retina-fated cells (by employing lens-specific Foxe3::GFP and retina-specific Rx2::H2B-RFP reporter lines) through the process of lens extrusion (cross-reference with Reviewer #1 and Reviewer #2).

      Referees cross-commenting

      We all seem to have similar comments and concerns. I think overall the suggestions are feasible and realistic for the timeframe provided.

      Reviewer #3 (Significance):

      This study describes a reproducible approach to differentiate ocular organoids composed of lens and retinal tissues. The characterisation of lens differentiation in this model is very detailed, and despite the morphogenetic differences, the molecular mechanisms show many similarities to the in vivo situation. The manuscript however does not highlight, in my opinion, why this model may be relevant. Clearly articulating this relevance, particularly in the discussion, will enhance the study and provide more clarity to the readers regarding the significance of the study for the field of organoid research, ocular research and regenerative studies.

      Revision Plan:

      (1) To address whether differential adhesion properties of retinal and lens progenitors mediate cell sorting to establish retina and lens domains in the organoids (Reviewer #1, comment 1), we will perform dissociation of the organoids on day 1 and subsequential re-aggregation. This experiment will allow to follow cell type specific adhesion properties of lens and retinal progenitor cells. We will employ lens progenitor reporter line Foxe3::GFP and retina-specific Rx2::H2B-RFP to monitor cell type specific sorting with fluorescent microscopy.

      (2)   Multiple reviewers (Reviewer #1, Reviewer #2, Reviewer #3) asked for the presentation of detailed in vivo imaging experiment showing individual contributions of retina- and lens- fated cells to the resulting tissue organization withing the ocular organoid. We will perform in vivo live imaging experiment to follow the movements of individual lens (Foxe3::GFP) and retinal (Rx2::H2B-GFP) cells from day 1 to day 2 of organoid development to address this point.

      (3) Reviewer #1 and Reviewer #3 raised questions concerning the role of FGF and BMP signaling and sources of these signaling pathway activities in ocular organoid tissue arrangement. To address this point and bring more light into the molecular mechanisms regulating lens and retina tissue arrangement in the organoid, we will perform additional experiment. We will assess the expression of candidate FGF and BMP ligands (Fgf8, Bmp7 and Bmp4) and activation of downstream effectors (p-ERK, p-SMAD) and the direct transcriptional target of Fgf signaling (Pea3) in the developing organoids. This will allow the identification of the tissue producing the ligand on one site and tissue responding to the signaling on the other site and help out to narrow down the molecular mechanism controlling tissue arrangements in the organoid.

      (4) We will analyze the expression of forebrain territory markers in organoids treated with the BMP inhibitor to identify the identity of the tissue differentiating at the expense of lens under the BMP inhibition (suggested by Reviewer #1). We will label Noggin-treated organoids with the antibodies against Lhx2, Otx2 and HuC/D to address this point.

      (5) We will provide more comprehensive analysis of the organoids grown without the Matrigel and compare them to the organoids grown in the presence of the Matrigel (mentioned by Reviewer #2 and Reviewer #3). With the use of lens progenitor-specific Foxe3::GFP reporter line, we will measure the onset of the lens-specific gene expression. In addition, we will use the immunohistochemistry to assess the gross morphology and size of the organoids grown without the Matrigel.

      Description of analyses that authors prefer not to carry out

      Reviewer #1:

      (4) Fig. 3f and 3g indicate that there is some cell population located between foxe3:GFP+ cells and rx2:H2B-RFP+ cells. What kind of cell-type is occupied in the interface area between foxe3:GFP+ cells and rx2:H2B-RFP+ cells?

      That is for sure interesting question. We are aware of this population of cells. We currently do not have a data that would with certainty clarify the fate of those cells. We are currently following up on that question with the use of scRNA sequencing, however we will not be able to address this question in the current manuscript.

      Reviewer #2:

      The role of HEPES in promoting organoid formation is intriguing. Do the authors have any insights into why it is important in this context? Have the authors tried other culture conditions and does culture condition influence the morphogenetic pathways occurring within the organoids?

      The role of the HEPES in lens formation is indeed very intriguing and under current investigation. As HEPES is mainly used to regulate pH of the culture media and pH might have impact on multiple cellular processes it will require significant time investment to dissect molecular mechanism underlying the effect of the HEPES on the process of lens formation (cross reference with Reviewer #3) and cannot be addressed in the current manuscript.

      Is there a difference in proliferation rates between the centrally located cells and the external ones? Could it be that highly proliferative cells give rise to neural retina (NR), while lower proliferating cells become lens?

      Although analysis of the proliferation rate of the cells at the surface and in the central region of the organoid might possibly show some differences in the proliferation rates between lens and retinal cells, we do not have any indications, that the proliferation rate itself would be instructive or superior to the cell fate decisions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript by Lopez-Blanch and colleagues, 21 microexons are selected for a deep analysis of their impacts on behavior, development, and gene expression. The authors begin with a systematic analysis of microexon inclusion and conservation in zebrafish and use these data to select 21 microexons for further study. The behavioral, transcriptomic, and morphological data presented are for the most part convincing. Furthermore, the discussion of the potential explanations for the subtle impacts of individual microexon deletions versus lossof-function in srrm3 and/or srrm4 is quite comprehensive and thoughtful. One major weakness: data presentation, methods, and jargon at times affect readability / might lead to overstated conclusions. However, overall this manuscript is well-written, easy to follow, and the results are of broad interest.

      We thank the Reviewer for their positive comments on our manuscript. In the revised version, we will try to improve readability, reduce jargon and avoid overstatements.  

      Strengths:

      (1) The study uses a wide variety of techniques to assess the impacts of microexon deletion, ranging from assays of protein function to regulation of behavior and development.

      (2) The authors provide comprehensive analyses of the molecular impact of their microexon deletions, including examining how host-gene and paralog expression is affected.

      Weaknesses:

      Major Points:

      (1) According to the methods, it seems that srrm3 social behavior is tested by pairing a 3mpf srrm3 mutant with a 30dpf srrm3 het. Is this correct? The methods seem to indicate that this decision was made to account for a slower growth rate of homozygous srrm3 mutant fish. However, the difference in age is potentially a major confound that could impact the way that srrm3 mutants interact with hets and the way that srrm3 mutants interact with one another (lower spread for the ratio of neighbour in front value, higher distance to neighbour value). This reviewer suggests testing het-het behavior at 3 months to provide age-matched comparisons for del-del, testing age-matched rather than size-matched het-del behavior, and also suggests mentioning this in the main text / within the figure itself so that readers are aware of the potential confound.

      Thank you for bringing up this point. For the tests shown in Figure 5, we indeed decided to match the pairs involving srrm3 mutant fish by fish size since we reasoned this would be more comparable to the other lines, both biologically and methodologically (in terms of video tracking, etc.). However, we are confident the results would be very similar if matched by age, since the differences in social interactions between the srrm3 homozygous mutants and their control siblings are very dramatic at any age. As an example, this can be appreciated, in line with the Reviewer's suggestion, in Videos S2 and S3, which show groups of five 5 mpf fish that are either srrm3 mutant or wild type. It can be observed that the behavior of 5 mpf WT fish (Video S3) is very similar to those of 1 mpf WT fish pairs, with very small interindividual distances, while the difference with repect to the srrm3 mutant group (Video S2) is dramatic. We nonetheless agree that this decision on the experimental design should be clearly stated in the main text and figure legend and we have done so in the revised version.

      (2) Referring to srrm3+/+; srrm4-/- controls for double mutant behavior as "WT for simplicity" is somewhat misleading. Why do the authors not refer to these as srrm4 single mutants?

      This comment applies to Figure 4 as well as the associated figure supplements. We reasoned that this made the understanding of plots easier, but the Reviewer is correct that it can be misleading. As a middle ground, we have now changed Figure 4 to follow the nomenclature of Figure 3D (WD, HD, DD), which is further explained in the legend, but kept the original format in the figure supplements for consistency with the (many) other plots in those figures.

      (3) It's not completely clear how "neurally regulated" microexons are defined / how they are different from "neural microexons"? Are these terms interchangeable?

      Yes, they are interchangeable. We have now double checked the wording to avoid confusion and for consistency.

      (4) Overexpression experiments driving srrm3 / srrm4 in HEK293 cells are not described in the methods.

      We apologized for this omission. We now briefly describe the data and asscoiated methods in more detail in the revised version; however, please note that the data was obtained from a previous publication (Torres-Mendez et al, 2019), where the detailed methodology is reported.

      (5) Suggest including more information on how neurite length was calculated. In representative images, it appears difficult to determine which neurites arise from which soma, as they cross extensively. How was this addressed in the quantification?

      We have added further details to the revised version. With regards to the specific question, we would like to mention that this has not been a very common issue for the time points used in the manuscript (10 hap and 24 hap). At those stages, it was nearly always evident how to track each individual neurite. Dubious cases were simply ignored and not measured, as we aimed for 100 neurites per well. Of course, such complex cases become much more common at later time points (48 and 72 hap), which were not used in this study.

      Reviewer #2 (Public review):

      Summary:

      This manuscript explores in zebrafish the impact of genetic manipulation of individual microexons and two regulators of microexon inclusion (Srrm3 and Srrm4). The authors compare molecular, anatomical, and behavioral phenotypes in larvae and juvenile fish. The authors test the hypothesis that phenotypes resulting from Srrm3 and 4 mutations might in part be attributable to individual microexon deletions in target genes.

      The authors uncover substantial alterations in in vitro neurite growth, locomotion, and social behavior in Srrm mutants but not any of the individual microexon deletion mutants. The individual mutations are accompanied by broader transcript level changes which may resemble compensatory changes. Ultimately, the authors conclude that the severe Srrm3/4 phenotypes result from additive and/or synergistic effects due to the de-regulation of multiple microexons.

      Strengths:

      The work is carefully planned, well-described, and beautifully displayed in clear, intuitive figures. The overall scope is extensive with a large number of individual mutant strains examined. The analysis bridges from molecular to anatomical and behavioral read-outs. Analysis appears rigorous and most conclusions are well-supported by the data.

      Overall, addressing the function of microexons in an in vivo system is an important and timely question.

      Weaknesses:

      The main weakness of the work is the interpretation of the social behavior phenotypes in the Srrm mutants. It is difficult to conclude that the mutations indeed impact social behavior rather than sensory processing and/or vision which precipitates apparent social alterations as a secondary consequence. Interpreting the phenotypes as "autism-like" is not supported by the data presented.

      The Reviewer is absolutely right. It was not our intention to imply that these social defects should be interpreted simply as autistic-like. It is indeed very likely that the main reason for the social alterations displayed by the srrm3 mutants is their impaired vision. We have now added this discussion point explicitly in the revised version. 

      Reviewer #3 (Public review):

      Summary:

      Microexons are highly conserved alternative splice variants, the individual functions of which have thus far remained mostly elusive. The inclusion of microexons in mature mRNAs increases during development, specifically in neural tissues, and is regulated by SRRM proteins. Investigation of individual microexon function is a vital avenue of research since microexon inclusion is disrupted in diseases like autism. This study provides one of the first rigorous screens (using zebrafish larvae) of the functions of individual microexons in neurodevelopment and behavioural control. The authors precisely excise 21 microexons from the genome of zebrafish using CRISPR-Cas9 and assay the downstream impacts on neurite outgrowth, larvae motility, and sociality. A small number of mild phenotypes were observed, which contrasts with the more dramatic phenotypes observed when microexon master regulators SRRM3/4 are disrupted. Importantly, this study attempts to address the reasons why mild/few phenotypes are observed and identify transcriptomic changes in microexon mutants that suggest potential compensatory gene regulatory mechanisms.

      Strengths:

      (1) The manuscript is well written with excellent presentation of the data in the figures.

      (2) The experimental design is rigorous and explained in sufficient detail.

      (3) The identification of a potential microexon compensatory mechanism by transcriptional alterations represents a valued attempt to begin to explain complex genetic interactions.

      (4) Overall this is a study with a robust experimental design that addresses a gap in knowledge of the role of microexons in neurodevelopment.

      Thank you very much for your positive comments to our manuscript.

      Reviewer #1 (Recommendations for the authors):

      Minor Suggestions

      (1) Axes are often scaled differently even between panels in the same figure. For example in Figure 5 - supplement 10, the srrm3_17 y axis scales from 0-20, while the neighboring panels scale from ~1-2.5. This somewhat underrepresents the finding that srrm3 mutants have much larger inter-individual distances. Similarly, in the panel above (src_1), the y-axis is scaled to include a single point around 17cm. As a result, it appears at first glance that the src_1 trials resulted in much lower inter-individual distance. Suggest scaling all of these the same to improve readability.

      While the Reviewer is certainly correct, after careful consideration we decided to have autoscaled axis to prioritize within-plot visualization (i.e. among genotypes within an experiment) than across plots (i.e. among experiments and lines).

      (2) Attention to italicizing gene names.

      Thanks.

      (3) In many points in the methods, we are instructed to "see below." Suggest directing the reader to a particular section heading.

      We found only one such instance, and we directed the reader to the specific section, as suggested.

      (4) In Methods, remove "in the corpus callosum." This is not an accurate descriptor for the site at which Mauthner axons cross.

      This is absolutely correct, apologies for this mistake.

      Clarify:

      (1) In the results section, "tissue-specific regulation was validated..." - suggest mentioning that this was performed in adult tissues / describe dissection in the methods.

      Added.

      (2) In the results section, the meaning of "no event ortholog" is not clear. Does this mean that a microexon does not have a human homolog? If so, suggest stating more clearly.

      Correct. We have added addition information.

      (3) In the results, the authors state that 78% of microexons are affected by srrm3/4 loss-offunction. Suggest stating the method used here (e.g. RNA-seq in mutants as compared to siblings)

      Added.

      (4) It is not clear what "siblings for the main founders means" for example in 3D. Is this effectively the analysis of microexon knockouts across multiple independent lines? Are the lines pooled for stats, for example in 3C?

      The main founder correspond to that listed as _1 and as default for experiments when only one found is used. We now explicitely state this.  

      For 3C, the lines are not pooled for stats; the stats correspond only to the main founder for each line. However, for each main founder line, multiple experiments are usually analyzed together and the stats are done taking their data structure into account (i.e. not simply pooling the values).

      (5) The purpose and a general description of NanoBRET assays should be included in the results.

      We added the main purpose of the NanoBRET assays (testing protein-protein interactions).

      (6) Specify that baseline behavior is analyzed in the light.

      Added.

      (7) In Figure 4A, adult fish are schematized being placed into a 96-well plate. Suggest using the larval diagram as in Figure 6 for accuracy.

      Done.

      (8) In Figure 4, plot titles could be made more accessible, especially in 4 F. Suggest removing extraneous information / italicizing gene names, etc. In G, suggest writing out Baseline, Dark, and Light to make it more accessible. Same in 4B.

      We have implemented some of the suggestions. In particular, italics were not used, since we are referring to the founder line, not the gene.

      (9) Figure 6 legend B - after (barplots), suggest inserting the word "and", to make clear that barplots indicate host gene *and* closely related paralogs are indicated by dots.

      Done.

      (10) In methods: "To better capture all microexons..." This sentence is difficult to understand. Suggested edit: "we excluded *from our calculation?* tissues with known or expected partial overlap... from comparison (for example, ...).

      Done.

      (11) In the methods, "which were defined with similar parameters but -min_rep 2." Suggest spelling this out, e.g. "with similar parameters, but requiring sufficient read coverage in at least n=2 samples per valid tissue group, whereas we only required one.".

      Done.

      (12) RNA was extracted for event and knockout validations. What does event mean here?

      Event refers to the validation of the exon regulatory pattern in WT tissues. We added this information.

      Provide definitions for abbreviations:

      (1) (Figure 6) Delta corrected VST Expression.

      Done.

      (2) "Mic-hosting genes" paralogs.

      Done.

      (3) In Figure 1F, "emic" is not defined.

      Done.

      Misspellings:

      All corrected.

      (1) Figure 6B (percentile is spelled percentil).

      (2) Figure 6B legend (bottom or top decile*).

      (3) Figure 6D - Schizophrenia* genes.

      (4) In Zebrafish husbandry and genotyping: suggest "srrm3 mutants grew more slowly.".

      (5) In results, "reduced body size at 90pdf" > 90dpf.

      Reviewer #2 (Recommendations for the authors):

      (1) Characterization of microexon mutants (Figure 2): The semi-quantitative PCR with flanking primers (Figure 2, supplement1) is well-suited to assess successful deletion of the exon and enables detection of potential mis-splicing around the alternative segment. However, it does not quantify the impact on total transcript levels. The authors should complement those experiments with qPCR measures of the transcript levels - otherwise, it is difficult to link mutant phenotypes to isoforms (as opposed to alterations in the level of gene expression). This point is somewhat addressed in Figure 6 by the RNA Seq analysis but it might help to add data specifically in Figure 2.

      As the Reviewer says, this point is explicitely addressed in Figure 6, where were show the change in the host gene's expression that follows the the removal of some microexons. We prefer to keep this in Figure 6, for consistency, as we believe this is not a direct (regulatory) consequence of the removal, but more likely a compensation effect.

      (2) Social behavior alterations in juvenile fish: The authors report "increased leadership" in Srrm3 mutant fish. However, these fish have impaired vision. Thus, "increased leadership" may simply reflect the fact that they do not perceive their conspecifics and, thus, do not follow them. The heterozygous conspecific will then mostly follow the Srrm3 mutant which appears as the mutant exhibiting an increase in leadership. Figure 5D suggests that Srrm3 del and het fish have the same ratio of "neighbor in front" which would be consistent with the hypothesis that the change in this metric is a consequence of a loss of following behavior due to a loss of vision. The authors should either adjust the discussion of this point or assess with additional experiments whether this is indeed a "social phenotype" or rather a secondary consequence of a loss of vision.

      The Reviewer is absolutely correct, and we have thus modified the short discussion directly related to these patterns.

      (3) The discussion centers on potential reasons why only mild phenotypes are observed in the single microexon mutants. One caveat of the phenotypic analysis provided in the manuscript is that it does not very deeply explore the phenotypic space of neuronal morphologies or circuit function. The behavioral and anatomical read-outs are rather coarse. There are no experiments exploring fine-structure of neuronal projections in vivo or synapse number, morphology, or function. Moreover, no attempts are made to explore which cell types normally express the microexons to potentially focus the loss-of-function analysis to these specific cell types. Of course, such analysis would substantially expand the scope of a study that already covers a large number of mutant alleles. However, the authors may want to add a discussion of these limitations in the manuscript.

      The Reviewer is correct. We aimed at covering this when referring to "(i) we may not be assessing the traits that these microexons are impacting, (ii) we may not have the sensitivity to robustly measure the magnitude of the changes caused by microexon removal". We have now added some of the specific points raised by the Reviewer as examples.

      (4) Note typos in Figure 6D: "schizoFrenia", "WNT signIalling"

      Done.

      Reviewer #3 (Recommendations for the authors):

      I only have a few minor suggestions for the authors.

      (1) It is interesting that a not insignificant number of microexon deletions (3/21) result in cryptic inclusions of intron fragments, and perhaps alludes to an as yet unreported molecular function of microexons in the regulation of host gene expression. Is it possible that microexon inclusion in these 3 genes could be important for expression? I think this requires some further discussion, as (if I'm not mistaken) microexons have thus far only been hypothesised to act as modulators of protein function, not as gene regulatory units.

      While we see that microexon removal can impact expression of the host gene (Figure 6), this is likely a compensatory mechanism (or so we suggest). We do not think these three cases are related to a putative physiological regulation, since the cryptic exons appear only in the deletion line. On the contrary, we think these are "regulatory artifacts" that originate in the nonWT mutated context. I.e. we removed the exon but some splicing signals remained in the intron, which are then recoginized by the spliceosome that incorrectly includes a different piece of the intron.

      (2) The flow of the text accompanying the molecular investigation of microexon function for evi5b and vav in Figure 3 could be improved. The text currently fades out with a speculative explanation for the lack of evi5b interaction phenotype. This final sentence could be moved to the discussion and replaced with a more general summary of the data.

      We have now swapped the order in which these results are described and leave out the discussion about evi5b's microexon function.

      (3) Is this a co-submission with Calhoun et al? If so, both papers should reference each other in the discussion and discuss the relative contributions of each.

      Done

      (4) "1 × 104 cells" in methods Nanobret paragraph should be superscript.

      Done

    1. Cyrus conquered Babylon bloodlessly and became a sort of patron of the Jews. This relationship may have enhanced the influence of Cyrus' religion, Zoroastrianism, on the development of Jewish monotheism, as we will discuss shortly. Cyrus also planned and began building infrastructure like the Royal Road.

      Cyrus is such a fascinating leader! He conquered Babylon without bloodshed, supported the Jews, and even started building amazing projects like the Royal Road. It’s wild to think how his actions might have even influenced the development of Jewish monotheism!

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      Evidence, reproducibility and clarity

      SUMMARY

      In this study, Fernandes and colleagues addressed the question of the role of micro-RNAs in regulating the coupling between organ growth and developmental timing. Using Drosophila, they identified the conserved micro-RNA miR-184 as a regulator of the developmental transition between juvenile larval stages and metamorphosis. This transition is under the control of the steroid hormone Ecdysone, and has been shown to be modulated in case of abnormal tissue growth to adjust the duration of larval growth in response to developmental perturbations. The relaxin-like hormone Dilp8 has been identified as a key secreted factor involved in this coupling. Here, the authors show that miR-184 is involved in the regulation of Dilp8 expression both in physiological conditions and upon growth perturbation. They propose that this function is carried out in imaginal tissues, where miR-184 levels are modulated by tissue stress. While several factors have already been involved in triggering sharp dilp8 induction at the transcriptional level, this study adds another level of complexity to the regulation of Dilp8 by proposing that its expression is fine-tunned post-transcriptionally through repression by miR-184.

      __MAJOR COMMENTS______

      Overall, the manuscript is well organized, and the logics of the experimental plan well presented. The results are clear, and I appreciate the quality of the pupariation curves. However, I believe that two main conclusions of the paper are not fully supported by the results presented in the figures: the direct regulation of dilp8 3'UTR by miR-184, and the specificity of this regulation in imaginal discs. Here I develop in more details these two aspects.

      Comment 1) The strategy of the 3'UTR sensor is not fully optimized. Indeed, in most experiments, qRT-PCR is used to assess dilp8 expression levels, although it reflects both transcriptional and post-transcriptional. Importantly, to show that post-transcriptional regulation is involved in the response to tissue damage, the levels of the 3'UTR sensor should be analyzed in discs expressing RAcs (showing at the same time that the response is cell-autonomous in the discs). The expected upregulation of the sensor should be prevented by simultaneous expression of miR-184. This approach would shed light on the relative contribution of transcriptional versus post-transcriptional regulation of dilp8 in response to growth perturbation.

      Response: We thank the reviewer for this comment. We agree that qRT-PCRs do not distinguish between transcriptional and post-transcriptional changes of dilp8 levels, in response to changes in miR-184 levels and tissue damage. In addition to the qRT-PCR data we have looked at dilp8-3’UTR-GFP reporter in response to overexpression of miR-184 in the wingdisc using patched-Gal4 driver, which show downregulation of the GFP reporter in the ptc domain (Fig 4C-D’). This suggests that dilp8 mRNA is a direct target of miR-184 by post-transcriptional regulation through its 3’UTR. Further, to confirm the specificity of the effect of miR-184 on dilp8-3’UTR, we generated a dilp8-3’UTR mutant in which the single target site for miR-184 was mutated. We show that the mutated dilp8-3’UTR reporter doesn’t show any regulation in response to miR-184 overexpression in the ptc domain of the wingdisc (Fig. 4E, E’, F, F’). This experiment confirms the specificity of the dilp8-3’UTR regulation by miR-184.

      As suggested by the reviewer we analysed dilp8-3’UTR-GFP reporter expression by overexpressing RicinA using ptcGAL4 driver in the wing imaginal disc (Fig. S6F-G’). We observed a slight but consistent increase in the dilp8-3’UTR-GFP reporter expression, indicating post-transcriptional regulation of dilp8 expression in response to tissue damage. However, the increase of reporter GFP levels observed in this experiment in response to tissue damage is mild (Fig. S6F-G’) than expected based on the qRT-PCR results (Fig S6A and B). We have added this new data to the manuscript (Fig. S6F-G’).

      We propose the following reasons to explain this result:

      a) both transcriptional and post-transcriptional regulation of dilp8 mRNA in response to developmental perturbations

      b) the data on 3’UTR reporter GFP is specifically from the ptc domain expression of RicinA, whereas for dilp8 transcript levels we have expressed RicinA in all larval imaginal tissues, or in the entire wing imaginal disc, which could be one of the reasons for the stronger effect seen on dilp8 mRNA levels

      c) we are not certain if the tubulin-promoter driven dilp8-3’UTR GFP reporter reflects post-transcriptional regulation of dilp8 by miR-184 efficiently in comparison to qRT-PCR. This is especially as the reporter-GFP-3’UTR will be expressed at very high levels due to the tubulin promoter, a majority of this reporter-GFP mRNA may not be relieved from degradation due to the moderate suppression of miR-184 in response to RicinA overexpression.

      Thus, our experiments suggest that dilp8 levels are regulated post-transcriptionally by miR-184 which contributes to pupariation delays in response to tissue damage. In support of this, we could rescue pupariation delays and dilp8 induction caused by RicinA expression using overexpression of miR-184 (Figs 5B, C). Thus, we confirm that the effect of post-transcriptional regulation by miR-184 during developmental perturbations also contributes to dilp8 induction and pupariation delays. Unfortunately, due to experimental limitations we could not perform simultaneous expression of RicinA and miR-184 to evaluate the rescue of dilp8-3’UTR-GFP sensor expression. The levels of dilp8-3’UTR sensor GFP is reduced efficiently by miR-184 overexpression (Fig 4D), which prevented us from attempting the rescue of the moderate increase of dilp8-3’UTR GFP levels in response to RicinA.

      Comment 2) In my opinion, the use of a 3'UTR sensor is not sufficient to conclude that the regulation by miR-184 is direct, as miR-184 could also regulate an intermediate factor that acts on dilp8 post-transcriptional regulation. To solve this issue, a common strategy is to generate a 3'UTR sensor with mutated binding sites that should abolish the regulation by miR-184. This mutated 3'UTR might also respond differently to tissue damage, which would strongly support the conclusions of the study.

      Response: We couldn’t agree more with the reviewer, this comment is addressed in the response to comment 1. We have confirmed the specificity of regulation of dilp8-3’UTR by miR-184 using target site mutated dilp8-3’UTR (new figures added to the manuscript Fig. 4E, E’, F, F’). We tested if the changes in dilp8 mRNA levels in response to tissue damage is post-transcriptional mediated by miR-184. We observe that there is a slight, but consistent increase of dilp8-3’UTR GFP reporter levels in the ptc domain of wingdisc in response to RicinA expression, suggesting a role for miR-184 mediated post-translational regulation of dilp8. However, we have not yet tested the mutated dilp8-3’UTR GFP reporter in response to tissue damage.

      Comment 3) Concerning the tissue-specific regulation of Dilp8 by miR-184, these results need to be strengthened. Indeed, this comes mostly from phenotypes observed with rn-GAL4. Although this is a classical tool for driving expression in imaginal discs, rn-GAL4 also drives strong expression in other tissues that could contribute to triggering a delay, such as the CNS and part of the gut (proventriculus). In our hands, some growth phenotypes in the wing obtained with rn-GAL4 could be fully reverted by blocking GAL4 in the CNS indicating that the phenotype was not wing-specific. Importantly, miR-184 seems to be highly expressed in the CNS according to FlyBase, reinforcing the possibility that it plays a role in this organ. Here I propose approaches to confirm that miR-184 mediated regulation of dilp8 and developmental timing indeed occur in the discs:

      - Another driver with less secondary expression sites could be used (pdmR11F02-GAL4), or rn-GAL4 could be combined with an elav-GAL80 to prevent expression in most neurons. - The authors could identify the source of Dilp8 upregulation in miR-184 mutants using tissue-specific qRT-PCR instead of whole larvae expression like in Fig 4A-B. - This tissue-specific upregulation could be functionally tested using a rescue experiment, in which the delay observed in miR-184 mutants could be rescued by disc-specific downregulation of Dilp8 (using pdm2-GAL4 for instance).

      Response: We are thankful to the reviewer, and agree that it is important to show that the effects that we see using rn-Gal4 are specific to imaginal discs, and not due to an effect in CNS. We tested this by expressing miR-184 sponge in the CNS. Though miR-184 is highly expressed in the larval CNS, downregulation of miR-184 specifically in the pan-neuronal background using elav-GAL4 led to no effects on pupariation timepoint. We have added this as supplementary data Figure S4. Therefore, we believe that the miR-184 downregulation phenotype in the rnGAL4 background can be mainly attributed to its role in the imaginal discs. In addition, as suggested by the reviewer we have also demonstrated that downregulation of miR-184 in the imaginal discs using rnGAL4 driver leads to an increase in dilp8 expression (Fig S5B). Thus confirming that dilp8 mRNA is enhanced in the imaginal discs by blocking miR-184.

      OPTIONAL: Because it is known that dilp8 is strongly regulated at the transcriptional level, the relative input from post-transcriptional upregulation is an important question arising from this study. Although it might be a more long-term approach, I believe that generating a Dilp8 mutant lacking its 3'UTR or, even better, with mutated miR-184 binding sites, would shed light on the role of this regulation for the response to growth perturbation and/or developmental stability (fluctuating asymmetry).

      Response: We thank the reviewer for the suggestion. This would have been an interesting experiment to carry out especially in the context of fluctuating asymmetry.

      MINOR COMMENTS

      1. __ I think that a number of results could be moved to SI as they are either controls, or reproduce published data without bringing novelty. For instance, results in Fig 5A-D are similar to data published by Sanchez et al, as stated in the text. Fig6A as well.__

      __Response: __We thank the reviewer for this suggestion, Fig. 5A-D, and F has been moved to Fig. S6A-E. We have also moved data from Fig. 6 to Fig. 5, as a result Fig 6 A-D has become Fig. 5 B-D.

      __ Fig 6D is quite mysterious, as it suggests that basal JNK activation regulates miR-184, which is different from a context of tissue damage. I think that this result could be removed. Alternatively, if the authors want to dig in that direction, more experiments should be provided, such as bskDN expression in an RAcs context and the effects on miR-184 levels and the 3'UTR sensor (since transcript levels are already published).__

      Response: We would like to clarify that our experiments suggest that endogenous JNK signalling negatively regulates miR-184, as blocking basal JNK signalling using bskDN increased the levels of miR-184 (changed to Fig 5D). Enhanced JNK signalling has been reported to be involved in tissue damage responses, and we propose that RicinA mediated increase in JNK signalling leads to the reduction of miR-184 (changed to Fig 5A, S6D-E). However, we are not strongly implying this as we did not co-express RicinA and bskDN to show that JNK signalling is responsible for the drop in miR-184 levels in response to tissue damage. We thank the reviewer for seeking this explanation, we have rewritten the results section to improve clarity.

      __ The references related to Dilp8 should be checked more in detail in the intro and discussion. About Dilp8 and developmental stability: remove the ref to Colombani et al 2012, instead put Boone et al 2016 and add Blanco-Obregon et al 2022 (in addition to Garelli et al 2012 who initially identified this phenotype. About Lgr3 as the receptor for Dilp8: add Colombani et al, Current Biology 2015, and cite here Vallejo et al 2015, Garelli et al 2015. Among the important transcriptional regulators of Dilp8, Xrp1 could be mentioned (Boulan et al 2019, Destefanis et al 2022) as it plays a complementary function to JNK depending on the type of tissue stress.__

      __Response: __We are really sorry for the glaring errors in citing appropriate references. We thank the reviewer for correcting this for us. We have made necessary changes to the text.

      Significance

      GENERAL ASSESSMENT This study provides convincing data showing that the conserved microRNA miR-184 plays a role in regulating developmental timing in Drosophila through modulating the levels of Dilp8, a key factor in the coupling between tissue growth and developmental transitions. The results are convincing, but the general conclusions of the paper need to be strengthened regarding the direct regulation of dilp8 by miR-184 and the tissue-specificity of this interaction.

      ADVANCE Dilp8 is a key factor that modulates growth and timing in response to developmental perturbations and contributes to developmental precision in physiological conditions. As such, its regulation has been studied by different groups in the last decade, leading to the identification of several inputs for its transcriptional regulation. Here, the authors uncover a post-transcriptional regulation by miR-184, adding another level of regulation of Dilp8 that contribute to ensuring proper regulation of developmental timing, and opening the possibility that miR-184 might play similar roles in other species.

      AUDIENCE This study is of interest for researchers in the field of basic science, with a focus on developmental timing, tissue damage and biological function of microRNAs.

      REVIEWER EXPERTISE Drosophila, growth control, developmental timing, Dilp8.

      Reviewer #2

      Evidence, reproducibility and clarity

      Drosophila has helped to characterize the mechanisms that coordinate tissue growth with developmental timing. The insulin/relaxin-like peptide Dilp8 has been identified as a key factor that communicates the abnormal growth status of larval imaginal discs to neuroendocrine neurons responsible for regulating the timing of metamorphosis. Dilp8, derived from imaginal discs, targets four Lgr3-positive neurons in the central nervous system, activating cyclic-AMP signaling in an Lgr3-dependent manner. This signaling pathway reduces the production of the molting hormone, ecdysone, delaying the onset of metamorphosis. Simultaneously, the growth rates of healthy imaginal tissues slow down, enabling the development of proportionate individuals.

      In this manuscript "miR-184 modulates dilp8 to control developmental timing during normal growth conditions and in response to developmental perturbations" by Dr. Varghese and colleagues, the authors identify a new post transcriptional regulator of Dilp8. The authors show that miR-184 plays a pivotal role in tissue damage responses by inducing dilp8 expression, which in turn delays pupariation to allow sufficient time for damage repair mechanisms to take effect.

      Major points:

      Comment 1) In most of the experiments for percentage of pupariation, the 50% pupariation in control is around 110 hours AED in figures 1, 2 and 3. In figures 5 and 6 using the UAS Ricin, the controls are more around 90 hours AED. Why this discrepancy?

      Response: We thank the reviewer for asking for this clarification. The former experiments for Figs 1-3 were carried out at 25oC while the latter experiments with a cold sensitive version of RicinA (UAS-RAcs), Figs 5 and 6 (now changed to Figs. 5 and S6 as suggested by reviewer #1) were carried out at 29oC (permissive temperature). This difference in temperature has led to alterations in pupariation timing. We apologise for not having mentioned this in the text, now we have made necessary corrections to the methods section clearly indicating this.

      Comment 2) What is the mechanism behind the expression of miR-184 in stress conditions? Is miR-184 also implicated in other conditions giving rise to a developmental delay (X-rays irradiation or animal bearing rasV12, scrib-/- tumors)?

      Response: We thank the reviewer for these questions.

      a) In response to developmental perturbations by RicinA, we believe that activation of JNK signalling controls miR-184 expression. We propose this as our experiments show that imaginal disc damage leads to enhancement of JNK signalling and increase in dilp8 mRNA levels (as reported earlier by Colombani et al 2012; Sánchez et al 2019), and a simultaneous reduction of miR-184 (Figs. S6A, D, E). We also have performed new experiments to show that in response to RicinA expression in the wingdisc there is moderate increase in the dilp8-3’UTR-GFP sensor expression (Figs. S6F-G’), indicating a post-transcriptional regulation of dilp8 expression in response to tissue stress. We also show that RicinA induced dilp8 expression and pupariation delay can be rescued by increasing miR-184 levels (Fig 5B and C), suggesting that the reduction of miR-184 in response to tissue damage contributes to the damage responses. In a separate experiment we show that blocking the endogenous JNK pathway by the expression of bskDN enhances miR-184 levels, suggesting that miR-184 is under the regulation of JNK signalling (Fig 5D). Hence, we speculate that during tissue stress, activation of JNK signalling leads to a reduction of miR-184 levels which contributes to regulating the levels of dilp8 post-transcriptionally and resulting in pupariation delays. The text has been modified to explain this better.

      b) In a previous paper by Shu et al., 2017 (https://doi.org/10.18632/oncotarget.22226) decreased expression of miR-184 was observed in a lglRNAi; RasV12 tumor background. Apart from this various studies have shown that dilp8 levels increase in response to tumour, radiation stress, apoptosis, and tissue damage (Yeom et al 2021, Ray et al 2019, Demay et al 2014, Katsuyama et al 2015, Colombani et al 2012, Garelli et al 2012). Whether the regulation of dilp8 by miR-184, occurs in these backgrounds is yet to be tested. We have now discussed this possibility in the manuscript.

      Comment 3) dilp8 mutant animals have also been shown to be more resistant to starvation or desiccation (https://doi.org/10.3389/fendo.2020.00461). Is miR-184 implicated in this answer?

      Response: We thank the reviewer for this question. In our earlier experiments miR-184 has been demonstrated to be regulated by nutrition in the larval stages and lack of miR-184 led to enhanced larval death in response to diet restriction (Fernandes et al., 2022). miR-184 was also demonstrated to play a role in the insulin producing cells (IPCs) in regulating lifespan (Fernandes & Varghese., 2022). In the current work, we propose miR-184 to act upstream of dilp8 in response to stress stimuli. Hence, it is possible that miR-184 might be involved in responses to starvation and desiccation stress in the adult female flies, by regulating dilp8 levels post-transcriptionally. However, it has not been tested yet if the miR-184 regulation of dilp8 plays a role in resistance to starvation or desiccation in adult females, as this was not within the scope of the current study. We have now added this reference in the discussion section.

      Comment 4) dilp8 expression has been also shown to be regulated by Xrp1 in response to ribosome stress (https://doi.org/10.1016/j.devcel.2019.03.016). This paper should be included in the manuscript. Is it possible that the expression levels of miR184 are regulated by Xrp1?

      Response: We thank the reviewer for the suggestion and have incorporated the reference into the paper. During ribosome stress in the larval imaginal discs the stress-response transcription factor Xrp1 acts through dilp8 in regulating systemic growth. We agree with the reviewer, it is possible that expression of miR-184 is regulated by Xrp1. Currently we have not explored this possibility. We have now added this to the discussion section.

      Minor points:

      1. __ Does the overexpression of miR184 induce an increased fluctuating asymmetry?__

      Response: We thank the reviewer for asking this question. The role of dilp8 in the fluctuation asymmetry is only observed in the dilp8 hypomorphic mutant background. To replicate this we would have to overexpress miR-184 in either the whole larvae or in the wing discs. Unfortunately overexpression of miR-184 in the wing discs (using rnGAL4) leads to pupal lethality while as overexpression of miR-184 in the whole larvae leads to embryonic lethality and therefore we were not be able to conclude from our experiments if miR-184 overexpression induces increased fluctuating asymmetry.

      2. There are 2 references Colombani et al. (2012 for Dilp8 and 2015 for Lgr3). Can you double check that they are used accordingly

      Response: We thank the reviewer for pointing these errors out and we have incorporated these changes into the paper.

      Significance

      Altogether, the paper present compiling lines of evidence supporting the proposed model. The experiments are well designed and are convincing. The papers is interesting and relevant for a broad audience.

      __Reviewer #3 __

      Evidence, reproducibility and clarity (Required):

      This is an interesting study demonstrating an interaction between miR-184 and the Drosophila insulin-like peptide 8 (dilp8) in the tissue damage response. The authors show that Dilp8 activity is negatively regulated by miR-184, apparently through direct interaction between miR-184 and the dilp8-3'UTR, which leads to lower dilp8 mRNA transcript levels, via an undetermined mechanism, supposedly its degradation? Furthermore, the authors show that during aberrant tissue growth, miR-184 levels are very slightly downregulated (see comment below), and based on other experiments, imply causation of this with the increased dilp8 mRNA levels that occur in these tissues, again via an unclear mechanism: upregulation or stabilization of dilp8 mRNA. The authors present evidence that the JNK pathway, which had been known to be critical for dilp8 mRNA upregulation upon tissue damage, does so via miR-184.

      Major Comments:

      __Comment 1: The data showing the direct regulation of dilp8-3'UTR by miR-184 are not very strong and would require more controls to strengthen the claim, as described below. __

      Response: We have performed new experiments to validate that dilp8-3’UTR is regulated by miR-184. Please see the detailed responses to comments 10-12 below.

      __Comment 2: The miR-184 effects are also very small (less than 2-fold reduction with tissue damage; or less than 2-fold induction with JNK-pathway inhibition via bskDN). These two points are the weakest part of the manuscript and model. __

      Response: We agree with the reviewers on this point. The reduction in miR-184 levels in response to RicinA expression is modest (25–30%), and the induction of miR-184 in response to bskDN expression is less than two-fold (Figs. 5A and D). In contrast, dilp8 transcript levels increase several-fold in response to RicinA expression (Fig. 5C, S6A and B). Since we measure dilp8 transcript levels by qPCR, we detect both transcriptional and post-transcriptional contributions to dilp8 regulation. In addition, we have performed a new experiment to check the post-transcriptional regulation of dilp8, in response to tissue damage. Though the change in the dilp8-3′UTR GFP reporter upon RicinA expression in the ptc domain of the wingdisc is mild (Figs. S6F-G’), this strongly suggests a post-transcriptional outcome of the reduction of miR-184 levels on dilp8. Hence, we propose that tissue damage induces strong transcriptional activation of dilp8, while the reduction of miR-184, despite its smaller magnitude, contributes to dilp8 upregulation via post-transcriptional regulation. In support of this, our experiments demonstrate direct regulation of the dilp8-3′UTR by miR-184 (Figs. 4C-F’), and show strong dilp8 mRNA upregulation in miR-184 deficient conditions (Fig. 4A and B), suggesting the role of miR-184 in maintaining dilp8 levels. We also show that RicinA induced effects on dilp8 and pupariation delay are reversed by co-expression of miR-184 (Fig. 5C). We do not claim that regulation by miR-184 is the sole mechanism for driving dilp8 induction during tissue damage, but suggest that miR-184-mediated post-transcriptional regulation acts in a complementary manner to transcriptional responses. Furthermore, we believe that the mild effect of JNK signaling on miR-184 (as shown by the bskDN experiment) is sufficient for the moderate reduction of miR-184 in response to tissue damage.

      Comment 3: ____Regarding the expression levels, it does not help that the authors show bar graphs with standard errors of the mean instead of the actual data points to allow reliable appreciation of the data dispersion.

      Response: We have modified our figures and have performed statistical analysis according to the suggestions of the reviewers, please see responses to comments 1-9, and 13-19.

      Comment 4: It is difficult to understand how minute changes in miR-184 levels can lead to over an order of magnitude differences (in some cases) in dilp8 mRNA levels considering that it is a stoichiometric relationship. Maybe ?miR-184-Dicer1? complexes are highly stable and re-used for multiple dilp8 transcripts - the authors could discuss how they understand this occurring in their manuscript.

      On the same line, discussion is also rather weak on what regards the mechanism of control of dilp8 mRNA levels by miR-184. Please discuss eg, the evidence for mRNA degradation induction by microRNAs with this UTR binding profile (imperfect UTR binding Fig S4) and-if appropriate-how other possible regulatory models (direct and indirect) could explain the findings.

      Response: We accept the reviewers comment that 25-30% reduction of miR-184 is low in comparison to the many fold increase in dilp8 levels. We believe that both post-transcriptional and transcriptional changes are responsible for the induction of dilp8 in response to tissue damage. However, our experiments suggest the role of post-transcriptional regulation by miR-184, as pupariation delay is rescued by miR-184 overexpression (also please see the response to the previous comment). We are not ruling out the possibility of transcriptional regulation of dilp8 mRNA, rather we are suggesting the possibility that both transcriptional and post-transcriptional means are responsible for changes in dilp8. Moreover, we have not performed absolute measurement of miR-184 in the imaginal discs (what we show is a comparison between control and RicinA expression), hence we do not have an exact estimate of how many miR-184 molecules are reduced and if they would be greatly equal or more in comparison to the dilp8 mRNA molecules that are upregulated, as again while measuring dilp8 mRNA we are not checking how many molecules of dilp8 exactly are increased. As the reviewer suggests, it is possible that miR-184-RISC could be stable to handle multiple dilp8 molecules one after the other, hence it is not a 1:1 relationship between miR-184:dilp8. We have included this in the manuscript. It is also known that imperfect 3’UTR binding as seen in most animal microRNAs leads to translational repression and mRNA deadenylation, which eventually results in mRNA degradation.

      Comment 5: ____We suggest the authors carefully revise their citations to cite appropriate work that supports the claims, and also to avoid missing the seminal studies that report the claims they cite.

      Response: We are really apologetic for the errors citing the key references. We are grateful to the reviewers for correcting this for us. We have made changes to the text to include and correct the references.

      We have the suggestions below which we hope will help the authors improve their manuscript. If the authors address these points raised above, we believe the manuscript should be a valuable contribution to the field, and help in the understanding of how tissues respond to growth aberrations and the regulation of transcript levels by microRNAs.

      Detailed Comments:

      Comment 1. Results 1st paragraph: please describe the screen in more detail. As written, one only discovers it was a miRNA loss-of-function screen when reading the legend of Table S1. Please show the original data of the screen - with dispersion if possible.

      Response: We thank the reviewers for these suggestions, we have now included the data from the screen with SEM, and p-values.

      Comment 2. Results 1st paragraph, Fourth line, "While several miRNAs caused delays in pupariation by 12 hours or more..". Please correct, as actually loss of miRNAs caused delays.

      Response: We thank the reviewer for pointing out this error, we have corrected the text accordingly.

      Comment 3. ____Results (Figure 1) - It says that data from three independent experiments are shown. However there is no dispersion in the data. Could the authors please explain this? Are the results of the three experiments summed and presented as one? or is this one of the three?

      Response: We thank the reviewers for these suggestions and have plotted data with the SEM values.

      Comment 4. It is reported in the legend of Figure S2 that LogRank test was performed to determine statistical significance. However, no statistical data is presented. Please show the results.

      __Response: __We thank the reviewers for these suggestions to improve the data presentation, we have incorporated the p-value as suggested.

      Comment 5. Fig2A and B. Please show the data points in the bar graphs (as in Figure. 2C), or choose another data representation. ____Please consider redoing statistical analysis with a simple t-test. ____It is not clear to me why ANOVA was used to compare two samples. Please state that data are normalized also to control (tub-GAL4>UAS-scramble). Please ____state____ the h post-hatching from which the RNA samples were collected (as in Fig 2C for 20HE quantification).

      __Response: __We thank the reviewers for these suggestions to improve the data presentation, we have incorporated all changes as suggested. Similar changes have been incorporated to the rest of the figures of the manuscript as well. Hours post-hatching information for each figure is now added to the figure legends. __ __

      Comment 6. Fig2C. Fig legend states the bar graphs are "absolute values". Please specify if the bar represents the average, median or something else.

      Response: We thank the reviewer for pointing this out, we have made the suggested changes.

      Comment 7. Throughout the manuscript: please use GAL4 in capital letters or at least standardize it throughout the ms. Currently there are GAL4s and Gal4s.. eg compare Fig 2 and 3 legends.

      Response: We thank the reviewer for pointing this out, we have incorporated all changes as recommended.

      Comment 8. FigS3A and B. Please revise as Fig2A and B above. and apply the same criteria in the respective figure legend.

      __Response: __We thank the reviewer for pointing this out, we have made the changes as recommended.

      Comment 9. Fig. 4 - please indicate on the figures what is whole larvae and what is wing imaginal discs. This will facilitate understanding of the figure.

      __Response: __We thank the reviewers for these suggestions and have included this information in all the figures.

      Comment 10. Fig 4 - Data - Authors do not show that rn-GAL4>miR-184-sponge causes up regulation of dilp8 mRNA levels, hence the model is weakened. Doing this experiment would significantly strengthen the study whatever the result is.

      Response: We thank the reviewer for pointing this out and we have included this in the manuscript (Fig S5B).

      Comment 11. The dilp8-3'UTR experiment is weak especially because its generation is not sufficiently well described in the manuscript. "The dilp8 3'UTR-GFP reporter line was created as described in (Vargheese & Cohen, 2007)" is not sufficient. Please describe the construct generation in sufficient detail so that the experiments can be reproduced by others.

      Response: We thank the reviewer for pointing this out and we have elaborated in the methods section on how we generated the dilp8 3'UTR-GFP reporter and dilp8 3'UTR mutant GFP reporter lines. The plasmid was originally created in Steve Cohen’s lab at EMBL, by modifying pCasper4 plasmid, by introducing a tubulin promoter, EGFP and a multiple cloning site, which allows one to clone 3’UTRs of target genes into this plasmid. Not1 and Xho1 sites were used to clone the dilp8-3’UTR and mut-3’UTR. We hope this explains our strategy sufficiently.

      Comment 12. Making assumptions, if the construct is as described in Vargheese & Cohen, 2007 and contains all of the dilp8 3'UTR - it should be a Tubulin-driven GFP gene with a dilp8-3'UTR "Tub-GFP-(dilp8 3'UTR)". In this case the authors need to rule out the alternative interpretation of the result in Fig. 4D by showing that the expression of miR-184 does not down regulate Tub-GFP expression itself. The best scenario would be to have a mutated dilp8 3'UTR for the miR-184 recognition site. This experiment would significantly strengthen the study and model.

      Response: We thank the reviewer for pointing this out. We agree with the reviewers that this experiment is needed to prove direct regulation of the dilp8-3’UTR by miR-184. We have mutated the sequences complementary to the seed region of miR-184 in the dilp8-3’UTR, and demonstrated that overexpression of miR-184 does not regulate the mutated tub-GFP-(dilp8 3'UTR) expression. This confirms that the dilp8 gene is a direct target of miR-184. This data is added to the manuscript as Figs 4E-F’.

      Comment 13. Figure 4C-D please separate dilp8 from 3'UTR with a space or hyphen.

      Response: We thank the reviewer for pointing this out and have separated dilp8 from 3’UTR with a hyphen.

      Comment 14. Figure 4E. Please name the dilp8 allele as MI00727 as it is not a KO, but rather a hypomorphic mutation (fully WT dilp8 transcripts are still generated, albeit at a much lower level).

      Response: We thank the reviewer for pointing this out and we have made the necessary changes.

      Comment ____15. Figure 6D: please add UAS to bskDN/+. All figures have rn-GAL4 alone or with UAS-GFP as control. This finding would be strengthened with this other control, especially because the size effect is small.____ This being said a general comment for all experiments is that hemi-controls are generally missing for all figures. eg, in Fig 3. One would typically include controls such as A. Phm>+ and +>miR.184; B. aug21>+ and +>miR.184; C. ptth>+ and +>miR.184; D. rn>+ and +>miR.184

      Response: We thank the reviewer for pointing this out. We have added UAS to bskDN, now Fig 5D and have also added the rnGAL4/+ control. We have also performed various hemi-control experiments as suggested by the reviewer to our best capabilities. We have added a separate graph with the hemicontrols in the as a Reviewer Response Figure 1.

      Comment 16. Figure 7: Are IPCs necessary for the model? If not, I suggest removing them and placing the Lgr3 neuron cell bodies much more anterior in this scheme. Their cell bodies are as anterior and rostral as it gets, approximately where the IPCs are depicted in this type of view of the CNS.

      Response: We thank the reviewer for pointing this out and have removed IPCs from the figure, this figure is now labelled as Fig. 6.

      Comment ____17. Table S1- It would be preferable to see the data of these experiments, but if the authors prefer to show this data in a table, please at least add the dispersion analyses (eg standard deviation.. OR median+-quartiles OR Confidence intervals..), N of animals analysed, and statistics against controls.

      Response: We thank the reviewer for pointing this out, we have added the number of larvae analysed, SEM values and statistics against the control condition.

      Comment ____18. In all figures with pupariation time: please also indicate significant findings in the graphs (with an asterisk, for instance) and adjust figure legends accordingly. This could facilitate understanding the data.

      __Response: __Thanks for the suggestion. We have incorporated this information into figure legends.

      Comment ____19. Please revise Figure legends for punctuation.

      __Response: __We have rectified all the errors in punctuation. We thank the reviewers for suggesting this.

      __Comment ____20. __

      a) Abstract:

      Line 10: What is the evidence to call Dilp8 a "paracrine" factor?

      Response: We thank the reviewer for pointing this out, we have changed the text to ‘secreted factor’.

      b) Introduction:

      4th paragraph, 3rd sentence " Dilp8... buffers developmental noise and delays pupariation..." Buffering of developmental noise was first shown in Garelli et al., Science 2012, so this publication should be cited. ____4th paragraph, 5th sentence: please include Jaszczak et al., Genetics 2016. This paper was published together with the 2015 papers, just a matter of timing that it got a 2016 date. Moreover, I do not think Katsuyama et al., 2015 is well cited to back up the statement in this sentence, hence I recommend removing that citation in this sentence.

      Response: We thank the reviewer for pointing this out and have made necessary changes.

      c) 6th paragraph: 5th line "targeting dilp8" : please specify if you mean the gene or the mRNA, or both. Same for line 7.

      Response: We thank the reviewer for pointing this out and have made necessary changes.

      d) Results Page 10, 1st paragraph, 1st sentence: the works cited are not the appropriate studies that demonstrated what is being stated. This was shown in Garelli et al., Science 2012 and Colombani et al., Science 2012. Results Page 10, 1st paragraph, line 11: Please also cite Colombani et al., Science 2012, who first showed that JNK is required for dilp8 regulation.

      Response: We thank the reviewer for pointing this out and are extremely apologetic for this oversight. We have made necessary changes to the manuscript.

      e) Discussion, 2nd paragraph, line 4: again, please indicate the rationale for using "paracrine" to describe Dilp8's activities. The current widely accepted model is that Dilp8 acts on interneurons in the brain ____(eg, reviewed in Juarez-Carreno et al., Cell Stress, 2018; Gontijo and Garelli, Mech Dev, 2018; Mirth and Shingleton, Front Cell Dev Biol, 2019; Texada et al., Genetics 2020; Boulan and Leopold, 2021).____ In order to reach the brain, Dilp8 has to be secreted from the discs and travel to the brain. This is as an endocrine mechanism as it gets for a small larva, considering that some discs can be on the opposite side of the larva (eg, genital discs). While this does not exclude that Dilp8 could also act paracrinally, the only evidence that I am aware of comes from other contexts such as during transdetermination (where Dilp8 has been proposed to work in an autocrine or paracrine fashion, via Drl in imaginal discs (Nemoto et al., Genes to Cells, 2023), however, this is not cited appropriately in this manuscript and is less related to the Lgr3-dependent pathway being studied here.

      Response: We totally agree with the reviewer and appreciate clarifying this for us. We have made necessary changes to the text.

      f) Discussion Page 13, 1st paragraph, This claim is supported by data presented in Garelli et al., Science 2012, not the other two papers. Garelli et al., 2015 shows that the Lgr3 receptor also participates in buffering developmental noise. Other studies have corroborated the Garelli et al., 2012 finding: eg, Colombani et al., Curr Biol 2015; Boone et al., Nat Commun 2016; Blanco-Obregon et al., Nat Commun 2022). Many other studies have shown that Dilp8 promotes developmental stability under tissue stress and challenges.

      Discussion Page 12, 3rd paragraph, 2nd sentence: "The Lgr3 neurons directly interact with ... PTTH ...and insulin-producing neurons" Please cite Colombani et al., 2015 and Vallejo et al., Science 2015. Vallejo et al., propose that circuit with insulin-producing neurons. In the 3rd sentence, only Jaszczak et al., 2016 is cited, whereas this claim/model comes from many studies, such as Halme et al., Curr Biol, 2010; Hackney et al., PLoS One 2012; Garelli et al. Science 2012; Colombani et al., Science, 2012; and the Lgr3 papers from 2015). Jaszczak et al., actually propose that Lgr3 is also required in the ring gland in addition to neurons.

      Discussion page 14 last paragraph,10 line, "In Aedes aegypti ....regulates ilp8 (Ling et al., 2017)". As far as I understand mosquitoes do not have a dilp8 orthologue (see for instance Gontijo and Gontijo, Mech Dev 2018; and Jan Veenstra's work). ilp nomenclature (numbering) does not follow that of Drosophila, so ilp8 is probably a typical Insulin/IGF-like peptide and is NOT an orthologue of Dilp8, a relaxin, so this citation needs to be removed or placed into the broader context of microRNA regulation of ilps.

      Response: We are really sorry for the numerous glaring errors in the references. We thank the reviewers for correcting this for us. We have made necessary changes to the text.

      Thank you for the opportunity to review your interesting work,

      Alisson Gontijo and Rebeca Zanini

      Reviewer #3 (Significance (Required)):

      If the authors address these points raised above, we believe the manuscript should be a valuable contribution to the field, and help in the understanding of how tissues respond to growth aberrations and the regulation of transcript levels by microRNAs.

      __Author’s concluding response: __

      We thank all the reviewers for the overall positive comments and suggestions that we believe have helped us to improve our manuscript. We have incorporated all the changes suggested, especially regarding errors in citing key references. We have performed most of the experimental suggestions. Also, we have modified the way in which graphs are presented, including statistical tests as suggested by the reviewers. Several controls have been performed to strengthen the manuscript further. We believe that this review process aided in significantly improving this manuscript.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements

      We thank the reviewer for their positive comments regarding the research article titled "The Ketogenic Diet Metabolite 1 β-Hydroxybutyrate Promotes Mitochondrial Elongation via Deacetylation and Improves Autism-like Behaviour in Zebrafish" by Uddin GM and colleagues. We appreciate your input, and we will address these comments as indicated below with specific responses to each point raised by reviewers.

      The main changes in the updated manuscript are as follows:

      We have revised the introduction to now incorporate additional background information on mitochondria, NAD, and mitochondrial dynamics and function. This addition aims to provide readers with a broader understanding of the mitochondrial context in relation to our study.

      Furthermore, we recognize that previous studies have explored mitochondrial function in the context of the ketogenic diet. While our specific investigation centered on mitochondrial morphology, we acknowledge the importance of comprehensively investigating mitochondrial function. To this end, we have added new data showing how BHB impacts mitochondrial oxidative phosphorylation in HeLa cells (Sup Fig 2), and how both BHB and NMN impact oxygen consumption/glycolysis in zebrafish (Fig 7).

      We have also added new behaviour analysis of the zebrafish (Fig 6), and have re-framed the discussion around neurodevelopment generally, rather than ASD specifically.

      Finally, we have now included a section in our manuscript that discusses the limitations of our study. These limitations can be further investigated to explore and characterize the full mechanistic potential behind the effects of the ketogenic diet and/or NMN on mitochondrial dynamics.

      2. Point-by-point description of the revisions

      This section is mandatory. *Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. *

      *Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Uddin GM and colleagues presented a research article entitled 'The Ketogenic Diet Metabolite 1 β-Hydroxybutyrate Promotes Mitochondrial Elongation via Deacetylation and Improves Autism-like Behaviour in Zebrafish'. Roles of ketogenic diet (KD) and NAD+ precursors in health promotion and longevity, as well as on the alleviation of a broad range of diseases are evident. However, their roles in autism are not well done, which is the novelty of the current study. Addressing below questions will improve the quality of the paper.

      Major concerns 1. In the introduction section, a broad overview of the roles of ketogenic diet (KD) in neurodegenerative disease (and ageing, if possible) should be provided. E.g., the authors should summarize exciting progress on the use of KD to treat Alzheimer's disease in animal models (PMID: 23276384). *

      Response: Thank you for your valuable suggestion. While it is true that the KD appears to be beneficial in neurodegenerative (and other disease) models, our focus in this paper is looking at neurodevelopment, rather than all potential benefits of the KD. Nonetheless, we have addressed this comment by incorporating a brief overview of the roles of the KD in neurodegenerative diseases, including Alzheimer's disease (AD), in the introduction section of the manuscript. Specifically, we have summarized the exciting progress made in utilizing KD to treat AD in animal models, as highlighted in the suggested study. This addition helps to provide a better overview of the potential therapeutic effects of KD in neurodegenerative diseases and strengthens the introduction section of the manuscript.

      • Roles of high fat diet to treat diseases could be extended to rare premature ageing diseases. In such scenario, high fat and NAD+ boosting shared some joint mechanisms (PMID: 25440059 ). *

      Response: This information and the reference are now added to the discussion.

      *In the introduction, a more detailed introduction of NAD+ and its roles in mitochondrial homeostasis (especially mitophagy and the mitochondrial fusion-fission balance) should be included (PMID: 24813611; PMID: 30742114; PMID: 31577933). *

      Response: Although our paper focused primarily on mitochondrial fission and fusion, we have incorporated a new paragraph in the introduction to provide a more detailed introduction detailing NAD+ and its roles in mitochondrial homeostasis, specifically highlighting mitophagy. We have included the suggested references.

      • In regarding to the statement of KD increases NAD+, was it due to increased generation (to check protein levels and activities of different NAD+ synthetic enzymes, such as iNAMPT, NMNAT1-3, and NRK) and/or reduced consumption (in addition to reduced glycolysis, does KD inhibit the activities of CD38 and PARPs? In this paper, Sirtuins' activities is (are increased)). Detailed exploration of the activities of these proteins will unveil a clear molecular mechanisms on how KD affects/regulates NAD+. *

      Response: Thank you for the comment. We agree that exploring the detailed mechanism of how the ketogenic diet (KD) affects NAD+ is an interesting question that will have important implications once answered. However, fully elucidating the mechanism of action would require a more comprehensive investigation, which is beyond the scope of this current project. We have now added this as a future direction in the manuscript.

      *Fig. 1: in the NAD+ field, the normal used NR/NMN concentrations are normally high like to use 500 µM to 2-5 mM (as the NAD+ levels in cells are high). In addition to use 50 µM, the authors are strongly to have a dose-dependent study (50 µM, 500µM, 1, 2, 5 mM), and see changes of mitochondrial funciton and parameters. In this condition, NAD+ levels should be also checked. *

      Response: We have added new supplemental data showing the initial dose response of the effects of BHB and NMN on mitochondrial morphology, which led us to choosing the relevant doses for the remainder of the paper. Our objective was not to investigate the broad impacts of different NMN concentrations on mitochondrial function and parameters, or NAD+ levels. As such, we have only focused on doses where we see effects on mitochondrial morphology.

      *Fig. 2: a comprehensive characterization of mitochondrial fusion-fission should be performed. In addition to the protein evaluated, changes on other key fusion-fission proteins, like Bax, Bak, Mfn-1, Mfn-2, etc should be performed (PMID: 17035996; PMID: 24813611). *

      Response: We agree that looking at other key proteins involved in mediating mitochondrial fission and fusion could provide additional insight. Indeed, given the changes in global acetylation that we see, it is expected that some other proteins may also be regulated in this way. However, there are at least a dozen proteins involved in mediating mitochondrial fusion and fission, not to mention many more proteins that regulate these proteins. Unfortunately, it is not feasible to analyze all the proteins involved in mitochondrial fusion-fission. Moreover, looking only at protein levels, doesn't necessarily inform about the activity of any protein. Instead, we concentrated in this paper on investigating known links between protein acetylation and mitochondrial dynamics, particularly focusing on the proteins that have known links to acetylation (i.e., DRP1, OPA1, MFNs). We have added a note in the discussion acknowledging that other means of regulation could also be occurring in parallel.

      *Figs. 1-5 were focused on mitochondrial morphology, whether KD and NMN changed mitochondrial funciton should be explored, such as to use seahorse to check ECR and OCR. *

      Response: Although our question was focused on morphology, we agree that mitochondrial function is important. We have added new data showing that BHB increases basal oxygen consumption in HeLa cells (Sup Fig 2), as well as new data showing that BHB and NMN influence oxygen consumption and glycolysis in our zebrafish model (Fig 7)

      • Fig. 6: NR/NMN used in animal studies (via gavage or in drinking water in mice, and on plate for worms and flies) are normally high (e.g., in drinking water for mice could be 4-12 mM; for worms and flies are normally 1-5 mM); for zebrafish, while they are swimming in water, this reviewer concerned whether it was true that 50 µM of NMN was sufficient to show the benefit presented.*

      Response: Our data show that these doses are indeed sufficient. We did look at some higher doses for NMN, but these were toxic, leading to poor survival and were not studied further.

      *Minor concerns 1. Line 26: For 'a growing list of neurological disorders, including autism spectrum disorder (ASD)', please add AD in. *

      Response: Line 26 is part of the abstract, which we feel should be focused more on the main message of the paper, which does not involve AD. As addressed above, we have added AD as an example in the introduction.

      *Line 57: For 'with side effects such as gastrointestinal disturbances, nausea/vomiting, diarrhea, constipation, and hypertriglyceridemia being reported', rate of frequency shall be provided if any. *

      Response: We have modified the statement to indicate the relative percent of patients suffering the various side effects.

      *Reviewer #1 (Significance (Required)):

      The novelty of the current study was to investigate effects of KD and NAD+ on autism. This investigation was not performed before and thus is the novelty.

      Weakness, effects of KD and NAD+/NMN on mitochondrial function were not well-investigated and should be done. Introduction was not well done, many key information in the fields were not provided which may mislead the readers an over-evaluation of the novelty of the current study.*

      Response: As outlined above, we have edited the introduction to include additional information requested by the reviewer. Moreover, our focus in this manuscript was to look at the mechanisms underlying changes in mitochondrial morphology, not mitochondrial function per se, though this is clearly important and related. Nonetheless, as discussed above, we have also added new data showing how BHB impacts mitochondrial function.

      *My expertise lies in NAD+, mitochondria, and brain health.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      The study examined the effect of beta-hydroxybutyrate and nicotinamide nucleotide on mitochondrial morphology and the molecular pathways which mitigate this effect as well as the effect of these treatments on behavior in zebrafish. The study is well done and well written. The only thing I think that could be improved are the bar in the graph some the significant comparisons. It is sometimes difficult to see which groups are being compared.*

      Response: We're happy to adjust how the data is displayed in the relevant bar graphs, but it is not clear exactly what changes the reviewer would like. To some degree this will depend on the specific guideline of the final journal where we hope the manuscript will be published. As such, we have not made changes at this point.

      ***Referees cross-commenting**

      The other reviewers do have some fair comments. Multiple doses would be helpful and showing bioenergetic data would complement the morphological measurements. Additionally, behavioral assays showing changes in social behavior in the Zebrafish would provide a stronger link to ASD. *

      Response: As discussed above, we have added new information on doses and mitochondrial bioenergetics. With respect to behaviour, we have added thigmotaxis data and reworked the discussion around behaviour and neurodevelopment so that it is less specific to ASD.

      *Reviewer #2 (Significance (Required)):

      As beta-hydroxybutyrate is an important substrate for the ketogenic diet, this study helps explain the potential mechanisms in which the ketogenic diet may enhance mitochondrial function.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      In this paper, Uddin and colleagues have investigated components of the ketogenic diet to understand changes in both mitochondrial morphology and protein expression, and zebrafish locomotor behaviour. They investigate whether beta-hydroxybutyrate (BHB) or nicotinamide nucleotide (NMN) application can later human mitochondria in HeLA cell lines, and also recue a locomotion defect in shank3b+/- zebrafish larvae that have previously been proposed as a model for autism. This study is strengthened by showing data from two species; however the link between the HeLA cell line data and larval zebrafish is not strong. The study would be improved by assessing zebrafish mitochondrial changes after drug application, and testing more than one concentration of BH and NMN in the behavioural assay. This is an interesting study, and it is nicely written and presented. I have made some comments to strengthen the study below.

      Major comments My expertise is in modelling some aspects of autism in zebrafish. To this end I have focussed on the zebrafish part of this manuscript more fully. I have several comments related to the zebrafish experiments. 1. The changes in mitochondrial morphology, peroxisome number and mitochondrial protein levels were measured in HeLA cells and not comparable data is shown for zebrafish. The same experiments should be repeated using larval zebrafish or a zebrafish cell line. *

      Response: We chose to use HeLa cells for the mechanistic studies due to practical reasons. Cell lines offer a controlled and well-established system for investigating cellular processes and molecular mechanisms. Measuring these parameters in tissues is significantly more challenging and requires different reagents (e.g., antibodies) and methodology (electron microscopy) that are not feasible in the current study.

      On the other hand, zebrafish larvae were employed for the behavior studies, which cannot be conducted using cell lines. By utilizing zebrafish, we were able to examine the effects of beta-hydroxybutyrate (BHB) and nicotinamide nucleotide (NMN) on locomotor behavior, providing valuable insights into potential therapeutic implications for autism.

      While we acknowledge the limitations of not directly measuring mitochondrial morphology, peroxisome number, and mitochondrial protein levels in zebrafish, we believe that our study provides significant contributions to understanding the effects of BHB and NMN in zebrafish behavior. Future studies could certainly consider incorporating zebrafish-specific experiments to complement the findings in HeLa cells.

      • How did you choose the concentration of BHB and NMN to use in behavioural experiments? And the timing of application - I don't really understand why you waited 3 days after drug application to measure locomotion. *

      Response: These doses chosen initially as they were similar the doses that induced mitochondrial elongation in HeLa cells and were tolerated by the fish larvae. As we saw promising effects at these initial doses, we decided to explore them in more detail. While we agree that it would be worth comparing the effects of additional doses, as well as looking at their effects at other timepoints, such work would be a major endeavour and is beyond the scope of our initial investigations, which we feel are worth reporting in their current state.

      With respect to the treatment paradigm, fish larvae were treated 10-48 hours post fertilization, as this is a critical neurogenic developmental timepoint that is often used for exposure studies. Fish do not fully hatch until 3-4 days post fertilization, and display only minimal movement before 5 days, which is why we waited until 5 days to look at movement.

      • Do the shank3b+/- larvae show any morphological deficits? Their decrease in locomotion is striking. Is the morphology also rescued by drug application? Can you tie this to the mitochondrial changes that you observed in HeLA cells?*

      Response: We do not observe any gross changes in fish morphology that might explain a decrease in locomotion. Unfortunately, it is not feasible to look at mitochondrial morphology in the fish at this time. However, based on previous published work showing that the ketogenic diet promotes mitochondrial elongation in mouse brains (PMID:32380723), we would expect mitochondrial morphology also to be changed in the fish. Nonetheless, as we have not examined this directly in fish, we are not making this specific claim in this manuscript.

      • In figure 6A you use time spent swimming as a readout of distance. This doesn't really make sense, because without also showing speed of swimming it is not possible to know whether time and distance correlate in the same way across genotypes. This figure could be improved by showing more detail - speed of swimming, time spent immobile etc. This can easily be extracted from the films that you have already made using the ViewPoint software. *

      Response: As requested, we have reanalyzed the zebrafish movement data for a more refined analysis. In the revised version (Fig 6), we include analysis of both speed and distance travelled within a defined time. Importantly, these findings still support differences between WT and shank3b+/- fish that are restored by BHB and NMN to varying degrees.

      • Showing a change in locomotion is not enough to claim that a model is autism-like. At a minimum I think that you need to show changes in social behaviour - likely using older fish (more than three weeks) that interact with each other. Changes in locomotion can be caused by so many factors, many of which are not indicative of autism. It is important that as a field we do not simply claim that locomotion can be used as a proxy for more complex disease phenotypes. This recent review may help you with this point:* https://www.frontiersin.org/articles/10.3389/fnmol.2020.575575/full.

      Response: The reviewer makes an important point that the movement behaviour phenotypes that we see do not necessarily represent classic ASD phenotypes (i.e., repetitive behaviour, reduced sociability, and reduced communication). To begin to address this issue, we analyzed thigmotaxis, which can be a measure of anxiety. Notably, we also see differences that are reversed by BHB and NMN. However, we cannot model all ASD behaviours in a fish model, and we are not set up to look at social behaviour, especially in the young fish that we were studying. As such, even though Shank3 is a recognized ASD gene, and the shank3b+/- model we are studying is a validated ASD model (PMID: 29619162), we have re-phrased the manuscript in the context of neurodevelopment generally, rather than with respect to ASD specifically. As such, we ascribe the movement and thigmotaxis phenotypes as neurodevelopmental phenotypes that are improved by BHB and NMN.

      *For the statistics, as far as I can tell, all of the data should be analysed by ANOVA or the non-parametric equivalent followed by a post-hoc test. Please check this and add information about normality in. *

      Response: As requested, we have clarified our statistical methodology throughout the manuscript.

      For the mechanistic data, we used t-tests for direct comparisons between two groups (e.g., vehicle vs. treatment). While multiple conditions such as vehicles, NMN, BHB, or etomoxir were tested, statistical comparisons were only conducted comparisons between the vehicle and each treatment group individually. As we are not also making comparisons between treatments this is not a multiple comparison, and ANOVA is not applicable in this context. We have clarified this rationale in the manuscript to avoid any confusion.

      For the zebrafish study, where multiple factors were involved (e.g., treatments across different time points or conditions), we performed a two-way ANOVA followed by Tukey's post-hoc test to identify specific group differences. This approach was appropriate for analyzing these datasets and ensures robust conclusion.

      With respect to normality testing, all datasets were assessed for normality using the Shapiro-Wilk test, and no violations of normality were observed. The updated text now includes these details.

      *Minor comments

      1. Make sure that you refer to the fish line as shank3b+/- throughout - see abstract.*

      This has bee corrected.

      • Please add a space between all numbers and units (e.g. 5 Mm). *

      This has bee corrected.

      • There is a spelling error on line 340 page 16: finings instead of findings. *

      This has bee corrected.

      • In figure 1, if each dot represents a different sample, then there appear to be many fewer samples analysed in 1D compared to 1B. Can you comment upon this please*

      __Response: __A total of 80-150 cells were counted per condition, and the analyses were performed on 3 independent replicates with 2 independent technical replicates for each treatment condition. The quantification of mean mitochondrial branch length in Figure 1B was measured using Image-J and the MiNA plugin. The measurements were taken from three independent replicates using a standard region of interest (ROI) and randomly selected areas from each image.

      In Figure 1D, NAD+ levels were measured 24 hours after treatment of vehicle, βHB, NMN, or Eto+βHB in HeLa cells (n=3-6/group). Each sample lysate represents an independent experimental dish from which coverslips were collected for image analysis.

      The difference in sample numbers between Figure 1B and 1D arises because image analysis involves individual cells fixed and stained on coverslips, whereas the NAD assay requires the whole lysate from the entire cell culture dish. Therefore, the higher cell count in Figure 1B represents the number of cells analyzed on coverslips, while Figure 1D represents NAD levels from the lysate normalized to the protein concentration.

      *Reviewer #3 (Significance (Required)):

      I think that this will be interesting to autism researchers and it could lead to more investigation of the ketogenic diet. Some more work is needed, likely in other model organisms, before this research can be translated to human patients. *

      __Response: __We agree that the findings of our study could be of interest to autism researchers and have implications for further investigation of the ketogenic diet (KD). It is important to note that further work, including studies in other model organisms, would be beneficial before translating this research to human patients.

      Our study aimed to provide mechanistic insights into the effects of the KD on mitochondrial morphology and behavior. We recognize that the translation of research findings to human patients requires rigorous investigation, including preclinical and clinical studies. Our study contributes to the understanding of the underlying mechanisms involved in the KD's effects, laying the groundwork for future research and potential therapeutic avenues.

      We appreciate your perspective and emphasize that our intention is to provide valuable insights into the mechanisms underlying the KD's effects rather than suggesting immediate translation to human patients. Further investigation and validation in diverse models and clinical settings will be necessary before considering clinical applications.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      In this paper, Uddin and colleagues have investigated components of the ketogenic diet to understand changes in both mitochondrial morphology and protein expression, and zebrafish locomotor behaviour. They investigate whether beta-hydroxybutyrate (BHB) or nicotinamide nucleotide (NMN) application can later human mitochondria in HeLA cell lines, and also recue a locomotion defect in shank3b+/- zebrafish larvae that have previously been proposed as a model for autism. This study is strengthened by showing data from two species; however the link between the HeLA cell line data and larval zebrafish is not strong. The study would be improved by assessing zebrafish mitochondrial changes after drug application, and testing more than one concentration of BH and NMN in the behavioural assay.

      This is an interesting study, and it is nicely written and presented. I have made some comments to strengthen the study below.

      Major comments

      My expertise is in modelling some aspects of autism in zebrafish. To this end I have focussed on the zebrafish part of this manuscript more fully. I have several comments related to the zebrafish experiments.

      1. The changes in mitochondrial morphology, peroxisome number and mitochondrial protein levels were measured in HeLA cells and not comparable data is shown for zebrafish. The same experiments should be repeated using larval zebrafish or a zebrafish cell line.
      2. How did you choose the concentration of BHB and NMN to use in behavioural experiments? And the timing of application - I don't really understand why you waited 3 days after drug application to measure locomotion.
      3. Do the shank3b+/- larvae show any morphological deficits? Their decrease in locomotion is striking. Is the morphology also rescued by drug application? Can you tie this to the mitochondrial changes that you observed in HeLA cells?
      4. In figure 6A you use time spent swimming as a readout of distance. This doesn't really make sense, because without also showing speed of swimming it is not possible to know whether time and distance correlate in the same way across genotypes. This figure could be improved by showing more detail - speed of swimming, time spent immobile etc. This can easily be extracted from the films that you have already made using the ViewPoint software.
      5. Showing a change in locomotion is not enough to claim that a model is autism-like. At a minimum I think that you need to show changes in social behaviour - likely using older fish (more than three weeks) that interact with each other. Changes in locomotion can be caused by so many factors, many of which are not indicative of autism. It is important that as a field we do not simply claim that locomotion can be used as a proxy for more complex disease phenotypes. This recent review may help you with this point: https://www.frontiersin.org/articles/10.3389/fnmol.2020.575575/full.
      6. For the statistics, as far as I can tell, all of the data should be analysed by ANOVA or the non-parametric equivalent followed by a post-hoc test. Please check this and add information about normality in.

      Minor comments

      1. Make sure that you refer to the fish line as shank3b+/- throughout - see abstract.
      2. Please add a space between all numbers and units (e.g. 5 Mm).
      3. There is a spelling error on line 340 page 16: finings instead of findings.
      4. In figure 1, if each dot represents a different sample, then there appear to be many fewer samples analysed in 1D compared to 1B. Can you comment upon this please?

      Significance

      I think that this will be interesting to autism researchers and it could lead to more investigation of the ketogenic diet. Some more work is needed, likely in other model organisms, before this research can be translated to human patients.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review)

      The weaknesses are in the clarity and resolution of the data that forms the basis of the model. In addition to whole embryo morphology that is used as evidence for convergent extension (CE) defects, two forms of data are presented, co-expression and IP, as well as a strong reliance on IF of exogenously expressed proteins. Thus, it is critical that both forms of evidence be very strong and clear, and this is where there are deficiencies; 1) For vast majority of experiments general morphology and LWR was used as evidence of effects on convergent extension movements rather than Keller explants or actual cell movements in the embryo. 2) The study would benefit from high or super resolution microscopy, since in many cases the differences in protein localization are not very pronounced. 3) The IP and Western analysis data often show subtle differences, and not apparent in some cases. 4) It is not clear how many biological repeats were performed or how and whether statistical analyses were performed. 

      (1) To more objectively assess the convergent extension phenotypes, we developed a Fiji macro to automatically quantify the LWR in various injected Xenopus embryos, as detailed in the Methods section. We acknowledge that a limitation in the current manuscript is how to link our mechanistic model at the molecular level with the actual cellular behavior during convergent extension, and we plan to perform cell biological studies in the future to elucidate the link;

      (2) We have repeated some of the imaging experiments in DMZ explants using a Zeiss LSM 900 confocal equipped with Airyscan2 detector that can increase the resolution to ~100 nm. The new data are in Suppl. Fig. 4, 9, 11, 16;

      (3) We have repeated all IP and western blots at least three times and provided quantification and statistical analyses;

      (4) We have added the information on biological repeats and statistical analyses in all figures and figure legends.

      Reviewer #2 (Public Review):

      The protein localization experiments in animal cap assays are for the most part convincing, but with the caveat that the authors assume that the proteins are acting within the same cell. As Fzd and Vangl2 are thought to localize to opposite cell ends in many contexts, can the authors be sure that the effects they observe are not due to trans interactions? 

      In our previous publication, we provided evidence that Vangl is necessary and sufficient to recruit Dvl to the plasma membrane within the same cell (Figure 3 in 10.1093/hmg/ddx095). In a more recent publication ( 10.1038/s41467-025-57658-0 ), we further elucidated a mechanism through which Dvl oligomerization switches its binding from Vangl to Fz, and determined that Dvl binding to Vangl and Fz are differentially mediated by its PDZ and DEP domain, respectively. In the current manuscript, we also performed co-IP experiment under various conditions to demonstrate binding between Dvl and Vangl. We feel that these evidences together provide a strong argument for our model where Vangl2 acts within the same cell to sequester Dvl from Fz.

      In regards to the Dvl patches induced by Wnt11 (Fig. 3 and Suppl. Fig. 9), we performed separate injection of EGFP- and mSc-tagged Dvl into adjacent blastomeres, and demonstrated that the Wnt11-induced patches arise from symmetrical accumulation of Dvl at contact of two neighboring cells (Suppl. Fig. 9a-c’). This scenario is different from epithelial PCP where Fz/Dvl and Vangl/Pk are asymmetrically accumulated at the contact between two adjacent cells.

      The authors propose a model whereby Vangl2 acts as an adaptor between Dvl and Ror, to first prevent ectopic activation of signaling, and then to relay Dvl to Fzd upon Wnt stimulation. This is based on the observation that Ror2 can be co-IPed with Vangl2 but not Dvl; and secondly that the distribution of Ror2 in membrane patches after Wnt11 stimulation is broader than that of Fzd7/Dvl, while Vangl2 localizes to the edges of these patches. The data for both these points is not wholly convincing. The co-IP of Ror2 and Vangl2 is very weak, and the input of Dvl into the same experiment is very low, so any direct interaction could have been missed. Secondly, the broader distribution of Ror2 in membrane patches is very subtle, and further analysis would be needed to firm up this conclusion. 

      (1) We repeated the co-IP experiment with Myc-tagged Vangl or Dvl. Using the same anti-Myc antibody and experimental condition (including the expression level of Vangl, Dvl and Ror2), we still found that Ror2 could be pulled down by Vangl but not Dvl (Suppl. Fig. 15b). Whereas this data confirms our previous conclusion, we acknowledge that a negative data does not fully exclude the possibility for direct biding between Ror and Dvl.

      (2) We re-analyzed the signal intensity of Dvl and Ror in Wnt11-induced patches. By quantifying the intensity ratio between Ror and Dvl along the patches, we found an increase over two folds at the border of the patches (Fig. 7j, bottom panel). We interpret this data to suggest that Ror is accumulated to a higher level than Dvl at the patch borders.     

      A final caveat to these experiments is that in the animal cap assays, loss of function and gain of function both cause convergence and extension defects, so any genetic interactions need to be treated with caution i.e. two injected factors enhancing a phenotype does not imply they act in the same direction in a pathway, in particular as there are both cis/trans and positive/negative feedbacks between the PCP proteins. 

      We agree with the reviewer that a difficulty in studying PCP/ non-canonical signaling is that both loss and gain of function of any its components can cause convergence and extension defects. Genetic interactions, especially synergistic interactions, should be interpreted with caution. But we do want to point out that, in a number of case, we were also able to demonstrate epistasis. For instance, we found that Dvl2 over-expression induced CE defects can be rescued by Pk over-expression (Fig. 1e and f), whereas Vangl/ Pk co-injection induced severe CE defects can be reciprocally rescued by Dvl2 over-expression (Fig. 1g). Likewise, we showed that Fz2/ Dvl2 co-injection induced CE defects can be rescued by wild-type Vangl2 but not Vangl2 RH mutant (Suppl. Fig. 6b), and Ror2 can rescue Vangl2 overexpression induced CE defect (Suppl. Fig. 14). Collectively, these functional interaction data consistently demonstrate an antagonism between Dvl/ Fz/ Ror2 and Vangl2/ Pk, which is correlated with our imaging and biochemical studies.

      As you can see from the reviews, the referees generally agree that your paper is a potentially valuable contribution to the field. Your observations are important because of the novel model based on the inhibitory feedback regulation between planar cell polarity (PCP) protein complexes. However, the reviewers also stated that the model is only partly supported by data because of insufficient clarity and missing controls in several experiments supporting the proposed model. The paper would be significantly improved if your conclusions are backed up by additional experimentation. Specifically, the referees wanted to see the reproducibility of the results shown in Figures 3, 4, 8, S3, S7, S12. 

      We hope that you are able to revise the paper along the lines suggested by the referees to increase the impact of your study on the current understanding of PCP signaling mechanisms. 

      We thank the reviewers for careful reading of our manuscript and for their constructive critiques and suggestions. We have repeated the animal cap studies in original Figures 3, 4, 8 and S3 with DMZ explants, and the new data are in Supplementary Fig. 9, 11, 16 and 4, respectively. We also repeated the biochemical studies in original Figure S 7and 12, and the new data are in Supplementary Fig. 8 and 15.

      Reviewer #1 (Recommendations For The Authors):

      Major points:(1) The author conducted an analysis of the subcellular localization of PCP core proteins, including Vangl2, Pk, Fz, and Dvl, within animal cap explants (ectodermal explants). To validate the model proposing that 'non-canonical Wnt induces Dvl to transition from Vangl to Fz, while PK inhibits this transition, and they function synergistically with Vangl to suppress Dvl during Convergent Extension (CE),' it is crucial to assess the subcellular localization of PCP core proteins in dorsal marginal zone (DMZ) cells, which are known to undergo CE. Notably, the overexpression of Wnt11 alone, as employed by the author, does not induce animal cap elongation. Therefore, the use of animal cap explants may not be sufficient to substantiate the model during Convergent Extension (CE). Indeed, previous knowledge indicates that Vangl2 and Pk localize to the anterior region in DMZ explants. However, the results presented in this manuscript appear to differ from this established understanding. Consequently, to provide more robust support for the proposed model, it is advisable to replicate the key experiments (Figures 3, 4, 8, and Figure S3) using DMZ explants. 

      We repeated the experiments in Figure 3, 4, 8 and Figure S3 with DMZ explant and the new data are in new Supplementary Fig. 9, 11, 16 and 4, respectively.In regards to “previous knowledge indicates that Vangl2 and Pk localize to the anterior region in DMZ explants”, we are aware Vangl/ Pk localization to the anterior cell cortex in neural epithelium from the studies by the Sokol and Wallingford labs, but are not aware of similar reports in DMZ explants. When we examined the localization of small amount of injected EGFP-mPk2 (0.1 ng mRNA) in DMZ explants, we saw a somewhat uniform distribution on the plasma membrane (Suppl. Fig. 4). In addition, in a related recent publication, we examined endogenous XVangl2 protein localization in activin induced animal cap explants that do undergo CE. What we observed was that whereas low level injected Dvl2 and Fz form clusters on the plasma member, endogenous XVangl2 remains uniformly distributed on the plasma membrane (Suppl. Fig. 3S-Z in 10.1038/s41467-025-57658-0 ). These observations may suggest potential differences of PCP protein localization during neural vs. mesodermal convergence and extension.

      (2) The author suggests that 'Vangl2 and Pk together synergistically disrupt Fz7-Dvl2 patches.' As shown in Figure 4 (panels J' to I'), it is evident that the co-expression of Pk and Vangl2 increases Fz7 endocytosis. Nevertheless, a significant amount of Fz7 still co-localizes with Dvl2. To strengthen the author's hypothesis, additional clear assay is required such as Fluorescence resonance energy transfer (FRET) assay. 

      We appreciate this valuable advice. Since none of the tagged Fz/ Dvl/ Vangl proteins we had were suitable for FRET, we made proteins tagged with mClover and mRuby2, which were reported as optimized FRET pairs. But in our hands mRuby2 seems to require very long time (~2 days) to mature and become detectable at room temperature, and is not suitable for our Xenopus experiments. We are in the process of establishing a luciferase based NanoBiT system to detect Fz-Dvl and Dvl-Vangl interactions in live cells and cell lysates, and will use it in future studies to investigate their interaction dynamics.

      For the current manuscript, we reason that a substantial reduction of Fz7-Dvl2 clusters with Vangl2/ Pk co-injection would still support our idea that Vangl2 and Pk act synergistically to sequester Dvl from Fz to prevent their clustering in response to non-canonical Wnt ligands.

      (3) The IP data is less clear and evident. A couple of examples are: a) Fig 2g where the authors report that the Vangl2 R177H variant reduced Vangl2 interaction with Pk and recruitment of Pk to the plasma membrane, but it appears that the variant interacts slightly better than WT Vangl2 with Pk. In Fig. S7a, the authors state that Pk overexpression can indeed significantly reduce Wnt11-induced dissociation of EGFP-Vangl2 and Flag-Dvl2 in the DMZ. However, there is a minimal impact when compared to the Wnt11 absent control. Based on the results presented in Fig S12a the authors indicate that Wnt11 reduces the association between Vangl2 and Dvl2, which can be discerned, but loss of Ror2 does not change this in any obvious way - but the authors indicate it does. In S12b, the authors have suggested that Ror and Dvl do not form a direct binding interaction. However, the interpretation of Figure S12b is not entirely convincing due to several issues. Notably, the expression levels of each protein appear inconsistent, the bands are not sufficiently clear, and there is the detection of three different tag proteins on a single blot. To strengthen the validity of these findings, it is advisable to repeat this experiment with improved quality. 

      We repeated all the co-IP and western blot analyses pointed out by the reviewer, and performed quantification and statistical analyses.

      Fig 2g had a mistake in the labeling and is replaced with new Figure 2g;

      Fig. S7a is replaced by new data in Supplementary Figure 8a and b;

      Fig. S12a and 12b are replaced by new data in Supplementary Figure 15a, a’ and b, respectively. In 15a and a’, we noticed a consistent decrease of Dvl2-Vangl2 co-IP in Xror2 morphant. The reason for this is not yet clear and will need further study in the future.

      Minor points: (1) In all the whole embryo injection assays examining morphology, no Western analysis is performed to show roughly equivalent and appropriate levels of the various proteins are being expressed. Differences will affect the data. 

      Although we did not do western analyses to examine the protein levels in various functional interaction assays, we did examine how co-expression of Vangl2, mPk2 or Dvl2 may impact each other’s protein levels in Supplementary Fig. 2, which did not reveal any significant change when co-injected in different combination.

      (2) The author's prior publication (Bimodal regulation of Dishevelled function by Vangl2 during morphogenesis, Hum Mol Genet. 2017) presented clear evidence of Vangl2 overexpression inducing Dvl2 membrane localization. However, Figure S4 in the current manuscript did not provide clear evidence of membrane localization. To strengthen the hypothesis that Vangl2-RH mutant also induces Dvl2 membrane localization, further comprehensive imaging analysis is needed. 

      We re-analyzed the imaging data and replaced old Figure S4 with a new Supplementary Fig. 5.

      (3) In Supplementary Figure 9, the authors propose that the overexpression of Vangl2/Pk induces Fz7 endocytosis, as indicated by its co-localization with FM4-64. However, it raises a question: how does the Fz7-GFP protein internalize into the cells without endocytosis, as seen in Figures S9a-c'? To enhance readers' understanding, a discussion addressing this point should be included. 

      We think that this might be a technical issue. As detailed in the Method section, we only incubated the embryos transiently with FM4-64 for 30 minutes, and the embryos were subsequently washed and dissected in 0.1X MMR without the dye. Therefore, only the Fz7-GFP protein endocytosed during the 30 minute-incubation would be labeled by FM-64, whereas that endocytosed before or after the incubation would not. Alternatively, the very few Fz7-GFP puncta occasionally observed in the absence of Vangl2/Pk overexpression could be vesicles trafficking to the plasma membrane.

      (4) Statistical analyses are absent for several results, including those in Figure 2f, Figure S4d, and Figure S7b. 

      We repeated these experiments and included statistical analyses. The new data are in Figure 2f, Supplementary Fig. 5d and Supplementary Fig. 8b.

      (5) This manuscript lacks any results regarding Ck1. Therefore, it is advisable to consider removing the discussion or mention of CK1. 

      We agree, and tune down the discussion on CK1 and removed CK1 from our model in Fig. 9.

      Reviewer #2 (Recommendations For The Authors):

      (1) In all the convergence and extension assays, the authors should report n numbers (i.e. number of animals), what statistical test is used, and what the error bars show. Ideally dot-plots would be used instead of bar charts as they give a better insight into the data distribution. It might be useful to give a section on the statistical analyses used in the M&M, including e.g. any power calculations carried out, as now required by many journals. 

      We have follow the advice to use dot-plots for all the quantification analyses in the manuscript. We include in the figure legends the statistical test used and what the error bars show. The number of embryos analyzed were included in each panel in the figures. We also provided more details in the Methods section on how the LWR quantification was carried out.

      (2) I think Figure 2g is wrongly labelled? FLAG bands are in all three lanes in the western blot, but not labelled as such in the schematic. 

      We corrected the schematic labeling in Figure 2g, and thank the reviewer for catching this mistake.

      (3) In Figure S7, the authors show that co-IP of Dvl and Vangl2 is reduced by Wnt11 and the effects of Wnt are blocked by Pk. Does Pk have any effect in the absence of Wnt? 

      We examined the effect of Pk over-expression on Dvl2-Vangl2 co-IP as advised, and did not see a significant impact in the absence of Wnt11 co-injection. The data is included in the new Supplementary Figure 8a. We interpret the data to suggest that “at least under the condition of our co-IP experiment, Pk may not directly impact the steady-state binding between Vangl and Dvl”.

      (4) In Figure 3, the authors show (as published previously) that Wnt11 induces patches of Dvl at the plasma membrane. It would be useful to see Dvl in the absence of Wnt and Vangl2/Dvl in the absence of Wnt. 

      Dvl is widely known as a cytoplasmic protein and its localization has been published by many labs over the past 20-30 years. In our recent publication (10.1038/s41467-025-57658-0 ), we also re-examined Dvl localization when injected at various dosages. So we did not feel it was necessary to show its localization in the absence of Wnt11 again, but included a reference to our prior publication. In regards to Vangl/Dvl distribution in the absence of Wnt11, the readers can see Suppl. Fig. 5b as an example, in addition to our previous publications referenced in the manuscript.

      (5) In the review figures, the difference in Fz7-GFP patch formation in d' and e' (vs e.g. a') is not very clear. Could the images be improved or (better) quantified in some way? 

      We assume that “review figures” refer to Figure 3 or 4? If so, we felt that Fz7-GFP patch formation was clear in Fig. 3d’, e’ or Fig. 4d’, e’. Nevertheless, we repeated these experiments in DMZ explants as advised by Reviewer 1, and additional examples of Fz7-EGFP patch formation can be seen in the new Suppl. Fig. 9d-f’ and Suppl. Fig. 11d-f’.

      (6) In Figure 6d, I'm concerned that the loss of flag-Dvl2 might occur via dephosphorylation in the IP reaction. Also the M&M don't include methodological details about buffers and whether phosphatase inhibitors were used. A compelling control would be anti-FLAG pulldown showing retention of phosphorylation. Also Figure 6f shows a reduced ratio of fast-to-slow migrating bands of Dvl with Vangl2/Pk - unless I have misunderstood, is this ratio the wrong way round? 

      We added co-IP buffer and protease inhibitor information in Methods.

      We agree that the concern about dephosphorylation during IP reaction is valid, and that direct pull down of Dvl to show the phosphorylated form is a compelling control. We therefore note that in Suppl. Fig. 8a and 15b, direct pull down of Flag-Dvl or Myc-Dvl (with anti-Flag or anti-Myc) did show the slower migrating, phosphorylated form. Additional examples in which Vangl only co-IP the faster migrating unphosphorylated Dvl include Suppl. Fig. 15a, and in a related paper we published recently (Fig. 3R and R’ in 10.1038/s41467-025-57658-0 ).

      Finally, we did wrongly label Figure 6f in the last submission, and the ratio should have been “slow/fast”. We have made the correction, and appreaicte the reviewer for the meticulousness in perusing our manuscript.

      (7) In Figure 7, what does Ror2 look like in the absence of Wnt11? 

      We included new Figure 7a-c to show that without Wnt11 co-injection, Ror2 is uniformly distributed on the plasma membrane.

      (8) Also in Figure 7, Ror2 patches are said to be slightly wider than Dvl2 patches "reminiscent of Vangl2" - I wouldn't describe them as being similar. Vangl2 shows a distinct dip in the center of the Dvl patches, Ror2 does not show a dip, and is only (at best) in a slightly wider patch, and I would want to see further examples to be convinced that the localization domain is reproducibly wider. The merge of many samples in 7d may actually be making the distribution harder to see and if the Xror2 and Dvl2 intensities were normalized I'm not sure how different the curves would appear. (i.e. the Xror2 curve looks like a flattened version of the Dvl2 curve). 

      We have added an additional panel in the new Figure 7j to compare the intensity ratio of Ror/ Dvl2 along the patches, and this analysis reveals an over two folds increase of the ratio at the border region. This quantification may make a more convincing argument that at the patch border region, Dvl is diminished whereas Ror2 accumulate with Vangl2. 

      (9) In Figure S12a, the authors suggest Wnt11 induced dissociation of Dvl from Vangl2 (by co-IP), and this is reduced after Ror2 MO. This would be more convincing with replicates and quantitation. 

      We have repeated this experiment with Vangl2 pull down and added quantification. The data is in the new Suppl. Fig. 15a.

      (10) In Figure S12b, the authors suggest Ror2 can co-IP Vangl2 but not Dvl. This is not very convincing, as the Dvl input band is very weak, and the Vangl2 co-IP band is very weak. 

      We repeated the co-IP experiment with Myc-tagged Vangl or Dvl. Using the same anti-Myc antibody and experimental condition (including the expression level of Vangl, Dvl and Ror2), we still found that Ror2 could be pulled down by Vangl but not Dvl (Suppl. Fig. 15b).

      (11) "Prickle" spelled "Prickel" in the abstract (and abbreviated to "PK" not "Pk" at one place in the abstract and several places in text) 

      We have corrected these typos.

      (12) Quite a lot of interesting observations are in supplemental figures. Normally it might be expected that extra data supporting a conclusion would be in supplemental, but here some of the supplemental data feels like it is more than simply additional evidence. For instance supplemental Figures 2 and 3 feel more than just supplemental (and Supplemental Figure 3 if merged with Figure 2 would make it easier for the reader). Moreover, for example, the description of the results in Figure 2 is punctuated by references to supplemental Figures 4 and 5 that contain key data to support the conclusions, which means the reader has to flick backwards and forwards from place to place in the manuscript to follow the argument. It is of course up to the authors, but in some cases putting supplemental data back into the main figures (for which there is no size or number limit) would increase clarity. 

      These are excellent points; in the resubmitted manuscript we have a total of 24 data figures, and we used 8 as main figures since we felt that they provide the most relevant and conclusive evidence to our model. We will consult the copy editors at eLife on how to arrange the rest as main vs. supporting figures when requesting publication as version of record.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers for their thoughtful comments and overall very supportive feedback.

      Reviewer #1 writes: "The study is very thorough and the experiments contain the appropriate controls. (...) The findings of the study can have relevance for human conditions involving disrupted mitochondrial dynamics, caused for example by mutations in mitofusins." Reviewer #2 writes: "The dataset is rich and the time-resolved approach strong." Reviewer #3 writes: "I admire the philosophy of the research, acknowledging an attempt to control for the many possible confounding influences. (...) This is a powerful and thoughtful study that provides a collection of new mechanistic insights into the link between physical and genetic properties of mitochondria in yeast."

      We address all points below. We have not yet updated our text and figures since we expect substantial additions from new experiments. But we have included Figure R1 with some additional analyses of existing data at the bottom of the manuscript.

      Reviewer1

      1.1 Statistical comparisons are missing throughout the manuscript (with the exception of Fig. 2c). Appropriate statistical tests, along with p-values, should be used and reported where different gorups are compared, for example (but not limited to) Fig. 3d and most panels of Fig. 4.

      We initially decided not to add too many extra labels to the already very busy plots, given that the magnitude of change mostly speaks for itself. However, we will try to find meaningful statistical tests together with a sensible graphical representation for all of the figures. For one example see Figure R1A.

      1.2. I do not agree with the use of Atp6 protein as a direct read-out of mtDNA content. While Atp6 protein levels will decrease with decreasing mtDNA content, the inverse is not necessarily true: decreased Atp6 protein levels do not necessarily indicate decreased mtDNA levels, because they could alternatively or additionally be caused by decreased transcription and/or translation. Therefore, please do not equate Atp6 protein levels to mtDNA levels, and instead rephrase the text referencing the Atp6 experiments in the Results and Discussion sections to measure "mtDNA expression" or "mt-encoded protein" or similar. For example, on p. 14 line 431 should read "mtDNA expression" rather than "decreased synthesis of mtDNA", and line 440 on the same page "mean mtDNA levels" should be "mtDNA expression" or similar.

      All three reviewers agree that using Atp6-NG as a direct proxy for mtDNA requires more validation, or at least rephrasing of the text. We agree that this is the most important point to address. We had previously tried using the mtDNA LacO array (Osman et al. 2015) to directly assess the amount of nucleoids per cell. However, the altered mitochondrial morphology of the Fzo1 depleted cells combined with the LacI-GFP which is still in mitochondria even when mtDNA is gone, increases the noise level to a point that we cannot interpret the signal. However, as this manuscript was in the submission process, the Schmoller lab (co-authors #2 and #7) adapted the HI-NESS system to label mtDNA in live yeast cells(Deng et al. 2025). This system promises much better signal to noise and we expect we can address all concerns regarding the actual count of nucleoids per cell. Should this unexpectedly fail for technical reasons, we will try to calibrate the Atp6-levels with DAPI staining at defined time points and will rephrase the text as the reviewer suggests.

      1.3. In Fig. 3, the authors use the fluorescence intensity of a mitochondrially-targeted mCardinal as a read-out of mitochondrial mass. Please provide evidence that this is not affected by MMP, either with relevant references or by control experiments (e.g. comparing it to N-acridine orange or other MMP-independent dyes or methods).

      Whether or not the import of any mitochondrial protein is dependent on the MMP depends largely on the signal sequence. The preSu9-signaling sequence was previously characterized as largely independent of the MMP compared to other presequences (Martin, Mahlke, and Pfanner 1991), which is why Vowinckel (Vowinckel et al. 2015) and others (Di Bartolomeo et al. 2020; Perić et al. 2016; Ebert et al. 2025) have previously used this as a neutral reference to the strongly MMP-dependent pre-Cox4 signal to estimate MMP. As one control in our own data, we consider that the population-averaged mitochondrial fluorescent signal Figure S3C stays constant in the first few hours, in agreement with the total averaged mitochondrial proteome (Fig R1E). As additional controls, we plan to compare the signal to an MMP independent dye as the reviewer suggests.

      1.4. In Fig. 2e-f, the authors use a promoter reporter with Neongreen to answer whether the reduced levels of the nuclear-encoded mitochondrial proteins Mrps5 and Qcr7 are due to decreased expression or to protein degradation, and find no evidence of degradation of the Neongreen reporter protein. However, subcellular localization might affect the availability of the protein to proteases. Although not absolutely required, it would be relevant to know if the Neongreen fusion protein is found in the same subcellular compartment as Mrps5 and Qcr7 at 0h and 9h after Fzo1 depletion.

      Here, it seems we need to explain the set-up and interpretation of the data better. The key point we are trying to make with the promoter-Neongreen construct is that the regulation is not mainly at the level of transcription. We are showing that the reduction in the levels of the actual protein (orange bars) is not (mainly) explained by a reduction in expression, since the promoter is similarly active at 0 and at 9 hours (grey bars). If expression from the promoter were strongly reduced, the Neongreen would be diluted with growth and would also decrease, but this is not the case. The fluorophore itself is just floating around in the cytosol and is not subject to the same post-translational regulation as Mrps5 and Qcr7, so there is no reason to expect degradation.

      1.5. Fzo1 depletion leads to a very rapid drop in MMP during the first hour of depletion. In the Discussion, can the authors speculate on the possible mechanism of this rapid MMP drop that occurs well before mtDNA or mt-encoded proteins are decreased in level?

      This is indeed an interesting point. We think there are likely three reasons causing this initial drop: Firstly, due to the fragmentation the mixing of mitochondrial content is disturbed and smaller fragments may have suboptimal stoichiometry of components (see also (Khan et al. 2024) who look at this in detail including the Fzo1 deletion); secondly, already fairly early, some mitochondrial fragments may not contain any mtDNA and therefore will be unable to synthesize ETC proteins; thirdly, altered morphological features like changes in the surface-to-volume ratios may play a role. Sadly, mechanistically following up on this is not possible with the tools in our hands and therefore outside of the scope of this manuscript. But we are happy to include these speculations in our discussion.

      1.6. In Fig. 2a, the mtDNA copy number of Fzo1-depleted cells is ca 1.3-fold of the control cells at the 0h timepoint. Why might this be? Is it an impact of one of the inducers? If so, we might be looking at the combination of two different processes when measuring copy number: one that is an induction caused by the inducer(s), and the other a consequence of Fzo1 depletion itself.

      We believe that this 30% increase is within the noise of the experiment rather than an effect of the induction. Since we normalize to t=0 uninduced, the first black data point does not have error bars, emphasizing this difference. None of the protein data suggests that there is an increase in mtDNA encoded proteins (see e.g. 2B, or Atp6 fluorescence data). In the planned HI-NESS experiment, we will see in our single cell data whether there is an actual increase in mtDNA upon TIR induction. Additionally, we will run a qPCR to carefully determine mtDNA levels of untreated wild-type cells, tetracycline treated wild-type cells and tetracycline induced TIR expressing cells to exclude effects of tetracycline as well as the expression of TIR on mtDNA.

      Minor comments:

      1.7. p. 3, line 71: "ten thousands of dividing cells.." should be "tens of thousands of dividing cells".

      Thank you, will correct.

      1.8.-p.4, line 116: please be even more clear with what the "depleted" cells and controls are treated with: are depleted cells treated with both inducers, and controls with neither?

      We will make this more clear. Depleted cells are treated with both inducers, the control cells are not. However, in Figure 1A and in S1 we do controls to show that inducing TIR per se or adding aTC per se does not change growth rate or mitochondrial morphology. We will make this more clear.

      1.9. -p.5, lines 147-148: the authors write "the rate with which the abundance of Cox2 and Var1 proteins decreases was similar to the rate of mtDNA loss" though the actual rate is not shown. Please calculate and show rates for these processes side by side to make comparison possible, or alternatively rephrase the statement.

      Indeed this was not phrased well. We will call it dynamics rather than rates.

      1.10. -Fig. 2d: changing the y-axis numbering to match those in panels a and b would facilitate comparisons.

      Makes sense, we will change this.

      1.11. Fig. 2e: it is recommended to label the western blot panels to indicate what protein is being imaged in each (Neongree,, Mrps5, Qcr7).

      We will adapt the labelling to make it more clear.

      1.12. -p.9, line 262: I suggest referencing Fig. 4e at the end of the first sentence for clarity.

      We will modify the sentence as suggested.

      1.13. -In the sections related to Fig. 3a and Fig. 5a as well as the connected supplemental data, the authors discuss both the median and the mean of mitochondrial mass and Atp6 protein, respectively. For purposes of clarity, I suggest decreasing the focus on the mean (that is provided only in the supplemental data) and focusing the text mainly on the median. The two show differing trends and it is very good that both are shown, but the clarity of the text can be improved by focusing more on the median where possible.

      We will check the phrasing and simplify.

      1.14. -p. 14, line 435: the statement that mt mass is maintained over the first 9h of depletion is only true for the mean mt mass, not for the median. Please make this clear or rephrase.

      We will check phrasing, make it more clear and also point out the extended proteomics data (see Fig R1), which corresponds to the mean of the populations

      1.15.-p.14, line 452: "mitofusions" should be "mitofusins".

      Thanks for catching this.

      Reviewer 2:

      2.1. While inducible TIR is used to reduce background, the manuscript should rigorously exclude auxin/TIR off-targets (growth, mitochondrial phenotypes, gene expression). Please include full matched controls: (plus minus)auxin, (plus minus)TIR, epitope tag alone, and a degron control on an unrelated mitochondrial membrane protein.

      We agree that rigorous controls are crucial for the interpretation of the results. However, we think we have already included most of the controls the reviewer is asking for, but we might have not pointed this out clearly enough. For example, in Fig 1A, we could make it more clear by adding more labels in which samples we added aTC, which is only described in the figure legend.

      Here is a list of all the controls:

      • Each depletion experiment is always matched with an experiment of the same strain without induction. So the genetic background as well as effects such as light exposure, time spent in the microfluidics systems, etc are controlled for.
      • Figure S1D shows that the growth rate is wildtype like in a strain containing either the AID tag or the TIR protein AND upon addition of both chemicals. It also shows that the final genetic background (AID-tag and TIR) also grows like wildtype if the inducers are not added. This conclusively shows that neither the tags/constructs nor the chemicals per se affect growth rate
      • In Figure S1C we show the mitochondrial morphology of the same controls. We will make sure to label them more consistently to match panel D, and include an actual wildtype and a FLAG-AID-Fzo1 strain without TIR treated with both aTC and 5-Ph-IAA as direct comparison
      • In figure 1A we compare the Fzo1 protein levels of a strain with and without TIR. We show that in absence of TIR, adding either aTC or Auxin does not change Fzo1 levels and that the levels are comparable in the strain that is able to deplete Fzo1 directly before addition of 5-Ph-IAA (after 2 h of induction of TIR through addition of tetracycline)
      • Additionally, in Figure S2C we show that two hours after adding aTC, the entire proteome does not change significantly apart from a strong induction of TIR. We can also make this more clear in the figure legend.
      • Additionally, we will run a qPCR to carefully determine mtDNA levels of untreated wild-type cells, tetracycline treated wild-type cells and tetracycline induced TIR expressing cells to exclude effects of tetracycline as well as the expression of TIR on mtDNA. (also in response to 1.6.) In summary, we think we have controlled sufficiently for all confounding parameters and most importantly showed that addition of either aTC or Auxin as well as the FLAG-AID tag per se does not disturb mitochondria or cell growth. We do not see what a degron control on an unrelated protein will tell us. Depending on the nature of the protein, it may or may not have a phenotype that may or may not be related to morphology changes etc.

      2.2. The Mitoloc preSu9 vs Cox4 import ratio is only a proxy of mitochondrial membrane potential (ΔΨm) and itself depends on mitochondrial mass, protein expression, matrix ATP, and import saturation. The authors need to calibrate ΔΨm with orthogonal dyes (TMRE/TMRM) and pharmacologic titrations (FCCP/antimycin/oligomycin) to generate a response curve; show that Mitoloc tracks dye-based ΔΨm across the relevant range and corrects for mass/photobleaching. Report single-cell ΔΨm vs mass residuals.

      We completely agree that the MitoLoc system is only a rough proxy for the actual membrane potential. That is why we make no quantitative claims on the absolute value or absolute difference between groups of cells. We also make very clear in Fig 3B what we are actually measuring and can emphasize again in the text that this is only a proxy. We agree that it is a good idea to compare MitoLoc values to TMRE staining as the reviewer suggests, we will do these experiments in depleted and control cells at different timepoints. Please note though that also dye staining has its caveats, especially in dynamic live cell experiments. TMRM for example is not compatible with the acidic pH 5 medium that is typically used for yeast and subjecting cells to washing steps and higher pH may change both morphology of mitochondria and the MMP, especially in cells that are already “stressed”. We prefer not to complete elaborate pharmacological titration experiments because firstly, this was extensively done in the original MitoLoc paper by the Ralser lab ((Vowinckel et al. 2015), cited 120 times); secondly, the value of the MMP is not the most critical claim of the manuscript. See also 3.12. Please note that in Figure S4D we had already plotted MMP vs mitochondrial concentration.

      2.3. To use Atp6-mNeon as a proxy for mtDNA is an assumption. Interpreting Atp6 intensity as "functional mtDNA" could be confounded by translation, turnover, or assembly. Please (i) report mtDNA copy number time courses (you have qPCR), nucleoid counts (DAPI/PicoGreen or TFAM/Abf2 tagging), and (ii) assess translation (e.g., 35S-labeling or puromycin proxies) and turnover (proteasome/AAA protease inhibition, mitophagy mutants -some data are alluded to- plus mRNA levels for mtDNA-encoded genes). This will support the "reduced synthesis" versus "increased degradation" conclusion.

      We agree with all three reviewers that Atp6 is only a proxy for mtDNA (Jakubke et al. 2021; Roussou et al. 2024) and the correlation should be checked more carefully. We will use the very recently established Hi-NESS system to follow nucleoids/ mtDNA during depletion experiments. See detailed reply to 1.2.

      (ii) in Figure 2C we inhibit mitochondrial translation and show that in this case control and depleted cells have the same level of Cox2, at least suggesting that degradation is not the key mechanism controlling the levels of mtDNA encoded proteins. We cannot do proteasome inhibitor assays since the nature of the AID-TIR systems requires an active proteasome. In figure S5C we show that the Atp6 depletion is similar in an atg32 deletion. This does not completely exclude a contribution of mitophagy to the observed phenotype, but does confirm that mitophagy is not the primary reason for cells becoming petite.

      2.4. The promoter-NeonGreen reporters argue against transcriptional down-regulation of nuclear OXPHOS. Please add mRNA (RT-qPCR/RNA-seq) for representative genes and a pulse-chase or degradation-pathway dependency (e.g., proteasome/mitophagy/autophagy mutants) to firmly assign active degradation. The authors need to normalize proteomics to mitochondrial mass (e.g., citrate synthase/porin) to separate organelle abundance from protein turnover.

      While we are happy to perform qPCR experiments for selected genes, a full RNA-seq experiment seems outside the scope of this study. As explained above, a proteasome inhibitor experiment is not possible in this set-up. Bulk mitophagy/autophagy seems unlikely to be the cause of the decrease of the nuclear-encoded OXPHOS proteins, since most other mitochondrial proteins do not decrease on average on population level in the first hours. This data is now plotted as additional figure (see below) and will be included in the supplementary of the revised manuscript (Fig R1E).

      2.5. Using preSu9-mCardinal intensity as "mitochondrial concentration" is sensitive to expression, import competence, and morphology/segmentation. The authors should provide validation that this metric tracks 3D volume across fragmentation states (e.g., correlation with mito-GFP volumetrics; detergent-free CS activity; TOMM20/Por1 immunoblot per cell).

      We agree that this is an important point and the co-authors discussed this point quite intensively. In figure S3A and B we show (using confocal data) that there is a very strong correlation between the total fluorescence signal and the 3D volume reconstruction. However, the slope of the correlation is different between tubular and fragmented mitochondria (compare panels A and B) and see figure legend. Since we are dealing with diffraction-limited objects it is likely that the 3D reconstruction is sensitive to morphology, especially if mitochondria are “clumping”. We therefore think that the total fluorescence signal is actually a better estimate of mitochondrial mass per cell than the 3D volume reconstruction (especially for our data obtained with a conventional epifluorescence microscope). The mean of the total mitochondrial fluorescence also better matches the population average mitochondrial proteome (Fig R1E). To consolidate this assumption, we will additionally compare our data to a strain with Tom70-Neongreen and to MMP independent dyes.

      Notably, since the morphology is similarly altered in mothers and buds this is of minor impact for our main point – the unequal distribution between mother and buds.

      2.6. The unequal mother-daughter distribution is compelling, but causality remains inferred. Test whether modulating inheritance machinery (actin cables/Myo2, Num1, Mmr1) or altering fission (Dnm1 inhibition) modifies segregation defects and rescues mtDNA/Atp6 decline. Complementation with Fzo1 re-expression at defined times would help order the phenotype cascade.

      We agree that rescue experiments would be very useful. We have some preliminary data for tether experiments, for example with Num1. The general problem is that the fragmented mitochondria clump together. We have not found a method to restore an equal distribution between mother and daughter cells. We will try to optimize the assay, but are not overly confident it will work. Mmr1 deletion aggravates the Fzo1 phenotype, likely also because the distribution becomes even more heterogeneous, but we have not rigorously analyzed this.

      We like the idea of the Fzo1 re-expression and will run such experiments. This will be especially powerful in combination with the new HI-NESS mtDNA reporter. We may be able to track exactly when cells reach the point-of-no return and become petite. This will also help connecting our mathematical model more directly to the data.

      2.7. The model is useful but should include parameter sensitivity (segregation variance, synthesis slopes, initial nucleoid number) and prospective validation (e.g., predict rescue upon partial restoration of synthesis or inheritance, then test experimentally).

      We will refine our model to include the to-be-measured nucleoids/mtDNA values. We will include a parameter sensitivity analysis with the updated model.

      Reviewer 3:

      3.1. About the use of Atp6 as a good proxy for mtDNA content. This is assumed from l285 onwards, based on a previous publication. As the link is fairly central to part of the paper's arguments, and the system in this study is being perturbed in several different ways, a stronger argument or demonstration that this link remains intact (and unchanged, as it is used in comparisons) would seem important.

      We agree, see 1.2.

      3.2. About confounding variables and processes. The study does an admirable job of being transparent and attempting to control for the many different influences involved in the physical-genetic link. But some remain less clearly unpacked, including some I think could be quite important. For example, there is a lot of focus on mito concentration -- but given the phenotypes are changing the sizes of cells, do concentration changes come from volume changes, mito changes, or both? In "ruling out" mitophagy -- a potentially important (and intuitive) influence, the argument is not presented as directly as it could be and it's not completely clear that it can in fact be ruled out in this way. There are a couple of other instances which I've put in the smaller points below.

      Thank you for acknowledging our efforts to show transparent and well-controlled experiments! We address each of the specific points below.

      3.3. full genus name when it first appears

      We will add the full name.

      3.4. I may be wrong here, but I thought the petite phenotype more classically arises from mtDNA deletion mutations, not loss? The way this is phrased implies that mtDNA loss is [always] the cause. Whether I'm wrong on that point or not, the petite phenotype should be described and referenced.

      We can expand the text and cite additional relevant papers. The term “petite” refers to any strain that is respiratory incompetent and leads to small colonies (not necessarily small cells!) (Seel et al. 2023). This can be mutations or gene loss (fragments) on the mtDNA (these are called cytoplasmic petite), or chemically induced loss of mtDNA (e.g. EtBr), or mutations of nuclear genes required for respiration (these are termed nuclear petite; some nuclear petites show loss of mtDNA in addition to the mutation in the nuclear genome) (Contamine and Picard 2000).

      3.5. para starting l59 -- should mention for context that mitochondria in (healthy, wildtype) yeast are generally much more fused than in other organisms

      ok.

      3.6. Fig 1C -- very odd choice of y-axis range! either start at zero or ensure that the data fill as much vertical space of the plot as possible

      True, this was probably some formatting relic. We will adapt the axis to fill the full space. Most of our axes start at 0, but that doesn’t make so much sense here, since we consider the solidity in the control as “baseline”.

      3.7. "wild-type like more tubular mitochondria" reads rather awkwardly. "more tubular mitochondria (as in the wild-type)"?

      Thank you, sounds better.

      3.8. l106 -- imaging artefacts? are mitos fragmenting because of photo stress? -- this is mentioned in l577-8 in the Methods, but the data from the growth rate and MMP comparison isn't given -- an SI figure would be helpful here. It would be reassuring to know that mito morphology wasn't changing in response to phototoxicity too.

      In the methods we just briefly point out that we have done all our “due diligence” controls to check that we do not generate phototoxicity, something that we highlight in the cited review. We do not explicitly have a figure for this, but figure S1A shows that the solidity of the mitochondrial network in control cells stays the same over 9 hours, even though these cells are exposed to the same cultivation and imaging regime as the depleted cells. We will also add a picture of control cells after 9 h. In S1B we show that control cells containing TIR but no AID tag treated with both chemicals imaged over 9 hours also show the same solidity (~mitochondrial morphology) as untreated control. Also, the doubling times of cells grown in our imaging system (Fig R1B) are very similar to the shake flask (Fig R1A). All in all, we are very confident that our imaging settings did not impact our reported phenotypes.

      3.9. para l146 -- so this suggests mtDNA-encoded proteins have a very rapid turnover, O(hours) -- is this known/reasonable?

      Reference (Christiano et al. 2014) suggests that respiratory chain proteins are shorter lived than the average yeast protein. However, based on Figure 2C we think the dynamics mostly speak for a dilution by growth.

      3.10. section l189 -- it's hard to reason fully about these statistics of mitochondrial concentration given that the petite phenotype is fundamentally affecting overall cell volume. can we have details on the cell size distribution in parallel with these results? to put it another way -- how does mitochondrial *amount* per cell change?

      This is a good point. We report mostly on mitochondrial “concentrations” because we think this is what the cell actually cares about (mitochondrial activity in relationship to cytosolic activity). But we will include additional graphs on mitochondrial amount as well as size distributions (Fig R1C, related to Fig 4F). We can already point out that the size distribution of the population does not change much in the first hours. The “petite” phenotype refers to small colonies on growth medium with limited supply of a fermentable carbon source, not to smaller size of single cells.

      3.11. l199 the mean in Fig S3C certainly does change -- it increases, clearly relative both to control and to its initial value. rather than sweeping this under the carpet we should look in more detail to understand it (a consequence of the increased skew of the distribution)?

      This relates somewhat to the previous point. The increase in average concentration is not due to an increased amount in the population, but due to the fact that it is the small buds that get a very high amount of the mitochondria which “exaggerates” the asymmetric/heterogenous distribution. This will be clarified by the figures we mention in the point above.

      3.12. para line 206 -- this doesn't make it clear whether your MMP signal is integrated over all mitochondria in the cell, or normalised by mitochondrial content? this matters quite a lot for the interpretation if the distributions of mitochondrial content are changing. reading on, this is even more important for para line 222. Reading further on, there is an equation on l612 that gives a definition, but it doesn't really clarify (apologies if I'm misunderstanding).

      For each cell, we basically calculate the relative mitochondrial enrichment of the MMP sensitive vs the MMP insensitive pre-sequence.

      So, MMP= (total intensity of mitochondrial pre-Cox4 Neongreen/ total intensity of mitochondrial pre-Su9 Cardinal) / (total cytosolic pre-Cox4 Neongreen/ total cytosolic pre-Su9 Cardinal).

      We calculate this value for each cell, but we do not have the optical resolution to calculate it for individual mitochondrial fragments.

      Both constructs are driven by the same strong promoter, so transcription of the fluorophore should never limit the uptake. Also, in Figure 3D we compare control and depleted cells with similar total mitochondrial concentration, so the difference must be due to a different import of the two fluorophores, see also Fig S4D. The calculated “MMP” value is of course only a crude proxy for the actual membrane potential in millivolts and we do not want to make any claims on absolute values or quantitative differences. But essentially what we are interested in is “mitochondrial health/activity” and we think the system is good at reporting this. See also 2.2.

      3.13. l230 -- a point of personal interest -- low mito concentrations are connected to low "function" (MMP) and give extended division times -- this is interestingly exactly the model needed to reproduce observations in HeLa cells (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002416). That model went on to predict several aspects of downstream cellular behaviour -- it would be very interesting to see how compatible that picture (parameterised using HeLa observations) is with yeast!

      Thank you for pointing out your interesting paper, which we will include in our discussion. Another recent preprint about fission yeast (Chacko et al. 2025) also fits into this picture. Since you were kind enough to disclose your identity, we would be happy to discuss this further with you in person if we can maybe follow-up on this.

      3.14. l239 "less mitochondria" -- a bit tricky but I'd say "fewer mitochondria" or "less mitochondrial content"

      Thanks, we will think about how to best rephrase this, probably less mitochondrial content.

      3.15. Section l234 So here (and in Fig 4) the focus is on overall distributions of mitochondrial concentration in different cells (mother-to-be, mother, bud; gen 1, gen >1). But we've just seen that one effect of fzo1 is to broader the distribution of mitochondrial concentration across cells. Can't we look in more depth at the implications of this heterogeneity? For example in Fig 4F (which is cool) we look at the distribution of all fzo1 mothers-to-be, mothers, and buds. But this loses information about the provenance. For example, do mothers-to-be with extremely low mito concentrations just push everything to the bud, while mothers-to-be with high mito concentrations distribute things more evenly? It would seem very easy and very interesting to somehow subset the distribution of mothers-to-be by concentration and see how different subsets behave

      This is a good point. When analyzing the data, we pretty much plotted everything against everything and then chose the graphs that we think will best guide the reader through the story-line. We can make additional supplementary plots where we show the starting concentrations/amounts of the mother in relationship to the resulting split ratio at the end of the cycle (Fig R1D).

      3.16. l285 -- experimental design -- do we know that Atp6 will continue to be a good proxy for functional mtDNA in the face of the perturbations provided by Fzo1 depletion? Especially if there is impact on the expression of mitoribosomes, the relationship between mtDNA and Atp6 may look rather different in the mutant?

      This is actually our top-priority experiment now. We will use the HI-NESS system and possibly DAPI staining to make a more direct link to mtDNA/ nucleoid numbers, see 1.2.

      3.17. l290 -- ruled out mitophagy. This message could be much clearer. Comparing Fig S5C and Fig 3A side-by-side is a needlessly difficult task -- put Fig 3A into Fig S5. Then we see that when mitophagy is compromised, the distribution of mitochondrial concentration has a lower median and much lower upper quartile than in the mitophagy-equipped Fzo1 mutant? What is going on here? For a paper motivated by disentangling coupled mechanisms, this should be made clearer!

      Thanks for pointing this out. We can of course easily include the control in the corresponding figure. Compromising mitophagy is likely to generally affect mitochondrial health and turnover a little bit, independent of what is going on with Fzo1. The second evidence that speaks against large-scale mitophagy is the proteomics data: On population level the dynamics of the respiratory chain proteins are very different from those of other (nuclear encoded) mitochondrial proteins. We will add additional supplementary figures to make this more clear, see Fig R1E. Most mitochondrial proteins in the proteomics experiment stay constant in the first few hours, consistent with the imaging data showing that the mean mitochondrial content of the population does not change initially. This again highlights that it is the unequal distribution which is the problem and not massive degradation of mitochondria.

      3.18. With the Atp6 signal, how do we know that fluorescence from different cells is comparable? Buds will be smaller than mother cells for example, potentially leading to less occlusion of the fluorescent signal by other content in the cytoplasm

      This is of course a general problem that anyone faces doing quantitative fluorescence microscopy. From the technical side, we have done the best we could by taking a reasonable amount of z-slices and by choosing fluorophores that are in a range with little cellular background fluorescence (e.g. Neongreen is much better than GFP). From a practical standpoint, we are always comparing to the control, which is subject to the same technical limitations as the depleted cells and the cell sizes are very similar. So, even if we are systematically overestimating the Atp6 concentration in the bud by a few %, the difference to the control would still be qualitatively true. We therefore do not think that any of our conclusions are affected by this.

      3.19. l343 -- maintenance of mtDNA -- here the point about l285 (is the Atp6-mtDNA relationship the same in the Fzo1 mutant) is particularly important, as we're directly tying findings about the protein product to implications about the mtDNA

      We will carefully address this, see above.

      3.20. l367 -- on a first read this description of the model feels like lots of choices have been made without being fully justified. Why a log-normal distribution (when the fit to the data looks rather flawed); why the choice of 5 groups for nucleoid number (why not 3? or 8?); the process used for parameter fitting is very unclear (after reading the methods I think some of these values are read directly from the data, but the shapes of the distributions remain unexplained). l705 -- presumably the ratio was drawn from a log-normal distribution and then the corresponding nucleoid numbers were rounded to integers? the ratio itself wasn't rounded? (also l367) How were the log-normal distributions fitted to experiments (Figs. S7A,B)? Just by eye?

      We will update our model based on measured nucleoid counts and then explain more stringently the choices we make/ parameters we select.

      3.21. l711 by random selection -- just at random? ("selection" could be confusing) Overall, it feels like the model may be too complicated for what it needs to show. Either (a) the model should show qualitatively that unequal inheritance and reduced production leads to rapid loss -- which a much simpler model, probably just involving a couple of lines of algebra, could show. Or (b) the model should quantitatively reproduce the particular numerical observations from the experiments -- it's not totally clear that it does this (do the cell-cycle-based decay timescales in Fig 7 correspond to the hour-based decay timescales in other plots, for example). At the moment the model is at a (b) level of detail but it's only clear that it's reporting the (a) level of results.

      If the HI-NESS and Fzo1 re-addition experiments work as explained above, all parameters will have direct experimental data, and we should get much closer to (a).

      3.22. A lot of the discussion repeats the results; depending on editorial preferences some of this text could probably be pared back to focus on the literature connections and context.

      We will think about streamlining the discussion once some of the additional material alluded to above has been added.

      3.23. Data availability -- it looks like much of the data required to reproduce the results is not going to be made available. Images and proteomic data are promised, but the data associated with mitochondrial concentration and other features are not mentioned. For FAIR purposes all the data (including statistics from analysis of the images) should be published.

      We maybe didn’t phrase this clearly. All data will be made available. Where technically feasible, this will be directly accessible in a repository, otherwise by request to the corresponding author.

      On our OMERO server, we have deposited many TB of raw images as well as all the intermediate steps such as segmentation masks, and the csv files with all the extracted data for each cell (including background corrections etc). Additionally, we can include csvs with the data grouped in a way that we used to generate all the box blots etc. As of now, the OMERO data is unfortunately only available by requesting a personal guest login from our bioinformatics facility, but we were promised that with the next technical update there will be a public link available. The proteomics data and the model are already fully accessible. The raw western blot images with corresponding ponceau staining will be included with the final publication either as additional supplementary material or in whatever format matches the journal requirements.

      3.24 l660 -- can an overview of the EM protocol be given, to avoid having to buy the Mayer 2024 article?

      The cited paper is open access. But we can also include more details in our method section.

      References:

      Chacko, L. A., H. Nakaoka, R. Morris, W. Marshall, and V. Ananthanarayanan. 2025. 'Mitochondrial function regulates cell growth kinetics to actively maintain mitochondrial homeostasis', bioRxiv.

      Christiano, R., N. Nagaraj, F. Frohlich, and T. C. Walther. 2014. 'Global proteome turnover analyses of the Yeasts S. cerevisiae and S. pombe', Cell Rep, 9: 1959-65.

      Contamine, V., and M. Picard. 2000. 'Maintenance and integrity of the mitochondrial genome: a plethora of nuclear genes in the budding yeast', Microbiol Mol Biol Rev, 64: 281-315.

      Deng, Jingti, Lucy Swift, Mashiat Zaman, Fatemeh Shahhosseini, Abhishek Sharma, Daniela Bureik, Francesco Padovani, Alissa Benedikt, Amit Jaiswal, Craig Brideau, Savraj Grewal, Kurt M. Schmoller, Pina Colarusso, and Timothy E. Shutt. 2025. 'A novel genetic fluorescent reporter to visualize mitochondrial nucleoids', bioRxiv: 2023.10.23.563667.

      Di Bartolomeo, F., C. Malina, K. Campbell, M. Mormino, J. Fuchs, E. Vorontsov, C. M. Gustafsson, and J. Nielsen. 2020. 'Absolute yeast mitochondrial proteome quantification reveals trade-off between biosynthesis and energy generation during diauxic shift', Proc Natl Acad Sci U S A, 117: 7524-35.

      Ebert, A. C., N. L. Hepowit, T. A. Martinez, H. Vollmer, H. L. Singkhek, K. D. Frazier, S. A. Kantejeva, M. R. Patel, and J. A. MacGurn. 2025. 'Sphingolipid metabolism drives mitochondria remodeling during aging and oxidative stress', bioRxiv.

      Jakubke, C., R. Roussou, A. Maiser, C. Schug, F. Thoma, R. Bunk, D. Horl, H. Leonhardt, P. Walter, T. Klecker, and C. Osman. 2021. 'Cristae-dependent quality control of the mitochondrial genome', Sci Adv, 7: eabi8886.

      Khan, Abdul Haseeb, Xuefang Gu, Rutvik J. Patel, Prabha Chuphal, Matheus P. Viana, Aidan I. Brown, Brian M. Zid, and Tatsuhisa Tsuboi. 2024. 'Mitochondrial protein heterogeneity stems from the stochastic nature of co-translational protein targeting in cell senescence', Nature Communications, 15: 8274.

      Martin, J., K. Mahlke, and N. Pfanner. 1991. 'Role of an energized inner membrane in mitochondrial protein import. Delta psi drives the movement of presequences', J Biol Chem, 266: 18051-7.

      Osman, C., T. R. Noriega, V. Okreglak, J. C. Fung, and P. Walter. 2015. 'Integrity of the yeast mitochondrial genome, but not its distribution and inheritance, relies on mitochondrial fission and fusion', Proc Natl Acad Sci U S A, 112: E947-56.

      Perić, Matea, Peter Bou Dib, Sven Dennerlein, Marina Musa, Marina Rudan, Anita Lovrić, Andrea Nikolić, Ana Šarić, Sandra Sobočanec, Željka Mačak, Nuno Raimundo, and Anita Kriško. 2016. 'Crosstalk between cellular compartments protects against proteotoxicity and extends lifespan', Scientific Reports, 6: 28751.

      Roussou, Rodaria, Dirk Metzler, Francesco Padovani, Felix Thoma, Rebecca Schwarz, Boris Shraiman, Kurt M. Schmoller, and Christof Osman. 2024. 'Real-time assessment of mitochondrial DNA heteroplasmy dynamics at the single-cell level', The EMBO Journal, 43: 5340-59-59.

      Seel, A., F. Padovani, M. Mayer, A. Finster, D. Bureik, F. Thoma, C. Osman, T. Klecker, and K. M. Schmoller. 2023. 'Regulation with cell size ensures mitochondrial DNA homeostasis during cell growth', Nat Struct Mol Biol, 30: 1549-60.

      Vowinckel, J., J. Hartl, R. Butler, and M. Ralser. 2015. 'MitoLoc: A method for the simultaneous quantification of mitochondrial network morphology and membrane potential in single cells', Mitochondrion, 24: 77-86.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      This article addresses the connection between perturbed mitochondrial structure and genetics in yeast. When mitochondrial fusion is compromised, what is the chain of causality -- the mechanism -- that leads to mtDNA populations becoming depleted? This is a fascinating question, linking physical cell biology to population genetics. I admire the philosophy of the research, acknowledging and attempt to control for the many possible confounding influences. The manuscript describes the context and the research tightly and digestibly; the figures illustrate the results in a clear and natural way.

      For transparency, I am Iain Johnston and I am happy for this review to be treated as public domain. To my eyes my most important shortcoming as a review is my relative lack of familiarity with the yeast fzo1 mutant; while I am familiar with analysis of yeast mito morphology and mtDNA segregation, a reviewer familiar with the nuances of this strain and its culture would be a useful complement.

      I have a few more general points and a collection of smaller points below that I believe might help make the story more robust.

      General points

      1. About the use of Atp6 as a good proxy for mtDNA content. This is assumed from l285 onwards, based on a previous publication. As the link is fairly central to part of the paper's arguments, and the system in this study is being perturbed in several different ways, a stronger argument or demonstration that this link remains intact (and unchanged, as it is used in comparisons) would seem important.
      2. About confounding variables and processes. The study does an admirable job of being transparent and attempting to control for the many different influences involved in the physical-genetic link. But some remain less clearly unpacked, including some I think could be quite important. For example, there is a lot of focus on mito concentration -- but given the phenotypes are changing the sizes of cells, do concentration changes come from volume changes, mito changes, or both? In "ruling out" mitophagy -- a potentially important (and intuitive) influence, the argument is not presented as directly as it could be and it's not completely clear that it can in fact be ruled out in this way. There are a couple of other instances which I've put in the smaller points below.

      Smaller points

      l47 full genus name when it first appears

      l58 I may be wrong here, but I thought the petite phenotype more classically arises from mtDNA deletion mutations, not loss? The way this is phrased implies that mtDNA loss is [always] the cause. Whether I'm wrong on that point or not, the petite phenotype should be described and referenced.

      para starting l59 -- should mention for context that mitochondria in (healthy, wildtype) yeast are generally much more fused than in other organisms

      Fig 1C -- very odd choice of y-axis range! either start at zero or ensure that the data fill as much vertical space of the plot as possible

      l105 "wild-type like more tubular mitochondria" reads rather awkwardly. "more tubular mitochondria (as in the wild-type)"?

      l106 -- imaging artefacts? are mitos fragmenting because of photo stress? -- this is mentioned in l577-8 in the Methods, but the data from the growth rate and MMP comparison isn't given -- an SI figure would be helpful here. It would be reassuring to know that mito morphology wasn't changing in response to phototoxicity too.

      para l146 -- so this suggests mtDNA-encoded proteins have a very rapid turnover, O(hours) -- is this known/reasonable?

      section l189 -- it's hard to reason fully about these statistics of mitochondrial concentration given that the petite phenotype is fundamentally affecting overall cell volume. can we have details on the cell size distribution in parallel with these results? to put it another way -- how does mitochondrial amount per cell change?

      l199 the mean in Fig S3C certainly does change -- it increases, clearly relative both to control and to its initial value. rather than sweeping this under the carpet we should look in more detail to understand it (a consequence of the increased skew of the distribution)?

      para line 206 -- this doesn't make it clear whether your MMP signal is integrated over all mitochondria in the cell, or normalised by mitochondrial content? this matters quite a lot for the intepretation if the distributions of mitochondrial content are changing. reading on, this is even more important for para line 222. Reading further on, there is an equation on l612 that gives a definition, but it doesn't really clarify (apologies if I'm misunderstanding).

      l230 -- a point of personal interest -- low mito concentrations are connected to low "function" (MMP) and give extended division times -- this is interestingly exactly the model needed to reproduce observations in HeLa cells (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002416). That model went on to predict several aspects of downstream cellular behaviour -- it would be very interesting to see how compatible that picture (parameterised using HeLa observations) is with yeast!

      l239 "less mitochondria" -- a bit tricky but I'd say "fewer mitochondria" or "less mitochondrial content"

      Section l234 So here (and in Fig 4) the focus is on overall distributions of mitochondrial concentration in different cells (mother-to-be, mother, bud; gen 1, gen >1). But we've just seen that one effect of fzo1 is to broader the distribution of mitochondrial concentration across cells. Can't we look in more depth at the implications of this heterogeneity? For example in Fig 4F (which is cool) we look at the distribution of all fzo1 mothers-to-be, mothers, and buds. But this loses information about the provenance. For example, do mothers-to-be with extremely low mito concentrations just push everything to the bud, while mothers-to-be with high mito concentrations distribute things more evenly? It would seem very easy and very interesting to somehow subset the distribution of mothers-to-be by concentration and see how different subsets behave

      l285 -- experimental design -- do we know that Atp6 will continue to be a good proxy for functional mtDNA in the face of the perturbations provided by Fzo1 depletion? Especially if there is impact on the expression of mitoribosomes, the relationship between mtDNA and Atp6 may look rather different in the mutant?

      l290 -- ruled out mitophagy. This message could be much clearer. Comparing Fig S5C and Fig 3A side-by-side is a needlessly difficult task -- put Fig 3A into Fig S5. Then we see that when mitophagy is compromised, the distribution of mitochondrial concentration has a lower median and much lower upper quartile than in the mitophagy-equipped Fzo1 mutant? What is going on here? For a paper motivated by disentagling coupled mechanisms, this should be made clearer!

      With the Atp6 signal, how do we know that fluorescence from different cells is comparable? Buds will be smaller than mother cells for example, potentially leading to less occlusion of the fluorescent signal by other content in the cytoplasm

      l336 -- similar to the Jajoo et al. mechanism in fission yeast -- but are you talking about feedback control of the mtDNA or the protein (or mRNA) product?

      l343 -- maintenance of mtDNA -- here the point about l285 (is the Atp6-mtDNA relationship the same in the Fzo1 mutant) is particularly important, as we're directly tying findings about the protein product to implications about the mtDNA

      l367 -- on a first read this description of the model feels like lots of choices have been made without being fully justified. Why a log-normal distribution (when the fit to the data looks rather flawed); why the choice of 5 groups for nucleoid number (why not 3? or 8?); the process used for parameter fitting is very unclear (after reading the methods I think some of these values are read directly from the data, but the shapes of the distributions remain unexplained). l705 -- presumably the ratio was drawn from a log-normal distribution and then the corresponding nucleoid numbers were rounded to integers? the ratio itself wasn't rounded? (also l367) How were the log-normal distributions fitted to experiments (Figs. S7A,B)? Just by eye? l711 by random selection -- just at random? ("selection" could be confusing) Overall, it feels like the model may be too complicated for what it needs to show. Either (a) the model should show qualitatively that unequal inheritance and reduced production leads to rapid loss -- which a much simpler model, probably just involving a couple of lines of algebra, could show. Or (b) the model should quantitatively reproduce the particular numerical observations from the experiments -- it's not totally clear that it does this (do the cell-cycle-based decay timescales in Fig 7 correspond to the hour-based decay timescales in other plots, for example). At the moment the model is at a (b) level of detail but it's only clear that it's reporting the (a) level of results.

      A lot of the discussion repeats the results; depending on editorial preferences some of this text could probably be pared back to focus on the literature connections and context.

      Data availability -- it looks like much of the data required to reproduce the results is not going to be made available. Images and proteomic data are promised, but the data associated with mitochondrial concentration and other features are not mentioned. For FAIR purposes all the data (including statistics from analysis of the images) should be published.

      l660 -- can an overview of the EM protocol be given, to avoid having to buy the Mayer 2024 article?

      Significance

      This is a powerful and thoughtful study that provides a collection of new mechanistic insights into the link between physical and genetic properties of mitochondria in yeast. Cell biologists, geneticists, and the mitochondrial field will find this of potentially deep interest. Because of the mode and dynamics of inheritance in budding yeast, findings here may not be directly transferrable to other eukaryotes, but these insights are still of interest for researchers outside of yeast for their insight into how this well-studied system manages its mitochondrial populations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Weakness:

      Although a familiarity preference is not found, it is possible that this is related to the nature of the stimuli and the amount of learning that they offer. While infants here are exposed to the same perceptual stimulus repeatedly, infants can also be familiarised to more complex stimuli or scenarios. Classical statistical learning studies for example expose infants to specific pseudo-words during habituation/familiarisation, and then test their preference for familiar vs novel streams of pseudo-words. The amount of learning progress in these probabilistic learning studies is greater than in perceptual studies, and familiarity preferences may thus be more likely to emerge there. For these reasons, I think it is important to frame this as a model of perceptual habituation. This would also fit well with the neural net that was used, which is processing visual stimuli rather than probabilistic structures. If statements in the discussion are limited to perceptual paradigms, they would make the arguments more compelling. 

      Thank you for your thoughtful feedback. We have now qualified our claims more explicitly throughout the manuscript to clarify the scope of our study. Specifically, we have made the following revisions:

      (1) Title Update: We have modified the title to “A stimulus-computable rational model of visual habituation in infants and adults” to explicitly specify the domain of our model.

      (2) Qualifying Language Throughout Introduction: We have refined our language throughout the introduction to ensure the scope of our claims is clear. Specifically, we have emphasized that our model applies to visual habituation paradigms by incorporating qualifying language where relevant. At the end of Section 1, we have revised the statement to: "Habituation and dishabituation to sequential visual stimuli are well described by a rational analysis of looking time." This clarification makes sure that our model is framed within the context of visual habituation paradigms, particularly those involving structured sequences of stimuli, while acknowledging that habituation extends beyond the specific cases we study.

      (3) New Paragraph on Scope in the Introduction: We have added language in the Introduction acknowledging that while visual habituation is a fundamental mechanism for learning, it is not the only form of habituation. Specifically, we highlight that: “While habituation is a broadly studied phenomenon across cognitive domains—including language acquisition, probabilistic learning, and concept formation—our focus here is on visual habituation, where infants adjust their attention based on repeated exposure to a visual stimulus.”

      (4) New Paragraph on Scope in the General Discussion: We have also revisited this issue in the General Discussion. We added a dedicated paragraph discussing the scope: “This current work focuses on visual habituation, a fundamental but specific form of habituation that applies to sequential visual stimuli. While habituation has been studied across various domains, our model is specifically designed to account for looking time changes in response to repeated visual exposure. This focus aligns with our choice of perceptual representations derived from CNNs, which process visual inputs rather than abstract probabilistic structures. Visual habituation plays a foundational role in infant cognition, as it provides a mechanism for concept learning based on visual experience. However, it does not encompass all forms of habituation, particularly those involving complex rule learning or linguistic structures. Future work should investigate whether models like RANCH can be extended to capture habituation mechanisms in other learning contexts.”

      Reviewer #2 (Public review):

      There are no formal tests of the predictions of RANCH against other leading hypotheses or models of habituation. This makes it difficult to evaluate the degree to which RANCH provides an alternative account that makes distinct predictions from other accounts. I appreciate that because other theoretical descriptions haven't been instantiated in formal models this might be difficult, but some way of formalising them to enable comparison would be useful. 

      We appreciate the reviewer's concern regarding formal comparisons between RANCH and other leading hypotheses of habituation. A key strength of RANCH is that it provides quantitative, stimulus-computable predictions of looking behavior—something that existing theoretical accounts do not offer. Because previous models can not generate predictions about behaviors, we can not directly compare the previous model with RANCH. 

      The one formal model that the reviewer might be referring to is the Goldilocks model, discussed in the introduction and shown in Figure 1. We did in fact spend considerable time in an attempt to implement a version of the Goldilocks model as a stimulus-computable framework for comparison. However, we found that it required too many free parameters, such as the precise shape of the inverted U-shape that the Goldilocks model postulates, making it difficult to generate robust predictions that we would feel confident attributing to this model specifically. This assertion may come as a surprise to a reader who expects that formal models should be able to make predictions across many situations, but prior models 1) cannot be applied to specific stimuli, and 2) do not generate dynamics of looking time within each trial. These are both innovations of our work. Instead, even prior formal proposals derive metrics (e.g., surprisal) that can only be correlated with aggregate looking time. And prior, non-formalized theories, such as the Hunter and Ames model, are simply not explicit enough to implement. 

      To clarify this point, we have now explicitly stated in the Introduction that existing models are not stimulus-computable and do not generate predictions for looking behavior at the level of individual trials: 

      “Crucially, RANCH is the first stimulus-computable model of habituation, allowing us to derive quantitative predictions from raw visual stimuli. Previous theoretical accounts have described broad principles of habituation, but they do not generate testable, trial-by-trial predictions of looking behavior. As a result, direct comparisons between RANCH and these models remain challenging: existing models do not specify how an agent decides when to continue looking or disengage, nor do they provide a mechanistic link between stimulus properties and looking time. By explicitly modeling these decision processes, RANCH moves beyond post-hoc explanations and offers a computational framework that can be empirically validated and generalized to new contexts.” 

      We also highlight that our empirical comparisons in Figure 1 evaluate theoretical predictions based on existing conceptual models using behavioral data, rather than direct model-to-model comparisons: 

      “Addressing these three challenges allowed us to empirically test competing hypotheses about habituation and dishabituation using our experimental data (Figure

      \ref{fig:conceptual}). However, because existing models do not generate quantitative predictions, we could not directly compare RANCH to alternative computational models. Instead, we evaluated whether RANCH accurately captured key behavioral patterns in looking time.”

      The justification for using the RMSEA fitting approach could also be stronger - why is this the best way to compare the predictions of the formal model to the empirical data? Are there others? As always, the main issue with formal models is determining the degree to which they just match surface features of empirical data versus providing mechanistic insights, so some discussion of the level of fit necessary for strong inference would be useful. 

      Thank you for recommending additional clarity on our choice of evaluation metrics. RMSE is a very standard measure (for example, it’s the error metric used in fitting standard linear regression!). On the other hand, it captures absolute rather than relative errors. Correlation-based measures (e.g., r and r<sup>2</sup>-type measures) provide a measure of relative distance between predictive measures. In our manuscript we reported both RMSE and R². In the revised manuscript, we have now:

      (1) Added a paragraph in the main text explaining that RMSE captures the absolute error in the same units as looking time, whereas r² reflects the relative proportion of variance explained by the model: 

      “RANCH predictions qualitatively matched habituation and dishabituation in both infants and adults. To quantitatively evaluate these predictions, we fit a linear model (adjusting model‐generated samples by an intercept and scaling factor) and then assessed two complementary metrics. First, the root mean squared error (RMSE) captures the absolute error in the same units as looking time. Second, the coefficient of determination ($R^2$) measures the relative variation in looking time that is explained by the scaled model predictions. Since each metric relies on different assumptions and highlights distinct aspects of predictive accuracy, they together provide a more robust assessment of model performance. We minimized overfitting by employing cross‐validation—using a split‐half design for infant data and ten‐fold for adult data—to compute both RMSE and $R^2$ on held‐out samples.”

      (2) We updated Table 1 to include both RMSE and R² for each model variant and linking hypothesis. We now reported both RMSE and R² across the two experiments. 

      We hope these revisions address your concerns by offering a more comprehensive and transparent assessment of our model’s predictive accuracy.

      Regarding your final question, the desired level of fit for insight, our view is that – at least in theory development – measures of fit should always be compared between alternatives (rather than striving for some absolute level of prediction). We have attempted to do this by comparing fit within- and across-samples and via various ablation studies. We now make this point explicit in the General Discussion:

      More generally, while there is no single threshold for what constitutes a “good” model fit, the strength of our approach lies in the relative comparisons across model variants, linking hypotheses, and ablation studies. In this way, we treat model fit not as an absolute benchmark, but as an empirical tool to adjudicate among alternative explanations and assess the mechanistic plausibility of the model’s components.

      The difference in model predictions for identity vs number relative to the empirical data seems important but isn't given sufficient weight in terms of evaluating whether the model is or is not providing a good explanation of infant behavior. What would falsification look like in this context? 

      We appreciate the reviewer’s observation regarding the discrepancy between model predictions and the empirical data for identity vs.~number violations. We were also very interested in this particular deviation and we discuss it in detail in the General Discussion, noting that RANCH is currently a purely perceptual model, whereas infants’ behavior on number violations may reflect additional conceptual factors. Moreover, because this analysis reflects an out-of-sample prediction, we emphasize the overall match between RANCH and the data (see our global fit metrics) rather than focusing on a single data point. Infant looking time data also exhibit considerable noise, so we caution against over-interpreting small discrepancies in any one condition. In principle, a more thorough “falsification” would involve systematically testing whether larger deviations persist across multiple studies or stimulus sets, which is beyond the scope of the current work. 

      For the novel image similarity analysis, it is difficult to determine whether any differences are due to differences in the way the CNN encodes images vs in the habituation model itself - there are perhaps too many free parameters to pinpoint the nature of any disparities. Would there be another way to test the model without the CNN introducing additional unknowns? 

      Thank you for raising this concern. In our framework, the CNN and the habituation model operate jointly to generate predictions, so it can be challenging to parse out whether any mismatches arise specifically from one component or the other. However, we are not worried that the specifics of our CNN procedure introduces free parameters because:

      (1) The  CNN introduces no additional free parameters in our analyses, because it is a pre‐trained model not fitted to our data. 

      (2) We tested multiple CNN embeddings and observed similar outcomes, indicating that the details of the CNN are unlikely to be driving performance (Figure 12).

      Moreover, the key contribution of our second study is precisely that the model can generalize to entirely novel stimuli without any parameter adjustments. By combining a stable, off‐the‐shelf CNN with our habituation model, we can make out‐of‐sample predictions—an achievement that, to our knowledge, no previous habituation model has demonstrated.

      Related to that, the model contains lots of parts - the CNN, the EIG approach, and the parameters, all of which may or may not match how the infant's brain operates. EIG is systematically compared to two other algorithms, with KL working similarly - does this then imply we can't tell the difference between an explanation based on those two mechanisms? Are there situations in which they would make distinct predictions where they could be pulled apart? Also in this section, there doesn't appear to be any formal testing of the fits, so it is hard to determine whether this is a meaningful difference. However, other parts of the model don't seem to be systematically varied, so it isn't always clear what the precise question addressed in the manuscript is (e.g. is it about the algorithm controlling learning? or just that this model in general when fitted in a certain way resembles the empirical data?) 

      Thank you for highlighting these points about the model’s components and the comparison of EIG- vs. KL-based mechanisms. Regarding the linking hypotheses (EIG, KL, and surprisal), our primary goal was to assess whether rational exploration via noisy perceptual sampling could account for habituation and dishabituation phenomena in a stimulus-computable fashion. Although RANCH contains multiple elements—including the CNN for perceptual embedding, the learning model, and the action policy (EIG or KL)—we did systematically vary the “linking hypothesis” (i.e., whether sampling is driven by EIG, KL, or surprisal). We found that EIG and KL gave very similar fits, while surprisal systematically underperformed.

      We agree that future experiments could be designed to produce diverging predictions between EIG and KL, but examining these subtle differences is beyond the scope of our current work. Here, we sought to establish that a rational model of habituation, driven by noisy perceptual sampling, can deliver strong quantitative predictions—even for out-of-sample stimuli—rather than to fully disentangle forward- vs. backward-looking information metrics.

      We disagree, however, that we did not evaluate or formally compare other aspects of the model. In Table 1 we report ablation studies of different aspects of the model architecture (e.g., removal of learning and noise components). Further, the RMSE and R² values reported in Table 1 and Section 4.2.3 can be treated as out-of-sample estimates of performance and used for direct comparison (because Table 1 uses cross-validation and Section 4.2.3 reports out of sample predictions). 

      Perhaps the reviewer is interested in statistical hypothesis tests, but we do not believe these are appropriate here. Cross-validation provides a metric of out-of-sample generalization and model selection based on the resulting numerical estimates. Significance testing is not typically recommended, except in a limited subset of cases (see e.g. Vanwinckelen & Blokeel, 2012 and Raschka, 2018).

      Reviewer #1 (Recommendations for the authors):

      "We treat the number of samples for each stimulus as being linearly related to looking time duration." Looking times were not log transformed? 

      Thank you for your question. The assumption of a linear relationship between the model’s predicted number of samples and looking time duration is intended as a measurement transformation, not a strict assumption about the underlying distribution of looking times. This linear mapping is used simply to establish a direct proportionality between model-generated samples and observed looking durations.

      However, in our statistical analyses, we do log-transform the empirical looking times to account for skewness and stabilize variance. This transformation is standard practice when analyzing infant looking time data but is independent of how we map model predictions to observed times. Since there is no a priori reason to assume that the number of model samples must relate to looking time in a strictly log-linear way, we retained a simple linear mapping while still applying a log transformation in our analytic models where appropriate.

      It would be nice to have figures showing the results of the grid search over the parameter values. For example, a heatmap with sigma on x and eta on y, and goodness of fit indicated by colour, would show the quality of the model fit as a function of the parameters' values, but also if the parameters estimates are correlated (they shouldn't be). 

      Thank you for the suggestion. We agree that visualizing the grid search results can provide a clearer picture of how different parameter values affect model fit. In the supplementary materials, we already present analyses where we systematically search over one parameter at a time to find the best-fitting values.

      We also explored alternative visualizations, including heatmaps where sigma and eta are mapped on the x and y axes, with goodness-of-fit indicated by color. However, we found that the goodness of fit was very similar across parameter settings, making the heatmaps difficult to interpret due to minimal variation in color. This lack of variation in fit reflects the observation that our model predictions are robust to changes in parameter settings, which allows us to report strong out of sample predictions in Section 4. Instead, we opted to use histograms to illustrate general trends, which provide a clearer and more interpretable summary of the model fit across different parameter settings. Please see the heatmaps below, if you are interested. 

      Author response image 1.

      Model fit (measured by RMSE) across a grid of prior values for Alpha, Beta, and V shows minimal variation. This indicates that the model’s performance is robust to changes in prior assumptions.

      Regarding section 5.4, paragraph 2: It might be interesting to notice that a potential way to decorrelate these factors is to look at finer timescales (see Poli et al., 2024, Trends in Cognitive Sciences), which the current combination of neural nets and Bayesian inference could potentially be adapted to do. 

      Thank you for this insightful suggestion. We agree that examining finer timescales of looking behavior could provide valuable insights into the dynamics of attention and learning. In response, we have incorporated language in Section 5.4 to highlight this as a potential future direction: 

      Another promising direction is to explore RANCH’s applicability to finer timescales of looking behavior, enabling a more detailed examination of within-trial fluctuations in attention. Recent work suggests that analyzing moment-by-moment dynamics can help disentangle distinct learning mechanisms \autocite{poli2024individual}.Since RANCH models decision-making at the level of individual perceptual samples, it is well-suited to capture these fine-grained attentional shifts.

      Previous work integrating neural networks with Bayesian (like) models could be better acknowledged: Blakeman, S., & Mareschal, D. (2022). Selective particle attention: Rapidly and flexibly selecting features for deep reinforcement learning. Neural Networks, 150, 408-421. 

      Thank you for this feedback. We have now incorporated this citation into our discussion section: 

      RANCH integrates structured perceptual representations with Bayesian inference, allowing for stimulus-computable predictions of looking behavior and interpretable parameters at the same time. This integrated approach has been used to study selective attention \autocite{blakeman2022selective}.

      Unless I missed it, I could not find an OSF repository (although the authors refer to an OSF repository for a previous study that has not been included). In general, sharing the code would greatly help with reproducibility. 

      Thanks for this comment. We apologize that – although all of our code and data were available through github, we did not provide links in the manuscript. We have now added this at the end of the introduction section. 

      Reviewer #2 (Recommendations for the authors):

      Page 7 "infants clearly dishabituated on trials with longer exposures" - what are these stats comparing? Novel presentation to last familiar? 

      Thank you for pointing out this slightly confusing passage. The statistics reported are comparing looking time in looking time between the novel and familiar test trials after longer exposures. We have now added the following language: 

      Infants clearly dishabituated on trials with longer exposures, looking longer at the novel stimulus than the familiar stimulus after long exposure.

      Order effects were covaried in the model - does the RANCH model predict similar order effects to those observed in the empirical data, ie can it model more generic changes in attention as well as the stimulus-specific ones? 

      Thank you for this question. If we understand correctly, you are asking whether RANCH can capture order effects over the course of the experiment, such as general decreases in attention across blocks. Currently, RANCH does not model these block-level effects—it is designed to predict stimulus-driven looking behavior rather than more general attentional changes that occur over time such as fatigue. In our empirical analysis, block number was included as a covariate to account for these effects statistically, but RANCH itself does not have a mechanism to model block-to-block attentional drift independent of stimulus properties. This is an interesting direction for future work, where a model could integrate global attentional dynamics alongside stimulus-specific learning. To address this, we have added a sentence in the General Discussion saying:

      Similarly, RANCH does not capture more global attention dynamics, such as block-to-block attentional drift independent of stimulus properties.

      "We then computed the root mean squared error (RMSE) between the scaled model results and the looking time data." Why is this the most appropriate approach to considering model fit? Would be useful to have a brief explanation. 

      Thank you for pointing this out. We believe that we have now addressed this issue in Response to Comment #2 from Reviewer 1. 

      The title of subsection 3.3 made me think that you would be comparing RANCH to alternate hypotheses or models but this seems to be a comparison of ways of fitting parameters within RANCH - I think worth explaining that. 

      We have now added a sentence in the subsection to make the content of the comparison more explicit: 

      Here we evaluated different ways of specifying RANCH's decision-making mechanism (i.e., different "linking hypotheses" within RANCH).

      3.5 would be useful to have some statistics here - does performance significantly improve? 

      As discussed above, we systematically compared model variants using cross-validated RMSE and R² values, which provide quantitative evidence of improved performance. While these differences are substantial, we do not report statistical hypothesis tests, as significance testing is not typically appropriate for model comparison based on cross-validation (see Vanwinckelen & Blockeel, 2012; Raschka, 2018). Instead, we rely on out-of-sample predictive performance as a principled basis for evaluating model variants.

      It would be very helpful to have a formal comparison of RANCH and other models - this seems to be largely descriptive at the moment (3.6).

      We believe that we have now addressed this issue in our response to the first comment.

      Does individual infant data show any nonlinearities? Sometimes the position of the peak look is very heterogenous and so overall there appears to be no increase but on an individual level there is. 

      Thank you for your question. Given our experimental design, each exposure duration appears in separate blocks rather than in a continuous sequence for each infant. Because of this, the concept of an individual-level nonlinear trajectory over exposure durations does not directly apply. Instead, each infant contributes looking time data to multiple distinct conditions, rather than following a single increasing-exposure sequence. Any observed nonlinear trend across exposure durations would therefore be a group-level effect rather than a within-subject pattern.

      In 4.1, why 8 or 9 exposures rather than a fixed number? 

      We used slightly variable exposure durations to reduce the risk that infants develop fixed expectations about when a novel stimulus will appear. We have now clarified this point in the text.

      Why do results differ for the model vs empirical data for identity? Is this to do with semantic processing in infants that isn't embedded in the model? 

      Thank you for your comment. The discrepancy between the model and empirical data for identity violations is related to the discrepancy we discussed for number violations in the General Discussion. As noted there, RANCH relies on perceptual similarity derived from CNN embeddings, which may not fully capture distinctions that infants make.

      The model suggests the learner’s prior on noise is higher in infants than adults, so produces potentially mechanistic insights. 

      We agree! One of the key strengths of RANCH is its ability to provide mechanistic insights through interpretable parameters. The finding that infants have a higher prior on perceptual noise than adults aligns with previous research suggesting that early visual processing in infants is more variable and less precise.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This is a manuscript describing outbreaks of Pseudomonas aeruginosa ST 621 in a facility in the US using genomic data. The authors identified and analysed 254 P. aeruginosa ST 621 isolates collected from a facility from 2011 to 2020. The authors described the relatedness of the isolates across different locations, specimen types (sources), and sampling years. Two concurrently emerged subclones were identified from the 254 isolates. The authors predicted that the most recent common ancestor for the isolates can be dated back to approximately 1999 after the opening of the main building of the facility in 1996. Then the authors grouped the 254 isolates into two categories: 1) patient-to-patient; or 2) environment-to-patient using SNP thresholds and known epidemiological links. Finally, the authors described the changes in resistance gene profiles, virulence genes, cell wall biogenesis, and signaling pathway genes of the isolates over the sampling years.

      Strengths:

      The major strength of this study is the utilisation of genomic data to comprehensively describe the characteristics of a long-term Pseudomonas aeruginosa ST 621 outbreak in a facility. This fills the data gap of a clone that could be clinically important but easily missed from microbiology data alone.

      Weaknesses:

      The work would further benefit from a more detailed discussion on the limitations due to the lack of data on patient clinical information, ward movement, and swabs collected from healthcare workers to verify the transmission of Pseudomonas aeruginosa ST 621, including potential healthcare worker to patient transmission, patient-to-patient transmission, patient-to-environment transmission, and environment-to-patient transmission. For instance, the definition given in the manuscript for patient-to-patient transmission could not rule out the possibility of the existence of a shared contaminated environment. Equally, as patients were not routinely swabbed, unobserved carriers of Pseudomonas aeruginosa ST 621 could not be identified and the possibility of misclassifying the environment-to-patient transmissions could not be ruled out. Moreover, reporting of changes in rates of resistance to imipenem and cefepime could be improved by showing the exact p-values (perhaps with three decimal places) rather than dichotomising the value at 0.05. By doing so, readers could interpret the strength of the evidence of changes.

      Impact of the work:

      First, the work adds to the growing evidence implicating sinks as long-term reservoirs for important MDR pathogens, with direct infection control implications. Moreover, the work could potentially motivate investments in generating and integrating genomic data into routine surveillance. The comprehensive descriptions of the Pseudomonas aeruginosa ST 621 clones outbreak is a great example to demonstrate how genomic data can provide additional information about long-term outbreaks that otherwise could not be detected using microbiology data alone. Moreover, identifying the changes in resistance genes and virulence genes over time would not be possible without genomic data. Finally, this work provided additional evidence for the existence of long-term persistence of Pseudomonas aeruginosa ST 621 clones, which likely occur in other similar settings.

      We thank the reviewer for their thorough evaluation of our work, and for the suggested improvements. A main goal of this study was to show that integrating routine wgs in the clinic was a game changer for infection control efforts. We appreciate this aspect was highlighted as a strength by this reviewer. While some of the weaknesses identified are inherent to the data (or lack thereof) available for this study, we have revised the manuscript to include a detailed discussion on limitations (sampling, thresholds of genetic relatedness, definition and categories etc.) that could influence the genomic inferences. We also provided exact p-values for the changes in rates of resistance, as requested. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly.

      Reviewer #2 (Public Review):

      Summary:

      The authors present a report of a large Pseudomonas aeruginosa hospital outbreak affecting more than 80 patients with first sampling dates in 2011 that stretched over more than 10 years and was only identified through genomic surveillance in 2020. The outbreak strain was assigned to the sequence type 621, an ST that has been associated with carpabapenem resistance across the globe. Ongoing transmission coincided with both increasing resistance without acquisition of carbapenemase genes as well as the convergence of mutations towards a host-adapted lifestyle.

      Strengths:

      The convincing genomic analyses indicate spread throughout the hospital since the beginning of the century and provide important benchmark findings for future comparison.

      The sampling was based on all organisms sent to the Multidrug-resistant Organism Repository and Surveillance Network across the U.S. Military Health System.

      Using sequencing data from patient and environmental samples for phylogenetic and transmission analyses as well as determining recurring mutations in outbreak isolates allows for insights into the evolution of potentially harmful pathogens with the ultimate aim of reducing their spread in hospitals.

      Weaknesses:

      The epidemiological information was limited and the sampling methodology was inconsistent, thus complicating the inference of exact transmission routes. Epidemiological data relevant to this analysis include information on the reason for sampling, patient admission and discharge data, and underlying frequency of sampling and sampling results in relation to patient turnover.

      We thank the reviewer for their thoughtful feedback on our manuscript and for highlighting the quality of the genomic analyses. We agree that the lack of patient epi data (e.g. date of admission and discharge) and the inconsistent sampling through the years are limitations of this study. We have revised the manuscript to acknowledge these limitations and discuss how not having this data complicates the inference of exact transmission routes. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly.

      Reviewer #3 (Public Review):

      Summary:

      This paper by Stribling and colleagues sheds light on a decade-long P. aeruginosa outbreak of the high-risk lineage ST-621 in a US Military hospital. The origins of the outbreak date back to the late 90s and it was mainly caused by two distinct subclones SC1 and SC2. The data of this outbreak showed the emergence of antibiotic resistance to cephalosporin, carbapenems, and colistin over time highlighting the emerging risk of extensively resistant infections due to P. aeruginosa and the need for ongoing surveillance.

      Strengths:

      This study overall is well constructed and clearly written. Since detailed information on floor plans of the building and transfers between facilities was available, the authors were able to show that these two subclones emerged in two separate buildings of the hospital. The authors support their conclusions with prospective environmental sampling in 2021 and 2022 and link the role of persistent environmental contamination to sustaining nosocomial transmission. Information on resistance genes in repeat isolates for the same patients allowed the authors to detect the emergence of resistance within patients. The conclusions have broader implications for infection control at other facilities. In particular, the paper highlights the value of real-time surveillance and environmental sampling in slowing nosocomial transmission of P. aeruginosa.

      Weaknesses:

      My major concern is that the authors used fixed thresholds and definitions to classify the origin of an infection. As such, they were not able to give uncertainty measures around transmission routes nor quantify the relative contribution of persistent environmental contamination vs patient-to-patient transmission. The latter would allow the authors to quantify the impact of certain interventions. In addition, these results represent a specific US military facility and the transmission patterns might be specific to that facility. The study also lacked any data on antibiotic use that could have been used to relate to and discuss the temporal trends of antimicrobial resistance.

      We thank the reviewer for their evaluation of our work and for highlighting the broad implications of our findings regarding the application of real-time surveillance to suppress nosocomial transmission. We agree with the reviewer that fixed thresholds and definitions are imperfect to classify the origin of an infection. The design of this study (e.g. inconsistent sampling through time) was not conducive to provide a comprehensive/quantitative measurement of transmission routes. Thus, we decided to apply conservative thresholds of genetic relatedness and strict conditions (e.g. time between isolate collection, shared hospital location etc.) to favor specificity as our goal was simply to establish that cases of environmentto-patient transmission did happen. In the absence of a truth set, we have not performed sensitivity analysis, but we are conducting a follow-up study to compare inferences from MCMC models to our original fixed-thresholds predictions. This limitation is now discussed in the revised manuscript. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly including the addition of Figure S3.

      Reviewer #1 (Recommendations For The Authors):

      The definitions used on lines 391-396 are necessarily somewhat arbitrary, but it would be helpful to have a little bit more justification for the choices made, particularly for the definition of environmental involving the "3x the number of years they were separated". It seems a little hard to square this with the more relaxed 10 SNP cutoff for a patient-to-patient designation. Are there reasons for thinking SNP differences associated with environmental transmission should be smaller than for patient-to-patient, or is the aim here just to set the bar higher for assuming an environmental source? Because these definitions are quite arbitrary, there could also be some value in exploring the sensitivity of the results to these assumptions.

      Thank you. We agree with the reviewers that SNP thresholds, albeit necessarily, are arbitrary and that more discussion/justification was needed to put the genomic inferences in context. We have revised the manuscript to indicate that: 1/ the 10 SNP cutoff for a patient-to-patient designation was set to account for the known evolution rate of P. aeruginosa (inferred by BEAST at 2.987E-7 subs/site/year in this study and similar to previous estimates PMID: 24039595) and the observed within host variability (now displayed in revised Fig. 1E). We note that this SNP distance was not sufficient and that an epi link (patients on the same ward at the same time) needed to be established. 2/ the environment-to-patient definition was indeed set to be most conservative (nearly identical isolates in two patients from the same ward with no known temporal overlap for > 365 days). This was indeed done to favor high specificity as this inference relied solely on clinical isolates (i.e. the identical environmental strain in the patientenvironment-patient chain was not sampled). For these clinical isolates to have acquired no/very little mutation in that much time, no/low replication is expected and, although unsampled, we propose this most likely happened on hospital surfaces.

      While the term "core genome" should be familiar to most readers, "shell genome" and "cloud genome" are less widely known, and an explanation of what these terms mean here would be helpful.

      Thank you. We have revised the manuscript to define the core, shell, and cloud genomes as genes sets found in ≥ 99%, ≥ 95% and ≥ 15% of isolates, respectively.

      In the first paragraph of the discussion, it could be added that in many cases for clinically important Gram negatives short read sequencing alone will fail to detect transmission events as outbreaks can be driven by plasmid spread with only very limited clonal spread (see, for example, https://www.nature.com/articles/s41564-021-00879-y )

      Thank you. We agree this is an important/emerging aspect of surveillance. However, the goal of this discussion point was to explain why such a large outbreak was missed prior to implementing WGS (short read) surveillance. We feel that discussing “plasmid outbreaks” (which is not at play here, and relatively rare in P. aeruginosa compared to the Enterobacteriaceae) and the need for long read will distract from the narrative. 

      line 599 What does "Mock" mean here? Would it be more accurate to say it is a simplified floor plan?

      Thank you. “Mock” was changed to “simplified”

      IPAC abbreviation is only used once - spelling it out in full would increase readability.

      Revised manuscript was edited as suggested.

      MHS is only used twice.

      Revised manuscript was edited to spell out Military Health System

      Line 364: full stop missing.

      Revised manuscript was edited as suggested.

      Line 401: Bayesian rather than bayesian.

      Revised manuscript was edited as suggested.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for giving me the opportunity to review this interesting manuscript.

      The conclusions of this paper are mostly well supported by the data presented, but epidemiological information was limited and the sampling methodology was inconsistent, thus complicating inference of exact transmission routes.

      Major issues:

      What was the baseline frequency of clinical and/or screening samples of Pseudomonas aeruginosa at the hospital? Neither Figure 1D nor Table S1 allows for differentiating between clinical and screening samples. Most isolates were cultured from clinical materials, and there is no information about the patients' length of stay and their respective sampling dates. Is there any possibility of finding out whether the samples were collected for clinical or screening purposes? Would it be possible to include the patients' admission data to determine whether the strains were imported into the hospital or related to a previous stay, e.g. among known carriers? Also, the issue of sampling dates vs. patient stay on the ward should be addressed, as there may be an overlap in patients' stay on the ward but no overlap in terms of sampling dates or even missing samples (missing links).

      We have revised the manuscript to address this important point: i) 16 isolates were from surveillance swabs and are labelled “Surveillance” in Table S1. The remaining 237 were clinical isolates; ii) unfortunately, because the sampling was done under a public health surveillance framework, we do not have access to historical patient data (admission/discharge date, wards, rooms, etc.) and we can not calculate length of stay or better identify patient overlap. These limitations are now acknowledged in the discussion of the revised manuscript.

      In order to evaluate the extent of the outbreak, more epidemiological data would be useful What is the size of the hospital, what is the average patient turnover, and what is the average length of stay in ICU and non-ICU? Is there any specialization besides the military label?

      We have revised the manuscript to indicate that facility A is 425-bed medical center and is the only Level 1 trauma center in the Military Health System. Unfortunately, the data to calculate length of stay, throughout the years, in ICU and non-ICU, was not available to us. This limitation is now also acknowledged in the discussion.

      Perhaps the authors could attempwt to discuss the extent to which large outbreaks like these may be considered as part of unavoidable evolutionary processes within the hospital microbiome as opposed to accumulation and transmission of potentially harmful genes/clones, and differentiate between the putative community spread without any epidemiological links on the one hand, and hospital outbreaks that could be targeted by local infection prevention activities on the other hand.,

      We respectfully disagree with the suggestion that this large outbreak “may be considered as part of unavoidable evolutionary processes within the hospital microbiome” and should be opposed to “transmission of potentially harmful genes/clones”. As a matter of fact, our data showed that infection control staff at Facility A responded with multiple interventions, including closing sinks, replacing tubing, and using foaming detergents. This resulted in slowing the spread of the ST621 outbreak with just 3 cases identified in 2022, 0 cases in 2023 and 1 case in 2024. This is now discussed in the revised manuscript.

      Page 5, lines 88-92 lines 101-104. It seems as if the outbreak was identified only by the means of genomic surveillance. This raises questions as to the rationale for sampling and sequencing, especially prior to 2020. Considering 11 cases per year between 2011 and 2016, one could assume such an outbreak would have been noticed without sequencing data.

      The MRSN was created in 2010, in response to the outbreak of MDR Acinetobacter baumannii in US military personnel returning from Iraq and Afghanistan. Between 2011 and 2017, the MRSN collected MDR isolates (mandate for all MDR ESKAPE but compliance varied between years and facilities) from across the Military Health System and, for select isolates (e.g. high-risk isolates carrying ESBLs or carbapenemases) performed molecular typing by PFGE. In 2017 the MRSN started to perform whole genome sequencing of its entire repository. In 2020, a routine prospective sequencing service was started and first detected the ST621 outbreak. A retrospective analysis of historical isolate genomes (2011-2019) identified additional cases. The first paragraph of the discussion lists possible factors to explain why the ST621 escaped detection by traditional approaches. We believe 11 cases per year is not a strong signal when stratified by month, wards, or both, especially for a clone lacking a carbapenemase and without a remarkable antibiotic susceptibility profile. 

      Did the infection control personnel suspect transmission? If yes, was the sampling and submission of samples to the MRSN adapted based on the epidemiologic findings?

      The ST621 outbreak was unsuspected before the initial genomic detection in 2020. Until that point, MDR isolates only (Magiorakos et al PMID: 21793988) were collected but compliance was variable through time. Quickly thereafter (starting in 2021), complete sampling of all clinical P. aeruginosa (MDR or not) from Facility A was started. The manuscript was revised to clarify those details of the sampling strategy.

      Is there any information about how many environmental sites were sampled without evidence of ST621 / screening samples were cultured without evidence of Pseudomonas aeruginosa?

      For patient isolates, only 16 isolates were from surveillance swabs. The remaining 237 were clinical isolates. No denominator data was available to calculate P. aeruginosa and ST-621 positivity rate in surveillance swabs throughout the time period. For environmental isolates, a total of 159 swabs were taken from 55 distinct locations in 8 wards/units including the ER. This data is now included in the revised manuscript. However, a complete analysis of these swabs (positivity rate for ESKAPE pathogens, P. aeruginosa, per ward/floor/room, per swab type (sink drain, bed rail etc.) etc.) is beyond the scope of this study and is being performed as a follow up investigation.

      Page 5 lines 89 and 39 Figure S1B. Please describe how the allelic distance for the cluster threshold was selected.

      As indicated in the legend of Figure S1B, no thresholds were applied. All ST621 isolates ever sequenced by the MRSN were included. All except 3 isolates shared between 023 cgMLST allelic differences. The remaining 3 were distant by 88-89 allelic differences. The text was revised to clarify this point.

      Page 5 lines 99-100. Could the authors please provide some distribution measures (e.g. IQR).

      Done as requested. The revised manuscript now reads “…of just 38 single nucleotide polymorphisms (SNPs), and an IQR of 19 (Fig. 1A, Table S1).”

      Page 5 line 102. Could the authors please provide some distribution measures (e.g. IQR).

      Please see above. A chart was created and is now included as Fig. S2.

      Page 6 line 107 and page 34 figure 1c. In the text it is stated that isolates were collected in 27 wards, the figure 1C depicts 26 wards and n/a.

      Thank you for spotting this inconsistency. This has been fixed in the revised manuscript.

      Page 6 lines 117-118. Samples collected in the emergency room would imply samples collected on admission, already addressed previously. Did the authors investigate a potential import into the hospital from community reservoirs or were all these isolates collected among patients who had been previously admitted to the hospital and/or tested positive for the outbreak strain?

      We agree that samples collected in the ER imply samples collected on admission. Of the 29 ER isolates only 9 (31%) were primary isolates (first detection in a new patient) which suggests a majority were from returning patients at Facility A. Because the sampling was done under a public health surveillance framework, we do not have access to historical patient data (admission/discharge date, wards, rooms, etc.) to investigate/confirm that these 9 patients had previous visits at Facility A. This point is now discussed in the revised manuscript.

      Page 6 line 128. This could also represent increased selective pressure. However, according to Table S1, the 28 isolates collected in 2011 (the number does not match with Figure 1D) were from many different wards, thus indicating earlier spread throughout the hospital.

      Yes, we agree. Please note that table S1 lists all isolates for 2011 whereas Figure 1D focuses on primary (first isolate from each patients) only.  

      Page 7 line 133. Both Figure 2 and the discussion section, page 13 line 296 suggest the year 2005 instead of 2004?

      Thank you for catching this typographical error. This was corrected to 2004 in the revised manuscript.

      Figure 1E. The figure should also depict intra-patient diversity for comparison.

      Thank you for this great suggestion. We have revised Figure 1E accordingly.

      Page 7, lines 146-147 Could the authors attempt explaining the upper part of the bimodal peaks?

      This is an all-vs-all SNP analysis for all inter-patient isolates. For each isolates all distances to other isolates are reported, not only the smallest. The upper peaks represent comparisons to isolates from a different outbreak subclone (SC1 vs SC2).

      Page 7, line 150 This is a very small number considering the extent of the outbreak and suggests a large number of missing links. Or does this rather imply continuous import and evolution over time that does not necessarily represent transmission within the hospital?

      We believe all cases were due to transmission happening within the hospital. Based on conservative thresholds (genetic relatedness and epi link, or lack thereof) the precise origin from another patient (n=10) or a contaminated surface (n=12) can be inferred. For the remaining 60 patients, with the available sampling, the conditions we chose are not met and we simply do not conclude whether a direct patient-to-patient or an environmental origin was more likely.

      Page 8 line 155. What does the temporal overlap refer to - sampling date versus patient's stay on the ward? Please specify.

      The temporal overlap was investigated from sampling dates, as dates of patient admission/discharged were not available.

      Page 8, line 157: What does primary/serial isolate mean - first and follow-up samples of ST621 per patient?

      Yes. Primary isolate is used to designate the first isolate from a patient. Serial isolates designate follow-up samples of ST621.

      Page 8 line 165: Table S3 and Figure 3 only refer to environmental samples from three wards. Ward 20 rooms 2 and 18 as well as ward 1 rooms 1 and 6 were hotspots - is there any information on the specific infection control/disinfection measures? Addressed in discussion page 12, lines 273-275, but no information on what was actually done.

      The manuscript was revised to indicate the precise disinfection measures that were taken. A follow-up study is ongoing to assess long-term efficacy and monitor possible retrograde growth from previously contaminated sinks.

      Page 8 line 175: Evaluation of change in resistance fraction over time - There may have been a selection bias with an inconsistent number of strains sequenced per year.

      Yes, incomplete sampling and possible selection bias are now listed with other limitations of this study in the discussion of the revised manuscript.

      Page 9 line 183: The referral to Table S1 is unclear, I could not find the number and the specific isolates selected for long-read sequencing.

      Thank you. This has been added to the revised Table S1.

      Page 10 lines 217-225 and Figure 4C: Perhaps it is possible to better align what is written in the text and the caption of the figure. The caption does not clarify that only one patient develops colistin resistance (what was the reason to include the other patients?).

      Thank you. We have revised the text and the caption of the figure to clarify that only isolates from one patient developed colistin resistance. The isolates from the other patients on Fig. 4C are shown to provide context and accurately map the emergence of the PhoQE77fs mutation.  

      Page 10, lines 228-229 and Table S5: How is it possible to identify those 64 genes in Table S5?

      We have revised Table S5 to facilitate the identification of the 64 genes with ≥ 2 independently acquired mutations (excluding SYN). Specifically, we have added column E labeled “Counts independent mutations per locus (excluding SYN)”. A total of 205 rows (in this table each row is a variant) have a value ≥ 2 and these represent 64 genes (upon deduplication of locus tags).  

      Page 13, lines 280-281: Where is the information on chronic infection presented? Serial cultures would not necessarily mean chronic infection.

      Authors response: Yes, we agree this was not the appropriate characterization and this was revised to ‘long-term’ infections.

      Page 14 line 306: Emergence of colistin resistance in a single patient, correct?

      Yes. This was further clarified in the text.

      Page 14 lines 315-320: This should go to the results section. In particular disinfection, closing, and replacing of tubing should be mentioned in the results section in reference to the results presented in Table S3.

      Thank you. We have considered this suggestion and have decided to leave this discussion as the closing paragraph of this publication. A follow-up study is ongoing to assess long-term efficacy of these interventions on the ST-621 bur also other outbreak clones at Facility A.

      Methods

      Page 15 lines 330-333: Perhaps it is possible to avoid redundancy.

      Thank you. We have revised the text accordingly.

      Page 15 lines 341: Information on which isolates were subjected to long-read sequencing is missing.

      Thank you. This has been added to the revised Table S1.

      Page 16 line 345: Was there a particular reason why Newbler was chosen?

      No. At the time Newbler was the default assembler built in the MRSN bacterial genome analysis pipeline and QC processes.

      Page 16, line 357-358: What was the rationale for selecting this isolate as reference genome?

      This isolate was chosen because it was collected early in the outbreak and phylogenetic analysis revealed it had low root to tip divergence.

      Page 16 line 361: Why 310 isolates, if only 253 were assigned to the outbreak clone and only a subset of those were collected in facility A?

      This was a typographical error that has corrected (it now reads “…set of 253 isolates.”) in the revised manuscript.  

      Page 17 lines 387-395: What is the reason that intra-patient diversity was not included in the set of criteria for SNP distances?

      The observed within host variability (now displayed in revised Fig. 1E) was taken into consideration when setting SNP thresholds for categorizing patient-to-patient transmission or environment-to-patient event. This is now clarified in the revised manuscript.

      Page 17 line 392: How was the threshold of <=10 SNPs determined?

      The 10 SNP cutoff to infer a patient-to-patient transmission event was set to account for the known evolution rate of P. aeruginosa (inferred by BEAST at 2.987E-7 subs/site/year in this study, and similar to previous estimates PMID: 24039595) and the observed within host variability (now displayed in revised Fig. 1E). We note that this SNP distance was not sufficient and that an epi link (patients on the same ward within the same month) needed to be established.

      Page 17 line 395 and Figure 2: What was the assumed average mutation rate per genome per year?

      Thank you. The mean substitution rate inferred by BEAST was 2.987E-7 similar to estimate from previous studies on P. aeruginosa outbreaks (e.g. PMID: 24039595).

      Reviewer #3 (Recommendations For The Authors):

      Please find (line-by-line comments) on each section of the manuscript below:

      Introduction

      Line 86: I am wondering why the authors state ">28 facilities" instead of the exact number of facilities from which these lineages were recovered.

      Thank you. Manuscript was revised to provide the exact number of facilities. It now reads “…recovered from 37 and 28 facilities, respectively.”

      Methods

      It's not clear to me which criteria were used for collecting these isolates (both prospective and retrospective). I understand that some of the data are described in more detail in Lebreton et al but I did not find the specific criteria for the collection of the isolates and I imagine that these might differ if different facilities. Would it be possible to comment on that and add a short paragraph in the Methods section?

      Thank you. This lack of clarity was also raised by other reviewers, and we have revised the manuscript to indicate that: 1/MDR isolates only (Magiorakos et al PMID: 21793988) were collected from 2011-2020 with the same criteria for all facilities although compliance was variable through time and between facilities; and 2/ starting in 2021 all P. aeruginosa isolates, irrespective of their susceptibility profile, were collected from Facility A

      The data comes from a US Military hospital. Is this related to the US Veterans Affairs Healthcare system? Is there more detailed information about the demographics of the patient population?

      Facility A is part of the Military Health System (MHS) which provides care for active service members and their families. This is distinct from the US Veterans Affairs Healthcare system. Only limited patient data was accessible to us as this study was done as part of our public health surveillance activities. Patient age (avg. 57.2 +/- 21.0) and gender (ratio male/female 1.7) are provided in the revised manuscript. 

      Line 384ff: The origin of infection was inferred based on the SNP threshold and epidemiological links. However, recombination events can complicate the interpretation of SNP data. Have the authors attempted to account for this?

      Thank you. We agree that recombination events can complicate the interpretation of SNP data. We used Gubbins v2.3.1 to filter out recombination from the core SNP alignment, as indicated in the revised manuscript.

      The authors' definition of environment-to-patient transmission seems conservative (nearly identical strain and no known temporal overlap for > 365 days). Have the authors changed the threshold, performed sensitivity analyses, and tested how this would affect their results?

      Indeed, acknowledging that fixed thresholds have limitations in their ability to accurately predict the origin of infections, we took a conservative approach to favor specificity as our goal was simply to establish that cases of environment-to-patient transmission did happen. In the absence of a truth set, we have not performed sensitivity analysis, but we are conducting a follow-up study to compare inferences from MCMC models to our original predictions. This limitation is now discussed in the revised manuscript.

      The authors don't seem to incorporate the role of healthcare workers in the transmission process. Could they comment on this? I am assuming that environment-to-patient transmission could either be directly from the environment to the patient or via a healthcare worker. I think it's fine to make simplifying assumptions here but it would be great if this was explicitly described.

      Thank you for this suggestion. We have not sampled the hands of healthcare workers in this study. As a result, the reviewer is correct to say that we made the simplifying assumption that healthcare workers would be possible intermediates in either environment-topatient or patient-to-patient transmissions, as previously described by others (PMID: 8452949). This limitation is now discussed in the revised manuscript.

      Page 5, line 100: What does "all vs all" mean? Based on the supplement, I assume it's the pairwise distance and then averaged across all of those. It would improve the readability of the manuscript if the authors could briefly define this term and then maybe refer to Table S1.

      Thank you. We have created Fig.S2 and revised the manuscript to state that ST-621 isolates from facility A belonged to the same outbreak clone with a distance (averaged all vs all pairwise comparison) of just 38 single nucleotide polymorphisms (SNPs), and an IQR of 19 (Fig. S2, Table S1).

      Figure 1D: It would be interesting to see additional figures in the supplement on the percentage of sequenced isolates per year and whether it varies across the different sources/sites. Is there any information on which isolates were chosen for sequencing?

      Lack of clarity in the sampling/sequencing scheme was raised by multiple reviewers and we have provided a thorough response to earlier comments. We also have revised the material and methods section accordingly. Finally, we have created Fig. S3 to show the percentage of sequenced isolates per year across different sources/sites, as suggested by the reviewer. No noticeable patterns were observed. 

      It seems like only a subset of all clinical isolates were sequenced. Would it be possible that SC2 was present already earlier but not picked up until a certain date?

      Although all isolates received by the MRSN were sequenced, compliance varied through time so it is true that not all clinical isolates were sequenced between 2011-2019. As such, we fully agree with this hypothesis and discuss this possibility as BEAST analysis placed the origin of SC2 in 2004 while the first detection of an SC2 isolate was in December 2012. This limitation is now discussed in the revised manuscript.

      Could the authors elaborate on whether the isolates resulted from single-colony picks? Is it possible that the different absence of a subclone is due to the fact that they picked only a colony?

      Yes, the isolates resulted from single-colony picks except when the presence of different colony morphologies was noted. In the latter, representative isolates for each colony morphologies were processed. We have revised the methods to make that clear.

      Figure 2: It is difficult to see which nodes belong to which patient due to the small font size. I wonder if it was possible to color the nodes for each patient, to make it more readable.

      We tried coloring the nodes but with > 60 distinct patients/colors we decided it did not improve clarity. We have revised figure 2 to increase the font size.  

      Page 7-8, lines 154-155: Did the authors check whether there were isolates of the same strain (that were found in the environment) present in other patients elsewhere in the ward?

      Yes. In rare cases, we observed virtually genetically identical isolates from two patients collected in different wards. Because we only have access to clinical isolate data (collected from patient X in ward Y) and do not have access to patient data (admission/discharge date, wards, rooms, etc.), we do not know but cannot exclude that patients overlap in a room prior to the sampling of their P. aeruginosa isolates. We designed our fixed thresholds to be conservative. As a result, in this analysis, these cases are labelled as “undetermined”.  

      Page 8: Do the authors have any information on antibiotic use during this timeframe? From the discussion, it seems like there is no patient-level prescription data. Is there any data on overall trends? How were trends in antibiotic use correlated with trends in antibiotic resistance?

      Unfortunately, patient-level prescription data (or any other data not linked to the bacterial specimens) was not accessible to us as this study was done as part of our public health surveillance activities.

      To infer the origin of infection, the authors used a static method with fixed thresholds and definitions. This study does not provide any uncertainty with their estimates. Maybe the authors could add a sentence in the discussion section that MCMC methods to infer transmission trees incorporating WGS could provide these estimates. These methods have not been applied to PA a lot but two examples where MCMC methods have been used without WGS (though the definition of environmental contamination may differ between these studies and this study).

      https://doi.org/10.1186/s13756-022-01095-x

      https://doi.org/10.1371/journal.pcbi.1006697

      Thank you for this great suggestion. We have revised the manuscript to include a discussion on the limitations of fixed thresholds to infer transmission chains/origins, and to discuss existing alternatives including MCMC methods. 

      Line 322-323: This sentence is a bit vague since not all of these HAI are due to P. aeruginosa. I would suggest citing a number that is specific to PA.

      Thank you. While our paper shows a particular example of protracted P. aeruginosa outbreak, the roll-out of routine WGS surveillance in the clinic will help prevent hospital-associated drug-resistant infections for more than this species. We believe that broadening the scope in the last sentence of the manuscript is important and we decline to revise as suggested.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Strengths:

      The authors embarked on an ambitious journey to seek the answer regarding 3D genome changes predisposing to metastatic organotropism. The authors succeeded in the assembly of a comprehensive panel of breast cancer cell lines and the aggregation of the 3D genome structure data to conduct a hypothesis-driven computation analysis. The authors also achieved in including proper controls representing normal non-cancerous epithelium and the end organ of interest. The authors did well in the citation of relevant references in 3D genome organization and EMT.

      Weaknesses:

      (1) The authors should clearly indicate how they determine the patterns of spread of the breast cancer cell lines being utilized in this manuscript. How did the authors arrive at the conclusion that certain cell lines would be determined as "localized spread" and "metastatic tropism to the lung"? This definition is crucial, and I will explain why.

      It is indeed a critical point to clearly define and explain what qualifies as metastatic potential to particular organs in our system. Here, we intentionally limited our scope to metastasis that had occurred within the human system. Our cell lines are chosen based on their sites of origin and etiological history in the patients from which they were derived. For example, the cancer cell line BT474 was classified as “localized” because these cells were derived from a solid tumor in the breast itself. Meanwhile, MCF7 and T47D cell lines are considered lung metastatic because these cells were collected from the pleural effusion from the lung. We therefore model human organotropism from the breast to the lung by using cells that originated from infiltrative ductal carcinoma (human breast) but were collected from pleural effusions (human lung). We then use as a comparison a human lung cancer-derived cell line that was itself purified from a pleural effusion. In this way, we can compare the genome structure of a lung cancer cell in the lung environment to a breast cancer cell that has metastasized to the lung environment.

      In our revised version, we further clarify this definition in the text as well as in additional annotations in our supplemental table of all cell line information.

      Todd Golub's team from the Broad Institute of MIT and Harvard published "A metastasis map of human cancer cell lines" to exhaustively create a first-generation metastasis map (MetMap) that reveals organspecific patterns of metastasis. (By the way, this work was not cited in the reference in this manuscript.) The MetMap Explorer (https://depmap.org/metmap/vis-app/index.html) is a public resource that could be openly accessed to visualize the metastatic potential of each cell line as determined by the in vivo barcoding approach as described in the MetMap paper in the format of petal plots. 5 organs were tested in the MetMap paper, including brain, lung, liver, kidney, and bone. The authors would discover that some of the organ-specific metastasis patterns defined in the MetMap Explorer would be different from the authors' classification. For example, the authors defined MCF7 as a line as lung metastatic, and rightly so the MetMap charted a signal towards lung with low penetrance and low metastatic potential. The authors defined ZR751 as a line with localized spread, however, the MetMap charted a signal towards the kidney with low penetrance and low metastatic potential, the signal strength similar to the lung metastasis in MCF7. A similar argument could be made for T47D. The TNBC line MDA-MB-231 is indeed highly metastatic, however, in MetMap data, its metastasis is not only specific to the lung but towards all 5 organs with high penetrance and metastatic potential. The 2 lung cancer cell lines mentioned in this study, A549 and H460, the authors defined them as localized spread to the lung. However, the MetMap data clearly indicated that A549 and H460 are highly metastatic to all 5 organs with high penetrance and high metastatic potential.

      We acknowledge the valuable contributions of animal models in metastatic cancer studies, but we also want to avoid the potentially confounding variable of the animal microenvironment. The MetMap Explorer contains valuable information (and as part of our clarification on this point, we now cite the MetMap in the text), but the “metastatic potential of each cell line” for this tool is measured in a mouse environment. Knowing that a particular cell line, which originated from a human lung metastasis, can further metastasize to other organs in a mouse does not necessarily mean that those cells could do so in humans. The microenvironment responses to metastatic colonization recapitulate the events in wound repair, and these can differ among species (https://pubmed.ncbi.nlm.nih.gov/28916657/ https://pubmed.ncbi.nlm.nih.gov/39729995/ ). Further, the changes a cell needs to make to adapt to a new organ system in a mouse could be confounded by the changes needed to adapt to mouse conditions in general. Finally, migration from a site of ectopic injection may not mimic migration from an initial tumor site. These factors lead to well known cases where MetMap does not reflect the metastatic potential of cancers in humans. As a classic example, prostate cancer frequently metastasizes to bone in humans, and the PC3 cell line was derived from a bone metastatic prostate cancer. However, MetMap shows no evidence of PC3 being able to metastasize to bone in a mouse.

      We agree that the very best data would come from matched primary and metastatic tumors in the same human patient, but those data do not currently exist and generating them would require future work beyond the scope of this study.

      Since results will vary among different experimental models testing metastatic organotropism, (intracardiac injection was the metastasis model being adopted in the MetMap), the authors should state more clearly which experimental model system served as the basis for their definition of organ-specific metastasis. In my opinion, this is the most crucial first step for this entire study to be sound and solid.

      Taking all the above into account, in our revision, we have now included further clarification in the main text to more clearly explain how and why we chose the cell lines we did and what the advantages and limitations of this choice are.

      (2) Figure 1b: The authors found that "MDA-MB-231 cells were grouped with the lung carcinoma cells. This implies that the genome organization of this cell line is closer to that of lung cells than to other breast epithelial cell lines.". In fact, another TNBC line BT549 was also clustered under the same clade. So this clade consisted of normal-like and highly metastatic lines. Therefore, the authors should be mindful of the fact that the compartment features might not directly link to metastasis (or even metastatic organotropism).

      In figure 1b, the grouping that includes MDA-MB-231 (lung metastatic breast cancer) connected to A549, and H460 (lung cancer) occurs at a distance of about 0.2. If the clustering tree were cut at a distance of 0.26, 6 separate clusters would result: two clusters of Luminal subtypes (all labeled red), one that includes all healthy epithelial cells (both lung and breast, all labeled green), one that links two localized breast cancers, one that links MDA-MB-231 to lung carcinoma cell lines, and then BT549 by itself. So, while BT549 appears next to MDA-MB-231 along the horizontal axis, this is just coincidence of the representation: the dendrogram shows it is quite distant from all the other cell lines in this cluster according to compartment profile.

      So, it is only MDA-MB-231 that is very closely linked with the lung cancer cell types.

      It is true that the healthy lung cells (HTBE) are clustered separately and are more similar to normal/non tumorigenic breast epithelial cells (HMEC and MCF10A) than to any cancer cell type. This could suggest that there are aspects of the compartment pattern that represent any healthy epithelium as compared to cancer. What we find in the compartment profile, in both the clustering and the PCA analysis, is that compartment signatures contain information about cell properties on several overlapping levels: there is an aspect of the compartment profile that distinguishes healthy from cancerous cells, an aspect that distinguishes luminal cancers from other subtypes, a part that associates with organotropism, and an aspect that captures EMT status. The final compartment status is a composite of these numerous factors.

      We have clarified the text to indicate that we mean MDA-MB-231 clusters near lung cancer, not necessarily healthy lung cell models.

      (3) Figure 3: In the text, the authors stated, "To further investigate this result, we examined the transcription status of genes that changed compartment across the EMT spectrum and, conversely, the compartment status of genes that changed transcription (Fig. 3b, c, and d)". However, it was not apparent in the figure that the cell lines were arranged according to an EMT spectrum.

      To display these comparisons more clearly, we have now revised figure 3b, c, and d in two ways: First, we have defined the gene and cell line clustering by one set of data (for example, compartment identity in 3b) and then displayed the other data (gene expression) with all genes and cell lines in the same order. Therefore, for each column, genes and cell lines can be compared visually between top and bottom rows. Second, we have colored cell line names from purple to yellow according to their EMT scores as shown in Supplementary Figure 1a. This allows a visual indication of how the clustering separates cell lines by EMT status.

      Also, the clustering heatmaps did not provide sufficient information regarding the genes with concordant/divergent compartments vs transcription changes. It would be more informative if the authors could spend more effort in annotating these genes/pathways.

      We want to clarify that the genes plotted in the heatmaps in Figure 3 are also the genes whose functional enrichment we present in figures 1 and 2. So, the genes that segregate strongly based on A/B compartment (but not gene expression) in figure 3b are the same genes whose GO terms are annotated in Figure 1d. Likewise, the genes that segregate strongly based on gene expression, but not A/B compartment, in figure 3c and d are the same genes whose GO terms are annotated in Figure 2b. We have now made this connection clearer in the text.

      But, we also agree with the reviewer that it is important to explore a bit further the relationship between these divergent sets of genes. Our explorations have led to several observations:

      (1) In some cases, the compartment-segregated genes and the transcription-segregated genes are different members of the same pathways. In Author response image 1 below, for example, we show interactions (according to STRING) for genes from figure 3c that are highly expressed in the epithelial-like cell lines and are annotated as involved in epithelial development (green). We then added to the network genes from figure 3b that are specifically in the A compartment in the epithelial-like cell lines but not mesenchymal cell lines that are also annotated as involved in epithelial development (red). Most of these epithelial development genes that change expression are in the A compartment in all cell lines and therefore do not rely on spatial compartment changes for their regulation. But some additional epithelial development genes, which are interconnected in this same network, are changing compartments across the EMT spectrum. One example, FOXA1, is a key hub in the network and is known to be a pioneer transcription factor involved in development and differentiation. Controlling this gene at the level of spatial genome organization rather than local transcriptional control could be important in the stable cell fate changes that can happen with EMT.

      Author response image 1.

      (2) Overall, the set of genes that change compartments does not have as strong functional enrichment as the transcription change set of genes. This could indicate that some of the compartment changes that occur with EMT are not directly gene regulatory but rather enable an overall conformational change of the chromatin that is needed for the alterations in physical cell state or to accomplish long distance gene regulation changes.

      (3) Related to long distance gene regulation changes, we also see cases in which the gene that changes transcription but not compartment across EMT is adjacent to regions that switch compartments.

      A good example is TFF3 (yellow, Supplementary figure 1C). TFF3 is one of the genes that strongly segregates across EMT by transcription, being more highly expressed in epithelial-like (bottom 4 tracks) but not mesenchymal-like (top 4 tracks) cancers. Despite this differential expression, it is almost always in the A compartment across all cell lines. However, it is adjacent to regions that show strong compartment change EMT signatures. So, even though this specific gene region is not changing compartment, its regulation may be influenced by the entire region being Aassociated in epithelial-like but neighboring regions becoming B-associated in mesenchymal like cancers.

      TFF3 is expressed in normal breast epithelium and has been implicated as a biomarker for endocrine therapy response in breast cancer.

      Meanwhile, many genes that are in these compartment switching regions (BACE2, DSCAM, PDE9A) are not among the strongest expression signature genes.

      (4) Interestingly, some of the regions (such as the region shown in Supplementary figure 1C) that change compartment across the breast cancer spectrum overlap with regions that we found change compartment in the progression of prostate cancer, as shown in the string.db enrichment analysis below.

      Author response image 2.

      In our revised manuscript, we now include more of these explanations in the text and include the example offset compartment and transcription change region shown about as panel c of Supplementary Figure 1.

      (4) Figure 4: The title of the subheading of this section was 'Lung metastatic breast cancer cell lines acquire lung-like genome architecture". Echoing my comments in point 1, I am a bit hesitant to term it as "lung metastatic" but rather "metastatic' in general since cell lines such as MDA-MD-231 do metastasize to other organs as well. However, I do get the point that the definition of "lung metastasis" is derived from the common metastasis features among the cell lines here (MCF7, T47D, SKBR3, MDAMB-231). There might be another argument about whether the "lung" carcinoma cell lines can be considered "localized" since they are also capable of metastasizing to other organs.

      Rather than classifying cells on metastatic “potential” (as measured in a mouse), our cell lines are chosen based on their sites of origin and etiological history in the patients from which they were derived. Cancer cell lines called “lung metastasis” were collected from the pleural effusion from the human lung. Likewise, we call a cancer “localized” because it was taken from the tissue where the cancer originated, even if it might, if placed into a different context, be able to metastasize. We would argue that the genome structure features of the “localized” cancers reflect cancers that have not yet metastasized (even if they could in the future) while the “metastatic” cancers have already gone to a certain location (even if they could in theory have gone to a different location).

      In a way, what the authors probably were trying to leverage here is the "tissue" identity of that organ.

      Having said this, in addition to showing the "lung permissive changes", the authors should show the "breast identity conservation" as well. Because this section started to deal with the concept of "tissue/lineage identify", the authors should also clarify whether these breast cancer cell lines capable of making lung metastasis are also preserving their original tissue identity from the compartment features (which would most likely be the case).

      This is a great question. We have now more explicitly checked the proportions of genomic regions that change compartments to match lung vs. maintaining breast-specific compartment identity. The graphs in Supplementary Figure 2 begin with all genomic bins that have distinctive compartment identity between non-cancerous breast and lung epithelial cells. Then, the plots show what fraction of these tissue-specific bins change compartment to match lung vs. maintaining breast identity in each breast cancer cell line category. As we have shown in other graphs, particularly for switches to the A compartment, more bins change to match lung in the metastatic vs. primary site cell lines. In most cases, more than 50% of the tissue-specific bins shift to look more like lung.

      (5) Rest of the sections: The authors started to claim that the organ-specific metastasis permissive compartmental features mimic the destinated end organ. The authors utilized additional non-breast cancer cell lines (prostate cancer cell lines LNCaP as localized and DU145 as brain metastatic) in brain metastasis to strengthen this claim. (DU145 in MetMap again is highly metastatic to lung, brain, and kidney). However, this makes one wonder that for cell lines that are capable of metastasizing to multiple organ sites (eg. MDA-MB-231, DU145, A459, H460), does it mean that they all acquire the permissive features for all these organs? This scenario is clinically relevant in Stage 4 patients who often present with not only one metastatic lesion in one single organ but multiple metastatic lesions in more than one organ (eg. concomitant liver and lung metastasis). Do the authors think that there might be different clones having different tropism-permissive 3D genome features or there might be evolutionary trajectory in this?

      In my opinion, to further prove this point, the authors might need to consider doing in vivo experiments to collect paired primary and organ-specific metastatic samples to look at the 3D genome changes.

      We agree that an ideal experimental follow up to this study would be to collect paired metastatic and primary tumors, either in mouse xenograft or, even better, from patients. This is beyond the scope of what we can do for our current paper, but we have added a statement to the discussion of further experiments that would be required to clarify this point.

      (6) Technically, the study utilized public Hi-C data without generating new Hi-C data. The resolution of the Hi-C data for compartments was set at 250KB as the binning size indicating that the Hi-C data was at lower resolution so it might not be ideal to address other 3D genome architecture changes such as TADs or long-range loops. It is therefore unknown whether there might be permissive TAD/loop changes associated with organotropism and this is the limitation of this study.

      Our decision to focus on A/B compartmentalization rather than TAD or loop structure in this analysis was intentional and biologically motivated, rather than solely being a reflection of data resolution. Both compartments and topologically associated domains (TADs) are key parts of genome organization and disruption of these structures has the potential to alter downstream gene regulation, as shown by numerous studies. However, compartments have been found, more so than TADs, to be strongly associated with cell type and cell fate. Therefore, in this manuscript, we decided to focus only on the compartment organization changes between different healthy and cancerous cells as they are more likely to represent the stable alterations of the genome organization malignant transformations.

      (7) In the final sentence of the discussion the authors stated "Overall, our results suggest that genome spatial compartment changes can help encode a cell state that favors metastasis (EMT)". The "metastasis (EMT)" was in fact not clearly linked inside the manuscript. The authors did not provide a strong link between metastasis and EMT in their result description. It is also unclear whether the EMTassociated compartment identity would also correlate with the organotropic compartment identity.

      We agree that this statement involves too strong of an assumption. The literature on this topic is vast and complex, and while there is abundant evidence that pathways of EMT can play important roles in facilitating metastasis, there are other pathways at play in the metastatic process as well (https://journals.plos.org/Plosbiology/article?id=10.1371/journal.pbio.3002487). We have made a clearer statement about this in the text now.

      To address the question of whether the organotropic changes related to the EMT changes, we calculated the overlap between the genomic bins that strongly segregated cell lines in the compartment principal component analysis (PC1) with those that showed “organotropic” changes. As you can see in supplementary table 3, this overlap is actually very small, where only 3% of bins are important both for the EMT segregation of cell lines and organotropism.

      We have now included this overlap information as supplementary table 3 and have addressed this in the text.

      Reviewer #2 (Public review):

      Summary:

      This work addresses an important question of chromosome architecture changes associated with organotopic metastatic traits, showing important trends in genome reorganization. The most important observation is that 3D genome changes consistent with adaptations for new microenvironments, including lung metastatic breast cells exhibiting signatures of the genome architecture typical to a lung cell-like conformation and brain metastatic prostate cancer cells showing compartment shifts toward a brain-like state.

      Strengths:

      This work presents interesting original results, which will be important for future studies and biomedical implications of epigenetic regulation in norm and pathology.

      Weaknesses:

      The authors used publicly available data for 15 cell types. They should show how many different sources the data were obtained from and demonstrate that obtained results are consistent if the data from different sources were used.

      In our revised version, we have provided a clarified table of information about all the publicly available data used from all the cell lines, indicating the sources of the data. The 17 datasets used come from 8 different studies. So, indeed, the reviewer is correct that many different sources of data were used. To address the question of whether our results would be consistent if data from different sources were used, we created a comparison map of the A/B compartment profiles for data from multiple sources when it was available. You can see below that the Hi-C data from different sources for the same cell lines cluster quite closely and show high correlation and are well separated from different cell lines. So, we do not think that source batch effects play a major role in our results.

      Author response image 3.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 1a: This figure could be re-formatted without the arrows. Arrows usually indicate upstreamto-downstream relationships along certain processes. Using arrows here would mislead people to think that the cell lines were derived from one another. The same could apply to the supplementary figures.

      We have now edited figure 1a to include lines linking cell lines, indicating conceptual relationships, rather than arrows, which would imply direct derivation.

      (2) Figure 1c: The PCA (PC2 axis) indeed seemed to separate the HER2 status quite well. One concern is MCF7, it is labeled as ERpos/HER2neg in MetMap but seems to be clustered as HER2pos in this study. Are they the same? (This again highlights the importance of cell line definition and annotation).

      It is a good point that MCF7, while generally considered HER2 negative (we indicate this negative status in Supplementary Table 1), falls near HER2 positive cells in PCA space. This indicates that PCA captures tendencies but is not a perfect classifier. In a high dimensional, complex system, it is expected that an unsupervised analysis such as this will not capture just one biological feature in a given principal component, and therefore something like HER2 status may not segregate perfectly. However, this analysis does suggest that MCF7 3D genome structure has features that are more similar to other HER2+ cell lines. This raises the interesting possibility that it may actually behave like HER2+ cells in some ways even while being HER2- itself. We have more clearly stated the MCF7 discrepancy in the text.

      Reviewer #2 (Recommendations for the authors):

      (1) The description of results can be shortened, to make it easier to read and understand.

      In our revision, we have tried to clarify where possible, but it was difficult to shorten without losing important caveats and context (especially to make important points emphasized by reviewer 1).

      (2) "100 most positive and negative eigenvalues for PC1" - please provide the correct description.

      We have altered this to make it clearer and more correct: “using the genes from the regions with the top 100 most positive and 100 most negative eigenvector loadings for this PC1”

    1. Good night, ladies, good night, sweet ladies, good night, good night.

      It's interesting to me how Eliot ends this section of The Waste Land with Ophelia's last words before she commits suicide. Lines before, we get references to "Bill," "Lou," and "May," indicating that the speaker is bidding farewell from the pub setting. Ophelia's line, on the other hand, bids farewell on behalf of not just Lil and the woman in the pub, but all the "sweet ladies" of the waste land. This idea of death as a fate is super interesting. The women have their emotional and spiritual deaths connected to Ophelia's physical death. This is yet another instance where we see suicide in a female in The Waste Land. If I think about what Eliot is trying to get at with women x waste land, especially with this Ophelia connection, I'd say the waste land is a world where the modes of expressing experiences like song, symbol, and even madness have been stripped of their meaning and beauty, leaving only bad nerves, dirty gossip, and the last call of the pub. This is obviously not the ideal place for women; hence, modern society is not fit for women to flourish.

    1. One critique of all of these approaches, however, is that no design, no matter how universal, will equally serve everyone. This is the premise of design justice44 Costanza-Chock, S. (2020). Design justice: Community-led practices to build the worlds we need. MIT Press. , which observes that design is fundamentally about power, in that designs may not only serve some people less well, but systematically exclude them in surprising, often unintentional ways.

      I agree with this. I am privileged to often forget about the exclusion of certain groups in "universal" designs. An example of this that I thought of was pens. I found out recently that a lot of left-handed people have a hard time with ink pens as there palms tend to smear the wet ink immediately after writing. Another example I could think of were the original Band-Aid colors, and how they did a poor job of representing people of all skin tones. Any design that leaves out a certain group of people should always have a substitute version for those people or should not be designed altogether.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This study provides a thorough analysis of Nup107's role in Drosophila metamorphosis, demonstrating that its depletion leads to developmental arrest at the third larval instar stage due to disruptions in ecdysone biosynthesis and EcR signaling. Importantly, the authors establish a novel connection between Nup107 and Torso receptor expression, linking it to the hormonal cascade regulating pupariation.

      However, some contradictory results weaken the conclusions of the study. The authors claim that Nup107 is involved in the translocation of EcR from the cytoplasm to the nucleus. However, the evidence provided in the paper suggests it more likely regulates EcR expression positively, as EcR is undetectable in Nup107-depleted animals, even below background levels.

      We appreciate the concern raised in this public review. However, we must clarify that we do not claim that Nup107 directly regulates the translocation of EcR from the cytoplasm to nucleus, rather Nup107 regulates Ecdysone hormone (20E) synthesis which in turn affects EcR translocation. In the manuscript, we posited this hypothesis if Nup107 will regulate EcR nuclear translocation (9th line of 2nd paragraph on page 6). We have spelled this out more clearly as the 3rd subsection title of the Results section, and in the discussion (8th line of 2nd paragraph on page 11).

      20E acts through the EcR to induce the transcription of EcR responsive genes including the EcR. This creates a positive autoregulatory loop that enhances the EcR level through ecdysone signaling (1). Since Nup107 depletion leads to a reduction in ecdysone levels, it disrupts the transcription autoregulatory EcR expression loop. This can contribute to the reduced EcR levels seen in Nup107-depleted animals. 

      Additionally, the link between Nup107 and Torso is not fully substantiated. While overexpression of Torso appears to rescue the lack of 20E production in the prothoracic gland, the distinct phenotypes of Torso and Nup107 depletion-developmental delay in the former versus complete larval arrest in the latter complicate understanding of Nup107's precise role.

      We understand that there are differences in the developmental delay when Tosro and Nup107 depletion is analyzed. However, the two molecules being compared here are very different, and variability in their depletion could contribute observed phenotypic differences (2). Even if there is no variability of depletion of Torso and Nup107­­­, we believe that Nup107, being more widely expressed, and involved in the regulation of various cellular processes, induces stronger defects.

      Further, we think that RNAi-mediated depletion of Nup107 in prothoracic glands (PG) causes significant reduction in the PG size, which may exert a pronounced defect in 20E biosynthesis through the Halloween genes, inducing a stronger developmental arrest.

      To clarify these discrepancies, further investigation into whether Nup107 interacts with other critical signaling pathways related to the regulation of ecdysone biosynthesis, such as EGFR or TGF-β, would be beneficial and could strengthen the findings.

      In summary, although the study presents some intriguing observations, several conclusions are not well-supported by the experimental data.

      We agree with the reviewer’s suggestion. As noted in the literature, five RTKs-torso, InR, EGFR, Alk, and Pvr-stimulate the PI3K/Akt pathway, which plays a crucial role in the PG functioning and controlling pupariation and body size (3). We have checked the torso and EGFR signaling. We rescued Nup107 defects with the torso overexpression, however, constitutively active EGFR (BL-59843) did not rescue the phenotype (data was not shown). Nonetheless, we plan to examine the EGFR pathway activation by measuring the pERK levels in Nup107-depleted PGs.

      Reviewer #2 (Public review):

      Summary:

      The manuscript by Kawadkar et al investigates the role of Nup107 in developmental progression via the regulation of ecdysone signaling. The authors identify an interesting phenotype of Nup107 whole-body RNAi depletion in Drosophila development - developmental arrest at the late larval stage. Nup107-depleted larvae exhibit mis-localization of the Ecdysone receptor (EcR) from the nucleus to the cytoplasm and reduced expression of EcR target genes in salivary glands, indicative of compromised ecdysone signaling. This mis-localization of EcR in salivary glands was phenocopied when Nup107 was depleted only in the prothoracic gland (PG), suggesting that it is not nuclear transport of EcR but the presence of ecdysone (normally secreted from PG) that is affected. Consistently, whole-body levels of ecdysone were shown to be reduced in Nup107 KD, particularly at the late third instar stage when a spike in ecdysone normally occurs. Importantly, the authors could rescue the developmental arrest and EcR mislocalization phenotypes of Nup107 KD by adding exogenous ecdysone, supporting the notion that Nup107 depletion disrupts biosynthesis of ecdysone, which arrests normal development. Additionally, they found that rescue of the Nup107 KD phenotype can also be achieved by over-expression of the receptor tyrosine kinase torso, which is thought to be the upstream regulator of ecdysone synthesis in the PG. Transcript levels of the torso are also shown to be downregulated in the Nup107KD, as are transcript levels of multiple ecdysone biosynthesis genes. Together, these experiments reveal a new role of Nup107 or nuclear pore levels in hormone-driven developmental progression, likely via regulation of levels of torso and torso-stimulated ecdysone biosynthesis.

      Strengths:

      The developmental phenotypes of an NPC component presented in the manuscript are striking and novel, and the data appears to be of high quality. The rescue experiments are particularly significant, providing strong evidence that Nup107 functions upstream of torso and ecdysone levels in the regulation of developmental timing and progression.

      Weaknesses:

      The underlying mechanism is however not clear, and any insight into how Nup107 may regulate these pathways would greatly strengthen the manuscript. Some suggestions to address this are detailed below.

      Major questions:

      (1) Determining how specific this phenotype is to Nup107 vs. to reduced NPC levels overall would give some mechanistic insight. Does knocking down other components of the Nup107 subcomplex (the Y-complex) lead to similar phenotypes? Given the published gene regulatory function of Nup107, do other gene regulatory Nups such as Nup98 or Nup153 produce these phenotypes?

      We thank this public review for raising this concern. Working with a Nup-complex like the Nup107 complex, this concern is anticipated but difficult to address as many Nups function beyond their complex identity. Our observations with all other members of the Nup107-complex, including dELYS, suggest that except Nup107, none of the other tested Nup107-complex members could induce larval developmental arrest.

      In this study, we primarily focused on the Nup107 complex (outer ring complex) of the NPC. However, previous studies have reported that Nup98 and Nup153 interact with chromatin, with these investigations conducted in Drosophila S2 cells (4, 5, 6). We have now examined other nucleoporins outside of this complex, such as Nup153.

      We ubiquitously depleted Nup153 using the Actin5C-Gal4 driver and assessed the pupariation profile of the knockdown larvae in comparison to control larvae. In contrast to the Nup107 knockdown, when Nup153 is depleted to less than 50% levels, no impact on pupariation was observed (Auhtor response image 1)

      Author response image 1.

      Nup153 depletion does not affect the Drosophila metamorphosis. Actin5C-Gal4 is used as a ubiquitous driver. (A) Comparison of pupariation profiles of control and Nup153 knockdown organisms. (B) Quantification of Nup153 knockdown efficiency. Data are represented from at least three independent experiments. Statistical significance was derived from the Student’s t-test. Error bars represents SEM. ***p = <0.001.

      (2) In a related issue, does this level of Nup107 KD produce lower NPC levels? It is expected to, but actual quantification of nuclear pores in Nup107-depleted tissues should be added. These and the above experiments would help address a key mechanistic question - is this phenotype the result of lower numbers of nuclear pores or specifically of Nup107?

      We agree with the concern raised here, and to address the concern raised here, we stained the control and Nup107 depleted salivary glands with mAb414 antibody (exclusively FG-repeat Nup recognizing antibody). While Nup107 intensities are significantly reduced at the nuclear envelope in Nup107 depleted salivary glands, the mAb414 staining seems unperturbed (Author response image 2).

      Author response image 2.

      Nup107 depletion does not perturb overall NPC composition. Comparison of salivary gland nucleus upon control and Nup107 knockdown. The Nup107 is shown in green and mAb414, staining for other FG-repeat containing nucleoporins is shown in red. Scale bars, 5µm.

      (3) Additional experiments on how Nup107 regulates the torso would provide further insight. Does Nup107 regulate transcription of the torso or perhaps its mRNA export? Looking at nascent levels of the torso transcript and the localization of its mRNA can help answer this question. Or alternatively, does Nup107 physically bind the torso?

      While the concern regarding torso transcript level is genuine, we have already reported in the manuscript that Nup107 directly regulates torso expression. When Nup107 is depleted, torso levels go down, which in turn controls ecdysone production and subsequent EcR signaling (Figure 6B of the manuscript).

      However, the exact nature of Nup107 regulation on torso expression is still unclear. Since the Nup107 is known to interact with chromatin (7), it may affect torso transcription. The possibility of a stable and physiologically relevant interaction between Nup107 and the torso in a cellular context is unlikely largely due to their distinct subcellular localizations. If we investigate this further, it will require a significant amount of time for having reagents and experimentation, and currently stands beyond the scope of this manuscript.

      (4) The depletion level of Nup107 RNAi specifically in the salivary gland vs. the prothoracic gland should be compared by RT-qPCR or western blotting.

      Although we know that the Nup107 protein signal is reduced in SG upon knockdown (Figure 3B), we have not compared the Nup107 transcript level in these two tissues (SG and PG) upon RNAi. As suggested here, we evaluated the knockdown efficiency of Nup107 using the salivary gland-specific driver AB1-Gal4 and the prothoracic gland-specific driver Phm-Gal4. Our results indicate a significant reduction in Nup107 transcript levels upon Nup107 RNAi in both SG and PG compared to their respective controls (Author response image 3).

      Author response image 3.

      Nup107 levels are significantly reduced upon Nup107<sup>KK</sup> RNAi. Quantification of Nup107 transcript levels from control and Nup107 depleted larvae [tissue specific depletion using AB1-Gal4 (A) and Phm-Gal4 (B)]. Data are represented from at least three independent experiments. Statistical significance was derived from the Student’s t-test. Error bars represent SEM. **p = <0.004

      (5) The UAS-torso rescue experiment should also include the control of an additional UAS construct - so Nup107; UAS-control vs Nup107; UAS-torso should be compared in the context of rescue to make sure the Gal4 driver is functioning at similar levels in the rescue experiment.

      This is a very valid point, and we took this into account while planning the experiment. In such cases, often the GAL4 dilution can be critical. We have demonstrated in Figure S7, that GAL4 dilution is not blurring our observations. We used the Nup107<sup>KK</sup>; UAS-GFP as control alongside the Nup107<sup>KK</sup>; UAS-torso. We conclude that the presence of GFP signals in prothoracic glands and their reduced size indicates genes downstream to both UAS sequences are transcribed, and GAL4 dilution does not play a role here.

      Minor:

      (6) Figures and figure legends can stand to be more explicit and detailed, respectively.

      We have revisited all figures and their corresponding legends to ensure appropriate and explicit details are provided.

      Reviewer #3 (Public review):

      Summary:

      In this study by Kawadkar et al, the authors investigate the developmental role of Nup107, a nucleoporin, in regulating the larval-to-pupal transition in Drosophila through RNAi knockdown and CRISPR-Cas9-mediated gene editing. They demonstrate that Nup107, an essential component of the nuclear pore complex (NPC), is crucial for regulating ecdysone signaling during developmental transitions. The authors show that the depletion of Nup107 disrupts these processes, offering valuable insights into its role in development.

      Specifically, they find that:

      (1) Nup107 depletion impairs pupariation during the larval-to-pupal transition.

      (2) RNAi knockdown of Nup107 results in defects in EcR nuclear translocation, a key regulator of ecdysone signaling.

      (3) Exogenous 20-hydroxyecdysone (20E) rescues pupariation blocks, but rescued pupae fail to close.

      (4) Nup107 RNAi-induced defects can be rescued by activation of the MAP kinase pathway.

      Strengths:

      The manuscript provides strong evidence that Nup107, a component of the nuclear pore complex (NPC), plays a crucial role in regulating the larval-to-pupal transition in Drosophila, particularly in ecdysone signaling.

      The authors employ a combination of RNAi knockdown, CRISPR-Cas9 gene editing, and rescue experiments, offering a comprehensive approach to studying Nup107's developmental function.

      The study effectively connects Nup107 to ecdysone signaling, a key regulator of developmental transitions, offering novel insights into the molecular mechanisms controlling metamorphosis.

      The use of exogenous 20-hydroxyecdysone (20E) and activation of the MAP kinase pathway provides a strong mechanistic perspective, suggesting that Nup107 may influence EcR signaling and ecdysone biosynthesis.

      Weaknesses:

      The authors do not sufficiently address the potential off-target effects of RNAi, which could impact the validity of their findings. Alternative approaches, such as heterozygous or clonal studies, could help confirm the specificity of the observed phenotypes.

      This is a very valid point raised, and we are aware of the consequences of the off-target effects of RNAi. To assert the effects of authentic RNAi and reduce the off-target effects, we have used two RNAi lines (Nup107<sup>GD</sup> and Nup107<sup>KK</sup>) against Nup107. Both RNAi induced comparable levels of Nup107 reduction, and using these lines, ubiquitous and PG specific knockdown produced similar phenotypes. Although the Nup107<sup>GD</sup> line exhibited a relatively stronger knockdown compared to the Nup107<sup>KK</sup> line, we preferentially used the Nup107<sup>KK</sup> line because the Nup107<sup>GD</sup> line is based on the P-element insertion, and the exact landing site is unknown. Furthermore, there is an off-target predicted for the Nup107<sup>GD</sup> line, where a 19bp sequence aligns with the bifocal (bif) sequence. The bif-encoded protein is involved in axon guidance and regulation of axon extension. However, the Nup107<sup>KK</sup> line does not have a predicted off-target molecule, and we know its precise landing site on the second chromosome. Thus, the Nup107<sup>KK</sup> line was ultimately used in experimentation for its clearer and more reliable genetic background.

      We are also investigating Nup107 knockdown in the prothoracic gland, which exhibits polyteny. Additionally, the number of cells in the prothoracic gland is quite limited, approximately 50-60 cells (8). Given this, there is a possibility that a clonal study may not yield the phenotype.

      NPC Complex Specificity: While the authors focus on Nup107, it remains unclear whether the observed defects are specific to this nucleoporin or if other NPC components also contribute to similar defects. Demonstrating similar results with other NPC components would strengthen their claims.

      We thank this public review for raising this concern. Working with a Nup-complex like the Nup107 complex, this concern is anticipated but difficult to address as many Nups function beyond their complex identity. Our observations with all other members of the Nup107-complex, including dELYS, suggest that except Nup107, none of the other Nup107-complex members could induce larval developmental arrest. Since the study is primarily focused on the Nup107 complex (outer ring complex) of the NPC, we have not examined many more nucleoporins outside of this complex. But our observations with Nup153 knockdown, a nuclear basket nucleoporin, is comparable to control, with no delay in development (Author response image 1)

      Although the authors show that Nup107 depletion disrupts EcR signaling, the precise molecular mechanism by which Nup107 influences this process is not fully explored. Further investigation into how Nup107 regulates EcR nuclear translocation or ecdysone biosynthesis would improve the clarity of the findings.

      We appreciate the concern raised. Through our observation, we have proposed the upstream effect of Nup107 on the PTTH-torso-20E-EcR axis regulating developmental transitions. We know that Nup107 regulates torso levels, but we do not know if Nup107 directly interacts with torso. We would like to address whether Nup107 exerts control on PTTH levels also.

      However, we must emphasize that Nup107 does not directly regulate the translocation of EcR. On the contrary, we have demonstrated that when Nup107 is depleted only in the salivary gland, EcR translocates into the nucleus. Thus we conclude that the EcR translocation is 20E dependent and Nup107 independent. Further, we have argued that Nup107 regulates the expression of Halloween genes required for ecdysone biosynthesis. We are interested in identifying if Nup107 associates directly or through some protein to chromatin to bring about the changes in gene expression required for normal development.

      There are some typographical errors and overly strong phrases, such as "unequivocally demonstrate," which could be softened. Additionally, the presentation of redundant data in different tissues could be streamlined to enhance clarity and flow.

      Response: We thank the reviewer for this observation. We have put our best efforts to remove all typographical errors and have now made more reasonable statements based on our conclusions.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      The manuscript presents compelling evidence that Nup107 plays a role in regulating ecdysone production. However, significant concerns remain regarding the effects on EcR localization and expression, as well as the claimed link between PTTH/Torso signaling and Nup107's function, as the evidence provided is not conclusive.

      The hypothesis that Nup107 mediates EcR translocation from the cytoplasm to the nucleus appears misinterpreted by the authors. Based on the presented images, particularly for the prothoracic gland (PG) Figure 3C, Nup107 depletion seems to impact EcR protein levels rather than its localization. This conclusion is supported by data showing that EcR transcripts are autonomously downregulated in the absence of Nup107. Furthermore, the restoration of nuclear EcR levels upon exogenous 20E supplementation suggests that (1) Nup107 is dispensable for EcR activation and function, and (2) its primary role lies in regulating ecdysone production.

      We appreciate the concern raised by reviewer. However, we must clarify that we do not claim that Nup107 directly regulates the translocation of EcR from the cytoplasm, rather Nup107 regulates Ecdysone hormone (20E) synthesis which in turn affects EcR translocation. In the manuscript, we posited this hypothesis if Nup107 will regulate EcR nuclear translocation (9th line of 2nd paragraph on page 6). We have spelled this out more clearly as the 3rd subsection title of the Results section, and in the discussion (8th line of 2nd paragraph on page 11).

      20E acts through the EcR to induce the transcription of EcR responsive genes including the EcR. This creates a positive autoregulatory loop that enhances the EcR level through ecdysone signaling (1). Since Nup107 depletion leads to a reduction in ecdysone levels, it disrupts the transcription autoregulatory EcR expression loop. This can contribute to the reduced EcR levels seen in Nup107-depleted animals.

      Given that nucleoporins are known to influence mRNA transport-for instance, Nup107 has been shown to control Scn5a mRNA transport (Guan et al., 2019)-the observed effects on Halloween gene and EcR expression may stem from disruptions in mRNA transport to the cytoplasm. The downregulation of Shade further supports this hypothesis, as restricted ecdysone biosynthesis typically induces Shade upregulation in peripheral tissues. Quantifying potential mRNA accumulation in the nuclei of PG cells in Nup107-depleted animals would clarify this.

      The reviewer raised a valid point, and we fully agree with the concern that Nup107 has been shown to control Scn5a mRNA transport (Guan et al., 2019). The observed effects on Halloween gene and EcR expression could indeed stem from disruptions in efficient mRNA export to the cytoplasm. However, if Nup107 were regulating the mRNA export of Halloween genes and EcR, we should not expect a rescue of the Nup107 developmental delay phenotype with torso overexpression. But, by overexpressing the torso in the Nup107 depletion background, we are activating the torso pathway dependent Halloween gene expression, and rescuing the developmental delay phenotype of Nup107 depletion.

      With the current data, it is difficult to conclusively claim a role for Nup107 in EcR translocation or expression. Additional experiments, such as EcR overexpression in Nup107-depleted animals or Nup107 overexpression, would help determine its precise role.

      We appreciate the concern raised by reviewer. We did attempt to rescue the Nup107 depletion phenotype by overexpressing EcR (BL-6868) in the Nup107-RNAi background. However, we were unable to rescue the Nup107 depletion dependent developmental delay phenotype with this approach. This further suggests that the phenotype is not merely due to low level of EcR, but it is due to low availability of ecdysone hormone and EcR signaling.

      The second major issue is the proposed link between Nup107 and PTTH/Torso signaling. The authors suggest that Nup107 regulates ecdysone production through Torso expression based on rescue experiments. However, this is inconsistent with the distinct phenotypes observed when Nup107 or Torso signaling is disrupted. While PTTH/Torso signaling causes only a modest developmental delay (12 hours to 2 days, depending on the mutant), Nup107 depletion results in a complete developmental arrest at the larval stage. This discrepancy raises doubts about the assertion that Torso overexpression alone rescues such a severe phenotype. One possibility is that PTTH levels are upregulated in Nup107-depleted animals, leading to overactivation of the pathway when Torso is overexpressed. Quantifying PTTH levels in Nup107-depleted animals could address this.

      The reviewer raised a valid point, and we fully acknowledge this concern. While we do not completely agree with the idea of PTTH upregulation in Nup107 depleted larvae, as suggested here, we believe that quantifying PTTH levels upon Nup107 depletion can provide a useful insight. To address it, we quantified PTTH levels in Nup107-depleted larvae and found no significant change in PTTH expression compared to controls (Author response image 4).

      Author response image 4.

      Nup107 knockdown does not affect the PTTH level. Quantitation of PTTH transcript levels from control and Nup107 depleted larvae (Prothoracic specific depletion Phm-Gal4). Data are represented from at least three independent experiments. Statistical significance was derived from the Student's t-test. ns is non-significant.

      Another possibility is that the stock used for Torso overexpression, which includes a trk mutant, may introduce genetic interactions that overactivate the pathway. Using a clean UAS-Torso stock would resolve this issue.

      We appreciate the reviewer’s observation regarding the use of the Torso overexpression line (BL-92604), which carries the trk null allele on the second chromosome. The cleaved form of the trk serves as ligand for the troso receptor. Since it may serve as ligand for the torso, I am not sure how trk null allele bearing line when used along for torso overexpression studies will overactivate the pathway. 

      We realized this concern and the fly line used in this study and reported in the manuscript was generated through the following genetic strategy using the BL-92604 line.  First, a double balancer stock (Sco/CyO; MKRS/TM6.Tb) was used to generate the Sco/CyO; UAS-torso/ UAS-torso genotype. This recombinant line was subsequently combined with the Nup107<sup>KK</sup> line. Through the use of the double balancer strategy, we effectively replaced Nup107 RNAi genotype on the second chromosome, thereby ensuring that our final experimental setup is free from trk mutant contamination, if at all.

      Moreover, the rescue of Nup107 depletion phenotypes by RasV12 overexpression suggests that multiple RTKs, not just Torso, are affected. EGFR signaling, the primary regulator of ecdysone biosynthesis in the PG during the last larval stage, is notably absent from the authors' analysis. EGFR inactivation is known to arrest development, and previous studies indicate that Nup107 can reduce EGFR pathway activity (Kim et al, 2010). The authors should analyze EGFR pathway activity in the absence of Nup107. Overexpressing EGF ligands like Vein or Spitz in the PG (rather than the receptor) in a Nup107-depleted background would provide more relevant insights.

      The RasGTPase is one of the common effector molecules downstream of an activated receptor kinase. Rescue with a constitutively activated form of RasGTPase (RasV12) suggests one of the routes which is activated downstream of the torso receptor. It does not directly suggest all different RTKs are affected and are involved. Our idea of performing a rescue experiment was to see if the pathway activated downstream of the torso involves RasGTPase. 

      As noted in the literature, five RTKs—torso, InR, EGFR, Alk, and Pvr—stimulate the PI3K/Akt pathway, which plays a crucial role in the PG for controlling pupariation and body size (3). Although EGFR signaling is important, PTTH/Torso signaling is considered the primary mediator of metamorphic timing. In response to the suggestion to analyze EGFR pathway activity in the absence of Nup107, we attempted to rescue the phenotype by overexpressing constitutively active EGFR (BL-59843) in the Nup107-depleted background (data was not shown). We used constitutively active EGFR to bypass the availability of its ligands (vein and spitz). Unfortunately, we were unable to rescue the phenotype with this approach, which further suggests that EGFR is not the targeted RTK pathway in this context. By rescuing with torso, we found that Nup107 regulates torso-mediated Ras/Erk signaling to control metamorphosis.

      Additional issues require clarification:

      (1) RNAi Efficiency: In Figure 1C, the Nup107GD line shows a stronger knockdown effect than Nup107KK, yet most experiments were conducted with the weaker line. This might explain the residual Nup107 protein observed in Figure 2. Could the authors justify this choice?

      This is a very valid point raised, and we are aware of the consequences of the off-target effects of RNAi. To assert the effects of authentic RNAi and reduce the off-target effects, we have used two RNAi lines (Nup107<sup>GD</sup> and Nup107<sup>KK</sup>) against Nup107. Both RNAi induced comparable levels of Nup107 reduction, and using these lines, ubiquitous and PG specific knockdown produced similar phenotypes. Although the Nup107<sup>GD</sup> line exhibited a relatively stronger knockdown compared to the Nup107<sup>KK</sup> line, we preferentially used the Nup107<sup>KK</sup> line because the Nup107<sup>GD</sup> line is based on the P-element insertion, and the exact landing site is unknown. Furthermore, there is an off-target predicted for the Nup107<sup>GD</sup> line, where a 19bp sequence aligns with the bifocal (bif) sequence. The bif-encoded protein is involved in axon guidance and regulation of axon extension. However, the Nup107<sup>KK</sup> line does not have a predicted off-target molecule, and we know its precise landing site on the second chromosome. Thus, the Nup107<sup>KK</sup> line was ultimately used in experimentation for its clearer and more reliable genetic background.

      (2) Control Comparisons: In Figure 3, the effects of Nup107 depletion on EcR expression in salivary glands (SG) and PG are shown, but only SG controls are provided. Including PG controls would enable proper comparisons. These controls should also be added to Figures 5, 6, and S5.

      As suggested by the reviewer, we have checked the EcR localization in prothoracic gland (Author response image 5), also. As shown in figure R5, when PGs isolated from control, Nup107-RNAi and torso overexpression in Nup107 background were stained for EcR, the observations made were indistinguishable from those made in SGs of the indicated genetic combinations. This indicated that Nup107 regulates EcR signaling by regulating the 20E biosynthesis.

      Author response image 5.

      Prothoracic gland’s specific torso expression rescues EcR nuclear translocation defects. Immunofluorescence-based detection of nucleocytoplasmic distribution of EcR (EcR antibody, red) in control, prothoracic gland specific Nup107 knockdown (Phm-Gal4>Nup107<sup>KK</sup>) and torso overexpressing PG-specific Nup107 knockdown (Phm-Gal4>Nup107<sup>KK</sup>; UAS-torso) third instar larval Prothoracic gland nuclei. DNA is stained with DAPI. Scale bars, 20 μm.

      (3) Clarify the function of Torso in the text: The authors must revise their description of Torso signaling as the primary regulator of ecdysone production in both the results and discussion sections. Specifically, in the results section, the claim that Torso depletion induces developmental arrest is inaccurate. Instead, available evidence, including Rewitz et al. 2009, demonstrates that Torso depletion causes a delay of approximately five days rather than a complete developmental arrest. This discrepancy should be corrected to avoid overstating the role of Torso signaling in ecdysone regulation and to align the manuscript with established findings.

      We agree with the reviewer. We have incorporated the suggestion at the relevant place in the main manuscript.

      Reviewer #3 (Recommendations for the authors):

      These findings suggest that Nup107 is involved in regulating ecdysone signaling during developmental transitions, with depletion of Nup107 disrupting hormone-regulated processes. Moreover, the rescue experiments hint that Nup107 might directly influence EcR signaling and ecdysone biosynthesis, though the precise molecular mechanism remains unclear.

      Overall, the manuscript presents compelling data supporting Nup107's role in regulating developmental transitions. However, I have a few comments for consideration:

      Major Comments:

      RNAi Specificity: While RNAi is a powerful tool, the authors do not sufficiently address potential off-target effects, which could undermine the conclusions. Although a mutant Nup107 is described, it is lethal-are heterozygous or clonal studies possible to validate the findings more robustly?

      This is a very valid point raised, and we are aware of the consequences of the off-target effects of RNAi. To assert the effects of authentic RNAi and reduce the off-target effects, we have used two RNAi lines (Nup107<sup>GD</sup> and Nup107<sup>KK</sup>) against Nup107. Both RNAi induced comparable levels of Nup107 reduction, and using these lines, ubiquitous and PG specific knockdown produced similar phenotypes. Although the Nup107<sup>GD</sup> line exhibited a relatively stronger knockdown compared to the Nup107<sup>KK</sup> line, we preferentially used the Nup107<sup>KK</sup> line because the Nup107<sup>GD</sup> line is based on the P-element insertion, and the exact landing site is unknown. Furthermore, there is an off-target predicted for the Nup107<sup>GD</sup> line, where a 19bp sequence aligns with the bifocal (bif) sequence. The bif-encoded protein is involved in axon guidance and regulation of axon extension. However, the Nup107<sup>KK</sup> line does not have a predicted off-target molecule, and we know its precise landing site on the second chromosome. Thus, the Nup107<sup>KK</sup> line was ultimately used in experimentation for its clearer and more reliable genetic background.

      Following the suggestion from the reviewer, we considered conducting heterozygous and clonal analyses using the Nup107 mutant. We have carried out Nup107 knockdown studies in the prothoracic gland, which has a limited number of cells (50-60 cells) and is known to exhibit polyteny (8). Keeping these aspects of the Prothoracic gland in mind, the possibility that a clonal study will yield the phenotype is scarce. However, we will consider moving forward with this approach also.

      (2) NPC Complex Specificity: It remains unclear whether the observed defects are specific to Nup107 or if other NPC components also cause similar defects. If the authors are unable to use Nup107 mutants, they could demonstrate similar defects with other critical NPC members to bolster their claim.

      We thank this public review for raising this concern. Working with a Nup-complex like the Nup107 complex, this concern is anticipated but difficult to address as many Nups function beyond their complex identity. Our analysis of Nup153 depleted organisms indicates no developmental delay/defect. We have also assessed effects of knockdown of all other members of the Nup107-complex, including dELYS, but except Nup107 no other member of the Nup107-complex could induce developmental arrest in the third instar stage causing lack of pupariation. However, the null mutant of Nup133, the direct interactor of Nup107 in the Nup107-complex, induces a delay in pupariation (unpublished data).

      (3) Molecular Mechanism of EcR Signaling: The manuscript shows that Nup107 depletion affects EcR signaling and ecdysone biosynthesis, but the molecular basis of this regulation is not fully explored. Does phosphorylated ERK (p-ERK) fail to enter the nucleus? Clarifying this mechanism would strengthen the study's impact.

      We appreciate the reviewer’s insightful comment and fully agree with the concern. To address this, we examined the subcellular localization of phosphorylated ERK (p-ERK) in the prothoracic gland of control larvae, Nup107-depleted larvae, and Nup107-depleted larvae with torso overexpression. In control larvae, p-ERK was predominantly localized in the nucleus. However, in Nup107-depleted larvae, p-ERK was largely retained in the cytoplasm, indicating impaired pathway activation and nuclear translocation. Notably, overexpression of the torso in the Nup107-depleted background restored nuclear localization of p-ERK in the prothoracic gland (Author response image 6). These findings suggest that Nup107 regulates Drosophila metamorphosis, in part, through modulation of torso-mediated MAPK signaling.

      Author response image 6.

      Nup107 regulates torso activation dependent p-ERK localization. Detection of nucleocytoplasmic distribution of p-ERK (anti- p-ERK antibody, green) in the third instar larval prothoracic glands of control, PG-specific Nup107 knockdown (Phm-Gal4>Nup107<sup>KK</sup>) and PG-specific torso overexpression in Nup107 knockdown background (Phm-Gal4>Nup107<sup>KK</sup>; UAS-torso). DNA is stained with DAPI. Scale bars, 20 µm.

      Minor Comments:

      (1) The manuscript contains typographical errors that may hinder readability. Additionally, some phrases (e.g., "unequivocally demonstrate") may be overly strong. Consider adjusting language to reflect the nature of the data more accurately.

      We agree with the reviewer. We have edited the manuscript accordingly to crease out such typographical errors at relevant places in the main manuscript.

      (2) The data presentation could be improved by eliminating redundancy. Some sections repeat similar findings in different tissues, which could be consolidated to improve clarity and flow.

      While we agree with the comment, we could not help ourselves in tissue redundancy for presenting our data for EcR translocation studies. I wish we could use another tissue. However, we have put EcR localization and p-ERK translocation data in the responses to present another non-redundant tissue perspective (Figures R5 and R6).

      References:

      (1) Varghese, Jishy, and Stephen M Cohen. “microRNA miR-14 acts to modulate a positive autoregulatory loop controlling steroid hormone signaling in Drosophila.” Genes & development vol. 21,18 (2007): 2277-82. doi:10.1101/gad.439807

      (2) Rewitz, Kim F et al. “The insect neuropeptide PTTH activates receptor tyrosine kinase torso to initiate metamorphosis.” Science (New York, N.Y.) vol. 326,5958 (2009): 1403-5. doi:10.1126/science.1176450

      (3) Pan, Xueyang, and Michael B O'Connor. “Coordination among multiple receptor tyrosine kinase signals controls Drosophila developmental timing and body size.” Cell reports vol. 36,9 (2021): 109644. doi:10.1016/j.celrep.2021.109644

      (4) Pascual-Garcia, Pau et al. “Metazoan Nuclear Pores Provide a Scaffold for Poised Genes and Mediate Induced Enhancer-Promoter Contacts.” Molecular cell vol. 66,1 (2017): 63-76.e6. doi:10.1016/j.molcel.2017.02.020

      (5) Pascual-Garcia, Pau et al. “Nup98-dependent transcriptional memory is established independently of transcription.” eLife vol. 11 e63404. 15 Mar. 2022, doi:10.7554/eLife.63404

      (6) Kadota, Shinichi et al. “Nucleoporin 153 links nuclear pore complex to chromatin architecture by mediating CTCF and cohesin binding.” Nature communications vol. 11,1 2606. 25 May. 2020, doi:10.1038/s41467-020-16394-3

      (7) Gozalo, Alejandro et al. “Core Components of the Nuclear Pore Bind Distinct States of Chromatin and Contribute to Polycomb Repression.” Molecular cell vol. 77,1 (2020): 67-81.e7. doi:10.1016/j.molcel.2019.10.017

      (8) Shimell, MaryJane, and Michael B O'Connor. “Endoreplication in the Drosophila melanogaster prothoracic gland is dispensable for the critical weight checkpoint.” microPublication biology vol. 2023 10.17912/micropub.biology.000741. 21 Feb. 2023, doi:10.17912/micropub.biology.000741

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      In this manuscript entitled "Molecular dynamics of the matrisome across sea anemone life history", Bergheim and colleagues report the prediction, using an established sequence analysis pipeline, of the "matrisome" - that is, the compendium of genes encoding constituents of the extracellular matrix - of the starlet sea anemone Nematostella vectensis. Re-analysis of an existing scRNA-Seq dataset allowed the authors to identify the cell types expressing matrisome components and different developmental stages. Last, the authors apply time-resolved proteomics to provide experimental evidence of the presence of the extracellular matrix proteins at three different stages of the life cycle of the sea anemone (larva, primary polyp, adult) and show that different subsets of matrisome components are present in the ECM at different life stages with, for example, basement membrane components accompanying the transition from larva to primary polyp and elastic fiber components and matricellular proteins accompanying the transition from primary polyp to the adult stage. 

      Strengths: 

      The ECM is a structure that has evolved to support the emergence of multicellularity and different transitions that have accompanied the complexification of multicellular organisms. Understanding the molecular makeup of structures that are conserved throughout evolution is thus of paramount importance. 

      The in-silico predicted matrisome of the sea anemone has the potential to become an essential resource for the scientific community to support big data annotation efforts and understand better the evolution of the matrisome and of ECM proteins, an important endeavor to better understand structure/function relationships. This study is also an excellent example of how integrating datasets generated using different -omic modalities can shed light on various aspects of ECM metabolism, from identifying the cell types of origins of matrisome components using scRNA-Seq to studying ECM dynamics using proteomics. 

      We greatly appreciate the positive feedback regarding the design of our study and the evolutionary significance of our findings.

      Weaknesses: 

      My concerns pertain to the three following areas of the manuscript: 

      (1) In-silico definition of the anemone matrisome using sequence analysis: 

      a) While a similar computational pipeline has been applied to predict the matrisome of several model organisms, the authors fail to provide a comprehensive definition of the anemone matrisome: In the text, the authors state the anemone matrisome is composed of "551 proteins, constituting approximately 3% of its proteome (see page 6, line 14), but Figure 1 lists 829 entries as part of the "curated" matrisome, Supplementary Table S1 lists the same 829 entries and the authors state that "Here, we identified 829 ECM proteins that comprise the matrisome of the sea anemone Nematostella vectensis" (see page 17, line 10). Is the sea anemone matrisome composed of 551 or 829 genes? If we refer to the text, the additional 278 entries should not be considered as part of the matrisome, but what is confusing is that some are listed as glycoproteins and the "new_manual_annotation" proposed by the authors and that refer to the protein domains found in these additional proteins suggest that in fact, some could or should be classified as matrisome proteins. For example, shouldn't the two lectins encoded by NV2.3951 and NV2.3157 be classified as matrisome-affiliated proteins? Based on what has been done for other model organisms, receptors have typically been excluded from the "matrisome" but included as part of the "adhesome" for consistency with previously published matrisome; the reviewer is left wondering whether the components classified as "Other" / "Receptor" should not be excluded from the matrisome and moved to a separate "adhesome" list. 

      In addition to receptors, the authors identify nearly 70 glycoproteins classified as "Other". Here, does other mean "non-matrisome" or "another matrisome division" that is not core or associated? If the latter, could the authors try to propose a unifying term for these proteins? Unfortunately, since the authors do not provide the reasons for excluding these entries from the bona fide matrisome (list of excluding domains present, localization data), the reader is left wondering how to treat these entries. 

      Overall, the study would gain in strength if the authors could be more definitive and, if needed, even propose novel additional matrisome annotations to include the components for now listed as "Other" (as was done, for example, for the Drosophila or C. elegans matrisomes). 

      The reviewer is correct to point out the confusing terminology used throughout our manuscript, where both the total of 829 proteins constituting the curated list of ECM domain proteins and the actual matrisome (excluding "others") were referred to as "matrisomes". In general, we followed the example set by Naba & Hynes in their 2012 paper (Mol Cell Proteomics. 2012 Apr;11(4):M111.014647. doi: 10.1074/mcp.M111.014647), where they define the "matrisome" as encompassing all components of the extracellular matrix ("core matrisome") and those associated with it ("matrisome-associated" proteins). This corresponds to our group of 551 proteins, comprising both core matrisome and matrisomeassociated proteins. The Naba & Hynes paper also contains the inclusive and exclusive domain lists for the matrisome that we applied for our dataset. In the revised manuscript, we have now labelled the group of 829 proteins as "curated ECM domain proteins/genes", which includes all proteins positively selected for containing a bona fide ECM domain. After excluding non-matrisomal proteins such as receptors, we arrive at the 551 proteins that constitute the "Nematostella matrisome". We have maintained this terminology throughout the revised manuscript and have revised Figures 1B and 4B accordingly.

      Regarding the category of "other" proteins, which by definition are not part of the matrisome although containing ECM domains, we have taken the reviewer's advice and classified these in more detail. We categorized all receptors as "adhesome" (202 proteins).  The remaining group of “other” secreted ECM domain proteins were then further subcategorized. Those exhibiting significant matches in the ToxProt database were subclassified as "putative venoms" (15 proteins). This group also includes the two lectins (NV2.3951 and NV2.3157), which had been originally shifted to the “other” category due to their classification as venoms. We categorized as “adhesive proteins” (28 proteins) factors such as coadhesins that due to their domain architecture resemble bioadhesive proteins described in proteomic studies of other invertebrate species, such as corals or sponges (see also https://doi.org/10.1016/j.jprot.2022.104506). Further sub-categories are stress/injury response proteins (9 proteins) and ion channels (6 proteins). The remaining 17 proteins were categorized as “uncharacterized ECM domain proteins”. These include highly diverse proteins possessing either single ECM domains or novel domain combinations. We decided to retain those in our dataset as candidates for future functional characterization.

      b) It is surprising that the authors are not providing the full currently accepted protein names to the entries listed in Supplementary Table S1 and have used instead "new_manual_annotation" that resembles formal protein names. This liberty is misleading. In fact, the "new_manual_annotation" seems biased toward describing the reason the proteins were positively screened for through sequence analysis, but many are misleading because there is, in fact, more known about them, including evidence that they are not ECM proteins. The authors should at least provide the current protein names in addition to their "new_manual_annotations". 

      c) To truly serve as a resource, the Table should provide links to each gene entry in the Stowers Institute for Medical Research genome database used and some sort of versioning (this could be added to columns A, B, or D). Such enhancements would facilitate the assessment of the rigor of the list beyond the manual QC of just a few entries. 

      d) Since UniProt is the reference protein knowledge database, providing the UniProt IDs associated with the predicted matrisome entries would also be helpful, giving easy access to information on protein domains, protein structures, orthology information, etc. 

      e) In conclusion, at present, the study only provides a preliminary draft that should be more rigorously curated and enriched with more comprehensive and authoritative annotations if the authors aspire the list to become the reference anemone matrisome and serve the community. 

      Table S1 has been updated to include links to the respective Stowers Institute IDs (first two columns), as well as SwissProt IDs and current descriptions from both the Stowers Institute (SI) and Swissprot.

      In our manual annotations, we prioritized these over automated ones due to the considerable effort invested in examining each sequence individually. The cnidaria-specific minicollagens and NOWA proteins might serve as an example. According to the SI descriptions, the minicollagens are annotated as “keratin-associated protein, predicted or hypothetical protein, collagen-like protein and pericardin”. We classified these as minicollagens on the basis of overall domain architecture and of signature domains and sequence motifs, such as minicollagen cysteine-rich domains (CRDs) and polyproline stretches (doi: 10.1016/j.tig.2008.07.001). NOWA is a CTLD/CRD-containing protein that is part of nematocyst tubules (doi:10.1016/j.isci.2023.106291). The first two NOWA isoforms, according to Si descriptions, were annotated as aggrecan and brevican core proteins, which is very misleading. We therefore feel that our manual annotations better serve the cnidarian research community in classifying these proteins.

      Automated annotations of ECM proteins often rely on similarities between individual domains, neglecting overall domain composition. For example, Swissprot descriptions annotate 31 TSP1 domain-containing proteins in our list as "Hemicentin-1", but closer inspection reveals that only one sequence (NV2.24790) qualifies as Hemicentin-1 due to its characteristic vWFA, Ig-like, TSP1, G2 nidogen, and EGF-like domain architecture. Regarding novel protein annotations, NV2.650 might serve as an example. While SI descriptions annotate this protein as "epidermal growth factor" based on the presence of several EGF-like domains, our analysis reveals two integrin alpha N-terminal domains that classify this sequence as integrin-related. We have therefore assigned a description (Secreted integrin-N-related protein) that references this defining domain and avoids misclassification within the EGF family.

      In cases where the automated annotation (including those in Genbank) matched our own findings, we adopted the existing description, as seen with netrin-1 (NV2.7734). We acknowledge that our manual annotations are not flawless and will be refined by future research. Nonetheless, we offer them as an approximation to a more accurate definition of the identified protein list.

      (2) Proteomic analysis of the composition of the mesoglea during the sea anemone life cycle: 

      a) The product of 287 of the 829 genes proposed to encode matrisome components was detected by proteomics. What about the other ~550 matrisome genes? When and where are they expressed? The wording employed by the authors (see line 11, page 13) implies that only these 287 components are "validated" matrisome components. Is that to say that the other ~550 predicted genes do not encode components of the ECM? This should be discussed. 

      Obviously, our wording was not sufficiently accurate here. In the revised Fig. 1B we indicated that 210 of the 551 matrisome (core and associated) proteins were confirmed by mass spectrometry. In total, 287 proteins were identified by mass spectrometry, meaning that 77 of those are non-matrisomal proteins belonging to the “adhesome” (47) and “other” (30) groups. The fact that the remaining 542 proteins of the matrisome predicted by our in silico analysis could not be identified has two major reasons: (1) Our study was focussed on the molecular dynamics of the mesoglea. Therefore, only mesogleas were isolated for the mass spectrometry analysis and nematocysts were mostly excluded by extensive washing steps. As nematocysts contribute significantly to the predicted matrisome, this group of proteins is underrepresented in the mass spectrometry analysis. (2) A significant fraction of the predicted ECM proteins constitutes soluble factors and transmembrane receptors. These might not be necessarily part of the mesoglea isolates. In addition, the isolation and solubilization method we applied might have technical limitations. Although we used harsh conditions for solubilizing the mesoglea samples (90°C and high DTT concentrations), we cannot exclude that we missed proteins which resisted solubilization and thus trypsinization. We confirmed that all genes predicted by the in silico analysis have transcriptomic profiles as demonstrated in supplementary table S4. We have clarified these points in the revised results part (p.6) and also revised the statement in line 16, page 13.

      b) Can the authors comment on how they have treated zero TMT values or proteins for which a TMT ratio could not be calculated because unique to one life stage, for example? 

      We did not include these proteins in the analysis of the respective statistical comparison. This involved only very few proteins (about 10).  

      c) Could the authors provide a plot showing the distribution of protein abundances for each matrisome category in the main figure 4? In mammals, the bulk of the ECM is composed of collagens, followed by fibrillar ECM glycoproteins, the other matrisome components being more minor. Is a similar distribution observed in the sea anemone mesoglea? 

      We have included such a plot showing protein abundances across life stages and protein categories (Fig. 4A). Collagens and basement membrane proteoglycans (perlecan) are the most abundant protein categories in the core matrisome while secreted factors dominate in the matrisome-associated group.

      d) Prior proteomic studies on the ECM of vertebrate organisms have shown the importance of allowing certain post-translational modifications during database search to ensure maximizing peptide-to-spectrum matching. Such PTMs include the hydroxylation of lysines and prolines that are collagen-specific PTMs. Multiple reports have shown that omitting these PTMs while analyzing LC-MS/MS data would lead to underestimating the abundance of collagens and the misidentification of certain collagens. The authors may want to reanalyze their dataset and include these PTMs as part of their search criteria to ensure capturing all collagen-derived peptides. 

      Thank you for this suggestion. We have re-analyzed our dataset including lysine and proline hydroxylation as PTM. While we obtained in total 70 more proteins using this approach, this additional group did not contain any large collagen or minicollagen we had not detected before. We only obtained two additional collagen-like proteins with very short triple helical domains (V2t013973001.1, NV2t024002001.1), one being a fragment. We don’t feel this justifies implementing a re-analysis of the proteome in our study.

      e) The authors should ensure that reviewers are provided with access to the private PRIDE repository so the data deposited can also be evaluated. They should also ensure that sufficient meta-data is provided using the SRDF format to allow the re-use of their LCMS/MS datasets. 

      We apologize for not providing the reviewer access in our initial submission and have asked the editorial office to forward the PRIDE repository link to all reviewers immediately after receiving the reviews. We did upload a metadata.csv file with the proteomics dataset. This file contains an annotation of all TMT labels to the samples and conditions and replicates used in the manuscript. It contains similar information as an SRDF format file. In addition, the search output files on protein and psm level have been provided. So, from our point of view, we provided all necessary information to reproduce the analysis.

      (3) Supplementary tables: 

      The supplementary tables are very difficult to navigate. They would become more accessible to readers and non-specialists if they were accompanied by brief legends or "README" tabs and if the headers were more detailed (see, for example, Table S2, what does "ctrl.ratio_Larvae_rep2" exactly refer to? Or Table S6 whose column headers using extensive abbreviations are quite obscure). Similarly, what do columns K to BX in Supplementary Table S1 correspond to? Without more substantial explanations, readers have no way of assessing these data points. 

      We have revised the tables and removed any redundant data columns. We also included detailed explanations of the used abbreviations, both in the headers and in a separate README file. Some of the information was apparently lost during the conversion to pdf files. We will therefore upload the original .xls files when submitting the revised manuscript.

      Reviewer #2 (Public review): 

      This work set out to identify all extracellular matrix proteins and associated factors present within the starlet sea anemone Nematostella vectensis at different life stages. Combining existing genomic and transcriptomic datasets, alongside new mass spectometry data, the authors provide a comprehensive description of the Nematostella matrisome. In addition, immunohistochemistry and electron microscopy were used to image whole mount and decellularized mesoglea from all life stages. This served to validate the de-cellularization methods used for proteomic analyses, but also resulted in a very nice description of mesoglea structure at different life stages. A previously published developmental cell type atlas was used to identify the cell type specificity of the matrisome, indicating that the core matrisome is predominantly expressed in the gastrodermis, as well as cnidocytes. The analyses performed were rigorous and the results were clear, supporting the conclusions made by the authors. 

      Thank you. We greatly appreciate the positive assessment of our study.

      Reviewer #3 (Public review): 

      Summary: 

      This manuscript by Bergheim et al investigates the molecular and developmental dynamics of the matrisome, a set of gene products that comprise the extracellular matrix, in the sea anemone Nematostella vectensis using transcriptomic and proteomic approaches. Previous work has examined the matrisome of the hydra, a medusozoan, but this is the first study to characterize the matrisome in an anthozoan. The major finding of this work is a description of the components of the matrisome in Nematostella, which turns out to be more complex than that previously observed in hydra. The authors also describe the remodeling of the extracellular matrix that occurs in the transition from larva to primary polyp, and from primary polyp to adult. The authors interpret these data to support previously proposed (Steinmetz et al. 2017) homology between the cnidarian endoderm with the bilaterian mesoderm. 

      Strengths: 

      The data described in this work are robust, combining both transcriptome and proteomic interrogation of key stages in the life history of Nematostella, and are of value to the community. 

      Thank you for your positive assessment of our dataset. 

      Weaknesses: 

      The authors offer numerous evolutionary interpretations of their results that I believe are unfounded. The main problem with extending these results, together with previous results from hydra, into an evolutionary synthesis that aims to reconstruct the matrisome of the ancestral cnidarian is that we are considering data from only two species. I agree with the authors' depiction of hydra as "derived" relative to other medusozoans and see it as potentially misleading to consider the hydra matrisome as an exemplar for the medusozoan matrisome. Given the organismal and morphological diversity of the phylum, a more thorough comparative study that compares matrisome components across a selection of anthozoan and medusozoan species using formal comparative methods to examine hypotheses is required. 

      Specifically, I question the author's interpretation of the evolutionary events depicted in this statement: 

      "The observation that in Hydra both germ layers contribute to the synthesis of core matrisome proteins (Epp et al. 1986; Zhang et al. 2007) might be related to a secondary loss of the anthozoan-specific mesenteries, which represent extensions of the mesoglea into the body cavity sandwiched by two endodermal layers." 

      Anthozoans and medusozoans are evolutionary sisters. Therefore, the secondary loss of "anthozoan-like mesenteries" in hydrozoans is at least as likely as the gain of this character state in anthozoans. By extension, there is no reason to prefer the hypothesis that the state observed in Nematostella, where gastroderm is responsible for the synthesis of the core matrisome components, is the ancestral state of the phylum. Moreover, the fossil evidence provided in support of this hypothesis (Ou et al. 2022) is not relevant here because the material described in that work is of a crown group anthozoan, which diversified well after the origin of Anthozoa. The phylogenetic structure of Cnidaria has been extensively studied using phylogenomic approaches and is generally well supported (Kayal et al. 2018; DeBiasse et al. 2024). Based on these analyses, anthozoans are not on a "basal" branch, as the authors suggest. The structure of cnidarian phylogeny bifurcates with Anthozoa forming one clade and Medusozoa forming the other. From the data reported by Bergheim and coworkers, it is not possible to infer the evolutionary events that gave rise to the different matrisome states observed in Nematostella (an anthozoan) and hydra (a medusozoan). Furthermore, I take the observation in Fig 5 that anthozoan matrisomes generally exhibit a higher complexity than other cnidarian species to be more supportive of a lineage-specific expansion of matrisome components in the Anthozoa, rather than those components being representative of an ancestral state for Cnidaria. Whatever the implication, I take strong issue with the statement that "the acquisition of complex life cycles in medusozoa, that are distinguished by the pelagic medusa stage, led to a secondary reduction in the matrisome repertoire." There is no causal link in any of the data or analyses reported by Bergheim and co-workers to support this statement and, as stated above, while we are dealing with limited data, insufficient to address this question, it seems more likely to me that the matrisome expanded in anthozoans, contrasting with the authors' conclusions. While the discussion raises many interesting evolutionary hypotheses related to the origin of the cnidarian matrisome, which is of vital interest if we are to understand the origin of the bilaterian matrisome, a more thorough comparative analysis, inclusive of a much greater cnidarian species diversity, is required if we are to evaluate these hypotheses. 

      DeBiasse MB, Buckenmeyer A, Macrander J, Babonis LS, Bentlage B, Cartwright P, Prada C, Reitzel AM, Stampar SN, Collins A, et al. 2024. A Cnidarian Phylogenomic Tree Fitted With Hundreds of 18S Leaves. Bulletin of the Society of Systematic Biologists [Internet] 3. Available from: https://ssbbulletin.org/index.php/bssb/article/view/9267

      Epp L, Smid I, Tardent P. 1986. Synthesis of the mesoglea by ectoderm and endoderm in reassembled hydra. J Morphol [Internet] 189:271-279. Available from: https://pubmed.ncbi.nlm.nih.gov/29954165/ 

      Kayal E, Bentlage B, Sabrina Pankey M, Ohdera AH, Medina M, Plachetzki DC, Collins AG, Ryan JF. 2018. Phylogenomics provides a robust topology of the major cnidarian lineages and insights on the origins of key organismal traits. BMC Evol Biol [Internet] 18:1-18. Available from: https://bmcecolevol.biomedcentral.com/articles/10.1186/s12862-018-1142-0

      Ou Q, Shu D, Zhang Z, Han J, Van Iten H, Cheng M, Sun J, Yao X, Wang R, Mayer G. 2022. Dawn of complex animal food webs: A new predatory anthozoan (Cnidaria) from Cambrian. The Innovation 3:100195 

      Steinmetz PRH, Aman A, Kraus JEM, Technau U. 2017. Gut-like ectodermal tissue in a sea anemone challenges germ layer homology. Nature Ecology & Evolution 2017 1:10 [Internet] 1:1535-1542. Available from: https://www.nature.com/articles/s41559-017-0285-5

      Zhang X, Boot-Handford RP, Huxley-Jones J, Forse LN, Mould AP, Robertson DL, Li L, Athiyal M, Sarras MP. 2007. The collagens of hydra provide insight into the evolution of metazoan extracellular matrices. J Biol Chem [Internet] 282:6792-6802. Available from: https://pubmed.ncbi.nlm.nih.gov/17204477/ 

      We agree with the reviewer that only the analysis of several additional anthozoan and medusozoan representatives will yield a valid basis for a reconstruction of the ancestral cnidarian matrisome and allow statements about ancestral or novel features within the phylum. We have therefore revised our statements in the discussion part of the manuscript by implementing the cited literature and also findings from medusozoan genome analysis (e.g. Gold et al., 2018) demonstrating that changes in gene content are as common in the anthozoans as in medusozoans, which questioned the previously stated “basal” state of Nematostella or of anthozoans in general.

      Reviewer #1 (Recommendations for the authors): 

      (1) In Figure 2A, an "o" is missing in the labeling of the "developing cnidcytes" population. 

      Thank you, we have corrected the typo.

      (2) It would be helpful to have the different life stages indicated as headers of the heat maps presented in Figure 4. 

      We have included symbolic representations for the different life stages on top of the heat maps in addition to the respective labels at the bottom.

      Reviewer #2 (Recommendations for the authors): 

      Important changes: 

      (1) Figure 2B The x-axis tissue names should be changed to something more easily readable/understandable - some are clear, but others are not. Perhaps abbreviations could be expanded in the legend. 

      We have expanded the legend in Fig. 2B to render it more easily readable. We have also rotated the maps in A to have them aligned with the ones in Fig.3B.

      (2) Figure 3B This figure would be improved by the inclusion of cluster names, to understand better the mapping. 

      We have added relevant cluster names to Fig. 3B and as stated above aligned the orientation of the maps in Fig. 2B and Fig. 3B.

      (3) Figure 3C As with 2B, I find the y-axis cnidocyte cell state names to be unclear at times. Perhaps abbreviations could be expanded in the legend. 

      All abbreviations were expanded in Fig.3C axis labels.

      (4) Many of the supplementary tables are not well exported or easily readable as is (gene names are truncated, headers truncated, etc), which means that they may not be easily usable by researchers in the field interested in following up on this work in other contexts. Indeed, to be more usable, please consider sharing these supplementary data as .csv files, for example, instead of as .pdfs. 

      We are sorry for this inconvenience, which was obviously caused by the conversion to pdf files. We will upload the original csv files when submitting the revised manuscript.

      Smaller nitpicky comments: 

      (5) Page 2 line 4 & page 3 line 7: Please consider a term other than "pre-bilaterian". The drawing/ordering of a phylogeny of extant species is not meaningful in terms of more or less ancestral. e.g. if the tips are flipped in the drawing of the tree, can we say that bilaterians are pre-cnidarians? What does that mean? 

      We have used that term on the basis that cnidarians existed before the appearance of bilaterians according to the fossil record and molecular phylogenies (McFadden et al., 2021; Adoutte et al., 2000;Cavalier-Smith et al., 1996; Collins, 1998; Kim et al., 1999; Medina et al., 2001; Wainright et al., 1993). To acknowledge remaining uncertainties in the timing of origin of animals, we will use the term “early-diverging metazoans” instead, which is widely accepted in the cnidarian community. 

      (6) Page 3 line 9 I was confused by the use of "gastrula-shaped body" to describe cnidarians, which are on the whole very morphologically diverse and don't all resemble gastrulae (that can also be quite diverse). 

      This term is sometimes used to refer to the diploblastic cnidarian body plan (outer ectoderm, inner endoderm) with a mouth that corresponds to the blastopore. To avoid misunderstandings, we changed it in the revised manuscript to “Cnidarians, the sister group to bilaterians, are characterized by a simple body plan with a central body cavity and a mouth opening surrounded by tentacles.”

      Reviewer #3 (Recommendations for the authors): 

      (1) In general, I felt there was a lot of discussion about protein structure and diversity that is difficult to follow without a figure. I think some of the information in Supplementary Figures S5, S9, and S11 should be in the main figures. 

      Following the reviewer’s suggestion, we have integrated Fig. S5 (collagens) into the main Fig. 2 and Fig. S9 (polydoms) into Fig. 4. As metalloproteases are not extensively discussed in the manuscript (and also due to the large size of the figure) we have kept Fig. S11 as a supplementary figure.

      (2) Page 3, Line 7: The use of the term "pre-bilaterian" is inappropriate. Cnidarians and bilaterians are evolutionary sisters. Therefore, each lineage derives from the same split and is the same age. The cnidarian lineage is not older than the bilaterian lineage. 

      Following a similar request by reviewer 2 we have replaced this term by “early diverging metazoans”.

      (3) Page 5, Line 10. How were in silico matrisomes from early-branching metazoan species predicted? 

      We applied the same bioinformatic pipeline as for the Nematostella matrisome. We clarified this in the respective methods part.

      (4) Page 16, Line 8: This should be Thus. 

      Obviously, the wording of this sentence was ambiguous. We changed it to ”In contrast, the adult mesoglea is significantly enriched in elastic fiber components, such as fibrillins and fibulin. This compositional shift likely adds to the visco-elastic properties (Gosline 1971a, b) of the growing body column (Fig. 4B,D, supplementary table S7).”

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer 1:

      While BAP1 mutant UM cell lines were included for some of the experiments, it seems the in-vivo data mentioned in the response to the reviewers comment is missing? The authors stated that "MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor." But the CDX model data shown in Figure 4 is from 92.1 cells. If this data is available, then the manuscript would benefit from its addition.

      We thank the reviewer for bringing this to our attention. As the reviewer mentioned, we show 92-1 CDX model in our manuscript. Additionally, strong tumor growth inhibition in MP-46  CDX model treated with our BAF ATPase inhibitor can be found in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/).

      Reviewer 3:<br /> Supplementary Figure 2C<br /> Is the T910M mutation in the parental MP41 cells heterozygous? If so, the authors should indicate this in the figure legend. If this is a homozygous mutation, the authors should explain how the inhibitors suppress SMARCA4 activity in cells that have a LOF mutation.

      We thank the reviewer for bringing this to our attention. We updated the figure legend accordingly to reflect the genotype of the mutations highlighted in the table.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The presented study by Centore and colleagues investigates the inhibition of BAF chromatin remodeling complexes. The study is well-written, and includes comprehensive datasets, including compound screens, gene expression analysis, epigenetics, as well as animal studies. This is an important piece of work for the uveal melanoma research field, and sheds light on a new inhibitor class, as well as a mechanism that might be exploited to target this deadly cancer for which no good treatment options exist.

      Strengths:

      This is a comprehensive and well-written study.

      Weaknesses:

      There are minimal weaknesses.

      We thank the reviewer for the positive comments.

      Reviewer #2 (Public Review):

      Summary:

      The authors generate an optimized small molecule inhibitor of SMARCA2/4 and test it in a panel of cell lines. All uveal melanoma (UM) cell lines in the panel are growth-inhibited by the inhibitor making the focus of the paper. This inhibition is correlated with the loss of promoter occupancy of key melanocyte transcription factors e.g. SOX10. SOX10 overexpression and a point mutation in SMARCA4 can rescue growth inhibition exerted by the SMARCA2/4 inhibitor. Treatment of a UM xenograft model results in growth inhibition and regression which correlates with reduced expression of SOX10 but not discernible toxicity in the mice. Collectively the data suggest a novel treatment of uveal melanoma.

      Strengths:

      There are many strengths of the study including the strong challenge of the on-target effect, the assays used, and the mechanistic data. The results are compelling as are the effects of the inhibitor. The in vivo data is dose-dependent and doses are low enough to be meaningful and associated with evidence of target engagement.

      Weaknesses:

      The authors introduce the field stating that SMARCA4 inhibitors are more effective in SMARCA2 deficient cancers and the converse. Since the desirable outcome of cancer therapy would be synthetic lethality it is not clear why a dual inhibitor is desirable. Wouldn't this be associated with more side effects? It is not known how the inhibitor developed here impacts normal cells, in particular T cells which are essential for any durable response to cancer therapies in patients. Another weakness is that the UM cell lines used do not molecularly resemble metastatic UM. These UM most frequently have mutations in the BAP1 tumor suppressor gene. It is not clear if the described SMARCA2/4 inhibitor is efficacious in BAP1 mutant UM cell lines in vitro or BAP1 mutant patient-derived xenografts in vivo.

      We thank the reviewer for their insightful and constructive comments. As we demonstrate in Fig. 1d, uveal melanoma cells are selectively and deeply sensitive to BAF ATPase inhibition, and provides a therapeutic window. This is confirmed in Fig. 4a-c, as we demonstrated robust tumor growth inhibition, achieved at a dose well-tolerated in xenograft study. FHD-286, a dual BRM/BRG1 inhibitor similar to FHT-1015 with optimized physical properties, has been evaluated in a Phase I trial in patients with metastatic uveal melanoma (NCT04879017) and manuscript describing results of this clinical trial is currently in preparation.

      As the reviewer mentioned, BAP1 loss is a signature of metastatic uveal melanoma. MP38 is a BAP1 mutant uveal melanoma cell line, and we demonstrated growth inhibition and robust caspase 3/7 activity in response to FHT-1015 (Supplementary Fig. 3a and 3f). MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor.

      Reviewer #3 (Public Review):

      Summary:

      This manuscript reports the discovery of new compounds that selectively inhibit SMARCA4/SMARCA2 ATPase activity that work through a different mode as previously developed SMARCA4/SMARCA2 inhibitors. They also demonstrate the anti-tumor effects of the compounds on uveal melanoma cell proliferation and tumor growth. The findings indicate that the drugs exert their effects by altering chromatin accessibility at binding sites for lineage-specific transcription factors within gene enhancer regions. In uveal melanoma, altered expression of the transcription factor, SOX10, and SOX10 target gene underlies the anti-proliferative effects of the compounds. This study is significant because the discovery of new SMARCA4/SMARCA2 inhibitory compounds that can abrogate uveal melanoma tumorigenicity has therapeutic value. In addition, the findings provide evidence for the therapeutic use of these compounds in other transcription factor-dependent cancers.

      Strengths:

      The strengths of this manuscript include biochemical evidence that the new compounds are selective for SMARCA4/SMARCA2 over other ATPases and that the mode of action is distinct from a previously developed compound, BRM014, which binds the RecA lobe of SMARCA2. There is also strong evidence that FHT1015 suppresses uveal melanoma proliferation by inducing apoptosis. The in vivo suppression of tumor growth without toxicity validates the potential therapeutic utility of one of the new drugs. The conclusion that FHT1015 primarily inhibits SMARCA4 activity and thereby suppresses chromatin accessibility at lineage-specific enhancers is substantiated by ATAC-seq and ChIP-seq studies.

      Weaknesses:

      The weaknesses include a lack of more precise information on which SMARCA4/SMARCA2 residues the drugs bind. Although the I1173M/I1143M mutations are evidence that the critical residues for binding reside outside the RecA lobe, this site is conserved in CHD4, which is not affected by the compounds. Hence, this site may be necessary but not sufficient for drug binding or specifying selectivity. A more precise evaluation of the region specifying the effect of the new compounds would strengthen the evidence that they work through a novel mode and that they are selective. Another concern is that the mechanisms by which FHT1015 promotes apoptosis rather than simply cell cycle arrest are not clear. Does SOX10 or another lineage-specific transcription factor underlie the apoptotic effects of the compounds?

      We thank the reviewer for the valuable comments.

      We believe that our dual ATPase inhibitor is selective and additional insights into binding specificity and selectivity for earlier stage compounds of this series were recently published in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/).

      The reviewer also poses a great question regarding the mechanism of apoptosis. The mechanism of apoptosis is extremely complex, but we observed a decrease in pro-survival BCL-2 protein expression in response to FHT-1015, in the experiment corresponding to Supplementary Fig. 5e. In the experiment described in Fig. 3k, we also monitored caspase 3/7 activity over time, and SOX10 overexpression rescued 92-1 cells from FHT-1015 induced apoptosis. This suggests the role of SOX10 as an important mediator of response to BAF ATPase inhibition, including apoptosis induced by FHT-1015.

      Additional Reviews:

      The referees would like to draw the authors' attention to the following issues that would best benefit from additional revision. 

      The clinical relevance of the study would be strengthened by the use of uveal melanoma cell lines with BAP1 mutations that better represent metastatic uveal melanoma. The use of patient-derived xenografts would also be pertinent and would be a useful addition. Similarly, attention to the effects of the inhibitor on non-cancerous proliferative cells such as blood/T/immune cells would also strengthen the manuscript. As the study reports the administration of one of the inhibitors in mice for the xenograft experiments, it would be important to assess any potential effects on blood cell counts and better discuss the eventual toxicity or lack of toxicity and how it was assessed. 

      The authors should better explain how SOX10 over expression can rescue viability in the presence of the inhibitor. Similarly given the critical roles of BRG1, SOX10, and MITF in cutaneous melanoma some specific discussion on the sensitivity of cutaneous melanoma cells to the inhibitor should be considered, and potential differences with uveal melanoma highlighted. 

      Aside from these issues, the authors are urged to consider the other points mentioned below. 

      Reviewer #1 (Recommendations For The Authors): 

      Figure 1d, as well as the text in the manuscript referring to this figure, would benefit from indicating specific cell lines used for UM. The same for the sentence in line 153. 

      We thank the reviewer for bringing this to our attention. We have added the cell line names and updated the manuscript accordingly.

      For any of the studies conducted, is there any link with the genetics of UM? E.g. BAP1 wildtype/BAP1 mutant? 

      As addressed above in the public review section, MP38 is a BAP1 mutant uveal melanoma cell line, and we demonstrated growth inhibition and robust caspase 3/7 activity in response to FHT-1015 (Supplementary Fig. 3a and 3f). MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor.

      Row 191 - How were peaks classified as enhancer-occupied? 

      We used annotatePeaks function of HOMER package to annotate genomic locations, as well as H3K27ac ChIP-seq to annotate peaks as enhancer-occupied. We thank the reviewer to pointing it out and have updated the manuscript accordingly to include this information.

      Row 259, the two cell lines should be named, also in Figure 3i. 

      We have added the cell line names and updated the manuscript accordingly.

      Reviewer #2 (Recommendations For The Authors): 

      As a proof of concept, this study is truly excellent and the authors should be commended. However, it is desirable that new knowledge in cancer is translated to the clinic. To this end there are a few things needed to strengthen the study. 

      I am rephrasing my statements from the public review to say that I would recommend testing the inhibitor in T cells (side effects) and BAP1 mutant cell lines (for clinical relevance). 

      As addressed in the public review section, MP38 is a BAP1 mutant uveal melanoma cell line, and we demonstrated growth inhibition and robust caspase 3/7 activity in response to FHT-1015 (Supplementary Fig. 3a and 3f). MP46 (Supplementary Fig. 3a) is BAP1-null uveal melanoma cell line with no detectable protein expression (Amirouchene-Angelozzi et al., Mol Oncol 2014), and we have observed strong tumor growth inhibition in this CDX model with our BAF ATPase inhibitor.

      Regarding concerns for any potential side effect on T cells, we observed an increase in both CD4 and CD8 T-cell populations in the peripheral blood and the spleen, when naïve, non-tumor bearing CD-1 mice were dosed with SMARCA2/4 dual ATPase inhibitor FHD-286 once daily for 14 days. FHD-286 is a compound similar to FHT-1015 described in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/). In addition, FHD-286 has been tested in tumor bearing syngeneic models. When B16F10 tumor bearing C57BL/6 were dosed with FHD-286 for 10 days, we observed an increase in CD69+ activated CD8 T-cell infiltration in the tumor microenvironment (doi:10.1136/jitc-2022-SITC2022.0888).

      Reviewer #3 (Recommendations For The Authors): 

      (1) Determine drug binding by crystal structure or generate additional SMARCA4 or SMARCA2 mutations in the region near I1173/I1143 that are not conserved in CHD4 and test them in an ATPase assay for effects on drug inhibition. For example, Q1166 in SMARCA4 and Q1136 in SMARCA4 could be changed to Alanine as in CHD4. Would this abrogate drug inhibition? 

      We believe that our dual ATPase inhibitor is selective and additional insights into binding specificity and selectivity for earlier stage compounds of this series were recently published in Vaswani et al., 2025 (PMID:39801091, https://pubmed.ncbi.nlm.nih.gov/39801091/).

      (2) The finding that SOX10 can rescue the antiproliferative effects of FHT1015 suggests that SMARCA4 is primarily needed for SOX10 expression. However, the co-occupancy of SMARCA4 and SOX10 at enhancers suggests that they cooperate to promote chromatin accessibility. It is unclear how over-expression of SOX10 can promote chromatin accessibility in drug-inhibited cells since SOX10 does not have chromatin remodeling activity. ATAC-seq in cells over-expressing SOX10 and treated with the drug could identify SOX10-dependent targets that do not require SMARCA4 activity and clarify the mechanism. It would also be informative to determine if SOX10 over-expression abrogates the effects of FHT1015 on both cell cycle and apoptosis, helping to resolve whether it is a partial or complete rescue of proliferation. 

      We agree that running ATAC-seq in cells overexpressing SOX10 would clarify this mechanism. However, shifts in corporate strategy deprioritized any further experiments for this project. One potential mechanism that SOX10 overexpression can partially rescue BAF inhibition phenotype is through overexpressed SOX10 localizing to open chromatin regions (mostly promoters) across the genome. We know from our ATAC-seq data (Fig. 2) that BAF inhibition leads to loss of chromatin accessibility at SOX10 enhancer sites, while promoter regions are only partially affected. Therefore, we think that overexpression of SOX10 would allow upregulation of its target genes via binding to the promoter regions. In this model, the enhancer-driven SOX10 target genes are likely to remain silenced.  

      (3) Although the in vivo studies indicate that the drugs are well-tolerated, additional in vitro studies to determine the effects of the drug on the proliferation/survival of non-cancerous cells would further validate their therapeutic utility.

      Author Response: The reviewer raises a critical question. FHD-286, a dual BRM/BRG1 inhibitor similar to FHT-1015 with optimized physical properties, has been evaluated in a Phase I trial in patients with metastatic uveal melanoma (NCT04879017), and it was well tolerated at continuous daily dose of up to 7.5 mg QD and at intermittent dose of up to 17.5 mg QD.  Manuscript describing results of this clinical trial is currently in preparation.

    1. Author response:

      Reviewer #1 (Public review):

      It appears obvious that with no or a little fitness penalty, it becomes beneficial to have MHC-coding genes specific to each pathogen. A more thorough study that takes into account a realistic (most probably non-linear in gene number) fitness penalty, various numbers of pathogens that could grossly exceed the self-consistent fitness limit on the number of MHC genes, etc, could be more informative.

      The reviewer seems to be referring to the cost of excessively high presentation breadth.  Such a cost is irrelevant to the inferior fitness of a polymorphic population with heterozygote advantage compared to a monomorphic population with merely doubled gene copy number.  It is relevant to the possibility of a fitness valley separating these two states, but this issue is addressed explicitly in the manuscript.

      An addition or removal of one of the pathogens is reported to affect "the maximum condition", a key ecological characteristic of the model, by an enormous factor 10^43, naturally breaking down all the estimates and conclusions made in [RS]. This observation is not substantiated by any formulas, recipes for how to compute this number numerically, or other details, and is presented just as a self-standing number in the text.

      It is encouraging that the reviewer agrees that this observation, if correct, would cast doubt on the conclusions of Siljestam and Rueffler.  I would add that it is not the enormity of this factor per se that invalidates those conclusions, but the fact that the automatic compensatory adjustment of c<sub>max</sub> conceals the true effects of removing a pathogen, which are quite large.

      I am not sure why the reviewer doubts that this observation is correct.  The factor of 2.7∙10<sup>43</sup> was determined in a straightforward manner in the course of simulating the symmetric Gaussian model of Siljestam and Rueffler with the specified parameter values.  A simple way to determine this number is to have the simulation code print the value to which c<sub>max</sub>  is set, or would be set, by the procedure of Siljestam and Rueffler for different parameter values.  In another section of this response I will describe how to do this with the simulation code written and used by Siljestam and Rueffler; doing so confirms the value that I obtained with my own code.  Furthermore, I will now give a theoretical derivation of this factor.

      As specified by Siljestam and Rueffler, the positions of the m pathogens in (m-1)-dimensional antigenic space correspond to the vertices of a regular simplex centered at the origin, with distance between vertices equal to 1.  The squared distance from the origin to each of the m vertices of such a simplex is (m-1)/2m (https://polytope.miraheze.org/wiki/Simplex).  Thus, the sum of the m squared distances is (m-1)/2.  For the (0, 0) homozygote, condition is multiplied by a factor of exp(-(vr)<sup>2</sup>/2) for each pathogen, where r is the distance from the origin.  It follows that, with v=20, all the pathogens together decrease condition by a factor of exp(20<sup>2</sup>∙(m-1)/4) = exp(100∙(m-1)).  Thus, increasing or decreasing m by 1 changes this value by a factor of exp(100) = 2.7∙10<sup>43</sup>.

      This begs the conclusion that the branching remains robust to changes in c_max that span 4 decades as well.

      That shows only that the results are not extremely sensitive to c<sub>max</sub> or K.  They are, nonetheless, exquisitely sensitive to m and v.  This difference in sensitivities is the reason that a relatively small change to m leads to such a large compensatory change in c<sub>max</sub> a change large enough to have a major effect on the results.

      As I wrote above, there is no explanation behind this number, so I can only guess that such a number is created by the removal or addition of a pathogen that is very far away from the other pathogens. Very far in this context means being separated in the x-space by a much greater distance than 1/\nu, the width of the pathogens' gaussians. Once again, I am not totally sure if this was the case, but if it were, some basic notions of how models are set up were broken. It appears very strange that nothing is said in the manuscript about the spatial distribution of the pathogens, which is crucial to their effects on the condition c.

      I did not explicitly describe the distribution of pathogens in antigenic space because it is exactly the same as in Siljestam and Rueffler, Fig. 4: the vertices of a regular simplex, centered at the origin, with unity edge length.

      The number in question (2.7∙10<sup>43</sup>) pertains to the Gaussian model with v=20.  As specified by Siljestam and Rueffler, each pathogen lies at a distance of 1 from every other pathogen, so the distance of any pathogen from the others is indeed much greater than 1/v.  This condition holds, however, for most of the parameter space explored by Siljestam and Rueffler (their Fig. 4), and for all of the parameter space that seemingly supports their conclusions.  Thus, if this condition indicates that “basic notions of how models are set up were broken”, they must have been broken by Siljestam and Rueffler.

      Overall, I strongly suspect that an unfortunately poor setup of the model reported in the manuscript has led to the conclusions that dispute the much better-substantiated claims made in [SD].

      The reviewer seems to be suggesting that my simulations are somehow flawed and my conclusions unreliable.  I will therefore describe how my conclusions about sensitivity to parameter values can be verified using the simulation code provided by Siljestam and Rueffler themselves, with only small, easily understood modifications.  I will consider adding this description as a supplement when I revise the manuscript.

      The starting point is the Matlab file MHC_sim_Dryad.m, available at https://doi.org/10.5061/dryad.69p8cz98j.  First, we can add a line that prints the value of the variable logcmax, which represents the natural logarithm of cmax determined and used by the code.  Below line 116 (‘prework’), add the line ‘logcmax’ (with no semicolon).

      Now, at the Matlab prompt, execute MHC_sim_Dryad(false, 8, 20, 1) to run the simulation for the Gaussian model with m=8, v=20, and K=1.  The output will indicate that logcmax=700, in accord with the theoretical factor exp(100*(m-1)) derived above.  The allelic diversity, n<sub>e</sub>, will rise to a steady state-level of about 140, as in the red curve of my Fig. 2.

      Now lower m to 7, i.e,  run MHC_sim_Dryad(false, 7, 20, 1).  The output will indicate that logcmax=600.  This confirms that lowering m by 1 causes the code to lower the value of c<sub>max</sub> by a factor exp(100)=2.7∙10<sup>43</sup>, which must also be the factor by which the condition of the most fit homozygote would increase without this adjustment.

      With the change of m to 7 and the compensatory change in c<sub>max</sub>, steady-state allelic diversity remains high.  But what if m changes but c<sub>max</sub> remains the same, as it would in reality?

      To find out, we can fix the value of c<sub>max</sub> to the value used with m=8 by adding the following line below the line previously added: ‘logcmax = 700’.  With this additional modification in place, executing MHC_sim_Dryad(false, 7, 20, 1) confirms that without a compensatory change to c<sub>max</sub>, lowering m from 8 to 7 mostly eliminates allelic diversity, in accord with the corresponding curve in my Fig. 2.  Similarly, raising m from 8 to 9, or changing v from 20 to 19.5 or 20.5 (executing MHC_sim_Dryad(false, 8, 19.5, 1) or MHC_sim_Dryad(false, 8, 20.5, 1)), largely eliminates diversity, confirming the other results in my Fig. 2.  Results for the bitstring model can also be confirmed, though this requires additional changes to the code.

      Thus, the extreme sensitivity of the results of Siljestam and Rueffler to parameter values can be verified with the code that they used for their simulations, indicating that my conclusions are not consequences of my having done a “poor setup of the model”.

      Response to Reviewer #2 (Public review):

      (1) The statement that the model outcome of Siljestam and Rueffler is very sensitive to parameter values is, in this form, not correct. The sensitivity is only visible once a strong assumption by Siljestam and Rueffler is removed. This assumption is questionable, and it is well explained in the manuscript by J. Cherry why it should not be used. This may be seen as a subtle difference, but I think it is important to pin done the exact nature of the problem (see, for example, the abstract, where this is presented in a misleading way).

      I appreciate the distinction, and the importance of clearly specifying the nature of the problem.  However, Siljestam and Rueffler do not invoke the implausible assumption that changes to the number of pathogens or their virulence will be accompanied by compensatory changes to c<sub>max</sub>.  Rather, they describe the adjustment of c<sub>max</sub> (Appendix 7) as a “helpful” standardization that applies “without loss of generality”.  Indeed, my low-diversity results could be obtained, despite such adjustment, by combining the small change to m or v with a very large change to K (e.g., a factor of 2.7∙10<sup>43</sup>).  In this sense there is no loss of generality, but the automatic adjustment of c<sub>max</sub> obscures the extreme sensitivity of the results to m and v.

      (2) The title of the study is very catchy, but it needs to be explained better in the text.

      I had hoped that the final paragraph of the Discussion would make the basis for the title clear.  I will consider whether this can be clarified in a revision.

    1. Chapter 4: Common Writing Assignments College writing assignments serve a different purpose than the typical writing assignments you completed in high school. The textbook Successful Writing explains that high school teachers generally focus on teaching you to write in a variety of modes and formats, including personal writing, expository writing, research papers, creative writing, and writing short answers and essays for exams. Over time, these assignments help you build a foundation of writing skills. In college, many instructors will expect you to already have that foundation. Your college composition courses will focus on writing for its own sake, helping you make the transition to college-level writing assignments. However, in most other college courses, writing assignments serve a different purpose. In those courses, you may use writing as one tool among many for learning how to think about a particular academic discipline. Additionally, certain assignments teach you how to meet the expectations for professional writing in a given field. Depending on the class, you might be asked to write a lab report, a case study, a literary analysis, a business plan, or an account of a personal interview. You will need to learn and follow the standard conventions for those types of written products. Finally, personal and creative writing assignments are less common in college than in high school. College courses emphasize expository writing, writing that explains or informs. Often expository writing assignments will incorporate outside research, too. Some classes will also require persuasive writing assignments in which you state and support your position on an issue. College instructors will hold you to a higher standard when it comes to supporting your ideas with reasons and evidence. Common Types of College Writing Assignments Below you will find a list of different types of writing assignments you may write as you pursue your academic goals. Review each assignment and think about the writing you’ve done in high school and how these assignments might look different in your college composition classes.   Figure 1   After reviewing Figure 1 and the descriptions of various types of writing assignments, watch the following video about the writing process. No matter what type of assignment you are writing, it will be important for you to follow a writing process: a series of steps a writer takes to complete a writing task. Making use of a writing process ensures that you stay organized and focused while allowing you to break up a larger assignment into several distinct tasks. Not every writer follows the same process, and part of the work you will do in your writing classes is to discover the writing process that works best for you. Even though the writing process is often presented as a linear set of steps that writers follow from beginning to end, composition scholars now recognize the recursive nature of writing. In other words, many writers repeat steps in the process and not all writers invest an equal amount of time in each stage. Instead, writers often loop back to individual stages as needed in order to develop and refine their work. As you watch the video below, consider your current writing process (if you have one) and reflect upon how you might develop your process to support your growth as a writer—and to save yourself time and stress when completing college writing assignments. In the previous chapters, we covered college writing at CNM and reading strateg

      The key to this is there are different types of writing assignments that has in the common writing assignments.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This work computationally characterized the threat-reward learning behavior of mice in a  recent study (Akiti et al.), which had prominent individual differences. The authors  constructed a Bayes-adaptive Markov decision process model and fitted the behavioral data  by the model. The model assumed (i) hazard function starting from a prior (with free mean  and SD parameters) and updated in a Bayesian manner through experience (actually no real  threat or reward was given in the experiment), (ii) risk-sensitive evaluation of future  outcomes (calculating lower 𝛼 quantile of outcomes with free 𝛼 parameter), and (iii) heuristic  exploration bonus. The authors found that (i) brave animals had more widespread hazard  priors than timid animals and thereby quickly learned that there was in fact little real threat,  (ii) brave animals may also be less risk-aversive than timid animals in future outcome  evaluation, and (iii) the exploration bonus could explain the observed behavioral features,  including the transition of behavior from the peak to steady-state frequency of bout. Overall,  this work is a novel interesting analysis of threat-reward learning, and provides useful  insights for future experimental and theoretical work. However, there are several issues that I  think need to be addressed.

      Strengths:

      (1) This work provides a normative Bayesian account for individual differences in  braveness/timidity in reward-threat learning behavior, which complements the analysis by  Akiti et al. based on model-free threat reinforcement learning.

      (2) Specifically, the individual differences were characterized by (i) the difference in the  variance of hazard prior and potentially also (ii) the difference in the risk-sensitivity in the  evaluation of future returns.

      Weakness:

      (1) Theoretically the effect of prior is diluted over experience whereas the effect of biased  (risk-aversive) evaluation persists, but these two effects could not be teased apart in the  fitting analysis of the current data.

      (2) It is currently unclear how (whether) the proposed model corresponds to neurobiological ( rather than behavioral) findings, different from the analysis by Akiti et al.

      We thank reviewer #1 for their useful feedback which we’ve used to improve the discussion,  formatting and clarity of the paper, and for highlighting important questions for future  extensions of our work.

      Major points:

      (1) Line 219

      It was assumed that the exploration bonus was replenished at a steady rate when the animal  was at the nest. An alternative way would be assuming that the exploration bonus slowly  degraded over time or experience, and if doing so, there appears to be a possibility that the  transition of the bout rate from peak to steady-state could be at least partially explained by  such a decrease in the exploration bonus.

      Section 2.2.3 explains the mechanism of the exploration bonus which motivates approach.  We think that the mechanism suggested by the reviewer is, in essence, what is happening in  the model. The exploration pool is indeed depleted over time or bouts of experience at the  object. In the peak confident phase for brave animals and the peak cautious phase for timid  animals, the rate of depletion exceeds the rate of regeneration, since the agent spends only  a single turn at the nest between bouts. In the steady-state phase, the exploration pool has  depleted so much previously that the agent must wait multiple turns at the nest for the pool  to regenerate to a sufficiently high value to justify approaching the object again.

      We have updated section 2.2.3 to explain that agents spend one turn at the nest during peak  phase but multiple turns during steady-state phase. Hopefully, this makes our mechanism  clear:

      “In simulations, when 𝐺(𝑡) is high, the agent has a high motivation to explore the object,  spending only a single turn in the nest state between bouts. In other words, the depletion  from 𝐺0 substantially influences the time point at which approach makes a transition from  peak to steady-state; the steady-state time then depends on the dynamics of depletion  (when at the object) and replenishment (when at the nest). In particular, in the steady-state  phases, the agent must wait multiple turns at the nest for 𝐺(𝑡)  to regenerate so that  informational reward once again exceeds the potential cost of hazard.“

      (2) Line 237- (Section 2.2.6, 2.2.7, Figures 7, 9)

      I was confused by the descriptions about nCVaR. I looked at the cited original literature  Gagne & Dayan 2022, and understood that nCVaR is a risk-sensitive version of expected  future returns (equation 4) with parameter α (α-bar) (ranging from 0 to 1) representing risk  preference. Line 269-271 and Section 4.2 of the present manuscript described (in my  understanding) that α was a parameter of the model. Then, isn't it more natural to report  estimated values of α, rather than nCVaR, for individual animals in Section 2.2.6, 2.2.7,  Figures 7, 9 (even though nCVaR monotonically depends on α)? In Figures 7 and 9, nCVaR  appears to be upper-bounded to 1. The upper limit of α is 1 by definition, but I have no idea why nCVaR was also bounded by 1. So I would like to ask the authors to add more detailed  explanations on nCVaR. Currently, CVaR is explained in Lines 237-243, but actually, there is  no explanation about nCVaR rather than its formal name 'nested conditional value at risk' in  Line 237.

      Thank you for pointing out this error. We have corrected the paper to use nCVaR to refer to  the objective and nCVaR's α, or sometimes just α, to refer to the risk sensitivity parameter  and thus the degree of risk sensitivity.

      (3) Line 333 (and Abstract)

      Given that animals' behaviors could be equally well fitted by the model having both nCVaR ( free α) and hazard prior and the alternative model having only hazard prior (with α = 1), may  it be difficult to confidently claim that brave (/timid) animals had risk-neutral (/risk-aversive)  preference in addition to widespread (/low-variance) hazard prior? Then, it might be good to  somewhat weaken the corresponding expression in the Abstract (e.g., add 'potentially also'  to the result for risk sensitivity) or mention the inseparability of risk sensitivity and prior belief  pessimism (e.g., "... although risk sensitivity and prior belief pessimism could not be teased  apart").

      Thank you for this suggestion, we have duly weakened the wording in the Abstract to say  “potentially more risk neutral”:

      “Some animals begin with cautious exploration, and quickly transition to confident approach  to maximize exploration for reward; we classify them as potentially more risk neutral, and  enjoying a flexible hazard prior. By contrast, other animals only ever approach in a cautious  manner and display a form of  self-censoring; they are characterized by potential risk  aversion and high and inflexible hazard priors.”

      Reviewer #2 (Public Review):

      Shen and Dayan build a Bayes adaptive Markov decision process model with three key  components: an adaptive hazard function capturing potential predation, an intrinsic reward  function providing the urge to explore, and a conditional value at risk (CvaR, closely related  to probability distortion explanations of risk traits). The model itself is very interesting and  has many strengths including considering different sources of risk preference in generating  behavior under uncertainty. I think this model will be useful to consider for those studying  approach/avoid behaviors in dynamic contexts.

      The authors argue that the model explains behavior in a very simple and unconstrained  behavioral task in which animals are shown novel objects and retreat from them in various  manners (different body postures and patterns of motor chunks/syllables). The model itself  does capture lots of the key mouse behavioral variability (at least on average on a  mouse-by-mouse basis) which is interesting and potentially useful. However, the variables in  the model - and the internal states it implies the mice have during the behavior - are  relatively unconstrained given the wide range of explanations one can offer for the mouse  behavior in the original study (Akiti et al). This reviewer commends the authors on an original  and innovative expansion of existing models of animal behaviour, but recommends that the  authors  revise their study to reflect the obvious  challenges . I would also recommend a  reduction in claiming that this exercise gives a normative-like or at least quantitative account  of mental disorders.

      We thank reviewer #2 for highlighting some of the strengths of our paper as well as pointing  out important limitations of Akiti et al’s original study which we’ve inherited as well as some  limitations of our own method. We address their concerns below.

      We have added a paragraph to the discussion discussing the limitations of the state  representation we adopted from Akiti’s study.

      (Reviewer #1 had the same concern, see above) “Motivated by tail-behind versus  tail-exposed in Akiti et al. (2022), we model approach using a dichotomy between cautious  and confident approach states [...]”

      We have reduced the suggestion that our model provides an account of mental disorders in  the abstract.

      Before:

      “On the other hand, “timid” animals, characterized by risk aversion and high and inflexible  hazard priors, display self-censoring that leads to the sort of asymptotic maladaptive  behavior that is often associated with psychiatric illnesses such as anxiety and depression.”

      After:

      “By contrast, other animals only ever approach in a cautious manner and display a form of  self-censoring; they are characterized by potential risk aversion and high and inflexible  hazard priors. “

      My main comment is that this paper is a very nice model creation that can characterize the  heterogeneity rodent behavior in a very simple approach/avoid context (Akiti et al; when a  novel object is placed in an arena) that itself can be interpreted in a multitude of ways. The  use of terms like "exploration", "brave", etc in this context is tricky because the task does not  allow the original authors (Akiti et al) to quantify these "internal states" or "traits" with the  appropriate level of quantitative detail to say whether this model is correct or not in capturing  the internal states that result in the rodent behavior. That said, the original behavioral setup  is so simple that one could imagine capturing the behavioral variability in multiple ways ( potentially without evoking complex computations that the original authors never showed  the mouse brain performs). I would recommend reframing the paper as a new model that  proposes a set of internal states that could give rise to the behavioral heterogeneity  observed in Akiti et al, but nonetheless is at this time only a hypothesis. Furthermore, an  explanation of what would be really required to test this would be appreciated to make the  point clearer.

      We thought very hard about using terms that might be considered to be anthropomorphic  such as ‘timid’ and ‘brave’. We are, of course, aware, of the concerns articulated by  investigators such as LeDoux about this. However, we think that, provided that we are clear  on the first appearance (using ‘scare’ quotes) that we are using them as indeed labels for  latent characteristics that capture correlations in various aspects of behaviour, they are more  helpful than harmful in making our descriptions understandable.

      Reviewer #3 (Public Review):

      Summary:

      The manuscript presents computational modelling of the behaviour of mice during  encounters with novel and familiar objects, originally reported by Akiti et al. (Neuron 110, 2022)          . Mice typically perform short bouts of approach followed by a retreat to a safe  distance, presumably to balance exploration to discover possible rewards with the potential  risk of predation. However, there is considerable heterogeneity in this exploratory behaviour,  both across time as an individual subject becomes more confident in approaching the object,  and across subjects; with some mice rapidly becoming confident to closely explore the  object, while other timid mice never become fully confident that the object is safe. The  current work aims to explain both the dynamics of adaptation of individual animals over time,  and the quantitative and qualitative differences in behaviour between subjects, by modelling  their behaviour as arising from model-based planning in a Bayes adaptive Markov Decision  Process (BAMDP) framework, in which the subjects maintain and update probabilistic  estimates of the uncertain hazard presented by the object, and rationally balance the  potential reward from exploring the object with the potential risk of predation it presents.

      In order to fit these complex models to the behaviour the authors necessarily make  substantial simplifying assumptions, including coarse-graining the exploratory behaviour into  phases quantified by a set of summary statistics related to the approach bouts of the animal.  Inter-individual variation between subjects is modelled both by differences in their prior  beliefs about the possible hazard presented by the object and by differences in their risk  preference, modelled using a conditional value at risk (CVaR) objective, which focuses the  subject's evaluation on different quantiles of the expected distribution of outcomes.  Interestingly these two conceptually different possible sources of inter-subject variation in  brave vs timid exploratory behaviour turn out not to be dissociable in the current dataset as  they can largely compensate for each other in their effects on the measured behaviour.  Nonetheless, the modelling captures a wide range of quantitative and qualitative differences  between subjects in the dynamics of how they explore the object, essentially through  differences in how subject's beliefs about the potential risk and reward presented by the  object evolve over the course of exploration, and are combined to drive behaviour.

      Exploration in the face of risk is a ubiquitous feature of the decision-making problem faced  by organisms, with strong clinical relevance, yet remains poorly understood and  under-studied, making this work a timely and welcome addition to the literature.

      Strengths:

      (1) Individual differences in exploratory behaviour are an interesting, important, and  under-studied topic.

      (2) Application of cutting-edge modelling methods to a rich behavioural dataset, successfully  accounting for diverse qualitative and qualitative features of the data in a normative  framework.

      (3) Thoughtful discussion of the results in the context of prior literature.

      Limitations:

      (1) The model-fitting approach used of coarse-graining the behaviour into phases and fitting  to their summary statistics may not be applicable to exploratory behaviours in more complex  environments where coarse-graining is less straightforward.

      (2) Some aspects of the work could be more usefully clarified within the manuscript.

      We thank reviewer #3 for their positive feedback and helping us to improve the clarity of our  paper. We have added discussion they thought was missing.

      Reviewer #1 (Recommendations for the authors):

      (1) Line 25-28

      This part of the Abstract might give an impression that timidity (but not braveness) is  potentially associated with psychiatric illness and even that timidity is thus inferior to  braveness. However, even though extreme timidity might indeed be associated with anxiety  or depression, extreme braveness could also be associated with other psychiatric or  behavioral problems. Moreover, as a population, the existence of both timid and brave  individuals could be advantageous, and it could be a reason why both types of individuals  evolutionarily survived in the case of wild animals (although Akiti et al. used mice, which may  have no or very limited genetic varieties, and so things may be different). So I would like to  encourage the authors to elaborate on the expression of this part of the Abstract and/or  enrich the related discussion in the Discussion.

      This is an important point. We note on line 38 that excessive novelty seeking (potentially  caused by excessive braveness) could also be maladaptive.

      Additionally, we have added a paragraph to the discussion discussing heterogeneity in risk  sensitivity within a population.

      “Our data show that there is substantial variation in the degrees of risk sensitivity across the  mice.  Previous works have reported substantial interpopulation and intrapopulation  differences in risk-sensitivity in humans which depend on gender, age, socioeconomic  status, personality characteristics, wealth and culture (Rieger et al., 2015; Frey et al., 2017).  Despite the normative appeal of 𝛼 = 1, it is possible that a population may benefit from  including individuals with $\alpha$ different from 1.0 or highly negative priors. For example,  more cautious individuals could learn from merely observing the risky behavior of less  cautious individuals. Furthermore, we have only considered risk-sensitivity under epistemic  uncertainty in our work. Risk averse individuals, for instance with 𝛼 < 1 may be more  successful than risk-neutral agents in environments where there are unexpected dangers ( unknown unknowns). Risk-aversion is thus a temperament of ecological and evolutionary  significance (Réale et al., 2007).”

      (2) Line 149

      Section 2.2 consists of eight subsections. I think this organization may not be very  appealing, because there are a bit too many subsections, and their relations are not  immediately clear to readers. So I would like to encourage the authors to make an  elaboration. For example, since 2.2.1 - 2.2.5 describes a summary of model construction  and model fitting whereas 2.2.6-2.2.8 shows the results, it could be good to divide these into  separate sections (2.2.1 - 2.2.5 and 2.3.1 - 2.3.3).

      Thank you for pointing this out. We’ve renumbered the sections as you’ve suggested.

      (3) Line 347-8

      Theoretically, the effect of prior is diluted over experience whereas the effect of biased  (risk-aversive) evaluation persists, as the authors mentioned in Lines 393-394. Then isn't it  possible to consider environments/conditions in which the two effects can be separated?

      We appreciate this suggestion. Indeed, our original thought in modeling this experiment was  that this would be exactly the case here - with epistemic uncertainty reducing as the object  became more familiar. However, proving to an animal that a single environment is  completely stationary/fixed is hard - reflected in our conclusion here that the exploration  bonus pool replenishes. Thus, we argued in the discussion that a series of environments  would be necessary to separate risk sensitivity from priors.

      (4) Line 407

      It would be nice to add a brief phrase explaining how (in what sense) this model's  assumption was consistent with the reported behavior. Also, should the assumption of  having two discrete approach states (cautious and confident) itself be regarded as a  limitation of the model? If the tail-behind and tail-exposure approaches were not merely  operationally categorized but were indicated to be two qualitatively distinct behaviors in the  experiment by Akiti et al., it is reasonable to model them as two discrete states, but  otherwise, the assumption of two discrete states would need to be mentioned as a  simplification/limitation.

      We have now removed line 407, and now have an additional  paragraph in the discussion  discussing the limitations of the tail-behind and tail-exposure state representation: “Motivated by tail-behind versus tail-exposed in Akiti et al. (2022), we model approach using  a dichotomy between cautious and confident approach states. This is likely a crude  approximation to the continuous and multifaceted nature of animal approach behavior. For  example, during approach animals likely adjust their levels of vigilance continuously (or  discretely; Lloyd and Dayan (2018)) to  monitor threat, and choose different velocities for  movement, and different attentional strategies for inspecting the novel object. We hope  future works will model these additional behavioral complexities, perhaps with additional  internal states, and corroborate these states with neurobiological data.”

      (5) Line 418

      The authors contrasted their model-based analyses with the model-free analyses of Akiti et  al. Another aspect of differences between the authors' model and the model of Akiti et al. is  whether it is normative or mechanistic: while how the model of Akiti et al. can be biologically  implemented appears to be clear (TS dopamine represents threat TD error, and TS  dopamine-dependent cortico-striatal plasticity implements TD error-based update of  model-free threat prediction), biological implementation of the authors' model seems more  elusive. Given this, it might be a fruitful direction to explore how these two models can be  integrated in the future.

      We enthusiastically agree that it would be most interesting in the future to explore the  integration of the two models - and, in the discussion ( Lines 537-548, 454-461) , point to  some first steps that might be fruitful along these lines. There are two separate  considerations here: one is that our account is mostly computational and algorithmic,  whereas Akiti’s model is mostly algorithmic and implementational; the second is, as noted by  the reviewer, that our account is model-based, whereas Akiti’s model is model-free (in the  sense of reinforcement learning; RL). These are related - thanks in no small part to the work  from the group including Akiti, we know a lot more about the implementation of model-free  than model-based RL. However, our model-based account does reach additional features of  behavior not captured in Akiti et al.’s model such as bout duration, frequency, and approach  type. Thus, the temptation of unification.

      (6) Line 426

      Related to the previous point, it would be nice to more specifically describe what variable TS  dopamine can represent in the authors' model if possible.

      In the discussion  (Lines 454-461) , we speculate that  TS dopamine could still respond to the  physical salience of the novel object and affect choices by determining the potential cost of  the encountered threat or the prior on the hazard function. For example, perhaps ablating TS  dopamine reduces the hazard priors which leads to faster transition from cautious to  confident approach and longer bout durations, consistent with the optogenetics behavioral  data reported in Akiti et al.

      Reviewer #2 (Recommendations for the authors):

      My guess is simpler versions of the model would not fit the data well. But this does not mean  for example that the mice have probability distortions (CvaR) or that even probabilistic  reasoning and the internal models necessary to support them are acting in the behavioral  context studied by Akiti. So related to the above, I would ask what other models would fit and  would not fit the data? And what does this mean?

      These are good points. Our model provides an approximately normative account of the  animals’ behavior  in terms of what it achieves relative to a utility function. In practice, the  animals could deploy a precompiled model-free policy (which does not rely on probabilistic  computations) that is exactly equivalent to our model-based policy. With the current  experiment, we cannot conclude whether or not the animals are performing the prospective  calculations in an online manner. Of course, the extent to which animals or humans are  performing probabilistic computations online and have internal models are on-going  questions of study.

      Model comparison is difficult because currently we do not know of any other risk-sensitive  exploration models. We cannot directly compare to the model in Akiti et al. since our model  explains additional features of behavior: bout duration, frequency, and approach type.  Indeed, our model is as simple as it can be in the sense with the exception of nCVaR,  removing any of the other parameters makes it difficult to fit some animals in our dataset. In the future, our model could be used to fit other datasets of risk-sensitive exploration and,  ideally,  be compared to other models.

      Explaining why animals avoid the novel object in what the offers call benign environment is a  very tricky issue. In Akiti et al, the readers are not yet convinced that the mice know that this  environment is benign. Being placed in an arena with a novel object presents mice with a  great uncertainty and we do not know whether they treat this as benign. Therefore, the  alternative explanations in this study need to be carefully discussed in lieu of the limitations  of the initial study.

      It is certainly true that it is unclear if the arena is  completely  benign to the animals. However,  the amount of time the animal spends in the center of the arena decreases significantly from  habituation to novelty days. This suggests that the animals avoid the novel object largely  because of the object itself, rather than the potential danger associated with the arena.  Furthermore, the animals are not reported as exhibiting more extreme behaviours such as  freezing. In any case, our account is relative in the sense that we are comparing the time the  animal spends at the object versus elsewhere in the environment, driven by the relative  novelty and relative risk of the environment versus the object. Trying to get more absolute  measures of these quantities would require a richer experimental set-up, for instance with  different degree of habituation or experience of the occurrence of (other) novel objects, in  general.

      We added a short note to the discussion to explain this:

      “Fourth, we modeled the relative amount of time the animal spends at the object versus  elsewhere in the environment which depends on the differential risk in the two states.  However, it is likely the animals avoid the novel object largely because of the object itself,  rather than the potential danger associated with the arena since they spend much less time  at the center of the arena during novelty than habituation days.”

      Figure 2 - how confident are the authors that each mouse differs from y=1? Related to this,  the behavior in Akiti is very noisy and changes across time. I am not sure if the authors fully  describe at what levels their model captures the behavior vs not in a detailed enough  fashion.

      We have performed a random permutation test on the minute-to-minute data. We have  updated Figure 2 so that brave animals that pass the Benjamini–Hochberg procedure y>1 at  level q=0.05 are represented with solid green dots and animals that don’t pass are  represented with hollow dots. 8 out of 11 brave animals passed Benjamini–Hochberg.

      Reviewer #3 (Recommendations for the authors):

      (1) I could not find information in the preprint about code availability. Please consider making  the code public to help others apply these modelling methods.

      We have released code and included the url in the paper in the Methods section.

      (2) Though the manuscript was generally clearly written, there were a number of places  where some additional information or clarification would be useful:

      a) Please define and explain the terms 'tail-behind' and 'tail-exposed' (used to describe  approach bout types) when first used.

      We have added definitions when we first mention these terms:

      “[...] 'tail-behind' (bouts where the animal's nose was closer to the object than the tail for the  entire bout) and 'tail-exposed' (bouts where the animal's tail is closer to the object than the  nose at some point during the bout), associated respectively with cautious risk-assessment  and engagement”

      b) At lines 57-58 when contrasting the 'model-free' account of Akiti et al with the 'model-based' account of the current work, it would be worth clarifying that these terms are  being used in the RL sense rather than e.g. a model-based analysis of the data.  

      We have updated the relevant lines to say “model-free/based reinforcement learning”.

      c) Line 61, the phrase 'the significant long-run approach of timid animals despite having  reached the "avoid" state' is unclear as the 'avoid' state has not been defined.

      We updated the terminology to “avoidance behavior” to be consistent with Akiti et al.  Avoidance refers to the animal routinely avoiding the object and therefore being unable to  learn whether it is safe.

      d) It was not completely clear to me how the coarse-graining of the behaviour was  implemented. Specifically, how were animals assigned to the brave, intermediate, or timid  group, and how were the parameters of the resulting behavioural phases fit?

      Sorry that this was not clear. Section 2.1 explains how the minute-to-minute behavioral data  was coarse-grained and how animal groups were assigned. We have added further  explanation of Figure 2 to the main text:

      “Fig 2 summarizes our categorization of the animals into the three groups: brave,  intermediate, and timid based on the phases identified in the animal's exploratory  trajectories. Timid animals spend no time in confident approach and are plotted in orange at  the origin of Fig 2. Brave animals differ from intermediate animals in that their approach time  during the first ten minutes of the confident phase is greater than the last ten minutes ( steady-state phase). Brave animals are plotted in green above and intermediate animals  are plotted in black below the y=1 line in Fig 2.”

      We also added extra information to outline the goal, and methodology of coarse-graining and  animal grouping:

      “We sought to capture  these qualitative differences (cautious versus confident) as well as  aspects of the quantitative changes in bout durations and frequencies as the animal learns  about their environment. To make this readily possible, we abstracted the data in two ways:

      averaging  bout statistics over time, and clustering the animals into three groups with  operationally distinct behaviors.”

      e) What purpose does the 'retreat' state serve in the BAMDP model (as opposed to  transitioning directly from 'object' to 'nest' states), and why do subjects not pass through it  following 'detect' states?

      Thank you for pointing this out. We have updated Figure 3 to note that the two “detected  states” also point to the “retreat” state. The reviewer is correct that there could be alternative  versions of the state diagram, and the ‘retreat’ state could indeed have been eliminated.  However, we thought that it was helpful to structure the animal’s progress through state  space.

      f) Why was the hazard function parameterised via the mean and SD at each time step rather  than with a parametric form of the mean and SD as a function of time?

      Since the agent can only spend 2, 3, or 4 turns at the object states, we didn’t see a need to  parameterize the mean and SD as a function of time. Doing so is a good solution to scaling  up the hazard function to more time-steps.

      (3) There were also a couple of points that could potentially be usefully touched on in the  discussion:

      a) What, if any, is the relationship between the CVaR objective and distributional RL? They  seem potentially related due to both focussing on quantiles of the outcome distribution.

      We have added a paragraph to the discussion discussing the connection between  distributional RL and CVaR:

      “CVaR is known to come in different flavors in the case of temporally-extended behavior.  Gagne and Dayan (2021) introduces two alternative time-consistent formulations of CVaR:  nested CVaR (nCVaR) and precommitted CVaR (pCVaR). nCVaR and pCVaR both enjoy  Bellman equations which make it possible to compute approximately optimal policies without  directly computing whole distributions of the outcomes. We use nCVaR in this study for its  computational efficiency. There is, of course, great current interest in distributional  reinforcement learning (Bellemare et al., 2023b) which does acquire such whole  distributions, not the least because of prominent observations linking non-linearities in the  response functions of dopamine neurons to methods for learning distributions of outcomes ( Dabney et al., 2020; Masset et al., 2023; Sousa et al., 2023). One functional motivation for  considering entire outcome distributions is the possibility of using them to determine  risk-sensitive policies (Gagne and Dayan, 2021).

      While it is possible to compute CVaR directly from return distributions, Gagne and Dayan  (2021) showed that this can lead to temporally inconsistent policies where the agent  deviates from its original plans (the authors called this the fixed CVaR or fCVaR measure).

      Rather further removed from our model-based methods is work from Antonov and Dayan  (2023), who consider a model-free exploration strategy which exploits full return distributions  to compute the value of perfect information which is used as a heuristic for trying actions  with uncertain consequences. Future works can examine risk-sensitive versions of Antonov  and Dayan (2023)'s computationally efficient model-free algorithm as one solution to the  burdensome computations in our model-based method.”

      b) Why normatively might subjects have non-neutral risk preference as captured by the  CvaR?

      We also added a paragraph to the discussion discussing the advantage of heterogeneity in  risk sensitivity within a population:

      (Reviewer #1 had the same question, see above) “Our data show that there is substantial  variation in the degrees of risk sensitivity across the mice.  Previous works have reported  substantial interpopulation and intrapopulation differences in risk-sensitivity in humans which  depend on gender, age, socioeconomic status, personality characteristics, wealth and culture [...]”

      c) Relevance of the current modelling work to clinical conditions characterised by  dysregulation of risk assesment (e.g. anxiety or PTSD).

      We’ve added a paragraph to the discussion:

      “Inter-individual differences in risk sensitivity are also of critical importance in psychiatry,  reflected in a panoply of anxiety disorders (Butler and Mathews, 1983; Giorgetta et al., 2012;  Maner et al., 2007; Charpentier et al., 2017), along with worry and rumination (Gagne and  Dayan, 2022). Understanding the spectrum of   extreme priors and extreme values of 𝛼  could have therapeutic implications, adding significance to the search for tasks that can  more cleanly separate them.”

      d) Is it surprising to see differences in risk preference (nCVaR) between the familiar object  and novel object condition, given that risk preference might be conceptualised as a trait  rather than a state variable?

      Thank you for raising this point. You are right that we expected risk sensitivity (nCVaR alpha)  to be the same between FONC and UONC animals on average. It is difficult to know if alpha  is higher for FONC than UONC animals due to the non-identifiability between alpha and  hazard priors. We have added this discussion to the paper:

      “This is surprising if we interpret 𝛼 as a trait that is stable through time. Unfortunately, due to  the non-identifiability between 𝛼 and hazard priors, we cannot verify whether 𝛼 is actually  higher for FONC animals than UONC animals.”

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The study is methodologically solid and introduces a compelling regulatory model. However, several mechanistic aspects and interpretations require clarification or additional experimental support to strengthen the conclusions.

      Strengths:

      (1) The manuscript presents a compelling structural and biochemical analysis of human glutamine synthetase, offering novel insights into product-induced filamentation.

      (2) The combination of cryo-EM, mutational analysis, and molecular dynamics provides a multifaceted view of filament assembly and enzyme regulation.

      (3) The contrast between human and E. coli GS filamentation mechanisms highlights a potentially unique mode of metabolic feedback in higher organisms.

      Weaknesses:

      (1) The mechanism underlying spontaneous di-decamer formation in the absence of glutamine is insufficiently explored and lacks quantitative biophysical validation.

      (2) Claims of decamer-only behavior in mutants rely solely on negative-stain EM and are not supported by orthogonal solution-based methods.

      We thank the reviewer for the summary and noting of the strengths. We agree that the evolutionary divergence of metabolic feedback in GS homologs is a fruitful avenue for future studies. With regard to the weaknesses, the di-decamer in the absence of glutamine only forms under high (higher than physiological) concentrations of enzyme. Our primary evidence for the mutant behavior was the lack of crosslinking (Figure 1E), with supplementary support from the negative stain. In the revised version we will soften the language to say “reduced” rather than “did not support” filament formation.

      Reviewer #2 (Public review):

      The authors set out to resolve the high-resolution structure of a glutamine synthetase (GS) decamer using cryo-EM, investigate glutamine binding at the decamer interface, and validate structural observations through biochemical assays of ATP hydrolysis linked to enzyme activity. Their work sits at the intersection of structural and functional biology, aiming to bridge atomic-level details with biological mechanisms - a goal with clear relevance to researchers studying enzyme catalysis and metabolic regulation.

      Strengths and weaknesses of methods and results:

      A key strength of the study lies in its use of cryo-EM, a technique well-suited for resolving large, dynamic macromolecular complexes like the GS decamer. The reported resolutions (down to 2.15 Å) initially suggest the potential for detailed structural insights, such as side-chain interactions and ligand density. However, several methodological limitations significantly undermine the reliability of the results:

      (1) Cryo-EM data processing: The absence of critical details about B-factor sharpening - a standard step to enhance map interpretability - is a major concern. For high-resolution maps (<3 Å), sharpening is typically applied to resolve side-chain features, yet the submitted maps (e.g., those in Figures 1D, 2D, and supplementary figures) appear unprocessed, with density quality inconsistent with the claimed resolutions. This makes it difficult to evaluate whether observed features (e.g., glutamine binding) are genuine or artifacts of unsharpened data.

      (2) Modeling and density consistency: The structural models, particularly for glutamine binding at the decamer interface, do not align with the reported resolution. The maps shown in Figure 2D and Supplementary Figure S7 lack sufficient density to confidently place glutamine or even surrounding residues, conflicting with claims of 2.15 Å resolution. Additionally, fitting a non-symmetric ligand (glutamine) into a symmetry-refined map requires justification, as symmetry constraints may distort ligand placement.

      (3) Biochemical assay controls: While the enzyme activity assays aim to link structure to function, they lack essential controls (e.g., blank reactions without GS or substrates, substrate omission tests) to confirm that ATP hydrolysis is GS-dependent. The use of TCEP, a reducing agent, is also not paired with experiments to rule out unintended effects on the PK/LDH system, further limiting confidence in activity measurements.

      Achievement of aims and support for conclusions:

      The study falls short of convincingly achieving its goals. The claimed high-resolution structural details (e.g., side-chain densities, ligand binding) are not supported by the provided maps, which lack sharpening and show inconsistencies in density quality. Similarly, the biochemical data do not robustly validate the structural claims due to missing controls. As a result, the evidence is insufficient to confirm glutamine binding at the decamer interface or the functional relevance of the observed structural features.

      Likely impact and utility:

      If these methodological gaps are addressed, the work could make a meaningful contribution to the field. A well-resolved GS decamer structure would advance understanding of enzyme assembly and ligand recognition, while validated biochemical assays would strengthen the link between structure and function. Improved data processing and clearer reporting of validation steps would also make the structural data more reliable for the community, providing a resource for future studies on GS or related enzymes.

      We disagree with the reviewer’s overall assessment.

      With regard to sharpening and resolution: we examined sharpened maps and in a revised version will present additional supplementary figures showing these maps side by side. We note that the resolutions reported are global and that the most interesting features are, of course, in the periphery and subject to conformational and compositional heterogeneity. We will include supplementary figures of core side chain densities that are more like what are expected by the reviewer in the revision. 

      With regard to modeling: the apo filament and turnover filament datasets were handled nearly identically. The additional density is therefore likely not artefactual to the symmetry operator - however, the lower resolution in this region noted by the reviewer is worthy of further exploration. The maps are public and we think this is the most plausible interpretation of the density, which we based primarily on the biochemical data and will include more speculation in the version.

      With regard to the biochemical controls: we point the reviewer to Figure S1, which shows that omission of ammonia or glutamate in the wild-type (tagless) system removes any coupling of the reactions. We will perform the additional controls to publication quality in the revised version along with the TCEP control. We note that the reducing agent is present across all experiments, ruling out an effect on any specific result. The inclusion of TCEP is also very standard in other published uses of the Coupled ATPase assay (e.g. PMID: 31778111 and PMID: 32483380 by our first author)

      Additional context:

      Cryo-EM has transformed structural biology by enabling high-resolution analysis of large complexes, but its success hinges on rigorous data processing and validation steps that are critical to ensuring reproducibility. The challenges highlighted here are not unique to this study; they reflect broader issues in the field where incomplete reporting of methods can obscure the reliability of results. By addressing these points, the authors would not only strengthen their current work but also set a positive example for transparent and rigorous structural biology research.

      All the data is public and the reviewer or anyone is free to reinterpret the maps and models - and we encourage that rather than just an interpretation of our static figures. In addition, we will upload the raw micrograph data for the apo filament and turnover filament datasets to EMPIAR prior to submitting the revision.

      Reviewer #3 (Public review):

      In this manuscript, the authors propose a product-dependent negative-feedback mechanism of human glutamine synthetase, whereby the product glutamine facilitates filament formation, leading to reduced catalytic specificity for ammonia. Using time-resolved cryo-EM, the authors demonstrate filament formation under product-rich conditions. Multiple high-quality structures, including decameric and di-decameric assemblies, were resolved under different biochemical states and combined with MD simulations, revealing that the conformational space of the active site loop is critical for the GS catalysis. The study also includes extensive steady-state kinetic assays, supporting the view that glutamine regulates GS assembly and its catalytic activity. Overall, this is a detailed and comprehensive study. However, I would advise that a few points be addressed and clarified.

      (1) In Figure 2D and Supplementary Figure 7, the extra density observed between the two decamers does not appear to have the defining features of a glutamine. A less defined density may be expected given the nature of the complex, but even though mutagenesis assays were performed to support this assignment, none of these results constitutes direct and conclusive evidence for glutamine binding at this site. I would thus suggest showing the density maps at multiple contour thresholds to allow readers to also better evaluate the various small molecules under turnover conditions that cannot be well fitted based on this density map, helping to provide a more balanced interpretation of the results.

      (2) On the same point regarding the density for the enzyme under turnover conditions, more details should be provided about the symmetry expansion and classification performed, and also show the approximate ratio of reconstructions that include this density. Did you try symmetry expansion followed by focused classification, especially on the interface region?

      (3) The interface between the two decamers of the model needs to be double-checked and reassigned, especially for the residues surrounding the fitted glutamine. For example, the side chain of the Lys residue shown in the attached figure is most likely modeled incorrectly.

      We thank the reviewer for the feedback. As noted above, we will include supplemental figures that show maps at multiple thresholds and sharpening schemes. We noted in the manuscript and above that our interpretation here is based on integrating biochemical evidence alongside the density and will make that even more clear in the revised manuscript. The filaments +/- the putative glutamine density were processed nearly identically, but we will attempt various schemes of focused classification/symmetry expansion in the revision as well. However, we point out that there is extensive averaging there that makes modeling a bit trickier than expected given the global resolution.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer 1:

      We thank Reviewer 1 for the discussion on the possible causes of ERPs and their relevance for the interpretation of changes in aperiodic activity. We have changed the relevant paragraph to read as follows: For example, ERPs may reflect changes in periodic activity, such as phase resets (Makeig et al., 2002), or baseline shifts (Nikulin et al., 2007). ERPs may also capture aperiodic activity, either in the form of evoked transients triggered by an event (Shah et al., 2004) or induced changes in the ongoing background signal. This has important implications: evoked transients can alter the broadband spectrum without implying shifts in ongoing background activity, whereas induced aperiodic changes may signal different neural mechanisms, such as shifts in the excitation-inhibition balance (Gao et al., 2017).

      Reviewer 1 argued that a time point-by-time point comparison between ERPs and aperiodic parameters may not be the most appropriate approach, since aperiodic time series have lower temporal resolution than ERPs. Reviewer suggested comparing their topographies instead. We had already done this in the first version of the paper (see Fig. S7: https://elifesciences.org/reviewedpreprints/101071v1#s10). However, in the second version, we opted to use linear mixed models for each channel-time point in order to maintain consistency with the other analyses in the paper (e.g. the comparison between FOOOF parameters and baseline-corrected power).

      Nevertheless, we repeated the topographic correlations as in the first version, and the results are shown below. Correlations were computed for each time point, subject and condition, and then averaged across these dimensions for visualisation. The pattern differs from that of the linear mixedmodel results (see Fig. S14), with notable correlations appearing after ~0.5 s for the exponent and after ~1.0 s for the offset. Still, the correlations remain low, suggesting that aperiodic parameters and ERPs encode different information (at least in this dataset).

      Author response image 1.<br />

      Additionally, to control for the effect of smearing we have performed the same linear mixed model analysis as in Fig. S14 on low-pass filtered ERPs (with cut-off 10 Hz), and the results were largely similar as in Fig. S14.

      Reviewer 1 discussed two possible explanations for the observed correlations between baselinecorrected power and FOOOF parameters (Figure 4): “The correlation between the exponent and lowfrequency activity could be of either direction: low frequency power changes could reflect 1/f shifts, or exponent estimates might be biased by undetected delta/theta activity. I think that one other piece of evidence /…/ to intuitively highlight why the latter is more likely is the /…/ decrease at high ("transbeta") frequencies, which suggests a rotational shift /../.” We agree with the interpretation that lowfrequency power changes in our data primarily reflect 1/f shifts. However, we are uncertain about the reviewer’s statement that the “latter” explanation (i.e., bias in exponent estimates due to delta/theta activity) is more likely. Given the context, we believe the reviewer may have intended to say the “former” explanation is more likely.

      We agree with the reviewers' observation that rhythmicity, as estimated using the pACF, can be independent of power (Myrov et al., 2024, Fig. 1). However, it seems that in real (non-simulated) datasets, the pACF and power spectral density (PSD) are often moderately correlated (e.g. Myrov et al., 2024, Fig. 5).

      Reviewer 1 asked whether we had examined aperiodic changes in the data before and after subtracting the response-locked ERPs. We did not carry out this extra analysis as, as the reviewer suggests, it would have been excessive – the current version of the paper already contains more than 60 figures. As mentioned in the manuscript, we acknowledge the possibility that response-locked ERPs contribute to the second aperiodic component. However, due to the weak correlation between reaction times and aperiodic activity, the presence of both components throughout the entire epoch (in at least the first and third datasets) and the distinct differences between the ERPs and the aperiodic activity in the different conditions (see Fig. 8 vs. Fig. S13), we cannot conclusively determine whether the second aperiodic component is directly related to motor responses. Finally, we agree with the reviewer that the distribution of the response-locked ERP more closely resembles the frontocentral (earlier) aperiodic component than the later post-response component. We have amended the relevant paragraph in the Discussion to include these observations. ”While it is possible that response-related ERPs contributed to the second aperiodic component, several observations suggest otherwise: both aperiodic components were present throughout the entire epoch, differences between conditions diverged between ERPs and aperiodic activity (compare Figure 8 and Figure S16), and the associations with reaction times were weak. Moreover, the distribution of the response-locked ERP qualitatively resembled the earlier frontocentral aperiodic component more than the later post-response component. Taken together, these findings suggest that ERPs and aperiodic activity capture distinct aspects of neural processing, rather than reflecting the same underlying phenomenon.”

      We agree with Reviewer 1 that our introduction of aperiodic activity was abrupt, and that the term 'aperiodic exponent' required definition. We have now defined it as the spectral steepness in log–log space (i.e. the slope), and have added a brief explanatory sentence to the introduction.

      Reviewer 1 noted that the phrase 'task-related changes in overall power' could be misinterpreted as referring to total (broadband) power, and recommended that we specify a frequency range. We agree, so we have replaced 'overall power' with 'spectral power within a defined frequency range'.

      We agree with Reviewer 1 that the way we worded things in the Discussion section regarding alpha activity and inhibitory processes was awkward and could easily be misread. We have rephrased the sentences and added a brief explanation to avoid implying a direct link between alpha attenuation and neural inhibition.

      Furthermore, based on the reviewer’s suggestion, we added a brief comment in the Discussion section (Theoretical and methodological implications) on theoretical perspectives regarding the interaction between age and aperiodic activity.

      Reviewer 1 suggested including condition as a fixed effect in order to examine whether the relationship between FOOOF parameters and baseline-corrected power is modulated by condition. Specifically, the reviewer proposed changing our model from

      baseline_corrected_power ~ 1 + fooof_parameter + (1|modality) + (1|nback) + (1|stimulus) + (1|subject)

      to

      baseline_corrected_power ~ 1 + fooof_parameter + modality*nback *stimulus + (1|subject)

      While we appreciate this suggestion, we believe that including design variables as fixed effects would confound the interpretation of (marginal) R² as a measure of the association between FOOOF parameters and baseline-corrected power. Our primary question in this analysis was about the fundamental relationship between these measures, not how experimental conditions moderate this relationship.

      To address the reviewer's concern regarding condition-specific effects, we conducted separate analyses for each condition using a simpler model:

      baseline_corrected_power ~ 1 + fooof_parameter + (1|subject)

      The results (now included in the Supplement, Fig. S4–S6) show generally smaller effect sizes compared to our original random-effects model, with notable differences between conditions. The 2-back conditions, particularly the non-target trials, exhibited the weakest associations. Despite these differences, the overall patterns remained consistent with our original findings: exponent and offset exhibited positive associations at low frequencies (delta, theta) and negative associations at higher frequencies (beta, low gamma), while periodic activity correlated substantially with baselinecorrected power in the alpha, beta, and gamma ranges.

      However, this condition-specific approach has important limitations. With only 47 subjects per condition, the statistical power is insufficient for stable correlation estimates (Schönbrodt & Perugini, 2013; https://doi.org/10.1016/j.jrp.2013.05.009). This likely explains why the effects are smaller and less stable effects than in our original model, which uses the full dataset's power while appropriately accounting for condition-related variance through random effects. Since these additional analyses do not alter our primary conclusions, we have included them in the Supplement for completeness and made a minor change in the Discussion section.

      Reviewer 1 asked what channels are lines on Figure 9 based on. As stated in the Methods section, “We fitted models in a mass univariate manner, that is for each channel, frequency (where applicable), and time point separately. /…/ For the purposes of visualisation, p-values were averaged across channels (for heatmaps or lines) or across time (for topographies).” Therefore, the lines and heatmaps apply to all channels.

      Reviewer 2:

      We would like to thank reviewer 2 for their detailed explanation of the expected behaviour of the specparam algorithm. We have added the following explanation to the Methods section:

      Importantly, as noted by the reviewer, this behaviour reflects an explicit design choice of the algorithm: to avoid overfitting ambiguous peaks at the edges of the spectrum, FOOOF excludes peaks that are too close to the boundaries. This exclusion is controlled by the _bw_std_edge parameter, which defines the distance that a peak must be from the edge in order to be retained (in units of standard deviation; set to 1.0 by default). Therefore, although the algorithm is functioning as intended, users should be careful when interpreting aperiodic parameters in datasets where lowfrequency oscillatory activity might be expected.

      In line with the reviewer’s suggestion we have added a version of specparam to the paper.

      We thank reviewer 2 for pointing out two studies that used a time-resolved approach to spectral parameterisation. We have updated the text accordingly:

      Although a similar approach has been used to track temporal dynamics in sleep and resting state (e.g., Wilson et al., 2022; Ameen et al., 2024), as well as in task-based contexts (e.g., Barrie et al., 1996; Preston et al., 2025), its specific application to working memory paradigms remains underexplored.

      Reviewer 3:

      Reviewer 3 notes that the revised manuscript feels less intriguing than the original version. While we understand this concern, we believe this difference arises from a misalignment in expectations regarding the scope and purpose of our study. We think the reviewer is interpreting our work as focusing on whether theta activity is elicited in a paradigm that reliably produces theta oscillations. In contrast, our study is framed around a working memory task in which, based on prior literature, we expected to observe theta activity but instead found an absence of theta spectral peaks in almost all participants. Note that the absence of theta is already noteworthy in itself, given that theta oscillations are believed to play a crucial role in working memory.

      Importantly, Van Engen et al. (2024) have recently reported similar findings:

      ”While we did not observe load-dependent aperiodic changes over the frontal midline, we did reveal the possibility that previous frontal midline theta results that do not correct for aperiodic activity likely do not reflect theta oscillations. /…/ While our results do not invalidate previous research into extracranial theta oscillations in relation to WM, they challenge popular and widely held beliefs regarding the mechanistic role for theta oscillations to group or segregate channels of information”.

      From this perspective, we maintain that the following statements are still justified:

      “substantial portion of the changes often attributed to theta oscillations in working memory tasks may be influenced by shifts in the spectral slope of aperiodic activity”

      "Note that although no prominent oscillatory peak in the theta range was observed at the group level, and some of this activity could potentially fall within the delta range, similar lowfrequency patterns have often been referred to as 'theta' in previous work, even in the absence of a clear spectral peak"

      These formulations are intended to emphasize existing interpretations of changes in low-frequency power as theta oscillations in related research.

      Next, Reviewer 3 pointed out that “spectral reflection (peak?) in spectral power plot does not imply that an event is repeating (i..e. oscillatory).” We agree with the reviewer that not every spectral peak implies a true oscillation. To address this, we complemented the power analyses with a measure of rhythmicity (phase autocorrelation function, pACF) after the first round of reviews, and the pACF results were largely similar to those for periodic activity. These results suggest that, in our case, periodic activity is indeed largely oscillatory.

      However, we do agree with the reviewer that the term “oscillatory” is not interchangeable with “periodic”. To address this, we reviewed the paper for all appearances of “oscillations”, “oscillatory” and related terms, and replaced them with “power”, “spectral” or “periodic activity” where appropriate (all changes are marked in red in the latest version of the manuscript).

      Examples of corrections:

      Changes in aperiodic activity appear as low-frequency oscillations in baseline-corrected time-frequency plots à low-frequency power

      “The periodic component includes only the parameterised oscillatory peak” à spectral peak

      “FOOOF decomposition may miss low-frequency oscillations near the edges of the spectrum” à low-frequency peaks

      We disagree with the reviewer’s assertion that the subtitle “Aperiodic parameters are largely independent of oscillatory activity” is misleading for a methods oriented paper. Namely, the full subtitle is “Rhythmicity analysis reveals aperiodic parameters are largely independent of oscillatory activity”. Since rhythmicity is a phase-based measure that requires repeating dynamics and is therefore indicative of oscillations, we believe this phrasing is technically accurate.

      Finally, we would like to emphasise our contribution once again. Our analyses of rhythmicity, spectrally parameterised power, and baseline-corrected power offer different perspectives on the data. Each of these analyses may lead to different interpretations, but performing all of them on the same data provides a more comprehensive insight into what is actually going on in the data.

      Our findings demonstrate that conclusions drawn from a single analytical approach may be incomplete or misleading. For example, as we discuss in the paper, many studies examine thetagamma coupling in scalp EEG during n-back tasks without first establishing whether theta activity genuinely oscillates (e.g. Rajji et al., 2016). The absence of true theta oscillations would undermine the validity of such analyses. Our multifaceted approach provides researchers with a systematic framework for validating oscillatory assumptions before proceeding with more complex analyses.

    1. Praising students for merely meeting expectations may reduce student behavior over time as it “cheapens” your praise.

      This is something I agree with wholeheartedly. And I think it is because I see this in my job, we have an "Employee of the Quarter" program and it sounds wonderful on the surface level but the unfortunate reality is that every single person will eventually get this award even if they don't deserve it. This will cause employees to think "Oh I can get this extra special recognition and this award just for being here/doing below the bare minimum/doing the bare minimum,,."

    1. Group G Ben Braniff, Kim Maynard, Nick Devic, Maria Echeverri Solis, Sam Yalda

      1. Design has a major impact on the world and society. Even the little things can add up to a lot. Sustainability is a revolutionary Idea that should be at the core of every design now.

      2. Society is another bottom line meaning all design inherently affects humans and/or is designed for humans. It's important to design for the extremes and the edge cases like people with disabilities.

      3. Corporations output a lot of waste. When they make small changes to be more sustainable, it results in big changes and saving a lot of material. Small changes can include anything from using 2% less plastic per water bottle to using wood buttons instead of plastic ones.

      4. A lot of people don't consider themselves disabled, but it's very common at some point in people's lives to have a certain level of impairment. It's important to keep this in mind when designing as you're designing for the general population--not just a specific individual.

      5. Addressing issues like world hunger may require rethinking the way we design food production. As they stated for example, choosing kangaroo meat over beef as a more environmentally sustainable option.

      6. Thoughtful design choices per the example in the video such as adding white circles inside letters to reduce ink use, can improve efficiency and conserve resources.

      7. It is interesting how he opens up his discussion to slowly introduce that design isn't just about doing it for marketing or 'profit' as he pointed out. When watching this it helps a person realize that design is so much more powerful than that if you put it towards another cause. Design could end up being the solution to some of the biggest problems in society.

      8. A very important point he made was that improving accessibility is beneficial to many more people than just the people that initially needed it such as people with disabilities. From this i think a good takeaway is that design should always be considerate of any disabilities/needs that the audience might have because sometimes that design is just better for everyone in general.

      9. My first take is design should go beyond money and aesthetics. By thinking about sustainability and accessibility the designers can create solutions that are socially responsible and environmentally friendly.

      10. My second take is when you design with people with disabilities you end up with solutions that are more usable and inclusive

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work aims to elucidate the molecular mechanisms affected in hypoxic conditions, causing reduced cortical interneuron migration. They use human assembloids as a migratory assay of subpallial interneurons into cortical organoids and show substantially reduced migration upon 24 hours of hypoxia. Bulk and scRNA-seq show adrenomedullin (ADM) up-regulation, as well as its receptor RAMP2, confirmed atthe protein level. Adding ADM to the culture medium after hypoxic conditions rescues the migration deficits, even though the subtype of interneurons affected is not examined. However, the authors demonstrate very clearly that ineffective ADM does not rescue the phenotype, and blocking RAMP2 also interferes with the rescue. The authors are also applauded for using 4 different cell lines and using human fetal cortex slices as an independent method to explore the DLXi1/2GFP-labelled iPSC-derived interneuron migration in this substrate with and without ADM addition (after confirming that also in this system ADM is up-regulated). Finally, the authors demonstrate PKA-CREB signalling mediating the effect of ADM addition, which also leads to up-regulation of GABAreceptors. Taken together, this is a very carefully done study on an important subject - how hypoxia affects cortical interneuron migration. In my view, the study is of great interest.

      Strengths:

      The strengths of the study are the novelty and the thorough work using several culture methods and 4 independent lines.

      Weaknesses:

      The main weakness is that other genes regulated upon hypoxia are not confirmed, such that readers will not know until which fold change/stats cut-off data are reliable.

      Reviewer #2 (Public review):

      Summary

      The manuscript by Puno and colleagues investigates the impact of hypoxia on cortical interneuron migration and downstream signaling pathways. They establish two models to test hypoxia, cortical forebrain assembloids, and primary human fetal brain tissue. Both of these models provide a robust assay for interneuron migration. In addition, they find that ADM signaling mediates the migration deficits and rescue using exogenous ADM.

      Strengths:

      The findings are novel and very interesting to the neurodevelopmental field, revealing new insights into how cortical interneurons migrate and as well, establishing exciting models for future studies. The authors use sufficient iPSC lines including both XX and XY, so the analysis is robust. In addition, the RNAseq data with re-oxygenation is a nice control to see what genes are changed specifically due to hypoxia. Further, the overall level of validation of the sequencing data and involvement of ADM signaling is convincing, including the validation of ADM at the protein level. Overall, this is a very nice manuscript.

      Weaknesses:

      I have a few comments and suggestions for the authors. See below.

      Reviewer #3 (Public review):

      Summary:

      The authors aimed to test whether hypoxia disrupts the migration of human cortical interneurons, a process long suspected to underlie brain injury in preterm infants but previously inaccessible for direct study. Using human forebrain assembloids and ex vivo developing brain tissue, they visualized and quantified interneuron migration under hypoxic conditions, identified molecular components of the response, and explored the effect of pharmacological intervention (specifically ADM) on restoring the migration deficits.

      Strengths:

      The major strength of this study lies in its use of human forebrain assembloids and ex vivo prenatal brain tissue, which provide a direct system to study interneuron migration under hypoxic conditions. The authors combine multiple approaches: long-term live imaging to directly visualize interneuron migration, bulk and single-cell transcriptomics to identify hypoxia-induced molecular responses, pharmacological rescue experiments with ADM to establish therapeutic potential, and mechanistic assays implicating the cAMP/PKA/pCREB pathway and GABA receptor expression in mediating the effect. Together, this rigorous and multifaceted strategy convincingly demonstrates that hypoxia disrupts interneuron migration and that ADM can restore this defect through defined molecular mechanisms.

      Overall, the authors achieve their stated aims, and the results strongly support their  conclusions. The work has a significant impact by providing the first direct evidence of hypoxia-induced interneuron migration deficits in the human context, while also nominating a candidate therapeutic avenue. Beyond the specific findings, the methodological platform - particularly the combination of assembloids and live imaging - will be broadly useful to the community for probing neurodevelopmental processes in health and disease.

      Weaknesses:

      The main weakness of the study lies in the extent to which forebrain assembloids

      recapitulate in vivo conditions, as the migration of interneurons from hSO to hCO does not fully reflect the native environment or migratory context of these cells. Nevertheless, this limitation is tempered by the fact that the work provides the first direct observation of human interneuron migration under hypoxia, representing a major advance for the field. In addition, while the transcriptomic analyses are valuable and highlight promising candidates, more in-depth exploration will be needed to fully elucidate the molecular mechanisms governing neuronal migration and maturation under hypoxic conditions.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The authors should examine if all cortical interneurons are affected by ADM or only subtypes (Parvalbumin/Somatostatin).

      We thank the reviewer for raising this important question. In our study, we utilized the Dlx1/2b::eGFP reporter to broadly label cortical interneurons; however, this system does not distinguish specific interneuron subtypes. To address this, in the revised version of the manuscript we will use the single-cell RNA sequencing data and immunostainings to provide this information. Based on previous analyses from Birey et al (Cell Stem Cell, 2022), we expect interneurons within assembloids to express mostly calbindin (CALB2) and somatostatin (SST) at this in vitro stage of development; parvalbumin subtype appears later based on data from Birey et al (Nature, 2017) and more recently from Varela et al, (bioRxiv, 2025).

      In parallel, we will analyze available scRNA-seq data from developing human primary brain tissue a similar age as the one used in the manuscript, and check whether these subtypes of interneurons are similar to the ones within assembloids.

      (2) The authors should test more candidates from their bulk RNA-seq data with different fold changes for regulation after hypoxia, to allow the reader to judge at which cut-off the DEGs may be reproducible. This would make this database much more valuable for the field of hypoxia research.

      We appreciate the reviewers’ thoughtful suggestion. In addition to the bulk RNA-seq analysis, we did validate several upregulated hypoxia-responsive genes with varying fold changes by qPCR; these include PDK1, PFKP, VEGFA (Figure S1). 

      We go agree that in-depth investigation of specific cut-offs would be interesting, however, this could be the focus of a different manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) Can the authors comment on the possibility of inflammatory response pathways being activated by hypoxia? Has this been shown before? While not the focus of the manuscript, it could be discussed in the Discussion as an interesting finding and potential involvement of other cells in the Hypoxic response.

      We thank the reviewer this important comment about inflammation. Indeed, hypoxia has been shown to activate the inflammatory response pathways. In various studies, it was found that HIF-1a can interact with NF-κB signaling, leading to the upregulation of pro-inflammatory cytokines such as IL-1β, IL-6, and TNF-α (Rius et al., Cell, 2008; Hagberg et al., Nat Rev Neurol, 2015).

      In our transcriptomics data (Figure 2D), and to the reviewers’ point, we identified enrichment of inflammatory signaling response following the hypoxic exposure. Since hSO at the time of analyses do contain astrocytes, we think these glia contribute to the observed pro-inflammatory changes. Based on these results and because ADM is known to have strong anti-inflammatory properties, the effects of ADM on hypoxic astrocytes should be investigated in future studies focused on hypoxia-induced inflammation. In the revision, we will address this comment in the discussion section and cite the appropriate papers.

      (2) Could the authors comment on the mechanism at play here with respect to ADM and binding to RAMP2 receptors - is this a potential autocrine loop, or is the source of ADM from other cell types besides inhibitory neurons? Given the scRNA-seq data, what cell-to-cell mechanisms can be at play? Since different cells express ADM, there could be different mechanisms in place in ventral vs dorsal areas.

      Based on our scRNA-seq data in hSOs showing significant upregulation of ADM expression in astrocytes and progenitors, we speculate that the primary mechanism is likely to involve paracrine interactions. However, we cannot exclude autocrine mechanisms with the included experiments. Dissecting these interactions in a cell-type specific manner could be an important focus for future ADM-related studies.

      To address the question about the possible different mechanisms in ventral versus dorsal areas, in the revision we will plot and include in the figures the data about the cell-type expression of ADM and its receptors in hCOs.

      (3) For data from Figure 6 - while the ELISA assays are informative to determine which pathways (PKA, AKT, ERK) are active, there is no positive control to indicate these assays are "working" - therefore, if possible, western blot analysis from assembloid tissue could be used (perhaps using the same lysates from Figure 3) as an alternative to validate changes at the protein level (however, this might prove difficult); further to this, is P-CREB activated at the protein level using WB?

      We thank the reviewer for this comment and the observation. Although we did not include a traditional positive control in these ELISA assays, several lines of evidence indicate that the measurements are reliable. First, the standard curves behaved as expected, and all sample values fell within the assay’s dynamic range. Second, technical replicates showed low variability, and the observed changes across experimental conditions (e.g., hypoxia vs. control) were consistent with the expected biological responses based on previous literature. We agree that including western blot validation would strengthen the findings, and we will note this for our future studies focused on CREB and ADM.

      (4) Could the authors comment further on the mechanism and what biological pathways and potential events are downstream of ADM binding to RAMP2 in inhibitory neurons? What functional impact would this have linked to the CREB pathway proposed? While the link to GABA receptors is proposed, CREB has many targets beyond this.

      We appreciate the reviewers’ insightful question. Currently, not much is known about the molecular pathways and downstream cellular events triggered by ADM binding to RAMP2 in inhibitory neurons, and in general in brain cells. The data from our study brings the first information about the cell-type specific expression of ADM in baseline and hypoxic conditions and is one of the key novelties of our study.

      While the signaling landscape of ADM in interneurons is largely unexplored, several studies in other (non-brain) cell types have demonstrated that ADM binding to RAMP2 can activate downstream cascades such as the cAMP/PKA/CREB pathway, PI3K/AKT, and ERK/MAPK, all of which are also known to be critical regulators of neuronal development and survival. These previously published data along with our CREB-targeted findings in hypoxic interneurons, suggest ADM–RAMP2 signaling could influence multiple aspects of interneuron biology, but these remain to be evaluated in future studies.

      We agree with the reviewer that CREB has a wide range of transcriptional targets. We decided to focus on GABA as a target of CREB for two main reasons, including: (i) GABA signaling has been previously shown to play an important role in the migration of cortical interneurons, and (ii) a previous study by Birey et al. (Cell Stem Cell, 2022) demonstrated that CREB pathway activity is essential for regulating interneuron migration in assembloid models of Timothy Syndrom, thus further providing evidence that dysregulation of CREB activity disrupts migration dynamics.

      While our study provides a first step toward uncovering the mechanisms of interneuron migration protection by ADM, we fully acknowledge that future work will be needed to delineate the full spectrum of ADM–RAMP2 downstream signaling events in inhibitory neurons and other brain cells.

      (5) Does hypoxia cause any changes to inhibitory neurogenesis (earlier stages than migration?) - this might always be known, but was not discussed.

      We appreciate this question from the reviewer; however, this was not something that we focused on in this manuscript due to the already large amount of data included. A separate study focusing on neurogenesis defects and the molecular mechanisms of injury for that specific developmental process would be an important next step.

      (6) In the Discussion section, it might be worth detailing to the readers what the functional impact of delayed/reduced migration of inhibitory neurons into the cortex might result in, in terms of functional consequences for neural circuit development.

      We thank the Reviewer for the suggestion of detailing the functional impact of reduced inhibitory neuron migration. We will revise the manuscript by incorporating a paragraph about this in the Discussion section.

      Reviewer #3 (Recommendations for the authors):

      Most of the evidence presented is convincing in supporting the conclusions, and I have only minor suggestions for improvement:

      (1) The bulk RNA-seq was performed in hSOs only, which may not fully capture the phenotypes of migrating or migrated interneurons. It would be valuable, if feasible, to sort migrated cells from hSO-hCO assembloids and specifically examine their molecular mediators.

      We thank the reviewer for this suggestion. While it is likely that the cellular environment will have some influence on a subset of the molecular changes, based on all the data from the manuscript and our specific target, the RNA-sequencing on hCOs was sufficient to capture essential changes like ADM upregulation. The in-depth exploration on differential responses of migrated versus non-migrated interneurons to hypoxia could be the focus of a different project.

      (2) In Figure 3, it is striking that cell-type heterogeneity dominates over hypoxia vs. control conditions. A joint embedding of hSO and hCO cells could provide further insight into molecular differences between migrated and non-migrated interneurons.

      We thank the reviewer for this observation and opportunity to clarify. Since we manually separated the assembloids before the analyses, we processed these samples separately. That is why they separate like this. In the revision, we will add data about ADM expression and its receptors’ expression in the hCOs.

      (3) It would be helpful to expand the discussion on how closely the migration observed in hSO-hCO assembloids reflects in vivo conditions, and what environmental aspects are absent from this model. This would better frame the interpretation and translational relevance of the findings.

      We thank the Reviewer for bringing up this important point. Although the assembloid model offers the unique advantage of allowing the direct investigation of migration patterns of hypoxic interneurons, we fully agree it does not fully recapitulate the in vivo environment. While there are multiple aspects that cannot be recapitulated in vitro at this time (e.g. cellular complexity, vasculature, immune response, etc), we are encouraged by the validation of our main findings in ex vivo developing human brain tissue, which strongly supports the validity of our findings for in vivo conditions.

      We will expand our discussion to include more details and the need to validate these findings using in vivo models, while also acknowledging that different species (e.g. rodents versus non-human primates versus humans) might have different responses to hypoxia.

      (4) The authors suggest that hypoxia is also associated with delayed interneuron maturation, yet the bulk RNA-seq data primarily reveal stress and hypoxia-related genes. A more detailed discussion of why genes linked to interneuron maturation and function were not strongly affected would clarify this point.

      We thank the Reviewer for the opportunity to clarify.

      The RNAseq data was performed during the acute stages of hypoxia/reoxygenation and we think a maturation phenotype might be difficult to capture at this point and would require analysis at later in vitro assembloid maturation stages.

      Our speculation about a possible maturation defect is based on data from previous studies from developmental biology that showed failure of interneurons to reach their final cortical location within a specified developmental window will impair their integration within the neuronal network, and thus lead to maturation defects and possible elimination by apoptosis.

      Since preterm infants suffer from countless hypoxic events over multiple months, we suggest these repetitive events are likely to induce cumulative delays in migration, inability of interneurons to reach their target in time, followed by abnormal integration within the excitatory network, and eventual elimination of some of these interneurons through apoptosis. However, the direct demonstration of this effect following a hypoxic insult would require prolonged in vivo experiments in rodents to follow the migration, network integration and apoptosis of interneurons; to our knowledge this experimental design is not technically feasible at this time.

      (5) Relatedly, while the focus on interneuron migration is well justified, acknowledging how hypoxia might also impact other aspects of cortical development (e.g., progenitor proliferation, neuronal maturation, or circuit integration) would place the findings in a broader developmental framework and strengthen their relevance.

      We appreciate the Reviewer’s suggestion to discuss the role of hypoxia on other processes during cortical development. In the revised manuscript, we will include citations about the effects of hypoxia on interneuron proliferation, maturation and circuit integration as available, and also expand to other cell types known to be affected.

      (6) Very minor: in Figure S3C and D, it was not stated what the colors mean (grey: control, yellow: hypoxia)

      Thank you for pointing out this error and we will correct it in our revision.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Overall, the conclusions of the paper are mostly supported by the data but may be overstated in some cases, and some details are also missing or not easily recognizable within the figures. The provision of additional information and analyses would be valuable to the reader and may even benefit the authors' interpretation of the data. 

      We thank the reviewer for the thoughtful and constructive feedback. We are pleased that the reviewer found the overall conclusions of our paper to be well supported by the data, and we appreciate the suggestions for improving figure clarity and interpretive accuracy. Below, we address each point with corresponding revisions.

      The conclusion that DREADD expression gradually decreases after 1.5-2 years is only based on a select few of the subjects assessed; in Figure 2, it appears that only 3 hM4Di cases and 2 hM3Dq cases are assessed after the 2-year timepoint. The observed decline appears consistent within the hM4Di cases, but not for the hM3Dq cases (see Figure 2C: the AAV2.1-hSyn-hM3Dq-IRES-AcGFP line is increasing after 2 years.) 

      We agree that our interpretation should be stated more cautiously, given the limited number of cases assessed beyond the two-year timepoint. In the revised manuscript, we have clarified in the Results that the observed decline is based on a subset of animals. We have also included a text stating that while a consistent decline was observed in hM4Di-expressing monkeys, the trajectory for hM3Dq expression was more variable with at least one case showing an increased signal beyond two years.

      Revised Results section:

      Lines 140, “hM4Di expression levels remained stable at peak levels for approximately 1.5 years, followed by a gradual decline observed in one case after 2.5 years, and after approximately 3 years in the other two cases (Figure 2B, a and e/d, respectively). Compared with hM4Di expression, hM3Dq expression exhibited greater post-peak fluctuations. Nevertheless, it remained at ~70% of peak levels after about 1 year. This post-peak fluctuation was not significantly associated with the cumulative number of DREADD agonist injections (repeated-measures two-way ANOVA, main effect of activation times, F<sub>(1,6)</sub> = 5.745, P = 0.054). Beyond 2 years post-injection, expression declined to ~50% in one case, whereas another case showed an apparent increase (Figure 2C, c and m, respectively).”

      Given that individual differences may affect expression levels, it would be helpful to see additional labels on the graphs (or in the legends) indicating which subject and which region are being represented for each line and/or data point in Figure 1C, 2B, 2C, 5A, and 5B. Alternatively, for Figures 5A and B, an accompanying table listing this information would be sufficient. 

      We thank the reviewer for these helpful suggestions. In response, we have revised the relevant figures (Fig. 1C, 2B, 2C, and 5) as noted in the “Recommendations for the authors”, including simplifying visual encodings and improving labeling. We have also updated Table 2 to explicitly indicate the animal ID and brain regions associated with each data point shown in the figures.

      While the authors comment on several factors that may influence peak expression levels, including serotype, promoter, titer, tag, and DREADD type, they do not comment on the volume of injection. The range in volume used per region in this study is between 2 and 54 microliters, with larger volumes typically (but not always) being used for cortical regions like the OFC and dlPFC, and smaller volumes for subcortical regions like the amygdala and putamen. This may weaken the claim that there is no significant relationship between peak expression level and brain region, as volume may be considered a confounding variable. Additionally, because of the possibility that larger volumes of viral vectors may be more likely to induce an immune response, which the authors suggest as a potential influence on transgene expression, not including volume as a factor of interest seems to be an oversight. 

      We thank the reviewer for raising this important issue. We agree that injection volume could act as a confounding variable, particularly since larger volumes were used in only handheld cortical injections. This overlap makes it difficult to disentangle the effect of volume from those of brain region or injection method. Moreover, data points associated with these larger volumes also deviated when volume was included in the model.

      To address this, we performed a separate analysis restricted to injections delivered via microinjector, where a comparable volume range was used across cases. In this subset, we included injection volume as additional factor in the model and found that volume did not significantly impact peak expression levels. Instead, the presence of co-expressed protein tags remained a significant predictor, while viral titer no longer showed a significant effect. These updated results have replaced the originals in the revised Results section and in the new Figure 5. We have also revised the Discussion to reflect these updated findings.

      The authors conclude that vectors encoding co-expressed protein tags (such as HA) led to reduced peak expression levels, relative to vectors with an IRES-GFP sequence or with no such element at all. While interesting, this finding does not necessarily seem relevant for the efficacy of long-term expression and function, given that the authors show in Figures 1 and 2 that peak expression (as indicated by a change in binding potential relative to non-displaced radioligand, or ΔBPND) appears to taper off in all or most of the constructs assessed. The authors should take care to point out that the decline in peak expression should not be confused with the decline in longitudinal expression, as this is not clear in the discussion; i.e. the subheading, "Factors influencing DREADD expression," might be better written as, "Factors influencing peak DREADD expression," and subsequent wording in this section should specify that these particular data concern peak expression only. 

      We appreciate this important clarification. In response, we have revised the title to "Protein tags reduce peak DREADD expression levels" in the Results section and “Factors influencing peak DREADD expression levels” in the Discussion section. Additionally, we specified that our analysis focused on peak ΔBP<sub>ND</sub> values around 60 days post-injection. We have also explicitly distinguished these findings from the later-stage changes in expression seen in the longitudinal PET data in both the Results and Discussion sections.

      Reviewer #1 (Recommendations for the authors):

      (1) Will any of these datasets be made available to other researchers upon request?

      All data used to generate the figures have been made publicly available via our GitHub repository (https://github.com/minamimoto-lab/2024-Nagai-LongitudinalPET.git). This has been stated in the "Data availability" section in the revised manuscript.

      (2) Suggested modifications to figures:

      a) In Figures 2B and C, the inclusion of "serotype" as a separate legend with individual shapes seems superfluous, as the serotype is also listed as part of the colour-coded vector

      We agree that the serotype legend was redundant since this information is already included in the color-coded vector labels. In response, we have removed the serotype shape indicators and now represent the data using only vector-construct-based color coding for clarity in Figure 2B and C.

      b) In Figures 3A and B, it would be nice to see tics (representing agonist administration) for all subjects, not just the two that are exemplified in panels C-D and F-H. Perhaps grey tics for the non-exemplified subjects could be used.

      In response, we have included black and white ticks to indicate all agonist administration across all subjects in Figure 3A and B, with the type of agonist clearly specified. 

      c) In Figure 4C, a Nissl- stained section is said to demonstrate the absence of neuronal loss at the vector injection sites. However, if the neuronal loss is subtle or widespread, this might not be easily visualized by Nissl. I would suggest including an additional image from the same section, in a non-injected cortical area, to show there is no significant difference between the injected and non-injected region.

      To better demonstrate the absence of neuronal loss at the injection site, we have included an image from the contralateral, non-injected region of the same section for comparison (Fig. 4C).

      d) In Figure 5A: is it possible that the hM3Dq construct with a titer of 5×10^13 gc/ml is an outlier, relative to the other hM3Dq constructs used?

      We thank the reviewer for raising this important observation. To evaluate whether the high-titer constructs represented a statistical outlier that might artifactually influence the observed trends, we performed a permutation-based outlier analysis. This assessment identified this point in question, as well as one additional case (titer 4.6 x 10e13 gc/ml, #255, L_Put), as significant outlier relative to the distribution of the dataset.

      Accordingly, we excluded these two data points from the analysis. Importantly, this exclusion did not meaningfully alter the overall trend or the statistical conclusions—specifically, the significant effect of co-expressed protein tags on peak expression levels remain robust. We have updated the Methods section to describe this outlier handling and added a corresponding note in the figure legend.

      Reviewer #2 (Public review): 

      Weaknesses 

      This study is a meta-analysis of several experiments performed in one lab. The good side is that it combined a large amount of data that might not have been published individually; the downside is that all things were not planned and equated, creating a lot of unexplained variances in the data. This was yet judiciously used by the authors, but one might think that planned and organized multicentric experiments would provide more information and help test more parameters, including some related to inter-individual variability, and particular genetic constructs. 

      We thank the reviewer for bringing this important point to our attention. We fully acknowledge that the retrospective nature of our dataset—compiled from multiple studies conducted within a single laboratory—introduces variability related to differences in injection parameters and scanning timelines. While this reflects the practical realities and constraints of long-term NHP research, we agree that more standardized and prospectively designed studies would better control such source of variances. To address this, we have added the following statement to the "Technical consideration" section in Discussion:

      Lines 297, "This study included a retrospective analysis of datasets pooled from multiple studies conducted within a single laboratory, which inherently introduced variability across injection parameters and scan intervals. While such an approach reflects real-world practices in long-term NHP research, future studies, including multicenter efforts using harmonized protocols, will be valuable for systematically assessing inter-individual differences and optimizing key experimental parameters."

      Reviewer #2 (Recommendations for the authors):

      I just have a few minor points that might help improve the paper:

      (1) Figure 1C y-axis label: should add deltaBPnd in parentheses for clarity.

      We have added “ΔBP<sub>ND</sub>” to the y-axis label for clarity.

      The choice of a sigmoid curve is the simplest clear fit, but it doesn't really consider the presence of the peak described in the paper. Would there be a way to fit the dynamic including fitting the peak?

      We agree that using a simple sigmoid curve for modeling expression dynamics is a limitation. In response to this and a similar comment from Reviewer #3, we tested a double logistic function (as suggested) to see if it better represented the rise and decline pattern. However, as described below, the original simple sigmoid curve was a better fit for the data. We have included a discussion regarding this limitation of this analysis. See Reviewer #3 recommendations (2) for details.

      The colour scheme in Figure 1C should be changed to make things clearer, and maybe use another dimension (like dotted lines) to separate hM4Di from hM3Dq.

      We have improved the visual clarity of Figure 1C by modifying the color scheme to represent vector construct and using distinct line types (dashed for hM4Di and solid for hM3Dq data) to separate DREADD type.

      (2) Figure 2

      I don't understand how the referencing to 100 was made: was it by selecting the overall peak value or the peak value observed between 40 and 80 days? If the former then I can't see how some values are higher than the peak. If the second then it means some peak values occurred after 80 days and data are not completely re-aligned.

      We thank the reviewer for the opportunity to clarify this point. The normalization was based on the peak value observed between 40–80 days post-injection, as this window typically captured the peak expression phase in our dataset (see Figure 1). However, in some long-term cases where PET scans were limited during this period—e.g., with one scan performing at day 40—it is possible that the actual peak occurred later. Therefore, instances where ΔBP<sub>ND</sub> values slightly exceeded the reference peak at later time points likely reflect this sampling limitation. We have clarified this methodological detail in the revised Results section to improve transparency.

      The methods section mentions the use of CNO but this is not in the main paper which seems to state that only DCZ was used: the authors should clarify this

      Although DCZ was the primary agonist used, CNO and C21 were also used in a few animals (e.g., monkeys #153, #221, and #207) for behavioral assessments. We have clarified this in the Results section and revised Figure 3 to indicate the specific agonist used for each subject. Additionally, we have updated the Methods section to clearly specify the use and dosage of DCZ, CNO, and C21, to avoid any confusion regarding the experimental design.

      Reviewer #3 (Public review): 

      Minor weaknesses are related to a few instances of suboptimal phrasing, and some room for improvement in time course visualization and quantification. These would be easily addressed in a revision. <br /> These findings will undoubtedly have a very significant impact on the rapidly growing but still highly challenging field of primate chemogenetic manipulations. As such, the work represents an invaluable resource for the community.

      We thank the reviewer for the positive assessment of our manuscript and for the constructive suggestions. We address each comment in the following point-by-point responses and have revised the manuscript accordingly.

      Reviewer #3 (Recommendations for the authors):

      (1) Please clarify the reasoning was, behind restricting the analysis in Figure 1 only to 7 monkeys with subcortical AAV injection?

      We focused the analysis shown in Figure 1 on 7 monkeys with subcortical AAV injections who received comparative injection volumes. These data were primary part of vector test studies, allowing for repeated PET scans within 150 days post-injection. In contrast, monkeys with cortical injections—including larger volumes—were allocated to behavioral studies and therefore were not scanned as frequently during the early phase. We will clarify this rationale in the Results section.

      (2) Figure 1: Not sure if a simple sigmoid is the best model for these, mostly peaking and then descending somewhat, curves. I suggest testing a more complex model, for instance, double logistic function of a type f(t) = a + b/(1+exp(-c*(t-d))) - e/(1+exp(-g*(t-h))), with the first logistic term modeling the rise to peak, and the second term for partial decline and stabilization

      We appreciate the reviewer’s thoughtful suggestion to use a double logistic function to better model both the rising and declining phases of the expression curve. In response to this and similar comments from Reviewer #1, we tested the proposed model and found that, while it could capture the peak and subsequent decline, the resulting fit appeared less biologically plausible (See below). Moreover, model comparison using BIC favored the original simple sigmoid model (BIC = 61.1 vs. 62.9 for the simple and double logistic model, respectively). This information has been included in the revised figure legend for clarity.

      Given these results, we retained the original simple sigmoid function in the revised manuscript, as it provides a sufficient and interpretable approximation of the early expression trajectory—particularly the peak expression-time estimation, which was the main purpose of this analysis. We have updated the Methods section to clarify our modeling and rationale as follows:

      Lines 530, "To model the time course of DREADD expression, we used a single sigmoid function, referencing past in vivo fluorescent measurements (Diester et al., 2011). Curve fitting was performed using least squares minimization. For comparison, a double logistic function was also tested and evaluated using the Bayesian Information Criterion (BIC) to assess model fit."

      We also acknowledge that a more detailed understanding of post-peak expression changes will require additional PET measurements, particularly between 60- and 120-days post-injection, across a larger number of animals. We have included this point in the revised Discussion to highlight the need for future work focused on finer-grained modeling of expression decline:

      Lines 317, “Although we modeled the time course of DREADD expression using a single sigmoid function, PET data from several monkeys showed a modest decline following the peak. While the sigmoid model captured the early-phase dynamics and offered a reliable estimate of peak timing, additional PET scans—particularly between 60- and 120-days post-injection—will be essential to fully characterize the biological basis of the post-peak expression trajectories.”

      Author response image 1.<br />

      (3) Figure 2: It seems that the individual curves are for different monkeys, I counted 7 in B and 8 in C, why "across 11 monkeys"? Were there several monkeys both with hM4Diand hM3Dq? Does not look like that from Table 1. Generally, I would suggest associating specific animals from Tables 1 and 2 to the panels in Figures 1 and 2.

      Some animals received multiple vector types, leading to more curves than individual subjects. We have revised the figure legends and updated Table 2 to explicitly relate each curve with the specific animal and brain region.

      (4) I also propose plotting the average of (interpolated) curves across animals, to convey the main message of the figure more effectively.

      We agree that plotting the mean of the interpolated expression curves would help convey the group trend. We added averaged curves to Figure 2BC.

      (5) Similarly, in line 155 "We assessed data from 17 monkeys to evaluate ... Monkeys expressing hM4Di were assessed through behavioral testing (N = 11) and alterations in neuronal activity using electrophysiology (N = 2)..." - please explain how 17 is derived from 11, 2, 5 and 1. It is possible to glean from Table 1 that it is the calculation is 11 (including 2 with ephys) + 5 + 1 = 17, but it might appear as a mistake if one does not go deep into Table 1.

      We have clarified in both the text and Table 1 that some monkeys (e.g., #201 and #207) underwent both behavioral and electrophysiological assessments, resulting in the overlapping counts. Specifically, the dataset includes 11 monkeys for hM4Di-related behavior testing (two of which underwent electrophysiology testing), 5 monkeys assessed for hM3Dq with FDG-PET, and 1 monkey assessed for hM3Dq with electrophysiology, totaling 19 assessments across 17 monkeys. We have revised the Results section to make this distinction more explicit to avoid confusion, as follows:

      Lines 164, "Monkeys expressing hM4Di (N = 11) were assessed through behavioral testing, two of which also underwent electrophysiological assessment. Monkeys expressing hM3Dq (N = 6) were assessed for changes in glucose metabolism via [<sup>18</sup>F]FDG-PET (N = 5) or alterations in neuronal activity using electrophysiology (N = 1).”

      (6) Line 473: "These stock solutions were then diluted in saline to a final volume of 0.1 ml (2.5% DMSO in saline), achieving a dose of 0.1 ml/kg and 3 mg/kg for DCZ and CNO, respectively." Please clarify: the injection volume was always 0.1 ml? then it is not clear how the dose can be 0.1 ml/kg (for a several kg monkey), and why DCZ and CNO doses are described in ml/kg vs mg/kg?

      We thank the reviewer for pointing out this ambiguity. We apologize for the oversight and also acknowledge that we omitted mention of C21, which was used in a small number of cases. To address this, we have revised the “Administration of DREADD agonist” section of the Methods to clearly describe the preparation, the volume, and dosage for each agonist (DCZ, CNO, and C21) as follows:

      Lines 493, “Deschloroclozapine (DCZ; HY-42110, MedChemExpress) was the primary agonist used. DCZ was first dissolved in dimethyl sulfoxide (DMSO; FUJIFILM Wako Pure Chemical Corp.) and then diluted in saline to a final volume of 1 mL, with the final DMSO concentration adjusted to 2.5% or less. DCZ was administered intramuscularly at a dose of 0.1 mg/kg for hM4Di activation, and at 1–3 µg/kg for hM3Dq activation. For behavioral testing, DCZ was injected approximately 15 min before the start of the experiment unless otherwise noted. Fresh DCZ solutions were prepared daily.

      In a limited number of cases, clozapine-N-oxide (CNO; Toronto Research Chemicals) or Compound 21 (C21; Tocris) was used as an alternative DREADD agonist for some hM4Di experiments. Both compounds were dissolved in DMSO and then diluted in saline to a final volume of 2–3 mL, also maintaining DMSO concentrations below 2.5%. CNO and C21 were administered intravenously at doses of 3 mg/kg and 0.3 mg/kg, respectively.”

      (7) Figure 5A: What do regression lines represent? Do they show a simple linear regression (then please report statistics such as R-squared and p-values), or is it related to the linear model described in Table 3 (but then I am not sure how separate DREADDs can be plotted if they are one of the factors)?

      We thank the reviewer for the insightful question. In the original version of Figure 5A, the regression lines represented simple linear fits used to illustrate the relationship between viral titer and peak expression levels, based on our initial analysis in which titer appeared to have a significant effect without any notable interaction with other factors (such as DREADD type).

      However, after conducting a more detailed analysis that incorporated injection volume as an additional factor and excluded cortical injections and statistical outliers (as suggested by Reviewer #1), viral titer was no longer found to significantly predict peak expression levels. Consequently, we revised the figure to focus on the effect of reporter tag, which remained the most consistent and robust predictor in our model.

      In the updated Figure 5, we have removed the relationship between viral titer and expression level with regression lines.

    1. This manuscript examines preprint review services and their role in the scholarly communications ecosystem.  It seems quite thorough to me. In Table 1 they list many peer-review services that I was unaware of e.g. SciRate and Sinai Immunology Review Project.

      To help elicit critical & confirmatory responses for this peer review report I am trialling Elsevier’s suggested “structured peer review” core questions, and treating this manuscript as a research article.

      Introduction

      1. Is the background and literature section up to date and appropriate for the topic?

        Yes.

      2. Are the primary (and secondary) objectives clearly stated at the end of the introduction?

        No. Instead the authors have chosen to put the two research questions on page 6 in the methods section. I wonder if they ought to be moved into the introduction – the research questions are not methods in themselves. Might it be better to state the research questions first and then detail the methods one uses to address those questions afterwards? [as Elsevier’s structured template seems implicitly to prefer.

      Methods

      1. Are the study methods (including theory/applicability/modelling) reported in sufficient detail to allow for their replicability or reproducibility?

        I note with approval that the version number of the software they used (ATLAS.ti) was given.

        I note with approval that the underlying data is publicly archived under CC BY at figshare.

        The Atlas.ti report data spreadsheet could do with some small improvement – the column headers are little cryptic e.g. “Nº  ST “ and “ST” which I eventually deduced was Number of Schools of Thought and Schools of Thought (?)   

        Is there a rawer form of the data that could be deposited with which to evidence the work done? The Atlas.ti report spreadsheet seemed like it was downstream output data from Atlas.ti. What was the rawer input data entered into Atlas.ti? Can this be archived somewhere in case researchers want to reanalyse it using other tools and methods.

        I note with disapproval that Atlas.ti is proprietary software which may hinder the reproducibility of this work. Nonetheless I acknowledge that Atlas.ti usage is somewhat ‘accepted’ in social sciences despite this issue.

        I think the qualitative text analysis is a little vague and/or under-described: “Using ATLAS.ti Windows (version 23.0.8.0), we carried out a qualitative analysis of text from the relevant sites, assigning codes covering what they do and why they have chosen to do it that way.” That’s not enough detail. Perhaps an example or two could be given? Was inter-rater reliability performed when ‘assigning codes’ ? How do we know the ‘codes’ were assigned accurately?

      2. Are statistical analyses, controls, sampling mechanism, and statistical reporting (e.g., P-values, CIs, effect sizes) appropriate and well described?

        This is a descriptive study (and that’s fine) so there aren’t really any statistics on show here other than simple ‘counts’ (of Schools of Thought) in this manuscript. There are probably some statistical processes going on within the proprietary qualitative analysis of text done in ATLAS.ti but it is under described and so hard for me to evaluate. 

      Results

      1. Is the results presentation, including the number of tables and figures, appropriate to best present the study findings?

        Yes. However, I think a canonical URL to each service should be given.  A URL is very useful for disambiguation, to confirm e.g. that the authors mean this Hypothesis (www.hypothes.is) and NOT this Hypothesis (www.hyp.io). I know exactly which Hypothesis is the one the authors are referring to but we cannot assume all readers are experts 😊

        Optional suggestion: I wonder if the authors couldn’t present the table data in a slightly more visual and/or compact way? It’s not very visually appealing in its current state. Purely as an optional suggestion, to make the table more compact one could recode the answers given in one or more of the columns 2, 3 and 4 in the table e.g. "all disciplines = ⬤ , biomedical and life sciences = ▲, social sciences =  ‡  , engineering and technology = † ". I note this would give more space in the table to print the URLs for each service that both reviewers have requested.

        ———————————————————————————————

        | Service name | Developed by | Scientific disciplines | Types of outputs |

        | Episciences | Other | ⬤ | blah blah blah. |

        | Faculty Opinions | Individual researcher | ▲ | blah blah blah. |

        | Red Team Market | Individual researcher | ‡ | blah blah blah. |

        ———————————————————————————————

        The "Types of outputs" column might even lend themselves to mini-colour-pictograms (?) which could be more concise and more visually appealing? A table just of text, might be scientifically 'correct' but it is incredibly dull for readers, in my opinion.

      2. Are additional sub-analyses or statistical measures needed (e.g., reporting of CIs, effect sizes, sensitivity analyses)?

        No / Not applicable. 

      Discussion

      1. Is the interpretation of results and study conclusions supported by the data and the study design?

        Yes.

      2. Have the authors clearly emphasized the limitations of their study/theory/methods/argument?

        No. Perhaps a discussion of the linguistic/comprehension bias of the authors might be appropriate for this manuscript. What if there are ‘local’ or regional Chinese, Japanese, Indonesian or Arabic language preprint review services out there? Would this authorship team really be able to find them?

      Additional points:

      • Perhaps the points made in this manuscript about financial sustainability (p24) are a little too pessimistic. I get it, there is merit to this argument, but there is also some significant investment going on there if you know where to look. Perhaps it might be worth citing some recent investments e.g. Gates -> PREreview (2024) https://content.prereview.org/prereview-welcomes-funding/  and Arcadia’s $4 million USD to COAR for the Notify Project which supports a range of preprint review communities including Peer Community In, Episciences, PREreview and Harvard Library.  (source: https://coar-repositories.org/news-updates/coar-welcomes-significant-funding-for-the-notify-project/

      • Although I note they are mentioned, I think more needs to be written about the similarity and overlap between ‘overlay journals’ and preprint review services. Are these arguably not just two different terms for kinda the same thing? If you have Peer Community In which has it’s overlay component in the form of the Peer Community Journal, why not mention other overlay journals like Discrete Analysis and The Open Journal of Astrophysics.   I think Peer Community In (& it’s PCJ) is the go-to example of the thin-ness of the line the separates (or doesn’t!) overlay journals and preprint review services. Some more exposition on this would be useful.

    2. Thank you very much for the opportunity to review the preprint titled “Preprint review services: Disrupting the scholarly communication landscape?” (https://doi.org/10.31235/osf.io/8c6xm) The authors review services that facilitate peer review of preprints, primarily in the STEM (science, technology, engineering, and math) disciplines. They examine how these services operate and their role within the scholarly publishing ecosystem. Additionally, the authors discuss the potential benefits of these preprint peer review services, placing them in the context of tensions in the broader peer review reform movement. The discussions are organized according to four “schools of thought” in peer review reform, as outlined by Waltman et al. (2023), which provides a useful framework for analyzing the services. In terms of methodology, I believe the authors were thorough in their search for preprint review services, especially given that a systematic search might be impractical.

      As I see it, the adoption of preprints and reforming peer review are key components of the move towards improving scholarly communication and open research. This article is a useful step along that journey, taking stock of current progress, with a discussion that illuminates possible paths forward. It is also well-structured and easy for me to follow. I believe it is a valuable contribution to the metaresearch literature.

      On a high level, I believe the authors have made a reasonable case that preprint review services might make peer review more transparent and rewarding for all involved. Looking forward, I would like to see metaresearch which gathers further evidence that these benefits are truly being realised.

      In this review, I will present some general points which merit further discussion or clarification to aid an uninitiated reader. Additionally, I raise one issue regarding how the authors framed the article and categorised preprint review services and the disciplines they serve. In my view, this problem does not fundamentally undermine the robust search, analyses, and discussion in this paper, but it risks putting off some researchers and constrains how broadly one should derive conclusions.

      General comments

      Some metaresearchers may be aware of preprints, but not all readers will be familiar with them. I suggest briefly defining what they are, how they work, and which types of research have benefited from preprints, similar to how “preprint review service” is clearly defined in the introduction.

      Regarding Waltman et al.’s (2023) “Equity & Inclusion” school of thought, does it specifically aim for “balanced” representation by different groups as stated in this article? There is an important difference between “balanced” versus “equitable” representation, and I would like to see it addressed in this text.

      Another analysis I would like to see is whether any of the 23 services reviewed present any evidence that their approach has improved research quality. For instance, the discussion on peer review efficiency and incentives states that there is currently “no hard evidence” that journals want to utilise reviews by Rapid Reviews: COVID-19, and that “not all journals are receptive” to partnerships. Are journals skeptical of whether preprint review services could improve research quality? Or might another dynamic be at work?

      The authors cite Nguyen et al. (2015) and Okuzaki et al. (2019), stating that peer review is often “overloaded”. I would like to see a clearer explanation by what “overloaded” means in this context so that a reader does not have to read the two cited papers.

      To the best of my understanding, one of the major sticking points in peer review reform is whether to anonymise reviewers and/or authors. Consequently, I appreciate the comprehensive discussion about this issue by the authors.

      However, I am only partially convinced by the statement that double anonymity is “essentially incompatible” with preprint review. For example, there may be, as yet not fully explored, ways to publish anonymous preprints with (a) a notice that it has been submitted to, or is undergoing, peer review; and (b) that the authors will be revealed once peer review has been performed (e.g. at least one review has been published). This would avoid the issue of publishing only after review is concluded as is the case for Hypothesis and Peer Community In.

      Additionally, the authors describe 13 services which aim to “balance transparency and protect reviewers’ interests”. This is a laudable goal, but I am concerned that framing this as a “balance” implies a binary choice, and that to have more of one, we must lose an equal amount of the other. Thinking only in terms of “balance” prevents creative, win-win solutions. Could a case be made for non-anonymity to be complemented by a reputation system for authors and reviewers? For example, major misconduct (e.g. retribution against a critical review) would be recorded in that system and dissuade bad actors. Something similar can already be seen in the reviewer evaluation system of CrowdPeer, which could plausibly be extended or modified to highlight misconduct.

      I also note that misconduct and abusive behaviour already occur even in fully or partially anonymised peer review, and they are not limited to the review or preprints. While I am not aware of existing literature on this topic, academics’ fears seem reasonable. For example, there is at least anecdotal testimonies that a reviewer would deliberately reject a paper to retard the progress of a rival research group, while taking the ideas of that paper and beating their competitors to winning a grant. Or, a junior researcher might refrain from giving a negative review out of fear that the senior researcher whose work they are reviewing might retaliate. These fears, real or not, seem to play a part in the debates about if and how peer review should (or should not) be anonymised. I would like to see an exploration of whether de-anonimisation will improve or worsen this behaviour and in what contexts. And if such studies exist, it would be good to discuss them in this paper.

      I found it interesting that almost all preprint review services claim to be complementary to, and not compete with, traditional journal-based peer review. The methodology described in this article cannot definitely explain what is going on, but I suspect there may be a connection between this aversion to compete with traditional journals, and (a) the skepticism of journals towards partnering with preprint review services and (b) the dearth of publisher-run options. I hypothesise that there is a power dynamic at play, where traditional publishers have a vested interest in maintaining the power they hold over scholarly communication, and that preprint review services stress their complementarity (instead of competitiveness) as a survival mechanism. This may be an avenue for further metaresearch.

      To understand preprints from which fields of research are actually present on the services categorised under “all disciplines,” I used the Random Integer Set Generator by the Random.org true random number service (https://www.random.org/integer-sets/) to select five services for closer examination: Hypothesis, Peeriodicals, PubPeer, Qeios, and Researchers One. Of those, I observed that Hypothesis is an open source web annotation service that allows commenting on and discussion of any web page on the Internet regardless of whether it is research or preprints. Hypothesis has a sub-project named TRiP (Transparent Review in Preprints), which is their preprint review service in collaboration with Cold Spring Harbor Laboratory. It is unclear to me why the authors listed Hypothesis as the service name in Table 1 (and elsewhere) instead of TRiP (or other similar sub-projects). In addition, Hypothesis seems to be framed as a generic web annotation service that is used by some as a preprint review tool. This seems fundamentally different from others who are explicitly set up as preprint review services. This difference seems noteworthy to me.

      To aid readers, I also suggest including hyperlinks to the 23 services reviewed in this paper. My comments on disciplinary representation in these services are elaborated further below.

      One minor point of curiosity is that several services use an “automated tool” to select reviewers. It would be helpful to describe in this paper exactly what those tools are and how they work, or report situations where services do not explain it.

      Lastly, what did the authors mean by “software heritage” in section 6? Are they referring to the organisation named Software Heritage (https://www.softwareheritage.org/) or something else? It is not clear to me how preprint reviews would be deposited in this context.

      Respecting disciplinary and epistemic diversity

      In the abstract and elsewhere in the article, the authors acknowledge that preprints are gaining momentum “in some fields” as a way to share “scientific” findings. After reading this article, I agree that preprint review services may disrupt publishing for research communities where preprints are in the process of being adopted or already normalised. However, I am less convinced that such disruption is occurring, or could occur, for scholarly publishing more generally.

      I am particularly concerned about the casual conflation of “research” and “scientific research” in this article. Right from the start, it mentions how preprints allow sharing “new scientific findings” in the abstract, stating they “make scientific work available rapidly.” It also notes that preprints enable “scientific work to be accessed in a timely way not only by scientists, but also…” This framing implies that all “scholarly communication,” as mentioned in the title, is synonymous with “scientific communication.” Such language excludes researchers who do not typically identify their work as “scientific” research. Another example of this conflation appears in the caption for Figure 1, which outlines potential benefits of preprint review services. Here, “users” are defined as “scientists, policymakers, journalists, and citizens in general.” But what about researchers and scholars who do not see themselves as “scientists”?

      Similarly, the authors describe the 23 preprint review services using six categories, one of which is “scientific discipline”. One of those disciplines is called “humanities” in the text, and Table 1 lists it as a discipline for Science Open Reviewed. Do the authors consider “humanities” to be a “scientific” discipline? If so, I think that needs to be justified with very strong evidence.

      Additionally, Waltman et al.’s four schools of thought for peer review reform works well with the 23 services analysed. However, at least three out of the four are explicitly described as improving “scientific” research.

      Related to the above are how the five “scientific disciplines” are described as the “usual organisation” of the scholarly communication landscape. On what basis should they be considered “usual”? In this formulation, research in literature, history, music, philosophy, and many other subjects would all be lumped together into the “humanities”, which sit at the same hierarchical level as “biomedical and life sciences”, arguably a much more specific discipline. My point is not to argue for a specific organisation of research disciplines, but to highlight a key epistemic assumption underlying the whole paper that comes across as very STEM-centric (science, technology, engineering, and math).

      How might this part of the methodology affect the categories presented in Table 1? “Biomedical and life sciences” appear to be overrepresented compared to other “disciplines”. I’d like to see a discussion that examines this pattern, and considers why preprint review services (or maybe even preprints more generally) appear to cover mostly the biomedical or physical sciences.

      In addition, there are 12 services described as serving “all disciplines”. I believe this paper can be improved by at least a qualitative assessment of the diversity of disciplines actually represented on those services. Because it is reported that many of these service stress improving the “reproducibility” of research, I suspect most of them serve disciplines which rely on experimental science.

      I randomly selected five services for closer examination, as mentioned above. Of those, only Qeios has demonstrated an attempt to at least split “arts and humanities” into subfields. The others either don’t have such categories altogether, or have a clear focus on a few disciplines (e.g. life sciences for Hypothesis/TRiP). In all cases I studied, there is a heavy focus on STEM subjects, especially biology or medical research. However, they are all categorised by the authors as serving “all disciplines”.

      If preprint review services originate from, or mostly serve, a narrow range of STEM disciplines (especially experiment-based ones), it would be worth examining why that is the case, and whether preprints and reviews of them could (or could not) serve other disciplines and epistemologies.

      It is postulated that preprint review services might “disrupt the scholarly communication landscape in a more radical way”. Considering the problematic language I observed, what about fields of research where peer-reviewed journal publications are not the primary form of communication? Would preprint review services disrupt their scholarly communications?

      To be clear, my concern is not just the conflation of language in a linguistic sense but rather inequitable epistemic power. I worry that this conflation would (a) exclude, minoritise, and alienate researchers of diverse disciplines from engaging with metaresearch; and (b) blind us from a clear pattern in these 23 services, that is their strong focus on the life sciences and medical research and a discussion of why that might be the case. Critically, what message are we sending to, for example, a researcher of 18th century French poetry with the language and framing of this paper? I believe the way “disciplines” are currently presented here poses a real risk of devaluing and minoritising certain subject areas and ways of knowing. In its current form, I believe that while this paper is a very valuable contribution, one should not derive from it any conclusions which apply to scholarly publishing as a whole.

      The authors have demonstrated inclusive language elsewhere. For example, they have consciously avoided “peer” when discussing preprint review services, clearly contrasting them to “journal-based peer review”. Therefore, I respectfully suggest that similar sensitivity be adopted to avoid treating “scientific research” and “research” as the same thing. A discussion, or reference to existing works, on the disciplinary skew of preprints (and reviews of them) would also add to the intellectual rigour of this already excellent piece.

      Overall, I believe this paper is a valuable reflection on the state of preprints and services which review them. Addressing the points I raised, especially the use of more inclusive language with regards to disciplinary diversity, would further elevate its usefulness in the metaresearch discourse. Thank you again for the chance to review.

      Signed:

      Dr Pen-Yuan Hsing (ORCID ID: 0000-0002-5394-879X)

      University of Bristol, United Kingdom

      Data availability

      I have checked the associated dataset, but still suggest including hyperlinks to the 23 services analysed in the main text of this paper.

    1. In "Researchers Are Willing to Trade Their Results for Journal Prestige: Results from a Discrete Choice Experiment", the authors investigate researchers’ publication preferences using a discrete choice experiment in a cross-sectional survey of international health and medical researchers. The study investigates publishing decisions in relation to negotiation of trade-offs amongst various factors like journal impact factor, review helpfulness, formatting requirements, and usefulness for promotion in their decisions on where to publish. The research is timely; as the authors point out, reform of research assessment is currently a very active topic. The design and methods of the study are suitable and robust. The use of focus groups and interviews in developing the attributes for study shows care in the design. The survey instrument itself is generally very well-designed, with important tests of survey fatigue, understanding (dominant choice task) and respondent choice consistency (repeat choice task) included. Respondent performance was good or excellent across all these checks. Analysis methods (pMMNL and latent class analysis) are well-suited to the task. Pre-registration and sharing of data and code show commitment to transparency. Limitations are generally well-described.

      In the below, I give suggestions for clarification/improvement. Except for some clarifications on limitations and one narrower point (reporting of qualitative data analysis methods), my suggestions are only that – the preprint could otherwise stand, as is, as a very robust and interesting piece of scientific work.

      1. Respondents come from a broad range of countries (63), with 47 of those countries represented by fewer than 10 respondents. Institutional cultures of evaluation can differ greatly across nations. And we can expect variability in exposure to the messages of DORA (seen, for example, in level of permeation of DORA as measured by signatories in each country, https://sfdora.org/signers/)..%3B!!NVzLfOphnbDXSw!HdeyeHHei6yWQHFjhN3deSSfp82ur9i9JNOLEVOYZN0BvyslUO2S8DlvjBbautmafJEvlUsxQZbT0JLQX7lO8EcOYtZsJkA%24&data=05%7C02%7Ca.l.brasil.varandas.pinto%40cwts.leidenuniv.nl%7C9f47a111adec49d04bb608dd0614ae94%7Cca2a7f76dbd74ec091086b3d524fb7c8%7C0%7C0%7C638673408085242099%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=by5mhPfSM0MFFG9LE2iiYjdtSs5IhvpuukqVv%2FLak2s%3D&reserved=0 "https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fsfdora.org%2Fsigners%2F).%3B!!NVzLfOphnbDXSw!HdeyeHHei6yWQHFjhN3deSSfp82ur9i9JNOLEVOYZN0BvyslUO2S8DlvjBbautmafJEvlUsxQZbT0JLQX7lO8EcOYtZsJkA%24&data=05%7C02%7Ca.l.brasil.varandas.pinto%40cwts.leidenuniv.nl%7C9f47a111adec49d04bb608dd0614ae94%7Cca2a7f76dbd74ec091086b3d524fb7c8%7C0%7C0%7C638673408085242099%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=by5mhPfSM0MFFG9LE2iiYjdtSs5IhvpuukqVv%2FLak2s%3D&reserved=0") In addition, some contexts may mandate or incentivise publication in some venues using measures including IF, but also requiring journals to be in certain databases like WoS or Scopus, or having preferred journal lists). I would suggest the authors should include in the Sampling section a rationale for taking this international approach, including any potentially confounding factors it may introduce, and then adding the latter also in the limitations.

      2. Reporting of qualitative results: In the introduction and methods, the role of the focus groups and interviews seems to have been just to inform the design of the experiment. But then, results from that qualitative work then appear as direct quotes within the discussion to contextualise or explain results. In this sense though, the qualitative results are being used as new data. Given this, I feel that the methods section should include description of the methods and tools used for qualitative data analysis (currently it does not). But in addition, to my understanding (and this may be a question of disciplinary norms – I’m not a health/medicine researcher), generally new data should not be introduced in the discussion section of a research paper. Rather the discussion is meant to interpret, analyse, and provide context for the results that have already been presented. I personally hence feel that the paper would benefit from the qualitative results being reported separately within the results section.

      3. Impact factors – Discussion section: While there is interesting new information on the relative trade-offs amongst other factors, the most emphasised finding, that impact factors still play a prominent role in publication venue decisions, is hardly surprising. More could perhaps be done to compare how the levels of importance reported here differ with previous results from other disciplines or over time (I know a like-for-like comparison is difficult but other studies have investigated these themes, e.g., https://doi.org/10.1177/01655515209585). In addition, beyond the question of whether impact factors are important, a more interesting question in my view is why they still persist. What are they used for and why are they still such important “driver[s] of researchers’ behaviour”? This was not the authors’ question, and they do provide some contextualisation by quoting their participants, but still I think they could do more to contextualise what is known from the literature on that to draw out the implications here. The attribute label in the methods for IF is “ranking”, but ranking according of what and for what? Not just average per-article citations in a journal over a given time frame. Rather, impact factors are used as a proxy indicators of less-tangible desirable qualities – certainly prestige (as the title of this article suggests), but also quality, trust (as reported by one quoted focus group member “I would never select a journal without an impact factor as I always publish in journals that I know and can trust that are not predatory”, p.6), journal visibility, importance to the field, or improved chances of downstream citations or uptake in news media/policy/industry etc. Picking apart the interactions of these various factors in researchers’ choices to make use of IFs (which is not in all cases bogus or unjustified) could add valuable context. I’d especially recommend engaging at least briefly with more work from Science and Technology Studies - especially Müller and de Rijcke’s excellent Thinking with Indicators study (doi: 10.1093/reseval/rvx023), but also those authors other work, as well as work from Ulrike Felt, Alex Rushforth (esp https://doi.org/10.1007/s11024-015-9274-5), Björn Hammerfelt and others.

      4. Disciplinary coverage: (1) A lot of the STS work I talk about above emphasises epistemic diversity and the ways cultures of indicator use differ across disciplinary traditions. For this reason, I think it should be pointed out in the limitations that this is research in Health/Med only, with questions on generalisability to other fields. (2) Also, although the abstract and body of the article do make clear the disciplinary focus, the title does not. Hence, I believe the title should be slightly amended (e.g., “Health and Medical Researchers Are Willing to Trade …”)

    1. when we are immersed in something, surrounded by it the waywe are by images from the media, we may come to accept them as just part ofthe real and natural world.

      We’re constantly surrounded by media images so it’s easy to take them for granted. I think Hall is saying for us to take a step back and think critically about what they show and why.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):  

      Summary:  

      The authors state the study's goal clearly: "The goal of our study was to understand to what extent animal individuality is influenced by situational changes in the environment, i.e., how much of an animal's individuality remains after one or more environmental features change." They use visually guided behavioral features to examine the extent of correlation over time and in a variety of contexts. They develop new behavioral instrumentation and software to measure behavior in Buridan's paradigm (and variations thereof), the Y-maze, and a flight simulator. Using these assays, they examine the correlations between conditions for a panel of locomotion parameters. They propose that inter-assay correlations will determine the persistence of locomotion individuality.

      Strengths:  

      The OED defines individuality as "the sum of the attributes which distinguish a person or thing from others of the same kind," a definition mirrored by other dictionaries and the scientific literature on the topic. The concept of behavioral individuality can be characterized as: (1) a large set of behavioral attributes, (2) with inter-individual variability, that are (3) stable over time. A previous study examined walking parameters in Buridan's paradigm, finding that several parameters were variable between individuals, and that these showed stability over separate days and up to 4 weeks (DOI: 10.1126/science.aaw718). The present study replicates some of those findings and extends the experiments from temporal stability to examining correlation of locomotion features between different contexts.  

      The major strength of the study is using a range of different behavioral assays to examine the correlations of several different behavior parameters. It shows clearly that the inter-individual variability of some parameters is at least partially preserved between some contexts, and not preserved between others. The development of high-throughput behavior assays and sharing the information on how to make the assays is a commendable contribution.

      Weaknesses:  

      The definition of individuality considers a comprehensive or large set of attributes, but the authors consider only a handful. In Supplemental Fig. S8, the authors show a large correlation matrix of many behavioral parameters, but these are illegible and are only mentioned briefly in Results. Why were five or so parameters selected from the full set? How were these selected? Do the correlation trends hold true across all parameters? For assays in which only a subset of parameters can be directly compared, were all of these included in the analysis, or only a subset?  

      The correlation analysis is used to establish stability between assays. For temporal re-testing, "stability" is certainly the appropriate word, but between contexts it implies that there could be 'instability'. Rather, instead of the 'instability' of a single brain process, a different behavior in a different context could arise from engaging largely (or entirely?) distinct context-dependent internal processes, and have nothing to do with process stability per se. For inter-context similarities, perhaps a better word would be "consistency".  

      The parameters are considered one-by-one, not in aggregate. This focuses on the stability/consistency of the variability of a single parameter at a time, rather than holistic individuality. It would appear that an appropriate measure of individuality stability (or individuality consistency) that accounts for the high-dimensional nature of individuality would somehow summarize correlations across all parameters. Why was a multivariate approach (e.g. multiple regression/correlation) not used? Treating the data with a multivariate or averaged approach would allow the authors to directly address 'individuality stability', along with the analyses of single-parameter variability stability.

      The correlation coefficients are sometimes quite low, though highly significant, and are deemed to indicate stability. For example, in Figure 4C top left, the % of time walked at 23{degree sign}C and 32{degree sign}C are correlated by 0.263, which corresponds to an R2 of 0.069 i.e. just 7% of the 32{degree sign}C variance is predictable by the 23{degree sign}C variance. Is it fair to say that 7% determination indicates parameter stability? Another example: "Vector strength was the most correlated attention parameter... correlations ranged... to -0.197," which implies that 96% (1 - R2) of Y-maze variance is not predicted by Buridan variance. At what level does an r value not represent stability?

      The authors describe a dissociation between inter-group differences and inter-individual variation stability, i.e. sometimes large mean differences between contexts, but significant correlation between individual test and retest data. Given that correlation is sensitive to slope, this might be expected to underestimate the variability stability (or consistency). Is there a way to adjust for the group differences before examining correlation? For example, would it be possible to transform the values to in-group ranks prior to correlation analysis?

      What is gained by classifying the five parameters into exploration, attention, and anxiety? To what extent have these classifications been validated, both in general, and with regard to these specific parameters? Is increased walking speed at higher temperature necessarily due to increased 'explorative' nature, or could it be attributed to increased metabolism, dehydration stress, or a heat-pain response? To what extent are these categories subjective?

      The legends are quite brief and do not link to descriptions of specific experiments. For example, Figure 4a depicts a graphical overview of the procedure, but I could not find a detailed description of this experiment's protocol.

      Using the current single-correlation analysis approach, the aims would benefit from re-wording to appropriately address single-parameter variability stability/consistency (as distinct from holistic individuality). Alternatively, the analysis could be adjusted to address the multivariate nature of individuality, so that the claims and the analysis are in concordance with each other.

      The study presents a bounty of new technology to study visually guided behaviors. The Github link to the software was not available. To verify successful transfer or open-hardware and open-software, a report would demonstrate transfer by collaboration with one or more other laboratories, which the present manuscript does not appear to do. Nevertheless, making the technology available to readers is commendable.

      The study discusses a number of interesting, stimulating ideas about interindividual variability and presents intriguing data that speaks to those ideas, albeit with the issues outlined above.

      While the current work does not present any mechanistic analysis of interindividual variability, the implementation of high-throughput assays sets up the field to more systematically investigate fly visual behaviors, their variability, and their underlying mechanisms.  

      Comments on revisions:  

      I want to express my appreciation for the authors' responsiveness to the reviewer feedback. They appear to have addressed my previous concerns through various modifications including GLM analysis, however, some areas still require clarification for the benefit of an audience that includes geneticists.  

      (1) GLM Analysis Explanation (Figure 9)  

      While the authors state that their new GLM results support their original conclusions, the explanation of these results in the text is insufficient. Specifically:

      The interpretation of coefficients and their statistical significance needs more detailed explanation. The audience includes geneticists and other nonstatistical people, so the GLM should be explained in terms of the criteria or quantities used to assess how well the results conform with the hypothesis, and to what extent they diverge.

      The criteria used to judge how well the GLM results support their hypothesis are not clearly stated.

      The relationship between the GLM findings and their original correlationbased conclusions needs better integration and connection, leading the reader through your reasoning.

      We thank the reviewer for highlighting this important point. We have revised the Results section in the reviseed manuscript to include a more detailed explanation of the GLM analysis. Specifically, we now clarify the interpretation of the model coefficients, including the direction and statistical significance, in relation to the hypothesized effects. We also outline the criteria we used to assess how well the GLM supports our original correlation-based conclusions—namely, whether the sign and significance of the coefficients align with the expected relationships derived from our prior analysis. Finally, we explicitly describe how the GLM results confirm or extend the patterns observed in the correlation-based analysis, to guide readers through our reasoning and the integration of both approaches.

      (2) Documentation of Changes  

      One struggle with the revised manuscript is that no "tracked changes" version was included, so it is hard to know exactly what was done. Without access to the previous version of the manuscript, it is difficult to fully assess the extent of revisions made. The authors should provide a more comprehensive summary of the specific changes implemented, particularly regarding:

      We thank the reviewer for bringing this to our attention. We were equally confused to learn that the tracked-changes version was not visible, despite having submitted one to eLife as part of our revision. 

      Upon contacting the editorial office, they confirmed that we did submit a trackedchanges version, but clarified that it did not contain embedded figures (as they were added manually to the clean version).  The editorial response said in detail: “Regarding the tracked-changes file: it appears the version with markup lacked figures, while the figure-complete PDF had markup removed, which likely caused the confusion mentioned by the reviewers.” We hope this answer from eLife clarifies the reviewers’ concern.

      (2)  Statistical Method Selection  

      The authors mention using "ridge regression to mitigate collinearity among predictors" but do not adequately justify this choice over other approaches. They should explain:

      Why ridge regression was selected as the optimal method  

      How the regularization parameter (λ) was determined  

      How this choice affects the interpretation of environmental parameters' influence on individuality

      We appreciate the reviewer’s thoughtful question regarding our choice of statistical method. In response, we have expanded the Methods section in the revised manuscript to provide a more detailed justification for the use of a GLM, including ridge regression. Specifically, we explain that ridge regression was selected to address collinearity and to control for overfitting.

      We now also describe how the regularization parameter (λ) was selected: we used 5-fold cross-validation over a log-spaced grid (10<sup>⁻⁶</sup> - 10<sup>⁶</sup) to identify the optimal value that minimized the mean squared error (MSE).

      Finally, we clarify in both the Methods and Results sections how this modeling choice affects the interpretation of our findings. 

      Reviewer #2 (Public review):  

      Summary:  

      The authors repeatedly measured the behavior of individual flies across several environmental situations in custom-made behavioral phenotyping rigs.

      Strengths:  

      The study uses several different behavioral phenotyping devices to quantify individual behavior in a number of different situations and over time. It seems to be a very impressive amount of data. The authors also make all their behavioral phenotyping rig design and tracking software available, which I think is great, and I'm sure other folks will be interested in using and adapting to their own needs.

      Weaknesses/Limitations:  

      I think an important limitation is that while the authors measured the flies under different environmental scenarios (i.e. with different lighting, temperature) they didn't really alter the "context" of the environment. At least within behavioral ecology, context would refer to the potential functionality of the expressed behaviors so for example, an anti-predator context, or a mating context, or foraging. Here, the authors seem to really just be measuring aspects of locomotion under benign (relatively low risk perception) contexts. This is not a flaw of the study, but rather a limitation to how strongly the authors can really say that this demonstrates that individuality is generalized across many different contexts. It's quite possible that rank-order of locomotor (or other) behaviors may shift when the flies are in a mating or risky context.  

      I think the authors are missing an opportunity to use much more robust statistical methods It appears as though the authors used pearson correlations across time/situations to estimate individual variation; however far more sophisticated and elegant methods exist. The problem is that pearson correlation coefficients can be anti-conservative and additionally, the authors have thus had to perform many many tests to correlate behaviors across the different trials/scenarios. I don't see any evidence that the authors are controlling for multiple testing which I think would also help. Alternatively, though, the paper would be a lot stronger, and my guess is, much more streamlined if the authors employ hierarchical mixed models to analyse these data, which are the standard analytical tools in the study of individual behavioral variation. In this way, the authors could partition the behavioral variance into its among- and within-individual components and quantify repeatability of different behaviors across trials/scenarios simultaneously. This would remove the need to estimate 3 different correlations for day 1 & day 2, day 1 & 3, day 2 & 3 (or stripe 0 & stripe 1, etc) and instead just report a single repeatability for e.g. the time spent walking among the different strip patterns (eg. figure 3). Additionally, the authors could then use multivariate models where the response variables are all the behaviors combined and the authors could estimate the among-individual covariance in these behaviors. I see that the authors state they include generalized linear mixed models in their updated MS, but I struggled a bit to understand exactly how these models were fit? What exactly was the response? what exactly were the predictors (I just don't understand what Line404 means "a GLM was trained using the environmental parameters as predictors (0 when the parameter was not changed, 1 if it was) and the resulting individual rank differences as the response"). So were different models run for each scenario? for different behaviors? Across scenarios? What exactly? I just harp on this because I'm actually really interested in these data and think that updating these methods can really help clarify the results and make the main messages much clearer!

      I appreciate that the authors now included their sample sizes in the main body of text (as opposed to the supplement) but I think that it would still help if the authors included a brief overview of their design at the start of the methods. It is still unclear to me how many rigs each individual fly was run through? Were the same individuals measured in multiple different rigs/scenarios? Or just one?

      I really think a variance partitioning modeling framework could certainly improve their statistical inference and likely highlight some other cool patterns as these methods could better estimate stability and covariance in individual intercepts (and potentially slopes) across time and situation. I also genuinely think that this will improve the impact and reach of this paper as they'll be using methods that are standard in the study of individual behavioral variation

      Reviewer #3 (Public review):  

      This manuscript is a continuation of past work by the last author where they looked at stochasticity in developmental processes leading to inter-individual behavioural differences. In that work, the focus was on a specific behaviour under specific conditions while probing the neural basis of the variability. In this work, the authors set out to describe in detail how stable individuality of animal behaviours is in the context of various external and internal influences. They identify a few behaviours to monitor (read outs of attention, exploration, and 'anxiety'); some external stimuli (temperature, contrast, nature of visual cues, and spatial environment); and two internal states (walking and flying).

      They then use high-throughput behavioural arenas - most of which they have built and made plans available for others to replicate - to quantify and compare combinations of these behaviours, stimuli, and internal states. This detailed analysis reveals that:

      (1) Many individualistic behaviours remain stable over the course of many days.  

      (2) That some of these (walking speed) remain stable over changing visual cues. Others (walking speed and centrophobicity) remain stable at different temperatures.

      (3) All the behaviours they tested fail to remain stable over spatially varying environment (arena shape).

      (4) and only angular velocity (a read out of attention) remains stable across varying internal states (walking and flying)

      Thus, the authors conclude that there is a hierarchy in the influence of external stimuli and internal states on the stability of individual behaviours.

      The manuscript is a technical feat with the authors having built many new high-throughput assays. The number of animals are large and many variables have been tested - different types of behavioural paradigms, flying vs walking, varying visual stimuli, different temperature among others.  

      Comments on revisions:'  

      The authors have addressed my previous concerns.  

      We thank the reviewer for the positive feedback and are glad our revisions have satisfactorily addressed the previous concerns. We appreciate the thoughtful input that helped us improve the clarity and rigor of the manuscript.

      Reviewer #1 (Recommendations for the authors):  

      Comment on Revised Manuscript  

      Recommendations for Improvement  

      (1) Expand the Results section for Figure 9 with a more detailed interpretation of the GLM coefficients and their biological significance

      (2) Provide explicit criteria (or at least explain in detail) for how the GLM results confirm or undermine their original hypothesis about environmental context hierarchy

      While the claims are interesting, the additional statistical analysis appears promising. However, clearer explanation of these new results would strengthen the paper and ensure that readers from diverse backgrounds can fully understand how the evidence supports the authors' conclusions about individuality across environmental contexts. 

      We thank the reviewer for these constructive suggestions. In response to these suggestions, we have expanded both the Methods and Results sections to provide a more detailed explanation of the GLM coefficients, including their interpretation and how they relate to our original correlation-based findings.

      We now clarify how the direction, magnitude, and statistical significance of specific coefficients reflect the influence of different environmental factors on the persistence of individual behavioral traits. To make this accessible to readers from diverse backgrounds, we explicitly outline the criteria we used to evaluate whether the GLM results support our hypothesis about the hierarchical influence of environmental context, namely, whether the structure and strength of effects align with the patterns predicted from our prior correlation analysis.

      These additions improve clarity and help readers understand how the new statistical results reinforce our conclusions about the context-dependence of behavioral individuality.

      Reviewer #2 (Recommendations for the authors):  

      Thanks for the revision of the paper! I updated my review to try and provide a little more guidance by what I mean about updating your analyses. I really think this is a super cool data set and I genuinely wish this were MY dataset so that way I could really dig into it to partition the variance. These variance partitioning methods are standard in my particular subfield (study of individual behavioral variation in ecology and evolution) and so I think employing them is 1) going to offer a MUCH more elegant and holistic view of the behavioral variation (e.g. you can report a single repeatability estimate for each behavior rather than 3 different correlations) and 2) improve the impact and readership for your paper as now you'll be using methods that a whole community of researchers are very familiar with. It's just a suggestion, but I hope you consider it!

      We sincerely thank the reviewer for the insightful and encouraging feedback and for introducing us to this modeling approach. In response to this suggestion, we have incorporated a hierarchical linear mixed-effects model into our analysis (now presented in Figure 10), accompanied by a new supplementary table (Table T3). We also updated the Methods, Results, and Discussion sections to describe the rationale, implementation, and implications of the mixed-model analysis.

      We agree with the reviewer that this approach provides a more elegant way to quantify behavioral variation and individual consistency across contexts. In particular, the ability to estimate repeatability directly aligns well with the core questions of our study. It facilitates improved communication of our findings to ecology, evolution, and behavior researchers. We greatly appreciate the suggestion; it has significantly strengthened both the analytical framework and the interpretability of the manuscript.

    1. Today, teachers are continually faced with the challenge of effectively reaching out to their classroom of students who span the spectrum of learning readiness, personal interests, skills, knowledge, and perspective. We know that not all students are alike.

      This is why I think it's important to survey your students at the beginning of the year in order to learn about their interests. Gaining insight into how your students learn best can help you, as the teacher, vary your teaching methods. Yes, habits can be good so students can know what the expectations are, but offering different sources, different instructional strategies, and diversifying your classroom layout can cover a wide range of learners. Keeping in mind that people learn through all of the major senses can truly help students retain information. For example, I am an auditory learner. I have to read aloud or talk things out. That's why I have to read things a few times to really grasp the material when it's a quiet setting. Therefore, timed tests really get my anxiety levels up. Not everyone has this problem or even recognizes it. Being an auditory learner may be great in the college setting during lectures, but it becomes very difficult in the test setting when everything is quiet. How could a teacher make an adjustment in the test setting for my scenario?

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      This work addresses an important question in the field of Drosophila aggression and mating. Prior social isolation is known to increase aggression in males, manifesting as increased lunging, which is suppressed by group housing (GH). However, it is also known that single housed (SH) males, despite their higher attempts to court females, are less successful. Here, Gao et al., develop a modified aggression assay to address this issue by recording aggression in Drosophila males for 2 hours, with a virgin female immobilized by burying its head in the food. They found that while SH males frequently lunge in this assay, GH males switch to higher intensity but very low frequency tussling. Constitutive neuronal silencing and activation experiments implicate cVA sensing Or67d neurons in promoting high frequency lunging, similar to earlier studies, whereas Or47b neurons promote low frequency but higher intensity tussling. Optogenetic activation revealed that three pairs of pC1SS2 neurons increase tussling. Cell-type-specific DsxM manipulations combined with morphological analysis of pC1SS2 neurons and side-by-side tussling quantification link the developmental role of DsxM to the functional output of these aggression-promoting cells. In contrast, although optogenetic activation of P1a neurons in the dark did not increase tussling, thermogenetic activation under visible light drove aggressive tussling. Using a further modified aggression assay, GH males exhibit increased tussling and maintain territorial control, which could contribute to a mating advantage over SH males, although direct measures of reproductive success are still needed.

      Strengths:

      Through a series of clever neurogenetic and behavioral approaches, the authors implicate specific subsets of ORNs and pC1 neurons in promoting distinct forms of aggressive behavior, particularly tussling. They have devised a refined territorial control paradigm, which appears more robust than earlier assays using a food cup (Chen et al., 2002). This new setup is relatively clutter-free and could be amenable to future automation using computer vision approaches. The updated Figure 5, which combines cell-type-specific developmental manipulation of pC1SS2 neurons with behavioral output, provides a link between developmental mechanisms and functional aggression circuits. The manuscript is generally well written, and the claims are largely supported by the data.

      Thank you for the precise summary of the manuscript and acknowledgment of the novelty and significance of the study.

      Weakness:

      Although most concerns have been addressed, the manuscript still lacks a rigorous, objective method for quantifying lunging and tussling. Because scoring appears to have been done manually and a single lunge in a 30 fps video spans only 2-3 frames, the 0.2 s cutoff seems arbitrary, and there are no objective criteria distinguishing reciprocal lunging from tussling. Despite this, the study offers valuable insights into the neural and behavioral mechanisms of Drosophila aggression.

      Thank you for this comment. The duration of each lunge was measured by analyzing the videos frame by frame—from the frame before the initiation of the lunge to the frame after its completion—resulting in an average span of 3–5 frames. Given a frame rate of 30 fps, this corresponds to approximately 0.1–0.17 seconds. We acknowledge that there are certain limitations for manually quantifying the two types of aggressive behaviors, which has now been stated in the newly added “Limitations of the Study” section in the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      Gao et al. investigated the change of aggression strategies by the social experience and its biological significance by using Drosophila. Two modes of inter-male aggression in Drosophila are known: lunging, high-frequency but weak mode, and tussling, low-frequency but more vigorous mode. Previous studies have mainly focused on the lunging. In this paper, the authors developed a new behavioral experiment system for observing tussling behavior and found that tussling is enhanced by group rearing, while lunging is suppressed. They then searched for neurons involved in the generation of tussling. Although olfactory receptors named Or67d and Or65a have previously been reported to function in the control of lunging, the authors found that these neurons do not function in the execution of tussling and another olfactory receptor, Or47b, is required for tussling, as shown by the inhibition of neuronal activity and the gene knockdown experiments. Further optogenetic experiments identified a small number of central neurons pC1[SS2] that induce the tussling specifically. These neurons express doublesex (dsx), a sex-determination factor, and knockdown of dsx strongly suppresses the induction of tussling. In order to further explore the ecological significance of the aggression mode change in group-rearing, a new behavioral experiment was performed to examine the territorial control and the mating competition. And finally, the authors found that differences in the social experience (group vs. solitary rearing) and the associated change in aggression strategy are important in these biologically significant competitions. These results add a new perspective to the study of aggression behavior in Drosophila. Furthermore, this study proposes an interesting general model in which the social experience modified behavioral changes play a role in reproductive success.

      Strengths:

      A behavioral experiment system that allows stable observation of tussling, which could not be easily analyzed due to its low-frequency, would be very useful. The experimental setup itself is relatively simple, just the addition of a female to the platform, so it should be applicable to future research. The finding about the relationship between the social experience and the aggression mode change is quite novel. Although the intensity of aggression changes with the social experience was already reported in several papers (Liu et al., 2011 etc), the fact that the behavioral mode itself changes significantly has rarely been addressed, and is extremely interesting. The identification of sensory and central neurons required for the tussling makes appropriate use of the genetic tools and the results are clear. A major strength of this study in neurobiology is the finding that another group of neurons (Or47b-expressing olfactory neurons and pC1[SS2] neurons), distinct from the group of neurons previously thought to be involved in low-intensity aggression (i.e. lunging), function in the tussling behavior. Furthermore, the results showing that the regulation of aggression by pC1[SS2] neurons is based on the function of the dsx gene will bring a new perspective to the field. Further investigation of the detailed circuit analysis is expected to elucidate the neural substrate of the conflict between the two aggression modes. The experimental systems examining the territory control and the reproductive competition in Fig. 6 are novel and have advantages in exploring their biological significance. It is important to note that in addition to showing the effects of age and social experience on territorial and mating behaviors, the authors experimentally demonstrated that altered fighting strategy has effects with respect to these behaviors.

      Thank you for your precise summary of our study and being very positive on the novelty and significance of the study.

      Reviewer #3 (Public review):

      In this revised manuscript, Gao et al. presented a series of well-controlled behavioral data showing that tussling, a form of high-intensity fighting among male fruit flies (Drosophila melanogaster) is enhanced specifically among socially experienced and relatively old males. Moreover, results of behavioral assays led authors to suggest that increased tussling among socially experienced males may increase mating success. They also concluded that tussling is controlled by a class of olfactory sensory neurons and sexually dimorphic central neurons that are distinct from pathways known to control lunges, a common male-type attack behavior.

      A major strength of this work is that it is the first attempt to characterize behavioral function and neural circuit associated with Drosophila tussling. Many animal species use both low-intensity and high-intensity tactics to resolve conflicts. High-intensity tactics are mostly reserved for escalated fights, which are relatively rare. Because of this, tussling in the flies, like high-intensity fights in other animal species, have not been systematically investigated. Previous studies on fly aggressive behavior have often used socially isolated, relatively young flies within a short observation duration. Their discovery that 1) older (14-days old) flies tend to tussle more often than younger (2 to 7-days-old) flies, 2) group-reared flies tend to tussle more often than socially isolated flies, and 3) flies tend to tussle at later stage (mostly ~15 minutes after the onset of fighting), are the result of their creativity to look outside of conventional experimental settings. These new findings are key for quantitatively characterizing this interesting yet under-studied behavior.

      Newly presented data have made several conclusions convincing. Detailed descriptions of methods to quantify behaviors help understand the basis of their claims by improving transparency. However, I remain concerned about authors' persistent attempt to link the high intensity aggression to reproductive success. The authors' effort to "tone down" the link between the two phenomena remains insufficient. There are purely correlational. I reiterate this issue because the overall value of the manuscript would not change with or without this claim.

      Thank you for acknowledging the novelty and significance of the study. Regarding the relationship you mentioned between high-intensity aggression and reproductive success, we further toned down the statement between them throughout the manuscript in the revised manuscript. We also modified the title to “Social Experience Shapes Fighting Strategies in Drosophila”. In addition, we now added a ‘Limitations of the Study’ section to clearly state the correlation between tussling and reproductive success.

      Reviewer #1 (Recommendations for the authors):

      If possible, mention the EM-connectome data showing the minimal interneuronal path from Or47b ORNs to pC1SS2 neurons (even if derived from the female connectome), which can strengthen the model of parallel sensory-central pathways.

      Thank you for this comment. According to data from the EM connectome, connecting Or47b ORNs to pC1d neurons requires at least two intermediate neurons. An example minimal pathway is: ORN_VA1v (L) → AL-AST1 (L) → PLP245 (L) → pC1d (R). We have added this point in the Discussion section of the revised manuscript.

      I'm not convinced that labeling lunges as "gentle" combat behavior works, either in the abstract or elsewhere. While lunging is indeed a lower-intensity form of aggression compared to tussling, applying anthropomorphic descriptors risks misleading readers.

      Thank you for this comment. We now use “low-intensity” instead of “gentle” to describe lunging.

      In Materials & Methods, please cross-check all figure-panel references after the recent re-numbering (e.g. "Figure 5A6A" etc.).

      Thank you for this comment. We have thoroughly verified the figure panel references in the Materials & Methods section.

      Ensure that Table S1 is clearly cited in the main text where you first describe fly genotypes.

      Thank you for this comment. We have now cited Table S1 in the main text.

      There are multiple grammatical errors and typos throughout the manuscript. Please correct them. Some examples are below, but this is not an exhaustive list:

      Line 98-102 requires rephrasing as the results are already published and not being observed by the authors.

      Thank you for this comment. We have revised the manuscript to “we occasionally observed the high-intensity boxing and tussling behavior in male flies as previously reported (Chen et al., 2002; Nilsen et al., 2004), which….”

      line 116- lower not 'lowed'.

      Corrected.

      line 942 & 945- knock-down males not 'knocking down males'.

      Corrected. Thank you very much for these comments.

      Reviewer #2 (Recommendations for the authors):

      The authors have almost completely answered the major comments I have noted on the ver.1 manuscript: (1) They clearly show changes in fighting strategy in the territory control behavior experiment in Fig. 6-figure supplements. (2) A detailed description of how aggressive behavior is measured. Thus, I am convinced by this revision.

      Thank you for these comments that make the manuscript a better version.

      Furthermore, in Fig. 5, which examined the relationship of pC1[SS2] characteristics with the function of dsx, is a novel data and very interesting. I look forward to further developments.

      Thank you. We will continue to explore this part in our future study.

      However, one point still concerns me.

      Line 192: Although the authors describe it as "usage-dependent," the trans-Tango technique is essentially a postsynaptic cell-labeling technique. It is possible that the labeling intensity in postsynaptic cells increases from the change in expression levels of the Or47b gene due to GH. However, there is no difference in the expression level of the Or47b gene labeled by GFP between SH and GH. Therefore, we cannot conclude that the expression of the Or47b gene is increased by rearing conditions.

      The original paper on trans-TANGO (Talay et al., 2017) does not discuss the usage-dependency. A review of trans-synaptic labeling techniques (Ni, Front Neural Circuits. 2021) discusses that the increase in trans-TANGO signaling with aging may be related to synaptic strength, but there is no experimental evidence for this. In my opinion, the results in Figure 3-figure supplement 2 only weakly suggest that the increase in trans-TANGO signaling may be explained by an increase in synaptic strength due to group rearing.

      We appreciate the reviewer’s insightful comment regarding the interpretation of the trans-Tango signal. Indeed, the original trans-Tango study (Talay et al., 2017) does not claim that the method is usage-dependent. The observed increase in trans-Tango labeling with age, as reported in their supplemental figures, may reflect accumulation over time, potentially influenced by synaptic maturation or increased component expression. To avoid overstating our results, we have revised the relevant statement in the manuscript to remove the term "usage-dependent" and now describe the change in trans-Tango signal more cautiously.  

      Reviewer #3 (Recommendations for the authors):

      Below are the cases where their professed attempts to "tone down the statement" appear ignored:

      Lines 27-29:

      "Our findings... suggest how social experience shapes fighting strategies to optimize reproductive success".

      We have now revised the manuscript to “Our findings… suggest that social experience may shape fighting strategies to optimize reproductive success.”

      Lines 85-86:

      "... discover that this infrequent yet intense form of combat is... crucial for territory dominance and mating competition".

      We have now revised the manuscript to “…discover that this infrequent yet intense form of combat is enhanced by social enrichment, while the low-intensity lunging is suppressed by social enrichment.” 

      Lines 335-339:

      "Here, we found that... GH males tend to... increase the high-intensity tussling, which enhances their territorial and mating competition."

      We have removed “which enhances their territorial and mating competition” in the revised manuscript.

      Lines 343-344:

      "... presenting a paradox between social experience, aggression and reproductive success. Our result resolved this paradox..."

      We have now revised the manuscript to “...Our results provide an explanation for this paradox…”

      Lines 355-358:

      "Interestingly, we found that the mating advantage gained through social enrichment can even offset the mating disadvantage associated with aging, further supporting the vital role of shifting fighting strategies in experienced, aged males."

      We have removed “further supporting the vital role of shifting fighting strategies in experienced, aged males” in the revised manuscript.

      Lines 361-362:

      "These results separate the function of the two fighting forms and rectify out understanding of how social experiences regulate aggression and reproductive success."

      We have removed this sentence in the revised manuscript.

      Some may say that a speculative statement is harmless, but I think it indeed is harmful unless it is clearly indicated as a speculation. It is regrettable that authors remain reluctant to change their claim without providing any new supporting evidence. All three reviewers raised the same concern in the first round of review.

      We apologize for not making the speculative nature of the statement clearer in the previous version. In the revised manuscript, we have now explicitly rephrased sentences to only suggest a correlation but not a causal link between tussling and reproductive success.

      I have no choice but to keep my evaluation of the manuscript as "Incomplete" unless the authors thoroughly eliminate any attempt to link these two. This must go beyond changing a few words in the lines listed above.

      Thank you for this comment. In addition to the lines listed above, we carefully checked all statements regarding the correlation between fighting strategies and reproductive success throughout the full text. Furthermore, we have also added a “Limitations of the Study” section to address the shortcomings of this study in the revised manuscript.

      I do not have the same level of concern over the interpretation of Fig. 6A-C, because this is directly linked to aggressive interactions. Even if the socially isolated males do not engage in tussling, it is not a leap to assume that a different fighting tactic of socially experienced males can give them an advantage in defending a territory. To me, this is a sufficient ethological link with the observed behavioral change.

      Thank you for this insightful comment.

      The following are relatively minor, although important, concerns.

      I beg to differ over the authors' definition of "tussling". Supplemental movies S1 and S2 appear to include "tussling" bouts in which 2 flies lunging at each other in rapid succession, and supplemental movie S3 appears to include bouts of "holding", in which one fly holds the opponent's wings and shakes vigorously. These cases suggest that the definition of "tussling" as opposed to "lunging" has a subjective element. However, I would not delve on this matter further because it is impossible to be completely objective over behavioral classification, even by using a computational method. An important point is that the definition is applied consistently within the publication. I have no reason to doubt that this was not the case.

      Thank you for this comment. Since the analysis of tussling behavior was conducted manually, it is challenging to achieve complete objectivity. However, we made every effort to apply consistent criteria throughout the analysis. We have added a “Limitations of the Study” section in the revised manuscript to clearly state this caveat. We appreciate your understanding.

      Authors now state that "all tester flies were loaded by cold anesthesia" (lines 432-433). I would like to draw attention to the well-known fact that anesthesia, whether by ice or by CO2, are long known to affect fly's subsequent behaviors (for aggression, see Trannoy S. et al., Learn. Mem. 2015. 22: 64-68). It will be prudent to acknowledge the possibility that this handling method could have contributed to unusually high levels of spontaneous tussling, which has not been reported elsewhere before.

      Thank you for this comment. The increased tussling behavior observed in our study is unlikely due to cold anesthesia, as noted by Trannoy S. et al. (2015), cold anesthesia profoundly reduces locomotion and general aggressiveness in flies. We acknowledge that the use of cold anesthesia in behavioral experiments may have potential effects on aggression. To minimize this influence, we allowed the flies to recover and adapt for at least 30 minutes before behavioral recording. Moreover, both control and experimental groups were treated in exactly the same manner to ensure consistency.

      It is intriguing that pC1SS2 neurons are dsx+ but fru-. Authors convincingly demonstrated that these neurons are clearly distinct from the P1a neurons, a well-characterized hub for male social behaviors. It is possible that pC1SS2 neurons overlap with previously characterized dsx+ neurons that are important for male aggressions (measured by lunges), such as in Koganezawa et al., Curr. Biol. 2016 and Chiu et al., Cell 2020, a point authors could have explicitly raised.

      Thank you for this comment. We have added this point into the Discussion section of the revised manuscript, as follows: “That tussling-promoting… aggression (Koganezawa et al., 2016). Moreover, the anatomical features of pC1<sup>SS2</sup> neurons are highly similar to the male-specific aggression-promoting (MAP) neurons identified by another previous study (Chiu et al., 2021).

      I acknowledge the authors' courage to initiate an investigation to a less characterized, high intensity fighting behavior. Tussling requires the simultaneous engagement of two flies. Even if there are confusion over the distinction between lunges and tussling, authors' conclusion that socially experienced flies and socially isolated flies employ distinct fighting strategy is convincing. The concern I raised above is about the interpretation of the data, not about the quality of data.

      Thank you for your constructive comments to make this manuscript better.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      LRRK2 protein is familially linked to Parkinson's disease by the presence of several gene variants that all confer a gain-of-function effect on LRRK2 kinase activity. 

      The authors examine the effects of BDNF stimulation in immortalized neuron-like cells, cultured mouse primary neurons, hIPSC-derived neurons, and synaptosome preparations from the brain. They examine an LRRK2 regulatory phosphorylation residue, LRRK2 binding relationships, and measures of synaptic structure and function. 

      Strengths: 

      The study addresses an important research question: how does a PD-linked protein interact with other proteins, and contribute to responses to a well-characterized neuronal signalling pathway involved in the regulation of synaptic function and cell health? 

      They employ a range of good models and techniques to fairly convincingly demonstrate that BDNF stimulation alters LRRK2 phosphorylation and binding to many proteins. Some effects of BDNF stimulation appear impaired in (some of the) LRRK2 knock-out scenarios (but not all). A phosphoproteomic analysis of PD mutant Knock-in mouse brain synaptosomes is included. 

      We thank this Reviewer for pointing out the strengths of our work. 

      Weaknesses: 

      The data sets are disjointed, conclusions are sweeping, and not always in line with what the data is showing. Validation of 'omics' data is very light. Some inconsistencies with the major conclusions are ignored. Several of the assays employed (western blotting especially) are likely underpowered, findings key to their interpretation are addressed in only one or other of the several models employed, and supporting observations are lacking. 

      We appreciate the Reviewer’s overall evaluaVon. In this revised version, we have provided several novel results that strengthen the omics data and the mechanisVc experiments and make the conclusions in line with the data.

      As examples to aid reader interpretation: (a) pS935 LRRK2 seems to go up at 5 minutes but goes down below pre-stimulation levels after (at times when BDNF-induced phosphorylation of other known targets remains very high). This is ignored in favour of discussion/investigation of initial increases, and the fact that BDNF does many things (which might indirectly contribute to initial but unsustained changes to pLRRK2) is not addressed.  

      We thank the Reviewer for raising this important point, which we agree deserves additional investigation. Although phosphorylation does decrease below pre-stimulation levels, a reduction is also observed for ERK/AKT upon sustained exposure to BDNF in our experimental paradigm (figure 1F-G). This phenomenon is well known in response to a number of extracellular stimuli and can be explained by mechanisms related to cellular negative feedback regulation, receptor desensitization (e.g. phosphorylation or internalization), or cellular adaptation. The effect on pSer935, however, is peculiar as phosphorylation goes below the unstimulated level, as pointed by the reviewer. In contrast to ERK and AKT whose phosphorylation is almost absent under unstimulated conditions (Figure 1F-G), the stoichiometry of Ser935 phosphorylation under unstimulated conditions is high. This observation is consistent with MS determination of relative abundance of pSer935 (e.g. in whole brain LRRK2 is nearly 100% phosphorylated at Ser935, see Nirujogi et al., Biochem J 2021).  Thus we hypothesized that the modest increase in phosphorylation driven by BDNF likely reflects a saturation or ceiling effect, indicating that the phosphorylation level is already near its maximum under resting conditions. Prolonged BDNF stimulation would bring phosphorylation down below pre-stimulation levels, through negative feedback mechanisms (e.g. phosphatase activity) explained above. To test this hypothesis, we conducted an experiment in conditions where LRRK2 is pretreated for 90 minutes with MLi-2 inhibitor, to reduce basal phosphorylation of S935. After MLi-2 washout, we stimulated with BDNF at different time points. We used GFP-LRRK2 stable lines for this experiment, since the ceiling effect was particularly evident (Figure S1A) and this model has been used for the interactomic study. As shown below (and incorporated in Fig. S1B in the manuscript), LRRK2 responds robustly to BDNF stimulation both in terms of pSer935 and pRABs. Phosphorylation peaks at 5-15 mins, while it decreases to unstimulated levels at 60 and 180 minutes. Notably, while the peak of pSer935 at 5-15 mins is similar to the untreated condition (supporting that Ser935 is nearly saturated in unstimulated conditions), the phosphorylation of RABs during this time period exceeds unstimulated levels. These findings support the notion that, under basal conditions, RAB phosphorylation is far from saturation. The antibodies used to detect RAB phosphorylation are the following: RAB10 Abcam # ab230261 e RAB8 (pan RABs) Abcam # ab230260.

      Given the robust response of RAB10 phosphorylation upon BDNF stimulation, we further investigated RAB10 phosphorylation during BDNF stimulation in naïve SH-SY5Y cells. We confirmed that the increase in pSer935 is coupled to increase in pT73-RAB10. Also in this case, RAB10 phosphorylation does not go below the unstimulated level, which aligns with the  low pRAB10 stoichiometry in brain (Nirujogi et al., Biochem J 2021). This experiment adds the novel and exciting finding that BDNF stimulation increases LRRK2 kinase activity (RAB phosphorylation) in neuronal cells. 

      Note that new supplemental figure 1 now includes: A) a comparison of LRRK2 pS935 and total protein levels before and after RA differentiation; B) differentiated GFP-LRRK2 SH-SY5Y (unstimulated, BDNF, MLi-2, BDNF+MLi-2); C) the kinetic of BDNF response in differentiated GFP-LRRK2 SH-SY5Y.

      (b) Drebrin coIP itself looks like a very strong result, as does the increase after BDNF, but this was only demonstrated with a GFP over-expression construct despite several mouse and neuron models being employed elsewhere and available for copIP of endogenous LRRK2. Also, the coIP is only demonstrated in one direction. Similarly, the decrease in drebrin levels in mice is not assessed in the other model systems, coIP wasn't done, and mRNA transcripts are not quantified (even though others were). Drebrin phosphorylation state is not examined.  

      We appreciate the Reviewer suggestions and provided additional experimental evidence supporting the functional relevance of LRRK2-drebrin interaction.

      (1) As suggested, we performed qPCR and observed that 1 month-old KO midbrain and cortex express lower levels of Dbn1 as compared to WT brains (Figure 5G). This result is in agreement with the western blot data (Figure 5H). 

      (2)To further validate the physiological relevance of LRRK2-drebrin interaction we performed two experiments:

      i) Western blots looking at pSer935 and pRab8 (pan Rab) in Dbn1 WT and knockout brains. As reported and quantified in Figure 2I, we observed a significant decrease in pSer935 and a trend decrease in pRab8 in Dbn1 KO brains. This finding supports the notion that Drebrin forms a complex with LRRK2 that is important for its activity, e.g. upon BDNF stimulation. 

      ii) Reverse co-immunoprecipitation of YFP-drebrin full-length, N-terminal domain (1-256 aa) and C-terminal domain (256-649 aa) (plasmids kindly received from Professor Phillip R. Gordon-Weeks, Worth et al., J Cell Biol, 2013) with Flag-LRRK2 co-expressed in HEK293T cells. As shown in supplementary Fig. S2C, we confirm that YFP-drebrin binds LRRK2, with the Nterminal region of drebrin appearing to be the major contributor to this interaction. This result is important as the N-terminal region contains the ADF-H (actin-depolymerising factor homology) domain and a coil-coil region known to directly bind actin (Shirao et al., J Neurochem 2017; Koganezawa et al., Mol Cell Neurosci. 2017). Interestingly, both full-length Drebrin and its truncated C-terminal construct cause the same morphological changes in Factin, indicating that Drebrin-induced morphological changes in F-actin are mediated by its N-terminal domains rather than its intrinsically disordered C-terminal region (Shirao et al., J Neurochem, 2017; Koganezawa et al., Mol Cell Neurosci. 2017). Given the role of LRRK2 in actin-cytoskeletal dynamics and its binding with multiple actin-related protein binding (Fig. 2 and Meixner et al., Mol Cell Proteomics. 2011; Parisiadou and Cai, Commun Integr Biol 2010), these results suggest the possibility that LRRK2 controls actin dynamics by competing with drebrin binding to actin and open new avenues for futures studies.

      (3) To address the request for examining drebrin phosphorylation state, we decided to perform another phophoproteomic experiment, leveraging a parallel analysis incorporated in our latest manuscript (Chen et al., Mol Theraphy 2025). In this experiment, we isolated total striatal proteins from WT and G2019S KI mice and enriched the phospho-peptides. Unlike the experiment presented in Fig. 7, phosphopeptides were enriched from total striatal lysates rather than synaptosomal fractions, and phosphorylation levels were normalized to the corresponding total protein abundance. This approach was intended to avoid bias toward synaptic proteins, allowing for the analysis of a broader pool of proteins derived from a heterogeneous ensemble of cell types (neurons, glia, endothelial cells, pericytes etc.). We were pleased to find that this new experiment confirmed drebrin S339 as a differentially phosphorylated site, with a 3.7 fold higher abundance in G2019S Lrrk2 KI mice. The fact that this experiment evidenced an increased phosphorylation stoichiometry in G2019S mice rather than a decreased is likely due to the normalization of each peptide by its corresponding total protein. Gene ontology analysis of differentially phosphorylated proteins using stringent term size (<200 genes) showed post-synaptic spines and presynaptic active zones as enriched categories (Fig. 3F). A SynGO analysis confirms both pre and postsynaptic categories, with high significance for terms related to postsynaptic cytoskeleton (Fig. 3G). As pointed, this is particularly interesting as the starting material was whole striatal tissue – not synaptosomes as previously – indicating that most significant phosphorylation differences occur in synaptic compartments. This once again reinforces our hypothesis that LRRK2 has a prominent role in the synapse. Overall, we confirmed with an independent phosphoproteomic analysis that LRRK2 kinase activity influences the phosphorylation state of proteins related to synaptic function, particularly postsynaptic cytoskeleton. For clarity in data presentation, as mentioned by the Reviewers, we removed Figure 7 and incorporated this new analysis in figure 3, alongside the synaptic cluster analysis. 

      Altogether, three independent OMICs approaches – (i) experimental LRRK2 interactomics in neuronal cells, (ii) a literature-based LRRK2 synaptic/cytoskeletal interactor cluster, and (iii) a phospho-proteomic analysis of striatal proteins from G2019S KI mice (to model LRRK2 hyperactivity) – converge to synaptic actin-cytoskeleton as a key hub of LRRK2 neuronal function.

      (c) The large differences in the CRISPR KO cells in terms of BDNF responses are not seen in the primary neurons of KO mice, suggesting that other differences between the two might be responsible, rather than the lack of LRRK2 protein. 

      Considering that some variability is expected for these type of cultures and across different species, any difference in response magnitude and kinetics could be attributed to the levels of TrKB  and downstream components expressed by the two cell types. 

      We are confident that differentiated SH-SY5Y cells provide a reliable model for our study as we could translate the results obtained in SH-SY5Y cells in other models. However, to rule out the possibility that the more pronounced effect observed in SH-SY5Y KO cells as respect to Lrrk2 KO primary neurons was due to CRISPR off-target effect, we performed an off-target analysis. Specifically, we selected the first 8 putative off targets exhibiting a CDF (Cutting Frequency Determination) off-target-score >0.2. 

      As shown in supplemental file 1, sequence disruption was observed only in the LRRK2 ontarget site in LRRK2 KO SH-SY5Y cells, while the 8 off-target regions remained unchanged across the genotypes and relative to the reference sequence. 

      (d) No validation of hits in the G2019S mutant phosphoproteomics, and no other assays related to the rest of the paper/conclusions. Drebrin phosphorylation is different but unvalidated, or related to previous data sets beyond some discussion. The fact that LRRK2 binding occurs, and increases with BDNF stimulation, should be compared to its phosphorylation status and the effects of the G2019S mutation. 

      As illustrated in the response to point (b), we performed a new phosphoproteomics investigation – with total striatal lysates instead of striatal synaptosomes and normalization phospho-peptides over total proteins – and found that S339 phosphorylation increases when LRRK2 kinase activity increases (G2019S). To address the request of validating drebrin phosphorylation, the main limitation is that there are no available antibodies against Ser339. While we tried phos-Tag gels in striatal lysates, we could not detect any reliable and specific signal with the same drebrin antibody used for western blot (Thermo Fisher Scientific: MA120377) due to technical limitations of the phosTag method. We are confident that phosphorylation at S339 has a physiological relevance, as it was identified 67 times across multiple proteomic discovery studies and they are placed among the most frequently phosphorylated sites in drebrin (https://www.phosphosite.org/proteinAction.action?id=2675&showAllSites=true).

      To infer a possible role of this phosphorylation, we looked at the predicted pathogenicity of using AlphaMissense (Cheng et al., Science 2023). included as supplementary figure (Fig. S3), aminoacid substitutions within this site are predicted not to be pathogenic, also due to the low confidence of the AlphaFold structure. 

      Ser339 in human drebrin is located just before the proline-rich region (PP domain) of the protein. This region is situated between the actin-binding domains and the C-terminal Homerbinding sequences and plays a role in protein-protein interactions and cytoskeletal regulation (Worth et al., J Cell Biol, 2013). Of interest, this region was previously shown to be the interaction site of adafin (ADFN), a protein involved in multiple cytoskeletal-related processes, including synapse formation and function by regulating puncta adherentia junctions, presynaptic differentiation, and cadherin complex assembly, which are essential for hippocampal excitatory synapses, spine formation, and learning and memory processes (Beaudoin, G. M., 3rd et al., J Neurosci, 2013). Of note, adafin is in the list of LRRK2 interacting proteins (https://www.ebi.ac.uk/intact/home), supporting a possible functional relevance of LRRK2-mediated drebrin phosphorylation in adafin-drebrin complex formation. This has been discussed in the discussion section.

      The aim of this MS analysis in G2019S KI mice – now included in figure 3 – was to further validate the crucial role of LRRK2 kinase activity in the context of synaptic regulation, rather than to discover and characterize novel substrates. Consequently, Figure 7 has been eliminated. 

      Reviewer #2 (Public Review):  

      Taken as a whole, the data in the manuscript show that BDNF can regulate PD-associated kinase LRRK2 and that LRRK2 modifies the BDNF response. The chief strength is that the data provide a potential focal point for multiple observations across many labs. Since LRRK2 has emerged as a protein that is likely to be part of the pathology in both sporadic and LRRK2 PD, the findings will be of broad interest. At the same time, the data used to imply a causal throughline from BDNF to LRRK2 to synaptic function and actin cytoskeleton (as in the title) are mostly correlative and the presentation often extends beyond the data. This introduces unnecessary confusion. There are also many methodological details that are lacking or difficult to find. These issues can be addressed. 

      We appreciate the Reviewer’s positive feedback on our study. We also value the suggestion to present the data in a more streamlined and coherent way. In response, we have updated the title to better reflect our overall findings: “LRRK2 Regulates Synaptic Function through Modulation of Actin Cytoskeletal Dynamics.” Additionally, we have included several experiments that we believe enhance and unify the study.

      (1) The writing/interpretation gets ahead of the data in places and this was confusing. For example, the abstract highlights prior work showing that Ser935 LRRK2 phosphorylation changes LRRK2 localization, and Figure 1 shows that BDNF rapidly increases LRRK2 phosphorylation at this site. Subsequent figures highlight effects at synapses or with synaptic proteins. So is the assumption that LRRK2 is recruited to (or away from) synapses in response to BDNF? Figure 2H shows that LRRK2-drebrin interactions are enhanced in response to BDNF in retinoic acid-treated SH-SY5Y cells, but are synapses generated in these preps? How similar are these preps to the mouse and human cortical or mouse striatal neurons discussed in other parts of the paper (would it be anticipated that BDNF act similarly?) and how valid are SHSY5Y cells as a model for identifying synaptic proteins? Is drebrin localization to synapses (or its presence in synaptosomes) modified by BDNF treatment +/- LRRK2? Or do LRRK2 levels in synaptosomes change in response to BDNF? The presentation requires re-writing to stay within the constraints of the data or additional data should be added to more completely back up the logic. 

      We thank the Reviewer for the thorough suggestions and comments. We have extensively revised the text to accurately reflect our findings without overinterpreting. In particular, we agree with the Reviewer that differentiated SH-SY5Y cells are not  identical to primary mouse or human neurons; however both neuronal models respond to BDNF. Supporting our observations, it is known that SH-SY5Y cells respond to BDNF.  In fact, a common protocol for differentiating SH-SY5Y cells involve BDNF in combination with retinoic acid (Martin et al., Front Pharmacol, 2022; Kovalevich et al., Methods in mol bio, 2013). Additionally, it has been reported that SH-SY5Y cells can form functional synapses (Martin et al., Front Pharmacol, 2022). While we are aware that BDNF, drebrin or LRRK2 can also affect non-synaptic pathways, we focused on synapses when moved to mouse models since: (i) MS and phosphoMS identified several cytoskeletal proteins enriched at the synapse, (ii) we and others have previously reported a role for LRRK2 in governing synaptic and cytoskeletal related processes; (iii) the synapse is a critical site that becomes dysfunctional in the early  stages of PD. We have now clarified and adjusted the text as needed. We have also performed additional experiments to address the Reviewer’s concern:

      (1) “Is the assumption that LRRK2 is recruited to (or away from) synapses in response to BDNF”? This is a very important point. There is consensus in the field that detecting endogenous LRRK2 in brain slices or in primary neurons via immunofluorescence is very challenging with the commercially available  antibodies (Fernandez et al., J Parkinsons Dis, 2022). We established a method in our previous studies to detect LRRK2 biochemically in synaptosomes (Cirnaru et al., Front Mol Neurosci, 2014; Belluzzi et al., Mol Neurodegener., 2016). While these data indicate LRRK2 is present in the synaptic compartments, it would be quite challenging to apply this method to the present study. In fact, applying acute BDNF stimulation in vivo and then isolate synaptosomes is a complex experiment beyond the timeframe of the revision due to the need of mouse ethical approvals. However, this is definitely an intriguing angle to explore in the future.

      (2)“Is drebrin localization to synapses (or its presence in synaptosomes) modified by BDNF treatment +/- LRRK2?” To try and address this question, we adapted a previously published assay to measure drebrin exodus from dendritic spines. During calcium entry and LTP, drebrin exits dendritic spines and accumulates in the dendritic shafts and cell body (Koganezawa et al., 2017). This facilitates the reorganization of the actin cytoskeleton (Shirao et al., 2017). Given the known role of drebrin and its interaction with LRRK2, we hypothesized that LRRK2 loss might affect drebrin relocalization during spine maturation.

      To test this, we treated DIV14 primary cortical neurons from Lrrk2 WT and KO mice with BDNF for 5, 15, and 24 hours, then performed confocal imaging of drebrin localization (Author response image 1). Neurons were transfected at DIV4 with GFP (cell filler) and PSD95 (dendritic spines) for visualization, and endogenous drebrin was stained with an anti-drebrin antibody. We then measured drebrin's overlap with PSD95-positive puncta to track its localization at the spine.

      In Lrrk2 WT neurons, drebrin relocalized from spines after BDNF stimulation, peaking at 15 minutes and showing higher co-localization with PSD95 at 24 hours, indicating the spine remodeling occurred. In contrast, Lrrk2 KO neurons showed no drebrin exodus. These findings support the notion that LRRK2's interaction with drebrin is important for spine remodeling via BDNF. However, additional experiments with larger sample sizes are needed, which were not feasible within the revision timeframe (here n=2 experiments with independent neuronal preparations, n=4-7 neurons analyzed per experiment). Thus, we included the relevant figure as Author response image 1 but chose not to add it in the manuscript (figure 3).

      Author response image 1.

      Lrrk2 affects drebrin exodus from dendritic spines. After the exposure to BDNF for different times (5 minutes, 15 minutes and 24 hours), primary neurons from Lrrk2 WT and KO mice have been transfected with GFP and PSD95 and stained for endogenous drebrin at DIV4. The amount of drebrin localizing in dentritic spines outlined by PSD95 has been assessed at DIV14. The graph shows a pronounced decrease in drebrin content in WT neurons during short time treatments and an increase after 24 hours. KO neurons present no evident variations in drebrin localization upon BDNF stimulation. Scale bar: 4 μm.<br />

      (2) The experiments make use of multiple different kinds of preps. This makes it difficult at times to follow and interpret some of the experiments, and it would be of great benefit to more assertively insert "mouse" or "human" and cell type (cortical, glutamatergic, striatal, gabaergic) etc. 

      We thank the Reviewer for pointing this out. We have now more clearly specified the cell type and species identity throughout the text to improve clarity and interpretation.

      (3) Although BDNF induces quantitatively lower levels of ERK or Akt phosphorylation in LRRK2KO preps based on the graphs (Figure 4B, D), the western blot data in Figure 4C make clear that BDNF does not need LRRK2 to mediate either ERK or Akt activation in mouse cortical neurons and in 4A, ERK in SH-SY5Y cells. The presentation of the data in the results (and echoed in the discussion) writes of a "remarkably weaker response". The data in the blots demand more nuance. It seems that LRRK2 may potentiate a response to BDNF that in neurons is independent of LRRK2 kinase activity (as noted). This is more of a point of interpretation, but the words do not match the images.  

      We thank the Reviewer for pointing this out. We have rephrased our data  presentation to better convey  our findings. We were not surprised to find that loss of LRRK2 causes only a reduction of ERK and AKT activation upon BDNF rather than a complete loss. This is because these pathways are complex and redundant and are activated by a number of cellular effectors. The fact that LRRK2 is one among many players whose function can be compensated by other signaling molecules is also supported by the phenotype of Lrrk2 KO mice that is measurable at 1 month but disappears with adulthood (4 and 18 months) (figure 5).

      Moreover, we removed the sentence “Of note, 90 mins of Lrrk2 inhibition (MLi-2) prior to BDNF stimulation did not prevent phosphorylation of Akt and Erk1/2, suggesting that LRRK2 participates in BDNF-induced phosphorylation of Akt and Erk1/2 independently from its kinase activity but dependently from its ability to be phosphorylated at Ser935 (Fig. 4C-D and Fig. 1B-C)” since the MLi-2 treatment prior to BDNF stimulation was not quantified and our new data point to an involvement of LRRK2 kinase activity upon BDNF stimulation.

      (4) Figure 4F/G shows an increase in PSD95 puncta per unit length in response to BDNF in mouse cortical neurons. The data do not show spine induction/dendritic spine density/or spine morphogenesis as suggested in the accompanying text (page 8). Since the neurons are filled/express gfp, spine density could be added or spines having PSD95 puncta. However, the data as reported would be expected to reflect spine and shaft PSDs and could also include some nonsynaptic sites. 

      The Reviewer is right. We have rephrased the text to reflect an increase in postsynaptic density (PSD) sites, which may include both spine and shaft PSDs, as well as potential nonsynaptic sites.

      (5) Experimental details are missing that are needed to fully interpret the data. There are no electron microscopy methods outside of the figure legend. And for this and most other microscopy-based data, there are few to no descriptions of what cells/sites were sampled, how many sites were sampled, and how regions/cells were chosen. For some experiments (like Figure 5D), some detail is provided in the legend (20 segments from each mouse), but it is not clear how many neurons this represents, where in the striatum these neurons reside, etc. For confocal z-stacks, how thick are the optical sections and how thick is the stack? The methods suggest that data were analyzed as collapsed projections, but they cite Imaris, which usually uses volumes, so this is confusing. The guide (sgRNA) sequences that were used should be included. There is no mention of sex as a biological variable. 

      We thank the Reviewer for pointing out this missing information. We have now included:

      (1) EM methods (page 24)

      (2) Methods for ICC and confocal microscopy now incorporates the Z-stack thickness (0.5 μm x 6 = 3 μm) on page 23.

      (3) Methods for Golgi-Cox staining now incorporates the Z-stack thickness and number of neurons and segments per neuron analyzed. 

      (4) The sex of mice is mentioned in the material and methods (page 17): “Approximately equal numbers of males and females were used for every experiment”.

      (6) For Figures 1F, G, and E, how many experimental replicates are represented by blots that are shown? Graphs/statistics could be added to the supplement. For 1C and 1I, the ANOVA p-value should be added in the legend (in addition to the post hoc value provided). 

      The blots relative to figure 1F,G and E are representative of several blots (at least n=5). The same redouts are part of figure 4 where quantifications are provided. We added the ANOVA p-value in the legend for figure 1C, 1I and 1K.

      (7) Why choose 15 minutes of BDNF exposure for the mass spec experiments when the kinetics in Figure 1 show a peak at 5 mins?  

      This is an important point. We repeated the experiment in GFP-LRRK2 SH-SY5Y cells (figure S1C) and included the 15 min time point. In addition to confirming that pSer935 increases similarly at 5 and 15 minutes, we also observed an increase in RAB phosphorylation at these time points. As mentioned in our response to Reviewer’s 1, we pretreated with MLi-2 for 90 minutes in this experiment to reduce the high basal phosphorylation stoichiometry of pSer935. 

      (8) The schematic in Figure 6A suggests that iPSCs were plated, differentiated, and cultured until about day 70 when they were used for recordings. But the methods suggest they were differentiated and then cryopreserved at day 30, and then replated and cultured for 40 more days. Please clarify if day 70 reflects time after re-plating (30+70) or total time in culture (70). If the latter, please add some notes about re-differentiation, etc. 

      We thank the reviewer for providing further clarity on the iPSC methodology. In the submitted manuscript 70DIV represents the total time in vitro and the process involved a cryostorage event at 30DIV, with a thaw of the cells and a further 40 days of maturation before measurement.  We have adjusted the methods in both the text and figure (new schematic) to clarify this.  The cryopreservation step has been used in other iPSC methods to great effect (Drummond et al., Front Cell Dev Biol, 2020). Due to the complexity and length of the iPSC neuronal differentiation process, cryopreservation represents a useful method with which to shorten and enhance the ability to repeat experiments and reduce considerable variation between differentiations. User defined differences in culture conditions for each batch of neurons thawed can usefully be treated as a new and separate N compared to the next batch of neurons.

      (9) When Figures 6B and 6C are compared it appears that mEPSC frequency may increase earlier in the LRRK2KO preps than in the WT preps since the values appear to be similar to WT + BDNF. In this light, BDNF treatment may have reached a ceiling in the LRRK2KO neurons.

      We thank the reviewer for his/her comment and observations about the ceiling effects. It is indeed possible that the loss of LRRK2 and the application of BDNF could cause the same elevation in synaptic neurotransmission. In such a situation, the increased activity as a result of BDNF treatment would be masked by the increased activity  observed as a result of LRRK2 KO. To better visualize the difference between WT and KO cultures and the possible ceiling effect, we merged the data in one single graph.  

      (10) Schematic data in Figures 5A and C and Figures 5B and E are too small to read/see the data. 

      We thank the Reviewer for this suggestion. We have now enlarged figure 5A and moved the graph of figure 5D in supplemental figure S5, since this analysis of spine morphology is secondary to the one shown in figure 5C.

      Reviewer #1 (Recommendations For The Authors): 

      Please forgive any redundancy in the comments, I wanted to provide the authors with as much information as I had to explain my opinion. 

      Primary mouse cortical neurons at div14, 20% transient increase in S935 pLRRK2 5min after BDNF, which then declines by 30 minutes (below pre-stim levels, and maybe LRRK2 protein levels do also). 

      In differentiated SHSY5Y cells there is a large expected increase in pERK and pAKT that is sustained way above pre-stim for 60 minutes. There is a 50% initial increase in pLRRK2 (but the blot is not very clear and no double band in these cells), which then looks like reduced well below pre-stim by 30 & 60 minutes. 

      We thank the Reviewer for bring up this important point. We have extensively addressed this issue in the public review rebuttal. In essence, the phosphorylation of Ser935 is near saturation under unstimulated conditions, as evidenced by its high basal stoichiometry, whereas Rab phosphorylation is far from saturation, showing an increase upon BDNF stimulation before returning to baseline levels. This distinction highlights that while pSer935 exhibits a ceiling effect due to its near-maximal phosphorylation at rest, pRab responds dynamically to BDNF, indicating low basal phosphorylation and a significant capacity for increase. Figure 1 in the rebuttal summarizes the new data collected. 

      GFP-fused overexpressed LRRK2 coIPs with drebrin, and this is double following 15 min BDNF. Strong result.

      We thank the Reviewer.

      BDNF-induced pAKT signaling is greatly impaired, and pERK is somewhat impaired, in CRISPR LKO SHSY5Y cells. In mouse primaries, both AKT and Erk phosph is robustly increased and sustained over 60 minutes in WT and LKO. This might be initially less in LKO for Akt (hard to argue on a WB n of 3 with huge WT variability), regardless they are all roughly the same by 60 minutes and even look higher in LKO at 60. This seems like a big disconnect and suggests the impairment in the SHSy5Y cells might have more to do with the CRISPR process than the LRRK2. Were the cells sequenced for off-target CRISPR-induced modifications?  

      Following the Reviewer suggestion – and as discussed in the public review section - we performed an off-target analysis. Specifically, we selected the first 8 putative off targets exhibiting a CDF (Cutting Frequency Determination) off-target-score >0.2. As shown in supplemental file 1, sequence disruption was observed only in the LRRK2 on-target site in LRRK2 KO SH-SY5Y cells, while the 8 off-target regions remained unchanged across the genotypes and relative to the reference sequence.  

      No difference in the density of large PSD-95 puncta in dendrites of LKO primary relative to WT, and the small (10%) increase seen in WT after BDNF might be absent in LKO (it is not clear to me that this is absent in every culture rep, and the data is not highly convincing). This is also referred to as spinogenesis, which has not been quantified. Why not is confusing as they did use a GFP fill... 

      The Reviewer is right that spinogenesis is not the appropriate term for the process analyzed. We replaced “spinogenesis” with “morphological alternation of dendritic protrusions” or “synapse maturation” which is correlated with the number of PSD95 positive puncta (ElHusseini et al., Science, 2000) . 

      There is a difference in the percentage of dendritic protrusions classified as filopodia to more being classified as thin spines in LKO striatal neurons at 1 month, which is not seen at any other age, The WT filopodia seems to drop and thin spine percent rise to be similar to LKO at 4 months. This is taken as evidence for delayed maturation in LKO, but the data suggest the opposite. These authors previously published decreased spine and increased filopodia density at P15 in LKO. Now they show that filopodia density is decreased and thin spine density increased at one month. How is that shift from increased to decreased filopodia density in LKO (faster than WT from a larger initial point) evidence of impaired maturation? Again this seems accelerated? 

      We agree with the Reviewer that the initial interpretation was indeed confusing. To adhere closely to our data and avoid overinterpretation – as also suggested by Reviewer 2 – we revised  the text and moved figure 5D to supplementary materials. In essence, our data point out to alterations in the structural properties of dendritic protrusions in young KO mice, specifically a reduction in  their size (head width and neck height) and a decrease in postsynaptic density (PSD) length, as observed with TEM. These findings suggest that LRRK2 is involved in morphological processes during spine development. 

      Shank3 and PSD95 mRNA transcript levels were reduced in the LKO midbrain, only shank3 was reduced in the striatum and only PSD was reduced in the cortex. No changes to mRNA of BDNF-related transcripts. None of these mRNA changes protein-validated. Drebrin protein (where is drebrin mRNA?) levels are reduced in LKO at 1&4 but not clearly at 18 months (seems the most robust result but doesn't correlate with other measures, which here is basically a transient increase (1m) in thin striatal spines).  

      As illustrated before, we performed qPCR for Dbn1 and found that its expression is significantly reduced in the cortex and midbrain and non-significantly reduced in the striatum (1 months old mice, a different cohort as those used for the other analysis in figure 5).  

      24h BDNF increases the frequency of mEPSCs on hIPSC-derived cortical-like neurons, but not LKO, which is already high. There are no details of synapse number or anything for these cultures and compares 24h treatment. BDNF increases mEPSC frequency within minutes PMC3397209, and acute application while recording on cells may be much more informative (effects of BDNF directly, and no issues with cell-cell / culture variability). Calling mEPSC "spontaneous electrical activity" is not standard.  

      We thank the reviewer for this point. We provided information about synapse number (Bassoon/Homer colocalization) in supplementary figure S7. The lack of response of LRRK2 KO cultures in terms of mEPSC is likely due to increase release probability as the number of synapses does not change between the two genotypes. 

      The pattern of LRRK2 activation is very disconnected from that of BDNF signalling onto other kinases. Regarding pLRRK2, s935 is a non-autophosph site said to be required for LRRK2 enzymatic activity, that is mostly used in the field as a readout of successful LRRK2 inhibition, with some evidence that this site regulates LRRK2 subcellular localization (which might be more to do with whether or not it is p at 935 and therefor able to act as a kinase). 

      The authors imply BDNF is activating LRRK2, but really should have looked at other sites, such as the autophospho site 1292 and 'known' LRRK2 substrates like T73 pRab10 (or other e.g., pRab12) as evidence of LRRK2 activation. One can easily argue that the initial increase in pLRRK2 at this site is less consequential than the observation that BDNF silences LRRK2 activity based on p935 being sustained to being reduced after 5 minutes, and well below the prestim levels... not that BDNF activates LRRK2. 

      As described above, we have collected new data showing that BDNF stimulation increases LRRK2 kinase activity toward its physiological substrates Rab10 and Rab8 (using a panphospho-Rab antibody) (Figure 1 and Figure S1). Additionally, we have also extensively commented the ceiling effect of pS935.

      BDNF does a LOT. What happens to network activity in the neural cultures with BDNF application? Should go up immediately. Would increasing neural activity (i.e., through depolarization, forskolin, disinhibition, or something else without BDNF) give a similar 20% increase in pS935 LRRK2? Can this be additive, or occluded? This would have major implications for the conclusions that BDNF and pLRRK2 are tightly linked (as the title suggests).  

      These are very valuable observations; however, they fall outside the scope and timeframe of this study. We agree that future research should focus on gaining a deeper mechanistic understanding of how LRRK2 regulates synaptic activity, including vesicle release probability and postsynaptic spine maturation, independently of BDNF.

      Figures 1A & H "Western blot analysis revealed a rapid (5 mins) and transient increase of Ser935 phosphorylation after BDNF treatment (Fig. 1B and 1C). Of interest, BDNF failed to stimulate Ser935 phosphorylation when neurons were pretreated with the LRRK2 inhibitor MLi-2" . The first thing that stands out is that the pLRRK2 in WB is not very clear at all (although we appreciate it is 'a pig' to work with, I'd hope some replicates are clearer); besides that, the 20% increase only at 5min post-BDNF stimulation seems like a much less profound change than the reduction from base at 60 and more at 180 minutes (where total LRRK2 protein is also going down?). That the blot at 60 minutes in H is representative of a 30% reduction seems off... makes me wonder about the background subtraction in quantification (for this there is much less pLRRK2 and more total LRRK2 than at 0 or 5). LRRK2 (especially) and pLRRK2 seem very sketchy in H. Also, total LRRK2 appears to increase in the SHSY5Y cell not the neurons, and this seems even clearer in 2 H. 

      To better visualize the dynamics of pS935 variation relative to time=0, we presented the data as the difference between t=0 and t=x. It clearly shows that pSe935 goes below prestimulation levels, whereas pRab10 does not. The large difference in the initial stoichiometry of these two phosphorylation is extensively discussed above.

      That MLi2 eliminates pLRRK2 (and seems to reduce LRRK2 protein?) isn't surprising, but a 90min pretreatment with MLi-2 should be compared to MLi-2's vehicle alone (MLi-2 is notoriously insoluble and the majority of diluents have bioactive effects like changing activity)... especially if concluding increased pLRRK2 in response to BDNF is a crucial point (when comparing against effects on other protein modifications such as pAKT). This highlights a second point... the changes to pERK and pAKT are huge following BDNF (nothing to massive quantities), whereas pLRRK2 increases are 20-50% at best. This suggests a very modest effect of BDNF on LRRK in neurons, compared to the other kinases. I worry this might be less consequential than claimed. Change in S1 is also unlikely to be significant... 

      These comments have been thoroughly addressed in the previous responses. Regarding fig. S1, we added an additional experiment (Figure S1C) in GFP-LRRK2 cells showing robust activation of LRRK2 (pS935, pRabs) at the timepoint of MS (15 min).

      "As the yields of endogenous LRRK2 purification were insufficient for AP-MS/MS analysis, we generated polyclonal SH-SY5Y cells stably expressing GFP-LRRK2 wild-type or GFP control (Supplementary Fig. 1)" . I am concerned that much is being assumed regarding 'synaptic function' from SHSY5Y cells... also overexpressing GFP-LRRK2 and looking at its binding after BDNF isn't synaptic function.  

      We appreciate the reviewer’s comment. We would like to clarify that the interactors enriched upon BDNF stimulation predominantly fall into semantic categories related to the synapse and actin cytoskeleton. While this does not imply that these interactors are exclusively synaptic, it suggests that this tightly interconnected network likely plays a role in synaptic function. This interpretation is supported by several lines of evidence: (1) previous studies have demonstrated the relevance of this compartment to LRRK2 function; (2) our new phosphoproteomics data from striatal lysate highlight enrichment of synaptic categories; and (3) analysis of the latest GWAS gene list (134 genes) also indicates significant enrichment of synapse-related categories. Taken together, these findings justify further investigation into the role of LRRK2 in synaptic biology, as discussed extensively in the manuscript’s discussion section.

      Figure 2A isn't alluded to in text and supplemental table 1 isn't about LRRK2 binding, but mEPSCs. 

      We have added Figure 2A and added supplementary .xls table 1, which refers to the excel list of genes with modulated interaction upon BDNF (uploaded in the supplemental material).

      We added the extension .xls also for supplementary table 2 and 3. 

      Figure 2A is useless without some hits being named, and the donut plots in B add nothing beyond a statement that "35% of 'genes' (shouldn't this be proteins?) among the total 207 LRRK2 interactors were SynGO annotated" might as well [just] be the sentence in the text. 

      We have now included the names of the most significant hits, including cytoskeletal and translation-related proteins, as well as known LRRK2 interactors. We decided to retain the donut plots, as we believe they simplify data interpretation for the reader, reducing the need to jump back and forth between the figures and the text.

      Validation of drebrin binding in 2H is great... although only one of 8 named hits; could be increased to include some of the others. A concern alludes to my previous point... there is no appreciable LRRK2 in these cells until GFP-LRRK2 is overexpressed; is this addressed in the MS? Conclusions would be much stronger if bidirectional coIP of these binding candidates were shown with endogenous (GFP-ve) LRRK2 (primaries or hIPSCs, brain tissue?) 

      To address the Reviewer’s concerns to the best of our abilities, we have added a blot in Supplemental figure S1A showing how the expression levels of LRRK2 increase after RA differentiation. Moreover, we have included several new data further strengthening the functional link between LRRK2 and drebrin, including qPCR of Dbn1 in one-month old Lrrk2 KO brains, western blots of Lrrk2 and Rab in Dbn1 KO brains, and co-IP with drebrin N- and Cterm domains. 

      Figures 3 A-C are not informative beyond the text and D could be useful if proteins were annotated. 

      To avoid overcrowding, proteins were annotated in A and the same network structure reported for synaptic and actin-related interactors. 

      Figure 4. Is this now endogenous LRRK2 in the SHSY5Y cells? Again not much LRRK2 though, and no pLRRK shown. 

      We confirm that these are naïve SH-SY5Y cells differentiated with RA and LRRK2 is endogenous. We did not assess pS935 in this experiment, as the primary goal was to evaluate pAKT and pERK1/2 levels. To avoid signal saturation, we loaded less total protein (30 µg instead of the 80 µg typically required to detect pS935). pS935 levels were extensively assessed in Figure 1. This experimental detail has now been added in the material and methods section (page 18).

      In C (primary neurons) There is very little increase in pLRRK2 / LRRK2 at 5 mins, and any is much less profound a change than the reduction at 30 & 60 mins. I think this is interesting and may be a more substantial consequence of BDNF treatment than the small early increase. Any 5 min increase is gone by 30 and pLRRK2 is reduced after. This is a disconnect from the timing of all the other pProteins in this assay, yet pLRRK2 is supposed to be regulating the 'synaptic effects'? 

      The first part of the question has already been extensively addressed. Regarding the timing, one possibility is that LRRK2 is activated upstream of AKT and ERK1/2, a hypothesis supported by the reduced activation of AKT and ERK1/2 observed in LRRK2 KO cells, as discussed in the manuscript, and in MLi-2 treated cells (Author response image 2). Concerning the synaptic effects, it is well established that synaptic structural and functional plasticity occurs downstream of receptor activation and kinase signaling cascades. These changes can be mediated by both rapid mechanisms (e.g., mobilization of receptor-containing endosomes via the actin cytoskeleton) and slower processes involving gene transcription of immediate early genes (IEGs). Since structural and functional changes at the synapse generally manifest several hours after stimulation, we typically assessed synaptic activity and structure 24 hours post-stimulation.

      Akt Erk1&2 both go up rapidly after BDNF in WT, although Akt seems to come down with pLRRK2. If they aren't all the same Akt is probably the most different between LKO and WT but I am very concerned about an n=3 for wb, wb is semi-quantitative at best, and many more than three replicates should be assessed, especially if the argument is that the increases are quantitively different between WT v KO (huge variability in WT makes me think if this were done 10x it would all look same). Moreover, this isn't similar to the LKO primaries  "pulled pups" pooled presumably. 

      Despite some variability in the magnitude of the pAKT/pERK response in naïve SH-SY5Y cells, all three independent replicates consistently showed a reduced response in LRRK2 KO cells, yielding a highly significant result in the two-way ANOVA test. In contrast, the difference in response magnitude between WT and LRRK2 KO primary cultures was less pronounced, which justified repeating the experiments with n=9 replicates. We hope the Reviewer acknowledges the inherent variability often observed in western blot experiments, particularly when performed in a fully independent manner (different cultures and stimulations, independent blots).

      To further strengthen the conclusion that this effect is reproducible and dependent on LRRK2 kinase activity upstream of AKT and ERK, we probed the membranes in figure 1H with pAKT/total AKT and pERK/total ERK. All things considered and consistent with our hypothesis, MLi-2 significantly reduced BDNF-mediated AKT and ERK1/2 phosphorylation levels (Author response image 2). 

      Author response image 2.

      Western blot (same experiments as in figure 1) was performed using antibodies against phospho-Thr202/185 ERK1/2, total ERK1/2 and phospho-Ser473 AKT, total AKT protein levels Retinoic acid-differentiated SH-SY5Y cells stimulated with 100 ng/mL BDNF for 0, 5, 30, 60 mins. MLi-2 was used at 500 nM for 90 mins to inhibit LRRK2 kinase activity.

      G lack of KO effect seems to be skewed from one culture in the plot (grey). The scatter makes it hard to read, perhaps display the culture mean +/- BDNF with paired bars. The fact that one replicate may be changing things is suggested by the weirdly significant treatment effect and no genotype effect. Also, these are GFP-filled cells, the dendritic masks should be shown/explained, and I'm very surprised no one counted the number (or type?) of protrusions, especially as the text describes this assay (incorrectly) as spinogenesis... 

      As suggested by the Reviewer we have replotted the results as bar graphs. Regarding the number of protrusions, we initially counted the number of GFP+ puncta in the WT and did not find any difference (Author response image 3). Due to our imaging setup (confocal microscopy rather than super-resolution imaging and Imaris 3D reconstruction), we were unable to perform a fine morphometric analysis. However, this was not entirely unexpected, as BDNF is known to promote both the formation and maturation of dendritic spines. Therefore, we focused on quantifying PSD95+ puncta as a readout of mature postsynaptic compartments. While we acknowledge that we cannot definitively conclude that each PSD95+ punctum is synaptically connected to a presynaptic terminal, the data do indicate an increase in the number of PSD95+ structures following BDNF stimulation.

      Author response image 3.

      GFP+ puncta per unit of neurite length (µm) in DIV14 WT primary neurons untreated or upon 24 hour of BDNF treatment (100 ng/ml). No significant difference were observed (n=3).

      Figure 5. "Dendritic spine maturation is delayed in Lrrk2 knockout mice". The only significant change is at 1 month in KO which shows fewer filopodia and increased thin spines (50% vs wt). At 4 months the % of thin spines is increased to 60% in both... Filopodia also look like 4m in KO at 1m... How is that evidence for delayed maturation? If anything it suggests the KO spines are maturing faster. "the average neck height was 15% shorter and the average head width was 27% smaller, meaning that spines are smaller in Lrrk2 KO brains" - it seems odd to say this before saying that actually there are just MORE thin spines, the number of mature "mushroom' is same throughout, and the different percentage of thin comes from fewer filopodia. This central argument that maturation is delayed is not supported and could be backwards, at least according to this data. Similarly, the average PSD length is likely impacted by a preponderance of thin spines in KO... which if mature were fewer would make sense to say delayed KO maturation, but this isn't the case, it is the fewer filopodia (with no PSD) that change the numbers. See previous comments of the preceding manuscript. 

      We agree that thin spines, while often considered more immature, represent an intermediate stage in spine development. The data showing an increase in thin spines at 1 month in the KO mice, along with fewer filopodia, could suggest a faster stabilization of these spines, which might indeed be indicative of premature maturation rather than delayed maturation. This change in spine morphology may indicate that the dynamics of synaptic plasticity are affected. Regarding the PSD length, as the Reviewer pointed out, the increased presence of thin spines in KO might account for the observed changes in PSD measurements, as thin spines typically have smaller PSDs. This further reinforces the idea that the overall maturation process may be altered in the KO, but not necessarily delayed. 

      We rephrase the interpretation of these data, and moved figure 5D as supplemental figure S4.

      "To establish whether loss of Lrrk2 in young mice causes a reduction in dendritic spines size by influencing BDNF-TrkB expression" - there is no evidence of this.  

      We agree and reorganized the text, removing this sentence.  

      Shank and PSD95 mRNA changes being shown without protein adds very little. Why is drebrin RNA not shown? Also should be several housekeeping RNAs, not one (RPL27)? 

      We measured Dbn1 mRNA, which shows a significant reduction in midbrain and cortex. Moreover we have now normalized the transcript levels against the geometrical means of three housekeeping genes (RPL27, actin, and GAPDH) relative abundance.

      Drebrin levels being lower in KO seems to be the strongest result of the paper so far (shame no pLRRK2 or coIP of drebrin to back up the argument). DrebrinA KO mice have normal spines, what about haploinsufficient drebrin mice (LKO seem to have half derbrin, but only as youngsters?)  

      As extensively explained in the public review, we used Dbn1 KO mouse brains and were able to show reduced Lrrk2 activity.

      Figure 6. hIPSC-derived cortical neurons. The WT 'cortical' neurons have a very low mEPSC frequency at 0.2Hz relative to KO. Is this because they are more or less mature? What is the EPSC frequency of these cells at 30 and 90 days for comparison? Also, it is very very hard to infer anything about mEPSC frequency in the absence of estimates of cell number and more importantly synapse number. Furthermore, where are the details of cell measures such as capacitance, resistance, and quality control e.g., Ra? Table s1 seems redundant here, besides suggesting that the amplitude is higher in KO at base. 

      We agree that the developmental trajectory of iPSC-derived neurons is critical to accurately interpreting synaptic function and plasticity. In response, we have included additional data now presented in the supplementary figure S7 and summarize key findings below:

      At DIV50, both WT and LRRK2 KO neurons exhibit low basal mEPSC activity (~0.5 Hz) and no response to 24 h BDNF stimulation (50 ng/mL).

      At DIV70 WT neurons show very low basal activity (~0.2 Hz), which increases ~7.5-fold upon BDNF treatment (1.5 Hz; p < 0.001), and no change in synapse number. KO neurons display elevated basal activity (~1 Hz) similar to BDNF-treated WT neurons, with no further increase upon BDNF exposure (~1.3 Hz) and no change in synapse number.

      At DIV90, no significant effect of BDNF in both WT and KO, indicating a possible saturation of plastic responses. The lack of BDNF response at DIV90 may be due to endogenous BDNF production or culture-based saturation effects. While these factors warrant further investigation (e.g., ELISA, co-culture systems), they do not confound the key conclusions regarding the role of LRRK2 in synaptic development and plasticity:

      LRRK2 Enables BDNF-Responsive Synaptic Plasticity. In WT neurons, BDNF induces a significant increase in neurotransmitter release (mEPSC frequency) with no reduction in synapse number. This dissociation suggests BDNF promotes presynaptic functional potentiation. KO neurons fail to show changes in either synaptic function or structure in response to BDNF, indicating that LRRK2 is required for activity-dependent remodeling.

      LRRK2 Loss Accelerates Synaptic Maturation. At DIV70, KO neurons already exhibit high spontaneous synaptic activity equivalent to BDNF-stimulated WT neurons. This suggests that LRRK2 may act to suppress premature maturation and temporally gate BDNF responsiveness, aligning with the differences in maturation dynamics observed in KO mice (Figure 5).  

      As suggested by the reviewer we reported the measurement of resistance and capacitance for all DIV (Table 1, supplemental material). A reduction in capacitance was observed in WT neurons at DIV90, which may reflect changes in membrane complexity. However, this did not correlate with differences in synapse number and is unlikely to account for the observed differences in mEPSC frequency. To control for cell number between groups, cell count prior to plating was performed (80k/cm2; see also methods) on the non-dividing cells to keep cell number consistent.

      The presence of BDNF in WT seems to make them look like LKO, in the rest of the paper the suggestion is that the LKO lack a response to BDNF. Here it looks like it could be that BDNF signalling is saturated in LKO, or they are just very different at base and lack a response.

      Knowing which is important to the conclusions, and acute application (recording and BDNF wash-in) would be much more convincing.

      We agree with the Reviewer’s point that saturation of BDNF could influence the interpretation of the data if it were to occur. However, it is important to note that no BDNF exists in the media in base control and KO neuronal culture conditions. This is  different from other culture conditions and allows us to investigate the effects of  BDNF treatment. Thus, the increased mEPSC frequency observed in KO neurons compared to WT neurons is defined only by the deletion of the gene and not by other extrinsic factors which were kept consistent between the groups. The lack of response or change in mEPSC frequency in KO is proposed to be a compensatory mechanism due to the loss of LRRK2. Of Note, LRRK2 as a “synaptic break” has already been described (Beccano-Kelly et al., Hum Mol Gen, 2015). However, a comprehensive analysis of the underlying molecular mechanisms will  require future studies beyond  with the scope of this paper.

      "The LRRK2 kinase substrates Rabs are not present in the list of significant phosphopeptides, likely due to the low stoichiometry and/or abundance" Likely due to the fact mass spec does not get anywhere near everything. 

      We removed this sentence in light of the new phosphoproteomic analysis.

      Figure 7 is pretty stand-alone, and not validated in any way, hard to justify its inclusion?  

      As extensively explained we removed figure 7 and included the new phospho-MS as part of figure. 3

      Writing throughout shows a very selective and shallow use of the literature.  

      We extensively reviewed the citations.

      "while Lrrk1 transcript in this region is relatively stable during development" The authors reference a very old paper that barely shows any LRRK1 mRNA, and no protein. Others have shown that LRRK1 is essentially not present postnatally PMC2233633. This isn't even an argument the authors need to make. 

      We thank the reviewer and included this more appropriate citation. 

      Reviewer #2 (Recommendations For The Authors): 

      Cyfip1 (Fig 3A) is part of the WAVE complex (page 13). 

      We thank the reviewer and specified it.

      The discussion could be more focused. 

      We extensively revised the discussion to keep it more focused.

      Note that we updated the GO ontology analyses to reflect the updated information present in g:Profiler.

      References.

      Nirujogi, R. S., Tonelli, F., Taylor, M., Lis, P., Zimprich, A., Sammler, E., & Alessi, D. R. (2021). Development of a multiplexed targeted mass spectrometry assay for LRRK2phosphorylated Rabs and Ser910/Ser935 biomarker sites. The Biochemical journal, 478(2), 299–326. https://doi.org/10.1042/BCJ20200930

      Worth, D. C., Daly, C. N., Geraldo, S., Oozeer, F., & Gordon-Weeks, P. R. (2013). Drebrin contains a cryptic F-actin-bundling activity regulated by Cdk5 phosphorylation. The Journal of cell biology, 202(5), 793–806. https://doi.org/10.1083/jcb.201303005

      Shirao, T., Hanamura, K., Koganezawa, N., Ishizuka, Y., Yamazaki, H., & Sekino, Y. (2017). The role of drebrin in neurons. Journal of neurochemistry, 141(6), 819–834. https://doi.org/10.1111/jnc.13988

      Koganezawa, N., Hanamura, K., Sekino, Y., & Shirao, T. (2017). The role of drebrin in dendritic spines. Molecular and cellular neurosciences, 84, 85–92. https://doi.org/10.1016/j.mcn.2017.01.004

      Meixner, A., Boldt, K., Van Troys, M., Askenazi, M., Gloeckner, C. J., Bauer, M., Marto, J. A., Ampe, C., Kinkl, N., & Ueffing, M. (2011). A QUICK screen for Lrrk2 interaction partners--leucine-rich repeat kinase 2 is involved in actin cytoskeleton dynamics. Molecular & cellular proteomics: MCP, 10(1), M110.001172. https://doi.org/10.1074/mcp.M110.001172

      Parisiadou, L., & Cai, H. (2010). LRRK2 function on actin and microtubule dynamics in Parkinson disease. Communicative & integrative biology, 3(5), 396–400. https://doi.org/10.4161/cib.3.5.12286

      Chen, C., Masotti, M., Shepard, N., Promes, V., Tombesi, G., Arango, D., Manzoni, C., Greggio, E., Hilfiker, S., Kozorovitskiy, Y., & Parisiadou, L. (2024). LRRK2 mediates haloperidol-induced changes in indirect pathway striatal projection neurons. bioRxiv : the preprint server for biology, 2024.06.06.597594. https://doi.org/10.1101/2024.06.06.597594

      Cheng, J., Novati, G., Pan, J., Bycroft, C., Žemgulytė, A., Applebaum, T., Pritzel, A.,Wong, L. H., Zielinski, M., Sargeant, T., Schneider, R. G., Senior, A. W., Jumper, J., Hassabis, D., Kohli, P., & Avsec, Ž. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (New York, N.Y.), 381(6664), eadg7492. https://doi.org/10.1126/science.adg7492

      Beaudoin, G. M., 3rd, Schofield, C. M., Nuwal, T., Zang, K., Ullian, E. M., Huang, B., & Reichardt, L. F. (2012). Afadin, a Ras/Rap effector that controls cadherin function, promotes spine and excitatory synapse density in the hippocampus. The Journal of neuroscience : the official journal of the Society for Neuroscience, 32(1), 99–110. https://doi.org/10.1523/JNEUROSCI.4565-11.2012

      Fernández, B., Chittoor-Vinod, V. G., Kluss, J. H., Kelly, K., Bryant, N., Nguyen, A. P. T., Bukhari, S. A., Smith, N., Lara Ordóñez, A. J., Fdez, E., Chartier-Harlin, M. C., Montine, T. J., Wilson, M. A., Moore, D. J., West, A. B., Cookson, M. R., Nichols, R. J., & Hilfiker, S. (2022). Evaluation of Current Methods to Detect Cellular Leucine-Rich Repeat Kinase 2 (LRRK2) Kinase Activity. Journal of Parkinson's disease, 12(5), 1423–1447. https://doi.org/10.3233/JPD-213128

      Cirnaru, M. D., Marte, A., Belluzzi, E., Russo, I., Gabrielli, M., Longo, F., Arcuri, L., Murru, L., Bubacco, L., Matteoli, M., Fedele, E., Sala, C., Passafaro, M., Morari, M., Greggio, E., Onofri, F., & Piccoli, G. (2014). LRRK2 kinase activity regulates synaptic vesicle trafficking and neurotransmitter release through modulation of LRRK2 macromolecular complex. Frontiers in molecular neuroscience, 7, 49. https://doi.org/10.3389/fnmol.2014.00049

      Belluzzi, E., Gonnelli, A., Cirnaru, M. D., Marte, A., Plotegher, N., Russo, I., Civiero, L., Cogo, S., Carrion, M. P., Franchin, C., Arrigoni, G., Beltramini, M., Bubacco, L., Onofri, F., Piccoli, G., & Greggio, E. (2016). LRRK2 phosphorylates pre-synaptic Nethylmaleimide sensitive fusion (NSF) protein enhancing its ATPase activity and SNARE complex disassembling rate. Molecular neurodegeneration, 11, 1. https://doi.org/10.1186/s13024-015-0066-z

      Martin, E. R., Gandawijaya, J., & Oguro-Ando, A. (2022). A novel method for generating glutamatergic SH-SY5Y neuron-like cells utilizing B-27 supplement. Frontiers in pharmacology, 13, 943627. https://doi.org/10.3389/fphar.2022.943627

      Kovalevich, J., & Langford, D. (2013). Considerations for the use of SH-SY5Y neuroblastoma cells in neurobiology. Methods in molecular biology (Clifton, N.J.), 1078, 9–21. https://doi.org/10.1007/978-1-62703-640-5_2

      Drummond, N. J., Singh Dolt, K., Canham, M. A., Kilbride, P., Morris, G. J., & Kunath, T. (2020). Cryopreservation of Human Midbrain Dopaminergic Neural Progenitor Cells Poised for Neuronal Differentiation. Frontiers in cell and developmental biology, 8, 578907. https://doi.org/10.3389/fcell.2020.578907

      Tao, X., Finkbeiner, S., Arnold, D. B., Shaywitz, A. J., & Greenberg, M. E. (1998). Ca2+ influx regulates BDNF transcription by a CREB family transcription factor-dependent mechanism. Neuron, 20(4), 709–726. https://doi.org/10.1016/s0896-6273(00)810107

      El-Husseini, A. E., Schnell, E., Chetkovich, D. M., Nicoll, R. A., & Bredt, D. S. (2000). PSD95 involvement in maturation of excitatory synapses. Science (New York, N.Y.), 290(5495), 1364–1368.

      Glebov OO, Cox S, Humphreys L, Burrone J. Neuronal activity controls transsynaptic geometry. Sci Rep. 2016 Mar 8;6:22703. doi: 10.1038/srep22703. Erratum in: Sci Rep. 2016 May 31;6:26422. doi: 10.1038/srep26422. PMID: 26951792; PMCID: PMC4782104.

      Beccano-Kelly DA, Volta M, Munsie LN, Paschall SA, Tatarnikov I, Co K, Chou P, Cao LP, Bergeron S, Mitchell E, Han H, Melrose HL, Tapia L, Raymond LA, Farrer MJ, Milnerwood AJ. LRRK2 overexpression alters glutamatergic presynaptic plasticity, striatal dopamine tone, postsynaptic signal transduction, motor activity and memory. Hum Mol Genet. 2015 Mar 1;24(5):1336-49. doi: 10.1093/hmg/ddu543. Epub 2014 Oct 24. PMID: 25343991.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      (1) Summary:

      The authors note that it is challenging to perform diffusion MRI tractography consistently in both humans and macaques, particularly when deep subcortical structures are involved. The scientific advance described in this paper is effectively an update to the tracts that the XTRACT software supports. The claims of robustness are based on a very small selection of subjects from a very atypical dMRI acquisition (n=50 from HCP-Adult) and an even smaller selection of subjects from a more typical study (n=10 from ON-Harmony).

      Strengths:

      The changes to XTRACT are soundly motivated in theory (based on anatomical tracer studies) and practice (changes in seeding/masking for tractography), and I think the value added by these changes to XTRACT should be shared with the field. While other bundle segmentation software typically includes these types of changes in release notes, I think papers are more appropriate.

      We would like to thank the reviewer for their assessment and we appreciate the comments for improving our manuscript. We have added new results, sampling from a larger cohort with a typical dMRI protocol (N=50 from UK Biobank), as well as showcasing examples from individual subject reconstructions (Supplementary figures S6, S7). We also demonstrate comparisons against another approach that has been proposed for extracting parts of the cortico-striatal bundle in a bundle segmentation fashion, as the reviewer suggests (see comment and Author response image 1 below). 

      We would also like to take the opportunity to summarise the novelty of our contribuIons, as detailed in the Introduction, which we believe extend beyond a mere software update; this is a byproduct of this work rather than the aim. 

      i) We devise for the first Ime standard-space protocols for 21 challenging cortico-subcortical bundles for both human and macaque and we interrogate them in a comprehensive manner.

      ii) We demonstrate robustness of these protocols using criteria grounded on neuroanatomy, showing that tractography reconstructions follow topographical principles known from tracers both in WM and GM and for both species. We also show that these protocols capture individual variability as assessed by respecting family structure in data from the HCP twins.

      iii) We use high-resolution dMRI data (HCP and post-mortem macaque) to showcase feasibility of these reconstructions, and we show that reconstructions are also plausible with more conventional data, such as the ones from the UK Biobank.

      iv) We further showcase robustness and the value of cross-species mapping by using these tractography reconstructions to predict known homologous grey matter (GM) regions across the two species, both in cortex and subcortex, on the basis of similarity of grey matter areal connection patterns to the set of proposed white matter bundles.

      Weaknesses

      (2) The demonstration of the new tracts does not include a large number of carefully selected scans and is only compared to the prior methods in XTRACT. The small n and limited statistical comparisons are insufficient to claim that they are better than an alternative. Qualitatively, this method looks sound.

      We appreciate the suggestion for larger sample size, so we performed the same analysis using 50 randomly drawn UK Biobank subjects, instead of ON-Harmony, matching the N=50 randomly drawn HCP subjects (detailed explanation in the comment below, Main text Figure 4A; Supplementary Figures S4). We also generated results using the full set of N=339 HCP unrelated subjects (Supplementary Figure S5 compares 10, 50 and 339 unrelated HCP subjects). We provide further details in the relevant point (3) below. 

      With regards to comparisons to other methods, there are not really many analogous approaches that we can compare against. In our knowledge there are no previous cross-species, standard space tractography protocols for the tracts we considered in this study (including Muratoff, amygdalofugal, different parts of extreme an external capsules, along with their neighbouring tracts). We therefore i) directly compared against independent neuroanatomical knowledge and patterns (Figures 2, 3, 5), ii) confirmed that patterns against data quality and individual variability that the new tracts demonstrate are similar to patterns observed for the more established cortical tracts (Figure 4), iii) indirectly assessed efficacy by performing a demanding task, such as homologue identification on the basis of the tracts we reconstruct (Figures 6, 7). 

      We need to point out that our approach is not “bundle segmentation”, in the sense of “datadriven” approaches that cluster streamlines into bundles following full-brain tractography. The latter is different in spirit and assigns a label to each generated streamline; as full-brain tractography is challenging (Maier-Hein, Nature Comms 2017), we follow instead the approach of imposing anatomical constraints to miIgate for some of these challenges as suggested in (MaierHein, 2017).

      Nevertheless, we used TractSeg (one of the few alternatives that considers corticostriatal bundles) to perform some comparisons. The Author response image below shows average path distributions across 10 HCP subjects for a few bundles that we also reconstruct in our paper (no temporal part of striatal bundle is generated by Tractseg). We can observe that the output for each tract is highly overlapping across subjects, indicating that there is not much individual variability captured. We also see the reduced specificity in the connectivity end-points of the bundles. 

      Author response image 1.

      Comparison between 10-subject average for example subcortical tracts using TractSeg and XTRACT. We chose example bundles shared between our set and TractSeg. Per subject TractSeg produces a binary mask rather than a path distribution per tract. Furthermore, the mask is highly overlapping across subjects. Where direct correspondence was not possible, we found the closest matching tract. Specifically, we used ST_PREF for STBf, and merged ST_PREC with ST_POSTC to match StBm. There was no correspondence for the temporal part of StB.

      We subsequently performed the twinness test using both TractSeg and XTRACT (Author response image 2), as a way to assess whether aspects of individual variability can be captured. Due to heritability of brain organisation features, we anticipate that monozygotic twins have more similar tract reconstructions compared to dizygoIc twins and subsequently non-twin siblings. This pattern is reproduced using our proposed approach, but not using TractSeg that provides a rather flat pattern.  

      Author response image 2.

      Violin plots of the mean pairwise Pearson’s correlations across tracts between 72 monozygotic (MZ) twin pairs, 72 dizygotic (DZ) twin pairs, 72 non-twin sibling pairs, and 72 unrelated subject pairs from the Human Connectome Project, using Tractseg (left) and XTRACT (right). About 12 cortico-subcortical tracts were considered, as closely matched as possible between the two approaches. For Tractseg we considered: 'CA', 'FX', 'ST_FO', 'ST_M1S1' (merged ‘ST_PREC’ and ‘ST_POSTC’ to approximate the sensorimotor part of our striatal bundle), 'ST_OCC', 'ST_PAR', 'ST_PREF',  'ST_PREM', 'T_M1S1' (merged ‘T_PREC’ and ‘T_POSTC’ to approximate the sensorimotor part of our striatal bundle), 'T_PREF', 'T_PREM', 'UF'. For XTRACT we considered: 'ac', 'fx', 'StB<sub>f</sub>', 'StB<sub>m</sub>', 'StB<sub>p</sub>', 'StB<sub>t</sub>, 'EmC<sub>f</sub>', 'EmC<sub>p</sub>', 'EmC<sub>t</sub>', 'MB', 'amf', 'uf'. Showing the mean (μ) and standard deviation (σ) for each group. There were no significant di^erences between groups using TractSeg.

      Taken together, these results indicate as a minimum that the different approaches have potentially different aims. Their different behaviour across the two approaches can be desirable and beneficial for different applications (for instance WM ROI segmentation vs connectivity analysis) but makes it challenging to perform like-to-like comparisons.

      (3) “Subject selection at each stage is unclear in this manuscript. On page 5 the data are described as "Using dMRI data from the macaque (𝑁 = 6) and human brain (𝑁 = 50)". Were the 50 HCP subjects selected to cover a range of noise levels or subject head motion? Figure 4 describes 72 pairs for each of monozygotic, dizygotic, non-twin siblings, and unrelated pairs - are these treated separately? Similarly, NH had 10 subjects, but each was scanned 5 times. How was this represented in the sample construction?”

      We appreciate the suggestions and we agree that some of the choices in terms of group sizes may have been confusing. Short answer is we did not perform any subject selection, subjects were randomly drawn from what we had available. The 72 twin pairs are simply the maximum number of monozygotic twin pairs available in the HCP cohort, so we used 72 pairs in all categories to match this number in these specific tests. The N=6 animals are good quality post-mortem dMRI data that have been acquired in the past and we cannot easily expand. For the rest of the points, we have now made the following changes:

      We have replaced our comparison to the ON-Harmony dataset (10 subjects) with a comparison to 50 unrelated UK Biobank subjects (to match the 50 unrelated HCP subject cohort used throughout). Updated results can be seen in Figure 4A and Supplementary Figure S4. This allows a comparison of tractography reconstruction between high quality and more conventional quality data for the same N.

      We looked at QC metrics to ensure our chosen cohorts were representaIve of the full cohorts we had available. The N=50 unrelated HCP cohort and N=50 unrelated UKBiobank cohorts we used in the study captured well the range of the full 339 unrelated HCP cohort and N=7192 UKBiobank cohort in terms of absolute/relative moion (Author response image 3A and 3B respectively). A similar pattern was observed in terms of SNR and CNR ranges Author response image 4).

      We generated tractography reconstructions for single subjects, corresponding to the 10th percentile (P<sub>10</sub>), median and 90th percentile (P90) of the distributions with respect to similarity to the cohort average maps. These are now shown in Supplementary Figures S6, S7. We also checked the QC metrics for these single subjects and confirmed that average absolute subject moIon was highest for the P<sub>10</sub>, followed by the P<sub>50</sub> and lowest for the P<sub>90</sub> subject, capturing a range of within cohort data quality.

      We generated reconstructions for an even larger HCP cohort (all 339 unrelated HCP subjects) and these look very similar to the N=50 reconstructions (Supplementary Figure S5).

      Author response image 3.

      Subsets chosen from the HCP and UKB reflect similar range of average motion (relative and absolute) to the corresponding full cohorts. (A) Absolute and relative motion comparison between N=50 and N=339 unrelated HCP subjects. (B) Absolute and relative motion comparison between N=50 and N=7192 super-healthy UKB subjects.  

      Author response image 4.

      Average SNR and CNR values show similar range between the N=50 UKB subset and the full UK Biobank cohort of N=7192.

      (4) In the paper, the authors state "the mean agreement between HCP and NH reconstructions was lower for the new tracts, compared to the original protocols (𝑝 < 10^−10). This was due to occasionally reconstructing a sparser path distribution, i.e., slightly higher false negative rate," - how can we know this is a false negative rate without knowing the ground truth?

      We are sorry for the terminology, we have corrected this, as it was confusing. Indeed, we cannot call it false negaIve, what we meant is that reconstructions from lower resolution data for these bundles ended up being in general sparser than the ones from the high-resolution data, potentially missing parts of the tract. We have now revised the text accordingly.

      Reviewer #2 Public Review:

      (5) Summary:

      In this article, Assimopoulos et al. expand the FSL-XTRACT software to include new protocols for identifying cortical-subcortical tracts with diffusion MRI, with a focus on tracts connecting to the amygdala and striatum. They show that the amygdalofugal pathway and divisions of the striatal bundle/external capsule can be successfully reconstructed in both macaques and humans while preserving large-scale topographic features previously defined in tract tracing studies. The authors set out to create an automated subcortical tractography protocol, and they accomplished this for a subset of specific subcortical connections for users of the FSL ecosystem.

      Strengths:

      A main strength of the current study is the translation of established anatomical knowledge to a tractography protocol for delineating cortical-subcortical tracts that are difficult to reconstruct. Diffusion MRI-based tractography is highly prone to false positives; thus, constraining tractography outputs by known anatomical priors is important. Key additional strengths include 1) the creation of a protocol that can be applied to both macaque and human data; 2) demonstration that the protocol can be applied to be high quality data (3 shells, > 250 directions, 1.25 mm isotropic, 55 minutes) and lower quality data (2 shells, 100 directions, 2 mm isotropic, 6.5 minutes); and 3) validation that the anatomy of cortical-subcortical tracts derived from the new method are more similar in monozygotic twins than in siblings and unrelated individuals.

      We thank the Reviewer for the globally posiIve evaluaIon of this work and the perInent comments that have helped us to improve the paper.

      Weaknesses

      (6) Although this work validates the general organizational location and topographic organization of tractography-derived cortical-subcortical tracts against prior tract tracing studies (a clear strength), the validation is purely visual and thus only qualitative. Furthermore, it is difficult to assess how the current XTRACT method may compare to currently available tractography approaches to delineating similar cortical-subcortical connections. Finally, it appears that the cortical-subcortical tractography protocols developed here can only be used via FSL-XTRACT (yet not with other dMRI software), somewhat limiting the overall accessibility of the method.

      We agree that a more quanItative comparison against gold standard tracing data would be ideal. However, there are practical challenges that prohibit such a comparison at this stage: i) Access to data. There are no quantifiable, openly shared, large scale/whole brain tracing data available. The Markov study provided the only openly available weighted connectivity matrices measured by tracers in macaques (Markov, Cereb Cortex 2014), which are only cortico-cortical and do not provide the white matter routes, they only quantify the relative contrast in connection terminals. ii) 2D microscopy vs 3D tractography. The vast majority of tracing data one can find in neuroanatomy labs is on 2D microscopy slices with restricted field of view, which is also the case for the data we had access to for this study. This complicates significantly like-to-like comparisons against 3D whole-brain tractography reconstructions. iii) Quantifiability is even tricky in the case of gold standard axonal tracing, as it depends on nuisance factors, e.g. injection site, injection size, injection uniformity and coverage, which confound the gold-standard measurements, but are not relevant for tractography. For these reasons, a number of high-profile NIH BRAIN CONNECTS Centres (for instance hXps://connects.mgh.harvard.edu/, hXps://mesoscaleconnecIvity.org/) are resourced to address these challenges at scale in the coming years and provide the tools to the community to perform such quantitative comparisons in the future.  

      In terms of comparison with other approaches, we have performed new tests and detail a response to a similar comment (2) from Reviewer 1.

      Finally, our protocols have been FSL-tested, but have nothing that is FSL specific. We cannot speak of performance when used with other tools, but there is nothing that prohibits translation of these standard space protocols to other tools. In fact, the whole idea behind XTRACT was to generate an approach open to external contributions for bundle-specific delineation protocols, both for humans and for non-human species. A number of XTRACT extensions that have been published over the last 5 years for other NHP species (Roumazeilles et al. (2020); Bryant et al. (2020); Wang et al. (2025)) and similar approaches have been used in commercial packages (Boshkovski et al, 2106, ISMRM 2022).

      Recommendations To the Authors:

      (7) Superiority of the FSL-XTRACT approach to delineating cortical-subcortical tracts. The Introduction of the article describes how "Tractography protocols for white matter bundles that reach deeper subcortical regions, for instance the striatum or the amygdala, are more difficult to standardize" due to the size, proximity, complexity, and bottlenecks associated with corticalsubcortical tracts. It would be helpful for the authors to better describe how the analytic approach adopted here overcomes these various challenges. What does the present approach do differently than prior efforts to examine cortical-subcortical connectivity? 

      There have not been many prior efforts to standardise cortico-subcortical connecIvity reconstructions, as we overview in the Introduction. As outlined in (Schilling et al. (2020),  hXps://doi.org/10.1007/s00429-020-02129-z), tractography reconstructions can be highly accurate if we guide them using constraints that dictate where pathways are supposed to go and where they should not go. This is the philosophy behind XTRACT and all the proposed protocols, which provide neuroanatomical constraints across different bundles. At the same time these constraints are relatively coarse so that they are species-generalisable. We have clarified that in Discussion. The approach we took was to first identify anatomical constraints from neuroanatomy literature for each tract of interest independently, derive and test these protocols in the macaque, and then optimise in an iterative fashion until the protocols generalise well to humans and until, when considering groups of bundles, the generated reconstructions can follow topographical principles known from tract tracing literature. This process took years in order to perform these iterations as meticulously as we could. We have modified the first sections in Methods to reflect this better (3rd paragraph of 1st Methods section), as well as modified the third and second to last paragraphs of the Introduction (“We propose an approach that addresses these challenges…”).

      (8) Relatedly, it is difficult to fully evaluate the utility of the current approach to dissecting cortical-subcortical tracts without a qualitative or quantitative comparison to approaches that already exist in the field. Can the authors show that (or clarify how) the FSL-XTRACT approach is similar to - or superior to - currently available methods for defining cortical-striatal and amygdalofugal tracts (e.g., methods they cite in the Introduction)?”

      From the limited similar approaches that exist, we did perform some comparisons against TractSeg, please see Reply to Comment 2 from Reviewer 1. We have also expanded the relevant text in the introduction to clarify the differences:

      “…However, these either uIlise labour-intensive single-subject protocols (22,26), are not designed to be generalisable across species (42, 43), or are based mostly on geometrically-driven parcellaIons that do not necessarily preserve topographical principles of connecIons (40). We propose an approach that addresses these challenges and is automated, standardised, generalisable across two species and includes a larger set of cortico-subcortical bundles than considered before, yielding tractography reconstructions that are driven by neuroanatomical constraints.”

      (9) Future applications of the tractography protocol:

      It would be helpful for the authors to describe the contexts in which the automated tractography approach developed here can (and cannot) be applied in future studies. Are future applications limited to diffusion data that has been processed with FSL's BEDPOSTX and PROBTRACKX? Can FSL-XTRACT take in diffusion data modelled in other software (e.g., with CSD in mrtrix or with GQI in DSI Studio)? Can the seed/stop/target/exclusion ROIs be applied to whole-brain tractography generated in other software? Integration with other software suites would increase the accessibility of the new tract dissection protocols.

      We have added some text in the Discussion to clarify this point. Our protocols have been FSLtested, but have nothing that is FSL specific. We cannot speak of performance of other tools, but there is nothing that prohibits translaIon of these standard space protocols to other tools. As described before, the protocols are recipes with anatomical constraints including regions the corresponding white matter pathways connect to and regions they do not, constructed with cross-species generalisability in mind. In fact a number of other packages (even commercial) have adopted the XTRACT protocols with success in the past, so we do not see anything in principle that prohibits these new protocols to be similarly adopted. 

      We cannot comment on the protocols’ relevance for segmenIng whole-brain tractograms, as these can induce more false posiIves than tractography reconstructions from smaller seed regions and may require stricter exclusions.    

      (10) It was great to see confirmation that the XTRACT approach can be successfully applied in both high-quality diffusion data from the HCP and in the ON-Harmony data. Given the somewhat degraded performance in the lower quality dataset (e.g., Figure 4A), can the authors speak to the minimum data requirements needed to dissect these new cortical-subcortical tracts? Will the approach work on single-shell, low b data? Is there a minimum voxel resolution needed? Which tracts are expected to perform best and worst in lower-quality data?

      Thank you for these comments, even if we have not really tried in lower (spaIal and angular) resolution data, given the proximity of the tracts considered, as well as the small size of some bundles, we would not recommend lower resolution than those of the UK Biobank protocol. In general, we would consider the UK Biobank protocol (2mm, 2 shells) as the minimum and any modern clinical scanner can achieve this in 6-8 minutes. We hence evaluated performance from high quality HCP to lower quality UK Biobank data, covering a considerable range (scan Ime from 55 minutes down to 6 minutes). 

      In terms of which tract reconstructions were more reproducible for UKBiobank data, the tracts with lowest correlations across subjects (Figure 4) were the anterior commissure (AC) and the temporal part of the Extreme Capsule (EmC<sub>t</sub>), while the highest correlations were for the Muratoff Bundle (MB) and the temporal part of the Striatal Bundle (StB<sub>t</sub>). Interestingly, for the HCP data, the temporal part of the Extreme Capsule (EmC<sub>t</sub>) and the Muratoff Bundle were also the tracts with the lowest/highest correlations, respectively. Hence, certain tract reconstructions were consistently more variable than others across subjects, which may hint to also being more challenging to reconstruct. We have now clarified these aspects in the corresponding Results section. 

      (11) Anatomical validation of the new cortical-subcortical tracts

      I really appreciated the use of prior tract tracing findings to anatomically validate the corticalsubcortical tractography outputs for both the cortical-striatal and amygdalofugal tracts. It struck me, however, that the anatomical validation was purely qualitative, focused on the relative positioning or the topographical organization of major connections. The anatomical validation would be strengthened if profiles of connectivity between cortical regions and specific subcortical nuclei or subcortical subdivisions could be quantitatively compared, if at all possible. Can the differential connectivity shown visually for the putamen in Figure 3 be quantified for the tract tracing data and the tractography outputs? Does the amygdalofugal bundle show differential/preferential connectivity across amygdala nuclei in tract tracing data, and is this seen in tractography?

      We appreciate the comment, please see Reply to your comment 6 above. In addiIon to the challenges described there, we do not have access to terminal fields other than in the striatum and these ones are 2D, so we make a qualitaIve comparison of the relevant connecIvity contrasts. We expect that a number of currently ongoing high-profile BRAIN CONNECTS Centres (such as the LINC and the CMC) will be addressing such challenges in the coming years and will provide the tools and data to the community to perform such quanItaIve comparisons at scale.  

      (12) I believe that all visualizations of the macaque and human tractography showed groupaveraged maps. What do these tracts look like at the individual level? Understanding individual-level performance and anatomical variation is important, given the Discussion paragraph on using this method to guide neuromodulation.

      We now demonstrate some representative examples of individual subject reconstructions in Supplementary Figures S6, S7, ranking subjects by the average agreement of individual tract reconstructions to the mean and depicting the 10th percentile, median and 90th percentile of these subjects. We have also shown more results in Author response images 1-2, generated by TractSeg, to indicate how a different bundle segmentation approach would handle individual variability compared to our approach.

      (13) Connectivity-based comparisons across species:

      Figures 5 and 6 of the manuscript show that, as compared to using only cortico-cortical XTRACT tracts, using the full set of XTRACT tracts (with new cortical-subcortical tracts) allows for more specific mapping of homologous subcortical and cortical regions across humans and macaques. Is it possible that this result is driven by the fact that the "connectivity blueprints" for the subcortex did not use an intermediary GM x WM matrix to identify connection patterns, whereas the connectivity blueprints for the cortex did? I was surprised that a whole brain GM x WM connectivity matrix was used in the cortical connectivity mapping procedure, given known problems with false positives etc., when doing whole brain tractography - especially aHer such anatomical detail was considered when deriving the original tracts. Perhaps the intermediary step lowers connectivity specificity and accuracy overall (as per Figure 9), accounting for the poorer performance for cortico-cortical tracts?

      The point is well-taken, however it cannot drive the results in Figures 5 and 6. Before explaining this further, let us clarify the raIonale of using the GMxWM connecIvity matrix, which we have published quite extensively in the past for cortico-cortical connecIons (Mars, eLife 2018 - Warrington, Neuroimage 2020 - Roumazeilles, PLoS Biology 2020 - Warrington, Science Advances 2022 – Bryant, J Neuroscience 2025). 

      Having established the bodies of the tract using the XTRACT protocols, we use this intermediate step of multiplying with a GM x WM connectivity matrix to estimate the grey matter projections of the tracts. The most obvious approach of tracking towards the grey matter (i.e. simply find where tracts intersect GM) has the problem that one moves through bottlenecks in the cortical gyrus and after which fibres fan out. Most tractography algorithms have problems resolving this fanning. However, we take the opposite approach of tracking from the grey matter surface towards the white matter (GMxWM connectivity matrix), thus following the direction in which the fibres are expected to merge, rather than to fan out. We then multiply the GMxWM tractrogram with that of the body of the tract to identify the grey matter endpoints of the tract. This avoids some of the major problems associated with tracking towards the surface. In fact, using this approach improves connectivity specificity towards the cortex, rather than the opposite. We provide some indicative results here for a few tracts:

      Author response image 5.

      Connectivity profiles for example cortico-cortical tracts with and without using the intermediary GMxWM matrix. Tracts considered are the Superior Longitudinal Fasciculus 1 (SLF<sub>1</sub>), Superior Longitudinal Fasciculus 2 (SLF<sub>2</sub>), the Frontal Aslant (FA) and the Inferior Fronto-Occipital Fasciculus (IFO). We see that the surface connectivity patterns without using the GMxWM intermediary matrix are more diffuse (effect of “fanning out” gyral bias), with reduced specificity, compared to whenusing the GMxWM matrix

      Tracking to/from subcortical nuclei does not have the same tractography challenges as tracking towards the cortex and in fact we found that using the intermediary GMxWM matrix is less favourable for subcortex (Figure 9), which is why we opted for not using it. 

      Regardless of how cortical and subcortical connectivity patterns are obtained, the results in Figures 5 and 6 utilise only cortical connectivity patterns. Hence, no matter what tracts are considered (cortico-cortical or cortico-subcortical) to build the connectivity patterns, these results have been obtained by always using the intermediate step of multiplying with the GMxWM connectivity matrix (i.e. it is not the case that cortical features are obtained with the intermediate step and subcortical features without, all of them have the intermediate step applied, as the connectivity patterns comprise of cortical endpoints). Figure 9 is only applicable for subcortical endpoints that play no role in the comparisons shown in Figures 5 and 6. We hope this clarifies this point.

      (14) Methodological clarifications:

      The Methods describe how anatomical masks used in tractography were delineated in standard macaque space and then translated to humans using "correspondingly defined landmarks". Can the authors elaborate as to how this translation from macaques to humans was accomplished?

      For a given tract, our process for building a protocol involved looking into the wider anatomical literature, including the standard white matter atlas of Schmahmann and Pandya (2006) and numerous anatomy papers that are referenced in the protocol description, to determine the expected path the tract was meant to take in white matter and which cortical and subcortical regions are connected. This helped us define constraints and subsequently the corresponding masks. The masks were created through the combination of hand-drawn ROIs and standard space atlases. We firstly started with the macaque where tracer literature is more abundant, but, importantly, our protocol definitions have been designed such that the same protocol can be applied to the human and macaque brain. All choices were made with this aspect in mind, hence corresponding landmarks between the two brains were considered in the mask definition (for instance “the putamen”, “a sub-commissural white matter mask”, the “whole frontal pole” etc, as described in the protocol descriptions).

      The protocols have not been created by a single expert but have been collated from multiple experts (co-authors SA, SW, DF, KB, SH, SS drove this aspect) and the final definitions have been agreed upon by the authors. 

      (15) The article heavily utilizes spatial path distribution maps/normalized path distributions, yet does not describe precisely what these are and how they were generated. Can the authors provide more detail, along with the rationale for using these with Pearson's correlations to compare tracts across subjects (as opposed to, e.g., overlap sensitivity/specificity or the Jaccard coefficient)?

      We have now clarified in text how these plots are generated, particularly when compared using correlation values. We tried Jaccard indices on binarized masks of the tracts and these gave similar trends to the correlations reported in Figure 4 (i.e. higher similarities within that across cohorts). We however feel that correlations are better than Jaccard indices, as the latter assume binary masks, so they focus on spatial overlap ignoring the actual values of the path distributions, we hence kept correlations in the paper.

      Reviewing Editor Comments

      “The reviewers had broadly convergent comments and were enthusiastic about the work. As further detailed by Reviewer 3 (see below), if the authors choose to pursue revisions, there are several elements that have the potential to enhance impact.”

      Thank you, we have replied accordingly and aimed to address most of the comments of the Reviewers.   

      “Comparison to existing methods. How does this approach compare to other approaches cited by the authors?”

      Please see replies to Comment 2 of Reviewer 1 and Comment 7 of Reviewer 2. Briefly, we have now generated new results and clarified aspects in the text. 

      “Minimum data requirements. How broadly can this approach be used across scan variation? How does this impact data from individual participants? Displaying individual participants may help, in addition to group maps.”

      Please see replies to Comment 10 of Reviewer2 on minimum data requirements and individual parIcipants, as well as to Comment 3 of Reviewer 1 on the actual groups considered. Briefly, we have generated new figures and regenerated results using UKBiobank data. 

      Softare. What are the sofware requirements? Is the approach interoperable with other methods?”

      Please see Reply to Comment 9 of Reviewer 2. Our protocols can be used to guide tractography using other types of data as they comprise of guiding ROIs for a given tract. So, although we have not tested them beyond FSL-XTRACT, we believe they can be useful with other tractography packages as well, as there is nothing FSL-specific in these anatomically-informed recipes. 

      “Comparisons with tract tracing. To the degree possible, quantitative comparisons with tract tracing data would bolster confidence in the method.”

      Please see Replies to Comments 6 and 11 of Reviewer 2. Briefly, we appreciate the comment and it is something we would love to do, but there are no data readily available that would allow such quanItaIve comparison in a meaningful way. This is a known challenge in the tractography field, which is why NIH has invested in two 5 year Centres to address it. Our approach will provide a solid starIng point for opImising and comparing further cortico-subcortical tractography reconstructions against microscopy and tracers in the same animal and at scale.

    1. You may receive an assignment prompt that asks you to write from your memory, recapturing the experience of reading a special book or text from your childhood or adolescence. Think of this as a chance to recapture something significant from your past, to explore its importance, and to reconstruct it in writing for others to appreciate. Certain books we’ve read live in our memories. When we first read these books or when they were read to us, they spoke to us in some important way. They may still speak to us. Find a book that played an important role in your life when you were a child or an adolescent. Why was it important? What was it like to read this book? Did you read it on your own or did someone read it to you? If someone read it to you, who was it, and what was the experience like? Is there a connection between this book and learning to read on your own? Re-read the book. (If it is long, like Little Women, for example, it is all right to skim it, although you may find yourself re-reading certain parts.) In your essay, use the book as a springboard for your writing by focusing on an insight (a discovery) you have made about the book. Be sure to cite passages and tell the effect they had on you. As you shape your drafts, give attention to organization, the way you build your story. Decide what the reader needs to know in the beginning, and think about the order the events happened and how much to tell the reader at each point. Give attention also to the pictures you create: try to reconstruct key moments by showing what happened rather than merely telling that it happened. Dialogue and scene descriptions often help to make those moments come alive. Finally, give careful thought to the story’s theme or controlling idea.

      brainstorm on how to wirte a narrative

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Summary:

      The manuscript by Zhang et al describes the use of a protein language model (pLM) to analyse disordered regions in proteins, with a focus on those that may be important in biological phase separation. While the paper is relatively easy to read overall, my main comment is that the authors could perhaps make it clearer which observations are new, and which support previous work using related approaches. Further, while the link to phase separation is interesting, it is not completely clear which data supports the statements made, and this could also be made clearer.

      We thank the reviewer for their thoughtful evaluation of our manuscript and for the supportive comments. As outlined in the responses below, we have made substantial revisions to clarify the novel observations presented in our study and to strengthen the connection between sequence conservation and phase separation.

      Comment 1: With respect to putting the work in a better context of what has previously been done before, this is not to say that there is not new information in it, but what the authors do is somewhat closely related to work by others. I think it would be useful to make those links more directly.

      We have addressed the specific comments as outlined below.

      Comment 1a: Alderson et al (reference 71) analysed in detail the conservation of IDRs (via pLDDT, which is itself related to conservation) to show, for example, that conserved residues fold upon binding. This analysis is very similar to the analysis used in the current study (using ESM2 as a different measure of conservation). Thus, the result that "Given that low ESM2 scores generally reflect mutational constraint in folded proteins, the presence of region a among disordered residues suggests that certain disordered amino acids are evolutionarily conserved and likely functionally significant" is in some ways very similar to the results of that (Alderson et al) paper .

      We thank the reviewer for the comment. However, we would like to clarify that our findings show subtle but important differences from those reported by Alderson et al. Specifically, Alderson et al. used AlphaFold2 predictions to identify IDRs that undergo disorder-to-order transitions, which the authors termed as conditionally folded IDRs. These regions could potentially be functionally important, assuming that function of IDRs necessitate folding.

      We argue, however, that, the validity of this structure-function relationship for IDRs remains to be tested. In our opinion, The most direct way to evaluate the functional significance is via evaluating the evolutionary conservation.

      As shown in Author response image 1, the correlation between pLDDT scores and the conservation score, while noticable, is significantly weaker than that between the ESM2 score and the conservation score.

      Author response image 1.

      Comparison of the correlation between AlphaFold2 pLDDT scores and conservation scores with the correlation between ESM2 scores and conservation scores. Calculations were performed using proteins in the MLO-hProt dataset. (A) Correlation between the mean AlphaFold2 pLDDT scores and conservation scores for various amino acids. Pearson correlation coefficients (r) are indicated in the figure legends. The four panels on the right present analogous correlation plots for amino acids grouped by structural order, as defined by their pLDDT scores. (B) Similar as in part A but for ESM2 scores.

      Therefore, we believe that ESM2 score is a better indicator than AlphaFold2 pLDDT score for functional relevance.

      Furthermore, for the human IDRs, we explicitly selected amino acids with pLDDT scores ≤ 70.

      These would be classified as structureless, disordered amino acids, according to the study by Alderson et al. Nevertheless, as shown in Figures 2 and 3 of the main text, our analyses still identifies conserved regions. Therefore, these regions may function via distinct mechanisms than the disorder to order transition.

      We now discuss the novelty of our work in the context of existing studies in the newly added Conclusions and Discussion: Related Work, as quoted below.

      “Numerous studies have sought to identify functionally relevant amino acid groups within IDRs [cite]. For instance, using multiple sequence alignment, several groups have identified evolutionarily conserved residues that contribute to phase separation [cite]. Alderson et al. employed AlphaFold2 to detect disordered regions with a propensity to adopt structured conformations, suggesting potential functional relevance [alderson et al].

      In contrast, our approach based on ESM2 is more direct: it identifies conserved residues without relying on alignment or presupposing that functional significance requires folding into stable 3D structures. Notably, many of the conserved residues identified in our analysis exhibit low pLDDT scores (Figure 2), implying potential functional roles independent of stable conformations.”

      Comment 1b: Dasmeh et al, Lu et al and Ho & Huang analysed conservation in IDRs, including aromatic residues and their role in phase separation.

      We thank the reviewer for bringing these works to our attention! We now explicitly discuss these studies in both the Discussion section as mentioned above and in the Introduction as quoted below.

      “Evolutionary analysis of IDRs is challenging due to difficulties in sequence alignment [cite], though several studies have attempted alignment of disordered proteins with promising results [Dasmeh et al, Lu et al and Ho & Huang].”

      Comment 1c: A number of groups have performed proteomewide saturation scans using pLMs, including variants of the ESM family, including Meier (reference 89, but cited about something else) and Cagiada et al (https://doi.org/10.1101/2024.05.21.595203) that analysed variant effects in IDRs using a pLM. Thus, I think statements such as "their applicability to studying the fitness and evolutionary pressures on IDRs has yet to be established" should possibly be qualified.

      We added a new paragraph in the Introduction to discuss the application of protein language models to IDRs and cited the suggested references.

      “While protein language models have been widely applied to structured proteins [cite], it is important to emphasize that these models themselves are not inherently biased toward folded domains. For example, the Evolutionary Scale Model (ESM2) [cite] is trained as a probabilistic language model on raw protein sequences, without incorporating any structural or functional annotations. Its unsupervised learning paradigm enables ESM2 to capture statistical patterns of residue usage and evolutionary constraints without relying on explicit structural information. Thus, the success of ESM2 in modeling the mutational landscapes of folded proteins [cite] reflects the model’s ability to learn sequence-level constraints imposed by natural selection — a property that is equally applicable to IDRs if those regions are also under functional selection. Indeed, protein language models are increasingly been used to analyze variant effects in IDRs [cite].”

      Comment 2: On page 4, the authors write, "The conserved residues are primarily located in regions associated with phase separation." These results are presented as a central part of the work, but it is not completely clear what the evidence is.

      We thank the reviewer this insightful comment. We realized that our wording is not as precise as we should have been. We meant to state that the regions associated with phase separation are significantly enriched in these conserved residues. This is a significant finding and indicates that phase separation could be a source of evolutionary pressure in dictating IDP sequence conservation. However, we do not intend to suggest that phase separation is the only evolutionary pressure.

      The sentence has been revised to

      “Notably, regions associated with phase separation are significantly enriched in these conserved residues.”

      We further replaced the section title "Conserved, Disordered Residues Localize in Regions Driving Phase Separation" with "Regions Driving Phase Separation Are Enriched with Conserved, Disordered Residues" to further clarify our findings and avoid overinterpretation.

      Finally, we revised the following sentence in the discussion

      “Notably, these conserved, disordered residues are predominantly located in regions actively involved in phase separation, contributing to the formation of membraneless organelles.”

      to

      “Notably, regions actively involved in phase separation are enriched with these conserved, disordered residues, supporting their potential role in the formation of membraneless organelles.”

      The submitted manuscript provides clear evidence supporting the enrichment of conserved residues in MLO-driving IDRs. Specifically, Figures 4A and 4C demonstrate that these IDRs exhibit a substantially higher fraction of conserved residues compared to other IDRs involved in phase separation.

      In this analysis, the nMLO-hIDR group serves as a baseline, representing the distribution of conservation in disordered regions lacking MLO-related functions. In contrast, IDRs from MLOassociated groups show a pronounced lower shift in their median and interquartile ranges, indicating stronger evolutionary constraints. Within the dMLO cohort, the degree of conservation follows a distinct gradient: driving residues exhibit the highest levels of conservation, followed by participant residues, with non-participant residues showing values closer to the nMLO baseline. This pattern reflects the relative functional importance of each group in phase separation, with conservation levels corresponding to their roles in MLO scaffolding.

      To further support this, we computed, for each IDR, the fraction of conserved amino acids. As shown in Figure S11B, for IDRs that actively contribute to phase separation, the fraction is indeed higher than those not involved in phase separation. This analysis is now included in SI.

      During the revision, we explicitly evaluated whether conserved residues are preferentially located in regions associated with phase separation. To this end, for each protein in the MLO-hProt dataset, we calculated the probability p of finding conserved residues within regions contributing to phase separation. These regions include both "driving" and "participating" segments as defined in Figure 4 of the main text.

      Figure S11A presents the distribution of p across all proteins. For comparison, we also include the distribution of 1− p, representing the probability of finding conserved residues in regions not associated with phase separation. On average, p exceeds 0.5, suggesting a tendency for conserved residues to be more frequently located in phase-separating regions. However, the difference between the two distributions is not statistically significant. This result may be due to the generally low density of conserved residues in IDRs, which makes the estimation of p challenging for individual proteins. Additionally, some conserved sites may be involved in functions unrelated to phase separation.

      We added the following text to the Discussion section of the main text.

      “We emphasize that the results presented in Figure 4 do not directly demonstrate that conserved residues are preferentially located in regions associated with phase separation. Although these regions are more enriched in conserved amino acids, their total sequence length can be smaller than that of non-phase-separating regions. As a result, the absolute number of conserved residues may still be higher outside phase-separating regions. To quantitatively assess this, we calculated, for each protein in the MLO-hProt dataset, the probability p of finding conserved residues within regions contributing to phase separation. These regions include both "driving" and "participating" segments, as defined in Figure 4 of the main text. Figure S11 shows the distribution of p across all proteins. For comparison, we also present the distribution of 1− p, which reflects the probability of finding conserved residues in non-phase-separating regions. While the average value of p exceeds 0.5, indicating a trend toward conserved residues being more frequently located in phase-separating regions, the difference between the two distributions is not statistically significant. Future studies with expanded datasets may be necessary to clarify this trend.”

      Comment 3: It would be useful with an assessment of what controls the authors used to assess whether there are folded domains within their set of IDRs.

      We acknowledge that our previous labeling may have caused some confusion. Protein sequences used in Figures 2 and 3 include both folded and disordered domains. Results presented in these figures were constructed using full-length protein sequences to highlight the similarities and differences in ESM2 scores between folded and disordered domains.

      In contrast, the analyses presented in Figures 4 and 5 focus exclusively on IDRs to examine their role in phase separation.

      To prevent further confusion, we have renamed the dataset used in Figures 2 and 3 as MLO-hProt, emphasizing that the analysis pertains to entire protein sequences. The term MLO-hIDR is now reserved for a new dataset that includes only disordered residues, as used in Figures 4 and 5, and corresponding SI Figures.

      For the dMLO-IDR dataset, all except one amino acid (P40967, residue G592) are annotated as disordered in the MobiDB database (https://mobidb.org/). This database characterizes disordered regions based on a combination of predictive algorithms and experimental data. As illustrated in Figure S5A, 25.5% of the proteins in the dataset have direct experimental evidence supporting their disorderedness. These experimental annotations are derived from a diverse range of techniques (Figure S5B). For the remaining proteins, disorder was predicted by one or more computational tools. Although not all tools were applied to every protein, each protein in the dataset was identified as disordered by at least one method.

      For human proteins, IDRs were identified based on AlphaFold2 pLDDT scores, using a threshold of 70. As established in prior studies [1, 2], the pLDDT score provides a quantitative measure of local structural confidence, with lower values indicating greater structural disorder. IDRs associated with conditional folding or disorder-to-order transitions generally exhibit high pLDDT values (e.g., >70).

      Author response image 2 shows a violin plot of AlphaFold2 pLDDT scores for the various MLO-hIDR groups. The consistently low scores support the conclusion that these regions are structurally disordered.

      We also cross-checked the MLO-hIDR regions against the MobiDB database. As shown in Figure S6, approximately 76% of the proteins in the dataset are predicted to contain disordered regions. Among the non-labeled segments with pLDDT scores ≤ 70, the majority are relatively short, with segments of 1–5 residues accounting for approximately 80%.

      Author response image 2.

      AlphaFold pLDDT scores of hIDRs in different MLO-related groups.

      In addition to renaming the dataset, we also revised the manuscript to highlight the validation of disorderedness in section of Results: Regions Driving Phase Separation Are Enriched with Conserved, Disordered Residues.

      “The presence of evolutionarily conserved disordered residues raises the question of their functional significance. To explore this, we identified disordered regions of MLO-hProt using a pLDDT score less than 70 and partitioned these regions into two categories: drivers (dMLO-hIDR), which actively drive phase separation, and clients (cMLO-hIDR), which are present in MLOs under certain conditions but do not promote phase separation themselves [cite]. Additionally, IDRs from human proteins not associated with MLOs, termed nMLO-hIDR, were included as a control. To enhance statistical robustness, we extended our dataset by incorporating driver proteins from additional species [cite], resulting in the expanded dMLO-IDR dataset. Beyond the pLDDT-based classification, the majority of residues in these datasets are also predicted to be disordered by various computational tools and supported by experimental evidence (Figures S5 and S6).”

      Recommendation 1: The authors use the terms "evolutionary fitness of IDRs" (abstract and p. 5, for example), "fitness of amino acids" (p. 4), and "quantify the fitness of particular residues at specific sites" (p. 5). It is not clear what is meant by fitness in this context.

      We thank the reviewer for pointing out the ambiguity in the term fitness. To enhance clarity, we have replaced “fitness" with “mutational tolerance" to more directly emphasize the evolutionary conservation of specific residues.

      Recommendation 2: The authors write (p. 6) "Previous studies have demonstrated a strong correlation between ESM2 scores and changes in free energy related to protein structure stability". While that may be true, it might be worth noting that ESM2 scores report on the effects of mutations and function more broadly than stability because these models have previously been shown to capture conservation effects beyond stability.

      We fully agree with the reviewer’s comment and have revised the main text accordingly. Specifically, the referenced sentence has been revised and relocated, as shown below.

      “Our analysis demonstrated that HP1_α_’s structured domains consistently yield low ESM2 scores, reflecting strong mutational constraints characteristic of folded regions. These constraints are further evident in the local LLR predictions, as shown in Figure 2B, where we illustrate the folded region G120-T130. Given the functional importance of preserving the 3D of structured domains, mutations with greater detrimental effects are likely to disrupt protein folding substantially. This interpretation is consistent with previous studies reporting a significant correlation between ESM2 LLRs and changes in free energy associated with protein structural stability [cite].”

      Recommendation 3: p. 10: The authors write "To exclude sequences that no longer qualify as homologs, we filtered for sequences with at least 20% identity to the reference". How did they decide on 20% and why? And over which residues are these 20% calculated.

      We apologize for the earlier lack of clarity. Sequence alignment was performed using the full-length protein sequences, encompassing both folded and disordered regions. For each sequence, we calculated the percent identity by counting the number of positions, denoted as n, at which the amino acid matches the reference. The percent identity was then computed as n/N, where N represents the total length of the aligned reference sequence. This total includes residues in folded and disordered regions, as well as gap positions introduced during alignment.

      We updated the Methods section of the main text to clarify.

      “We performed multi-sequence alignment (MSA) analysis using HHblits from the HH-suite3 software suite [citations], a widely used open-source toolkit known for its sensitivity in detecting sequence similarities and identifying protein folds. HHblits builds MSAs through iterative database searches, sequentially incorporating matched sequences into the query MSA with each iteration. Sequence alignment was performed using the full-length protein sequences, encompassing both folded and disordered regions.

      ...

      To refine alignment quality by focusing on closely related homologs, we filtered out sequences with ≤ 20% identity to the query, excluding weakly related sequences where only short segments show similarity to the reference. For each sequence, we calculated the percent identity by counting the number of positions, denoted as n, at which the amino acid matches the reference. The percent identity was then computed as n/N, where N represents the total length of the aligned reference sequence. This total includes residues in folded and disordered regions, as well as gap positions introduced during alignment.”

      We selected a 20% sequence identity threshold to balance inclusion of true homologs with exclusion of distant matches that may not share functional relevance. To determine this cutoff, we compared identity thresholds of 0%, 10%, 20%, and 40% and examined the resulting distributions of conservation and ESM2 scores across aligned residues for MLO-hProt dataset (Author response image 3). Thresholds of 10%, 20%, and 40% produced qualitatively similar results, with a consistent correspondence between low ESM2 scores and high conservation. Lower thresholds introduced highly divergent sequences that added noise to the alignment, resulting in reduced overall conservation scores. In contrast, higher thresholds excluded homologs with potentially meaningful conservation, particularly in disordered regions where conservation scores tend to be relatively low.

      Author response image 3.

      Histograms of the ESM2 score and the conservation score, presented in a format consistent with Figure 3B of the main text. The conservation scores were computed using aligned sequences with identity thresholds of ≥0, ≥10%, ≥20%, and ≥40% (left to right). Contour lines represent different levels of −log_P_(CS,ESM2), where P is the joint probability density of conservation score (CS) and ESM2 score. Contours are spaced at 0.5-unit intervals, highlighting regions of distinct density.

      Recommendation 4: In their description of "motif" searching algorithm (p. 20) I think that the search algorithm would give a different result whether the search is performed N->C or C->N (because the first residue (i) needs to have a score <0.5 but the last (j) could have a score >0.5 as long as the average is below 0.5. Is that correct? And if so, why did they choose an asymmetric algorithm? .

      We thank the reviewer for highlighting the asymmetry in our motif-search algorithm.

      To investigate this issue, we repeated the algorithm starting from the C-terminus and compared the resulting motifs with those obtained from the N-terminal scan. We found that the two sets of motifs overlap entirely: each motif identified from the C-terminal direction has a corresponding counterpart from the N-terminal scan. However, the motifs are not identical. The directionality of the search introduces additional amino acids—referred to here as peripheral residues—at the motif boundaries, which differ between the two sets.

      As shown in Author response image 4, the number of peripheral residues is small relative to the total motif length.

      To eliminate asymmetry and ambiguity, we have revised our method to perform bidirectional scans—from both the N- and C-termini—and define each motif as the overlapping region identified by both directions. This approach emphasizes the conserved core and avoids the inclusion of spurious terminal residues. The updated procedure is described in Methods: Motif Identification.

      “To identify motifs within a given IDR, we implemented the following iterative procedure. Starting from either the N– or C–terminus of the sequence, we first locate the initial residue i whose ESM2 score falls within 0.5. From i, residues are sequentially appended…”

      Author response image 4.

      Number of peripheral residues and their relative length to the full-motif length identified from both sides. (A). The unique motifs identified from N-to-C terminal direction. (B) The unique motifs identified from C-to-N terminal direction.

      “…in the direction toward the opposite terminus until the segment’s average ESM2 score exceeds 0.5; the first residue to breach this threshold is denoted j. The segment (i,i+1,..., j−1) is then recorded as a candidate motif. This process repeats starting from j until the end of the IDR is reached.

      We perform this full procedure independently from both termini and designate the final motif as the intersection of the two candidate-motif sets. This bidirectional overlap strategy excludes terminal residues that might transiently satisfy the average-score criterion only due to adjacent low-scoring regions, thereby isolating the conserved core of each motif. All other residues—those not included in either directional pass—are classified as non-motif regions, minimizing peripheral artifacts.”

      Accordingly, we have updated the Supplementary material: ESM2_motif_with_exp_ref.csv for the new identified motifs commonly exited from both N-terminal and C-terminal searches. Minor changes were observed in the set of motifs as being discussed, but these do not affect the main conclusions. Figures 5C, 5D, and S6 have been revised accordingly.

      Reviewer #2:

      Summary:

      Unfortunately, I do not believe that the results can be trusted. ESM2 has not been validated for IDRs through experiments. The authors themselves point out its little use in that context. In this study, they do not provide any further rationale for why this situation might have changed. Furthermore, they mention that experimental perturbations of the predicted motifs in in vivo studies may further elucidate their functional importance, but none of that is done here. That some of the motifs have been previously validated does not give any credibility to the use of ESM2 here, given that such systems were probably seen during the training of the model.

      We thank the reviewer for their detailed and thoughtful critique of our manuscript. We recognize the importance of careful model validation, especially in the context of IDRs, and appreciate the opportunity to clarify the scope and rationale of our study. Below, we respond point-by-point to the main concerns.

      (1) The use of ESM2 is not validated for IDRs, and the authors provide no rationale for its applicability in this context.

      We thank the reviewer for raising this important point.

      First, we emphasize that ESM2 is a probabilistic language model trained entirely on amino acid sequences, without any structural supervision. The model does not receive any input about protein structure — folded or disordered — during training. Instead, it learns to estimate the likelihood of each amino acid at a given position, conditioned on the surrounding sequence context. This makes ESM2 agnostic to whether a sequence is folded or disordered; the model’s capacity to identify patterns of residue usage arises solely from the statistics of natural sequences.

      As such, ESM2 is not inherently biased toward folded proteins, even though previous studies have demonstrated its usefulness in identifying conserved and functionally constrained residues in structured domains [3–9]. These findings support the broader utility of language models for uncovering evolutionary constraints — and by extension, suggest that similar signatures could exist in IDRs, particularly if they are under functional selection.

      Indeed, if certain residues or motifs in IDRs are conserved due to their importance in biological processes (e.g., phase separation), we would expect such selection to be reflected in sequence-based features, which ESM2 is designed to detect. The model’s applicability to IDRs, then, is a natural extension of its core probabilistic architecture.

      To further evaluate this, we carried out an independent in silico validation using multiple sequence alignments (MSAs). This analysis allowed us to compute the evolutionary conservation of individual amino acids without any reliance on ESM2. We then compared these conservation scores to ESM2 scores and found a strong correlation between the two. This provides direct, quantitative support for the idea that ESM2 is capturing biologically meaningful sequence constraints — even in disordered regions.

      While we agree that experimental testing would ultimately provide the most compelling validation, we believe that our MSA-based comparison constitutes a strong and arguably ideal computational validation of the model’s predictions. It offers an orthogonal measure of evolutionary pressure that confirms the biological plausibility of ESM2 scores.

      We added the following text in the introduction to highlight the applicability of ESM2 to IDRs.

      “While protein language models have been widely applied to structured proteins, it is important to emphasize that these models themselves are not inherently biased toward folded domains. For example, the Evolutionary Scale Model (ESM2) [cite] is trained as a probabilistic language model on raw protein sequences, without incorporating any structural or functional annotations. It operates by estimating the likelihood of observing a given amino acid at a particular position, conditioned on the entire surrounding sequence context. This unsupervised learning paradigm enables ESM2 to capture statistical patterns of residue usage and evolutionary constraints without relying on explicit structural information. Thus, the success of ESM2 in modeling fitness landscapes of folded proteins reflects the model’s ability to learn sequence-level constraints imposed by natural selection — a property that is equally applicable to IDRs if those regions are also under functional selection. Indeed, protein language models are increasingly been used to analyze variant effects in IDRs [cite].”

      (2) There is no experimental validation of the ESM2-based predictions in this study.

      We agree that experimental validation would provide definitive support for the utility of ESM2 in IDRs, and we explicitly state this as a limitation in the revised manuscript as quoted below.

      “Limitations: Despite the promising findings, our study has several limitations. Most notably, our analysis is purely computational, relying on ESM2-derived predictions and sequence-based conservation without accompanying experimental validation. While the strong correlation between ESM2 scores and evolutionary conservation provides compelling evidence that the identified motifs are functionally constrained, the precise biological roles of these motifs remain uncharacterized. ESM2 is well-suited for highlighting regions under selective pressure, but it does not provide mechanistic insights into how conserved motifs contribute to specific molecular functions such as phase separation, molecular recognition, or dynamic regulation. Determining these roles will require targeted experimental investigations, including mutagenesis and biophysical characterization.”

      In addition, we revised the manuscript title from “Protein Language Model Identifies Disordered, Conserved Motifs Driving Phase Separation" to “Protein Language Model Identifies Disordered, Conserved Motifs Implicated in Phase Separation". This revision softens the original claim to better reflect the absence of direct experimental evidence for the motifs’ role in phase separation.

      However, we also emphasize that the goal of our study is not to claim definitive predictive power, but rather to explore whether ESM2-derived mutational profiles align with known biological features of IDRs — and in doing so, to generate new, testable hypotheses.

      In addition, while no in vivo experiments were performed, our study does include an in silico validation step, as detailed in the response to the previous comment. The strong correlation between ESM2 scores and conservation scores provides direct support for the utility of ESM2 in identifying residues under evolutionary constraint in disordered regions.

      (3) The overlap between predicted motifs and known ones may be due totraining data leakage.

      We respectfully clarify that training data leakage is not possible in this case, as ESM2 is trained using unsupervised learning on raw protein sequences alone. The model has no access to experimental annotations, functional labels, or knowledge of which motifs are involved in phase separation. It only models statistical sequence patterns derived from evolutionarily observed proteins.

      Therefore, any agreement between ESM2-derived predictions and previously validated motifs arises not from memorization of experimental data, but from the model’s ability to learn meaningful sequence constraints from the natural distribution of proteins.

      (4) The authors should revamp the study with a testable predictive framework.

      We respectfully suggest that a full revamp is not necessary or appropriate in this context.

      As outlined in our previous responses, we believe that certain misunderstandings about the nature and capabilities of ESM2 may have influenced the reviewer’s assessment.

      Importantly, both Reviewer 1 and Reviewer 3 express strong support for the significance and novelty of this work, and recommend publication following minor revisions.

      In this context, we believe the manuscript provides a useful contribution as a first step toward understanding disordered regions using language models, and that it has value even in the absence of direct experimental testing. We have now better positioned the manuscript in this light, clarified limitations, and suggested concrete next steps for follow-up research.

      We hope these clarifications and revisions address the reviewer’s concerns, and we thank them again for helping us strengthen the framing, rigor, and clarity of our study.

      Reviewer #3:

      Summary:

      This is a very nice and interesting paper to read about motif conservation in protein sequences and mainly in IDRs regions using the ESM2 language model. The topic of the paper is timely, with strong biological significance. The paper can be of great interest to the scientific community in the field of protein phase transitions and future applications using the ESM models. The ability of ESM2 to identify conserved motifs is crucial for disease prediction, as these regions may serve as potential drug targets. Therefore, I find these findings highly significant, and the authors strongly support them throughout the paper. The work motivates the scientific community towards further motif exploration related to diseases.

      Strengths:

      (1) Revealing conserved regions in IDRs by the ESM-2 language model.

      (2) Identification of functionally significant residues within protein sequences, especially in IDRs.

      (3) Findings supported by useful analyses.

      We appreciate the reviewer’s thoughtful words and support for our work.

      Weaknesses:

      (1) Lack of examples demonstrating the potential biological functions of these conserved regions.

      As detailed in the Response to Recommendation 6, we conducted additional analyses to connect the identified conserved regions with their biological functions.

      (2) Very limited discussion of potential future work and of limitations.

      We have substantially revised the Conclusions and Discussion section to provide a detailed analysis of the study’s limitations and to propose several directions for future research, as elaborated in our Response to Recommendation 5 below.

      Recommendation 1: The authors describe the ESM2 score such that lower scores are associated with conserved residues, stating that "lower scores indicate higher mutational constraint and reduced flexibility, implying that these residues are more likely essential for protein function, as they exhibit fewer permissible mutational states." However, when examining intrinsically disordered regions (IDRs), which are known to drive phase separation, I observe that the ESM2 score is relatively high (Figure 3C, pLDDT < 50, and Supplementary Figure S2). Could the authors clarify how this relatively high score aligns with the conservation of motifs that drive phase separation?

      We thank the reviewer for this insightful comment. We would like to clarify that most amino acids in the IDRs are not conserved, even for IDRs that contribute to phase separation. Only a small set of amino acids in these IDRs, which we term as motifs, are evolutionarily conserved with low ESM2 scores. Therefore, the ESM2 scores exhibit bimodal distribution at high and low values, as shown in Figures 4A and 4C of the manuscript. When averaged over all the amino acids, the mean ESM2 scores, plotted in Figure 3C, are relatively high due to dominant population of non-conserved amino acids.

      Recommendation 2: The authors mention: "We first analyzed the relationship between ESM2 and pLDDT scores for human Heterochromatin Protein 1 (HP1, residues 1-191)". I appreciate this example as a demonstration of amino acid conservation in IDRs. However, it is questionable whether the authors could provide some more examples to support amino acid conservation particularly within the IDRs along with lower ESM2 score (e.g, Could the authors provide some additional examples of "conserved disordered" regions in various proteins which are associated with relatively low ESM2 score as appear in Figure 2A).

      We thank the reviewer for this valuable suggestion. We want to kindly noted that the conserved residues on IDRs are prevalent as indicated in Figures 2D and 3B. To further illustrate the prevalence of “conserved disordered” regions, we generated ESM2 versus pLDDT score plots for the full dMLO–hProt dataset (82 proteins) in Figure S2. In these plots, residues with pLDDT ≤ 70 are highlighted in blue to denote structural disorder (dMLO-hIDR), and these disordered residues with ESM2 score ≤ 1.5 are shown in purple to indicate conserved disordered segments.

      Recommendation 3: Could the authors plot a Violin conservation score plot for Figure 4A to emphasise the relationship between ESM2 scores and conservation scores of disordered residues?

      We thank the reviewer for this suggestion. We included a violin plot illustrating the distribution of conservation scores for disordered residues across all four IDR groups, shown in Author response image 5. Consistent with the findings in Figure 4A, the phase separation drivers (dMLO-hIDR and dMLOIDR) exhibit a higher proportion of conserved amino acids compared to the client group (cMLOhIDR).

      We also note that the nMLO-hIDR group may contain conserved residues due to functions unrelated to MLO formation, which could contribute to the higher observed levels of conservation in this group.

      Author response image 5.

      Violin plots illustrating the distribution of conservation scores for disordered residues across the nMLO–hIDR, cMLO–hIDR, dMLO–hIDR, and dMLO–IDR datasets. Pairwise statistical comparisons were conducted using two-sided Mann–Whitney U tests on the conservation score distributions (null hypothesis: the two groups have equal medians). P-values indicate the probability of observing the observed rank differences under the null hypothesis. Statistical significance is denoted as follows: ***: p < 0.001; **: p < 0.01; *:p < 0.05;

      Recommendation 4: It will be appreciated if the authors could add to Figure 4 Violin plots, a statistical comparison between the different groups.

      We thank the reviewer for this valuable suggestion. We included the p-values for Figures 4A and 4C to quantify the statistical significance of differences in the distributions.

      Most comparisons are highly significant (p < 0.001), while the largest p-value (p = 0.089) between the conservation score of driving and non-participating groups (Figure 4C) still suggests a marginally significant trend.

      Recommendation 5: Could the authors expand more on potential future research directions using ESM2, given its usefulness in identifying conserved motifs? Specifically, how do the authors envision conserved motifs will contribute to future discoveries/applications/models using ESM (e.g, discuss the importance of conserved motifs, especially in IDRs motifs, in protein phase transition prediction in relation to diseases).

      We thank the reviewer for this insightful comment. To further assess the functional relevance of the conserved motifs, we incorporated pathogenic variant data from ClinVar [10, 11] to evaluate mutational impacts. As shown in Figure S12A and B, a substantial number of pathogenic variants in MLO-hProt proteins are associated with low ESM2 LLR values. This pattern holds for both folded and disordered residues.

      Moreover, we observed that variants located within motifs are more frequently pathogenic compared to those outside motifs (Figure S12C). In the main text, motifs were defined only for driver proteins; however, the available variant data for this subset are limited (6 data points). To improve statistical power, we extended motif identification to include both client and driver human proteins, following the same methodology described in the main text. Consistent with previous findings, variants within motifs in this expanded set are also more likely to be pathogenic. These results further support the functional importance of both low ESM2-scoring residues and the conserved motifs in which they reside.

      The following text was added in the Discussion section of the manuscript to discuss these results and outline future research directions.

      “Several promising directions could extend this work, both to refine our mechanistic understanding and to explore clinical relevance. One avenue is testing the hypothesis that conserved motifs in scaffold proteins act as functional stickers, mediating strong intermolecular interactions. This could be evaluated computationally via free energy calculations or experimentally via interaction assays. Deletion of such motifs in client proteins may also reduce their partitioning into condensates, illuminating their roles in molecular recruitment.

      To explore potential clinical implications, we analyzed pathogenicity data from Clin-Var [10, 11]. As shown in Figure S12A, single-point mutations with low LLR values—indicative of constrained residues—are enriched among clinically reported pathogenic variants, while benign variants typically exhibit higher LLR values. Moreover, mutations within conserved motifs are significantly more likely to be pathogenic than those in non-motif regions (Figure S12B). These findings highlight the potential of ESM2 as a first-pass screening tool for identifying clinically relevant residues and suggest that the conserved motifs described here may serve as priorities for future studies, both mechanistic and therapeutic.”

      Moreover, the functional significance of conserved motifs, particularly their implications in disease and pathology, warrants further investigation. As an initial analysis, we incorporated ClinVar pathogenic variant data [citation] to assess mutational effects within our datasets. As illustrated in Figure R12A, single-point mutations with low LLR values are enriched among clinically reported pathogenic variants, whereas benign variants are more commonly associated with higher LLR values. Notably, mutations within conserved motifs are substantially more likely to be pathogenic compared to those in non-motif regions. These findings highlight the potential of ESM2 as a firstpass tool for identifying residues of clinical relevance. The conserved motifs identified here may be prioritized in future studies aimed at elucidating their biological roles and evaluating their viability as therapeutic targets.

      Recommendation 6: The authors mention: "Our findings provide strong evidence for evolutionary pressures acting on specific IDRs to preserve their roles in scaffolding phase separation mechanisms, emphasizing the functional importance of entire motifs rather than individual residues in MLO formation." They also present a word cloud of functional motifs in Figure 5D. Although it makes sense that evolutionarily conserved motifs, especially within the IDRs regions, act as functional units, I think there is no direct evidence for such functionality (e.g., examples of biological pathways associated with IDRs and phase separation). Hence, there is no justification to write in the figure caption: "ESM2 Identifies Functional Motifs in driving IDRs" unless the authors provide some examples of such functionality. This will even make the paper stronger by establishing a clear connection to biological pathways, and hence these motifs can serve as potential drug targets.

      We thank the reviewer for this insightful suggestion. We have replaced “functional motifs" with “conserved motifs" in the figure caption.

      Identifying the precise biological pathways associated with the conserved motifs is a complex task and a comprehensive investigation lies beyond the scope of this study. Nonetheless, as an initial effort, we explored the potential functions of these motifs using annotations available in DisProt (https://disprot.org/).

      DisProt is the leading manually curated database dedicated to IDPs, providing both structural and functional annotations. Expert curators compile experimentally validated data, including definitions of disordered regions, associated functional terms, and supporting literature references. Author response image 6 presents a representative DisProt entry for DNA topoisomerase 1 (UniProt ID: P11387), illustrating its structural and biological annotation.

      For each motif, we located the corresponding DisProt entry and assigned a functional annotation based on the annotated IDR from which the motif originates. We emphasize that this functional assignment should be regarded as an approximation. Because experimental annotations often pertain to the entire IDR, regions outside the motif may also contribute to the reported function.

      Nevertheless, the annotations provide valuable insights.

      Author response image 6.

      Screenshot of information provided by the DisProt database. Detailed annotations of biological functions and structural features, along with experimental references, are accessible via mouse click.

      Approximately 50% of ESM2-predicted IDR motifs lack functional annotations. Among those that are annotated, motifs from the dMLO-IDR dataset are predominantly associated with “molecular condensate scaffold activity,” followed by various biomolecular binding functions (Author response image 7A). These findings support the role of these motifs in MLO formation.

      For comparison, we applied the same identification procedure (described in Methods: Motif Identification) to motifs from the nMLO-hIDR dataset. In contrast to the dMLO-IDR motifs, these exhibit a broader range of annotated functions related to diverse cellular processes. Collectively, these results suggest that motifs identified by ESM2 are aligned with biologically relevant functions captured in current databases.

      Finally, as illustrated in Figure S12 and discussed in the Response to Recommendation 5, variants occurring within identified motifs are more likely to be pathogenic than those in non-motif regions, further underscoring their functional importance.

      Author response image 7.

      Biological functions of ESM2-predicted motifs. (A) Distribution of biological functions associated with all identified motifs from dMLO-IDR driving groups. (B) Distribution of biological functions associated with all identified motifs from nMLO-hIDR groups.

      Recommendation 7: In Figure 2C the authors present FE (I assume this is free energy), some discussion about the difference in the free energy referring to the "a" region is missing (i.e. both "Folded" and "Disordered" regions are associated with low ESM score but with low and high free energy (FE), respectively.

      We thank the reviewer for the comments. FE indeed abbreviates free energy. To improve clarify and avoid confusion, we have updated all figure captions by replacing “FE” with “−logP” to explicitly denote the logarithm of probability in the contour density plots.

      We used “a" in Figures 2C and 2D to refer to regions with low ESM2 scores, which appears a local minimum in both plots. Since most residues in folded regions are conserved, region a has lower free energy than region b in Figure 2C. On the other hand, as most residues in disordered regions are not conserved, as we elaborated in Response to Recommendation 1, region a has lower population and higher free energy than region b.

      To avoid confusion, we have replaced “a" and “b" in Figure 2D with “I" and “II".

      Recommendation 8: Figure S2: It would be useful to plot the same figure for structured and disordered regions as well.

      We are not certain we fully understood this comment, as we believe the requested analysis has already been addressed. In Figure S2, we used the AlphaFold2 pLDDT score to represent the structural continuum of different protein regions, where residues with pLDDT > 70 (red and lightred bars) are classified as structured, while those with pLDDT ≤ 70 (blue and light-blue bars) are classified as disordered.

      Minor suggestion 1: Could the authors clarify the meaning of the abbreviation "FE" in the colorbar of the contour line? I assume this is free energy.

      We have updated all contour density plot figure captions by replacing “FE” with “−logP” to explicitly denote the logarithm of probability.

      Minor suggestion 2: In Figure 2A - do the authors mean "Conserved folded" instead of just "Folded"? If so, could the authors indicate this?

      We thank the reviewer for this comment. The ESM2 scores indeed suggest that, within folded regions, there may be multiple distinct groups exhibiting varying degrees of evolutionary conservation. However, as our primary focus is on IDRs, we chose not to investigate these distinctions further.

      Figure 2A illustrates a randomly selected folded region based on AlphaFold2 pLDDT scores.

      References

      (1) Ruff, K. M.; Pappu, R. V. AlphaFold and Implications for Intrinsically Disordered Proteins. Journal of Molecular Biology 2021, 433, 167208.

      (2) Alderson, T. R.; Pritišanac, I.; Kolaric, Ð.; Moses, A. M.; Forman-Kay, J. D. Systematic´ Identification of Conditionally Folded Intrinsically Disordered Regions by AlphaFold2. Proceedings of the National Academy of Sciences of the United States of America, 120, e2304302120.

      (3) Brandes, N.; Goldman, G.; Wang, C. H.; Ye, C. J.; Ntranos, V. Genome-Wide Prediction of Disease Variant Effects with a Deep Protein Language Model. Nature Genetics 2023, 55, 1512–1522.

      (4) Lin, Z. et al. Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. 2023.

      (5) Zeng, W.; Dou, Y.; Pan, L.; Xu, L.; Peng, S. Improving Prediction Performance of General Protein Language Model by Domain-Adaptive Pretraining on DNA-binding Protein. Nature Communications 2024, 15, 7838.

      (6) Gong, J. et al. THPLM: A Sequence-Based Deep Learning Framework for Protein Stability Changes Prediction upon Point Variations Using Pretrained Protein Language Model. Bioinformatics 2023, 39, btad646.

      (7) Lin, W.; Wells, J.; Wang, Z.; Orengo, C.; Martin, A. C. R. Enhancing Missense Variant Pathogenicity Prediction with Protein Language Models Using VariPred. Scientific Reports 2024, 14, 8136.

      (8) Saadat, A.; Fellay, J. Fine-Tuning the ESM2 Protein Language Model to Understand the Functional Impact of Missense Variants. Computational and Structural Biotechnology Journal 2025, 27, 2199–2207.

      (9) Chu, S. K. S.; Narang, K.; Siegel, J. B. Protein Stability Prediction by Fine-Tuning a Protein Language Model on a Mega-Scale Dataset. PLOS Computational Biology 2024, 20, e1012248.

      (10) Landrum, M. J.; Lee, J. M.; Riley, G. R.; Jang, W.; Rubinstein, W. S.; Church, D. M.; Maglott, D. R. ClinVar: Public Archive of Relationships among Sequence Variation and Human Phenotype. Nucleic Acids Research 2014, 42, D980–D985.

      (11) Landrum, M. J. et al. ClinVar: Improving Access to Variant Interpretations and Supporting Evidence. Nucleic Acids Research 2018, 46, D1062–D1067.

    1. Author response:

      Reviewer #1 (Public Review):

      Summary:

      The present study evaluates the role of visual experience in shaping functional correlations between extrastriate visual cortex and frontal regions. The authors used fMRI to assess "resting-state" temporal correlations in three groups: sighted adults, congenitally blind adults, and neonates. Previous research has already demonstrated differences in functional correlations between visual and frontal regions in sighted compared to early blind individuals. The novel contribution of the current study lies in the inclusion of an infant dataset, which allows for an assessment of the developmental origins of these differences.

      The main results of the study reveal that correlations between prefrontal and visual regions are more prominent in the blind and infant groups, with the blind group exhibiting greater lateralization. Conversely, correlations between visual and somato-motor cortices are more prominent in sighted adults. Based on these data, the authors conclude that visual experience plays an instructive role in shaping these cortical networks. This study provides valuable insights into the impact of visual experience on the development of functional connectivity in the brain.

      Strengths:

      The dissociations in functional correlations observed among the sighted adult, congenitally blind, and neonate groups provide strong support for the study's main conclusion regarding experience-driven changes in functional connectivity profiles between visual and frontal regions.

      In general, the findings in sighted adult and congenitally blind groups replicate previous studies and enhance the confidence in the reliability and robustness of the current results.

      Split-half analysis provides a good measure of robustness in the infant data.

      Weaknesses:

      There is some ambiguity in determining which aspects of these networks are shaped by experience.

      This uncertainty is compounded by notable differences in data acquisition and preprocessing methods, which could result in varying signal quality across groups. Variations in signal quality may, in turn, have an impact on the observed correlation patterns.

      The study's findings could benefit from being situated within a broader debate surrounding the instructive versus permissive roles of experience in the development of visual circuits.

      Reviewer #2 (Public Review):

      Summary:

      Tian et al. explore the developmental organs of cortical reorganization in blindness. Previous work has found that a set of regions in the occipital cortex show different functional responses and patterns of functional correlations in blind vs. sighted adults. In this paper, Tian et al. ask: how does this organization arise over development? Is the "starting state" more like the blind pattern, or more like the adult pattern? Their analyses reveal that the answer depends on the particular networks investigated; some functional connections in infants look more like blind than sighted adults; other functional connections look more like sighted than blind adults; and others fall somewhere in the middle, or show an altogether different pattern in infants compared with both sighted and blind adults. 

      Strengths:

      The question raised in this paper is extremely important: what is the starting state in development for visual cortical regions, and how is this organization shaped by experience? This paper is among the first to examine this question, particularly by comparing infants not only with sighted adults but also blind adults, which sheds new light on the role of visual (and cross-modal) experience. Another clear strength lies in the unequivocal nature of many results. Many results have very large effect sizes, critical interactions between regions and groups are tested and found, and infant analyses are replicated in split halves of the data. 

      Weaknesses:

      A central claim is that "infant secondary visual cortices functionally resemble those of blind more than sighted adults" (abstract, last paragraph of intro). I see two potential issues with this claim. First, a minor change: given the approaches used here, no claims should be made about the "function" of these regions, but rather their "functional correlations". Second (and more importantly), the claim that the secondary visual cortex in general resembles blind more than sighted adults is still not fully supported by the data. In fact, this claim is only true for one aspect of secondary visual area functional correlations (i.e., their connectivity to A1/M1/S1 vs. PFC). In other analyses, the infant secondary visual cortex looks more like sighted adults than blind adults (i.e., in within vs. across hemisphere correlations), or shows a different pattern from both sighted and blind adults (i.e., in occipito-frontal subregion functional connectivity). It is not clear from the manuscript why the comparison to PFC vs. non-visual sensory cortex is more theoretically important than hemispheric changes or within-PFC correlations (in fact, if anything, the within-PFC correlations strike me as the most important for understanding the development and reorganization of these secondary visual regions). It seems then that a more accurate conclusion is that the secondary visual cortex shows a mix of instructive effects of vision and reorganizing effects of blindness, albeit to a different extent than the primary visual cortex.

      Relatedly, group differences in overall secondary visual cortex connectivity are particularly striking as visualized in the connectivity matrices shown in Figure S1. In the results (lines 105-112), it is noted that while the infant FC matrix is strongly correlated with both adult groups, the infant group is nonetheless more strongly correlated with the blind than sighted adults. I am concerned that these results might be at least partially explained by distance (i.e., local spread of the bold signal), since a huge portion of the variance in these FC matrices is driven by stronger correlations between regions within the same system (e.g., secondary-secondary visual cortex, frontal-frontal cortex), which are inherently closer together, relative to those between different systems (e.g., visual to frontal cortex). How do results change if only comparisons between secondary visual regions and non-visual regions are included (i.e., just the pairs of regions within the bold black rectangle on the figure), which limits the analysis to long-rang connections only? Indeed, looking at the off-diagonal comparisons, it seems that in fact there are three altogether different patterns here in the three groups. Even if the correlation between the infant pattern and blind adult pattern survives, it might be more accurate to claim that infants are different from both adult groups, suggesting both instructive effects of vision and reorganizing effects of blindness. It might help to show the correlation between each group and itself (across independent sets of subjects) to better contextualize the relative strength of correlations between the groups. 

      It is not clear that differences between groups should be attributed to visual experience only. For example, despite the title of the paper, the authors note elsewhere that cross-modal experience might also drive changes between groups. Another factor, which I do not see discussed, is possible ongoing experience-independent maturation. The infants scanned are extremely young, only 2 weeks old. Although no effects of age are detected, it is possible that cortex is still undergoing experience-independent maturation at this very early stage of development. For example, consider Figure 2; perhaps V1 connectivity is not established at 2 weeks, but eventually achieves the adult pattern later in infancy or childhood. Further, consider the possibility that this same developmental progression would be found in infants and children born blind. In that case, the blind adult pattern may depend on blindness-related experience only (which may or may not reflect "visual" experience per se). To deal with these issues, the authors should add a discussion of the role of maturation vs. experience and temper claims about the role of visual experience specifically (particularly in the title). 

      The authors measure functional correlations in three very different groups of participants and find three different patterns of functional correlations. Although these three groups differ in critical, theoretically interesting ways (i.e., in age and visual/cross-modal experience), they also differ in many uninteresting ways, including at least the following: sampling rate (TR), scan duration, multi-band acceleration, denoising procedures (CompCor vs. ICA), head motion, ROI registration accuracy, and wakefulness (I assume the infants are asleep).

      Addressing all of these issues is beyond the scope of this paper, but I do feel the authors should acknowledge these confounds and discuss the extent to which they are likely (or not) to explain their results. The authors would strengthen their conclusions with analyses directly comparing data quality between groups (e.g., measures of head motion and split-half reliability would be particularly effective).

      Response #1: We appreciate the reviewer’s comments. In response, we have revised the paper to provide a more balanced summary of the data and clarified in the introduction which signatures the paper focuses on and why. Additionally, we have included several control analyses to account for other plausible explanations for the observed group differences. Specifically, we randomly split the infant dataset into two halves and performed split-half cross-validation. Across all comparisons, the results from the two halves were highly similar, suggesting that the effects are robust (see Supplementary Figures S3 and S4).

      Furthermore, we compared the split-half noise ceiling across the groups (infants, sighted adults, and blind adults) and found no significant differences between them (details in response #6). Finally, we repeated our analysis after excluding infants with a radiology score of 4 or 5, and the results remained consistent, indicating that our findings are not confounded by potential brain anomalies (details in response #2).

      We hope these control analyses help strengthen our conclusions.

      Reviewer #3 (Public Review):

      Summary:

      This study aimed to investigate whether the differences observed in the organization of visual brain networks between blind and sighted adults result from a reorganization of an early functional architecture due to blindness, or whether the early architecture is immature at birth and requires visual experience to develop functional connections. This question was investigated through the comparison of 3 groups of subjects with resting-state functional MRI (rs-fMRI). Based on convincing analyses, the study suggests that: 1) secondary visual cortices showed higher connectivity to prefrontal cortical regions (PFC) than to non-visual sensory areas (S1/M1 and A1) in sighted infants like in blind adults, in contrast to sighted adults; 2) the V1 connectivity pattern of sighted infants lies between that of sighted adults (stronger functional connectivity with non-visual sensory areas than with PFC) and that of blind adults (stronger functional connectivity with PFC than with non-visual sensory areas); 3) the laterality of the connectivity patterns of sighted infants resembled those of sighted adults more than those of blind adults, but sighted infants showed a less differentiated fronto-occipital connectivity pattern than adults.

      Strengths:

      The question investigated in this article is important for understanding the mechanisms of plasticity during typical and impaired development, and the approach considered, which compares different groups of subjects including, neonates/infants and blind adults, is highly original.

      -Overall, the analyses considered are solid and well-detailed. The results are quite convincing, even if the interpretation might need to be revised downwards, as factors other than visual experience may play a role in the development of functional connections with the visual system.

      Weaknesses:

      While it is informative to compare the "initial" state (close to birth) and the "final" states in blind and sighted adults to study the impact of post-natal and visual experience, this study does not analyze the chronology of this development and when the specialization of functional connections is completed. This would require investigating when experience-dependent mechanisms are important for the setting- establishment of multiple functional connections within the visual system. This could be achieved by analyzing different developmental periods in the same way, using open databases such as the Baby Connectome Project. Given the early, "condensed" maturation of the visual system after birth, we might expect sighted infants to show connectivity patterns similar to those of adults a few months after birth.

      The rationale for mixing full-term neonates and preterm infants (scanned at term-equivalent age) from the dHCP 3rd release is not understandable since preterms might have a very different development related to prematurity and to post-natal (including visual) experience. Although the authors show that the difference between the connectivity of visual and other sensory regions, and the one of visual and PFC regions, do not depend on age at birth, they do not show that each connectivity pattern is not influenced by prematurity. Simply not considering the preterm infants would have made the analysis much more robust, and the full-term group in itself is already quite large compared with the two adult groups. The current study setting and the analyses performed do not seem to be an adequate and sufficient model to ascertain that "a few weeks of vision after birth is ... insufficient to influence connectivity".

      In a similar way, excluding the few infants with detected brain anomalies (radiological scores higher or equal to 4) would strengthen the group homogeneity by focusing on infants supposed to have a rather typical neurodevelopment. The authors quote all infants as "sighted" but this is not guaranteed as no follow-up is provided.

      Response #2: We appreciate the reviewer’s suggestion. We re-analyzed the infant cohort after excluding all cases with radiological scores ≥4 (n =39 infants excluded). The revised analysis confirmed that the connectivity patterns reported in the main text remain statistically unchanged (see Supplementary Fig. S11). This demonstrates the robustness of our findings to potential confounding effects from potential brain anomalies. We have explicitly clarified this in the revised Methods section (page 14, line 391in the manuscript).

      In our dataset, newborns (average age at scan = 2.79 weeks) have very limited and immature vision. We agree with the reviewer that long-term visual outcomes cannot be guaranteed without follow-up data. The term "sighted infants" was used operationally to distinguish this cohort from congenitally blind populations.

      The post-menstrual age (PMA) at scan of the infants is also not described. The methods indicate that all were scanned at "term-equivalent age" but does this mean that there is some PMA variability between 37 and 41 weeks? Connectivity measures might be influenced by such inter-individual variability in PMA, and this could be evaluated.

      The rationale for presenting results on the connectivity of secondary visual cortices before one of the primary cortices (V1) was not clear to understand. Also, it might be relevant to better justify why only the connectivity of visual regions to non-visual sensory regions (S1-M1, A1) and prefrontal cortex (PFC) was considered in the analyses, and not the ones to other brain regions.

      In relation to the question explored, it might be informative to reposition the study in relation to what others have shown about the developmental chronology of structural and functional long-distance and short-distance connections during pregnancy and the first postnatal months.

      The authors acknowledge the methodological difficulties in defining regions of interest (ROIs) in infants in a similar way as adults. The reliability and the comparability of the ROIs positioning in infants is definitely an issue. Given that brain development is not homogeneous and synchronous across brain regions (in particular with the frontal and parietal lobes showing delayed growth), the newborn brain is not homothetic to the adult brain, which poses major problems for registration. The functional specialization of cortical regions is incomplete at birth. This raises the question of whether the findings of this study would be stable/robust if slightly larger or displaced regions had been considered, to cover with greater certainty the same areas as those considered in adults. And have other cortical parcellation approaches been considered to assess the ROIs robustness (e.g. MCRIB-S for full-terms)?

      Recommendations for the Authors:

      Reviewer #1(Recommendations for the authors):

      Further consideration should be given to the underlying changes in network architecture that may account for differences in functional correlations across groups. An increase (or decrease) in correlation between two regions could signify an increase (decrease) in connection or communication between those regions. Alternatively, it might reflect an increase in communication or connection with a third region, while the physical connections/interactions between the two original regions remain unchanged. These possibilities lead to distinct mechanistic interpretations. For example, there are substantial changes in connectivity during early visual (e.g. Burkhalter A. 1993, Cerebral Cortex) and visuo-motor development (e.g., Csibra et al. 2000 Neuroreport). It's not clear whether increases in communication within the visual network and improvements in visuo-motor behavior (e.g., Yizhar et al. 2023 Frontiers in Neuroscience) wouldn't produce a qualitatively similar pattern of results.

      Relatedly, the within-network correlation patterns between visual ROIs and frontal ROIs appear markedly different between sighted adults and infants (Supplementary Figure S1). To what extent do the differences in long-range correlations between visual and frontal regions reflect these within-network differences in functional organization?

      Response #3: The reviewer is raising some interesting questions about possible mechanisms and network changes. Resting state studies are indeed always subject to possibility that some effects are mediated by a third, unobserved region. Prior whole-cortex connectivity analyses have observed primarily changes in occipito-frontal connectivity in blindness, so there is not a clear cortical ‘third region’ candidate (Deen et al., 2015). However, some thalamic affects have also been observed and could contribute to the phenomenon (Bedny et al., 2011). Resting state changes in correlation between two areas do not imply changes in strength of long-range anatomical connectivity. Indeed, in the current case they may well reflect differential functional coupling, rather than strengthening or weakening of anatomical connections. We now discuss this in the Discussion section on page 12, line 301 as follows:

      “Despite these insights, many questions remain regarding the neurobiological mechanisms underlying experience-based functional connectivity changes and their relationship to anatomical development. Long-range anatomical connections between brain regions are already present in infants—even prenatally—though they remain immature (Huang et al., 2009; Kostović et al., 2019, 2021; Takahashi et al., 2012; Vasung, 2017). Functional connectivity changes may stem from local synaptic modifications within these stable structural pathways, consistent with findings that functional connectivity can vary independently of structural connection strength (Fotiadis et al., 2024). Moreover, functional connectivity has been shown to outperform structural connectivity in predicting individual behavioral differences, suggesting that experience-based functional changes may reflect finer-scale synaptic or network-level modulations not captured by macrostructural measures (Ooi et al., 2022). Prior studies also suggest that, even in adults, coordinated sensory-motor experience can lead to enhancement of functional connectivity across sensory-motor systems, indicating that large-scale changes in functional connectivity do not necessarily require corresponding changes in anatomical connectivity (Guerra-Carrillo et al., 2014; Li et al., 2018).”

      It is not clear how changes in correlation patterns among visual areas would produce the connectivity between visual areas and prefrontal areas reported in the current study. Activity in visual areas drives correlations both among visual areas and between visual and prefrontal areas and the same is true of prefrontal corticies.

      The findings from this study should be more closely linked to the extensive literature surrounding the debate on whether experience plays an instructive or permissive role in visual development (e.g., Crair 1999 Current Opin Neurobiol; Sur et al. 1999 J Neurobiol; Kiorpes 2016 J Neurosci; Stellwagen & Shatz 2002 Neuron; Roy et al. 2020 Nature Communications).

      Response #4: The instructive role suggests that specific experiences or patterns of neural activity directly shape and organize neural circuitry, while the permissive role indicates that such experiences or activity merely enable other factors, such as molecular signals, to influence neural circuit formation(Crair, 1999; Sur et al., 1999). To distinguish whether experience plays an instructive or permissive role, it is essential to manipulate the pattern or information content of neural activity while maintaining a constant overall activity level (Crair, 1999; Roy et al., 2020; Stellwagen & Shatz, 2002). However, both the sighted and blind adult groups have had extensive experience and neural activity in the visual cortices. For the sighted group, activity in the visual cortex is partly driven by bottom-up input from the external environment, through the retina, LGN, and ultimately to the cortex. In contrast, the blind group’s visual cortex activity is partially driven by top-down input from non-visual networks. The precise role of this activity in shaping the observed connectivity patterns remains unclear. Although our study cannot speak to this issue directly, we now link to the relevant literature on page 12,line 320 of the manuscript in the Discussion section as follows:

      “The current findings reveal both effects of vision and effects of blindness on the functional connectivity patterns of the visual cortex. A further open question is whether visual experience plays an instructive or permissive role in shaping neural connectivity patterns. An instructive role suggests that specific sensory experiences or patterns of neural activity directly shape and organize neural circuitry. In contrast, a permissive role implies that sensory experience or neural activity merely facilitates the influence of other factors—such as molecular signals—on the formation and organization of neural circuits (Crair, 1999; Sur et al., 1999). Studies with animals that manipulate the pattern or informational content of neural activity while keeping overall activity levels constant could distinguish between these hypotheses (Crair, 1999; Roy et al., 2020; Stellwagen & Shatz, 2002).”

      The assertion that a few weeks of vision after birth is insufficient to influence connectivity is provocative. Though supported by the study's results, it would benefit from integration with research in animal models showing considerable malleability of networks from early experience (e.g., Akerman et al. 2002 Neuron; Li et al. 2006 Nature Neuroscience; Stacy et al. 2023 J Neuroscience).

      Response #5: We thank the reviewer for their suggestion. The present study found that several weeks of postnatal visual experience is insufficient to significantly alter the long-term connectivity patterns of the visual cortices. While animal studies have shown that acute visual experience, or even exposure to visual stimuli through unopened eyelids, can robustly influence visual system development(Akerman et al., 2002; Li et al., 2008; Van Hooser et al., 2012). We think this discrepancy may be attributed to the substantial differences in developmental timelines between species. The human lifespan is much longer, and so is the human critical period, making it unclear how to map duration from one species to another. We briefly touched upon the time course issue in page 11 line 289 in the Discussion section as follows:

      “The present results reveal the effects of experience on development of functional connectivity between infancy and adulthood, but do not speak to the precise time course of these effects. Infants in the current sample had between 0 and 20 weeks of visual experience. Comparisons across these infants suggests that several weeks of postnatal visual experience is insufficient to produce a sighted-adult connectivity profile. The time course of development could be anywhere between a few months and years and could be tested by examining data from children of different ages.”

      Substantial differences between the groups are evident in several key aspects of the study, including the number of subjects, brain sizes, imaging parameters, and data preprocessing, all of which are likely to have an impact on the overall signal quality. To clarify how these differences might have impacted correlation differences between groups, it would be essential to include information on the noise ceilings for each correlation analysis within each group.

      Response #6: We thank the reviewer for their suggestion. We now report the split-half noise ceiling for adult and infant groups. For each participant, we first split the rs-fMRI time series into two halves, then calculated the ROI-wise rsFC pattern from the two splits. The split-half noise ceiling was estimated according to Lage-Castellanos et al (2019). The noise ceilings of the three groups (infants: 0.90 ± 0.056,blind adults: 0.88 ± 0.041, sighted adults: 0.90 ± 0.055) showed no significant difference (One-way ANOVA<sub>,</sub> F(2,552) = 2.348, p = 0.097). Therefore, we believe that overall signal quality is unlikely to impact our results. We also add the relevant context in the Method section in page 16 Line 447 as follows:

      “Substantial differences between the groups exist in this study, including the number of subjects, brain sizes, imaging parameters, and data preprocessing, all of which are likely to have an impact on the overall signal quality. To address this concern, we compared the split-half noise ceiling across the groups (infants, sighted adults, and blind adults). For each participant, we first split the rs-fMRI time series into two halves, then calculated the ROI-wise rsFC pattern from the two splits. The split-half noise ceiling was estimated according to Lage-Castellanos et al (Lage-Castellanos et al., 2019). The noise ceilings of the three groups (infants: 0.90 ± 0.056, blind adults: 0.88 ± 0.041, sighted adults: 0.90 ± 0.055) showed no significant difference (One-way ANOVA, F (2,552) = 2.348, p = 0.097). Therefore, overall signal quality is unlikely to impact our results.”

      In general, it appears that the infant correlations are stronger compared to the other groups. While this could reflect increased coherence or lack of differentiation, it is also possible that it is simply due to the presence of a non-neuronal global signal. Such a signal has the potential to substantially limit the effective range of functional correlations and comparisons with adults. To address this, it is advisable to conduct control analyses aimed at assessing and potentially removing global signals.

      Response #7: We agree with the reviewer that global signal regression (GSR) may help reduce non-neuronal artifacts, such as motion, cardiac, and respiratory signals, which are known to correlate with the global signal. However, the global signal also contains neural signals from gray matter, and removing it can introduce unwanted artifacts, especially for the current study. First, GSR can reduce the physiological accuracy of functional connectivity (FC); second, GSR may have differential effects across groups, potentially introducing additional artifacts in between-group comparisons, as noted by Murphy et al (Murphy & Fox, 2017). The CompCor method (Behzadi et al., 2007; Whitfield-Gabrieli & Nieto-Castanon, 2012) is capble to estimate the global non-neuronal artifacts like the GSR method. Meanwhile as it estimate global non-neuronal artifacts from signals within the white matter (WM) and cerebrospinal fluid (CSF) masks, but not the gray matter (GM), CompCor could introduce minimal unwanted bias to the GM signal.

      Was there a difference in correlations for preterm vs term neonates? Recent research has suggested that preterm births can have an impact on functional networks, particularly in frontal cortices. e.g., Tokariev et al. 2019, Li et al. 2021 elife; Zhang et al. 2022 Fronteirs in Neuroscience.

      Response #8: We have compared preterm and term neonates for all the main results, including the connectivity from the secondary visual cortex/V1 to non-visual sensory cortices versus prefrontal cortices, the laterality of occipito-frontal connectivity, and the specialization across different fronto-occipital networks. This information is reported in Page 6 line 169 and Supplementary Figure S7. The connectivities of full-term infants are generally higher than those of preterm infants. However, the connectivity patterns of term and preterm infants are very similar.

      The consistency between the current results and prior work (e.g., Burton et al. 2014) is notable, particularly in the observed greater correlations in prefrontal regions and weaker correlations in somato-motor regions for early blind individuals compared to sighted. However, almost all visual-frontal correlations in both groups were negative in that prior study. Some discussion on why positive correlations were found in the current study could help to clarify.

      Response #9: Many other papers have reported positive correlations similar to those found in our study (e.g., Deen et al., 2015; Kanjlia et al., 2021). In contrast, Burton's study identified predominantly negative visual-frontal correlations, we think this is likely because the global signal was regressed out during preprocessing. This methodological choice can lead to an increase in negative connections (Murphy & Fox, 2017).

      The term "secondary visual areas" used throughout the paper lacks specificity, and its usage in terms of underlying anatomical and functional areas has been inconsistent in the literature. It would be advisable to adopt a more precise characterization based on functional and/or anatomical criteria.

      Response #10: We specified in the article that Tthe occipital ROIs were defined in the current study are functional areas in people born blind identified in prior studies as regions that respond to three non-visual tasks such as language, math, or executive function, and show functional connectivity changes in blind adults in previous studies (Kanjlia et al., 2016, 2021; Lane et al., 2015). These regions respond to language, math and executivie function in the congenitally blind population (see Figure 1.) The are refered collectively as ‘secondary visual areas’ to destinguish them from V1. Anatomically, these three regions cover the majority of the lateral occipital cortex and part of the ventral occipital cortex, providing a good sample of the connectivity profile of higher-order visual areas. Thus, we are using the term "secondary visual areas" to refer to these regions. In blind individuals, although these regions respond to non-visual tasks, their exact functions are unknown.

      The inclusion of the ventral temporal cortex in the visual ROIs is currently only depicted in Supplementary Figure S7. To enhance the clarity of the areas of interest analyzed, it would be advisable to illustrate the ventral temporal areas in the main text. Were there notable differences in the frontal correlations between the lateral occipital visual areas and ventral temporal areas?

      Response #11: We thank the reviewer for pointing out this issue. We added a statement about the ventral visual cortex in describing the location of the ROI and added the ventral view of ROIs in the Figure 1. The language-responsive and math -responsive ROIs covers both the lateral and ventral visual cortex, whereas executive function (response-conflict) regions cover only the lateral visual cortex. We compared the connectivity patterns of these three regions and found no differences (see supplementary Fig S2).

      The blind group results are characterized as reflecting a reorganization in comparison to sighted adults while the results for sighted adults compared to infants are discussed more as a maturation ("adult pattern isn't default but requires experience to establish"). Both the sighted and blind adult groups showed differences from the infant group, and these differences are attributed to the role of experience. Why use "reorganization" for one result and maturation for another?

      Response #12: We agree with the reviewer that both of the adult groups should be thought of as equal in relation to the infants. In other words, the brain develops under one set of experiential conditions or another. We do not think that the adult sighted pattern reflects maturation. Rather, the sighted adult pattern reflects the combined influence of maturation and visual experience. The adult blind pattern reflects the combined influence of maturation and blindness. We use the term ‘reorganization’ to label differences in the blind adults relative to sighted infants. We do so for the purpose of clarity and to remain consistent with terminology in prior liaterature. However, we agree with the reviewer that the blind group does not reflect ‘reorganization’ intrinsically any more than the sighted adult group.

      The statement that "visual experience is required to set up long-range functional connectivity" is unclear, especially since the infant and blind groups showed stronger long-range functional correlations with PFC.

      Response #13: We revised this sentence to specifically as “visual experience establishes elements of the sighted-adult long-range connectivity” in tha Abstract line 17.

      The statement that the visual ROIS roughly correspond to "the anatomical location of areas such as V5/MT+, LO, V3a, and V4v" appears imprecise. From Supplementary Figure S7, these areas cover anterior portions of ventral temporal cortex (do these span the anatomical location of putative category-selective areas?) and into the intraparietal sulcus.

      Response #14: Thanks to the reviewer for the clarification. The ventral ROIs cover the middle and part of the anterior portion of the ventral temporal lobe, including the putative category-selective areas. Additionally, the dorsal ROIs extend beyond the occipital lobe to the intraparietal sulcus and superior parietal lobule. We have added a more detailed description of the anatomical location of the ROI in the Methods section Page 17 line 489 as follows:

      “Each functional ROI spans multiple anatomical regions and together the secondary visual ROIs tile large portions of lateral occipital, occipito-temporal, dorsal occipital and occipito-parietal cortices. In sighted people, the secondary visual occipital ROIs include the anatomical locations of functional regions such as motion area V5/MT+, the lateral occipital complex (LO), category specific ventral occipitotemporal cortices and dorsally, V3a and V4v.  The occipital ROI also covers the middle of the ventral temporal lobe. Dorsally, it extended to the intraparietal sulcus and superior parietal lobule.”

      The motivation for assessing correlations with motor and frontal regions was briefly discussed in the introduction. It would be helpful to reiterate this motivation when first introducing the analyses in the results.

      Response #15: Thank you for the thoughtful suggestion. Upon reflection, we chose to substantially revise the Introduction to more clearly and comprehensively explain the rationale for examining the couplings with motor and frontal regions, rather than reiterating it in the Results section. We believe this revised framing provides a stronger foundation for the analyses that follow, while avoiding redundancy across sections. We hope this addresses the reviewer’s concern.

      Reviewer #2 (Recommendations for the authors):

      Congratulations on a well-written paper and an interesting set of results.

      Reviewer #3 (Recommendations for the authors):

      Abstract:

      Mentioning "sighted infants" does not seem adequate.

      Response #16: In our dataset, newborns (average age at scan = 2.79 weeks) have very limited and immature vision. We agree with the reviewer that long-term visual outcomes cannot be guaranteed without follow-up data. The term "sighted infants" was used operationally to distinguish this cohort from congenitally blind populations.

      In sentences after "Specifically...", it was not clear whether the authors referred to V1 connectivity.

      Response #17: We thank the reviewer for this comment. In the revised abstract, we have removed the original "Specifically..." phrasing and clarified the results.

      Introduction

      Talking about the "instructive effects" of vision might be confusing or misleading. Visual experiences like exposure to oral language are part of the normal/spontaneous environment that allows the infant behavioral acquisitions (contrarily with learnings that occur later during development with instruction like for reading).

      Response #18: We appreciate the reviewer’s concern and would like to clarify that the term “instructive effect” is used here derived from neurodevelopmental studies (Crair, 1999; Sur et al., 1999). In this context, “instructive” refers to activity-dependent mechanisms where patterns of neural activity actively guide the organization of synaptic connectivity, emphasizing that spontaneous or sensory-driven activity (e.g., retinal waves, visual experience) can directly shape circuit refinement, as seen in ocular dominance column formation. In the context of our study, we emphasize that vision plays an instructive role in setting up the balance of connectivity between occipital cortex and non-visual networks.

      For references on the development of connectivity, I would advise citing MRI studies but also studies based on histological approaches (see for example the detailed review by Kostovic et al, NeuroImage 2019).

      Response #19: We thank the reviewer for this suggestion. We have incorporated a discussion on the long-range anatomical connections that emerge as early as infancy, referencing studies that employed diffusion MR imaging and histological methods, as detailed below.

      “Many long-range anatomical connections between brain regions are already established in infants, even before birth, although they are not yet mature (Huang et al., 2009; Kostović et al., 2019, 2021; Takahashi et al., 2012; Vasung, 2017).” (Page 12, line 303 in the manuscript)

      Results

      P7 l170: It might be helpful to be precise that this is "compared with inter-hemispheric connectivity".

      Response #20: We thank the reviewer for this suggestion. To align with our established terminology, we have revised the statement to explicitly contrast within-hemisphere connectivity with between-hemisphere connectivity. The modified text now reads (page 7, line 183 in the manuscript):

      “Compared to sighted adults, blind adults exhibited a stronger dominance of within-hemisphere connectivity over between-hemisphere connectivity. That is, in people born blind, left visual networks are more strongly connected to left PFC, whereas right visual networks are more strongly connected to right PFC.

      L176-181: It was not clear to me what was the difference between "across" and "between hemisphere connectivity". Would it be informative to test the difference between blind and sighted adults?

      Response #21: We clarify that there is no distinction between the terms “across” and “between hemisphere connectivity”—they refer to the same concept. To ensure consistency, we have revised the text to exclusively use “between hemisphere connectivity” throughout the manuscript. Regarding the comparison between blind and sighted adults, we conducted statistical comparisons between these groups in our analysis, and the results have been incorporated into the revised version (Page 7, line 187 in the manuscript).

      Adding statistics on Figure 3, but also on Figures 1 and 2 might help the reading.

      Response #22: We have added the statistics in Figure 1-4.

      Adding the third comparison in Figure 4 would be possible in my view.

      Response #23: We explored integrating the response-conflict region into Figure 4, but this would require a 3x3 bar chart with pairwise statistical significance markers, which introduced excessive visual complexity that hindered readers’ ability to grasp our intended message. To ensure clarity, we retained the original Figure 4 while providing the complete three-region analysis (including all statistical comparisons) in Supplementary Figure S8 to ensure completeness.

      Methods

      The authors might have to specify ages at birth, and ages at scan (median + range?).

      Response #24: We have added that information in the Methods section as follows:

      “The average age from birth at scan = 2.79 weeks (SD = 3.77, median = 1.57, range = 0 – 19.71); average gestational age at scan = 41.23 weeks (SD = 1.77, median = 41.29, range = 37 – 45.14); average gestational age at birth = 38.43 weeks (SD = 3.73, median = 39.71, range = 23 – 42.71).” (Page 14, line 379 in the manuscript)

      It might be relevant to comment on the range of available fMRI volumes, and the fact that connectivity measures might then be less robust in infants.

      Response #25: We report the range of fMRI volumes in the Methods section (Page 16, Line 449). Adult participants (blind and sighted) underwent 1–4 scanning sessions, each containing 240 volumes (mean scan duration: 710.4 seconds per participant). For infants, all subjects had 2300 fMRI volumes, and we retained a subset of 1600 continuous volumes per subject with the minimum number of motion outliers. While infant connectivity measures may inherently exhibit lower robustness due to developmental and motion-related factors, our infant cohort’s large sample size (n=475) and stringent motion censoring criteria enhance the reliability of group-level inferences. We have integrated this clarification into the Methods section (Page 16, Line 444) as follows:

      "While infant connectivity estimates may be less robust at the individual level compared to adults due to shorter scan durations and higher motion, our cohort’s large sample size (n=475) and rigorous motion censoring mitigate these limitations for group-level analyses. "

      The mention of dHCP 2nd release should be removed from the paragraph on data availability.

      Response #26: We have removed it.

    1. Reviewer #3 (Public review):

      A bias in how people infer the amount of control they have over their environment is widely believed to be a key component of several mental illnesses including depression, anxiety, and addiction. Accordingly, this bias has been a major focus in computational models of those disorders. However, all of these models treat control as a unidimensional property, roughly, how strongly outcomes depend on action. This paper proposes---correctly, I think---that the intuitive notion of "control" captures multiple dimensions in the relationship between action and outcome. In particular, the authors identify one key dimension: the degree to which outcome depends on how much *effort* we exert, calling this dimension the "elasticity of control". They additionally argue that this dimension (rather than the more holistic notion of controllability) may be specifically impaired in certain types of psychopathology. This idea has the potential to change how we think about several major mental disorders in a substantial way, and can additionally help us better understand how healthy people navigate challenging decision-making problems. More concisely, it is a *very good idea*.

      The more concrete contributions, however, are not as strong. In particular, evidence for the paper's most striking claims is weak. Quoting the abstract, these claims are (1) "the elasticity of control [is] a distinct cognitive construct guiding adaptive behavior" and (2) "overestimation of elasticity is associated with elevated psychopathology involving an impaired sense of control."

      Main issues

      I'll highlight the key points.

      - The task cannot distinguish elasticity inference from general learning processes

      - Participants were explicitly instructed about elasticity, with labeled examples

      - The psychopathology claims rely on an invalid interpretation of CCA, and are contradicted by simple correlations (elasticity bias and the sense of agency scale is r=0.03)

      Distinct construct

      Starting with claim 1, there are three subclaims here. (1A) People's behavior is sensitive to differences in elasticity; (1B) there are mental processes specific to elasticity inference, i.e., not falling out of general learning mechanisms; and, implicitly, (1C) people infer elasticity naturally as they go about their daily lives. The results clearly support 1A. However, 1B and 1C are not well supported.

      (1B) The data cannot support the "distinct cognitive construct" claim because the task is too simple to dissociate elasticity inference from more general learning processes (also raised by Reviewer 1). The key behavioral signature for elasticity inference (vs. generic controllability inference) is the transfer across ticket numbers, illustrated in Fig 4. However, this pattern is also predicted by a standard Bayesian learner equipped with an intuitive causal model of the task. Each ticket gives you another chance to board and the agent infers the probability that each attempt succeeds. Crucially, this logic is not at all specific to elasticity or even control. An identical model could be applied to inferring the bias of a coin from observations of whether any of N tosses were heads-a task that is formally identical to this one (at least, the intuitive model of the task; see first minor comment).

      Importantly, this point cannot be addressed by showing that the author's model fits data better than this or any other specific Bayesian model. It is not a question of whether one particular updating rule explains data better than another. Rather, it is a question of whether the task can distinguish between biases in *elasticity* inference versus biases in probabilistic inference more generally. The present task cannot make this distinction because it does not make separate measurements of the two types of inference. To provide compelling evidence that elasticity inference is a "distinct cognitive construct", one would need to show that there are reliable individual differences in elasticity inference that generalize across contexts but do not generalize to computationally similar types of probabilistic inference (e.g. the coin flipping example).

      (1C) The implicit claim that people infer elasticity outside of the experimental task is undermined by the experimental design. The authors explicitly tell people about the two notions of control as part of the training phase: "To reinforce participants' understanding of how elasticity and controllability were manifested in each planet, [participants] were informed of the planet type they had visited after every 15 trips."

      In the revisions, the authors seem to go back and forth on whether they are claiming that people infer elasticity without instruction (I won't quote it here). I'll just note that the examples they provide in the most recent rebuttal are all cases in which one never receives explicit labels about elasticity. If people only infer elasticity when it is explicitly labeled, I struggle to see its relevance for understanding human cognition and behavior.

      Psychopathology

      Finally, I turn to claim 2, that "overestimation of elasticity is associated with elevated psychopathology involving an impaired sense of control." The CCA analysis is in principle unable to support this claim. As the authors correctly note in their latest rebuttal, the CCA does show that "there is a relationship between psychopathology traits and task parameters". The lesion analysis further shows that "elasticity bias specifically contributes to this relationship" (and similarly for the Sense of Agency scale). Crucially, however, this does *not* imply that there is a relationship between those two variables. The most direct test of that relationship is the simple correlation, which the authors report only in a supplemental figure: there is no relationship (r=0.03). Although it is of course possible that there is a relationship that is obscured by confounding variables, the paper provides no evidence-statistical or otherwise-that such a relationship exists.

      Minor comments

      The statistical structure of the task is inconsistent with the framing. In the framing, participants can make either one or two second boarding attempts (jumps) by purchasing extra tickets. The additional attempt(s) will thus succeed with probability p for one ticket and 2p - p^2 for two tickets; the p^2 captures the fact that you only take the second attempt if you fail on the first. A consequence of this is buying more tickets has diminishing returns. In contrast, in the task, participants always jumped twice after purchasing two tickets, and the probability of success with two tickets was exactly double that with one ticket. Thus, if participants are applying an intuitive causal model to the task, the researcher could infer "biases" in elasticity inference that are probably better characterized as effective use of prior information (encoded in the causal model).

      The model is heuristically defined and does not reflect Bayesian updating. For example, it over-estimates maximum control by not using losses with less than 3 tickets (intuitively, the inference here depends on what your beliefs about elasticity). Including forced three-ticket trials at the beginning of each round makes this less of an issue; but if you want to remove those trials, you might need to adjust the model. The need to introduce the modified model with kappa is likely another symptom of the heuristic nature of the model updating equations.

    2. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      This research takes a novel theoretical and methodological approach to understanding how people estimate the level of control they have over their environment and how they adjust their actions accordingly. The task is innovative and both it and the findings are well-described (with excellent visuals). They also offer thorough validation for the particular model they develop. The research has the potential to theoretically inform understanding of control across domains, which is a topic of great importance.

      We thank the Reviewer for their favorable appraisal and valuable suggestions, which have helped clarify and strengthen the study’s conclusion. 

      In its revised form, the manuscript addresses most of my previous concerns. The main remaining weakness pertains to the analyses aimed at addressing my suggesting of Bayesian updating as an alternative to the model proposed by the authors. My suggestion was to assume that people perform a form of function approximation to relate resource expenditure to success probability. The authors performed a version of this where people were weighing evidence for a few canonical functions (flat, step, linear), and found that this model underperformed theirs. However, this Bayesian model is quite constrained in its ability to estimate the function relating resources. A more robust test would be to assume a more flexible form of updating that is able to capture a wide range of distributions (e.g., using basis functions, gaussian processes, or nonparametric estimators); see, e.g., work by Griffiths on human function learning). The benefit of testing this type of model is that it would make contact with a known form of inference that individuals engage in across various settings and therefore could offer a more parsimonious and generalizable account of function learning, whereby learning of resource elasticity is a special case. I defer to the authors as to whether they'd like to pursue this direction, but if not I think it's still important that they acknowledge that they are unable to rule out a more general process like this as an alternative to their model. This pertains also to inferences about individual differences, which currently hinge on their preferred model being the most parsimonious.

      We thank the Reviewer for this thoughtful suggestion. We acknowledge that more flexible function learning approaches could provide a stronger test in favor of a more general account. Our Bayesian model implemented a basis function approach where the weights of three archetypal functions (flat, step, linear) are learned from experience Testing models with more flexible basis functions would likely require a task with more than three levels of resource investment (1, 2, or 3 tickets). This would make an interesting direction for future work expanding on our current findings. We now incorporate this suggestion in more detail in our updated manuscript (335-341):

      “Second, future models could enable generalization to levels of resource investment not previously experienced. For example, controllability and its elasticity could be jointly estimated via function approximation that considers control as a function of invested resources. Although our implementation of this model did not fit participants’ choices well (see Methods), other modeling assumptions drawn from human function learning [30] or experimental designs with continuous action spaces may offer a better test of this idea.”

      Reviewer #2 (Public review):

      This research investigates how people might value different factors that contribute to controllability in a creative and thorough way. The authors use computational modeling to try to dissociate "elasticity" from "overall controllability," and find some differential associations with psychopathology. This was a convincing justification for using modeling above and beyond behavioral output and yielded interesting results. Notably, the authors conclude that these findings suggest that biased elasticity could distort agency beliefs via maladaptive resource allocation. Overall, this paper reveals important findings about how people consider components of controllability. The authors have gone to great lengths to revise the manuscript to clarify their definitions of "elastic" and "inelastic" and bolster evidence for their computational model, resulting in an overall strong manuscript that is valuable for elucidating controllability dynamics and preferences. 

      We thank the Reviewer for their constructive feedback throughout the review process, which has substantially strengthened our manuscript and clarified our theoretical framework.

      One minor weakness is that the justification for the analysis technique for the relationships between the model parameters and the psychopathology measures remains lacking given the fact that simple correlational analyses did not reveal any significant associations.

      We note that the existence of bivariate relationships is not a prerequisite for the existence of multivariate relationships. Conditioning the latter on the former, therefore, would risk missing out on important relationships existing in the data. Ultimately, correlations between pairs of variables do not offer a sensitive test for the general hypothesis that there is a relationship between two sets of variables. As an illustration, consider that elasticity bias correlated in our data (r = .17, p<.001) with the difference between SOA (sense of agency) and SDS (self-rating depression). Notably, SOA and SDS were positively correlated (r = .47, p<.001), and neither of them was correlated with elasticity bias (SOA: r=.04 p=.43, SDS: r=-.06, p=.16). It was a dimension that ran between them that mapped onto elasticity bias. This specific finding is incidental and uncorrected for multiple comparisons, hence we do not report it in the manuscript, but it illustrates the kinds of relationships that cannot be accounted for by looking at bivariate relationships alone.  

      Reviewer #3 (Public review):

      A bias in how people infer the amount of control they have over their environment is widely believed to be a key component of several mental illnesses including depression, anxiety, and addiction. Accordingly, this bias has been a major focus in computational models of those disorders. However, all of these models treat control as a unidimensional property, roughly, how strongly outcomes depend on action. This paper proposes---correctly, I think---that the intuitive notion of "control" captures multiple dimensions in the relationship between action and outcome.

      In particular, the authors identify one key dimension: the degree to which outcome depends on how much *effort* we exert, calling this dimension the "elasticity of control". They additionally argue that this dimension (rather than the more holistic notion of controllability) may be specifically impaired in certain types of psychopathology. This idea has the potential to change how we think about several major mental disorders in a substantial way and can additionally help us better understand how healthy people navigate challenging decision-making problems. More concisely, it is a very good idea.

      We thank the Reviewer for their thoughtful engagement with our manuscript. We appreciate their recognition of elasticity as a key dimension of control that has the potential to advance our understanding of psychopathology and healthy decision-making.

      Starting with theory, the authors do not provide a strong formal characterization of the proposed notion of elasticity. There are existing, highly general models of controllability (e.g., Huys & Dayan, 2009; Ligneul, 2021) and the elasticity idea could naturally be embedded within one of these frameworks. The authors gesture at this in the introduction; however, this formalization is not reflected in the implemented model, which is highly task-specific.

      Our formal definition of elasticity, detailed in Supplementary Note 1, naturally extends the reward-based and information-theoretic definitions of controllability by Huys & Dayan (2009) and Ligneul (2021). We now further clarify how the model implements this formalized definition (lines 156-159).

      “Conversely, in the ‘elastic controllability model’, the beta distributions represent a belief about the maximum achievable level of control (𝑎<sub>Control</sub>, 𝑏<sub>Control</sub>) coupled with two elasticity estimates that specify the degree to which successful boarding requires purchasing at least one (𝑎<sub>elastic≥1</sub>, 𝑏<sub>elastic≥1</sub>) or specifically two (𝑎<sub>elastic2</sub>, 𝑏<sub>elastic2</sub>) extra tickets. As such, these elasticity estimates quantify how resource investment affects control. The higher they are, the more controllability estimates can be made more precise by knowing how much resources the agent is willing and able to invest (Supplementary Note 1).”

      Moreover, the authors present elasticity as if it is somehow "outside of" the more general notion of controllability. However, effort and investment are just specific dimensions of action; and resources like money, strength, and skill (the "highly trained birke") are just specific dimensions of state. Accordingly, the notion of elasticity is necessarily implicitly captured by the standard model. Personally, I am compelled by the idea that effort and resource (and therefore elasticity) are particularly important dimensions, ones that people are uniquely tuned to. However, by framing elasticity as a property that is different in kind from controllability (rather than just a dimension of controllability), the authors only make it more difficult to integrate this exciting idea into generalizable models.

      We respectfully disagree that we present elasticity as outside of, or different in kind from, controllability. Throughout the manuscript, we explicitly describe elasticity as a dimension of controllability (e.g., lines 70-72, along many other examples). This is also expressed in our formal definition of elasticity (Supplementary Note 1). 

      The argument that vehicle/destination choice is not trivial because people occasionally didn't choose the instructed location is not compelling to me-if anything, the exclusion rate is unusually low for online studies. The finding that people learn more from non-random outcomes is helpful, but this could easily be cast as standard model-based learning very much like what one measures with the Daw two-step task (nothing specific to control here). Their final argument is the strongest, that to explain behavior the model must assume "a priori that increased effort could enhance control." However, more literally, the necessary assumption is that each attempt increases the probability of success-e.g. you're more likely to get a heads in two flips than one. I suppose you can call that "elasticity inference", but I would call it basic probabilistic reasoning.

      We appreciate the Reviewer’s concerns but feel that some of the more subjective comments might not benefit from further discussion. We only note that controllability and its elasticity are features of environmental structure, so in principle any controllability-related inference is a form of model-based learning. The interesting question is whether people account in their world model for that particular feature of the environment.   

      The authors try to retreat, saying "our research question was whether people can distinguish between elastic and inelastic controllability." I struggle to reconcile this with the claim in the abstract "These findings establish the elasticity of control as a distinct cognitive construct guiding adaptive behavior". That claim is the interesting one, and the one I am evaluating the evidence in light of.

      In real-world contexts, it is often trivial that sometimes further investment enhances control and sometimes it does not. For example, students know that if they prepare more extensively for their exams they will likely be able to achieve better grades, but they also know that there is uncertainty in this regard – their grades could improve significantly, modestly, or in some cases, they might not improve at all, depending on the type of exams their study program administers and the knowledge or skills being tested. Our research question was whether in such contexts people learn from experience the degree to which controllability is elastic to invested resources and adapt their resource investment accordingly. Our findings show that they do. 

      The authors argue for CCA by appeal to the need to "account for the substantial variance that is typically shared among different forms of psychopathology". I agree. A simple correlation would indeed be fairly weak evidence. Strong evidence would show a significant correlation after *controlling for* other factors (e.g. a regression predicting elasticity bias from all subscales simultaneously). CCA effectively does the opposite, asking whether-with the help of all the parameters and all the surveys-one can find any correlation between the two sets of variables. The results are certainly suggestive, but they provide very little statistical evidence that the elasticity parameter is meaningfully related to any particular dimension of psychopathology.

      We agree with the Reviewer on the relationship between elasticity and any particular dimension of psychopathology. The CCA asks a different question, namely, whether there is a relationship between psychopathology traits and task parameters, and whether elasticity bias specifically contributes to this relationship. 

      I am very concerned to see that the authors removed the discussion of this limitation in response to my first review. I quote the original explanation here:

      - In interpreting the present findings, it needs to be noted that we designed our task to be especially sensitive to overestimation of elasticity. We did so by giving participants free 3 tickets at their initial visits to each planet, which meant that upon success with 3 tickets, people who overestimate elasticity were more likely to continue purchasing extra tickets unnecessarily. Following the same logic, had we first had participants experience 1 ticket trips, this could have increased the sensitivity of our task to underestimation of elasticity in elastic environments. Such underestimation could potentially relate to a distinct psychopathological profile that more heavily loads on depressive symptoms. Thus, by altering the initial exposure, future studies could disambiguate the dissociable contributions of overestimating versus underestimating elasticity to different forms of psychopathology.

      The logic of this paragraph makes perfect sense to me. If you assume low elasticity, you will infer that you could catch the train with just one ticket. However, when elasticity is in fact high, you would find that you don't catch the train, leading you to quickly infer high elasticity eliminating the bias. In contrast, if you assume high elasticity, you will continue purchasing three tickets and will never have the opportunity to learn that you could be purchasing only one-the bias remains.

      The authors attempt to argue that this isn't happening using parameter recovery. However, they only report the *correlation* in the parameter, whereas the critical measure is the *bias*. Furthermore, in parameter recovery, the data-generating and data-fitting models are identical-this will yield the best possible recovery results. Although finding no bias in this setting would support the claims, it cannot outweigh the logical argument for the bias that they originally laid out. Finally, parameter recovery should be performed across the full range of plausible parameter values; using fitted parameters (a detail I could only determine by reading the code) yields biased results because the fitted parameters are themselves subject to the bias (if present). That is, if true low elasticity is inferred as high elasticity, then you will not have any examples of low elasticity in the fitted parameters and will not detect the inability to recover them.

      The logic the Reviewer describes breaks down when one considers the dynamics of participants’ resource investment choices. A low elasticity bias in a participant’s prior belief would make them persist for longer in purchasing a single ticket despite failure, as compared to a person without such a bias. Indeed, the ability of the experimental design to demonstrate low elasticity biases is evidenced by the fact that the majority of participants were fitted with a low elasticity bias (μ = .16 ± .14, where .5 is unbiased). 

      Originally, the Reviewer was concerned that elasticity bias was being confounded with a general deficit in learning. The weak inter-parameter correlations in the parameter recovery test resolved this concern, especially given that, as we now noted, the simulated parameter space encompassed both low and high elasticity biases (range=[.02,.76]). Furthermore, regarding the Reviewer's concern about bias in the parameter recovery, we found no such significant bias with respect to the elasticity bias parameter (Δ(Simulated, Recovered)= -.03, p=.25), showing that our experiment could accurately identify low and high elasticity biases.

      The statistical structure of the task is inconsistent with the framing. In the framing, participants can make either one or two second boarding attempts (jumps) by purchasing extra tickets. The additional attempt(s) will thus succeed with probability p for one ticket and 2p – p<sup>^</sup>2 for two tickets; the p<sup>^</sup>2 captures the fact that you only take the second attempt if you fail on the first. A consequence of this is buying more tickets has diminishing returns. In contrast, in the task, participants always jumped twice after purchasing two tickets, and the probability of success with two tickets was exactly double that with one ticket. Thus, if participants are applying an intuitive causal model to the task, they will appear to "underestimate" the elasticity of control. I don't think this seriously jeopardizes the key results, but any follow-up work should ensure that the task's structure is consistent with the intuitive causal model.

      We thank the Reviewer for this comment, and agree the participants may have employed the intuitive understanding the Reviewer describes. This is consistent with our model comparison results, which showed that participants did not assume that control increases linearly with resource investment (lines 677-692). Consequently, this is also not assumed by our model, except perhaps by how the prior is implemented (a property that was supported by model comparison). In the text, we acknowledge that this aspect of the model and participants’ behavior deviates from the true task's structure, and it would be worthwhile to address this deviation in future studies. 

      That said, there is no reason that this will make participants appear to be generally underestimating elasticity. Following exposure to outcomes for one and three tickets, any nonlinear understanding of probabilities would only affect the controllability estimate for two tickets. This would have contrasting effects on the elasticity estimated to the second and third tickets, but on average, it would not change the overall elasticity estimated. On the other hand, such a participant is only exposed to outcomes for two and three tickets, they would come to judge the difference between the first and second tickets too highly, thereby overestimating elasticity.  

      The model is heuristically defined and does not reflect Bayesian updating. For example, it overestimates maximum control by not using losses with less than 3 tickets (intuitively, the inference here depends on what your beliefs about elasticity). Including forced three-ticket trials at the beginning of each round makes this less of an issue; but if you want to remove those trials, you might need to adjust the model. The need to introduce the modified model with kappa is likely another symptom of the heuristic nature of the model updating equations.

      Note that we have tested a fully Bayesian model (lines 676-691), but found that this model fitted participants’ choices worse. 

      You're right; saying these analyses provides "no information" was unfair. I agree that this is a useful way to link model parameters with behavior, and they should remain in the paper. However, my key objection still holds: these analyses do not tell us anything about how *people's* prior assumptions influence behavior. Instead, they tell us about how *fitted model parameters* depend on observed behavior. You can easily avoid this misreading by adding a small parenthetical, e.g.

      Thus, a prior assumption that control is likely available **(operationalized by \gamma_controllability)** was reflected in a futile investment of resources in uncontrollable environments.

      We thank the Reviewer for the suggestion and have added this parenthetical (lines 219, 225).

    1. Reviewer #2 (Public review):

      Summary:

      This paper considers the effects of cognitive load (using an n-back task related to font color), predictability, and age on reading times in two experiments. There were main effects of all predictors, but more interesting effects of load and age on predictability. The effect of load is very interesting, but the manipulation of age is problematic, because we don't know what is predictable for different participants (in relation to their age). There are some theoretical concerns about prediction and predictability, and a need to address literature (reading time, visual world, ERP studies).

      Strengths/weaknesses

      It is important to be clear that predictability is not the same as prediction. A predictable word is processed faster than an unpredictable word (something that has been known since the 1970/80s), e.g., Rayner, Schwanenfluegel, etc. But this could be due to ease of integration. I think this issue can probably be dealt with by careful writing (see point on line 18 below). To be clear, I do not believe that the effects reported here are due to integration alone (i.e., that nothing happens before the target word), but the evidence for this claim must come from actual demonstrations of prediction.

      The effect of load on the effects of predictability is very interesting (and also, I note that the fairly novel way of assessing load is itself valuable). Assuming that the experiments do measure prediction, it suggests that they are not cost-free, as is sometimes assumed. I think the researchers need to look closely at the visual world literature, most particularly the work of Huettig. (There is an isolated reference to Ito et al., but this is one of a large and highly relevant set of papers.)

      There is a major concern about the effects of age. See the Results (161-5): this depends on what is meant by word predictability. It's correct if it means the predictability in the corpus. But it may or may not be correct if it refers to how predictable a word is to an individual participant. The texts are unlikely to be equally predictable to different participants, and in particular to younger vs. older participants, because of their different experiences. To put it informally, the newspaper articles may be more geared to the expectations of younger people. But there is also another problem: the LLM may have learned on the basis of language that has largely been produced by young people, and so its predictions are based on what young people are likely to say. Both of these possibilities strike me as extremely likely. So it may be that older adults are affected more by words that they find surprising, but it is also possible that the texts are not what they expect, or the LLM predictions from the text are not the ones that they would make. In sum, I am not convinced that the authors can say anything about the effects of age unless they can determine what is predictable for different ages of participants. I suspect that this failure to control is an endemic problem in the literature on aging and language processing and needs to be systematically addressed.

      Overall, I think the paper makes enough of a contribution with respect to load to be useful to the literature. But for discussion of age, we would need something like evidence of how younger and older adults would complete these texts (on a word-by-word basis) and that they were equally predictable for different ages. I assume there are ways to get LLMs to emulate different participant groups, but I doubt that we could be confident about their accuracy without a lot of testing. But without something like this, I think making claims about age would be quite misleading.

    1. Reviewer #2 (Public review):

      Summary:

      This study investigates the influence of prior stimuli over multiple time scales in a position discrimination task, using pupillometry data and a reanalysis of EEG data from an existing dataset. The authors report consistent history-dependent effects across task-related, task-unrelated, and stimulus-related dimensions, observed across different time scales. These effects are interpreted as reflecting a unified mechanism operating at multiple temporal levels, framed within predictive coding theory.

      Strengths:

      The authors have done a good job in their revision, clarifying important points and stating the limitations of the study clearly.

      I also think they made a valid effort to address and correct issues arising from the temporal dependency confound, although I still wonder whether the best approach would have been to design an experiment in a way that avoided this confound in the first place.<br /> Overall, this is a substantially improved version, and I particularly appreciate the clarification and correction regarding the direction of the bias in the EEG data (repulsive rather than attractive).

      Weaknesses:

      These are now relatively minor points.

      I believe this latter aspect, the repulsive bias, may deserve further discussion, especially in relation to their behavioral findings and, in particular, to earlier work proposing multi-stage frameworks of serial dependence, where low-level repulsion interacts with attractive biases at higher-level stages (Fritsche et al., 2020; Pascucci et al., 2019; Sheehan & Serences, 2022). The authors may also consider to cite some key reviews on serial dependence that discuss both repulsion and attraction in forced-choice and reproduction tasks (Manassi et al., 2023; Pascucci et al., 2023).

      Related to this, after finding the opposite pattern, is the sentence in line 472-473 ("Further, we found an attractive...") and the related argument still valid?

      Regarding my earlier point about former line 197 and Figure 3b,c: what I noticed-similar to the patterns reported in the studies I referenced-is that the data cannot be simply described as showing faster and more accurate responses for small deltas. Responses also appear faster and more accurate for very large deltas, with performance being worse in between. Indeed, as the authors state: "The peak in precision for large Deltas locations is consistent with alternate events being encoded more precisely, while the peak for small offsets may be explained by the attractive bias towards the previous target." I wonder whether it is necessary, or unequivocally supported by the data, to hypothesize two separate mechanisms here. An alternative could be interference effects between consecutive stimuli that are neither identical nor completely different-making the previous one more likely to interfere with the current stimulus representation.

      Finally, this is definitely a minor point, but I still find the reply to my comment about the prediction of stable retinal input rather speculative. Such a prediction would seem more plausible in world-centered coordinates.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The manuscript is quite dense, with some concepts that may prove difficult for the non-specialist. I recommend spending a few more words (and maybe some pictures) describing the difference between task-relevant and task-irrelevant planes. Nice technique, but not instantly obvious. Then we are hit with "stimulus-related", which definitely needs some words (also because it is orthogonal to neither of the above). 

      We agree that the original description of the planes was too terse and have expanded on this in the revised manuscript.

      Line 85 - To test the influence of attention, trials were sorted according to two spatial reference planes, based on the location of the stimulus: task-related and task-unrelated (Fig. 1b). The task-related plane corresponded to participants’ binary judgement (Fig 1b, light cyan vertical dashed line) and the task-unrelated plane was orthogonal to this (Fig 1b, dark cyan horizontal dashed line). For example, if a participant was tasked with performing a left-or-right of fixation judgement, then their task-related plane was the vertical boundary between the left and right side of fixation, while their task-unrelated plane was the horizontal boundary. The former (left-right) axis is relevant to their task while the latter (top-bottom) axis is orthogonal and task irrelevant. This orthogonality can be leveraged to analyze the same data twice (once according to the task-related plane and again according to the taskunrelated plane) in order to compare performance when the relative location of an event is either task relevant or irrelevant.

      Line 183 - whereas task planes were constant, the stimulus-related plane was defined by the location of the stimulus on the previous trial, and thus varied from trial to trial. That is, on each trial, the target is considered a repeat if it changes location by <|90°| relative to its location on the previous trial, and an alternate if it moves by >|90°|.

      (2) While I understand that the authors want the three classical separations, I actually found it misleading. Firstly, for a perceptual scientist to call intervals in the order of seconds (rather than milliseconds), "micro" is technically coming from the raw prawn. Secondly, the divisions are not actually time, but events: micro means one-back paradigm, one event previously, rather than defined by duration. Thirdly, meso isn't really a category, just a few micros stacked up (and there's not much data on this). And macro is basically patterns, or statistical regularities, rather than being a fixed time. I think it would be better either to talk about short-term and long-term, which do not have the connotations I mentioned. Or simply talk about "serial dependence" and "statistical regularities". Or both. 

      We agree that the temporal scales defined in the current study are not the only way one could categorize perceptual time. We also agree that by using events to define scales, we ignore the influence of duration. In terms of the categories, we selected these for two reasons: 1) they conveniently group previous phenomena, and 2) they loosely correspond to iconic-, short- and long-term memory. We agree that one could also potentially split it up into two categories (e.g., short- and long-term), but in general, we think any form of discretization will have limitations. For example, Reviewer 1 suggests that the meso category is simply a few micros stacked together. However, there is a rich literature on phenomena associated with sequences of an intermediate length that do not appear to be entirely explained by stacking micro effects (e.g., sequence learning and sequential dependency). We also find that when controlling for micro level effects, there are clear meso level effects. Also, by the logic that meso level effects are just stacked micro effects, one could also argue the same for macro effects. We don’t think this argument is incorrect, rather we think it exemplifies the challenge of discretising temporal scales. Ultimately, the current study was aimed to test whether seemingly disparate phenomena identified in previous work could be captured by unifying principles. To this end we found that these categories were the most useful. However, we have included a “Limitations and future directions” section in the Discussion of the revised manuscript that acknowledges both the alternative scheme proposed by Reviewer 1, and the value of extending this work to consider the influence of duration (as well as events).

      Line 488 - Limitations and future directions. One potential limitation of the current study is the categorization of temporal scales according to events, independent of the influence of event duration. While this simplification of time supports comparison between different phenomena associated with each scale (e.g., serial dependence, sequential dependencies, statistical learning), future work could investigate the role of duration to provide a more comprehensive understanding of the mechanisms identified in the current study.

      Related to this, while the temporal scales applied here conveniently categorized known sensory phenomena, and partially correspond to iconic-, short-, and long-term memory, they are but one of multiple ways to delineate time. For example, temporal scales could alternatively be defined simply as short- and long-term (e.g., by combining micro and meso scale phenomena). However, this could obscure meaningful differences between phenomena associated with sensory persistence and short-term memory, or qualitative differences in the way that shortsequences of events are processed.

      (3) More serious is the issue of precision. Again, this is partially a language problem. When people use the engineering terms "precision" and "accuracy" together, they usually use the same units, such as degrees. Accuracy refers to the distance from the real position (so average accuracy gives bias), and precision is the clustering around the average bias, usually measured as standard deviation. Yet here accuracy is percent correct: also a convention in psychology, but not when contrasting accuracy with precision, in the engineering sense. I suggest you change "accuracy" to "percent correct". On the other hand, I have no idea how precision was defined. All I could find was: "mixture modelling was used to estimate the precision and guess rate of reproduction responses, based on the concentration (k) and height of von Mises and uniform distributions, respectively". I do not know what that means.

      In the case of a binary decision, is seems reasonable to use the term “accuracy” to refer to the correspondence between the target state and the response on a task. However, we agree that while our (main) task is binary, the target is not and nor is the secondary task. We thank the reviewer for bringing this to our attention, as we agree that this will be a likely cause of confusion. To avoid confusion we have specifically referred to “task accuracy” throughout the revised manuscript.

      With regards to precision, our measure of precision is consistent with what Reviewer 1 describes as such, i.e., the clustering of responses. In particular, the von Mises distribution is essentially a Gaussian distribution in circular space, and the kappa parameter defines the width of the distribution, regardless of the mean, with larger values of kappa indicating narrower (more precise) distributions. We could have used standard deviation to assess precision; however, this would incorrectly combine responses on which participants failed to encode the target (e.g., because of a blink) and were simply guessing. To account for these trials, we applied mixture modelling of guess and genuine responses to isolate the precision of genuine responses, as is standard in the visual working memory literature. However, we agree that this was not sufficiently described in the original manuscript and have elaborated on this method in the revised version.

      Line 598 - From the reproduction task, we sought to estimate participant’s recall precision. It is likely that on some trials participants failed to encode the target and were forced to make a response guess. To isolate the recall precision from guess responses, we used mixture modelling to estimate the precision and guess rate of reproduction responses, based on the concentration (k) and height of von Mises and uniform distributions, respectively (Bays et al., 2009). The k parameter of the von Mises distribution reflects its width, which indicates the clustering of responses around a common location.

      (4) Previous studies show serial dependence can increase bias but decrease scatter (inverse precision) around the biased estimate. The current study claims to be at odds with that. But are the two measures of precision relatable? Was the real (random) position of the target subtracted from each response, leaving residuals from which the inverse precision was calculated? (If so, the authors should say so..) But if serial dependence biases responses in essentially random directions (depending on the previous position), it will increase the average scatter, decreasing the apparent precision. 

      Previous studies have shown that when serial dependence is attractive there is a corresponding increase in precision around small offsets from the previous item (citations). Indeed, attractive biases will lead to reduced scattering (increased precision) around a central attracter. Consistent with previous studies, and this rational, we also found an attractive bias coupled with increased precision. To clarify, for the serial dependency analysis, we calculated bias and precision by binning reproduction responses according to the offset between the current and previous target and then performing the same mixture modelling described above to estimate the mean (bias) and kappa (precision) parameters of the von Mises distribution fit to the angular errors. This was not explained in the original manuscript, so we thank Reviewer 1 for bringing this to our attention and have clarified the analysis in the revised version.

      Line 604 - For the serial dependency analysis, we calculated bias and precision by binning reproduction responses according to the angular offset between the current and previous target and then performing mixture modelling to estimate the mean (bias) and k (precision) parameters of the von Mises distribution.

      (5) I suspect they are not actually measuring precision, but location accuracy. So the authors could use "percent correct" and "localization accuracy". Or be very clear what they are actually doing. 

      As explained in our response to Reviewer 1’s previous comment, we are indeed measuring precision.

      Reviewer #2 (Public review):

      (1) The abstract should more explicitly mention that conclusions about feedforward mechanisms were derived from a reanalysis of an existing EEG dataset. As it is, it seems to present behavioral data only.

      It is not clear what relevance the fact that the data has been analyzed previously has to the results of the current study. However, we do think that it is important to be clear that the EEG recordings were collected separately from the behavioural and eyetracking data, so we have clarified this in the revised abstract.

      Line 7 - By integrating behavioural and pupillometry recordings with electroencephalographical recordings from a previous study, we identify two distinct mechanisms that operate across all scales.

      (2) The EEG task seems quite different from the others, with location and color changes, if I understand correctly, on streaks of consecutive stimuli shown every 100 ms, with the task involving counting the number of target events. There might be different mechanisms and functions involved, compared to the behavioral experiments reported. 

      As stated above, we agree that it is important that readers are aware that the EEG recordings were collected separately to the behavioural and eyetracking data. We were forthright about this in the original manuscript and how now clarified this in the revised abstract. We agree that collecting both sets of data in the same experiment would be a useful validation of the current results and have acknowledged this in a new Limitations and future directions section of the Discussion of the revised manuscript.

      Line 501 - Another limitation of the current study is that the EEG recordings were collected in the separate experiment to the behavioural and pupillometry data. The stimuli and task were similar between experiments, but not identical. For example, the EEG experiment employed coloured arc stimuli presented at a constant rate of ~3.3 Hz and participants were tasked with counting the number of stimuli presented at a target location. By contrast, in the behavioural experiment, participants viewed white blobs presented at an average rate of ~2.8 Hz and performed a binary spatial task coupled with an infrequent reproduction task. An advantage of this was that the sensory responses to stimuli in the EEG recordings were not conflated with motor responses; however, future work combining these measures in the same experiment would serve as a validation for the current results.

      (3) How is the arbitrary choice of restricting EEG decoding to a small subset of parieto-occipital electrodes justified? Blinks and other artifacts could have been corrected with proper algorithms (e.g., ICA) (Zhang & Luck, 2025) or even left in, as decoders are not necessarily affected by noise. Moreover, trials with blinks occurring at the stimulus time should be better removed, and the arbitrary selection of a subset of electrodes, while reducing the information in input to the decoder, does not account for trials in which a stimulus was missed (e.g., due to blinks).

      Electrode selection was based on several factors: 1) reduction of eye movement/blink artifacts (as noted in the original manuscript), 2) consistency with the previous EEG study (Rideaux, 2024) and other similar decoding studies (Buhmann et al., 2024; Harrison et al., 2023; Rideaux et al., 2023), 3) improved signal-to-noise by including only sensors that carry the most position information (as shown in Supplementary Figure 1a and the previous EEG study). We agree that this was insufficiently explained in the original manuscript and have clarified our sensor selection in the revised version.

      Line 631 - We only included the parietal, parietal-occipital, and occipital sensors in the analyses to i) reduce the influence of signals produced by eye movements, blinks, and non-sensory cortices, ii) for consistency with similar previous decoding studies (Buhmann et al., 2024; Rideaux, 2024; Rideaux et al., 2025), and iii) to improve decoding accuracy by restricting sensors to those that carried spatial position information (Supplementary Fig. 1a).

      (4) The artifact that appears in many of the decoding results is puzzling, and I'm not fully convinced by the speculative explanation involving slow fluctuations. I wonder if a different high-pass filter (e.g., 1 Hz) might have helped. In general, the nature of this artifact requires better clarification and disambiguation.

      We agree that the nature of this artifact requires more clarification and disambiguation. Due to relatively slow changes in the neural signal, which are not stimulus-related, there is a degree of temporal autocorrelation in the recordings. This can be filtered out, for example, by using a stricter high-pass filter; however, we tried a range of filters and found that a cut-off of at least 0.7 Hz is required to remove the artifact, and even a filter of 0.2 Hz introduces other (stimulus-related) artifacts, such as above-chance decoding prior to stimulus onset. These stimulus-related artifacts are due to the temporal smearing of data, introduced by the filtering, and have a more pronounced and complex influence on the results and are more difficult to remove through other means, such as the baseline correction applied in the original manuscript.

      The temporal autocorrelation is detected by the decoder during training and biases it to classify/decode targets that are presented nearby in time as similar. That is, it learns the neural pattern for a particular stimulus location based on the activity produced by the stimulus and the temporal autocorrelation (determined by slow stimulus unrelated fluctuations). The latter only accounts for a relatively smaller proportion of the variance in the neural recordings under normal circumstances and would typically go undetected when simply plotting decoding accuracy as a function of position. However, it becomes weakly visible when decoding accuracy is plotted as a function of distance from the previous target, as now the bias (towards temporally adjacent targets) aligns with the abscissa. Further, it becomes highly visible when the stimulus labels are shuffled, as now the decoder can only learn from the variance associated with the temporal autocorrelation (and not from the activity produced by the stimulus).

      In the linear discriminant analysis, this led to temporally proximal items being more likely to be classified as on the same side. This is why there is above-chance performance for repeat trials (Supplementary Figure 2b), and below-chance performance for alternate trials, even when the labels are shuffled – the temporal autocorrelation produces a general bias towards classifying temporally proximate stimuli as on the same side, which selectively improves the classification accuracy of repeat trials. Fortunately, the bias is relatively constant as a function of time within the epoch and is straightforward to estimate by shuffling the labels, which means that it can be removed through a baseline correction. However, to further demonstrate that the autocorrelation confound cannot account for the differences observed between repeat and alternate trials in the micro classification analysis, we now additionally show the results from a more strictly filtered version of the data (0.7 Hz). These results show a similar pattern as the original, with the additional stimulusrelated artifacts introduced by the strict filter, e.g., above chance decoding prior to stimulus onset.

      In the inverted encoding analysis, the same temporal autocorrelation manifests as temporally proximal trials being decoded as more similar locations. This is why there is increased decoding accuracy for targets with small angular offsets from the previous target, even when the labels are shuffled (Supplementary Figure 3c), because it is on these trials that the bias happens to align with the correct position. This leads to an attractive bias towards the previous item, which is most prominent when the labels are shuffled.

      To demonstrate the phenomenon, we simulated neural recordings from a population of tuning curves and performed the inverted encoding analysis on a clean version of the data and a version in which we introduced temporal autocorrelation. We then repeated this after shuffling the labels. The simulation produced very similar results to those we observed in the empirical data, with a single exception: while precision in the simulated shuffled data was unaffected by autocorrelation, precision in the unshuffled data was clearly affected by this manipulation. This may explain why we did not find a correlation between the shuffled and unshuffled precision in the original manuscript. 

      These results echo those from the classification analysis, albeit in a more continuous space. However, whereas in the classification analysis it was straightforward to perform a baseline correction to remove the influence of general temporal dependency, the more complex nature of the accuracy, precision, and bias parameters over the range of time and delta location makes this approach less appropriate. For example, the bias in the shuffled condition ranged from -180 to 180 degrees, which when subtracted from the bias in the unshuffled condition would produce an equally spurious outcome, i.e., the equal opposite of this extreme bias. Instead for the inverted encoding analysis, we used the data high-pass filtered at 0.7 Hz. As with the classification analysis, this removed the influence of general temporal dependencies, as indicated by the results of the shuffled data analysis (Supplementary Figure 3f), but it also temporally smeared the stimulus-related signal, resulting in above chance decoding accuracy prior to stimulus onset (Supplementary Figure 3d). However, given thar we were primarily interested in the pattern of accuracy, precision, and bias as a function of delta location, and less concerned with the precise temporal dynamics of these changes, which appeared relatively stable in the filtered data. Thus, this was the more suitable approach to removing the general temporal dependencies in the inverted encoding analysis and the one that is presented in Figure 3.

      We have updated the revised manuscript in light of these changes, including a fuller description of the artifact and the results from the abovementioned control analyses.

      Figure 3 updated.

      Figure 3 caption - e) Decoding accuracy for stimulus location, from reanalysis of previously published EEG data (17). Inset shows the EEG sensors included in the analysis (blue dots), and black rectangles indicate the timing of stimulus presentations (solid: target stimulus, dashed: previous and subsequent stimuli). f) Decoding accuracy for location, as a function of time and D location. Bright colours indicate higher decoding accuracy; absolute accuracy values can be inferred from (e). g-i) Average location decoding  (g) accuracy, (h) precision, and (h) bias from 50 – 500 ms following stimulus onset. Horizontal bar in (e) indicates cluster corrected periods of significance; note, all time points were significantly above chance due to temporal smear introduced by strict high-pass filtering (see Supplementary Figure 3 for full details). Note, the temporal abscissa is aligned across (e & f). Shaded regions indicate ±SEM.

      Line 218 - To further investigate the influence of serial dependence, we applied inverted encoding modelling to the EEG recordings to decode the angular location of stimuli. We found that decoding accuracy of stimulus location sharply increased from ~60 ms following stimulus onset (Fig. 3e). Note, to reduce the influence of general temporal dependencies, we applied a 0.7 Hz high-pass filter to the data, which temporally smeared the stimulus-related information, resulting in above chance decoding accuracy prior to stimulus presentation (for full details, see Supplementary Figure 3). To understand how serial dependence influences the representation of these features, we inspected decoding accuracy for location as a function of both time and D location (Fig. 3f). We found that decoding accuracy varied depending not only as a function of time, but also as a function of D location. To characterise this relationship, we calculated the average decoding accuracy from 50 ms until the end of the epoch (500 ms), as a function of D location (Fig. 3g). This revealed higher accuracy for targets with larger D location. We found a similar pattern of results for decoding precision (Fig. 3h). These results are consistent with the micro temporal context (behavioural) results, showing that targets that alternated were recalled more precisely. Lastly, we calculated the decoding bias as a function of D location and found a clear repulsive bias away from the previous item (Fig. 3i). While this result is inconsistent with the attractive behavioural bias, it is consistent with recent studies of serial dependence suggesting an initial pattern of repulsion followed by an attractive bias during the response period (20–22).

      Line 726 - As shown in Supplementary Figure 3, we found the same general temporal dependencies in the decoding accuracy computed using inverted encoding that were found using linear discriminant classification. However, as a baseline correction would not have been appropriate or effective for the parameters decoded with this approach, we instead used a high-pass filter of 0.7 Hz to remove the confound, while being cautious about interpreting the timing of effects produced by this analysis due to the temporal smear introduced by the filter.

      Supplementary Figure 2 updated.

      Supplementary Figure 2 caption - Removal of general micro temporal dependencies in EEG responses. We found that there were differences in classification accuracy for repeat and alternate stimuli in the EEG data, even when stimulus labels were shuffled. This is likely due to temporal autocorrelation within the EEG data due to low frequency signal changes that are unrelated to the decoded stimulus dimension. This signal trains the decoder to classify temporally proximal stimuli as the same class, leading to a bias towards repeat classification. For example, in general, the EEG signal during trial one is likely to be more similar to that during trial two than during trial ten, because of low frequency trends in the recordings. If the decoder has been trained to classify the signal associated with trial one as a leftward stimulus, then it will be more likely to classify trial two as a leftward stimulus too. These autocorrelations are unrelated to stimulus features; thus, to isolate the influence of stimulus-specific temporal context, we subtracted the classification accuracy produced by shuffling the stimulus labels from the unshuffled accuracy (as presented in Figure 2e, f). We confirmed that using a stricter high-pass filter (0.7 Hz) removes this artifact, as indicated by the equal decoding accuracy between the two shuffled conditions. However, the stricter high-pass filter temporally smears the stimulus-related signal, which introduces other (stimulus-related) artifacts, e.g., above-chance decoding accuracy prior to stimulus presentation, that are larger and more complex, i.e., changing over time. Thus, we opted to use the original high pass filter (0.1 Hz) and apply a baseline correction. a) The uncorrected classification  accuracy along task related and unrelated planes. Note that these results are the same as the corrected version shown in Figure 2e, because the confound is only apparent when accuracy is grouped according to temporal context.

      b) Same as (a), but split into repeat and alternate stimuli, along (left) task-related and (right) unrelated planes. Classification  accuracy when labels are shuffled is also shown. Inset in (a) shows the EEG sensors included in the analysis (blue dots). (c, d) Same as (a, b), but on data filtered using a 0.7 Hz high-pass filter. Black rectangles indicate the timing of stimulus presentations (solid: target stimulus, dashed: previous and subsequent stimuli). Shaded regions indicate ±SEM.

      Supplementary Figure 3 updated.

      Supplementary Figure 3 caption - Removal of general temporal dependencies in EEG responses for inverted encoding analyses. As described in Methods - Neural Decoding, we used inverted encoding modelling of EEG recordings to estimate the decoding accuracy, precision, and bias of stimulus location. Just as in the linear discriminant classification analysis, we also found the influence of general temporal dependencies in the results produced by the inverted encoding analysis. In particular, there was increased decoding accuracy for targets with low D location. This was weakly evident in the period prior to stimulus presentation, but clearly visible when the labels were shuffled. These results are mirror those from the classification analysis, albeit in a more continuous space. However, whereas in the classification analysis it was straightforward to perform a baseline correction to remove the influence of general temporal dependency, the more complex nature of the accuracy, precision, and bias parameters over the range of time and D location makes this approach less appropriate. For example, the bias in the shuffled condition ranged from -180° to 180°, which when subtracted from the bias in the unshuffled condition would produce an equally spurious outcome, i.e., the equal opposite of this extreme bias. Instead for the inverted encoding analysis, we used the data high-pass filtered at 0.7 Hz. As with the classification analysis, this significantly reduced the influence of general temporal dependencies, as indicated by the results of the shuffled data analysis, but it also temporally smeared the stimulus-related signal, resulting in above chance decoding accuracy prior to stimulus onset. However, we were primarily interested in the pattern of accuracy, precision, and bias as a function of D location, and less concerned with the precise temporal dynamics of these changes. Thus, this was the more suitable approach to removing the general temporal dependencies in the inverted encoding analysis and the one that is presented in Figure 3. (a) Decoding accuracy as a function of time for the EEG data filtered using a 0.1 Hz high-pass filter. Inset shows the EEG sensors included in the analysis (blue dots), and black rectangles indicate the timing of stimulus presentations (solid: target stimulus, dashed: previous and subsequent stimuli). (b, c) The same as (a), but as a function of time and D location for (b) the original data and (c) data with shuffled labels. (d-f) Same as (a-c), but for data filtered using a 0.7 Hz high-pass filter. Shaded regions in (a, d) indicate ±SEM. Horizontal bars in (a, d) indicate cluster corrected periods of significance; note, all time points in (d) were significantly above chance. Note, the temporal abscissa is vertically aligned across plots (a-c & d-f).

      In the process of performing these additional analyses and simulations, we became aware that the sign of the decoding bias in the inverted encoding analyses had been interpreted in the wrong direction. That is, where we previously reported an initial attractive bias followed by a repulsive bias relative to the previous target, we have in fact found the opposite, an initial repulsive bias followed by an attractive bias relative to the previous target. Based on the new control analyses and simulations, we think that the latter attractive bias was due to general temporal dependencies. That is, in the filtered data, we only observe a repulsive bias. While the bias associated with serial dependence was not a primary feature of the study, this (somewhat embarrassing) discovery has led to reinterpretation of some results relating to serial dependence. However, it is encouraging to see that our results now align with those of recent studies (Fischer et al., 2024; Luo et al., 2025; Sheehan et al. 2024).

      Line 385 - Our corresponding EEG analyses revealed better decoding accuracy and precision for stimuli preceded by those that were different and a bias away from the previous stimulus. These results are consistent with finding that alternating stimuli are recalled more precisely. Further, while the repulsive pattern of biases is inconsistent with the observed behavioural attractive biases, it is consistent with recent work on serial dependence indicating an initial period of repulsion, followed by an attractive bias during the response period (20–22). These findings indicate that serial dependence and first-order sequential dependencies can be explained by the same underlying principle.

      (5) Given the relatively early decoding results and surprisingly early differences in decoding peaks, it would be useful to visualize ERPs across conditions to better understand the latencies and ERP components involved in the task.

      A rapid presentation design was used in the EEG experiment, and while this is well suited to decoding analyses, unfortunately we cannot resolve ERPs because the univariate signal is dominated by an oscillation at the stimulus presentation frequency (~3 Hz). We agree that this could be useful to examine in future work.

      (6) It is unclear why the precision derived from IEM results is considered reliable while the accuracy is dismissed due to the artifact, given that both seem to be computed from the same set of decoding error angles (equations 8-9).

      This point has been addressed in our response to point (4).

      (7) What is the rationale for selecting five past events as the meso-scale? Prior history effects have been shown to extend much further back in time (Fritsche et al., 2020). 

      We used five previous items in the meso analyses to be consistent with previous research on sequential dependencies (Bertelson, 1961; Gao et al., 2009; Jentzsch & Sommer, 2002; Kirby, 1976; Remington, 1969). However, we agree that these effects likely extend further and have acknowledged this in the revied version of the manuscript.

      Line 240 - Higher-order sequential dependences are an example of how stimuli (at least) as far back as five events in the past can shape the speed and task accuracy of responses to the current stimulus (9, 10); however, note that these effects have been observed for more than five events (20).

      (8) The decoding bias results, particularly the sequence of attraction and repulsion, appear to run counter to the temporal dynamics reported in recent studies (Fischer et al., 2024; Luo et al., 2025; Sheehan & Serences, 2022). 

      This point has been addressed in our response to point (4).

      (9) The repulsive component in the decoding results (e.g., Figure 3h) seems implausibly large, with orientation differences exceeding what is typically observed in behavior. 

      As noted in our response to point (4), this bias was likely due to the general temporal dependency confound and has been removed in the revised version of the manuscript.

      (10) The pattern of accuracy, response times, and precision reported in Figure 3 (also line 188) resembles results reported in earlier work (Stewart, 2007) and in recent studies suggesting that integration may lead to interference at intermediate stimulus differences rather than improvement for similar stimuli (Ozkirli et al., 2025).

      Thank you for bringing this to our attention, we have acknowledged this in the revised manuscript.

      Line 197 - Consistent with our previous binary analysis, and with previous work (19), we also found that responses were faster and more accurate when D location was small (Fig. 3b, c).

      (11) Some figures show larger group-level variability in specific conditions but not others (e.g., Figures 2b-c and 5b-c). I suggest reporting effect sizes for all statistical tests to provide a clearer sense of the strength of the observed effects. 

      Yes, as noted in the original manuscript, we find significant differences between the variance task-related and -unrelated conditions. We think this is due to opposing forces in the task-related condition: 

      “The increased variability of response time differences across the taskrelated plane likely reflects individual differences in attention and prioritization of responding either quickly or accurately. On each trial, the correct response (e.g., left or right) was equally probable. So, to perform the task accurately, participants were motivated to respond without bias, i.e., without being influenced by the previous stimulus. We would expect this to reduce the difference in response time for repeat and alternate stimuli across the taskrelated plane, but not the task-unrelated plane. However, attention may amplify the bias towards making faster responses for repeat stimuli, by increasing awareness of the identity of stimuli as either repeats or alternations (17). These two opposing forces vary with task engagement and strategy and thus would be expected produce increased variability across the task-related plane.” We agree that providing effect sizes may provided a clearer sense of the observed effects and have done so in the revised version of the manuscript.

      Line 739 - For Wilcoxon signed rank tests, the rank-biserial correlation (r) was calculated as an estimate of effect size, where 0.1, 0.3, and 0.5 indicate small, medium, and large effects, respectively (54). For Friedman’s ANONA tests, Kendal’s W was calculated as an estimate of effect size, where 0.1, 0.3, and 0.5 indicate small, medium, and large effects, respectively (55).

      (12) The statement that "serial dependence is associated with sensory stimuli being perceived as more similar" appears inconsistent with much of the literature suggesting that these effects occur at post-perceptual stages (Barbosa et al., 2020; Bliss et al., 2017; Ceylan et al., 2021; Fischer et al., 2024; Fritsche et al., 2017; Sheehan & Serences, 2022). 

      In light of the revised analyses, this statement has been removed from the manuscript.

      (13) If I understand correctly, the reproduction bias (i.e., serial dependence) is estimated on a small subset of the data (10%). Were the data analyzed by pooling across subjects?

      The dual reproduction task only occurred on 10% of trials. There were approximately 2000 trials, so ~200 reproduction responses. For the micro and macro analyses, this was sufficient to estimate precision within each of the experimental conditions (repeat/alternate, expected/unexpected). However, it is likely that we were not able to reproduce the effect of precision at the meso level across both experiments because we lacked sufficient responses to reliably estimate precision when split across the eight sequence conditions. Despite this, the data was always analysed within subjects.

      (14) I'm also not convinced that biases observed in forced-choice and reproduction tasks should be interpreted as arising from the same process or mechanism. Some of the effects described here could instead be consistent with classic priming. 

      We agree that the results associated with the forced-choice task (response time task accuracy) were likely due to motor priming, but that a separate (predictive) mechanism may explain the (precision) results associated with the reproduction task. These are two mechanisms we think are operating across the three temporal scales investigated in the current study.

      Reviewing Editor Comments:

      (1) Clarify task design and measurement: The dense presentation makes it difficult to understand key design elements and their implications. Please provide clearer descriptions of all task elements, and how they relate to each other (EEG vs. behaviour, stimulus plane vs. TR and TU plane, reproduction vs. discrimination and role of priming), and clearly explain how key measures were computed for each of these (e.g., precision, accuracy, reproduction bias).

      In the revised manuscript, we have expanded on descriptions of the source and nature of the data (behavioural and EEG), the different planes analyzed in the behavioural task, and how key metrics (e.g., precision) were computed.

      (2) Offer more insight into underlying data, including original ERP waveforms to aid interpretation of decoding results and the timing of effects. In particular, unpack the decoding temporal confound further.

      In the revised manuscript, we have considerably offered more insight into the decoding results, in particular, the nature of the temporal confound. We were unable to assess ERPs due to the rapid presentation design employed in the EEG experiment.

      (3) Justify arbitrary choices such as electrode selection for EEG decoding (e.g., limiting to parieto-occipital sensors), number of trials in meso scale, and the time terminology itself.

      In the revised manuscript, we have clarified the reasons for electrode selection.

      (3) Discuss deviations from literature: Several findings appear to contradict or diverge from previous literature (e.g., effects of serial dependence). These discrepancies could be discussed in more depth. 

      Upon re-analysis of the serial dependence bias and removal of the temporal confound, the results of the revised manuscript now align with those from previous literature, which has been acknowledged.

      Reviewer #1 (Recommendations for the authors):

      (1) would like to use my reviewer's prerogative to mention a couple of relevant publications. 

      Galluzzi et al (Journal of Vision, 2022) "Visual priming and serial dependence are mediated by separate mechanisms" suggests exactly that, which is relevant to this study.

      Xie et al. (Communications Psychology, 2025) "Recent, but not long-term, priors induce behavioral oscillations in peri-saccadic vision" also seems relevant to the issue of different mechanisms. 

      Thank you for bringing these studies to our attention. We agree that they are both relevant have referenced both appropriately in the revised version of the manuscript.

      Reviewer #2 (Recommendations for the authors): 

      (1) I find the discussion on attention and awareness (from line 127 onward) somewhat vague and requiring clarification.

      We agree that this statement was vague and referred to “awareness” without operationation. We have revised this statement to improve clarity.

      Line 135 - However, task-relatedness may amplify the bias towards making faster responses for repeat stimuli, by increasing attention to the identity of stimuli as either repeats or alternations (17).

      (2) Line 140: It's hard to argue that there are expectations that the image of an object on the retina is likely to stay the same, since retinal input is always changing. 

      We agree that retinal input is often changing, e.g., due to saccades, self-motion, and world motion. However, for a prediction to be useful, e.g., to reduce metabolic expenditure or speed up responses, it must be somewhat precise, so a prediction that retinal input will change is not necessarily useful, unless it can specify what it will change to. Given retinal input of x at time t, the range of possible values of x at time t+1 (predicting change) is infinite. By contrast, if we predict that x=x at time t+1 (no change), then we can make a precise prediction. There is, of course, other information that could be used to reduce the parameter space of predicted change from x at time t, e.g., the value of x at time t-1, and we think this drives predictions too. However, across the infinite distribution of changes from x, zero change will occur more frequently than any other value, so we think it’s reasonable to assert that the brain may be sensitive to this pattern.

      (3) Line 564: The gambler's fallacy usually involves sequences longer than just one event.

      Yes, we agree that this phenomenon is associated with longer sequences. This section of the manuscript was in regards to previous findings that were not directly relevant to the current study and has been removed in the revised version.

      (4) In the shared PDF, the light and dark cyan colors used do not appear clearly distinguishable. 

      I expect this is due to poor document processing or low-quality image embeddings. I will check that they are distinguishable in the final version.

      References: 

      Barbosa, J., Stein, H., Martinez, R. L., Galan-Gadea, A., Li, S., Dalmau, J., Adam, K. C. S., Valls-Solé, J., Constantinidis, C., & Compte, A. (2020). Interplay between persistent activity and activity-silent dynamics in the prefrontal cortex underlies serial biases in working memory. Nature Neuroscience, 23(8), Articolo 8. https://doi.org/10.1038/s41593-020-0644-4

      Bliss, D. P., Sun, J. J., & D'Esposito, M. (2017). Serial dependence is absent at the time of perception but increases in visual working memory. Scientific reports, 7(1), 14739. 

      Ceylan, G., Herzog, M. H., & Pascucci, D. (2021). Serial dependence does not originate from low-level visual processing. Cognition, 212, 104709. https://doi.org/10.1016/j.cognition.2021.104709

      Fischer, C., Kaiser, J., & Bledowski, C. (2024). A direct neural signature of serial dependence in working memory. eLife, 13. https://doi.org/10.7554/eLife.99478.1

      Fritsche, M., Mostert, P., & de Lange, F. P. (2017). Opposite effects of recent history on perception and decision. Current Biology, 27(4), 590-595. 

      Fritsche, M., Spaak, E., & de Lange, F. P. (2020). A Bayesian and efficient observer model explains concurrent attractive and repulsive history biases in visual perception. eLife, 9, e55389. https://doi.org/10.7554/eLife.55389

      Gekas, N., McDermott, K. C., & Mamassian, P. (2019). Disambiguating serial effects of multiple timescales. Journal of vision, 19(6), 24-24. 

      Luo, M., Zhang, H., Fang, F., & Luo, H. (2025). Reactivation of previous decisions repulsively biases sensory encoding but attractively biases decision-making. PLOS Biology, 23(4), e3003150. https://doi.org/10.1371/journal.pbio.3003150

      Ozkirli, A., Pascucci, D., & Herzog, M. H. (2025). Failure to replicate a superiority effect in crowding. Nature Communications, 16(1), 1637. https://doi.org/10.1038/s41467025-56762-5

      Sheehan, T. C., & Serences, J. T. (2022). Attractive serial dependence overcomes repulsive neuronal adaptation. PLoS biology, 20(9), e3001711. 

      Stewart, N. (2007). Absolute identification is relative: A reply to Brown, Marley, and

      Lacouture (2007).  Psychological  Review, 114, 533-538. https://doi.org/10.1037/0033-295X.114.2.533

      Treisman, M., & Williams, T. C. (1984). A theory of criterion setting with an application to sequential dependencies. Psychological review, 91(1), 68. 

      Zhang, G., & Luck, S. J. (2025). Assessing the impact of artifact correction and artifact rejection on the performance of SVM- and LDA-based decoding of EEG signals. NeuroImage, 316, 121304. https://doi.org/10.1016/j.neuroimage.2025.121304

  5. mathieubcd.github.io mathieubcd.github.io
    1. Scenario analysis is amethod in which multiple potential future states (or outcomes) are forecast.It is not constrained by events of the past, which may not capture the impactof changes in the environment; rather it uses both trends (the known) anduncertainties (the unknown) to predict a range of possible future scenarios.

      Healthcare leaders often rely too heavily on past data, even though the future rarely unfolds the same way as the past. Scenario analysis encourages organizations to think in terms of possibilities, not certainties, which is especially relevant in healthcare, where conditions can change quickly. For example, we can plan for best-case, worst-case, and most likely outcomes during a pandemic. This improves resource planning and highlights the risks of making decisions based on outdated assumptions. It’s a reminder that uncertainty should be treated as part of strategy, not just as an obstacle.

    1. Are we to keep the people of India ignorant in order that we may keep them submissive? Or do we think that we can give them knowledge without awakening ambition? Or do we mean to awaken ambition and to provide it with no legitimate vent? Who will answer any of these questions in the affirmative? Yet one of them must be answered in the affirmative, by every person who maintains that we ought permanently to exclude the natives from high office. 1 have no fears. The path of duty is plain before us: and it is also the path of wisdom, of national prosperity, of national honor.

      Here, Macaulay challenges the logic of permanently excluding Indians from higher office under British rule. He frames the issue as a series of rhetorical questions, pointing out the contradictions in denying education and advancement to Indians while still claiming to rule justly. His language reveals both a moral stance and a pragmatic one: keeping India submissive through ignorance is unjust and also unwise for Britain’s long-term prosperity. By insisting that knowledge will naturally create ambition, he argues that denying Indians political opportunity would lead to instability. Overall, the passage reveals Macaulay’s conviction that the gradual inclusion of Indians into governance was not only a duty but also a means to strengthen Britain’s honor and secure its empire.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      The authors aim to explore the effects of the electrogenic sodium-potassium pump (Na<SUP>+</SUP>/K<SUP>+</SUP>ATPase) on the computational properties of highly active spiking neurons, using the weakly-electric fish electrocyte as a model system. Their work highlights how the pump's electrogenicity, while essential for maintaining ionic gradients, introduces challenges in neuronal firing stability and signal processing, especially in cells that fire at high rates. The study identifies compensatory mechanisms that cells might use to counteract these effects, and speculates on the role of voltage dependence in the pump's behavior, suggesting that Na<SUP>+</SUP>/K<SUP>+</SUP>-ATPase could be a factor in neuronal dysfunctions and diseases

      Strengths:

      (1) The study explores a less-examined aspect of neural dynamics-the effects of Na<SUP>+</SUP>/K<SUP>+</SUP>-ATPase electrogenicity. It offers a new perspective by highlighting the pump's role not only in ion homeostasis but also in its potential influence on neural computation.

      (2) The mathematical modeling used is a significant strength, providing a clear and controlled framework to explore the effects of the Na<SUP>+</SUP>/K<SUP>+</SUP>-ATPase on spiking cells. This approach allows for the systematic testing of different conditions and behaviors that might be difficult to observe directly in biological experiments.

      (3) The study proposes several interesting compensatory mechanisms, such as sodium leak channelsand extracellular potassium buffering, which provide useful theoretical frameworks for understanding how neurons maintain firing rate control despite the pump's effects.

      Weaknesses:

      (1) While the modeling approach provides valuable insights, the lack of experimental data to validate the model's predictions weakens the overall conclusions.

      (2)The proposed compensatory mechanisms are discussed primarily in theoretical terms without providing quantitative estimates of their impact on the neuron's metabolic cost or other physiological parameters.

      Comments on revisions:

      The revised manuscript is notably improved.

      We thank the reviewer for their concise and accurate summary and appreciate the constructive feedback on the article’s strengths and weaknesses. Experimental work is beyond the scope of our modeling-based study. However, we would like our work to serve as a framework for future experimental studies into the role of the electrogenic pump current (and its possible compensatory currents) in disease, and its role in evolution of highly specialized excitable cells (such as electrocytes).

      Quantitative estimates of metabolic costs in this study are limited to the ATP that is required to fuel the Na<SUP>+</SUP>/K<SUP>+</SUP> pump. By integrating the net pump current over time and dividing by one elemental charge, one can find the rate of ATP that is consumed by the Na<SUP>+</SUP>/K<SUP>+</SUP> pump for either compensatory mechanism. The difference in net pump current is thus proportional to ATP consumption, which allows for a direct comparison of the cost efficiency of the Na<SUP>+</SUP>/K<SUP>+</SUP> pump for each proposed compensatory mechanism. The Na<SUP>+</SUP>/K<SUP>+</SUP> pump is however not the only ATP-consuming element in the electrocyte, and some of the compensatory mechanisms induce other costs related to cell ‘housekeeping’ or presynaptic processes. We now added a section in the appendix titled ‘Considerations on metabolic costs of compensatory mechanisms’ (section 11.4), where we provide rough estimates on the influence of the compensatory mechanisms on the total metabolic costs of the cell and membrane space occupation. Although we argue that according these rough estimates, the impact of discussed compensatory mechanisms could be significant, due to the absence of more detailed experimental quantification, a plausible quantitative cost estimate on the whole cell level remains beyond the scope of this article.

      Reviewer #1 (Recommendations for the authors):

      I just have a few recommendations on the updated manuscript.

      (1) When exploring the different roles of Na<SUP>+</SUP>/K<SUP>+</SUP>-ATPase in the Results section, the authors employed many different models. For instance, the voltage equation on page 15, voltage equation (2) on page 22, voltage equation (12) on page 24, voltage equation (30) on page 32, and voltage equation (38) on page 35 are presented as the master equations for their respective biophysical models. Meanwhile, the phase models are presented on page 29 and page 33. I would recommend that the authors clearly specify which equations correspond to each subsection of the Results section and explicitly state which equations were used to generate the data in each figure. This would help readers more easily follow the connections between the models, the results, and the figures.

      We thank the reviewer for pointing out that the links of the different voltage equations to the results could be expressed more explicitly in the article. All simulations were done using the ‘master equation’  expressed in Eq. 2, and the other voltage equations that are specified in the article (in the new version of the article Eqs. 13, 31, and 39) are reformulations of Eq. 2 to analytically show different properties of the voltage equation (Eq. 2). This has now been mentioned in the article when formulating the voltage equations, and the equation for the total leak current (in the new version Eq. 3) has been added for completeness.

      (2) The authors may want to revisit their description and references concerning Eigenmannia virescens. For example, wave-type weakly electric fish (e.g., Eigenmannia) and pulse-type weakly electric fish (e.g., Gymnotus carapo) exhibit large differences, making references 52-55 may be inappropriate for subsection 4.3.1, as these studies focus on Gymnotus carapo. Additionally, even within wave-type species, chirp patterns vary. For example, Eigenmannia can exhibit short "pauses"-type chirps, whereas Apteronotus leptorhynchus (another waver-form fish) does not (https://pubmed.ncbi.nlm.nih.gov/14692494/).

      We thank the reviewer for pointing this out. The citations and phrasing in sections 4.3.1 and 4.3.2 have been updated to specifically refer to the weakly electric fish e. Virescens.

      (3) Table on page 21: Please explain why the parameter value (13.5mM) of [Na<SUP>^</SUP>+]_{in} is 10 timeslarger than its value (1.35mM) in reference [26]? How does this value (13.5mM) compare with the range of variable [Na<SUP>^</SUP>+]_{in} in equation (6)?

      The intracellular sodium concentration in reference [26] was reported to be 1.35 mM, but the authors also reported an extracellular sodium concentration of 120 mM, and a sodium reversal potential of 55 mV. Upon calculating the sodium reversal potential, we found that an intracellular sodium concentration of 1.35 mM would give a sodium reversal potential of 113 mV. An intracellular sodium concentration of 13.5 mM, on the other hand, leads to the reported and physiological reversal potential of 55 mV. This has now been clarified in the article, and the connection between this value and Eq. 6 (Eq. 7 in the new version) has also been clarified.

      Reviewer #2 (Public review):

      Summary:

      The paper by Weerdmeester, Schleimer, and Schreiber uses computational models to present the biological constraints under which electrocytes - specialized, highly active cells that facilitate electro-sensing in weakly electric fish-may operate. The authors suggest potential solutions that these cells could employ to circumvent these constraints.

      Electrocytes are highly active or spiking (greater than 300Hz) for sustained periods (for minutes to hours), and such activity is possible due to an influx of sodium and efflux of potassium ions into these cells after each spike. The resulting ion imbalance must be restored, which in electrocytes, as with many other biological cells, is facilitated by the Na-K pumps at the expense of biological energy, i.e., ATP molecules. For each ATP molecule the pump uses, three positively charged sodium ions from the intracellular space are exchanged for two positively charged potassium ions from the extracellular space. This creates a net efflux of positive ions into the extracellular space, resulting in hyperpolarized potentials for the cell over time. For most cells, this does not pose an issue, as their firing rate is much slower, and other compensatory mechanisms and pumps can effectively restore the ion imbalances. However, in the electrocytes of weakly electric fish, which spike at exceptionally high rates, the net efflux of positive ions presents a challenge. Additionally, these cells are involved in critical communication and survival behaviors, underscoring their essential role in reliable functioning.

      In a computational model, the authors test four increasingly complex solutions to the problem of counteracting the hyperpolarized states that occur due to continuous NaK pump action to sustain baseline activity. First, they propose a solution for a well-matched Na leak channel that operates in conjunction with the NaK pump, counteracting the hyperpolarizing states naturally. Their model shows that when such an orchestrated Na leak current is not included, quick changes in the firing rates could have unexpected side effects. Secondly, they study the implications of this cell in the context of chirps-a means of communication between individual fish. Here, an upstream pacemaking neuron entrains the electrocyte to spike, which ceases to produce a so-called chirp - a brief pause in the sustained activity of the electrocytes. In their model, the authors demonstrate that including the extracellular potassium buffer is necessary to obtain a reliable chirp signal. Thirdly, they tested another means of communication in which there was a sudden increase in the firing rate of the electrocyte, followed by a decay to the baseline. For this to occur reliably, the authors emphasize that a strong synaptic connection between the pacemaker neuron and the electrocyte is necessary. Finally, since these cells are energy-intensive, they hypothesize that electrocytes may have energy-efficient action potentials, for which their NaK pumps may be sensitive to the membrane voltages and perform course correction rapidly.

      Strengths:

      The authors extend an existing electrocyte model (Joos et al., 2018) based on the classical Hodgkin and Huxley conductance-based models of sodium and potassium currents to include the dynamics of the sodium-potassium (NaK) pump. The authors estimate the pump's properties based on reasonable assumptions related to the leak potential. Their proposed solutions are valid and may be employed by weakly electric fish. The authors explore theoretical solutions to electrosensing behavior that compound and suggest that all these solutions must be simultaneously active for the survival and behavior of the fish. This work provides a good starting point for conducting in vivo experiments to determine which of these proposed solutions the fish employ and their relative importance. The authors include testable hypotheses for their computational models.

      Weaknesses:

      The model for action potential generation simplifies ion dynamics by considering only sodium and potassium currents, excluding other ions like calcium. The ion channels considered are assumed to be static, without any dynamic regulation such as post-translational modifications. For instance, a sodium-dependent potassium pump could modulate potassium leak and spike amplitude (Markham et al., 2013).

      This work considers only the sodium-potassium (NaK) pumps to restore ion gradients. However, in many cells, several other ion pumps, exchangers, and symporters are simultaneously present and actively participate in restoring ion gradients. When sodium currents dominate action potentials, and thus when NaK pumps play a critical role, such as the case in Eigenmannia virescens, the present study is valid. However, since other biological processes may find different solutions to address the pump's non-electroneutral nature, the generalizability of the results in this work to other fast-spiking cell types is limited. For example, each spike could include a small calcium ion influx that could be buffered or extracted via a sodium-calcium exchanger.

      We thank the reviewer for the detailed summary and the updated identified strengths and weaknesses. The current article indeed focuses on and isolates the interplay between sodium currents, potassium currents, and sodium-potassium pump currents. As discussed in section 5.1, in excitable cells where these currents are the main players in action-potential generation, the results presented in this article are applicable. The contribution of post-translational effects of ion channels, other ionic currents, and other active transporters and pumps, could be exciting avenues for further studies

      .

      Reviewer #2 (Recommendations for the authors):

      Thank you for addressing my comments.

      All the figures are now consistent. The color schema used is clear.

      The methods and discussions expansions improve the paper.

      Including the model assumptions and simplifications is appreciated.

      Including internal references is helpful.

      The equations are clear, and the references have been fixed.

      I am content with the changes. I have updated my review accordingly.

      We thank the reviewer for their initial constructive comments that lead to the significant improvement of the article.

      Page : 3 Line : 113 Author : Unknown Author 07/24/2025 

      Although this is technically correct, the article is about electrocommunication signals and does not focus on sensing.

      Page : 3 Line : 153 Author : Unknown Author 07/24/2025

      electrocommunication

      Page : 4 Line : 164 Author : Unknown Author 07/24/2025 

      Judging from the cited article, I think this should be a sodium-dependent potassium current.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors developed a sequence-based method to predict drug-interacting residues in IDP, based on their recent work, to predict the transverse relaxation rates (R2) of IDP trained on 45 IDP sequences and their corresponding R2 values. The discovery is that the IDPs interact with drugs mostly using aromatic residues that are easy to understand, as most drugs contain aromatic rings. They validated the method using several case studies, and the predictions are in accordance with chemical shift perturbations and MD simulations. The location of the predicted residues serves as a starting point for ligand optimization.

      Strengths:

      This work provides the first sequence-based prediction method to identify potential druginteracting residues in IDP. The validity of the method is supported by case studies. It is easy to use, and no time-consuming MD simulations and NMR studies are needed.

      Weaknesses:

      The method does not depend on the information of binding compounds, which may give general features of IDP-drug binding. However, due to the size and chemical structures of the compounds (for example, how many aromatic rings), the number of interacting residues varies, which is not considered in this work. Lacking specific information may restrict its application in compound optimization, aiming to derive specific and potent binding compounds.

      We fully recognize that different compounds may have different interaction propensity profiles along the IDP sequence. In future studies, we will investigate compound-specific parameter values. The limiting factor is training data, but such data are beginning to be available.

      Reviewer #2 (Public review):

      Summary:

      In this work, the authors introduce DIRseq, a fast, sequence-based method that predicts druginteracting residues (DIRs) in IDPs without requiring structural or drug information. DIRseq builds on the authors' prior work looking at NMR relaxation rates, and presumes that those residues that show enhanced R2 values are the residues that will interact with drugs, allowing these residues to be nominated from the sequence directly. By making small modifications to their prior tool, DIRseq enables the prediction of residues seen to interact with small molecules in vivo.

      Strengths:

      The preprint is well written and easy to follow

      Weaknesses:

      (1) The DIRseq method is based on SeqDYN, which itself is a simple (which I do not mean as a negative - simple is good!) statistical predictor for R2 relaxation rates. The challenge here is that R2 rates cover a range of timescales, so the physical intuition as to what exactly elevated R2 values mean is not necessarily consistent with "drug interacting". Presumably, the authors are not using the helix boost component of SeqDYN here (it would be good to explicitly state this). This is not necessarily a weakness, but I think it would behove the authors to compare a few alternative models before settling on the DIRseq method, given the somewhat ad hoc modifications to SeqDYN to get DIRseq.

      Actually, the factors that elevate R2 are well-established. These are local interactions and residual secondary structures (if any). The basic assumption of our method is that intra-IDP interactions that elevate R2 convert to IDP-drug interactions. This assumption was supported by our initial observation that the drug interaction propensity profiles predicted using the original SeqDYN parameters already showed good agreement with CSP profiles. We only made relatively small adjustments to the parameters to improve the agreement. Indeed we did not apply the helix boost portion of SeqDYN to DIRseq, and now state as such (p. 4, second last paragraph). We now also compare DIRseq with several alternative models, as summarized in new Table S2.

      Specifically, the authors previously showed good correlation between the stickiness parameter of Tesei et al and the inferred "q" parameter for SeqDYN; as such, I am left wondering if comparable accuracy would be obtained simply by taking the stickiness parameters directly and using these to predict "drug interacting residues", at which point I'd argue we're not really predicting "drug interacting residues" as much as we're predicting "sticky" residues, using the stickiness parameters. It would, I think, be worth the authors comparing the predictive power obtained from DIRseq with the predictive power obtained by using the lambda coefficients from Tesei et al in the model, local density of aromatic residues, local hydrophobicity (note that Tesei at al have tabulated a large set of hydrophobicity scores!) and the raw SeqDYN predictions. In the absence of lots of data to compare against, this is another way to convince readers that DIRseq offers reasonable predictive power.

      We now compare predictions of these various parameter sets, and report the results in Table S2.  In short, among all the tested parameter sets, DIRseq has the best performance as measured by (1) strong correlations between prediction scores and CSPs and (2) high true positives and low false positives (p. 7-9).

      (2) Second, the DIRseq is essentially SeqDYN with some changes to it, but those changes appear somewhat ad hoc. I recognize that there is very limited data, but the tweaking of parameters based on physical intuition feels a bit stochastic in developing a method; presumably (while not explicitly spelt out) those tweaks were chosen to give better agreement with the very limited experimental data (otherwise why make the changes?), which does raise the question of if the DIRseq implementation of SeqDYN is rather over-parameterized to the (very limited) data available now? I want to be clear, the authors should not be critiqued for attempting to develop a model despite a paucity of data, and I'm not necessarily saying this is a problem, but I think it would be really important for the authors to acknowledge to the reader the fact that with such limited data it's possible the model is over-fit to specific sequences studied previously, and generalization will be seen as more data are collected.

      We have explained the rationale for the parameter tweaks, which were limited to q values for four amino-acid types, i.e., to deemphasize hydrophobic interactions and slightly enhance electrostatic interactions (p. 4-5). We now add that these tweaks were motivated by observations from MD simulations of drug interactions with a-syn (ref 13). As already noted in the response to the preceding comment, we now also present results for the original parameter values as well as for when the four q values are changed one at a time.

      (3) Third, perhaps my biggest concern here is that - implicit in the author's assumptions - is that all "drugs" interact with IDPs in the same way and all drugs are "small" (motivating the change in correlation length). Prescribing a specific length scale and chemistry to all drugs seems broadly inconsistent with a world in which we presume drugs offer some degree of specificity. While it is perhaps not unexpected that aromatic-rich small molecules tend to interact with aromatic residues, the logical conclusion from this work, if one assumes DIRseq has utility, is that all IDRs bind drugs with similar chemical biases. This, at the very least, deserves some discussion.

      The reviewer raises a very important point. In Discussion, we now add that it is important to further develop DIRseq to include drug-specific parameters when data for training become available (p. 12-13). To illustrate this point, we use drug size as a simple example, which can be modeled by making the b parameter dependent on drug molecule size.

      (4) Fourth, the authors make some general claims in the introduction regarding the state of the art, which appear to lack sufficient data to be made. I don't necessarily disagree with the author's points, but I'm not sure the claims (as stated) can be made absent strong data to support them. For example, the authors state: "Although an IDP can be locked into a specific conformation by a drug molecule in rare cases, the prevailing scenario is that the protein remains disordered upon drug binding." But is this true? The authors should provide evidence to support this assertion, both examples in which this happens, and evidence to support the idea that it's the "prevailing view" and specific examples where these types of interactions have been biophysically characterized.

      We now cite nine studies showing that IDPs remain disordered upon drug binding.

      Similarly, they go on to say:

      "Consequently, the IDP-drug complex typically samples a vast conformational space, and the drug molecule only exhibits preferences, rather than exclusiveness, for interacting with subsets of residues." But again, where is the data to support this assertion? I don't necessarily disagree, but we need specific empirical studies to justify declarative claims like this; otherwise, we propagate lore into the scientific literature. The use of "typically" here is a strong claim, implying most IDP complexes behave in a certain way, yet how can the authors make such a claim? 

      Here again we add citations to support the statement.

      Finally, they continue to claim:

      "Such drug interacting residues (DIRs), akin to binding pockets in structured proteins, are key to optimizing compounds and elucidating the mechanism of action." But again, is this a fact or a hypothesis? If the latter, it must be stated as such; if the former, we need data and evidence to support the claim.

      We add citations to both compound optimization and mechanism of action.

      Reviewer #1 (Recommendations for the authors):

      (1) The authors should compare the sequences of the IDPs in the case studies with the 45 IDPs in training the SeqDYN model to make sure that they are not included in the training dataset or are highly homologous.

      Please note that the data used for training SeqDYN were R2 rates, which are independent of the property being studied here, i.e., drug interacting residues. Therefore whether the IDPs studied here were in the training set for SeqDYN is immaterial.

      (2) The authors manually tuned four parameters in SeqDYN to develop the model for predicting drug-interacting residues without giving strict testing or explanations. More explanations, testing of more values, and ablation testing should be given.

      As responded above, we now both expand the explanation and present more test results.

      (3) The authors changed the q values of L, I, and M to the value of V. What are the results if these values are not changed?

      These results are shown in Table S2 (entry named SeqDYN_orig).

      (4) Only one b value is chosen based on the assumption that a drug molecule interacts with 3-4 residues at a time. However, the number of interacting residues is related to the size of the drug molecule. Adjusting the b value with the size of the ligand may provide improvement. It is better to test the influence of adjusting b values. At least, this should be discussed.

      Good point! We now state that b potentially can be adjusted according to ligand size (p. 12-13). In addition, we also show the effect of varying b on the prediction results (Table S2; p. 8, last paragraph).

      (5) The authors add 12 Q to eliminate end effects. However, explanations on why 12 Qs are chosen should be given. How about other numbers of Q or using other residues (e.g., the commonly used residues in making links, like GS/PS or A?

      As we already explained, “Gln was selected because its 𝑞 value is at the middle of the 20 𝑞 values.” (p. 5, second paragraph). Also, 12 Qs are sufficient to remove any end effects; a higher number of Qs does not make any difference.

      Reviewer #2 (Recommendations for the authors):

      (1) The authors make reference to the "C-terminal IDR" in cMyc, but the region they note is found in the bHLH DNA binding domain (which falls from residue ~370-420).

      We now clarify that this region is disordered on its own but form a helix-loop-loop structure upon heterodimerization with Max (p. 11, last paragraph).

      (2) Given the fact that X-seq names are typically associated with sequencing-based methods, it's perhaps confusing to name this method DIRseq?

      We appreciate the reviewer’s point, but by now the preprint posted in bioRxiv is in wide circulation, and the DIRseq web server has been up for several months, so changing its name would cause a great deal of confusion.

      (3) I'd encourage the authors just to spell out "drug interacting residues" and retain an IDR acronym for IDRs. Acronyms rarely make writing clearer, and asking folks to constantly flip between IDR and DIR is asking a lot of an audience (in this reviewer's opinion, anyway).

      The reviewer makes a good point; we now spell out “drug-interacting residues”.

      (4) The assumption here is that CSPs result from direct drug:IDR interactions. However, CSPs result from a change in the residue chemical environment, which could in principle be an indirect effect (e.g., in the unbound state, residues A and B interact; in the bound state, residue A is now free, such that it experiences a CSP despite not engaging directly). While I recognize such assumptions are commonly made, it behoves the authors to explicitly make this point so the reader understands the relationship between CSPs and binding.

      We did add caveats of CSP in Introduction (p. 3, second paragraph).

      (5) On the figures, please label which protein is which figure, as well as provide a legend for the annotations on the figures (red line, blue bar, cyan region, etc.)

      We now label protein names in Fig. 1. For annotation of display items, it is also made in the Figs. 2 and 3 captions; we now add it to the Fig. 4 caption.

      (6) abstract: "These successes augur well for deciphering the sequence code for IDP-drug binding." - This is not grammatically correct, even if augur were changed to agree. Suggest rewriting.

      “Augur well” means to be a good sign (for something). We use this phrase here in this meaning.

      (6) page 5: "we raised the 𝑞 value of Asp to be the same as that of Glu" → suggested "increased" instead of raised.

      We have made the suggested change.

      (7) The authors should consider releasing the source code (it is available via the .js implementation on the server, but this is not very transferable/shareable, so I'd encourage the authors to provide a stand-alone implementation that's explicitly shareable).

      We have now added a link for the user to download the source code.

    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife Assessment

      The authors examine the effect of cell-free chromatin particles (cfChPs) derived from human serum or from dying human cells on mouse cells in culture and propose that these cfChPs can serve as vehicles for cell-to-cell active transfer of foreign genetic elements. The work presented in this paper is intriguing and potentially important, but it is incomplete. At this stage, the claim that horizontal gene transfer can occur via cfChPs is not well supported because it is only based on evidence from one type of methodological approach (immunofluorescence and fluorescent in situ hybridization (FISH)) and is not validated by whole genome sequencing.

      We disagree with the eLife assessment that our study is incomplete because we did not perform whole genome sequencing. Tens of thousands of genomes have been sequenced, and yet they have failed to detect the presence of the numerous “satellite genomes” that we describe in our paper. To that extent whole genome sequencing has proved to be an inappropriate technology. Rather, eLife should have commended us for the numerous control experiments that we have done to ensure that our FISH probes and antibodies are target specific and do not cross-react.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Horizontal gene transfer is the transmission of genetic material between organisms through ways other than reproduction. Frequent in prokaryotes, this mode of genetic exchange is scarcer in eukaryotes, especially in multicellular eukaryotes. Furthermore, the mechanisms involved in eukaryotic HGT are unknown. This article by Banerjee et al. claims that HGT occurs massively between cells of multicellular organisms. According to this study, the cell free chromatin particles (cfChPs) that are massively released by dying cells are incorporated in the nucleus of neighboring cells.

      The reviewer is mistaken. We do not claim that the internalized cfChPs are incorporated into the nucleus. We show throughout the paper that the cfChPs perform their novel functions autonomously outside the genome without being incorporated into the nucleus. This is clearly seen in all our chromatin fibre images, metaphase spreads and our video abstract. Occasionally, when the cfChPs fluorescent signal overlie the chromosomes, we have been careful to state that the cfChPs are associated with the chromosomes without implying that they have integrated.

      These cfChPs are frequently rearranged and amplified to form concatemers, they are made of open chromatin, expressed, and capable of producing proteins. Furthermore, the study also suggests that cfChPs transmit transposable elements (TEs) between cells on a regular basis, and that these TEs can transpose, multiply, and invade receiving cells. These conclusions are based on a series of experiments consisting in releasing cfChPs isolated from various human sera into the culture medium of mouse cells, and using FISH and immunofluorescence to monitor the state and fate of cfChPs after several passages of the mouse cell line.

      Strengths:

      The results presented in this study are interesting because they may reveal unsuspected properties of some cell types that may be able to internalize free-circulating chromatin, leading to its chromosomal incorporation, expression, and unleashing of TEs. The authors propose that this phenomenon may have profound impacts in terms of diseases and genome evolution. They even suggest that this could occur in germ cells, leading to within-organism HGT with long-term consequences.

      Again the reviewer makes the same mistake. We do not claim that the internalized cfChPs are incorporated into the chromosomes. We have addressed this issue above.

      We have a feeling that the reviewer has not understood our work – which is the discovery of “satellite genomes” which function autonomously outside the nuclear genome.

      Weaknesses:

      The claims of massive HGT between cells through internalization of cfChPs are not well supported because they are only based on evidence from one type of methodological approach: immunofluorescence and fluorescent in situ hybridization (FISH) using protein antibodies and DNA probes. Yet, such strong claims require validation by at least one, but preferably multiple, additional orthogonal approaches. This includes, for example, whole genome sequencing (to validate concatemerization, integration in receiving cells, transposition in receiving cells), RNA-seq (to validate expression), ChiP-seq (to validate chromatin state).

      We disagree with the reviewer that our study is incomplete because we did not perform whole genome sequencing. Tens of thousands of genomes have been sequenced, and yet they have failed to detect the presence of the numerous “satellite genomes” that we describe in our paper. To that extent whole genome sequencing has proved to be an inappropriate approach. Rather, the reviewer should have commended us for the numerous control experiments that we have done to ensure that our FISH probes and antibodies are target specific and do not cross-react.

      Should HGT through internalization of circulating chromatin occur on a massive scale, as claimed in this study, and as illustrated by the many FISH foci observed on Fig 3 for example, one would expect that the level of somatic mosaicism may be so high that it would prevent assembling a contiguous genome for a given organism. Yet, telomere-to-telomere genomes have been produced for many eukaryote species, calling into question the conclusions of this study.

      The reviewer has raised a related issue below and we have responded to both of them together.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I thank the authors for taking my comments and those of the other reviewer into account and for adding new material to this new version of the manuscript. Among other modifications/additions, they now mention that they think that NIH3T3 cells treated with cfChPs die out after 250 passages because of genomic instability which might be caused by horizontal transfer of cfChPs DNA into the genome of treated cells (pp. 45-46, lines 725-731). However, no definitive formal proof of genomic instability and horizontal transfer is provided.

      We mention that the NIH3T3 cells treated with cfChPs die out after 250 passages in response to the reviewer’s earlier comment “Should HGT through internalization of circulating chromatin occur on a massive scale, as claimed in this study, and as illustrated by the many FISH foci observed in Fig 3 for example, one would expect that the level of somatic mosaicism may be so high that it would prevent assembling a contiguous genome for a given organism”.

      We have agreed with the reviewer and have simply speculated that the cells may die because of extreme genomic instability. We have left it as a speculation without diverting our paper in a different direction to prove genomic instability.

      The authors now refer to an earlier study they conducted in which they Illumina-sequenced NIH3T3 cells treated with cfChPs (pp. 48, lines. 781-792). This study revealed the presence of human DNA in the mouse cell culture. However, it is unclear to me how the author can conclude that the human DNA was inside mouse cells (rather than persisting in the culture medium as cfChPs) and it is also unclear how this supports horizontal transfer of human DNA into the genome of mouse cells. Horizontal transfer implies integration of human DNA into mouse DNA, through the formation of phosphodiester bounds between human nucleotides and mouse nucleotides. The previous Illumina-sequencing study and the current study do not show that such integration has occured. I might be wrong but I tend to think that DNA FISH signals showing that human DNA lies next to mouse DNA does not necessarily imply that human DNA has integrated into mouse DNA. Perhaps such signals could result from interactions at the protein level between human cfChPs and mouse chromatin?

      With due respect, our earlier genome sequencing study that the reviewer refers to was done on two single cell clones developed following treatment with cfChPs. So, the question of cfChPs lurking in the culture medium does not arise.

      The authors should be commended for doing so many FISH experiments. But in my opinion, and as already mentioned in my earlier review of this work, horizontal transfer of human DNA into mouse DNA should first be demonstrated by strong DNA sequencing evidence (multiple long and short reads supporting human/mouse breakpoints; discarding technical DNA chimeras) and only then eventually confirmed by FISH.

      As mentioned earlier, we disagree with the reviewer that our study is incomplete because we did not perform whole genome sequencing. Tens of thousands of genomes have been sequenced, and yet they have failed to detect the presence of the numerous “satellite genomes” that we describe in our paper. To that extent whole genome sequencing has proved to be an inappropriate approach. Rather, the reviewer should have commended us for the numerous control experiments that we have done to ensure that our FISH probes and antibodies are target specific and do not cross-react.

      Regarding my comment on the quantity of human cfChPs that has been used for the experiments, the authors replied that they chose this quantity because it worked in a previous study. Could they perhaps explain why they chose this quantity in the earlier study? Is there any biological reason to choose 10 ng and not more or less? Is 10 ng realistic biologically? Could it be that 10 ng is orders of magnitude higher than the quantity of cfChPs normally circulating in multicellular organisms and that this could explain, at least in part, the results obtained in this study?

      The reviewer again raises the same issue to which we have already addressed in our revised manuscript. To quote “We chose to use 10ng based on our earlier report in which we had obtained robust biological effects such as activation of DDR and activation of apoptotic pathways using this concentration of cfChPs (Mittra I et. al., 2015)”.

      It is also mentioned in the response that RNA-seq has been performed on mouse cells treated with cfChPs, and that this confirms human-mouse fusion (genomic integration). Since these results are not included in the manuscript, I cannot judge how robust they are and whether they reflect a biological process rather than technical issues (technical chimeras formed during the RNA-seq protocol is a well-known artifact). In any case, I do not think that genomic integration can be demonstrated through RNA-seq as junction between human and mouse RNA could occur at the RNA level (i.e. after transcription). RNA-seq could however show whether human-mouse chimeras that have been validated by DNA-sequencing are expressed or not.

      We did perform transcriptome sequencing as suggested earlier by the reviewer, but realized that the amount of material required to be incorporated into the manuscript to include “material and methods”, “results”, “discussion”, “figures” and “legends to figures” and “supplementary figures and tables” would be so massive that it will detract from the flow of our work and hijack it in a different direction. We have, therefore, decided to publish the transcriptome results as a separate manuscript.

      Given these comments, I believe that most of the weaknesses I mentioned in my review of the first version of this work still hold true.

      An important modification is that the work has been repeated in other cell lines, hence I removed this criticism from my earlier review.

      Additional changes made

      (1) We have now rewritten the “Abstract” to 250 words to fit in eLife’s instructions. (It was not possible to reduce the word count further.

      (2) We have provided the Video 1 as separate file instead of link.

      (3) Some of Figure Supplements (which were stand-alone) are now given as main figures. We have re-arranged Figures and Figure Supplements in accordance with eLife’s instructions.

      (4) We have now provided a list of the various cell lines used in this study, their tissue origin and procurement source in Supplementary File 3.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Horizontal gene transfer is the transmission of genetic material between organisms through ways other than reproduction. Frequent in prokaryotes, this mode of genetic exchange is scarcer in eukaryotes, especially in multicellular eukaryotes. Furthermore, the mechanisms involved in eukaryotic HGT are unknown. This article by Banerjee et al. claims that HGT occurs massively between cells of multicellular organisms. According to this study, the cell free chromatin particles (cfChPs) that are massively released by dying cells are incorporated in the nucleus of neighboring cells. These cfChPs are frequently rearranged and amplified to form concatemers, they are made of open chromatin, expressed, and capable of producing proteins. Furthermore, the study also suggests that cfChPs transmit transposable elements (TEs) between cells on a regular basis, and that these TEs can transpose, multiply, and invade receiving cells. These conclusions are based on a series of experiments consisting in releasing cfChPs isolated from various human sera into the culture medium of mouse cells, and using FISH and immunofluorescence to monitor the state and fate of cfChPs after several passages of the mouse cell line.

      Strengths:

      The results presented in this study are interesting because they may reveal unsuspected properties of some cell types that may be able to internalize free-circulating chromatin, leading to its chromosomal incorporation, expression, and unleashing of TEs. The authors propose that this phenomenon may have profound impacts in terms of diseases and genome evolution. They even suggest that this could occur in germ cells, leading to within-organism HGT with long-term consequences.

      Weaknesses:

      The claims of massive HGT between cells through internalization of cfChPs are not well supported because they are only based on evidence from one type of methodological approach: immunofluorescence and fluorescent in situ hybridization (FISH) using protein antibodies and DNA probes. Yet, such strong claims require validation by at least one, but preferably multiple, additional orthogonal approaches. This includes, for example, whole genome sequencing (to validate concatemerization, integration in receiving cells, transposition in receiving cells), RNA-seq (to validate expression), ChiP-seq (to validate chromatin state).

      We have responded to this criticism under “Reviewer #1 (Recommendations for the authors, item no. 1-4)”.

      Another weakness of this study is that it is performed only in one receiving cell type (NIH3T3 mouse cells). Thus, rather than a general phenomenon occurring on a massive scale in every multicellular organism, it could merely reflect aberrant properties of a cell line that for some reason became permeable to exogenous cfChPs. This begs the question of the relevance of this study for living organisms.

      We have responded to this criticism under “Reviewer #1 (Recommendations for the authors, item no. 6)”.

      Should HGT through internalization of circulating chromatin occur on a massive scale, as claimed in this study, and as illustrated by the many FISH foci observed in Fig 3 for example, one would expect that the level of somatic mosaicism may be so high that it would prevent assembling a contiguous genome for a given organism. Yet, telomere-to-telomere genomes have been produced for many eukaryote species, calling into question the conclusions of this study.

      The reviewer is right in expecting that the level of somatic mosaicism may be so high that it would prevent assembling a contiguous genome. This is indeed the case, and we find that beyond ~ 250 passages the cfChPs treated NIH3T3 cells begin to die out apparently become their genomes have become too unstable for survival. This point will be highlighted in the revised version (pp. 45-46, lines 725-731).

      Reviewer #2 (Public review):

      I must note that my comments pertain to the evolutionary interpretations rather than the study's technical results. The techniques appear to be appropriately applied and interpreted, but I do not feel sufficiently qualified to assess this aspect of the work in detail.

      I was repeatedly puzzled by the use of the term "function." Part of the issue may stem from slightly different interpretations of this word in different fields. In my understanding, "function" should denote not just what a structure does, but what it has been selected for. In this context, where it is unclear if cfChPs have been selected for in any way, the use of this term seems questionable.

      We agree. We have removed the term “function” wherever we felt we had used it inappropriately.

      Similarly, the term "predatory genome," used in the title and throughout the paper, appears ambiguous and unjustified. At this stage, I am unconvinced that cfChPs provide any evolutionary advantage to the genome. It is entirely possible that these structures have no function whatsoever and could simply be byproducts of other processes. The findings presented in this study do not rule out this neutral hypothesis. Alternatively, some particular components of the genome could be driving the process and may have been selected to do so. This brings us to the hypothesis that cfChPs could serve as vehicles for transposable elements. While speculative, this idea seems to be compatible with the study's findings and merits further exploration.

      We agree with the reviewer’s viewpoint. We have replaced the term “predatory genome” with a more realistic term “satellite genome” in the title and throughout the manuscript. We have also thoroughly revised the discussion section and elaborated on the potential role of LINE-1 and Alu elements carried by the concatemers in mammalian evolution. (pp. 46-47, lines 743-756).

      I also found some elements of the discussion unclear and speculative, particularly the final section on the evolution of mammals. If the intention is simply to highlight the evolutionary impact of horizontal transfer of transposable elements (e.g., as a source of new mutations), this should be explicitly stated. In any case, this part of the discussion requires further clarification and justification.

      As mentioned above, we have revised the “discussion” section taking into account the issues raised by the reviewer and highlighted the potential role of cfChPs in evolution by acting as vehicles of transposable elements.

      In summary, this study presents important new findings on the behavior of cfChPs when introduced into a foreign cellular context. However, it overextends its evolutionary interpretations, often in an unclear and speculative manner. The concept of the "predatory genome" should be better defined and justified or removed altogether. Conversely, the suggestion that cfChPs may function at the level of transposable elements (rather than the entire genome or organism) could be given more emphasis.

      As mentioned above, we have replaced the term “predatory genome” with “satellite genome” and revised the “discussion” section taking into account the issues raised by the reviewer.

      Reviewer #1 (Recommendations for the authors):

      (1) I strongly recommend validating the findings of this study using other approaches. Whole genome sequencing using both short and long reads should be used to validate the presence of human DNA in the mouse cell line, as well as its integration into the mouse genome and concatemerization. Breakpoints between mouse and human DNA can be searched in individual reads. Finding these breakpoints in multiple reads from two or more sequencing technologies would strengthen their biological origin. Illumina and ONT sequencing are now routinely performed by many labs, such that this validation should be straightforward. In addition to validating the findings of the current study, it would allow performance of an in-depth characterization of the rearrangements undergone by both human cfChPs and the mouse genome after internalization of cfChPs, including identification of human TE copies integrated through bona fide transposition events into the mouse genome. New copies of LINE and Alu TEs should be flanked by target site duplications. LINE copies should be frequently 5' truncated, as observed in many studies of somatic transposition in human cells.

      (2) Furthermore, should the high level of cell-to-cell HGT detected in this study occur on a regular basis within multicellular organisms, validating it through a reanalysis of whole genome sequencing data available in public databases should be relatively easy. One would expect to find a high number of structural variants that for some reason have so far gone under the radar.

      (3) Short and long-read RNA-seq should be performed to validate the expression of human cfChPs in mouse cells. I would also recommend performing ChIP-seq on routinely targeted histone marks to validate the chromatin state of human cfChPs in mouse cells.

      (4) The claim that fused human proteins are produced in mouse cells after exposing them to human cfChPs should be validated using mass spectrometry.

      The reviewer has suggested a plethora of techniques to validate our findings. Clearly, it is neither possible to undertake all of them nor to incorporate them into the manuscript. However, as suggested by the reviewer, we did conduct transcriptome sequencing of cfChPs treated NIH3T3 cells and were able to detect the presence of human-human fusion sequences (representing concatemerisation) as well as human-mouse fusion sequences (representing genomic integration). However, we realized that the amount of material required to be incorporated into the manuscript to include “material and methods”, “results”, “discussion”, “figures” and “legends to figures” and “supplementary figures and tables” would be so massive that it will detract from the flow of our work and hijack it in a different direction. We have, therefore, decided to publish the transcriptome results as a separate manuscript. However, to address the reviewer’s concerns we have now referred to results of our earlier whole genome sequencing study of NIH3T3 cells similarly treated with cfChPs wherein we had conclusively detected the presence of human DNA and human Alu sequences in the treated mouse cells. These findings have now been added as an independent paragraph (pp. 48, lines. 781-792).

      (5) It is unclear from what is shown in the paper (increase in FISH signal intensity using Alu and L1 probes) if the increase in TE copy number is due to bona fide transposition or to amplification of cfChPs as a whole, through mechanisms other than transposition. It is also unclear whether human TEs end up being integrated into the neighboring mouse genome. This should be validated by whole genome sequencing.

      Our results suggest that TEs amplify and increase their copy number due to their association with DNA polymerase and their ability to synthesize DNA (Figure 14a and b). Our study design cannot demonstrate transposition which will require real time imaging.

      The possibility of incorporation of TEs into the mouse genome is supported by our earlier genome sequencing work, referred to above, wherein we detected multiple human Alu sequences in the mouse genome (pp. 48, lines. 781-792).

      (6) In order to be able to generalize the findings of this study, I strongly encourage the authors to repeat their experiments using other cell types.

      We thank the reviewer for this suggestion. We have now used four different cell lines derived from four different species and demonstrated that horizontal transfer of cfChPs occur in all of them suggesting that it is a universal phenomenon. (pp. 37, lines 560-572) and (Supplementary Fig. S14a-d).

      We have also mentioned this in the abstract (pp. 3, lines 52-54).

      (7) Since the results obtained when using cfChPs isolated from healthy individuals are identical to those shown when using cfChPs from cancer sera, I wonder why the authors chose to focus mainly on results from cancer-derived cfChPs and not on those from healthy sera.

      Most of the experiments were conducted using cfChPs isolated from cancer patients because of our especial interest in cancer, and our earlier results (Mittra et al., 2015) which had shown that cfChPs isolated from cancer patients had significantly greater activity in terms of DNA damage and activation of apoptotic pathways than those isolated from healthy individuals. We have now incorporated the above justification on (pp. 6, lines. 124-128).

      (8) Line 125: how was the 10-ng quantity (of human cfChPs added to the mouse cell culture) chosen and how does it compare to the quantity of cfChPs normally circulating in multicellular organisms?

      We chose to use 10ng based on our earlier report in which we had obtained robust biological effects such as activation of DDR and apoptotic pathways using this concentration of cfChPs (Mittra I et. al. 2015). We have now incorporated the justification of using this dose in our manuscript (pp. 51-52, lines. 867-870).

      (9) Could the authors explain why they repeated several of their experiments in metaphase spreads, in addition to interphase?

      We conducted experiments on metaphase spreads in addition to those on chromatin fibres because of the current heightened interest in extra-chromosomal DNA in cancer, which have largely been based on metaphase spreads. We were interested to see how the cfChP concatemers might relate to the characteristics of cancer extrachromosomal DNA and whether the latter in fact represent cfChPs concatemers acquired from surrounding dying cancer cells. We have now mentioned this on pp. 7, lines 150-155.

      (10) Regarding negative controls consisting in checking whether human probes cross-react with mouse DNA or proteins, I suggest that the stringency of washes (temperature, reagents) should be clearly stated in the manuscript, such that the reader can easily see that it was identical for controls and positive experiments.

      We were fully aware of these issues and were careful to ensure that washing steps were conducted meticulously. The careful washing steps have been repeatedly emphasized under the section on “Immunofluorescence and FISH” (pp. 54-55, lines. 922-944).

      (11) I am not an expert in Immuno-FISH and FISH with ribosomal probes but it can be expected that ribosomal RNA and RNA polymerase are quite conserved (and thus highly similar) between humans and mice. A more detailed explanation of how these probes were designed to avoid cross-reactivity would be welcome.

      We were aware of this issue and conducted negative control experiment to ensure that the human ribosomal RNA probe and RNA polymerase antibody did not cross-react with mouse. Please see Supplementary Fig. S4c.

      (12) Finally, I could not understand why the cfChPs internalized by neighboring cells are called predatory genomes. I could not find any justification for this term in the manuscript.

      We agree and this criticism has also been made by #Reviewer 2. We have now replaced the term “predatory” genomes with “satellite” genomes.

      Reviewer #2 (Recommendations for the authors):

      (1) P2 L34: The term "role" seems to imply "what something is supposed to do" (similar to "function"). Perhaps "impact" would be more neutral. Additionally, "poorly defined" is vague-do you mean "unknown"?

      We thank the reviewer for this suggestion. We have now rephrased the sentence to read “Horizontal gene transfer (HGT) plays an important evolutionary role in prokaryotes, but it is thought to be less frequent in mammals.” (pp. 2, lines. 26-27).

      (2) P2 L35: It seems that the dash should come after "human blood."

      Thank you, we have changed the position of the dash (pp. 2, line. 29).

      (3) P2 L37: Must we assume these structures have a function? Could they not simply be side effects of other processes?

      We think this is a matter of semantics, especially since we show that cfChPs once inside the cell perform many functions such as replication, DNA synthesis, RNA synthesis, protein synthesis etc. We, therefore, think the word “function” is not inappropriate.

      (4) Abstract: After reading the abstract, I am unclear on the concept of a "predatory genome." Based on the summarized results, it seems one cannot conclude that these elements provide any adaptive value to the genome.

      We agree. We have now replaced the term “predatory” genomes with a more realistic term viz. “satellite” genomes.

      (5) Video abstract: The video abstract does not currently stand on its own and needs more context to be self-explanatory.

      Thank you for pointing this out. We have now created a new and much more professional video with more context which we hope will meet with the reviewer’s approval.

      (6) P4 L67: Again, I am uncertain that HGT should be said to have "a role" in mammals, although it clearly has implications and consequences. Perhaps "role" here is intended to mean "consequence"?

      We have now changed the sentence to read as follows “However, defining the occurrence of HGT in mammals has been a challenge” (pp. 4, line. 73).

      (7) P6 L111: The phrase "to obtain a new perspective about the process of evolution" is unclear. What exactly is meant by this statement?

      We have replaced this sentence altogether which now reads “The results of these experiments are presented in this article which may help to throw new light on mammalian evolution, ageing and cancer” (pp. 5-6, lines 116-118).

      (8) P38 L588: The term "predatory genome" has not been defined, making it difficult to assess its relevance.

      This issue has been addressed above.

      (9) P39 L604: The statement "transposable elements are not inherent to the cell" suggests that some TEs could originate externally, but this does not rule out that others are intrinsic. In other words, TEs are still inherent to the cell.

      This part of the discussion section has been rewritten and the above sentence has been deleted.

      (10) P39 L609: The phrase "may have evolutionary functions by acting as transposable elements" is unclear. Perhaps it is meant that these structures may serve as vehicles for TEs?

      This sentence has disappeared altogether in the revised discussion section.

      (11) P41 L643: "Thus, we hypothesize ... extensively modified to act as foreign genetic elements." This sentence is unclear. Are the authors referring to evolutionary changes in mammals in general (which overlooks the role of standard mutational processes)? Or is it being proposed that structural mutations (including TE integrations) could be mediated by cfChPs in addition to other mutational mechanisms?

      We have replaced this sentence which now reads “Thus, “within-self” HGT may occur in mammals on a massive scale via the medium of cfChP concatemers that have undergone extensive and complex modifications resulting in their behaviour as “foreign” genetic elements” (pp. 47, lines 763-766).

      (12) P41 L150: The paragraph beginning with "It has been proposed that extreme environmental..." transitions too abruptly from HGT to adaptation. Is it being proposed that cfChPs are evolutionary processes selected for their adaptive potential? This idea is far too speculative at this stage and requires clarification.

      We agree. This paragraph has been removed.

      (13) P43 L681: This summary appears overly speculative and unclear, particularly as the concept of a "predatory genome" remains undefined and thus cannot be justified. It suggests that cfChPs represent an alternative lifestyle for the entire genome, although alternative explanations seem far more plausible at this point.

      We have now replaced the term “predatory” genome with “satellite” genome. The relevant part of the summary section has also been partially revised (pp. 49-50, lines 817-831).

      Changes independent of reviewers’ comments.

      We have made the following additions / modifications.

      (1) The abstract has been modified and it’s “conclusion” section has been rewritten.

      (2) Section 1.14 has been newly added together with accompanying Figures 15 a,b and c.

      (3) The “Discussion” section has been greatly modified and parts of it has been rewritten.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Common comments

      (1) Significance of zero mutation rate

      Reviewers asked why we included mutation rate even though setting mutation rate to zero doesn’t change results. We think that including non-zero mutation rate makes our results more generalisable, and thus is a strength rather than weakness. To better motivate this choice, we have added a sentence to the beginning of Results:

      (2) Writing the mu=0 case first

      Reviewers suggested that we should first focus on the mu=0 case, and then generalize the result. The suggestions are certainly good. However, given the large amount of work involved in a re-organization, we have decided to adhere to our current narrative. However, we now only include equations where mu=0 in the main text, and have moved the case of nonzero mutation rate to Supplementary Information.

      (3) Making equations more accessible

      We have taken three steps to make equations more readable.

      ● Equations in the main text correspond to the case of zero-mutation rate.

      ● The original section on equation derivation is now in a box in the main text so that readers have the choice of skipping it but interested readers can still get a gist of where equations came from.

      ● We have provided a much more detailed interpretation of the equation (see page 10).

      (4) Validity of the Gaussian approximation

      Reviewers raised concerns about the validity of Gaussian approximation on F frequency𝑓(𝜏). The fact that our calculations closely match simulations suggest that this approximation is reasonable. Still, we added a discussion about the validity of this approximation in Box 1.

      We also added to SI with various cases of initial S and F sizes. This figure shows that when either initial S or initial F is small, the distribution of𝑓(𝜏) is not normal. However, if initial S and F are both on the order of hundreds, then the distribution of 𝑓(𝜏) is approximately Gaussian.

      Public Reviews:

      Summary:

      The authors demonstrate with a simple stochastic model that the initial composition of the community is important in achieving a target frequency during the artificial selection of a community.

      Strengths:

      To my knowledge, the intra-collective selection during artificial selection has not been seriously theoretically considered. However, in many cases, the species dynamics during the incubation of each selection cycle are important and relevant to the outcome of the artificial selection experiment. Stochasticity from birth and death (demographic stochasticity) plays a big role in these species' abundance dynamics. This work uses a simple framework to tackle this idea meticulously.

      This work may or may not be hysteresis (path dependency). If this is true, maybe it would be nice to have a discussion paragraph talking about how this may be the case. Then, this work would even attract the interest of people studying dynamic systems.

      We have added this clarification in the main text:

      “Note that here, selection outcome is path-dependent in the sense of being sensitive to initial conditions. This phenomenon is distinct from hysteresis where path-dependence results from whether a tuning parameter is increased or decreased.

      Weaknesses:

      (1) Connecting structure and function

      In typical artificial selection literature, most of them select the community based on collective function. Here in this paper, the authors are selecting a target composition. Although there is a schematic cartoon illustrating the relationship between collective function (y-axis) and the community composition in the main Figure 1, there is no explicit explanation or justification of what may be the origin of this relationship. I think giving the readers a naïve idea about how this structure-function relationship arises in the introduction section would help. This is because the conclusion of this paper is that the intra-collective selection makes it hard to artificially select a community that has an intermediate frequency of f (or s). If there is really evidence or theoretical derivation from this framework that indeed the highest function comes from the intermediate frequency of f, then the impact of this paper would increase because the conclusions of this stochastic model could allude to the reasons for the prevalent failures of artificial selection in literature.

      We have added this to introduction: “This is a common quest: whenever a collective function depends on both populations, collective function is maximised, by definition, at an intermediate frequency (e.g. too little of either population will hamper function [23]).”

      (2) Explain intra-collective and inter-collective selection better for readers.

      The abstract, the introduction, and the result section use these terms or intra-collective and inter-collective selection without much explanation. For the wide readership of eLife, a clear definition in the beginning would help the audience grasp the importance of this paper, because these concepts are at the core of this work.

      This is a great point. We have added in Abstract:

      “Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection.”

      and in Introduction

      “A selection cycle consists of three stages (Fig. 1). During collective maturation, intra-collective selection favors fast-growing individuals within a collective. At the end of maturation, inter-collective selection acts on collectives and favors those achieving the target composition. Finally during collective reproduction, offspring collectives sample stochastically from the parents, a process dominated by genetic drift.”

      (3) Achievable target frequency strongly depending on the degree of demographic stochasticity.

      I would expect that the experimentalists would find these results interesting and would want to consider these results during their artificial selection experiments. The main Figure 4 indicates that the Newborn size N0 is a very important factor to consider during the artificial selection experiment. This would be equivalent to how much bottleneck is imposed on the artificial selection process in every iteration step (i.e., the ratio of serial dilution experiment). However, with a low population size, all target frequencies can be achieved, and therefore in these regimes, the initial frequency now does not matter much. It would be great for the authors to provide what the N0 parameter actually means during the artificial selection experiments. Maybe relative to some other parameter in the model. I know this could be very hard. But without this, the main result of this paper (initial frequency matters) cannot be taken advantage of by the experimentalists.

      We have added an analytical approximation for N0˘, the Newborn size below which all target frequencies can be achieved in SI.

      Also, we have added lines indicating N0˘ in Fig4a.

      (4) Consideration of environmental stochasticity.

      The success (gold area of Figure 2d) in this framework mainly depends on the size of the demographic stochasticity (birth-only model) during the intra-collective selection. However, during experiments, a lot of environmental stochasticity appears to be occurring during artificial selection. This may be out of the scope of this study. But it would definitely be exciting to see how much environmental stochasticity relative to the demographic stochasticity (variation in the Gaussian distribution of F and S) matters in succeeding in achieving the target composition from artificial selection.

      You are correct that our work considers only demographic stochasticity.

      Indeed, considering other types of stochasticity will be an exciting future research direction. We added in the main text:

      “Overall our model considers mutational stochasticity, as well as demographic stochasticity in terms of stochastic birth and stochastic sampling of a parent collective by offspring collectives. Other types of stochasticity, such as environmental stochasticity and measurement noise, are not considered and require future research.”

      (5) Assumption about mutation rates

      If setting the mutation rates to zero does not change the result of the simulations and the conclusion, what is the purpose of having the mutation rates \mu? Also, is the unidirectional (S -> F -> FF) mutation realistic? I didn't quite understand how the mutations could fit into the story of this paper.

      This is a great point. We have added this to the beginning of Results to better motivate our study:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations. This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around. When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.

      See answer on common question 1.

      (6) Minor points

      In Figure 3b, it is not clear to me how the frequency difference for the Intra-collective and the Inter-collective selection is computed.

      We added a description in caption 3b.

      In Figure 5b, the gold region (success) near the FF is not visible. Maybe increase the size of the figure or have an inset for zoom-in. Why is the region not as big as the bottom gold region?

      We increased the resolution of Fig 5b so that the gold region near FF is more visible.

      We have added Fig 5c and the following explanation to the main text:

      “From numerical simulations, we identified two accessible regions: a small region near FF and a band region spanning from S to F (gold in Fig. 5b i). Intuitively, the rate at which FF grows faster than S+F is greater than the rate at which F grows faster than S (see section VIII in Supplementary Information). Thus, the problem can initially be reduced to a two-population problem (i.e. FF versus F+S; Fig. 5c left), and then expanded to a three-population problem (Fig. 5c right).”

      Recommendations For The Authors

      Since the conclusion of the model greatly depends on the noise (variation) of F and S in the Gaussian distribution, it would be nice to have a plot where the y-axis is the variation in terms of frequency and the x-axis is the s_0 or f_0 (frequency). In the plot, I would love to see how the variation in the frequency depends on the initial frequency of S and F. Maybe this is just trivial.

      In the SI, we added Fig6a, as per your request. Previous Fig6 became Fig6b.

      Reviewer #2 (Public review):

      The authors provide an analytical framework to model the artificial selection of the composition of communities composed of strains growing at different rates. Their approach takes into account the competition between the targeted selection at the level of the meta-community and the selection that automatically favors fast-growing cells within each replicate community. Their main finding is a tipping point or path-dependence effect, whereby compositions dominated by slow-growing types can only be reached by community-level selection if the community does not start and never crosses into a range of compositions dominated by fast growers during the dynamics.

      These results seem to us both technically correct and interesting. We commend the authors on their efforts to make their work reproducible even when it comes to calculations via extensive appendices, though perhaps a table of contents and a short description of these appendices at the start of SI would help navigate them.

      Thank you for the suggestion. We have added a paragraph at the beginning of SI.

      The main limitation in the current form of the article is that it could clarify how its assumptions and findings differ from and improve upon the rest of the literature:

      -  Many studies discuss the interplay between community-level evolution and species- or strain-level evolution. But "evolution" can be a mix of various forces, including selection, drift/randomness, and mutation/innovation.

      - This work's specificity is that it focuses strictly on constant community-level selection versus constant strain-level selection, all other forces being negligible (neither stochasticity nor innovation/mutation matter at either level, as we try to clarify now).

      Note that intra-collective selection is not strictly “constant” in the sense that selection favoring F is the strongest at intermediate F frequency (Fig 3). However, we think that you mean that intra- and inter-collective selection are present in every cycle, and this is correct for our case, and for community selection in general.

      -  Regarding constant community-level selection, it is only briefly noted that "once a target frequency is achieved, inter-collective selection is always required to maintain that frequency due to the fitness difference between the two types" [pg. 3 {section sign}2]. In other words, action from the selector is required indefinitely to maintain the community in the desired state. This assumption is found in a fraction of the literature, but is still worth clarifying from the start as it can inform the practical applicability of the results.

      This is a good point. We have added to abstract:

      “Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection.”

      - More importantly, strain-level evolution also boils down here to pure selection with a constant target, which is less usual in the relevant literature. Here, (1) drift from limited population sizes is very small, with no meaningful counterbalancing of selection, (2) pure exponential regime with constant fitness, no interactions, no density- or frequency-dependence, (3) there is no innovation in the sense that available types are unchanging through time (no evolution of traits such as growth rate or interactions) and (4) all the results presented seem unchanged when mutation rate mu = 0 (as noted in Appendix III), meaning that the conclusions are not "about" mutation in any meaningful way.

      With regard to point (1), Figure 4a (reproduced below) shows how Newborn size affects the region of achievable targets. Indeed at large Newborn size (e.g. 5000 and above), no target frequency is achievable (since drift is too small to generate sufficient inter-community variation and consequently all communities are dominated by fast-growing F). However at Newborn size of for example 1000, there are two regions of accessible target frequencies. At smaller Newborn size, all target frequencies become achievable due to drift becoming sufficiently strong.

      With regard to points (2) and (3), we have added to Introduction

      “To enable the derivation of an analytical expression, we have made the following simplifications.

      First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult.”

      With regard to point (4), we view this as a strength rather than weakness. We have added the following to the beginning of Results and Discussions:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations.”

      “When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.”

      See Point 1 of Common comments.

      - Furthermore, the choice of mutation mechanism is peculiar, as it happens only from slow to fast grower: more commonly, one assumes random non-directional mutations, rather than purely directional ones from less fit to fitter (which is more of a "Lamarckian" idea). Given that mutation does not seem to matter here, this choice might create unnecessary opposition from some readers or could be considered as just one possibility among others.

      We have added the following justification:

      “This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around.”

      It would be helpful to have all these points stated clearly so that it becomes easy to see where this article stands in an abundant literature and contributes to our understanding of multi-level evolution, and why it may have different conclusions or focus than others tackling very similar questions.

      Finally, a microbial context is given to the study, but the assumptions and results are in no way truly tied to that context, so it should be clear that this is just for flavor.

      We have deleted “microbial” from the title, and revised our abstract:

      Recommendations For The Authors

      (1) More details concerning our main remark above:

      - The paragraph discussing refs [24, 33] is not very clear in how they most importantly differ from this study. Our impression is that the resource aspect is not very important for instance, and the main difference is that these other works assume that strains can change in their traits.

      We are fairly sure that resource depletion is important in Rainey group’s study, as the attractor only evolved after both strains grew fast enough to deplete resources by the end of maturation. Indeed, evolution occurred in interaction coefficients which dictate the competition between strains for resources.

      Regardless, you raised an excellent point. As discussed earlier, we have added the following:

      “To enable the derivation of an analytical expression, we have made the following simplifications.

      First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult.”

      - We would advise the main text to focus on mu = 0, and only say in discussion that results can be generalized.

      Your suggestion is certainly good. However, given the large amount of work involved in a reorganisation, we have decided to adhere to our current narrative. However, as discussed earlier, we have added this at the beginning of Results to help orient readers:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations.”

      “When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.”

      (2) We think the material on pg. 5 "Intra-collective evolution is the fastest at intermediate F frequencies, creating the "waterfall" phenomenon", although interesting, could be presented in a different way. The mathematical details on how to find the probability distribution of the maximum of independent random variables (including Equation 1) will probably be skipped by most of the readers (for experienced theoreticians, it is standard content; for experimentalists, it is not the most relevant), as such I would recommend displacing them to SM and report only the important results.

      This is an excellent suggestion. We have put a sketch of our calculations in a box in the main text to help orient interested readers. As before, details are in SI.

      Similarly, Equations 2, 3, and 4 are hard to read given the large amount of parameters and the low amount of simplification. Although exploring the effect of the different parameters through Figures 3 and 4 is useful, I think the role of the equations should be reconsidered:

      i. Is it possible to rewrite them in terms of effective variables in a more concise way?

      See Point 3 of Common comments.

      ii. Is it possible to present extreme/particular cases in which they are easier to interpret?

      We have focused on the case where the mutation rate is zero. This makes the mathematical expressions much simpler (see above).

      (3) Is it possible to explain more in detail why the distribution of f_k+1 conditional to f_k^* is well approximated by a Gaussian? Also, have you explored to what extent the results would change if this were not true (in light of the few universal classes for the maximum of independent variables)?

      Despite the appeal to the CLT and the histograms in the Appendix suggesting that the distribution looks a bit like a Gaussian at a certain scale, fluctuations on that scale are not necessarily what is relevant for the results - a rapid (and maybe wrong) attempt at a characteristic function calculation suggests that in your case, one does not obtain convergence to Gaussians unless we renormalize by S(t=0) and F(t=0), so it seems there is a justification missing in the text as is for the validity of this approximation (or that it is simply assumed).

      See point 4 of Common comments.

      Reviewer #3 (Public Reviews):

      The authors address the process of community evolution under collective-level selection for a prescribed community composition. They mostly consider communities composed of two types that reproduce at different rates, and that can mutate one into the other. Due to such differences in 'fitness' and to the absence of density dependence, within-collective selection is expected to always favour the fastest grower, but the collective-level selection can oppose this tendency, to a certain extent at least. By approximating the stochastic within-generation dynamics and solving it analytically, the authors show that not only high frequencies of fast growers can be reproducibly achieved, aligned with their fitness advantage. Small target frequencies can also be maintained, provided that the initial proportion of fast growers is sufficiently small. In this regime, similar to the 'stochastic corrector' model, variation upon which selection acts is maintained by a combination of demographic stochasticity and of sampling at reproduction. These two regions of achievable target compositions are separated by a gap, encompassing intermediate frequencies that are only achievable when the bottleneck size is small enough or the number of communities is (disproportionately) larger.

      A similar conclusion, that stochastic fluctuations can maintain the system over evolutionary time far from the prevalence of the faster-growing type, is then confirmed by analyzing a three-species community, suggesting that the qualitative conclusions of this study are generalizable to more complex communities.

      I expect that these results will be of broad interest to the community of researchers who strive to improve community-level selection, but are often limited to numerical explorations, with prohibitive costs for a full characterization of the parameter space of such embedded populations. The realization that not all target collective functions can be as easily achieved and that they should be adapted to the initial conditions and the selection protocol is also a sobering message for designing concrete applications.

      A major strength of this work is that the qualitative behaviour of the system is captured by an analytically solvable approximation so that the extent of the 'forbidden region' can be directly and generically related to the parameters of the selection protocol.

      Thanks so much for these positive comments.

      I however found the description of the results too succinct and I think that more could be done to unpack the mathematical results in a way that is understandable to a broader audience. Moreover, the phenomenon the authors characterize is of purely ecological nature. Here, mutations of the growth rate are, in my understanding, neither necessary (non-trivial equilibria can be maintained also when \mu =0) nor sufficient (community-level selection is necessary to keep the system far from the absorbing state) for the phenomenon described. Calling this dynamics community evolution reflects a widespread ambiguity, and is not ascribable just to this work. I find that here the authors have the opportunity to make their message clearer by focusing on the case where the 'mutation' rate \mu vanishes (Equations 39 & 40 of the SI) - which is more easily interpretable, at least in some limits - while they may leave the more general equations 3 & 4 in the SI.

      See points 1-4 of Common comments.

      Combined with an analysis of the deterministic equations, that capture the possibility of maintaining high frequencies of fast growers, the authors could elucidate the dynamics that are induced by the presence of a second level of selection, and speculate on what would be the result of real open-ended evolution (not encompassed by the simple 'switch mutations' generally considered in evolutionary game theory), for instance discussing the invasibility (or not) of mutant types with slightly different growth rates.

      Indeed, evolution is not restricted to two types. However, our main goal here is to derive an analytical expression, and it was difficult for even two types. For three-type collectives, we had to resort to simulations. Investigating the case where fitness effects of mutations are continuously distributed is beyond the scope of this study.

      The single most important model hypothesis that I would have liked to be discussed further is that the two types do not interact. Species interactions are not only essential to achieve inheritance of composition in the course of evolution but are generally expected to play a key role even on ecological time scales. I hope the authors plan to look at this in future work.

      In our system, the S and F do interact in a competitive fashion: even though S and F are not competing for nutrients (which are always in excess), they are competing for space. This is because a fixed number of cells are transferred to the next cycle. Thus, the presence of F will for example reduce the chance of S being propagated. We have added this clarification to our main text:

      “Note that even though S and F do not compete for nutrients, they compete for space: because the total number of cells transferred to the next cycle is fixed, an overabundance of one population will reduce the likelihood of the other being propagated.”

      Recommendations For The Authors

      I felt the authors could put some additional effort into making their theoretical results meaningful for a population of readers who, though not as highly mathematically educated as they are, can nonetheless appreciate the implications of simple relations or scaling. Below, you find some suggestions:

      (1) In order to make it clear that there is a 'natural' high-frequency equilibrium that can be reached even in the absence of selection, the authors could examine first the dynamics of the deterministic system in the absence of mutations, and use its equilibria to elucidate the combined role of the 'fitness' difference \omega and of the generation duration \tau in setting its value. The fact that these parameters always occur in combination (when there are no mutations) is a general and notable feature of the stochastic model as well. Moreover, this model would justify why you only focus on decreasing the frequency in the new generation.

      Note that the ‘natural’ high-frequency equilibrium in the absence of collective selection is when fast grower F becomes fixed in the population. Following your suggestion, we have introduced two parameters 𝑅τ and 𝑊τ to reflect the coupling between ‘fitness’ and ‘generation duration’:

      (2) Since the phenomenon described in the paper is essentially ecological in nature (as the author states, it does not change significantly if the 'mutation rate' \mu is set to zero), I would put in the main text Equations 39 & 40 of the SI in order to improve intelligibility.

      See Point 2 at the beginning of this letter.

      These equations can be discussed in some detail, especially in the limit of small f^*_k, where I think it is worth discussing the different dependence of the mean and the variance of the frequency distribution on the system's parameters.

      This is a great suggestion. We have added the following:

      “In the limit of small , Equation (3) becomes f while Equation (4) becomes . Thus, both Newborn size (N<sub>0</sub>) and fold-change in F/S during maturation (W<sub>τ</sub>) are important determinants of selection progress.

      (3) I would have appreciated an explanation in words of what are the main conceptual steps involved in attaining Equation 2, the underlying hypotheses (notably on community size and distributions), and the expected limits of validity of the approximation.

      See points 3 and 4 at the beginning of this letter.

      (4) I think that some care needs to be put into explaining where extreme value statistics is used, and why is the median of the conditional distribution the most appropriate statistics to look at for characterizing the evolutionary trajectory (which seems to me mostly reliant on extreme values).

      Great point! We added an explanation of using median value in Box 1.

      and also added figure 7 to explaining it in SI.

      Showing in a figure the different distributions you are considering (for instance, plotting the conditional distribution for one generation in the trajectories displayed in Figure 2) would be useful to understand what information \bar f provides on a sequence of collective generations, where in principle there may be memory effects.

      Thanks for this suggestion. We have added to Fig 2d panel to illustrate the shape and position of F frequency distributions in each step in the first two selection cycles.

      (5) Similarly, I do not understand why selecting the 5% best communities should push the system's evolution towards the high-frequency solution, instead of just slowing down the improvement (unless you are considering the average composition of the top best communities - which should be justified). I think that such sensitivity to the selection intensity should be appropriately referenced and discussed in the main text, as it is a parameter that experimenters are naturally led to manipulate.

      In the main text, we have added this explanation:

      “In contrast with findings from an earlier study [23], choosing top 1 is more effective than the less stringent “choosing top 5%”. In the earlier study, variation in the collective trait is partly due to nonheritable factors such as random fluctuations in Newborn biomass. In that context, a less stringent selection criterion proved more effective, as it helped retain collectives with favorable genotypes that might have exhibited suboptimal collective traits due to unfavorable nonheritable factors. However, since this study excludes nonheritable variations in collective traits, selecting the top 1 collective is more effective than selecting the top 5% (see Fig. 11 in Supplementary Information).”

      (6) Equation 1 could be explained in simpler terms as the product between the probability that one collective reaches the transmitted value times the probability that all others do worse than that. The current formulation is unclear, perhaps just a matter of English formulation.

      We have revised our description to state:

      “Equation (1) can be described as the product between two terms related to probability: (i) describes the probability density that any one of the g Adult collectives achieves f given , and (ii) describes the probability that all other g – 1 collectives achieve frequencies above f and thus not selected.”

      (7) I think that the discussion of the dependence of the boundaries of the 'waterfall' region with the difference in growth rate \omega is important and missing, especially if one wants to consider open-ended evolution of the growth rate - which can occur at steps of different magnitude.

      We added a new chapter and figure in supplementary information on the threshold values when \omega varies. As expected, smaller \omega enlarges the success area.

      We have also added a new figure panel to show how maturation time affects selection efficacy.

      (8) Notations are a bit confusing and could be improved. First of all, in most equations in the main text and SI, what is initially introduced as \omega appears as s. This is confusing because the letter s is also used for the frequency of the slow type.

      The letter S is used to denote an attribute of cells (S cells), the type of cells (Equations 1-3 of the SI) and the number of these cells in the population, sometimes with different meanings in the same sentence. This is confusing, and I suggest referring to slow cells or fast cells instead (or at least to S-cells and F-cells), and keeping S and F as variables for the number of cells of the two types.

      All typos related to the notation have been fixed. We use S and F as types, and S and F (italic) and population numbers.

      (9) On page 3, when introducing the sampling of newborns as ruled by a binomial distribution, the information that you are just transmitting one collective is needed, while it is conveyed later.

      We have added this emphasis:

      “At the end of a cycle, a single Adult with the highest function (with F frequency f closest to the target frequency ) is chosen to reproduce g Newborn collectives each with N<sub>0</sub> cells (‘Selection’ and ’Reproduction’ in Fig. 1).”

      (10) I found that the abstract talks too early about the 'waterfall' phenomenon. As this is a concept introduced here, I suggest the authors first explain what it is, then use the term. It is a useful metaphor, but it should not obscure the more formal achievements of the paper.

      We feel that the “waterfall” analogy offers a gentle helping hand to orient those who have not thought much about the phenomenon. We view abstract as an opportunity to attract readership, and thus the more accessible the better.

      (11) In the SI there are numerous typos and English language issues. I suggest the authors read carefully through it, and add line numbers to the next version so that more detailed feedback is possible.

      Thank you for going through SI. We have gone through the SI, and fixed problems.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The present work studies the coevolution of HIV-1 and the immune response in clinical patient data. Using the Marginal Path Likelihood (MPL) framework, they infer selection coefficients for HIV mutations from time-series data of virus sequences as they evolve in a given patient.

      Strengths:

      The authors analyze data from two human patients, consisting of HIV population sequence samples at various points in time during the infection. They infer selection coefficients from the observed changes in sequence abundance using MPL. Most beneficial mutations appear in viral envelop proteins. The authors also analyze SHIV samples in rhesus macaques, and find selection coefficients that are compatible with those found in the corresponding human samples.

      Weaknesses:

      The MPL method used by the authors considers only additive effects of mutations, thus ignoring epistasis.

      As suggested, we have now addressed this limitation by inferring epistatic fitness landscapes for CH505, CH848, SHIV.CH505, and SHIV.CH848. Indeed, the computational burden of the epistasis inference procedure was one constraint that motivated us to consider only additive fitness in the previous version of our paper. The original approach developed by Sohail et al. (2022) tested only sequences with <50 sites due to this limitation, far smaller than the ones we consider. Beyond this computational constraint, we also believed that 1) an additive fitness model may suffice to capture local fitness landscapes, and practically, 2) epistatic interactions are more challenging to validate than the effects of individual mutations, making the interpretation of the model more complex.

      However, after performing the analyses described in this paper, we developed a new approach for identifying epistatic interactions that can scale to much longer sequences (Shimagaki et al., Genetics, in press). We therefore applied this method to infer an epistatic fitness landscape for the HIV and SHIV data sets that we studied. As in that work, we focused on short-range (<50 bp) interactions which we could more confidently estimate from data. We have added a section in the SI describing the epistatic fitness model and our analysis. 

      Overall, we found substantial agreement between the epistatic and purely additive models in terms of the estimated fitness effects of individual mutations (new Supplementary Fig. 8) and overall fitness (Supplementary Fig. 9). Consistent with our prior work, we did not find substantial evidence for very strong epistatic interactions (Supplementary Fig. 10). This does not necessarily mean that strong epistatic interactions do not exist; rather, this shows that strong interactions don’t substantially improve the fit of the model to data, and thus many are regularized toward zero. While the biological validation of epistatic interactions is challenging, we found that the largest epistatic interactions, which we defined as the top 1% of all shortrange interactions, were modestly but significantly enriched in the CD4 binding site, V1 and V5 regions for CH505 and in the CD4 binding site, V4, and V5 for CH848. In addition, mutation pairs N280S/V281A and E275K/V281G, which confer resistance to CH235, ranked in the top 15% of all epistatic interactions in CH505.

      We have now included an additional section in the Results, “Robustness of inferred selection to changes in the fitness model and finite sampling”, which discusses our epistatic analyses (page 6, lines 415-464), along with the above Supplementary Figures and a technical section in the SI summarizing the epistasis inference approach.

      Although the evolution of broadly neutralizing antibodies (bnAbs) is a motivating question in the introduction and discussion sections (and the title), the relevance of the analysis and results to better understanding how bnAbs arise is not clear. The only result presented in direct connection to bnAbs is Figure 6.

      It is true that, while bnAb development is a major motivator of our study, our analysis focuses on HIV-1 and does not directly consider antibody evolution. We have now brought attention to this point as a limitation directly in the Discussion. Following the suggestion below in the “Recommendations for the authors,” we have edited our manuscript to place more emphasis on viral fitness and somewhat reduce the emphasis on bnAbs, though this remains an important motivating factor. Specifically, the Abstract now begins

      Human immunodeficiency virus (HIV)-1 evolves within individual hosts to escape adaptive immune responses while maintaining its capacity for replication. Coevolution between the HIV-1 and the immune system generates extraordinary viral genetic diversity. In some individuals, this process also results in the development of broadly neutralizing antibodies (bnAbs) that can neutralize many viral variants, a key focus of HIV-1 vaccine design. However, a general understanding of the forces that shape virusimmune coevolution within and across hosts remains incomplete. Here we performed a quantitative study of HIV-1 evolution in humans and rhesus macaques, including individuals who developed bnAbs.

      We have similarly modified the Discussion to focus first on viral fitness. In response to comments from Reviewer 3, we have also more clearly articulated how our work might contribute to the understanding of bnAb development in the Discussion.

      Questions or suggestions for further discussion:

      I list here a number of points for which I believe the paper would benefit if additional discussion/results were included.

      The MPL method used by the authors considers only additive effects of mutations, thus ignoring epistasis. In Sohail et al (2022) MBE 39(10), p. msac199  (https://doi.org/10.1093/molbev/msac199) an extension of MPL is developed allowing one to infer epistasis. Can the authors comment on why this was not attempted here?

      I presume one possible reason is that epistasis inference requires considerably more computational effort (and more data). However, since the authors find most beneficial mutations occurring in Env, perhaps restricting the analysis to Env genes only (e.g. the trimer shown in Figure 2) can lead to tractable inference of epistasis within this segment (instead of the full genome).

      As described above, we have now addressed this comment by inferring epistatic fitness landscapes for the data sets that we consider. Our overall results using the epistatic fitness model are consistent with the ones that we previously obtained with an additive model.

      Do the authors find correlations in the inferred selection coefficients of the two samples CH505 and CH848? I could not find any discussion of this in the manuscript. Only correlations between Humans and RM are discussed.

      To address this question, we compared the fitness values and individual selection coefficients across CH505 and CH848 data sets. We found little correlation between CH505 and CH848 fitness values (shown in a new Supplementary Fig. 6) or selection coefficients. We found only 199 common mutations between HIV-1 amino acid sequences from CH505 and CH848 out of 868 and 1,406 total mutations, respectively. Thus, we were not surprised to find no strong relationship between fitness estimates from CH505 and CH848 data sets. 

      Reviewer #2 (Public review):

      Summary:

      This paper combines a biological topic of interest with the demonstration of important theoretical/methodological advances. Fitness inference is the foundation of the quantitative analysis of adapting systems. It is a hard and important problem and this paper highlights a compelling approach (MPL) first presented in (1) and refined in (2), roughly summarized in equation 12.

      (1) Sohail, M. S., Louie, R. H., McKay, M. R. & Barton, J. P. Mpl resolves genetic linkage in fitness inference from complex evolutionary histories. Nature biotechnology 39, 472-479 (2021).

      (2) Shimagaki, K. & Barton, J. P. Bézier interpolation improves the inference of dynamical models from data. Physical Review E 107, 024116 (2023).

      The authors find that positive selection shapes the variable regions of env in shared patterns across two patient donors. The patterns of positive selection are interesting in and of themselves, they confirm the intuition that hyper-variation in env is the result of immune evasion rather than a broadly neutral landscape (flatness). They show that the immune evasion patterns due to CD8 T and naive B-cell selection are shared across patients. Furthermore, they suggest that a particular evolutionary history (larger flux to high fitness states) is associated with bNAb emergence. Mimicking this evolutionary pattern in vaccine design may help us elicit bNAbs in patients in the future.

      There is a lot of information to be found in the full fitness landscape of env. The enormous strength of reversion-to-consensus in the patterns is a known pattern of HIV post-infection populations but they are nicely quantified here. Agreement between SHIV and HIV evolution is shown. They find selection is larger for autologous antibodies than the bNAbs themselves (perhaps bNAbs are just too small a component of the host response to drive the bulk of selection?), and that big fitness increases precede antibody breadth in rhesus macaques, suggesting that this fitness increase is the immune challenge required to draw forth a bNAb. This is all of high interest to HIV researchers.

      Strength of evidence:

      One limitation is, of course, that the fitness model is constant in time when the immune challenge is variable and changing. This simplification may complicate some interpretations.

      We agree that this is a limitation of our current approach. In prior work, we have found that the constant fitness effects of mutations that we infer typically reflect the time-averaged fitness effect when the selection changes over time (Gao and Barton, PNAS 2025; Lee et al., Nat Commun 2025). It could be difficult, however, to capture changes in selection that fluctuate rapidly with underlying immune responses. We have added a new paragraph in the Discussion that more clearly sets out some of the limitations of our analysis, including our assumption of constant selection coefficients.

      There are additional methodological and technical limitations that should be considered in the interpretation of our results. Most notably, we assume that the viral fitness landscape is static in time. While we do not expect selection for effective replication (“intrinsic” fitness) to change substantially over time, pressure for immune escape could vary along with the immune responses that drive them. In prior work, we have found that constant selection coefficients typically reflect the average fitness effect of a mutation when its true contribution to fitness is time-varying [42,43]. This may not adequately description mutational effects that undergo large or rapid shifts in time. Future work should also examine temporal patterns in selection for individual mutations.

      Equation 12 in the methods is really a beautiful tool because it is so simple, but accounts for linkage and can be solved precisely even in the presence of detailed mutational and selection models. However, the reliance on incomplete observations of the frequency leads to complications that must be carefully (re)addressed here.

      For instance, the consistent finding of strong selection in hypervariable regions is biologically intuitive but so striking, that I worry that it might be the result of a bias for selection in high entropy regions. 

      Thank you for this suggestion. We agree that it is important to carefully interrogate these results. To assess the effects of general sequence variability on inferred selection, we first computed a position-specific entropy measure, H<sub >i</sub >, for each site i. We first defined the time-dependent entropy H<sub >i</sub >(t) = - ∑<sub >a</sub> x<sub>i</sub> (a, t) log x<sub>i</sub> (a, t)), where x<sub>i</sub> (a, t) represents the frequency of amino acid/nucleotide a at position i and time t, at each sample time. We then computed H<sub>i</sub> as the average of H<sub>i</sub>(t) across all sample times. A new Supplementary Fig. 1 plots the entropy against the inferred selection coefficients. Although some sequence variation must be observed in order for us to infer that a mutation is beneficial, we did not find a systematic bias toward larger (more beneficial) selection coefficients at more variable sites. Overall, we found only a modest correlation between inferred selection coefficients and entropy (Pearson’s r = 0.33 and 0.29 for CH505 and CH848, respectively), which appears to be partly driven by the tendency for mutations inferred to be significantly deleterious to occur at sites with low entropy. In addition to the new Supplementary Figure, we have added a reference to this analysis in the main text:

      To test whether our results might be biased by overall sequence variability, we examined the relationship between our inferred selection coefficients and entropy, a common measure of sequence variability. Overall, we found only a modest correlation between selection and entropy, suggesting that the signs of selection that we observe are not due to increased sequence variability alone (Supplementary Fig. 1).

      Mutational and covariance terms in equation 12 might be underestimated, due to finite sampling effect in highly diverse populations. Sampling effects lead to zeros in x(t) when actual frequency zeros might be rare at the population sizes of HIV viral loads and mutation rates. Both mutational flux and C underestimation will bias selection upward in eq. 12. 

      The prior papers (1) and (2) seem to show robustness to finite sampling effects, but, again, more care needs to be shown that this robustness transfers to the amino acid inference under these conditions. That synonymous sites are rarely selected for in the nucleotide level is a good sign, and it may be a matter of simply fully explaining the amino-acid level model.

      As above, we agree that these tests are important. To assess the robustness of our results to finite sampling, we performed bootstrap sampling on the viral sequences and inferred selection coefficients using the resampled sequences. Specifically, we resampled the same number of sequences as in the original data at each time point and repeated this for all time points across all HIV-1 and SHIV data sets. A new Supplementary Fig. 11 shows a typical comparison of the original selection coefficients vs. those obtained through bootstrap resampling. Overall, we observe a high degree of consistency between the selection coefficients in each case, which is surely aided by the long time series in these data sets. As pointed out by the reviewer, uncertainty in low-frequency mutations is a particular concern, though the effects on inferred selection are mitigated by regularization. 

      We have added a section in the Results, “Robustness of inferred selection to changes in the fitness model and finite sampling”, which includes this analysis:

      Finite sampling of sequence data could also affect our analyses. To further test the robustness of our results, we inferred selection coefficients using bootstrap resampling, where we resample sequences from the original ensemble, maintaining the same number of sequences for each time point and subject. The selection coefficients from the bootstrap samples are consistent with the original data (see Supplementary Fig. 11), with Pearson’s r values of around 0.85 for HIV-1 data sets and 0.95 for SHIV data sets, respectively.

      Uncertainty propagates to the later parts of the paper, eg. HIV and SIV shared patterns might be the result of shared biases in the method application. However, this worry does not extend to the apples-to-apples comparison of fitness trajectories across individuals (Figures 5 and 6) which I think are robust (for these sample sizes). 

      One way to address this uncertainty is to compare the fitness values and individual selection coefficients across CH505 and CH848 data sets, which was also requested by Reviewer 1. Overall, we found little correlation between CH505 and CH848 fitness values (shown in a new Supplementary Fig. 6) or selection coefficients. This suggests that similarities between HIV-1 and SHIV landscapes are not solely determined by potential biases in the inference approach. We have now added a reference to this point in the main text:

      In contrast, the inferred fitness landscapes of CH505 and CH848, which share few mutations in common, are poorly correlated (Supplementary Fig. 6). This suggests that the similarities between viral fitness values in humans and RMs are not artifacts of the model, but rather stem from similarities in underlying evolutionary drivers.

      The timing evidence is slightly weakened by the fact that bNAb detection is different from bNAb presence and the possibility that fitness increases occurred after the bNAbs appeared remains. Still, their conclusion is plausible and fits in with the other observations which form a coherent and compelling picture.

      Yes, we agree that this is a limitation of our analysis — bNAbs may have been present at low levels before they were detected, and we cannot definitively reject selection by bNAbs. Nonetheless, in at least one case (RM5695), rapid fitness gains were substantially separated in time from bNAb detection (roughly 2 weeks after infection vs. 16 weeks, respectively). We have now added this point in a new paragraph in the Discussion:

      While we found a strong relationship between viral fitness dynamics and the emergence of bnAbs, it may not be true that the former stimulates the latter. For example, bnAbs may have been present within each host before they were experimentally detected. Rapid viral fitness gains within hosts that developed broad antibody responses could then have been driven by undetected bnAb lineages. However, we did not find strong selection for known bnAb resistance mutations, and in at least one case (RM5695), rapid fitness gains (roughly 2 weeks after infection) substantially preceded bnAb detection (16 weeks). Still, given the limited size of the data set that we studied, it is unclear the extent to which our results will transfer to larger and broader data sets.

      Overall thisrpretations could provide valuable insights into the broader significance of these results. is a convincing paper, part of a larger admirable project of accurately inferring complete fitness landscapes.

      Reviewer #3 (Public review):

      Summary:

      Shimagaki et al. investigate the virus-antibody coevolutionary processes that drive the development of broadly neutralizing antibodies (bnAbs). The study's primary goal is to characterize the evolutionary dynamics of HIV-1 within hosts that accompany the emergence of bnAbs, with a particular focus on inferring the landscape of selective pressures shaping viral evolution. To assess the generality of these evolutionary patterns, the study extends its analysis to rhesus macaques (RMs) infected with simianhuman immunodeficiency viruses (SHIV) incorporating HIV-1 Env proteins derived from two human individuals.

      Strengths:

      A key strength of the study is its rigorous assessment of the similarity in evolutionary trajectories between humans and macaques. This cross-species comparison is particularly compelling, as it quantitatively establishes a shared pattern of viral evolution using a sophisticated inference method. The finding that similar selective pressures operate in both species adds robustness to the study's conclusions and suggests broader biological relevance.

      Weaknesses:

      However, the study has some limitations. The most significant weakness is that the authors do not sufficiently discuss the implications of the observed similarities. While the identification of shared evolutionary patterns (e.g., Figure 5) is intriguing, the study would benefit from a more explicit discussion of what these findings mean for instance, in the context of HIV vaccine design, immunotherapy, or fundamental viral-host interactions. Even speculative inte

      Thank you for this suggestion. We have now clarified the potential implications of our work in several areas. While speculative, one possible application is in vaccine design: it may be beneficial to design sequential immunogens to mimic the patterns of viral evolution associated with rapid fitness gains. This “population-based” design principle is different from typical approaches, which have focused on molecular details of virus surface proteins. 

      We have extended our discussion of our results in the context of viral evolution within and across hosts and related host species. Overall, our work suggests that there may be relatively few paths to significantly higher viral fitness in vivo. Evolutionary “contingencies” such as shifting immune pressure or epistatic interactions could influence the direction of evolution, but not so dramatically that the dynamics that we see in different hosts are not comparable. We have also connected our work more broadly to the literature in evolutionary parallelism in HIV-1 in different contexts.

      A secondary, albeit less critical, limitation is the placement of methodological details in the Supplementary Information. While it is understandable that the authors focus on results in the main text - especially since the methodology is not novel and has been previously described in earlier publications - some readers might benefit from a more thorough presentation of the method within the main paper.

      We have now modified the main text to add a new section, “Model overview,” that lays out the key steps of our approach. While we reserve technical details for the Methods, we believe that this new section provides more intuition about how our results were obtained (including a discussion of the important Eq. 12, now Eq. 3 in the main text) and our underlying assumptions.

      Conclusions:

      Overall, the study presents a compelling analysis of HIV-1 evolution and its parallels in SHIV-infected macaques. While the quantitative comparison between species is a notable contribution, a deeper discussion of its broader implications would strengthen the paper's impact.

      Reviewer #1 (Recommendations for the authors):

      I suggest de-emphasizing bnAbs and focusing on selection landscape inference, which seems to be the actual focus of the paper.

      While we do not directly study antibody development in this work, bnAb development is certainly an important motivating factor. As described in the responses above, we have now modified the Abstract and Discussion to place relatively more emphasis on fitness comparisons and to relatively less focus on bnAb development.  

      Reviewer #2 (Recommendations for the authors):

      Please make sure that the MPL method is defined in this paper and its limitations are at least partially repeated.

      As noted in responses above, we have now included more methodological details in the main text of the paper, which we hope will make the intuition and assumptions involved in our analysis clearer.

      I'd like the code to better show or describe the model, I could not figure out the model details by looking at the code. It seems mostly just to be csv exporting for use with preexisting MPL code. A longer code readme would be helpful.

      We have now updated the README on GitHub to include a conceptual overview of our inference approach, which references how each step is implemented in the code.

      Reviewer #3 (Recommendations for the authors):

      Try to give some more details (not necessarily giving the full mathematical derivation) on the statistical method utilized.

      As noted above, we have now expanded our discussion of the statistical methods and assumptions in the main text.

      Figures 3 and 4 are somewhat 'messy'. Although I do not have a constructive suggestion here, I feel that with a little more effort maybe the authors could come up with something more clean.

      It is true that the mutation frequency dynamics are somewhat “choppy” and difficult to follow intuitively. To attempt to make these figures easier to parse visually, we have increased the transparency on the lines and added exponential smoothing to the mutation frequencies, resulting in smoother trajectories. The trajectories without smoothing are retained in Supplementary Fig. 3. Here we also note that this smoothing is for visual purposes only; we use the original frequency trajectories for inference, rather than the smoothed ones.

    1. Author response:

      Reviewer #1 (Public review)

      Summary:

      Ever since the surprising discovery of the membrane-associated Periodic Skeleton (MPS) in axons, a significant body of published work has been aimed at trying to understand its assembly mechanism and function. Despite this, we still lack a mechanistic understanding of how this amazing structure is assembled in neuronal cells. In this article, the authors report a "gap-and-patch" pattern of labelled spectrin in iPSC-derived human motor neurons grown in culture. The mid-sections of these axons exhibit patches with reasonably well-organized MPS that are separated by gaps lacking any detectable MPS and having low spectrin content. Further, they report that the intensity modulation of spectrin is correlated with intensity modulations of tubulin as well. However, neurofilament fluorescence does not show any correlation. Using DIC imaging, the authors show that often the axonal diameter remains uniform across segments, showing a patch-gap pattern. Gaps are seen more abundantly in the midsection of the axon, with the proximal section showing continuous MPS and the distal segment showing continuous spectrin fluorescence but no organized MPS. The authors show that spectrin degradation by caspase/calpain is not responsible for gap formation, and the patches are nascent MPS domains. The gap and patch pattern increases with days in culture and can be enhanced by treating the cells using the general kinase inhibitor staurosporine. Treatment with the actin depolymerizing agent Latrunculin A reduces gap formation. The reasons for the last two observations are not well understood/explained.

      We thank the reviewer for the detailed and accurate description of the data shown and its relevance to further our understanding of MPS assembly mechanism and function.

      Strengths:

      The claims made in the paper are supported by extensive imaging work and quantification of MPS. Overall, the paper is well written and the findings are interesting. Although much of the reported data are from axons treated with staurosporine, this may be a convenient system to investigate the dynamics of MPS assembly, which is still an open question.

      We thank the reviewer for the positive comments on the manuscript, the techniques used and the proposed model.

      Weaknesses:

      Much of the analysis is on staurosporine-treated cells, and the effects of this treatment can be broad. The increase in patch-gap pattern with days in culture is intriguing, and the reason for this needs to be checked carefully. It would have been nice to have live cell data on the evolution of the patch and gap pattern using a GFP tag on spectrin. The evolution of individual patches and possible coalescence of patches can be observed even with confocal microscopy if live cell super-resolution observation is difficult.

      We will consider the inclusion of live imaging experiments using the expressión of C-terminus-tagged human beta2-spectrin in the revised version of the manuscript. We are familiar with live-imaging and FRAP experiments and we will explore how to develop these experiments to generate data for inclusion in a revised submission.

      Some more comments:

      (1) Axons can undergo transient beading or regularly spaced varicosity formation during media change if changes in osmolarity or chemical composition occur. Such shape modulations can induce cytoskeletal modulations as well (the authors report modulations in microtubule fluorescence). The authors mention axonal enlargements in some instances. Although they present DIC images to argue that the axons showing gaps are often tubular, possible beading artefacts need to be checked. Beading can be transient and can be checked by doing media changes while observing the axons on a microscope.

      We don´t discard the presence of “nano beads” in these axons. It was recently suggested that the normal morphology of axons is indeed resembling “pearls-on-a-string” (Griswold et al., 2025), with “nano beads” separated by thin tubular "connectors" (also referred to as NSV, for non-synaptic varicosities). However, it is unlikely that the gap-patch pattern of beta2-spectrin can be attributed to such a morphology, given we used formaldehyde as fixative, and Griswold and colleagues show that the use of aldehyde-based fixatives do not preserve NSVs. We are able to see scattered axonal enlargements (“micro beads”), as we described in distal portions in Fig. 1C(C2) and E. However, the number, appearance and staining of these are not compatible with the gap-patch pattern in beta2-spectrin. Moreover, we would have expected to see these NSVs in our extensive STED imaging, yet we did not. We will discuss this further in the resubmission.

      (2) Why do microtubules appear patchy? One would imagine the microtubule lengths to be greater than the patch size and hence to be more uniform.

      Our stainings are for tubulin protein isoforms beta-III and alpha-II. That is, they would label microtubules, but free tubulin as well. The slight decrease in intensity for tubulin within gaps is indeed something to investigate, but we don´t interpret this as “patchy microtubules”. If the Reviewer refers to Fig. 2C-D, it is actually difficult to anticipate the slight decrease in intensity by the naked eye. To further support this, we will consider including stainings and quantitative analyses for microtubules in the resubmission. We are familiar with the use of permeabilizing conditions during fixation (in protocols known as “cytoskeletal fixation” to label microtubules (and not free tubulin).

      (3) Why do axons with gaps increase with days in culture? If patches are nascent MPS that progressively grow, one would have expected fewer gaps with increasing days in culture. Is this indicative of some sort of degeneration of axons?

      We agree with the apparent discrepancy. However, one has to take into account that these axons are still elongating even at 2 weeks in culture. Hence, at any time point, there is a new axonal compartment recently added, and hence, with low beta2-spectrin and no MPS. Also, the dynamical evolution of the MPS has to take into account beta2-spectrin supply. If supply is somehow lower than a given threshold, it is expected that there will be more gaps, given the new, more distant parts of the axons have a lower supply of beta2-spectrin . To explore this formally, we are working on simulations of these multifactorial dynamic systems to better understand this, that together with key experimental observations would enhance our understanding into overall MPS assembly in growing axons. However, findings for this project will be the subject of another manuscript.

      (4) It is surprising that Latrunculin A reduces gap formation induced by staurosporine (also seems to increase MPS correlation) while it decreases actin filament content. How can this be understood? If the idea is to block actin dynamics, have the authors tried using Jasplakinolide to stabilize the filaments?

      The results with the co-treatment with Latrunculin A and Staurosporine are indeed intriguing, and provide clear evidence that the gap-and-patch pattern arises from local assembly of the MPS, requiring new actin filaments. However, the fact that F-actin within the pre-formed MPS seems unaffected is not surprising. There are many different populations of F-actin in axons (i.e. MPS rings, longitudinal filaments, actin patches, actin trails). Latrunculin A affects filaments indirectly. The target of Latrunculin A is not actin filaments, but free monomers. It ultimately affects actin filaments as they end up losing monomers, and devoid of new monomers, filaments get shorter and eventually disappear. The drastic decrease in F-actin in our axons reflects that. The fact that F-actin in the MPS is preserved only speaks to the fact that these filaments are stable -if they are not losing monomers in the time frame of the treatment, the filament remains unaffected. We will support this with more observations and imaging and with a more extensive discussion summarizing the literature on the matter in the resubmission.

      On the other hand, the use of F-actin stabilizing drugs (like Jasplakinolide) would have a different effect. We will study how an experiment with these drugs could be informative of the process under investigation for the resubmission

      (5) The authors speculate that the patches are formed by the condensation of free spectrins, which then leaves the immediate neighborhood depleted of these proteins. This is an interesting hypothesis, and exploring this in live cells using spectrin-GFP constructs will greatly strengthen the article. Will the patch-gap regions evolve into continuous MPS? If so, do these patches expand with time as new spectrin and actin are recruited and merge with neighboring patches, or can the entire patch "diffuse" and coalesce with neighboring patches, thus expanding the MPS region?

      We agree with the reviewer's interpretation. A virtue of our experimental model and our interpretations of the observations in fixed cells is that it gives rise to informative questions such as the ones posed by the reviewer. As stated above, we will consider the inclusion of live imaging experiments using the expressión of C-terminus tagged human beta2-spectrin in the revised version of the manuscript. We are familiar with live-imaging and FRAP experiments and we think we can provide the evidence suggested.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Gazal et al. describe the presence of unique gaps and patches of BetaII-spectrin in medial sections of long human motor neuron axons. BII-spectrin, along with Alpha-spectrin, forms horizontal linkers between 180nm spaced F-actin rings in axons. These F-actin rings, along with the spectrin linkers, form membrane periodic structures (MPS) which are critical for the maintenance of the integrity, size, and function of axons. The primary goal of the authors was to address whether long motor axons, particularly those carrying familial mutations associated with the neurodegenerative disorder ALS, show defects in gaps and patches of BetaII-spectrin, ultimately leading to degradation of these neurons.

      We thank the reviewer for the detailed and accurate description of the data shown.

      Strengths:

      The experiments are well-designed, and the authors have used the right methods and cutting-edge techniques to address the questions in this manuscript. The use of human motor neurons and the use of motor neurons with different familial ALS mutations is a strength. The use of isogenic controls is a positive. The induction of gaps and patches by the kinase inhibitor staurosporine and their rescue by Latrunculin A is novel and well-executed. The use of biochemical assays to explore the role of calpains is appropriate and well-designed. The use of STED imaging to define the periodicity of MPS in the gaps and patches of spectrin is a strength.

      We thank the reviewer for the positive comments on the manuscript, the techniques used and the proposed model.

      Weaknesses:

      The primary weakness is the lack of rigorous evaluation to validate the proposed model of spectrin capture from the gaps into adjacent patches by the use of photobleaching and live imaging. Another point is the lack of investigation into how gaps and patches change in axons carrying the familial ALS mutations as they age, since 2 weeks is not a time point when neurodegeneration is expected to start.

      We will consider the inclusion of live imaging experiments using the expressión of tagged human beta2-spectrin in the revised version of the manuscript. We are familiar with live-imaging and FRAP experiments and we believe we can provide the evidence suggested. We don't discard the notion that axons carrying familial ALS mutations will show defects in MPS formation and/or stability when observed at longer culture times, or under culture conditions that promote neuronal aging (Guix et al., 2021). Thus, we will continue to work with these cells, but the goal of that project lies well beyond the primary message of the present manuscript, and we anticipate that the revised version will not include new data on this matter. 

      Reviewer #3 (Public review):

      Summary:

      Gazal et al present convincing evidence supporting a new model of MPS formation where a gap-and-patch MPS pattern coalesces laterally to give rise to a lattice covering the entire axon shaft.

      Strengths:

      (1) This is a very interesting study that supports a change in paradigm in the model of MPS lattice formation.

      (2) Knowledge on MPS organization is mainly derived from studies using rat hippocampal neurons. In the current manuscript, Gazal et al use human IPS-derived motor neurons, a highly relevant neuron type, to further the current knowledge on MPS biology.

      (3) The quality of the images provided, specifically of those involving super-resolution, is of a high standard. This adequately supports the conclusions of the authors.

      We thank the reviewer for the positive comments on the manuscript, the techniques used and the proposed model.

      Weaknesses:

      (1) The main concern raised by the manuscript is the assumption that staudosporine-induced gap and patch formation recapitulates the physiological assembly of gaps and patches of betaII-spectrin.

      We will further explore the inclusion of more measurements of other parameters and variables towards establishing whether these gaps-and-patches patterns are equivalent structures in control and staurosporine-treated cells. 

      (2) One technical challenge that limits a more compelling support of the new model of MPS formation is that fixed neurons are imaged, which precludes the observation of patch coalescence.

      As stated before regarding similar comments by other reviewers, we will consider the inclusion of live imaging experiments in the revised version of the manuscript.

      Nicolas Unsain, PhD, and Thomas Durcan, PhD.

      References

      Griswold, J.M., Bonilla-Quintana, M., Pepper, R. et al. Membrane mechanics dictate axonal pearls-on-a-string morphology and function. Nat Neurosci 28, 49–61 (2025). https://doi.org/10.1038/s41593-024-01813-1

      Guix F.X., Marrero Capitán A., Casadomé-Perales A., Palomares-Pérez .I, López Del Castillo I., Miguel V., Goedeke L., Martín M.G., Lamas S., Peinado H., Fernández-Hernando C., Dotti C.G. Increased exosome secretion in neurons aging in vitro by NPC1-mediated endosomal cholesterol buildup. Life Sci Alliance. 2021 Jun 28;4(8):e202101055. doi: 10.26508/lsa.202101055. Print 2021 Aug.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03098

      Corresponding author: Pedro Escoll

      1. General Statements

      Our study investigates the interplay between the metabolism of host cells and the intracellular replication of Salmonella enterica serovar Typhimurium (ST). Type III Secretion Systems (T3SSs) are considered essential for ST to replicate within macrophages. However, we found that restricting macrophages to different bioenergetic contexts, such as supplementing them with glycerol, modulates bacterial replication and remarkably, enables a T3SS-deficient ST mutant (ΔprgHssaV) to replicate intracellularly. This T3SS-independent replication occurs within the Salmonella-containing vacuole (SCV) and is driven by the capacity of the host cell to provide these preferred nutrients, rather than by the host glycolytic activity itself.

      2. Description of the planned revisions

      __Reviewer #1 (Evidence, reproducibility and clarity): __

      Summary:

      In this manuscript, the authors investigate how host cell metabolic heterogeneity influences the intracellular replication of Salmonella enterica serovar Typhimurium. They use live-cell imaging of infected human primary macrophages to reveal that bacterial replication does not occur uniformly across infected cells. They demonstrate that supplementation with specific carbon sources-used by Salmonella during infection-promotes bacterial replication and increases the proportion of macrophages supporting intracellular growth. These effects are seen even in the absence of functional Type III Secretion Systems (T3SS), using a ΔprgHssaV double mutant. The authors further suggest that this replication enhancement is not strictly dependent on host glycolytic activity but rather on the host cell's ability to import nutrients. Their findings imply that intracellular Salmonella can exploit host cell metabolism to grow, even without its canonical virulence secretion systems, under nutrient-favorable conditions.

      Major Concern:

      While the topic is potentially interesting, the novelty is not fully clear. The concept that nutrient availability impacts intracellular Salmonella replication, largely via T3SS2 function, has been addressed previously (e.g., Liss et al., 2017). The finding that added exogenous carbon sources can enhance bacterial growth is thus not unexpected. The key claim-that Salmonella can replicate intracellularly even in the absence of T3SS function-would be significantly strengthened by demonstrating whether this is specific to Salmonella, or whether similar effects are seen with non-intracellular organisms such as E. coli K-12. If the phenomenon is unique to Salmonella, this would suggest a pathogen-specific mechanism beyond general metabolic support.

      As acknowledged by the Reviewer, the novelty and key claim of our work is that Salmonella can replicate intracellularly even in the absence of T3SS. To experimentally sustain that claim, we showed evidence that providing macrophages with the preferred carbon sources used by Salmonella during infection, such as glycerol, bypass the requirement of both T3SS by Salmonella to grow, intravacuolarly, inside macrophages.

      With respect to the article mentioned by the Reviewer (Liss et al. 2017, ref 36 in the manuscript), there are three important novel insights provided by our work: i) we show that Salmonella can replicate intracellularly in the SCV even in the absence of T3SS if certain carbon sources are provided; ii) we show the preference of Salmonella for certain carbon sources intracellularly such as glycerol and galactose (but not preferentially glucose); and iii) we have extended our observations to primary human macrophages in addition to RAW cells.

      We are not convinced that the experiment suggested by the Reviewer to use E. coli K12 (ECK12) is necessary to support our findings for Salmonella, but we propose to add the requested experiment. Briefly, we will infect hMDMs and RAW macrophages with ST-WT-GFP, ST-ΔprgHΔssaV or ECK12-WT-GFP, while culturing macrophages on different carbon sources (glucose, glycerol, galactose, fructose). Then we will monitor intracellular bacterial growth. By comparing bacterial growth of ST double mutant with ECK12-WT-GFP under favorable carbon sources such as glycerol, the results will be definitive to answer whether this phenomenon is unique to Salmonella or not.

      Specific Comments:

      1. Figure 1H: The effect shown here is not compelling due to inconsistent y-axis scaling. Panels 1B, 1C, and 1D should use a unified axis range with 1H to allow direct visual comparison of growth dynamics.

      Thank you, we will change it as suggested.

      Figures 1B, 1C, 1G, 1H: The current presentation of individual growth traces makes it difficult to appreciate the population-level trend. A smoothed average line overlaid on these plots could better represent the average dynamics of replicative vs. non-replicative infections. Or alternatively the total fraction of cells that proliferate summarized as a segmented bar plot (possibly binned per time point).

      We will plot the results as suggested, the total fraction of infected cells harboring bacteria that proliferate as a segmented bar plot, binned per time point.

      Figure 2G: This panel would benefit from including a comparable condition with the SPI-1/SPI-2 double mutant to aid interpretation. Additionally, the authors should explore whether this nutrient-supported replication is seen in non-phagocytic cells such as HeLa or Caco-2, which would help delineate whether the observed phenomenon is macrophage-specific.

      The graph asked by Reviewer is Figure S1D. As we are representing ST growth in macrophages supporting Salmonella replication, some of the conditions, such as lactate, cannot be shown in the infection conditions using the double mutant because there are no cells supporting the replication of the double mutant, so there are no cells to plot.

      As suggested, we are also going to perform the same experiments in HeLa cells to investigate whether the observed phenomenon is macrophage specific.

      Line 117: The sentence stating that the double mutant can undergo "exponential intracellular growth even in the absence of T3SS-dependent secretion" is an overstatement. The data suggest only a modest improvement in growth, restricted to a minority of infected cells. This claim should be revised accordingly, as should similar overstatements in the discussion (e.g., lines 203-204).

      We will remove the term 'exponential' and revise the sentence at line 117 and those in the discussion. Line 203-204 will be: 'we demonstrated that providing macrophages with preferred nutrients allows a subpopulation of ST to replicate intracellularly without the need for a functional T3SS'.

      Line 162: The authors should clarify that glycerol had the strongest effect in primary macrophages, while multiple alternative carbon sources had notable effects primarily in RAW cells.

      We will add this clarification in the text.

      Lines 198-201: This relates to the major concern. The authors should assess whether the observed growth enhancement is unique to Salmonella by testing other bacteria not known for intracellular replication. This would clarify whether the effect is due to general nutrient-driven host cell permissivity or a pathogen-specific adaptation.

      As outlined above, we will perform the suggested experiment with E. coli K12 to answer whether this phenomenon is unique to Salmonella or not.

      RAW 264.7 Observations: The modest intracellular growth of SPI-1/SPI-2 double mutants in RAW cells is consistent with prior observations in the field. The idea that nutrient availability explains this is noteworthy. The authors might consider whether differences in standard culture media (e.g., glucose concentration) influence these outcomes. This could have broader implications for reproducibility in infection models.

      Thank you for the suggestion, we will include a paragraph discussing whether differences in standard culture media might influence bacterial replication. Indeed, to answer also a question from Reviewer #2, we will include a new supplementary Figure where we have already compared "no Glucose" (0 mM), "low Glucose" (2 mM) and standard culture media Glucose levels (10 mM). Our results show that differences in Glucose levels in the culture media influence Salmonella intracellular growth in hMDMs and RAW macrophages (see Figure below).

      Reviewer #1 (Significance):

      This manuscript highlights how host cell metabolism and nutrient availability can influence intracellular Salmonella replication. While the findings are intriguing, the current framing overstates their novelty and impact. Key revisions-such as comparative experiments with non-pathogenic bacteria and non-phagocytic cells, consistent figure scaling, and more measured language-would improve the clarity and significance of the work. If the authors can show Salmonella-specific mechanisms at play, the study could offer important insights into host-pathogen metabolic interactions.

      We believe that performing all experiments suggested by the Reviewers, as well as the requested changes in the text to avoid overstatements, will improve the manuscript and will offer readers new insights and details to better understand the metabolic interactions happening between host and pathogens and how they can shape bacterial virulence.

      Reviewer #2 (Evidence, reproducibility and clarity):

      Summary: In their study titled "Provision of Preferred Nutrients to Macrophages Enables Salmonella to Replicate Intracellularly Without Relying on Type III Secretion Systems", Dr. Garcia-Rodriguez et al. describe the influence of the host cell metabolism on the intracellular proliferation potential of Salmonella during infection. The authors investigate whether the supplementation of the media with different carbon sources has an impact on the intracellular lifestyle of Salmonella. By using single cell tracking in live-cell microscopy, including the use of different reporter strains, they describe that glycerol benefits Salmonella's ability to grow within its vacuolar niche, in part, interestingly, in a Type-3-Secretion System independent manner.

      They furthermore highlight the dependence on host background for this observation by showing that effects differ between cells of varying metabolic activity. Throughout their study, they use cutting-edge methodologies, as well as Salmonella strains that could be of versatile use in other investigations. This work, while limited to in vitro models for now, has implications for the better understanding of how pathogens and their host are intertwined. This, in turn, has significance for the development of new anti-infective strategies further down the line. I therefore believe that it should be disseminated to the research community. The following comments summarize ideas how the quality of the study could be improved:

      Major comments:

      1. Salmonella, especially when cultured to activate the SPI-1 T3SS, introduce rapid cell death in their host - most commonly through activation of the NLRC4 inflammasome and downstream pyroptotic signaling. The authors don't describe the effect of the infection in differently supplemented media on host cell death, yet it would be important to elucidate whether this cellular response is also altered.

      We have performed these experiments and tracked host cell death by measuring Annexin-V levels in single cells, during infection in the conditions using the different supplements. We will include these results in the revised version of the manuscript and main text. Please see the Figure below showing that the different carbon sources did not affect macrophages cell death significantly (future Figure S1E and S1F)

      The aspect of partially T3SS-independent growth enhancement by glycerol (and depending on the host background glucose) is most curious. The authors quantify this by determining the percentage of cells containing proliferating Salmonella and by tracking individual cells over the time course of the infection. I am missing a general statement on whether the initial infection rate (i.e. timepoint 0) is comparable across conditions and mutants, and whether possible discrepancies in the infection rate could have downstream effects on the statements and claims made in the manuscript. This is, to my mind, also important for the quantification of cytosolic and vacuolar bacteria. There, the authors always speak in "percent of infected cells", so it is relevant whether the number of infected cells varies among conditions (see e.g. Figure 3).

      We thank the reviewer for this comment. The initial infection rate at t=0 significantly differs between WT and mutants in RAW 264.7 macrophages, and carbon source supplementation has no effect. However, as we only analyze infected cells, this does not affect the final results. In any case, we are going to add the graphs of % of infected cells at t=0 as supplementary Figures S1G-K.

      The authors use a concentration of 10mM for all supplemented alternative carbon sources. It would be useful to discuss the rationale behind this approach, including whether all chemicals have the same ability to be taken up by the cell. A concentration series (at least for some of the tested compounds) may be beneficial to bolster the conclusions that the authors make.

      We use 10 mM as this is the concentration of Glucose in standard culture media. By using 10 mM for all the different carbon sources, we can thus compare them keeping concentration constant (10 mM). Indeed, to answer also Reviewer #1, we will include in the manuscript a paragraph discussing whether differences in standard culture media might influence bacterial replication. As this Reviewer suggested, we will include a new supplementary Figure comparing no Glucose (0 mM), low Glucose (2 mM) and standard culture media Glucose levels (10 mM), showing that the concentration of glucose has a gradual effect in supporting the replication of the T3SS-deficient strain in RAW macrophages (see Figure below).

      I think it would strengthen the study, if the authors used host cell mutants in certain metabolite transporters, or alternatively Salmonella mutants that are deficient in uptake or metabolism of some of the compounds used in this study. This point is alluded to in the discussion, and I believe if the authors could show that in certain host mutant backgrounds the impact of supplementation with alternative carbon sources can be reversed, it would immensely bolster the strength of the claims.

      Following Reviewer's suggestion, we generated ST metabolic mutants unable to metabolize glycerol, galactose or fructose. As seen in the Figures below, during infection, the supplementations with glycerol/galactose does not boost Salmonella replication in metabolic mutants as in WT conditions, demonstrating that supplemented carbon sources indeed arrive to bacteria within the SCV and are used by intracellular Salmonella to grow. This Figures will be now Future Figure 4J-N.

      I think it would be useful to include the meaning of this work for other intracellular pathogens in the discussion section: Do the authors believe that this phenotype is Salmonella-specific? If the pathogens are at hand, it might be interesting to infect with other intracellular bacteria, such as Shigella or Francisella to investigate if the boosting of growth by glycerol also holds true for these.

      We have performed experiments with Legionella pneumophila and galactose (see figure below), showing that this carbon source is specific of Salmonella (as shown in Figure 4F in the manuscript). We could perform experiments also with L. pneumophila and glycerol to answer the Reviewers question. However, we think that the results with Legionella might be out of the focus of this article and would constitute themselves a new article, as both pathogens have a very different, non-comparable intracellular metabolism. Thus, the experiment suggested by Reviewer #1 using E. coli K12 (ECK12) while culturing macrophages on different carbon sources (glucose, glycerol, galactose, fructose) is in our opinion a better fit. We will monitor intracellular bacterial growth and, by comparing bacterial growth of the ST-ΔprgHssaV double mutant with ECK12-WT-GFP under favorable carbon sources such as glycerol, the results will be definitive to answer whether this phenomenon is unique to Salmonella or not.

      Minor comments:

      • Line 41: The authors write "are required for", but given their findings, it might be more accurate to phrase this as "have previously been described to be required for" or "have previously been described essential for".

      We will change it.

      • Line 86: Is the referencing of Figure S1C correct or should it be S1A?

      Yes, thank you, it is S1A, we will change it.

      • Lines 119,120: Related to what is displayed in Figure 2G: Are these differences significant?

      Glucose, galactose and lactate curves are significantly different compared to control (p

      • Lines 126,127: What is the change for glycerol, and is the intracellular growth significantly higher compared to the control?

      6,2 {plus minus} 1.9% in glycerol vs. 2 {plus minus} 1% in control, p

      • Figure 1E&F: Related to one of the major comments: Would it be possible to quantify this at timepoint 0 to ensure that the initial infection rates are the same across conditions?

      As outlined above, we will add the graphs of % of infected cells at t=0 as supplementary Figures S1G-K (Major Comment number 2 from this Reviewer)

      • Figure 3E,F: Why does the sum of the curves not add up to 100% (especially in the beginning)? And related to that, why do both the percentage of cytosolic and vacuolar cells grow over time? Since this infection is performed with gentamycin present, re-infection should not be possible.

      The localization module of the SINA plasmid relies on transcriptional reporters, whose expression requires time for induction and detection. Therefore, at early time points, infected cells are not classified as vacuolar or cytoplasmic because the reporters have not yet been expressed (as described in PLoS Pathog. 2021;17(4):e1009550, PMID: 33930101).

      At later time points, a subset of cells harbors bacteria that do not express any of the reporters. These bacteria are considered dormant, representing about 10% of the population, as detailed in the same article. In addition, a small percentage of infected cells simultaneously contain both STvac and STcyt. Such cells are subclassified as harboring STcyt but also STvac. Consequently, the total proportion of infected cells carrying STvac and STcyt may also exceed 100%.

      • Figure S1A: While significance testing is described in the legend, there are no indications of significance in the figure panels.

      The Reviewer is right, there is no significant changes between conditions, we will change the significance testing to ns=non-significant.

      • Figure S1B: Due to the stark discrepancies between hMDMs and RAW264.7, it might make sense to plot them on two different y-axes. Furthermore, I would clarify the y-axis: In the legend, it seems as CFU counts are shown, while CFU/ml/t2 rather describes a change over time.

      We agree. However, we will maintain the scale of the Y-axis as it was required by Reviewer #1 to be consistent with Y-axis. We will change the legend to indicate that we plot CFU/ml/t2.

      • Figure S1C: The prgH-mutant seems to outperform the wildtype in intracellular proliferation, while the double mutant underperforms compared to the ssaV-mutant. Could you please discuss/explain how the prgH-deletion has seemingly opposite effects on intracellular proliferation, depending on whether it is introduced in a wildtype or ssaV-KO background?

      As T3SS-1 plays a role in inducing macrophage cell death via activation of the NLRC4 inflammasome, macrophages infected with bacteria carrying a functional T3SS-1 (such as WT), are more prone to undergo cell death at late time-points, which disrupts bacterial proliferation and reduces the proportion of infected cells. Thus, these dead cells were not considered in the analysis. Even if cell death of ST-WT-infected RAW macrophages remains below 5%, more ΔprgH-infected cells are considered in the analyses at late time-points, and ST-ΔprgH continue replicating (and growing in ST area).

      • Figure S2A: As for the comments related to Figure 3, I am unsure how the sum of STvac and STcyt can deviate from 100. This is especially puzzling for the red curve (glycerol) at e.g. 3hpi, when the sum of the two clearly seems to be larger than 100.

      At early time points, no infected cells are classified as vacuolar or cytoplasmic because the reporters have not yet been expressed. At later time points, a subset of cells harbor bacteria that do not express any of the reporters, which are considered dormant (10% of the population). Finally, a small percentage of infected cells simultaneously contain both STvac and STcyt, therefore the total proportion of infected cells carrying STvac and STcyt may also exceed 100%.

      **Cross-commenting** I agree in principle with the comments raised by Reviewer #1 - especially when it comes to the enhancement in significance if the authors assess the species specificity. Elucidating whether the growth enhancement is Salmonella-specific, occurs for other intracellular pathogens (e.g. Shigella, Francisella) or also for extracellular bacteria (e.g. E. coli, Yersinia), would definitely strengthen the study.

      As said before, for the revision we are going to perform the experiments suggested by Reviewer #1 of using E. coli K12 (ECK12) while culturing macrophages on different carbon sources (glucose, glycerol, galactose, fructose). And to satisfy this Reviewer's curiosity, we are going to perform experiments also with L. pneumophila and glycerol.

      Reviewer #2 (Significance):

      General assessment:

      As the authors write in their discussion, the strength of this study is also it's limitation: Using single cell tracking in microscopy is a very elegant and powerful approach, yet conversely, it limits the scope of the study to in vitro approaches. While it enables assessment of bacterial pathogenicity and host-dependence on a single-cell level, it remains to be investigated whether the conclusion that the authors draw from their work will hold in more complex or physiologically relevant models.

      During the preparation of this Revision Plan, we discovered the article published in PLoS Pathogens by Andrew Grant and Pietro Mastroni "Attenuated Salmonella Typhimurium Lacking the Pathogenicity Island-2 Type 3 Secretion System Grow to High Bacterial Numbers inside Phagocytes in Mice" (PLoS Pathog 2012 8(12): e1003070, PMID: 23236281). In this article, authors showed that our main conclusion is also relevant in vivo (Salmonella Typhimurium can replicate within macrophages in the absence of T3SS). This will be addressed in the Discussion of the revised manuscript. Our study provides a metabolic explanation, at the single cell level for those observations.

      A further small shortcoming of the study is the heavy focus on the bacterial aspect in this host-pathogen interaction. While the authors do link the proliferative potential of the intracellular bacteria to the metabolic status of the individual host cell, more could be done with respect to host responses in the varying media compositions, including investigating alterations to the cell cycle, induction of cell death, or the ability to activate inflammatory signaling.

      We agree, and we are actively investigating how restricting macrophages to specific carbon sources impact other host responses, such as cytokine production. For the revised manuscript, we will add the results on the induction of cell death.

      Nonetheless, this study is of large interest to the field and the systematic approach to addressing their hypotheses speaks to the scientific excellence of the investigators.

      Thank you.

      3. Description of the revisions that have already been incorporated in the transferred manuscript

      N/A

      • *

      4. Description of analyses that authors prefer not to carry out

      N/A

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The authors describe a massively parallel reporter assays (MPRA) screen focused at identifying polymorphisms in 5' and 3' UTRs that affect translation efficiency and thus might have a functional impact on cells. The topic is of timely interest, and indeed, several related efforts have recently been published and preprinted (e.g., https://pubmed.ncbi.nlm.nih.gov/37516102/ and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635273/). This study has several major issues with the results and their presentation.

      Major comments:

      • The main issue remains that it appears that the screen has largely failed, and the reasons for that remain unclear, which make it difficult to interpret how useful is the resulting data. The authors mention batch effects as a potential contributor. The authors start with a library that includes ~6,000 variants, which makes it a medium-size MPRA. But then, only 483 pairs of WT/mutated UTRs yield high confidence information, which is already a small number for any downstream statistical analysis, particularly since most don't actually affect translation in the reporter screen setting (which is not unexpected). It is unclear why >90% of the library did not give high-confidence information. The profiles presented as base-case examples in Fig. 2B don't look very informative or convincing. All the subsequent analysis is done on a very small set of UTRs that have an effect, and it is unclear to this reviewer how these can yield statistically significant and/or biologically-relevant associations.

      • From the variants that had an effect, the authors go on to carry out some protein-level validations, and see some changes, but it is not clear if those changes are in the same direction was observed in the screen. In their rebuttal the authors explain that they largely can not infer directionality of changes form the screen, which further limits its utility.

      • It is particularly puzzling how the authors can build a machine learning predictor with >3,000 features when the dataset they use for training the model has just a few dozens of translation-shifting variants.

      We recognize that RNA distribution within polysomes is inherently less stable than the associated protein components. This instability has been noted in previous studies, including those cited by the reviewer, which used RNA from bulk polysomes to infer the translatome without fractionation. Acknowledging this limitation, we purposely adopted a conservative strategy: (i) performing gross fractionation of polysomes, and (ii) collaborating with biostatisticians at the Institute of Statistical Science, Academia Sinica, to design a conservative yet optimized analysis pipeline that minimized batch effects.

      This approach proved robust: representative cases in Fig. 2B clearly demonstrate distinct distributions of reference and alternative alleles. From our high-confidence dataset, we applied a well-established statistical framework specifically designed to accommodate multiple influencing factors in relatively small datasets (Elements of Statistical Learning by Hastie, Tibshirani, and Friedman). We further conducted sensitivity analyses to select an optimal QC cutoff across a range of stringencies, ensuring maximal reliability of our results. We have therefore successfully shortlisted UTR variants which have strong effect on translation.

      Building upon these conservative measures, we developed a predictive model for translation effects of UTR variants. Importantly, this model was validated not only with our internal test dataset but also with independent external datasets. In addition, the sequence features identified by the model were validated through reporter assays and in vivo CRISPR editing. These external and functional validations establish the generalizability and robustness of our approach.

      A more detailed analysis of the directionality of changes in translation efficiency is under active investigation. These results will be reported in a separate manuscript currently in preparation.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors describe a massively parallel reporter assays (MPRA) screen focused on identifying polymorphisms in 5' and 3' UTRs that affect translation efficiency and thus might have a functional impact on cells. The topic is of timely interest, and indeed, several related efforts have recently been published and preprinted (e.g., https://pubmed.ncbi.nlm.nih.gov/37516102/ and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635273/). This study has several major issues with the results and their presentation.

      Major comments:

      (1) The main issue is that it appears that the screen has largely failed, yet the reasons for that are unclear, which makes it difficult to interpret. The authors start with a library that includes approximately 6,000 variants, which makes it a medium-sized MPRA. But then, only 483 pairs of WT/mutated UTRs yield highconfidence information, which is already a small number for any downstream statistical analysis, particularly since most don't actually affect translation in the reporter screen setting (which is not unexpected). It is unclear why >90% of the library did not give high-confidence information. The profiles presented as basecase examples in Figure 2B don't look very informative or convincing. All the subsequent analysis is done on a very small set of UTRs that have an effect, and it is unclear to this reviewer how these can yield statistically significant and/or biologically relevant associations.

      To make sure our final results are technically and statistically sound, we applied stringent selection criteria and cutoffs in our analytics workflow. First, from our RNA-seq dataset, we filtered the UTRs with at least 20 reads in a polysome profile across all three repeated experiments. Secondly, in the following main analysis using a negative binomial generalized linear model (GLM), we further excluded the UTRs that displayed batch effect, i.e. their batch-related main effect and interaction are significant. We believe our measure has safeguarded the filtered observations (UTRs) from the (potential) high variation of our massively parallel translation assays and thus gives high confidence to our results.

      Regarding the interpretation of Figure 2B, since we aimed to identify the UTRs whose interaction term of genotype and fractions is significant in our generalized linear model, it is statistically conventional to doublecheck the interaction of the two variables using such a graph. For instance, in the top left panel of Figure 2B (5'UTR of ANK2:c.-39G>T), we can see that read counts of WT samples congruously decreased from Mono to Light, whereas the read counts of mutant samples were roughly the same in the two fractions – the trend is different between WT and mutant. Ergo, the distinct distribution patterns of two genotypes across three fractions in Figure 2B offer the readers a convincing visual supplement to our statistics from GLM.

      In contrast to Figure 2B, the graphs of nonsignificant UTRs (shown below) reveal that the trends between the two genotypes are similar across the 'Mono and Light' and 'Light and Heavy' polysome fractions. Importantly, our analysis remains unaffected by differential expression levels between WT and mutant, as it specifically distinguishes polysome profiles with different distributions. This consistent trend further supports the lack of interaction between genotype and polysome fractions for these UTRs.

      Author response image 1.

      Examples of non-significant UTR pairs in massively parallel polysome profiling assays.

      (2) From the variants that had an effect, the authors go on to carry out some protein-level validations and see some changes, but it is not clear if those changes are in the same direction as observed in the screen.

      To infer the directionality of translation efficiency from polysome profiles, a common approach involves pooling polysome fractions and comparing them with free or monosome fractions to identify 'translating' fractions. However, this method has two major potential pitfalls: (i) it sacrifices resolution and does not account for potential bias toward light or heavy polysomes, and (ii) it fails to account for discrepancies between polysome load and actual protein output (as discussed in https://doi.org/10.1016/j.celrep.2024.114098 and https://doi.org/10.1038/s41598-019-47424-w). Therefore, our analysis focused on the changes within polysome profiles themselves. 'Significant' candidates were identified based on a significant interaction between genotype and polysome distribution using a negative binomial generalized linear model, without presupposing the direction of change on protein output. 

      (3) The authors follow up on specific motifs and specific RBPs predicted to bind them, but it is unclear how many of the hits in the screen actually have these motifs, or how significant motifs can arise from such a small sample size.

      We calculated the Δmotif enrichment in significant UTRs versus nonsignificant UTRs using Fisher’s exact test. For example, the enrichment of the Δ‘AGGG’ motif in 3’ UTRs is shown below:

      Author response table 1.

      This test yields a P-value of 0.004167 by Fisher’s exact test. The P-values and Odds ratios of Δmotifs in relation to polysome shifting are included in Supplementary Table S4, and we will update the detailed motif information in the revised Supplementary Table S4.

      (4) It is particularly puzzling how the authors can build a machine learning predictor with >3,000 features when the dataset they use for training the model has just a few dozens of translation-shifting variants.

      We understand the concern regarding the relatively small number of translation-shifting variants compared to the large number of features. To address this, we employed LASSO regression, which, according to The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, is particularly suitable for datasets where the number of features 𝑝𝑝 is much larger than the number of samples 𝑁𝑁. LASSO effectively performs feature selection by shrinking less important coefficients to zero, allowing us to build a robust and generalizable model despite the limited number of variants.

      (5) The lack of meaningful validation experiments altering the SNPs in the endogenous loci by genome editing limits the impact of the results.

      Following the reviewer’s suggestion, we assessed the endogenous mutant effect by generating CRISPR knock-in clones carrying the IRF6:c.-4609G>A variant. We showed that this G>A variant generate a deleterious upstream open reading frame, which dramatically reduced protein expression of the main open reading frame (Fig. 7B-D). The genome editing further demonstrated the G>A variant reduced endogenous IRF6 protein expression to 23% or 44% in two independent clones. We have incorporated the genome editing results in the revised  main text and the new Figure 7E&F: 

      “To further validate the endogenous effect of the novel upstream ATG (uATG), we generated CRISPR knockin clones carrying the IRF6:c.-4609G>A variant and examined its impact on gene expression. The introduction of the uATG reduced RNA levels to 88% and 37% of the wild-type in two independent clones (Fig. 7E), and protein levels to 44% and 23%, respectively (Fig. 7F), resulting in an overall reduction of translation efficiency to 50–62%.“ (p.18)

      Reviewer #2 (Public Review):

      Summary:

      In their paper "Massively Parallel Polyribosome Profiling Reveals Translation Defects of Human DiseaseRelevant UTR Mutations" the authors use massively parallel polysome profiling to determine the effects of 5' and 3' UTR SNPs (from dbSNP/ClinVar) on translational output. They show that some UTR SNPs cause a change in the polysome profile with respect to the wild-type and that pathogenic SNPs are enriched in the polysome-shifting group. They validate that some changes in polysome profiles are predictive of differences in translational output using transiently expressed luciferase reporters. Additionally, they identify sequence motifs enriched in the polysome-shifting group. They show that 2 enriched 5' UTR motifs increase the translation of a luciferase reporter in a protein-dependent manner, highlighting the use of their method to identify translational control elements.

      Strengths:

      This is a useful method and approach, as UTR variants have been more difficult to study than coding variants. Additionally, their evidence that pathogenic mutations are more likely to cause changes in polysome association is well supported.

      Weaknesses:

      The authors acknowledge that they "did not intend to immediately translate the altered polysome profile into an increase or decrease in translation efficiency, as the direction of the shift was not readily evident. Additionally, sedimentation in the sucrose gradient may have been partially affected by heavy particles other than ribosomes." However, shifted polysome distribution is used as a category for many downstream analyses. Without further clarity or subdivision, it is very difficult to interpret the results (for example in Figure 5A, is it surprising that the polysome shifting mutants decrease structure? Are the polysome "shifts" towards the untranslated or heavy fractions?)

      Our approach, combining polysome fractionation of the UTR library with negative binomial generalized linear model (GLM) analysis of RNA-seq data, systematically identifies variants that affect translational efficiency. The GLM model is specifically designed to detect UTR pairs with significant interactions between genotype and polysome fractions, relying solely on changes in polysome profiles to identify variants that disrupt translation. Consequently, our analytical method does not determine the direction of translation alteration.

      Following the massively parallel polysome profiling, we sought to understand how these polysome-shifting variants influence the translation process. To do this, we examined their effects on RNA characteristics related to translation, such as RBP binding and RNA structure. In Figure 5A, we observed a notable trend in significant hits within 5’ UTRs—they tend to increase ΔG (weaker folding energy) in response to changes in polysome profiles, regardless of whether protein production increases or decreases (Fig. 3).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor comments:

      (1) Figure 3A - the claim that 5'UTR variants had a stronger effect than 3'UTR is based on the two UTRs with the strongest effect. It is unclear how these differences between 5' and 3'UTRs are significant.

      We carried out a Wilcoxon rank-sum test to examine the mut/WT fold change of translation efficiency between the 3’ and 5’ UTR variants. The results showed that the 5’ UTR variants exhibited a greater change of translation efficiency. We have inserted this result in the revised Figure 3C and refers to this figure in the main text: “Furthermore, we observed that 5’ UTR variants had a greater impact on translation activity relative to 3’ UTR variants (Fig. 3C).” (p. 12)

      (2) Figures 2B and S1, S2 - what is the meaning of less signal for a light chain and a similar signal for a heavy chain? How can this situation, while being a significant difference between the profiles, lead to a biologically relevant difference in eventual protein output?

      Taking 3’UTR ACADSB:c.*4177G>A (bottom-left panel in Figure 2B) as an example: WT transcripts have less read count (in the unit of log(CPM)) compared with the transcripts carrying the mutant UTR in the light polysome-containing fraction, whereas the read counts of the two genotypes are approximately the same in the heavy polysome-containing fraction.

      In line with our reply to Reviewer 1’s major comment 1, we aimed to identify the UTRs whose interaction term of genotype and fractions is significant in our generalized linear model (GLM). That is, the UTR pairs whose WT and mutant have different trends across the fractions (Mono to Light & Light to Heavy) are our targets. In Figure 2B, 3’UTR ACADSB:c.*4177G>A is a perfect example of our significant hits, as it displays the clear distinction of the trends of the two genotypes across three fractions.

      It is widely known that the alteration of polysome profiling distribution indicates the change of translational efficiency. Our GLM model helped us identify the UTR pairs whose WT and mutant have different polysome profiling patterns and thus likely have distinct translational efficiency. Nevertheless, since we only had limited polysome fractions in our experiments, we further validated our significant hits and confirmed the direction of regulation using luciferase reporter assay.

      (3) The paragraph starting with "Even with the high confidence dataset, we did not intend to immediately translate the altered polysome profile into an increase or decrease in translation efficiency" is confusing. The whole premise of the screen used by the authors is that polysome profiling is a useful proxy for estimating levels of translation, so claiming that it doesn't necessarily measure translation is counterintuitive.

      In line with our reply to the last question, our goal is to use the alteration of polysome profiling patterns as a proxy for the change of translational efficiency. However, due to the limited number of fractions in our experiment, we could not directly infer the direction of regulation, i.e. increase or decrease of translational efficiency, of the statistically significant variants. That is why we refrained from making any conclusion about the direction of the regulation for the significant hits and proceed to validate them using luciferase reporter assay.

      (4) Figure S5A - this is normalized to the nucleotide distribution in 5' or 3'UTRs? Is this statistic being applied to 27 SNPs in 3'UTRs?

      To identify sequence features associated with altered polysome association, we systematically analyzed both significant and nonsignificant UTRs for nucleotide and motif-level changes. Fisher’s exact test was employed to evaluate whether specific nucleotide or motif alterations were enriched or depleted in polysome-shifting UTRs, compared to nonsignificant UTR pairs. For example, in the case of nucleotide C (see table below; also Table S4 and new Fig. S6A), only four significant 3’ UTRs involved a change in C, resulting in a significant depletion of this nucleotide change among polysome-shifting 3’ UTRs (odds ratio = 0.22, p = 0.0069). Expanding this approach to all 1-7 nt motifs, we identified multiple motif and nucleotide changes that were significantly associated with altered polysome association.

      Author response table 2.

      (5) "uATG in the 5' UTR was not identified by the model as a widespread feature explaining polysome shifting". Is this because of the method of ribosome profiling or because of the sequences in the library? Can having more sequences in the library specifically looking at 5'UTR give more power for such an effect to emerge?

      Our assay design accounted for the presence of upstream ATG codons and the strength of adjacent Kozak sequences. However, additional factors known to influence the function of upstream open reading frames (uORFs)—such as the reading frame of the uORF relative to the main coding sequence, and the use of nonATG initiation codons—were not systematically included. As a result, the current assay may have limited sensitivity in detecting uORF-related regulatory effects. A dedicated design specifically tailored to uORF variants is likely to enhance the detection power and better capture their contribution to translational control.

      (6) Figure 7B- it is not clear whether the luciferase reporter and the GFP reporter in the library function in a similar manner; is it creating out-of-frame or in of in frame uORF? Also, it is not clear if the differences are statistically significant.

      In the MPRA library, the IRF6 uORF is out of frame relative to the GFP coding sequence. To directly assess its translational impact, we employed a luciferase reporter assay by fusing luciferase downstream of the IRF6 uORF. These constructs revealed a significant reduction in protein production, as shown in Figures 3 and 7B–F. Although the clinically relevant IRF6 uORF is out-of-frame with the main ORF, we engineered an inframe uORF variant to validate translation initiation at the upstream ATG (uATG) (Fig. 7B-D). The in-frame construct confirmed uATG usage and led to a significant reduction in luciferase protein expression. Together, these results support the conclusion that the IRF6:c.-4609G>A variant gives rise to an active uORF that suppresses translation of the main ORF.

      Reviewer #2 (Recommendations For The Authors):

      (1) It would be helpful for the authors to subcategorize their data in ways that they consider meaningful and interpretable (e.g. shifts from all monosome to heavy, all heavy to monosome/free, etc.) Relatedly, what do the authors think the functional meaning is when a given transcript has high mono/heavy occupancy but low light occupancy (like what is shown in Figure 2B for ANK2) in the polysome profiling experiment? It is not apparent why a transcript with a high ribosome occupancy (heavy) would also have light occupancy (light).

      From the amplicon sequencing data, we obtained read counts for each UTR variant across the monosome, light, and heavy polysome fractions. Notably, this approach does not preserve the original relative abundance of transcripts among the three fractions. That is, despite a greater abundance of mRNAs in the heavy polysome fraction, comparable numbers of sequencing reads were recovered from the monosome and light fractions. As a result, this method is not suitable for interpreting the global directionality of translational shifts but is well-suited for detecting relative differences in polysome association. Therefore, our experimental and analytical design—combining targeted amplicon sequencing with generalized linear modeling (GLM)—was optimized to identify UTR variants that alter polysome association, independently of absolute transcript abundance in each fraction.

      (2) The method put forward in Figure 2 would be more convincing if there was data showing reproducibility in the massively parallel reporter assay. Perhaps the mut/WT ratio for all transcripts can be plotted against each other and a statistical test of correlation can be performed.

      Thank you for pointing this out. To demonstrate the reproducibility of our massively parallel reporter assay, we have plotted scatter plots of the ratios of all transcripts (summing the monosome, light, and heavy fractions) across different batches using our high-confidence dataset. We calculated the Pearson correlation coefficients and corresponding p-values for these comparisons. The results show strong correlation between each batch, supporting the reproducibility of our assay. We have incorporated this analysis in the main text as well as Supplemental Figure 3: “Pearson correlation analysis revealed R coefficients ranging from 0.59 to 0.71 for the mut-to-WT transcript ratios across three independent experiments (Supplemental Fig. 3).”

      (3) The dots in Figure 2B indicate separate experiments, but the y-axis is log(counts). Values could be normalized (perhaps a ratio of mut/WT) for comparison between experiments.

      We aimed to compare UTR distribution across polysome fractions and recognized the importance of presenting the distribution patterns for both genotypes. This approach allows us to more clearly illustrate the differences or similarities in polysome association between the two genotypes.

      (4) When describing the 5' UTRs used for the validation experiments in Figure 3, more information about the 5' UTR sequence used is necessary. It is not clear how much or what part of the 5' UTRs were removed, or why this was necessary considering the same experiment was conducted using full-length UTRs.

      In the initial library design, technical limitations of bulk oligonucleotide synthesis constrained the UTRs to 155 nucleotides, comprising 115-nt of endogenous human UTR sequence flanked by 20-nt priming sites on both ends. Variants were centered at the 58th nucleotide within the 115-nt UTR sequence. When one flanking region of the native UTR was shorter than 57 nt, the variant was shifted accordingly toward the shorter arm to maintain the 115-nt UTR length (Fig. 2A).

      Given that endogenous UTRs in the human genome are often longer than 155 nt, we further evaluated the functional consequences of variants within full-length UTR sequences (Fig. 3B). While the mutant effects observed in the library setting were largely recapitulated, their magnitude was diminished in the full-length context, likely due to the increased sequence and structural complexity.

      To clarify the experimental design related to Figure 3, we modified the text as the following: “The variants significantly altering the polysome profile were then individually validated by means of high-sensitivity luciferase reporter assays (Fig. 3A). To that end, we resynthesized both the variant and corresponding wildtype alleles in the same library format - 115-nt native UTR segments centered on the variant and flanked by 20-nt priming sites. These UTRs were then cloned upstream (5’) or downstream (3’) of the firefly luciferase coding sequence, depending on their genomic location.” (p. 11)

      (5) The conclusions from inserting RBP-binding motifs into 5' UTRs and assaying translational output (Figure 4) would be strengthened by including luciferase reporters containing endogenous 5' UTRs containing these motifs, and versions where the motifs are disrupted.

      Several variants that altered translation efficiency were validated in their native sequence contexts, including 5’ UTR variants in DMD and NF1 that affect SRSF1/2 binding sites, as well as a 3’ UTR variant in AL049650.1 that impacts a KHSRP binding site (Fig. 3 and Supplemental Figs. S1 & S2). To address the functional relevance of these variants within their native regulatory landscapes, we have incorporated the following clarification into the text (p. 13): “This observation is consistent with additional findings where variants that create or disrupt specific RBP binding sites—such as SRSF1/2 (e.g., in DMD and NF1; Fig. 2 and Supplementary Fig. S4) and KHSRP (e.g., in AL049650.1; Fig. 2 and Supplementary Figs. S4 & S5)—led to significant changes in translation efficiency within their native UTR contexts.”

      (6) Figure 5C shows that 5' UTR SNPs that form an uAUG are associated with greater structural changes, but this does not "indicate" that "structure‐modifying UTR variants may control primary ORF translation partly by interfering with translation initiation from a uORF." The data presented in Figure 5 and luciferase/polysome data presented previously do not distinguish whether translation is occurring at an uAUG or canonical AUG. The statement quoted above is speculative and it should be clear that it is a hypothesis generated by the data and is not conclusive.

      We appreciate the reviewer’s suggestion. We have therefore modified our text to: ”Therefore, while changes in uATG may not be common explanatory factors for polysome-shifting mutations, our results suggest that structure-modifying UTR variants may control primary ORF translation partly by interfering with translation initiation from a uORF.” (p. 14)

      Minor points/questions

      (1) The authors should clarify whether during library construction for massively parallel polysome profiling the 3' UTR constructs contain a common 5' UTR? Likewise, do the 5' UTR constructs contain a common 3' UTR? Perhaps the lack of a 5' UTR in the 3' UTR constructs, which is implied by Figure 2A, would influence differences seen between 3' UTR pairs (and likewise for 5' UTR pairs).

      There are short common 5’ UTRs appended to the 3’ UTR library, and likewise, a common short 3’ UTR is included in the 5’ UTR library. The common 5’ UTR comprises partial sequences from the CMV promoter and the plasmid backbone of pEGFP-N1 vector. The common 3’ UTR includes sequences from the pEGFP-N1 backbone and a short polyadenylation signal from HBA1 (hemoglobin subunit alpha 1). While we cannot entirely rule out potential crosstalk between 5’ and 3’ UTRs, the design ensures that all constructs are compared in a controlled and consistent context, enabling valid pairwise comparisons between variant and wildtype alleles.

      To clarify the library design, we have revised the main text to include this explanation: 

      “The entire library of UTR oligonucleotides (UTR library) was subsequently ligated upstream or downstream of an enhanced GFP (EGFP) coding region, along with a CMV promoter and a common UTR sequence on the opposite end. Cells transfected with the UTR library were treated with cycloheximide 14 hours post transfection and then subjected to polysome fractionation (see Methods).” (p.11) 

      “The variants significantly altering the polysome profile were then individually validated through highsensitivity luciferase reporter assays (Fig. 3A). To this end, we resynthesized both the variant and corresponding wildtype alleles in the same library format - 115-nt native UTR segments centered on the variant and flanked by 20-nt priming sites. These UTRs were then cloned upstream (5’) or downstream (3’) of the firefly luciferase coding sequence, depending on their genomic location. As the initial library design, the test UTR segment differs only by one nucleotide, while a shared short UTR fragment is present on the opposite end of the coding sequence to ensure consistency across constructs (Fig. 2A).” (p. 12)

      (2) The lines connecting the polysome distribution points make the plots appear busy and difficult to read, the data would be easier to interpret if they were removed.

      We employed a generalized linear model (GLM) to identify the variants that altered the polysome association of the corresponding transcripts. Statistically speaking, we were looking for the variants which led to significant interaction between genotype and polysome fractions. Ergo, displaying the lines as it is in our plots offers readers a convincing visualization of the interaction: lines from WT and Mut groups were not parallel, which indicates the interaction between genotype and polysome fractions. Moreover, showing the lines from three batches of experiments also helps us ascertain the reproducibility of our experiments. Taken all together, the presence of the lines makes our plots even more informative.

    1. For example, in the Logic & Communication column, we see many light-orange cells – the AI often thought papers were a bit clearer or better argued (by its judgment) than the human evaluators did.

      I wonder if we should normalize this in a few ways, at least as an alternative measure.

      I suspect the AI's distribution of ratings may have different than the human distribution of ratings overall and, the "bias" may also differ by category.

      Actually, that might be something to do first -- compare the distributions of (middle -- later more sophisticated) ratings for humans and for LLMs in an overall sense.

      One possible normalization would be to state these as percentiles relative to the other stated percentiles within that group (humans, LLMs), or even within categories of paper/field/cause area (I suspect there's some major difference between the more applied and niche-EA work and the standard academic work (the latter is also probably concentrated in GH&D and environmental econ). On the other hand, the systematic differences between LLM and human ratings on average might also tell us something interesting. So I wouldn't want to only use normalized measures.

      I think a more sophisticated version of this normalization just becomes a statistical (random effects?) model where you allow components of variation along several margins.

      It's true the ranks thing gets at this issue to some extent, as I guess Spearman also does? But I don't think it fully captures it.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      In this paper, the GFP-GBP system for mistargeting protein localization was used in fission yeast cells to discover new protein interactions involved in vesicular trafficking during cytokinesis. This approach uncovered a new association between the F-BAR protein Rga7 and its binding partner Rng10 with the Munc13 protein Ync13 at the cell division site. Additional associations were observed between Rga7-Rng10, Ync13 and the glucan synthases Ags1 and Bgs4, and the vesicle fusion protein Sec1. These interactions identified by the GFP-GBP system were further supported by co-immunoprecipitation experiments and by defining localization dependencies with live cell imaging in a variety of mutant strains. The imaging data are all of high quality and for the most part support the conclusions. However, in my opinion some of the interpretations are overstated, and the manuscript would benefit from providing additional mechanistic information. Major and minor recommendations are outlined below.

      Major suggestions 1. The co-IP data are interpreted to suggest that all the above-mentioned proteins form a single "big complex." However, as noted in the manuscript and reflected in the model, the multipass integral membrane proteins Bgs4 and Ags1 are embedded in the vesicle membrane and likely only indirectly associate with the scaffold Rga7-Rng10 via Ync13, without forming a 'complex'. One would expect the entirety of these vesicle contents to co-IP if the model is correct. The first paragraph of page 11 should be revised to more clearly reflect this scenario and to align with the proposed model.

      Response: We thank the reviewer for this thoughtful clarification. In the original manuscript, we stated “…indicating these proteins do interact or form big protein complexes… These results suggest that Rga7, Rng10, and Ync13 form a protein complex.” We agree that our initial wording may have unintentionally implied that all proteins detected in co-IP experiments assemble into a single, large physical complex. As the reviewer correctly noticed, the multipass integral membrane proteins Bgs4 and Ags1 are embedded within vesicle membranes and are more likely to associate indirectly with the Rga7-Rng10-Ync13 complex, rather than being part of one unified protein complex. To avoid overinterpretation, we have modified the last sentence of the first paragraph on the original page 11 as below: “These results suggest that Rga7, Rng10, and Ync13 do form a protein complex, although maybe dynamic and not super stable (see Discussion). Our data indicate that Rga7 interacts with both Ync13 and Rng10 to form a module on the plasma membrane for targeting of the vesicles containing cargos such as glucan synthases Bgs4 and Ags1. However, these glucan synthases are multipass integral membrane embedded proteins and likely only indirectly associate with the module Rng10-Rga7-Ync13, without forming a big protein complex.”

      Can Ync13 be artificially directed or tethered to the division site independently of Rga7-Rng10 (e.g., via Imp2)? If so, can this rescue the phenotypes of rga7Δ cells? This experiment could clarify whether Ync13 is the key functional effector of the Rga7-Rng10 complex.

      Response: We thank the reviewer for suggesting this interesting experiment. We agree that testing whether correctly localized Ync13 is sufficient to execute the division-site function of the Rga7–Rng10 complex would clarify its role. To test this, we artificially targeted Ync13 to the division site independently of Rga7 by tethering it to the scaffold protein Pmo25. Pmo25, an MO25 family protein, localizes to both the plasma membrane at the division site and the spindle pole body (mainly one of the SPBs) during mitosis and cytokinesis, enabling us to mislocalize Ync13 to these structures through GFP–GBP system. We did not use Imp2 because its localization pattern (mainly to the contractile ring [1, 2]) is different from Ync13. Microscopy revealed robust localization of Ync13 at the division site and the SPB in rga7Δ cells, and this tethered Ync13 persisted along the cleavage furrow throughout ring constriction. Importantly, enforced division-site localization of Ync13 significantly rescued the cytokinesis defects and cell lysis of rga7Δ. Consistently, growth assays on Phloxin B (PB) plate showed the elevated lysis/death in rga7Δ cells was rescued by Ync13 tethering to Pmo25-GBP. Together, these findings support that Ync13 is a key functional effector acting downstream of the Rga7–Rng10 scaffold at the division site. We have added these results in the new Figure 6 and associate text in the revised manuscript. We have also updated the model in Figure 8 to reflect this new result.

      1. Demeter J, Sazer S. imp2, a new component of the actin ring in the fission yeast Schizosaccharomyces pombe. J Cell Biol. 1998;143(2):415-27. PubMed PMID: 9786952.
      2. Martin-Garcia R, Coll PM, Perez P. F-BAR domain protein Rga7 collaborates with Cdc15 and Imp2 to ensure proper cytokinesis in fission yeast. J Cell Sci. 2014;127(Pt 19):4146-58. Epub 2014/07/24. doi: 10.1242/jcs.146233. PubMed PMID: 25052092.
      3. The authors should consider structural or computational modeling of the proposed Rga7-Rng10-Ync13 complex. Such analysis could offer insight into how these components interact and strengthen the proposed model. Response: We thank the reviewer for this valuable suggestion. Following the recommendation, we performed structural modeling of the Rga7–Rng10–Ync13 complex using AlphaFold3. Our previous work demonstrated that the F-BAR protein Rga7 forms a stable dimer and its F-BAR domain binds the C-terminal (aa751–1038) region of Rng10 [3]. Based on these findings, we constructed an input model consisting of two full-length Rga7 subunits, two Rng10(751–1038) subunits, and one full-length Ync13. The predicted structure revealed a modular organization in which Rng10(751–1038) associated strongly with the F-BAR domain of the Rga7 dimer, consistent with our prior biochemical data [3]. In addition, the model suggested that Ync13 interacted with the GAP domain of Rga7, positioning Ync13 in close proximity to the Rga7–Rng10 interface (Fig. S5, A, B, D and F). Further domain specific predictions confirmed the interactions between Rga7-GAP and Ync13 N-terminus (pTM: 0.63, ipTM: 0.64), two Rga7 F-BARs (pTM: 0.74, ipTM: 0.71), as well as Rga7 F-BAR and Rng10(751–1038) (pTM: 0.56, ipTM: 0.78) (Fig. S5, C-F). Overlay analyses revealed that the interacting domains align well with the structure of whole complex as the root mean square differences (RMSDs) are Liu Y, McDonald NA, Naegele SM, Gould KL, Wu J-Q. The F-BAR domain of Rga7 relies on a cooperative mechanism of membrane binding with a partner protein during fission yeast cytokinesis. Cell Rep. 2019;26(10):2540-8.e4. doi: 10.1016/j.celrep.2019.01.112. PubMed PMID: 30840879; PubMed Central PMCID: PMCPMC6425953.

      Minor text edits 1. Define "SIN" in the discussion section for clarity.

      Response: We defined the SIN pathway in the Discussion section as suggested: “At low restrictive temperatures, the lethality of mutant sid2, the most downstream kinase in the Septation Initiation Network, is partially rescued by upregulating Rho1. Thus, it has been suggested that the Septation Initiation Network activates Rho1, which in turn activates the glucan synthases [4].”

      Alcaide-Gavilán M, Lahoz A, Daga RR, Jimenez J. Feedback regulation of SIN by Etd1 and Rho1 in fission yeast. Genetics. 2014;196(2):455-70. Epub 2013/12/18. doi: 10.1534/genetics.113.155218. PubMed PMID: 24336750; PubMed Central PMCID: PMCPMC3914619.

      Figure S3, the protein schematics should start at residue "1" and not "0".

      Response: We apologize for the mistake. The schematics in revised figure (now Figure S4A) have been corrected to start at residue 1.

      Mass spectrometry data referenced in the text are not provided in the manuscript.

      __Response: __We apologize for the omission. The mass spectrometry data are now shown in Table S1. __

      __

      In Figure 4A. The Ags1 rim localization does not appear decreased as the authors claim.

      __Response: __After examining the data again, we agree with the reviewer’s assessment. So, we reworded the sentence as the following: “We also found that in ync13Δ cells, the Bgs4 intensity at the rim of the septum was much lower than in WT after ring constriction (Fig. 4B).”


      On page 13: "both Rga7 and Rng10 can mistarget Trs120 to mitochondria."

      Response: Thank you. The typo “mistargeting” has been corrected to “mistarget”.

      Minor figure edits 1. Consider inverting single-channel images to display fluorescence on a white background, which would improve visual clarity.

      Response: We appreciate the reviewer’s suggestion. However, we have chosen to retain the original display format with fluorescence shown in a black background, to be consistent with our (and some others’) previous publications. We believe this format preserves clarity while allowing easier comparison with the previously published works.

      The Figure 1 legend should describe the experimental setup rather than restating conclusions.

      Response: We thank the reviewer for this helpful suggestion. The Figure 1 legend has been revised to describe the experimental setup and imaging conditions rather than summarizing conclusions as the following:

      Fig. 1. Physical interactions among the key cytokinetic proteins in plasma membrane deposition and septum formation revealed by ectopic mistargeting to mitochondria by Tom20-GBP. __Arrowheads mark examples of colocalization at mitochondria. (A) Ync13 colocalizes with Rga7 and Rng10 at cell tips and the division site. (B-F) Tom20-GBP can ectopically mistarget Rga7/Rng10-mEGFP and their interacting partners tagged with tdTomato/RFP/mCherry to mitochondria. Tom20–GBP was used to recruit mEGFP-tagged Rga7 or Rng10 to mitochondria, and colocalization was assessed with tdTomato/RFP/mCherry-tagged candidate binding partners. Cells were grown at 25ºC in YE5S + 1.2 M sorbitol medium for ~36 to 48 h and then were washed with YE5S without sorbitol and grown in YE5S for 4 h before imaging. (B) Rga7/Rng10-Ync13. (C) Rga7/Rng10-Trs120. (D) Rga7/Rng10-Bgs4. (E) Rga7/Rng10-Ags1. (F)__ Rga7-Smi1. Bars, 5 μm.

      Reduce the number of arrows indicating co-localization in microscopy images; highlighting 1-2 representative examples is sufficient and less visually cluttered.

      Response: We appreciate the reviewer’s suggestion. We have revised the micrographs to reduce the number of arrowheads, highlighting several representative examples of co-localization per image. This improves clarity and reduces visual clutter while still guiding the reader to the key observations.

      Figure 3F, the scale bar is listed as 5 μm in the legend but it appears to my eye to be 2 μm.

      Response: We thank the reviewer for noticing this error. After rechecking the original imaging data, we have added a new 5 μm scale bar.

      The orientation of Bgs4/Smi1 should be inverted in the schematic within vesicles so that Smi1 is always on the cytoplasmic side.

      Response: We thank the reviewer for pointing out this error. The schematic has been corrected so that Bgs4 and Smi1 are oriented appropriately, with Smi1 consistently placed on the cytoplasmic side of vesicles because it does not have a transmembrane domain. The revised schematic is included in the updated Figure 8.

      6. Also in the schematic, Mid1 is not at the constricting CR and therefore needs to be removed.

      __Response: __Thank you for the suggestion. Mid1 has been removed from the model figure.

      Reviewer #1 (Significance (Required): From the data presented in the manuscript, it is proposed that Rga7 and Rng10 form a scaffold at the division site for delivery of exocytic vesicles marked by the TRAPPII complex but not the exocyst complex. Further, it is proposed that these vesicles deliver specifically the glucan synthases necessary for septation. Overall, this study builds on previous work from the Wu lab to clarify how the TRAPPII-decorated vesicles are specifically delivered to the cell division site, adding some new information about vesicle trafficking regulation during cytokinesis. It also provides new insight into the role of a F-BAR scaffold protein.

      This paper will be of interest to those studying cytokinesis and also those studying mechanisms of intracellular trafficking.

      Reviewer expertise: Cell division, signaling, membrane biology

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary:

      This paper provides a comprehensive analysis of the roles of Rng10, Rga7, and Ync13 in cytokinesis using fission yeast as a model system. The authors demonstrate that Ync13/Rna7/Rng10 not only interact with each other but also associate with components of glucan synthases, which are essential for secondary septum formation but not for the primary septum. They further show that Ync13 is involved in exocytosis through its interaction with Sec1 and plays a role in membrane trafficking via interaction with the TRAPP-II complex. Collectively, their findings reveal a coordinated mechanism that ensures the timely formation of the secondary septum during cytokinesis, as deletion of these proteins disrupts septum formation and leads to cell lysis.

      The conclusions drawn in this paper are well-supported by the data, with a clear methodology and robust statistical analyses that enhance reproducibility. However, I have the following major and minor comments:

      Major Comments - 1) The authors propose that Ync13, Rng10, and Rga7 interact to form a protein complex, supported by their mislocalization studies. While these findings are suggestive, additional co-immunoprecipitation (co-IP) data specifically demonstrating a direct interaction between Ync13 and Rng10 would strengthen the claim.

      Response: We thank the reviewer for this suggestion. The direct interaction between Rga7 with Rng10 has been already established and published by our group [3, 5]. Here we found that Rga7 and Ync13 directly interact by in vitro binding assay (Figure 2, D and E). While our current data do not suggest a direct physical interaction between Ync13 and Rng10, our mislocalization results and other data do provide strong support for their functional association. In particular, ectopic tethering of Ync13 to mitochondria recruits Rng10 to the same sites and vice versa (Figures. 1B and S2A). Additionally, division-site tethering of Ync13 by Pmo25-GBP rescues both the growth and cell-lysis phenotype of rga7Δ (Figure 6), consistent with the idea that Ync13 functions downstream of Rga7-Rng10 because Rga7 localization depends on Rng10 (Figure 8). Furthermore, our AlphaFold3 modeling predicts that Rng10 binds the BAR domain of Rga7, whereas Ync13 binds the GAP domain of Rga7, suggesting that Rng10 and Ync13 are positioned within the same complex through Rga7 without direct interaction (Figure S5).

              The predicted lack of direct interaction between Ync13 and Rng10(751–1038) is supported by the experiment mentioned below to answer the minor question from the Reviewer 3. We tested the mistargeting of mECitrine-Rng10(751–1038) in *rga7Δ tom20-GBP* cells and found that Ync13-tdTomato could not be recruited to mitochondria (Figure S4H). This indicates that Ync13 cannot interact with Rng10(751–1038) independently of Rga7, supporting our proposed model that Rga7 interacts with Rng10 through the BAR domain while with Ync13 through the GAP domain. We have added these clarifications to the revised manuscript (Results and Discussion) to better contextualize the evidence for the Rga7–Rng10–Ync13 assembly.
      

      Liu Y, McDonald NA, Naegele SM, Gould KL, Wu J-Q. The F-BAR Domain of Rga7 Relies on a Cooperative Mechanism of Membrane Binding with a Partner Protein during Fission Yeast Cytokinesis. Cell Rep. 2019;26(10):2540-8.e4. doi: 10.1016/j.celrep.2019.01.112. PubMed PMID: 30840879; PubMed Central PMCID: PMCPMC6425953. Liu Y, Lee I-J, Sun M, Lower CA, Runge KW, Ma J, et al. Roles of the novel coiled-coil protein Rng10 in septum formation during fission yeast cytokinesis. Mol Biol Cell. 2016;27(16):2528-41. Epub 2016/07/08. doi: 10.1091/mbc.E16-03-0156. PubMed PMID: 27385337; PubMed Central PMCID: PMCPMC4985255.

      2) It remains unclear whether Ync13 directly interacts with components of the glucan synthase complex (Bgs4/Ags1), or if this association is mediated through other factors (Rng10, Rga7). Clarifying the nature of this interaction would significantly enhance the mechanistic insight.

      Response: We thank the reviewer for this thoughtful clarification. As pointed out by Reviewer 1 in major comment 1, the multipass integral membrane proteins Bgs4 and Ags1 are embedded within vesicle membranes and are more likely to associate indirectly with the Rga7–Rng10-Ync13 complex rather than being part of one unified protein complex, although Rga7 Co-IPs with Bgs4 and its binding partner Smi1 (Figure 1, A-C). We would like to make it clear that our model or manuscript does not claim direct interactions between the Ync13-Rga7-Rng10 module and the glucan synthase complexes but suggest that the module aids in selection of vesicle targeting sites on the plasma membrane. To clarify, we have revised the text to more clearly state that our co-IP and in vitro binding results demonstrate that Rga7 physically associates with Ync13 and Rng10, and that vesicle-associated proteins such as Bgs4 and Ags1 are likely recruited through indirect interactions.

      __Minor comments: __1) The manuscript refers to mass spectrometry-based interaction data, but the corresponding dataset is not included. Providing this would enhance transparency and reproducibility.

      __Response: __We apologize for the omission. The mass spectrometry data are now shown in Table S1.

      2) In Figure 2D, the MBP-6x pull-down lane shows a faint band around 76 kDa. The authors should clarify what this band represents and whether it has any relevance to the study.

      Response: We thank the reviewer for noticing this faint band. The weak ~76 kDa band in the MBP-6x pull-down lane is non-specific background binding of MBP and Rga7. We added a note in the figure legend to clarify this point.


      3) A quantification graph corresponding to the data in Figure 3G would aid in better interpreting the results and assessing their significance.

      Response: We thank the reviewer for this suggestion. We have now added two quantification graphs corresponding to Figure 3G, showing the measured Rng10 signal intensities across the division site. Statistical analysis shows the full width at half maximum (FWHM) is significantly different between WT and ync13D cells, and the figure legend and text have been updated accordingly in the revised manuscript.

      4) Figure 4D appears to be missing time legends, which are essential for interpreting the dynamics of the experiment.

      Response: We thank the reviewer for noticing this. We apology for making this confusing statement in figure legend. We would like to clarify that the full width at half maximum (FWHM) was calculated from line scans using single time point images from cells at the end of contractile-ring constriction. Those line scans were fitted with the Gaussian distribution to calculate the mean and standard deviation of FWHM. We have updated the figure legend to make it clearer in the revised manuscript.

      Reviewer #2 (Significance (Required)):

      Nature and Significance of the Advance This study provides a conceptual and mechanistic advance in understanding the spatial and temporal regulation of membrane trafficking during cytokinesis. It identifies a conserved module-Ync13-Rga7-Rng10-that directs the selective tethering and fusion of secretory vesicles at the division site, functioning independently of the exocyst complex. This finding challenges the prevailing model that the exocyst is universally required for vesicle tethering during cytokinesis. While previous work has underscored the roles of TRAPP-II and vesicle trafficking in septum formation (Wang et al., 2016; Arellano et al., 1997; Gerien and Wu, 2018), the precise mechanism targeting vesicles to the division site remained unclear. This study fills that gap by elucidating how Ync13 and Rga7 coordinate vesicle delivery and glucan synthase localization (Liu et al., 2016; Zhu et al., 2018), thereby extending our understanding of septum biogenesis and membrane remodeling beyond actomyosin ring dynamics.

      Relevant Audience: This work is relevant to: • Cell biologists investigating cytokinesis, membrane trafficking, or vesicle fusion. • Yeast geneticists interested in conserved cell division pathways. • Researchers focused on SNARE-mediated membrane dynamics and trafficking regulation. • Biomedical scientists exploring analogous processes in mammalian systems, particularly those studying cell division defects linked to disease. The findings have implications across both basic and translational research in cell biology and membrane dynamics.

      My Expertise: My research focuses on membrane fusion, specifically the SNARE-mediated fusion process. I study the spatio-temporal regulation of fusion events and the coordinated action of regulatory proteins in determining the structural and functional outcomes of membrane fusion. This background provides me with the framework to critically evaluate studies investigating cytokinesis and trafficking mechanisms at the molecular level.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Zhang et al. elucidate key roles of a conserved module the Ync13-Rga7-Rng10 complex in coordinating selective tethering, docking, and fusion of glucan synthases containing vesicles with the plasma membrane, a process crucial for cell wall synthesis and survival of fission yeast at division. Using methods including mistargeting proteins to mitochondria, co-immunoprecipitation, in vitro binding assays, genetic and cellular methods, electron microscopy, and live-cell confocal microscopy, the authors demonstrate that this module controls a vesicle targeting pathway mediated by the TRAPP-II complex and SM protein Sec1, which ensures glucan synthases Bgs4 and Ags1 are deposited at the division site in a spatiotemporal manner.

      Major comments: The authors report aberrant accumulation of Bgs4 and Ags1 in the center of the septum after actomyosin ring constriction in ync13del cells and detect no overall defects in Bgs1 distribution there (Figure 4). When similar experiments were analyzed in this paper ( https://pmc.ncbi.nlm.nih.gov/articles/PMC6249806/), Bgs1 distribution and level did change in cells lacking Ync13, although these phenotypes of Bgs1 appeared later that that of Bgs4. I wonder whether there could exist a second wave of Bgs1 arrival in ync13del cells at later time points after ring fully constricts. Could this late recruitment of Bgs1 depends on Rng7 and Rng10, since these protein complexes are enriched in the middle of septum of ync13del cells? Or as the authors mentioned in the Discussion, could Rho GTPase regulated by Rga7 GAP also play a role in Bgs1 accumulation or fusion with the septum in the above scenario, if no obvious accumulation of vesicles is observed in ync13del cells with electron microscopy? How does Bgs1 localize in ync13-19 rng10del double?

      Response: We thank the reviewer for this insightful observation. We repeated the experiment to observe the localization of Bgs1 in WT and ync13Δ cells. We confirmed our earlier observation reported in this manuscript that the localization of Bgs1 at rim of the division site and its distribution along the division plane in ync13Δ is not very different from WT, although its intensity is higher and has more variation in ync13Δ cells (Figure above) . As suggested by the reviewer, we did microscopy to test Bgs1 localization in ync13-19 temperature sensitive mutant, rng10Δ, ync13-19 rng10Δ, and WT (Fig. S7). While line scan curves for Bgs1 localization at the division site steep for ync13-19 rng10Δ double mutant, it has no statistically significant difference in FWHM as compared to control WT (Fig. S7). Please note that we used different confocal systems, cameras, and laser powers for Fig. 4, C and E (PerkinElmer UltraVIEW Vox CSUX1) and Fig. S7 (Nikon W1+SoRa), so the FWHMs are not comparable between the two figures.

      To test if there is any second wave of Bgs1 localization at the division site, we tracked the fluorescence intensity of Bgs1 throughout 2 h long movies and plotted the Bgs1 intensity profile at the division site over time. The data clearly show only one peak of Bgs1 and no later accumulation at the division site, although Bgs1 intensity has more variation in ync13-19 and ync13-19 rng10Δ cells and the intensity is higher in ync13-19 rng10Δ cells. All these experiments conclude that Ync13-Rga7-Rng10 module impacts the localization of glucan synthases essential for the secondary septum (Bgs4 and Ags1) but not the primary (Bgs1).

      Assessments of protein abundance by Western blotting (Figure 3C and 3D) can benefit from some quantifications.

      Response: We thank the reviewer for this suggestion. We have now quantified the Western blot bands in Figures 3C and 3D, which have been added as supplementary figures along with the Western blot for Rng10 (Fig. S6, A-C) in the revised figures.

      Minor comments: Based on a series of experiments in which mistargeting Rga7 and Rng10 truncations drive Ync13-tdTomato to mitochondria, the authors suggest that Rga7, Rng10, and Ync13 have multivalent interactions with each other. Previous study (https://pmc.ncbi.nlm.nih.gov/articles/PMC6425953/) demonstrated that in cells co-expressing Tom20-GBP mECitrine-Rng10(751-950), Rga7 was efficiently mistargeted to mitochondria. This raises a possibility that Ync13 mistargeted by mECitrine-Rng10(751-1038) could come from Rga7 that strongly associated with Rng10(751-1038) on mitochondria. I wonder whether the authors could compare some of their truncation mistargeting experiments in the original manuscript and the ones in which either Rga7 or Rng10 is deleted, e.g. Tom20-GBP mECitrine-Rng10(751-1038) experiments in rga7del cells, if cells are still viable in this genetic background.

      Response: We thank the reviewer for this insightful suggestion. We tested the mistargeting of mECitrine-Rng10(751–1038) in rga7Δ tom20-GBP cells and found that Ync13-tdTomato could not be recruited to mitochondria. This indicates that Ync13 cannot interact with Rng10 C-terminus independently of Rga7, supporting the Alphafold3 modeling and our proposed model that Rga7 interacts with Rng10 through the BAR domain while with Ync13 through the GAP domain. We have added the new data to the revised manuscript (Fig. S4H and associate text) and included a brief discussion highlighting that Rga7 is required for the Rng10–Ync13 interaction. We removed the mentioning of multivalent interactions in the manuscript to minimize confusion.

      It is interesting that rga7del rng10del double mutants can survive better in EMM or YES with sorbitol. I wonder this would allow the authors to test whether the interaction between Ync13 and Sec1 is modulated by the presence of Rga7 and Rng10 or even the entire vesicle? Does mistargeted Ync13 overexpressed using the 3nmt1 promoter is still capable of driving Sec1 to mitochondria in rga7del rng10del cells.

      Response: We thank the reviewer for this suggestion. While we did not succeed in constructing the pentamutant deleting both rga7 and rng10 and mislocalizing Ync13 to mitochondria, we were able to make a quadruple mutant deleting rng10 and mislocalizing Ync13 to mitochondria. We tested whether mistargeted Ync13 overexpressed using the 3nmt1 promoter can recruit Sec1 to mitochondria in rng10Δ cells. Our results show that overexpressed Ync13 is still able to drive Sec1 localization to mitochondria without Rng10 (Fig. S2G). This suggests that Rng10 (together with Rga7) primarily functions to recruit and position Ync13 at the division site rather than being strictly required for the interaction between Ync13 and Sec1. This is also consistent with our Pmo25-GBP mislocalization experiments where we found that rga7Δ 3nmt1-mECitrine-ync13 cells even under the repressed condition for the 3nmt1 promoter can partially rescue the lysis phenotype of rga7Δ cells (Figure 6).

      The endogenous level of Ync13 is not particular high. Is this low level of Ync13 crucial for its function? Does mildly elevated level of Ync1 promote vesicle fusion at the closing septum?

      Response: We thank the reviewer for this insightful question. To test if there is a correlation between Ync13 levels and vesicle fusion at the division site, we mildly overexpressed Ync13 from the 3nmt1 promoter in YE5S rich medium without additionally added thiamine to obtain cells with different Ync13 levels (the rich medium has some residual amount of thiamine, which partially represses the nmt1 promoter) and then tracked the Rab11 GTPase Ypt3 labeled vesicles. This resulted in increased levels of Ync13 as well as Ypt3 at the division site (Fig. S8B). We measured the Ync13 intensity at division site and counted the number of Ypt3 vesicles reaching the division site in 2-minute continuous movie at the middle focal plane. We observed that increasing Ync13 level promoted the tethering and accumulation of Ypt3 vesicles at the division site until it reached a plateau (Fig. S8B). Thus, the Ync13 level is important for vesicle fusion at the division site. Collectively, Ync13, working with Rga7 and Rng10, plays an important role in vesicle targeting and fusion on the plasma membrane at the division site during cytokinesis. This is consistent with our results that overexpressed Ync13 can mislocalize Sec1 to mitochondria in rng10Δ (Fig. S2G) and can rescue the rga7Δ (Fig. 6).

      Reviewer #3 (Significance (Required)):

      Most of conclusions are well supported by a combination of methods. Out of curiosity, I wonder how much of Bgs4 or Smi1 detected in Co-IP experiments exist in the vesicle-bound form. The authors propose a very interesting working model that addresses several key challenges in achieving vesicle targeting specificity when timely delivery of various enzymes to their respective spatial locations along the primary and secondary septum must be orchestrated. I think this manuscript will be of interest to a broad audience.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Joint Public Review:

      Weaknesses:

      The lack of pleiotropy is an unconfirmable assumption of MR, and the addition of those models is therefore quite important, as this is a primary weakness of the MR approach. Given that concern, I read the sensitivity analyses using pleiotropy-robust models as the main result, and in that case, they can't test their hypotheses as these models do not show a BMI instrumental variable association. The other weakness, which might be remedied, is that the power of the tests here is not described. When a hypothesis is tested with an under-powered model, the apparent lack of association could be due to inadequate sample size rather than a true null. Typically, when a statistically significant association is reported, power concerns are discounted as long as the study is not so small as to create spurious findings. That is the case with their primary BMI instrumental variable model - they find an association so we can presume it was adequately powered. But the primary models they share are not the pleiotropy-robust methods MR-Egger, weighted median, and weighted mode. The tests for these models are null, and that could mean a couple of things: (1) the original primary significant association between the BMI genetic instrument was due to pleiotropy, and they therefore don't have a robust model to explore the effects of the tobacco genetic instrument. (2) The power for the sensitivity analysis models (the pleiotropy-robust methods) is inadequate, and the authors share no discussion about the relative power of the different MR approaches. If they do have adequate power, then again, there is no need to explore the tobacco instrument.

      Reviewing Editor Comments:

      We suggest that the authors add power estimates to assess whether the sample size is sufficient, given the strength and variability of the genetic instruments. It would also be helpful to present effect estimates for the tobacco instruments alone, to clarify their independent contribution and improve the interpretation of the joint models. In addition, the role of pleiotropy should be addressed more clearly, including which model is considered primary. Stratified analyses by smoking status are encouraged, as prior studies indicate that BMI-HNC associations may differ between smokers and non-smokers. Finally, the comparison with previous studies should be revised, as most reported null findings without accounting for tobacco instruments. If this study finds an association, it should not be framed as a replication

      We would like to highlight that post-hoc power calculations are often considered redundant since the statistical power estimated for an observed association is directly related to its p-value[1]. In other words, the uncertainty of the association is already reflected in its 95% confidence interval. However, we understand power calculations may still be of interest to the reader, so we have incorporated them in the revised manuscript. We have edited the text as follows (lines 151-155):“Consequently, we used the total R<sup>2</sup> values to examine the statistical power in our study[42]. However, we acknowledge that the value of post-hoc power calculations is limited, since the statistical power estimated for an observed association is already reflected in the 95% confidence interval presented alongside the point estimate[43].” We have also added supplementary figures 1 and 2.

      We can see that when using the latest HEADSpAcE data we were able to detect BMI-HNC ORs as small as 1.16 with 80% power, while the GAME-ON dataset only permitted the detection of ORs as small as 1.26 using the same BMI instruments (Figure B). We have explained these figures in the results section as follows (lines 257-263): “Using the BMI genetic instruments (total R<sup>2</sup>= 4.8%) and an α of 0.05, we had 80% statistical power to detect an OR as small as 1.16 for HNC risk (Supplementary Figure 1). For WHR (total R<sup>2</sup>= 3.1%) and WC (total R<sup>2</sup>= 4.4%), we could detect odds ratios (ORs) as small as 1.20 and 1.17, respectively. This is an improvement in terms of statistical power compared to the GAME-ON analysis published by Gormley et al.[28], for which there was 80% power to detect an OR as small as 1.26 using the same BMI genetic instruments (Supplementary Figure 2).”

      The reason we use inverse variance weighted (IVW) Mendelian randomization (MR) to obtain our main results rather than the pleiotropy-robust methods mentioned by the reviewer/editors (i.e., MR-Egger, weighted median and weighted mode) is that the former has greater statistical power than the latter[2]. Hence, instead of focussing on the statistical significance of the pleiotropy-robust analyses, we consider it is of more value to compare the consistency of the effect sizes and direction of the effect estimates across methods. Any evidence of such consistency increases our confidence in our main findings, since each method relies on different assumptions. As we cannot be sure about the presence and nature of horizontal pleiotropy, it is useful to compare results across methods even though they are not equally powered. It is true that our results for the genetically predicted effects of body mass index (BMI) on the risk of head and neck cancer (HNC) differ across methods. This is precisely what led us to question the validity of our main finding (suggesting a positive effect of BMI on HNC risk). We have now clarified this in the methods section of the revised manuscript as advised. Lines 165-171:

      “Because the IVW method assumes all genetic variants are valid instruments[44], which is unlikely the case, three pleiotropy-robust two-sample MR methods (i.e., MR-Egger[45], weighted median[46] and weighted mode[47]) were used in sensitivity analyses. When the magnitude and direction of effect estimates are consistent across methods that rely on different assumptions, the main findings are more convincing. As we cannot be sure about the presence and nature of horizontal pleiotropy, it is useful to compare results across methods even if they are not equally powered.”

      We understand that the reviewer/editors are concerned that we do not have a robust model to explore the role of tobacco consumption in the link between BMI and HNC. However, we have a different perspective on the matter. If indeed, the main IVW finding for BMI and HNC is due to pleiotropy (since some of the pleiotropy-robust methods suggest conflicting results), then the IVW multivariable MR method is a way to explore the potential source of this bias[3]. We were particularly interested in exploring the role of smoking in the observed association because smoking and adiposity are known to influence each other [4-9] and share a genetic basis[10, 11].

      We agree that it would be useful to present the univariable MR effect estimates for smoking behaviour and HNC risk along those obtained using multivariable MR. We have now included the univariable MR estimates for both smoking behaviour variables as a note under Supplementary Table 11 and in the manuscript (lines 316-318): “In univariable IVW MR, both CSI and SI were linked to an increased risk of HNC (CSI OR=4.47 per 1-SD higher CSI, 95%CI 3.31–6.03, p<0.001; SI OR=2.07 per 1-SD higher SI 95%CI 1.60–2.68, p<0.001) (Additional File 2: note in Supplementary Table 11).”

      We understand the appeal of conducting stratified MR analyses by smoking status. However, we anticipate such analyses would hinder the interpretation of our findings as they can induce collider bias which could spuriously lead to different effect estimates across strata[12, 13].

      We thank the reviewer/editors for their comment regarding the way we frame of our findings. We have now edited the discussion section to highlight our study results are different to those obtained in studies that do not account for smoking behaviour. Lines 398-401: “With a much larger sample (N=31,523, including 12,264 cases), our IVW MR analysis suggested BMI may play a role in HNC risk, in contrast to previous studies. However, our sensitivity analyses implied that causality was uncertain.”

      Reviewer #1 (Recommendations for the authors):

      The authors do share a table of the percent variance explained of the different genetic instruments, which vary widely, and that table is very welcome because we can get some sense of their utility. The problem is that they don't translate that into a power estimate for the case-control study size that they use. They say that it is the biggest to date, which is good, but without some formal power estimate, it is not particularly reassuring. A framework for MR study power estimates was reported in PMID: 19174578, but that was using very simple MR constructs in use in 2009, and it isn't clear to me if that framework can be used here. That power paper suggests that weak genetic instruments need very large sample sizes, far larger than what is used in the current manuscript. I am unable to estimate the true strength of the instruments used here, and so I am unsure of whether power is an issue or not.

      We have now included power calculations in our manuscript to address the reviewer’s concerns. Nevertheless, as mentioned above, post-hoc power calculations are of limited value, as statistical power is already reflected in the uncertainty around the point estimates (the 95% confidence intervals). Hence, it is important to avoid drawing conclusions regarding the likelihood of true effects or false negatives based on these calculations.

      Although the hypothesis here is that smoking accounts for the apparent BMI association previously reported for HNC, it would have been preferable to see the estimates for their 2 genetic instruments for tobacco alone. The current results only show the BMI instruments alone and then with the tobacco instruments. I would like to see what the risk estimates are for the tobacco instrument alone, so that I can judge for myself what happens in the joint models. As presented, one can only do that for the BMI instruments.

      We thank the reviewer for this comment. The univariable IVW MR estimate of smoking initiation was OR=2.07 (95%CI 1.60 to 2.68, p<0.001), while the one for comprehensive smoking index was OR=4.47 (95%CI 3.31 to 6.03, p<0.001). We have included this information in the manuscript as requested (please see response to reviewing editor above).

      On line 319, they write that "We did not find evidence against bias due to correlated pleiotropy..." I find this difficult to parse, but I think it means that they should believe that correlated pleiotropy remains a problem. So again, they seem to see their primary model as compromised, and so do I. This limitation is again stated by the authors on lines 351-352.

      We apologise if the wording of the sentence was not easy to understand. When using the CAUSE method, we did not find evidence to reject the null hypothesis that the sharing (correlated pleiotropy) model fits the data at least as well as the causal model. In other words, our CAUSE finding and the inconsistencies observed across our other sensitivity analyses led us to believe that our main IVW MR estimate for BMI-HNC was likely biased by correlated pleiotropy. We believe it is important to explore the source of this bias, which is why we used multivariable MR to investigate the direct effect of BMI on HNC risk while accounting for smoking behaviour.

      In the following paragraphs (lines 358-369), the authors state that their findings are consistent with prior reports, but that doesn't seem to be the case if we take their primary BMI instrument as representing the outcome of this manuscript. Here, they find an association between the BMI instrument and HNC risk, but in each of the other papers they present the primary finding was null without the extensive model changes or the aim of accounting for tobacco with another instrument. I don't see that as replication.

      This is a good point. We have now edited the discussion of our manuscript to avoid giving the impression that our findings replicate those from studies that do not account for smoking behaviour in their analyses. We have edited lines 384-401 as follows:

      “Previous MR studies suggest adiposity does not influence HNC risk[27-29]. Gormley et al.[28] did not find a genetically predicted effect of adiposity on combined oral and oropharyngeal cancer when investigating either BMI (OR=0.89 per 1-SD, 95% CI 0.72–1.09, p=0.26), WHR (OR=0.98 per 1-SD, 95% CI 0.74–1.29, p=0.88) or waist circumference (OR=0.73 per 1-SD, 95% CI 0.52–1.02, p=0.07) as risk factors. Similarly, a large two-sample MR study by Vithayathil et al.[29] including 367,561 UK Biobank participants (of which 1,983 were HNC cases) found no link between BMI and HNC risk (OR=0.98 per 1-SD higher BMI, 95% CI 0.93–1.02, p=0.35). Larsson et al.[27] meta-analysed Vithayathil et al.’s[29] findings with results obtained using FinnGen data to increase the sample size even further (N=586,353, including 2,109 cases), but still did not find a genetically predicted effect of BMI on HNC risk (OR=0.96 per 1-SD higher BMI, 95% CI 0.77–1.19, p=0.69). With a much larger sample (N=31,523, including 12,264 cases), our IVW MR analysis suggested BMI may play a role in HNC risk, in contrast to previous studies. However, our sensitivity analyses implied that causality was uncertain.”

      We also deleted part of a sentence in the discussion section, so lines 416-418 now look as follows: “An important strength of our study was that the HEADSpAcE consortium GWAS used had a large sample size which conferred more statistical power to detect effects of adiposity on HNC risk compared to previous MR analyses[27-29].”

      On lines 384-386 they note a strength is that this is the largest study to date, but I would reiterate that larger and more powerful does not equate to adequately powered.

      This is true. We have included power calculations in the manuscript as requested.

      It's well known that different HNC subsites have different etiologies, as they mention on lines 391-392, and it is implicit in their use of data on HPV positive and negative oropharyngeal cancer. They say that they did not find evidence for heterogeneity in this study, but that would only be true for the null BMI instrument. The effect sizes for their smoking instruments are strikingly different between the subsites.

      We agree and are sorry for the confusion we may have caused by the way we worded our findings. We have edited the text to clarify that the lack of subsite heterogeneity only applied to our results for BMI/WHC/WC-HNC risk. Lines 418-424 now read as follows:

      “Furthermore, the availability of data on more HNC subsites, including oropharyngeal cancers by HPV status, allowed us to investigate the relationship between adiposity and HNC risk in more detail than previous MR studies which limited their subsite analyses to oral cavity and overall oropharyngeal cancers[28, 68]. This is relevant because distinct HNC subsites are known to have different aetiologies[69], although we did not find evidence of heterogeneity across subsites in our analyses investigating the genetically predicted effects of BMI, WHR and WC on HNC risk.”

      Finally, the literature on mutational patterns gives us strong reason to believe that HNC caused by tobacco are biologically distinct from tumors not caused by tobacco. The authors report in the introduction that traditional observational studies of BMI and HNC have reported different findings in smokers versus never smokers, so I would assume there is a possibility that the BMI instrument could have different associations with tumors of the tobacco-induced phenotype and tumors with a non-tobacco induced phenotype. I would assume that authors have access to the data on self-reported tobacco use behavior, even if they can't separate these tumors by molecular types. Stratifying their analysis by tobacco users or not might reveal different results with the BMI instrument.

      We appreciate the reviewer’s comment. We agree that it would have been interesting to present stratified analyses by smoking status along our main findings. However, we decided against this because of the risk of inducing collider bias in our MR analyses i.e., where stratifying on smoking status may induce spurious associations between the adiposity instruments and confounding factors. Multivariable MR is considered a better way of investigating the direct effects of an exposure (adiposity) on an outcome (HNC) accounting for a third variable (smoking)[14], which is why we opted for this method instead.

      References:

      (1) Heinsberg LW, Weeks DE: Post hoc power is not informative. Genet Epidemiol 2022, 46(7):390-394.

      (2) Burgess S, Butterworth A, Thompson SG: Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol 2013, 37(7):658-665.

      (3) Burgess S, Davey Smith G, Davies NM, Dudbridge F, Gill D, Glymour MM, Hartwig FP, Kutalik Z, Holmes MV, Minelli C et al: Guidelines for performing Mendelian randomization investigations: update for summer 2023. Wellcome Open Res 2019, 4:186.

      (4) Morris RW, Taylor AE, Fluharty ME, Bjorngaard JH, Asvold BO, Elvestad Gabrielsen M, Campbell A, Marioni R, Kumari M, Korhonen T et al: Heavier smoking may lead to a relative increase in waist circumference: evidence for a causal relationship from a Mendelian randomisation meta-analysis. The CARTA consortium. BMJ Open 2015, 5(8):e008808.

      (5) Taylor AE, Morris RW, Fluharty ME, Bjorngaard JH, Asvold BO, Gabrielsen ME, Campbell A, Marioni R, Kumari M, Hallfors J et al: Stratification by smoking status reveals an association of CHRNA5-A3-B4 genotype with body mass index in never smokers. PLoS Genet 2014, 10(12):e1004799.

      (6) Taylor AE, Richmond RC, Palviainen T, Loukola A, Wootton RE, Kaprio J, Relton CL, Davey Smith G, Munafo MR: The effect of body mass index on smoking behaviour and nicotine metabolism: a Mendelian randomization study. Hum Mol Genet 2019, 28(8):1322-1330.

      (7) Asvold BO, Bjorngaard JH, Carslake D, Gabrielsen ME, Skorpen F, Smith GD, Romundstad PR: Causal associations of tobacco smoking with cardiovascular risk factors: a Mendelian randomization analysis of the HUNT Study in Norway. Int J Epidemiol 2014, 43(5):1458-1470.

      (8) Carreras-Torres R, Johansson M, Haycock PC, Relton CL, Davey Smith G, Brennan P, Martin RM: Role of obesity in smoking behaviour: Mendelian randomisation study in UK Biobank. BMJ 2018, 361:k1767.

      (9) Freathy RM, Kazeem GR, Morris RW, Johnson PC, Paternoster L, Ebrahim S, Hattersley AT, Hill A, Hingorani AD, Holst C et al: Genetic variation at CHRNA5-CHRNA3-CHRNB4 interacts with smoking status to influence body mass index. Int J Epidemiol 2011, 40(6):1617-1628.

      (10) Thorgeirsson TE, Gudbjartsson DF, Sulem P, Besenbacher S, Styrkarsdottir U, Thorleifsson G, Walters GB, Consortium TAG, Oxford GSKC, consortium E et al: A common biological basis of obesity and nicotine addiction. Transl Psychiatry 2013, 3(10):e308.

      (11) Wills AG, Hopfer C: Phenotypic and genetic relationship between BMI and cigarette smoking in a sample of UK adults. Addict Behav 2019, 89:98-103.

      (12) Coscia C, Gill D, Benitez R, Perez T, Malats N, Burgess S: Avoiding collider bias in Mendelian randomization when performing stratified analyses. Eur J Epidemiol 2022, 37(7):671-682.

      (13) Hamilton FW, Hughes DA, Lu T, Kutalik Z, Gkatzionis A, Tilling K, Hartwig FP, Davey Smith G: Non-linear Mendelian randomization: evaluation of effect modification in the residual and doubly-ranked methods with simulated and empirical examples. Eur J Epidemiol 2025.

      (14) Sanderson E, Davey Smith G, Windmeijer F, Bowden J: An examination of multivariable Mendelian randomization in the single-sample and two-sample summary data settings. Int J Epidemiol 2019, 48(3):713-727.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer # 1 (Public review)

      This study aims to elucidate the mechanisms by which stress-induced α2A-adrenergic receptor (α2A-AR) internalization leads to cytosolic noradrenaline (NA) accumulation and subsequent neuronal dysfunction in the locus coeruleus (LC). While the manuscript presents an interesting but ambitious model involving calcium dynamics, GIRK channel rundown, and autocrine NA signaling, several key limitations undermine the strength of the conclusions. 

      (1) First, the revision does not include new experiments requested by reviewers to validate core aspects of the mechanism. Specifically, there is no direct measurement of cytosolic NA levels or MAO-A enzymatic activity to support the link between receptor internalization and neurochemical changes. The authors argue that such measurements are either not feasible or beyond the scope of the study, leaving a significant gap in the mechanistic chain of evidence. 

      Although the reviewer #1 commented that “The authors argue that such measurements are either not feasible or beyond the scope of the study, leaving a significant gap in the mechanistic chain of evidence”, we believe that this comment may be unfair. 

      It may be unfair for the reviewer #1 to neglect our responses to the original reviewer comments regarding the direct measurement of cytosolic NA levels. It is true that none of the recommended methods to directly measure cytosolic NA levels are not feasible as described in the original authors’ response (see the original authors’ response to the comment raised by the Reviewer #1 as Recommendations for the authors (2)). To measure extracellular NA with GRAB-NE photometry, α2A-ARs must be expressed in the cell membrane. GRAB-NE photometry is not applicable unless α2A-ARs are expressed, whereas increases in cytosolic NA levels are caused by internalization of α2A-ARs in our study.

      In our study, we elaborated to detect the change in MAO-A protein with Western blot method, instead of examining MAO-A enzymatic activity. Because the relative quantification of active AEP and Tau N368 proteins by Western blot analysis should accurately reflect the change in the MAO-A enzymatic activity, enzymatic assay may not be necessarily required while we admit the necessity of enzymatic assay to better demonstrate the MAO-A activities as discussed in the previously revised manuscript (R1, page 10, lines 314-315). 

      We used the phrase “beyond the scope of the current study” for “the mechanism how Ca<sup>2+</sup> activates MAO-A” as described in the original authors’ responses (see the original authors’ response to the comment raised by the Reviewer #1 as Weakness (3)). We do not think that this mechanism must be investigated in the present study because the Ca<sup>2+</sup> dependent nature of MAO-A activity is already known (Cao et al., 2007). 

      On the other hand, because it is not possible to measure cytosolic NA levels with currently available methods, the quantification of the connection between α2A-AR internalization and increased cytosolic NA levels must be considered outside the scope of the study. However, our study demonstrated the qualitative relationship between α2A-AR internalization and active-AEP/TauN-368 reflecting increased cytosolic NA levels, leaving “a small gap in the mechanistic chain of evidence.” Therefore, it may be unreasonable to criticize our study as “leaving a significant gap in the mechanistic chain of evidence” with the phrase “beyond the scope of the current study.” 

      (2) Second, the behavioral analysis remains insufficient to support claims of cognitive impairment. The use of a single working memory test following an anxiety test is inadequate to verify memory dysfunction behaviors. Additional cognitive assays, such as the Morris Water Maze or Novel Object Recognition, are recommended but not performed.

      As described in the original authors’ response (see the original authors’ response to the comment raised by the Reviewer #1 as Weakness (4)), we had already done another behavioral test using elevated plus maze (EPM) test. By combining the two tests, it may be possible to more accurately evaluate the results of Y-maze test by differentiating the memory impairment from anxiety. However, the results obtained by these behavioral tests showed that chronic RS mice displayed both anxiety-like and memory impairment-like behaviors. Accordingly, we have softened the implication of anxiety and memory impairment (page 13, lines 396-399) and revised the abstract (page 2, line 59) in the revised manuscript (R2).  

      (3) Third, concerns regarding the lack of rigor in differential MAO-A expression in fluorescence imaging were not addressed experimentally. Instead of clarifying the issue, the authors moved the figure to supplementary data without providing further evidence (e.g., an enzymatic assay or quantitative reanalysis of Western blot, or re-staining of IF for MAO-A) to support their interpretation.

      Because the quantification of MAO-A expression can be performed with greater accuracy by means of Western blot than by immunohistochemistry, we have moved the immunohistochemical results (shown in Figure 5) to the supplemental data (Figure S8) following the suggestion made by the Reviewer #3. As the relative quantification of active AEP and Tau N368 proteins by Western blot analysis may accurately reflect changes in the MAO-A enzymatic activity which is consistent with the result of Western blot analysis of MAO-A, enzymatic assay or re-staining of immunofluorescence for MAO-A may not be necessarily required. We do not think that a new experiment of Western blot analysis is necessary to re-evaluate MAO-A just because of the lack of the less-reliable quantification of immunohistochemical staining.

      (4) Fourth, concerns regarding TH staining remain unresolved. In Figure S7, the α2A-AR signal appears to resemble TH staining, and vice versa, raising the possibility of labeling errors. It is recommended that the authors re-examine this issue by either double-checking the raw data or repeating the immunostaining to validate the staining.

      The reviewer #3 is misunderstanding Figure S7. In Figure S7, there are two types of α2A-AR expressing neurons; one is TH-positive LC neuron and the other is TH-negative neuron in mesencephalic trigeminal nucleus (MTN). This clearly indicates that TH staining is specific. Furthermore, α2A-AR staining was much more extensive in MTN neurons than in LC neurons. Thus, α2A-AR signal is not similar to TH signal and there are no labeling errors, which is also evident in the merged image (Figure S7C).

      (5) Overall, the manuscript offers a potentially interesting framework but falls short in providing the experimental rigor necessary to establish causality. The reliance on indirect reasoning and reorganizing of existing data, rather than generating new evidence, limits the overall impact and interpretability of the study.

      Overall, the reviewer #1 was not satisfied with our revision regardless of the authors’ responses. As detailed above in our responses to the replies (1)~(4), we believe that in the original authors’ responses and in the above-described responses we effectively responded to the criticisms by the reviewer #1.

      Reviewer #2 (Public review): 

      Comments on revisions: 

      The authors have addressed all of the reviewers' comments.

      We appreciate constructive and helpful comments made by the reviewer #2.

      Reviewer #3 (Public review): 

      Weaknesses:  

      Nevertheless, the manuscript currently reads as a sequence of discrete experiments rather than a single causal chain. Below, I outline the key points that should be addressed to make the model convincing.

      Please see the responses to the recommendation for the authors made by reviewer #3.

      Reviewer #3 (Recommendations for the authors):

      (1) Causality across the pathway  

      Each step (α2A internalisation, GIRK rundown, Ca<sup>2+</sup> rise, MAO-A/AEP upregulation) is demonstrated separately, but no experiment links them in a single preparation. Consider in vivo Ca<sup>2+</sup> or GRAB NE photometry during restraint stress while probing α2A levels with i.p. clonidine injection or optogenetic over excitation coupled to biochemical readouts. Such integrated evidence would help to overcome the correlational nature of the manuscript to a more mechanistic study. 

      Authors response: It is not possible to measure free cytosolic NA levels with GRAB NE photometry when α2A AR is internalized as described above (see the response to the comment made by reviewer #1 as the recommendation for the authors).

      The core idea behind my comment, as well as that of Reviewer 1, was to encourage integrating your individual findings into a more cohesive in vivo experiment. Using GRAB-NE to measure extracellular NA could serve as an indirect readout of NA uptake via NAT, and ultimately, cytosolic NA levels. Connecting these experiments would significantly strengthen the manuscript and enhance its overall impact. 

      It may be true that the measurement of extracellular NA could serve as an indirect readout of NA uptake via NAT, and ultimately cytosolic NA levels. However, the reviewer #3 is still misunderstanding the applicability of GRAB-NE method to detect NE in our study. As described in the original authors’ response, there appeared to be no fluorescence probe to label cytosolic NA at present. Especially, the GRAB-NE method recommended by the reviewers #1 and #3 is limited to detect NA only when α2A-AR is expressed in the cell membrane.Therefore, when increases in cytosolic NA levels are caused by internalization of α2A-ARs, NA measurement with GRAB-NE photometry is not applicable.

      (2) Pharmacology and NE concentration  

      The use of 100 µM noradrenaline saturates α and β adrenergic receptors alike. Please provide ramp measurements of GIRK current in dose-response at 1-10 µM NE (blocked by atipamezole) to confirm that the rundown really reflects α2A activity rather than mixed receptor effects. 

      Authors response: It is true that 100 µM noradrenaline activates both α and β adrenergic receptors alike. However, it was clearly showed that enhancement of GIRK-I by 100 µM noradrenaline was completely antagonized by 10 µM atipamezole and the Ca<sup>2+</sup> dependent rundown of NA-induced GIRK-I was prevented by 10 µM atipamezole. Considering the Ki values of atipamezole for α2A AR (=1~3 nM) (Vacher et al., 2010, J Med Chem) and β AR (>10 µM) (Virtanen et al., 1989, Arch Int Pharmacodyn Ther), these results really reflect α2A AR activity but not β AR activity (Figure S5). Furthermore, because it is already well established that NA-induced GIRK-I was mediated by α2A AR activity in LC neurons (Arima et al., 1998, J Physiol; Williams et al., 1985, Neuroscience), it is not necessarily need to re-examine 1-10 µM NA on GIRK-I.

      While the milestone papers by Williams remain highly influential, they should be re-evaluated in light of more recent findings, given that they date back over 40 years. Advances in our understanding now allow for a more nuanced interpretation of some of their results. For example, see McKinney et al. (eLife, 2023). This study demonstrates that presynaptic β-adrenergic receptors-particularly β2-can enhance neuronal excitability via autocrine mechanisms. This suggests that your post-activation experiments using atipamezole may not fully exclude a contribution of β-adrenergic signaling. Such a role might become apparent when conducting more detailed titration experiments.

      The reviewer #3 may be misunderstanding the report by McKinney et al. (eLife, 2013). This paper did not demonstrate that presynaptic β-adrenergic receptors-particularly β2- can enhance neuronal excitability via autocrine mechanisms. It is impossible for LC neurons to increase their excitability by activating β-adrenergic receptors, as we have clearly shown that enhancement of GIRK-I by 100 µM noradrenaline was completely antagonized by 10 µM atipamezole. Considering the difference in Ki values of atipamezole for α2-AR (= 2~4 nM) (Vacher et al., 2010, J Med Chem) and β-AR (>10 µM) (Virtanen et al., 1989, Arch Int Pharmacodyn Ther), such a complete antagonization (of 100 µM NA-induced GIRK-I) by 10 µM atipamezole really reflect α2A-AR activity but not β-AR activity (Figure S5). Furthermore, it is already well established that NA-induced GIRK-I was mediated by α2-AR activity in LC neurons (Arima et al., 1998, J Physiol). McKinney et al. (eLife, 2023) have just found the absence of lateral inhibition on adjacent LC neurons by NA autocrine caused respective spike activity. This has nothing to do with autoinhibition.

      (4) Age mismatch and disease claims 

      All electrophysiology and biochemical data come from juvenile (< P30) mice, yet the conclusions stress Alzheimer-related degeneration. Key endpoints need to be replicated in adult or aged mice, or the manuscript should soften its neurodegenerative scope. 

      Authors response: As described in the section of Conclusion, we never stress Alzheimer-related degeneration, but might give such an impression. To avoid such a misunderstanding, we have added a description “However, the present mechanism must be proven to be valid in adult or old mice, to validate its involvement in the pathogenesis of AD.” (R1, page 14, lines 448-450).

      It would be great to see this experiment performed in aged mice-you are the one who has everything in place to do it right now! 

      In our future separate studies, we would like to prove that the present mechanism is valid in aged mice, to validate its involvement in the pathogenesis of AD. This is partly because the patch-clamp study in aged mice is extremely difficult and takes much time.

      Authors response: In the abstract, you suggest that internalization of α2A-adrenergic receptors could represent a therapeutic target for Alzheimer's disease. "...Thus, it is likely that internalization of α2A-AR increased cytosolic NA, as reflected in AEP increases, by facilitating reuptake of autocrine-released NA. The suppression of α2A-AR internalization may have a translational potential for AD treatment."

      α2A-AR internalization was involved in the degeneration of LC neurons. Because we confirmed that spike-frequency adaptation reflecting α2A-AR-mediated autoinhibition can be induced in adult mice as prominently as in juvenile mice (Figure S10), it is not inadequate to suggest that the suppression of α2A-AR internalization may have a translational potential for anxiety/AD treatment (see Discussion; R2, page 14, lines 445-449).

      (6) Quantitative histology  

      Figure 5 presents attractive images, but no numerical analysis is provided. Please provide ROI-based fluorescence quantification (with n values) or move the images to the supplement and rely on the Western blots. 

      Author response: We have moved the immunohistochemical results in Fig. 5 to the supplement, as we believe the quantification of immunohistochemical staining is not necessarily correct.   

      What do you mean by that " ...immunohistochemical staining is not necessarily correct."  

      It is evident that in terms of quantification, Western blot analysis is a more accurate method than immunohistochemical staining. In this sense, it is the contention of our study that the ROI-based fluorescence quantification of immunohistochemical staining is not necessarily an accurate or correct procedure, compared to the quantification by Western blot analysis.

    1. Author response:

      Notes to Editors

      We previously received comments from three reviewers at Biological Psychiatry, which we have addressed in detail below. The following is a summary of the reviewers’ comments along with our responses.

      Reviewers 1 and 2 sought clearer justification for studying the cognition-mental health overlap (covariation) and its neuroimaging correlates. In the revised manuscripts, we expanded the Introduction and Discussion to explicitly outline the theoretical implications of investigating this overlap with machine learning. We also added nuance to the interpretation of the observed associations.

      Reviewer 1 raised concerns about the accessibility of the machine learning methodology for readers without expertise in this field. We revised the Methods section to provide a clearer, step-by-step explanation of our machine learning approach, particularly the two-level machine learning through stacking. We also enhanced the description of the overall machine learning design, including model training, validation, and testing.

      In response to Reviewer 2’s request for deeper interpretation of our findings and stronger theoretical grounding, we have expanded our discussion by incorporating a thorough interpretation of how mental health indices relate to cognition, material that was previously included only in supplementary materials due to word limit constraints. We have further strengthened the theoretical justification for our study design, with particular emphasis on the importance of examining shared variance between cognition and mental health through the derivation of neural markers of cognition. Additionally, to enhance the biological interpretation of our results, we included new analyses of feature importance across neuroimaging modalities, providing clearer insights into which neural features contribute most to the observed relationships.

      Notably, Reviewer 3 acknowledged the strength of our study, including multimodal design, robust analytical approach, and clear visualization and interpretation of results. Their comments were exclusively methodological, underscoring the manuscript’s quality.

      Reviewer 1:

      The authors try to bridge mental health characteristics, global cognition and various MRI-derived (structural, diffusion and resting state fMRI) measures using the large dataset of UK Biobank. Each MRI modality alone explained max 25% of the cognitionmental health covariance, and when combined together 48% of the variance could be explained. As a peer-reviewer not familiar with the used methods (machine learning, although familiar with imaging), the manuscript is hard to read and I wonder what the message for the field might be. In the end of the discussion the authors state '... we provide potential targets for behavioural and physiological interventions that may affect cognition', the real relevance (and impact) of the findings is unclear to me.

      Thank you for your thorough review and practical recommendations. We appreciate your constructive comments and suggestions and hope our revisions adequately address your concerns.

      Major questions

      (1) The methods are hard to follow for people not in this specific subfield, and therefore, I expect that for readers it is hard to understand how valid and how useful the approach is.

      Thank you for your comment. To enhance accessibility for readers without a machine learning background, we revised the Methods section to clarify our analyses while retaining important technical details needed to understand our approach. Recognizing that some concepts may require prior knowledge, we provide detailed explanations of each analysis step, including the machine learning pipeline in the Supplementary Methods.

      Line 188: “We employed nested cross-validation to predict cognition from mental health indices and 72 neuroimaging phenotypes (Fig. 1). Nested cross-validation is a robust method for evaluating machine-learning models while tuning their hyperparameters, ensuring that performance estimates are both accurate and unbiased. Here, we used a nested cross-validation scheme with five outer folds and ten inner folds.

      We started by dividing the entire dataset into five outer folds. Each fold took a turn being held out as the outerfold test set (20% of the data), while the remaining four folds (80% of the data) were used as an outer-fold training set. Within each outer-fold training set, we performed a second layer of cross-validation – this time splitting the data into ten inner folds. These inner folds were used exclusively for hyperparameter tuning: models were trained on nine of the inner folds and validated on the remaining one, cycling through all ten combinations.

      We then selected the hyperparameter configuration that performed best across the inner-fold validation sets, as determined by the minimal mean squared error (MSE). The model was then retrained on the full outer-fold training set using this hyperparameter configuration and evaluated on the outer-fold test set, using four performance metrics: Pearson r, the coefficient of determination ( R<sup>2</sup>), the mean absolute error (MAE), and the MSE. This entire process was repeated for each of the five outer folds, ensuring that every data point is used for both training and testing, but never at the same time. We opted for five outer folds instead of ten to reduce computational demands, particularly memory and processing time, given the substantial volume of neuroimaging data involved in model training. Five outer folds led to an outer-fold test set at least n = 4 000, which should be sufficient for model evaluation. In contrast, we retained ten inner folds to ensure robust and stable hyperparameter tuning, maximising the reliability of model selection.

      To model the relationship between mental health and cognition, we employed Partial Least Squares Regression (PLSR) to predict the g-factor from 133 mental health variables. To model the relationship between neuroimaging data and cognition, we used a two-step stacking approach [15–17,61] to integrate information from 72 neuroimaging phenotypes across three MRI modalities. In the first step, we trained 72 base (first-level) PLSR models, each predicting the g-factor from a single neuroimaging phenotype. In the second step, we used the predicted values from these base models as input features for stacked models, which again predicted the g-factor. We constructed four stacked models based on the source of the base predictions: one each for dwMRI, rsMRI, sMRI, and a combined model incorporating all modalities (“dwMRI Stacked”, “rsMRI Stacked”, “sMRI Stacked”, and “All MRI Stacked”, respectively). Each stacked model was trained using one of four machine learning algorithms – ElasticNet, Random Forest, XGBoost, or Support Vector Regression – selected individually for each model (see Supplementary Materials, S6).

      For rsMRI phenotypes, we treated the choice of functional connectivity quantification method – full correlation, partial correlation, or tangent space parametrization – as a hyperparameter. The method yielding the highest performance on the outer-fold training set was selected for predicting the g-factor (see Supplementary Materials, S5).

      To prevent data leakage, we standardized the data using the mean and standard deviation derived from the training set and applied these parameters to the corresponding test set within each outer fold. This standardization was performed at three key stages: before g-factor derivation, before regressing out modality-specific confounds from the MRI data, and before stacking. Similarly, to maintain strict separation between training and testing data, both base and stacked models were trained exclusively on participants from the outer-fold training set and subsequently applied to the corresponding outer-fold test set.

      To evaluate model performance and assess statistical significance, we aggregated the predicted and observed g_factor values from each outer-fold test set. We then computed a bootstrap distribution of Pearson’s correlation coefficient (_r) by resampling with replacement 5 000 times, generating 95% confidence intervals (CIs) (Fig. 1). Model performance was considered statistically significant if the 95% CI did not include zero, indicating that the observed associations were unlikely to have occurred by chance.”

      (2) If only 40% of the cognition-mental health covariation can be explained by the MRI variables, how to explain the other 60% of the variance? And related to this %: why do the author think that 'this provides us confidence in using MRI to derive quantitative neuromarkers of cognition'?

      Thank you for this insightful observation. Using the MRI modalities available in the UK Biobank, we were able to account for 48% of the covariation between cognition and mental health. The remaining 52% of unexplained variance may arise from several sources. One possibility is the absence of certain neuroimaging modalities in the UK Biobank dataset, such as task-based fMRI contrasts, positron emission tomography, arterial spin labeling, and magnetoencephalography/electroencephalography. Prior research from our group and others has consistently demonstrated strong predictive performance from specific task-based fMRI contrasts, particularly those derived from tasks like the n-Back working memory task and the face-name episodic memory task, none of which is available in the UK Biobank.

      Moreover, there are inherent limitations in using MRI as a proxy for brain structure and function. Measurement error and intra-individual variability, such as differences in a cognitive state between cognitive assessments and MRI acquisition, may also contribute to the unexplained variance. According to the Research Domain Criteria (RDoC) framework, brain circuits represent only one level of neurobiological analysis relevant to cognition. Other levels, including genes, molecules, cells, and physiological processes, may also play a role in the cognition-mental health relationship.

      Nonetheless, neuroimaging provides a valuable window into the biological mechanisms underlying this overlap – insights that cannot be gleaned from behavioural data alone. We have now incorporated these considerations into the Discussion section.

      Line 658: “Although recent debates [18] have challenged the predictive utility of MRI for cognition, our multimodal marker integrating 72 neuroimaging phenotypes captures nearly half of the mental health-explained variance in cognition. We demonstrate that neural markers with greater predictive accuracy for cognition also better explain cognition-mental health covariation, showing that multimodal MRI can capture both a substantial cognitive variance and nearly half of its shared variance with mental health. Finally, we show that our neuromarkers explain a substantial portion of the age- and sex-related variance in the cognition-mental health relationship, highlighting their relevance in modeling cognition across demographic strata.

      The remaining unexplained variance in the relationship between cognition and mental health likely stems from multiple sources. One possibility is the absence of certain neuroimaging modalities in the UK Biobank dataset, such as task-based fMRI contrasts, positron emission tomography, arterial spin labeling, and magnetoencephalography/electroencephalography. Prior research has consistently demonstrated strong predictive performance from specific task-based fMRI contrasts, particularly those derived from tasks like the n-Back working memory task and the face-name episodic memory task, none of which is available in the UK Biobank [15,17,61,69,114,142,151].

      Moreover, there are inherent limitations in using MRI as a proxy for brain structure and function. Measurement error and intra-individual variability, such as differences in a cognitive state between cognitive assessments and MRI acquisition, may also contribute to the unexplained variance. According to the RDoC framework, brain circuits represent only one level of neurobiological analysis relevant to cognition [14]. Other levels, including genes, molecules, cells, and physiological processes, may also play a role in the cognition-mental health relationship.

      Nonetheless, neuroimaging provides a valuable window into the biological mechanisms underlying this overlap – insights that cannot be gleaned from behavioural data alone. Ultimately, our findings validate brain-based neural markers as a fundamental neurobiological unit of analysis, advancing our understanding of mental health through the lens of cognition.”

      Regarding our confidence in using MRI to derive neural markers for cognition, we base this on the predictive performance of MRI-based models. As we note in the Discussion (Line 554: “Consistent with previous studies, we show that MRI data predict individual differences in cognition with a medium-size performance (r ≈ 0.4) [15–17, 28, 61, 67, 68].”), the medium effect size we observed (r ≈ 0.4) agrees with existing literature on brain-cognition relationships, confirming that machine learning leads to replicable results. This effect size represents a moderate yet meaningful association in neuroimaging studies of aging, consistent with reports linking brain to behaviour in adults (Krämer et al., 2024; Tetereva et al., 2022). For example, a recent meta-analysis by Vieira and colleagues (2022) reported a similar effect size (r = 0.42, 95% CI [0.35;0.50]). Our study includes over 15000 participants, comparable to or more than typical meta-analyses, allowing us to characterise our work as a “mega-analysis”. And on top of this predictive performance, we found our neural markers for cognition to capture half of the cognition-mental health covariation, boosting our confidence in our approach.

      Krämer C, Stumme J, da Costa Campos L, Dellani P, Rubbert C, Caspers J, et al. Prediction of cognitive performance differences in older age from multimodal neuroimaging data. GeroScience. 2024;46:283–308.

      Tetereva A, Li J, Deng JD, Stringaris A, Pat N. Capturing brain cognition relationship: Integrating task‐based fMRI across tasks markedly boosts prediction and test‐retest reliability. NeuroImage. 2022;263:119588.

      (3) Imagine that we can increase the explained variance using multimodal MRI measures, why is it useful? What does it learn us? What might be the implications?

      We assume that by variance, Reviewer 1 referred to the cognition-mental health covariation mentioned in point 2) above.

      If we can increase the explained cognition-mental health covariation using multimodal MRI measures, it would mean that we have developed a reasonable neuromarker that is close to RDoC’s neurobiological unit of analysis for cognition. RDoC treats cognition as one of the main basic functional domains that transdiagnostically underly mental health. According to RDoC, mental health should be studied in relation to cognition, alongside other domains such as negative and positive valence systems, arousal and regulatory systems, social processes, and sensorimotor functions. RDoC further emphasizes that each domain, including cognition, should be investigated not only at the behavioural level but also through its neurobiological correlates. This means RDoC aims to discover neural markers of cognition that explain the covariation between cognition and mental health. For us, we approach the development of such neural markers using multimodal neuroimaging. We have now explained the motivation of our study in the first paragraph of the Introduction.

      Line 43: “Cognition and mental health are closely intertwined [1]. Cognitive dysfunction is present in various mental illnesses, including anxiety [2, 3], depression [4–6], and psychotic disorders [7–12]. National Institute of Mental Health’s Research Domain Criteria (RDoC) [13,14] treats cognition as one of the main basic functional domains that transdiagnostically underly mental health. According to RDoC, mental health should be studied in relation to cognition, alongside other domains such as negative and positive valence systems, arousal and regulatory systems, social processes, and sensorimotor functions. RDoC further emphasizes that each domain, including cognition, should be investigated not only at the behavioural level but also through its neurobiological correlates. In this study, we aim to examine how the covariation between cognition and mental health is reflected in neural markers of cognition, as measured through multimodal neuroimaging.”

      More specific issues:

      Introduction

      (4) In the intro the sentence 'in some cases, altered cognitive functioning is directly related to psychiatric symptom severity' is in contrast to the next sentence '... are often stable and persist upon alleviation of psychiatric symptoms'.

      Thank you for pointing this out. The first sentence refers to cases where cognitive deficits fluctuate with symptom severity, while the second emphasizes that core cognitive impairments often remain stable even during symptom remission. To avoid this confusion, we have removed these sentences.

      (5) In the intro the text on the methods (various MRI modalities) is not needed for the Biol Psych readers audience.

      We appreciate your comment. While some members of our target audience may have backgrounds in neuroimaging, machine learning, or psychiatry, we recognize that not all readers will be familiar with all three areas. To ensure accessibility for those who are not familiar with neuroimaging, we included a brief overview of the MRI modalities and quantification methods used in our study to provide context for the specific neuroimaging phenotypes. Additionally, we provided background information on the machine learning techniques employed, so that readers without a strong background in machine learning can still follow our methodology.

      (6) Regarding age of the study sample: I understand that at recruitment the subjects' age ranges from 40 to 69 years. At MRI scanning the age ranges between about 46 to 82. How is that possible? And related to the age of the population: how did the authors deal with age in the analyses, since age is affecting both cognition as the brain measures?

      Thank you for noticing this. In the Methods section, we first outline the characteristics of the UK Biobank cohort, including the age at first recruitment (40-69 years). Table 1 then shows the characteristics of participant subsamples included in each analysis. Since our study used data from Instance 2 (the second in-person visit), participants were approximately 5-13 years older at scanning, resulting in the age range of 46 to 82 years. We clarified the Table 1 caption as follows:

      Line 113: “Table 1. Demographics for each subsample analysed: number, age, and sex of participants who completed all cognitive tests, mental health questionnaires, and MRI scanning”

      We acknowledge that age may influence cognitive and neuroimaging measures. In our analyses, we intentionally preserved age-related variance in brain-cognition relationships across mid and late adulthood, as regressing out age completely would artificially remove biologically meaningful associations. At the same time, we rigorously addressed the effects of age and sex through additional commonality analyses quantifying age and sex contributions to the relationship between cognition and mental health.

      As noted by Reviewer 1 and illustrated in Figure 8, age and sex shared substantial overlapping variance with both mental health and neuroimaging phenotypes in explaining cognitive outcomes. For example, in Figure 8i, age and sex together accounted for 43% of the variance in the cognition-mental health relationship:

      (2.76 + 1.03) / (2.76 + 1.03 + 3.52 + 1.45) ≈ 0.43

      Furthermore, neuromarkers from the all-MRI stacked model explained 72% of this age/sexrelated variance:

      2.76 / (2.76 + 1.03) ≈ 0.72

      This indicates that our neuromarkers captured a substantial portion of the cognition-mental health covariation that varied with age and sex, highlighting their relevance in age/sex-sensitive cognitive modeling.

      In the Methods, Results, and Discussion, we say:

      Methods

      Line 263: “To understand how demographic factors, including age and sex, contribute to this relationship, we also conducted a separate set of commonality analyses treating age, sex, age2, age×sex, and age2×sex as an additional set of explanatory variables (Fig. 1).”

      Results

      Line 445: “Age and sex shared substantial overlapping variance with both mental health and neuroimaging in explaining cognition, accounting for 43% of the variance in the cognition-mental health relationship. Multimodal neural marker of cognition based on three MRI modalities (“All MRI Stacked”) explained 72% of this age and sex-related variance (Fig. 8i–l and Table S21).”

      Discussion

      Line 660: “We demonstrate that neural markers with greater predictive accuracy for cognition also better explain cognition-mental health covariation, showing that multimodal MRI can capture both a substantial cognitive variance and nearly half of its shared variance with mental health. Finally, we show that our neuromarkers explain a substantial portion of the age- and sex-related variance in the cognition-mental health relationship, highlighting their relevance in modeling cognition across demographic strata.”

      (7) Regarding the mental health variables: where characteristics with positive value (e.g. happiness and subjective wellbeing) reversely scored (compared to the negative items, such as anxiety, addition, etc)?

      We appreciate you noting this. These composite scores primarily represent standard clinical measures such as the GAD-7 anxiety scale and N-12 neuroticism scale. We did not reverse the scores to keep their directionality, therefore making interpretability consistent with the original studies the scores were derived from (e.g., Davis et al., 2020; Dutt et al., 2022). Complete descriptive statistics for all mental health indices and detailed derivation procedures are provided in the Supplementary Materials (S2). On Page 6, Supplementary Methods, we say:

      Line 92: “Composite mental health scores included the Generalized Anxiety Disorder (GAD-7), the Posttraumatic Stress Disorder (PTSD) Checklist (PCL-6), the Alcohol Use Disorders Identification Test (AUDIT), the Patient Health Questionnaire (PHQ-9) [12], the Eysenck Neuroticism (N-12), Probable Depression Status (PDS), and the Recent Depressive Symptoms (RDS-4) scores [13, 14]. To calculate the GAD-7, PCL-6, AUDIT, and PHQ-9, we used questions introduced at the online follow-up [12]. To obtain the N-12, PDS, and RDS-4 scores [14], we used data collected during the baseline assessment [13, 14].

      We subcategorized depression and GAD based on frequency, current status (ever had depression or anxiety and current status of depression or anxiety), severity, and clinical diagnosis (depression or anxiety confirmed by a healthcare practitioner). Additionally, we differentiated between different depression statuses, such as recurrent depression, depression triggered by loss, etc. Variables related to self-harm were subdivided based on whether a person has ever self-harmed with the intent to die.

      To make response scales more intuitive, we recorded responses within the well-being domain such that the lower score corresponded to a lesser extent of satisfaction (“Extremely unhappy”) and the higher score indicated a higher level of happiness (“Extremely happy”). For all questions, we assigned the median values to “Prefer not to answer” (-818 for in-person assessment and -3 for online questionnaire) and “Do not know” (-121 for in-person assessment and -1 for online questionnaire) responses. We excluded the “Work/job satisfaction” question from the mental health derivatives list because it included a “Not employed” response option, which could not be reasonably coded.

      To calculate the risk of PTSD, we used questions from the PCL-6 questionnaire. Following Davis and colleagues [12], PCL-6 scores ranged from 6 to 29. A PCL-6 score of 12 or below corresponds to a low risk of meeting the Clinician-Administered PTSD Scale diagnostic criteria. PCL-6 scores between 13 and 16 and between 17 and 25 are indicative of an increased risk and high risk of PTSD, respectively. A score of above 26 is interpreted as a very high risk of PTSD [12, 15]. PTSD status was set to positive if the PCL-6 score exceeded or was equal to 14 and encompassed stressful events instead of catastrophic trauma alone [12].

      To assess alcohol consumption, alcohol dependence, and harm associated with drinking, we calculated the sum of the ten questions from the AUDIT questionnaire [16]. We additionally subdivided the AUDIT score into the alcohol consumption score (questions 1-3, AUDIT-C) and the score reflecting problems caused by alcohol (questions 4-10, AUDIT-P) [17]. In questions 2-10 that followed the first trigger question (“Frequency of drinking alcohol”), we replaced missing values with 0 as they would correspond to a “Never” response to the first question.

      An AUDIT score cut-off of 8 suggests moderate or low-risk alcohol consumption, and scores of 8 to 15 and above 15 indicate severe/harmful and hazardous (alcohol dependence or moderate-severe alcohol use disorder) drinking, respectively [16, 18]. Subsequently, hazardous alcohol use and alcohol dependence status correspond to AUDIT scores of ≥ 8 and ≥ 15, respectively. The “Alcohol dependence ever” status was set to positive if a participant had ever been physically dependent on alcohol. To reduce skewness, we logx+1-transformed the AUDIT, AUDIT-C, and AUDIT-P scores [17].”

      Davis KAS, Coleman JRI, Adams M, Allen N, Breen G, Cullen B, et al. Mental health in UK Biobank – development, implementation and results from an online questionnaire completed by 157 366 participants: a reanalysis. BJPsych Open. 2020;6:e18.

      Dutt RK, Hannon K, Easley TO, Griffis JC, Zhang W, Bijsterbosch JD. Mental health in the UK Biobank: A roadmap to selfreport measures and neuroimaging correlates. Hum Brain Mapp. 2022;43:816–832.  

      (8) In the discussion section (page 23, line 416-421), the authors refer to specific findings that are not described in the results section > I would add these findings to the main manuscript (including the discussion / interpretation).

      We appreciate your careful reading. We agree that our original Results section did not explicitly describe the factor loadings for mental health in the PLSR model, despite discussing their implications later in the paper. We needed to include this part of the discussion in the Supplementary Materials to meet the word limit of the original submission. However, in response to your suggestion, we have now added the results regarding factor loadings to the Results section. We also moved the discussion of the association between mental health features and general cognition from the Supplementary Material to the manuscript’s Discussion.

      Results

      Line 298: “On average, information about mental health predicted the g-factor at  R<sup>2</sup><sub>mean</sub> = 0.10 and r<sub>mean</sub> \= 0.31 (95% CI [0.291, 0.315]; Fig. 2b and 2c and Supplementary Materials, S9, Table S12). The magnitude and direction of factor loadings for mental health in the PLSR model allowed us to quantify the contribution of individual mental health indices to cognition. Overall, the scores for mental distress, alcohol and cannabis use, and self-harm behaviours relate positively, and the scores for anxiety, neurological and mental health diagnoses, unusual or psychotic experiences, happiness and subjective well-being, and negative traumatic events relate negatively to cognition.”

      Discussion

      Line 492: “Factor loadings derived from the PLSR model showed that the scores for mental distress, alcohol and cannabis use, and self-harm behaviours related positively, and the scores for anxiety, neurological and mental health diagnoses, unusual or psychotic experiences, happiness and subjective well-being, and negative traumatic events related negatively to the g-factor. Positive PLSR loadings of features related to mental distress may indicate greater susceptibility to or exaggerated perception of stressful events, psychological overexcitability, and predisposition to rumination in people with higher cognition [72]. On the other hand, these findings may be specific to the UK Biobank cohort and the way the questions for this mental health category were constructed. In particular, to evaluate mental distress, the UK Biobank questionnaire asked whether an individual sought or received medical help for or suffered from mental distress. In this regard, the estimate for mental distress may be more indicative of whether an individual experiencing mental distress had an opportunity or aspiration to visit a doctor and seek professional help [73]. Thus, people with better cognitive abilities and also with a higher socioeconomic status may indeed be more likely to seek professional help.

      Limited evidence supports a positive association between self-harm behaviours and cognitive abilities, with some studies indicating higher cognitive performance as a risk factor for non-suicidal self-harm. Research shows an inverse relationship between cognitive control of emotion and suicidal behaviours that weakens over the life course [73,74]. Some studies have found a positive correlation between cognitive abilities and the risk of nonsuicidal self-harm, suicidal thoughts, and suicidal plans that may be independent of or, conversely, affected by socioeconomic status [75,76]. In our study, the magnitude of the association between self-harm behaviours and cognition was low (Fig. 2), indicating a weak relationship.

      Positive PLSR loadings of features related to alcohol and cannabis may also indicate the influence of other factors. Overall, this relationship is believed to be largely affected by age, income, education, social status, social equality, social norms, and quality of life [79–80]. For example, education level and income correlate with cognitive ability and alcohol consumption [79,81–83]. Research also links a higher probability of having tried alcohol or recreational drugs, including cannabis, to a tendency of more intelligent individuals to approach evolutionary novel stimuli [84,85]. This hypothesis is supported by studies showing that cannabis users perform better on some cognitive tasks [86]. Alternatively, frequent drinking can indicate higher social engagement, which is positively associated with cognition [87]. Young adults often drink alcohol as a social ritual in university settings to build connections with peers [88]. In older adults, drinking may accompany friends or family visits [89,90]. Mixed evidence on the link between alcohol and drug use and cognition makes it difficult to draw definite conclusions, leaving an open question about the nature of this relationship.

      Consistent with previous studies, we showed that anxiety and negative traumatic experiences were inversely associated with cognitive abilities [90–93]. Anxiety may be linked to poorer cognitive performance via reduced working memory capacity, increased focus on negative thoughts, and attentional bias to threatening stimuli that hinder the allocation of cognitive resources to a current task [94–96]. Individuals with PTSD consistently showed impaired verbal and working memory, visual attention, inhibitory function, task switching, cognitive flexibility, and cognitive control [97–100]. Exposure to traumatic events that did not reach the PTSD threshold was also linked to impaired cognition. For example, childhood trauma is associated with worse performance in processing speed, attention, and executive function tasks in adulthood, and age at a first traumatic event is predictive of the rate of executive function decline in midlife [101,102]. In the UK Biobank cohort, adverse life events have been linked to lower cognitive flexibility, partially via depression level [103].

      In agreement with our findings, cognitive deficits are often found in psychotic disorders [104,105]. We treated neurological and mental health symptoms as predictor variables and did not stratify or exclude people based on psychiatric status or symptom severity. Since no prior studies have examined isolated psychotic symptoms (e.g., recent unusual experiences, hearing unreal voices, or seeing unreal visions), we avoid speculating on how these symptoms relate to cognition in our sample.

      Finally, negative PLSR loadings of the features related to happiness and subjective well-being may be specific to the study cohort, as these findings do not agree with some previous research [107–109]. On the other hand, our results agree with the study linking excessive optimism or optimistic thinking to lower cognitive performance in memory, verbal fluency, fluid intelligence, and numerical reasoning tasks, and suggesting that pessimism or realism indicates better cognition [110]. The concept of realism/optimism as indicators of cognition is a plausible explanation for a negative association between the g-factor and friendship satisfaction, as well as a negative PLSR loading of feelings that life is meaningful, especially in older adults who tend to reflect more on the meaning of life [111]. The latter is supported by the study showing a negative association between cognitive function and the search for the meaning of life and a change in the pattern of this relationship after the age of 60 [112]. Finally, a UK Biobank study found a positive association of happiness with speed and visuospatial memory but a negative relationship with reasoning ability [113].”

      (9) In the discussion section (page 24, line 440-449), the authors give an explanation on why the diffusion measure have limited utility, but the arguments put forward also concern structural and rsfMRI measures.

      Thank you for this important observation. Indeed, the argument about voxel-averaged diffusion components (“… these metrics are less specific to the properties of individual white matter axons or bundles, and instead represent a composite of multiple diffusion components averaged within a voxel and across major fibre pathways”) could theoretically apply across other MRI modalities. We have therefore removed this point from the discussion to avoid overgeneralization. However, we maintain our central argument about the biological specificity of conventional tractography-derived diffusion metrics as their particular sensitivity to white matter microstructure (e.g., axonal integrity, myelin content) may make them better suited for detecting neuropathological changes than dynamic cognitive processes. This interpretation aligns with the mixed evidence linking these metrics to cognitive performance, despite their established utility in detecting white matter abnormalities in clinical populations (e.g., Bergamino et al., 2021; Silk et al., 2009). We clarify this distinction in the manuscript.

      Line 572: “The somewhat limited utility of diffusion metrics derived specifically from probabilistic tractography in serving as robust quantitative neuromarkers of cognition and its shared variance with mental health may stem from their greater sensitivity and specificity to neuronal integrity and white matter microstructure rather than to dynamic cognitive processes. Critically, probabilistic tractography may be less effective at capturing relationships between white matter microstructure and behavioural scores cross-sectionally, as this method is more sensitive to pathological changes or dynamic microstructural alterations like those occurring during maturation. While these indices can capture abnormal white matter microstructure in clinical populations such as Alzheimer’s disease, schizophrenia, or attention deficit hyperactivity disorder (ADHD) [117–119], the empirical evidence on their associations with cognitive performance is controversial [114, 120–126].”

      Bergamino M, Walsh RR, Stokes AM. Free-water diffusion tensor imaging improves the accuracy and sensitivity of white matter analysis in Alzheimer’s disease. Sci Rep. 2021;11:6990.

      Silk TJ, Vance A, Rinehart N, Bradshaw JL, Cunnington R. White-matter abnormalities in attention deficit hyperactivity disorder: a diffusion tensor imaging study. Hum Brain Mapp. 2009;30:2757–2765.

      Reviewer 2:

      This is an interesting study combining a lot of data to investigate the link between cognition and mental health. The description of the study is very clear, it's easy to read for someone like me who does not have a lot of expertise in machine learning.

      We thank you for your thorough review and constructive feedback. Your insightful comments have helped us identify conceptual and methodological aspects that required improvement in the manuscript. We have incorporated relevant changes throughout the paper, and below, we address each of your points in detail.

      Comment 1: My main concern with this manuscript is that it is not yet clear to me what it exactly means to look at the overlap between cognition and mental health. This relation is r=0.3 which is not that high, so why is it then necessary to explain this overlap with neuroimaging measures? And, could it be that the relation between cognition and mental health is explained by third variables (environment? opportunities?). In the introduction I miss an explanation of why it is important to study this and what it will tell us, and in the discussion I would like to read some kind of 'answer' to these questions.

      Thank you. It’s important to clarify why we investigated the relationship between cognition and mental health, and what we found using data from the UK Biobank.

      Conceptually, our work is grounded in the Research Domain Criteria (RDoC; Insel et al., 2010) framework. RDoC conceptualizes mental health not through traditional diagnostic categories, but through core functional domains that span the full spectrum from normal to abnormal functioning. These domains include cognition, negative and positive valence systems, arousal and regulatory systems, social processes, and sensorimotor functions. Within this framework, cognition is considered a fundamental domain that contributes to mental health across diagnostic boundaries. Meta-analytic evidence supports a link between cognitive functioning and mental health (Abramovitch, et al., 2021; East-Richard, et al., 2020). In the context of a large, population-based dataset like the UK Biobank, this implies that cognitive performance – as measured by various cognitive tasks – should be meaningfully associated with available mental health indicators.

      However, because cognition is only one of several functional domains implicated in mental health, we do not expect the covariation between cognition and mental health to be very high. Other domains, such as negative and positive valence systems, arousal and regulatory systems, or social processing, may also play significant roles. Theoretically, this places an upper bound on the strength of the cognition-mental health relationship, especially in normative, nonclinical samples.

      Our current findings from the UK Biobank reflect this. Most of the 133 mental health variables showed relatively weak individual correlations with cognition (mean r \= 0.01, SD = 0.05, min r \= –0.08, max r \= 0.17; see Figure 2). However, using a PLS-based machine learning approach, we were able to integrate information across all mental-health variables to predict cognition, yielding an out-of-sample correlation of r = 0.31 [95% CI: 0.29, 0.32].  

      We believe this estimate approximates the true strength of the cognition-mental health relationship in normative samples, consistent with both theoretical expectations and prior empirical findings. Theoretically, this aligns with the RDoC view that cognition is one of several contributing domains. Empirically, our results are consistent with findings from our previous mega-analysis in children (Wang et al., 2025). Moreover, in the field of gerontology, an effect size of r = 0.31 is not considered small. According to Brydges (2019), it falls around the 70th percentile of effect sizes reported in gerontological studies and approaches the threshold for a large effect (r \= 0.32). Given that most studies report within-sample associations, our out-of-sample results are likely more robust and generalizable (Yarkoni & Westfall, 2017).

      To answer, “why is it then necessary to explain this overlap with neuroimaging measures”, we again draw on the conceptual foundation of the RDoC framework. RDoC emphasizes that each functional domain, such as cognition, should be studied not only at the behavioural level but also across multiple neurobiological units of analysis, including genes, molecules, cells, circuits, physiology, and behaviour.

      MRI-based neural markers represent one such level of analysis. While other biological systems (e.g., genetic, molecular, or physiological) also contribute to the cognition-mental health relationship, neuroimaging provides unique insights into the brain mechanisms underlying this association – insights that cannot be obtained from behavioural data alone.

      In response to the related question, “Could the relationship between cognition and mental health be explained by third variables (e.g., environment, opportunities)?”, we note that developing a neural marker of cognition capable of capturing its relationship with mental health is the central aim of this study. Using the MRI modalities available in the UK Biobank, we were able to account for 48% of the covariation between cognition and mental health.

      The remaining 52% of unexplained variance may stem from several sources. According to the RDoC framework, neuromarkers could be further refined by incorporating additional neuroimaging modalities (e.g., task-based fMRI, PET, ASL, MEG/EEG, fNIRS) and integrating other units of analysis such as genetic, molecular, cellular, and physiological data.

      Once more comprehensive neuromarkers are developed, capturing a greater proportion of the cognition-mental health covariation, they may also lead to new research direction – to investigate how environmental factors and life opportunities influence these markers. However, exploring those environmental contributions lies beyond the scope of the current study.

      We discuss these considerations and explain the motivation of our study in the revised Introduction and Discussion.

      Line 481: “Our analysis confirmed the validity of the g-factor [31] as a quantitative measure of cognition [31], demonstrating that it captures nearly half (39%) of the variance across twelve cognitive performance scores, consistent with prior studies [63–68]. Furthermore, we were able to predict cognition from 133 mental health indices, showing a medium-sized relationship that aligns with existing literature [69,70]. Although the observed mental health-cognition association is lower than within-sample estimates in conventional regression models, it aligns with our prior mega-analysis in children [69]. Notably, this effect size is not considered small in gerontology. In fact, it falls around the 70th percentile of reported effects and approaches the threshold for a large effect at r = 0.32 [71]. While we focused specifically on cognition as an RDoC core domain, the strength of its relationship with mental health may be bounded by the influence of other functional domains, particularly in normative, non-clinical samples – a promising direction for future research.”

      Line 658: “Although recent debates [18] have challenged the predictive utility of MRI for cognition, our multimodal marker integrating 72 neuroimaging phenotypes captures nearly half of the mental health-explained variance in cognition. We demonstrate that neural markers with greater predictive accuracy for cognition also better explain cognition-mental health covariation, showing that multimodal MRI can capture both a substantial cognitive variance and nearly half of its shared variance with mental health. Finally, we show that our neuromarkers explain a substantial portion of the age- and sex-related variance in the cognition-mental health relationship, highlighting their relevance in modeling cognition across demographic strata.

      The remaining unexplained variance in the relationship between cognition and mental health likely stems from multiple sources. One possibility is the absence of certain neuroimaging modalities in the UK Biobank dataset, such as task-based fMRI contrasts, positron emission tomography, arterial spin labeling, and magnetoencephalography/electroencephalography. Prior research has consistently demonstrated strong predictive performance from specific task-based fMRI contrasts, particularly those derived from tasks like the n-Back working memory task and the face-name episodic memory task, none of which is available in the UK Biobank [15,17,61,69,114,142,151].

      Moreover, there are inherent limitations in using MRI as a proxy for brain structure and function. Measurement error and intra-individual variability, such as differences in a cognitive state between cognitive assessments and MRI acquisition, may also contribute to the unexplained variance. According to the RDoC framework, brain circuits represent only one level of neurobiological analysis relevant to cognition [14]. Other levels, including genes, molecules, cells, and physiological processes, may also play a role in the cognition-mental health relationship.

      Nonetheless, neuroimaging provides a valuable window into the biological mechanisms underlying this overlap – insights that cannot be gleaned from behavioural data alone. Ultimately, our findings validate brain-based neural markers as a fundamental neurobiological unit of analysis, advancing our understanding of mental health through the lens of cognition.”

      Introduction

      Line 43: “Cognition and mental health are closely intertwined [1]. Cognitive dysfunction is present in various mental illnesses, including anxiety [2, 3], depression [4–6], and psychotic disorders [7–12]. National Institute of Mental Health’s Research Domain Criteria (RDoC) [13,14] treats cognition as one of the main basic functional domains that transdiagnostically underly mental health. According to RDoC, mental health should be studied in relation to cognition, alongside other domains such as negative and positive valence systems, arousal and regulatory systems, social processes, and sensorimotor functions. RDoC further emphasizes that each domain, including cognition, should be investigated not only at the behavioural level but also through its neurobiological correlates. In this study, we aim to examine how the covariation between cognition and mental health is reflected in neural markers of cognition, as measured through multimodal neuroimaging.”

      Discussion

      Line 481: “Our analysis confirmed the validity of the g-factor [31] as a quantitative measure of cognition [31], demonstrating that it captures nearly half (39%) of the variance across twelve cognitive performance scores, consistent with prior studies [63–68]. Furthermore, we were able to predict cognition from 133 mental health indices, showing a medium-sized relationship that aligns with existing literature [69,70]. Although the observed mental health-cognition association is lower than within-sample estimates in conventional regression models, it aligns with our prior mega-analysis in children [69]. Notably, this effect size is not considered small in gerontology. In fact, it falls around the 70th percentile of reported effects and approaches the threshold for a large effect at r = 0.32 [71]. While we focused specifically on cognition as an RDoC core domain, the strength of its relationship with mental health may be bounded by the influence of other functional domains, particularly in normative, non-clinical samples – a promising direction for future research.”

      Line 658: “Although recent debates [18] have challenged the predictive utility of MRI for cognition, our multimodal marker integrating 72 neuroimaging phenotypes captures nearly half of the mental health-explained variance in cognition. We demonstrate that neural markers with greater predictive accuracy for cognition also better explain cognition-mental health covariation, showing that multimodal MRI can capture both a substantial cognitive variance and nearly half of its shared variance with mental health. Finally, we show that our neuromarkers explain a substantial portion of the age- and sex-related variance in the cognition-mental health relationship, highlighting their relevance in modeling cognition across demographic strata.

      The remaining unexplained variance in the relationship between cognition and mental health likely stems from multiple sources. One possibility is the absence of certain neuroimaging modalities in the UK Biobank dataset, such as task-based fMRI contrasts, positron emission tomography, arterial spin labeling, and magnetoencephalography/electroencephalography. Prior research has consistently demonstrated strong predictive performance from specific task-based fMRI contrasts, particularly those derived from tasks like the n-Back working memory task and the face-name episodic memory task, none of which is available in the UK Biobank [15,17,61,69,114,142,151].

      Moreover, there are inherent limitations in using MRI as a proxy for brain structure and function. Measurement error and intra-individual variability, such as differences in a cognitive state between cognitive assessments and MRI acquisition, may also contribute to the unexplained variance. According to the RDoC framework, brain circuits represent only one level of neurobiological analysis relevant to cognition [14]. Other levels, including genes, molecules, cells, and physiological processes, may also play a role in the cognition-mental health relationship.

      Nonetheless, neuroimaging provides a valuable window into the biological mechanisms underlying this overlap – insights that cannot be gleaned from behavioural data alone. Ultimately, our findings validate brain-based neural markers as a fundamental neurobiological unit of analysis, advancing our understanding of mental health through the lens of cognition.”

      Insel T, Cuthbert B, Garvey M, Heinssen R, Pine DS, Quinn K, et al. Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. AJP. 2010;167:748–751.

      Abramovitch, A., Short, T., & Schweiger, A. (2021). The C Factor: Cognitive dysfunction as a transdiagnostic dimension in psychopathology. Clinical Psychology Review, 86, 102007.

      East-Richard, C., R. -Mercier, A., Nadeau, D., & Cellard, C. (2020). Transdiagnostic neurocognitive deficits in psychiatry: A review of meta-analyses. Canadian Psychology / Psychologie Canadienne, 61(3), 190–214.

      Wang Y, Anney R, Pat N. The relationship between cognitive abilities and mental health as represented by cognitive abilities at the neural and genetic levels of analysis. eLife. 2025.14:RP105537.

      Brydges CR. Effect Size Guidelines, Sample Size Calculations, and Statistical Power in Gerontology. Innovation in Aging. 2019;3(4):igz036.

      Yarkoni T, Westfall J. Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspect Psychol Sci. 2017;12(6):1100-1122.

      Comment 2 Title: - Shouldn't it be "MRI markers" (plural)?

      We used the singular form (“marker”) intentionally, as it refers to the composite neuroimaging marker derived from all three MRI modalities in our stacked model. This multimodal marker represents the combined predictive power of all modalities and captures the highest proportion of the mental health-cognition relationship in our analyses.

      Comment 3: Introduction - I miss an explanation of why it is useful to look at cognition-mental health covariation

      We believe we have sufficiently addressed this comment in our response to Reviewer 2, comment 1 above.

      Comment 4: - "Demonstrating that MRI-based neural indicators of cognition capture the covariation between cognition and mental health will thereby support the utility of such indicators for understanding the etiology of mental health" (page 4, line 56-58) - how/why?

      Previous research has largely focused on developing MRI-based neural indicators that accurately predict cognitive performance (Marek et al., 2022; Vieira et al., 2020). Building on this foundation, our findings further demonstrate that the predictive performance of a neural indicator for cognition is closely tied to its ability to explain the covariation between cognition and mental health. In other words, the robustness of a neural indicator – its capacity to capture individual differences in cognition – is strongly associated with how well it reflects the shared variance between cognition and mental health.

      This insight is particularly important within the context of the RDoC framework, which seeks to understand the etiology of mental health through functional domains (such as cognition) and their underlying neurobiological units of analysis (Insel et al., 2010). According to RDoC, for a neural indicator of cognition to be informative for mental health research, it must not only predict cognitive performance but also capture its relationship with mental health.

      Furthermore, RDoC emphasizes the integration of neurobiological measures to investigate the influence of environmental and developmental factors on mental health. In line with this, our neural indicators of cognition may serve as valuable tools in future research aimed at understanding how environmental exposures and developmental trajectories shape mental health outcomes. We discuss this in more detail in the revised Discussion.

      Line 481: “Our analysis confirmed the validity of the g-factor [31] as a quantitative measure of cognition [31], demonstrating that it captures nearly half (39%) of the variance across twelve cognitive performance scores, consistent with prior studies [63–68]. Furthermore, we were able to predict cognition from 133 mental health indices, showing a medium-sized relationship that aligns with existing literature [69,70]. Although the observed mental health-cognition association is lower than within-sample estimates in conventional regression models, it aligns with our prior mega-analysis in children [69]. Notably, this effect size is not considered small in gerontology. In fact, it falls around the 70th percentile of reported effects and approaches the threshold for a large effect at r = 0.32 [71]. While we focused specifically on cognition as an RDoC core domain, the strength of its relationship with mental health may be bounded by the influence of other functional domains, particularly in normative, non-clinical samples – a promising direction for future research.”

      Line 658: “Although recent debates [18] have challenged the predictive utility of MRI for cognition, our multimodal marker integrating 72 neuroimaging phenotypes captures nearly half of the mental health-explained variance in cognition. We demonstrate that neural markers with greater predictive accuracy for cognition also better explain cognition-mental health covariation, showing that multimodal MRI can capture both a substantial cognitive variance and nearly half of its shared variance with mental health. Finally, we show that our neuromarkers explain a substantial portion of the age- and sex-related variance in the cognition-mental health relationship, highlighting their relevance in modeling cognition across demographic strata.

      The remaining unexplained variance in the relationship between cognition and mental health likely stems from multiple sources. One possibility is the absence of certain neuroimaging modalities in the UK Biobank dataset, such as task-based fMRI contrasts, positron emission tomography, arterial spin labeling, and magnetoencephalography/electroencephalography. Prior research has consistently demonstrated strong predictive performance from specific task-based fMRI contrasts, particularly those derived from tasks like the n-Back working memory task and the face-name episodic memory task, none of which is available in the UK Biobank [15,17,61,69,114,142,151].

      Moreover, there are inherent limitations in using MRI as a proxy for brain structure and function. Measurement error and intra-individual variability, such as differences in a cognitive state between cognitive assessments and MRI acquisition, may also contribute to the unexplained variance. According to the RDoC framework, brain circuits represent only one level of neurobiological analysis relevant to cognition [14]. Other levels, including genes, molecules, cells, and physiological processes, may also play a role in the cognition-mental health relationship.

      Nonetheless, neuroimaging provides a valuable window into the biological mechanisms underlying this overlap – insights that cannot be gleaned from behavioural data alone. Ultimately, our findings validate brain-based neural markers as a fundamental neurobiological unit of analysis, advancing our understanding of mental health through the lens of cognition.”

      Marek S, Tervo-Clemmens B, Calabro FJ, Montez DF, Kay BP, Hatoum AS, et al. Reproducible brain-wide association studies require thousands of individuals. Nature. 2022;603:654–660.

      Vieira S, Gong QY, Pinaya WHL, et al. Using Machine Learning and Structural Neuroimaging to Detect First Episode Psychosis: Reconsidering the Evidence. Schizophr Bull. 2020;46(1):17-26.

      Insel T, Cuthbert B, Garvey M, Heinssen R, Pine DS, Quinn K, et al. Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. AJP. 2010;167:748–751.

      Comment 5: - The explanation about the stacking approach is not yet completely clear to me. I don't understand how the target variable can be the dependent variable in both step one and two. Or are those different variables? It would be helpful to also give an example of the target variable in line 88 on page 5

      Thank you for this excellent question. In our stacking approach, the same target variable, the g-factor, is indeed used across both modeling stages, but with a key distinction in how predictions are generated and integrated.

      In the first-level models, we trained separate Partial Least Squares Regression (PLSR) models for each of the 72 neuroimaging phenotypes, each predicting the g-factor independently. The predicted values from these 72 models were then used as input features for the second-level stacked model, which combined them to generate a final prediction of the g-factor. This twostage framework enables us to integrate information across multiple imaging modalities while maintaining a consistent prediction target.

      To avoid data leakage, both modeling stages were conducted entirely within the training set for each cross-validation fold. Only after the second-level model was trained was it applied to the outer-fold test participants who were not involved in any part of the model training process.

      To improve accessibility, we have revised the Methods section (see Page 10) to clarify this approach, ensuring that the description remains technically accurate while being easier to follow.

      Line 188: “We employed nested cross-validation to predict cognition from mental health indices and 72 neuroimaging phenotypes (Fig. 1). Nested cross-validation is a robust method for evaluating machine-learning models while tuning their hyperparameters, ensuring that performance estimates are both accurate and unbiased. Here, we used a nested cross-validation scheme with five outer folds and ten inner folds.

      We started by dividing the entire dataset into five outer folds. Each fold took a turn being held out as the outerfold test set (20% of the data), while the remaining four folds (80% of the data) were used as an outer-fold training set. Within each outer-fold training set, we performed a second layer of cross-validation – this time splitting the data into ten inner folds. These inner folds were used exclusively for hyperparameter tuning: models were trained on nine of the inner folds and validated on the remaining one, cycling through all ten combinations.

      We then selected the hyperparameter configuration that performed best across the inner-fold validation sets, as determined by the minimal mean squared error (MSE). The model was then retrained on the full outer-fold training set using this hyperparameter configuration and evaluated on the outer-fold test set, using four performance metrics: Pearson r, the coefficient of determination ( R<sup>2</sup>), the mean absolute error (MAE), and the MSE. This entire process was repeated for each of the five outer folds, ensuring that every data point is used for both training and testing, but never at the same time. We opted for five outer folds instead of ten to reduce computational demands, particularly memory and processing time, given the substantial volume of neuroimaging data involved in model training. Five outer folds led to an outer-fold test set at least n = 4 000, which should be sufficient for model evaluation. In contrast, we retained ten inner folds to ensure robust and stable hyperparameter tuning, maximising the reliability of model selection.

      To model the relationship between mental health and cognition, we employed Partial Least Squares Regression (PLSR) to predict the g-factor from 133 mental health variables. To model the relationship between neuroimaging data and cognition, we used a two-step stacking approach [15–17,61] to integrate information from 72 neuroimaging phenotypes across three MRI modalities. In the first step, we trained 72 base (first-level) PLSR models, each predicting the g-factor from a single neuroimaging phenotype. In the second step, we used the predicted values from these base models as input features for stacked models, which again predicted the g-factor. We constructed four stacked models based on the source of the base predictions: one each for dwMRI, rsMRI, sMRI, and a combined model incorporating all modalities (“dwMRI Stacked”, “rsMRI Stacked”, “sMRI Stacked”, and “All MRI Stacked”, respectively). Each stacked model was trained using one of four machine learning algorithms – ElasticNet, Random Forest, XGBoost, or Support Vector Regression – selected individually for each model (see Supplementary Materials, S6).

      For rsMRI phenotypes, we treated the choice of functional connectivity quantification method – full correlation, partial correlation, or tangent space parametrization – as a hyperparameter. The method yielding the highest performance on the outer-fold training set was selected for predicting the g-factor (see Supplementary Materials, S5).

      To prevent data leakage, we standardized the data using the mean and standard deviation derived from the training set and applied these parameters to the corresponding test set within each outer fold. This standardization was performed at three key stages: before g-factor derivation, before regressing out modality-specific confounds from the MRI data, and before stacking. Similarly, to maintain strict separation between training and testing data, both base and stacked models were trained exclusively on participants from the outer-fold training set and subsequently applied to the corresponding outer-fold test set.

      To evaluate model performance and assess statistical significance, we aggregated the predicted and observed gfactor values from each outer-fold test set. We then computed a bootstrap distribution of Pearson’s correlation coefficient (r) by resampling with replacement 5 000 times, generating 95% confidence intervals (CIs) (Fig. 1). Model performance was considered statistically significant if the 95% CI did not include zero, indicating that the observed associations were unlikely to have occurred by chance.”

      Comment 6: Methods - It's not clear from the text and Figure 1 which 12 scores from 11 tests are being used to derive the g-factor. Figure 1 shows only 8 bullet points with 10 scores in A and 13 tests under 'Cognitive tests' in B. Moreover, Supplement S1 describes 12 tests and 14 measures (Prospective Memory test is in the text but not in Supplementary Table 1).

      Thank you for identifying this discrepancy. In the original Figure 1b and in the Supplementary Methods (S1), the “Prospective Memory” test was accidentally duplicated, while it was present in the Supplementary Table 1 (Line 53, Supplementary Table 1). We have now corrected both figures for consistency. To clarify: Figure 1a presents the global mental health and cognitive domains studied, while Figure 1b now accurately lists 1) the 12 cognitive scores from 11 tests used to derive the g-factor (with the Trail Making Test contributing two measures – numeric and alphabetic trails) and 2) the three main categories of mental health indices used as machine learning features.

      We also corrected the Supplementary Materials to remove the duplicate test from the first paragraph. In Supplementary Table 1, there were 11 tests listed, and for the Trail Making test, we specified in the “Core measures” column that this test had 2 derivative scores: duration to complete the numeric path (Trail 1) and duration to complete the alphabetic path (Trail 2).

      Supplementary Materials, Line 46: “We used twelve scores from the eleven cognitive tests that represented the following cognitive domains: reaction time and processing speed (Reaction Time test), working memory (Numeric Memory test), verbal and numerical reasoning (Fluid Intelligence test), executive function (Trail Making Test), non-verbal fluid reasoning (Matrix Pattern Completion test), processing speed (Symbol Digit Substitution test), vocabulary (Picture Vocabulary test), planning abilities (Tower Rearranging test), verbal declarative memory (Paired Associate Learning test), prospective memory (Prospective Memory test), and visual memory (Pairs Matching test) [1].”

      Comment 7: - For the mental health measures: If I understand correctly, the questionnaire items were used individually, but also to create composite scores. This seems counterintuitive, because I would assume that if the raw data is used, the composite scores would not add additional information to that. When reading the Supplement, it seems like I'm not correct… It would be helpful to clarify the text on page 7 in the main text.

      You raise an excellent observation regarding the use of both individual questionnaire items and composite scores. This dual approach was methodologically justified by the properties of Partial Least Squares Regression (PLSR), our chosen first-level machine learning algorithm, which benefits from rich feature sets and can handle multicollinearity through dimensionality reduction. PLSR transforms correlated features into latent variables, meaning both individual items and composite scores can contribute unique information to the model. We elaborate on PLSR's mathematical principles in Supplementary Materials (S5).

      To directly address this concern, we conducted comparative analyses showing that the PLSR model (a single 80/20% training/test split), incorporating all 133 mental health features (both items and composites), outperformed models using either type alone. The full model achieved superior performance (MSE = 0.458, MAE = 0.537, \= 0.112, Pearson r = 0.336, p-value = 6.936e-112) compared to using only composite scores (93 features; MSE = 0.461, MAE = 0.538, R<sup>2</sup> = 0.107, Pearson r = 0.328, p-value = 5.8e-106) or only questionnaire items (40 features; MSE = 0.499, MAE = 0.561, R<sup>2</sup> = 0.033, Pearson r = 0.184, p-value = 2.53e-33). These results confirm that including both data types provide complementary predictive value. We expand on these considerations in the revised Methods section.

      Line 123: “Mental health measures encompassed 133 variables from twelve groups: mental distress, depression, clinical diagnoses related to the nervous system and mental health, mania (including bipolar disorder), neuroticism, anxiety, addictions, alcohol and cannabis use, unusual/psychotic experiences, traumatic events, selfharm behaviours, and happiness and subjective well-being (Fig. 1 and Tables S4 and S5). We included both selfreport questionnaire items from all participants and composite diagnostic scores computed following Davis et al. and Dutt et al. [35,36] as features in our first-level (for explanation, see Data analysis section) Partial Least Squares Regression (PLSR) model. This approach leverages PLSR’s ability to handle multicollinearity through dimensionality reduction, enabling simultaneous use of granular symptom-level information and robust composite measures (for mental health scoring details, see Supplementary Materials, S2). We assess the contribution of each mental health index to general cognition by examining the direction and magnitude of its PLSR-derived loadings on the identified latent variables”

      Comment 8: - Results - The colors in Figure 4 B are a bit hard to differentiate.

      We have updated Figure 4 to enhance colour differentiation by adjusting saturation and brightness levels, improving visual distinction. For further clarity, we split the original figure into two separate figures.

      Comment 9: - Discussion - "Overall, the scores for mental distress, alcohol and cannabis use, and self-harm behaviours relate positively, and the scores for anxiety, neurological and mental health diagnoses, unusual or psychotic experiences, happiness and subjective well-being, and negative traumatic events relate negatively to cognition," - this seems counterintuitive, that some symptoms relate to better cognition and others relate to worse cognition. Could you elaborate on this finding and what it could mean?

      We appreciate you highlighting this important observation. While some associations between mental health indices and cognition may appear counterintuitive at first glance, these patterns are robust (emerging consistently across both univariate correlations and PLSR loadings) and align with previous literature (e.g., Karpinski et al., 2018; Ogueji et al., 2022). For instance, the positive relationship between cognitive ability and certain mental health indicators like help-seeking behaviour has been documented in other population studies (Karpinski et al., 2018; Ogueji et al., 2022), potentially reflecting greater health literacy and access to care among cognitively advantaged individuals. Conversely, the negative associations with conditions like psychotic experiences mirror established neurocognitive deficits in these domains.

      As was initially detailed in Supplementary Materials (S12) and now expanded in our Discussion, these findings likely reflect complex multidimensional interactions. The positive loadings for mental distress indicators may capture: (1) greater help-seeking behaviour among those with higher cognition and socioeconomic resources, and/or (2) psychological overexcitability and rumination tendencies in high-functioning individuals. These interpretations are particularly relevant to the UK Biobank's assessment methods, where mental distress items focused on medical help-seeking rather than symptom severity per se (e.g., as a measure of mental distress, the UK Biobank questionnaire asked whether an individual sought or received medical help for or suffered from mental distress).

      Line 492: “Factor loadings derived from the PLSR model showed that the scores for mental distress, alcohol and cannabis use, and self-harm behaviours related positively, and the scores for anxiety, neurological and mental health diagnoses, unusual or psychotic experiences, happiness and subjective well-being, and negative traumatic events related negatively to the g-factor. Positive PLSR loadings of features related to mental distress may indicate greater susceptibility to or exaggerated perception of stressful events, psychological overexcitability, and predisposition to rumination in people with higher cognition [72]. On the other hand, these findings may be specific to the UK Biobank cohort and the way the questions for this mental health category were constructed. In particular, to evaluate mental distress, the UK Biobank questionnaire asked whether an individual sought or received medical help for or suffered from mental distress. In this regard, the estimate for mental distress may be more indicative of whether an individual experiencing mental distress had an opportunity or aspiration to visit a doctor and seek professional help [73]. Thus, people with better cognitive abilities and also with a higher socioeconomic status may indeed be more likely to seek professional help.

      Limited evidence supports a positive association between self-harm behaviours and cognitive abilities, with some studies indicating higher cognitive performance as a risk factor for non-suicidal self-harm. Research shows an inverse relationship between cognitive control of emotion and suicidal behaviours that weakens over the life course [73,74]. Some studies have found a positive correlation between cognitive abilities and the risk of nonsuicidal self-harm, suicidal thoughts, and suicidal plans that may be independent of or, conversely, affected by socioeconomic status [75,76]. In our study, the magnitude of the association between self-harm behaviours and cognition was low (Fig. 2), indicating a weak relationship.

      Positive PLSR loadings of features related to alcohol and cannabis may also indicate the influence of other factors. Overall, this relationship is believed to be largely affected by age, income, education, social status, social equality, social norms, and quality of life [79–80]. For example, education level and income correlate with cognitive ability and alcohol consumption [79,81–83]. Research also links a higher probability of having tried alcohol or recreational drugs, including cannabis, to a tendency of more intelligent individuals to approach evolutionary novel stimuli [84,85]. This hypothesis is supported by studies showing that cannabis users perform better on some cognitive tasks [86]. Alternatively, frequent drinking can indicate higher social engagement, which is positively associated with cognition [87]. Young adults often drink alcohol as a social ritual in university settings to build connections with peers [88]. In older adults, drinking may accompany friends or family visits [89,90]. Mixed evidence on the link between alcohol and drug use and cognition makes it difficult to draw definite conclusions, leaving an open question about the nature of this relationship.

      Consistent with previous studies, we showed that anxiety and negative traumatic experiences were inversely associated with cognitive abilities [90–93]. Anxiety may be linked to poorer cognitive performance via reduced working memory capacity, increased focus on negative thoughts, and attentional bias to threatening stimuli that hinder the allocation of cognitive resources to a current task [94–96]. Individuals with PTSD consistently showed impaired verbal and working memory, visual attention, inhibitory function, task switching, cognitive flexibility, and cognitive control [97–100]. Exposure to traumatic events that did not reach the PTSD threshold was also linked to impaired cognition. For example, childhood trauma is associated with worse performance in processing speed, attention, and executive function tasks in adulthood, and age at a first traumatic event is predictive of the rate of executive function decline in midlife [101,102]. In the UK Biobank cohort, adverse life events have been linked to lower cognitive flexibility, partially via depression level [103].

      In agreement with our findings, cognitive deficits are often found in psychotic disorders [104,105]. We treated neurological and mental health symptoms as predictor variables and did not stratify or exclude people based on psychiatric status or symptom severity. Since no prior studies have examined isolated psychotic symptoms (e.g., recent unusual experiences, hearing unreal voices, or seeing unreal visions), we avoid speculating on how these symptoms relate to cognition in our sample.

      Finally, negative PLSR loadings of the features related to happiness and subjective well-being may be specific to the study cohort, as these findings do not agree with some previous research [107–109]. On the other hand, our results agree with the study linking excessive optimism or optimistic thinking to lower cognitive performance in memory, verbal fluency, fluid intelligence, and numerical reasoning tasks, and suggesting that pessimism or realism indicates better cognition [110]. The concept of realism/optimism as indicators of cognition is a plausible explanation for a negative association between the g-factor and friendship satisfaction, as well as a negative PLSR loading of feelings that life is meaningful, especially in older adults who tend to reflect more on the meaning of life [111]. The latter is supported by the study showing a negative association between cognitive function and the search for the meaning of life and a change in the pattern of this relationship after the age of 60 [112]. Finally, a UK Biobank study found a positive association of happiness with speed and visuospatial memory but a negative relationship with reasoning ability [113].”

      Karpinski RI, Kinase Kolb AM, Tetreault NA, Borowski TB. High intelligence: A risk factor for psychological and physiological overexcitabilities. Intelligence. 2018;66:8–23.

      Ogueji IA, Okoloba MM. Seeking Professional Help for Mental Illness: A Mixed-Methods Study of Black Family Members in the UK and Nigeria. Psychol Stud. 2022;67:164–177.

      Comment 10: - All neuroimaging factors together explain 48% of the variance in the cognition-mental health relationship. However, this relationship is only r=0.3 - so then the effect of neuroimaging factors seems a lot smaller… What does it mean?

      Thank you for raising this critical point. We have addressed this point in our response to Reviewer 1, comment 2, Reviewer 1, comment 3 and Reviewer 2, comment 1.

      Briefly, cognition is related to mental health at around r = 0.3 and to neuroimaging phenotypes at around r = 0.4. These levels of relationship strength are consistent to what has been shown in the literature (e.g., Wang et al., 2025 and Vieira et al., 2020). We discussed the relationship between cognition and mental health in our response to Reviewer 2, comment 1 above. In short, this relationship reflects just one functional domain – mental health may also be associated with other domains such as negative and positive valence systems, arousal and regulatory systems, social processes, and sensorimotor functions. Moreover, in the context of gerontology research, this effect size is considered relatively large (Brydges et al., 2019).

      We conducted a commonality analysis to investigate the unique and shared variance of mental health and neuroimaging phenotypes in explaining cognition.  As we discussed in our response to Reviewer 1, comment 2, we were able to account for 48% of the covariation between cognition and mental health using the MRI modalities available in the UK Biobank. The remaining 52% of unexplained variance may arise from several sources.

      One possibility is the absence of certain neuroimaging modalities in the UK Biobank dataset, such as task-based fMRI contrasts, positron emission tomography, arterial spin labeling, and magnetoencephalography/electroencephalography. Prior research from our group and others has consistently demonstrated strong predictive performance from specific task-based fMRI contrasts, particularly those derived from tasks like the n-Back working memory task and the face-name episodic memory task, none of which is available in the UK Biobank (Tetereva et al., 2025).

      Moreover, there are inherent limitations in using MRI as a proxy for brain structure and function. Measurement error and intra-individual variability, such as differences in a cognitive state between cognitive assessments and MRI acquisition, may also contribute to the unexplained variance. According to RDoC framework, brain circuits represent only one level of neurobiological analysis relevant to cognition. Other levels, including genes, molecules, cells, and physiological processes, may also play a role in the cognition-mental health relationship.

      We have now incorporated these considerations into the Discussion section.

      Line 481: “Our analysis confirmed the validity of the g-factor [31] as a quantitative measure of cognition [31], demonstrating that it captures nearly half (39%) of the variance across twelve cognitive performance scores, consistent with prior studies [63–68]. Furthermore, we were able to predict cognition from 133 mental health indices, showing a medium-sized relationship that aligns with existing literature [69,70]. Although the observed mental health-cognition association is lower than within-sample estimates in conventional regression models, it aligns with our prior mega-analysis in children [69]. Notably, this effect size is not considered small in gerontology. In fact, it falls around the 70th percentile of reported effects and approaches the threshold for a large effect at r = 0.32 [71]. While we focused specifically on cognition as an RDoC core domain, the strength of its relationship with mental health may be bounded by the influence of other functional domains, particularly in normative, non-clinical samples – a promising direction for future research.”

      Line 658: “Although recent debates [18] have challenged the predictive utility of MRI for cognition, our multimodal marker integrating 72 neuroimaging phenotypes captures nearly half of the mental health-explained variance in cognition. We demonstrate that neural markers with greater predictive accuracy for cognition also better explain cognition-mental health covariation, showing that multimodal MRI can capture both a substantial cognitive variance and nearly half of its shared variance with mental health. Finally, we show that our neuromarkers explain a substantial portion of the age- and sex-related variance in the cognition-mental health relationship, highlighting their relevance in modeling cognition across demographic strata.

      The remaining unexplained variance in the relationship between cognition and mental health likely stems from multiple sources. One possibility is the absence of certain neuroimaging modalities in the UK Biobank dataset, such as task-based fMRI contrasts, positron emission tomography, arterial spin labeling, and magnetoencephalography/electroencephalography. Prior research has consistently demonstrated strong predictive performance from specific task-based fMRI contrasts, particularly those derived from tasks like the n-Back working memory task and the face-name episodic memory task, none of which is available in the UK Biobank [15,17,61,69,114,142,151].

      Moreover, there are inherent limitations in using MRI as a proxy for brain structure and function. Measurement error and intra-individual variability, such as differences in a cognitive state between cognitive assessments and MRI acquisition, may also contribute to the unexplained variance. According to the RDoC framework, brain circuits represent only one level of neurobiological analysis relevant to cognition [14]. Other levels, including genes, molecules, cells, and physiological processes, may also play a role in the cognition-mental health relationship.

      Nonetheless, neuroimaging provides a valuable window into the biological mechanisms underlying this overlap – insights that cannot be gleaned from behavioural data alone. Ultimately, our findings validate brain-based neural markers as a fundamental neurobiological unit of analysis, advancing our understanding of mental health through the lens of cognition.”

      Wang Y, Anney R, Pat N. The relationship between cognitive abilities and mental health as represented by cognitive abilities at the neural and genetic levels of analysis. eLife. 2025.14:RP105537.

      Vieira S, Gong QY, Pinaya WHL, et al. Using Machine Learning and Structural Neuroimaging to Detect First Episode Psychosis: Reconsidering the Evidence. Schizophr Bull. 2020;46(1):17-26.

      Brydges CR. Effect Size Guidelines, Sample Size Calculations, and Statistical Power in Gerontology. Innovation in Aging. 2019;3(4):igz036.

      Tetereva A, Knodt AR, Melzer TR, et al. Improving Predictability, Reliability and Generalisability of Brain-Wide Associations for Cognitive Abilities via Multimodal Stacking. Preprint. bioRxiv. 2025;2024.05.03.589404.

      Reviewer 3:

      Buianova et al. present a comprehensive analysis examining the predictive value of multimodal neuroimaging data for general cognitive ability, operationalized as a derived g-factor. The study demonstrates that functional MRI holds the strongest predictive power among the modalities, while integrating multiple MRI modalities through stacking further enhances prediction performance. The inclusion of a commonality analysis provides valuable insight into the extent to which shared and unique variance across mental health features and neuroimaging modalities contributes to the observed associations with cognition. The results are clearly presented and supported by highquality visualizations. Limitations of the sample are stated clearly.

      Thank you once more for your constructive and encouraging feedback. We appreciate your careful reading and valuable methodological insights. Your expertise has helped us clarify key methodological concepts and improve the overall rigour of our study.

      Suggestions for improvement:

      (1) The manuscript would benefit from the inclusion of permutation testing to evaluate the statistical significance of the predictive models. This is particularly important given that some of the reported performance metrics are relatively modest, and permutation testing could help ensure that results are not driven by chance.

      Thank you, this is an excellent point. We agree that evaluating the statistical significance of our predictive models is essential.

      In our original analysis, we assessed model performance by generating a bootstrap distribution of Pearson’s r, resampling the data with replacement 5,000 times (see Figure 3b). In response to your feedback, we have made the following updates:

      (1) Improved Figure 3b to explicitly display the 95% confidence intervals.

      (2) Supplemented the results by reporting the exact confidence interval values.

      (3) Clarified our significance testing procedure in the Methods section.

      We considered model performance statistically significant when the 95% confidence interval did not include zero, indicating that the observed associations are unlikely to have occurred by chance.

      We chose bootstrapping over permutation testing because, while both can assess statistical significance, bootstrapping additionally provides uncertainty estimates in the form of confidence intervals. Given the large sample size in our study, significance testing can be less informative, as even small effects may reach statistical significance. Bootstrapping offers a more nuanced understanding of model uncertainty.

      Line 233: “To evaluate model performance and assess statistical significance, we aggregated the predicted and observed g-factor values from each outer-fold test set. We then computed a bootstrap distribution of Pearson’s correlation coefficient (r) by resampling with replacement 5 000 times, generating 95% confidence intervals (CIs) (Fig. 1). Model performance was considered statistically significant if the 95% CI did not include zero, indicating that the observed associations were unlikely to have occurred by chance.”

      (2) Applying and testing the trained models on an external validation set would increase confidence in generalisability of the model.

      We appreciate this excellent suggestion. While we considered this approach, implementing it would require identifying an appropriate external dataset with comparable neuroimaging and behavioural measures, along with careful matching of acquisition protocols and variable definitions across sites. These challenges extend beyond the scope of the current study, though we fully agree that this represents an important direction for future research.

      Our findings, obtained from one of the largest neuroimaging datasets to date with training and test samples exceeding most previous studies, align closely with existing literature: the predictive accuracy of each neuroimaging phenotype and modality for cognition matches the effect size reported in meta-analyses (r ≈ 0.4; e.g., Vieira et al., 2020). The ability of dwMRI, rsMRI and sMRI to capture the cognition-mental health relationship is, in turn, consistent with our previous work in pediatric populations (Wang et al., 2025; Pat et al., 2022).

      Vieira S, Gong QY, Pinaya WHL, et al. Using Machine Learning and Structural Neuroimaging to Detect First Episode Psychosis: Reconsidering the Evidence. Schizophr Bull. 2020;46(1):17-26.

      Wang Y, Anney R, Pat N. The relationship between cognitive abilities and mental health as represented by cognitive abilities at the neural and genetic levels of analysis. eLife. 2025.14:RP105537.

      Pat N, Wang Y, Anney R, Riglin L, Thapar A, Stringaris A. Longitudinally stable, brain-based predictive models mediate the relationships between childhood cognition and socio-demographic, psychological and genetic factors. Hum Brain Mapp. 2022;43:5520–5542.

      (3) The rationale for selecting a 5-by-10-fold cross-validation scheme is not clearly explained. Clarifying why this structure was preferred over more commonly used alternatives, such as 10-by-10 or 5-by-5 cross-validation, would strengthen the methodological transparency.

      Thank you for this important methodological question. Our choice of a 5-by-10-fold crossvalidation scheme was motivated by the need to balance robust hyperparameter tuning with computational efficiency, particularly memory and processing time. Retaining five outer folds allowed us to rigorously assess model performance across multiple data partitions, leading to an outer-fold test set at least n = 4 000 and providing a substantial amount of neuroimaging data involved in model training. In contrast, employing ten inner folds ensured robust and stable hyperparameter tuning that maximizes the reliability of model selection. Thus, the 5-outer-fold with our large sample provided sufficient out-of-sample test set size for reliable model evaluation and efficient computation, while 10 inner folds enabled robust hyperparameter tuning. We now provide additional rationale for this design decision on Page 10.

      Line 188: “We employed nested cross-validation to predict cognition from mental health indices and 72 neuroimaging phenotypes (Fig. 1). Nested cross-validation is a robust method for evaluating machine-learning models while tuning their hyperparameters, ensuring that performance estimates are both accurate and unbiased. Here, we used a nested cross-validation scheme with five outer folds and ten inner folds.

      We started by dividing the entire dataset into five outer folds. Each fold took a turn being held out as the outerfold test set (20% of the data), while the remaining four folds (80% of the data) were used as an outer-fold training set. Within each outer-fold training set, we performed a second layer of cross-validation – this time splitting the data into ten inner folds. These inner folds were used exclusively for hyperparameter tuning: models were trained on nine of the inner folds and validated on the remaining one, cycling through all ten combinations.

      We then selected the hyperparameter configuration that performed best across the inner-fold validation sets, as determined by the minimal mean squared error (MSE). The model was then retrained on the full outer-fold training set using this hyperparameter configuration and evaluated on the outer-fold test set, using four performance metrics: Pearson r, the coefficient of determination ( R<sup>2</sup>), the mean absolute error (MAE), and the MSE. This entire process was repeated for each of the five outer folds, ensuring that every data point is used for both training and testing, but never at the same time. We opted for five outer folds instead of ten to reduce computational demands, particularly memory and processing time, given the substantial volume of neuroimaging data involved in model training. Five outer folds led to an outer-fold test set at least n = 4 000, which should be sufficient for model evaluation. In contrast, we retained ten inner folds to ensure robust and stable hyperparameter tuning, maximising the reliability of model selection.”

      (4) A more detailed discussion of which specific brain regions or features within each neuroimaging modality contributed most strongly to the prediction of cognition would enhance neurobiological relevance of the findings.

      Thank you for this thoughtful suggestion. To address this point, we have included feature importance plots for the top-performing neuroimaging phenotypes within each modality (Figure 5 and Figures S2–S4), demonstrating the relative contributions of individual features to the predictive models. While we maintain our primary focus on cross-modality performance comparisons in the main text, as this aligns with our central aim of evaluating multimodal MRI markers at the integrated level, we outline the contribution of neuroimaging features with the highest predictive performance for cognition in the revised Results and Discussion.

      Methods

      Line 255: “To determine which neuroimaging features contribute most to the predictive performance of topperforming phenotypes within each modality, while accounting for the potential latent components derived from neuroimaging, we assessed feature importance using the Haufe transformation [62]. Specifically, we calculated Pearson correlations between the predicted g-factor and scaled and centred neuroimaging features across five outer-fold test sets. We also examined whether the performance of neuroimaging phenotypes in predicting cognition per se is related to their ability to explain the link between cognition and mental health. Here, we computed the correlation between the predictive performance of each neuroimaging phenotype and the proportion of the cognition-mental health relationship it captures. To understand how demographic factors, including age and sex, contribute to this relationship, we also conducted a separate set of commonality analyses treating age, sex, age<sup>2</sup>, age×sex, and age<sup>2</sup>×sex as an additional set of explanatory variables (Fig. 1).”

      Results

      dwMRI

      Line 331: “Overall, models based on structural connectivity metrics performed better than TBSS and probabilistic tractography (Fig. 3). TBSS, in turn, performed better than probabilistic tractography (Fig. 3 and Table S13). The number of streamlines connecting brain areas parcellated with aparc MSA-I had the best predictive performance among all dwMRI neuroimaging phenotypes (R<sup>2</sup><sub>mean</sub> = 0.052, r<sub>mean</sub> = 0.227, 95% CI [0.212, 0.235]). To identify features driving predictions, we correlated streamline counts in aparc MSA-I parcellation with the predicted g_factor values from the PLSR model. Positive associations with the predicted _g-factor were strongest for left superior parietal-left caudal anterior cingulate, left caudate-right amygdala, and left putamen-left hippocampus connections. The most marked negative correlations involved left putamen-right posterior thalamus and right pars opercularis-right caudal anterior cingulate pathways (Fig. 5 and Supplementary Fig. S2).”

      rsMRI

      Line 353: “Among RSFC metrics for 55 and 21 ICs, tangent parameterization matrices yielded the highest performance in the training set compared to full and partial correlation, as indicated by the cross-validation score. Functional connections between the limbic (IC10) and dorsal attention (IC18) networks, as well as between the ventral attention (IC15) and default mode (IC11) networks, displayed the highest positive association with cognition. In contrast, functional connectivity between the limbic (IC43, the highest activation within network) and default mode (IC11) and limbic (IC45) and frontoparietal (IC40) networks, between the dorsal attention (IC18) and frontoparietal (IC25) networks, and between the ventral attention (IC15) and frontoparietal (IC40) networks, showed the highest negative association with cognition (Fig. 5 and Supplementary Fig. S3 and S4)”

      sMRI

      Line 373: “FreeSurfer subcortical volumetric subsegmentation and ASEG had the highest performance among all sMRI neuroimaging phenotypes (R<sup>2</sup><sub>mean</sub> = 0.068, r<sub>mean</sub> = 0.244, 95% CI [0.237, 0.259] and R<sup>2</sup><sub>mean</sub> = 0.059, r<sub>mean</sub> = 0.235, 95% CI [0.221, 0.243], respectively). In FreeSurfer subcortical volumetric subsegmentation, volumes of all subcortical structures, except for left and right hippocampal fissures, showed positive associations with cognition. The strongest relations were observed for the volumes of bilateral whole hippocampal head and whole hippocampus (Fig. 5 and Supplementary Fig. S5 for feature importance maps). Grey matter morphological characteristics from ex vivo Brodmann Area Maps showed the lowest predictive performance (R<sup>2</sup><sub>mean</sub> = 0.008, r<sub>mean</sub> = 0.089, 95% CI [0.075, 0.098]; Fig. 3 and Table S15).”

      Discussion

      dwMRI

      Line 562: “Among dwMRI-derived neuroimaging phenotypes, models based on structural connectivity between brain areas parcellated with aparc MSA-I (streamline count), particularly connections with bilateral caudal anterior cingulate (left superior parietal-left caudal anterior cingulate, right pars opercularis-right caudal anterior cingulate), left putamen (left putamen-left hippocampus, left putamen-right posterior thalamus), and amygdala (left caudate-right amygdala), result in a neural indicator that best reflects microstructural resources associated with cognition, as indicated by predictive modeling, and more importantly, shares the highest proportion of the variance with mental health-g, as indicated by commonality analysis.”

      rsMRI

      Line 583: “We extend findings on the superior performance of rsMRI in predicting cognition, which aligns with the literature [15, 28], by showing that it also explains almost a third of the variance in cognition that mental health captures. At the rsMRI neuroimaging phenotype level, this performance is mostly driven by RSFC patterns among 55 ICA-derived networks quantified using tangent space parameterization. At a feature level, these associations are best captured by the strength of functional connections among limbic, dorsal attention and ventral attention, frontoparietal and default mode networks. These functional networks have been consistently linked to cognitive processes in prior research [127–130].”

      sMRI

      Line 608: “Integrating information about brain anatomy by stacking sMRI neuroimaging phenotypes allowed us to explain a third of the link between cognition and mental health. Among all sMRI neuroimaging phenotypes, those that quantified the morphology of subcortical structures, particularly volumes of bilateral hippocampus and hippocampal head, explain the highest portion of the variance in cognition captured by mental health. Our findings show that, at least in older adults, volumetric properties of subcortical structures are not only more predictive of individual variations in cognition but also explain a greater portion of cognitive variance shared with mental health than structural characteristics of more distributed cortical grey and white matter. This aligns with the Scaffolding Theory that proposes stronger compensatory engagement of subcortical structures in cognitive processing in older adults [138–140].”

      (5) The formatting of some figure legends could be improved for clarity - for example, some subheadings were not formatted in bold (e.g., Figure 2 c)

      Thank you for noticing this. We have updated the figures to enhance clarity, keeping subheadings plain while bolding figure numbers and MRI modality names.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Summary:

      The experiment is interesting and well executed and describes in high detail fish behaviour in thermally stratified waters. The evidence is strong but the experimental design cannot distinguish between temperature and vertical position of the treatments.

      Strengths:

      High statistical power, solid quantification of behaviour.

      Weaknesses:

      A major issue with the experimental design is the vertical component of the experiment. Many thermal preference and avoidance experiments are run using horizontal division in shuttlebox systems or in annular choice flumes. These remove the vertical stratification component so that hot and cold can be compared equally, without the vertical layering as a confounding factor. The method chosen, with its vertical stratification, is inherently unable to control for this effect because warm water is always above, and cold water is always below. This complicates the interpretations and makes firm conclusions about thermal behaviour difficult.

      We highly appreciate this evaluation and have addressed the reviewer’s specific comments below.

      The sentence "Further, the metabolic performance (and thus functions including growth, reproduction, and locomotion) of ectotherms takes the form of a bell-shaped curve as a function of temperature6, peaking within a range of optimal temperatures (the 'preferendum') and going to zero at lower and upper temperature limits7." contains several over-simplifications and misconceptions:

      (1) Thermal performance curves are never bell-shaped.

      (2) The optimum for various traits often shows different TPCs.

      (3) The preferendum rarely lines up with the thermal optimum for various trait TPCs.

      (4) Performance for various traits rarely reaches zero at upper or lower limits, instead they can reach zero at less extreme temperatures (e.g. growth) or maintain high function all the way up to and sometimes beyond thermal limits (e.g. aerobic scope, heart rate).

      We highly appreciate this input. We have replaced that sentence with: L69-71: “Because temperature influences the rates of most physiological processes, rapid warming or cooling can affect fish performance traits, including metabolic rates, swimming ability, and thermal tolerance (Jutfelt et al. 2024).”

      The use of adaptation instead of acclimation is confusing. Adaptation should be reserved for evolutionary change. This is an issue in several parts of the manuscript.

      Thanks for this input, we have replaced the word adapt with acclimate in two instances: L79 and L398.

      It is not true that "very few quantitative studies of thermotaxis have been conducted in fish". There exists an extensive literature on thermal preference and avoidance in fish that the manuscript downplays.

      Thanks a lot for this input. We understand that thermal preference is ultimately driven by mechanistic responses to thermal gradients, and that thermotaxis and thermokinesis are the two mechanisms used by fish to navigate heterothermal environments. Our study and analysis are focused on understanding these mechanisms in vertically stratified conditions, not to understand thermal preferences per se. We have modified our text to clarify this aspect. Our literature review was focused on the behavioral mechanisms and our understanding is that the establishment of thermal preferences has a different goal compared to understanding how fish respond to rapid changes in water temperature. We have deleted that sentence and replaced it by (L107-110): “While the thermal preference of fish is a well-established field of research, very few quantitative studies of the behavioral mechanisms allowing fish to seek their preferendum (i.e. thermotaxis) have been conducted in fish.”

      (Methods) It is unclear why the blue dye was used in all experiments. The fish can see the differently coloured water layer and that may have affected their choices. Five control trials without dye were run but finding no difference there could also be due to low statistical power.

      We appreciate this comment. The blue dye was used to visualize the precise location of the thermal interface and was therefore necessary in all experiments (see Methods section ‘Visualization and evolution of the thermal interface’). We acknowledge that fish can perceive the colored water layer, but since the dye concentration and resulting color intensity were consistent across all treatments, we do not see how it could have acted as a confounding variable. While we recognize the possibility of some behavioral influence from the dye, the clear behavioral differences across treatments indicate that it was not a determining factor. To emphasize this we have added the following to the manuscript (L701-703): “Furthermore, because the dye concentration and resulting color intensity were consistent across all treatments, the dye did not act as a confounding variable in our statistical comparisons.”

      Regarding statistical power, our control experiment without dye (N = 16 fish, 4 replicates; see Fig. S34 and S35) provides sufficient statistical power to assess whether the dye influenced behavior. The reviewer indicated that the high statistical power was a strength of the paper, which aligns with our view that our study design enables robust statistical comparisons. It seems contradictory that statistical power is a concern for the control trials, given that our main experiments were conducted with a similar sample size. Indeed, the number of replicates used is consistent with similar studies and balances statistical rigor with the ethical goal of reducing the number of animals used in experimentation. To emphasize this, we have added the following to the manuscript (L865-868): “The number of replicates used in this study reflects a balance between statistical rigor and the ethical imperative to minimize the use of animals in experimentation. Regarding statistical power, our design (five replicates with groups of four fish each) is consistent with similar studies and represents an adequate sample size.”

      A major issue with the experimental design is the vertical component of the experiment. Many thermal preference and avoidance experiments are run using horizontal division in shuttlebox systems or in annular choice flumes. These remove the vertical stratification component so that hot and cold can be compared equally, without the vertical layering as a confounding factor. The method chosen, with its vertical stratification, is inherently unable to control for this effect because warm water is always above, and cold water is always below. This complicates the interpretations and makes firm conclusions about thermal behaviour difficult. This issue should be thoroughly discussed.

      Thank you very much for this comment. We revised the manuscript accordingly, to clearly indicate that our goal was to assess the response of fish to vertically thermally stratified water, a scenario that occurs frequently in nature. We have added the following paragraph the discussion (L523-530): “However, a generalization of our observations to horizontally oriented thermal gradients remains elusive. Our results are inherently tied to the vertical stratification created in our experiments. As warm water was always positioned above and cold water below, we could not control for the effect of vertical position (i.e., we could not do cold over warm layer experiments). This limits our ability to directly compare our findings to those obtained from horizontally oriented thermal gradients. On the other hand, the case we addressed is of direct environmental relevance, as natural waters often experience vertical thermal stratification.”

      It is unclear why the authors assume an "optimal temperature" (undefined for which trait) of 12°C for brown trout parr, and why they assume the preference temperature would match that "optimal" temperature. The thermal biology for any fish species is more complex than a single perfect temperature, with various traits showing differing optima and often a mismatch with the preferred temperature. The literature suggests brown trout growth optima between 13 and 16°C, and preference temperature has even been suggested to be as high as 21°C. In light of this, the authors' conclusion that brown trout avoid cold and don't avoid warm water is possibly misguided. It is possible that the brown trout had a preference temperature higher than 12°C, which should be acknowledged and discussed.

      This is indeed a very important aspect, which was partly (but indeed not fully) already addressed in the discussion. To reflect these considerations, we have expanded the existing paragraph in the discussion (additions are in yellow). (L422 - L439): “We conclude from the behavior of fish when warmer water was available that their acute thermal preferendum exceeded 12 °C, departing from the acclimation temperature we had chosen based on the thermal preferendum for trout reported in literature[33]. Indeed, the thermal biology for any fish species is more complex than a single, static thermal preferendum: Many internal and external factors, such as hypoxia, satiation, time of day, and life stage[5], can influence the temperature preference of fish. For example, the level of satiation can have an impact because when fish are well fed, their growth rate increases with body temperature as metabolic performance increases[40]. This modifies the preferred temperature, as observed in Bear Lake sculpin (Cottus extensus) that ascend into warmer water after feeding to stimulate digestion and thereby achieve a three-fold higher growth rate[41]. In contrast, field studies with adult fish have observed movement from warm to cold water in summer[42,43], allowing fish to lower their metabolic rate, likely in effort to conserve energy[2,44]. We propose that the behavior of trout parr upon exposure to warmer water in our experiments served to achieve a higher body temperature to ultimately increase growth rate, which is critical for this life stage[45,46]. Indeed, growth experiments on brown trout populations have shown that optimal growth temperatures can range between 15 and 19 °C, depending on the stream of origin[46].”

      The figures are unnecessarily complex and introduce a long list of abbreviations and Greek characters for no apparent reason. There are many simpler ways for showing the results so unclear why they are so opaque.

      We appreciate the reviewer’s feedback and agree on the importance of clarity, however (in the absence of specific suggestions) we did not make changes to the figures or the use of Greek characters (which align with convention), as we believe they effectively convey the results. We highlight that the data themselves are very rich (multiple fish, multiple phases, multiple treatments, etc.) and we wanted to convey this richness in a compact and transparent manner.

      Reviewer #2:

      This paper investigates an interesting question: how do fish react to and avoid thermal disturbances from the optimum that occur on fast timescales? Previous work has identified potential strategies for warm avoidance in fish on short timescales while strategies for cold avoidance are far more elusive. The work combines a clever experimental paradigm with careful analysis to show that trout parr avoid cold water by limiting excursions across a warm-cold thermal interface. While I found the paper interesting and convincing overall, there are a few omissions and choices in the presentation that limit interpretability and clarity.

      A main question concerns the thermal interface itself. The authors track this interface using a blue dye that is mixed in with either colder or warmer water before a gate is opened that leads to gravitational flow overlaying the two water temperatures. The dye likely allows to identify convective currents which could lead to rapid mixing of water temperatures. However, it is less clear whether it accurately reflects thermal diffusion. This is problematic as the authors identify upward turning behavior around the interface which appears to be the behavioral strategy for avoiding cold water but not warm water. Without knowing the extent of the gradient across the interface, it is hard to know what the fish are sensing. The authors appear to treat the interface as essentially static, leading them to the conclusion that turning away before the interface is reached is likely related to associative learning. However, thermal diffusion could very likely create a gradient across centimeters which is used as a cue by the fish to initiate the turn. In an ideal world, the authors would use a thermal camera to track the relationship between temperature and the dye interface. Absent that, the simulation that is mentioned in passing in the methods section should be discussed in detail in the main text, and results should be displayed in Figure 1. Error metrics on the parameters used in the simulation could then be used to identify turns in subsequent figures that likely are or aren't affected by a gradient formed across the interface.

      The authors assume that the thermal interface triggers the upward-turning behavior. However, an alternative explanation, which should be discussed, is that cold water increases the tendency for upward turns. This could be an adaptive strategy since for temperatures > 4C turning swimming upwards is likely a good strategy to reach warmer water.

      The paper currently also suffers from a lack of clarity which is largely created by figure organization. Four main and 38 supplemental figures are very unusual. I give some specific recommendations below but the authors should decide which data is truly supplemental, versus supporting important points made in the paper itself. There also appear to be supplemental figures that are never referenced in the text which makes traversing the supplements unnecessarily tedious.

      The N that was used as the basis for statistical tests and plots should be identified in the figures to improve interpretability. To improve rigor, the experimental procedures should be expanded.

      Specifically, the paper uses two thermal models which are not detailed at all in the methods section.

      We appreciate these crucial comments to our paper. We have addressed these points in detail below.

      As stated above, a characterization of the thermal interface is critical. Ideally via measurement or at least by expanding on the simulation.

      We appreciate the idea of using thermal cameras and, indeed, we had initially tried to use them. However, thermal cameras generally cannot see through plexiglass or glass-like material due to the way infrared radiation interacts with these materials. While thin plastics can transmit some infrared, thicker plastics and reflective materials like glass tend to block or reflect infrared light.

      We have attempted to better characterize the thermal interface thickness, namely the spatial extent of the thermal gradient over the time period of our experiments (20 min). Indeed, our simulations in the original SI were conducted precisely to estimate the thermal interface thickness, though based on thermal diffusion in still water, while turbulence generated by the moving gravity current can smear out the interface, particularly in the initial phase. To account for this in our in the reviewed manuscript, we adopted a phenomenological approach to estimate the initial increase in thickness of the thermal interface due to turbulence and present this refined simulation in our manuscript.

      Our analysis suggests that, rather than assuming an initial interface thickness of zero (as in the original version of the manuscript), the thermal diffusion simulations should begin with an initial thickness of 2.8 mm in TR1. To incorporate this adjustment, we set the initial interface thickness to 2.8 mm and ran the simulation forward for t = 20 min, assuming diffusion. This approach resulted in a final interface thickness ranging between 4 and 6 cm (see Fig. 29 in the Supplementary Information).

      To reflect this refinement, we have added a new paragraph (L717-758: "Characterization of the thermal gradient", to the Methods section. Additionally, we have updated Fig. S29 in the Supplementary Information and included an average (over time and across treatments) gradient thickness of 5 cm in Figs. 2 and 3 of the manuscript. The revised Figs. 2 and 3 now explicitly indicate the estimated vertical extent of the thermal gradient, with an extended caption detailing these changes.

      The simulation should be detailed in the methods so that its validity can be evaluated and ideally, it should involve curved interfaces as encountered in the experiment.

      To account for the effect of turbulence during the initial, inertia-dominated phase after the gate removal, we have provided a correction for the initial thickness of the interface (see the addition to the Methods section). Thank you for your suggestion regarding the incorporation of curved interfaces in the simulations. We believe that including curved interfaces in the simulations would not significantly affect the results. As shown in the manuscript, the interface is curved primarily during the initial phase of the process (first 2 min where the flow is inertia-dominated), which is currently not included in our data analysis (phase 1 begins 2 min after the gate removal).

      In that vein, distances from the interface rather than height above the interface should be reported for the fish.

      We acknowledge the reviewer’s suggestion to report distances from the interface rather than height above or below it. However, beyond the initial phase, we do not see a strong justification for using the orthogonal distance over the vertical distance, as the choice is inherently arbitrary (e.g., one could also measure the distance along the fish’s orientation vector). We have therefore kept our assessment based on the vertical distance.

      Absent measurements, the paragraph on associative learning should be struck from the discussion as it is purely speculative.

      We agree that the original paragraph on associative learning may have sounded overly speculative. However, after updating our manuscript with additional simulations of the thermal gradient's vertical extent, we found that fish perform upward turns not only above the thermal interface, but also before entering the thermal gradient itself. This observation makes us hesitant to attribute the response solely to thermotaxis. We believe it is essential to provide a plausible explanation—albeit speculative—for how fish initiate these turns before directly encountering the cold-water gradient. To support this, we have extended the discussion in this paragraph and added Supplementary Fig. 39. The new text now reads (additions in yellow): (L487 – 499): “Our findings show that fish were able to perform upward turns while still located above the thermal interface and that is, before actually sampling the cold water below the interface. In fact, our simulation of the vertical extent of the thermal gradient revealed that a substantial fraction of upward turns occurred before fish encountered the gradient itself — that is, prior to any sensory detection of the temperature change (Supplementary Fig. 39). This finding may be evidence of associative learning, whereby fish used information regarding the presence of colder water at depth obtained at prior times. While the current data do not provide conclusive evidence in this regard, they prompt the possibility that, rather than responding solely to immediate thermal cues, fish use spatial memory or associative learning to anticipate the location of colder water based on prior experience. Indeed, fish are able to perform associative learning based on non-visual cues[53], create mental maps of their surroundings54 and retain memory for hours[55], days[56] and months[57,58].”  

      The body-temperature simulations need to be detailed in the methods.

      Thanks for this comment. We have removed the supplementary text section and have included the paragraph “Body cooling during cold-water excursions” into the methods section of our manuscript (L804 - L829).

      Constant temperature experiments could be helpful in addressing the importance of a gradient/interface for triggering upward turning

      We agree, however, we were limited (for ethical reasons) to a maximum number of fish we could use in the experiments. Hence, we focused on getting approval to run experiments focused on the responses to thermal gradients. However, occupancy during the acclimation phase in 12 °C showed that fish were much more stationary and primarily occupied the lower half of the tank.

      A lot of ease of reading could be gained by labeling the conditions according to either the second temperature or perhaps even better the delta temperature (i.e. TR[-2C] instead of TR1).

      We agree that labeling conditions by the second temperature or delta temperature could in principle improve readability. However, since T_bottom and T_top are explicitly mentioned in each main figure at least once, they can be directly associated with the respective treatment. Therefore, we have opted to retain the current labeling for consistency.

      The figure legends are often short and do not accurately label all figure elements. This is especially true for supplemental figure legends which often appear rushed (e.g., the legend for Figure S2 stops mid-sentence, the legend of Figure S3 does not indicate what Ttop or Tbottom are).

      We appreciate the reviewer’s comment and have carefully revised all figure legends to ensure clarity and completeness. Specifically, we have corrected figure labels, expanded the descriptions for supplemental figures, and ensured that all elements are accurately defined. For instance, we have completed the legend for Figure S2 and clarified the definitions of T_top and T_bottom in Figure S3. Additionally, we have systematically reviewed all figure legends to prevent inconsistencies and omissions.

      For Figure S3, to improve clarity, plotting the standard deviation at different points in the tank across the phases could be more informative than the hard-to-distinguish multi-line plots in different shades of red.

      We appreciate the reviewer’s suggestion regarding Figure S3. However, the primary goal of this figure is to illustrate how the thermal interface moves over time. While plotting the standard deviation at different points in the tank could provide additional statistical insights, it would detract from the intended visualization of the interface dynamics. For this reason, we have opted to retain the current multi-line representation. Nevertheless, we have ensured that the figure is as clear as possible by refining the color contrast and improving the legend for better readability.

      There is an inconsistency in in-text citation styles (mixture of superscript and numbers in brackets).

      Thank you for pointing this out. We have carefully reviewed the manuscript and corrected any inconsistencies in the in-text citation style to ensure uniform formatting throughout.

      While the statement in the introduction, that increases in movement frequency could be purely metabolic in nature is correct, at least for larval zebrafish it has been shown that sensory neural activity is predictive of motor neuron activity and swim rates (Haesemeyer, 2018, cited by the authors).

      This is an interesting finding. It is however unclear to us why this information is crucial in our context of brown trout parr.

      Examples of summary results from Supplementary Figures 8-10 should be bundled in a main text figure since this appears to be important information supporting the conclusions.

      We agree that Supplementary Figures 8–10 contain important information (i.e. Boxplots) on vertical occupancy and the time individuals spent in different water temperatures. However, this information is already integrated into Figure 2C, D, F, and G, which display the vertical distributions of fish across treatments and over time. Given the current length of the manuscript, adding another main-text figure could dilute rather than enhance clarity. For this reason, we have opted to keep these details in the Supplementary Materials while ensuring they are appropriately referenced in the main text.

      The distributions of excursion length for all treatments should be graphed in a main figure to support the point made in the third paragraph of the "Trout parr... do not avoid warm water" section of the results.

      We appreciate the reviewer’s suggestion. However, we do not believe that plotting excursion length is necessary to support this statement, as the key finding is already well represented in the manuscript. Specifically, the transition to bimodal depth occupancy, with fish spending comparable time above and below the interface in warm-water treatments (TR6–TR9), is clearly conveyed in Figure 2F and Supplementary Figure 8B. Additionally, this information is explicitly stated in the results section (L235): "Fish did not avoid warmer water in any of the warm-water treatments (TR6–TR9). Instead, fish transitioned to a bimodal depth occupancy, with comparable time spent above and below the interface (Fig. 2F; Supplementary Fig. 8B)." Given this, we believe that adding an additional figure would not enhance clarity but may instead introduce redundancy.

      There should be a main figure panel that statistically compares the turn biases around the interface for the different conditions and the +/- 5cm interface line mentioned in the text should be visualized in the appropriate figures - incidentally, this length scale is on par with the diffusion seen in simulations further suggesting that fish in fact sense a gradient here rather than remembering an interface.

      To address the reviewer’s comment, we have made the following updates:

      • Extended and incorporated simulations of the thermal interface thickness (see Methods and Supplementary Fig. 29).

      • Plotted the vertical locations of up-turning events relative to the phase-averaged position of the thermal interface (see Supplementary Fig. 39), which includes the estimated 5 cm vertical extent of the thermal gradient.

      • Added the thermal interface thickness to the main figures (Fig. 3F,G and Fig. 2E,H) where applicable.

      While we do not claim that memory alone explains cold-water avoidance, our data still suggests that it may contribute to the observed behavior, particularly since a substantial number of upturns occurred before the fish entered the thermal gradient (see also Author response image 1 below). Our aim is not to statistically disentangle the relative contribution of thermotaxis versus associative learning, but to propose a plausible interpretation of this observed anticipatory behavior with due caution to clarify that this is only a possibility.

      Given that the thermal gradient is now visualized and characterized in detail, we respectfully suggest that an additional statistical comparison of turn biases would not add further clarity. We believe that is is evidence that vertical turning, away from the cold, occurred within and above the thermal gradient. However, we welcome the reviewer’s perspective and to demonstrate that turning points occur outside and above the thermal interface we have plotted them against gradient growth over time (see Author response image 1 below).

      Author response image 1.

      The colored area indicates the temporal growth of thermal interface thickness.

      Reviewer #3:

      In this study, the authors measured the behavioural responses of brown trout to the sudden availability of a choice between thermal environments. The data clearly show that these fish avoid colder temperatures than the acclimation condition, but generally have no preference between the acclimation condition or warmer water (though I think the speculation that the fish are slowly warming up is interesting). Further, the evidence is compelling that avoidance of cold water is a combination of thermotaxis and thermokinesis. This is a clever experimental approach and the results are novel, interesting, and have clear biological implications as the authors discuss. I also commend the team for an extremely robust, transparent, and clear explanation of the experimental design and analytical decisions. The supplemental material is very helpful for understanding many of the methodological nuances, though I admit that I found it overwhelming at times and wonder if it could be pruned slightly to increase readability. Overall, I think the conclusions are generally well-supported by the data, and I have no major concerns.

      Minor comments

      P2 intro paragraphs 1/3 - it is not clear that thermal preference generally reflects the thermal optimum, partly because it is not clear what trait is being optimized (fitness?). Some nuance here would be helpful, and would also link nicely to the discussion on p10.

      Thank you for this comment. We have now refined this section as follows (L67–71): "As most fish species are ectotherms, their body temperature fluctuates with the surrounding water temperature. Because temperature influences the rates of most physiological processes, rapid warming or cooling can affect fish performance traits, including metabolic rates, swimming ability, and thermal tolerance[6]."

      To further clarify how thermal preference relates to thermal optimum and what trait is being optimized, we have incorporated additional nuance in this section. Specifically, we now acknowledge that thermal preference may not always align with the thermal optimum for performance or fitness.

      P2 intro paragraph 2 - "adapt physiologically" implies evolution, but here you are referring to plasticity. Suggest saving the word "adapt/adaptation" for evolutionary changes (see also p9).

      Thank you for this comment. We have revised the wording to "acclimate physiologically" (L79) to more accurately reflect plastic responses rather than evolutionary adaptation.

      P7 - "This difference in probabilities (ρup - ρdown) was particularly large in the region immediately above and below the interface (-5 cm < D < 5 cm; Fig. 3F) and is a hallmark of a thermotactic behavior." I agree that the result provides compelling evidence for thermotaxis, but would it be possible to bolster this case by statistically testing for a difference in probabilities among the treatment groups here?

      In addition to Fig. 3F, we are presenting statistical evidence that for colder water temperatures, fish penetrate less deeply into the cold lower water. The decreasing trend was statistically significant (Mann–Kendall test: , p < 0.001; Supplementary Table 6) and is presented in Fig. 4C. The depth reached during each cold-water excursion is determined by the location of the vertical turning point, which redirects the fish upward toward the surface. We think this is sufficient evidence for thermotaxis.

      P9 paragraph 3 = "recent studies suggest that fish may instead respond to temporal changes of their internal body temperature." It seems like a citation is missing here. Would be useful to briefly summarize the evidence for internal temperature sensing that is the basis of this modelling exercise.

      Thanks, we have added that citation (L385).

      P10 "Our findings provide the first experimental evidence for this mode of behavioral thermoregulation in which fish navigate their heterothermal environment to achieve gradual body warming."

      I think this statement overreaches given the presented data. While there may be a trend towards fish in the warm treatment spending increasing amounts of time in the upper half of the tank, I do not see this pattern supported statistically. There is also no evidence of gradual body warming, and even if there was I disagree that this would constitute experimental evidence that this was happening "intentionally". By this reasoning, any shuttlebox experiment in which fish actively shuttle between relatively warm and cool sides to end up with a preference that is above the starting condition would also constitute evidence for gradual warming. Overall, this is an interesting pattern, but I do not think there is sufficient evidence to conclude that fish are strategically warming.

      We appreciate the reviewer’s comment and acknowledge that our original wording may have overstated the evidence. We have revised the sentence to better reflect the evdience presented (L411-415): “Our observations resemble this mode of behavioral thermoregulation, in which fish progressively favor warmer regions within a heterothermal environment. However, additional experimental evidence is required to determine the mechanisms underlying this behavior.”

      P11 "Despite the avoidance response of cold water, fish engaged in repeated cold-water excursions..."

      This is an interesting speculation, but I think it would be helpful to also point out that these fish are biased towards the bottom of the tank (based on control measurements) and this pattern may therefore simply reflect a desire to be lower in the water column.

      Thank you for this helpful comment. We have now added this point to the revised text, which reads (L475-477): “Despite the avoidance response to cold water, fish engaged in repeated cold-water excursions, potentially reflecting a behavioral strategy to map the thermal environment. This pattern may also reflect an inherent tendency to occupy the lower part of the tank, as observed during homogeneous temperature of 12 °C during the acclimation phase.”

      P13 - why was the dye always added to the right side of the tank, instead of being assigned to a side randomly? I think the control experiment is good evidence that the dye did not substantially affect behaviour, but it seems like it would have been nice to separate dye and novel temperature exposure.

      We agree that randomizing the side of dye application would have been ideal. The dye was consistently added to the right side to maintain procedural consistency, ensuring that the “incoming” or “novel” temperature was always dyed. That said, our control experiment provides strong evidence that the dye itself did not influence behavior (as discussed above and in the manuscript).

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Evidence, reproducibility and clarity

      The manuscript by Egawa and colleagues investigates differences in nodal spacing in an avian auditory brain stem circuit. The results are clearly presented and data are of very high quality. The authors make two main conclusions:

      (1) Node spacing, i.e. internodal length, is intrinsically specified by the oligodendrocytes in the region they are found in, rather than axonal properties (branching or diameter).

      (2) Activity is necessary (we don't know what kind of signaling) for normal numbers of oligodendrocytes and therefore the extent of myelination.

      These are interesting observations, albeit phenomenon. I have only a few criticisms that should be addressed:

      (1) The use of the term 'distribution' when describing the location of nodes is confusing. I think the authors mean rather than the patterns of nodal distribution, the pattern of nodal spacing. They have investigated spacing along the axon. I encourage the authors to substitute node spacing or internodal length for node distribution.

      Thanks for your suggestion to avoid confusion. We used the phrase "nodal spacing" instead of "nodal distribution" throughout the revised manuscript.

      (2) In Seidl et al. (J Neurosci 2010) it was reported that axon diameter and internodal length (nodal spacing) were different for regions of the circuit. Can the authors help me better understand the difference between the Seidl results and those presented here?

      As a key distinction, our study focuses specifically on the main trunk of the contralateral projection of NM axons. This projection features a sequential branching structure known as the delay line, where collateral branches form terminal arbors and connect to the ventral dendritic layer of NL neurons. This structural organization plays a critical role in influencing the dynamic range of ITD detection by regulating conduction delays along the NM axon trunk.

      The study by Seidl et al. (2010) is a pioneering work that measured diameter of NM axon using electron microscopy, providing highly reliable data. However, due to the technical  limitations of electron microscopy, which does not allow for the continuous tracing of individual axons, it is not entirely clear whether the axons measured in the ventral NL region correspond to terminal arbors of collateral branches or the main trunk of NM axons (see Figure 9E, F in their paper). Instead, they categorized axon diameters based on their distance from NL cell layer, showing that axon diameter increases distally (see Figure 9G in their paper). Notably, the diameters of ventral axons located more than 120 μm away from the NL cell layer is almost identical to those in the midline.

      As illustrated in our Figure 4D and Supplementary Video 2, the main trunk of the contralateral NM projection is predominantly located in these distal regions. Therefore, our findings complement those of Seidl et al. (2010) rather than contradicting them. We made this point as clear as possible in text (page 7, line 3).

      (3) The authors looked only in very young animals - are the results reported here applicable only to development, or does additional refinement take place with aging?

      In this study, we examined chick embryos from E9 to just before hatching (E21) and post-hatch chicks up to P9. Chickens begin to perceive sound around E12 and possess sound localization abilities at the time of hatching (Grier et al., 1967) (added to page 4, line 9). Therefore, by E21, the sound localization circuit is largely established.

      On the other hand, additional refinement of the circuit with aging is certainly possible. A key cue for sound localization, interaural time difference (ITD), depends on the distance between the two ears, which increases as the animal grows. As shown in Figure 2G, internodal length increased by approximately 20% between E18 and P9 while maintaining regional differences. Given that NM axons are nearly fully myelinated by E21 (Figure 4D, 6C), this suggests that myelin extends in proportion to the overall growth of the head and brain volume. We described this possibility in text (page 5, line 21)

      Thus, our study covers not only the early stages of myelination but also the post-functional maturation in the sound localization circuit.

      (4) The fact that internodal length is specified by the oligodendrocyte suggests that activity may not modify the location of nodes of Ranvier - although again, the authors have only looked during early development. This is quite different than this reviewer's original thoughts - that activity altered internodal length and axon diameter. Thus, the results here argue against node plasticity. The authors may choose to highlight this point or argue for or against it based on results in adult birds?

      In this study, we demonstrated that although vesicular release did not affect internodal length, it selectively promoted oligodendrogenesis, thereby supporting the full myelination and hence the pattern of nodal spacing along the NM axons. We believe that this finding falls within the broader scope of 'activity-dependent plasticity' involving oligodendrocytes and nodes.

      As summarized in the excellent review by Bonetto et al. (2021), activity-dependent plasticity in oligodendrocytes encompasses a wide range of phenomena, not limited to changes in internodal length but also including oligodendrogenesis. Moreover, the effects of neuronal activity are not uniform but likely depend on the diversity of both neurons and oligodendrocytes. For example, in the mouse visual cortex, activity-dependent myelination occurs in interneurons but not in excitatory neurons (Yang et al., 2020). Additionally, expression of TeNT in axons affected myelination heterogeneously in zebrafish; some axons were impaired in myelination and the others were not affected at all (Koudelka et al., 2016). In the mouse corpus callosum, neuronal activity influences oligodendrogenesis, which in turn facilitates adaptive myelination (Gibson et al., 2014).

      Thus, rather than refuting the role of activity-dependent plasticity in nodal spacing, our findings emphasize the diversity of underlying regulatory mechanisms. We described these explicitly in text (page 10, line 18).

      Significance

      This paper may argue against node plasticity as a mechanism for tuning of neural circuits. Myelin plasticity is a very hot topic right now and node plasticity reflects myelin plasticity. this seems to be a circuit where perhaps plasticity is NOT occurring. That would be interesting to test directly. One limitation is that this is limited to development.

      This paper does not argue against node plasticity, but rather demonstrates that oligodendrocytes in the NL region exhibit a form of plasticity; they proliferate in response to vesicular release from NM axons, yet do not undergo morphological changes, ensuring adequate oligodendrocyte density for the full myelination of the auditory circuit. Thus, activity-dependent plasticity involving oligodendrocytes would contributes in various ways to each neural circuit, which is presumably attributed to the fact that myelination is driven by complex multicellular interactions between diverse axons and oligodendrocytes. Oligodendrocytes are known to exhibit heterogeneity in morphology, function, responsiveness, and gene profiles (Foerster et al., 2019; Sherafat et al., 2021; Osanai et al., 2022; Valihrach et al., 2022), but functional significance of this heterogeneity remains largely unclear. This paper also provides insight into how oligodendrocyte heterogeneity may contribute to the fine-tuning of neural circuit function, adding further value to our findings. Importantly, our study covers the wide range of development in the sound localization circuit, from the pre-myelination (E9) to the postfunctional maturation (P9), revealing how the nodal spacing pattern along the axon in this circuit emerges and matures.

      Reviewer #2:

      Evidence, reproducibility and clarity

      Egawa et al describe the developmental timeline of the assembly of nodes of Ranvier in the chick brainstem auditory circuit. In this unique system, the spacing between nodes varies significantly in different regions of the same axon from early stages, which the authors suggest is critical for accurate sound localization. Egawa et al set out to determine which factors regulate this differential node spacing. They do this by using immunohistological analyses to test the correlation of node spacing with morphological properties of the axons, and properties of oligodendrocytes, glial cells that wrap axons with the myelin sheaths that flank the nodes of Ranvier. They find that axonal structure does not vary significantly, but that oligodendrocyte density and morphology varies in the different regions traversed by these axons, which suggests this is a key determinant of the region-specific differences in node density and myelin sheath length. They also find that differential oligodendrocyte density is partly determined by secreted neuronal signals, as (presumed) blockage of vesicle fusion with tetanus toxin reduced oligodendrocyte density in the region where it is normally higher. Based on these findings, the authors propose that oligodendrocyte morphology, myelin sheath length, and consequently nodal distribution are primarily determined by intrinsic oligodendrocyte properties rather than neuronal factors such as activity.

      Major points, detailed below, need to be addressed to overcome some limitations of the study.

      Major comments:

      (1) It is essential that the authors validate the efficiency of TeNT to prove that vesicular release is indeed inhibited, to be able to make any claims about the effect of vesicular release on oligodendrogenesis/myelination.

      eTeNT is a widely used genetically encoded silencing tool and constructs similar to the one used in this study have been successfully applied in primates and rodents to suppress target behaviors via genetic dissection of specific pathways (Kinoshita et al., 2012; Sooksawate et al., 2013). However, precisely quantifying the extent of vesicular release inhibition from NM axons in the brainstem auditory circuit is technically problematic.

      One major limitation is that while A3V efficiently infects NM neurons, its transduction efficiency does not reach 100%. In electrophysiological evaluations, NL neurons receive inputs from multiple NM axons, meaning that responses may still include input from uninfected axons. Additionally, failure to evoke synaptic responses could either indicate successful silencing or failure to stimulate NM axons, making a clear distinction difficult. Furthermore, unlike in motor circuits, we cannot assess the effect of silencing by observing behavioral outputs.

      Thus, we instead opted to quantify the precise expression efficiency of GFP-tagged eTeNT in the cell bodies of NM neurons. The proportion of NM neurons expressing GFP-tagged eTeNT was 89.7 ± 1.6% (N = 6 chicks), which is consistent with previous reports evaluating A3V transduction efficiency in the brainstem auditory circuit (Matsui et al., 2012). These results strongly suggest that synaptic transmission from NM axons was globally silenced by eTeNT at the NL region. We described these explicitly in text (page 8, line 2).

      (2) Related to 1, can the authors clarify if their TeNT expression system results in the whole tract being silenced? It appears from Fig. 6 that their approach leads to sparse expression of TeNT in individual neurons, which enables them to measure myelination parameters. Can the authors discuss how silencing a single axon can lead to a regional effect in oligodendrocyte number?

      Figure 6D depicts a representative axon selected from a dense population of GFP-positive axons in a 200-μm-thick slice after A3V-eTeNT infection to bilateral NM. As shown in Supplementary Video 1 and 2, densely labeled GFP-positive axons can be traced along the main trunk. To prevent any misinterpretation, we have revised the description of Figure 6 in the main text and Figure legend (page 31, line 9), and stated the A3V-eTeNT infection efficiency was 89.7 ± 1.6% in NM neurons, as mentioned above. Based on this efficiency, we interpreted that the global occlusion of vesicular release from most of the NM axons altered the pericellular microenvironment of the NL region, which led to the regional effect on the oligodendrocyte density.

      On the other hand, your question regarding whether sparse expression of eTeNT still has an effect is highly relevant. As we also discussed in our reply to comment 4 by Reviewer #1, the relationship between neuronal activity and oligodendrocytes is highly diverse. In some types of axons, vesicular release is essential for normal myelination, and this process was disrupted by TeNT (Koudelka et al., 2016), suggesting that direct interaction with oligodendrocytes via vesicle release may actively promote myelination in these types of axons.

      To clarify whether the phenotype observed in Figure 6 arises from changes in the pericellular microenvironment at the NL region or from the direct suppression of axon-oligodendrocyte interactions, we included a new Supplementary Figure (Figure 6—figure supplement 1). In this figure, we evaluated the node formation on the axon sparsely expressing eTeNT by electroporation into the unilateral NM. The results showed that sparse eTeNT expression did not increase the percentages of heminodes or unmyelinated segments. This finding supports our conclusion that the increased unmyelinated segments by A3V-eTeNT resulted from impaired synaptic transmission at NM terminals and subsequent alterations of  pericellular microenvironment at the NL region.

      (3) The authors need to fully revise their statistical analyses throughout and supply additional information that is needed to assess if their analyses are adequate:

      Thank you for your valuable suggestions to improve the rigor of our statistical analyses. We have reanalyzed all statistical tests using R software. In the revised Methods section and Figure Legends, we have clarified the rationale for selecting each statistical test, specified which test was used for each figure, and explicitly defined both n and N. After reevaluation with the Shapiro-Wilk test, we adjusted some analyses to non-parametric tests where appropriate. However, these adjustments did not alter the statistical significance of our results compared to the original analyses.

      (3.1) the authors use a variety of statistical tests and it is not always obvious why they chose a particular test. For example, in Fig. 2G they chose a Kruskal-Wallis test instead of a two-way ANOVA or MannWhitney U test, which are much more common in the field. What is the rationale for the test choice?

      We have revised the explanation of our statistical test choices to provide greater clarity and precision. For example, in Figure 2G, we first assessed the normality of the data in each of the four groups using the Shapiro-Wilk test, which revealed that some datasets did not follow a normal distribution. Given this, we selected the Kruskal-Wallis test, a commonly used non-parametric test for comparisons across three or more groups. Since the Kruskal-Wallis test indicated a significant difference, we conducted a post hoc Steel-Dwass test to determine which specific group comparisons were statistically significant.

      (3.2) in some cases, the choice of test appears wholly inappropriate. For example, in Fig. 3H-K, an unpaired t-test is inappropriate if the two regions were analysed in the same samples. In Fig. 5, was a ttest used for comparisons between multiple groups in the same dataset? If so, an ANOVA may be more appropriate.

      In the case of Figures 3H-K, we compared oligodendrocyte morphology between regions. However, since the number of sparsely labeled oligodendrocytes differs both between regions and across individual samples, there is no strict correspondence between paired measurements. On the other hand, in Figures 5B, C, and E, we compared the density of labeled cells between regions within the same slice, establishing a direct correspondence between paired data points. For these comparisons, we appropriately used a paired t-test.

      (3.3) in some cases, the authors do not mention which test was used (Fig 3: E-G no test indicated, despite asterisks; G/L/M - which regression test that was used? What does r indicate?)

      We have specified the statistical tests used for each figure in the Methods section and Figure Legends for better clarity. Additionally, we have revised the descriptions for Figure 4G, L, and M and their corresponding Figure Legends to explicitly indicate that Spearman’s rank correlation coefficient (rₛ) was used for evaluation.

      (3.4) more concerningly, throughout the results, data may have been pseudo-replicated. t-tests and ANOVAs assume that each observation in a dataset is independent of the other observations. In figures 1-4 and 6 there is a very large "n" number, but the authors do not indicate what this corresponds to. This leaves it open to interpretation, and the large values suggest that the number of nodes, internodal segments, or cells may have been used. These are not independent experimental units, and should be averaged per independent biological replicate - i.e. per animal (N).

      We have now clarified what “n” represents in each figure, as well as the number of animals (N) used in each experiment, in the Figure Legends.

      In this study, developmental stages of chick embryos were defined by HH stage (Hamburger and Hamilton, 1951), minimizing individual variability. Additionally, since our study focuses on the distribution of morphological characteristics of individual cells, averaging measurements per animal would obscure important cellular-level variability and potentially mislead interpretation of data. Furthermore, we employed a strategy of sparse genetic labeling in many experiments, which naturally results in variability in the number of measurable cells per animal. Given the clear distinctions in our data distributions, we believe that averaging per biological replicate is not essential in this case.

      To further ensure the robustness of our statistical analysis, data presented as boxplots were preliminarily assessed using PlotsOfDifferences, a web-based application that calculates and visualizes effect sizes and 95% confidence intervals based on bootstrapping (https://huygens.science.uva.nl/PlotsOfDifferences/; https://doi.org/10.1101/578575). Effect sizes can serve as a valuable alternative to p-values (Ho, 2018; https://www.nature.com/articles/s41592019-0470-3). The significant differences reported in our study are also supported by clear differences in effect sizes, ensuring that our conclusions remain robust regardless of the statistical approach used.

      If requested, we would be happy to provide PlotsOfDifferences outputs as supplementary source data files, similar to those used in eLife publications, for each figure.

      (3.5) related to the pseudo-replication issue, can the authors include individual datapoints in graphs for full transparency, per biological replicates, in addition or in alternative to bar-graphs (e.g. Fig. 5 and 6).

      We have now incorporated individual data points into the bar graphs in Figures 5 and 6.

      (4) The main finding of the study is that the density of nodes differs between two regions of the chicken auditory circuit, probably due to morphological differences in the respective oligodendrocytes. Can the authors discuss if this finding is likely to be specific to the bird auditory circuit?

      The morphological differences of oligodendrocytes between white and gray matter are well established (i.e. shorter myelin at gray matter), but their correspondence with the nodal spacing pattern along the long axonal projections of cortical neurons is not well understood. Future research may find similarities with our findings. Additionally, as mentioned in the final section of the Discussion, the mammalian brainstem auditory circuit is functionally analogous to the avian ITD circuit. Regional differences in nodal spacing along axons have also been observed in the mammalian system, raising the important question of whether these differences are supported by regional heterogeneity in oligodendrocytes. Investigating this possibility will facilitate our understanding of the underlying logic and mechanisms for determining node spacing patterns along axons, as well as provide valuable insights into evolutionary convergence in auditory processing mechanisms. We described these explicitly in text (page 11, line 34).

      (5) Provided the authors amend their statistical analyses, and assuming significant differences remain as shown, the study shows a correlation (but not causation) between node spacing and oligodendrocyte density, but the authors did not manipulate oligodendrocyte density per se (i.e. cell-autonomously). Therefore, the authors should either include such experiments, or revise some of their phrasing to soften their claims and conclusions. For example, the word "determine" in the title could be replaced by "correlate with" for a more accurate representation of the work. Similar sentences throughout the main text should be amended.

      As you summarized in your comment, our results demonstrated that A3V-eTeNT suppressed oligodendrogenesis in the NL region, leading to a reduction in oligodendrocyte density (Figures 6L, M), which caused the emergence of unmyelinated segments. While this is an indirect manipulation of oligodendrocyte density, it nonetheless provides evidence supporting a causal relationship between oligodendrocyte density and nodal spacing.

      The emergence of unmyelinated segments at the NL region further suggests that the myelin extension capacity of oligodendrocytes differs between regions, highlighting regional differences in intrinsic properties of oligodendrocyte as the most prominent determinant of nodal spacing variation. However, as you correctly pointed out, our findings do not establish direct causation.

      In the future, developing methods to artificially manipulate myelin length could provide a more definitive demonstration of causality. Given these considerations, we have modified the title to replace "determine" with "underlie", ensuring that our conclusions are presented with appropriate nuance.

      (6) The authors fail to introduce, or discuss, very pertinent prior studies, in particular to contextualize their findings with:

      (6.1) known neuron-autonomous modes of node formation prior to myelination, e.g. Zonta et al (PMID 18573915); Vagionitis et al (PMID 35172135); Freeman et al (PMID 25561543)

      (6.2) known effects of vesicular fusion directly on myelinating capacity and oligodendrogenesis, e.g. Mensch et al (PMID 25849985)

      (6.3) known correlation of myelin length and thickness with axonal diameter, e.g. Murray & Blakemore (PMID 7012280); Ibrahim et al (PMID 8583214); Hildebrand et al (PMID 8441812).

      (6.4) regional heterogeneity in the oligodendrocyte transcriptome (page 9, studies summarized in PMID 36313617)

      Thank you for your insightful suggestions. We have incorporated the relevant references you provided and revised the manuscript accordingly to contextualize our findings within the existing literature.

      Minor comments:

      (7) Can the authors amend Fig. 1G with the correct units of measurement, not millimetres.

      Response: 

      Thank you for your suggestion. We have corrected the units in Figure 1G to µm

      (8) The Olig2 staining in Fig 2C does not appear to be nuclear, as would be expected of a transcription factor and as is well established for Olig2, but rather appears to be excluded from the nucleus, as it is in a ring or donut shape. Can the authors comment on this?

      Oligodendrocytes and OPCs have small cell bodies, often comparable in size to their nuclei. The central void in the ring-like Olig2 staining pattern appears too small to represent the nucleus. Additionally, a similar ring-like appearance is observed in BrdU labeling (Figure 5G), suggesting that this staining pattern may reflect nuclear morphology or other structural features.

      Significance

      In our view the study tackles a fundamental question likely to be of interest to a specialized audience of cellular neuroscientists. This descriptive study is suggestive that in the studied system, oligodendrocyte density determines the spacing between nodes of Ranvier, but further manipulations of oligodendrocyte density per se are needed to test this convincingly.

      The main finding of our study is that the primary determinant of the biased nodal spacing pattern in the sound localization circuit is the regional heterogeneity in the morphology of oligodendrocytes due to their intrinsic properties (e.g., their ability to produce and extend myelin sheaths) rather than the density of the cells. This was based on our observations that a reduction of oligodendrocyte density by A3V-eTeNT expression caused unmyelinated segments but did not increase internodal length (Figure 6), further revealing the importance of oligodendrocyte density in ensuring full myelination for the axons with short internodes. Thus, we think that our study could propose the significance of oligodendrocyte heterogeneity in the circuit function as well as in the nodal spacing using experimental manipulation of oligodendrocyte density. 

      Reviewer #3:

      Evidence, reproducibility and clarity

      The authors have investigated the myelination pattern along the axons of chick avian cochlear nucleus. It has already been shown that there are regional differences in the internodal length of axons in the nucleus magnocellularis. In the tract region across the midline, internodes are longer than in the nucleus laminaris region. Here the authors suggest that the difference in internodal length is attributed to heterogeneity of oligodendrocytes. In the tract region oligodendrocytes would contribute longer myelin internodes, while oligodendrocytes in the nucleus laminaris region would synthesize shorter myelin internodes. Not only length of myelin internodes differs, but also along the same axon unmyelinated areas between two internodes may vary. This is an interesting contribution since all these differences contribute to differential conduction velocity regulating ipsilateral and contralateral innervation of coincidence detector neurons. However, the demonstration falls rather short of being convincing. I have some major concerns:

      (1) The authors neglect the possibility that nodal cluster may be formed prior to myelin deposition. They have investigated stages E12 (no nodal clusters) and E15 (nodal cluster plus MAG+ myelin). Fig. 1D is of dubious quality. It would be important to investigate stages between E12 and E15 to observe the formation of pre-nodes, i.e., clustering of nodal components prior to myelin deposition.

      Thank you for your insightful comment regarding the potential role of pre-nodal clusters in determining internodal length. Indeed, studies in zebrafish have suggested that pre-nodal clustering of node components prior to myelination may prefigure internodal length (Vagionitis et al., 2022). We have incorporated a discussion on whether such pre-nodal clusters could contribute to regional differences in nodal spacing in our manuscript (page 9, line 35).

      Whether pre-nodal clusters are detectable before myelination appears to depend on neuronal subpopulation (Freeman et al., 2015). To investigate the presence of pre-nodal clusters along NM axons in the brainstem auditory circuit, we previously attempted to visualize AnkG signals at E13 and E14. However, we did not observe clear structures indicative of pre-nodal clusters; instead, we only detected sparse fibrous AnkG signals with weak Nav clustering at their ends, consistent with hemi-node features. This result does not exclude the possibility of pre-nodal clusters on NM axons, as the detection limit of immunostaining cannot be ruled out. In brainstem slices, where axons are densely packed, nodal molecules are expressed at low levels across a wide area, leading to a high background signal in immunostaining, which may mask weak pre-nodal cluster signals prior to myelination. Regarding the comment on Figure 1D, we assume you are referring to Figure 2D based on the context. The lack of clarity in the high-magnification images in Figure 2D results from both the high background signal and the limited penetration of the MAG antibody. Furthermore, we are unable to verify Neurofascin accumulation at pre-nodal clusters, as there is currently no commercially available antibody suitable for use in chickens, despite our over 20 years of efforts to identify one for AIS research. Therefore, current methodologies pose significant challenges in visualizing pre-nodal clusters in our model. Future advancements, such as exogenous expression of fluorescently tagged Neurofascin at appropriate densities or knock-in tagging of endogenous molecules, may help overcome these limitations.

      However, a key issue to be discussed in this study is not merely the presence or absence of prenodal clusters, but rather whether pre-nodal clusters—if present—would determine regional differences in internodal length. To address this possibility, we have added new data in Figure 6I, measuring the length of unmyelinated segments that emerged following A3V-eTeNT expression.

      If pre-nodal clusters were fixed before myelination and predetermined internodal length, then the length of unmyelinated segments should be equal to or a multiple of the typical internodal length. However, our data showed that unmyelinated segments in the NL region were less than half the length of the typical NL internodal length, contradicting the hypothesis that fixed pre-nodal clusters determine internodal length along NM axons in this region.

      (2) The claim that axonal diameter is constant along the axonal length need to be demonstrated at the EM level. This would also allow to measure possible regional differences in the thickness of the myelin sheath and number of myelin wraps.

      As mentioned in our reply to comment 2 by Reviewer #1, the diameter of NM axons was already evaluated using electron microscopy (EM) in the pioneering study by Seidl et al., (2010). Additionally, EM-based analysis makes it difficult to clearly distinguish between the main trunk of NM axons and thin collateral branches at the NL region. Accordingly, we did not do the EM analysis in this revision. 

      In Figure 4, we used palGFP, which is targeted to the cell membrane, allowing us to measure axon diameter by evaluating the distance between two membrane signal peaks. This approach minimizes the influence of the blurring of fluorescence signals on diameter measurements. Thus, we believe that our method is sufficient to evaluate the relative difference in axon diameters between regions and hence to show that axon diameter is not the primary determinant of the 3-fold difference in internodal length between regions. 

      (3) The observation that internodal length differs is explain by heterogeneity of sources of oligodendrocyte is not convincing. Oligodendrocytes a priori from the same origin remyelinate shorter internode after a demyelination event.

      The heterogeneity in oligodendrocyte morphology would reflect differences in gene profiles, which, in turn, may arise from differences in their developmental origin and/or pericellular microenvironment of OPCs. We made this point as clear as possible in Discussion (page 9, line 21).

      Significance

      The authors suggest that the difference in internodal length is attributed to heterogeneity of oligodendrocytes. In the tract region oligodendrocytes would contribute longer myelin internodes, while oligodendrocytes in the nucleus laminaris region would synthesize shorter myelin internodes. Not only length of myelin internodes differs, but also along the same axon unmyelinated areas between two internodes may vary. This is an interesting contribution since all these differences contribute to differential conduction velocity regulating ipsilateral and contralateral innervation of coincidence detector neurons.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The paper sets out to examine the social recognition abilities of a 'solitary' jumping spider species. It demonstrates that based on vision alone spiders can habituate and dishabituate to the presence of conspecifics. The data support the interpretation that these spiders can distinguish between conspecifics on the basis of their appearance.

      We appreciate the reviewer’s summary. We indeed aimed at investigating the social recognition abilities of the solitary jumping spider (Phidippus regius), using visual cues alone. By employing a habituation-dishabituation paradigm, well-established in developmental psychology, we found support for the interpretation that these spiders can distinguish between conspecifics based on their appearance, as the reviewer noted.

      Strengths:

      The study presents two experiments. The second set of data recapitulates the findings of the first experiment with an independent set of spiders, highlighting the strength of the results. The study also uses a highly quantitative approach to measuring relative interest between pairs of spiders based on their distance.

      We appreciate the reviewer's acknowledgement of the strengths of our study. The second set of data underscores the robustness and reliability of the results. Additionally, however, the second experiment served the purpose of disentangling whether the habituation effect observed over sessions was caused by ‘physical’ or ‘cognitive’ fatigue by employing ‘long-term’ dishabituation trials at the end of Session 3. These trials are critical in our study as they help to differentiate between recognition of individual identities versus recognition of familiar individuals (as opposed to unfamiliar ones) and to determine if the observed effects are due to ‘general habituation’ or ‘specific recognition’. We will elaborate on this further below in this revision.

      As stated by the reviewer, we employed a highly quantitative approach to measure relative interest between pairs of spiders based on their distance, providing precise and objective data to support our conclusions.

      Weaknesses:

      The study design is overly complicated, missing key controls, and the data presented in the figures are not clearly connected to the study. The discussion is challenging to understand and appears to make unsupported conclusions.

      While we acknowledge that the study design is indeed complex, this complexity is essential for conducting a well-controlled and balanced experiment regarding the experimental conditions.  

      The habituation-dishabituation paradigm is a well-established paradigm in developmental psychology with non-verbal infants. It is understood that during the habituation phase, an individual's attention to a repeated stimulus decreases as they engage in information processing and form a mental representation of it. As the stimulus becomes familiar, it loses its novelty and interest. When a new stimulus is introduced, a recovery of attention suggests that the individual has compared this new stimulus to the stored memory of the habituation stimulus and detected a difference. This process suggests that the individual not only remembered the original stimulus but also recognized the new one as distinct (for a review Kavšek & Bornstein, 2010).

      This paradigm has also been extensively applied in animal research, where, like infants, nonverbal subjects rely on recognition and discrimination processes to demonstrate their cognitive abilities. The use of this paradigm dates back to seminal studies such as Humphrey (1974), which explored the perceptual world of monkeys, illustrating how species and individuals are perceived and recognized. In another previous study (Dahl, Logothetis, and Hoffman, 2007), we utilized an even more complex experimental design that incorporated dedicated baseline trials for both habituation and dishabituation phases, which was well-received despite its complexity. In the current study, we contrast dishabituation and habituation trials directly, creating a sequential cascade where each trial is evaluated against the preceding one as its baseline.

      On the basis of these arguments, we respectfully decline the claim that this paradigm is inappropriate or lacks key controls. Our study design, though complex, is rigorously grounded in established methodologies and offers a robust framework for exploring individual recognition in Phidippus regius.

      However, we take the reviewer’s comments seriously and are committed to identifying and addressing the aspects in our manuscript that may have led to misunderstandings. We clarify these areas in our revision of the manuscript. Modifications were made in the Introduction, Methods, and Discussion sections.

      Dahl, C. D., Logothetis, N. K., & Hoffman, K. L. (2007). Individuation and holistic processing of faces in rhesus monkeys. Proceedings of the Royal Society B: Biological Sciences, 274(1622), 2069-2076.

      Humphrey, N. K. (1974). Species and individuals in the perceptual world of monkeys. Perception, 3(1), 105-114.

      Kavšek, M., & Bornstein, M. H. (2010). Visual habituation and dishabituation in preterm infants: A review and meta-analysis. Research in developmental disabilities, 31(5), 951-975.

      (1) Study design: The study design is rather complicated and as a result, it is difficult to interpret the results. The spiders are presented with the same individual twice in a row, called a habituation trial. Then a new individual is presented twice in a row. The first of these is a dishabituation trial and the second is another habituation trial (but now habituating to a second individual). This is done with three pairings and then this entire structure is repeated over three sessions. 

      While we acknowledge that the design is complex, this complexity is essential for conducting a well-controlled experiment, as described earlier. As the reviewer noted, our design involves presenting the same individual to the focal spider twice in a row (habituation trial), followed by a new individual (dishabituation trial), and then repeating this structure. This approach is fundamental to the habituation-dishabituation paradigm, which allows us to systematically compare the responses to a familiar individual with those elicited by a novel one. If the spiders exhibit different behaviours in terms of the distance they maintain when encountering the same individual versus a new one, it indicates that they are processing the stimuli differently, consistent with recognition memory. This differential response is a key indicator that the spiders can distinguish between familiar and unfamiliar individuals, demonstrating not only a decrease in interest or engagement due to repeated exposure but also a cognitive process where the lack of a matching memory template triggers a distinct behavioural response when confronted with novel stimuli.

      By repeating this sequence two more times (Session 2 and 3), we aim to assess the consistency of this recognition process over time. If the focal spider does not remember the individuals from the previous session (one hour ago), we expect consistent behavioural responses across sessions. Conversely, if there is a decrease in response magnitude but the overall response patterns are maintained, we can infer that the focal spider recognizes the previously presented individuals and exhibits habituation, reflected in reduced response intensity. In other words, over sessions and repeated exposure to the same individuals, the memory traces become more firmly established, leading to a situation where a dishabituation trial introduces less novelty, as the spider's recognition of previously encountered individuals becomes more robust and consistent to the point where “habituation” and “dishabituation” trials become indistinguishable, as observed in Session 3. This method allows us to assess the duration of identity recognition in these spiders, indicating how long the memory of specific individuals persists. 

      All of these outcomes were anticipated before we began Experiment 1. Given that the results aligned with our predictions, we then sought to determine whether the observed reduction in the magnitude of the effect (i.e., the difference between habituation and dishabituation trials) was due to a physical fatigue effect, where the spiders might simply be getting tired, or a cognitive fatigue effect, where the spiders recognized the individuals and as a result did not exhibit any novelty response. To address this, we replicated the experiment with a new group of spiders and introduced special (long-term dishabituation) trials at the end, where the focal spider was presented with a novel spider. 

      These extra trials allowed us to disentangle the nature of the diminishing response across repeated sessions: a lack of dishabituation (remaining distant) would suggest general physical fatigue, whereas a strong dishabituation response (approaching closely) to the novel spider would indicate cognitive fatigue, thereby confirming that the spiders were indeed recognizing the familiar individuals throughout the experiment. 

      In light of these considerations, we believe that the complexity of our design is not only justified but absolutely necessary to rigorously test the cognitive capabilities of the spiders. Nonetheless, we understand the need for clarity in presenting our findings and are committed to refining our manuscript to better communicate the rationale and results of our study.

      The data appear to show the strong effects of differences between habituation and dishabituation trials in the first session. The decrease in differential behavior between the socalled habituation and dishabituation trials in sessions 2 and 3 is explained as a consequence of the spiders beginning to habituate in general to all of the individuals. 

      The key question, as mentioned above, is to determine the underlying cause of this general habituation across sessions. Specifically, we aim to differentiate between two potential causes: physical fatigue, where the spiders may simply become less responsive due to the demands of the three-hour testing period, or cognitive fatigue, where the repeated exposure to the same individuals leads to a decreased response because the spiders have started to recognize these individuals over multiple repetitions.

      To address this, we replicated the experiment and introduced each focal spider to a new individual in what we termed "long-term dishabituation" trials. By comparing the spiders' responses to these novel individuals with their responses in earlier trials, we sought to better understand the underlying mechanisms of habituation and the duration of individual recognition. The strong dishabituation response observed in these trials is indicative of cognitive fatigue, supporting the presence of recognition memory rather than a general physical fatigue effect.

      The claim that the spiders remember specific individuals is somewhat undercut because all of the 'dishabituation' trials in session 2 are toward spiders they already met for 14 minutes previously but seemingly do not remember in session 2. 

      We appreciate the reviewer’s comment regarding the claim that spiders do not remember specific individuals. This assessment does not align with the rationale of our experiment. The reviewer noted that the dishabituation trials in session 2 involved spiders previously encountered and suggested that the lack of a clear memory response might undercut the claim of specific individual recognition. 

      However, as we explained earlier, we expect habituation in Session 2 relative to Session 1 precisely because spiders recognize each other in Session 2. If there were no such habituation in Sessions 2 or 3, it would suggest that the spiders’ recognition memory does not persist beyond one hour. 

      Additionally, it is important to correct the timing noted by the reviewer: each individual spider reencounters the same spider exactly one hour later, not 14 minutes. This is detailed in Table 2 of the manuscript, which outlines that each trial lasts 7 minutes, with a 3-minute visual separation between trials. With six trials per session, this totals to 1 hour per session. Thus, every pair of spiders re-encounters exactly 1 hour after their last interaction.

      Again, it is important to clarify that the observed decrease in differential behaviour is not indicative of a failure to remember specific individuals. Rather, it reflects a systematic pattern of habituation, which is a common and expected outcome in such paradigms. This systematic decrease in response strength suggests that the spiders recognize the previously encountered individuals and becoming less responsive over repeated exposures, consistent with the process of habituation. In different terms, the repeated exposure to the same individuals leads to more firmly established memory traces, leading to a situation where a dishabituation trial introduces less novelty, as the spider's recognition of previously encountered individuals becomes more robust and consistent.

      Based on the explanations provided above, we respectfully reject the claim that “the spiders remember specific individuals is somewhat undercut […]”. In contrast, this claim is incorrect, as the exact opposite is true. The very strength of our study lies in demonstrating that spiders possess robust recognition memory, as evidenced by a clear dissociation of habituation and dishabituation trials in Session 1, followed by a gradually diminishing effect over Session 2 and 3 as the spiders are increased exposed to the same individuals: Furthermore, the strong rebound from habituation observed in long-term dishabituation trials, where the spiders were exposed to novel individuals. 

      This misunderstanding suggests that we should take additional care in the revised manuscript to clarify our explanations and provide more detail, ensuring that the rationale behind our experimental design and findings are communicated effectively.

      In session 3 it is ambiguous what is happening because the spiders no longer differentiate between the trial types. This could be due to fatigue or familiarity. 

      The reviewer proposes that the absence of differentiation between 'habituation' and 'dishabituation' trials in Session 3 might be attributed to either fatigue or familiarity. We interpret "fatigue" as what we have termed the “physical fatigue effect” and "familiarity" as “cognitive fatigue effect.” In this context, we concur with the reviewer’s observation, and this very line of reasoning prompted us to conduct a further experiment following the outcome of Experiment 1.

      A second experiment is done to show that introducing a totally novel individual, recovers a large dishabituation response, suggesting that the lack of differences between 'habituation' and 'dishabituation' trials in session 3 is the result of general habituation to all of the spiders in the session rather than fatigue. As mentioned before, these data do support the claim that spiders differentiate among individuals.

      As the reviewer rightly noted, we addressed these possibilities in our second experiment by introducing a completely novel individual to the spiders, which resulted in a strong dishabituation response. This outcome suggests that the lack of differentiation in Session 3 is more likely due to cognitive habituation rather than physical fatigue. The robust response to novel individuals demonstrates that the spiders are capable of distinguishing between familiar and unfamiliar individuals, suggesting that the reduced differentiation is a consequence of habituation from repeated encounters with the same individuals. 

      We appreciate the reviewer's recognition that these findings support the conclusion that spiders are capable of differentiating between individual conspecifics.

      Additionally, it is important to clarify the structure of our sessions. Each of the 6 trials lasts 7 minutes with a 3-minute visual separation, resulting in a total of 1 hour per session. This ensures that each pair of spiders is encountered exactly one hour later, which controls for the timing and allows us to evaluate the spiders' recognition memory over repeated sessions.

      In summary, while the data show a decrease in differential behaviour between habituation and dishabituation trials in Session 2 and 3, the results from our second experiment support the interpretation that this is due to ‘cognitive habituation’ (familiarization) rather than ‘physical fatigue’ (general habituation). This habituation effect underscores the spiders' ability to recognize and become familiar with specific individuals over time, reinforcing our conclusion that they can differentiate among individuals.

      The data from session 1 are easy to interpret. The data from sessions 2 and 3 are harder to understand, but these are the trials in which they meet an individual again after a substantial period of separation. 

      The data from Session 1 are straightforward to interpret, showing clear differences between habituation and dishabituation trials. However, the data from Sessions 2 and 3 are more complex, as these sessions involve the spiders re-encounter individuals after a 1-hour period of separation. Importantly, the outcome is not an artefact in our experiment, but the consequence of a deliberate choice in the experimental design to assess whether spiders can recognise each other after this duration. We believe that this complexity aligns with our expectations, based on the assumption that spiders can recognise each other after one hour. The observed pattern of habituation in Sessions 2 and 3 suggests that the spiders retain memory of the individuals, leading to decreased responsiveness upon repeated encounters. This interpretation is further supported by the Experiment 2, which introduced a novel individual and elicited a strong dishabituation response. This finding confirms that the reduced differentiation in later sessions is due to cognitive habituation rather than physical fatigue, supporting the conclusion that recognition memory last at least one hour.

      We hope this explanation clarifies our findings and the rationale behind our relatively complex experimental design choice. 

      Other studies looking at recognition in ants and wasps (cited by the authors) have done a 4 trial design in which focal animal A meets B in the first trial, then meets C in the second trial, meets B again in the third trial, and then meets D in the last trial. In that scenario trials 1, 2, and 4 are between unfamiliar individuals and trial 3 is between potentially familiar individuals. In both the ants and wasps, high aggression is seen in species with and without recognition on trial 1, with low aggression specifically for trials with familiar individuals in species with recognition. Across different tests, species or populations that lack recognition have shown a general reduction in aggression towards all individuals that become progressively less aggressive over time (reminiscent of the session 2 and 3 data) while others have maintained modest levels of aggression across all individuals. The 4 session design used in those other studies provides an unambiguous interpretation of the data while controlling for 'fatigue'. 

      We acknowledge that there are multiple ways to design experiments to test recognition memory. In fact, we considered using the paradigm similar to the one proposed by the reviewer and used in studies like Dreier et al., which involves a series of trials with unfamiliar and familiar individuals over extended intervals. We then, however, opted for a more complex design to rigorously assess how habituation and recognition memory develop over repeated sessions with shorter intervals.

      In the following, we would like to describe the advantages and disadvantages of both paradigms and outline how we ended up using the more complex version:

      Advantages of our paradigm: 

      As pointed out, by repeating the sequence in exactly similar manner (every same pair of spiders reoccurs after exactly 1 and 2 hours), we can comprehensively evaluate the effect of habituation over multiple exposures. This allows us to assess the extent of the spiders’ memory, when a spider shows stronger habituation to individuals that were novel in Session 1 but “familiar” by the time they encounter them again in Session 2. To achieve this, we need to ensure that each trial and visual separation is precisely timed, ensuring consistent intervals between encounters. As a consequence, each individual spider undergoes the exact same experimental protocol. Most critically, however, are the novel individuals presented after Session 3 (long-term dishabituation trials) that help differentiate between cognitive habituation and physical fatigue.  Disadvantages of our paradigm:

      The sequences of habituation and dishabituation trials may make the design more complex, as pointed out by the reviewer. As a consequence, the interpretation will become more difficult. However, the data perfectly align with our predictions, and the outcomes were as anticipated in two independently run experiments with two groups of spiders. This highlights the reliability of our experimental design and robustness of our findings.

      Advantages of the 4-trial paradigm proposed by the reviewer:

      Clearly, the structure of the proposed design is simpler, making interpretation easier. The paradigm also accommodates longer intervals between trials (e.g., 24 hours). Longer intervals could theoretically have been applied in our study. (However, we chose not to leave the spiders in the experimental box longer than necessary, opting instead to return them to their home containers for the night to ensure their well-being. And, a 24-hour interval targets a different phase in the process of long-term memory, but more to this topic further below.)

      Disadvantages of the 4-trial paradigm proposed by the reviewer:

      Strictly replicating the 4-trial design would result in one familiar encounter versus three unfamiliar ones. This imbalance might introduce bias and limit the robustness of the measurements. Additionally, the design provides less data overall, as the focal individual will be confronted with three other individuals, who will then be excluded from further testing as focal subjects themselves. In contrast, our design ensures a balanced number of familiar0020(habituation) and novel encounters (dishabituation) for each focal individual, allowing for more efficient and comprehensive data collection without excluding individuals from further testing.

      Given the aforementioned considerations, we determined that the advantages of our experimental design, in particular the assessment of a cognitive fatigue effect when encountering the same individuals again, outweigh those of the proposed 4-trial design. The mentioned limitations of the 4-trial design, such as the potential for bias and less comprehensive data collection, do not justify re-running the study, especially when the best case scenario is fewer insights than our already existing findings. Our current paradigm yielded results that align perfectly with our predictions, offering a thorough and reliable understanding of recognition memory and habituation in spiders. Therefore, we believe our approach provides a more complete and robust answer to our research questions.

      However, we acknowledge that there might be insufficient information in the manuscript addressing the rationale behind our design choices, and we will revise the manuscript to provide a clearer explanation of why our approach is well suited to answering the research questions at hand.

      That all trials in sessions 2 and 3 are always with familiar individuals makes it challenging to understand how much the spiders are habituating to each other versus having some kind of associative learning of individual identity and behavior.

      We understand the reviewer's concern that having all trials in Sessions 2 and 3 involve familiar individuals could make it challenging to distinguish between general habituation and associative learning of individual identities. In our study, we contrast habituation and dishabituation trials: If general habituation were occurring, we would expect uniformly reduced responses (around the zero line) to all individuals over time, indicating that the spiders are getting used to any individual regardless of their specific identity. However, this is not the case. Our data show that while the responses in Session 2 are reduced in effect size compared to Session 1, they are not flat (around the zero line). This indicates that the spiders still differentiate between a repetition of a spider identity (habituation trials) and two different spider identities (dishabituation trials), albeit with a reduced response strength. The systematicity in the data suggests that the spiders are not merely habituating to any individual, but are instead retaining some level of recognition between specific individuals.

      Only by Session 3 do the spiders fully habituate to the point where the responses to habituation and dishabituation trials converge, indicating a complete habituation effect. The introduction of novel individuals in our long-term dishabituation trials further supports the idea that the spiders are recognizing specific individuals rather than exhibiting general habituation. If the spiders were experiencing general habituation, we would not expect the strong dishabituation response observed in our study.

      The data presentation is also very complicated. How is it the case that a negative proportion of time is spent? The methods reveal that this metric is derived by comparing the time individuals spent in each region relative to the previous time they saw that individual. 

      We understand the reviewer's concern regarding the complexity of the data presentation and the calculation of the negative proportion of time. Regarding the complexity of the design, we have already justified our choice of a more intricate experimental setup. This complexity is necessary for accurately assessing recognition memory and habituation over repeated sessions. 

      The metric is derived by comparing the time individuals spent in each region (relative to the transparent front panel) in the current trial (n) relative to the previous trial (n-1). With multiple trials, this results in a cascade of trials and conditions. This method was established in

      Humphrey’s and our previous study (Humphrey, 1974; Dahl, Logothetis, Hoffman, 2007), where we demonstrated its effectiveness in assessing individuation of faces in macaque monkeys.  

      Also in our current experimental design, each current trial is contrasted with the preceding one, allowing us to compare distributions of distances taken in two trials. In this context, every preceding trial serves as baseline for every current trial. 

      Figure 1 of the manuscript, illustrates the structure and analysis of the trials,

      Panel a depicts the baseline, habituation, and dishabituation trials, where spiders are exposed to different conspecifics.

      Baseline (left panel, red): When two spiders are visually exposed to each other for the first time, it is expected that they will explore each other closely, exhibiting high levels of proximity (initial exploratory behaviour).

      Habituation (centre panel, green): When the same spiders are reintroduced in a subsequent round of exposure, it is anticipated that they will exhibit reduced exploratory behaviour and maintain a greater distance compared to the baseline trial, if they recognize each other from the previous encounter (indicative of habituation).

      Panel b (upper and middle panels; red and green): Demonstrates the theoretical assumptions and expected changes in behaviour:

      By subtracting the distribution of distances in the baseline trial from the habituation trial, we generate a delta distribution. This delta distribution reveals negative values near the transparent panel (indicating reduced proximity in the habituation trial) and positive values at mid- to fardistances (indicating increased distancing behaviour). This delta distribution is also what is reported in Figure 2. 

      Dishabituation: In this trial, a new spider (different from the one in the habituation trial) is introduced. The dishabituation trial will be considered in contrast to the habituation trial described above. If the spider recognizes the new individual as different, it is expected to show increased exploratory behaviour and reduced distance, similar to the initial baseline trial.

      By subtracting the distribution of distances in the habituation trial from the dishabituation trial, we obtain another delta distribution. This delta distribution should reveal positive values near the transparent panel (indicating increased proximity in the dishabituation trial) and negative values at mid- to far-distances (indicating decreased proximity compared to the habituation trial).

      We hope this clarifies the rationale behind our data presentation and the methodological approach we employed. We have revised the figure to enhance its clarity and make it more intuitive for the reader.

      Dahl, C. D., Logothetis, N. K., & Hoffman, K. L. (2007). Individuation and holistic processing of faces in rhesus monkeys. Proceedings of the Royal Society B: Biological Sciences, 274(1622), 2069-2076.

      Humphrey, N. K. (1974). Species and individuals in the perceptual world of monkeys. Perception, 3(1), 105-114.

      At the very least, data showing the distribution of distances from the wall would be much easier to interpret for the reader.

      We understand the reviewer's concern that data showing the distribution of distances from the wall would be much easier to interpret for the reader. We initially consider that but came to the conclusion that this approach is not straightforward. For instance, if both spiders are positioned at the very front but in different corners, the distance to the panel would be very small, but the distance between the spiders would be large. Thus, using distances from the wall could misrepresent the actual spatial distribution between the spiders.

      (2) "Long-term social memory": It is not entirely clear what is meant by the authors when they say 'long-term social memory', though typically long-term memory refers to a form of a memory that requires protein synthesis.  

      To address this conceptually, we used the term "long-term social memory" to describe the spiders' ability to recognize and remember individual conspecifics over multiple experimental sessions. While social memory refers to the ability of an individual to recognize other individuals within a social context, long-term memory typically involves the retention of information over extended periods. Recognizing that the term “long-term social memory” is not commonly used, we have revised the manuscript to use the more standard term “long-term memory.”

      While the precise timing of memory formation varies across species and contexts, a general rule is that long-term memory should last for > 24 hours (e.g., Dreier et al 2007 Biol Letters). The longest time that spiders are apart in this trial setup is something like an hour. There is no basis to claim that spiders have long-term social memory as they are never asked to remember anyone after a long time apart.

      We appreciate the reviewer’s feedback regarding the term "long-term social memory." The statement "long-term memory should last for > 24 hours" is a generalisation in discussions about memory. It oversimplifies a more complex topic. That is, long-term memory is typically distinguished from short-term memory by its persistence over time, often lasting from hours to a lifetime. However, the exact duration that qualifies memory as "long-term" varies depending on the context, model species, and type of memory. In studies involved in synaptic plasticity (LTP), the object might indeed be to look at memory that persists for at least 24 hours as a criterion for long-term memory. In studies of cellular and/or molecular mechanisms where the stabilization and consolidation of memory traces over time are key areas of interest this 24-hour interval is very common. But, defining long-term memory strictly by a 24-hour duration is by no means universally accepted nor does it apply across all fields of study.

      To clarify, long-term memory is a process involving consolidation starting within minutes to hours after learning. Clearly, full consolidation can take longer, while memory persisting 24 hours is considered fully consolidated. But this does not mean that memory lasting less than 24 hours are not part of long-term memory. 

      In fact, Atkinson and Shiffrin (1969) proposed that information entering short-term memory remains there for about 20 to 30 seconds before being displaced due to space limitations. During this brief interval, initial encoding processes begin transferring information to long-term memory, establishing an initial memory trace. This transfer is not indicative of full consolidation but represents the initial "laying down" of the memory trace (encoding). In our study, the focal spider’s brain forms initial memory traces of the individuals it encounters. This process continues during the period of visual separation. Upon re-encountering the same individual a few minutes later, the spider accesses the initial memory trace stored in long-term memory. This trace is fragile and not fully consolidated. The re-encounter acts as a rehearsal, reactivating specific memory traces and potentially strengthening them through additional encoding processes, allowing the spider to recognize the individual even an hour later.

      According to Markowitsch (2013), initial encoding in long-term memory begins within seconds to minutes. It is also important to note that we argue for identity recognition rather than identity recall. Recognition involves correctly identifying a stimulus when it is presented again, while recall requires the volitional generation of information without an external stimulus. Thus, recall may rely on deeper forms of memory consolidation than recognition.

      Is protein synthesis required for long-term memory? 

      The role of protein synthesis in long-term memory has been extensively studied. According to Castellucci et al. (1978), explicit memory comprises a short-term phase that does not require protein synthesis and a long-term phase that does. Hebbian learning in its initial phase (early LTP) does not necessarily require protein synthesis. This phase involves the rapid strengthening of synapses through existing proteins and signaling pathways, such as the activation of NMDA receptors and the influx of Ca2+ ions. For the changes to persist (late LTP), protein synthesis is important. This phase involves the production of new proteins that contribute to long-term structural changes at the synapse, such as the growth of new synaptic connections or the stabilization of existing ones.

      This differentiation between the early and late phases of LTP highlights that long-term memory can begin forming without immediate protein synthesis. Our study focuses on this early phase of memory encoding, which involves the initial formation of memory traces that do not yet depend on protein synthesis. 

      It is however worth noting that recent research suggests that there is an early phase of protein synthesis (within minutes to hours) through the activation of immediate early genes (IEGs) and transcription factors. In this context, protein synthesis supports initial synaptic modifications. What the reviewer refers to is the consolidation phase (late phase), where continued synthesis of proteins induces structural changes at synapses, leading to the formation of new synaptic connections. In our study, it is plausible to assume that an early form of protein synthesis may contribute to stabilizing the initial memory traces during the encoding phase. However, whether or not protein synthesis occurred in our spiders is beyond the scope of this investigation and was not specifically addressed.

      The critical aspect of our study is that the information transitioned from short-term memory to long-term memory during an early encoding phase, allowing recall after an hour. Due to the inherent limitations and transient nature of the short-term memory, it is implausible for spiders to retain these memory representations solely within the short-term memory for such durations. Our findings suggest that the initial encoding processes were robust enough to transfer these experiences into long-term memory, where they were stabilized and could be accessed later. 

      In sum, it is important to note that long-term memory is a dynamic process, and while testing after 24 hours is a convention in some studies, this timing is arbitrary and not universally applicable to all contexts or species. The more critical consideration here is that we are dealing with a species where no prior evidence of long-term memory exists. Debating a 24-hour delay or the specifics of protein synthesis, while potentially interesting for future studies, detracts from the true significance of our findings. Our study is the first to show something akin to long-term memory representations in this species and this should remain in our focus.

      Shiffrin, R. M., & Atkinson, R. C. (1969). Storage and retrieval processes in long-term memory. Psychological review, 76(2), 179. 

      Markowitsch, H. J. (2013). Memory and self–Neuroscientific landscapes. International Scholarly Research Notices, 2013(1), 176027.

      Castellucci, V. F., Carew, T. J., & Kandel, E. R., 1978. Cellular analysis of long-term habituation of the gill-withdrawal reflex of Aplysia californica. Science, 202(4374), 1306-1308.

      The odd phrasing of the 'long-term dishabutation' trial makes it seem that it is testing a longterm memory, but it is not. The spiders have never met. The fact that they are very habituated to one set of stimuli and then respond to a new stimulus is not evidence of long-term memory. To clearly test memory (which is the part really lacking from the design), the authors would need to show that spiders - upon the first instance of re-encountering a previously encountered individual are already 'habituated' to them but not to some other individuals. The current data suggest this may be the case, but it is just very hard to interpret given the design does not directly test the memory of individuals in a clear and unambiguous manner.

      While we appreciate the reviewer's feedback, we believe there may have been some misunderstanding regarding the term “long-term dishabituation.” The introduction of novel individuals at the end of Session 3 was not intended to test long-term memory by having spiders recognize these novel individuals. Instead, it aimed to investigate the nature of the habituation observed over the three sessions.

      The novel individuals introduced at the end of Session 3 serve the purpose to differentiate between general habituation (a decline in response due to repeated exposure to any stimuli) and specific habituation (recognition and reduced response to previously encountered individuals). The novel spiders have never been encountered before, so the focal spiders cannot have prior representations of them. Thus, the strong dishabituation response to these novel individuals indicates that the habituation observed earlier is not due to a general fatigue effect or loss of interest but rather a specific habituation effect to the familiar individuals. By showing such strong and increased response to novel individuals, the study demonstrates that the spiders' increasingly reduced responses in Sessions 2 and 3 are not merely due to a general decrease in responsiveness but suggest cognitive habituation. This cognitive habituation implies that the spiders remember the familiar individuals (as each of them occurred three times across the three sessions), a process that relies on long-term memory. Therefore, while the novel spiders themselves are not a direct test of long-term memory, the use of these novel spiders helps us infer that the habituation observed over the three sessions is indeed due to the formation of long-term memory traces.

      In other words, the organism detects and processes the novel stimulus as different from the habituated one. In our study, if a spider showed a strong dishabituation response to a novel individual introduced at the end of Session 3, it would indicate that the spider had formed specific representations of the individuals they encountered during the three sessions. These representations allow the spiders to recognise the novel individuals as different, leading to renewed interest and a stronger behavioural response. It is the absence of a prior representation for the novel spiders that triggers this dishabituation response. Since the novel spider does not match any stored representations of the previously encountered spiders, the focal spider responds more strongly.

      The introduction of novel individuals at the end of Session 3 helps clarify that the increasing habituation observed in Session 2 and 3 is specific to familiar individuals, indicating cognitive habituation. This supports the presence of long-term memory processes in the spiders, as they can distinguish between previously encountered individuals and new ones. The habituationdishabituation paradigm thus effectively demonstrates the spiders' ability to form and reactivate encoded memory traces, providing clear evidence of recognition memory. 

      For these reasons, we are convinced that our interpretation is accurate and hope this clarification renders the additional request for an entirely new experiment unnecessary.

      (3) Lack of a functional explanation and the emphasis on 'asociality': It is entirely plausible that recognition is a pleitropic byproduct of the overall visual cognition abilities in the spiders. 

      We agree with the reviewer that it is essential to consider the broader context of individual recognition and its potential adaptive significance. The possibility that recognition in jumping spiders could be a pleiotropic byproduct of their advanced visual cognition abilities is indeed a plausible explanation and has been discussed in our manuscript.

      However, the discussion that discounts territoriality as a potential explanation is not well laid out. First, many species that are 'asocial' nevertheless defend territories. It is perhaps best to say such species are not group living, but they have social lives because they encounter conspecifics and need to interact with them.

      The reviewer also correctly points out that many 'asocial' species still defend territories and have social interactions. Our use of the term 'asocial' was meant to indicate that jumping spiders do not live in cohesive social groups, but we acknowledge that they do have social lives in terms of interactions with conspecifics. It is more accurate to describe these spiders as non-groupliving, yet socially interactive species. A better term is “non-social” to refer to the jumping spider as a species that do not live in stable social groups and do not exhibit associated behaviours, such as cooperative behaviours. This also would imply that individuals still interact with conspecifics, especially in contexts like mating, territorial disputes or aggression. We, thus, change the term from “asocial” to “non-social” in the manuscript.  

      Indeed, there are many examples of solitary living species that show the dear enemy effect, a form of individual recognition, towards familiar territorial neighbors. The authors in this case note that territorial competition is mediated by the size or color of the chelicerae (seemingly a trait that could be used to distinguish among individuals). Apparently, because previous work has suggested that territorial disputes can be mediated by a trait in the absence of familiarity has led them to discount the possibility that keeping track of the local neighbors in a potentially cannibalistic species could be a sufficient functional reason. In any event, the current evidence presented certainly does not warrant discounting that hypothesis.

      The “dear enemy effect”, where solitary living species recognize and show reduced aggression towards familiar territorial neighbors, is a relevant consideration. This effect demonstrates that individual recognition can have significant functional implications even in species that are not group-living. We will elaborate on this effect in the revised manuscript to provide a more comprehensive discussion.

      The reviewer mentioned that territorial disputes can be mediated by the size or color of the chelicerae, potentially serving as a feature for individual recognition. Our intention was not to discount the role of such traits but to highlight that the level of identity recognition we observed represents subordinate classification. This is different from the basic-level classification, such as distinguishing between male and female based on chelicerae colour. While we acknowledge that colour can be an important feature for identity discrimination, our findings suggest that individual recognition in jumping spiders goes beyond simple colour differentiation. 

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, the authors investigated whether a salticid spider, Phidippus regius, recognizes other individuals of the same species. The authors placed each spider inside a container from which it could see another spider for 7 minutes, before having its view of the other spider occluded by an opaque barrier for 3 minutes. The spider was then either presented with the same individual again (habituation trial) or a different individual (dishabituation trial). The authors recorded the distance between the two spiders during each trial. In habituation trials, the spiders were predicted to spend more time further away from each other and, in dishabituation trials, the spiders were predicted to spend more time closer to each other. The results followed these predictions, and the authors then considered whether the spiders in habituation trials were generally fatigued instead of being habituated to the appearance of the other spider, which may have explained why they spent less time near the other individual. The authors presented the spiders with a different (novel) individual after a longer period of time (which they considered to be a long-term dishabituation trial), and found that the spiders switched to spending more time closer to the other individual again during this trial. This suggested that the spiders had recognized and had habituated to the individual that they had seen before and that they became dishabituated when they encountered a different individual.

      We appreciate the reviewer's detailed summary of our study. The reviewer's summary accurately captures the essence of our experimental design, predictions, and findings.

      Strengths:

      It is interesting to consider individual recognition by Phidippus regius. Other work on individual recognition by an invertebrate has been, for instance, known for a species of social wasp, but Phidippus regius is a different animal. Importantly and more specifically, P. regius is a salticid spider, and these spiders are known to have exceptional eyesight for animals of their size, potentially making them especially suitable for studies on individual recognition. In the current study, the results from experiments were consistent with the authors' predictions, suggesting that the spiders were recognizing each other by being habituated to individuals they had encountered before and by being dishabituated to individuals they had not encountered before. This is a good start in considering individual recognition by this species.

      We appreciate the reviewer's positive summary and acknowledgment of the strengths of our study. We would like to point out some more details: 

      While the exceptional eyesight of salticid spiders is indeed a significant factor, our study reaches deeper in terms of processing. We do not argue at the level of sensation rather than at the level of perception. Even more, identity recognition is a higher-level perceptual process. This distinction is crucial: we are not merely examining the spiders' sensory capabilities (such as good eye sight), but rather how their brains interpret and represent what they “see”. This involves a cognitive process where the sensory input (sensation) is processed and integrated into meaningful constructs (perception) and memorised in form of representations. 

      Our study also suggests that P. regius engages in “higher-level” perceptual processes. This most-likely involves complex representations of individual conspecifics, which in mammalian brains are associated with regions such as the central inferior temporal (cIT) and anterior inferior temporal (aIT) areas. We provide evidence that these spiders do not just sense visual stimuli but interpret and recognize individual identities, indicating sophisticated perceptual and cognitive abilities. In other words, the spiders do not merely respond to visual stimuli in a reflexive manner, but rather engage in sophisticated perceptual and cognitive processes that allow them to recognize and distinguish between individual identities. This indicates that the spiders are not simple Braitenberg vehicles reacting to stimuli, but are thinking organisms capable of complex mental representations. This resonates with current trends in animal cognition research, which increasingly recognize some level of consciousness and advanced cognitive abilities across a wide range of animal species. Moreover, this aligns with the growing interest and recognition of spider cognition, where research begins to provide evidence for the cognitive complexity and perceptual capabilities of these often underestimated creatures (Jackson and Cross, 2011). 

      Jackson, R. R., & Cross, F. R. (2011). Spider cognition. Advances in insect physiology, 41, 115174.

      Weaknesses:

      The experiments in this manuscript (habituation/dishabituation trials) are a good start for considering whether individuals of a salticid species recognize each other. I am left wondering, however, what features the spiders were specifically paying attention to when recognizing each other. The authors cited Sheehan and Tibbetts (2010) who stated that "Individual recognition requires individuals to uniquely identify their social partners based on phenotypic variation." Also, recognition was considered in a paper on another salticid by Tedore and Johnsen (2013).

      Tedore, C., & Johnsen, S. (2013). Pheromones exert top-down effects on visual recognition in the jumping spider Lyssomanes viridis. The Journal of Experimental Biology, 216, 1744-1756. doi: 10.1242/jeb.071118 

      In this elegant study, the authors presented spiders with manipulated images to find out what features matter to these spiders when recognizing individuals.

      The reviewer raises an important point regarding the specific features that Phidippus regius might be paying attention to when recognizing individual conspecifics. Our study indeed cited Sheehan and Tibbetts (2010) to highlight the importance of phenotypic variation in individual recognition. Additionally, we referenced the work by Tedore and Johnsen (2013) on visual recognition in another salticid species, which suggests that multiple sensory modalities, including visual and pheromonal cues, may be involved in the recognition process. While our current study focused on demonstrating that Phidippus regius can recognize individual conspecifics, we acknowledge that it does not specifically identify the phenotypic features involved in this recognition. 

      Part of the problem with using two living individuals in experiments is that the behavior of one individual can influence the behavior of the other, and this can bias the results.  

      We appreciate the reviewer's observation regarding the potential bias introduced by using two living individuals in experiments, as the behaviour of one individual can indeed influence the behaviour of the other. We shared this concern initially; however, the consistency of the data with our hypotheses suggests that this potential bias did not adversely affect the validity of our findings, rendering the concern largely illusory at least in the context of our study.

      We opted for the living-individual paradigm for the following reasons:

      There is a growing trend in ethological as well as animal cognition research towards more ecologically valid and biologically relevant settings, while simultaneously advancing the precision and quantification of the data collected. This is referred to as computational ethology.

      This approach advocates for assessing behaviour in environments that more closely resemble natural conditions, rather than relying solely on sterile and artificial experimental setups. The rationale is that such naturalistic arenas allow animals to exhibit a broader range of behaviours and interactions, providing a more accurate reflection of their cognitive and social abilities. The challenge, however, lies in navigating the inherent tradeoff between the strict control offered by standardized procedures and the ecological validity of more naturalistic interactions.

      By allowing two spiders to confront each other, we aimed to capture authentic behavioural responses while maintaining a degree of experimental standardization through the use of a controlled setup. Our approach ensures that the behaviours observed are not merely artifacts of an artificial environment but are representative of genuine social interactions. Also, to minimize potential biases arising from mutual behavioural influences, we employed a controlled and repeatable experimental environment. 

      We believe that the chosen approach provides a meaningful balance (in the above-mentioned trade-off) between ecological validity and experimental rigour. By combining a standardized environment with the naturalistic interaction of real spiders, we ensured that our findings are both scientifically robust and biologically relevant.

      However, this issue can be readily avoided because salticids are well known, for example, to be highly responsive to lures (e.g. dead prey glued in lifelike posture onto cork disks) and to computer animation. 

      While it is true that salticid spiders are responsive to lures and computer animations, we carefully considered the most appropriate and ecologically valid approach for our study. Our aim was to capture genuine behavioural patterns in a context that closely mimics the natural encounters these spiders experience.

      Additionally, creating comparable video stimuli of spiders presents its own set of challenges: Video recordings or computer animations may not fully capture the nuanced behaviours and subtle variations that occur during real-life interactions. There is also a risk that such stimuli could be perceived differently by the spiders, potentially introducing new biases or confounding factors.

      Scientific progress is not made by merely relying on previously established paradigms, especially when they may not be suitable for the specific context of a study. While alternative methods like lures or computer animations can be valuable in certain situations, our approach was deliberately chosen to best capture the naturalistic and interactive aspects of spider behaviour.

      These methods have already been successful and helpful for standardizing the different stimuli presented during many different experiments for many different salticid spiders, and they would be helpful for better understanding how Phidippus regius might recognize another individual on the basis of phenotypic variation. There are all sorts of ways in which a salticid might recognize another individual. Differences in face or body structure, or body size, or all of these, might have an important role in recognition, but we won't know what these are using the current methods alone. Also, I didn't see any details about whether body size was standardized in the current manuscript.

      As mentioned previously, the goal of our study was to demonstrate that identity recognition occurs in spiders. This alone is of significant importance, as it challenges existing assumptions about the cognitive capabilities of small-brained animals. We did not aim at providing a proximate explanation (mechanism) for identity recognition in spiders.

      The problem with what the reviewer suggested is this: As long as we do not have conclusive evidence that spiders recognize individual conspecifics, any attempt to design and manipulate stimuli would lack a solid foundation. Without understanding whether spiders have this capability, we cannot make informed decisions about which features or characteristics to manipulate in stimuli. In other words, this uncertainty means we lack a starting point for our assumptions, making it nearly impossible to create stimuli that would be useful or relevant in testing identity recognition.

      Additionally, it is nearly impossible to artificially generate a stimulus set that encompasses the natural variance in features that spiders use for visual individuation. There is no guarantee that artificial stimuli, such as lures or computer animations, would capture the relevant features that spiders use in natural interactions.

      In other words, the question how Phidippus regius recognizes another individual will be subject of further investigation. In this study, we focus on whether or not they individuate others.  

      For another perspective, my thoughts turn to a paper by Cross et al.

      Cross, F. R., Jackson, R. R., & Taylor, L. A. (2020). Influence of seeing a red face during the male-male encounters of mosquito-specialist spiders. Learning & Behavior, 48, 104-112. doi: 10.3758/s13420-020-00411-y

      These authors found that males of Evarcha culicivora, another salticid species that is known to have a red face, become less responsive to their own mirror images after having their faces painted with black eyeliner than if their faces remained red. In all instances, the spiders only saw their own mirror images and never another spider, and these results cannot be interpreted on the basis of habituation/dishabituation because the spiders were not responding differently when they simply saw their mirror image again. Instead, it was specifically the change to the spider's face which resulted in a change of behavior. The findings from this paper and from Tedore and Johnsen can help give us additional perspectives that the authors might like to consider. On the whole, I would like the authors to further consider the features that P. regius might use to discern and recognize another individual.

      We acknowledge that identifying the specific features used by P. regius for identity recognition is a valuable direction for future research. However, we must emphasise that without first establishing whether spiders are capable of individuating each other, it would be premature and challenging to determine the specific features they rely on for this process. A lack of response to certain features could either suggest that those features are not relevant or, more critically, that the spider does not recognize individual identities at all. Thus, our initial focus on demonstrating identity recognition is essential before delving into the specific cues or characteristics involved.

      While the call for addressing the proximate causation of identity recognition in jumping spiders is valid, we need to also reiterate the significance of our findings and why they stand on their own merit:

      Our study demonstrates for the first time that Phidippus regius can systematically individuate conspecifics, showing habituation within short intervals (10 minutes) and over longer intervals (1 hour). This behaviour is not due to general habituation or physical fatigue but is a result of cognitive habituation, as illustrated by the spiders' response to novel individuals introduced after repeated encounters with familiarized ones. 

      What are the implications of this? Our findings indicate that these spiders possess long-term memory and form representations that can be reactivated after an hour. While this is most-likely not fully consolidated memory formation (see our reply to Reviewer 1), it represents an encoded long-term memory. This implies that small-brained animals can remember, represent, and potentially build internal mental images, which are crucial for sophisticated cognitive processing. 

      Reviewer #3 (Public Review):

      Summary:

      Jumping spiders (family Salticidae) have extraordinarily good eyesight, but little is known about how sensitive these small animals might be to the identity of other individuals that they see. Here, experiments were carried out using Phidippus regius, a salticid spider from North America. There were three steps in the experiments; first, a spider could see another spider; then its view of the other spider was blocked; and then either the same or a different individual spider came into view. Whether it was the same or a different individual that came into view in the third step had a significant effect on how close together or far apart the spiders positioned themselves. It has been demonstrated before that salticids can discriminate between familiar and unfamiliar individuals while relying on chemical cues, but this new research on P. regius provides the first experimental evidence that a spider can discriminate by sight between familiar and unfamiliar individuals.

      Clark RJ, Jackson RR (1995) Araneophagic jumping spiders discriminate between the draglines of familiar and unfamiliar conspecifics. Ethology, Ecology and Evolution 7:185-190

      We appreciate the reviewer's comprehensive summary and acknowledgment of the significance of our findings.

      Strengths:

      This work is a useful step toward a fuller understanding of the perceptual and cognitive capacities of spiders and other animals with small nervous systems. By providing experimental evidence for a conclusion that a spider can, by sight, discriminate between familiar and unfamiliar individuals, this research will be an important milestone. We can anticipate a substantial influence on future research.

      We appreciate the reviewer’s recognition of the strengths and significance of our study. We are pleased that the reviewer considers our research an important milestone. Our findings indeed suggest that even animals with relatively simple nervous systems can perform complex cognitive tasks, which has substantial implications for the broader study of animal cognition.

      As pointed out by the reviewer, we also hope that our study will have a substantial influence on future research. By establishing a methodology and providing clear evidence of visual discrimination, we aim to encourage further investigations into the cognitive abilities of jumping spiders and other arthropods. Future research can build on our findings to explore the specific visual cues and mechanisms involved in individual recognition (as Reviewer 2 pointed out), as well as the ecological and evolutionary implications of these abilities.

      Weaknesses:

      (1) The conclusions should be stated more carefully.

      We agree that clarity in our conclusions is paramount. We will revise the manuscript to ensure that our conclusions are presented with precision and appropriately reflect the data. Specifically, we will emphasize the evidence supporting our findings of visual individual recognition and clarify the limitations and scope of our conclusions to avoid any potential overstatements.

      (2) It is not clearly the case that the experimental methods are based on 'habituation (learning to ignore; learning not to respond). Saying 'habituation' seems to imply that certain distances are instances of responding and other distances are instances of not responding but, as a reasonable alternative, we might call distance in all instances a response. However, whether all distances are responses or not is a distracting issue because being based on habituation is not a necessity.

      We appreciate the reviewer's feedback and understand the concern regarding the use of the term 'habituation.' We agree that all distances maintained by the spiders are active responses and reflect their behavioral decisions based on perception and recognition of the other individual. We recognize that all distances are responses and interpret these as the spiders’ “active decisions”, modulated by their recognition of the same or different individuals. 

      The terms 'habituation' and 'dishabituation' are used to label trial types for ease of discussion and to describe the expected behavioural modulation.

      (3) Besides data related to distances, other data might have been useful. For example, salticids are especially well known for the way they communicate using distinctive visual displays and, unlike distance, displaying is a discrete, unambiguous response.

      We appreciate the reviewer’s suggestion to incorporate data on visual displays, which are indeed well-known communication methods among salticids. We agree that visual displays are discrete and unambiguous responses that could provide additional insights into the spiders' recognition abilities.

      Our primary focus on distance measurements was driven by the need to quantify behaviour in a continuous and scalable manner, that is, how spiders modulate their proximity based on familiarity with other individuals.

      We acknowledge the potential value of including visual display measurments; however, in our study, we aimed to establish a foundational understanding of recognition behaviour through proximity measures first. Also, capturing diplays requires a different experimental paradigm, where the displays are clearly visible and analyzable. 

      (4) Methods more aligned with salticids having extraordinarily good eyesight would be useful. For example, with salticids, standardising and manipulating stimuli in experiments can be achieved by using mounts, video playback, and computer-generated animation.

      There is no doubt that salticids have excellent eyesight. However, our study focuses on higherlevel perceptual processes that require complex brain analysis, not just visual acuity. The goal was to investigate whether spiders can individuate and recognize conspecifics, which involves interpreting visual information and forming long-term representations.

      Clearly, methods like video playback and computer animations are useful in controlled settings, where the spider is mounted, but they pose challenges for our specific research question. At this stage of research, we lack precise knowledge of which visual features are critical for individual recognition in spiders, making it difficult to design effective artificial stimuli. 

      Our primary objective was to determine if spiders can individuate others. Before exploring the proximate mechanisms of how they individuate others, it was essential to establish that they have this capability. This foundational question needed to be addressed before delving into more detailed mechanistic studies.

      (5) An asocial-versus-social distinction is too imprecise, and it may have been emphasised too much. With P. regius, irrespective of whether we use the label asocial or social, the important question pertains to the frequency of encounters between the same individuals and the consequences of these encounters.

      Our intent was to convey that P. regius does not live in cohesive social groups but does engage in individual interactions that can have significant behavioral consequences. We will revise the manuscript to reduce the emphasis on the asocial-versus-social distinction. As discussed above, we also will change the term “asocial” to “non-social” in the manuscript.

      (6) Hypotheses related to not-so-strictly adaptive factors are discussed and these hypotheses are interesting, but these considerations are not necessarily incompatible with more strictly adaptive influences being relevant as well.

      We appreciate the reviewer's observation regarding the discussion of hypotheses related to notso-strictly adaptive factors. We agree that our considerations of these factors do not preclude the relevance of more strictly adaptive influences.

      We will revise the manuscript to explicitly discuss how our findings can be interpreted in the context of adaptive hypotheses. This will provide a more comprehensive understanding of the evolutionary significance of individual recognition in P. regius. Modifications were made in the Discussion section.

      In the following, we comment on issues not mentioned in the “public reviews” section.

      Reviewer #1 (Recommendations For The Authors):

      (1) I would suggest conducting experiments that actually test for recognition memory, as this seems to be a claim that the authors make. Following the ant studies by Dreier cited in this manuscript would be sufficient to test for memory. Given the relative simplicity of the measures being taken (location of spiders), this would seem like a very simple addition that would provide a much stronger and more readily interpreted dataset.

      As previously explained in our detailed responses (public reviews), we believe that the current design effectively addresses the questions at hand. Our approach, using a habituationdishabituation paradigm, provides robust evidence for recognition memory within the framework of early long-term memory.

      Additionally, we have explained why using the distance to the panel as a measure is not appropriate in this context. Specifically, using such a measure can misrepresent the actual interests of the spiders in each other.

      While we acknowledge the merits of the ant studies by Dreier, our current design allows for a detailed understanding of the spiders' recognition capabilities over short (10 min) and slightly longer intervals (up to one hour). This is sufficient to demonstrate the presence of recognition memory without the necessity of further experiments. The observed patterns of habituation and dishabituation responses in our study clearly indicate that the spiders can distinguish between familiar and novel individuals, which supports our claims.

      Given these points, we respectfully maintain that the current data and experimental design are adequate to support our findings and provide a comprehensive understanding of recognition memory in Phidippus regius.

      (2) The writing is rather impenetrable. The results explain the basic finding in terms of statistical variables rather than simply stating the results. A clear and straightforward statement such as 'the spiders showed reduced interest upon habituation trials, indicating xyz' (and then citing the stats) is preferable to the introduction of results as a statistical model. The statistical model is a means of assessing the results. It is not the result. Describe the data.

      We tried to improve that in the current version.

      (3) Showing more straightforward data such as distance from the joint barrier would make the paper much easier to understand.

      This paper has been on bioRxiv for some time and my guess is that it has ended up here because it is having trouble in review. Collecting new data that more directly test the question at hand, presenting the data in a more direct manner, and more critically evaluating your own claims will improve the paper.

      While it is true that the paper has been on bioRxiv for a while, this submission marks the first instance where it has undergone peer review. Prior to this, the manuscript was submitted to other journals but was not reviewed.

      We hope the explanations provided in the “public reviews” section, along with the revised manuscript, sufficiently clarify our study and its conclusions. We believe the current data robustly address the research questions, and as outlined in our detailed responses, we have critically evaluated our claims and presented the data clearly. Given these clarifications, we do not see the necessity for new experiments as the existing data adequately support our findings. We trust that these revisions and explanations will clarify any misunderstandings.

      I am totally sold that the spiders are paying attention to identity at some level. The key now is to understand what that actually means in terms of recognition (i.e. memory of individuals) not just habituation.

      We appreciate the reviewer’s emphasis on the distinction between habituation and memorybased individual recognition. As detailed in the preceding discussion, we have taken great care to clarify how our paradigm distinguishes simple habituation effects from true memory for individual identity. We trust that the preceding sections make clear how our findings go beyond simple habituation to establish genuine individual recognition.

      Reviewer #2 (Recommendations For The Authors):

      Aside from the comments in the public review, I have some additional comments that the authors may wish to consider.

      Numerous times in the manuscript, the authors mentioned that recognizing individuals requires recognition memory. This seems rather obvious, and I wonder if the authors could instead be more precise about what they mean by 'recognition memory'?

      Recognition memory refers to the cognitive ability to identify a previously encountered stimulus, an individual, or events as familiar. It involves both encoding and retrieval processes, allowing an organism to distinguish between novel and familiar stimuli. This form of memory is a fundamental component of cognitive functioning and is supported by neural mechanisms that, in the mammal brain, involve the hippocampus and other brain regions associated with memory processing. 

      In our study, we aimed to test whether Phidippus regius recognizes conspecifics, or, in other words, utilizes recognition memory to distinguish between familiar and unfamiliar conspecifics. With the habituation - dishabituation paradigm, we assessed the spiders' ability to recognize previously encountered individuals and demonstrate memory retention over short (10 min) and extended periods (1 hour).

      Encoding: In the initial trial, when a spider encounters an individual for the first time (Figure 1A, “Baseline” or “Dishabituation” for every following trial), it encodes the visual information related to that specific individual. This encoding process involves creating a memory trace of the individual's phenotypic characteristics.

      Storage: During the visual separation period, this encoded information is stored in the spider's memory system. The memory trace, though initially fragile, starts to stabilize over the separation period. Whether or not this leads to some form of consolidated memory remains unaddressed. This aspect was highlighted by the first reviewer, but our focus is on the early process rather than on late processes, such as consolidation. 

      Retrieval: In the subsequent trial, when the same individual is presented again, the spider retrieves the stored memory trace. If the spider recognizes the individual, its behaviour reflects habituation, indicating memory retrieval. Conversely, when a novel individual is introduced, the lack of stored memory trace triggers a different behavioural response, indicating dishabituation. This differential response demonstrates the spider's ability to distinguish between familiar and unfamiliar individuals. This differential response is also key to understanding the nature of habituation over the three sessions, as introducing novel spiders leads to a significant dishabituation response after the three sessions in Experiment 2.

      In Line 39, the authors state that they used "a naturalistic experimental procedure". I would like to know how this experiment is 'naturalistic'. The authors' use of an arena does not appear naturalistic, or something the spiders would encounter in the wild.

      We appreciate the reviewer's comment regarding our use of the term 'naturalistic'. We acknowledge that the experimental arena itself does not replicate the conditions found in the wild. Our approach aimed to incorporate elements of natural behaviour by allowing two spiders to freely move and interact within the controlled environment. This approach aligns with principles from computational ethology, which seeks to balance the trade-off between repeatability/standardization and observing free, naturalistic behaviour. By using this paradigm, we aimed to capture behaviours that closely resemble those exhibited in their natural habitat. This setup was chosen to balance the need for ecological validity with the requirements for standardized data collection. 

      Also, and this point has been raised above, by observing the spiders' natural interactions without restraining them or using artificial stimuli like computer animations, we aimed to capture behaviours that closely resemble their natural responses to conspecifics. In contrast, we would not have any clear expectations regarding responses to arbitrarily designed artificial stimuli. This method provides a more ecologically valid assessment of the spiders' recognition abilities.

      There are a few details wrong in Line 41. 'Salticidae' is a family name and shouldn't be italicized. Also, the sentence suggests that there is a spider called a 'jumping spider' in the family Salticidae, which is technically called Phidippus regius. To clarify, all spiders in the family Salticidae are known as jumping spiders, and one species of jumping spiders is called Phidippus regius.

      We will correct this in the manuscript to accurately reflect the classification and terminology. Thank you for pointing out these inaccuracies.

      A manuscript on individual recognition by a salticid should include citations to earlier papers that have already considered individual recognition by salticids. As well as the paper by Tedore and Johnsen (2013), the authors should be aware of the following papers.

      Clark, R. J., & Jackson, R. R. (1994). Portia labiata, a cannibalistic jumping spider, discriminates between its own and foreign egg sacs. International Journal of Comparative Psychology, 7, 3843.

      Clark, R. J., & Jackson, R. R. (1994). Self-recognition in a jumping spider: Portia labiata females discriminate between their own draglines and those of conspecifics. Ethology, Ecology & Evolution, 6, 371-375.

      Clark, R. J., & Jackson, R. R. (1995). Araneophagic jumping spiders discriminate between the draglines of familiar and unfamiliar conspecifics. Ethology, Ecology & Evolution, 7, 185-190.

      We appreciate the reviewer's suggestion to include citations to these earlier papers. We will add the recommended references to provide a comprehensive background.

      In Line 203, I would not consider "interaction with human caretakers and experimenters" to be a form of behavioral enrichment. This kind of interaction has the potential to be stressful for the spiders, rather than enriching. I suggest deleting that part of the sentence.

      We appreciate the reviewer's feedback and agree that interactions with human caretakers and experimenters might not always be enriching and could potentially be stressful for the spiders. We will remove that part of the sentence to better reflect the intended meaning.

      Reviewer #3 (Recommendations For The Authors):

      This manuscript is useful and interesting, and I predict that it will be influential, but more attention should be given to stating the objective and conclusion accurately and clearly. As I understand it, the objective was to investigate a specific hypothesis: that Phidippus regius has a capacity to identify conspecific individuals as particular individuals (i.e., individual identification). Strong evidence supporting this hypothesis being true would be especially remarkable because I am unaware of any published work having shown evidence of a spider expressing this specific perceptual capacity.

      Thank you for recognizing the significance and potential influence of our manuscript. We agree that clearly stating the objective and conclusions is essential for conveying the importance of our findings. Our results provide robust evidence supporting the hypothesis that Phidippus regius can recognize and remember individual conspecifics. We will revise the manuscript to more clearly highlight the objective and our conclusions, emphasizing the novel evidence for individual identification in these spiders.

      Based on reading this manuscript and based on my understanding of the meaning of 'individual identification', it seems to me that the hypothesis that P. regius has a capacity for individual identification might or might not be true, and the experiments in this manuscript cannot tell us which is the case. 

      We respectfully disagree with the reviewer's assessment. Our experiments were carefully designed to test whether P. regius has the capacity for individual identification, and our results provide clear evidence supporting this hypothesis. The systematic differences in the spiders' behaviour when encountering familiar versus novel individuals indicate that they can recognize and remember specific conspecifics. We will revise the manuscript to ensure that the evidence and conclusions are stated more clearly to address any potential misunderstandings.

      Determining which is the case would have required research that made better use of the literature, and displayed more critical thinking. addressed credible alternative hypotheses and adopted experimental methods that focused more strictly on individual identification. 

      The distinction between whether P. regius has a capacity for individual identification is not ambiguous in our study. Our findings clearly demonstrate this capacity through systematic behavioural responses to familiar versus novel individuals. As pointed out above, the experimental procedure might be complex, but results are systematic despite this complexity. The experiments were designed to directly address the hypothesis of individual identification, and the data robustly support our conclusions. While considering alternative hypotheses is important, the results we present provide a coherent and compelling case for individual identification in P. regius. We will ensure our manuscript clearly articulates this narrative and the supporting evidence.

      At the same time, I also appreciate that asking for all of that at once would be asking for too much. As I see it, this manuscript tells us about research that moves us closer to a clear focus on the details and questions that will matter in the context of considering a hypothesis that is strictly about individual identification. More importantly, I think this research reveals a perceptual capacity that is remarkable even if it is not strictly a capacity for individual identification.

      We understand the desire for a more focused exploration of individual identification with paradigms more familiar to the reviewers and we acknowledge that further detailed studies could enhance our understanding of this capacity. However, our findings do indeed suggest that Phidippus regius exhibits a remarkable perceptual capacity for recognizing and remembering individual conspecifics. The systematic behavioural responses observed in our experiments strongly indicate that these spiders possess the ability for individual recognition. While our study may not have explored every potential detail (e.g. which features are most crucial for the memory matching processes), the evidence we present robustly supports the conclusion of individual identification.

      We acknowledge that it is indeed valuable to follow established paradigms and build upon the frameworks that have been used successfully in similar species and studies. These paradigms provide a solid foundation for scientific inquiry and allow for comparability across different research efforts. However, it is equally important to acknowledge and explore alternative approaches. Scientific progress is driven not only by replication but also by innovation. By employing new paradigms, researchers can uncover novel insights and push the boundaries of current understanding. The paradigm we used in our study, while different from those traditionally applied to similar research, is not an invention but a well-established method in various domains. It represents an innovative application in the context of our specific research questions, offering a fresh perspective and contributing to the advancement of the field.

      As I understand it, 'individual identification' means identifying another individual as being a particular individual instead of a member of a larger set (or 'class') of individuals. An 'individual' is a set containing a single individual. Interesting examples of identifying members of larger sets include discriminating between familiar and unfamiliar individuals. In the context of the specific experiments in this manuscript, familiar-unfamiliar discrimination means discriminating between recently-seen and not-so-recently-seen individuals. My impression is that the experiments in this manuscript have given us a basis for concluding that P. regius has a capacity for familiarunfamiliar (recently seen versus not so recently seen) discrimination. If this is the case, then I think this is the conclusion that should be emphasised. This would be an important conclusion.

      I appreciate that, depending on how we use the words, familiar-unfamiliar discrimination might be construed as being 'individual identification'. An individual is identified as 'the individual recently seen'. As a casual way of speaking, it can be reasonable to call this 'individual identification'. The difficulty comes from the way calling this 'individual identification' can suggest something more than has been demonstrated. To navigate through this difficulty, we need an expression to use for a capacity that goes beyond familiar-unfamiliar discrimination. In the context of this manuscript about P. regius, we need expressions that will make it easy to consider two things. One of these things is a capacity for familiar-unfamiliar discrimination. The other is the capacity to identify another individual as being a particular individual.

      We appreciate the reviewer's insightful comments on the distinction between familiar-unfamiliar discrimination and individual identity recognition. Our study indeed focuses on demonstrating that Phidippus regius can recognize and remember individual conspecifics, providing evidence for individual identity recognition.

      Two specific behavioural hallmarks that speak against familiarity recognition:

      First, the significant dishabituation response to novel individuals introduced after multiple sessions underscores the specificity of the recognition. This shows that the spiders' habituation is not general but specific to familiar individuals. 

      Second, the pattern of habituation over the sessions provides further evidence: We observed the strongest systematic modulation in Session 1, a reduced modulation in Session 2, and a further diminished effect in Session 3. If the spiders were only responding based on familiarity, we would expect a more drastic decrease, resulting in a washed-out non-effect by Session 2. However, the continued, though diminishing, differentiation between habituation and dishabituation trials across sessions indicates that the spiders are not merely responding to a general sense of familiarity but are engaging in individual recognition. In other words, the spiders' ability to distinguish between familiar and novel individuals even after repeated exposures suggests that they are not just recognizing a familiar status but are identifying specific individuals.

      Things people do might help clarify what this means. People have an extraordinary capacity for identifying other individuals as particular individuals. Often this is based on giving each other names. Imagine we are letting somebody see photographs and asking them to identify who they see. The answer might be, 'somebody familiar' or 'somebody I saw recently' (familiar-unfamiliar discrimination); or the question might be answered by naming a particular individual (individual identification).

      We appreciate the reviewer's efforts to clarify the distinction between familiar-unfamiliar discrimination and individual recognition using human examples. However, we believe this comparison might not fully capture the complexity of individual recognition in non-human animals. 

      Familiarity recognition refers to recognizing someone as having been seen or encountered before without necessarily distinguishing them from others in the same category. On the other hand, identity recognition involves recognizing a specific individual based on unique characteristics (or features). In humans, this often involves naming, but more critically, like in most animals, it involves recognizing visual, auditory, chemical or other sensory cues. In animals, including spiders, individual recognition does not involve and let alone rely on naming but on the ability to distinguish between individuals based on sensory cues and learnt associations. This is a valid and well-documented form of individual recognition across many species.

      Individual recognition does not require naming or the assignment of a referential label. Animals can distinguish between specific individuals based on previously perceived and stored features and characteristics. Naming is the exception rather than the rule in the animal kingdom. Only a few species, such as humans and maybe certain cetaceans, use naming for identity recognition. This is an evolutionary rarity and not the standard mechanism for individual recognition, which primarily relies on sensory cues and learnt associations. Furthermore, the mechanism of recognition in both humans and animals involves a complex process of matching incoming sensory and perceptual information with stored memory representations. Naming is merely a tool for communication, allowing us to convey which individual we are referring to. It is not the mechanism by which recognition occurs. The core of individual recognition is this matching process, where sensory cues (visual, auditory, chemical, etc.) are compared to memory traces of previously encountered individuals. Therefore, the suggestion that individual identification necessitates naming misrepresents the actual cognitive processes involved. 

      We can think of individual identification being based on more fine-grained discrimination (with this, set size = one), with familiar-unfamiliar discrimination being more coarse-grained discrimination (with this, set size can be more than one). Restricting the expression 'individual identification' to instances of having the capacity to identify another individual as being a particular individual (set size = one) is better aligned with normal usage of this expression.

      Absolutely, the distinction between fine-grained and coarse-grained discrimination aligns with the concept of different category levels, such as basic and subordinate levels, put forward by Eleanor Rosch (e.g. Rosch, 1973). In the context of individual recognition, fine-grained discrimination (where set size = one) refers to the ability to identify a specific individual based on unique characteristics. This is referred to as subordinate level categorization. Coarse-grained discrimination (where set size can be more than one) refers to recognizing someone as familiar without distinguishing them from others in the same category, more similar to basic level categorization. 

      Rosch, E.H. (1973). "Natural categories". Cognitive Psychology. 4 (3): 328–50.doi:10.1016/0010-0285(73)90017-0

      There is a strong emphasis on an asocial-social distinction in this manuscript. It seems to me that this needs to be focused more clearly on the specific factors that would make a capacity for individual identification beneficial. In the context of this manuscript, the term 'social' may suggest too much. It seems to me that the issue that matters the most is whether individuals live in situations where important encounters occur frequently between the same individuals. Irrespective of whether other notions of the meaning of 'social' also apply, there are salticids that live in aggregated situations where they frequently have important encounters with each other. This is the case with Phidippus regius in the field in Florida, but I realize that there may not be much published information about the natural history of this salticid. Even so, there are salticids to which the word 'social' has been applied in published literature.

      We appreciate the reviewer's comments on the asocial-social distinction and we agree that this terminology might need refinement. Our intent was not to categorize Phidippus regius rigidly but to explore the contextual factors influencing the benefits of individual identification. The critical factor in our study is indeed the frequency and importance of encounters between individuals, rather than a broader social structure. We will revise the manuscript to reflect this more nuanced perspective, focusing on the ecological validity of our experimental design and the adaptive significance of individual recognition in environments where repeated encounters can occur.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      (1) Some details are not described for experimental procedures. For example, what were the pharmacological drugs dissolved in, and what vehicle control was used in experiments? How long were pharmacological drugs added to cells?

      We apologise for the oversight. These details have now been added to the methods section of the manuscript as well as to the relevant figure legends.

      Briefly, latrunculin was used at a final concentration of 250 nM and Y27632 at a final concentration of 50 μM. Both drugs were dissolved in DMSO. The vehicle controls were effected with the highest final concentration of DMSO of the two drugs.

      The details of the drug treatments and their duration was added to the methods and to figures 6, S10, and S12.

      (2) Details are missing from the Methods section and Figure captions about the number of biological and technical replicates performed for experiments. Figure 1C states the data are from 12 beads on 7 cells. Are those same 12 beads used in Figure 2C? If so, that information is missing from the Figure 2C caption. Similarly, this information should be provided in every figure caption so the reader can assess the rigor of the experiments. Furthermore, how heterogenous would the bead displacements be across different cells? The low number of beads and cells assessed makes this information difficult to determine.

      We apologise for the oversight. We have now added this data to the relevant figure panels.

      To gain a further understanding of the heterogeneity of bead displacements across cells, we have replotted the relevant graphs using different colours to indicate different cells. This reveals that different cells appear to behave similarly and that the behaviour appears controlled by distance to the indentation or the pipette tip rather than cell identity.

      We agree with the reviewer that the number of cells examined is low. This is due to the challenging nature of the experiments that signifies that many attempts are necessary to obtain a successful measurement.

      The experiments in Fig 1C are a verification of a behaviour documented in a previous publication [1]. Here, we just confirm the same behaviour and therefore we decided that only a small number of cells was needed.

      The experiments in Fig 2C (that allow for a direct estimation of the cytoplasm’s hydraulic permeability) require formation of a tight seal between the glass micropipette and the cell, something known as a gigaseal in electrophysiology. The success rate of this first step is 10-30% of attempts for an experienced experimenter. The second step is forming a whole cell configuration, in which a hydraulic link is formed between the cell and the micropipette. This step has a success rate of ~ 50%. Whole cell links are very sensitive to any disturbance. After reaching the whole cell configuration, we applied relatively high pressures that occasionally resulted in loss of link between the cell and the micropipette. In summary, for the 12 successful measurements, hundreds of unsuccessful attempts were carried out.

      (3) The full equation for displacement vs. time for a poroelastic material is not provided. Scaling laws are shown, but the full equation derived from the stress response of an elastic solid and viscous fluid is not shown or described.

      We thank the reviewer for this comment. Based on our experiments, we found that the cytoplasm behaves as a poroelastic material. However, to understand the displacements of the cell surface in response to localised indentation, we show that we also need to take the tension of the submembranous cortex into account. In summary, the interplay between cell surface tension generated by the cortex and the poroelastic cytoplasm controls the cell behaviour. To our knowledge, no simple analytical solutions to this type problem exist.

      In Fig 1, we show that the response of the cell to local indentation is biphasic with a short time-scale displacement followed by a longer time-scale one. In Figs 2 and 3, we directly characterise the kinetics of cell surface displacement in response to microinjection of fluid. These kinetics are consistent with the long time-scale displacement but not the short time-scale one. Scaling considerations led us to propose that tension in the cortex may play a role in mediating the short time-scale displacement. To verify this hypothesis, we have now added new data showing that the length-scale of an indentation created by an AFM probe depends on tension in the cortex (Fig S5).  

      In a previous publication [2], we derived the temporal dynamics of cell surface displacement for a homogenous poroelastic material in response to a change in osmolarity. In the current manuscript, the composite nature of the cell (membrane, cortex, cytoplasm) needs to be taken into account as well as a realistic cell shape. Therefore, we did not attempt to provide an analytical solution for the displacement of the cell surface versus time in the current work. Instead, we turned to finite element modelling to show that our observations are qualitatively consistent with a cell that comprises a tensed submembranous actin cortex and a poroelastic cytoplasm (Fig 4). We have now added text to make this clearer for the reader.

      Reviewer #2 (Public review):

      Comments & Questions:

      The authors state, "Next, we sought to quantitatively understand how the global cellular response to local indentation might arise from cellular poroelasticity." However, the evidence presented in the following paragraph appears more qualitative than strictly quantitative. For instance, the length scale estimate of ~7 μm is only qualitatively consistent with the observed ~10 μm, and the timescale 𝜏𝑧 ≈ 500 ms is similarly described as "qualitatively consistent" with experimental observations. Strengthening this point would benefit from more direct evidence linking the short timescale to cell surface tension. Have you tried perturbing surface tension and examining its impact on this short-timescale relaxation by modulating acto-myosin contractility with Y-27632, depolymerizing actin with Latrunculin, or applying hypo/hyperosmotic shocks?

      Upon rereading our manuscript, we agree with the reviewer that some of our statements are too strong. We have now moderated these and clarified the goal of that section of the text.

      The reviewer asks if we have examined the effect of various perturbations on the short time-scale displacements. In our experimental conditions, we cannot precisely measure the time-scale of the fast relaxation because its duration is comparable to the frame rate of our image acquisition. However, we examined the amplitude of the displacement of the first phase in response to sucrose treatment and we have carried out new experiments in which we treat cells with 250nM Latrunculin to partially depolymerise cellular F-actin. Neither of these treatments had an impact on the amplitude of vertical displacements (Fig. S3).

      The absence of change in response to Latrunculin may be because the treatment decreases both the elasticity of the cytoplasm  and the cortical tension . As the length-scale  of the deformation of the surface scales as , the two effects of latrunculin treatment may therefore compensate one another and result in only small changes in . We have now added this data to supplementary information and comment on this in the text.   

      The reviewer’s comment also made us want to determine how cortical tension affects the length-scale of the cell surface deformation created by localised microindentation. To isolate the role of the cortex from that of cell shape, we decided to examine rounded mitotic cells. In our experiments, we indented a mitotic cell expressing a membrane targeted GFP with a sharp AFM tip (Fig. S5).

      In our experiments, we adjusted force to generate a 2μm depth indentation and we imaged the cell profile with confocal microscopy before and during indentation. Segmentation of this data allowed us to determine the cell surface displacement resulting from indentation and measure a length scale of deformation. In control conditions, the length scale created by deformation is on the order of 1.2μm. When we inhibited myosin contractility with blebbistatin, the length-scale of deformation decreased significantly to 0.8 μm, as expected if we decrease the surface tension γ without affecting the cytoplasmic elasticity. We have now added this data to our manuscript.

      The authors demonstrate that the second relaxation timescale increases (Figure 1, Panel D) following a hyperosmotic shock, consistent with cytoplasmic matrix shrinkage, increased friction, and consequently a longer relaxation timescale. While this result aligns with expectations, is a seven-fold increase in the relaxation timescale realistic based on quantitative estimates given the extent of volume loss?

      We thank the reviewer for this interesting question. Upon re-examining our data, we realised that the numerical values in the text related to the average rather than the median of our measurements. The median of the poroelastic time constant increases from ~0.4s in control conditions to 1.4s in sucrose, representing approximately a 3.5 fold increase.

      Previous work showed that HeLa cell volume decreases by ~40% in response to hyperosmotic shock [3]. The fluid volume fraction in cells is ~65-75%. If we assume that the water is contained in N pores of volume , we can express the cell volume as with the volume of the solid fraction. We can rewrite .

      With ∅ = 0.42  -0.6.  As  does not change in response to osmotic shock, we can rewrite the volume change to obtain the change in pore size .

      The poroelastic diffusion constant scales as and the poroelastic timescale scales as . Therefore, the measured change in volume leads to a predicted increase in poroelastic diffusion time of 1.7-1.9 fold, smaller than observed in our experiments. This suggests that some intuition can be gained in a straightforward manner assuming that the cytoplasm is a homogenous porous material.

      However, the reality is more complex and the hydraulic pore size is distinct from the entanglement length of the cytoskeleton mesh, as we discussed in a previous publication [4]. When the fluid fraction becomes sufficiently small, macromolecular crowding will impact diffusion further and non-linearities will arise. We have now added some of these considerations to the discussion.

      If the authors' hypothesis is correct, an essential physiological parameter for the cytoplasm could be the permeability k and how it is modulated by perturbations, such as volume loss or gain. Have you explored whether the data supports the expected square dependency of permeability on hydraulic pore size, as predicted by simple homogeneity assumptions?

      We thank the reviewer for this comment. As discussed above, we have explored such considerations in a previous publication (see discussion in [4]). Briefly, we find that the entanglement length of the F-actin cytoskeleton does play a role in controlling the hydraulic pore size but is distinct from it. Membrane bounded organelles could also contribute to setting the pore size. In our previous publication, we derived a scaling relationship that indicates that four different length-scales contribute to setting cellular rheology: the average filament bundle length, the size distribution of particles in the cytosol, the entanglement length of the cytoskeleton, and the hydraulic pore size. Many of these length-scales can be dynamically controlled by the cell, which gives rise to complex rheology. We have now added these considerations to our discussion.

      Additionally, do you think that the observed decrease in k in mitotic cells compared to interphase cells is significant? I would have expected the opposite naively as mitotic cells tend to swell by 10-20 percent due to the mitotic overshoot at mitotic entry (see Son Journal of Cell Biology 2015 or Zlotek Journal of Cell Biology 2015).

      We thank the reviewer for this interesting question. Based on the same scaling arguments as above, we would expect that a 10-20% increase in cell volume would give rise to 10-20% increase in diffusion constant. However, we also note that metaphase leads to a dramatic reorganisation of the cell interior and in particular membrane-bounded organelles. In summary, we do not know why such a decrease could take place. We now highlight this as an interesting question for further research.

      Based on your results, can you estimate the pore size of the poroelastic cytoplasmic matrix? Is this estimate realistic? I wonder whether this pore size might define a threshold above which the diffusion of freely diffusing species is significantly reduced. Is your estimate consistent with nanobead diffusion experiments reported in the literature? Do you have any insights into the polymer structures that define this pore size? For example, have you investigated whether depolymerizing actin or other cytoskeletal components significantly alters the relaxation timescale?

      We thank the reviewer for this comment. We cannot directly estimate the hydraulic pore size from the measurements performed in the manuscript. Indeed, while we understand the general scaling laws, the prefactors of such relationships are unknown.

      We carried out experiments aiming at estimating the hydraulic pore size in previous publications [3,4] and others have shown spatial heterogeneity of the cytoplasmic pore size [5]. In our previous experiments, we examined the diffusion of PEGylated quantum dots (14nm in hydrodynamic radius). In isosmotic conditions, these diffused freely through the cell but when the cell volume was decreased by a hyperosmotic shock, they no longer moved [3,4]. This gave an estimate of the pore radius of ~15nm.

      Previous work has suggested that F-actin plays a role in dictating this pore size but microtubules and intermediate filaments do not [4].

      There are no quantifications in Figure 6, nor is there a direct comparison with the model. Based on your model, would you expect the velocity of bleb growth to vary depending on the distance of the bleb from the pipette due to the local depressurization? Specifically, do blebs closer to the pipette grow more slowly?

      We apologise for the oversight. The quantifications are presented in Fig S10 and Fig S12. We have now modified the figure legends accordingly.

      Blebs are very heterogenous in size and growth velocity within a cell and across cells in the population in normal conditions [6]. Other work has shown that bleb size is controlled by a competition between pressure driving growth and actin polymerisation arresting it[7]. Therefore, we did not attempt to determine the impact of depressurisation on bleb growth velocity or size.

      In experiments in which we suddenly increased pressure in blebbing cells, we did notice a change in the rate of growth of blebs that occurred after we increased pressure (Author response image 1). However, the experiments are technically challenging and we decided not to perform more.

      Author response image 1.

      A. A hydraulic link is established between a blebbing cell and a pipette. At time t>0, a step increase in pressure is applied. B. Kymograph of bleb growth in a control cell (top) an in a cell subjected to a pressure increase at t=0s (bottom). Top: In control blebs, the rate of growth is slow and approximately constant over time. The black arrow shows the start of blebbing. Bottom: The black arrow shows the start of blebbing. The dashed line shows the timing of pressure application and the red arrow shows the increase in growth rate of the bleb when the pressure increase reaches the bleb. This occurs with a delay δt.

      I find it interesting that during depressurization of the interphase cells, there is no observed volume change, whereas in pressurization of metaphase cells, there is a volume increase. I assume this might be a matter of timescale, as the microinjection experiments occur on short timescales, not allowing sufficient time for water to escape the cell. Do you observe the radius of the metaphase cells decreasing later on? This relaxation could potentially be used to characterize the permeability of the cell surface.

      We thank the reviewer for this comment.

      First, we would like to clarify that both metaphase and interphase cells increase their volume in response to microinjection. The effect is easier to quantify in metaphase cells because we assume spherical symmetry and just monitor the evolution of the radius (Fig 3). However, the displacement of the beads in interphase cells (Fig 2) clearly shows that the cell volume increases in response to microinjection. For both interphase and metaphase cells, when the injection is prolonged, the membrane eventually detaches from the cortex and large blebs form until cell lysis. In contrast to the reviewer’s intuition, we never observe a relaxation in cell volume, probably because we inject fluid faster than the cell can compensate volume change through regulatory mechanisms involving ion channels.

      When we depressurise metaphase cells, we do not observe any change in volume (Fig S10). This contrasts with the increase that we observe upon pressurisation. The main difference between these two experiments is the pressure differential. During depressurisation experiments, this is the hydraulic pressure within the cell ~500Pa (Fig 6A); whereas during pressurisation experiments, this is the pressure in the micropipette, ranging from 1.4-10 kPa (Fig 3). We note in particular that, when we used the lowest pressures in our experiments, the increase in volume was very slow (see Fig 3C). Therefore, we agree with the reviewer that it is likely the magnitude of the pressure differential that explains these differences.

      I am curious about the saturation of the time lag at 30 microns from the pipette in Figure 4, Panel E for the model's prediction. A saturation which is not clearly observed in the experimental data. Could you comment on the origin of this saturation and the observed discrepancy with the experiments (Figure E panel 2)? Naively, I would have expected the time lag to scale quadratically with the distance from the pipette, as predicted by a poroelastic model and the diffusion of displacement. It seems weird to me that the beads start to move together at some distance from the pipette or else I would expect that they just stop moving. What model parameters influence this saturation? Does membrane permeability contribute to this saturation?

      We thank the reviewer for pointing this out. In our opinion, the saturation occurring at 30 microns arises from the geometry of the model. At the largest distance away from the micropipette, the cortex becomes dominant in the mechanical response of the cell because it represents an increasing proportion of the cellular material.

      To test this hypothesis, we will rerun our finite element models with a range of cell sizes. This will be added to the manuscript at a later date.

      Reviewer #3 (Public review):

      Weaknesses: I have two broad critical comments:

      (1) I sense that the authors are correct that the best explanation of their results is the passive poroelastic model. Yet, to be thorough, they have to try to explain the experiments with other models and show why their explanation is parsimonious. For example, one potential explanation could be some mechanosensitive mechanism that does not involve cytoplasmic flow; another could be viscoelastic cytoskeletal mesh, again not involving poroelasticity. I can imagine more possibilities. Basically, be more thorough in the critical evaluation of your results. Besides, discuss the potential effect of significant heterogeneity of the cell.

      We thank the reviewer for these comments and we agree with their general premise.

      Some observations could qualitatively be explained in other ways. For example, if we considered the cell as a viscoelastic material, we could define a time constant with η the viscosity and E the elasticity of the material. The increase in relaxation time with sucrose treatment could then be explained by an increase in viscosity. However, work by others has  previously shown that, in the exact same conditions as our experiment, viscoelasticity cannot account for the observations[1]. In its discussion, this study proposed poroelasticity as an alternative mechanism but did not investigate that possibility. This was consistent with our work that showed that the cytoplasm behaves as a poroelastic material and not as a viscoelastic material [4]. Therefore, we decided not to consider viscoelasticity as possibility. We now explain this reasoning better and have added a sentence about a potential role for mechanotransductory processes in the discussion.

      (2) The study is rich in biophysics but a bit light on chemical/genetic perturbations. It could be good to use low levels of chemical inhibitors for, for example, Arp2/3, PI3K, myosin etc, and see the effect and try to interpret it. Another interesting question - how adhesive strength affects the results. A different interesting avenue - one can perturb aquaporins. Etc. At least one perturbation experiment would be good.

      We agree with the reviewer. In our previous studies, we already examined what biological structures affect the poroelastic properties of cells [2,4]. Therefore, the most interesting aspect to examine in our current work would be perturbations to the phenomenon described in Fig 6G and, in particular, to investigate what volume regulation mechanisms enable sustained intracellular pressure gradients. However, these experiments are particularly challenging and with very low throughput. Therefore, we feel that these are out of the scope of the present report and we mention these as promising future directions.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Please add more information to Materials and methods and figure captions to more clearly share how many different cells and trials the data are coming from.

      This has been done.

      Please add the full equation for displacement vs. time for the poroelastic model and describe appropriately.

      This cannot be done but we explain why.

      Overall, the clarity of the writing in the manuscript could be improved.

      This has been done.

      Please increase text size in some of the figures.

      This has been done.

      Reviewer #2 (Recommendations for the authors):<br /> Figure 1 would benefit from some revisions for clarity. In Panel D, for the control experiment with 7 cells, why are only 3 data points shown?

      This was due to the use of excel for generating the box plot. Some data points overlap. We now have used a different software.

      In Panel E, there is no legend explaining the red dots in the whisker plots.

      This has now been added.

      Additionally, the inset in Panel D lacks a legend, and it is unclear how k was computed.

      This inset panel has been removed.

      Moreover, I find Figure 1, Panel C somewhat pixelated, which makes it challenging to interpret. As I am colorblind, I need to zoom in significantly to distinguish the colors, and the current resolution makes this difficult. Improving the image resolution would be helpful.

      Apologies for this. We have now verified the quality of images on our submission.  

      I am unsure about the method used to compute the relaxation timescale in Figure S2. If an exponential relaxation is assumed, I would expect a function of the form:

      which implies that for t=t1+tau_p, the result should be d1+0.6*Delta d which does not correspond to the formula given. Have you tried fitting the data with an exponential function or using the model to extract tau_p without assuming a specific functional form?

      We thank the reviewer for pointing this out. We have now added further explanation of the fitting to the figure legend.

      References:

      (1) Rosenbluth, M. J., Crow, A., Shaevitz, J. W. & Fletcher, D. A. Slow stress propagation in adherent cells. Biophys J 95, 6052-6059 (2008). https://doi.org/10.1529/biophysj.108.139139

      (2) Esteki, M. H. et al. Poroelastic osmoregulation of living cell volume. iScience 24, 103482 (2021). https://doi.org/10.1016/j.isci.2021.103482

      (3) Charras, G. T., Mitchison, T. J. & Mahadevan, L. Animal cell hydraulics. J Cell Sci 122, 3233-3241 (2009). https://doi.org/10.1242/jcs.049262

      (4) Moeendarbary, E. et al. The cytoplasm of living cells behaves as a poroelastic material. Nat Mater 12, 253-261 (2013). https://doi.org/10.1038/nmat3517

      (5) Luby-Phelps, K., Castle, P. E., Taylor, D. L. & Lanni, F. Hindered diffusion of inert tracer particles in the cytoplasm of mouse 3T3 cells. Proc Natl Acad Sci U S A 84, 4910-4913 (1987). https://doi.org/10.1073/pnas.84.14.4910

      (6) Charras, G. T., Coughlin, M., Mitchison, T. J. & Mahadevan, L. Life and times of a cellular bleb. Biophys J 94, 1836-1853 (2008). https://doi.org/10.1529/biophysj.107.113605

      (7) Tinevez, J. Y. et al. Role of cortical tension in bleb growth. Proc Natl Acad Sci U S A 106, 18581-18586 (2009). https://doi.org/10.1073/pnas.0903353106

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      The authors observed a decline in autophagy and proteasome activity in the context of Milton knockdown. Through proteomic analysis, they identified an increase in the protein levels of eIF2β, subsequently pinpointing a novel interaction within eIF subunits where eIF2β contributes to the reduction of eIF2α phosphorylation levels. Furthermore, they demonstrated that overexpression of eIF2β suppresses autophagy and leads to diminished motor function. It was also shown that in a heterozygous mutant background of eIF2β, Milton knockdown could be rescued. This work represents a novel and significant contribution to the field, revealing for the first time that the loss of mitochondria from axons can lead to impaired autophagy function via eIF2β, potentially influencing the acceleration of aging.

      Thank you so much for your review and comments.

      Reviewer #2 (Public Review):

      In the manuscript, the authors aimed to elucidate the molecular mechanism that explains neurodegeneration caused by the depletion of axonal mitochondria. In Drosophila, starting with siRNA depletion of Milton and Miro, the authors attempted to demonstrate that the depletion of axonal mitochondria induces the defect in autophagy. From proteome analyses, the authors hypothesized that autophagy is impacted by the abundance of eIF2β and the phosphorylation of eIF2α. The authors followed up the proteome analyses by testing the effects of eIF2β overexpression and depletion on autophagy. With the results from those experiments, the authors proposed a novel role of eIF2β in proteostasis that underlies neurodegeneration derived from the depletion of axonal mitochondria.

      The manuscript has several weaknesses. The reader should take extra care while reading this manuscript and when acknowledging the findings and the model in this manuscript.

      The defect in autophagy by the depletion of axonal mitochondria is one of the main claims in the paper. The authors should work more on describing their results of LC3-II/LC3-I ratio, as there are multiple ways to interpret the LC3 blotting for the autophagy assessment. Lysosomal defects result in the accumulation of LC3-II thus the LC3-II/LC3-I ratio gets higher. On the other hand, the defect in the early steps of autophagosome formation could result in a lower LC3-II/LC3-I ratio. From the results of the actual blotting, the LC3-I abundance is the source of the major difference for all conditions (Milton RNAi and eIF2β overexpression and depletion).

      Thank you so much for your review and comments. As the reviewer pointed out, LC3-II/LC3- I ratio changes do not necessarily indicate autophagy defects. However, since p62 accumulation (Figure 2B, 2E, 3E, Figure 8C, Figure 9C), these results collectively suggest that autophagy is lowered.

      As the reviewer pointed out and we described in v2, milton knockdown, eIF2β overexpression and heterozygosity increase LC3-I abundance. We do not know how these conditions increase LC3-I at this moment. We will investigate the cause of the increase in LC3-I by milton knockdown and how it contribute to impaired autophagy. We added this discussion as:

      Lines 388-393; ‘Our results also suggest that milton knockdown and overexpression of eIF2β affect autophagy via increased LC3-I abundance (Figures 2 and 7), suggesting an unconventional mechanism of autophagy suppression. To our knowledge, the roles of eIF2β in aging and autophagy independent of ISR have not been reported. Our results revealed a novel function of eIF2β to maintain proteostasis during aging, while further investigation is required to elucidate underlying mechanisms.’

      Another main point of the paper is the up-regulation of eIF2β by depleting the axonal mitochondria leads to the proteostasis crisis. This claim is formed by the findings from the proteome analyses. The authors should have presented their proteomic data with much thorough presentation and explanation. As in the experiment scheme shown in Figure 4A, the author did two proteome analyses: one from the 7-day-old sample and the other from the 21-day-old sample. The manuscript only shows a plot of the result from the 7-day-old sample, but that of the result from the 21-day-old sample. For the 21-day-old sample, the authors only provided data in the supplemental table, in which the abundance ratio of eIF2β from the 21-day-old sample is 0.753, meaning eIF2β is depleted in the 21-day-old sample. The authors should have explained the impact of the eIF2β depletion in the 21-day-old sample, so the reader could fully understand the authors' interpretation of the role of eIF2β on proteostasis.

      Thank you for pointing it out. Plots of the 21-day-old proteome results was included in the main figure (Figure 4C) in v2. In this revision, we further analyzed age-dependent changes of eIF2β levels by western blotting (Figure 4G). We found that eIF2β levels increased during aging until 49-day-old then reduced at 63-day-old (Figure 4G in the revised manuscript). At the young age, eIF2β levels were higher in milton knockdown brain compared to the control , and eIF2β levels were lower in milton knockdown brains than those in the control. These results suggest that milton knockdown accelerates age-dependent changes in eIF2β. We added these results and discussion in the revised manuscript.

      Lines 240-243: ‘We also investigated age-dependent changes in eIF2β by western blotting of control flies at 7-, 21-, 35-, and 49-, and 63-day-old. eIF2β levels increased during aging until 49-day-old (Figure 4G). These results suggest that upregulation of eIF2β in milton knockdown fly brain reflects early an onset of age-dependent increase of eIF2β levels.’

      Lines 363-368: ‘We also found that eIF2β protein levels increase in an age-dependent manner until 49-day-old and reduces after that (Figure 4G). In the brains with neuronal knockdown of milton, eIF2β levels were higher at 7-day-old than those in control and lower at the 21-day-old (Figure 4D and Supplementary table). These results suggest that milton knockdown is likely accelerating age-dependent changes rather than increasing their magnitude.’Our new data indicate that eIF2β levels increase during aging in control flies until 49-day-old, then reduce at 63-day-old (included as Figure 4G in the revised manuscript). These age- dependent changes might explain the reduction in eIF2β levels in Milton knockdown compared to the control in middle age: higher eIF2β levels in milton knockdown flies at a young age than control and lower eIF2β levels in the middle-aged flies may reflect premature aging.

      We included these sentences in the discussion section:

      Lines 240-243:‘We also investigated age-dependent changes in eIF2β by western blotting of control flies at 7-, 21-, 35-, and 49-, and 63-day-old. eIF2β levels increased during aging until 49-day-old (Figure 4G). These results suggest that upregulation of eIF2β in milton knockdown fly brain reflects early an onset of age-dependent increase of eIF2β levels.’

      Lines 359-371: ‘Our results suggest that the loss of axonal mitochondria is an event upstream of proteostasis collapse during aging. The number of puncta of ubiquitinated proteins was higher in milton knockdown at 14-day-old, but there was no significant difference at 30-day-old (Figure 1). Proteome analyses also showed that age-related pathways, such as immune responses, are enhanced in young flies with milton knockdown (Table 2). We also found that eIF2β protein levels increase in an age-dependent manner until 49-day-old and reduces after that (Figure 4G). In the brains with neuronal knockdown of milton, eIF2β levels were higher at 7-day-old than those in control and lower at the 21-day-old (Figure 4D and Supplementary table). These results suggest that milton knockdown is likely accelerating age-dependent changes rather than increasing their magnitude. Disruption of proteostasis is expected to contribute neurodegeneration38 , and it would be interesting to analyze the sequence of protein accumulation and axonal degeneration in milton knockdown (24,29 and Figure 1) in detail with higher time resolution.’


      With our new data, we revised some of our responses to the first round of reviewer’s comments.

      Reviewer #1 (Public Review):

      The authors observed a decline in autophagy and proteasome activity in the context of Milton knockdown. Through proteomic analysis, they identified an increase in the protein levels of eIF2β, subsequently pinpointing a novel interaction within eIF subunits where eIF2β contributes to the reduction of eIF2α phosphorylation levels. Furthermore, they demonstrated that overexpression of eIF2β suppresses autophagy and leads to diminished motor function. It was also shown that in a heterozygous mutant background of eIF2β, Milton knockdown could be rescued. This work represents a novel and significant contribution to the field, revealing for the first time that the loss of mitochondria from axons can lead to impaired autophagy function via eIF2β, potentially influencing the acceleration of aging. To further support the authors' claims, several improvements are necessary, particularly in the methods of quantification and the points that should be demonstrated quantitatively. It is crucial to investigate the correlation between aging and the proteins eIF2β and eIF2α.

      Thank you so much for your review and comments. We included analyses of protein levels of eIF2α, eIF2β, and eIF2γ at 7 days and 21 days (Figure 4D). The manuscript was revised as below;

      Lines 246-249 ‘As for the other subunits of eIF2 complex, proteome analysis did not detect a significant difference in the protein levels of eIF2α and eIF2γ between milton knockdown and control flies at 7 and 21 days (Figure 4D).’

      NEW TEXT: We analyzed age-dependent changes of eIF2β levels in more detail by western blotting (Figure 4G). We found that eIF2β levels increased during aging until 49-day-old then reduced at 63-day-old (Figure 4G in the revised manuscript). At the young age, eIF2β levels were higher in milton knockdown brain compared to the control , and eIF2β levels were lower in milton knockdown brains than those in the control. These results suggest that Milton knockdown accelerates age-dependent changes in eIF2β.. We added these results and discussion in the revised manuscript.

      NEW TEXT: Lines 240-243: ‘We also investigated age-dependent changes in eIF2β by western blotting of control flies at 7-, 21-, 35-, and 49-, and 63-day-old. eIF2β levels increased during aging until 49-day-old (Figure 4G). These results suggest that upregulation of eIF2β in milton knockdown fly brain reflects early an onset of age-dependent increase of eIF2β levels.’

      NEW TEXT: Lines 363-368: ‘We also found that eIF2β protein levels increase in an age-dependent manner until 49-day-old and reduces after that (Figure 4G). In the brains with neuronal knockdown of milton, eIF2β levels were higher at 7-day-old than those in control and lower at the 21-day-old (Figure 4D and Supplementary table). These results suggest that milton knockdown is likely accelerating age-dependent changes rather than increasing their magnitude.’

      Reviewer #2 (Public Review):

      In the manuscript, the authors aimed to elucidate the molecular mechanism that explains neurodegeneration caused by the depletion of axonal mitochondria. In Drosophila, starting with siRNA depletion of Milton and Miro, the authors attempted to demonstrate that the depletion of axonal mitochondria induces the defect in autophagy. From proteome analyses, the authors hypothesized that autophagy is impacted by the abundance of eIF2β and the phosphorylation of eIF2α. The authors followed up the proteome analyses by testing the effects of eIF2β overexpression and depletion on autophagy. With the results from those experiments, the authors proposed a novel role of eIF2β in proteostasis that underlies neurodegeneration derived from the depletion of axonal mitochondria.

      The manuscript has several weaknesses. The reader should take extra care while reading this manuscript and when acknowledging the findings and the model in this manuscript.

      The defect in autophagy by the depletion of axonal mitochondria is one of the main claims in the paper. The authors should work more on describing their results of LC3-II/LC3-I ratio, as there are multiple ways to interpret the LC3 blotting for the autophagy assessment. Lysosomal defects result in the accumulation of LC3-II thus the LC3-II/LC3-I ratio gets higher. On the other hand, the defect in the early steps of autophagosome formation could result in a lower LC3-II/LC3-I ratio. From the results of the actual blotting, the LC3-I abundance is the source of the major difference for all conditions (Milton RNAi and eIF2β overexpression and depletion). In the text, the authors simply state the observation of their LC3 blotting. The manuscript lacks an explanation of how to evaluate the LC3-II/LC3-I ratio. Also, the manuscript lacks an elaboration on what the results of the LC3 blotting indicate about the state of autophagy by the depletion of axonal mitochondria.

      Thank you for pointing it out, and we apologize for an insufficient description of the result. We included quantitation of the levels of LC3-I and LC3-II in Figures 2A, 2D, 3D, 7B (Figure 6B in the previous version), and 8B (Figure 7B in the previous version). As the reviewer pointed out, LC3-II/LC3-I ratio changes do not necessarily indicate autophagy defects. However, since p62 accumulation (Figure 2B, 2E, 3E, 7C (Figure 6C in the previous version), 8C (Figure 7C in the previous version)), these results collectively suggest that autophagy is lowered. We revised the manuscript to include this discussion as below:

      Lines 174-186 ‘During autophagy progression, LC3 is conjugated with phosphatidylethanolamine to form LC3-II, which localizes to isolation membranes and autophagosomes. LC3-I accumulation occurs when autophagosome formation is impaired, and LC3-II accumulation is associated with lysosomal defects31,32. p62 is an autophagy substrate, and its accumulation suggests autophagic defects31,32. We found that milton knockdown increased LC3-I, and the LC3-II/LC3-I ratio was lower in milton knockdown flies than in control flies at 14-day-old (Figure 2A). We also analyzed p62 levels in head lysates sequentially extracted using detergents with different stringencies (1% Triton X-100 and 2% SDS). Western blotting revealed that p62 levels were increased in the brains of 14-day-old of milton knockdown flies (Figure 2B). The increase in the p62 level was significant in the Triton X-100- soluble fraction but not in the SDS-soluble fraction (Figure 2B), suggesting that depletion of axonal mitochondria impairs the degradation of less-aggregated proteins.’

      Line 189-190: 'At 30 day-old, LC3-I was still higher, and the LC3-II/LC3-I ratio was lower, in milton knockdown compared to the control (Figure 2D).’

      Line 202-203: ‘However, in contrast with milton knockdown, Pfk knockdown did not affect the levels of LC3-I, LC3-II or the LC3-II/LC3-I ratio (Figure 3D).’

      Line 279-285: ‘Neuronal overexpression of eIF2β increased LC3-II, while the LC3-II/LC3-I ratio was not significantly different (Figure 7A and B). Overexpression of eIF2β significantly increased the p62 level in the Triton X-100-soluble fraction (Figure 7C, 4-fold vs. control, p <0.005 (1% Triton X-100)) but not in the SDS-soluble fraction (Figure 7C, 2-fold vs. control, p\= 0.062 (2% SDS)), as observed in brains of milton knockdown flies (Figure 2B). These data suggest that neuronal overexpression of eIF2β accumulates autophagic substrates.’

      Line 311-319: ‘Neuronal knockdown of milton causes accumulation of autophagic substrate p62 in the Triton X-100-soluble fraction (Figure 2B), and we tested if lowering eIF2β ameliorates it. We found that eIF2β heterozygosity caused a mild increase in LC3-I levels and decreases in LC3-II levels, resulting in a significantly lower LC3-II/LC3-I ratio in milton knockdown flies (Figure 8B). eIF2β heterozygosity decreased the p62 level in the Triton X- 100-soluble fraction in the brains of milton knockdown flies (Figure 8C). The p62 level in the SDS-soluble fraction, which is not sensitive to milton knockdown (Figure 2B), was not affected (Figure 8C). These results suggest that suppression of eIF2β ameliorates the impairment of autophagy caused by milton knockdown.’

      Another main point of the paper is the up-regulation of eIF2β by depleting the axonal mitochondria leads to the proteostasis crisis. This claim is formed by the findings from the proteome analyses. The authors should have presented their proteomic data with much thorough presentation and explanation. As in the experiment scheme shown in Figure 4A, the author did two proteome analyses: one from the 7-day-old sample and the other from the 21-day-old sample. The manuscript only shows a plot of the result from the 7-day-old sample, but that of the result from the 21-day-old sample. For the 21-day-old sample, the authors only provided data in the supplemental table, in which the abundance ratio of eIF2β from the 21-day-old sample is 0.753, meaning eIF2β is depleted in the 21-day-old sample. The authors should have explained the impact of the eIF2β depletion in the 21-day-old sample, so the reader could fully understand the authors' interpretation of the role of eIF2β on proteostasis.

      NEW TEXT: Thank you for pointing it out. We included plots of the 21-day-old proteome results as a part of the main figure (Figure 4C). As the reviewer pointed out, eIF2β protein levels are lower in milton knockdown background at the 21-day-old compared to the control. Since a reduction in the eIF2_β_ ameliorated milton knockdown-induced locomotor defects in aged flies (Figure 7D), the reduction in eIF2β observed in the 21-day-old milton knockdown flies is not likely to negatively contribute to milton knockdown-induced defects. Our new data indicate that eIF2β levels increase during aging in control flies until 49-day-old, then reduce at 63-day-old (included as Figure 4G in the revised manuscript). These age-dependent changes might explain the reduction in eIF2β levels in Milton knockdown compared to the control in middle age: higher eIF2β levels in milton knockdown flies at a young age than control and lower eIF2β levels in the middle-aged flies may reflect premature aging.

      NEW TEXT: We included these sentences in the discussion section:

      NEW TEXT: Lines 240-243:‘We also investigated age-dependent changes in eIF2β by western blotting of control flies at 7-, 21-, 35-, and 49-, and 63-day-old. eIF2β levels increased during aging until 49-day-old (Figure 4G). These results suggest that upregulation of eIF2β in milton knockdown fly brain reflects early an onset of age-dependent increase of eIF2β levels.’

      NEW TEXT: Lines 359-371: ‘Our results suggest that the loss of axonal mitochondria is an event upstream of proteostasis collapse during aging. The number of puncta of ubiquitinated proteins was higher in milton knockdown at 14-day-old, but there was no significant difference at 30-day-old (Figure 1). Proteome analyses also showed that age-related pathways, such as immune responses, are enhanced in young flies with milton knockdown (Table 2). We also found that eIF2β protein levels increase in an age-dependent manner until 49-day-old and reduces after that (Figure 4G). In the brains with neuronal knockdown of milton, eIF2β levels were higher at 7-day-old than those in control and lower at the 21-day-old (Figure 4D and Supplementary table). These results suggest that milton knockdown is likely accelerating age-dependent changes rather than increasing their magnitude. Disruption of proteostasis is expected to contribute neurodegeneration38 , and it would be interesting to analyze the sequence of protein accumulation and axonal degeneration in milton knockdown (24,29 and Figure 1) in detail with higher time resolution.’

      The manuscript consists of several weaknesses in its data and explanation regarding translation.

      (1) The authors are likely misunderstanding the effect of phosphorylation of eIF2α on translation. The P-eIF2α is inhibitory for translation initiation. However, the authors seem to be mistaken that the down-regulation of P-eIF2α inhibits translation.

      We are sorry for our insufficient explanation in the previous version. As the reviewer pointed out, it is well known that the phosphorylated form of eIF2α inhibits translation initiation. Neuronal knockdown of milton caused a reduction in p-eIF2α (Figure 5D and E (Figure 4J and K in the previous version)), and it also lowered translation (Figure 6 (Figure 5 in the previous version)); the relationship between these two events is currently unclear. We do not think that a reduction in the p-eIF2α suppressed translation; rather, we propose that the unbalance of expression levels of the components of eIF2 complexes negatively affects translation. We revised discussion sections to describe our interpretation more in detail as below:

      Line 374-384: ‘eIF2β is a component of eIF2, which meditates translational regulation and ISR initiation. When ISR is activated, phosphorylated eIF2α suppresses global translation and induces translation of ATF4, which mediates transcription of autophagy-related genes39,40. Since ISR can positively regulate autophagy, we suspected that suppression of ISR underlies a reduction in autophagic protein degradation. We found neuronal knockdown of milton reduced phosphorylated eIF2α, suggesting that ISR is reduced (Figure 5). However, we also found that global translation was reduced (Figure 6). Increased levels of eIF2β might disrupt the eIF2 complex or alter its functions. The stoichiometric mismatch caused by an imbalance of eIF2 components may inhibit ISR induction. Supporting this model, we found that eIF2β upregulation reduced the levels of p-eIF2α (Figure 7).’We have revised the graphical abstract and removed the eIF2 complex since its role in the loss of proteostasis caused by milton knockdown has not been elucidated yet.

      (2) The result of polysome profiling in Figure 4H is implausible. By 10%-25% sucrose density gradient, polysomes are not expected to be observed. The authors should have used a gradient with much denser sucrose, such as 10-50%.

      Thank you for pointing it out. It was a mistake of 10-50%, and we apologize for the oversight. It was corrected (Figure 6 (Figure 5 in the previous version)).

      (3) Also on the polysome profiling, as in the method section, the authors seemed to fractionate ultra-centrifuged samples from top to bottom and then measured A260 by a plate reader. In that case, the authors should have provided a line plot with individual data points, not the smoothly connected ones in the manuscript.

      Thank you for pointing it out. We revised the graph (Figure 6 (Figure 5 in the previous version)).

      (4) For both the results from polysome profiling and puromycin incorporation (Figure 4H and I), the difference between control siRNA and Milton siRNA are subtle, if not nonexistent. This might arise from the lack of spatial resolution in their experiment as the authors used head lysate for these data but the ratio of Phospho-eIF2α/eIF2α only changes in the axons, based on their results in Figure 4E-G. The authors could have attempted to capture the spatial resolution for the axonal translation to see the difference between control siRNA and Milton siRNA.

      Thank you for your comment. We agree that it would be an interesting experiment, but it will take a considerable amount of time to analyze axonal translation with spatial resolution. We will try to include such analyses in the future. For this manuscript, we revised the discussion section to include the reviewer's suggestion as below;

      Lines 355-357: ‘Further analyses to dissect the effects of milton knockdown on proteostasis and translation in the cell body and axon by experiments with spatial resolution would be needed.’

      Recommendations for the authors:

      From the Reviewing Editor:

      As the Reviewing Editor, I have read your manuscript and the associated peer reviews. I have concerns about publishing this work in its current form. I think that your manuscript cannot claim to have found a novel function of eIF2beta because of technical uncertainties and conceptual problems that should be addressed.

      Thank you so much for your review and comments. We addressed all the concerns raised by the reviewers. Point-by-point responses are listed below.

      First, your manuscript is based partly on what appears to be a mistaken understanding of the mechanistic basis of the ISR. Specifically, eIF2 is a heterotrimeric complex of alpha, beta, and gamma subunits. When eIF2a is phosphorylated, the heterotrimer adopts a new conformation. This conformation directly binds and inhibits eIF2B, the decameric GEF that exchanges the GDP bound to the gamma subunit of the eIF2 complex for GTP. Unless I misunderstood your paper, you seem to propose that decreasing levels of phospho-eIF2a will inhibit translation, but this is backward from what we know about the ISR.

      Thank you for your insightful comment, and we are sorry for the confusion. We did not mean to propose that decreasing levels of phospho-eIF2_a_ inhibits translation. We apologize for our insufficient explanation, which might have caused a misunderstanding (Lines 312-318 in the original version). We agree with the reviewer that ‘mismatch due to elevated eIF2-beta could change the behavior of the ISR’. We revised the text in the result section as follows:

      Lines 263-268 (in the Result section) ‘Phosphorylation of eIF2α induces conformational changes in the eIF2 complex and inhibits global translation36. To analyze the effects of milton knockdown on translation, we performed polysome gradient centrifugation to examine the level of ribosome binding to mRNA. Since p-eIF2α was downregulated, we hypothesized that milton knockdown would enhance translation. However, unexpectedly, we found that milton knockdown significantly reduced the level of mRNAs associated with polysomes (Figure 6A and B).’

      Lines 374-384 (in the Discussion section): ‘eIF2β is a component of eIF2, which meditates translational regulation and ISR initiation. When ISR is activated, phosphorylated eIF2α suppresses global translation and induces translation of ATF4, which mediates transcription of autophagy-related genes39,40. Since ISR can positively regulate autophagy, we suspected that suppression of ISR underlies a reduction in autophagic protein degradation. We found neuronal knockdown of milton reduced phosphorylated eIF2α, suggesting that ISR is reduced (Figure 5). However, we also found that global translation was reduced (Figure 6). Increased levels of eIF2β might disrupt the eIF2 complex or alter its functions. The stoichiometric mismatch caused by an imbalance of eIF2 components may inhibit ISR induction. Supporting this model, we found that eIF2β upregulation reduced the levels of p-eIF2α (Figure 7).’

      It may be possible that a stoichiometric mismatch due to elevated eIF2-beta could change the behavior of the ISR, but your paper doesn't adequately address the expression levels of all three eIF2 subunits: alpha, beta, and gamma. The proteomic data shown in Fig 4B is unconvincing on its own because the changes in the beta subunit are subtle. The Western blot in Figure 4C suggests that the KD changes the mass or mobility of the beta subunit, and most importantly, there are no Western blots measuring the levels of eIF2a, eIF2a-phospho, or eIF2-gamma.

      We appreciate the reviewer’s comment and agree that the stoichiometric mismatch due to elevated eIF2β may interfere with ISR. We found overexpression of eIF2β lowered p-eIF2 alpha (Figure S2 in V1), which supports this model. We included this data in the main figure in the revised manuscript (Figure 7D) and revised the text as below:

      Lines 286-289: ‘Since milton knockdown reduced the p-eIF2α level (Figure 5E), we asked whether an increase in eIF2β affects p-eIF2α. Neuronal overexpression of eIF2β did not affect the eIF2α level but significantly decreased the p-eIF2α level (Figure 7D and E).’

      Expression data of eIF2α and eIF2γ from proteomic analyses has been extracted from proteome analyses and included as a table (Figure 4D). Western blots of phospho-eIF2a (Figure S1 in V1) in the main figure (Figure 5B). The result section was revised as below;

      Lines 246-249: ‘As for the other subunits of eIF2 complex, proteome analysis did not detect a significant difference in the protein levels of eIF2α and eIF2γ between milton knockdown and control flies at 7 and 21 days (Figure 4D).’

      NEW TEXT: We also analyzed age-dependent changes of eIF2β by western blotting and found that eIF2β increased during aging until 49-day-old. We included this result as Figure 4G and added these sentences in the result section:

      NEW TEXT: Line 240-243: ‘We also investigated age-dependent changes in eIF2β by western blotting of control flies at 7-, 21-, 35-, and 49-, and 63-day-old. eIF2β levels increased during aging until 49-day-old (Figure 4G). These results suggest that upregulation of eIF2β in milton knockdown fly brain reflects early an onset of age-dependent increase of eIF2β levels.

      Reviewer #1 (Recommendations For The Authors):

      L125-128: In this section, while the efficiency of Milton knockdown is referenced from a previous publication, it is necessary to also mention that the Miro knockdown has been similarly reported in the literature. Additionally, the Methods section lacks details on the Miro RNAi line used, and Table 2 does not include the genotype for Miro RNAi. This information should be included for clarity and completeness.

      Thank you for pointing it out. Knockdown efficiency with this strain has been reported (Iijima- Ando et al., PLoS Genet, 2012). We revised the text to include citation and knockdown efficiency as follows:

      Lines 136-147: ‘There was no significant increase in ubiquitinated proteins in milton knockdown flies at 1-day old, suggesting that the accumulation of ubiquitinated proteins caused by milton knockdown is age-dependent (Figure S1). We also analyzed the effect of the neuronal knockdown of Miro, a partner of milton, on the accumulation of ubiquitin-positive proteins. Since severe knockdown of Miro in neurons causes lethality, we used UAS-Miro RNAi strain with low knockdown efficiency, whose expression driven by elav-GAL4 caused 30% reduction of Miro mRNA in head extract24. Although there was a tendency for increased ubiquitin- positive puncta in Miro knockdown brains, the difference was not significant (Figure 1B, p>0.05 between control RNAi and Miro RNAi). These data suggest that the depletion of axonal mitochondria induced by milton knockdown leads to the accumulation of ubiquitinated proteins before neurodegeneration occurs.’

      L132-L136: The current phrasing in this section suggests an increase in ubiquitinated proteins for both Milton and Miro knockdowns. However, since there is no significant difference noted for Miro, it is incorrect to state an increase in ubiquitin-positive puncta. Furthermore, combining the results of Milton knockdown to claim an increase in ubiquitinated proteins prior to neurodegeneration is misleading. At the very least, the expression here needs to be moderated to accurately reflect the findings.

      Thank you for pointing it out. We revised the text as above.

      L137-L141: Results in Figure 1 indicate that Milton knockdown leads to an increase in ubiquitinated proteins at 14 days, while Miro knockdown shows no difference from the control at either 14 or 30 days. Conversely, both the control and Miro exhibit an increase in ubiquitinated proteins with aging, but this trend does not seem to apply to Milton knockdown. This observation suggests that Milton KD may not affect the changes in protein quality control associated with aging. It implies that Milton's function might be more related to protein homeostasis in younger cells, or that changes due to aging might overshadow the effects of Milton knockdown. These interpretations should be included in the Results or Discussion sections for a more comprehensive analysis.

      NEW TEXT: Thank you for your insightful comment. As you mentioned, the accumulation of ubiquitinated proteins significantly increases only in young flies. Age-related pathways, such as immune responses, are highlighted in young milton knockdown flies but not in the aged flies. Our new result indicates that eIF2β increases during aging in control flies (included as Figure 4G in the revised manuscript), and upregulation of eIF2β in milton knockdown is only observed at a young age. These results suggest that milton knockdown does not increase the magnitude of age-dependent changes but accelerates their onset. We revised the text to include those points as follows:

      NEW TEXT: Lines 152-153: ‘These results suggest that depletion of axonal mitochondria may have more impact on proteostasis in young neurons than in old neurons.’

      NEW TEXT: Lines 359-371: ‘Our results suggest that the loss of axonal mitochondria is an event upstream of proteostasis collapse during aging. The number of puncta of ubiquitinated proteins was higher in milton knockdown at 14-day-old, but there was no significant difference at 30-day- old (Figure 1). Proteome analyses also showed that age-related pathways, such as immune responses, are enhanced in young flies with milton knockdown (Table 2). We also found that eIF2β protein levels increase in an age-dependent manner until 49-day-old and reduces after that (Figure 4G). In the brains with neuronal knockdown of milton, eIF2β levels were higher at 7-day-old than those in control and lower at the 21-day-old (Figure 4 and Supplementary table). These results suggest that milton knockdown is likely accelerating age-dependent changes rather than increasing their magnitude. Disruption of proteostasis is expected to contribute neurodegeneration38 , and it would be interesting to analyze the sequence of protein accumulation and axonal degeneration in milton knockdown (24,29 and Figure 1) in detail with higher time resolution.’

      L143 : Please remove the erroneously included quotation mark.

      Thank you for pointing it out. We corrected it.

      L145-L147:

      While it is understood that Milton knockdown results in a reduction of mitochondria in axons, as reported previously and seemingly indicated in Figure 1E, this paper repeatedly refers to axonal depletion of mitochondria. Therefore, it would be beneficial to quantitatively assess the number of mitochondria in the axonal terminals located in the lamina via electron microscopy. Such quantification would robustly reinforce the argument that mitochondrial absence in axons is a consequence of Milton knockdown.

      Thank you for pointing it out. We included quantitation of the number of mitochondria in the synaptic terminals (Figure 1E).

      The text and figure legend was revised accordingly:

      Lines 156-157: ‘As previously reported24, the number of mitochondria in presynaptic terminals decreased in milton knockdown (Figure 1E).’

      The knockdown of Milton is known to reduce mitochondrial transport from an early stage, but what about swelling? By observing swelling at 1 day and 14 days, it may be possible to confirm the onset of swelling and discuss its correlation with the accumulation of ubiquitinated proteins.

      Quantitation of axonal swelling has also been included (Figure 1F).

      We appreciate the reviewer's comments on the correlation between the accumulation of ubiquitinated proteins and axonal swelling. Axonal swelling was not observed at 3-days-old (Iijima-Ando et al., PLoS Genetics, 2012), indicating that axonal swelling is an age-dependent event. Dense materials are found in swollen axons more often than in normal axons, suggesting a positive correlation between disruption of proteostasis and axonal damage. It would be interesting to analyze the time course of events further; however, we feel it is beyond the scope of this manuscript. We revised the text to include this discussion as:

      Lines 157-160: ‘The swelling of presynaptic terminals, characterized by the enlargement and roundness, was not reported at 3-day-old24 but observed at this age with about 4% of total presynaptic terminals (Figure 1F, asterisks).’

      Lines 162-167: ‘Dense materials are rarely found in age-matched control neurons, indicating that milton knockdown induces abnormal protein accumulation in the presynaptic terminals (Figure 1G and H). In milton knockdown neurons, dense materials are found in swollen presynaptic terminals more often than in presynaptic terminals without swelling, suggesting a positive correlation between the disruption of proteostasis and axonal damage (Figure 1G).’

      Lines 369-371: ‘Disruption of proteostasis is expected to contribute neurodegeneration38 , and it would be interesting to analyze the sequence of protein accumulation and axonal degeneration in milton knockdown (24,29 and Figure 1) in detail with higher time resolution.’

      L147-L151: Though Figures 1F and 1G provide qualitative representations, it is advisable to quantitatively assess whether dense materials significantly accumulate. Such quantitative analysis would be required to verify the accumulation of dense materials in the context of the study.

      Thank you for pointing it out. We included quantitation of the number of neurons with dense material (Figure 1G). We revised the manuscript as follows:

      Line 162-164: ‘Dense materials are rarely found in age-matched control neurons, indicating that milton knockdown induces abnormal protein accumulation in the presynaptic terminals (Figure 1G and H).’

      Regarding Figure 1B, C:

      Even though the count of puncta in the whole brain appears to be fewer than 400, the magnification of the optic lobe suggests a substantial presence of puncta. Please clarify in the Methods section what constitutes a puncta and whether the quantification in the whole brain is based on a 2D or 3D analysis. Detail the methodology used for quantification.

      Thank you for your comment. We revised the method section to include more details as below:

      Lines 440-443: ‘Quantitative analysis was performed using ImageJ (National Institutes of Health) with maximum projection images derived from Z-stack images acquired with same settings. Puncta was identified with mean intensity and area using ImageJ.’

      What about 1-day-old specimens? Does Milton knockdown already show an increase in ubiquitinated protein accumulation at this early stage? Investigating whether ubiquitin-protein accumulation is involved in aging promotion or is already prevalent during developmental stages is a necessary experiment.

      Thank you for your comment. We carried out immunostaining with an anti-ubiquitin antibody in the brains at 1-day-old. No significant difference was detected between the control and milton knockdown. This result has been included as Figure S1 in the revised manuscript. The result section was revised as below:

      Line 136-139 ‘There was no significant increase in ubiquitinated proteins in milton knockdown flies at 1-day old, suggesting that the accumulation of ubiquitinated proteins caused by milton knockdown is age-dependent (Figure S1).’

      For Figure 1E: In the Electron Microscopy section of the Methods, define how swollen axons were identified and describe the quantification methodology used.

      Thank you for your comment. Swollen axons are, unlike normal axons, round in shape and enlarged. We revised the text as below;

      Lines 157-160: ‘The swelling of presynaptic terminals, characterized by the enlargement and roundness, was not reported at 3-day-old24 but observed at this age with about 4% of total presynaptic terminals (Figure 1F, asterisks).’

      Lines 689-691, Figure 1 legend: ‘Swollen presynaptic terminals (asterisks in (F)), characterized by the enlargement and higher circularity, were found more frequently in milton knockdown neurons.’

      L218-L219: Throughout the text, the expression 'eIF2β is "upregulated" in response to Milton knockdown' is frequently used. However, considering the presented results, it might be more accurate to interpret that under the condition of Milton knockdown, eIF2β is not undergoing degradation but rather remains stable.

      Thank you for pointing it out. We replaced ‘upregulated’ with ‘increased’ throughout the text.

      L234-L235: On what basis is the conclusion drawn that there is a reduction? Given that three experiments have been conducted, it would be possible and more convincing to quantify the results to determine if there is a significant decrease.

      Thank you for pointing it out. We quantified the AUC of polysome fraction and carried out a statistical analysis. There is a significant decrease in polysome in milton knockdown, and this result has been included in Figure 5B. We revised the figure and the legend accordingly.

      L236: 5H-> 4H

      Thank you for pointing it out, and we are sorry for the confusion. We corrected it.

      L238-L239: Since there is no significant difference observed, it may not be accurate to interpret a reduction in puromycin incorporation.

      Thank you for pointing it out. As described above, quantification of polysome fractions showed that milton knockdown significantly reduced polysome (Figure 6B (Figure 5B in the previous version)). We revised the manuscript as below;

      Lines 267-268: ‘However, unexpectedly, we found that milton knockdown significantly reduced the level of mRNAs associated with polysomes (Figure 6A and B).’

      Figure 5D and Figure 6D: Climbing assays have been conducted, but I believe experiments should also be performed to examine whether overexpression or heterozygous mutants of eIF2β induce or suppress degeneration.

      Thank you for pointing it out. We analyzed the eyes with eIF2β overexpression for neurodegeneration. Although there was a tendency of elevated neurodegeneration in the retina with eIF2β overexpression, the difference between control and eIF2β overexpression did not reach statistical significance (Figure S2). This result has been included as Figure S2 in the revised manuscript, and the following sentences have been included in the text:

      Lines 292-297: ‘We asked if eIF2β overexpression causes neurodegeneration, as depletion of axonal mitochondria in the photoreceptor neurons causes axon degeneration in an age- dependent manner24. eIF2β overexpression in photoreceptor neurons tends to increase neurodegeneration in aged flies, while it was not statistically significant (p>0.05, Figure S2).’

      L271-L272: The results in Figure 6B are surprising. I anticipated a greater increase compared to the Milton knockdown alone. While p62 appears to be reduced, it is not clear why these results lead to the conclusion that lowering eIF2β rescues autophagic impairment. Please add a discussion section to address this point.

      Thank you for pointing it out. We apologize for the unclear description of the result. Milton knockdown flies show p62 accumulation (Figure 2), and deleting one copy of eIF2beta in milton knockdown background reduced p62 accumulation (Figure 8C (Figure 7C in the previous version)). We revised the text as below:

      Lines 311-319: ‘Neuronal knockdown of milton causes accumulation of autophagic substrate p62 in the Triton X-100-soluble fraction (Figure 2B), and we tested if lowering eIF2β ameliorates it. We found that eIF2β heterozygosity caused a mild increase in LC3-I levels and decreases in LC3-II levels, resulting in a significantly lower LC3-II/LC3-I ratio in milton knockdown flies (Figure 8B). eIF2β heterozygosity decreased the p62 level in the Triton X-100-soluble fraction in the brains of milton knockdown flies (Figure 8C). The p62 level in the SDS-soluble fraction, which is not sensitive to milton knockdown (Figure 2B), was not affected (Figure 8C). These results suggest that suppression of eIF2β ameliorates the impairment of autophagy caused by milton knockdown.’

      L369: Please specify the source of the anti-ubiquitin antibody used.

      Thank you for pointing it out. We included the antibody information in the method section.

      Figure 7: While the relationship between Milton knockdown and the eIF2β and eIF2α proteins has been elucidated through the authors' efforts, I would like to see an investigation into whether eIF2β is upregulated and eIF2α phosphorylation is reduced in simply aged Drosophila. This would help us understand the correlation between aging and eIF2 protein dynamics.

      Thank you for your comment. We agree that it is an important question, and we are working on it. However, we feel that it is beyond the scope of the current manuscript.

      L645-L646: If the mushroom body is identified using mito-GFP, then include mito-GFP in the genotype listed in Supplementary Table 2.

      We are sorry for the oversight. We corrected it in Supplementary Table 2.

      Additionally, while it is presumed that the mito-GFP signal decreases in axons with Milton RNAi, how was the lobe tips area accurately selected for analysis? Please include these details along with a comprehensive description of the quantification methodology in the Methods section.

      Thank you for your comment. Although the mito-GFP signal in the axon is weak in the milton knockdown neurons, it is sufficient to distinguish the mushroom body structure from the background. We revised the method section to include this information in the method section:

      Line 443-447: ‘For eIF2α and p-eIF2α immunostaining, the mushroom body was detected by mitoGFP expression.’

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      I am impressed with the thoroughness with which the authors addressed my concerns. I don't have any further concerns and think that this paper makes an interesting and significant contribution to our understanding of VWM. I would only suggest adding citations to the newly added paragraph where the authors state "It could be argued that preparatory attention relies on the same mechanisms as working memory maintenance." They could cite work by Bettencourt and Xu, 2016; and Sheremata, Somers, and Shomstein (2018).

      We thank the reviewer for the positive feedback. We have now cited the referenced work in the manuscript (Page. 19, Line 371).

      Reviewer #2 (Public review):

      Overall, I think that the authors' revision has addressed most, if not all, of my major concerns noted in my previous comments. The results appear convincing and I do not have additional comments.

      We thank the reviewer for the positive feedback and are pleased that the revision addressed the major concerns.

      Reviewer #3 (Public review):

      (1) The authors addressed most of my previous concerns and provided additional data analysis. They conducted further analyses to demonstrate that the observed changes in network communication are associated with behavioral RTs, supporting the idea that the impulse-driven sensory-like template enhances informational connectivity between sensory and frontoparietal areas, and relates to behavior.

      We are pleased that the revision addressed the major concerns.

      (2) I would like to further clarify my previous points regarding the definition of the two types of templates and the evidence for their coexistence. The authors stated that the sensory-like template likely existed in a latent state and was reactivated by visual pings, proposing that sensory and non-sensory templates coexist. However, it remains unclear whether this reflects a dynamic switch between formats or true coexistence. If the templates are non-sensory in nature, what exactly do they represent? Are they meant to be abstract or conceptual representations, or, put simply, just "top-down attentional information"? If so, why did the generalization analysestraining classifiers on activity during the stimulus selection period and testing on preparatory activity-fail to yield significant results? While the stimulus selection period necessarily encodes both target and distractor information, it should still contain attentional information. I would appreciate more discussion from this perspective.

      We thank the reviewer for the helpful clarification of previous comments. Since we addressed similar comments from Reviewer 2 (Point 2) in the previous round, our response below may appear somewhat repetitive. First, regarding whether our findings reflect a dynamic switch between non-sensory and sensory-like template, or the ‘coexistence’ of two template formats, we acknowledge that the temporal limitations of fMRI prevent us from directly testing dynamic representations. However, several aspects of our data favor the latter interpretation: (1) our key findings remained consistent in the subset of participants (N=14) who completed both No-Ping and Ping sessions in counterbalanced order. This makes it unlikely that participants systematically switched cognitive strategies (e.g., using non-sensory templates in the No-Ping session versus sensory-like templates in the Ping session) in response to the taskirrelevant, uninformative visual impulse; (2) while we agree that the temporal dynamics between the two templates remain unclear, it is difficult to imagine that orientation-specific templates observed in the Ping session emerged de novo from purely non-sensory templates and an exogenous ping. In other words, if there is no orientation information at all to begin with, how does it come into being from an orientation-less external ping? A more parsimonious explanation is that orientation information was already present in a latent format and was activated by the ping, in line with the models of “activity-silent” working memory. However, since the detailed circuit-level mechanism underlying such reactivation remain unclear, we acknowledge that this interpretation warrants direct investigation in future studies. This point is discussed in the main texts (Page 19-20, Line 389-402). 

      Second, while our data cannot definitively determine the nature of the non-sensory template, we consider categorical coding a plausible candidate based on prior visual search studies. For instance, categorical attributes (e.g., left-tilted vs. right-tilted) have been shown to effectively guide attention in orientation search tasks (Wolfe et al., 1992), similar to our paradigm. Further, categorical templates are more tolerant of stimulus variability, making them well-suited to our task, which involved trial-by-trial variations in target orientation around a reference (see Page 21, Line 427- 437 for more detailed discussions).

      Third, the lack of generalization from stimulus selection to preparatory attention in the Ping session may relate to the limited overlap in shared information between these two periods. Neural activity during stimulus selection encodes sensory information about both orientations, along with sensory-like attentional signals (as indicated by the attention decoding and crosstask generalization from perception task to the stimulus-selection period). In contrast, preparatory activity likely involves a dominant non-sensory template, a latent sensory-like template, and residual sensory effects from the impulse stimulus. The limited overlap in sensory-like attentional signals may therefore be insufficient to support generalization across the two periods.

      Reviewer #2 ( Recommendations for the authors)

      I think the central prediction of greater pattern similarity between 'attend leftward' and 'perceived leftward' in the ping session in comparison to the no-ping session (the same also holds for 'attend rightward' and 'perceived rightward' could be directly examined by a two-way ANOVA (session × the attend orientation is the same/different from the perceived orientation) for each ROI (V1 and EVC). A three-way ANOVA might complicate readers' intuitive understanding of the implications of the statistical results.

      We thank the reviewer for the suggestion. Following the reviewer’s suggestion, we defined a new condition label based on orientation consistency between attended and perceived orientations: (1) same orientation: averaging “attend leftward/perceive leftward” and “attend rightward/perceive rightward”; and (2) different orientation: averaging “attend leftward/perceive rightward” and “attend rightward/perceive leftward”. A two-way mixed ANOVA (session × orientation consistency) on Mahalanobis distance revealed a main effect of orientation consistency in V1 (F(1,38) = 4.21, p = 0.047, η<sub>p</sub><sup>2</sup> = 0.100), indicating that activity patterns were more similar when attended and perceived orientations matched. No significant main effect of session was found (p = 0.923). Importantly, a significant interaction was found in V1 (F(1,38) = 5.00, p = 0.031, η<sub>p</sub><sup>2</sup> = 0.116), suggesting that visual impulse enhanced the similarity between preparatory attentional template and the perception of corresponding orientation. In EVC, the same analysis revealed only a main effect of orientation consistency (F(1,38) = 5.87, p = 0.020, η<sub>p</sub><sup>2</sup> = 0.134), with no significant other effects (ps > 0.240). The interaction results were consistent with those reported in the original three-way ANOVA. We have now replaced the previous analysis with the new one in the main texts (Page 11-12, Line 231-242).

    1. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Yamazaki et al. conducted multiple microscopy-based GFP localization screens, from which they identified proteins that are associated with PM/cell wall damage stress response. Specifically, the authors identified that bud-localized TMD-containing proteins and endocytotic proteins are associated with PM damage stress. The authors further demonstrated that polarized exocytosis and CME are temporally coupled in response to PM damage, and CME is required for polarized exocytosis and the targeting of TMD-containing proteins to the damage site. From these results, the authors proposed a model that CME delivers TMD-containing repair proteins between the bud tip and the damage site.

      Strengths:

      Overall, this is a well-written manuscript, and the experiments are well-conducted. The authors identified many repair proteins and revealed the temporal coordination of different categories of repair proteins. Furthermore, the authors demonstrated that CME is required for targeting of repair proteins to the damage site, as well as cellular survival in response to stress related to PM/cell wall damage. Although the roles of CME and bud-localized proteins in damage repair are not completely new to the field, this work does have conceptual advances by identifying novel repair proteins and proposing the intriguing model that the repairing cargoes are shuttled between the bud tip and the damaged site through coupled exocytosis and endocytosis.

      Weaknesses:

      While the results presented in this manuscript are convincing, they might not be sufficient to support some of the authors' claims. Especially in the last two result sessions, the authors claimed CME delivers TMD-containing repair proteins from the bud tip to the damage site. The model is no doubt highly possible based on the data, but caveats still exist. For example, the repair proteins might not be transported from one localization to another localization, but are degraded and resynthesized. Although the Gal-induced expression system can further support the model to some extent, I think more direct verification (such as FLIP or photo-convertible fluorescence tags to distinguish between pre-existing and newly synthesized proteins) would significantly improve the strength of evidence.

      Major experiment suggestions:

      (1) The authors may want to provide more direct evidence for "protein shuttling" and for excluding the possibility that proteins at the bud are degraded and synthesized de novo near the damage site. For example, if the authors could use FLIP to bleach bud-localized fluorescent proteins, and the damaged site does not show fluorescent proteins upon laser damage, this will strongly support the authors' model. Alternatively, the authors could use photo-convertible tags (e.g., Dendra) to differentiate between pre-existing repair proteins and newly synthesized proteins.

      (2) In line with point 1, the authors used Gal-inducible expression, which supported their model. However, the author may need to show protein abundance in galactose, glucose, and upon PM damage. Western blot would be ideal to show the level of full-length proteins, or whole-cell fluorescence quantification can also roughly indicate the protein abundance. Otherwise, we cannot assume that the tagged proteins are only expressed when they are growing in galactose-containing media.

      (3) Similarly, for Myo2 and Exo70 localization in CME mutants (Figure 4), it might be worth doing a western or whole-cell fluorescence quantification to exclude the caveat that CME deficiency might affect protein abundance or synthesis.

      (4) From the authors' model in Figure 7, it looks like the repair proteins contribute to bud growth. Does laser damage to the mother cell prevent bud growth due to the reduction of TMD-containing repair proteins at the bud? If the authors could provide evidence for that, it would further support the model.

      (5) Is the PM repair cell-cycle-dependent? For example, would the recruitment of repair proteins to the damage site be impaired when the cells are under alpha-factor arrest?

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The small conductance calcium-activated potassium channel 2 (SK2) is an important drug target for treating neurological and cardiovascular diseases. However, structural information on this subtype of SK channels has been lacking, and it has been diOicult to draw conclusions about activator and inhibitor binding and action in the absence of structural information.

      Here the authors set out to (1) determine the structure of the transmembrane regions of a mammalian SK2 channel, (2) determine the binding site of apamin, a historically important SK2 inhibitor whose mode of action is unclear, and (3) use the structural information to generate a novel set of activators/inhibitors that selectively target SK2.

      The authors largely achieved all the proposed goals, and they present their data clearly.

      Unable to solve the structure of the human SK2 due to excessive heterogeneity in its cytoplasmic regions, the authors create a chimeric construct using SK4, whose structure was previously solved, and use it for structural studies. The data reveal a unique extracellular structure formed by the S2-S3 loop, which appears to directly interact with the selectivity filter and modulate its conductivity. Structures of SK2 in the absence and presence of the activating Ca2+ ions both possess non-K+-selective/conductive selectivity filters, where only sites 3 and 4 are preserved. The S6 gates are captured in closed and open states, respectively. Apamine binds to the S2-S3 loop, and unexpectedly, induces a K+ selective/conductive conformation of the selectivity filter while closing the S6 gate.

      Through high-throughput screening of small compound libraries and compound optimization, the group identified a reasonably selective inhibitor and a related compound that acts as an activator. The characterization shows that these compounds bind in a novel binding site. Interestingly, the inhibitor, despite binding in a site diOerent from that of apamine, also induces a K+ selective/conductive conformation of the selectivity filter while the activator induces a non-K+ selective/conductive conformation and an open S6 gate.

      The data suggest that the selectivity filter and the S6 gate are rarely open at the same time, and the authors hypothesize that this might be the underlying reason for the small conductance of SK2. The data will be valuable for understanding the mechanism of SK2 channel (and other SK subtypes).

      Overall, the data is of good quality and supports the claims made by the authors. However, a deeper analysis of the cryo-EM data sets might yield some important insights, i.e., about the relationship between the conformation of the selectivity filter and the opening of the S6 gate.

      We attempted focused 3D classification to identify subsets of particles with the S6 open and the SF in a conductive state but were not able to isolate such a particle class. This indicates that either none or a very small percentage of particles exists in a fully conductive state. This sentence was included in the results section: 

      “Focused 3D classification of the S3-S4 linker was unsuccessful in identifying particles subsets with a dilated extracellular constriction suggesting that either none or a very small percentage of Ca<sup>2+</sup>-bound SK2-4 is in a conductive state”

      Some insight and discussion about the allosteric networks between the SF and the S6 gate would also be a valuable addition.

      The extracellular constriction is in the same non-conductive conformation in the Ca<sup>2+</sup> bound and Ca<sup>2+</sup> -free SK2-4 structures suggesting that the conformation of S3-S4 linker/SF and the S6 are not allosterically coupled. We predict that Ca<sup>2+</sup> opens the intracellular gate and another physiological factor (not yet identified) promotes extracellular gate opening. These sentences were added to the results and discussion: “This along with the similar conformation of the S3-S4 linker in the Ca<sup>2+</sup> -bound and Ca<sup>2+</sup> -free states of SK2-4 suggest that Ca<sup>2+</sup> -dependent intracellular gate dynamics are not coupled to the conformation of the S3-S4 linker. Other yet to be identified physiological factors may be required to dilate the extracellular constriction.”

      “Alternatively, other physiological factors, such as PIP2[46,47] or protein-protein interactions[48-50], may exist in live cells that modulate the interaction between S3-S4 linker and the selectivity filter.”

      Reviewer #2 (Public review):

      Summary:

      The authors have used single-particle cryoEM imaging to determine how small-molecule regulators of the SK channel interact with it and modulate their function.

      Strengths:

      The reconstructions are of high quality, and the structural details are well described.

      Weaknesses:

      The electrophysiological data are poorly described. Several details of the structural observations require a mechanistic context, perhaps better relating them to what is known about SK channels or other K channel gating dynamics.

      As recommended, additional details for electrophysiological data were added to the results, methods, and figure legends for clarification.  

      The most pressing point I have to make, which could help improve the manuscript, relates to the selectivity filter (SF) conformation. Whether the two ion-bound state of SK2-4 (Figure 4A) represents a non-selective, conductive SF occluded by F243 or represents a C-type inactivated SF, further occluded by F243, is unclear. It would be important to discuss this. Reconstructions of Kv1.3 channels also feature a similar configuration, which has been correlated to its accelerated C-type inactivation.

      Structural overlays of Ca<sup>2+</sup> bound SK2-4, HCN, and C-type inactivated Kv1.3 selectivity filters demonstrate that each have conformational diVerences and it is diVicult to definitively determine if the SK2-4 selectivity filter is in a non-selective conformation like HCN or a C-type inactivated conformation like Kv1.3. Based on the number of ions observed in the filter and the position of Tyr361 we believe the selectivity filter most closely resembles that of HCN. Importantly, the selectivity filter conformation observed in the SK2-4 Ca<sup>2+</sup> -bound and Ca<sup>2+</sup> -free structures is ultimately nonconductive due to the Phe243 extracellular constriction blocking K<sup>+</sup> eVlux. 

      A comparison of the SK2-4 selectivity filter to HCN and C-type inactivated Kv1.3 was included in Figure 4 and this sentence was included in the results section:

      “The selectivity filter of SK2-4 resembles that of to HCN in both the position of Tyr361 and the number of K<sup>+</sup> coordination sites (Fig 4E,F,G,H)”

      Furthermore, binding of a toxin derivative to Kv1.3 restores the SF into a conductive form, though occluded by the toxin. It appears that apamin binding to SK2-4 might be doing something similar. Although I am not sure whether SK channels undergo C-type inactivation like gating, classical MTS accessibility studies have suggested that dynamics of the SF might play a role in the gating of SK channels. It would be really useful (if not essential) to discuss the SF dynamics observed in the study and relate them better to aspects of gating reported in the literature.

      Extracellular toxin binding to SK2-4 and K<sub>v</sub>1.3 induce a conformational change in the selectivity filter to produce a canonical K<sup>+</sup> selective structure with four coordination sites. However, the mechanism by which the toxins produce the conformational change is diVerent. For SK2-4, apamin interacts primarily with S3-S4 linker residues and induces a shift in the S3-S4 linker away from the pore axis. This in turn prevents the hydrogen bonds between Arg240 and Tyr245 of the S3-S4 linker and Asp363 at the C-terminus of the selectivity filter to produce a selectivity filter conformation with four K<sup>+</sup> coordination sites. For K<sub>v</sub>1.3, the sea anemone toxin ShK binds directly to the C-terminus of the selectivity filter disrupting interactions required for the C-type inactivated structure and thereby inducing the conformational change. These sentences were added to the results:

      “Toxin induced selectivity filter conformational change has also been reported for K<sub>v</sub 1.3 with the sea anemone toxin ShK. However, unlike apamin binding to SK2-4, ShK binds directly to the K<sub>v</sub> 1.3 selectivity filter to convert a C-type inactivated conformation to a canonical K<sup>+</sup> selective structure with four coordination sites [39,40]. The change in selectivity filter conformation in apamin-bound SK2-4 seems to be driven instead by the weakening of interactions between the selectivity filter and the S3-S4 linker.”

      The SF of K channels, in conductive states, are usually stabilized by an H-bond network involving water molecules bridged to residues behind the SF (D363 in the down-flipped conformation and Y361). Considering the high quality of the reconstructions, I would suspect that the authors might observe speckles of density (possibly in their sharpened map) at these sites, which overlap with water molecules identified in high-resolution X-ray structures of KcsA, MthK, NaK, NaK2K, etc. It could be useful to inspect this region of the density map.

      We did not observe strong density near Y361 or D363 that could be confidently model as water. However, in the structures of SK2-4 bound to apamin and compound 1 Tyr361 in the selectivity filter rotates 180° and forms a hydrogen bond with Thr355 in the pore helix. The homologous hydrogen bond is also observed in SK4 and the conductive/ K<sup>+</sup> selective selectivity filter conformation of Kv1.3.  The rotation of Tyr361 to form a hydrogen bond with Thr355, reorientation of Asp363 and Trp350 into hydrogen bonding position, and the presence of four K<sup>+</sup> coordination sites upon binding of apamin and compound 1 strongly suggest that the selectivity filter is in a K<sup>+</sup> selective/conductive conformation. The Tyr361/Thr355 hydrogen bond is now described in the paper and shown in Figures 4D, 5D, and S6F.

      Reviewer #3 (Public review):

      This is a fundamentally important study presenting cryo-EM structures of a human small conductance calcium-activated potassium (SK2) channel in the absence and presence of calcium, or with interesting pharmacological probes bound, including the bee toxin apamin, a small molecule inhibitor, and a small molecule activator. As eOorts to solve structures of the wild-type hSK2 channel were unsuccessful, the authors engineered a chimera containing the intracellular domain of the SK4 channel, the subtype of SK channel that was successfully solved in a previous study (reference 13). The authors present many new and exciting findings, including opening of an internal gate (similar to SK4), for the first time resolving the S3-S4 linker sitting atop the outer vestibule of the pore and unanticipated plasticity of the ion selectivity filter, and the binding sites for apamin, one new small molecule inhibitor and another small molecule activator. Appropriate functional data are provided to frame interpretations arising from the structures of the chimeric protein; the data are compelling, the interpretations are sound, and the writing is clear. This high-quality study will be of interest to membrane protein structural biologists, ion channel biophysicists, and chemical biologists, and will be valuable for future drug development targeting SK channels.

      The following are suggestions for strengthening an already very strong and solid manuscript:

      (1) It would be good to include some information in the text of the results section about the method and configuration used to obtain electrophysiological data and the limitations. It is not until later in the text that the Qube instrument is mentioned in the results section, and it is not until the methods section that the reader learns it was used to obtain all the electrophysiological data. Even there, it is not explicitly mentioned that a series of diOerent internal solutions were used in each cell where the free calcium concentration was varied to obtain the data in Figure1C. Also, please state the concentration of free calcium for the data in Figure 1B.

      As recommended, additional details for electrophysiological data were added to the results, methods, and figure legends for clarification.  

      (2) The authors do a nice job of discussing the conformations of the selectivity filter they observed here in SK as they relate to previous work on NaK and HCN, but from my perspective the authors are missing an opportunity to point out even more striking relationships with slow C-type inactivation of the selectivity filter in Shaker and Kv1 channels. C-type inactivation of the filter in Shaker was seen in 150 mM K using the W434F mutant (PMC8932672) or in 4 mM K for the WT channel (PMC8932672), and similar results have been reported for Kv1.2 (PMC9032944; PMC11825129) and for Kv1.3 (PMC9253088; PMC8812516) channels. For Kv1.3, C-type inactivation occurs even in 150 mM K (PMC9253088; PMC8812516). Not unlike what is seen here with apamin, binding of the sea anemone toxin (ShK) with a Fab attached (or the related dalazatide) inserts a Lys into the selectivity filter and stabilizes the conducting conformation of Kv1.3 even though the Lys depletes occupancy of S1 by potassium (PMC9253088; PMC8812516). Or might the conformation of the filter be controlled by regulatory processes in SK2 channels? I think connecting the dots here would enhance the impact of this study, even if it remains relatively speculative.

      Please see the response to reviewer 2’s comments for a comparison of the selectivity filter structure between SK2-4 and C-type inactivated K<sub>v</sub>1.3 and a discussion of toxin induced selectivity filter conformational change.

      What is known about how the functional properties of SK2 channels (where the filter changes conformation) diOer from SK4, where the filter remains conducting (reference 13)? Is there any evidence that SK2 channels inactivate?

      Compared with SK4, SK2 has some unique properties such as lower conductance and the ability to switch between low- and high-open probability states. Mutation of Phe243 suggests that the S3-S4 linker conformation contributes to the low conductance. This is included in the discussion.

      “Such a mechanism may explain some properties of SK2 that are not observed in SK4, which lacks an S3-S4 linker, such as its low conductance (~10 pS) and the ability to switch between low- and high-open probability states[3,4]. Indeed, mutation of Phe243 in rat SK2 produced a 2-fold increase in channel conductance[5].”

      Or might the conformation of the filter be controlled by regulatory processes in SK2 channels? I think connecting the dots here would enhance the impact of this study, even if it remains relatively speculative.

      Please see the response to reviewer 1’s comments for a discussion of the potential physiological role of the S3-S4 linker/extracellular constriction and its mechanism for opening.

      Reviewer #1 (Recommendations for the authors):

      I enjoyed reading your paper and am intrigued by your findings on the selectivity filter of SK2. I've got a few recommendations for data analysis and a couple of questions that might contribute to the discussion.

      In your Ca2+-bound dataset, have you tried to parse out any alternative conformations (e.g., by using 3D classification, or 3D variability)? Do you think there might be a small(er) population of particles that adopt a fully open conformation? If you haven't done this already, I would recommend doing so. You have a rather large number of particles in your final 3D reconstruction (~660k), so there might be some hidden conformations that could contribute to our understanding of the system.

      I would recommend doing the same for your compound 4-bound data set.

      Please see above for response to this recommendation.

      Do you think apamine works solely as a pore blocker, or does its binding perhaps also aOect the S6 gate via allosteric networks (perhaps the same ones that induce the formation of the K+ conductive SF through binding of compound 1 above the S6 gate?)?

      Apamin binding does not change the conformation of the pore helices (S5 or S6) and thus we believe it acts primarily as a pore blocker. The following was added to the results section:

      “Overall, the apamin-bound SK2-4/CaM structure resembles Ca<sup>2+</sup>-bound SK2-4. The Nterminal lobe of CaM engages with the S<sub>45</sub> A helix, the S5 and S6 helices adopt a similar conformation, and the intracellular gate Val390 is open with a radius of 3.5 Å (Fig 2D). The most significant conformational change is in the position of the S3-S4 linker, which shifts ~2 Å away from the pore axis to accommodate apamin binding.”

      Is there a mechanistic explanation for why it might be diOicult/energetically costly for the SF to be conductive and the S6 gate to be open at the same time?

      Not to our knowledge.

      I also have these minor recommendations:

      -In all figures showing density, include the threshold/sigma value at which density is shown.

      -For all ligands and ions, include half-map data.

      Sigma values were added for all figures legends displaying cryoEM density. The displayed maps are the sharpened full maps.

      Reviewer #2 (Recommendations for the authors):

      Is it possible to provide a structure-sequence guided explanation for the diOerent aOinity of compound 1 for SK2 vs SK4?

      Yes. The following is now included in the results section and a panel was added to Figure S6D.

      “However, for SK4 Thr212 replaces SK2 Ser318 and Trp216 (homologous to SK2 Trp322) is conserved but adopts a diVerent rotamer conformation (Fig S6D). Both changes occlude the compound 1 binding site in SK4 and would likely reduce compound 1 potency on SK4 as observed in the functional data.”

      Is it possible to propose a model of modulation by compound 1/4 where the authors can comment on the conformational dependence of compound binding? That is, do they bind exclusively to the identified conformational states of the channel, or are they able to bind to both closed and open channels, but bias one state over the other?

      The clash between compound 1 and Thr386 in the open conformation of the S6 helices suggests that compound 1 would preferentially bind to closed state of SK2. Similarly, the clash between compound 4 and Ile380 in the closed conformation of the S6 helices suggests that compound 4 would preferentially bind to the open state of SK2. This was included in the discussion:

      “This proposed mechanism of modulation suggests that compound 1 may bind preferentially to the closed conformation of the S6 helices and compound 4 may bind preferentially to the open conformation of the S6 helices.” 

      Please provide the calcium concentration used to generate the data in Figure 1B. The calcium concentration is now stated in the legend for Fig 1B:

      “Intracellular solution contains 2 µM Ca<sup>2+</sup> based on calculation using Maxchelator (see methods)”

      Essential and critically important descriptions of experiments in Figure 7A are lacking. It would be essential to describe properly, with care, what the currents and the conditions of measurements are. If these currents are obtained by subtracting leak currents by adding other drugs, it would be good to comment on whether the latter compete with compounds 1/4.

      As recommended, additional details for electrophysiological data were added to the results, methods, and figure legends for clarification. SK currents were obtained by subtracting leak currents by adding UCL1684 only at the end of experiments. UCL1684 is not expected to interfere with eVect of compound 1 or 4 given diVerent binding sites and mechanisms.  

      If Compound 1 changes the structure of the SF (Figure 6F), would it also promote apamin binding? Given that both these agents produce a similar change in the SF, could each favor the binding of the other?

      Since apamin binds to the S3-S4 linker it is unlikely that the selectivity filter conformational change observed in the compound 1 bound structure would aVect apamin binding.

    1. These temporary limitations will pass. The physics engines thatunderpin VR are improving. In years to come, the headsets will getsmaller, and we will transition to glasses, contact lenses, and eventuallyretinal or brain implants. The resolution will get better, until a virtualworld looks exactly like a nonvirtual world. We will figure out how tohandle touch, smell, and taste. We may spend much of our lives in theseenvironments, whether for work, socializing, or entertainment.

      Its so crazy to me how much VR can and will change the world. I think that its really cool to use as a fun game or activity but I do not think that it should be incorporated into everyday life. I feel as though its going to make the world into such a fake environment and ruin true socialness and connection.

    2. Reality exists, independently of us. The truthmatters. There are truths about reality, and we can try to find them.Even in an age of multiple realities, I still believe in objective reality.

      I find it interesting to sya that reality exists just independently of us. I mean everyone lives a completely different live and we tend to forget that. This can also be referred to as sonder. I think sonder can also be applied to concept of if virtual reality is reality and where the truths are within reality. If reality is just wihin our minds, how does one go about trying to find out what is true? Just by living? I think we can create certain realties in our brain that may or may not come true.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary:

      It seems as if the main point of the paper is about the new data related to rat fish although your title is describing it as extant cartilaginous fishes and you bounce around between the little skate and ratfish. So here's an opportunity for you to adjust the title to emphasize ratfish is given the fact that leader you describe how this is your significant new data contribution. Either way, the organization of the paper can be adjusted so that the reader can follow along the same order for all sections so that it's very clear for comparative purposes of new data and what they mean. My opinion is that I want to read, for each subheading in the results, about the the ratfish first because this is your most interesting novel data. Then I want to know any confirmation about morphology in little skate. And then I want to know about any gaps you fill with the cat shark. (It is ok if you keep the order of "skate, ratfish, then shark, but I think it undersells the new data).

      The main points of the paper are 1) to define terms for chondrichthyan skeletal features in order to unify research questions in the field, and 2) add novel data on how these features might be distributed among chondrichthyan clades. However, we agree with the reviewer that many readers might be more interested in the ratfish data, so we have adjusted the order of presentation to emphasize ratfish throughout the manuscript.

      Strengths:

      The imagery and new data availability for ratfish are valuable and may help to determine new phylogenetically informative characters for understanding the evolution of cartilaginous fishes. You also allude to the fossil record.

      Thank you for the nice feedback.

      Opportunities:

      I am concerned about the statement of ratfish paedomorphism because stage 32 and 33 were not statistically significantly different from one another (figure and prior sentences). So, these ratfish TMDs overlap the range of both 32 and 33. I think you need more specimens and stages to state this definitely based on TMD. What else leads you to think these are paedomorphic? Right now they are different, but it's unclear why. You need more outgroups.

      Sorry, but we had reported that the TMD of centra from little skate did significantly increase between stage 32 and 33. Supporting our argument that ratfish had features of little skate embryos, TMD of adult ratfish centra was significantly lower than TMD of adult skate centra (Fig1).  Also, it was significantly higher than stage 33 skate centra, but it was statistically indistinguishable from that of stage 33 and juvenile stages of skate centra.  While we do agree that more samples from these and additional groups would bolster these data, we feel they are sufficiently powered to support our conclusions for this current paper.

      Your headings for the results subsection and figures are nice snapshots of your interpretations of the results and I think they would be better repurposed in your abstract, which needs more depth.

      We have included more data summarized in results sub-heading in the abstract as suggested (lines 32-37).

      Historical literature is more abundant than what you've listed. Your first sentence describes a long fascination and only goes back to 1990. But there are authors that have had this fascination for centuries and so I think you'll benefit from looking back. Especially because several of them have looked into histology and development of these fishes.

      I agree that in the past 15 years or so a lot more work has been done because it can be done using newer technologies and I don't think your list is exhaustive. You need to expand this list and history which will help with your ultimate comparative analysis without you needed to sample too many new data yourself.

      We have added additional recent and older references: Kölliker, 1860; Daniel, 1934; Wurmbach, 1932; Liem, 2001; Arratia et al., 2001.

      I'd like to see modifications to figure 7 so that you can add more continuity between the characters, illustrated in figure 7 and the body of the text.

      We address a similar comment from this reviewer in more detail below, hoping that any concerns about continuity have been addressed with inclusion of a summary of proposed characters in a new Table 1, re-writing of the Discussion, and modified Fig7 and re-written Fig7 legend.

      Generally Holocephalans are the outgroup to elasmobranchs - right now they are presented as sister taxa with no ability to indicate derivation. Why isn't the catshark included in this diagram?

      While a little unclear exactly what was requested, we restructured the branches to indicate that holocephalans diverged earlier from the ancestors that led to elasmobranchs. Also in response to this comment, we added catshark (S. canicula) and little skate (L. erinacea) specifically to the character matrix.

      In the last paragraph of the introduction, you say that "the data argue" and I admit, I am confused. Whose data? Is this a prediction or results or summary of other people's work? Either way, could be clarified to emphasize the contribution you are about to present.

      Sorry for this lack of clarity, and we have changed the wording in this revision to hopefully avoid this misunderstanding.

      Reviewer #2 (Public Review):

      General comment:

      This is a very valuable and unique comparative study. An excellent combination of scanning and histological data from three different species is presented. Obtaining the material for such a comparative study is never trivial. The study presents new data and thus provides the basis for an in-depth discussion about chondrichthyan mineralised skeletal tissues.

      many thanks for the kind words

      I have, however, some comments. Some information is lacking and should be added to the manuscript text. I also suggest changes in the result and the discussion section of the manuscript.

      Introduction:

      The reader gets the impression almost no research on chondrichthyan skeletal tissues was done before the 2010 ("last 15 years", L45). I suggest to correct that and to cite also previous studies on chondrichthyan skeletal tissues, this includes studies from before 1900.

      We have added additional older references, as detailed above.

      Material and Methods:

      Please complete L473-492: Three different Micro-CT scanners were used for three different species? ScyScan 117 for the skate samples. Catshark different scanner, please provide full details. Chimera Scncrotron Scan? Please provide full details for all scanning protocols.

      We clarified exact scanners and settings for each micro-CT experiment in the Methods (lines 476-497).

      TMD is established in the same way in all three scanners? Actually not possible. Or, all specimens were scanned with the same scanner to establish TMD? If so please provide the protocol.

      Indeed, the same scanner was used for TMD comparisons, and we included exact details on how TMD was established and compared with internal controls in the Methods. (lines 486-488)

      Please complete L494 ff: Tissue embedding medium and embedding protocol is missing. Specimens have been decalcified, if yes how? Have specimens been sectioned non-decalcified or decalcified?

      Please complete L506 ff: Tissue embedding medium and embedding protocol is missing. Description of controls are missing.

      Methods were updated to include these details (lines 500-503).

      Results:

      L147: It is valuable and interesting to compare the degree of mineralisation in individuals from the three different species. It appears, however, not possible to provide numerical data for Tissue Mineral Density (TMD). First requirement, all specimens must be scanned with the same scanner and the same calibration values. This in not stated in the M&M section. But even if this was the case, all specimens derive from different sample locations and have, been preserved differently. Type of fixation, extension of fixation time in formalin, frozen, unfrozen, conditions of sample storage, age of the samples, and many more parameters, all influence TMD values. Likewise the relative age of the animals (adult is not the same as adult) influences TMD. One must assume different sampling and storage conditions and different types of progression into adulthood. Thus, the observation of different degrees of mineralisation is very interesting but I suggest not to link this observation to numerical values.

      These are very good points, but for the following reasons we feel that they were not sufficiently relevant to our study, so the quantitative data for TMD remain scientifically valid and critical for the field moving forward.  Critically, 1) all of the samples used for TMD calculations underwent the same fixation protocols, and 2) most importantly, all samples for TMD were scanned on the same micro-CT scanner using the same calibration phantoms for each scanning session.  Finally, while the exact age of each adult was not specified, we note for Fig1 that clear statistically significant differences in TMD were observed among various skeletal elements from ratfish, shark, and skate.  Indeed, ratfish TMD was considerably lower than TMD reported for a variety of fishes and tetrapods (summarized in our paper about icefish skeletons, who actually have similar TMD to ratfish: https://doi.org/10.1111/joa.13537).

      In  , however, we added a caveat to the paper’s Methods (lines 466-469), stating that adult ratfish were frozen within 1 or 2 hours of collection from the wild, staying frozen for several years prior to thawing and immediate fixation.

      Parts of the results are mixed with discussion. Sometimes, a result chapter also needs a few references but this result chapter is full of references.

      As mentioned above, we reduced background-style writing and citations in each Results section.

      Based on different protocols, the staining characteristics of the tissue are analysed. This is very good and provides valuable additional data. The authors should inform the not only about the staining (positive of negative) abut also about the histochemical characters of the staining. L218: "fast green positive" means what? L234: "marked by Trichrome acid fuchsin" means what? And so on, see also L237, L289, L291

      We included more details throughout the Results upon each dye’s first description on what is generally reflected by the specific dyes of the staining protocols. (lines 178, 180, 184, 223, 227, and 243-244)

      Discussion

      Please completely remove figure 7, please adjust and severely downsize the discussion related to figure 7. It is very interesting and valuable to compare three species from three different groups of elasmobranchs. Results of this comparison also validate an interesting discussion about possible phylogenetic aspects. This is, however, not the basis for claims about the skeletal tissue organisation of all extinct and extant members of the groups to which the three species belong. The discussion refers to "selected representatives" (L364), but how representative are the selected species? Can there be a extant species that represents the entire large group, all sharks, rays or chimeras? Are the three selected species basal representatives with a generalist life style?

      These are good points, and yes, we certainly appreciate that the limited sampling in our data might lead to faulty general conclusions about these clades.  In fact, we stated this limitation clearly in the Introduction (lines 126-128), and we removed “representative” from this revision.  We also replaced general reference to chondrichthyans in the Title by listing the specific species sampled.  However, in the Discussion, we also compare our data with previously published additional species evaluated with similar assays, which confirms the trend that we are concluding.  We look forward to future papers specifically testing the hypotheses generated by our conclusions in this paper, which serves as a benchmark for identifying shared and derived features of the chondrichthyan endoskeleton.

      Please completely remove the discussion about paedomorphosis in chimeras (already in the result section). This discussion is based on a wrong idea about the definition of paedomorphosis. Paedomorphosis can occur in members of the same group. Humans have paedormorphic characters within the primates, Ambystoma mexicanum is paedormorphic within the urodeals. Paedomorphosis does not extend to members of different vertebrate branches. That elasmobranchs have a developmental stage that resembles chimera vertebra mineralisation does not define chimera vertebra centra as paedomorphic. Teleost have a herocercal caudal fin anlage during development, that does not mean the heterocercal fins in sturgeons or elasmobranchs are paedomorphic characters.

      We agree with the reviewer that discussion of paedomorphosis should apply to members of the same group.  In our paper, we are examining paedomorphosis in a holocephalan, relative to elasmobranch fishes in the same group (Chrondrichthyes), so this is an appropriate application of paedomorphosis.  In response to this comment, we clarified that our statement of paedomorphosis in ratfish was made with respect to elasmobranchs (lines 37-39; 418-420).

      L432-435: In times of Gadow & Abott (1895) science had completely wrong ideas bout the phylogenic position of chondrichthyans within the gnathostomes. It is curious that Gadow & Abott (1895) are being cited in support of the paedomorphosis claim.

      If paedomorphosis is being examined within Chondrichthyes, such as in our paper and in the Gadow and Abbott paper, then it is an appropriate reference, even if Gadow and Abbott (and many others) got the relative position of Chondrichthyes among other vertebrates incorrect.

      The SCPP part of the discussion is unrelated to the data obtained by this study. Kawaki & WEISS (2003) describe a gene family (called SCPP) that control Ca-binding extracellular phosphoproteins in enamel, in bone and dentine, in saliva and in milk. It evolved by gene duplication and differentiation. They date it back to a first enamel matrix protein in conodonts (Reif 2006). Conodonts, a group of enigmatic invertebrates have mineralised structures but these structure are neither bone nor mineralised cartilage. Cat fish (6 % of all vertebrate species) on the other hand, have bone but do not have SCPP genes (Lui et al. 206). Other calcium binding proteins, such as osteocalcin, were initially believed to be required for mineralisation. It turned out that osteocalcin is rather a mineralisation inhibitor, at best it regulates the arrangement collagen fiber bundles. The osteocalcin -/- mouse has fully mineralised bone. As the function of the SCPP gene product for bone formation is unknown, there is no need to discuss SCPP genes. It would perhaps be better to finish the manuscript with summery that focuses on the subject and the methodology of this nice study.

      We completely agree with the reviewer that many papers claim to associate the functions of SCPP genes with bone formation, or even mineralization generally.  The Science paper with the elephant shark genome made it very popular to associate SCPP genes with bone formation, but we feel that this was a false comparison (for many reasons)!  In response to the reviewer’s comments, however, we removed the SCPP discussion points, moving the previous general sentence about the genetic basis for reduced skeletal mineralization to the end of the previous paragraph (lines 435-439).  We also added another brief Discussion paragraph afterwards, ending as suggested with a summary of our proposed shared and derived chondrichthyan endoskeletal traits (lines 440-453).

      Reviewer #1 (Recommendations For The Authors):

      Further Strengths and Opportunities:

      Your headings for the results subsection and figures are nice snapshots of your interpretations of the results and I think they would be better repurposed in your abstract, which needs more depth. It's a little unusual to try and state an interpretation of results as the heading title in a results section and the figures so it feels out of place. You could also use the headings as the last statement of each section, after you've presented the results. In order I would change these results subheadings to:

      Tissue Mineral Density (TMD)

      Tissue Properties of Neural Arches

      Trabecular mineralization

      Cap zone and Body zone Mineralization Patterns

      Areolar mineralization

      Developmental Variation

      Sorry, but we feel that summary Results sub-headings are the best way to effectively communicate to readers the story that the data tell, and this style has been consistently used in our previous publications.  No changes were made.

      You allude to the fossil record and that is great. That said historical literature is more abundant than what you've listed. Your first sentence describes a long fascination and only goes back to 1990. But there are authors that have had this fascination for centuries and so I think you'll benefit from looking back. Especially because several of them have looked into histology of these fishes. You even have one sentence citing Coates et al. 2018, Frey et al., 2019 and ørvig 1951 to talk about the potential that fossils displayed trabecular mineralization. That feels like you are burying the lead and may have actually been part of the story for where you came up with your hypothesis in the beginning... or the next step in future research. I feel like this is really worth spending some more time on in the intro and/or the discussion.

      We’ve added older REFs as pointed out above.  Regarding fossil evidence for trabecular mineralization, no, those studies did not lead to our research question.  But after we discovered how widespread trabecular mineralization was in extant samples, we consulted these papers, which did not focus on the mineralization patterns per se, but certainly led us to emphasize how those patterns fit in the context of chondrichthyan evolution, which is how we discussed them.

      I agree that in the past 15 years or so a lot more work has been done because it can be done using newer technologies. That said there's a lot more work by Mason Dean's lab starting in 2010 that you should take a look at related to tesserae structure... they're looking at additional taxa than what you did as well. It will be valuable for than you to be able to make any sort of phylogenetic inference as part of your discussion and enhance the info your present in figure 7. Go further back in time... For example:

      de Beer, G. R. 1932. On the skeleton of the hyoid arch in rays and skates. Quarterly

      Journal of Microscopical Science. 75: 307-319, pls. 19-21.

      de Beer, G. R. 1937. The Development of the Vertebrate Skull. The University Press,Oxford.

      Indeed, we have read all of Mason’s work, citing 9 of his papers, and where possible, we have incorporated their data on different species into our Discussion and Fig7.  Thanks for the de Beer REFs.  While they contain histology of developing chondrichthyan elements, they appear to refer principally to gross anatomical features, so were not included in our Intro/Discussion.

      Most sections with in the results, read more like a discussion than a presentation of the new data and you jump directly into using an argument of those data too early. Go back in and remove the references or save those paragraphs for the discussion section. Particularly because this journal has you skip the method section until the end, I think it's important to set up this section with a little bit more brevity and conciseness.  For instance, in the first section about tissue mineral density, change that subheading to just say tissue mineral density. Then you can go into the presentation of what you see in the ratfish, and then what you see in the little skate, and then that's it. You save the discussion about what other elasmobranch's or mineralizing their neural arches, etc. for another section.

      We dramatically reduced background-style writing and citations in each Results section (other than the first section of minor points about general features of the ratfish, compared to catshark and little skate), keeping only a few to briefly remind the general reader of the context of these skeletal features.

      I like that your first sentence in the paragraph is describing why you are doing. a particular method and comparison because it shows me (the reader) where you're sampling from. Something else is that maybe as part of the first figure rather than having just each with the graph have a small sketch for little skate and catch shark to show where you sampled from for comparative purposes. That would relate back, then to clarifying other figures as well.

      done (also adding a phylogenetic tree).

      Second instance is your section on trabecular mineralization. This has so many references in it. It does not read like results at all. It looks like a discussion. However, the trabecular mineralization is one of the most interesting aspect of this paper, and how you are describing it as a unique feature. I really just want a very clear description of what the definition of this trabecular mineralization is going to be.

      In addition to adding Table 1 to define each proposed endoskeletal character state, we have changed the structure of this section and hope it better communicates our novel trabecular mineralization results.  We also moved the topic of trabecular mineralization to the first detailed Discussion point (lines 347-363) to better emphasize this specific topic.

      Carry this reformatting through for all subsections of the results.

      As mentioned above, we significantly reduced background-style writing and citations in each Results section.

      I'd like to see modifications to figure 7 so that you can add more continuity between the characters, illustrated in figure 7 and the body of the text. I think you can give the characters a number so that you can actually refer to them in each subsection of the results. They can even be numbered sequentially so that they are presented in a standard character matrix format, that future researchers can add directly to their own character matrices. You could actually turn it into a separate table so it doesn't taking up that entire space of the figure, because there need to be additional taxa referred to on the diagram. Namely, you don't have any out groups in figure 7 so it's hard to describe any state specifically as ancestral and wor derived. Generally Holocephalans are the outgroup to elasmobranchs - right now they are presented as sister taxa with no ability to indicate derivation. Why isn't the catshark included in this diagram?

      The character matrix is a fantastic idea, and we should have included it in the first place!  We created Table 1 summarizing the traits and terminology at the end of the Introduction, also adding the character matrix in Fig7 as suggested, including specific fossil and extant species.  For the Fig7 branching and catshark inclusion, please see above. 

      You can repurpose the figure captions as narrative body text. Use less narrative in the figure captions. These are your results actually, so move that text to the results section as a way to truncate and get to the point faster.

      By figure captions, we assume the reviewer refers to figure legends.  We like to explain figures to some degree of sufficiency in the legends, since some people do not read the main text and simply skim a manuscript’s abstract, figures, and figure legends.  That said, we did reduce the wording, as requested.

      More specific comments about semantics are listed here:

      The abstract starts negative and doesn't state a question although one is referenced. Potential revision - "Comprehensive examination of mineralized endoskeletal tissues warranted further exploration to understand the diversity of chondrichthyans... Evidence suggests for instance that trabecular structures are not common, however, this may be due to sampling (bring up fossil record.) We expand our understanding by characterizing the skate, cat shark, and ratfish... (Then add your current headings of the results section to the abstract, because those are the relevant takeaways.)"

      We re-wrote much of the abstract, hoping that the points come across more effectively.  For example, we started with “Specific character traits of mineralized endoskeletal tissues need to be clearly defined and comprehensively examined among extant chondrichthyans (elasmobranchs, such as sharks and skates, and holocephalans, such as chimaeras) to understand their evolution”.  We also stated an objective for the experiments presented in the paper: “To clarify the distribution of specific endoskeletal features among extant chondrichthyans”. 

      In the last paragraph of the introduction, you say that "the data argue" and I admit, I am confused. Whose data? Is this a prediction or results or summary of other people's work? Either way, could be clarified to emphasize the contribution you are about to present.

      Sorry for this lack of clarity, and we have changed the wording in this revision to hopefully avoid this misunderstanding.

      In the second paragraph of the TMD section, you mention the synarcual comparison. I'm not sure I follow. These are results, not methods. Tell me what you are comparing directly. The non-centrum part of the synarcual separate from the centrum? They both have both parts... did you mean the comparison of those both to the cat shark? Just be specific about which taxon, which region, and which density. No need to go into reasons why you chose those regions here.. Put into methods and discussion for interpretation.

      We hope that we have now clarified wording of that section.

      Label the spokes somehow either in caption or on figure direction. I think I see it as part of figure 4E, I, and J, but maybe I'm misinterpreting.

      Based upon histological features (e.g., regions of very low cellularity with Trichrome unstained matrix) and hypermineralization, spokes in Fig4 are labelled with * and segmented in blue.  We detailed how spokes were identified in main text (lines 241-243; 252-254) and figure legend (lines 597-603). 

      Reviewer #2 (Recommendations For The Authors):

      Other comments

      L40: remove paedomorphism

      no change; see above

      L53: down tune languish, remove "severely" and "major"

      done (lines 57-59)

      L86: provide species and endoskeletal elements that are mineralized

      no change; this paragraph was written generally, because the papers cited looked at cap zones of many different skeletal elements and neural arches in many different species

      L130: remove TMD, replace by relative, descriptive, values

      no change; see above

      L135: What are "segmented vertebral neural arches and centra" ?

      changed to “neural arches and centra of segmented vertebrae” (lines 140-141)

      L166: L168 "compact" vs. "irregular". Partial mineralisation is not necessarily irregular.

      thanks for pointing out this issue; we changed wording, instead contrasting “non-continuous” and “continuous” mineralization patterns (lines 171-174)

      L192: "several endoskeletal regions". Provide all regions

      all regions provided (lines 198-199)

      L269: "has never been carefully characterized in chimeras". Carefully means what? Here, also only one chimera is analyses, not several species.

      sentence removed

      302: Can't believe there is no better citation for elasmobranch vertebral centra development than Gadow and Abott (1895)

      added Arriata and Kolliker REFs here (lines 293-295)

      L318 ff: remove discussion from result chapter

      references to paedomorphism were removed from this Results section

      L342: refer to the species studied, not to the entire group.

      sorry, the line numbering for the reviewer and our original manuscript have been a little off for some reason, and we were unclear exactly to which line of text this comment referred.  Generally in this revision, however, we have tried to restrict our direct analyses to the species analyzed, but in the Discussion we do extrapolate a bit from our data when considering relevant published papers of other species.

      346: "selected representative". Selection criteria are missing

      “selected representative” removed

      L348: down tune, remove "critical"

      Done

      L351: down tune, remove "critical"

      done

      L 364: "Since stem chondrichthyans did not typically mineralize their centra". Means there are fossil stem chondrichthyans with full mineralised centra?

      Re-worded to “Stem chondrichthyans did not appear to mineralize their centra” (lines 379)

      L379: down tune and change to: "we propose the term "non-tesseral trabecular mineralization. Possibly a plesiomorphic (ancestral) character of chondrichthyans"

      no change; sorry, but we feel this character state needs to be emphasized as we wrote in this paper, so that its evolutionary relationship to other chondrichthyan endoskeletal features, such as tesserae, can be clarified.

      L407: suggests so far palaeontologist have not been "careful" enough?

      apologies; sentence re-worded, emphasizing that synchrotron imaging might increase details of these descriptions (lines 406-408)

      414: down tune, remove "we propose". Replace by "possibly" or "it can be discussed if"

      sentence re-worded and “we propose” removed (lines 412-415)

      L420: remove paragraph

      no action; see above

      L436: remove paragraph

      no action; see above

      L450: perhaps add summery of the discussion. A summery that focuses on the subject and the methodology of this nice study.

      yes, in response to the reviewer’s comment, we finished the discussion with a summary of the current study.  (lines 440-453)

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03094

      Corresponding author(s): Saurabh S. Kulkarni

      1. General Statements

      We thank the reviewers for their strong praise of the manuscript, highlighting its rigor, depth, and conceptual importance. They consistently described the study as a beautiful, fascinating, and conceptually strong piece of work that addresses a timely question in multiciliated cells. They also noted the high quality of the data, careful quantification, and the use of multiple genetic and pharmacological approaches, all of which improve the reproducibility and credibility of the findings. Importantly, they emphasized the novelty of discovering a direct mechanistic link between Piezo1-mediated mechanotransduction and Foxj1-driven transcriptional control of multiciliation, representing a significant breakthrough for both the cilia field and mechanobiology more broadly. Collectively, these strengths highlight the manuscript’s wide impact and make it highly suitable for publication in a high-impact journal.

      2. Description of the planned revisions

      Reviewer #1:


      There are two experiments that would significantly strengthen these claims.

      • First if their model is correct then even short term treatment with Yoda1 should induce the pathway and effect centriole numbers. While I appreciate the challenge of long term Yoda1 treatment its not clear to me why it would be needed if short term treatment is setting off the transcriptional cascade. Yoda is used throughout the paper to induce all the pathways but we don't know if it actually induces the phenotype. I think this should be addressed with either short term treatments or a dose response to find a dose that does not lead to skin pealing. It is hard to ignore this obvious deficiency.
      • Second, the model predicts that all of this is to regulate Foxj1 levels to regulate the subtle balance between cell size and centriole number. If this is correct, then the overexpression of Foxj1 should have a profound effect on centriole number in multiciliated cells. This is such an easy experiment that would validate many of the claims. RESPONSE:

      We recognize that the reviewer is asking us to test the sufficiency of the pathway with these comments: “If their model is correct, then they should be able to activate the pathway in one way or another to stimulate centriole number. This is a significant limitation to their overall model.” And “If this is correct, then the overexpression of Foxj1 should have a profound effect on centriole number in multiciliated cells.”

      To address reviewers’ suggestions, we will perform the following experiments.

      1. A brief exposure (15 and 30 mins) to Yoda1 and wait for 3 hours to examine changes in centriole amplification. This will avoid skin peeling from long-term exposure.
      2. A brief exposure to Yoda1 (15 mins) followed by a 30-minute wait period, and the cycle repeats a total of 4 times for a total of 3 hours to examine centriole amplification.
      3. The above two experiments will also be done in a constitutively active-Yap background to increase the probability that synergistic activation can lead to centriole amplification.
      4. Although Foxj1 is essential for multiciliogenesis, it is not sufficient to induce multiciliogenesis, as shown by multiple previous studies. Therefore, we do not expect overexpression of Foxj1 to have a profound effect on centriole number. While we will conduct the experiments because we truly want to address the suggestions and gain insight into the answers ourselves, we respectfully ask the Reviewer to consider the following responses to their concern.

      Yoda1 sufficiency: We agree that testing whether acute Yoda1 treatment can induce centriole amplification is an important question. We will conduct experiments with short-pulse and cyclic Yoda1 exposure, including in a constitutively active-YAP background (listed above), to address this possibility. However, several challenges complicate interpretation: (i) PIEZO1 adapts and desensitizes upon activation, (ii) transient signaling may be sufficient to cause secondary signaling but insufficient to drive stable transcriptional programs required for amplification, and (iii) centriole number is inherently variable, making modest effects difficult to resolve. However, we must recognize that failure to observe sufficiency under these conditions would not invalidate the model for two reasons: 1) absence of evidence is not evidence of absence, and thus, we may not have found the right experimental design. 2) PIEZO1–YAP is a necessary input but not sufficient on its own, as elaborated below. For both reasons, we are very careful about the interpretation of results in the manuscript, which shows that this pathway is necessary for centriole amplification using loss-of-function approaches.

      Foxj1 overexpression: Foxj1 is a well-established regulator essential for motile and multiciliogenesis across species (Xenopus, zebrafish, mouse). Loss of Foxj1 reduces cilia number in MCCs, but its activation alone does not have a profound effect on ciliogenesis/cilia number in MCCs. This is because Foxj1 is a part of a larger network essential for multiciliogenesis. This parallels the behavior of other transcriptional regulators, such as Myb, where loss of function impairs centriole amplification, but overexpression does not drive the formation of supernumerary centrioles. Both studies are seminal discoveries in the field of ciliogenesis, but they did not demonstrate the sufficiency of these molecules/pathways. Thus, our results, demonstrating that Foxj1 is necessary to induce tension-dependent centriole amplification, are significant, as the reviewer mentioned. The lack of Foxj1 sufficiency to induce centriole amplification is not a deficiency of the study, but rather evidence that Foxj1 is a part of a larger network essential for tension-dependent centriole amplification.

      Necessity versus sufficiency: We respectfully emphasize that sufficiency is not a prerequisite for demonstrating the significance of a pathway. Mechanochemical signaling is inherently complex, involving many mechanosensitive proteins and pathways. In our case, mechanical stretch increases centriole amplification, with PIEZO1–YAP signaling identified as a key mediator. However, we do not claim that PIEZO1–YAP alone is sufficient. Other pathways, including cadherin-mediated junctions, F-actin–myosin contractility, integrin–focal adhesion signaling, and nuclear mechanotransduction, likely contribute and may regulate unique downstream effectors that collectively promote centriole amplification. Therefore, PIEZO1–YAP should be regarded as one essential component within a larger network.


      __TIMELINE: __We will perform these additional proposed experiments. Since the first author, a postdoctoral researcher on this manuscript, has started a new job and will be coming in on weekends to complete the experiments, we estimate it will take approximately 2-3 months to finish them.


      Reviewer #2:

      1. Considering the Yap-piezo mechanism of action, the authors' logic for the selection of myb, foxj, plk4 and ccno as transcriptional targets is clear, but the HCR-derived signal and the differences seen in the yap morphants are not very strong, notwithstanding the statistical significance. There appear to be distinct subgroups within the treated populations (in Figure S6B, although these data seem quite different in Fig. 7H, so a comment on the technical differences might be helpful), so that the extent to which Yap1 regulates (Myb-)Foxj1 expression in MCCs is not clearly demonstrated by this experiment. Related to this point, it is unclear why 20-25% of the yap1/ piezo1 MO-treated embryos do not show a decline in FOXj1 in Fig. 6, given the qualitative nature of the scoring. Assuming the KD penetrance would vary on a cell-to-cell basis, rather than an embryo-to-embryo basis, this may suggest that there are additional relevant targets (some of which are discussed by the authors). Single-cell analysis might be a way to address this; however, this is not a trivial experiment, it might be sufficient to include a caveat in the text. Furthermore, the conclusion that Foxj1 regulates centriole amplification in a tension-dependent manner is well-supported by the data.

      RESPONSE: We appreciate the reviewer’s thoughtful observation. Differences in the expression of Foxj1 from experiment to experiment are possible due to a combination of factors, including heterogeneity in MCC development across embryos, slightly different embryonic stages, differences in embryo quality between fertilizations, and variability in morpholino delivery and knockdown penetrance, which can occur both across embryos and on a cell-to-cell basis within an embryo. We also note that technical aspects of HCR RNA-FISH, such as proteinase K treatment and washing steps, can affect signal intensity, potentially contributing to the appearance of distinct subgroups within treated populations.

      We agree that single-cell analysis would be a powerful way to dissect these differences, but as the reviewer notes, this is not a trivial experiment and is beyond the scope of the present study. We have therefore added clarifications in the text and discussion to acknowledge these sources of variability and to highlight the possibility of parallel pathways regulating foxj1 expression.

      ********************************************

      Controls for the knockdowns by the various MOs should be provided.

      RESPONSE: We appreciate the reviewer’s comment. The piezo1 MO has been previously established in Kulkarni et al. (2021). Additionally, the current manuscript includes MO control experiments for both erk2 and yap1, through KD at the 1-cell stage using the MO oligonucleotide, followed by mosaic-rescue with the respective WT RNA constructs (mCherry-ERK2 and yap1-GFP) and a nuclear tracer molecule such as H2B-RFP (Fig. 5, E-H, Fig. S5, C&D, Fig. 3, D-F). The mosaic-rescue is a robust experiment that provides an internal control within the same embryo, thereby avoiding differences that may arise due to embryo-to-embryo variability, embryo quality, or differences in fertilization batches. This approach also serves as a valuable tool for detecting cell-autonomous effects, providing a clear readout against uninjected neighboring cells, as the injected cells are labeled with a tracer. We will perform a similar mosaic-rescue experiment for the foxj1 MO.

      TIMELINE: We will conduct mosaic-rescue experiments for the foxj1 MO. We will need 1 month to complete the experiment.

      ********************************************

      __Minor comments:

      __

      Autocorrection of ERK1/2 or MEK1/2 pathways to 1/2 should be avoided. – We are unclear on this comment. Can reviewer please clarify what they mean.


      Reviewer # 3

      Major concerns

      1- The presented data do not yet establish a specific, direct pathway linking mechanotransduction to centriole number, because the molecular players tested (PIEZO1, Ca²⁺, PKC, ERK, YAP, Foxj1) are highly pleiotropic. As such, the observed centriole number phenotypes, and some of the major conclusions, could be indirect. It is therefore critical to test the specificity and causality of the proposed pathway. This could be done with the authors' own strategies and/or with the following potential approaches:

      • Genetic dependency and sufficiency tests: It could be shown that Yoda1 has no effect in PIEZO1 loss-of-function MCCs, and that wild-type PIEZO1, but not conductance-ad PIEZO1 pore mutants restores Yoda1 responsiveness across centriole number, pERK, and YAP readouts. For example, PIEZO1 C terminus was shown to govern Ca²⁺ influx and ERK1/2 activation. Comparing full length PIEZO1 with a C terminal deletion in MCC restricted rescue; loss of rescue of centriole amplification and ERK/YAP activation with the C terminal deletion can provide a genetics anchored specificity test beyond broad inhibitors.

      RESPONSE:

      • To address the reviewer’s concern, we will test whether Yoda1 affects ERK and Yap activation when Piezo1 is depleted. We appreciate the reviewer’s thoughtful suggestion to employ genetic rescue experiments with Piezo1 mutants. Unfortunately, these are not technically feasible in Xenopus, as the Piezo1 coding sequence is exceptionally large (~7.5 kb)____, and repeated attempts by our group to generate and express stable, translatable transcripts have been unsuccessful. To address genetic dependency and specificity despite these technical barriers, we have employed a combination of orthogonal strategies that together provide strong genetic and mechanistic evidence:

      • Mosaic loss-of-function experiments (Fig. 1) demonstrate that Piezo1 regulates centriole number in a cell-autonomous manner, ruling out global epithelial or indirect tissue-wide effects.

      • Pharmacological activation/inhibition with Piezo1-specific agonist (Yoda1) and inhibitors (GSMTx4, gadolinium) produced consistent phenotypes, including activation of downstream ERK and YAP readouts. Notably, Yoda1 is a Piezo-specific agonist, not a broad pharmacological agent.
      • Downstream pathway dissection (calcium chelation, PKC inhibition, ERK2 depletion, and YAP1 knockdown/rescue) consistently converges on the same phenotypes, reduced centriole amplification and altered Foxj1 expression, providing multiple independent lines of evidence that the Piezo1–Ca²⁺–PKC–ERK–YAP axis specifically controls centriole number.
      • Positive feedback regulation of Piezo1 expression by YAP/Foxj1 (Fig. 7) further strengthens the argument for a pathway-specific role rather than pleiotropic, indirect effects. Taken together, while full-length Piezo1 rescue experiments are technically not possible in Xenopus due to gene size constraints, our data employ state-of-the-art genetic, pharmacological, and orthogonal functional assays to rigorously test pathway specificity. These complementary approaches provide compelling evidence for the causal role of Piezo1-mediated mechanotransduction in centriole number control in MCCs.

      • Downstream bypass/rescue experiments: In PIEZO1 loss-of-function or BAPTA conditions, can enforcing MEK/ERK activation or YAP rescue centriole number defect? Conversely, can MEK inhibitors block Yoda1-induced effects.

      RESPONSE: We appreciate the reviewer’s insightful questions.

      • We will express CA Yap in the Piezo1 KD background to assess if we can rescue centriole number. We also note that the converse experiment has already been performed in our study: 1) PKC inhibition abolishes Yoda1-induced ERK phosphorylation and nuclear localization (Fig. 2), 2) both MEK inhibition and ERK2 depletion block Yoda1-induced Yap activation and nuclear entry (Figs. 4, S2). Thus, we have directly demonstrated that MEK inhibition prevents Yoda1-induced effects, satisfying this aspect of the reviewer’s concern.

      ********************************************

      2- Image quantification and analysis must be described in greater detail in the Methods section, as they are central to the major conclusions of the manuscript. For example, the authors should explain how nuclear, cytoplasmic, and centriole segmentation were performed, and how relative protein levels in the nucleus versus the cytoplasm (e.g., YAP, volume- or area-based) were quantified. Specifically, the thresholds and segmentation criteria applied to different cellular structures under various conditions, as well as the use of Imaris and other software, should be clearly detailed.

      RESPONSE: We will describe the methods in greater detail.

      ********************************************

      3- PIEZO1 mRNA was shown to incrase in a Foxj1 linked feedback loop. Does this increase translate into an increase in total protein levels?

      RESPONSE: If the reviewer is referring to Figure 7B, that is the Piezo1 antibody, so yes, the Piezo1 protein levels have increased.

      If the reviewer is referring to Figure 7C and D, we show that loss of Foxj1 leads to a reduction in Piezo1 mRNA expression.

      ********************************************

      4- Is the proposed signaling cascade active in mammalian multiciliated cells (e.g., airway epithelium). If possible, testing this by using one of the major players of the pathway as a readout such as as ERK phosphorylation, YAP nuclear localization in mammalian MCCs will reveal whether regulation of centriole number through this pathway is conserved and would strengthen the generality.


      RESPONSE: We agree with the reviewer that testing conservation of this pathway in mammalian MCCs is of great interest. Indeed, another group is currently investigating the role of Yap in the mammalian airway epithelium; in their temporally controlled Yap knockout model (the global Yap KO being embryonic lethal), they observed that Yap loss led to a reduction in centriole number. To avoid overlap and direct competition with this ongoing work, we chose to focus our efforts on Xenopus.

      Importantly, Xenopus has become a widely recognized and powerful system for MCC biology, enabling mechanistic dissection of centriole amplification and ciliogenesis. Several key discoveries in the field, including the identification of MCIDAS as a master regulator of MCC fate, were first made in Xenopus before being validated in mammals. Similarly, our study provides a mechanistic framework in Xenopus that can inform and guide ongoing studies in the mammalian airway.

      ********************************************

      5- Throughout the results section, there are multiple times where authors raised specific hypothesis about their data (e.g. foxj1 regulation of number control, apical actin/YAP). However, they have not tested them. These hypothesis are very exciting and if possible, testing experimentally, would strengthen the conclusions associated with them.

      RESPONSE: We are not sure what the reviewer means here by “authors raised specific hypothesis about their data (e.g., foxj1 regulation of number control, apical actin/YAP). However, they have not tested them”,

      BECAUSE:

      • Foxj1 regulation of centriole number: We demonstrate a clear reduction in centriole number upon Foxj1 depletion, and importantly, we extend this finding by showing that the reduction is tension-dependent (Fig. 6). We will perform a rescue assay to demonstrate the specificity.
      • Foxj1 and YAP: We never claimed that Foxj1 regulates YAP expression, and this is not part of our proposed model. Instead, our data show that Piezo1–ERK–YAP signaling regulates Foxj1
      • Foxj1 and apical actin: Foxj1 regulation of apical F-actin has already been established in prior work, and in our study, we clearly observe reduced apical actin intensity in Foxj1-depleted MCCs (Fig. 6). To further strengthen this conclusion, we will provide a quantitative analysis of apical actin intensity in Foxj1 morphants. ********************************************

      __TIMELINE: __We will perform these additional proposed experiments. Since the first author, a postdoc on this manuscript, has started a new job and will be coming in on weekends to finish the experiments, we estimate it will take approximately 2-3 months to complete them.

      Minor comments

      MCC vs non MCC identification (Fig. 1): Clarify how non MCCs were distinguished from MCCs (e.g. markers/criteria). – Can the reviewer please clarify which panel or panels? Or provide more specific text that needs to be changed.

      Add the Kintner group reference linking motile cilia number and centriole number in Xenopus MCCs.– Can the reviewer clarify where and which reference? Thank you.

      3. Description of the revisions that have already been incorporated in the transferred manuscript

      Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. If no revisions have been carried out yet, please leave this section empty.

      Reviewer 2

      Major comments:

      1. It should be clarified whether the immunoblots and the related quantitations in Figs. 2 and S2 are all from separate blots/ exposures. If so, they are not useful as controls, and these blots should be repeated with the relevant samples analyzed in parallel. Size markers and labels should be included (2B, 2G; S2B and S2G). An increase in total ERK would alter the interpretation of the increase in nuclear pERK in the IF experiments. RESPONSE: We thank the reviewer for raising this important point regarding clarification of the immunoblots. All experimental groups were analyzed in parallel with their corresponding controls. Because the primary antibodies for pERK and ERK were both raised in rabbit, we optimized our workflow to prevent protein loss during stripping and to ensure accurate visualization. Specifically, lysates from each experimental group were loaded in duplicate on the same gel, separated by a molecular weight ladder that served as a reference point. After transfer, the blot was cut along the ladder, and the two halves were processed in parallel: one probed with anti-pERK and the other with anti-ERK. This strategy ensured that all samples from a single experiment (e.g., Control and Yoda1-treated groups) were analyzed under identical conditions, with staining and imaging performed together at the same exposure. To enhance clarity, we have provided this data as __uncut, full-length __as Supplemental Figure 7 (Figure S7) in the revised revision.

      ********************************************

      Minor comments:

      1. Reference list should be checked for completeness; some citations lack journal/ volume/ page/ year details. – We have corrected the references.
      2. An 'overexposed' version of the image selected for centrioles in Figure 5F might be included with the Chibby-BFP at the same level as in the other figures. At present, the Yap KD cell in the image appears to have normal centrioles; this is potentially confusing, even though the authors clearly explain the matter in the text. – __We have added a new panel to Fig. 5F to avoid confusion.

      __ 3. It might be clearer to present injected/ uninjected in the same orientation in Fig. 6A and B. – __Unfortunately, that is not possible because the injected and uninjected sides are left and right, and they cannot be in the same orientation.

      __ 4. Figure 7B lacks the schematic described in the figure legend. – We have removed the Schematic sentence from the figure legend. That was an error on our side. Thank you for catching it.


      Reviewer 3


      1. Abstract: "how MCCs regulate centriole/cilia numbers remains a major knowledge gap" overstates the field; please soften to reflect recent advances (mechanics/apical area scaling; PIEZO1 implication). – We changed the text to “incompletely understood”.
      2. GsMTx4 rationale: State that GsMTx4 is a spider venom peptide that inhibits cationic mechanosensitive channels (including PIEZO1) and justify its use alongside Yoda1.– GsMTx4 was used in the previous manuscript, and its use was justified there. Here, we are only comparing the results. However, we have added a sentence describing what GSMTx4 is. We have also included a sentence explaining the use of Yoda1. “GsMTx4, a spider venom peptide used in our previous study, inhibits cationic mechanosensitive channels, including Piezo1.”

      “For this experiment, we used the Piezo1 channel-specific chemical agonist, Yoda1, to increase the sensitivity of Piezo1 and upregulate calcium entry into cells”

      Timeline statement: "Centriole amplification to migration and apical docking takes ~4-5 h (personal observation)" is not appropriate; either cite time lapse literature or include your own time lapse data.– We have added a reference that showed imaging for 2 hours, but it was not enough to capture the entire process from intercalation to maturation, so we also kept “personal observation” still in the manuscript. We are unaware of any study that has done time-lapse imaging for 4 hours to capture the entire process of centriole amplification.

      Redundancy: The description of Yoda1 as a channel specific agonist is repeated; keep only once.- Removed

      "WT yap1 GFP construct previously used by Dr. Lance Davidson ..." should move construct description to Methods and keep only the citation in Results.– We moved it to Methods.

      "(Unpublished data; Dr. Mahjoub)" should be removed unless data are shown.- Removed

      Replace "as shown previously in our eLife paper" with "as we previously showed or shown previously (Kulkarni et al., 2021)".– We have made the change.

      The two hypotheses for how Foxj1 could regulate number under tension (actin remodeling vs. transcriptional control of amplification genes) belong in the Discussion unless tested. Moreover, the part on the discussion on yap sequestration by apical actin and the two possibilities presented also should go do discussion. – We have moved both to the discussion section.

      4. Description of analyses that authors prefer not to carry out

      Please include a point-by-point response explaining why some of the requested data or additional analyses might not be necessary or cannot be provided within the scope of a revision. This can be due to time or resource limitations or in case of disagreement about the necessity of such additional data given the scope of the study. Please leave empty if not applicable.

      Reviewer 3

      1- The hypothesis about the centriole pool of Piezo as the mechnosensor for centriole number regulation is very exciting and novel. Can localization controlled variants be used to test whether a centriole associated pool directly senses tension for number control (for example, centrosome targeted PIEZO1 via a PACT tag). Alternatively, broad cellular Ca sensors (GcaMP) or centrosome proximal Ca sensors (e.g., PACT GCaMP) can be used detect local calcium microdomains during tethering or Yoda1 treatment.

      RESPONSE: We appreciate the reviewer's curiosity and excitement; however, these experiments will not alter the conclusion of this paper and will be part of the next study, which aims to delve deeper into how different pools of Piezo1 at centrioles versus cell junctions function in MCCs. To that point, we had thought about these experiments. As mentioned earlier, the Piezo1 coding sequence is exceptionally large (~7.5 kb)____, and repeated attempts by our group to generate and express stable, translatable transcripts have been unsuccessful. Thus, the idea of centrosome-targeted PIEZO1 via a PACT is very exciting; however, it is not technically feasible. Beyond size, PIEZO1 is a trimeric, large plasma-membrane mechanosensitive channel that requires proper ER processing and bilayer incorporation. PACT localizes cargo to the centriole/pericentriolar material, not a membrane compartment; thus, a PACT-anchored PIEZO1 would be membrane-mismatched and almost certainly nonfunctional even if expressed/

      Second, Centrosome-proximal GCaMP (PACT-GCaMP) would show correlation, not causation. This experiment does not address the question “centriole pool of Piezo as the mechanosensor for centriole number regulation”. It will only show if the Ca2+ influx is happening at the basal bodies, but not whether and how that Ca2+ is essential for centriole amplification. For this purpose, we will need to find a way to block Ca2+ influx specifically at basal bodies, rather than junctions, which will require extensive controls.

      We do not claim that any specific Piezo1 or Ca2+ pool is critical for controlling centriole number and thus the suggested experiment would not alter the manuscript's conclusions. We therefore view the above as exciting future directions rather than prerequisites.

      ********************************************

      2- Because the proposed pathway is tension-sensing and YAP pathway is tightly linked to the actin cytoskeleton, the role of actin cysoskeleton in the proposed pathway should be tested directly. The authors mention different hypothesis around actin but has not tested them in the manuscript. For example, actin-depedent sequestration of Yap at the apical surface is intriguing. Does actin polymerization induced by drugs release Yap from the apical surface?

      RESPONSE: We would like to thank the reviewer for their suggestion. As per the reviewers' suggestion, we have moved this section to discussion, stating that “In the future, we plan to address this question by examining how Yap is sequestered by apical actin.”.

      However, we appreciate the reviewer’s enthusiasm and would like to share some experiments we are thinking/planning of to test the hypothesis.

      We plan to examine if the actin polymerization or contractility is responsible for Yap sequestration/release from the apical surface with the following experiments: 1) if the Yap is displaced by Jasplakinolide treatment, which stabilizes filamentous actin, 2) use of ROCK inhibitor to decrease contractility in the absence or presence of Yoda1, 3) Use genetic constructs such as Shroom3 to increase ROCK-mediated contractility to observe changes in Yap localization and dynamics.

      Although these experiments are interesting, they do not alter the conclusion of the current manuscript, and they represent future directions for our research.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      Summary: The authors have previously published Mass-spectrometry data that demonstrates a physical interaction between Sall4 and the BAF chromatin complex in iPSC derived neurectodermal cells that are a precursor cell state to neural crest cells. The authors sought to understand the basis of this interaction and investigate the role of Sall4 and the BAF chromatin remodelling complex during neural crest cell specification. The authors first validate this interaction with a co-IP between ARID1B subunit and Sall4 confirming the mass spec data. The authors then utilise in silico modelling to identify the specific interaction between the BAF complex and Sall4, suggesting that this contact is mediated through the BAF complex member DPF2. To functionally validate the role of Sall4 during neural crest specification, the authors utilsie CRISPR-Cas9 to introduce a premature stop codon on one allele of Sall4 to generate iPSCs that are haploinsufficient for Sall4. Due to the reports of Sall4's role in pluripotency, the authors confirm that this model doesn't disrupt pluripotent stem cells and is viable to model the role of Sall4 during neural crest induction. The authors expand this assessment of Sall4 function further during their differentiation model to cranial neural crest cells, assessing Sall4 binding with Cut+Run sequencing, revealing that Sall4 binds to motifs that correspond to key genes in neural crest differentiation. Moreover, reduction in Sall4 expression also reduces the binding of the BAF complex, through Cut and Run for BRG1. Overall, the authors then propose a model by which Sall4 and BRG1 bind to and open enhancer regions in neurectodermal cells that enable complete differentiation to cranial neural crest cells.

      Overall, the data is clear and reproducible and offers a unique insight into the role of chromatin remodellers during cell fate specification.

      We thank the Reviewer for the nice words of appreciation of our manuscript.

      However, I have some minor comments.

      1- Using AlphaFold in silico modelling, he authors propose the interaction between the BAF complex with Sall4 is mediated by DPF2, but don't test it. Does a knockout, or knockdown of DPF2 prevent the interaction?

      We agree with the Reviewer that we are not functionally validating our computational prediction that DPF2 is the specific BAF subunit directly linking SALL4 with BAF. We chose not to perform the validation experiment for two main reasons:

      1) This would be outside of the scope of the paper. In fact, from a mechanistic point of view, we have confirmed via both Mass-spectrometry and co-IP with ARID1B that SALL4 and BAF interact in our system. Moreover, mechanistically we also extensively demonstrate that the interaction with SALL4 is required to recruit BAF at the neural crest induction enhancers and we further demonstrate that depletion of SALL4 impairs this. In our view, this was the focus of the manuscript. On the other hand, detecting with certainty which BAF subunit mediates the interaction with SALL4 would be outside the scope of the paper.

      2) Moreover, after careful consideration, we don’t think that even a knock-out of DPF2 would provide a definite answer to which exact BAF subunit mediates the interaction with SALL4. In fact, knock out of DPF2 could potentially disrupt BAF assembly or stability, and this could result in a disruption of the interaction with SALL4 even if DPF2 is not the very subunit mediating it (in other words the experiment could provide a false positive result). In our opinion, the only effective experiment would be mutating the DPF2 residues that we computationally predicted as responsible for the interaction with SALL4, but again this would be very laborious and out of the scope.

      That being said, we agree with the Reviewer that while the SALL4-BAF interaction was experimentally validated with robust approaches, the role of DPF2 in the interaction was only computationally predicted, which comes as a limitation of the study. We have now added a dedicated paragraph in the discussion to acknowledge such limitation.

      2- OPTIONAL: Does knockout of DPF2 phenocopy the Sall4 ko? This would be very interesting to include in the manuscript, but it would perhaps be a larger body of work.

      See point-1.

      3- Figure 1, the day of IP is not clearly described until later in the test. please outline during in the figure.

      We thank the Reviewer for pointing this out. This has been fixed.

      3- What is the expression of Sall1 (and other Sall paralogs) during differentiation. The same with the protein levels of Sall4, does this remain at the below 50%, or is this just during pluripotency?

      As Recommend by the Reviewer, we have performed time-course WB of SALL1 and SALL4. These experiments revealed that SALL1 remains very lowly expressed in wild-type conditions across time points and all the way through differentiation until CNCC (See updated supplementary Fig. S9). This is consistent with previous studies that demonstrated that SALL4, but not SALL1, is required for early mammalian development (see for example Miller et al. 2016, Development, and Koulle et al. 2025, Biorxiv). We performed the same time-course WB for SALL4 which revealed that SALL4 expression progressively decreases after day-5 (as expected) and it’s very low at CNCC stage (day-14), therefore we would expect the KO to remain at even lower level at this stage.

      4- The authors hypothesise that Sall4 binds to enhancers- with the criteria for an enhancer being that these peaks > 1KB from the TSS are enhancers. Can this be reinforced by overlaying with other ChIP tracks that would give more confidence in this? There are several datasets from Joanna Wysocka's lab that also utilise this protocol which can give you more evidence to reinforce the claim and provide further detail as to the role of Sall4.

      We thank the Reviewer for this great suggestion. As recommended, we have used publicly available ChIP-seq data generated by the Wysocka lab (H3K4me1, H3K4m3) and also generated new H3K27ac CHIP-seq data as well. These experiments and analyses confirmed that these regions are putative CNCC enhancers (and a minority of them putative promoters), decorated with H3K4me1 and with progressive increase in H3K27ac after CNCC induction (day-5). See new Supplementary Figure S6.

      5- The authors state that cells fail to become cranial neural crest cells, however they do not propose what the cells do instead. do they become neural? Or they stay at pluriopotent, which is one option given the higher expression of Nanog, OCT4 and OTX2 that are all expressed in pluripotent stem cells.

      We think that it is likely a mix of both. There is a mixed bag of expression of pluripotency markers, but also high expression of neuroectodermal markers. This suggests that most cells safely reach the neuroectodermal stage but fail to go beyond that, while some of the cells simply do not differentiate or regress back to pluripotency. We would rather refrain on overinterpreting what the KO-cells become, as it is likely an aberrant cell type, but following the Reviewer’s indication we have added a paragraph in the discussion to speculate on this.

      6- In general, I would like to see the gating strategy and controls for the flow cytometry in a supplemental figure.

      As Recommended by the Reviewer, we have added the gating strategy in the Supplementary Fig. S4.

      7- For supplementary figure 1- please include the gene names in the main image panels rather than just the germ layer.

      Done. The figure is now Supplementary Figure S3 since two supplementary figures were added before.


      Reviewer #2

      Summary In this manuscript, the authors build on their previous work (Pagliaroli et al., 2021) where they identified an interaction between the transcription factor SALL4 and the BAF chromatin remodeling complex at Day-5 of an iPSC to CNCC differentiation protocol. In their current work, the authors begin by exploring this interaction further, leveraging AlphaFold to predict interaction surfaces between SALL4 and BAF complex members, considering both SALL4 splice isoforms: a longer SALL4A (associated with developmental processes) and a shorter SALL4B (associated with pluripotency). They propose that SALL4A may interact with DPF2, a BAF complex member, in an isoform-dependent manner. The authors next explore the role of SALL4 in craniofacial development, motivated by patient heterozygous loss of function mutations, leveraging iPSC cells with an engineered SALL4 frameshift mutation (SALL4-het-KO). Using this model, the authors first demonstrate that a reduced expression of SALL4 does not impact the iPSC identity, perhaps due to compensation via upregulation of SALL1. Upon differentiation to neuroectoderm, SALL4 haploinsufficiency causes a reduction in newly accessible sites which are associated with a reduction in SALL4 binding and therefore a loss of BAF complex recruitment. Interestingly, however, there were few transcriptional changes at this stage. Later in the CNCC differentiation at Day-14 when the wildtype cells have switched expression of CNCC markers, the SALL4-het-KO cells fail to switch cadherin expression associated with a transition from epithelial to mesenchymal state, and fail to induce CNCC specification and post-migratory markers. Together the authors propose that SALL4 recruits BAF to CNCC enhancers as early as the neuroectodermal stage, and failure of BAF recruitment in SALL4-het-KO lines results in a loss of open chromatin at regulatory regions required later for induction of the CNCC programme. The failure of the later differentiation is compelling in the light of the early stages of the differentiation progressing normally, and the authors outline an interesting proposed mechanism whereby SALL4 recruits BAF to remodel chromatin ahead of CNCC enhancer activation, a model that can be tested further in future work. The link between SALL4 DNA binding and BAF recruitment is nicely argued, and very interesting as altered chromatin accessibility at Day 5 in the neuroectodermal stage is associated with only few changes in gene expression, while gene expression is greatly impacted later in the CNCC stage at Day 14. The in silico predictions of SALL4-BAF interaction interfaces are perhaps less convincing, requiring experimental follow-up outside the scope of this paper. Some of the associated figures could perhaps be moved to the supplement to enhance the focus on the later functional genomics experiments.

      We thank the Reviewer for the nice words of appreciation of our manuscript.

      Major comments

      1. A lot of emphasis is placed on the AlphaFold predictions in Figure 1, however the predictions in Figure 1B appear to be mostly low or very low confidence scores (coloured yellow and orange). It is unclear how much weight can be placed on these predictions without functional follow-up, e.g. mutating certain residues and showing impact on the interaction by co-IP. The latter parts of the manuscript are much better supported experimentally, and therefore perhaps some of the Figure 1 could move to a Supplemental Figure (e.g. the right-hand part of 1B, and the lower part of Figure 1C showing SALL4B predicted interactions). The limitations of AlphaFold predictions should be acknowledged and the authors should discuss how these predicted interactions could be experimentally explored further in the future.

      As recommended by the Reviewer, we have moved part of the AlphaFold predictions to Supplementary Figure S1, and we added a paragraph in the discussion to acknowledge the limitations of AlphaFold.

      The authors only show data for one heterozygous knockout clone for SALL4. It is usual to have more than one clone to mitigate potential clonal effects. The authors should comment why they only have one clone and include any data for a second clone for key experiments if they already have this. Alternatively, the authors could provide any quality control information generated during production of this line, for example if any additional genotyping was performed.

      We apologize for the confusion and for our lack of clarify on this. We have used two clones (one generated with a 11 bp deletion, one with a 19 bp deletion, both in exon-1, see also the point 6 of your minor points). The two clones were used as biological replicates, so for example the two ATAC-seq replicates performed in each time point were performed with the two different clones, and the three RNA-seq replicates were performed with two technical replicates of the clone with the 11bp deletion and one replicate with the clone with 19 bp deletion. We have clarified this in the methods section of the manuscript and added a Supplementary Figure (S2) showing the editing strategy for the two clones. Thank you for catching it.

      The authors show all genomics data (ATAC-seq, CUT&RUN and ChIP-seq) as heatmaps and average profiles. It would be valuable to see some representative loci for the ATAC seq (perhaps along with SALL4 and BRG1 recruitment) at some representative and interesting loci.

      As recommended by the Reviewer, we have added Genome Browser screenshots of representative loci in Fig. 6.

      Figure 4A. The schematic could be improved by including brightfield or immunofluorescent images at the three stages of the differentiation. Are the iPS cells seeded as single cells, or passaged as colonies before starting the differentiation. Further details are required in the methods to clarify how the differentiation is performed, for example at what Day are the differentiating cells passaged, this is not shown on the schematic in Figure 4A.

      As recommended, we added IF images in the Fig. 4A schematic, and added more details in the methods.

      There is likely some heterogeneity of cell types in the differentiation at Day 5 and Day 14. Can the authors comment on this from previous publications or perhaps conduct some IF for markers to demonstrate what proportions of cells are neuroectoderm at Day 5 and CNCCs at Day 14.

      The differentiation starts with single cells that aggregate to form neuroectodermal clusters, as per original protocol. The CNCCs that we obtain with this protocol homogeneously express CNCC markers, as shown by IF of SOX9 in Fig. 4A. For the day-5, as recommended we have added IF for PAX6 also showing homogeneous expression (Fig. 4A).

      For the motif analysis for Day 5-specific SALL4 binding sites (Figure 4E), was de novo motif calling performed? Were any binding sites reminiscent of a SALL4 binding site observed (e.g. an AT-rich motif)? Could the authors comment on this in the text - if there is no SALL4 binding motif, does this suggest SALL4 is recruited indirectly to these sites via interaction with another transcription factor for example?

      Similar to SALL4, SALL1 also recognizes AT-rich motifs. However, while we found AT-rich motifs as enriched in our day-5 motif analysis (in the regions that gain SALL4 binding upon differentiation), the enrichment is not particularly strong, and several other motifs are significantly more enriched, suggesting that, like the Reviewer mentioned, SALL4 might be recruited indirectly at these sites by other factors. We have added a paragraph on this in the discussion.

      Does SALL1 remain upregulated at Day-5 and Day-14 of the differentiation for the SALL4-het-KO line? Are binding sites known for this TF and were they detected in the motif analysis performed? Further discussion of the impact of the overexpression of SALL1 on the phenotypes observed is warranted - e.g. for Figure 5F, could the sites associated with a gain of BRG1 peaks upon loss of SALL4 be associated with SALL1 being upregulated and 'hijacking' BAF recruitment to distinct sites associated with nervous system development? Is SALL1 still upregulated at Day 5?

      As mentioned above, SALL1 also recognizes AT-rich motifs but similar to SALL4 also binds unspecifically, likely in cooperation with other TFs. Like the Reviewer suggested, it is certainly possible that some of the sites associated with a gain of BRG1 peaks upon loss of SALL4 could be associated with SALL1 being upregulated and 'hijacking' BAF recruitment to distinct sites. While this is speculative, we have added a paragraph on this in the discussion.

      Related to the point above, SALL4A is proposed to have an isoform-specific interaction with the BAF complex. It would be valuable to plot SALL4A and SALL4B expression from the available RNA-seq data at Day 0, 5 and 14 to explore whether stage-specific isoform expression matches with the proposed role of SALL4A to interact with BAF at Day 5. It could be valuable to also look at expression of SALL1, 2 and 3 across the time course to see whether additional compensation mechanisms are at play during the differentiation.

      Thanks for suggesting this. We performed a time course analysis of isoform specific gene expression, which showed that SALL4B expression remains low throughout differentiation, while SALLA4A expression increases upon differentiation cues and it remains at high levels until the end. We have added this to supplementary Fig. S9. Moreover, we have performed an additional experiment, using pomalidomide, which is a thalidomide derivative that selectively degrades SALL4A but not SALL4B. Notably, SALL4A degradation recapitulated the main findings obtained with the CRISPR-KO of SALL4, further supporting that SALL4A is the isoform involved in CNCC induction (see new Fig. 8).

      At line 264, The authors state "SALL4 recruits the BAF complex at CNCC developmental enhancers to increase chromatin accessibility". Given that this analysis is performed at Day 5 of the differentiation, which is labelled as neuroectoderm what evidence do the authors have that these are specifically CNCC enhancers? Statements relating to enhancers should generally be re-phrased to putative enhancers (as no functional evidence is provided for enhancer activity), and further evidence could be provided to support that these are CNCC-specific regulatory elements, e.g. showing representative gene loci from CNCC-specific genes. Discussion of the RNA-seq presented in Supplementary Figure 2B may also be appropriate to introduce here given that large numbers of accessible chromatin sites are detected while the expression of very few genes is impacted, suggesting these sites may become active enhancers at a later developmental stage.

      As also recommended by the other Reviewer, to further characterize these sites, we have used publicly available histone modification CHIP-seq data (H3K4me1, H3K4me3) generated by the Wysocka lab (H3K4me1, H3K4m3) and also generated new H3K27ac CHIP-seq data as well. These experiments and analyses confirmed that these regions are putative CNCC enhancers (and a minority of them putative promoters), all decorated with H3K4me1, and all showing progressive increase in H3K27ac after CNCC induction (day-5). See new Supplementary Figure S6.

      1. Do any of the putative CNCC enhancers detected at Day 5 as being sensitive to SALL4 downregulation and loss of BAF recruitment overlap with previously tested VISTA enhancers (https://enhancer.lbl.gov/vista/)?

      Yes, we have found examples of overlap and have included two of them in the updated Figure 6 as Genome Browser screenshots.

      Minor comments

      1. The authors are missing references in the introduction "a subpopulation of neural crest cells that migrate dorsolaterally to give rise to the cartilage and bones of the face and anterior skull, as well as cranial neurons and glia".

      Fixed, thank you.

      The discussion of congenital malformations associated with SALL4 haploinsufficiency is brief in the introduction. From OMIM, SALL4 heterozygous mutations are implicated with the condition Duane-radial ray syndrome (DRRS) with "upper limb anomalies, ocular anomalies, and, in some cases, renal anomalies... The ocular anomalies usually include Duane anomaly". That Duane anomaly is one phenotype among a number for patients with SALL4 haploinsufficiency could be clarified in the introduction. Of note, this is stated more clearly in the discussion but needs re-wording in the introduction.

      Done, thank you.

      The statements "show that the SALL4A isoform directly interacts with the BAF complex subunit DPF2 through its zinc-finger-3 domain" and "this interaction occurs between the zinc-finger-cluster-3 (ZFC3) domain of SALL4A and the plant homeodomains (PHDs) of DPF2" in the introduction appear overstated and should be toned down. To show this the authors would need to mutate or delete the proposed important zinc-finger domains from SALL4A, which is outside the scope of this work. Notably, this is less strongly-stated elsewhere in the manuscript, e.g "predict that this interaction is mediated by the BAF subunit DPF2", Line 162.

      Done, thank you.

      Could the authors clarify why 3 Alphafold output models are shown for SALL4B in Figure 1C, and only one output model for SALL4A?

      AlphaFold3 produces five separate predicted models per protein combination (e.g., Model_1 … Model_4), each derived from slightly different network parameters or initializations. The final output prioritizes the model with the highest confidence score. This multi-model strategy enables the identification of the most robust conformation while providing a measure of structural uncertainty (as per GitHub documentation for AlphaFold3). wE have conducted the same analysis for SALL4A as we did for SALL4B. Specifically, SALL4A interacts with the AT-rich DNA in models 0, 1, and 2, therefore models 3 and 4 were excluded. When analysing models 1 and 2, we found a higher number of residues involved in the interaction (>800 instead of 396). Similarly to model 0, only the interactions between residues belonging to an annotated functional domain (ZFs and PHDs) were considered.

      In Model 1: SALL4A and DPF2 interact mainly through ZF6 and 7, and not 5 as Model 0.

      In Model 2: SALL4A and DPF2 interact mainly through ZF5 and 6, and not 7 as Models 0. In contrast, this model shows an interaction with ZF1 not shown in the other two models, but with a higher PAE (31 average compared to 25 to 27 average of the other two ZFs.

      Therefore, we considered Model 0 as it is the model with higher confidence and representative of all significant models (includes ZF5, 6, and 7).

      Line 121. The authors state "DPF2, a broadly expressed BAF subunit,", but don't show expression during their CNCC differentiation. It would be good to include expression of DPF2 in Figure 1E.

      Done, thank you.

      The text states "a 11 bp deletion within the 3'-terminus of exon 1 of SALL4", while the figure legend states, "Sanger sequencing confirming the 19 bp deletion in one allele of SALL4 is displayed". The authors should clarify this disparity and experimentally confirm the deletion, e.g. by TA-cloning the two alleles and sequencing these separately to show that one allele is wildtype and the other has a frameshift deletion.

      We apologize for the confusion. As stated above (point-2 of the major comments), we have used two clones (one generated with a 11 bp deletion, one with a 19 bp deletion, both in exon-1, see also the point 6 of your minor points). The two clones were used as biological replicates (see response above for details). The deletion for both clones was experimentally confirmed by Sanger sequencing by the company that generated the lines for us (Synthego). The strategy for the two clones is now shown also in Supplementary Fig. S2.

      The authors generate an 11-bp (or 19-bp?) deletion in exon-1 - it would be valuable to include a discussion whether patients have been identified with deletions and frame-shift mutations in this region of SALL4 exon-1. And also clarify, if not clearly stated in the text, that both SALL4A and SALL4B will be impacted by this mutation. Are there examples of patient mutations which only impact SALL4A?

      As requested, we have added a discussion paragraph to discuss this. And, yes, both SALL4A and SALL4B are impacted by both deletions in both clones (11 bp and 19 bp deletion).

      Regarding patient variants on exon-1 and patient variants that only impact SALL4A. We could only find one published pathogenic 170bp deletion in exon 1 (VCV000642045.7). The majority of the pathogenic or likely pathogenic variances are located on exon2. In particular, of the 63 reported pathogenic (or likely pathogenic) clinical variants, 42 were located on exon 2. Among these, 28 are located in the portion shared by both SALL4A and SALL4B, while the remaining 14 were SALL4A specific.

      For the SALL4 blots in Figure 2B, is the antibody expected to detect both isoforms (SALL4A and SALL4B), and which isoform is shown? If two isoforms are detected, they should both be presented in the figure.

      Yes, the antibody detects both isoforms, and we now present both in the figure 2, as recommended.

      SALL4 expression should be shown for Figure 2C to see whether the >50% down-regulation of SALL4 at the protein level may be partially driven by transcriptional changes.

      Done, thank you. As expected, we observed the SALL4 mRNA expression in the KO line is comparable to wild-type conditions, but still this results in a significant decrease of the SALL4 protein level likely because of autoregulatory mechanisms coupled with non-sense mediated decay of the mutated allele. Also, we note that SALL4 usually makes homodimers, therefore lack of sufficient amount of protein could also lead to degradation of the monomers.

      The number of experimental replicates should be indicated in all figure legends where relevant. Raw data points should be plotted visibly over the violin plots (e.g. Figure 2C).

      Done, thank you.

      For Figure 3A, the images of the DAPI and NANOG/OCT4 staining should be shown separately in addition to the overlay.

      Done, thank you.

      The metric 'Corrected Total Cell Fluorescence (CTCF)' should be described in the methods. The number of images used for the quantification in Figure 3A should be

      Done, thank you.

      Figure 3C - what are the 114 differentially expressed genes? Some interesting genes could be labelled on the plot and the data used to generate this plot should be included as a Supplementary Table. Supplementary Tables should similarly be provided for Figure 6C, Day 14 and Supplementary Figure 2B, Day 5.

      As recommended, we have highlighted some interesting genes in the volcano plot and also included all the expression data for all genes in Supplementary Table S3.

      Figure 4B. The shared peaks are not shown. For completeness, it would be ideal to show these sites also.

      Done, thank you.

      Figure 4C is difficult to interpret. Why is the plot asymmetric to the left versus right? What does the axis represent - % of binding sites?

      The asymmetry is due to the fact that there is a larger number of peaks that are downstream of the TSS than peaks that are upstream of TSS. This is consistent with the fact that many SALL4 peaks are in introns, likely representing intronic enhancers.

      Line 224-225. What do n= 3,729 and n= 6,860 refer to? There appear to be many more binding sites indicated in Figure 4B, therefore these numbers cannot represent 86% and 97% of sites?

      Thank you for pointing this out, we should have specified in the text. Those numbers refer to the genes whose TSS is closest to each SALL4 peak. Notably, multiple peaks can share the same closest TSS, hence the discrepancy between # of peaks and # of nearest genes.

      Raw numbers:

      • Day-0 RAW = 6,104 (peaks = 6,114);
      • Day-5 RAW = 17,131 (peaks = 17,137). Now raw data reported in Supplementary Table 4.

      Figure 4E. Several TFs mentioned in the text (Line 243) are not shown in the figure, it would be good to show all TFs motifs mentioned in the text in this figure. Again, there is no mention of whether a sequence-specific motif is detected for SALL4 (e.g. an AT-rich sequence) from this motif analysis.

      Done, thank you. An AT-rich sequence, resembling the SALL4 motif, was detected in a small minority of sites (this is now shown in Supplementary Figure S5), suggesting that SALL4 engages chromatin in a broad manner, going beyond its preferred motif, possibly in cooperation with other TFs. This is consistent with many studies that in mESCs have shown that SALL4 binds at OCT4/NANOG/SOX2 target motifs. This is now discussed in a dedicated paragraph in the discussion.

      Figure 4G. How was the ATAC-seq data normalized for the WT and SALL4-het-KO lines for this comparison? The background levels of accessibility seem quite different in Replicate 1.

      The bigwigs used to make the heatmaps are normalized by sequencing depth using the Deeptools Suite (normalization by RPKM).

      Figures 5B-C could be exchanged to flow better with the text. A Venn diagram could be included to show the overlap between the sites losing BRG1 in SALL4-het-KO (13,505 sites) and the Day5-specific SALL4 sites (17,137 sites).

      Done, thank you.

      At Day 5, the authors suggest a shift towards neural differentiation. It could be interesting for the authors to perform qRT-PCR at Day 5 for some neural markers or look in the Day 14 data for markers of neural differentiation at the expense of CNCC markers.

      See updated Supplementary Fig. S8, where we show timecourse expression of several genes, including neural markers.

      Is the data used to plot Figure 5D the same as Figure 4G. If so, why is only one replicate shown in Figure 5D?

      Only one replicate was shown in the main figure purely for lack of space, but the experiment was replicated twice (with the two different clones), and the results were exactly the same. See plots below for your convenience:

      Figure 6A. How many replicates are shown? If n=2, boxplots are not an appropriate to represent the distribution of the data. Please include n= X in the figure legend and plot the raw data points also.

      Done, thank you, and as suggested we are no longer using boxplots for this panel.

      Figure 6B. What is the significance of CD99 for CNCC differentiation?

      Figure 6F. No error bars are shown, how many replicates were performed for this time couse? The linear regression line does not appear to add much value and could be removed.

      As suggested, we have removed these plots and replaced them with individual genes plots, which include error bars. See updated Supplementary Figure S8.

      At line 304, the authors state "while SALL4-het-KO showed a significant downregulation of these genes". Perhaps 'failed to induce these genes' may be more accurate unless they were expressed at Day 5 and downregulated at Day 14.

      Done, thank you.

      Lines 332-335. The genes selected for pluripotency, neural plate border, CNCC specification could be plotted separately in the Supplement to show individual gene expression dynamics.

      Done, thank you, see point 24.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      In this manuscript, Singh, Wu and colleagues explore functional links between septins and the exocyst complex. The exocyst in a conserved octameric complex that mediates the tethering of secretory vesicles for exocytosis in eukaryotes. In fission yeast cells, the exocyst is necessary for cell division, where it localizes mostly at the rim of the division plane, but septins, which localize in a similar manner, are non-essential. The main findings of the work are that septins are required for the specific localization of the exocyst to the rim of the division plane, and the likely consequent localization of the glucanase Eng1 at this same location, where it is known to promote cell separation. In the absence of septins, the exocyst still localizes to the division plane but is not restricted to the rim. They also show some defects in the localization of secretory vesicles and glucan synthase cargo. They further propose that interactions between septins and exocysts are direct, as shown through Alphafold2 predictions (of unclear strength) and clean coIP experiments. 

      Strengths: 

      The septin, exocyst and Eng1 localization data are well supported, showing that the septin rim recruits the exocyst and (likely consequently) the Eng1 glucanase at this location. One major finding of the manuscript is that of a physical interaction between septins and exocyst subunits. Indeed, many of the coIPs supporting this discovery are very clear. 

      Weaknesses: 

      I am less convinced by the strength of the physical interaction of septins with the exocyst complex. Notably, one important open question is whether septins interact with the intact exocyst complex, as claimed in the text, or whether the interactions occur only with individual subunits. The two-hybrid and coIP data only show weak interactions with individual subunits, and some coIPs (for instance Sec3 and Exo70 with Spn1 and Spn4) are negative, suggesting that the exocyst complex does not remain intact in these experiments.

      Given the known structure of the full exocyst complex and septin filaments (at least in S. cerevisiae), the Alphafold2 predicted structure could be used to probe whether the proposed interaction sites are compatible with full complex formation.  

      We thank the reviewer for these important and insightful comments. We agree that our current data, particularly the data from yeast two-hybrid and co-immunoprecipitation (coIP) assays, primarily reveal interactions between individual septin and exocyst subunits, and do not conclusively demonstrate binding of septins to the fully assembled exocyst complex. We realize this as a key limitation and have revised the manuscript text accordingly to clarify this point.

      We also appreciate the reviewer’s suggestion to use structural prediction to further assess their interaction plausibility. We have now employed the full Saccharomyces cerevisiae exocyst complex (with 4.4 Å resolution) published by the Guo group (Mei et al., 2018) to examine the interfaces of septin and the exocyst interactions, assuming that the S. pombe exocyst has the similar structure. We focused on checking all the interacting residues on the exocyst complex and septins from our AlphaFold modeling to determine whether these predicted interactions are structurally compatible. Our analysis reveals that majority subunit interactions are sterically feasible, while a few would likely require partial disassembly or flexible conformations. These new insights have been added to the revised Results and Discussion sections (Figure Supplement S4, S5 and Videos 4-7).

      While we cannot fully resolve whether septins engage with the whole exocyst complex versus selected subunits, our combined data support a model that septins scaffold or spatially regulate the exocyst localization at the division site, potentially through dynamic and multivalent interactions. We now explicitly state this more cautious interpretation in the revised manuscript.

      Mei, K., Li, Y., Wang, S., Shao, G., Wang, J., Ding, Y., Luo, G., Yue, P., Liu, J.-J., Wang, X. and Dong, M.-Q., Wang, H-W, Guo W. 2018. Cryo-EM structure of the exocyst complex. Nature Struct & Mol. Biol, 25(2), pp.139-146.

      The effect of spn1∆ on Eng1 localization is very clear, but the effect on secretory vesicles (Ypt3, Syb1) and glucan synthase Bgs1 is less convincing. The effect is small, and it is not clear how the cells are matched for the stage of cytokinesis. 

      For localizations and quantifications of Eng1, Ypt3, Syb1, and Bgs1 shown in Figures 6 and 7, cells with a closed septum (at or after the end of contractile-ring constriction) were quantified or highlighted. To quantify their fluorescence intensity at the division site using line scan, the line width used was 3 pixels. For Syb1 (Figure 6D), we quantified cells at the end of ring constriction (when Rlc1-tdTomato constricted to a dot) in the middle focal plane. The exact same lines were drawn in both Rlc1 and Syb1 channels. The center of line scan was defined as the pixel with the brightest Rlc1 value. All data were aligned by the center and plotted. For Bgs1 (Figure 7A), we quantified the cells that Rlc1 signal had disappeared from the division site. The line was drawn in the Bgs1 channel in the middle focal plane. The center of line scan was defined as the pixel with the brightest Bgs1 value.

      All data were aligned by the center and plotted. These details were added to the Materials and Methods.

      Reviewer #2 (Public Review): 

      Summary: 

      This interesting study implicates the direct interaction between two multi-subunit complexes, known as the exocyst and septin complexes, in the function of both complexes during cytokinesis in fission yeast. While previous work from several labs had implicated roles for the exocyst and septin complexes in cytokinesis and cell separation, this study describes the importance of protein:protein interaction between these complexes in mediating the functions of these complexes in cytokinesis. Previous studies in neurons had suggested interactions between septins and exocyst complexes occur but the functional importance of such interactions was not known. Moreover, in baker's yeast where both of these complexes have been extensively studied - no evidence of such an interaction has been uncovered despite numerous studies which should have detected it. Therefore while exocyst:septin interactions appear to be conserved in several systems, it appears likely that budding yeast are the exception--having lost this conserved interaction. 

      Strengths: 

      The strengths of this work include the rigorous analysis of the interaction using multiple methods including Co-IP of tagged but endogenously expressed proteins, 2 hybrid interaction, and Alphafold Multimer. Careful quantitative analysis of the effects of loss of function in each complex and the effects on localization and dynamics of each complex was also a strength. Taken together this work convincingly describes that these two complexes do interact and that this interaction plays an important role in post Golgi vesicle targeting during cytokinesis. 

      Weaknesses: 

      The authors used Alphafold Multimer to predict (largely successfully) which subunits were most likely to be involved in direct interactions between the complexes. It would be very interesting to compare this to a parallel analysis on the budding yeast septin and exocyst complexes where it is quite clear that detectable interactions between the exocyst and septins (using the same methods) do not exist. Presumably the resulting pLDDT scores will be significantly lower. These are in silico experiments and should not be difficult to carry out. 

      We thank the reviewer for this insightful suggestion. To assess the specificity of the predicted interactions between septins and the exocyst complex in S. pombe, we performed a comparative AlphaFold2 analysis using some of the homologous subunits from Saccharomyces cerevisiae. We modeled two interactions between Cdc10-Sec5 and Cdc10-Sec15 (Cdc10 is the Spn2 homolog) using the same pipeline and parameters at the time when we did the modeling for S. pombe. We did not find interactions between them using the criteria we used for the fission yeast proteins in this study. These results support the notion that the predicted septin–exocyst interactions in S. pombe are not generalizable to budding yeast. Unfortunately, we did not test all other combinations at that time and the AlphaFold2 platform is not available to us now (showing system error messages when we tried recently). We thank the reviewer again for this helpful suggestion, which should strengthen the evolutionary interpretation of the septin-exocyst interactions once it is able to be systematically carried out.

      Reviewer #3 (Public Review): 

      Septins in several systems are thought to guide the location of exocytosis, and they have been found to interact with the exocyst vesicle-tethering complex in some cells. However, it is not known whether such interactions are direct or indirect. Moreover, septin-exocyst physical associations were not detected in several other systems, including yeasts, making it unclear whether such interactions reflect a conserved septin-exocytosis link or whether they may missed if they depend on septin polymerization or association into higher-order structures. Singh et. al., set out to define whether and how septins influence the exocyst during S. pombe cytokinesis. Based on three lines of evidence, the authors conclude that septins directly bind to exocyst subunits to regulate localization of the exocyst and vesicle secretion during cytokinesis. The conclusions are consistent with the data presented, but some interpretations need to be clarified and extended: 

      (1) The first line of evidence examines septin and exocyst localization during cytokinesis in wild-type and septin-mutant or exocyst-mutant yeast. Quantitative imaging convincingly shows that the detailed localization of the exocyst at the division site is perturbed in septin mutants, and that this is accompanied by modest accumulation of vesicles and vesicle cargos. Whether that is sufficient to explain the increased thickness of the division septum in septin mutants remains unclear.

      The modest accumulation of vesicles and vesicle cargos at the division site is one of the reasons for the increased thickness of the division septum in septin mutants. It is more likely that the misplaced exocyst can still tether vesicles along the division plane (less likely at the rim) without septins. Due to the lack of the glucanase Eng1 at the rim of the division plane in septin mutants, daughter-cell separation is delayed and then cells continue to thicken the septum. We have added these points to the Discussion.

      (2) The second line of evidence involves a comprehensive Alphafold2 analysis of potential pair-wise interactions between septin and exocyst subunits. This identifies several putative interactions in silico, but it is unclear whether the identified interaction surfaces would be available in the full septin or exocyst complexes.  

      We thank the reviewer for raising this important point. We fully agree that a key limitation of pairwise AlphaFold predictions is that they do not account for the higher-order structural context of multimeric protein complexes, such as septin hetero-oligomers or the assembled exocyst complex. As a result, some of the predicted interfaces could indeed be conformationally restricted in the native state.

      To address this concern, we predicted the S. pombe exocyst and septin structures using AlphaFold3. We mapped predicted contact residues onto the predicted structure. Most predicted interfaces (86% for the exocyst and 86-96% for septins) appear to be located on accessible surfaces in the assembled complexes (Figure supplement S4, S5, videos 4 - video 7), suggesting that these interactions are sterically plausible. We have added this important caveat to the text of the revised manuscript highlighting the interface accessibility within the assembled complexes. We appreciate the reviewer’s insight, which helped us strengthen the interpretation and limitations of the AlphaFold-based analysis.

      (3) The third line of evidence uses co-immunoprecipitation and yeast two hybrid assays to show that several physical interactions predicted by Alphafold2 can be detected, leading the authors to conclude that they have identified direct interactions. However, both methods leave open the possibility that the interactions are indirect and mediated by other proteins in the fission yeast extract (co-IP) or budding yeast cell (two-hybrid). 

      We thank the reviewer for this important clarification. We agree that coimmunoprecipitation (co-IP) and yeast two-hybrid (Y2H) assays cannot conclusively distinguish between direct and indirect interactions. As the reviewer points out, co-IPs may reflect associations mediated by bridging proteins within the fission yeast extract, and Y2H readouts can be influenced by fusion context or endogenous host proteins. In our manuscript, we have now revised the relevant statements in the Results and Discussion sections to clarify that the observed associations are consistent with direct interactions predicted by AlphaFold2, but cannot alone establish direct binding. We have also tempered our terminology—substituting phrases such as “direct interaction” with “physical association consistent with direct binding,” where appropriate.

      (4) Based on prior studies it would be expected that the large majority of both septins and exocyst subunits are present in cells and extracts as stoichiometric complexes. Thus, one would expect any septin-exocyst interaction to yield associations detectable with multiple subunits, yet co-IPs were not detected in some combinations. It is therefore unclear whether the interactions reflect associations between fully-formed functional complexes or perhaps between transient folding intermediates. 

      We thank the reviewer for this thoughtful observation. We agree that both septins and exocyst subunits are generally understood to exist in cells as stable, stoichiometric complexes, and that interactions between fully assembled complexes might be expected to yield co-immunoprecipitation signals involving multiple subunits from each complex. However, it was also found that >50% of septins Spn1 and Spn4 are in the cytoplasm even during cytokinesis when the septin double rings are formed (Table 1 of Wu and Pollard, Science 2005, PMID: 16224022). Thus, it is possible that there are pools of free septin and exocyst subunits in the cytoplasm, which were detected in our Co-IP assays. 

      In our experiments, we observed selective co-IP signals between certain septin and exocyst subunits, while other combinations did not yield detectable interactions. We believe these findings could reflect several other possibilities besides the possible interactions among the free subunits in the cytoplasm:

      (1) Some interactions may only be strong enough between specific subunits at exposed interfaces under the Co-IP conditions, rather than through wholesome complex–complex interactions;

      (2) The detergent and/or salt conditions used in our co-IPs may disrupt labile complex interfaces or partially dissociate multimeric assemblies.

      To address this concern, we now include in the Discussion a paragraph highlighting the possibility that some of the observed interactions may not reflect binding between fully assembled, functional complexes. Notably, most detected interactions pairs are consistent with the AlphaFold predictions, which suggest specific subunit interfaces may be responsible for mediating contact. While we cannot fully resolve whether septins engage with the whole exocyst complex versus selected subunits, our combined data supports a model that septins scaffold or spatially regulate the exocyst localization at the division site, potentially through dynamic and multivalent interactions. We now explicitly state this more cautious interpretation in the revised manuscript. Future biochemical studies using native complex purifications, cross-linking mass spectrometry, or in vitro reconstitution with fully assembled septin and exocyst complexes, or in vivo FRET assays will be essential to clarify whether the interactions we observe occur between intact assemblies or intermediate forms.

      Reviewer #1 (Recommendations for the Authors): 

      A major finding from the manuscript is the description of physical interaction of septin subunits with exocyst subunits. The analysis starts from Alphafold2 predictions, shown in Figures 3 and S3. However, some of the most useful metrics of Alphafold, the PAE plot and the pTM and ipTM values, are not provided. It is thus very difficult to estimate the value of the predicted structures (which are also obscured by all side chains). The power of a predicted structure is that it suggests binding interfaces, which is not explored here. At the very least, it would not be difficult to examine whether the proposed binding interfaces are free in the septin filaments and octameric exocyst complex. 

      Please also see response to reviewer #1 (Public Review).

      We thank the reviewer for these very helpful suggestions. We agree that inclusion of AlphaFold2 model confidence metrics—specifically the Predicted Aligned Error (PAE) plots, as well as pTM and ipTM values—is essential for evaluating the reliability of the predicted septin–exocyst interfaces.

      In the revised manuscript, we have now included the PAE plots (Figure 3 and Supplementary S3) and summarizes the pTM scores for each predicted septin–exocyst subunit pair. We also provide a short description of these metrics in the figure legend to help guide interpretation. The old Alphafold2 version (alphafold2advanced) that we used doesn’t give iPTM score, so are not included. However, according to our methodology, we only counted the interacting residues which have pLDDT scores >50%, predicting the resulting iPTM score should not be very weak.

      In addition, we have updated Figures 3 and S3 to show simplified ribbon diagrams of the interface regions, with side chains hidden by default and selectively displayed only at predicted interaction hotspots. This improves structural clarity and makes the interface regions easier to interpret. We mentioned in the Discussion that the preliminary studies show that the predicted interacting interfaces of Sec15 and Sec5 with septin subunits are accessible for interaction in the whole exocyst complex. The new Figure Supplement S4 and S5 and Videos 4-7 now show the interface residues of both the exocyst and septins that are involved in the interactions.

      Two further points on the interaction: 

      The 2H interaction data is not very convincing. The insets showing beta-gal assays do not look very different from the negative control (compare for instance in panel 4E the Sec15BD alone, last column, with the Sec15-BD in combination with Spn4-AD, third column: roughly same color), which suggests it is mostly driven by autoactivation of Sec15-BD. Providing growth information in addition to beta-gal may be helpful. 

      We appreciate the reviewer’s close evaluation of the yeast two-hybrid (Y2H) assay data, and we agree that the signals observed in the Spn4–Sec15 combination is indeed weak. Unfortunately, we did not perform growth assays. However, we would like to clarify that this is consistent with the nature of the interactions that we are investigating. The interaction between individual septin and exocyst subunits is not strong and/or transient as supported by the weak interactions by Co-IP experiments. Given the exocyst only tethers/docks vesicles on the plasma membrane for tens of seconds before vesicle fusion, the multivalent interactions between septins and the exocyst should be very dynamic and not be too strong. 

      As evidenced by our Co-IP experiments and multivalent interactions predicted by Alphafold2, the interaction between Spn4 and Sec15 is detectable but weak, suggesting that this may be a low-affinity or transient interaction. Given that Y2H assays have known limitations in detecting such low-affinity interactions—especially those that depend on conformational context or are not optimal in the yeast nucleus—it is perhaps not surprising that the X-gal color development is subtle. These limitations of the Y2H system have been well-documented (e.g., Braun et al., 2009; Vidal & Fields, 2014), particularly for interactions with affinities in the micromolar range or those requiring conformational specificity. Therefore, the weak signal observed is in line with expectations for a lowaffinity, transient interaction such as between Spn4 and Sec15.

      Vidal, M. and Fields, S., 2014. The yeast two-hybrid assay: still finding connections after 25 years. Nature methods, 11(12), pp.1203-1206.

      Braun, P., Tasan, M., Dreze, M., Barrios-Rodiles, M., Lemmens, I., Yu, H., Sahalie, J.M., Murray, R.R., Roncari, L., De Smet, A.S. and Venkatesan, K., 2009. An experimentally derived confidence score for binary protein-protein interactions. Nature methods, 6(1), pp.91-97.

      In the coIP experiments, I am confused by the presence of tubulin signal in some of the IPs. For instance, in Fig 4B, but not 4D, where the same Sec15-GFP is immunoprecipitated. There is also a signal in 4C but not 4A. This needs to be clarified. 

      The presence of tubulin in some immunoprecipitates is not unexpected, particularly in experiments involving cytoskeleton-associated proteins such as septins and exocyst subunits. The occasional presence of tubulin in our co-IP samples is consistent with well-documented reports showing tubulin as a frequent non-specific co-purifying protein, particularly under native lysis conditions used to preserve large complexes (Vega and Hsu, 2003; Gavin et al., 2006; Mellacheruvu et al., 2013; Hein et al., 2015). The CRAPome database and quantitative interactomics studies highlight tubulin as one of the most common background proteins in affinity-based workflows. Importantly, tubulin was used as a loading control but not as a marker for interaction in our study, and its variable presence does not reflect a specific interaction with Sec15-GFP or other bait proteins, and we have clarified this point in the revised figure legend.

      Gavin, A.C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen, L.J., Bastuck, S., Dümpelfeld, B. and Edelmann, A., 2006. Proteome survey reveals modularity of the yeast cell machinery. Nature, 440(7084), pp.631-636.

      Mellacheruvu, D., Wright, Z., Couzens, A.L., Lambert, J.P., St-Denis, N.A., Li, T., Miteva, Y.V., Hauri, S., Sardiu, M.E., Low, T.Y. and Halim, V.A., 2013. The CRAPome: a contaminant repository for affinity purification–mass spectrometry data. Nature methods, 10(8), pp.730736.

      Hein, M.Y., Hubner, N.C., Poser, I., Cox, J., Nagaraj, N., Toyoda, Y., Gak, I.A., Weisswange, I., Mansfeld, J., Buchholz, F. and Hyman, A.A., 2015. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell, 163(3), pp.712-723.

      Vega, I.E., Hsu, S.C. 2003. The septin protein Nedd5 associates with both the exocyst complex and microtubules and disruption of its GTPase activity promotes aberrant neurite sprouting in PC12 cells. Neuroreport, 14, pp.31-37.

      Regarding the localization of Ypt3 and Syb1 in WT and spn1∆ in Figure 6C-D and Bgs1 in Figure 7A, it would help to add a contractile ring marker to be able to match the timing of cytokinesis between WT and mutants and ensure that cells of same stage are compared (and add some quantification for Ypt3). In fact, in Figure 7A, next to the cells being pointed at, there are very similar localizations of Bgs1 in WT and spn1∆ at the rim of the ingressing septum, which makes me wonder how the quantified cells were chosen. 

      For localizations and quantifications of Eng1, Ypt3, Syb1, and Bgs1 shown in Figures 6 and 7, cells with a closed septum (at or after the end of contractile-ring constriction) were quantified or highlighted. To quantify their fluorescence intensity at the division site using line scan, the line width used was 3 pixels. For Syb1 (Figure 6D), we quantified cells at the end of ring constriction (when Rlc1-tdTomato constricted to a dot) in the middle focal plane. The exact same lines were drawn in both Rlc1 and Syb1 channels. The center of line scan was defined as the pixel with the brightest Rlc1 value. All data were aligned by the center and plotted. For Bgs1 (Figure 7A), we quantified the cells that Rlc1 signal had disappeared from the division site. The line was drawn in the Bgs1 channel in the middle focal plane. The center of line scan was defined as the pixel with the brightest Bgs1 value. All data were aligned by the center and plotted. These details were added to the Materials and Methods.

      Finally, the manuscript would benefit from some figure reorganization/compaction. Unless work on the binding interfaces is added, Figure 3 and S3 could be removed and summarized by providing the pTM and ipTM values of the predicted interactions. Figure 5 could be combined with Figure 2, as it is essentially a repeat with additional exocyst subunits. 

      Because the binding interfaces are added, we keep the original Figures 3 and S3. The experiments in Figure 5 could not be performed before the interaction tests between septins and the exocyst. Thus, to aid the flow of the story, we keep Figures 2 and 5 separated.

      Minor comments: 

      The last sentence of the first paragraph of the results does not make much sense at this point of the paper. After the first paragraph, there is no evidence that colocalization would be required for proper function.  

      We agree that the sentence in question may have overstated the functional implications of colocalization too early in the Results section, before presenting supporting evidence. Our intention was to introduce the hypothesis that spatial proximity between septins and exocyst subunits may be relevant for their coordination during cytokinesis, which we examine in later figures. We have revised the sentence to more accurately reflect the observational nature of the data at this stage in the manuscript as below:

      "These observations suggest the spatial proximity between septins and the exocyst during certain stage of cytokinesis, raising the possibility of their functional coordination, which we would further investigate below."

      What is the indicated n in Figure 6B? Number of cells? 

      Yes, the n in Figure 6B refers to the thin sections of electron microscopy quantified in the analysis. We have now updated the figure legend to explicitly state this for clarity.

      The causal inference made between the alteration of Exocyst localization in septin mutants and the thicker septum is possible, but by no means certain. It should be phrased more cautiously. 

      We agree that our original phrasing may have overstated the causal relationship between altered exocyst localization in septin mutants and septum thickening. Our data supports a correlation between these phenotypes, but additional experiments would be required to establish direct causality.

      To reflect this, we have revised the relevant sentence in the Discussion to read:

      “The modest accumulation of vesicles and vesicle cargos at the division site is one of the reasons for the increased thickness of the division septum in septin mutants. It is more likely that the misplaced exocyst can still tether vesicles along the division plane without septins. Due to the lack of the glucanase Eng1 at the rim of the division plane in septin mutants, daughter-cell separation is delayed and then cells continue to thicken the septum.”

      Reviewer #2 (Recommendations for the Authors): 

      (1) In the display of the AlphaFold Model for the interactions (Figure 3 and Supplemental Figure 3) it is difficult to identify which subunits are where. Residue numbers and subunits should be labeled and only side chains important for the interactions should be present in the model. 

      We appreciate this valuable suggestion. We agree that clearer visual labeling is essential for interpreting the predicted interactions and have revised Figures 3 and S3 accordingly to improve readability and emphasize key structural features.

      Specifically, we have:

      • Labeled each subunit with its name and color-coded consistently across panels.

      •  Annotated key interface residues with residue numbers directly in the figure.

      • Removed non-interacting side chains to declutter the model and highlight only those involved in predicted interactions as well as expanded the figure legend for explanation.

      (2) In Table 1 the column label "Genetic Interaction at 25C" is confusing when synthetic growth defects are shown with a "plus". Rather this column could be labeled "Growth of double mutants at 25C" and then designate the relative growth rate observed at 25C as in Table 2. Designating a negative effect on growth with a plus is confusing. 

      Thanks for the thoughtful suggestions. We have made the suggested changes by deleting the last column so that Tables 1 and 2 are consistent.

      (3) In Figure 4, why is tubulin being co-immunoprecipitated in two of the four anti-GFP IPs? Are the IPs dirty and if so why does it vary between the four experiments? If they are dirty can the non-specific tubulin be removed by additional washes with IP buffer or conversely is it necessary to do minimal washes in order to detect the exocyst-septin interaction by coIP? A comment on this would be helpful. 

      The presence of tubulin in some immunoprecipitates is not unexpected, particularly in experiments involving cytoskeleton-associated proteins such as septins and exocyst subunits. The occasional presence of tubulin in our co-IP samples is consistent with welldocumented reports showing tubulin as a frequent non-specific co-purifying protein, particularly under native lysis conditions used to preserve large complexes (Vega and Hsu, 2003; Gavin et al., 2006; Mellacheruvu et al., 2013; Hein et al., 2015). The CRAPome database and quantitative interactomics studies highlight tubulin as one of the most common background proteins in affinity-based workflows. Importantly, tubulin was used as a loading control but not marker for interaction in our study, and its variable presence does not reflect a specific interaction with Sec15-GFP or other bait proteins, and we have clarified this point in the revised figure legend.

      Gavin, A.C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen, L.J., Bastuck, S., Dümpelfeld, B. and Edelmann, A., 2006. Proteome survey reveals modularity of the yeast cell machinery. Nature, 440(7084), pp.631-636.

      Mellacheruvu, D., Wright, Z., Couzens, A.L., Lambert, J.P., St-Denis, N.A., Li, T., Miteva, Y.V., Hauri, S., Sardiu, M.E., Low, T.Y. and Halim, V.A., 2013. The CRAPome: a contaminant repository for affinity purification–mass spectrometry data. Nature methods, 10(8), pp.730736.

      Hein, M.Y., Hubner, N.C., Poser, I., Cox, J., Nagaraj, N., Toyoda, Y., Gak, I.A., Weisswange, I., Mansfeld, J., Buchholz, F. and Hyman, A.A., 2015. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell, 163(3), pp.712-723.

      Vega, I.E., Hsu, S.C. 2003. The septin protein Nedd5 associates with both the exocyst complex and microtubules and disruption of its GTPase activity promotes aberrant neurite sprouting in PC12 cells. Neuroreport, 14, pp.31-37. 

      In response to the second part of reviewer’s comment, we washed the pulldown product for 5 times each time with 1 ml IP buffer at 4ºC. We used this standard protocol for all the Co-IP experiments to detect the interaction between different septin-exocyst subunits. So, we are not sure if and how more washes or more stringent buffer conditions can interfere with detection of the interactions.

      Reviewer #3 (Recommendations for the Authors): 

      In addition to the issues noted in the public review, there were some confusing findings and references to previous literature that merit further consideration or discussion: 

      • The current gold standard for validating Alphafold predictions involves making targeted mutants suggested by the structural predictions. The absence of any such validation weakens the conclusions significantly. 

      We agree that the targeted mutagenesis based on AlphaFold2-predicted interaction interfaces represents a powerful approach to experimentally validate the in silico models. While we did not pursue structure-guided mutagenesis in this study, our goal was to identify putative interactions between septin and exocyst subunits as a foundation for future functional work. Our current conclusions are intentionally limited to proposing putative interfaces, supported by co-immunoprecipitation and genetic interaction data.

      We recognize that direct validation of specific contact residues would significantly strengthen the model. Accordingly, we have revised the Discussion to explicitly state this limitation and to note that structure-based mutagenesis will be an important next step to test the functional relevance of predicted interactions. We have added the following statement:

      “Future studies are needed to refine the residues involved in the interactions because the predicted interacting residues from AlphaFold are too numerous. However, it is encouraging that most of the predicted interacting residues are clustered in several surface patches. Experimental validation through targeted mutagenesis is an important next step.”

      • Much of the writing appears to imply that differences in mutant phenotypes indicate differences in septin (or exocyst) subunit behaviors/functions. However, my reading of the work in budding yeast is that such differences reflect the partial functionality that can be conferred by aberrant partial septin complexes that assemble and may polymerize in mutants lacking different subunits. In this view, which is supported by data showing that essentially all septins are in stoichiometric octameric complexes in cells, the wild-type functions are all mediated by the full complex. Similarly, the separate exocyst subunit localizations based on tagged Sec3 (Finger et al) were not supported by later work from the Brennwald lab with untagged Sec3, and the idea that different exocyst subunits may function separately from the full complex has very limited support in yeast. I would suggest that the text be edited to better reflect the literature, or that different views be better justified. 

      Thanks for the suggestions. We have revised the text accordingly.

      • The comprehensive set of Alphafold2 predictions is a major strength of the paper, but it is unclear to this reader whether the multiple predicted interactions truly reflect multivalent multimode interactions or whether many (most?) predictions would not be consistent with interactions between full complexes and may not indicate physiological interactions. Better discussion of these issues is needed to interpret the findings. 

      We appreciate the reviewer’s suggestion to use structural prediction to further assess interaction plausibility. We have now employed the full Saccharomyces cerevisiae exocyst complex (with 4.4 Å resolution) published by the Guo group to examine the interfaces of septins and the exocyst interactions, assuming that the S. pombe exocyst has the similar structure. We mapped predicted contact residues onto the predicted structure. Most predicted interfaces (86% for the exocyst and 86-96% for septins) appear to be located on accessible surfaces in the assembled complexes (Figure supplement S4, S5, videos 4 - video 7), suggesting that these interactions are sterically plausible. We have added this important caveat to the text of the revised manuscript highlighting the interface accessibility within the assembled complexes. We appreciate the reviewer’s insight, which helped us strengthen the interpretation and limitations of the AlphaFold-based analysis.

      • Some but not all co-IP blots appear to show tubulin (negative control) coming down with the GFP pull-downs. Why is that, and what does it imply for the reliability of the co-IP protocol? 

      The presence of tubulin in some immunoprecipitates is not unexpected, particularly in experiments involving cytoskeleton-associated proteins such as septins and exocyst subunits. The occasional presence of tubulin in our co-IP samples is consistent with welldocumented reports showing tubulin as a frequent non-specific co-purifying protein, particularly under native lysis conditions used to preserve large complexes (Vega and Hsu, 2003; Gavin et al., 2006; Mellacheruvu et al., 2013; Hein et al., 2015). The CRAPome database and quantitative interactomics studies highlight tubulin as one of the most common background proteins in affinity-based workflows. Importantly, tubulin was used as a loading control but not a marker for interaction in our study, and its variable presence does not reflect a specific interaction with Sec15-GFP or other bait proteins, and we have clarified this point in the revised figure legend.

      Gavin, A.C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen, L.J., Bastuck, S., Dümpelfeld, B. and Edelmann, A., 2006. Proteome survey reveals modularity of the yeast cell machinery. Nature, 440(7084), pp.631-636.

      Mellacheruvu, D., Wright, Z., Couzens, A.L., Lambert, J.P., St-Denis, N.A., Li, T., Miteva, Y.V., Hauri, S., Sardiu, M.E., Low, T.Y. and Halim, V.A., 2013. The CRAPome: a contaminant repository for affinity purification–mass spectrometry data. Nature methods, 10(8), pp.730736.

      Hein, M.Y., Hubner, N.C., Poser, I., Cox, J., Nagaraj, N., Toyoda, Y., Gak, I.A., Weisswange, I., Mansfeld, J., Buchholz, F. and Hyman, A.A., 2015. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell, 163(3), pp.712-723.

      Vega, I.E., Hsu, S.C. 2003. The septin protein Nedd5 associates with both the exocyst complex and microtubules and disruption of its GTPase activity promotes aberrant neurite sprouting in PC12 cells. Neuroreport, 14, pp.31-37.

      • Why were two different protocols used for different yeast-two-hybrid analyses? 

      The purpose of using two protocols was to test which protocol is more reliable and sensitive.

      • The different genetic interactions between septin and exocyst mutants when combined with TRAPP-II mutants merits further discussion: might the difference reflect relocation of exocyst from rim to center in septin mutants versus inactivation of exocyst in exocyst mutants? 

      We appreciate this insightful comment and agree that this distinction is likely meaningful. The reviewer correctly notes that septin mutants may not abolish exocyst function but rather cause its spatial mislocalization: from the rim to the center of the division site, whereas the exocyst mutants likely result in partial or complete loss of vesicle tethering activity at the plasma membrane.

      To address this important nuance, we have expanded the Discussion as follows:

      “The genetic interactions between mutations in the exocyst and septins when combined with TRAPP-II mutants may reflect fundamentally different consequences for compromising the exocyst function (Tables 1 and 2). In septin mutants, the exocyst complex still localizes to the division site but is mispositioned from the rim to the center of the division plane. This mislocalization allows partial retention of exocyst function, leading to very mild synthetic or additive defects when combined with compromised TRAPP-II trafficking and tethering. In contrast, in exocyst subunit mutants, the exocyst becomes partial or non-functional, resulting in a more severe loss of exocyst activity. These differing consequences could explain the qualitative differences in genetic interactions observed with TRAPP-II mutants (Tables 1 and 2). Thus, septins and the exocyst also work in different genetic pathways for certain functions in fission yeast cytokinesis.”

      • The vesicle accumulation in septin mutants was quite modest. Does that imply that most vesicles are still fusing in the septum? Further discussion would be beneficial to understand what the authors think this means. 

      We thank the reviewer for this important point. We agree that the modest vesicle accumulation observed in septin mutants suggests that a significant proportion of vesicles continue to successfully fuse at the division site, even in the absence of fully functional septin structures.

      We now discuss this in greater detail in the revised manuscript:

      “The relatively modest vesicle accumulation in septin mutants suggests that septins are not absolutely required for vesicle tethering or fusion per se at the division site. Instead, septins primarily function to spatially organize the targeting sites of exocyst-directed vesicles by stabilizing the localization of the exocyst at the rim of the cleavage furrow. In septin mutants, mislocalization of the exocyst reduces the spatial precision of membrane insertion but still permits vesicle tethering and fusion, albeit in a less controlled manner. Thus, septins likely play a modulatory rather than essential role in exocytic vesicle delivery during cytokinesis. This interpretation aligns with our localization and genetic interaction data, which indicates that septins act as scaffolds to optimize secretion geometry, rather than as core components of the fusion machinery.”

      • It was unclear to this reader why relocation of some exocyst complexes from the rim to the center of the septal region would lead to dramatic thickening of the septum. Further discussion would be beneficial to understand what the authors think this means. 

      The modest accumulation of vesicles and vesicle cargos at the division site is one of the reasons for the increased thickness of the division septum in septin mutants. It is more likely that the misplaced exocyst can still tether vesicles along the division plane without septins. Because of the lack of glucanase Eng1 at the rim of the division plane in septin mutants, daughter-cell separation is delayed and then cells continue to thicken the septum. We have added these points to the Discussion.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      The authors make a bold claim that a combination of repetitive transcranial magnetic stimulation (intermittent theta burst-iTBS) and transcranial alternating current stimulation (gamma tACS) causes slight improvements in memory in a face/name/profession task.

      Strengths:

      The idea of stimulating the human brain non-invasively is very attractive because, if it worked, it could lead to a host of interesting applications. The current study aims to evaluate one such exciting application.

      Weaknesses:

      (1) The title refers to the "precuneus-hippocampus" network. A clear definition of what is meant by this terminology is lacking. More importantly, mechanistic evidence that the precuneus and the hippocampus are involved in the potential effects of stimulation remains unconvincing.

      Thank you for the observation. We believe that the evidence collected supports our state relative to the stimulation of the precuneus and the involvement of the hippocampus. In particular, given the existing evidence on TMS methodology and precuneus non-invasive stimulation (see Koch et al., Brain, 2022, Koch et al., Alzheimer's research & therapy, 2025), the computation of the biophysical model with the E-field we produced (see Biophysical modeling and E-field calculation section in the supplementary information), together with the individual identification of the precuneus through the RM (see iTBS+γtACS neuromodulation protocol and MRI data acquisition in the main text), we can reasonably assume that the individually identified PC was stimulated.

      As we acknowledged in the Limitations section, we cannot entirely rule out the possibility that our results might also reflect stimulation of more superficial parietal regions adjacent to the precuneus. Nor do we provide direct evidence of microscopic changes in the precuneus following stimulation. However, the results we provide in terms of changes in precuneus oscillatory activity and precuneus-hippocampi connectivity sustain both our thesis of the precuneus stimulation and of hippocampi involvement in the stimulation effects.

      Despite this consideration, we agree on the fact that a clear definition of what is meant by the terminology “precuneus-hippocampus network” is lacking. Moreover, since our data and previous evidence sustain the notion of PC stimulation, while this study does not produce direct evidence of the hippocampi stimulation - but only of the effect of the neuromodulation protocol on its connection with the precuneus, we soften the claim in the title. We remove the mention of the precuneus-hippocampus network so that the modified title will be as follows: “Dual transcranial electromagnetic stimulation of the precuneus boosts human long-term memory.”

      (2) The question of the extent to which the stimulation approach and the stimulation parameters used in these experiments causes specific and functionally relevant neural effects remains open. Invasive recordings that could address this question remain out of the scope of this non-invasive study. The authors conducted scalp EEG experiments in an attempt to address this question using non-invasive methods. However, the results shown in Fig. 3 are unclear. The results are inconsistently reported in units of microvolts squared in some panels (3A, 3B) and in units of microvolts in other panels (3C). Also, there is insufficient consideration of potential contamination by signal components reflecting eye movements, other muscle artifacts, or another volume-conducted signal reflecting aggregate activity inside the brain.

      As you correctly noted, Figure 3 presents results obtained from the TMS–EEG recordings. However, there is no inconsistency regarding the measurement units, as we are referring to two distinct indices: one in the frequency domain—oscillatory power shown in Figures 3A and 3B, expressed in microvolts squared (μV<sup>²</sup>)—and one in the time domain—the TMS-evoked potential shown in Figure 3C, expressed in microvolts (μV).

      Regarding the concern about artifacts, this is an important issue on which our group has a strong expertise, having published well-established, highly cited procedures on how to record and clean TMS-EEG signals (e.g., Casula et al., Clinical Neurophysiology, 2017; Rocchi et al., Brain Stimulation, 2021). In the current study, we adopted a well-established and rigorous approach for both data acquisition and preprocessing. This ensured that the recorded TMS–EEG signals were not contaminated by physiological or electrical artifacts.

      As regards the recording procedure, all participants were instructed to fixate on a black cross to minimize eye movements. To avoid auditory-related components caused by the TMS click, we adopted an ad-hoc procedure optimized for TMS-EEG recordings (Rocchi et al., Brain Stimulation, 2021). First, participants were given earphones that continuously played an ad-hoc masking noise composed of white noise mixed with specific time-varying frequencies of the TMS click (Rocchi et al., Brain Stimulation, 2021). The masking noise volume was adjusted to ensure that participants could not detect the TMS click, or as much as tolerated (always below 90 dB). To further reduce the impact of the TMS click on the EEG signal, we placed ear defenders (SNR=30) on top of the earphones. Please see TMS–EEG data acquisition section in the main text.

      As regards the offline cleaning process, we applied Independent Component Analysis (INFOMAX-ICA) to the EEG data to identify and remove components associated with muscle activity, eye movements, blinking, and residual TMS-related artifacts, in line with the most recent guidelines on TMS–EEG preprocessing (Hernandez-Pavon et al., Brain Stimulation, 2023). Specifically, for TMS-related muscle artefacts, we strictly followed the criteria based on their scalp topography, spectral content, timing, and amplitude, which we published in a paper focused on this topic (Casula et al., Clinical Neurophysiology, 2017). We add this detail in the TMS–EEG preprocessing and analysis section in the supplementary information (lines 119-120).

      (3) Figure 3 indicates "Precuneus oscillatory activity ...", but evidence that the activity presented reflects precuneus activity is lacking. The maps shown at the bottom of Figure 3C suggest that the EEG signals recorded with scalp EEG reflect activity generated across a wide spatial range, with a peak encompassing at least tens of centimeters. Thus, evidence that effects specifically reflect precuneus activity, as the paper's title and text throughout the manuscript suggest, is lacking.

      We believe there may have been a misunderstanding. As indicated in the figure caption, panels A and B represent oscillatory activity, whereas panel C displays the TMS-evoked potentials (TEPs). Therefore, the topographical maps mentioned (i.e., those in panel C) did not refer to oscillatory activity, but to differences in TEP amplitude. Specifically, the topographies shown in Figure 3C illustrate statistically significant differences in TEP amplitudes between post-stimulation time points (T1—immediately after stimulation, and T2—20 minutes after stimulation) and the pre-stimulation baseline (T0).

      In this figure, we focused our analysis on a cluster of electrodes overlying the individually identified precuneus, capturing EEG responses to single TMS pulses delivered to that target. This approach, widely used in previous literature (e.g., Koch et al., NeuroImage, 2018; Casula et al., Annals of Neurology, 2022; Koch et al., Brain, 2022; Maiella et al., Clinical Neurophysiology, 2024; Koch et al., Alzheimer’s Research & Therapy, 2025), supports the interpretation that the observed responses reflect precuneus-related activity. Furthermore, the wide spatial range change you mention proved to be statistically different only when conducting the TMS-EEG over the precuneus (i.e., administering the TMS single pulse over the precuneus) and not when performing it over the left parietal cortex. We modified the discussion section in the main text to make it more clear (lines 196-199).

      “Moreover, we observed specific cortical changes in the posteromedial parietal areas, as evidenced by the whole-brain analysis conducted on TMS-EEG data when performed over the precuneus and the absence of effect when TMS-EEG was performed on the lateral posterior parietal cortex used as a control condition.”

      That said, we do not state that the effects observed specifically reflect the precuneus activity; indeed, we think the effect of the stimulation is broader, as discussed in the Discussion section. We rather sustain, in line with the literature (Koch et al., Neuroimage 2018; Koch et al., Brain, 2022; Koch et al., Alzheimer's research & therapy, 2025), the idea that the effects observed are a consequence of the precuneus stimulation by the dual stimulation.

      (4) The paper as currently presented (e.g., Figure 3) also lacks rigorous evidence of relevant oscillatory activity. Prior to filtering EEG signals in a particular frequency band, clear evidence of oscillations in the frequency band of interest should be shown (e.g., demonstration of a clear peak that emerges naturally in the frequency range of interest when spectral analysis is applied to "raw" signals). The authors claim that gamma oscillations change because of the stimulation, but a clear peak in the gamma range prior to stimulation is not apparent in the data as currently presented. Thus, the extent to which spectral measurements during stimulation reflect physiological gamma oscillations remains unclear.

      If we understand correctly, your concern relates to the lack of a clear gamma peak before neuromodulation, which may suggest uncertainty about the observed changes in gamma oscillatory activity. Is that correct?

      First, it is important to underline that the natural frequency typically observed in the precuneus falls within the beta range, not the gamma range (see Rosanova et al., Journal of Neuroscience, 2009; Casula et al., Annals of Neurology, 2022). This explains why a prominent gamma peak is not expected at baseline (T0).

      Differently, our neuromodulatory protocol was specifically aimed at boosting gamma oscillatory activity given its well-established role in learning and memory processes (Griffiths & Jensen, Trends in Neurosciences, 2023). Thus, to assess the effect of the neuromodulatory protocol, we compared the oscillatory activity before (T0) and after stimulation (T1 and T2), which showed a clear increase in the gamma band. This effect is visible in the raw oscillatory power plot and is most clearly represented in Figure 3B, where the gamma band emerged as the only frequency range showing significant changes across time points.

      (5) Concerns remain regarding the rigor of statistical analyses in the revised manuscript (see also point 8 below). Figure 3B shows an undefined statistical test with p<0.05. The statistical test that was used is not explained. Also, a description of how corrections for multiple comparisons were made is missing. Figures 3A and 3C are not accompanied by statistics, making the results difficult to interpret. For Figure 4C, a claim was made based on a significant p-value for one statistical test and a non-significant p-value in another test. This is a common statistical mistake (see Figure 1 and accompanying discussion in Makin and Orban de Xivry (2019) Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife 8:e48175).

      All statistical tests are described in the Statistical Analysis section of the main text. Specifically, to assess cortical oscillation changes in Experiment 3, we conducted repeated-measures ANOVAs with stimulation condition (iTBS+γtACS vs. iTBS+sham-tACS) and time (ΔT1 = T1–T0; ΔT2 = T2–T0) as within-subject factors, for each frequency band. To further explore the effects of stimulation at each time point, we performed paired t-tests with Bonferroni correction for multiple comparisons. A one-tailed hypothesis was adopted, based on our a priori prediction of gamma-band increase derived from previous work (Maiella et al., 2022).

      Please note that Figures 3A and 3C are purely descriptive and are therefore not accompanied by statistical tests. Figure 3A shows the full spectral profile across frequencies and conditions, while statistical significance for these data is reported in Figure 3B. Similarly, the upper part of Figure 3C displays the TMS-evoked potential (TEP) in the precuneus, while the statistical comparison of TEP amplitudes across time points is shown in the lower part of Figure 3C.

      Regarding Figure 4C and the article you cited, are you referring to the error described as “Interpreting comparisons between two effects without directly comparing them”? If we understand correctly, this refers to the mistake of inferring an effect by observing that a significant result occurs in one condition or group, while the corresponding result in another condition or group is not significant, without directly testing the difference between them.

      In the case of Experiment 4, which investigates fMRI effects and is illustrated in Figure 4, we employed a general linear model that explicitly modeled both conditions and time points, allowing for a direct statistical comparison. Therefore, the connectivity effect reported does not fall into the category of the error you mentioned.

      Importantly, Figure 4C does not depict the effect of the neuromodulatory protocol itself. Rather, its purpose is to show that, within the real stimulation condition, there is a correlation between the observed effect and the integrity of the bilateral Middle Longitudinal Fasciculus. No conclusions or assumptions were made based on the absence of a significant correlation in the sham condition. However, since it was an exploratory analysis, we decided to soften our claims relative to the neural mechanism in the discussion section of the main text (lines 241-246).

      (6) In the second question posed in the original review, I highlighted that it was unclear how such stimulation would produce memory enhancement. The authors replied that, in the absence of mechanisms, there are many other studies that suffer from the same problem. This raises the question of placebo effects. The paper does not sufficiently address or discuss the possibility that any potential stimulation effects may reflect placebo effects.

      We agree with the reviewer on the potential role of a placebo effect in our study. For this reason, our experimental study had several stimulation conditions, including a placebo condition, which corresponded to the sham iTBS-sham tACS condition, which did not produce any effect.

      (7) The third major concern in the original review was the lack of evidence for a mechanism that is specific to the precuneus. Evidence for specific involvement of the precuneus remains lacking in the revised manuscript. The authors state: "the non-invasive stimulation protocol was applied to an individually identified precuneus for each participant". However, the meaning of this statement is unclear. Specifically, it is unclear how the authors know that they are specifically targeting the precuneus. Without directly recording from the precuneus and directly demonstrating effects, which is outside of the scope of the study, specific involvement of the precuneus seems speculative. Also, it does not seem as though a figure was included in the paper to show how the stimulation protocol specifically targets the precuneus. In their response to the original reviews, the authors state that posterior medial parietal areas are the only regions that show significant differences following the stimulation, but they did not cite a specific figure, or statistics reported in the text, that show this. In any event, posterior medial parietal areas encompass a wide area of the brain, so this would still not provide evidence for an effect specifically involving the precuneus.

      We respectfully disagree with the claim that targeting the precuneus in our study is speculative. The statement that “without directly recording from the precuneus and directly demonstrating effects, which is outside the scope of the study, specific involvement of the precuneus seems speculative” would, by that logic, implicitly call into question a large body of cognitive neuroscience research employing non-invasive techniques such as EEG and fMRI.

      Our methodological approach—combining MRI-guided stimulation, biophysical modeling, and TMS–EEG—is well established and widely used for targeting and studying the role of specific cortical regions, including the precuneus (e.g., Wang et al., Science, 2014; Koch et al., NeuroImage, 2018; Casula et al., Annals of Neurology, 2022, 2023; Koch et al., Brain, 2022; Maiella et al., Clinical Neurophysiology, 2024; Koch et al., Alzheimer’s Research & Therapy, 2025).

      In line with previously published protocols (Santarnecchi et al., Human Brain Mapping, 2018; Özdemir et al., PNAS, 2020; Mantovani et al., Journal of Psychiatric Research, 2021), we identified individual targets (i.e., the precuneus) for each participant based on structural and resting-state functional MRI data (see MRI Data Acquisition and Preprocessing section in the main text). This target was then accurately localized using MRI-guided stereotaxic neuronavigation, ensuring reproducible and anatomically precise stimulation across subjects.

      Finally, concerning the last comment about the lack of figures/statistics showing how the stimulation protocol targets the precuneus and the specificity of the effect observed, we would like to let the focus go over:

      Figure 3 in the main text, where we show the results of the TME-EEG over the posterior medial parietal areas;

      Figure S1 in the supplementary information, which shows with the e-fied simulation how the stimulation protocol targets the brain;

      the Precuneus iTBS+γtACS increases gamma oscillatory activity section in the main text results, where we report the results of the statistical analysis of the TMS-EEG conducted over the precuneus and the left posterior parietal cortex, used as a control condition to test for the specificity of the neuromodulation protocol.

      (8) Regarding chance levels, it is unfortunate that the authors cannot quantify what chance levels are in the immediate and delayed recall conditions. This makes interpretation of the results challenging. In the immediate and delayed conditions, the authors state that the chance level is 33%. It would be useful to mark this in the figures. If I understand correctly, chance is 33% in Fig. 2A. If this is the case and if I am interpreting the figure correctly:

      Gray bars for the sham condition appear to be below chance (~20-25%). Why is this condition associated with an accuracy level that is lower than chance?

      Cyan bars and red bars do not appear to be significantly different from chance (i.e., 33%), with red slightly higher than cyan. What statistic was performed to obtain the level of significance indicated in the figure? The highest average value for the red condition appears to be around 35%. More details are needed to fully explain this figure and to support the claims associated with this figure.

      The immediate and recall conditions you mention correspond to a free recall task. In this case, the notion of a fixed "chance level" is not straightforward as it would be in recognition or forced-choice paradigms, which is why we did not quantify it at first. I will now try to explain this extensively.

      Unlike multiple-choice tasks, where participants select the answer from a limited set of alternatives and the probability of a correct response by chance can be precisely quantified (e.g., 33% in a 3-alternative forced choice), free recall involves the spontaneous retrieval of items from memory without external cues or predefined options. As such, the response range in free recall is essentially unconstrained, encompassing the entire vocabulary of the participant.

      Because of this open-ended nature, the probability of correctly recalling a studied item purely by chance is exceedingly low and could be approximated to zero. Also, in our task, participants had to correctly recollect both name and occupation, doubling the possibility of the answers.

      This assumption is further supported by the fact that random guesses in free recall are unlikely to match any of the studied items, given the vast number of possible alternatives. As a result, performance above zero can be reasonably interpreted as reflecting genuine memory retrieval, rather than random guessing.

      As regards statistics, repeated-measures ANOVAs with stimulation condition as a within-subject factor (i.e., iTBS+γtACS; iTBS+sham-tACS; sham-iTBS+sham-tACS) for each dependent variable (see statistical analysis section in main text).

      (9) In the revised version of the paper, the authors did not address concerns associated with the block design (please see question 4d in the original review).

      We are sorry for the misunderstanding. We did not address your concerns related to block design since it does not apply to our study. As reported in the paper you mentioned in the original review, block design involves data collection performed in response to different stimuli of a given class presented in succession. If this is the case, it does not correspond to our experimental design since both TMS-EEG and fMRI were conducted in the resting state (i.e., without the presentation of stimuli) on different days according to the different randomized stimulation conditions.  

      In sum, this study presents an admirable aspirational goal, the notion that a non-invasive stimulation protocol could modulate activity in specific brain regions to enhance memory. However, the evidence presented at the behavioral level and at the mechanistic level (e.g. the putative involvement of specific brain regions) remains unconvincing.

      We hope our response will be carefully considered, fostering a constructive exchange and leading to a reassessment of your evaluation.

      Reviewer #2 (Public review):

      Summary:

      The manuscript by Borghi and colleagues provides evidence that the combination of intermittent theta burst TMS stimulation and gamma transcranial alternating current stimulation (γtACS) targeting the precuneus increases long-term associative memory in healthy subjects compared to iTBS alone and sham conditions. Using a rich dataset of TMS-EEG and resting-state functional connectivity (rs-FC) maps and structural MRI data, the authors also provide evidence that dual stimulation increased gamma oscillations and functional connectivity between the precuneus and hippocampus. Enhanced memory performance was linked to increased gamma oscillatory activity and connectivity through white matter tracts.

      Strengths:

      The combination of personalized repetitive TMS (iTBS) and gamma tACS is a novel approach to targeting the precuneus, and thereby, connected memory-related regions to enhance long-term associative memory. The authors leverage an existing neural mechanism engaged in memory binding, theta-gamma coupling, by applying TMS at theta burst patterns and tACS at gamma frequencies to enhance gamma oscillations. The authors conducted a thorough study that suggests that simultaneous iTBS and gamma tACS could be a powerful approach for enhancing long-term associative memory. The paper was well-written, clear, and concise.

      Comments on Revision:

      I thank the authors for their thoughtful responses to my first review and their inclusion of more detailed methodological discussion of their rationale for the stimulation protocol conditions and timing. Regarding the apparent difference in connectivity at baseline between conditions, the explanation that this is due to intrinsic dynamics, state, or noise implies the baseline is reflecting transient changes in dynamics rather than a true or stable baseline. Based on this, it looks like iTBS solely is significantly greater than the baseline before the iTBS and γtACS condition but maybe not that much lower than post-stimulation period for iTBS and γtACS. A longer baseline period should be used to ensure transient states are not driving baseline levels such that these endogenous fluctuations would average out. This also raises questions about whether the effect of iTBS and γtACS or iTBS alone are dependent on the intrinsic state at the time when stimulation begins. Their additional clarification of memory scoring is helpful but also reveals that the effect of dual iTBS+γtACS specifically on the association between faces and names is just significant. This modest increase in associative memory should be taken into consideration when interpreting these findings.

      We thank the reviewer for the feedback. We fully agree that considering baseline dynamics is critical when assessing the neurophysiological and connectivity effects of stimulation protocols.

      In Experiments 3 and 4, baseline measurements were specifically included in our design to account for the possibility that intrinsic dynamics, state, or noise could influence the observed effects of neuromodulation. Indeed, if we had compared only post-stimulation connectivity between the real and sham conditions, the effects might have appeared larger. The inclusion of baseline measurements allows us to contextualize and better isolate the neuromodulatory impact by controlling such endogenous fluctuations. Importantly, the fMRI connectivity measurements, which comprise the baseline, are derived from 10-minute BOLD signal acquisitions, which help mitigate the influence of transient fluctuations and provide a quite stable estimate of intrinsic connectivity.

      Moreover, regarding the possibility that stimulation effects may depend on the intrinsic state at stimulation onset, we hypothesize that gamma-frequency entrainment induced by tACS could reduce the variability of intrinsic dynamics, promoting a more stable neural state that is favorable for the induction of long-term plasticity.

      As regards the memory scoring, we would like to clarify that the significant improvement observed in the dual iTBS+γtACS condition does not pertain solely to the face–name association. Rather, it concerns the more demanding task of recalling the association between face, name, and occupation. While we agree that the observed effect could be considered modest, it is worth noting that it follows from only 3 minutes of stimulation.

      Reviewer #3 (Public review):

      Summary:

      Borghi and colleagues present results from 4 experiments aimed at investigating the effects of dual γtACS and iTBS stimulation of the precuneus on behavioral and neural markers of memory formation. In their first experiment (n = 20), they find that a 3-minute offline (i.e., prior to task completion) stimulation that combines both techniques leads to superior memory recall performance in an associative memory task immediately after learning associations between pictures of faces, names, and occupation, as well as after a 15-minute delay, compared to iTBS alone (+ tACS sham) or no stimulation (sham for both iTBS and tACS). Performance in a second task probing short-term memory was unaffected by the stimulation condition. In a second experiment (n = 10), they show that these effects persist over 24 hours and up to a full week after initial stimulation. A third (n = 14) and fourth (n = 16) experiment were conducted to investigate neural effects of the stimulation protocol. The authors report that, once again, only combined iTBS and γtACS increases gamma oscillatory activity and neural excitability (as measured by concurrent TMS-EEG) specific to the stimulated area at the precuneus compared to a control region, as well as precuneus-hippocampus functional connectivity (measured by resting state MRI), which seemed to be associated with structural white matter integrity of the bilateral middle longitudinal fasciculus (measured by DTI).

      Strengths:

      Combining non-invasive brain stimulation techniques is a novel, potentially very powerful method to maximize the effects of these kinds of interventions that are usually well-tolerated and thus accepted by patients and healthy participants. It is also very impressive that the stimulation-induced improvements in memory performance resulted from a short (3 min) intervention protocol. If the effects reported here turn out to be as clinically meaningful and generalizable across populations as implied, this approach could represent a promising avenue for treatment of impaired memory functions in many conditions.

      Methodologically, this study is expertly done! I don't see any serious issues with the technical setup in any of the experiments. It is also very commendable that the authors conceptually replicated the behavioral effects of experiment 1 in experiment 2 and then conducted two additional experiments to probe the neural mechanisms associated with these effects. This certainly increases the value of the study and the confidence in the results considerably.

      The authors used a within-subject approach in their experiments, which increases statistical power and allows for stronger inferences about the tested effects. They also used to individualize stimulation locations and intensities, which should further optimize the signal-to-noise ratio.

      Weaknesses:

      I think one of the major weaknesses of this study is the overall low sample size in all of the experiments (between n = 10 and n = 20). This is, as I mentioned when discussing the strengths of the study, partly mitigated by the within-subject design and individualized stimulation parameters. The authors mention that they performed a power analysis but this analysis seemed to be based on electrophysiological readouts similar to those obtained in experiment 3. It is thus unclear whether the other experiments were sufficiently powered to reliably detect the behavioral effects of interest. In the revised manuscript, the authors provide post-hoc sensitivity analyses that help contextualize the strength of the findings.

      While the authors went to great lengths trying to probe the neural changes likely associated with the memory improvement after stimulation, it is impossible from their data to causally relate the findings from experiments 3 and 4 to the behavioral effects in experiments 1 and 2. This is acknowledged by the authors and there are good methodological reasons for why TMS-EEG and fMRI had to be collected in separate experiments, but readers should keep in mind that this limits inferences about how exactly dual iTBS and γtACS of the precuneus modulate learning and memory.

      We thank the reviewer for the feedback.

      Reviewer #1 (Recommendations for the authors):

      I suggest:

      (1) Removing all mechanistic claims about the precuneus and hippocampus.

      We soften our claims about the precuneus-hippocampus network.

      (2) Repeating and focusing on the behavioral experiments with a much larger number of images and stronger statistical power to try to demonstrate a compelling behavioral correlate of the proposed stimulation protocol.

      We clarified the misunderstanding relative to the chance level of the behavioral experiments raised by the reviewer.

      Reviewer #2 (Recommendations for the authors):

      Use longer baseline to establish stable gamma level for comparisons in Figure 3

      If we understand correctly, you propose to increase the baseline to establish the gamma oscillatory activity as expressed in Figure 3 (showing the results of experiment 3). Is that right? In the figure, you see a baseline of -100; 0ms, which we use for a merely graphical reason, since no activity is usually observable before the TMS pulse. However, to establish the level of gamma, we used a larger baseline correction ranging from -700 ms to -300 ms (i.e., 400ms). We added this important information in the cortical oscillation section of the supplementary information (lines 134-135).

      Reviewer #3 (Recommendations for the authors):

      I think that the authors did a great job responding to the concerns raised by the reviewers. All of my own comments have been satisfactorily addressed. I will update my public review to be more concise, so that it only includes the overall assessment of the manuscript, including the strengths and weaknesses, but without the requests for clarification. Strengths and weaknesses remain largely the same, as the authors did not conduct additional experiments.

      Thank you.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      I applaud the authors' for providing a thorough response to my comments from the first round of review. The authors' have addressed the points I raised on the interpretation of the behavioral results as well as the validation of the model (fit to the data) by conducting new analyses, acknowledging the limitations where required and providing important counterpoints. As a result of this process, the manuscript has considerably improved. I have no further comments and recommend this manuscript for publication.

      We are pleased that our revisions have addressed all the concerns raised by Reviewer #1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for assessment of memory-based tasks may provide improved early detection in Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      The authors also compare the latent cause model to the Rescorla-Wagner model and a latent state model allowing for better assessment of the latent cause model as a strong model for assessing reinstatement.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent causes by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory after reinstatement, at least based on the simulation and examples shown in figures 1 and 3. More specifically, in figure 1, the authors indicate that the posterior probability of the latent cause,z<sub>A</sub> (the putative acquisition memory), increases, partially leading to reinstatement. This does not appear to be the case as test 3 (day 36) appears to have similar posterior probabilities for z<sub>A</sub> as well as similar weights for the CS as compared to the last days of extinction. Rather, the model appears to mainly modify the weights in the most recent latent cause, z<sub>B</sub> - the putative the 'extinction state', during reinstatement. The authors suggest that previous experimental data have indicated that spontaneous recovery or reinstatement effects are due to an interaction of the acquisition and extinction memory. These studies have shown that conditioned responding at a later time point after extinction is likely due to a balance between the acquisition memory and the extinction memory, and that this balance can shift towards the acquisition memory naturally during spontaneous recovery, or through artificial activation of the acquisition memory or inhibition of the extinction memory (see Lacagnina et al. for example). Here the authors show that the same latent cause learned during extinction, z<sub>B</sub>, appears to dominate during the learning phase of reinstatement, with rapid learning to the context - the weight for the context goes up substantially on day 35 - in z<sub>B</sub>. This latent cause, z<sub>B</sub>, dominates at the reinstatement test, and due to the increased associative strength between the context and shock, there is a strong CR. For the simulation shown in figure 1, it's not clear why a latent cause model is necessary for this behavior. This leads to the next point.

      We would like to first clarify that our behavioral paradigm did not last for 36 days, as noted by the reviewer. Our reinstatement paradigm contained 7 phases and 36 trials in total: acquisition (3 trials), test 1 (1 trial), extinction 1 (19 trials), extinction 2 (10 trials), test 2 (1 trial), unsignaled shock (1 trial), test 3 (1 trial). The day is labeled under each phase in Figure 2A. 

      We have provided explanations on how the reinstatement is explained by the latent cause model in the first round of the review. Briefly, both acquisition and extinction latent causes contribute to the reinstatement (test 3). The former retains the acquisition fear memory, and the latter has the updated w<sub>context</sub> from unsignaled shock. Although the reviewer is correct that the z<sub>B</sub> in Figure 1D makes a great contribution during the reinstatement, we would like to argue that the elevated CR from test 2 (trial 34) to test 3 (trial 36) is the result of the interaction between z<sub>A</sub> and z<sub>B</sub>.

      We provided Author response image 1 using the same data in Figure 1D and 1E to further clarify this point. The posterior probability of z<sub>A</sub> increased after an unsignaled shock (trial 35), which may be attributed to the return of acquisition fear memory. The posterior probability of z<sub>A</sub> then decreased again after test 3 (trial 36) because there was no shock in this trial. Along with the weight change, the expected shock change substantially in these three trials, resulting in reinstatement. Note that the mapping of expected shock to CR in the latent cause model is controlled by parameter θ and λ. Once the expected shock exceeds the threshold θ, the CR will increase rapidly if λ is smaller.

      Lastly, accepting the idea that separate memories are responsible for acquisition and extinction in the memory modification paradigm, the latent cause model (LCM) is a rational candidate modeling this idea. Please see the following reply on why a simple model like the Rescorla-Wagner (RW) model is not sufficient to fully explain the behaviors observed in this study.

      Author response image 1.

      The sum posterior probability (A), the sum of associative weight of CS (B), and the sum of associative weight of context (C) of acquisition and extinction latent causes in Figure 1D and 1E.

      (2) The authors compared the latent cause model to the Rescorla-Wagner model. This is very commendable, particularly since the latent cause model builds upon the RW model, so it can serve as an ideal test for whether a more simplified model can adequately predict the behavior. The authors show that the RW model cannot successfully predict the increased CR during reinstatement (Appendix figure 1). Yet there are some issues with the way the authors have implemented this comparison:

      (2A) The RW model is a simplified version of the latent cause model and so should be treated as a nested model when testing, or at a minimum, the number of parameters should be taken into account when comparing the models using a method such as the Bayesian Information Criterion, BIC.

      We acknowledge that the number of parameters was not taken into consideration when we compared the models. We thank the reviewer for the suggestion to use the Bayesian Information Criterion (BIC). However, we did not use BIC in this study for the following reasons. We wanted a model that can explain fear conditioning, extinction and reinstatement, so our first priority is to fit the test phases. Models that simulate CRs well in non-test phases can yield lower BIC values even if they fail to capture reinstatement. When we calculate the BIC by using the half normal distribution (μ = 0, σ \= 0.3) as the likelihood for prediction error in each trial, the BIC of the 12-month-old control is -37.21 for the RW model (Appendix 1–figure 1C) and -11.60 for the LCM (Figure 3C). Based on this result, the RW model would be preferred, yet the LCM was penalized by the number of parameters, even though it fit better in trial 36. Because we did not think this aligned with our purpose to model reinstatement, we chose to rely on the practical criteria to determine whether the estimated parameter set is accepted or not for our purpose (see Materials and Methods). The number of accepted samples can thus roughly be seen as the model's ability to explain the data in this study. These exclusion criteria then created imbalances in accepted samples across models (Appendix 1–figure 2). In the RW model, only one or two samples met the criteria, preventing meaningful statistical comparisons of BIC within each group. Overall, though we agreed that BIC is one of the reasonable metrics in model comparison, we did not think it aligns with our purpose in this study.

      (2B) The RW model provides the associative strength between stimuli and does not necessarily require a linear relationship between V and the CR. This is the case in the original RW model as well as in the LCM. To allow for better comparison between the models, the authors should be modeling the CR in the same manner (using the same probit function) in both models. In fact, there are many instances in which a sigmoid has been applied to RW associative strengths to predict CRs. I would recommend modeling CRs in the RW as if there is just one latent cause. Or perhaps run the analysis for the LCM with just one latent cause - this would effectively reduce the LCM to RW and keep any other assumptions identical across the models.

      Regarding the suggestion to run the analysis using the LCM with one latent cause, we agree that this method is almost identical to the RW model, which is also mentioned in the original paper (Gershman et al., 2017). Importantly, it would also eliminate the RW model’s advantage of assigning distinct learning rates to different stimuli, highlighted in the next comment (2C).

      We thank the reviewer for suggesting applying the transformation of associative strength (V) to CR as in the LCM. We examined this possibility by heuristically selecting parameter values to test how such a transformation would influence the RW model (Author response image 2A). Specifically, we set α<sub>CS</sub> = 0.5, α<sub>context</sub> \= 1, β = 1, and introduced the additional parameters θ and λ, as in the LCM. This parameter set is determined heuristically to address the reviewer’s concern about a higher learning rate of context. The dark blue line is the plain associative strength. The remaining lines are CR curves under different combinations of θ and λ.

      Consistent with the reviewer’s comment, under certain parameter settings (θ \= 0.01, λ = 0.01), the extended RW model can reproduce higher CRs at test 3, thereby approximating the discrimination index observed in the 12-month-old control group. However, this modification changes the characteristics of CRs in other phases from those in the plain RW model. In the acquisition phase, the CRs rise more sharply. In the extinction phase, the CRs remain high when θ is small. Though changing λ can modulate the steepness, the CR curve is flat on the second day of the extinction phase, which does not reproduce the pattern in observed data (Figure 2B). These trade-offs suggest that the RW model with the sigmoid transformation does not improve fit quality and, in fact, sacrifices features that were well captured by simpler RW simulations (Appendix 1–figure 1A to 1D). To further evaluate this extended RW model (RW*), we applied the same parameter estimation method used in the LCM for individual data (see Materials and Methods). For each animal, α<sub>CS</sub>, α<sub>context</sub>, β, θ, and λ were estimated with their lower and upper bounds set as previously described (see Appendix 1, Materials and Methods). The results showed that the number of accepted samples slightly increased compared to the RW model without sigmoidal transformation of CR (RW* vs. RW in Author response image 2B, 2C). However, this improvement did not surpass the LCM (RW* vs. LCM in Author response image 2B, Author response image 1C). Overall, these results suggest that while using the same method to map the expected shock to CR, the RW model does not outperform the LCM. Practically, further extension, such as adding novel terms, might improve the fitting level. We would like to note that such extensions should be carefully validated if they are reasonable and necessary for an internal model, which is beyond the scope of this study. We hope this addresses the reviewer's concerns about the implementation of the RW model. 

      Author response image 2.

      Simulation (A) and parameter estimation (B and C) in the extended Rescorla-Wagner model.

      (2C) In the paper, the model fits for the alphas in the RW model are the same across the groups. Were the alphas for the two models kept as free variables? This is an important question as it gets back to the first point raised. Because the modeling of the reinstatement behavior with the LCM appears to be mainly driven by latent cause z<sub>B</sub>, the extinction memory, it may be possible to replicate the pattern of results without requiring a latent cause model. For example, the 12-month-old App NL-G-F mice behavior may have a deficit in learning about the context. Within the RW model, if the alpha for context is set to zero for those mice, but kept higher for the other groups, say alpha_context = 0.8, the authors could potentially observe the same pattern of discrimination indices in figure 2G and 2H at test. Because the authors don't explicitly state which parameters might be driving the change in the DI, the authors should show in some way that their results cannot simply be due to poor contextual learning in the 12 month old App NL-G-F mice, as this can presumably be predicted by the RW model. The authors' model fits using RW don't show this, but this is because they don't consider this possibility that the alpha for context might be disrupted in the 12-month-old App NL-G-F mice. Of course, using the RW model with these alphas won't lead to as nice of fits of the behavior across acquisition, extinction, and reinstatement as the authors' LCM, the number of parameters are substantially reduced in the RW model. Yet the important pattern of the DI would be replicated with the RW model (if I'm not mistaken), which is the important test for assessment of reinstatement.

      We would like to clarify that we estimated three parameters in the RW model for individuals:  α<sub>CS</sub>,  α<sub>context</sub>, and β. Even if we did so, many samples did not satisfy our criteria (Appendix 1–figure 2). Please refer to the “Evaluation of model fit” in Appendix 1 and the legend of Appendix 1–figure 1A to 1D, where we have written the estimated parameter values.

      We did not agree that paralyzing the contextual learning by setting  α<sub>context</sub>  as 0 in the RW model can explain the CR curve of 12-month-old AD mice well. Specifically, the RW model cannot capture the between-day extinction dynamics (i.e., the increase in CR at the beginning of day 2 extinction)  and the higher CR at test 3 relative to test 2 (i.e., DI between test 3 and test 2 is greater than 0.5). In addition, because the context input (= 0.2) was relatively lower than the CS input (= 1), and there is only a single unsignaled shock trial, even setting  α<sub>context</sub> = 1 results in only a limited increase in CR (Appendix 1–figure 1A to 1D; see also Author response image 2 9). Thus, the RW model cannot replicate the reinstatement effect or the critical pattern of discrimination index, even under conditions of stronger contextual learning.  

      (3) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual-US learning during the US re-exposure or to increased responding to the CS - presumably caused by reactivation of the acquisition memory. The authors do perform a comparison between the preCS and CS period, but it is not clear whether this is taken into account in the LCM. For example, the instance of the model shown in figure 1 indicates that the 'extinction cause', or cause z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. If they haven't already, I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model. In more precise terms, it's not clear whether the authors incorporate a preCS/ITI period each day the cue is presented as a vector of just the context in addition to the CS period in which the vector contains both the context and the CS. Based on the description, it seemed to me that they only model the CRs during the CS period on days when the CS is presented, and thereby the context is only ever modeled on its own (as just the context by itself in the vector) on extinction days when the CS is not presented. If they are modeling both timepoints each day that the CS I presented, then I would recommend explicitly stating this in the methods section.

      In this study, we did not model the preCS freezing rate, and we thank the reviewer for the suggestion to model preCS periods as separate context-only trials. In our view, however, this approach is not consistent with the assumptions of the LCM. Our rationale is that the available periods of context and the CS are different. We assume that observation of the context lasts from preCS to CS. If we simulate both preCS (context) and CS (context and tone), the weight of context would be updated twice. Instead, we follow the same method as described in the original code from Gershman et al. (2017) to consider the context effect. We agree that explicitly modeling preCS could provide additional insights, but we believe it would require modifying or extending the LCM. We consider this an important direction for future research, but it is outside the scope of this study.

      (4) The authors fit the model using all data points across acquisition and learning. As one of the other reviewers has highlighted, it appears that there is a high chance for overfitting the data with the LCM. Of course, this would result in much better fits than models with substantially fewer free parameters, such as the RW model. As mentioned above, the authors should use a method that takes into account the number of parameters, such as the BIC.

      Please refer to the reply to public review (2A) for the reason we did not take the suggestion to use BIC. In addition, we feel that we have adequately addressed the concern of overfitting in the first round of the review. 

      (5) The authors have stated that they do not think the Barnes maze task can be modeled with the LCM. Whether or not this is the case, if the authors do not model this data with the LCM, the Barnes maze data doesn't appear valuable to the main hypothesis. The authors suggest that more sophisticated models such as the LCM may be beneficial for early detection of diseases such as Alzheimer's, so the Barnes maze data is not valuable for providing evidence of this hypothesis. Rather, the authors make an argument that the memory deficits in the Barnes maze mimic the reinstatement effects providing support that memory is disrupted similarly in these mice. Although, the authors state that the deficits in memory retrieval are similar across the two tasks, the authors are not explicit as to the precise deficits in memory retrieval in the reinstatement task - it's a combination of overgeneralizing latent causes during acquisition, poor learning rate, over differentiation of the stimuli.

      We would like to clarify that we valued the latent cause model not solely because it is more sophisticated and fits more data points, but it is an internal model that implicates the cognitive process. Please also see the reply to the recommendations to authors (3) about the reason why we did not take the suggestion to remove this data.

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's inability to retain competing memories. These issues are evident in Figure 3:

      (1) The model misses trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, and the faster return of fear during reinstatement compared to the gradual learning of fear during acquisition. It also underestimates the increase in fear at the start of day 2 of extinction, particularly in controls.

      (2) The model explains the higher fear response in controls during reinstatement largely through a stronger association to the context formed during the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition (as seen in Figure 3C). In the experiment, however, this memory does seem to be important for explaining the higher fear response in controls during reinstatement (as seen in Author Response Figure 3). The model does show a necessary condition for memory retrieval, which is that controls rely more on the latent causes from acquisition. But this alone is not sufficient, since the associations within that cause may have been overwritten during extinction. The Rescorla-Wagner model illustrates this point: it too uses the latent cause from acquisition (as it only ever uses a single cause across phases) but does not retain the original stimulus-shock memory, updating and overwriting it continuously. Similarly, the latent cause model may reuse a cause from acquisition without preserving its original stimulus-shock association.

      These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. over differentiation), but the model itself does not appear to capture these processes accurately.

      The authors could benefit from a model that better matches the data and captures the retention and retrieval of fear memories across phases. While they explored alternatives, including the Rescorla-Wagner model and a latent state model, these showed no meaningful improvement in fit. This highlights a broader issue: these models are well-motivated but may not fully capture observed behavior.

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current models fall short in doing so.

      We thank the reviewer for the insightful comments. For the comments (1) and (2), please refer to our previous author response to comments #26 and #27. We recognize that the models tested in this study have limitations and, as noted, do not fully capture all aspects of the observed behavioral data. We see this as an important direction for future research and value the reviewer’s suggestions.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I have maintained some of the main concerns included in the first round of reviews as I think they remain concerns with the new draft, even though the authors have included substantially more analysis of their data, which is appreciated. I particularly found the inclusion of the comparative modeling valuable, although I think the analysis comparing the models should be improved.

      (1) This relates to point 1 in the public assessment or #16 in the response to reviewers from the authors. The authors raise the point that even a low posterior can drive behavioral expression (lines 361-365 in the response to authors), and so the acquisition latent cause may partially drive reinstatement. Yet in the stimulation shown in figure 1D, this does not seem to be the case. As I mentioned in the public response, in figure 1, the posteriors for z<sub>A</sub> are similar on day 34 and day 36, yet only on day 36 is there a strong CR. At least in this example, it does not appear that z<sub>A</sub> contributes to the increased responding from day 34 (test 2) to day 36 (test 3). There may be a slight increase in z1 in figure 3C, but the dominant change from day 34 to day 36 appears to be the increase in the posterior of z3 and the substantial increase in w3. The authors then cite several papers which have shown the shift in balance between what it is the putative acquisition memory and extinction memory (i.e. Lacagnina et al.). Yet I do not see how this modeling fits with most of the previous findings. For example, in the Lacagnina et al. paper, activation of the acquisition ensemble or inhibition of the extinction ensemble drives freezing, whereas the opposite pattern reduces freezing. What appears to be the pattern in the modeling in this paper is primarily learning of context in the extinction latent cause to predict the shock. As I mention in point 2C of the public review, it's not clear why this pattern of results would require a latent cause model. Would a high alpha for context and not the CS not give a similar pattern of results in the RW model? At least for giving similar results of the DIs in figure 2?

      First, we would like to clarify that the x-axis in Figure 1D is labeled “Trial,” not “Day.” Please refer to the reply to public review (1), where we clarified the posterior probability of the latent cause from trials 34 to 36. Second, although we did not have direct neural circuit evidence in this study, we discussed the similarities between previous findings and the modeling in the first review. Briefly, our main point focuses on the interaction between acquisition and extinction memory. In other words, responses at different times arise from distinct internal states made up of competing memories. We assume that the reviewer expects a modeling result showing nearly full recovery of acquisition memory, which aligns with previous findings where optogenetic activation of the acquisition engram can partially mimic reinstatement (Zaki et al., 2022; see also the response to comment #12 in the first round of review). We acknowledge that such a modeling result cannot be achieved with the latent cause model and see it as a potential future direction for model improvement.

      Please also refer to the reply to public review (2) about how a high alpha for context in the RW model cannot explain the pattern we observed in the reinstatement paradigm.

      (2) This is related to point 3 in the public comments and #13 in the response to reviewers. I raised the question of comparing the preCS/ITI period with the CS period, but my main point was why not include these periods in the LCM itself as mentioned in more detail in point 3 in the current public review. The inclusion of the comparisons the authors performed helped, but my main point was that the authors could have a better measure of wcontext if they included the preCS period as a stimulus each day (when only the context is included in the stimulus). This would provide better estimates of wcontext. As stated in the public review, perhaps the authors did this, but my understanding of the methods this was not the case, rather, it seems the authors only included the CS period for CRs within the model (at least on days when the CS was present).

      Please refer to the reply to public review (3) about the reason why we did not model the preCS freezing rate.

      (3) This relates to point 4 in the public review and #15 and #24 in the response to authors. The authors have several points for why the two experiments are similar and how results may be extrapolated - lines 725-733. The first point is that associative learning is fundamental in spatial learning. I'm not sure that this broad connection between the two studies is particularly insightful for why one supports the other as associative learning is putatively involved in most behavioral tasks. In the second point about reversals, why not then use a reversal paradigm that would be easier to model with LCM? This data is certainly valuable and interesting, yet I don't think it's helpful for this paper to state qualitatively the similarities in the potential ways a latent cause framework might predict behavior on the Barnes maze. I would recommend that the authors either model the behavior with LCM, remove the experiment from the paper, or change the framing of the paper that LCM might be an ideal approach for early detection of dementia or Alzheimer's disease.

      We would like to clarify that our aim was not to present the LCM as an ideal tool for early detection of AD symptoms. Rather, our focus is on the broader idea of utilizing internal models and estimating individual internal states in early-stage AD. Regarding using a reversal paradigm that would be easier to model with LCM, the most straightforward approach is to use another type of paradigm for fear conditioning, then to examine the extent to which similar behavioral characteristics are observed between paradigms within subjects. However, re-exposing the same mice to such paradigms is constrained by strong carry-over effects, limiting the feasibility of this experiment. Other behavioral tasks relevant to AD that avoid shock generally involve action selection for subsequent observation (Webster et al., 2014), which falls outside the structure of LCM. Our rationale for including the Barnes maze task is that spatial memory deficit is implicated in the early stage of AD, making it relevant for translational research. While we acknowledge that exact modeling of Barnes maze behavior would require a more sophisticated model (as discussed in the first round of review), our intention to use the reversal Barnes maze paradigm is to suggest a presumable memory modification learning in a non-fear conditioning paradigm. We also discussed whether similar deficits in memory modification could be observed across two behavioral tasks.

      (4) Reviewer # mentioned that the change in pattern of behavior only shows up in the older mice questioning the clinical relevance of early detection. I do think this is a valid point and maybe should be addressed. There does seem to be a bit of a bump in the controls on day 23 that doesn't appear in the 6-month group. Perhaps this was initially a spontaneous recovery test indicated by the dotted vertical line? This vertical line does not appear to be defined in the figure 1 legend, nor in figures 2 and 3.

      We would like to emphasize that the App<sup>NL-G-F</sup> knock-in mouse is widely considered a model of early-stage AD, characterized by Aβ accumulation with little to no neurofibrillary tangle pathology or neuronal loss (see Introduction). By examining different ages, we can assess the contribution of both the amount and duration of Aβ accumulation as well as age-related factors. Modeling the deficit in the memory modification process in the older App<sup>NL-G-F</sup> knock-in mice, we suggested a diverged internal state in early-stage AD in older age, and this does not diminish the relevance of the model for studying early cognitive changes in AD.

      We would also like to clarify again that the x-axis in the figure is “Trial,” not “Day.” The vertical dashed lines in these figures indicate phase boundaries, and they were defined in the figure legend: in Figure 1C, “The vertical dashed lines separate the phases.”; in Figure 2B, “The dashed vertical line separates the extinction 1 and extinction 2 phases.”; in Figure 3, “The vertical dashed lines indicate the boundaries of phases.”

      (5) Are the examples in figure 3 good examples? The example for the 12-month-old control shows a substantial increase in weights for the context during test 3, but not for the CS. Yet in the bar plots in Figure 4 G and H, this pattern seems to be different. The weights for the context appear to substantially drop in the "after extinction" period as compared to the "extinction" period. It's hard to tell the change from "extinction" to "after extinction" for the CS weights (the authors change the y-axis for the CS weights but not for the context weights from panels G to H).

      We would like to clarify that in Figure 3C, the increase in weights for context is not presented during test 3 (trial 36), noted by the reviewer; rather, it is the unsignaled shock phase (trial 35).

      We assumed that the reviewer might misunderstand that the labels on the left in Figure 4, “Acquisition”, “Extinction”, and “After extinction”, indicate the time point. However, the data shown in Figure 4C to 4H are all from the same time point: test 3 (trial 36). The grouping reflects the classification of latent causes based on the trial in which they were inferred. In addition, for Figures 4G and 4H, the y‐axis limits were not set identically because the data range for “Sum of w<sub>CS</sub>” varied. This was done to ensure the visibility of all data points. In Figure 4, each dot represents one animal. Take Figure 3D as an example. The point in Figure 4G is the sum of w3 and w4 in trial 36, and the point in Figure 4H is w5 in trial 36, note that the subscript numerals indicate latent cause index. We hope this addresses the reviewer’s question about the difference between the two figures.


      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The authors show certain memory deficits in a mouse knock-in model of Alzheimer's Disease (AD). They show that the observed memory deficits can be explained by a computational model, the latent cause model of associative memory. The memory tasks used include the fear memory task (CFC) and the 'reverse' Barnes maze. Research on AD is important given its known huge societal burden. Likewise, better characterization of the behavioral phenotypes of genetic mouse models of AD is also imperative to advance our understanding of the disease using these models. In this light, I applaud the authors' efforts.

      Strengths:

      (1) Combining computational modelling with animal behavior in genetic knock-in mouse lines is a promising approach, which will be beneficial to the field and potentially explain any discrepancies in results across studies as well as provide new predictions for future work.

      (2) The authors' usage of multiple tasks and multiple ages is also important to ensure generalization across memory tasks and 'modelling' of the progression of the disease.

      Weaknesses:

      [#1] (1) I have some concerns regarding the interpretation of the behavioral results. Since the computational model then rests on the authors' interpretation of the behavioral results, it, in turn, makes judging the model's explanatory power difficult as well. For the CFC data, why do knock-in mice have stronger memory in test 1 (Figure 2C)? Does this mean the knock-in mice have better memory at this time point? Is this explained by the latent cause model? Are there some compensatory changes in these mice leading to better memory? The authors use a discrimination index across tests to infer a deficit in re-instatement, but this indicates a relative deficit in re-instatement from memory strength in test 1. The interpretation of these differential DIs is not straightforward. This is evident when test 1 is compared with test 2, i.e., the time point after extinction, which also shows a significant difference across groups, Figure 2F, in the same direction as the re-instatement. A clarification of all these points will help strengthen the authors' case.

      We appreciate the reviewer for the critical comments. According to the latent cause framework, the strength of the memory is influenced by at least 2 parameters: associative weight between CS and US given a latent cause and posterior probability of the latent cause. The modeling results showed that a higher posterior probability of acquisition latent cause, but not higher associative weight, drove the higher test 1 CR in App<sup>NL-G-F</sup> mice (Results and Discussion; Figure 4 – figure supplement 3B, 3C). In terms of posterior, we agree that App<sup>NL-G-F</sup> mice have strong fear memory. On the other hand, this suggests that App<sup>NL-G-F</sup> mice exhibited a tendency toward overgeneralization, favoring modification of old memories, which adversely affected the ability to retain competing memories. The strong memory in test 1 would be a compensatory effect of overgeneralization.    

      To estimate the magnitude of reinstatement, at least, one would have to compare CRs between test 2 (extinction) and test 3 (reinstatement), as well as those between test 1 (acquisition) and test 3. These comparisons represent the extent to which the memory at the reinstatement is far from that in the extinction, and close to that in the acquisition. Since discrimination index (DI) has been widely used as a normalized measure to evaluate the extent to which the system can distinguish between two conditions, we applied DI consistently to behavioral and simulated data in the reinstatement experiment, and the behavioral data in the reversal Barnes maze experiment, allowing us to evaluate the discriminability of an agent in these experiments. In addition, we used DI to examine its correlation with estimated parameters, enabling us to explore how individual discriminability may relate to the internal state. We have already discussed the differences in DI between test 3 and test 1, as well as CR in test 1 between control and App<sup>NL-G-F</sup> in the manuscript and further elaborated on this point in Line 232, 745-748.   

      [#2] (2) I have some concerns regarding the interpretation of the Barnes maze data as well, where there already seems to be a deficit in the memory at probe test 1 (Figure 6C). Given that there is already a deficit in memory, would not a more parsimonious explanation of the data be that general memory function in this task is impacted in these mice, rather than the authors' preferred interpretation? How does this memory weakening fit with the CFC data showing stronger memories at test 1? While I applaud the authors for using multiple memory tasks, I am left wondering if the authors tried fitting the latent cause model to the Barnes maze data as well.

      While we agree that the deficits shown in probe test 1 may imply impaired memory function in App<sup>NL-G-F</sup> mice in this task, it would be difficult to explain this solely in terms of impairments in general memory function. The learning curve and the daily strategy changes suggested that App<sup>NL-G-F</sup> mice would have virtually intact learning ability in the initial training phase (Figure 6B, 6F, Figure 6 – figure supplement 1 and 3). For the correspondence relationship between the reinstatement and the reversal Barnes maze learning from the aspect of memory modification process, please also see our reply to comment #24. We have explained why we did not fit the latent cause model to the Barnes maze data in the provisional response.

      [#3] (3) Since the authors use the behavioral data for each animal to fit the model, it is important to validate that the fits for the control vs. experimental groups are similar to the model (i.e., no significant differences in residuals). If that is the case, one can compare the differences in model results across groups (Figures 4 and 5). Some further estimates of the performance of the model across groups would help.

      We have added the residual (i.e., observed CR minus simulated CR) in Figure 3 – figure supplement 1D and 1E. The fit was similar between control and App<sup>NL-G-F</sup> mice groups in the test trials, except test 3 in the 12-month-old group. The residual was significantly higher in the 12-month-old control mice than App<sup>NL-G-F</sup> mice, suggesting the model underestimated the reinstatement in the control, yet the DI calculated from the simulated CR replicates the behavioral data (Figure 3 – figure supplement 1A to 1C). These results suggest that the latent cause model fits our data with little systematic bias such as an overestimation of CR for the control group in the reinstatement, supporting the validity of the comparisons in estimated parameters between groups. These results and discussion have been added in the manuscript Line 269-276.

      One may notice that the latent cause model overestimated the CR in acquisition trials in all groups in Figure 3 – figure supplement 1D and 1E. We have discussed this point in the reply to comment #26, 34 questioned by reviewer 3.

      [#4] (4) Is there an alternative model the authors considered, which was outweighed in terms of prediction by this model? 

      Yes, we have further evaluated two alternative models: the Rescorla-Wagner (RW; Rescorla & Wagner, 1972) model and the latent state model (LSM; Cochran & Cisler, 2019). The RW model serves as a baseline, given its known limitations in explaining fear return after extinction. The LSM is another contemporary model that shares several concepts with the latent cause model (LCM) such as building upon the RW model, assuming a latent variable inferred by Bayes’ rule, and involving a ruminative update for memory modification. We evaluated the three models in terms of the prediction accuracy and reproducibility of key behavioral features. Please refer to the Appendix 1 for detailed methods and results for these two models.

      As expected, the RW model fit well to the data till the end of extinction but failed to reproduce reinstatement (Appendix 1 – figure 1A to 1D). Due to a large prediction error in test 3, few samples met the acceptance criteria we set (Appendix 1 – figure 2 and 3A). Conversely, the LSM reproduced reinstatement, as well as gradual learning in acquisition and extinction phases, particularly in the 12month-old control (Appendix 1 – figure 1G). The number of accepted samples in the LSM was higher than in the RW model but generally lower than in the LCM (Appendix 1 – figure 2). The sum of prediction errors over all trials in the LSM was comparable to that in the LCM in the 6-month-old group (Appendix 1 – figure 4A), it was significantly lower in the 12-month-old group (Appendix 1 – figure 4B). Especially the LSM generated smaller prediction errors during the acquisition trials than in the LCM, suggesting that the LSM might be better at explaining the behaviors of acquisition (Appendix 1 – figure 4A and 4B; but see the reply for comment #34). While the LSM generated smaller prediction errors than the LCM in test 2 of the control group, it failed to replicate the observed DIs, a critical behavioral phenotype difference between control and App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6A to 6C; cf. Figure 2F to 2H, Figure 3 – figure supplement 1A to 1C).

      Thus, although each model could capture different aspects of reinstatement, standing on the LCM to explain the reinstatement better aligns with our purpose. It should also be noted that we did not explore all parameter spaces of the LSM, hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research. 

      [#5] One concern here is also parameter overfitting. Did the authors try leaving out some data (trials/mice) and predicting their responses based on the fit derived from the training data?

      Following the reviewer’s suggestion, we confirmed if overfitting occurred using all trials to estimate parameters. Estimating parameters while actually leaving out trials would disorder the time lapse across trials, and thereby the prior of latent causes in each trial. Instead, we removed the constraint of prediction error by setting the error threshold to 1 for certain trials to virtually leave these trials out. We treated these trials as a virtual “training” dataset, while the rest of the trials were a “test” dataset. For the median CR data of each group (Figure 3), we estimated parameters under 6 conditions with unique training and test trials, then evaluated the prediction error for the training and test trials. Note that training and test trials were arbitrarily decided. Also, the error threshold for the acquisition trial was set to 1 as described in Materials and Methods, which we have further discussed the reason in the reply to comment #34 and treated acquisition trials separately from the test trials. We expect that the contribution of the data from the acquisition and test trials for parameter estimation could be discounted compared to those from the training trials with the constraint, and if overfitting occurred, the prediction error in the test data would be worse than that in the training trials.

      Author response image 1A to 1F showed the simulated and observed CR under each condition, where acquisition trials were in light-shaded areas, test trials were in dark-shaded areas, and the rest of the trials were training trials. Author response image 1G showed mean squared prediction error across the acquisition, training and test trials under each condition. The dashed gray line showed the mean squared prediction error of training trials in Figure 3 as a baseline.

      In conditions i and ii, where two or four trials in the extinction were used for training (Author response image 1A and 1B), the prediction error was generally higher in test trials than in training trials. In conditions iii and iv where ten trials in the extinction were used for training (Author response image 1C and 1D), the difference in prediction error between testing and training trials became smaller. These results suggest that providing more extinction trial data would reduce overfitting. In condition v (Author response image 1E), the results showed that using trials until extinction can predict reinstatement in control mice but not App<sup>NL-G-F</sup> mice. Similarly, in condition vi (Author response image 1F), where test phase trials were left out, the prediction error differences were greater in App<sup>NL-G-F</sup> mice. These results suggest that the test trials should be used for the parameter estimation to minimize prediction error for all groups. Overall, this analysis suggests that using all trials would reduce prediction error with few overfitting. 

      Author response image 1.

      Leaving trials out in parameter estimation in the latent cause model. (A – F) The observed CR (colored line) is the median freezing rate during the CS presentation over the mice within each group, which is the same as that in Figure 3. The colors indicate different groups: orange represents 6-month-old control, light blue represents 6-month-old App<sup>NL-G-F</sup> mice, pink represents 12-month-old control, and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice. Under six different leave-out conditions (i – vi), parameters were estimated and used for generating simulated CR (gray line). In each condition, trials were categorized as acquisition (light-shaded area), training data (white area), and test data (dark-shaded area) based on the error threshold during parameter estimation. Only the error threshold of the test data trial was different from the original method (see Material and Method) and set to 1. In conditions i to vi, the number of test data trials is 27, 25, 19, and 19 in extinction phases. In condition v, the number of test data trials is 2 (trials 35 and 36). In condition vi, test data trials were the 3 test phases (trials 4, 34, and 36). (G) Each subplot shows the mean squared prediction error for the test data trial (gray circles), training data trial (white squares), and acquisition trial (gray triangles) in each group. The left y-axis corresponds to data from test and training trials, and the right y-axis corresponds to data from acquisition trials. The dashed line indicates the results calculated from Figure 3 as a baseline.  

      Reviewer #1 (Recommendations for the authors):

      Minor:

      [#6] (1) I would like the authors to further clarify why 'explaining' the reinstatement deficit in the AD mouse model is important in working towards the understanding of AD i.e., which aspect of AD this could explain etc.

      In this study, we utilized the reinstatement paradigm with the latent cause model as an internal model to illustrate how estimating internal states can improve understanding of cognitive alteration associated with extensive Aβ accumulation in the brain. Our findings suggest that misclassification in the memory modification process, manifesting as overgeneralization and overdifferentiation, underlies the memory deficit in the App<sup>NL-G-F</sup> knock-in model mice. 

      The parameters in the internal model associated with AD pathology (e.g., α and σ<sub>x</sub><sup>2</sup> in this study) can be viewed as computational phenotypes, filling the explanatory gap between neurobiological abnormalities and cognitive dysfunction in AD. This would advance the understanding of cognitive symptoms in the early stages of AD beyond conventional behavioral endpoints alone.

      We further propose that altered internal states in App<sup>NL-G-F</sup> knock-in mice may underlie a wide range of memory-related symptoms in AD as we observed that App<sup>NL-G-F</sup> knock-in mice failed to retain competing memories in the reversal Barnes maze task. We speculate on how overgeneralization and overdifferentiation may explain some AD symptoms in the manuscript:

      - Line 565-569: overgeneralization may explain deficits in discriminating highly similar visual stimuli reported in early-stage AD patients as they misclassify the lure as previously learned object

      - Line 576-579: overdifferentiation may explain impaired ability to transfer previously learned association rules in early-stage AD patients as they misclassify them as separated knowledge. 

      - Line 579-582: overdifferentiation may explain delusions in AD patients as an extended latent cause model could simulate the emergence of delusional thinking

      We provide one more example here that overgeneralization may explain that early-stage AD patients are more susceptible to proactive interference than cognitively normal elders in semantic memory tests (Curiel Cid et al., 2024; Loewenstein et al., 2015, 2016; Valles-Salgado et al., 2024), as they are more likely to infer previously learned material. Lastly, we expect that explaining memory-related symptoms within a unified framework may facilitate future hypothesis generation and contribute to the development of strategies for detecting the earliest cognitive alteration in AD.  

      [#7] (2) The authors state in the abstract/introduction that such computational modelling could be most beneficial for the early detection of memory disorders. The deficits observed here are pronounced in the older animals. It will help to further clarify if these older animals model the early stages of the disease. Do the authors expect severe deficits in this mouse model at even later time points?

      The early stage of the disease is marked by abnormal biomarkers associated with Aβ accumulation and neuroinflammation, while cognitive symptoms are mild or absent. This stage can persist for several years during which the level of Aβ may reach a plateau. As the disease progresses, tau pathology and neurodegeneration emerge and drive the transition into the late stage and the onset of dementia. The App<sup>NL-G-F</sup> knock-in mice recapitulate the features present in the early stage (Saito et al., 2014), where extensive Aꞵ accumulation and neuroinflammation worsen along with ages (Figure 2 – figure supplement 1). Since App<sup>NL-G-F</sup> knock-in mice are central to Aβ pathology without tauopathy and neurodegeneration, it should be noted that it does not represent the full spectrum of the disease even at advanced ages. Therefore, older animals still model the early stages of the diseases and are suitable to study the long-term effect of Aβ accumulation and neuroinflammation. 

      The age tested in previous reports using App<sup>NL-G-F</sup> mice spanned a wide range from 2 months old to 24 months old. Different behavioral tasks have varied sensitivity but overall suggest the dysfunction worsens with aging (Bellio et al., 2024; Mehla et al., 2019; Sakakibara et al., 2018). We have tested the reinstatement experiment with 17-month-old App<sup>NL-G-F</sup> mice before (Author response image 2). They showed more advanced deficits with the same trends observed in 12-month-old App<sup>NL-G-F</sup> mice, but their freezing rates were overall at a lower level. There is a concern that possible hearing loss may affect the results and interpretation, therefore we decided to focus on 12-month-old data.

      Author response image 2.

      Freezing rate across reinstatement paradigm in the 17-month-old App<sup>NL-G-F</sup> mice. Dashed and solid lines indicate the median freezing rate over 34 mice before (preCS) and during (CS) tone presentation, respectively. Red, blue, and yellow backgrounds represent acquisition, extinction, and unsignaled shock in Figure 2A. The dashed vertical line separates the extinction 1 and extinction 2 phases.

      [#8] (3) There are quite a few 'marginal' p-values in the paper at p>0.05 but near it. Should we accept them all as statistically significant? The authors need to clarify if all the experimental groups are sufficiently powered.

      For our study, we decided a priori that p < 0.05 would be considered statistically significant, as described in the Materials and Methods. Therefore, in our Results, we did not consider these marginal values as statistically significant but reported the trend, as they may indicate substantive significance.

      We described our power analysis method in the manuscript Line 897-898 and have provided the results in Tables S21 and S22.

      [#9] (4) The authors emphasize here that such computational modelling enables us to study the underlying 'reasoning' of the patient (in the abstract and introduction), I do not see how this is the case. The model states that there is a latent i.e. another underlying variable that was not previously considered.

      Our use of the term “reasoning” was to distinguish the internal model, which describes how an agent makes sense of the world, from other generative models implemented for biomarker and disease progression prediction. However, we agree that using “reasoning” may be misleading and imprecise, so to reduce ambiguity we have removed this word in our manuscript Line 27: Nonetheless, internal models of the patient remain underexplored in AD; Line 85: However, previous approaches did not suppose an internal model of the world to predict future from current observation given prior knowledge.   

      [#10] (5) The authors combine knock-in mice with controls to compute correlations of parameters of the model with behavior of animals (e.g. Figure 4B and Figure 5B). They run the risk of spurious correlations due to differences across groups, which they have indeed shown to exist (Figure 4A and 5A). It would help to show within-group correlations between DI and parameter fit, at least for the control group (which has a large spread of data).

      We agree that genotype (control, App<sup>NL-G-F</sup>) could be a confounder between the estimated parameters and DI, thereby generating spurious correlations. To address this concern, we have provided withingroup correlation in Figure 4 – figure supplement 2 for the 12-month-old group and Figure 5 – figure supplement 2 for the 6-month-old group.

      In the 12-month-old group, the significant positive correlation between σx2 and DI remained in both control and App<sup>NL-G-F</sup> mice even if we adjusted the genotype effect, suggesting that it is very unlikely that the correlations in Figure 4B are due to the genotype-related confounding. On the other hand, the positive correlation between α and DI was found to be significant in the control mice but not in the App<sup>NL-G-F</sup> mice. Most of α were distributed around the lower bound in App<sup>NL-G-F</sup> mice, which possibly reduced the variance and correlation coefficient. These results support our original conclusion that α and σ<sub>x</sub><sup>2</sup> are parameters associated with a lower magnitude of reinstatement in aged App<sup>NL-G-F</sup> mice.

      In the 6-month-old group, the correlations shown in Figure 5B were not preserved within subgroups, suggesting genotype would be a confounder for α, σ<sub>x</sub><sup>2</sup>, and DI. We recognized that significant correlations in Figure 5B may arise from group differences, increased sample size, or greater variance after combining control and App<sup>NL-G-F</sup> mice. 

      Therefore, we concluded that α and σ<sub>x</sub><sup>2</sup> are associated with the magnitude of reinstatement but modulated by the genotype effect depending on the age. 

      We have added interpretations of within-group correlation in the manuscript Line 307-308, 375-378.

      [#11] (6) It is unclear to me why overgeneralization of internal states will lead to the animals having trouble recalling a memory. Would this not lead to overgeneralization of memory recall instead?

      We assume that the reviewer is referring to “overgeneralization of internal states,” a case in which the animal’s internal state remained the same regardless of the observation, thereby leading to “overgeneralization of memory recall.” We agree that this could be one possible situation and appears less problematic than the case in which this memory is no longer retrievable. 

      However, in our manuscript, we did not deal with the case of “overgeneralization of internal states”. Rather, our findings illustrated how the memory modification process falls into overgeneralization or overdifferentiation and how it adversely affects the retention of competing memories, thereby causing App<sup>NL-G-F</sup> mice to have trouble recalling the same memory as the control mice. 

      According to the latent cause model, retrieval failure is explained by a mismatch of internal states, namely when an agent perceives that the current cue does not match a previously experienced one, the old latent cause is less likely to be inferred due to its low likelihood (Gershman et al., 2017). For example, if a mouse exhibited higher CR in test 2, it would be interpreted as a successful fear memory retrieval due to overgeneralization of the fear memory. However, it reflects a failure of extinction memory retrieval due to the mismatch between the internal states at extinction and test 2. This is an example that overgeneralization of memory induces the failure of memory retrieval. 

      On the other hand, App<sup>NL-G-F</sup> mice exhibited higher CR in test 1, which is conventionally interpreted as a successful fear memory retrieval. When estimating their internal states, they would infer that their observation in test 1 well matches those under the acquisition latent causes, that is the overgeneralization of fear memory as shown by a higher posterior probability in acquisition latent causes in test 1 (Figure 4 – figure supplement 3). This is an example that over-generalization of memory does not always induce retrieval failure as we explained in the reply to comment #1. 

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for the assessment of memory-based tasks may provide improved early detection of Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in the production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      [#12] (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent states by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory state after reinstatement, at least based on the examples shown in Figures 1 and 3. Rather, the model appears to mainly modify the weights in the most recent state, putatively the 'extinction state', during reinstatement. Of course, the authors must rely on how the model fits the data, but this seems problematic based on prior research indicating that reinstatement is most likely due to the reactivation of the acquisition memory. This may call into question whether the model is successfully modeling the underlying processes or states that lead to behavior and whether this is a valid approach for AD.

      We thank the reviewer for insightful comments. 

      We agree that, as demonstrated in Gershman et al. (2017), the latent cause model accounts for spontaneous recovery via the inference of new latent causes during extinction and the temporal compression property provided by the prior. Moreover, it was also demonstrated that even a relatively low posterior can drive behavioral expression if the weight in the acquisition latent cause is preserved. For example, when the interval between retrieval and extinction was long enough that acquisition latent cause was not dominant during extinction, spontaneous recovery was observed despite the posterior probability of acquisition latent cause (C1) remaining below 0.1 in Figure 11D of Gershman et al. (2017). 

      In our study, a high response in test 3 (reinstatement) is explained by both acquisition and extinction latent cause. The former preserves the associative weight of the initial fear memory, while the latter has w<sub>context</sub> learned in the unsignaled shock phase. These positive w were weighted by their posterior probability and together contributed to increased expected shock in test 3. Though the posterior probability of acquisition latent cause was lower than extinction latent cause in test 3 due to time passage, this would be a parallel instance mentioned above. To clarify their contributions to reinstatement, we have conducted additional simulations and the discussion in reply to the reviewer’s next comment (see the reply to comment #13).

      We recognize that our results might appear to deviate from the notion that reinstatement results from the strong reactivation of acquisition memory, where one would expect a high posterior probability of the acquisition latent cause. However, we would like to emphasize that the return of fear emerges from the interplay of competing memories. Previous studies have shown that contextual or cued fear reinstatement involves a neural activity switch back to fear state in the medial prefrontal cortex (mPFC), including the prelimbic cortex and infralimbic cortex, and the amygdala, including ventral intercalated amygdala neurons (ITCv), medial subdivision of central nucleus of the amygdala (CeM), and the basolateral amygdala (BLA) (Giustino et al., 2019; Hitora-Imamura et al., 2015; Zaki et al., 2022). We speculate that such transition is parallel to the internal states change in the latent cause model in terms of posterior probability and associative weight change.

      Optogenetic manipulation experiments have further revealed how fear and extinction engrams contribute to extinction retrieval and reinstatement. For instance, Gu et al. (2022) used a cued fear conditioning paradigm and found that inhibition of extinction engrams in the BLA, ventral hippocampus (vHPC), and mPFC after extinction learning artificially increased freezing to the tone cue. Similar results were observed in contextual fear conditioning, where silencing extinction engrams in the hippocampus dentate gyrus (DG) impaired extinction retrieval (Lacagnina et al., 2019). These results suggest that the weakening extinction memory can induce a return of fear response even without a reminder shock. On the other hand, Zaki et al. (2022) showed that inhibition of fear engrams in the BLA, DG, or hippocampus CA1 attenuated contextual fear reinstatement. However, they also reported that stimulation of these fear engrams was not sufficient to induce reinstatement, suggesting these fear engram only partially account for reinstatement. 

      In summary, reinstatement likely results from bidirectional changes in the fear and extinction circuits, supporting our interpretation that both acquisition and extinction latent causes contribute to the reinstatement. Although it remains unclear whether these memory engrams represent latent causes, one possible interpretation is that w<sub>context</sub> update in extinction latent causes during unsignaled shock indicates weakening of the extinction memory, while preservation of w in acquisition latent causes and their posterior probability suggests reactivation of previous fear memory. 

      [#13] (2) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual US learning during the US re-exposure or to increased response to the CS - presumably caused by reactivation of the acquisition memory. For example, the instance of the model shown in Figure 1 indicates that the 'extinction state', or state z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. By not comparing the difference in the evoked freezing CR at the test (ITI vs CS period), the purpose of the reinstatement test is lost in the sense of whether a previous memory was reactivated - was the response to the CS restored above and beyond the freezing to the context? I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model.

      To clarify the contribution of context, we have provided preCS freezing rate across trials in Figure 2 – figure supplement 2. As the reviewer pointed out, the preCS freezing rate did not remain at the same level across trials, especially within the 12-month-old control and App<sup>NL-G-F</sup> group (Figure 2 – figure supplement 2A and 2B), suggesting the effect context. A paired samples t-test comparing preCS freezing (Figure 2 – figure supplement 2E) and CS freezing (Figure 2E) in test 3 revealed significant differences in all groups: 6-month-old control, t(23) = -6.344, p < 0.001, d = -1.295; 6-month-old App<sup>NL-G-F</sup>, t(24) = -4.679, p < 0.001, d = -0.936; 12-month-old control, t(23) = -4.512, p < 0.001, d = 0.921; 12-month-old App<sup>NL-G-F</sup>, t(24) = -2.408, p = 0.024, d = -0.482. These results indicate that the response to CS was above and beyond the response to context only. We also compared the change in freezing rate (CS freezing rate minus preCS freezing rate) in test 2 and test 3 to examine the net response to the tone. The significant difference was found in the control group, but not in the App<sup>NL-GF</sup> group (Author response image 3). The increased net response to the tone in the control group suggested that the reinstatement was partially driven by reactivation of acquisition memory, not solely by the contextual US learning during the unsignaled shock phase. We have added these results and discussion in the manuscript Line 220-231.

      Author response image 3.

      Net freezing rate in test 2 and test 3. Net freezing rate is defined as the CS freezing rate (i.e., freezing rate during 1 min CS presentation) minus the preCS freezing rate (i.e., 1 min before CS presentation). The dashed horizontal line indicates no freezing rate change from the preCS period to the CS presentation. *p < 0.05 by paired-sample Student’s t-test, and the alternative hypothesis specifies that test 2 freezing rate change is less than test 3. Colors indicate different groups: orange represents 6-month-old control (n = 24), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 25), pink represents 12-month-old control (n = 24), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 25). Each black dot represents one animal. Statistical results were as follows: t(23) = -1.927, p = 0.033, Cohen’s d = -0.393 in 6-month-old control; t(24) = -1.534, p = 0.069, Cohen’s d = -0.307 in 6-month-old App<sup>NL-G-F</sup>; t(23) = -1.775, p = 0.045, Cohen’s d = -0.362 in 12-month-old control; t(24) = 0.86, p = 0.801, Cohen’s d = 0.172 in 12-monthold App<sup>NL-G-F</sup>

      According to the latent cause model, if the reinstatement is merely induced by an association between the context and the US in the unsignaled shock phase, the CR given context only and that given context and CS in test 3 should be equal. However, the simulation conducted for each mouse using their estimated parameters confirmed that this was not the case in this study. The results showed that simulated CR was significantly higher in the context+CS condition than in the context only condition (Author response image 4). This trend is consistent with the behavioral results we mentioned above.

      Author response image 4.

      Simulation of context effect in test 3. Estimated parameter sets of each sample were used to run the simulation that only context or context with CS was present in test 3 (trial 36). The data are shown as median with interquartile range, where white bars with colored lines represent CR for context only and colored bars represent CR for context with CS. Colors indicate different groups: orange represents 6-month-old control (n = 15), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 12), pink represents 12-month-old control (n = 20), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 18). Each black dot represents one animal. **p < 0.01, and ***p < 0.001 by Wilcoxon signed-rank test comparing context only and context + CS in each group, and the alternative hypothesis specifies that CR in context is not equal to CR in context with CS. Statistical results were as follows: W = 15, p = 0.008, effect size r = -0.66 in 6-month-old control; W = 0, p < 0.001, effect size r = -0.88 in 6-month-old App<sup>NL-G-F</sup>; W = 25, p = 0.002, effect size r = -0.67 in 12-month-old control; W = 9, p = 0.002 , effect size r = -0.75 in 12-month-old App<sup>NL-G-F</sup>

      [#14] (3) This is related to the second point above. If the question is about the memory processes underlying memory retrieval at the test following reinstatement, then I would argue that the model parameters that are not involved in testing this hypothesis be fixed prior to the test. Unlike the Gershman paper that the authors cited, the authors fit all parameters for each animal. Perhaps the authors should fit certain parameters on the acquisition and extinction phase, and then leave those parameters fixed for the reinstatement phase. To give a more concrete example, if the hypothesis is that AD mice have deficits in differentiating or retrieving latent states during reinstatement which results in the low response to the CS following reinstatement, then perhaps parameters such as the learning rate should be fixed at this point. The authors state that the 12-month-old AD mice have substantially lower learning rate measures (almost a 20-fold reduction!), which can be clearly seen in the very low weights attributed to the AD mouse in Figure 3D. Based on the example in Figure 3D, it seems that the reduced learning rate in these mice is most likely caused by the failure to respond at test. This is based on comparing the behavior in Figures 3C to 3D. The acquisition and extinction curves appear extremely similar across the two groups. It seems that this lower learning rate may indirectly be causing most of the other effects that the authors highlight, such as the low σx, and the changes to the parameters for the CR. It may even explain the extremely high K. Because the weights are so low, this would presumably lead to extremely low likelihoods in the posterior estimation, which I guess would lead to more latent states being considered as the posterior would be more influenced by the prior.

      We thank the reviewer for the suggestion about fitting and fixing certain parameters in different phases.

      However, this strategy may not be optimal for our study for the following scientific reasons.

      Our primary purpose is to explore internal states in the memory modification process that are associated with the deficit found in App<sup>NL-G-F</sup> mice in the reinstatement paradigm. We did not restrict the question to memory retrieval, nor did we have a particular hypothesis such that only a few parameters of interest account for the impaired associative learning or structure learning in App<sup>NL-G-F</sup> mice while all other parameters are comparable between groups. We are concerned that restricting questions to memory retrieval at the test is too parsimonious and might lead to misinterpretation of the results. As we explain in reply to comment #5, removing trials in extinction during parameter estimation reduces the model fit performance and runs the risk of overfitting within the individual. Therefore, we estimated all parameters for each animal, with the assumption that the estimated parameter set represents individual internal state (i.e., learning and memory characteristics) and should be fixed within the animal across all trials.  

      Figure 3 is the parameter estimation and simulation results using the median data of each group as an individual. The estimated parameter value is one of the possible cases in that group to demonstrate how a typical learning curve fits the latent cause model. The reviewer mentioned “20-fold reduction in learning rate” is the comparison of two data points, not the actual comparison between groups. The comparison between control and App<sup>NL-G-F</sup> mice in the 12-month-old group for all parameters was provided in Table S7. The Mann-Whitney U test did not reveal a significant difference in learning rate (η): 12-month-old control (Mdn = 0.09, IQR=0.23) vs. 12-month-old App<sup>NL-G-F</sup> (Mdn = 0.12, IQR=0.23), U = 199, p = 0.587.  

      We agree that lower learning rate could bias the learning toward inferring a new latent cause. However, this tendency may depend on the value of other parameters and varied in different phases in the reinstatement paradigm. Here, we used ⍺ as an example and demonstrate their interaction in Appendix 2 – table 2 with relatively extreme values: ⍺ \= {1, 3} and η \= {0.01, 0.5} while the rest of the parameters fixed at the initial guess value. 

      When ⍺ = 1, the number of latent causes across phases (K<sub>acq</sub>, K<sub>ext</sub>, K<sub>rem</sub>) remain unchanged and their posterior probability in test 3 were comparable even if η increased from 0.01 to 0.5. This is an example that lower η does not lead to inferring new latent causes because of low ⍺. The effect of low learning rate manifests in test 3 CR due to low w<sub>context, acq</sub> and w<sub>context, ext</sub>

      When ⍺ = 3, the number of acquisition latent causes (K<sub>acq</sub>) was higher in the case of η = 0.01 than that of η = 0.5, showing the effect mentioned by the reviewer. However, test 1 CR is much lower when η = 0.01, indicating unsuccessful learning even after inferring a new latent cause. This is none of the cases observed in this study. During extinction phases, the effect of η is surpassed by the effect of high ⍺, where the number of extinction latent causes (K<sub>ext</sub>) is high and not affected by η. After the extinction phases, the effect of K kicks in as the total number of latent causes reaches its value (K = 33 in this example), especially in the case of η = 0.01. A new latent cause is inferred after extinction in the condition of η = 0.5, but the CR 3 is still high as the w<sub>context, acq</sub> and w<sub>context, ext</sub> are high. This is an example that a new latent cause is inferred in spite of higher η

      Overall, the learning rate would not have a prominent effect alone throughout the reinstatement paradigm, and it has a joint effect with other parameters. Note that the example here did not cover our estimated results, as the estimated learning rate was not significantly different between control and App<sup>NL-G-F</sup> mice (see above). Please refer to the reply to comment #31 for more discussion about the interaction among parameters when the learning rate is fixed. We hope this clarifies the reviewer’s concern.

      [#15] (4) Why didn't the authors use the latent causal model on the Barnes maze task? The authors mention in the discussion that different cognitive processes may be at play across the two tasks, yet reversal tasks have been suggested to be solved using latent states to be able to flip between the two different task states. In this way, it seems very fitting to use the latent cause model. Indeed, it may even be a better way to assess changes in σx as there are presumably 12 observable stimuli/locations.

      Please refer to our provisional response about the application of the latent cause model to the reversal Barnes maze task. Briefly, it would be difficult to directly apply the latent cause model to the Barnes maze data because this task involves operant learning, and thereby almost all conditions in the latent cause model are not satisfied. Please also see our reply to comment #24 for the discussion of the link between the latent cause model and Barnes maze task. 

      Reviewer #2 (Recommendations for the authors):

      [#16] (1) I had a bit of difficulty finding all the details of the model. First, I had to mainly rely on the Gershman 2017 paper to understand the model. Even then, there were certain aspects of the model that were not clear. For instance, it's not quite clear to me when the new internal states are created and how the maximum number of states is determined. After reading the authors' methods and the Gershman paper, it seems that a new internal state is generated at each time point, aka zt, and that the prior for that state decays onwards from alpha. Yet because most 'new' internal states don't ever take on much of a portion of the posterior, most of these states can be ignored. Is that a correct understanding? To state this another way, I interpret the equation on line 129 to indicate that the prior is determined by the power law for all existing internal states and that each new state starts with a value of alpha, yet I don't see the rule for creating a new state, or for iterating k other than that k iterates at each timestep. Yet this seems to not be consistent with the fact that the max number of states K is also a parameter fit. Please clarify this, or point me to where this is better defined.

      I find this to be an important question for the current paper as it is unclear to me when the states were created. Most notably, in Figure 3, it's important to understand why there's an increase in the posterior of z<sub>5</sub> in the AD 12-month mice at test. Is state z<sub>5</sub> generated at trial 5? If so, the prior would be extremely small by trial 36, making it even more perplexing why z<sub>5</sub> has such a high posterior. If its weights are similar to z<sub>3</sub> and z<sub>4</sub>, and they have been much more active recently, why would z<sub>5</sub> come into play?

      We assume that the “new internal state" the reviewer is referring to is the “new latent cause." We would like to clarify that “internal state" in our study refers to all the latent causes at a given time point and observation. As this manuscript is submitted as a Research Advance article in eLife, we did not rephrase all the model details. Here, we explain when a new latent cause is created (i.e., the prior probability of a new latent cause is greater than 0) with the example of the 12-month-old group (Figure 3C and 3D). 

      Suppose that before the start of each trial, an agent inferred the most likely latent cause with maximum posterior, and it inferred k latent causes so far. A new latent cause can be inferred at the computation of the prior of latent causes at the beginning of each trial.  

      In the latent cause model, it follows a distance-dependent Chinese Restaurant Process (CRP; Blei and Frazier, 2011). The prior of each old latent cause is its posterior probability, which is the final count of the EM update before the current. In addition, the prior of old latent causes is sensitive to the time passage so that it exponentially decreases as a forgetting function modulated by g (see Figure 2 in Gershman et al., 2017). Simultaneously, the prior of a new cause is assigned ⍺. The new latent cause is inferred at this moment. Hence, the prior of latent causes is jointly determined by ⍺, g and its posterior probability. The maximum number of latent causes K is set a priori and does not affect the prior while k < K (see also reply to comment #30 for the discussion of boundary set for K and comment #31 for the discussion of the interaction between ⍺ and K). Note that only one new latent cause can be inferred in each trial, and (k+1)<sup>th</sup> latent cause, which has never been inferred so far, is chosen as the new latent cause.

      In our manuscript, the subscript number in zₖ denotes the order in which they were inferred, not the trial number. In Figures 3C and 3D, z<sub>3</sub> and z<sub>4</sub> were inferred in trials 5 and 6 during extinction; z<sub>5</sub> is a new latent cause inferred in trial 36. Therefore, the prior of z<sub>5</sub> is not extremely small compared to z<sub>4</sub> and z<sub>3</sub>.

      In both control and App<sup>NL-G-F</sup> mice in the 12-month-old (Figures 3C and 3D), z<sub>3</sub> is dominant until trial 35. The unsignaled shock at trial 35 generates a large prediction error as only context is presented and followed by the US. This prediction error reduces posterior of z<sub>3</sub>, while increasing the posterior of z<sub>4</sub> and w<sub>context</sub> in z<sub>3</sub> and z<sub>4</sub>. This decrease of posterior of z<sub>3</sub> is more obvious in the App<sup>NL-G-F</sup> than in the control group, prompting them to infer a new latent cause z<sub>5</sub> (Figure 3C and 3D). Although Figure 3C and 3D are illustrative examples as we explained in the reply to comment #14, this interpretation would be plausible as the App<sup>NL-G-F</sup> group inferred a significantly larger number of latent causes after the extinction with slightly higher posteriors of them than those in the control group (Figure 4E).

      [#17] (2) Related to the above, Are the states z<sub>A</sub> and z<sub>B</sub> defined by the authors to help the reader group the states into acquisition and extinction states, or are they somehow grouped by the model? If the latter is true, I don't understand how this would occur based on the model. If the former, could the authors state that these states were grouped together by the author?

      We used z<sub>A</sub> and z<sub>B</sub> annotations to assist with the explanation, so this is not grouped by the model. We have stated this in the manuscript Line 181-182.

      [#18] (3) This expands on the third point above. In Figure 3D, internal states z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub> appear to be pretty much identical in weights in the App group. It's not clear to me why then the posterior of z<sub>5</sub> would all of a sudden jump up. If I understand correctly, the posterior is the likelihood of the observations given the internal state (presumably this should be similar across z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub>), multiplied by the prior of the state. Z3 and Z4 are the dominant inferred states up to state 36. Why would z<sub>5</sub> become more likely if there doesn't appear to be any error? I'm inferring no error because there are little or no changes in weights on trial 36, most prominently no changes inz<sub>3</sub> which is the dominant internal state in step 36. If there's little change in weights, or no errors, shouldn't the prior dominate the calculation of the posterior which would lead to z<sub>3</sub> and z<sub>4</sub> being most prominent at trial 36?

      We have explained how z<sub>5</sub> of the 12-month-old App<sup>NL-G-F</sup> was inferred in the reply to comment #16. Here, we explain the process underlying the rapid changes of the posterior of z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub> from trial 35 to 36.

      During the extinction, the mice inferred z<sub>3</sub> given the CS and the context in the absence of US. In trial 35, they observed the context and the unsignaled shock in the absence of the CS. This reduced the likelihood for the CS under z<sub>3</sub> and thereby the posterior of z<sub>3</sub>, while relatively increasing the posterior of z<sub>4</sub>. The associative weight between the context and the US , w<sub>context</sub>, indeed increased in both z<sub>3</sub> and z<sub>4</sub>, but w<sub>context</sub> of z<sub>4</sub> was updated more than that of z<sub>3</sub> due to its higher posterior probability. At the beginning of trial 36, a new latent cause z<sub>5</sub> was inferred with a certain prior (see also the reply for comment #16), and w<sub>5</sub> = w<sub>0</sub>, where w<sub>0</sub> is the initial value of weight. After normalizing the prior over latent causes, the emergence of z<sub>5</sub> reduced the prior probability of other latent causes compared to the case where the prior of z<sub>5</sub> is 0. Since the CS was presented while the US was absent in trial 36, the likelihood of the CS and that of the US under z<sub>3</sub>, and especially z<sub>4</sub>, given the cues and w became lower than the case in which z<sub>5</sub> has not been inferred yet. Consequently, the posterior of z<sub>5</sub> became salient (Figure 3D).

      To maintain consistency across panels, we used a uniform y-axis range. However, we acknowledge that this may make it harder to notice the changes of associative weights in Figure 3D. We have provided the subpanel in Figure 3D with a smaller y-axis limit to reveal the weight changes at trial 35 in Author response image 5.

      Author response image 5.

      Magnified view of w<sub>context</sub> and wCS in the last 3 trials in Figure 3D. The graph format is the same as in Figure 3D. The weight for CS (w<sub>CS</sub>) and that for context (w<sub>context</sub>) in each latent cause across trial 34 (test 2), 35 (unsignaled shock), and 36 (test 3) in 12-month-old App<sup>NL-G-F</sup> in Figure 3D was magnified in the upper and lower magenta box, respectively.

      [#19] (8) In Figure 4B - The figure legend didn't appear to indicate at which time points the DIs are plotted.

      We have amended the figure legend to indicate that DI between test 3 and test 1 is plotted.

      [#20] (9) Lines 301-303 state that the posterior probabilities of the acquisition internal states in the 12month AD mice were much higher at test 1 and that this resulted in different levels of CR across the control and 12-month App group. This is shown in the Figure 4A supplement, but this is not apparent in Figure 3 panels C and D. Is the example shown in panel D not representative of the group? The CRs across the two examples in Figure 3 C and D look extremely similar at test 1. Furthermore, the posteriors of the internal states look pretty similar across the two groups for the first 4 trials. Both the App and control have substantial posterior probabilities for the acquisition period, I don't see any additional states at test 1. The pattern of states during acquisition looks strikingly similar across the two groups, whereas the weights of the stimuli are considerably different. I think it would help the authors to use an example that better represents what the authors are referring to, or provide data to illustrate the difference. Figure 4C partly shows this, but it's not very clear how strong the posteriors are for the 3rd state in the controls.

      Figure 3 serves as an example to explain the internal states in each group (see also the third paragraph in the reply to comment #14). Figure 4C to H showed the results from each sample for between-group comparison in selected features. Therefore, the results of direct comparisons of the parameter values and internal states between genotypes in Figure 3 are not necessarily the same as those in Figure 4. Both examples in Figure 3C and 3D inferred 2 latent causes during the acquisition. In terms of posterior till test 1 (trial 4), the two could be the same. However, such examples were not rare, as the proportion of the mice that inferred 2 latent causes during the acquisition was slightly lower than 50% in the control, and around 90% in the App<sup>NL-G-F</sup> mice (Figure 4C). The posterior probability of acquisition latent cause in test 1 showed a similar pattern (Figure 4 – figure supplement 3), with values near 1 in around 50% of the control mice and around 90% of the App<sup>NL-G-F</sup> mice.  

      [#21] (10) Line 320: This is a confusing sentence. I think the authors are saying that because the App group inferred a new state during test 3, this would protect the weights of the 'extinction' state as compared to the controls since the strength of the weight updates depends on the probability of the posterior.

      In order to address this, we have revised this sentence to “Such internal states in App<sup>NL-G-F</sup> mice would diverge the associative weight update from those in the control mice after extinction.” in the manuscript Line 349-351.

      [#22] (11) In lines 517-519 the authors address the difference in generalizing the occurrence of stimuli across the App and control groups. It states that App mice with lower alpha generalized observations to an old cause rather than attributing it as a new state. Going back to statement 3 above, I think it's important to show that the model fit of a reduction in alpha does not go hand-in-hand with a reduction in the learning rates and hence the weights. Again, if the likelihoods are diminished due to the low weights, then the fit of alpha might be reduced as well. To reiterate my point above, if the observations in changes in generalization and differentiation occur because of a reduction in the learning rate, the modeling may not be providing a particularly insightful understanding of AD, other than that poor learning leads to ineffectual generalization and differentiation. Do these findings hold up if the learning rates are more comparable across the control and App group?

      These findings were explained on the basis of comparable learning rates between control and App<sup>NL-GF</sup> mice in the 12-month-old group (see the reply to comment #14). In addition, we have conducted simulation for different ⍺ and σ<sub>x</sub><sup>2</sup> values under the condition of the fixed learning rate, where overgeneralization and overdifferentaiton still occurred (see the reply to comment #26).  

      [#23] (12) Lines 391 - 393. This is a confusing sentence. "These results suggest that App NL-G-F mice could successfully form a spatial memory of the target hole, while the memory was less likely to be retrieved by a novel observation such as the absence of the escape box under the target hole at the probe test 1." The App mice show improved behavior across days of approaching the correct hole. Is this statement suggesting that once they've approached the target hole, the lack of the escape box leads to a reduction in the retention of that memory?

      We speculated that when the mice observed the absence of the escape box, a certain prediction error would be generated, which may have driven the memory modification. In App<sup>NL-G-F</sup> mice, such modification, either overgeneralization or overdifferentiation, could render the memory of the target hole vulnerable; if overgeneralization occurred, the memory would be quickly overwritten as the goal no longer exists in this position in this maze, while if overdifferentiation occurred, a novel memory such that the goal does not exist in the maze different from previous one would be formed. In either case of misclassification, the probability of retrieving the goal position would be reduced. To reduce ambiguity in this sentence, we have revised the description in the manuscript Line 432-434 as follows: “These results suggest that App<sup>NL-G-F</sup> mice could successfully form a spatial memory of the target hole, while they did not retrieve the spatial memory of the target hole as strongly as control mice when they observed the absence of the escape box during the probe test.”

      [#24] (13) The connection between the results of Barnes maze and the fear learning paradigm is weak. How can changes in overgeneralization due to a reduction in the creation of inferred states and differentiation due to a reduced σx lead to the observations in the Barnes maze experiment?

      We extrapolated our interpretation in the reinstatement modeling to behaviors in a different behavioral task, to explore the explanatory power of the latent cause framework formalizing mechanisms of associative learning and memory modification. Here, we explain the results of the reversal Barnes maze paradigm in terms of the latent cause model, while conferring the reinstatement paradigm.

      Whilst we acknowledge that fear conditioning and spatial learning are not fully comparable, the reversal Barnes maze paradigm used in our study shares several key learning components with the reinstatement paradigm. 

      First, associative learning is fundamental in spatial learning (Leising & Blaisdell, 2009; Pearce, 2009). Although we did not make any specific assumptions of what kind of associations were learned in the Barnes maze, performance improvements in learning phases likely reflect trial-and-error updates of these associations involving sensory preconditioning or secondary conditioning. Second, the reversal training phases could resemble the extinction phase in the reinstatement paradigm, challenge previously established memory. In terms of the latent cause model, both the reversal learning phase in the reversal Barnes maze paradigm and the extinction phase in the reinstatement paradigm induce a mismatch of the internal state. This process likely introduces large prediction errors, triggering memory modification to reconcile competing memories.  

      Under the latent cause framework, we posit that the mice would either infer new memories or modify existing memories for the unexpected observations in the Barnes maze (e.g., changed location or absence of escape box) as in the reinstatement paradigm, but learn a larger number of association rules between stimuli in the maze compared to those in the reinstatement. In the reversal Barnes maze paradigm, the animals would infer that a latent cause generates the stimuli in the maze at certain associative weights in each trial, and would adjust behavior by retaining competing memories.

      Both overgeneralization and overdifferentiation could explain the lower exploration time of the target hole in the App<sup>NL-G-F</sup> mice in probe test 1. In the case of overgeneralization, the mice would overwrite the existing spatial memory of the target hole with a memory that the escape box is absent. In the case of overdifferentiation, the mice would infer a new memory such that the goal does not exist in the novel field, in addition to the old memory where the goal exists in the previous field. In both cases, the App<sup>NL-G-F</sup> mice would not infer that the location of the goal is fixed at a particular point and failed to retain competing spatial memories of the goal, leading to relying on a less precise, non-spatial strategy to solve the task.  

      Since there is no established way to formalize the Barnes maze learning in the latent cause model, we did not directly apply the latent cause model to the Barnes maze data. Instead, we used the view above to explore common processes in memory modification between the reinstatement and the Barnes maze paradigm. 

      The above description was added to the manuscript on page 13 (Line 410-414) and page 19-20 (Line 600-602, 626-639).

      [#25] (14) In the fear conditioning task, it may be valuable to separate responding to the context and the cue at the time of the final test. The mice can learn about the context during the reinstatement, but there must be an inference to the cue as it's not present during the reinstatement phase. This would provide an opportunity for the model to perhaps access a prior state that was formed during acquisition. This would be more in line with the original proposal by Gershman et al. 2017 with spontaneous recovery.

      Please refer to the reply to comment #13 regarding separating the response to context in test 3.  

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering the early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's ability to retain competing memories. These issues are evident in Figure 3:

      [#26] (1) The model misses key trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, the increase in fear at the start of day 2 of extinction (especially in controls), and the more rapid reinstatement of fear observed in older controls compared to acquisition.

      We acknowledge these limitations and explained why they arise in the latent cause model as follows.

      a. Absence of a fear response at the start of the experiment and the gradual learning of fear during acquisition 

      In the latent cause model, the CR is derived from a sigmoidal transformation from the predicted outcome with the assumption that its mapping to behavioral response may be nonlinear (see Equation 10 and section “Conditioned responding” in Gershman et al., 2017). 

      The magnitude of the unconditioned response (trial 1) is determined by w<sub>0</sub>, θ, and λ. An example was given in Appendix 2 – table 3. In general, a higher w<sub>0</sub> and a lower θ produce a higher trial 1 CR when other parameters are fixed. During the acquisition phase, once the expected shock exceeds θ, CR rapidly approaches 1, and further increases in expected shock produce few changes in CR. This rapid increase was also evident in the spontaneous recovery simulation (Figure 11) in Gershman et al. (2017). The steepness of this rapid increase is modulated by λ such that a higher value produces a shallower slope. This is a characteristic of the latent cause model, assuming CR follows a sigmoid function of expected shock, while the ordinal relationship over CRs is maintained with or without the sigmoid function, as Gershman et al. (2017) mentioned. If one assumes that the CR should be proportional to the expected shock, the model can reproduce the gradual response as a linear combination of w and posteriors of latent causes while omitting the sigmoid transformation (Figure 3). 

      b. Increase in fear at the start of day 2 extinction

      This point is partially reproduced by the latent cause model. As shown in Figure 3, trial 24 (the first trial of day 2 extinction) showed an increase in both posterior probability of latent cause retaining fear memory and the simulated CRs in all groups except the 6-month-old control group, though the increase in CR was small due to the sigmoid transformation (see above). This can be explained by the latent cause model as 24 h time lapse between extinction 1 and 2 decreases the prior of the previously inferred latent cause, leading to an increase of those of other latent causes. 

      Unlike other groups, the 6-month-old control did not exhibit increased observed CR at trial 24

      but at trial 25 (Figure 3A). The latent cause model failed to reproduce it, as there was no increase in posterior probability in trial 24 (Figure 3A). This could be partially explained by the low value of g, which counteracts the effect of the time interval between days: lower g keeps prior of the latent causes at the same level as those in the previous trial. Despite some failures in capturing this effect, our fitting policy was set to optimize prediction among the test trials given our primary purpose of explaining reinstatement.

      c. more rapid reinstatement of fear observed in older controls compared to acquisition

      We would like to point out that this was replicated by the latent cause model as shown in Figure 3 – figure supplement 1C. The DI between test 3 and test 1 calculated from the simulated CR was significantly higher in 12-month-old control than in App<sup>NL-G-F</sup> mice (cf. Figure 2C to E).  

      [#27] (2) The model attributes the higher fear response in controls during reinstatement to a stronger association with the context from the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition. These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. overdifferentiation), but the model itself does not appear to capture these processes accurately. The authors could benefit from a model that better matches the data and that can capture the retention and recollection of a fear memory across phases.

      First, we would like to clarify that the latent cause model explains the reinstatement not only by the extinction latent cause with increased w<sub>context</sub> but also the acquisition latent cause with preserved wCS and w<sub>context</sub> (see also reply to comment #13). Second, the latent cause model primarily attributes the higher fear reinstatement in control to a lower number of latent causes inferred after extinction (Figure 4E) and higher w<sub>context</sub> in extinction latent cause (Figure 4G). We noted that there was a trend toward significance in the posterior probability of latent causes inferred after extinction (Figure 4E), which in turn influences those of acquisition latent causes. Although the posterior probability of acquisition latent cause appeared trivial and no significance was detected between control and App<sup>NL-G-F</sup> mice (Figure 4C), it was suppressed by new latent causes in App<sup>NL-G-F</sup> mice (Author response image 6).

      This indicates that App<sup>NL-G-F</sup> mice retrieved acquisition memory less strongly than control mice. Therefore, we argue that the latent cause model attributed a higher fear response in control during reinstatement not solely to the stronger association with the context but also to CS fear memory from acquisition. Although we tested whether additional models fit the reinstatement data in individual mice, these models did not satisfy our fitting criteria for many mice compared to the latent cause model (see also reply to comment #4 and #28).

      Author response image 6.

      Posterior probability of acquisition, extinction, and after extinction latent causes in test 3. The values within each bar indicate the mean posterior probability of acquisition latent cause (darkest shade), extinction latent cause (medium shade), and latent causes inferred after extinction (lightest shade) in test 3 over mice within genotype. Source data are the same as those used in Figure 4C–E (posterior of z).

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current model falls short in doing so.

      Reviewer #3 (Recommendations for the authors):

      [#28] Other computational models may better capture the data. Ideally, I'd look for a model that can capture the gradual learning during acquisition, and, in some mice, the inferring of a new latent cause during extinction, allowing the fear memory to be retained and referenced at the start of day 2 extinction and during later tests.

      We have further evaluated another computational model, the latent state model, and compared it with the latent cause model. The simulation of reinstatement and parameter estimation method of the latent state model were described in the Appendix.

      The latent state model proposed by Cochran and Cisler (2019) shares several concepts with the latent cause model, and well replicates empirical data under certain conditions. We expect that it can also explain the reinstatement. 

      Following the same analysis flow for the latent cause model, we estimated the parameters and simulated reinstatement in the latent state model from individual CRs and median of them. In the median freezing rate data of the 12-month-old control mice, the simulated CR replicated the observed CR well and exhibited the ideal features that the reviewer looked for: gradual learning during acquisition and an increased fear at the start of the second-day extinction (Appendix 1 – figure 1G). However, a lot of samples did not fit well to the latent state model. The number of anomalies was generally higher than that in the latent cause model (Appendix 1 – figure 2). Within the accepted samples, the sum of squared prediction error in all trials was significantly lower in the latent state model, which resulted from lower prediction error in the acquisition trials (Appendix 1 – figure 4A and 4B). In the three test trials, the squared prediction error was comparable between the latent state model and the latent cause model except for the test 2 trials in the control group (Appendix 1 – figure 4A and 4B, rightmost panel). On the other hand, almost all accepted samples continued to infer the acquisition latent states during extinction without inferring new states (Appendix 1 – figure 5B and 5E, left panel), which differed from the ideal internal states the reviewer expected. While the latent state model fit performance seems to be better than the latent cause model, the accepted samples cannot reproduce the lower DI between test 3 and test 1 in aged App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6C). These results make the latent state model less suitable for our purpose and therefore we decided to stay with the latent cause model. It should also be noted that we did not explore all parameter spaces of the latent state model hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research.

      If you decide not to go with a new model, my preference would be to drop the current modeling. However, if you wish to stay with the current model, I'd like to see justification or acknowledgment of the following:

      [#29] (1) Lower bound on alpha of 1: This forces the model to infer new latent causes, but it seems that some mice, especially younger AD mice, might rely more on classical associative learning (e.g., Rescorla-Wagner) rather than inferring new causes.

      We acknowledge that the default value set in Gershman et al. (2017) is 0.1, and the constraint we set is a much higher value. However, ⍺ = 1 does not always force the model to infer new latent causes.

      In the standard form Chinese restaurant process (CRP), the prior that n<sup>th</sup> observation is assigned to a new cluster is given by ⍺ / (n - 1 + ⍺) (Blei & Gershman, 2012). When ⍺ = 1, the prior of the new cluster for the 2nd observation will be 0.5; when ⍺ = 3, this prior increases to 0.75. Thus, when ⍺ > 1, the prior of the new cluster is above chance early in the sequence, which may relate to the reviewer’s concern. However, this effect diminishes as the number of observations increases. For instance, the prior of the new cluster drops to 0.1 and 0.25 for the 10th observation when ⍺ = 1 and 3, respectively. Furthermore, the prior in the latent cause model is governed by not only α but also g, a scaling parameter for the temporal difference between successive observations (see Results in the manuscript) following “distance-dependent” CRP, then normalized over all latent causes including a new latent cause. Thus, it does not necessarily imply that ⍺ greater than 1 forces agents to infer a new latent cause_. As shown in Appendix 2 – table 4, the number of latent causes does not inflate in each trial when _α = 1. On the other hand, the high number of latent causes due to α = 2 can be suppressed when g = 0.01. More importantly, the driving force is the prediction error generated in each trial (see also comment #31 about the interaction between ⍺ and σ<sub>x</sub><sup>2</sup>). Raising the value of ⍺ per se can be viewed as increasing the probability to infer a new latent cause, not forcing the model to do so by higher α alone. 

      During parameter exploration using the median behavioral data under a wider range of ⍺ with a lower boundary at 0.1, the estimated value eventually exceeded 1. Therefore, we set the lower bound of ⍺ to be 1 is to reduce inefficient sampling. 

      [#30] (2) Number of latent causes: Some mice infer nearly as many latent causes as trials, which seems unrealistic.

      We set the upper boundary for the maximum number of latent causes (K) to be 36 to align with the infinite features of CRP. This allowed some mice to infer more than 20 latent causes in total. When we checked the learning curves in these mice, we found that they largely fluctuated or did not show clear decreases during the extinction (Author response image 7, colored lines). The simulated learning curves were almost flat in these trials (Author response image 7, gray lines). It might be difficult to estimate the internal states of such atypical mice if the sampling process tried to fit them by increasing the number of latent causes. Nevertheless, most of the samples have a reasonable total number of latent causes: 12-month-old control mice, Mdn = 5, IQR = 4; 12-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 1.75; 6-month-old control mice, Mdn = 7, IQR = 12.5; 6-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 5.25. These data were provided in Tables S9 and S12.  

      Author response image 7.

      Samples with a high number of latent causes. Observed CR (colored line) and simulated CR (gray line) for individual samples with a total number of inferred latent causes exceeding 20. 

      [#31] (3) Parameter estimation: With 10 parameters fitting one-dimensional curves, many parameters (e.g., α and σx) are likely highly correlated and poorly identified. Consider presenting scatter plots of the parameters (e.g., α vs σx) in the Supplement.

      We have provided the scatter plots with a correlation matrix in Figure 4 – figure supplement 1 for the 12-month-old group and Figure 5 – figure supplement 1 for the 6-month-old group. As pointed out by the reviewer, there are significant rank correlations between parameters including ⍺ and σ<sub>x</sub><sup>2</sup> in both the 6 and 12-month-old groups. However, we also noted that there are no obvious linear relationships between the parameters.

      The correlation above raises a potential problem of non-identifiability among parameters. First, we computed the variance inflation index (VIF) for all parameters to examine the risk of multicollinearity, though we did not consider a linear regression between parameters and DI in this study. All VIF values were below the conventional threshold 10 (Appendix 2 – table 5), suggesting that severe multicollinearity is unlikely to bias our conclusions. Second, we have conducted the simulation with different combinations of ⍺, σ<sub>x</sub><sup>2</sup>, and K to clarify their contribution to overgeneralization and overdifferentiation observed in the 12-month-old group. 

      In Appendix 2 – table 6, the values of ⍺ and σ<sub>x</sub><sup>2</sup> were either their upper or lower boundary set in parameter estimation, while the value K was selected heuristically to demonstrate its effect. Given the observed positive correlation between alpha and σ<sub>x</sub><sup>2</sup>, and their negative correlation with K (Figure 4 - figure supplement 1), we consider the product of K \= {4, 35}, ⍺ \= {1, 3} and σ<sub>x</sub><sup>2</sup> \= {0.01, 3}. Among these combinations, the representative condition for the control group is α = 3, σ<sub>x</sub><sup>2</sup> = 3, and that for the App<sup>NL-G-F</sup> group is α = 1, σ<sub>x</sub><sup>2</sup> = 0.01. In the latter condition, overgeneralization and overdifferentiation, which showed higher test 1 CR, lower number of acquisition latent causes (K<sub>acq</sub>), lower test 3 CR, lower DI between test 3 and test 1, and higher number of latent causes after extinction (K<sub>rem</sub>), was extremely induced. 

      We found conditions that fall outside of empirical correlation, such as ⍺ = 3, σ<sub>x</sub><sup>2</sup> = 0.01, also reproduced overgeneralization and overdifferentiation. Similarly, the combination, ⍺ = 1, σ<sub>x</sub><sup>2</sup> = 3, exhibited control-like behavior when K = 4 but shifted toward App<sup>NL-G-F</sup>-like behavior when K = 36. The effect of K was also evident when ⍺ = 3 and σ<sub>x</sub><sup>2</sup> = 3, where K = 36 led to over-differentiation. We note that these conditions were artificially set and likely not representative of biologically plausible. These results underscore the non-identifiability concern raised by the reviewer. Therefore, we acknowledge that merely attributing overgeneralization to lower ⍺ or overdifferentiation to lower σ<sub>x</sub><sup>2</sup> may be overly reductive. Instead, these patterns likely arise from the joint effect of ⍺, σ<sub>x</sub><sup>2</sup>, and K. We have revised the manuscript accordingly in Results and Discussion (page 11-13, 18-19).

      [#32] (4) Data normalization: Normalizing the data between 0 and 1 removes the interpretability of % freezing, making mice with large changes in freezing indistinguishable seem similar to mice with small changes.

      As we describe in our reply to comment #26, the conditioned response in the latent cause model was scaled between 0 and 1, and we assume 0 and 1 mean the minimal and maximal CR within each mouse, respectively. Furthermore, although we initially tried to fit simulated CRs to raw CRs, we found that the fitting level was low due to the individual difference in the degree of behavioral expression: some mice exhibited a larger range of CR, while others showed a narrower one. Thus, we decided to normalize the data. We agree that this processing will make the mice with high changes in freezing% indistinguishable from those with low changes. However, the freezing% changes within the mouse were preserved and did not affect the discrimination index.

      [#33] (5) Overlooking parameter differences: Differences in parameters, like w<sub>0</sub>, that didn't fit the hypothesis may have been ignored.

      Our initial hypothesis is that internal states were altered in App<sup>NL-G-F</sup> mice, and we did not have a specific hypothesis on which parameter would contribute to such a state. We mainly focus on the parameters (1) that are significantly different between control and App</sup>NL-G</sup>- mice and (2) that are significantly correlated to the empirical behavioral data, DI between test 3 and test 1. 

      In the 12-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, w<sub>0</sub> and K showed marginal p-value in Mann-Whitney U test (Table S7) and moderate correlation with the DI (Table S8). While differences in K were already discussed in the manuscript, we did miss the point that w<sub>0</sub> could contribute to the differences in w between control and App<sup>NL-G-F</sup> (Figure 4G) in the previous manuscript. We explain the contribution of w<sub>0</sub> on the reinstatement results here. When other parameters are fixed, higher w<sub>0</sub> would lead to higher CR in test 3, because higher w<sub>0</sub> would allow increasing w<sub>context</sub> by the unsignaled shock, leading to reinstatement (Appendix 2 – table 7). It is likely that higher w<sub>0</sub> would be sampled through the parameter estimation in the 12-month-old control but not App<sup>NL-G-F</sup>. On the other hand, the number of latent causes is not sensitive to w<sub>0</sub> when other parameters were fixed at the initial guess value (Appendix 2 – table 1), suggesting w<sub>0</sub> has a small contribution to memory modification process. 

      Thus, we speculate that although the difference in w<sub>0</sub> between control and App<sup>NL-G-F</sup> mice may arise from the sampling process, resulting in a positive correlation with DI between test 3 and test 1, its contribution to diverged internal states would be smaller relative to α or σ<sub>x</sub><sup>2</sup> as a wide range of w<sub>0</sub> has no effect on the number of latent causes (Appendix 2 – table 7). We have added the discussion of differences in w<sub>0</sub> in the 12-month-old group in manuscript Line 357-359.

      In the 6-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, 𝜃 is significantly higher in the AD mice group (Table S10) but not correlated with the DI (Table S11). We have already discussed this point in the manuscript.  

      [#34] (6) Initial response: Higher initial responses in the model at the start of the experiment may reflect poor model fit.

      Please refer to our reply to comment #26 for our explanation of what contributes to high initial responses in the latent cause model.

      In addition, achieving a good fit for the acquisition CRs was not our primary purpose, as the response measured in the acquisition phase includes not only a conditioned response to the CS and context but also an unconditioned response to the novel stimuli (CS and US). This mixed response presumably increased the variance of the measured freezing rate over individuals, therefore we did not cover the results in the discussion.

      Rather, we favor models at least replicating the establishment of conditioning, extinction and reinstatement of fear memory in order to explain the memory modification process. As we mentioned in the reply for comment #4, alternative models, the latent state model and the Rescorla-Wagner model, failed to replicate the observation (cf. Figure 3 – figure supplement 1A-1C). Thus, we chose to stand on the latent cause model as it aligns better with the purpose of this study. 

      [#35] In addition, please be transparent if data is excluded, either during the fitting procedure or when performing one-way ANCOVA. Avoid discarding data when possible, but if necessary, provide clarity on the nature of excluded data (e.g., how many, why were they excluded, which group, etc?).

      We clarify the information of excluded data as follows. We had 25 mice for the 6-month-old control group, 26 mice for the 6-month-old App<sup>NL-G-F</sup> group, 29 mice for the 12-month-old control group, and 26 mice for the 12-month-old App<sup>NL-G-F</sup> group (Table S1). 

      Our first exclusion procedure was applied to the freezing rate data in the test phase. If the mouse had a freezing rate outside of the 1.5 IQR in any of the test phases, it is regarded as an outlier and removed from the analysis (see Statistical analysis in Materials and Methods). One mouse in the 6-month-old control group, one mouse in the 6-month-old App<sup>NL-G-F</sup> group, five mice in the 12-month-old control group, and two mice in the 12-month-old App<sup>NL-G-F</sup> group were excluded.

      Our second exclusion procedure was applied during the fitting and parameter estimation (see parameter estimation in Materials and Methods). We have provided the number of anomaly samples during parameter estimation in Appendix 1 – figure 2.   

      Lastly, we would like to state that all the sample sizes written in the figure legends do not include outliers detected through the exclusion procedure mentioned above.

      [#36] Finally, since several statistical tests were used and the differences are small, I suggest noting that multiple comparisons were not controlled for, so p-values should be interpreted cautiously.

      We have provided power analyses in Tables S21 and S22 with methods described in the manuscript (Line 897-898) and added a note that not all of the multiple comparisons were corrected for in the manuscript (Line 898-899).

      References cited in the response letter only 

      Bellio, T. A., Laguna-Torres, J. Y., Campion, M. S., Chou, J., Yee, S., Blusztajn, J. K., & Mellott, T. J. (2024). Perinatal choline supplementation prevents learning and memory deficits and reduces brain amyloid Aβ42 deposition in App<sup>NL-G-F</sup> Alzheimer’s disease model mice. PLOS ONE, 19(2), e0297289. https://doi.org/10.1371/journal.pone.0297289

      Blei, D. M., & Frazier, P. I. (2011). Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research, 12(74), 2461–2488.

      Cochran, A. L., & Cisler, J. M. (2019). A flexible and generalizable model of online latent-state learning. PLOS Computational Biology, 15(9), e1007331. https://doi.org/10.1371/journal.pcbi.1007331

      Curiel Cid, R. E., Crocco, E. A., Duara, R., Vaillancourt, D., Asken, B., Armstrong, M. J., Adjouadi, M., Georgiou, M., Marsiske, M., Wang, W., Rosselli, M., Barker, W. W., Ortega, A., Hincapie, D., Gallardo, L., Alkharboush, F., DeKosky, S., Smith, G., & Loewenstein, D. A. (2024). Different aspects of failing to recover from proactive semantic interference predicts rate of progression from amnestic mild cognitive impairment to dementia. Frontiers in Aging Neuroscience, 16. https://doi.org/10.3389/fnagi.2024.1336008

      Giustino, T. F., Fitzgerald, P. J., Ressler, R. L., & Maren, S. (2019). Locus coeruleus toggles reciprocal prefrontal firing to reinstate fear. Proceedings of the National Academy of Sciences, 116(17), 8570–8575. https://doi.org/10.1073/pnas.1814278116

      Gu, X., Wu, Y.-J., Zhang, Z., Zhu, J.-J., Wu, X.-R., Wang, Q., Yi, X., Lin, Z.-J., Jiao, Z.-H., Xu, M., Jiang, Q., Li, Y., Xu, N.-J., Zhu, M. X., Wang, L.-Y., Jiang, F., Xu, T.-L., & Li, W.-G. (2022). Dynamic tripartite construct of interregional engram circuits underlies forgetting of extinction memory. Molecular Psychiatry, 27(10), 4077–4091. https://doi.org/10.1038/s41380-022-01684-7

      Lacagnina, A. F., Brockway, E. T., Crovetti, C. R., Shue, F., McCarty, M. J., Sattler, K. P., Lim, S. C., Santos, S. L., Denny, C. A., & Drew, M. R. (2019). Distinct hippocampal engrams control extinction and relapse of fear memory. Nature Neuroscience, 22(5), 753–761. https://doi.org/10.1038/s41593-019-0361-z

      Loewenstein, D. A., Curiel, R. E., Greig, M. T., Bauer, R. M., Rosado, M., Bowers, D., Wicklund, M., Crocco, E., Pontecorvo, M., Joshi, A. D., Rodriguez, R., Barker, W. W., Hidalgo, J., & Duara, R. (2016). A Novel Cognitive Stress Test for the Detection of Preclinical Alzheimer’s Disease: Discriminative Properties and Relation to Amyloid Load. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 24(10), 804–813. https://doi.org/10.1016/j.jagp.2016.02.056

      Loewenstein, D. A., Greig, M. T., Curiel, R., Rodriguez, R., Wicklund, M., Barker, W. W., Hidalgo, J., Rosado, M., & Duara, R. (2015). Proactive Semantic Interference Is Associated With Total and Regional Abnormal Amyloid Load in Non-Demented Community-Dwelling Elders: A Preliminary Study. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 23(12), 1276–1279. https://doi.org/10.1016/j.jagp.2015.07.009

      Valles-Salgado, M., Gil-Moreno, M. J., Curiel Cid, R. E., Delgado-Á lvarez, A., Ortega-Madueño, I., Delgado-Alonso, C., Palacios-Sarmiento, M., López-Carbonero, J. I., Cárdenas, M. C., MatíasGuiu, J., Díez-Cirarda, M., Loewenstein, D. A., & Matias-Guiu, J. A. (2024). Detection of cerebrospinal fluid biomarkers changes of Alzheimer’s disease using a cognitive stress test in persons with subjective cognitive decline and mild cognitive impairment. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1373541

      Zaki, Y., Mau, W., Cincotta, C., Monasterio, A., Odom, E., Doucette, E., Grella, S. L., Merfeld, E., Shpokayte, M., & Ramirez, S. (2022). Hippocampus and amygdala fear memory engrams reemerge after contextual fear relapse. Neuropsychopharmacology, 47(11), 1992–2001. https://doi.org/10.1038/s41386-022-01407-0