10,000 Matching Annotations
  1. Nov 2025
  2. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. Sarah T. Roberts. Behind the Screen. Yale University Press, September 2021. URL: https://yalebooks.yale.edu/9780300261479/behind-the-screen (visited on 2023-12-08). [o2] Tarleton Gillespie. Custodians of the Internet. Yale University Press, August 2021. URL: https://yalebooks.yale.edu/9780300261431/custodians-of-the-internet (visited on 2023-12-08). [o3] Reddit. URL: https://www.reddit.com/ (visited on 2023-12-08). [o4] ShiningConcepts. r/TheoryOfReddit: reddit is valued at more than ten billion dollars, yet it is extremely dependent on mods who work for absolutely nothing. Should they be paid, and does this lead to power-tripping mods? November 2021. URL: www.reddit.com/r/TheoryOfReddit/comments/qrjwjw/reddit_is_valued_at_more_than_ten_billion_dollars/ (visited on 2023-12-08). [o5] Wikipedia. URL: https://www.wikipedia.org/ (visited on 2023-12-08). [o6] Wikipedia:Administrators. November 2023. Page Version ID: 1187624916. URL: https://en.wikipedia.org/w/index.php?title=Wikipedia:Administrators&oldid=1187624916 (visited on 2023-12-08). [o7] Wikipedia:Paid-contribution disclosure. November 2023. Page Version ID: 1184161032. URL: https://en.wikipedia.org/w/index.php?title=Wikipedia:Paid-contribution_disclosure&oldid=1184161032 (visited on 2023-12-08). [o8] Wikipedia:Wikipedians. November 2023. Page Version ID: 1184672006. URL: https://en.wikipedia.org/w/index.php?title=Wikipedia:Wikipedians&oldid=1184672006 (visited on 2023-12-08). [o9] Brian Resnick. The 2018 Nobel Prize reminds us that women scientists too often go unrecognized. Vox, October 2018. URL: https://www.vox.com/science-and-health/2018/10/2/17929366/nobel-prize-physics-donna-strickland (visited on 2023-12-08). [o10] Maggie Fick and Paresh Dave. Facebook's flood of languages leaves it struggling to monitor content. Reuters, April 2019. URL: https://www.reuters.com/article/idUSKCN1RZ0DL/ (visited on 2023-12-08). [o11] David Gilbert. Facebook Is Ignoring Moderators’ Trauma: ‘They Suggest Karaoke and Painting’. Vice, May 2021. URL: https://www.vice.com/en/article/m7eva4/traumatized-facebook-moderators-told-to-suck-it-up-and-try-karaoke (visited on 2023-12-08). [o12] Billy Perrigo. TikTok's Subcontractor in Colombia Under Investigation. Time, November 2022. URL: https://time.com/6231625/tiktok-teleperformance-colombia-investigation/ (visited on 2023-12-08). [o13] Mike Masnick, Randy Lubin, and Leigh Beadon. Moderator Mayhem: A Content Moderation Game. URL: https://moderatormayhem.engine.is/ (visited on 2023-12-17).

      Sarah T. Roberts’ Behind the Screen really opened my eyes to how hidden and emotionally damaging content moderation work can be. The book reveals how the people who clean up the internet—filtering through disturbing images, videos, and hate speech—are often underpaid, outsourced, and given little emotional support. What struck me the most was how invisible this labor is, even though it’s essential for keeping social media platforms usable. Reading about the trauma moderators face makes me think differently about platforms like Facebook or TikTok, which profit from user-generated content but rely on poorly supported workers to make it “safe.” It makes me question whether platforms should be legally required to provide better pay, mental health care, and transparency about their moderation processes.

    2. Mike Masnick, Randy Lubin, and Leigh Beadon. Moderator Mayhem: A Content Moderation Game. URL: https://moderatormayhem.engine.is/ (visited on 2023-12-17).

      This game proves how difficult it is to correctly identify the context of each scenario. While it could be easy to see a broad baseline of keywords or language that an AI moderator could flag down, humans understand how the world works and it forces moderators to stay informed on what is going on in the world. This game pretty much makes the player blind unless they use the “look further” button which might not be feasible in the endless amount of media that is created every second of the day.

    1. Experienced Game Developers In Jaipur, India

      I’m a Unity Developer with a focus on AR and VR experiences. Currently working at Orion InfoSolutions, where I design and build immersive solutions using Unity3D, MRTK, and XR Toolkit. Passionate about creating interactive and intuitive mixed reality experiences that bridge the digital and physical worlds. Throughout my career, I have collaborated with cross-functional teams to develop impactful applications across various industries, from entertainment and gaming to education and enterprise . Let's connect and explore how we can collaborate to bring your mixed reality visions to life.

    1. Game theory explains why we get these things wrong, and we need to find a way out of this which involves restraint.

      Exactly, and how do you solve a collective action problem to achieve that restraint.

    2. That game theoretic relationship creates a topology that is actually driving us in a self-terminating direction, and nobody’s steering because there is no sacred.

      A classic example of a S mis-explanation which sounds compelling on the surface.

      It's not the game theory topology that is driving us here per se - it is a feature of underlying human nature (self interest, difficulty to coordinate/free rider) with global warming.

      And it is not because there is no sacred but because it is not perceived, and specifically not perceived communally. And that broke down for some good reasons.

    1. Reviewer #1 (Public review):

      Summary:

      The NF-kB signaling pathway plays a critical role in the development and survival of conventional alpha beta T cells. Gamma delta T cells are evolutionarily conserved T cells that occupy a unique niche in the host immune system and that develop and function in a manner distinct from conventional alpha beta T cells. Specifically, unlike the case for conventional alpha beta T cells, a large portion of gamma delta T cells acquire functionality during thymic development, after which they emigrate from the thymus and populate a variety of mucosal tissues. Exactly how gamma delta T cells are functionally programmed remains unclear. In this manuscript, Islam et al. use a wide variety of mouse genetic models to examine the influence of the NF-kB signaling pathway on gamma delta T cell development and survival. They find that the inhibitor of kappa B kinase complex (IKK) is critical to the development of gamma delta T1 subsets, but not adaptive/naïve gamma delta T cells. In contrast, IKK-dependent NF-kB activation is required for their long-term survival. They find that caspase 8-deficiency renders gamma delta T cells sensitive to RIPK1-mediated necroptosis, and they conclude that IKK repression of RIPK1 is required for the long-term survival of gamma delta T1 and adaptive/naïve gamma delta T cells subsets. These data will be invaluable in comparing and contrasting the signaling pathways critical for the development/survival of both alpha beta and gamma delta T cells.

      The conclusions of the paper are mostly well-supported by the data, but some aspects need to be clarified.

      (1) The authors appear to be excluding a significant fraction of the TCRlow gamma delta T cells from their analysis in Figure 1A. Since this population is generally enriched in CD25+ gamma delta T cells, this gating strategy could significantly impact their analysis due to the exclusion of progenitor gamma delta T cell populations.

      (2) The overall phenotype of the IKKDeltaTCd2 mice is not described in any great detail. For example, it is not clear if these mice possess altered thymocyte or peripheral T cell populations beyond that of gamma delta T cells. Given that gamma delta T cell development has been demonstrated to be influenced by gamma delta T cells (i.e, trans-conditioning), this information could have aided in the interpretation of the data. Related to this, it would have been helpful if the authors provided a comparison of the frequencies of each of the relevant subsets, in addition to the numbers.

      (3) The manner in which the peripheral gamma delta T cell compartment was analyzed is somewhat unclear. The authors appear to have assessed both spleen and lymph node separately. The authors show representative data from only one of these organs (usually the lymph node) and show one analysis of peripheral gamma delta T cell numbers, where they appear to have summed up the individual spleen and lymph node gamma delta T cell counts. Since gamma deltaT17 and gamma deltaT1 are distributed somewhat differently in these compartments (lymph node is enriched in gamma deltaT17, while spleen is enriched in gamma deltaT1), combining these data does not seem warranted. The authors should have provided representative plots for both organs and calculated and analyzed the gamma delta T cell numbers for both organs separately in each of these analyses.

      (4) The authors make extensive use of surrogate markers in their analysis. While the markers that they choose are widely used, there is a possibility that the expression of some of these markers may be altered in some of their genetic mutants. This could skew their analysis and conclusions. A better approach would have been to employ either nuclear stains (Tbx21, RORgammaT) or intracellular cytokine staining to definitively identify functional gamma deltaT1 or gamma deltaT17 subsets.

      (5) The analysis and conclusion of the data in Figure 3A is not convincing. Because the data are graphed on log scale, the magnitude of the rescue by kinase dead RIPK1 appears somewhat overstated. A rough calculation suggests that in type 1 game delta T cells, there is ~ 99% decrease in gamma delta T cells in the Cre+WT strain and a ~90% decrease in the Cre+KD+ strain. Similarly, it looks as if the numbers for adaptive gamma delta T cells are a 95% decrease and an 85% decrease, respectively. Comparing these data to the data in Figure 5, which clearly show that kinase dead RIPK1 can completely rescue the Caspase 8 phenotype, the conclusion that gamma delta T cells require IKK activity to repress RIPK1-dependent pathways does not appear to be well-supported. In fact, the data seem more in line with a conclusion that IKK has a significant impact on gamma delta T cell survival in the periphery that cannot be fully explained by invoking Caspase8-dependent apoptosis or necroptosis. Indeed, while the authors seem to ultimately come to this latter conclusion in the Discussion, they clearly state in the Abstract that "IKK repression of RIPK1 is required for survival of peripheral but not thymic gamma delta T cells." Clarification of these conclusions and seeming inconsistencies would greatly strengthen the manuscript. With respect to the actual analysis in Figure 3A, it appears that the authors used a succession of non-parametric t-tests here without any correction. It may be helpful to determine if another analysis, such as ANOVA, may be more appropriate.

      (6) The conclusion that the alternative pathway is redundant for the development and persistence of the major gamma delta T cell subsets is at odds with a previous report demonstrating that Relb is required for gamma delta T17 development (Powolny-Budnicka, I., et al., Immunity 34: 364-374, 2011). This paper also reported the involvement of RelA in gamma delta T17 development. The present manuscript would be greatly improved by the inclusion of a discussion of these results.

      (7) The data in Figures 1C and 3A are somewhat confusing in that while both are from the lymph nodes of IKKdeltaTCD2 mice, the data appear to be quite different (In Figure 3A, the frequency of gamma delta T cells increases and there is a near complete loss of the CD27+ subset. In Figure 1A, the frequency of gamma delta T cells is drastically decreased, and there is only a slight loss of the CD27+ subset.)

    1. Have you ever reported a post/comment for violating social media platform rules?

      I have not made a post that has violated a social media platform's rules. I have however tried to create usernames in video games and my proposed names have been dissallowed for some reason or another. This can be because a name has some profanity, or that there are words within the username that could spell out something the game doesn't want you to spell.

    1. Reviewer #3 (Public review):

      Summary:

      In this paper, the authors demonstrate the inevitability of the emergence of spatial information in sufficiently complex systems, even those that are only trained on object recognition (i.e. not a "spatial" system). As such, they present an important null hypothesis that should be taken into consideration for experimental design and data analysis of spatial tuning and its relevance for behavior.

      Strengths:

      The paper's strengths include the use of a large multi-layer network trained in a detailed visual environment. This illustrates an important message for the field: that spatial tuning can be a result of sensory processing. While this is a historically recognized and often-studied fact in experimental neuroscience, it is made more concrete with the use of a complex sensory network. Indeed, the manuscript is a cautionary tale for experimentalists and computational researchers alike against blindly applying and interpreting metrics without adequate controls. The addition of the deep network, i.e. the argument that sufficient processing increases the likelihood of such a confound, is a novel and important contribution.

      Weaknesses:

      However, the work has a number of significant weaknesses. Most notably: the spatial tuning that emerges is precisely that we would expect from visually-tuned neurons, and they do not engage with literature that controls for these confounds or compare the quality or degree of spatial tuning with neural data; the ability to linearly decode position from a large number of units is not a strong test of spatial cognition; and the authors make strong but unjustified claims as to the implications of their results in opposition to, as opposed to contributing to, work being done in the field.

      The first weakness is that the degree and quality of spatial tuning that emerges in the network is not analyzed to the standards of evidence that have been used in well-controlled studies of spatial tuning in the brain. Specifically, the authors identify place cells, head direction cells, and border cells in their network, and their conjunctive combinations. However, these forms of tuning are the most easily confounded by visual responses, and it's unclear if their results will extend to observed forms of spatial tuning that are not.

      For example, consider the head direction cells in Figure 3C. In addition to increased activity in some directions, these cells also have a high degree of spatial nonuniformity, suggesting they are responding to specific visual features of the environment. In contrast, the majority of HD cells in the brain are only very weakly spatially selective, if at all, once an animal's spatial occupancy is accounted for (Taube et al 1990, JNeurosci). While the preferred orientation of these cells are anchored to prominent visual cues, when they rotate with changing visual cues the entire head direction system rotates together (cells' relative orientation relationships are maintained, including those that encode directions facing AWAY from the moved cue), and thus these responses cannot be simply independent sensory-tuned cells responding to the sensory change) (Taube et al 1990 JNeurosci, Zugaro et al 2003 JNeurosci, Ajbi et al 2023).

      As another example, the joint selectivity of detected border cells with head direction in Figure 3D suggests that they are "view of a wall from a specific angle" cells. In contrast, experimental work on border cells in the brain has demonstrated that these are robust to changes in the sensory input from the wall (e.g. van Wijngaarden et al 2020), or that many of them are are not directionally selective (Solstad et al 2008).

      The most convincing evidence of "spurious" spatial tuning would be the emergence of HD-independent place cells in the network, however, these cells are a very small minority (in contrast to hippocampal data, Thompson and Best 1984 JNeurosci, Rich et al 2014 Science), the examples provided in Figure 3 are significantly more weakly tuned than those observed in the brain.

      Indeed, the vast majority of tuned cells in the network are conjunctively selective for HD (Figure 3A). While this conjunctive tuning has been reported, many units in the hippocampus/entorhinal system are not strongly hd selective (Muller et al 1994 JNeurosci, Sangoli et al 2006 Science, Carpenter et al 2023 bioRxiv). Further, many studies have been done to test and understand the nature of sensory influence (e.g. Acharya et al 2016 Cell), and they tend to have a complex relationship with a variety of sensory cues, which cannot readily be explained by straightforward sensory processing (rev: Poucet et al 2000 Rev Neurosci, Plitt and Giocomo 2021 Nat Neuro). E.g. while some place cells are sometimes reported to be directionally selective, this directional selectivity is dependent on behavioral context (Markus et al 1995, JNeurosci), and emerges over time with familiarity to the environment (Navratiloua et al 2012 Front. Neural Circuits). Thus, the question is not whether spatially tuned cells are influenced by sensory information, but whether feed-forward sensory processing alone is sufficient to account for their observed turning properties and responses to sensory manipulations.

      These issues indicate a more significant underlying issue of scientific methodology relating to the interpretation of their result and its impact on neuroscientific research. Specifically, in order to make strong claims about experimental data, it is not enough to show that a control (i.e. a null hypothesis) exists, one needs to demonstrate that experimental observations are quantitatively no better than that control.

      Where the authors state that "In summary, complex networks that are not spatial systems, coupled with environmental input, appear sufficient to decode spatial information." what they have really shown is that it is possible to decode some degree of spatial information. This is a null hypothesis (that observations of spatial tuning do not reflect a "spatial system"), and the comparison must be made to experimental data to test if the so-called "spatial" networks in the brain have more cells with more reliable spatial info than a complex-visual control.

      Further, the authors state that "Consistent with our view, we found no clear relationship between cell type distribution and spatial information in each layer. This raises the possibility that "spatial cells" do not play a pivotal role in spatial tasks as is broadly assumed." Indeed, this would raise such a possibility, if 1) the observations of their network were indeed quantitatively similar to the brain, and 2) the presence of these cells in the brain were the only evidence for their role in spatial tasks. However, 1) the authors have not shown this result in neural data, they've only noticed it in a network and mentioned the POSSIBILITY of a similar thing in the brain, and 2) the "assumption" of the role of spatially tuned cells in spatial tasks is not just from the observation of a few spatially tuned cells. But from many other experiments including causal manipulations (e.g. Robinson et al 2020 Cell, DeLauilleon et al 2015 Nat Neuro), which the authors conveniently ignore. Thus, I do not find their argument, as strongly stated as it is, to be well-supported.

      An additional weakness is that linear decoding of position is not a measure of spatial cognition. The ability to decode position from a large number of weakly tuned cells is not surprising. However, based on this ability to decode, the authors claim that "'spatial' cells do not play a privileged role in spatial cognition". To justify this claim, the authors would need to use the network to perform e.g. spatial navigation tasks, then investigate the networks' ability to perform these tasks when tuned cells were lesioned.

      Finally, I find a major weakness of the paper to be the framing of the results in opposition to, as opposed to contributing to, the study of spatially tuned cells. For example, the authors state that "If a perception system devoid of a spatial component demonstrates classically spatially-tuned unit representations, such as place, head-direction, and border cells, can "spatial cells" truly be regarded as 'spatial'?" Setting aside the issue of whether the perception system in question does indeed demonstrate spatially-tuned unit representations comparable to those in the brain, I ask "Why not?" This seems to be a semantic game of reading more into a name than is necessarily there. The names (place cells, grid cells, border cells, etc) describe an observation (that cells are observed to fire in certain areas of an animal's environment). They need not be a mechanistic claim (that space "causes" these cells to fire) or even, necessarily, a normative one (these cells are "for" spatial computation). This is evidenced by the fact that even within e.g. the place cell community, there is debate as to these cells' mechanisms and function (eg memory, navigation, etc), or if they can even be said to only serve a single one function. However, they are still referred to as place cells, not as a statement of their function but as a history-dependent label that refers to their observed correlates with experimental variables. Thus, the observation that spatially tuned cells are "inevitable derivatives of any complex system" is itself an interesting finding which contributes to, rather than contradicts, the study of these cells. It seems that the authors have a specific definition in mind when they say that a cell is "truly" "spatial" or that a biological or artificial neural network is a "spatial system", but this definition is not stated, and it is not clear that the terminology used in the field presupposes their definition.

      In sum, the authors have demonstrated the existence of a control/null hypothesis for observations of spatially-tuned cells. However, 1) It is not enough to show that a control (null hypothesis) exists, one needs to test if experimental observations are no better than control, in order to make strong claims about experimental data, 2) the authors do not acknowledge the work that has been done in many cases specifically to control for this null hypothesis in experimental work or to test the sensory influences on these cells, and 3) the authors do not rigorously test the degree or source of spatial tuning of their units.

      Comments on revisions:

      While I'm happy to admit that standards of spatial tuning are not unified or consistent across the field, I do not believe the authors have addressed my primary concern: they have pointed out a null model, and then have constructed a strong opinion around that null model without actually testing if it's sufficient to account for neural data. I've slightly modified my review to that effect.

      I do think it would be good for the authors to state in the manuscript what they mean when they say that a cell is "truly" "spatial" or that a biological or artificial neural network is a "spatial system". This is implied throughout, but I was unable to find what would distinguish a "truly" spatial system from a "superfluous" one.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      but see Franzius, Sprekeler, Wiskott, PLoS Computational Biology, 2007

      We have discussed the differences with this work in the response to Editor recommendations above.

      While the findings reported here are interesting, it is unclear whether they are the consequence of the specific model setting, and how well they would generalize.

      We have considered deep vision models across different architectures in our paper, which include traditional feedforward convolutional neural networks (VGG-16), convolutional neural networks with skip connections (ResNet-50) and the Vision Transformer (VIT) which employs self-attention instead of convolution as its core information processing unit.

      In particular, examining the pictures shown in Fig. 1A, it seems that local walls of the ’box’ contain strong oriented features that are distinct across different views. Perhaps the response of oriented visual filters can leverage these features to uniquely determine the spatial variable. This is concerning because this is a very specific setting that is unlikely to generalize.

      The experimental set up is based on experimental studies of spatial cognition in rodents. They are typically foraging in square or circular environments. Indeed, square environments will have more borders and corners that will provide information about the spatial environment, which is true in both empirical studies and our simulations. In any navigation task, and especially more realistic environments, visual information such as borders or landmarks likely play a major role in spatial information available to the agent. In fact, studies that do not consider sensory information to contribute to spatial information are likely missing a major part of how animals navigate.

      The prediction would be that place cells/head direction cells should go away in darkness. This implies that key aspects of functional cell types in the spatial cognition are missing in the current modeling framework.

      We addressed this comment in our response to the editor’s highlight. To briefly recap, we do not intend to propose a comprehensive model of the brain that captures all spatial phenomena, as we would not expect this from an object recognition network. Instead, we show that such a simple and nonspatial model can reproduce key signatures of spatial cells, raising important questions about how we interpret spatial cell types that dominate current research.

      Reviewer #2 (Public Review):

      The network used in the paper is still guided by a spatial error signal [...] one could say that the authors are in some way hacking this architecture and turning it into a spatial navigation one through learning.

      To be clear, the base networks we use do not undergo spatial error training. They have either been pre-trained on image classification tasks or are untrained. We used a standard neuroscience approach: training linear decoders on representations to assess the spatial information present in the network layers. The higher decoding errors in early layer representations (Fig. 2A) indicate that spatial information differs across layers—an effect that cannot be attributed to the linear decoder alone.

      My question is whether the paper is fighting an already won battle.

      Intuitive cell type discovery are still being celebrated. Concentrating on this kind of cell type discovery has broader implications that could be deleterious to the future of science. One point to note is that this issue depends on the area or subfield of neuroscience. In some subfields, papers that claim to find cell types with a strong claim of specific functions are relatively rare, and population coding is common (e.g., cognitive control in primate prefrontal cortex, neural dynamics of motor control). Although rodent neuroscience as a field is increasingly adopting population approaches, influential researchers and labs are still publishing “cell types” and in top journals (here are a few from 2017-2024: Goal cells (Sarel et al., 2017), Object-vector cells (Høydal et al., 2019), 3D place cells (Grieves et al., 2020), Lap cells (Sun et al., 2020), Goal-vector cells (Ormond and O’Keefe, 2022), Predictive grid cells (Ouchi and Fujisawa, 2024).

      In some cases, identification of cell types is only considered a part of the story, and there are analyses on behavior, neural populations, and inactivationbased studies. However, our view (and suggest this is shared amongst most researchers) is that a major reason these papers are reviewed and accepted to top journals is because they have a simple, intuitive “cell type” discovery headline, even if it is not the key finding or analysis that supports the insightful aspects of the work. This is unnecessary and misleading to students of neuroscience, related fields, and the public, it affects private and public funding priorities and in turn the future of science. Worse, it could lead the field down the wrong path, or at the least distribute attention and resources to methods and papers that could be providing deeper insights. Consistent with the central message of our work, we believe the field should prioritize theoretical and functional insights over the discovery of new “cell types”.

      Reviewer #3 (Public Review):

      The ability to linearly decode position from a large number of units is not a strong test of spatial information, nor is it a measure of spatial cognition

      Using a linear decoder to test what information is contained in a population of neurons available for downstream areas is a common technique in neuroscience (Tong and Pratte, 2012; DiCarlo et al., 2012) including spatial cells (e.g., Diehl et al. 2017; Horrocks et al. 2024). A linear decoder is used because it is a direct mapping from neurons to potential output behavior. In other words, it only needs to learn some mapping to link one set of neurons to another set which can “read out” the information. As such, it is a measure of the information contained in the population, and it is a lower bound of the information contained - as both biological and artificial neurons can do more complex nonlinear operations (as the activation function is nonlinear).

      We understand the reviewer may understand this concept but we explain it here to justify our position and for completeness of this public review.

      For example, consider the head direction cells in Figure 3C. In addition to increased activity in some directions, these cells also have a high degree of spatial nonuniformity, suggesting they are responding to specific visual features of the environment. In contrast, the majority of HD cells in the brain are only very weakly spatially selective, if at all, once an animal’s spatial occupancy is accounted for (Taube et al 1990, JNeurosci). While the preferred orientation of these cells are anchored to prominent visual cues, when they rotate with changing visual cues the entire head direction system rotates together (cells’ relative orientation relationships are maintained, including those that encode directions facing AWAY from the moved cue), and thus these responses cannot be simply independent sensory-tuned cells responding to the sensory change) (Taube et al 1990 JNeurosci, Zugaro et al 2003 JNeurosci, Ajbi et al 2023).

      As we have noted in our response to the editor, one of the main issues is how the criteria to assess what they are interested in is created in a subjective, and biased way, in a circular fashion (seeing spatial-like responses, developing criteria to determine a spatial response, select a threshold).

      All the examples the reviewer provides concentrate on strict criteria developed after finding such cells. What is the purpose of these cells for function, for behavior? Just finding a cell that looks like it is tuned to something does not explain its function. Neuroscience began with tuning curves in part due to methodological constraints, which was a promising start, but we propose that this is not the way forward.

      The metrics used by the authors to quantify place cell tuning are not clearly defined in the methods, but do not seem to be as stringent as those commonly used in real data. (e.g. spatial information, Skaggs et al 1992 NeurIPS).

      We identified place cells following the definition from Tanni et al. (2022), by one of the leading labs in the field. Since neurons in DNNs lack spikes, we adapted their criteria by focusing on the number of spatial bins in the ratemap rather than spike-based measures. However, our central argument is that the very act of defining spatial cells is problematic. Researchers set out to find place cells to study spatial representations, find spatially selective cells with subjective, qualitative criteria (sometimes combined with prior quantitative criteria, also subjectively defined), then try to fine-tune the criteria to more “stringent” criteria, depending on the experimental data at hand. It is not uncommon to see methodological sections that use qualitative judgments, such as: “To avoid bias ... we applied a loose criteria for place cells” Tanaka et al. (2018) , which reflects the lack of clarity for and subjectivity of place cell selection criteria.

      A simple literature survey reveals inconsistent criteria across studies. For place field selection, Dombeck et al. (2010) required mean firing rates exceeding 25% of peak rate, while Tanaka et al. (2018) used a 20% threshold. Speed thresholds also vary dramatically: Dombeck et al. (2010) calculated firing rates only when mice moved faster than 8.3 cm/s, whereas Tanaka et al. (2018) used 2 cm/s. Additional criteria differ further: Tanaka et al. (2018) required firing rates between 1-10 Hz and excluded cells with place fields larger than 1/3 of the area, while Dombeck et al. (2010) selected fields above 1.5 Hz, and Tanni et al. (2022) used a 10 spatial bins to 1/2 area threshold. As Dombeck et al. (2010) noted, differences in recording methods and place field definitions lead to varying numbers of identified place cells. Moreover, Grijseels et al. (2021) demonstrated that different detection methods produce vastly different place cell counts with minimal overlap between identified populations.

      This reflects a deeper issue. Unlike structurally and genetically defined cell types (e.g., pyramidal neurons, interneurons, dopamingeric neurons, cFos expressing neurons), spatial cells lack such clarity in terms of structural or functional specialization and it is unclear whether such “cell types” should be considered cell types in the same way. While scientific progress requires standardized definitions, the question remains whether defining spatial cells through myriad different criteria advances our understanding of spatial cognition. Are researchers finding the same cells? Could they be targeting different populations? Are they missing cells crucial for spatial cognition that they exclude due to the criteria used? We think this is likely. The inconsistency matters because different criteria may capture genuinely different neural populations or computational processes.

      Variability in definitions and criteria is an issue in any field. However, as we have stated, the deeper issue is whether we should be defining and selecting these cells at all before commencing analysis. By defining and restricting to spatial “cell types”, we risk comparing fundamentally different phenomena across studies, and worse, missing the fundamental unit of spatial cognition (e.g., the population).

      We have added a paragraph in Discussion (lines 357-366) noting the inconsistency in place cell selection criteria in the literature and the consequences of using varying criteria.

      We have also added a sentence (lines 354-356) raising the comparison of functionally defined spatial cell types with structurally and genetically defined cell types in the Discussion.

      Thus, the question is not whether spatially tuned cells are influenced by sensory information, but whether feed-forward sensory processing alone is sufficient to account for their observed turning properties and responses to sensory manipulations.

      These issues indicate a more significant underlying issue of scientific methodology relating to the interpretation of their result and its impact on neuroscientific research. Specifically, in order to make strong claims about experimental data, it is not enough to show that a control (i.e. a null hypothesis) exists, one needs to demonstrate that experimental observations are quantitatively no better than that control.

      Where the authors state that ”In summary, complex networks that are not spatial systems, coupled with environmental input, appear sufficient to decode spatial information.” what they have really shown is that it is possible to decode *some degree* of spatial information. This is a null hypothesis (that observations of spatial tuning do not reflect a ”spatial system”), and the comparison must be made to experimental data to test if the so-called ”spatial” networks in the brain have more cells with more reliable spatial info than a complex-visual control.

      We agree that good null hypotheses with quantitative comparisons are important. However, it is not clear that researchers in the field have not been using a null hypothesis, rather they make the assumption that these cell types exist and are functional in the way they assume. We provide one null hypothesis. The field can and should develop more and stronger null hypotheses.

      In our work, we are mainly focusing on criteria of finding spatial cells, and making the argument that simply doing this is misleading. Researcher develop criteria and find such cells, but often do not go further to assess whether they are real cell “types”, especially if they exclude other cells which can be misleading if other cells also play a role in the function of interest.

      But from many other experiments including causal manipulations (e.g. Robinson et al 2020 Cell, DeLauilleon et al 2015 Nat Neuro), which the authors conveniently ignore. Thus, I do not find their argument, as strongly stated as it is, to be well-supported.

      We acknowledge that there are several studies that have performed inactivation studies that suggest a strong role for place cells in spatial behavior. Most studies do not conduct comprehensive analyses to confirm that their place cells are in fact crucial for the behavior at hand.

      One question is how the criteria were determined. Did the researchers make their criteria based on what “worked”, so they did not exclude cells relevant to the behavior? What if their criteria were different, then the argument could have been that non-place cells also contribute to behavior.

      Another question is whether these cells are the same kinds of cells across studies and animals, given the varied criteria across studies? As most studies do not follow the same procedures, it is unclear whether we can generalize these results across cells and indeed, across task and spatial environments.

      Finally, does the fact that the place cells – the strongly selective cells with a place field – have a strong role in navigation provide any insight into the mechanism? Identifying cells by itself does not contribute to our understanding of how they work. Consistent with our main message, we argue that performing analyses and building computational models that uncover how the function of interest works is more valuable than simply naming cells.

      Finally, I find a major weakness of the paper to be the framing of the results in opposition to, as opposed to contributing to, the study of spatially tuned cells. For example, the authors state that ”If a perception system devoid of a spatial component demonstrates classically spatially-tuned unit representations, such as place, head-direction, and border cells, can ”spatial cells” truly be regarded as ’spatial’?” Setting aside the issue of whether the perception system in question does indeed demonstrate spatiallytuned unit representations comparable to those in the brain, I ask ”Why not?” This seems to be a semantic game of reading more into a name then is necessarily there. The names (place cells, grid cells, border cells, etc) describe an observation (that cells are observed to fire in certain areas of an animal’s environment). They need not be a mechanistic claim... This is evidenced by the fact that even within e.g. the place cell community, there is debate about these cells’ mechanisms and function (eg memory, navigation, etc), or if they can even be said to serve only a single function. However, they are still referred to as place cells, not as a statement of their function but as a history-dependent label that refers to their observed correlates with experimental variables. Thus, the observation that spatially tuned cells are ”inevitable derivatives of any complex system” is itself an interesting finding which *contributes to*, rather than contradicts, the study of these cells. It seems that the authors have a specific definition in mind when they say that a cell is ”truly” ”spatial” or that a biological or artificial neural network is a ”spatial system”, but this definition is not stated, and it is not clear that the terminology used in the field presupposes their definition.

      We have to agree to disagree with the reviewer on this point. Although researchers may reflect on their work and discuss what the mechanistic role of these cells are, it is widely perceived that cell type discovery is perceived as important to journals and funders due to its intuitive appeal and easy-tounderstand impact – even if there is no finding of interest to be reported. As noted in the comment above, papers claiming cell type discovery continue to be published in top journals and is continued to be funded.

      Our argument is that maybe “cell type” discovery research should not celebrated in the way it is, and in fact they shouldn’t be discovered when they are not genuine cell types like structural or genetic cell types. By using this term it make it appear like they are something they are not, which is misleading. They may be important cells, but providing a name like a “place” cell also suggests other cells are not encoding space - which is very unlikely to be true.

      In sum, our view is that finding and naming cells through a flawed theoretical lens that may not actually function as their names suggests can lead us down the wrong path and be detrimental to science.

      Reviewer #1 (Recommendations For The Authors):

      The novelty of the current study relative to the work by Franzius, Sprekeler, Wiskott (PLoS Computational Biology, 2007) needs to be carefully addressed. That study also modeled the spatial correlates based on visual inputs.

      Our work differs from Franzius et al. (2007) on both theoretical and experimental fronts. While both studies challenge the mechanisms underlying spatial cell formation, our theoretical contributions diverge. Franzius et al. (2007) assume spatial cells are inherently important for spatial cognition and propose a sensory-driven computational mechanism as an alternative to mainstream path integration frameworks for how spatial cells arise and support spatial cognition. In contrast, we challenge the notion that spatial cells are special at all. Using a model with no spatial grounding, we demonstrate that 1) spatial cells as naturally emerge from complex non-linear processing and 2) are not particularly useful for spatial decoding tasks, suggesting they are not crucial for spatial cognition.

      Our approach employs null models with fixed weights—either pretrained on classification tasks or entirely random—that process visual information non-sequentially. These models serve as general-purpose information processors without spatial grounding. In contrast, Franzius et al. (2007)’s model learns directly from environmental visual information, and the emergence of spatial cells (place or head-direction cells) in their framework depends on input statistics, such as rotation and translation speeds. Notably, their model does not simultaneously generate both place and head-direction cells; the outcome varies with the relative speed of rotation versus translation. Their sensory-driven model indirectly incorporates motion information through learning, exhibiting a time-dependence influenced by slow-feature analysis.

      Conversely, our model simultaneously produces units with place and headdirection cell profiles by processing visual inputs sampled randomly across locations and angles, independent of temporal or motion-related factors. This positions our model as a more general and fundamental null hypothesis, ideal for challenging prevailing theories on spatial cells due to its complete lack of spatial or motion grounding.

      Finally, unlike Franzius et al. (2007), who do not evaluate the functional utility of their spatial representations, we test whether the emergent spatial cells are useful for spatial decoding. We find that not only do spatial cells emerge in our non-spatial model, but they also fail to significantly aid in location or head-direction decoding. This is the central contribution of our work: spatial cells can arise without spatial or sensory grounding, and their functional relevance is limited. We have updated the manuscript to clarify the novelty of the current contribution to previous work (lines 324-335).

      In Fig. 2, it may be useful to plot the error in absolute units, rather than the normalized error. The direction decoding can be quantified in terms of degree Also, it would be helpful to compare the accuracy of spatial localization to that of the actual place cells in rodents.

      We argue it makes more sense and put comparison in perspective when we normalize the error by dividing the maximal error possible under each task. For transparency, we plot the errors in absolute physical units used by the Unity game engine in the updated Appendix (Fig. 1).

      Reviewer #2 (Recommendations For The Authors):

      Regarding the involvement of ’classified cells’ in decoding, I think a useful way to present the results would be to show the relationship between ’placeness’, ’directioness’ and ’borderness’ and the strength of the decoder weights. Either as a correlation or as a full scatter plot.

      We appreciate your suggestion to visualize the relationship between units’ spatial properties and their corresponding decoder weights. We believe it would be an important addition to our existing results. Based on the exclusion analyses, we anticipated the correlation to be low, and the additional results support this expectation.

      As an example, we present unit plots below for VGG-16 (pre-trained and untrained, at its penultimate layer with sampling rate equals 0.3; Author response image 1 and 2). Additional plots for various layers and across models are included in the supplementary materials (Fig. S12-S28). Consistently across conditions, we observed no significant correlations between units’ spatial properties (e.g., placeness) and their decoding weight strengths. These results further corroborate the conclusions drawn from our exclusion analyses.

      Reviewer #3 (Recommendations For The Authors):

      My main suggestions are that the authors: -perform manipulations to the sensory environment similar to those done in experimental work, and report if their tuned cells respond in similar ways -quantitatively compare the degree of spatial tuning in their networks to that seen in publicly available data -re-frame the discussion of their results to critically engage with and contribute to the field and its past work on sensory influences to these cells

      As we noted in our opening section, our model is not intended as a model of the brain. It is a non-spatial null model, and we present the surprising finding that even such a model contains spatial cell-like units if identified using criteria typically used in the field. This raises the question whether simply finding cells that show spatial properties is sufficient to grant the special status of “cell type” that is involved in the brain function of interest.

      Author response image 1.

      VGG-16 (pre-trained), penultimate layer units, show no apparent relationship between spatial properties and their decoder weight strengths.

      Author response image 2.

      VGG-16 (untrained), penultimate layer units, show no apparent relationship between spatial properties and their decoder weight strengths.

      Furthermore, our main simulations were designed to be compared to experimental work where rodents foraged around square environments in the lab. We did not do an extensive set of simulations as the purpose of our study is not to show that we capture exactly every single experimental finding, but rather raise the issues with the functional cell type definition and identification approach for progressing neuroscientific knowledge.

      Finally, as we note in more detail below, different labs use different criteria for identifying spatial cells, which depend both on the lab and the experimental design. Our point is that we can identify such cells using criteria set by neuroscientists, and that such cell types may not reflect any special status in spatial processing. Additional simulations that show less alignment with certain datasets will not provide support for or against our general message.

      References

      Banino A, Barry C, Uria B, Blundell C, Lillicrap T, Mirowski P, Pritzel A, Chadwick MJ, Degris T, Modayil J, Wayne G, Soyer H, Viola F, Zhang B, Goroshin R, Rabinowitz N, Pascanu R, Beattie C, Petersen S, Sadik A, Gaffney S, King H, Kavukcuoglu K, Hassabis D, Hadsell R, Kumaran D (2018) Vector-based navigation using grid-like representations in artificial agents. Nature 557(7705):429–433, DOI 10.1038/s41586-018-0102-6, URL http://www.nature.com/articles/s41586-018-0102-6

      DiCarlo JJ, Zoccolan D, Rust NC (2012) How Does the Brain Solve Visual Object Recognition? Neuron 73(3):415–434, DOI 10.1016/J.NEURON.2012.01.010, URL https://www.cell.com/neuron/fulltext/S0896-6273(12)00092-X

      Diehl GW, Hon OJ, Leutgeb S, Leutgeb JK (2017) Grid and Nongrid Cells in Medial Entorhinal Cortex Represent Spatial Location and Environmental Features with Complementary Coding Schemes. Neuron 94(1):83– 92.e6, DOI 10.1016/j.neuron.2017.03.004, URL https://linkinghub.elsevier.com/retrieve/pii/S0896627317301873

      Dombeck DA, Harvey CD, Tian L, Looger LL, Tank DW (2010) Functional imaging of hippocampal place cells at cellular resolution during virtual navigation. Nature Neuroscience 13(11):1433–1440, DOI 10.1038/nn.2648, URL https://www.nature.com/articles/nn.2648

      Ebitz RB, Hayden BY (2021) The population doctrine in cognitive neuroscience. Neuron 109(19):3055–3068, DOI 10.1016/j.neuron. 2021.07.011, URL https://linkinghub.elsevier.com/retrieve/pii/S0896627321005213

      Grieves RM, Jedidi-Ayoub S, Mishchanchuk K, Liu A, Renaudineau S, Jeffery KJ (2020) The place-cell representation of volumetric space in rats. Nature Communications 11(1):789, DOI 10.1038/s41467-020-14611-7, URL https://www.nature.com/articles/s41467-020-14611-7

      Grijseels DM, Shaw K, Barry C, Hall CN (2021) Choice of method of place cell classification determines the population of cells identified. PLOS Computational Biology 17(7):e1008835, DOI 10.1371/journal.pcbi.1008835, URL https://dx.plos.org/10.1371/journal.pcbi.1008835

      Horrocks EAB, Rodrigues FR, Saleem AB (2024) Flexible neural population dynamics govern the speed and stability of sensory encoding in mouse visual cortex. Nature Communications 15(1):6415, DOI 10.1038/s41467-024-50563-y, URL https://www.nature.com/articles/s41467-024-50563-y

      Høydal , Skytøen ER, Andersson SO, Moser MB, Moser EI (2019) Objectvector coding in the medial entorhinal cortex. Nature 568(7752):400– 404, DOI 10.1038/s41586-019-1077-7, URL https://www.nature.com/articles/s41586-019-1077-7

      Ormond J, O’Keefe J (2022) Hippocampal place cells have goal-oriented vector fields during navigation. Nature 607(7920):741–746, DOI 10.1038/s41586-022-04913-9, URL https://www.nature.com/articles/s41586-022-04913-9

      Ouchi A, Fujisawa S (2024) Predictive grid coding in the medial entorhinal cortex. Science 385(6710):776–784, DOI 10.1126/science.ado4166, URL https://www.science.org/doi/10.1126/science.ado4166

      Sarel A, Finkelstein A, Las L, Ulanovsky N (2017) Vectorial representation of spatial goals in the hippocampus of bats. Science 355(6321):176–180, DOI 10.1126/science.aak9589, URL https://www.science.org/doi/10.1126/science.aak9589

      Sun C, Yang W, Martin J, Tonegawa S (2020) Hippocampal neurons represent events as transferable units of experience. Nature Neuroscience 23(5):651–663, DOI 10.1038/s41593-020-0614-x, URL https://www.nature.com/articles/s41593-020-0614-x

      Tanaka KZ, He H, Tomar A, Niisato K, Huang AJY, McHugh TJ (2018) The hippocampal engram maps experience but not place. Science 361(6400):392–397, DOI 10.1126/science.aat5397, URL https://www.science.org/doi/10.1126/science.aat5397

      Tanni S, De Cothi W, Barry C (2022) State transitions in the statistically stable place cell population correspond to rate of perceptual change. Current Biology 32(16):3505–3514.e7, DOI 10.1016/j.cub. 2022.06.046, URL https://linkinghub.elsevier.com/retrieve/pii/S0960982222010089

      Tong F, Pratte MS (2012) Decoding Patterns of Human Brain Activity. Annual Review of Psychology 63(1):483–509, DOI 10.1146/annurev-psych-120710-100412, URL https://www.annualreviews.org/doi/10.1146/annurev-psych-120710-100412

    1. 3. This isn't the Whose Life Sucks More game. You have seen moments I can never imagine.

      Absence of a disability hierarchy -- disability often put under one framework, but really that framework only holds true in some ways, disability awareness is often seen as only one thing, but there's a lot under that umbrella term

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1:

      Strengths:

      The innovation on the task alone is likely to be impactful for the field, extending recent continuous report (CPR) tasks to examine other aspects of perceptual decision-making and allowing more naturalistic readouts. One interesting and novel finding is the observation of dyadic convergence of confidence estimates even when the partner is incidental to the task performance, and that dyads tend to be more risk-seeking (indicating greater confidence) than when playing solo. The paper is well-written and clear.”

      We thank reviewer 1 for this encouraging evaluation. Below we address the identified weaknesses and recommendations.

      (1) Do we measure metacognitive confidence?

      One concern with the novel task is whether confidence is disambiguated from a tracking of stimulus strength or coherence. […] But in the context of an RDK task, one simple strategy here is to map eccentricity directly to (subjective) motion coherence - such that the joystick position at any moment in time is a vector with motion direction and strength. This would still be an interesting task - but could be solved without invoking metacognition or the need to estimate confidence in one's motion direction decision. […] what the subjects might be doing is tracking two features of the world - motion strength and direction. This possibility needs to be ruled out if the authors want to claim a mapping between eccentricity and decision confidence […].”

      We thank reviewer 1 for pointing out that the joystick tilt responses of our subjects could potentially be driven by stimulus coherence instead of metacognitive decision confidence. Below, we present four arguments to address this point of concern:

      (1.1) Similar physical coherence between high and low confidence states

      Nominal motion coherence is a discrete value, but the random noisiness in the stimulus causes the actual frame-by-frame coherence to be distributed around this nominal value. Because of this, subjects might scale their joystick tilt report according to the coherence fluctuations around the nominal value. To check if this was the case, we use a median split to separate stimulus states into states with large versus small joystick tilt, individually for each nominal coherence. For each stimulus state, we extracted the actual instantaneous (frame-to-frame) motion coherence, which is based on the individual movements of dots in the stimulus patch between two frames, recorded in our data files.

      First, we compared the motion coherence between stimulus states with large versus small joystick tilt. For each stimulus state, we calculated average instantaneous motion coherence, and analyzed the difference of the medians for the large versus small tilt distributions for each subject and each coherence level. The resulting histograms show the distribution of differences across all 38 subjects for each nominal coherence, and are, except for the coherence of 22%, not significantly different from zero across subjects (Author response image 1). For the 22% coherence condition, the difference amounts to 0.19% – a very small, non-perceptible difference. Thus, we do no find systematic differences between the average motion coherence in states with high versus low joystick tilt.

      Author response image 1.

      Histograms of within-subject difference between medians of average coherence distributions with large and small joystick tilt for all subjects. Coherence is color-coded (cyan – 0%, magenta – 98%). On top, the title of each panel illustrates the number of significant differences (Ranksum test in each subject) without correction for multiple comparisons (see Author response table 1 below). In the second row of the title, we show the result of the population t-test against zero. Only 22% coherence shows a significant bias. Positive values indicate higher average coherence for large joystick tilt.  

      Author response table 1.

      List of all individual significantly different coherence distributions between high and low tilt states, without correction for multiple comparisons. Median differences do not show a consistent bias (i.e. positive values) that would indicate higher average coherence for the large tilts.

      (1.2) Short-term stimulus fluctuations have no effect

      […] But to fully characterise the task behaviour it also seems important to ask how and whether fluctuations in motion energy (assuming that the RDK frames were recorded) during a steady state phase are affecting continuous reporting of direction and eccentricity, prior to asking how social information is incorporated into subjects' behaviour.

      In addition to the analysis of stimulus coherence and tilt averaged across each stimulus state (1.1), we analyzed moment-to-moment relationship between instantaneous coherence and ongoing reports of accuracy and tilt. Below, we provide evidence that short-term fluctuations in the instantaneous coherence (i.e. the motion energy of the stimulus) do not result in correlated changes in joystick responses, neither for tilt nor accuracy. For each continuous stimulus state, we calculated cross-correlation functions between the instantaneous coherence, tilt and accuracy, and then averaged the cross-correlation across all states of the same nominal coherence, and then across subjects. The resulting average cross-correlation functions are essentially flat. This further supports our interpretation that the joystick reports do not reflect short-term fluctuations of motion energy.

      Author response image 2.

      Cross-correlation between the length of the resultant vector with joystick accuracy (left) and tilt (right). Coherence is color-coded. Shaded background illustrates 95% confidence intervals.

      (1.3) Joystick tilt changes over time despite stable average stimulus coherence

      If perceptual confidence is derived from evidence integration, we should see changes over time even when the stimulus is stable. Here, we have analyzed the average slope of the joystick tilt as a function of time within each stimulus state for each subject and each coherence, to verify if our participants tilted their joystick more with additional evidence. This is illustrated with a violin plot below (Author response image 3). The linear slopes of the joystick tilt progression over the course of stimulus states are different between coherence levels. High coherence causes more tilt over time, resulting in positive slopes for most subjects. In contrast, low/no coherence results mostly in flat or negative slopes. This tilt progression over time indicates that low coherence results in lower confidence, as subjects do not wager more with weak evidence. In contrast, high coherence causes subjects to exhibit more confidence, indicated by positive slope of the joystick tilt.

      Author response image 3.

      Violin plots showing the fitted slopes of the joystick tilt time course in the last 200 samples (1667 ms) leading up to a next stimulus direction (cf. Figure 2D). Positive values signify an increase in joystick tilt over time. Each dot shows the average slope for one subject. Coherence is color-coded. The dashed line at zero indicates unchanged joystick tilt over the analyzed time window.

      (1.4) Cross-correlation between response accuracy and joystick tilt

      Similar to 1.2 above, we have cross-correlated the frame-by-frame changes of joystick accuracy and tilt for each individual stimulus state and each subject. Across subjects, changes in tilt occur later than changes in accuracy, indicating that changes in the quality of the report are followed by changes in the size of the wager. Given that this process is not driven by short-term changes in the motion energy of the stimulus (see 1.2 above), we interpret this as additional evidence for a metacognitive assessment of the quality of the behavioral report (i.e. accuracy) reflected in the size of the wager (our measure for confidence). (See Figure 2E).

      (2) Peri-decision wagering is different to post-decision wagering

      […] One route to doing this would be to ask whether the eccentricity reports show statistical signatures of confidence that have been established for more classical punctate tasks. Here a key move has been to identify qualitative patterns in the frame of reference of choice accuracy - with confidence scaling positively with stimulus strength for correct decisions, and negatively with stimulus strength for incorrect decisions (the so-called X-pattern, for instance Sanders et al. 2016 Neuron […].

      We thank reviewer 1 for the constructive feedback. Our behavioral data do not show similar signatures to the previously reported post-decision confidence expression (Desender et al., 2021; Sanders et al., 2016). The previously described patterns show, first of all, that confidence for the incorrect type1 decisions diverges from the correct type1 decisions, declining with stimulus strength (e.g. coherence), as compared to increase for correct decisions. In our task, there is a graded accuracy and (putative) confidence expression, but there are no correct or incorrect decisions – instead, there are hits and misses of the reward targets presented at nominal directions. Instead of a decline for misses, we observe an equally positive scaling with coherence for the confidence, both for hits and misses (Author response image 4A). This is because in our peri-decision wagering task, the expression of confidence causally determines the binary hit or miss outcome. The outcome in our task is a function of the two-dimensional joystick response: higher tilt (confidence) requires a more accurate response to successfully hit a target. Thus, a subject can display a high (but not high enough) level of accuracy and confidence but still remain unsuccessful. If we instead median-split the confidence reports by high and low accuracy (Author response image 4C), we observe a slight separation, especially for higher coherences, but still no clear different in slopes.

      We do observe the other two dynamic signatures of confidence (Desender et al., 2021): signature 2 – monotonically increasing accuracy as a function of confidence (Author response image 4), and signature 3 – steeper type 1 psychometric performance (accuracy) for high versus low confidence (Author response image 4D).

      Author response image 4.

      Confidence (i.e., joystick tilt, left column) and accuracy reports (right column) for different stimulus coherence, sorted by discrete outcome (hit versus miss, upper row) and the complementary joystick dimension (lower row, based on median split).

      Author response image 5.

      Accuracy reports correlate positively with confidence reports. For each stimulus state, we averaged the joystick response in the time window between 500 ms (60 samples) after a direction change until the first reward target appearance. If there was no target, we took all samples until the next RDP direction change into account. This corresponds to data snippets averaged in Figure 2D. Thus, for each stimulus state, we extracted a single value for joystick accuracy and for tilt (confidence). Subsequently, we fitted a linear regression to the accuracy-confidence scatter within each subject and within each coherence level. The plot above shows the average linear regression between accuracy and confidence across all subjects (i.e., the slopes and intercepts were averaged across n=38 subjects). Coherence is color-coded.

      (3)  Additional analyses regarding the continuous nature of our data

      I was surprised not to see more analysis of the continuous report data as a function of (lagged) task variables. […]

      Reviewer 1 requested more analyses regarding the continuous nature of our data. We agree that this is a useful addition to our paper, and thank reviewer 1 for this suggestion. To address this point, we revised main Figure 2 and provided additional panels. Panel D illustrates the continuous ramp-up of both accuracy and tilt (confidence) for high coherence levels, suggesting ongoing evidence integration and meta-cognitive assessment. Panel E shows the cross-correlation between frame-by-frame changes in accuracy and tilt (see 1.4 above). Here, we demonstrate that changes in the accuracy precede changes in joystick tilt, characterizing the continuous nature of the perceptual decision-making process.

      (4) Explicit motivation regarding continuous social experiments

      This paper is innovating on a lot of fronts at once - developing a new CPR task for metacognition, and asking exploratory questions about how a social setting influences performance on this novel task. However, the rationale for this combination was not made explicit. Is the social manipulation there to help validate the new task as a measure of confidence as dissociated from other perceptual variables? (see query 1 below). Or is the claim that the social influence can only be properly measured in the naturalistic CPR task, and not in a more established metacognition task?

      Our rationale for the combination of real-time decision making and social settings was twofold:

      i. Primates, including humans, are social species. Naturally, most behavior is centered around a social context and continuously unfolds in real-time. We wanted to showcase a paradigm in which distinct aspects of continuous perceptual decision-making could be assessed over time in individual and social environments.

      ii. Human behavior is susceptible to what others think and do. We wanted to demonstrate that the sheer presence of a co-acting social partner affects continuous decision-making, and quantify the extent and direction of social modulation.

      We agree that the motivation for combining the new task and this specific type of social co-action should be more clear. We have clarified this aspect in the Introduction, line 92-109. In brief, the continuous, free-flowing nature of the CPR task and real-time availability of social information made this design a very suitable paradigm for assessing unconstrained social influences. We see this study as the first step into disentangling the neural basis of social modulation in primates. See also the response to reviewer 2, point 2, below.

      (5) Response to minor points

      (5.1)  Clarification on behavioral modulation patterns

      Lines 295-298, isn't it guaranteed to observe these three behavioral patterns (both participants improving, both getting worse, only one improving while the other gets worse) even in random data?

      The reviewer is correct. We now simply illustrate these possibilities in Figure 4B and how these patterns could lead to divergence or convergence between the participants (see also line 282). Unlike random data, our results predominantly demonstrate convergence.

      (5.2) Clarification on AUC distributions

      Lines 703-707, it wasn't clear what the AUC values referred to here (also in Figure 3) - what are the distributions that are being compared? I think part of the confusion here comes from AUC being mentioned earlier in the paper as a measure of metacognitive sensitivity (correct vs. incorrect trial distributions), whereas my impression here is that here AUC is being used to investigate differences in variables (e.g., confidence) between experimental conditions.

      We apologize for the confusion. Indeed, the AUC analysis was used for the two purposes:

      (i) To assess the metacognitive sensitivity (line 175, Supplementary Figure 2).

      (ii) To assess the social modulation of accuracy and confidence (starting at line 232, Figures 3-6). 

      We now introduce the second AUC approach for assessing social modulation, and the underlying distributions of accuracy and confidence derived from each stimulus state, separately in each subject, in line 232.

      (5.3) Clarification of potential ceiling effects

      Could the findings of the worse solo player benefitting more than the better solo player (Figure 4c) be partly due to a compressive ceiling effect - e.g., there is less room to move up the psychometric function for the higher-scoring player?

      We thank the reviewer for this insight. First, even better performing participants were not at ceiling most of the times, even at the highest coherence (cf. Figure 2 and Supplementary Figure 3C). To test for the potential ceiling effect in the better solo players, we correlated their social modulation (expressed as AUC as in Figure 4) to the solo performance. There was no significant negative correlation for the accuracy (p > 0.063), but there was a negative correlation for the confidence (r = - 0.39, p = 0.0058), indicating that indeed low performing “better players in a dyad” showed more positive social modulation. We note however that this correlation was driven mainly by few such initially low performing “better” players, who mostly belonged to the dyads where both participants improved in confidence (green dots, Figure 4B), and that even the highest solo average confidence was at ceiling (<0.95). To conclude, the asymmetric social modulation effect we observe is mainly due to the better players declining (orange and red dots, Figure 4B), rather than due to both players improving but the better player improving less (green dots, Figure 4B).

      Reviewer 2:

      Strengths:

      There are many things to like about this paper. The visual psychophysics has been undertaken with much expertise and care to detail. The reporting is meticulous and the coverage of the recent previous literature is reasonable. The research question is novel.

      We thank reviewer 2 for this positive evaluation. Below we address the identified weaknesses and recommendations.

      (1) Streamlining the text to make the paper easier to read

      The paper is difficult to read. It is very densely written, with little to distinguish between what is a key message and what is an auxiliary side note. The Figures are often packed with sometimes over 10 panels and very long captions that stick to the descriptive details but avoid clarity. There is much that could be shifted to supplementary material for the reader to get to the main points.

      We thank reviewer 2 for the honest assessment that our article was difficult to read and understand, and for providing specific examples of confusion. We substantially improved the clarity:

      We added a Glossary that defines key terms, including Accuracy and Hit rate. 

      We replaced the confusing term “eccentricity” with joystick “tilt”.

      We simplified Figures 3 and 5, moving some panels into supplementary figures.

      We substantially redesigned and simplified our main Figure 4, displaying the data in a more straightforward, less convoluted way, and removing several panels. This change was accompanied by corresponding changes in the text (section starting at line 277).

      More generally, we shortened the Introduction, substantially revised the Results and the figure legends, and streamlined the Discussion.

      (2) Dyadic co-action vs joint dyadic decision making

      A third and very important one is what the word "dyadic" refers to in the paper. The subjects do not make any joint decisions. However, the authors calculate some "dyadic score" to measure if the group has been able to do better than individuals. So the word dyadic sometimes refers to some "nominal" group. In other places, dyadic refers to the social experimental condition. For example, we see in Figure 3c that AUC is compared for solo vs dyadic conditions. This is confusing.

      […] my key criticism is that the paper makes strong points about collective decision-making and compares its own findings with many papers in that field when, in fact, the experiments do not involve any collective decision-making. The subjects are not incentivized to do better as a group either. […]

      The reviewer is correct to highlight these important aspects. We did, in fact, not investigate a situation where two players had to reach a joint decision with interdependent payoff and there was no incentive to collaborate or even incorporate the information provided by the other player. To make the meaning of “dyadic” in our context more explicit, we have clarified the nature of the co-action and independent payoff (e.g. lines 107, 211, 482, 755 - Glossary), and used the term “nominal combined score” (line 224) and “nominal “average accuracy” within a dyad” (line 439).

      Concerning the key point about embedding our findings into the literature on collective decision-making, we would like to clarify our motivation. Outside of the recent study by Pescetelli and Yeung, 2022, we are not aware of any perceptual decision-making studies that investigated co-action without any explicit joint task. So naturally, we were stimulated by the literature on collective decisions, and felt it is appropriate to compare our findings to the principles derived from this exciting field.  Besides developing continuous – in time and in “space” (direction) – peri-decision wagering CPR game, the social co-action context is the main novel contribution of our work. Although it is possible to formulate cooperative or competitive contexts for the CPR, we leveraged the free-flowing continuous nature of the task that makes it most readily amendable to study spontaneously emerging social information integration.

      We now more explicitly emphasize that most prior work has been done using the joint decision tasks, in contrast to the co-action we study here, in Introduction and Discussion.

      (3) Addition of relevant literature to Discussion

      […] To see why this matters, look at Lorenz et al PNAS (https://www.pnas.org/doi/10.1073/pnas.1008636108) and the subsequent commentary that followed it from Farrell (https://www.pnas.org/doi/full/10.1073/pnas.1109947108). The original paper argued that social influence caused herding which impaired the wisdom of crowds. Farrell's reanalysis of the paper's own data showed that social influence and herding benefited the individuals at the expense of the crowd demonstrating a form of tradeoff between individual and joint payoff. It is naive to think that by exposing the subjects to social information, we should, naturally, expect them to strive to achieve better performance as a group.

      Another paper that is relevant to the relationship between the better and worse performing members of the dyad is Mahmoodi et al PNAS 2015 (https://www.pnas.org/doi/10.1073/pnas.1421692112). Here too the authors demonstrate that two people interacting with one another do not "bother" figuring out each others' competence and operate under "equality assumption". Thus, the lesser competent member turns out to be overconfident, and the more competent one is underconfident. The relevance of this paper is that it manages to explain patterns very similar to Schneider et al by making a much simpler "equality bias" assumption.

      We thank reviewer 2 for pointing out these highly relevant references, which we have now integrated in the Discussion (lines 430 and 467). Regarding the debate of Lorenz et al and Farell, although it is about very different type of tasks – single-shot factual knowledge estimation, it is very illuminating for understanding the differing perspectives on individual vs group benefit. We fully agree that it is naïve to assume that during independent co-action in our highly demanding task participants would strive to achieve better performance as a group – if anything, we expected less normative and more informational, reliability-driven effects as a way to cope with task demands.

      Mahmoodi et al. is a particularly pertinent and elegant study, and the equality bias they demonstrate may indeed underlie the effects we see. We admit that we did not know this paper at the time of our initial writing, but it is encouraging to see the convergence [pun intended] despite task and analysis differences. As highlighted above (2), our novel contributions remain that we observe mutual alignment, or convergence, in real-time without explicitly formulated collective decision task and associated social pressure, and that we separate asymmetric social effects on accuracy and confidence.

      Other reviewer-independent changes:

      Additional information: Angular error in Figure 2

      In panel A of the main Figure 2, we have added the angular error of the solo reports (blue dashed line) to give readers an impression about the average deviation of subjects’ joystick direction from the nominal stimulus direction. We have pointed out that angular error is the basis for accuracy calculation.

      Data alignment

      In the previous version of the manuscript, we have presented data with different alignments: Accuracy values were aligned to the appearance of the first target in a stimulus state (target-alignment) to avoid the predictive influence of target location within the remaining stimulus state, while the joystick tilt was extracted at the end of each stimulus state (state-alignment) to allow subjects more time to make a deliberate, confidence-guided report (Methods). We realized that this is confusing as it compares the social modulation of the two response dimensions at different points in time. In the revision, we use state-aligned data in most figures and analyses and clearly indicate which alignment type has been used. We kept the target-alignment for the illustration of the angular error in the solo-behavior (Figure 2). Specifically, this has only changed the reporting on accuracy statistics. None of the results have changed fundamentally, but the social modulation on accuracy became even stronger in state-aligned data.

      In summary, we hope that these revisions have resulted in an easier-to-understand and convincing article, with clear terminology and concise and important takeaway messages.

      We thank both reviewers and the editors again for their time and effort, and look forward to the reevaluation of our work.

      References

      Desender K, Donner TH, Verguts T. 2021. Dynamic expressions of confidence within an evidence accumulation framework. Cognition 207:104522. doi:10.1016/j.cognition.2020.104522

      Pescetelli N, Yeung N. 2022. Benefits of spontaneous confidence alignment between dyad members. Collective Intelligence 1. doi:10.1177/26339137221126915

      Sanders JI, Hangya B, Kepecs A. 2016. Signatures of a Statistical Computation in the Human Sense of Confidence. Neuron 90:499–506. doi:10.1016/j.neuron.2016.03.025

    1. What we should at all times look at is the fact that:We are all oppressed by the same system.That we are oppressed to varying degrees is a deliberate design to stratify us not only socially but also in terms of the enemy’s aspirations.Therefore it is to be expected that in terms of the enemy’s plan there must be this suspicion and that if we are committed to the problem of emancipation to the same degree it is part of our duty to bring to the black people the deliberateness of the enemy’s subjugation scheme.That we should go on with our programme, attracting to it only committed people and not just those eager to see an equitable distribution of groups amongst our ranks. This is a game common amongst liberals. The one criterion that must govern all our action is commitment.

      This part states how black people are all hurt by the same system that tried to divide ot tear them apart. So they talk about working together with those who are really and truly committed.

    1. Join MKT 4D Hari Ini for Fresh Results and Winning Big  If you are searching for MKT 4D Hari Ini in Malaysia then you do not need to scroll through endless websites. Just visit Clubmy and get reliable 4D updates. We simplify everything from placing a bet to viewing results, especially for Malaysian players. This makes your 4D experience more responsible, informed and enjoyable.  What is MKT 4D Hari Ini and How to Win Big in it The real meaning of the phrase MKT 4D is Market 4D. That refers to the famous 4D lottery format in Malaysia. And the word Hari Ini means today. The combined meaning is today’s 4D market. In Malaysia, licensed operators conduct draws. And in this draw, 4D patterns are selected randomly. Its simplicity makes it more appealing in Malaysia. The process of winning the 4D lottery is more effortless than you could think. Choose a pattern of four digits between 0000 and 9999. Then place the bet and after the draw, check whether your number matches the winning combinations or not. If it matches perfectly, then it would be great. Because that means you have won. This game is a perfect combination of strategy, luck and pattern spotting. We make your everyday play full of excitement. Why Everyone in Malaysia Talks About MKT 4D Many Malaysians search for MKT 4D Hari Ini on Google. In other words, they are looking for today’s winning 4D numbers. To match their tickets while hoping they have landed a jackpot. It could also mean that they are searching for recent 4D trends or historical data. This helps them choose a perfect winning combination in the next draw.  When you follow the updates consistently then you are able to avoid missing hot (frequently winning) numbers or live news regarding 4D results. Every player wants to access reliable and 100% authentic results. So when they search MKT 4D, it also points at how the players are curious about the guidelines on how to interpret the 4D result. At Clubmy, we have all this information and data in one place. So without a further ado, you can quickly view and check today’s 4D results. Through this, you can understand the 4D patterns so planning the next step becomes seamless. Check Today’s MKT 4D Only on Clubmy Malaysia There are many 4D platforms in Malaysia. And many of you must have a question in mind, why Clubmy only? Because we are the only trusted platform that offers the Hari Ini MKT 4D section in Malaysia.  When we say it all in one platform, we don’t only claim it but also prove it. Just visit our official page now, you will see clear 4D draw numbers with dates and operator details. We also publish live results with accuracy so no more delays in updates. If you are searching for previous draws then visit our archive section. Here you will approach all the past results and track the 4D number pattern and its frequency as well.  Are you tired of placing bets through guessing? Then join our page now. We provide all the essential information and analysis for players who want to understand the logic behind their picks. In this way, we promote a responsible 4D gaming environment in Malaysia. And encourage the concept of 4D just for enjoyment, not a financial strategy. Useful Tricks & Strategies for Hari Ini MKT 4D If you are following today’s draw then these tips and habits will definitely help you improve your 4D experience. Keep a Record of History Draw Before chasing the MKT 4D Hari Ini, you must make a record of past results. You can take a screenshot or write your favourite number combinations. Because with time, it is possible you might forget the repetition or skipped patterns. So our archive section helps you maintain track of those records. Set Your Budget and Do not Follow Losses  In a lottery game, you have no idea of winning or losing. Because the draws are truly random. That is why stop chasing your losses. Once you lose then do not invest again and again to convert this loss into a win. This habit disturbs your budget and also removes the fun factor from your lottery game. Do Not Rely Only on Luck  Many Malaysian players choose their 4D pattern randomly. Or some follow their lucky numbers, birthdays or anniversaries for choosing a bet number. These patterns might be too simple to lead you to win big. So avoid this habit and pick your 4D number with logic and luck both. This approach definitely brings big wins for you. Connect with a Reliable and Trustworthy Source  There are many online scams in the market. And countless fake 4D websites that post incorrect data. So if you want 100% authentic results then it is essential to find a reliable source. When Clubmy provides you with live results updates officially then you do not need to search here and there. Here you will receive verified, confirmed and transparent information. Join Clubmy, Your Trusted Source for Today’s MKT 4D Hari Ini The MKT 4D Hari Ini is not all about the draw or results. In fact it is a mixture of traditional lottery excitement in Malaysia. At Clubmy, we are proud to provide you with authentic information, live results and useful guidance. All crafted especially for the Malaysian players to promote a responsible and informed gaming culture. Stay lucky and play smart. Enjoy the best experience online on your smartphone or desktop. We are always available for your support and to keep you ahead of the curve in 4D games. Frequently Asked Questions When can I view the MKT 4D Hari Ini today’s draw at Clubmy? At Clubmy, our draws usually take place in the late afternoon or evening. But there is no fixed time of announcements because the draw timing varies with the operator. And we always publish the results after the official draw.  Is your prediction regarding 4D winning numbers 100% true? Our prediction is most likely true. But when you can say 100% true, it is not possible because no one tells you the winning number. It is just a prediction so you can utilize this to analyze trends. Is 4D culture legal for the Malaysian players? Yes it is legal for Malaysian players. Only when you play the 4D lottery with a licensed source. So always verify the legal status of the platform before buying tickets. Clubmy is your licensed partner. Here you can play the 4D lottery without any fear.

      【MKT 4D Hari Ini: Mulakan Perjalanan Bertuah Anda dengan Clubmy🎯】

      Di Malaysia, mencari "MKT 4D Hari Ini" bukan sekadar menyemak nombor - ia adalah gaya hidup yang penuh dengan penantian. Kini, Clubmy menawarkan perkhidmatan kemas kini 4D yang paling boleh dipercayai, menjadikan setiap pertaruhan lebih bijak dan setiap jangkaan lebih berasas.

      Daripada keputusan cabutan segera kepada analisis data sejarah, daripada penjejakan nombor hangat kepada panduan strategi pertaruhan, Clubmy membina persekitaran permainan 4D yang selamat dan telus untuk pemain Malaysia dengan profesionalisme dan integriti. Biarkan nasib berjalan dengan kebijaksanaan, biarkan permainan kembali keseronokan itu sendiri. 👉 [Baca Artikel Penuh] Mulakan Perjalanan 4D Pintar

    1. Author response:

      The following is the authors’ response to the original reviews

      Response to the Editors’ Comments

      Thankyou for this summary of the reviews and recommendations for corrections. We respond to each in turn, and have documented each correction with specific examples contained within our response to reviewers below.

      ‘They all recommend to clarify the link between hypotheses and analyses, ground them more clearly in, and conduct critical comparisons with existing literature, and address a potential multiple comparison problem.’

      We have restructured our introduction to include the relevant literature outlined by the reviewers, and to be more clearly ground the goals of our model and broader analysis. We have additionally corrected for multiple comparisons within our exploratory associative analyses. We have additionaly sign posted exploratory tests more clearly.

      ‘Furthermore, R1 also recommends to include a formal external validation of how the model parameters relate to participant behaviour, to correct an unjustified claim of causality between childhood adversity and separation of self, and to clarify role of therapy received by patients.’

      We have now tempered our language in the abstract which unintentionally implied causality in the associative analysis between childhood trauma and other-to-self generalisation. To note, in the sense that our models provide causal explanations for behaviour across all three phases of the task, we argue that our model comparison provides some causal evidence for algorithmic biases within the BPD phenotype. We have included further details of the exclusion and inclusion criteria of the BPD participants within the methods.

      R2 specifically recommends to clarify, in the introduction, the specific aim of the paper, what is known already, and the approach to addressing it.’

      We have more thoroughly outlined the current state of the art concerning behavioural and computational approaches to self insertion and social contagion, in health and within BPD. We have linked these more clearly to the aims of the work.

      ‘R2 also makes various additional recommendations regarding clarification of missing information about model comparison, fit statistics and group comparison of parameters from different models.’

      Our model comparison approach and algorithm are outlined within the original paper for Hierarchical Bayesian Model comparison (Piray et al., 2019). We have outlined the concepts of this approach in the methods. We have now additionally improved clarity by placing descriptions of this approach more obviously in the results, and added points of greater detail in the methods, such as which statistics for comparison we extracted on the group and individual level.

      In addition, in response to the need for greater comparison of parameters from different models, we have also hierarchically force-fitted the full suite of models (M1-M4) to all participants. We report all group differences from each model individually – assuming their explanation of the data - in Table S2. We have also demonstrated strong associations between parameters of equivalent meaning from different models to support our claims in Fig S11. Finally, we show minimal distortion to parameter estimates in between-group analysis when models are either fitted hierarchically to the entire population, or group wise (Figure S10).

      ‘R3 additionally recommends to clarify the clinical and cognitive process relevance of the experiment, and to consider the importance of the Phase 2 findings.’

      We have now included greater reference to the assumptions in the social value orientation paradigm we use in the introduction. We have also responded to the specific point about the shift in central tendencies in phase 2 from the BPD group, noting that, while BPD participants do indeed get more relatively competitive vs. CON participants, they remain strikingly neutral with respect to the overall statespace. Importantly, model M4 does not preclude more competitive distributions existing.

      ‘Critically, they also share a concern about analyzing parameter estimates fit separately to two groups, when the best-fitting model is not shared. They propose to resolve this by considering a model that can encompass the full dynamics of the entire sample.’

      We have hierarchically force-fitted the full suite of models (M1-M4) to all participants to allow for comparison between parameters within each model assumption. We report all group differences from each model individually – assuming their explanation of the data - in Table S2 and Table S3. We have also demonstrated strong associations between parameters of equivalent meaning from different models to support our claims in Fig S11. We also show minimal distortion to parameter estimates in between-group analysis when models are either fitted hierarchically to the entire population, or group wise (Figure S10).

      Within model M1 and M2, the parameters quantify the degree to which participants believe their partner to be different from themselves. Under M1 and M2 model assumptions, BPD participants have meaningfully larger versus CON (Fig S10), which supports the notion that a new central tendency may be more parsimonious in phase 2 (as in the case of the optimal model for BPD, M4). We also show strong correlations across models between under M1 and M2, and the shift in central tendenices of beliefs between phase 1 and 2 under M3 and M4. This supports our primary comparison, and shows that even under non-dominant model assumptions, parameters demonstrate that BPD participants expect their partner’s relative reward preferences to be vastly different from themselves versus CON.

      ‘A final important point concerns the psychometric individual difference analyses which seem to be conducted on the full sample without considering the group structure.’

      We have now more clearly focused our psychometric analysis. We control for multiple comparisons, and compare parameters across the same model (M3) when assessing the relationship between paranoia, trauma, trait mentalising, and social contagion. We have relegated all other exploratory analyses to the supplementary material and noted where p values survive correction using False Discovery Rate.

      Reviewer 1:

      ‘The manuscript's primary weakness relates to the number of comparisons conducted and a lack of clarity in how those comparisons relate to the authors' hypotheses. The authors specify a primary prediction about disruption to information generalization in social decision making & learning processes, and it is clear from the text how their 4 main models are supposed to test this hypothesis. With regards to any further analyses however (such as the correlations between multiple clinical scales and eight different model parameters, but also individual parameter comparisons between groups), this is less clear. I recommend the authors clearly link each test to a hypothesis by specifying, for each analysis, what their specific expectations for conducted comparisons are, so a reader can assess whether the results are/aren't in line with predictions. The number of conducted tests relating to a specific hypothesis also determines whether multiple comparison corrections are warranted or not. If comparisons are exploratory in nature, this should be explicitly stated.’

      We have now corrected for multiple comparisons when examining the relationship between psychometric findings and parameters, using partial correlations and bootstrapping for robustness. These latter analyses were indeed not preregistered, and so we have more clearly signposted that these tests were exploratory. We chose to focus on the influence of psychometrics of interest on social contagion under model M3 given that this model explained a reasonable minority of behaviour in each group. We have now fully edited this section in the main text in response, and relegated all other correlations to the supplementary materials.

      ‘Furthermore, the authors present some measures for external validation of the models, including comparison between reaction times and belief shifts, and correlations between model predicted accuracy and behavioural accuracy/total scores. However it would be great to see some more formal external validation of how the model parameters relate to participant behaviour, e.g., the correlation between the number of pro-social choices and ß-values, or the correlation between the change in absolute number of pro-social choices and the change in ß. From comparing the behavioural and computational results it looks like they would correlate highly, but it would be nice to see this formally confirmed.’

      We have included this further examination within the Generative Accuracy and Recovery section:

      ‘We also assessed the relationship (Pearson rs) between modelled participant preference parameters in phase 1 and actual choice behaviour: was negatively correlated with prosocial versus competitive choices (r=-0.77, p<0.001) and individualistic versus competitive choices (r=-0.59, p<0.001); was positively correlated with individualistic versus competitive choices (r=0.53, p<0.001) and negatively correlated with prosocial versus individualistic choices (r=-0.69, p<0.001).’

      ‘The statement in the abstract that 'Overall, the findings provide a clear explanation of how self-other generalisation constrains and assists learning, how childhood adversity disrupts this through separation of internalised beliefs' makes an unjustified claim of causality between childhood adversity and separation of self - and other beliefs, although the authors only present correlations. I recommend this should be rephrased to reflect the correlational nature of the results.’

      Sorry – this was unfortunate wording: we did not intend to imply causation with our second clause in the sentence mentioned. We have amended the language to make it clear this relationship is associative:

      ‘Overall, the findings provide a clear explanation of how self-other generalisation constrains and assists learning, how childhood adversity is associated with separation of internalised beliefs, and makes clear causal predictions about the mechanisms of social information generalisation under uncertainty.’

      ‘Currently, from the discussion the findings seem relevant in explaining certain aberrant social learning and -decision making processes in BPD. However, I would like to see a more thorough discussion about the practical relevance of their findings in light of their observation of comparable prediction accuracy between the two groups.’

      We have included a new paragraph in the discussion to address this:

      ‘Notably, despite differing strategies, those with BPD achieved similar accuracy to CON participants in predicting their partners. All participants were more concerned with relative versus absolute reward; only those with BPD changed their strategy based on this focus. Practically this difference in BPD is captured either through disintegrated priors with a new median (M4) or very noisy, but integrated priors over partners (M1) if we assume M1 can account for the full population. In either case, the algorithm underlying the computational goal for BPD participants is far higher in entropy and emphasises a less stable or reliable process of inference. In future work, it would be important to assess this mechanism alongside momentary assessments of mood to understand whether more entropic learning processes contribute to distressing mood fluctuation.’

      ‘Relatedly, the authors mention that a primary focus of mentalization based therapy for BPD is 'restoring a stable sense of self' and 'differentiating the self from the other'. These goals are very reminiscent of the findings of the current study that individuals with BPD show lower uncertainty over their own and relative reward preferences, and that they are less susceptible to social contagion. Could the observed group differences therefore be a result of therapy rather than adverse early life experiences?’

      This is something that we wish to explore in further work. While verbal and model descriptions appear parsimonious, this is not straight forward. As we see, clinical observation and phenomenological dynamics may not necessarily match in an intuitive way to parameters of interest. It may be that compartmentalisation of self and other – as we see in BPD participants within our data – may counter-intuitively express as a less stable self. The evolutionary mechanisms that make social insertion and contagion enduring may also be the same that foster trust and learning.

      ‘Regarding partner similarity: It was unclear to me why the authors chose partners that were 50% similar when it would be at least equally interesting to investigate self-insertion and social contagion with those that are more than 50% different to ourselves? Do the authors have any assumptions or even data that shows the results still hold for situations with lower than 50% similarity?’

      While our task algorithm had a high probability to match individuals who were approximately 50% different with respect to their observed behaviour, there was variation either side of this value. The value of 50% median difference was chosen for two reasons: 1. We wanted to ensure participants had to learn about their partner to some degree relative to their own preferences and 2. we did not want to induce extreme over or under familiarity given the (now replicated) relationship between participant-partner similarity and intentional attributions (see below). Nevertheless, we did have some variation around the 50% median. Figure 3A in the top left panel demonstrates this fluctuation in participant-partner similarity and the figure legend further described this distribution (mean = 49%, sd = 12%). In future work we want to more closely manipulate the median similarity between participants and partners to understand how this facilitates or inhibits learning and generalisation.

      There is some analysis of the relationship between degrees of similiarity and behaviour. In the third paragraph of page 15 we report the influence of participant-partner similarity on reaction times. In prior work (Barnby et al., 2022; Cognition) we had shown that similarity was associated with reduced attributions of harm about a partner, irrespective of their true parameters (e.g. whether they were prosocial/competitive). We replicate this previous finding with a double dissociation illustrated in Figure 4, showing that greater discrepancies in participant-partner prosociality increases explicit harmful intent attributions (but not self-interest), and discrepancies in participant-partner individualism reduces explicit self-interest attributions (but not harmful intent). We have made these clearer in our results structure, and included FDR correction values for multiple comparisons.

      The methods section is rather dense and at least I found it difficult to keep track of the many different findings. I recommend the authors reduce the density by moving some of the secondary analyses in the supplementary materials, or alternatively, to provide an overall summary of all presented findings at the end of the Results section.

      We have now moved several of our exploratory findings into the supplementary materials, noteably the analysis of participant-partner similarity on reaction times (Fig S9), as well as the uncorrected correlation between parameters (Fig S7).

      Fig 2C) and Discussion p. 21: What do the authors mean by 'more sensitive updates'? more sensitive to what?

      We have now edited the wording to specify ‘more belief updating’ rather than ‘sensitive’ to be clearer in our language.

      P14 bottom: please specify what is meant by axial differences.

      We have changed this to ‘preference type’ rather than using the term ‘axial’.

      It may be helpful to have Supplementary Figure 1 in the main text.

      Thank you for this suggestion. Given the volume of information in the main text we hope that it is acceptable for Figure S1 to remain in the supplementary materials.

      Figure 3D bottom panel: what is the difference between left and right plots? Should one of them be alpha not beta?

      The left and right plots are of the change in standard deviation (left) and central tendency (right) of participant preference change between phase 1 and 3. This is currently noted in the figure legend, but we had added some text to be clearer that this is over prosocial-competitive beliefs specifically. We chose to use this belief as an example given the centrality of prosocial-comeptitive beliefs in the learning process in Figure 2. We also noticed a small labelling error in the bottom panels of 3D which should have noted that each plot was either with respect to the precision or mean-shift in beliefs during phase 3.

      ‘The relationship between uncertainty over the self and uncertainty over the other with respect to the change in the precision (left) and median-shift (right) in phase 3 prosocial-competitive beliefs .’

      Supplementary Figure 4: The prior presented does not look neutral to me, but rather right-leaning, so competitive, and therefore does indeed look like it was influenced by the self-model? If I am mistaken please could the authors explain why.

      This example distribution is taken from a single BPD participant. In this case, indeed, the prior is somewhat right-shifted. However, on a group level, priors over the partner were closely centred around 0 (see reported statistics in paragraph 2 under the heading ‘Phase 2 – BPD Participants Use Disintegrated and Neutral Priors). However, we understand how this may come across as misleading. For clarity we have expanded upon Figure S4 to include the phase 1 and prior phase 2 distributions for the entire BPD population for both prosocial and individualistic beliefs. This further demonstrates that those with BPD held surprisingly neutral beliefs over the expectations about their partners’ prosociality, but had minor shifts between their own individualistic preferences and the expected individualistic preferences of their partners. This is also visible in Figure S2.

      Reviewer 2:

      ‘There are two major weaknesses. First, the paper lacks focus and clarity. The introduction is rather vague and, after reading it, I remained confused about the paper's aims. Rather than relying on specific predictions, the analysis is exploratory. This implies that it is hard to keep track, and to understand the significance, of the many findings that are reported.’

      Thank you for this opportunity to be clearer in our framing of the paper. While the model makes specific causal predictions with respect to behavioural dynamics conditional on algorithmic differences, our other analyses were indeed exploratory. We did not preregister this work but now given the intriguing findings we intent to preregister our future analyses.

      We have made our introduction clearer with respect to the aims of the paper:

      ‘Our present work sought to achieve two primary goals: 1. Extend prior causal computational theories to formalise the interrelation between self-insertion and social contagion within an economic paradigm, the Intentions Game and 2., Test how a diagnosis of BPD may relate to deficits in these forms of generalisation. We propose a computational theory with testable predictions to begin addressing this question. To foreshadow our results, we found that healthy participants employ a mixed process of self-insertion and contagion to predict and align with the beliefs of their partners. In contrast, individuals with BPD exhibit distinct, disintegrated representations of self and other, despite showing similar average accuracy in their learning about partners. Our model and data suggest that the previously observed computational characteristics in BPD, such as reduced self-anchoring during ambiguous learning and a relative impermeability of the self, arise from the failure of information about others to transfer to and inform the self. By integrating separate computational findings, we provide a foundational model and a concise, dynamic paradigm to investigate uncertainty, generalization, and regulation in social interactions.’

      ‘Second, although the computational approach employed is clever and sophisticated, there is important information missing about model comparison which ultimately makes some of the results hard to assess from the perspective of the reader.’

      Our model comparison employed what is state of the art random-effects Bayesian model comparison (Piray et al., 2019; PLOS Comp. Biol.). It initially fits each individual to each model using Laplace approximation, and subsequently ‘races’ each model against each other on the group level and individual level through hierarchical constraints and random-effect considerations. We included this in the methods but have now expanded on the descrpition we used to compare models:

      In the results -

      ‘All computational models were fitted using a Hierarchical Bayesian Inference (HBI) algorithm which allows hierarchical parameter estimation while assuming random effects for group and individual model responsibility (Piray et al., 2019; see Methods for more information). We report individual and group-level model responsibility, in addition to protected exceedance probabilities between-groups to assess model dominance.’

      We added to our existing description in the methods –

      ‘All computational models were fitted using a Hierarchical Bayesian Inference (HBI) algorithm which allows hierarchical parameter estimation while assuming random effects for group and individual model responsibility (Piray et al., 2019). During fitting we added a small noise floor to distributions (2.22e<sup>-16</sup>) before normalisation for numerical stability. Parameters were estimated using the HBI in untransformed space drawing from broad priors (μM\=0, σ<sup>2</sup><sub>M</sub> = 6.5; where M\={M1, M2, M3, M4}). This process was run independently for each group. Parameters were transformed into model-relevant space for analysis. All models and hierarchical fitting was implemented in Matlab (Version R2022B). All other analyses were conducted in R (version 4.3.3; arm64 build) running on Mac OS (Ventura 13.0). We extracted individual and group level responsibilities, as well as the protected exceedance probability to assess model dominance per group.’

      (1) P3, third paragraph: please define self-insertion

      We have now more clearly defined this in the prior paragraph when introducing concepts.

      ‘To reduce uncertainty about others, theories of the relational self (Anderson & Chen, 2002) suggest that people have availble to them an extensive and well-grounded representation of themselves, leading to a readily accessible initial belief (Allport, 1924; Kreuger & Clement, 1994) that can be projected or integrated when learning about others (self-insertion).’

      (2) Introduction: the specific aim of the paper should be clarified - at the moment, it is rather vague. The authors write: "However, critical questions remain: How do humans adjudicate between self-insertion and contagion during interaction to manage interpersonal generalization? Does the uncertainty in self-other beliefs affect their generalizability? How can disruptions in interpersonal exchange during sensitive developmental periods (e.g., childhood maltreatment) inform models of psychiatric disorders?". Which of these questions is the focus of the paper? And how does the paper aim at addressing it?

      (3) Relatedly, from the introduction it is not clear whether the goal is to develop a theory of self-insertion and social contagion and test it empirically, or whether it is to study these processes in BPD, or both (or something else). Clarifying which specific question(s) is addressed is important (also clarifying what we already know about that specific question, and how the paper aims at elucidating that specific question).

      We have now included our specific aims of the paper. We note this in the above response to the reviwers general comments.

      (4) "Computational models have probed social processes in BPD, linking the BPD phenotype to a potential over-reliance on social versus internal cues (Henco et al., 2020), 'splitting' of social latent states that encode beliefs about others (Story et al., 2023), negative appraisal of interpersonal experiences with heightened self-blame (Mancinelli et al., 2024), inaccurate inferences about others' irritability (Hula et al., 2018), and reduced belief adaptation in social learning contexts (Siegel et al., 2020). Previous studies have typically overlooked how self and other are represented in tandem, prompting further investigation into why any of these BPD phenotypes manifest." Not clear what the link between the first and second sentence is. Does it mean that previous computational models have focused exclusively on how other people are represented in BPD, and not on how the self is represented? Please spell this out.

      Thank you for the opportunity to be clearer in our language. We have now spelled out our point more precisely, and included some extra relevant literature helpfully pointed out by another reviewer.

      ‘Computational models have probed social processes in BPD, although almost exclusively during observational learning. The BPD phenotype has been associated with a potential over-reliance on social versus internal cues (Henco et al., 2020), ‘splitting’ of social latent states that encode beliefs about others (Story et al., 2023), negative appraisal of interpersonal experiences with heightened self-blame (Mancinelli et al., 2024), inaccurate inferences about others’ irritability (Hula et al., 2018), and reduced belief adaptation in social learning contexts (Siegel et al., 2020). Associative models have also been adapted to characterize  ‘leaky’ self-other reinforcement learning (Ereira et al., 2018), finding that those with BPD overgeneralize (leak updates) about themselves to others (Story et al., 2024). Altogether, there is currently a gap in the direct causal link between insertion, contagion, and learning (in)stability.’

      (5) P5, first paragraph. The description of the task used in phase 1 should be more detailed. The essential information for understanding the task is missing.

      We have updated this section to point toward Figure 1 and the Methods where the details of the task are more clearly outlined. We hope that it is acceptable not to explain the full task at this point for brevity and to not interrupt the flow of the results.

      “Detailed descriptions of the task can be found in the methods section and Figure 1.’

      (6) P5, second paragraph: briefly state how the Psychometric data were acquired (e.g., self-report).

      We have now clarified this in the text.

      ‘All participants also self-reported their trait paranoia, childhood trauma, trust beliefs, and trait mentalizing (see methods).’

      (7) "For example, a participant could make prosocial (self=5; other=5) versus individualistic (self=10; other=5) choices, or prosocial (self=10; other=10) versus competitive (self=10; other=5) choices". Not sure what criteria are used for distinguishing between individualistic and competitive - they look the same?

      Sorry. This paragraph was not clear that the issue is that the interpretation of the choice depends on both members of the pair of options. Here, in one pair {(self=5,other=5) vs (self=10,other=5)}, it is highly pro-social for the self to choose (5,5), sacrificing 5 points for the sake of equality. In the second pair {(self=10,other=10) vs (self=10,other=5)}, it is highly competitive to choose (10,5), denying the other 5 points at no benefit to the self. We have clarified this:

      ‘We analyzed the ‘types’ of choices participants made in each phase (Supplementary Table 1). The interpretation of a participant’s choice depends on both values in a choice. For example, a participant could make prosocial (self=5; other=5) versus individualistic (self=10; other=5) choices, or prosocial (self=10; other=10) versus competitive (self=10; other=5) choices. There were 12 of each pair in phases 1 and 3 (individualistic vs. prosocial; prosocial vs. competitive; individualistic vs. competitive).’  

      (8) "In phase 1, both CON and BPD participants made prosocial choices over competitive choices with similar frequency (CON=9.67[3.62]; BPD=9.60[3.57])" please report t-test - the same applies also various times below.

      We have now included the t test statistics with each instance.

      ‘In phase 3, both CON and BPD participants continued to make equally frequent prosocial versus competitive choices (CON=9.15[3.91]; BPD=9.38[3.31]; t=-0.54, p=0.59); CON participants continued to make significantly less prosocial versus individualistic choices (CON=2.03[3.45]; BPD=3.78 [4.16]; t=2.31, p=0.02). Both groups chose equally frequent individualistic versus competitive choices (CON=10.91[2.40]; BPD=10.18[2.72]; t=-0.49, p=0.62).’

      (9) P 9: "Models M2 and M3 allow for either self-insertion or social contagion to occur independently" what's the difference between M2 and M3?

      Model M2 hypothesises that participants use their own self representation as priors when learning about the other in phase 2, but are not influenced by their partner. M3 hypothesises that participants form an uncoupled prior (no self-insertion) about their partner in phase 2, and their choices in phase 3 are influenced by observing their partner in phase 2 (social contagion). In Figure 1 we illustrate the difference between M2 and M3. In Table 1 we specifically report the parameterisation differences between M2 and M3. We have also now included a correlational analysis of parameters between models to demonstrate the relationship between model parameters of equivalent value between models (Fig S11). We have also force fitted all models (M1-M4) to the data independently and reported group differences within each (see Table S2 and Table S3).

      (10) P 9, last paragraph: I did not understand the description of the Beta model.

      The beta model is outlined in detail in Table 1. We have also clarified the description of the beta model on page 9:

      ‘The ‘Beta model’ is equivalent to M1 in its causal architecture (both self-insertion and social contagion are hypothesized to occur) but differs in richness: it accommodates the possibility that participants might only consider a single dimension of relative reward allocation, which is typically emphasized in previous studies (e.g., Hula et al., 2018).’

      (11) P 9: I wonder whether one could think about more intuitive labels for the models, rather than M1, M2 etc.. This is just a suggestion, as I am not sure a short label would be feasible here.

      Thank you for this suggestion. We apologise that it is not very intitutive. The problem is that given the various terms we use to explain the different processes of generalisation that might occur between self and other, and given that each model is a different combination of each, we felt that numbering them was a lesser evil. We hope that the reader will be able to reference both Figure 1 and Table 1 to get a good feel for how the models and their causal implications differ.

      (12) Model comparison: the information about what was done for model comparison is scant, and little about fit statistics is reported. At the moment, it is hard for a reader to assess the results of the model comparison analysis.

      Model comparison and fitting was conducted using simultaneous hierarchical fitting and random-effects comparison. This is employed through the HBI package (Piray et al., 2019) where the assumptions and fitting proceedures are outlined in great detail. In short, our comparison allows for individual and group-level hierarchical fitting and comparison. This overcomes the issue of interdependence between and within model fitting within a population, which is often estimated separately.

      We have outlined this in the methods, although appreciate we do not touch upon it until the reader reaches that point. We have added a clarification statement on page 9 to rectify this:

      ‘All computational models were fitted using a Hierarchical Bayesian Inference (HBI) algorithm which allows hierarchical parameter estimation while assuming random effects for group and individual model responsibility (Piray et al., 2019; see Methods for more information). We report individual and group-level model responsibility, in addition to protected exceedance probabilities between-groups to assess model dominance.’

      (13) P 14, first paragraph: "BPD participants were also more certain about both types of preference" what are the two types of preferences?

      The two types of preferences are relative (prosocial-competitive) and absolute (individualistic) reward utility. These are expressed as b and a respectively. We have expanded the sentence in question to make this clearer:

      ‘BPD participants were also more certain about both self-preferences for absolute and relative reward ( = -0.89, 95%HDI: -1.01, -0.75; = -0.32, 95%HDI: -0.60, -0.04) versus CON participants (Figure 2B).’

      (14) "Parameter Associations with Reported Trauma, Paranoia, and Attributed Intent" the results reported here are intriguing, but not fully convincing as there is the problem of multiple comparisons. The combinations between parameters and scales are rather numerous. I suggest to correct for multiple comparisons and to flag only the findings that survive correction.

      We have now corrected this and controlled for multiple comparisons through partial correlation analysis, bootstrapping assessment for robustness, permutation testing, and False Detection Rate correction. We only report those that survive bootstrapping and permutation testing, reporting both corrected (p[fdr]) and uncorrected (p) significance.

      (15) Results page 14 and page 15. The authors compare the various parameters between groups. I would assume that these parameters come from M1 for controls and from M4 for BDP? Please clarify if this is indeed the case. If it is the case, I am not sure this is appropriate. To my knowledge, it is appropriate to compare parameters between groups only if the same model is fit to both groups. If two different models are fit to each group, then the parameters are not comparable, as the parameter have, so to speak, different "meaning" in two models. Now, I want to stress that my knowledge on this matter may be limited, and that the authors' approach may be sound. However, to be reassured that the approach is indeed sound, I would appreciate a clarification on this point and a reference to relevant sources about this approach.

      This is an important point. First, we confirmed all our main conclusions about parameter differences using the maximal model M1 to fit all the participants. We added Supplementary Table 2 to report the outcome of this analysis. Second, we did the same for parameters across all models M1-M4, fitting each to participants without comparison. This is particularly relevant for M3, since at least a minority of participants of both groups were best explained by this model. We report these analyses in Fig S11:

      Since the M4 is nested within M1, we argue that this comparison is still meaningful, and note explanations in the text for why the effects noted between groups may occur given the differences in their causal meaning, for example in the results under phase 2 analyses:

      ‘Belief updating in phase 2 was less flexible in BPD participants. Median change in beliefs (from priors to posteriors) about a partner’s preferences was lower versus. CON ( = -5.53, 95%HDI: -7.20, -4.00; = -10.02, 95%HDI: -12.81, -7.30). Posterior beliefs about partner were more precise in BPD versus CON ( = -0.94, 95%HDI: -1.50, -0.45;  = -0.70, 95%HDI: -1.20, -0.25).  This is unsurprising given the disintegrated priors of the BPD group in M4, meaning they need to ‘travel less’ in state space. Nevertheless, even under assumptions of M1 and M2 for both groups, BPD showed smaller posteriors median changes versus CON in phase 2 (see Table T2). These results converge to suggest those with BPD form rigid posterior beliefs.’

      (16) "We built and tested a theory of interpersonal generalization in a population of matched participants" this sentence seems to be unwarranted, as there is no theory in the paper (actually, as it is now, the paper looks rather exploratory)

      We thank the reviewer for their perspective. Formal models can be used as a theoretical statement on the casual algorithmic process underlying decision making and choice behaviour; the development of formal models are an essential theoretical tool for precision and falsification (Haslbeck et al., 2022). In this sense, we have built several competing formal theories that test, using casual architectures, whether the latent distribution(s) that generate one’s choices generalise into one’s predictions about another person, and simultaneously whether one’s latent distribution(s) that represent beliefs about another person are used to inform future choices.

      Reviewer 3:

      ‘My broad question about the experiment (in terms of its clinical and cognitive process relevance): Does the task encourage competition or give participants a reason to take advantage of others? I don't think it does, so it would be useful to clarify the normative account for prosociality in the introduction (e.g., some of Robin Dunbar's work).’

      We agree that our paradigm does not encourage competition. We use a reward structure that makes it contingent on participants to overcome a particular threshold before earning rewards, but there is no competitive element to this, in that points earned or not earned by partners have no bearing on the outcomes for the participant. This is important given the consideration of recursive properties that arise through mixed-motive games; we wanted to focus purely on observational learning in phase 2, and repercussion-free choices made by participants in phase 1 and 3, meaning the choices participants, and decisions of a partner, are theoretically in line with self-preferences irrespective of the judgement of others. We have included a clearer statement of the structure of this type of task, and more clearly cited the origin for its structure (Murphy & Ackerman, 2011):

      ‘Our present work sought to achieve two primary goals. 1. Extend prior causal computational theories to formalise and test the interrelation between self-insertion and social contagion on learning and behaviour to better probe interpersonal generalisation in health, and 2., Test whether previous computational findings of social learning changes in BPD can be explained by infractions to self-other generalisation. We accomplish these goals by using a dynamic, sequential social value economic paradigm, the Intentions Game, building upon a Social Value Orientation Framework (Murphy & Ackerman, 2011) that assumes motivational variation in joint reward allocation.’

      Given the introductions structure as it stands, we felt providing another paragraph on the normative assumptions of such a game was outside the scope of this article.

      ‘The finding that individuals with BPD do not engage in self-other generalization on this task of social intentions is novel and potentially clinically relevant. The authors find that BPD participants' tendency to be prosocial when splitting points with a partner does not transfer into their expectations of how a partner will treat them in a task where they are the passive recipient of points chosen by the partner. In the discussion, the authors reasonably focus on model differences between groups (Bayesian model comparison), yet I thought this finding -- BPD participants not assuming prosocial tendencies in phase 2 while CON participant did -- merited greater attention. Although the BPD group was close to 0 on the \beta prior in Phase 2, their difference from CON is still in the direction of being more mistrustful (or at least not assuming prosociality). This may line up with broader clinical literature on mistrustfulness and attributions of malevolence in the BPD literature (e.g., a 1992 paper by Nigg et al. in Journal of Abnormal Psychology). My broad point is to consider further the Phase 2 findings in terms of the clinical interpretation of the shift in \beta relative to controls.’

      This is an important point, that we contextualize within the parameterisation of our utility model. While the shift toward 0 in the BPD participants is indeed more competitive, as the reviewer notes, it is surprisingly centred closely around 0, with only a slight bias to be prosocial (mean = -0.47;  = -6.10, 95%HDI: -7.60, -4.60). Charitably we might argue that BPD participants are expecting more competitive preferences from their partner. However even so, given their variance around their priors in phase 2, they are uncertain or unconfident about this. We take a more conservative approach in the paper and say that given the tight proximity to 0 and the variance of their group priors, they are likely to be ‘hedging their bets’ on whether their partner is going to be prosocial or competitive. While the movement from phase 1 to 2 is indeed in the competitive direction it still lands in neutral territory. Model M4 does not preclude central tendancies at the start of Phase 2 being more in the competitive direction.

      ‘First, the authors note that they have "proposed a theory with testable predictions" (p. 4 but also elsewhere) but they do not state any clear predictions in the introduction, nor do they consider what sort of patterns will be observed in the BPD group in view of extant clinical and computational literature. Rather, the paper seems to be somewhat exploratory, largely looking at group differences (BPD vs. CON) on all of the shared computational parameters and additional indices such as belief updating and reaction times. Given this, I would suggest that the authors make stronger connections between extant research on intention representation in BPD and their framework (model and paradigm). In particular, the authors do not address related findings from Ereira (2020) and Story (2024) finding that in a false belief task that BPD participants *overgeneralize* from self to other. A critical comparison of this work to the present study, including an examination of the two tasks differ in the processes they measure, is important.’

      Thank you for this opportunity to include more of the important work that has preceded the present manuscript. Prior work has tended to focus on either descriptive explanations of self-other generalisation (e.g. through the use of RW type models) or has focused on observational learning instability in absence of a causal model from where initial self-other beliefs may arise. While the prior work cited by the reviewer [Ereira (2020; Nat. Comms.) and Story (2024; Trans. Psych.)] does examine the inter-trial updating between self-other, it does not integrate a self model into a self’s belief about an other prior to observation. Rather, it focuses almost exclusively on prediction error ‘leakage’ generated during learning about individual reward (i.e. one sided reward). These findings are important, but lie in a slightly different domain. They also do not cut against ours, and in fact, we argue in the discussion that the sort of learning instability described above and splitting (as we cite from Story ea. 2024; Psych. Rev.) may result from a lack of self anchoring typical of CON participants. Nevertheless we agree these works provide an important premise to contrast and set the groundwork for our present analysis and have included them in the framing of our introduction, as well as contrasting them to our data in the discussion.

      In the introduction:

      ‘The BPD phenotype has been associated with a potential over-reliance on social versus internal cues (Henco et al., 2020), ‘splitting’ of social latent states that encode beliefs about others (Story et al., 2023), negative appraisal of interpersonal experiences with heightened self-blame (Mancinelli et al., 2024), inaccurate inferences about others’ irritability (Hula et al., 2018), and reduced belief adaptation in social learning contexts (Siegel et al., 2020). Associative models have also been adapted to characterize  ‘leaky’ self-other reinforcement learning (Ereira et al., 2018), finding that those with BPD overgeneralize (leak updates) about themselves to others (Story et al., 2024). Altogether, there is currently a gap in the direct causal link between insertion, contagion, and learning (in)stability.’

      In the discussion:

      ‘Disruptions in self-to-other generalization provide an explanation for previous computational findings related to task-based mentalizing in BPD. Studies tracking observational mentalizing reveal that individuals with BPD, compared to those without, place greater emphasis on social over internal reward cues when learning (Henco et al., 2020; Fineberg et al., 2018). Those with BPD have been shown to exhibit reduced belief adaptation (Siegel et al., 2020) along with ‘splitting’ of latent social representations (Story et al., 2024a). BPD is also shown to be associated with overgeneralisation in self-to-other belief updates about individual outcomes when using a one-sided reward structure (where participant responses had no bearing on outcomes for the partner; Story et al., 2024b). Our analyses show that those with BPD are equal to controls in their generalisation of absolute reward (outcomes that only affect one player) but disintegrate beliefs about relative reward (outcomes that affect both players) through adoption of a new, neutral belief. We interpret this together in two ways: 1. There is a strong concern about social relativity when those with BPD form beliefs about others, 2. The absence of constrained self-insertion about relative outcomes may predispose to brittle or ‘split’ beliefs. In other words, those with BPD assume ambiguity about the social relativity preferences of another (i.e. how prosocial or punitive) and are quicker to settle on an explanation to resolve this. Although self-insertion may be counter-intuitive to rational belief formation, it has important implications for sustaining adaptive, trusting social bonds via information moderation.’

      In addition, perhaps it is fairer to note more explicitly the exploratory nature of this work. Although the analyses are thorough, many of them are not argued for a priori (e.g., rate of belief updating in Figure 2C) and the reader amasses many individual findings that need to by synthesized.’

      We have now noted the primary goals of our work in the introduction, and have included caveats about the exploratory nature of our analyses. We would note that our model is in effect a causal combination of prior work cited within the introduction (Barnby et al., 2022; Moutoussis et al., 2016). This renders our computational models in effect a causal theory to test, although we agree that our dissection of the results are exploratory. We have more clearly signposted this:

      ‘Our present work sought to achieve two primary goals. 1. Extend prior causal computational theories to formalise and test the interrelation between self-insertion and social contagion on learning and behaviour to better probe interpersonal generalisation in health, and 2., Test whether previous computational findings of social learning changes in BPD can be explained by infractions to self-other generalisation. We accomplish these goals by using a dynamic, sequential economic paradigm, the Intentions Game, building upon a Social Value Orientation Framework (Murphy & Ackerman, 2011) that assumes innate motivational variation in joint reward allocation.‘

      ‘Second, in the discussion, the authors are too quick to generalize to broad clinical phenomena in BPD that are not directly connected to the task at hand. For example, on p. 22: "Those with a diagnosis of BPD also show reduced permeability in generalising from other to self. While prior research has predominantly focused on how those with BPD use information to form impressions, it has not typically examined whether these impressions affect the self." Here, it's not self-representation per se (typically, identity or one's view of oneself), but instead cooperation and prosocial tendencies in an economic context. It is important to clarify what clinical phenomena may be closely related to the task and which are more distal and perhaps should not be approached here.’

      Thank you for this important point. We agree that social value orientation, and particularly in this economically-assessed form, is but one aspect of the self, and we did not test any others. A version of the social contagion phenomena is also present in other aspects of the self in intertemporal (Moutoussis et al., 2016), economic (Suzuki et al., 2016) and moral preferences (Yu et al., 2021). It would be most interesting to attempt to correlate the degrees of insertion and contagion across the different tasks.

      We take seriously the wider concern that behaviour in our tasks based on economic preferences may not have clinical validity. This issue is central in the whole field of computational psychiatry, much of which is based on generalizing from tasks like ours, and discussing correlations with psychometric measures. We hope that it is acceptable to leave such discussions to the many reviews on computational psychiatry (Montague et al., 2012; Hitchcock et al., 2022; Huys et al., 2016). Here, we have just put a caveat in the dicussion:

      ‘Finally, a limitation may be that behaviour in tasks based on economic preferences may not have clinical validity. This issue is central to the field of computational psychiatry, much of which is based on generalising from tasks like that within this paper and discussing correlations with psychometric measures. Extrapolating  economic tasks into the real world has been the topic of discussion for the many reviews on computational psychiatry (e.g. Montague et al., 2012; Hitchcock et al., 2022; Huys et al., 2016). We note a strength of this work is the use of model comparison to understand causal algorithmic differences between those with BPD and matched healthy controls. Nevertheless, we wish to further pursue how latent characteristics captured in our models may directly relate to real-world affective change.’

      ‘On a more technical level, I had two primary concerns. First, although the authors consider alternative models within a hierarchical Bayesian framework, some challenges arise when one analyzes parameter estimates fit separately to two groups, particularly when the best-fitting model is not shared. In particular, although the authors conduct a model confusion analysis, they do not as far I could tell (and apologies if I missed it) demonstrate that the dynamics of one model are nested within the other. Given that M4 has free parameters governing the expectations on the absolute and relative reward preferences in Phase 2, is it necessarily the case that the shared parameters between M1 and M4 can be interpreted on the same scale? Relatedly, group-specific model fitting has virtues when believes there to be two distinct populations, but there is also a risk of overfitting potentially irrelevant sample characteristics when parameters are fit group by group.

      To resolve these issues, I saw one straightforward solution (though in modeling, my experience is that what seems straightforward on first glance may not be so upon further investigation). M1 assumes that participants' own preferences (posterior central tendency) in Phase 1 directly transfer to priors in Phase 2, but presumably the degree of transfer could vary somewhat without meriting an entirely new model (i.e., the authors currently place this question in terms of model selection, not within-model parameter variation). I would suggest that the authors consider a model parameterization fit to the full dataset (both groups) that contains free parameters capturing the *deviations* in the priors relative to the preceding phase's posterior. That is, the free parameters $\bar{\alpha}_{par}^m$ and $\bar{\beta}_{par}^m$ govern the central tendency of the Phase 2 prior parameter distributions directly, but could be reparametrized as deviations from Phase 1 $\theta^m_{ppt}$ parameters in an additive form. This allows for a single model to be fit all participants that encompasses the dynamics of interest such that between-group parameter comparisons are not biased by the strong assumptions imposed by M1 (that phase 1 preferences and phase 2 observations directly transfer to priors). In the case of controls, we would expect these deviation parameters to be centred on 0 insofar as the current M1 fit them best, whereas for BPD participants should have significant deviations from earlier-phase posteriors (e.g., the shift in \beta toward prior neutrality in phase 2 compared to one's own prosociality in phase 1). I think it's still valid for the authors to argue for stronger model constraints for Bayesian model comparison, as they do now, but inferences regarding parameter estimates should ideally be based on a model that can encompass the full dynamics of the entire sample, with simpler dynamics (like posterior -> prior transfer) being captured by near-zero parameter estimates.’

      Thank you for the chance to be clearer in our modelling. In particular, the suggestion to include a model that can be fit to all participants with the equivalent of the likes of partial social insertion, to check if the results stand, can actually be accomplished through our existing models.  That is, the parameter that governs the flexibility over beliefs in phase 2 under models M1 (dominant for CON participant) and M2 parameterises the degree to which participants think their partner may be different from themselves. Thus, forcibly fitting M1 and M2 hierarchically to all participants, and then separately to BPD and CON participants, can quantify the issue raised: if BPD participants indeed distinguish partners as vastly different from themselves enough to warent a new central tendency, should be quantitively higher in BPD vs CON participants under M1 and M2.

      We therefore tested this, reporting the distributional differences between for BPD and CON participants under M1, both when fitted together as a population and as separate groups. As is higher for BPD participants under both conditions for M1 and M2 it supports our claim and will add more context for the comparison - may be large enough in BPD that a new central tendency to anchor beliefs is a more parsimonious explanation.

      We cross checked this result by assessing the discrepancy between the participant’s and assumed partner’s central tendencies for both prosocial and individualistic preferences via best-fitting model M4 for the BPD group. We thereby examined whether belief disintegration is uniform across preferences (relative vs abolsute reward) or whether one tendency was shifted dramatically more than another.  We found that beliefs over prosocial-competitive preferences were dramatically shifted, whereas those over individualistic preferences were not.

      We have added the following to the main text results to explain this:

      Model Comparison:

      ‘We found that CON participants were best fit at the group level by M1 (Frequency = 0.59, Protected Exceedance Probability = 0.98), whereas BPD participants were best fit by M4 (Frequency = 0.54, Protected Exceedance Probability = 0.86; Figure 2A). We first analyse the results of these separate fits. Later, in order to assuage concerns about drawing inferences from different models, we examined the relationships between the relevant parameters when we forced all participants to be fit to each of the models (in a hierarchical manner, separated by group). In sum, our model comparison is supported by convergence in parameter values when comparisons are meaningful. We refer to both types of analysis below.’

      Phase 1:

      ‘These differences were replicated when considering parameters between groups when we fit all participants to the same models (M1-M4; see Table S2).’

      Phase 2:

      ‘To check that these conclusions about self-insertion did not depend on the different models, we found that only under M1 and M2 were consistently larger in BPD versus CON. This supports the notion that new central tendencies for BPD participants in phase 2 were required, driven by expectations about a partner’s relative reward. (see Fig S10 & Table S2). and parameters under assumptions of M1 and M2 were strongly correlated with median change in belief between phase 1 and 2 under M3 and M4, suggesting convergence in outcome (Fig S11).’

      ‘Furthermore, even under assumptions of M1-M4 for both groups, BPD showed smaller posterior median changes versus CON in phase 2 (see Table T2). These results converge to suggest those with BPD form rigid posterior beliefs.’

      ‘Assessing this same relationship under M1- and M2-only assumptions reveals a replication of this group effect for absolute reward, but the effect is reversed for relative reward (see Table S3). This accords with the context of each model, where under M1 and M2, BPD participants had larger phase 2 prior flexibility over relative reward (leading to larger initial surprise), which was better accounted for by a new central tendency under M4 during model comparison. When comparing both groups under M1-M4 informational surprise over absolute reward was consistently restricted in BPD (Table S3), suggesting a diminished weight of this preference when forming beliefs about an other.’

      Phase 3

      ‘In the dominant model for the BPD group—M4—participants are not influenced in their phase 3 choices following exposure to their partner in phase 2. To further confirm this we also analysed absolute change in median participant beliefs between phase 1 and 3 under the assumption that M1 and M3 was the dominant model for both groups (that allow for contagion to occur). This analysis aligns with our primary model comparison using M1 for CON and M4 for BPD  (Figure 2C). CON participants altered their median beliefs between phase 1 and 3 more than BPD participants (M1: linear estimate = 0.67, 95%CI: 0.16, 1.19; t = 2.57, p = 0.011; M3: linear estimate = 1.75, 95%CI: 0.73, 2.79; t = 3.36, p < 0.001). Relative reward was overall more susceptible to contagion versus absolute reward (M1: linear estimate = 1.40, 95%CI: 0.88, 1.92; t = 5.34, p<0.001; M3: linear estimate = 2.60, 95%CI: 1.57, 3.63; t = 4.98, p < 0.001). There was an interaction between group and belief type under M3 but not M1 (M3: linear estimate = 2.13, 95%CI: 0.09, 4.18, t = 2.06, p=0.041). There was only a main effect of belief type on precision under M3 (linear estimate = 0.47, 95%CI: 0.07, 0.87, t = 2.34, p = 0.02); relative reward preferences became more precise across the board. Derived model estimates of preference change between phase 1 and 3 strongly correlated between M1 and M3 along both belief types (see Table S2 and Fig S11).’

      ‘My second concern pertains to the psychometric individual difference analyses. These were not clearly justified in the introduction, though I agree that they could offer potentially meaningful insight into which scales may be most related to model parameters of interest. So, perhaps these should be earmarked as exploratory and/or more clearly argued for. Crucially, however, these analyses appear to have been conducted on the full sample without considering the group structure. Indeed, many of the scales on which there are sizable group differences are also those that show correlations with psychometric scales. So, in essence, it is unclear whether most of these analyses are simply recapitulating the between-group tests reported earlier in the paper or offer additional insights. I think it's hard to have one's cake and eat it, too, in this regard and would suggest the authors review Preacher et al. 2005, Psychological Methods for additional detail. One solution might be to always include group as a binary covariate in the symptom dimension-parameter analyses, essentially partialing the correlations for group status. I remain skeptical regarding whether there is additional signal in these analyses, but such controls could convince the reader. Nevertheless, without such adjustments, I would caution against any transdiagnostic interpretations such as this one in the Highlights: "Higher reported childhood trauma, paranoia, and poorer trait mentalizing all diminish other-to-self information transfer irrespective of diagnosis." Since many of these analyses relate to scales on which the groups differ, the transdiagnostic relevance remains to be demonstrated.’

      We have restructured the psychometric section to ensure transparency and clarity in our analysis. Namely, in response to these comments and those of the other reviewers, we have opted to remove the parameter analyses that aimed to cross-correlate psychometric scores with latent parameters from different models: as the reviewer points out, we do not have parity between dominant models for each group to warrant this, and fitting the same model to both groups artificially makes the parameters qualitatively different. Instead we have opted to focus on social contagion, or rather restrictions on , between phases 1 and 3 explained by M3. This provides us with an opportunity to examine social contagion on the whole population level isolated from self-insertion biases. We performed bootstrapping (1000 reps) and permutation testing (1000 reps) to assess the stability and significance of each edge in the partial correlation network, and then applied FDR correction (p[fdr]), thus controlling for multiple comparisons. We note that while we focused on M3 to isolate the effect across the population, social contagion across both relative and absolute reward under M3 strongly correlated with social contagion under M1 (see Fig S11).

      ‘We explored whether social contagion may be restricted as a result of trauma, paranoia, and less effective trait mentalizing under the assumption of M3 for all participants (where everyone is able to be influenced by their partner). To note, social contagion under M3 was highly correlated with contagion under M1 (see Fig S11). We conducted partial correlation analysis to estimate relationships conditional on all other associations and retained all that survived bootstrapping (1000 reps), permutation testing (1000 reps), and subsequent FDR correction. Persecution and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p = 0.004, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p=0.019, p[fdr]=0.02). MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p=0.026, p[fdr]=0.043). CTQ scores were also directly and negatively associated with shifts in individualistic preferences (; r = -0.24, 95%CI: -0.44, -0.13, p=0.052, p[fdr]=0.065). This provides some preliminary evidence that trauma impacts beliefs about individualism directly, whereas trauma and persecutory beliefs impact beliefs about prosociality through impaired mentalising (Figure 4A).’

      (1) As far as I could tell, the authors didn't provide an explanation of this finding on page 5: "However, CON participants made significantly fewer prosocial choices when individualistic choices were available" While one shouldn't be forced to interpret every finding, the paper is already in that direction and I found this finding to be potentially relevant to the BPD-control comparison.

      Thank you for this observation. This sentance reports the fact that CON participants were effectively more selfish than BPD participants. This is captured by the lower value of reported in Figure 2, and suggests that CON participants were more focused on absolute value – acting in a more ‘economically rational’ manner – versus BPD participants. This fits in with our fourth paragraph of the discussion where we discuss prior work that demonstrates a heightened social focus in those with BPD. Indeed, the finding the reviewer highlights further emphasises the point that those with BPD are much more sensitive, and motived to choose, options concerning relative reward than are CON participants. The text in the discussion reads:

      ‘We also observe this in self-generated participant choice behaviour, where CON participants were more concerned over absolute reward versus their BPD counterparts, suggesting a heighted focus on relative vs. absolute reward in those with BPD.’

      (2) The adaptive algorithm for adjusting partner behavior in Phase 2 was clever and effective. Did the authors conduct a manipulation check to demonstrate that the matching resulted in approximately 50% difference between one's behavior in Phase 1 and the partner in Phase 2? Perhaps Supplementary Figure suffices, but I wondered about a simpler metric.

      Thanks for this point. We highlight this in Figure 3B and within the same figure legend although appreciate the panel is quite small and may be missed.  We have now highlighted this manipulation check more clearly in behavioural analysis section of the main text:

      ‘Server matching between participant and partner in phase 2 was successful, with participants being approximately 50% different to their partners with respect to the choices each would have made on each trial in phase 2 (mean similarity=0.49, SD=0.12).’

      (3) The resolution of point-range plots in Figure 4 was grainy. Perhaps it's not so in the separate figure file, but I'd suggest checking.

      Apologies. We have now updated and reorganised the figure to improve clarity.

      (4) p. 21: Suggest changing to "different" as opposed to "opposite" since the strategies are not truly opposing: "but employed opposite strategies."

      We have amended this.

      (5) p. 21: I found this sentence unclear, particularly the idea of "similar updating regime." I'd suggest clarifying: "In phase 2, CON participants exhibited greater belief sensitivity to new information during observational learning, eventually adopting a similar updating regime to those with BPD."

      We have clarified this statement:

      ‘In observational learning in phase 2, CON participants initially updated their beliefs in response to new information more quickly than those with BPD, but eventually converged to a similar rate of updating.’

      (6) p. 23: The content regarding psychosis seemed out of place, particularly as the concluding remark. I'd suggest keeping the focus on the clinical population under investigation. If you'd like to mention the paradigm's relevance to psychosis (which I think could be omitted), perhaps include this as a future direction when describing the paradigm's strengths above.

      We agree the paragraph is somewhat speculative. We have omitted it in aid of keeping the messaging succinct and to the point.

      (7) p. 24: Was BPD diagnosis assess using unstructured clinical interview? Although psychosis was exclusionary, what about recent manic or hypomanic episodes or Bipolar diagnosis? A bit more detail about BPD sample ascertainment would be useful, including any instruments used to make a diagnosis and information about whether you measured inter-rater agreement.

      Participants diagnosed with BPD were recruited from specialist personality disorder services across various London NHS mental health trusts. The diagnosis of BPD was established by trained assessors at the clinical services and confirmed using the Structured Clinical Interview for DSM-IV (SCID-II) (First et al., 1997). Individuals with a history of psychotic episodes, severe learning disability or neurological illness/trauma were excluded. We have now included this extra detail within our methods in the paper:

      ‘The majority of BPD participants were recruited through referrals by psychiatrists, psychotherapists, and trainee clinical psychologists within personality disorder services across 9 NHS Foundation Trusts in the London, and 3 NHS Foundation Trusts across England (Devon, Merseyside, Cambridgeshire). Four BPD participants were also recruited by self-referral through the UCLH website, where the study was advertised. To be included in the study, all participants needed to have, or meet criteria for, a primary diagnosis of BPD (or emotionally-unstable personality disorder or complex emotional needs) based on a professional clinical assessment conducted by the referring NHS trust (for self-referrals, the presence of a recent diagnosis was ascertained through thorough discussion with the participant, whereby two of the four also provided clinical notes). The patient participants also had to be under the care of the referring trust or have a general practitioner whose details they were willing to provide. Individuals with psychotic or mood disorders, recent acute psychotic episodes, severe learning disability, or current or past neurological disorders were not eligible for participation and were therefore not referred by the clinical trusts.‘

    1. Current research has shown that the interactive use of a smartphone, computer, or video game console in the hour before bedtime increases the likelihood of both reported difficulty falling asleep and having unrefreshing sleep.

      This is something that hits home for me. I am guilty of scrolling social media on my way to sleep and I have noticed less quality sleep. I would really like to practice not being on my phone to see if I can't improve my sleep.

    2. Current research has shown that the interactive use of a smartphone, computer, or video game console in the hour before bedtime increases the likelihood of both reported difficulty falling asleep and having unrefreshing sleep.

      This comment felt very real. As person who uses their phone before they go to sleep, I can agree that this is true. When you use your phone to "wine down" you are only amping yourself up. When I talk to my significant other before falling asleep and put my phone down, I tend to sleep much better. I do not have a hundred thoughts running through my head about wordily problems, games, or notifications. I think it may be even a good idea to place your phone in another room before going to sleep.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      The authors present a substantial improvement to their existing tool, MorphoNet, intended to facilitate assessment of 3D+t cell segmentation and tracking results, and curation of high-quality analysis for scientific discovery and data sharing. These tools are provided through a user-friendly GUI, making them accessible to biologists who are not experienced coders. Further, the authors have re-developed this tool to be a locally installed piece of software instead of a web interface, making the analysis and rendering of large 3D+t datasets more computationally efficient. The authors evidence the value of this tool with a series of use cases, in which they apply different features of the software to existing datasets and show the improvement to the segmentation and tracking achieved. 

      While the computational tools packaged in this software are familiar to readers (e.g., cellpose), the novel contribution of this work is the focus on error correction. The MorphoNet 2.0 software helps users identify where their candidate segmentation and/or tracking may be incorrect. The authors then provide existing tools in a single user-friendly package, lowering the threshold of skill required for users to get maximal value from these existing tools. To help users apply these tools effectively, the authors introduce a number of unsupervised quality metrics that can be applied to a segmentation candidate to identify masks and regions where the segmentation results are noticeably different from the majority of the image. 

      This work is valuable to researchers who are working with cell microscopy data that requires high-quality segmentation and tracking, particularly if their data are 3D time-lapse and thus challenging to segment and assess. The MorphoNet 2.0 tool that the authors present is intended to make the iterative process of segmentation, quality assessment, and re-processing easier and more streamlined, combining commonly used tools into a single user interface.   

      We sincerely thank the reviewer for their thorough and encouraging evaluation of our work. We are grateful that they highlighted both the technical improvements of MorphoNet 2.0 and its potential impact for the broader community working with complex 3D+t microscopy datasets. We particularly appreciate the recognition of our efforts to make advanced segmentation and tracking tools accessible to non-expert users through a user-friendly and locally installable interface, and for pointing out the importance of error detection and correction in the iterative analysis workflow. The reviewer’s appreciation of the value of integrating unsupervised quality metrics to support this process is especially meaningful to us, as this was a central motivation behind the development of MorphoNet 2.0. We hope the tool will indeed facilitate more rigorous and reproducible analyses, and we are encouraged by the reviewer’s positive assessment of its utility for the community.

      One of the key contributions of the work is the unsupervised metrics that MorphoNet 2.0 offers for segmentation quality assessment. These metrics are used in the use cases to identify low-quality instances of segmentation in the provided datasets, so that they can be improved with plugins directly in MorphoNet 2.0. However, not enough consideration is given to demonstrating that optimizing these metrics leads to an improvement in segmentation quality. For example, in Use Case 1, the authors report their metrics of interest (Intensity offset, Intensity border variation, and Nuclei volume) for the uncurated silver truth, the partially curated and fully curated datasets, but this does not evidence an improvement in the results. Additional plotting of the distribution of these metrics on the Gold Truth data could help confirm that the distribution of these metrics now better matches the expected distribution. 

      Similarly, in Use Case 2, visual inspection leads us to believe that the segmentation generated by the Cellpose + Deli pipeline (shown in Figure 4d) is an improvement, but a direct comparison of agreement between segmented masks and masks in the published data (where the segmentations overlap) would further evidence this. 

      We agree that demonstrating the correlation between metric optimization and real segmentation improvement is essential. We have added new analysis comparing the distributions of the unsupervised metrics with the gold truth data before and after curation. Additionally, we provided overlap scores where ground truth annotations are available, confirming the improvement. We also explicitly discussed the limitation of relying solely on unsupervised metrics without complementary validation.

      We would appreciate the authors addressing the risk of decreasing the quality of the segmentations by applying circular logic with their tool; MorphoNet 2.0 uses unsupervised metrics to identify masks that do not fit the typical distribution. A model such as StarDist can be trained on the "good" masks to generate more masks that match the most common type. This leads to a more homogeneous segmentation quality, without consideration for whether these metrics actually optimize the segmentation 

      We thank the reviewer for this important and insightful comment. It raises a crucial point regarding the risk of circular logic in our segmentation pipeline. Indeed, relying on unsupervised metrics to select “good” masks and using them to train a model like StarDist could lead to reinforcing a particular distribution of shapes or sizes, potentially filtering out biologically relevant variability. This homogenization may improve consistency with the chosen metrics, but not necessarily with the true underlying structures.

      We fully agree that this is a key limitation to be aware of. We have revised the manuscript to explicitly discuss this risk, emphasizing that while our approach may help improve segmentation quality according to specific criteria, it should be complemented with biological validation and, when possible, expert input to ensure that important but rare phenotypes are not excluded.

      In Use case 5, the authors include details that the errors were corrected by "264 MorphoNet plugin actions ... in 8 hours actions [sic]". The work would benefit from explaining whether this is 8 hours of human work, trying plugins and iteratively improving, or 8 hours of compute time to apply the selected plugins. 

      We clarified that the “8 hours” refer to human interaction time, including exploration, testing, and iterative correction using plugins. 

      Reviewer #2 (Public review):

      Summary: 

      This article presents Morphonet 2.0, a software designed to visualise and curate segmentations of 3D and 3D+t data. The authors demonstrate their capabilities on five published datasets, showcasing how even small segmentation errors can be automatically detected, easily assessed, and corrected by the user. This allows for more reliable ground truths, which will in turn be very much valuable for analysis and training deep learning models. Morphonet 2.0 offers intuitive 3D inspection and functionalities accessible to a non-coding audience, thereby broadening its impact. 

      Strengths: 

      The work proposed in this article is expected to be of great interest to the community by enabling easy visualisation and correction of complex 3D(+t) datasets. Moreover, the article is clear and well written, making MorphoNet more likely to be used. The goals are clearly defined, addressing an undeniable need in the bioimage analysis community. The authors use a diverse range of datasets, successfully demonstrating the versatility of the software. 

      We would also like to highlight the great effort that was made to clearly explain which type of computer configurations are necessary to run the different datasets and how to find the appropriate documentation according to your needs. The authors clearly carefully thought about these two important problems and came up with very satisfactory solutions. 

      We would like to sincerely thank the reviewer for their positive and thoughtful feedback. We are especially grateful that they acknowledged the clarity of the manuscript and the potential value of MorphoNet 2.0 for the community, particularly in facilitating the visualization and correction of complex 3D(+t) datasets. We also appreciate the reviewer’s recognition of our efforts to provide detailed guidance on hardware requirements and access to documentation—two aspects we consider crucial to ensuring the tool is both usable and widely adopted. Their comments are very encouraging and reinforce our commitment to making MorphoNet 2.0 as accessible and practical as possible for a broad range of users in the bioimage analysis community.

      Weaknesses: 

      There is still one concern: the quantification of the improvement of the segmentations in the use cases and, therefore, the quantification of the potential impact of the software. While it appears hard to quantify the quality of the correction, the proposed work would be significantly improved if such metrics could be provided. 

      The authors show some distributions of metrics before and after segmentations to highlight the changes. This is a great start, but there seem to be two shortcomings: first, the comparison and interpretation of the different distributions does not appear to be trivial. It is therefore difficult to judge the quality of the improvement from these. Maybe an explanation in the text of how to interpret the differences between the distributions could help. A second shortcoming is that the before/after metrics displayed are the metrics used to guide the correction, so, by design, the scores will improve, but does that accurately represent the improvement of the segmentation? It seems to be the case, but it would be nice to maybe have a better assessment of the improvement of the quality. 

      We thank the reviewer for this constructive and important comment. We fully agreed that assessing the true quality improvement of segmentation after correction is a central and challenging issue. While we initially focused on changes in the unsupervised quality metrics to illustrate the effect of the correction, we acknowledged that interpreting these distributions was not always straightforward, and that relying solely on the metrics used to guide the correction introduced an inherent bias in the evaluation.

      To address the first point, we revised the manuscript to provide clearer guidance on how to interpret the changes in metric distributions before and after correction, with additional examples to make this interpretation more intuitive.

      Regarding the second point, we agreed that using independent, external validation was necessary to confirm that the segmentation had genuinely improved. To this end, we included additional assessments using complementary evaluation strategies on selected datasets where ground truth was accessible, to compare pre- and post-correction segmentations with an independent reference. These results reinforced the idea that the corrections guided by unsupervised metrics generally led to more accurate segmentations, but we also emphasized their limitations and the need for biological validation in real-world cases.

      Reviewer #3 (Public review): 

      Summary: 

      A very thorough technical report of a new standalone, open-source software for microscopy image processing and analysis (MorphoNet 2.0), with a particular emphasis on automated segmentation and its curation to obtain accurate results even with very complex 3D stacks, including timelapse experiments. 

      Strengths: 

      The authors did a good job of explaining the advantages of MorphoNet 2.0, as compared to its previous web-based version and to other software with similar capabilities. What I particularly found more useful to actually envisage these claimed advantages is the five examples used to illustrate the power of the software (based on a combination of

      Python scripting and the 3D game engine Unity). These examples, from published research, are very varied in both types of information and image quality, and all have their complexities, making them inherently difficult to segment. I strongly recommend the readers to carefully watch the accompanying videos, which show (although not thoroughly) how the software is actually used in these examples. 

      We sincerely thanked the reviewer for their thoughtful and encouraging feedback. We were particularly pleased that the reviewer appreciated the comparative analysis of MorphoNet 2.0 with both its earlier version and existing tools, as well as the relevance of the five diverse and complex use cases we had selected. Demonstrating the software’s versatility and robustness across a variety of challenging datasets was a key goal of this work, and we were glad that this aspect came through clearly. We also appreciated the reviewer’s recommendation to watch the accompanying videos, which we had designed to provide a practical sense of how the tool was used in real-world scenarios. Their positive assessment was highly motivating and reinforced the value of combining scripting flexibility with an interactive 3D interface.

      Weaknesses: 

      Being a technical article, the only possible comments are on how methods are presented, which is generally adequate, as mentioned above. In this regard, and in spite of the presented examples (chosen by the authors, who clearly gave them a deep thought before showing them), the only way in which the presented software will prove valuable is through its use by as many researchers as possible. This is not a weakness per se, of course, but just what is usual in this sort of report. Hence, I encourage readers to download the software and give it time to test it on their own data (which I will also do myself).   

      We fully agreed that the true value of MorphoNet 2.0 would be demonstrated through its practical use by a wide range of researchers working with complex 3D and 3D+t datasets. In this regard, we improved the user documentation and provided a set of example datasets to help new users quickly familiarize themselves with the platform. We were also committed to maintaining and updating MorphoNet 2.0 based on user feedback to further support its usability and impact.

      In conclusion, I believe that this report is fundamental because it will be the major way of initially promoting the use of MorphoNet 2.0 by the objective public. The software itself holds the promise of being very impactful for the microscopists' community. 

      Reviewer #1 (Recommendations for the authors): 

      (1) In Use Case 1, when referring to Figure 3a, they describe features of 3b? 

      We corrected the mismatch between Figure 3a and 3b descriptions.

      (2) In Figure 3g-I, columns for Curated Nuclei and All Nuclei appear to be incorrectly labelled, and should be the other way around. 

      We corrected  the label swapped between “Curated Nuclei” and “All Nuclei.”

      (3) Some mention of how this will be supported in the future would be of interest. 

      We added a note on long-term support plans  

      (4) Could Morphonet be rolled into something like napari and integrated into its environment with access to its plugins and tools? 

      We thank the reviewer for this pertinent suggestion. We fully recognize the growing importance of interoperability within the bioimage analysis community, and we have been working on establishing a bridge between MorphoNet and napari to enable data exchange and complementary use of the two tools. As a platform, all new developments are first evaluated by our beta testers before being officially released to the user community and subsequently documented. The interoperability component is still under active development and will be announced shortly in a beta-testing phase. For this reason, we were not able to include it in the present manuscript, but we plan to document it in a future release.

      (5) Can meshes be extracted/saved in another format? 

      We agreed that the ability to extract and save meshes in standard formats was highly useful for interoperability with other tools. We implemented this feature in the new version of MorphoNet, allowing users to export meshes in commonly used formats such as OBJ or STL. Response: We thank the reviewer for this pertinent suggestion. We fully recognize the growing importance of interoperability within the bioimage analysis community, and we have been working on establishing a bridge between MorphoNet and napari to enable data exchange and complementary use of the two tools. As a platform, all new developments are first evaluated by our beta testers before being officially released to the user community and subsequently documented. The interoperability component is still under active development and will be announced shortly in a beta-testing phase. For this reason, we were not able to include it in the present manuscript, but we plan to document it in a future release.

      Reviewer #2 (Recommendations for the authors): 

      As a comment, since the authors mentioned the recent progress in 3D segmentation of various biological components, including organelles, it could be interesting to have examples of Morphonet applied to investigate subcellular structures. These present different challenges in visualization and quantification due to their smaller scale.

      We thank the reviewer for this insightful suggestion. We fully agree that applying MorphoNet 2.0 to the analysis of sub-cellular structures is a promising direction, particularly given the specific challenges these datasets present in terms of resolution, visualization, and quantification. While our current use cases focus on cellular and tissue-level segmentation, we are actively interested in extending the applicability of the tool to finer scales. We are currently exploring plugins for spot detection and curation in single-molecule FISH data. However, this requires more time to properly validate relevant use cases, and we plan to include this functionality in the next release.

      Another comment is that the authors briefly mention two other state-of-the-art softwares (namely FIJI and napari) but do not really position MorphoNet against them. The text would likely benefit from such a comparison so the users can better decide which one to use or not. 

      We agreed that providing a clearer comparison between MorphoNet 2.0 and other widely used tools such as FIJI and Napari would greatly benefit readers and potential users. In response, we included a new paragraph in the supplementary materials of the revised manuscript, highlighting the main features, strengths, and limitations of each tool in the context of 3D+t segmentation, visualization, and correction workflows. This addition helped users better understand the positioning of MorphoNet 2.0 and make informed choices based on their specific needs.

      Minor comments: 

      L 439: The Deli plugin is mentioned but not introduced in the main text; it could be helpful to have an idea of what it is without having to dive into the supplementary material. 

      We included a brief description in the main text and thoroughly revise the help pages to improve clarity

      Figure 4: It is not clear how the potential holes created by the removal of objects are handled. Are the empty areas filled by neighboring cells, for example, are they left empty? 

      We clarified in the figure legend of Figure 4.

      Please remove from the supplementary the use cases that are already in the main text. 

      We cleaned up redundant use case descriptions.

      Typos: 

      L 22: the end of the sentence is missing. 

      L 51: There are two "."   

      L 370: replace 'et' with 'and'.   

      L 407-408, Figure 3: panels g-i, the columns 'curated nuclei' and 'all nuclei' seem to be inverted. 

      L 549: "four 4". 

      Reviewer #3 (Recommendations for the authors): 

      Dear Authors, what follows are "minor comments" (the only sort of comment I have for this nice report): 

      Minor issues: 

      (1) Not being a user of MorphoNet, I found that reading the manuscript was a bit hard due to the several names of plugins or tools that are mentioned, many times without a clear explanation of what they do. One way of improving this could be to add a table, a sort of glossary, with those names, a brief explanation of what they are, and a link to their "help" page on the web. 

      We understood that the manuscript might be difficult to follow for readers unfamiliar with MorphoNet, especially due to the numerous plugin and tool names referenced. To address this, we carried out a complete overhaul of the help pages to make them clearer, more structured, and easier to navigate.

      (2) Figure 4d, orthogonal view: It is claimed that this segmentation is correct according to the original intensity image, but it is not clear why some cells in the border actually appear a lot bigger than other cells in the embryo. It does look like an incomplete segmentation due to the poor image quality at the border. Whether this is the case or if the authors consider the contrary, it should be somehow explained/discussed in the figure legend or the main text. 

      We revised the figure legend and main text to acknowledge the challenge of segmenting peripheral regions with low signal-to-noise ratios and discussed how this affects segmentation.

      Small writing issues I could spot:   

      Line 247: there is a double point after "Sup. Mat..". 

      Line 329: probably a diagrammation error of the pdf I use to review, there is a loose sentence apparently related to a figure: "Vegetal view ofwith smoothness". 

      Line 393 (and many other places): avoid using numbers when it is not a parameter you are talking about, and the number is smaller than 10. In this case, it should be: "The five steps...". 

      Line 459: Is "opposite" referring to "Vegetal", like in g? In addition, it starts with lower lowercase. 

      Lines 540-541: Check if redaction is correct in "...projected the values onto the meshed dual of the object..." (it sounds obscure to me). 

      Lines 548-549: Same thing for "...included two groups of four 4 nuclei and one group of 3 fused nuclei.". 

      Line 637: Should it be "Same view as b"? 

      Line 646: "The property highlights..."? 

      Line 651: In the text, I have seen a "propagation plugin" named as "Prope", "Propa", and now "Propi". Are they all different? Is it a mistake? Please, see my first "Minor issue", which might help readers navigate through this sort of confusing nomenclature. 

      Line 702: I personally find the use of the term "eco-system" inappropriate in this context. We scientists know what an ecosystem is, and the fact that it has now become a fashionable word for politicians does not make it correct in any context. 

      We thank the reviewer for their careful reading of the manuscript and for pointing out these writing and typographic issues. We corrected all the mentioned points in the revised version, including punctuation, sentence clarity, consistent naming of tools (e.g., the propagation plugin), and appropriate use of terms such as “ecosystem.” We also appreciated the suggestion to avoid numerals for numbers under ten when not referring to parameters, and we ensured consistency throughout the text. These corrections improved the clarity and readability of the manuscript, and we were grateful for the reviewer’s attention to detail.

    1. When and how to use Standard English Maybe you have cousins or friends in other parts of the country, and there have been times when you have misunderstood each other? Perhaps you were trying to play a game that has different names in different parts of the country. Such local words, which are not Standard English, should not be used in formal situations such as in an exam or going for a job interview. In formal situations, it is required that you use Standard English, which also means not using slang words that you would use with your friends.

      friend talk a multi standard English cause its different in other parts or country.

    Annotators

    1. As in most gathering and hunting societies, women’s economic functions, along with childbearing, are absolutely crucial. Women typically generate more food through gath-ering than the men who hunt animals or look for game that has already been killed. Gathering and hunting societies appear to have developed d

      Women were very important to early Africa, which is the trend of many different tribes and societies around the world. Women have played a huge role in history and continue to do so

  3. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. In what ways have you experienced going viral?

      I had an interesting experience during covid when we were all locked indoors of going viral on tik tok and I will never forget it. I was always a bit obsessed with going viral during covid as any middle schooler in the time was. It was right when the video game among us was going viral itself and I decided to try and benefit off of that. I played the game a lot and really enjoyed playing, I decided to create a fresh tik tok account that would post funny among us content. Videos would be 60 seconds and of my game play along with funny sound effects over the gameplay and my videos went pretty viral. I worked up to 170 thousand followers and a total of around 5 million likes and even more views. It was a very fun but also stressful experience because once I reached that viral status, I was constantly worried about keeping it and not going down in views.

    1. A meme is a piece of culture that might reproduce in an evolutionary fashion, like a hummable tune that someone hears and starts humming to themselves, perhaps changing it, and then others overhearing next. In this view, any piece of human culture can be considered a meme that is spreading (or failing to spread) according to evolutionary forces. So we can use an evolutionary perspective to consider the spread of:

      This reminds me of something quite silly but I think it's worth mentioning. While this term was later adapted to refer to what we today call a meme, it was still in use a this definition before and did circle through media, which made the media retroactively very comedic through the redefining of the word meme. My favorite example of this is the 2013 game Metal Gear Rising: Revengeance, which has a plot points revolving around how the only thing that truly matters to a persons self and decisions is memes and the ideas that their culture pass on to them. But with our modern definition, all the thoughtful speeches throughout the game become unintentionally very funny.

    2. A meme is a piece of culture that might reproduce in an evolutionary fashion, like a hummable tune that someone hears and starts humming to themselves, perhaps changing it, and then others overhearing next.

      This passage made me think about how memes are almost like a game of telephone throughout online communities and across generations. A millennial may see the same meme in a completely different way than someone in generation z or alpha and visa versa.

    1. ith contrastive analysis and code-switching, teachers learn tools toaccurately assess and effectively respond to the standard literacy needs oftheir vernacular-speaking students. Teachers gain confidence to fosterthe broader student writer, encouraging students to pursue their ideas andvision in well-developed, well-structured essays. Then in the end game ofthe writing process, teachers help students edit for Standard English, ifthat is the language appropriate to the writing task (se

      When teachers learn how to differentiate dialects vs actual mistakes, they also feel more acclomplished as a teacher who helps their students compared to feelings as though the kids cannot learn.

    1. Brittany questioned the form and function of a test, so it made sensefor her to try and create one that met her goals. In the end, she cre-ated what we might now call an example of high school and collegealignment—an exam in high school that might have prepared her forour college writing class. It is wishful thinking, but classmates wereprompted to talk about how to approach tests that they needed to takebut didn’t agree with, and my colleagues and I learned that alignmentdiscussions can be had among all stakeholders, rather than amongteachers and administrators alone.

      Relates to the public communication, like adapting a game demo for investors. Could invention projects like this replace traditional essays? This reaffirms that audience awareness develops through experimentation, not memorization.

      (SAYS-DOES) Charlton says Brittany’s creative testing aligns audiences, and this does illustrate authentic transfer of learning.

    1. Thinking through Hutchinson’s and Moore’s perspectives, we could arguethat Kojima’s strategy of using racial ambiguity to cater to both the Japaneseand the Western audience permits him to embody Japaneseness without anyhistorical baggage.

      Furthermore, can you stop to think what budget the game may have?

    2. Noting the racially ambiguous design of the mgs series’ protagonist Snake,Hutchinson argues that the white-passing body welcomes Western playersto empathize with its message.

      I know that this is colonising, you don't have to shove it upon me... but isn't it a justified concession? Isn't the inherent peace-cooperation argument embeded in the game akin to the reparatory non-repetition argument that underlies historical memory?

      For me, it is not, and I say this having played a large chunk of the game while focusing on utilitarianist EA ethics. It is not, because it may avoid tokenisation, sure, but Sam Porter is not a slave, he is a hero. Not only that, with although it prefaces the quest of reaching white people with anti-war logics, the game has war, the game has fights, and its sequel does too. These are surrounded with mysticism and fantastic events which cloud the statements and leave them open to interpretation in a way that most players are sure to miss them. It's not provocative, it's a eco-tourism chore. The cutscences and events are a McGuffin to visit places and trek through them to feel epic.

      To influence a mass of players, and not just get critical acclaim it would have needed to be more straightforward.

  4. Oct 2025
    1. He couldn’t compete with them head-to-head from a product standpoint, and couldn’t possibly outspend them in marketing. The solution was for Fraser and his team to question every facet of their business, including product packaging, pricing and advertising. The result was the world’s first baking soda and peroxide toothpaste, Mentadent, which was very successful.

      This story is a perfect example. When you can't fight the same way, you must change the game. By questioning everything, they found a new way to win.

    1. other important volumes were kept on a high shelf. As there were no book stores in his neighborhood, his grandmother took him to secondhand stores to purchase books, looking especially for ones with maps, one of his passions. Both boys owned a DS (dual-screen hand-held game console) and other elec-tronic toys and games

      This paragraph provides a vivid counterexample to the stereotype that low-income families lack educational resources or value literacy. The detailed descriptions of the boys’ homes—filled with books, newspapers, maps, magazines, and even technology like iPods and GPS devices—show how these families actively create literacy-rich environments that reflect their interests, cultures, and daily lives. I believe this approach is absolutely correct. My mother was also born in northern China, an area with scarce educational resources. Yet her mother relentlessly pushed every child in the family to study hard, sending them all to university. That's why I can now enjoy a quality education in a great city. In their eyes, education truly changed their destiny—all because of thoseold books sold one by one at street stalls.

    1. Democracies thrive when politicians believe they are better off playing by the rules of that game — even when they lose elections — because that’s the way to maximize their self-interest over time.

      But, what changed?

    1. better understand how one instance of poor time management can trigger a cascading situation with disastrous results, imagine that a student has an assignment due in a business class. She knows that she should be working on it, but she isn’t quite in the mood. Instead she convinces herself that she should think a little more about what she needs to complete the assignment and decides to do so while looking at social media or maybe playing a couple more rounds of a game on her phone. In a little while, she suddenly realizes that she has become distracted and the evening has slipped away. She has little time left to work on her assignment. She stays up later than usual trying to complete the assignment but cannot finish it. Exhausted, she decides that she will work on it in the morning during the hour she had planned to study for her math quiz. She knows there will not be enough time in the morning to do a good job on the assignment, so she decides that she will put together what she has and hope she will at least receive a passing grade.

      this is going over a girl with poor time managements.

    1. This process can continue until you find an idea everyone believes in. Flesh out the winning sketches with details before moving to prototype creation and user testing.

      This way of brainstorming isn't familiar with me so I can see how this is useful. It does seem like a fun kind of "game" to get the juices flowing.

    1. When Elgin took up his post in Istanbul in 1799, he and his compatriots saw it as their patriotic duty to outdo the French in this race to grab history.

      are you guys even doing it for the love of the game or are you just doing it to beat the french

    1. 10.2.5. Are things getting better?# We could look at inventions of new accessible technologies and think the world is getting better for disabled people. But in reality, it is much more complicated. Some new technologies make improvements for some people with some disabilities, but other new technologies are continually being made in ways that are not accessible. And, in general, cultures shift in many ways all the time, making things better or worse for different disabled people.

      I personally think this are getting better. New technologies and settings are being created for people with disabilities to use online services. People with disabilities are also just being considered more today when company's invent a new app or game. For example text to speech or video games that offer color blind settings.

    1. blowout

      The second time that the author has used this word. Many people likely consider blowout meaning different things. In my opinion, I don't think that the Packers won in a blowout, since they only won by ten points. However, other people have a different concept of this word in terms of a football game.

    2. Jordan Love didn't need Sunday night's thumping

      First sentence is an informed opinion or speculation because Love didn't come out and say these things, but there was evidence of this from last night's game.

  5. www.tripleeframework.com www.tripleeframework.com
    1. However, we can look a little more deeply at engagement by considering if the technology is not just capturing the interest of the student, but if it is actually engaging them actively in the content

      I think this is important to remember when using new "shiny" technologies. One that comes to mind for me is Kahoot. I have found that Kahoot is a great way to engage students in for example a review of a topic for a quiz. I use it in Astronomy, 9th grade science and other classes. I almost always use it in "classic" mode, where I have quiz questions and students compete for points. However there are other game modes where students earn more time to keep playing or possibly added features to a game by answering questions correctly. At one point there was a snowball fight version where you answered questions correctly to get more snowballs. Those other game formats seemed to capture students' interest, but not really engaging them in the actual review questions. Those were just there to help them keep playing the game. The technology needs to engage students in content to be effective.

    2. It is important to look for "time on task" engagement.

      This brief statement is powerful for me! Engagement can be described in different ways, but if students are busy being "engaged" by selecting user names, avatars or playing a "reward" game of flappy bird, engagement isn't "time on task." Students must be engaged in the content, as the previous statement, for this engagement to qualify as worthwhile. There is a balance, to be sure, of entertainment and education, but it must favor the content of the lesson.

    1. on the day called "Carnival" schoolboys bring fighting-cocks to their schoolmaster,

      In this article Carnival seems to be a day for boys to bring their roosters to fight each other. Previously, we read this was a celebration before Lent, " a rejoicing period of time" (Milliman, 597). Lent from my understanding is where you give something up in order to develop or strengthen your relationship to God. This seems to be an interesting game especially since this would be a game to relax the students before a time of ceremonial divinity. This game also fits in with ceremonial combat mention by Milliman (591). I wonder why it was deemed morally acceptable to have animals fight but not have tournaments. This seems like it would feed a love of violence. However, this book does not mention details of how far the cock-fight would go, so maybe not. There seems to be a respect for knowing how to fight as a discipline and actually fighting as a sport.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer 1:

      The authors frequently refer to their predictions and theory as being causal, both in the manuscript and in their response to reviewers. However, causal inference requires careful experimental design, not just statistical prediction. For example, the claim that "algorithmic differences between those with BPD and matched healthy controls" are "causal" in my opinion is not warranted by the data, as the study does not employ experimental manipulations or interventions which might predictably affect parameter values. Even if model parameters can be seen as valid proxies to latent mechanisms, this does not automatically mean that such mechanisms cause the clinical distinction between BPD and CON, they could plausibly also refer to the effects of therapy or medication. I recommend that such causal language, also implicit to expressions like "parameter influences on explicit intentional attributions", is toned down throughout the manuscript.

      Thankyou for this chance to be clearer in the language. Our models and paradigm introduce a from of temporal causality, given that latent parameter distributions are directly influenced by latent parameter estimates at a previous point in time (self-uncertainty and other uncertainty directly governs social contagion). Nevertheless, we appreciate the reviewers perspective and have now toned down the language to reflect this.

      Abstract:

      ‘Our model makes clear predictions about the mechanisms of social information generalisation concerning both joint and individual reward.’

      Discussion:

      ‘We can simulate this by modelling a framework that incorporates priors based on both self and a strong memory impression of a notional other (Figure S3).’

      ‘We note a strength of this work is the use of model comparison to understand algorithmic differences between those with BPD and matched healthy controls.’

      Although the authors have now much clearer outlined the stuy's aims, there still is a lack of clarity with respect to the authors' specific hypotheses. I understand that their primary predictions about disruptions to self-other generalisation processes underlying BPD are embedded in the four main models that are tested, but it is still unclear what specific hypotheses the authors had about group differences with respect to the tested models. I recommend the authors specify this in the introduction rather than refering to prior work where the same hypotheses may have been mentioned.

      Thankyou for this further critique which has enabled us to more cleary refine our introduction. We have now edited our introduction to be more direct about our hypotheses, that these hypotheses are instantiated into formal models, and what our predictions were. We have also included a small section on how previous predictions from other computational assessments of BPD link to our exploratory work, and highlighted this throughout the manuscript.

      ‘This paper seeks to address this gap by testing explicitly how disruptions in self-other generalization processes may underpin interpersonal disruptions observed in BPD. Specifically, our hypotheses were: (i) healthy controls will demonstrate evidence for both self-insertion and social contagion, integrating self and other information during interpersonal learning; and (ii) individuals with BPD will exhibit diminished self-other integration, reflected in stronger evidence for observations that assume distinct self-other representations.

      We tested these hypotheses by designing a dynamic, sequential, three-phase Social Value Orientation (Murphy & Ackerman, 2014) paradigm—the Intentions Game—that would provide behavioural signatures assessing whether BPD differed from healthy controls in these generalization processes (Figure 1A). We coupled this paradigm with a lattice of models (M1-M4) that distinguish between self-insertion and social contagion (Figure 1B), and performed model comparison:

      M1. Both self-to-other (self-insertion) and other-to-self (social contagion) occur before and after learning M2. Self-to-other transfer only occurs M3. Other-to-self transfer only occurs M4. Neither transfer process, suggesting distinct self-other representations

      We additionally ran exploratory analysis of parameter differences and model predictions between groups following from prior work demonstrating changes in prosociality (Hula et al., 2018), social concern (Henco et al., 2020), belief stability (Story et al., 2024a), and belief updating (Story, 2024b) in BPD to understand whether discrepancies in self-other generalisation influences observational learning. By clearly articulating our hypotheses, we aim to clarify the theoretical contribution of our findings to existing literature on social learning, BPD, and computational psychiatry.’

      Caveats should also be added about the exploratory nature of the many parameter group comparisons. If there are any predictions about group differences that can be made based on prior literature, the authors should make such links clear.

      Thank you for this. We have now included caveats in the text to highlight the exploratory nature of these group comparisons, and added direct links to relevant literature where able:

      Introduction

      ‘We additionally ran exploratory analysis of parameter differences and model predictions between groups following from prior work demonstrating changes in prosociality (Hula et al., 2018), social concern (Henco et al., 2020), belief stability (Story et al., 2024a), and belief updating (Story, 2024b) in BPD to understand whether discrepancies in self-other generalisation influences observational learning. By clearly articulating our hypotheses, we aim to clarify the theoretical contribution of our findings to existing literature on social learning, BPD, and computational psychiatry.’

      Model Comparison

      ‘We found that CON participants were best fit at the group level by M1 (Frequency = 0.59, Exceedance Probability = 0.98), whereas BPD participants were best fit by M4 (Frequency = 0.54, Exceedance Probability = 0.86; Figure 2A). This suggests CON participants are best fit by a model that fully integrates self and other when learning, whereas those with BPD are best explained as holding disintegrated and separate representations of self and other that do not transfer information back and forth.

      We first explore parameters between separate fits (see Methods). Later, in order to assuage concerns about drawing inferences from different models, we examined the relationships between the relevant parameters when we forced all participants to be fit to each of the models (in a hierarchical manner, separated by group). In sum, our model comparison is supported by convergence in parameter values when comparisons are meaningful (see Supplementary Materials). We refer to both types of analysis below.’

      Phase 2 analysis

      ‘Prior work predicts those with BPD should focus more intently on public social information, rather than private information that only concerns one party (Henco et al., 2020). In BPD participants, only new beliefs about the relative reward preferences – mutual outcomes for both player - of partners differed (see Fig 2E): new median priors were larger than median preferences in phase 1 (mean = -0.47; = -6.10, 95%HDI: -7.60, -4.60).’

      ‘Models of moral preference learning (Story et al., 2024) predicts that BPD vs non-BPD participants have more rigid beliefs about their partners. We found that BPD participants were equally flexible around their prior beliefs about a partner’s relative reward preferences (= -1.60, 95%HDI: -3.42, 0.23), and were less flexible around their beliefs about a partner’s absolute reward preferences (=-4.09, 95%HDI: -5.37, -2.80), versus CON (Figure 2B).’

      Phase 3 analysis

      ‘Prior work predicts that human economic preferences are shaped by observation (Panizza, et al., 2021; Suzuki et al. 2016; Yu et al, 2021), although little-to-no work has examined whether contagion differs for relative vs. absolute preferences. Associative models predict that social contagion may be exaggerated in BPD (Ereira et al., 2018).… As a whole, humans are more susceptible to changing relative preferences more than selfish, absolute reward preferences, and this is disrupted in BPD.’

      Psychometric and Intentional Attribution analysis

      ‘Childhood trauma, persecution, and poor mentalising in BPD are all predicted to disrupt one’s ability to change (Fonagy & Luyten, 2009).’

      ‘Prior work has also predicted that partner-participant preference disparity influences mental state attributions (Barnby et al., 2022; Panizza et al., 2021).’

      I'm not sure I understand why the authors, after adding multiple comparison correction, now list two kinds of p-values. To me, this is misleading and precludes the point of multiple comparison corrections, I therefore recommend they report the FDR-adjusted p-values only. Likewise, if a corrected p-value is greater than 0.05 this should not be interpreted as a result.

      We have now adjusted the exploratory results to include only the FDR corrected values in the text.

      ‘We assessed conditional psychometric associations with social contagion under the assumption of M3 for all participants. We conducted partial correlation analyses to estimate relationships conditional on all other associations and retained all that survived bootstrapping (5000 reps), permutation testing (5000 reps), and subsequent FDR correction. When not controlled for group status, RGPTSB and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p[fdr]=0.02). This was not affected by group correction. CTQ scores were moderately and negatively associated with shifts in individualistic reward preferences (; r = -0.25, 95%CI: -0.46, -0.04, p[fdr]=0.03). This was not affected by group correction. MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p[fdr]=0.03). This was diminished when controlled for group status (r = 0.13, 95%CI: -0.34, 0.08, p[fdr]=0.20). Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).

      Prior work has predicted that partner-participant preference disparity influences mental state attributions (Barnby et al., 2022; Panizza et al., 2021). We tested parameter influences on explicit intentional attributions in Phase 2 while controlling for group status. Attributions included the degree to which they believed their partner was motived by harmful intent (HI) and self-interest (SI). According with prior work (Barnby et al., 2022), greater disparity of absolute preferences before learning was associated on a trend level with reduced attributions of SI (<= -0.23, p[fdr]=0.08), and greater disparity of relative preferences before learning exaggerated attributions of HI = 0.21, p[fdr]=0.08), but did not survive correction (Figure S4B). This is likely due to partners being significantly less individualistic and prosocial on average compared to participants (= -5.50, 95%HDI: -7.60, -3.60; = 12, 95%HDI: 9.70, 14.00); partners are recognised as less selfish and more competitive.’

      Can the authors please elaborate why the algorithm proposed to be employed by BPD is more 'entropic', especially given both their self-priors and posteriors about partners' preferences tended to be more precise than the ones used by CON? As far as I understand, there's nothing in the data to suggest BPD predictions should be more uncertain. In fact, this leads me to wonder, similarly to what another reviewer has already suggested, whether BPD participants generate self-referential priors over others in the same way CON participants do, they are just less favourable (i.e., in relation to oneself, but always less prosocial) - I think there is currently no model that would incorporate this possibility? It should at least be possible to explore this by checking if there is any statistical relationship between the estimated θ_ppt^m and 〖p(θ〗_par |D^0).

      Thank you for this opportunity to be clearer in our wording. We belief the reviewer is referring to this line in the discussion: ‘In either case, the algorithm underlying the computational goal for BPD participants is far higher in entropy and emphasises a less stable or reliable process of inference.’

      We note in the revised Figure 2 panel E and in the results that those with BPD under M4 show insertion along absolute reward (they still expect diminished selfishness in others), but neutral priors over relative reward (around 0, suggesting expectations of neither prosocial or competitive tendencies of others). Thus, θ_ppt^m (self preference) and θ_par^m (other preference) are tightly associated for absolute, but not relative reward.

      In our wording, we meant that whether under model M4 or M1, those with BPD either show a neutral prior over relative reward (M4) or a prior with large variance over relative reward (M1), showing expectations of difference between themselves and their partner. In both cases, expectation about a partner’s absolute reward preferences is diminished vs. CON participants. We have strengthened our language in the discussion to clarify this:

      ‘In either case, the algorithm underlying the computational goal for BPD participants is far higher in uncertainty, whether through a neutral central tendency (M4) or large variance (M1) prior over relative reward in phase 2, and emphasises a less certain and reliable expectation about others.’

      To note, social contagion under M3 was highly correlated with contagion under M1 (see Fig S11). This provides some preliminary evidence that trauma impacts beliefs about individualism directly, whereas trauma and persecutory beliefs impact beliefs about prosociality through impaired trait mentalising" - I don't understand what the authors mean by this, can they please elaborate and add some explanation to the main text?

      We have now clarified this in the text:

      ‘Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).’

      I noted that at least some of the newly added references have not been added to the bibliography (e.g., Hitchcock et al. 2022).

      Thankyou for noticing this omission. We have now ensured all cited works are in the reference list.

      Reviewer 2:

      The paper is not based on specific empirical hypotheses formulated at the outset, but, rather, it uses an exploratory approach. Indeed, the task is not chosen in order to tackle specific empirical hypotheses. This, in my view, is a limitation since the introduction reads a bit vague and it is not always clear which gaps in the literature the paper aims to fill. As a further consequence, it is not always clear how the findings speak to previous theories on the topic.’

      As I wrote in the public review, however, I believe that an important limitation of this work is that it was not based on testing specific empirical hypotheses formulated at the outset, and on selecting the experimental paradigm accordingly. This is a limitation because it is not always clear which gaps in the literature the paper aims to fill. As a consequence, although it has improved substantially compared to the previous version, the introduction remains a bit vague. As a further consequence, it is not always clear how the findings speak to previous theories on the topic. Still, despite this limitation, the paper has many strengths, and I believe it is now ready for publication

      Thank you for this further critique. We appreciate your appraisal that the work has improved substantially and is ready for publication. We nevertheless have opted to clarify our introduction and aprior predictions throughout the manuscript (please see response to Reviewer 1).

      Reviewer 3:

      Although the authors note that their approach makes "clear and transparent a priori predictions," the paper could be improved by providing a clear and consolidated statement of these predictions so that the results could be interpreted vis-a-vis any a priori hypotheses.

      In line with comments from both Reviewer 1 and 2, we have clarified our introduction to make it clear what our aprior predictions and hypotheses are about our core aims and exploratory analyses (see response to Reviewer 1).

      The approach of using a partial correlation network with bootstrapping (and permutation) was interesting, but the logic of the analysis was not clearly stated. In particular, there are large group (Table 1: CON vs. BPD) differences in the measures introduced into this network. As a result, it is hard to understand whether any partial correlations are driven primarily by mean differences in severity (correlations tend to be inflated in extreme groups designs due to the absence of observation in middle of scales forming each bivariate distribution). I would have found these exploratory analyses more revealing if group membership was controlled for.

      Thank you for this chance to be clearer in our methods. We have now written a more direct exposition of this exploratory method:

      ‘Exploratory Network Analysis

      To understand the individual differences of trait attributes (MZQ, RGPTSB, CTQ) with other-to-self information transfer () across the entire sample we performed a network analysis (Borsboom, 2021). Network analysis allows for conditional associations between variables to be estimated; each association is controlled for by all other associations in the network. It also allows for visual inspection of the conditional relationships to get an intuition for how variables are interrelated as a whole (see Fig S11). We implemented network analysis with the bootNet package in r using the ‘estimateNetwork’ function with partial correlations (Epskamp, Borsboom & Fried, 2018). To assess the stability of the partial correlations we further implemented bootstrap resampling with 5000 repetitions using the ‘bootnet’ function. We then additionally shuffled the data and refitted the network 5000 times to determine a p<sub>permuted</sub> value; this indicates the probability that a conditional relationship in the original network was within the null distribution of each conditional relationship. We then performed False Discovery Rate correction on the resulting p-values. We additionally controlled for group status for all variables in a supplementary analysis (Table S4).’

      We have also further corrected for group status and reported these results as a supplementary table, and also within the main text alongside the main results. We have opted to relegate Figure 4 into a supplementary figure to make the text clearer.

      ‘We explored conditional psychometric associations with social contagion under the assumption of M3 for all participants (where everyone is able to be influenced by their partner). We conducted partial correlation analyses to estimate relationships conditional on all other associations and retained all that survived bootstrapping (5000 reps), permutation testing (5000 reps), and subsequent FDR correction. When not controlled for group status, RGPTSB and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p[fdr]=0.02). This was not affected by group correction. CTQ scores were moderately and negatively associated with shifts in individualistic reward preferences (; r = -0.25, 95%CI: -0.46, -0.04, p[fdr]=0.03). This was not affected by group correction. MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p[fdr]=0.03). This was diminished when controlled for group status (r = 0.13, 95%CI: -0.34, 0.08, p[fdr]=0.20). Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).’

      Discussion first para: "effected -> affected"

      Thanks for spotting this. We have now changed it.

      Add "s" to "participant: "Notably, despite differing strategies, those with BPD achieved similar accuracy to CON participant."

      We have now changed this.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether advantageous and disadvantageous inequality aversion can be vicariously learned and generalized. Using an adapted version of the ultimatum game (UG), in three phases, participants first gave their own preference (baseline phase), then interacted with a "teacher" to learn their preference (learning phase), and finally were tested again on their own (transfer phase). The key measure is whether participants exhibited similar choice preferences (i.e., rejection rate and fairness rating) influenced by the learning phase, by contrasting their transfer phase and baseline phase. Through a series of statistical modeling and computational modeling, the authors reported that both advantageous and disadvantageous inequality aversion can indeed be learned (Study 1), and even be generalised (Study 2).

      Strengths:

      This study is very interesting, it directly adapted the lab's previous work on the observational learning effect on disadvantageous inequality aversion, to test both advantageous and disadvantageous inequality aversion in the current study. Social transmission of action, emotion, and attitude have started to be looked at recently, hence this research is timely. The use of computational modeling is mostly appropriate and motivated. Study 2, which examined the vicarious inequality aversion in conditions where feedback was never provided, is interesting and important to strengthen the reported effects. Both studies have proper justifications to determine the sample size.

      Weaknesses:

      Despite the strengths, a few conceptual aspects and analytical decisions have to be explained, justified, or clarified.

      INTRODUCTION/CONCEPTUALIZATION

      (1) Two terms seem to be interchangeable, which should not, in this work: vicarious/observational learning vs preference learning. For vicarious learning, individuals observe others' actions (and optionally also the corresponding consequence resulting directly from their own actions), whereas, for preference learning, individuals predict, or act on behalf of, the others' actions, and then receive feedback if that prediction is correct or not. For the current work, it seems that the experiment is more about preference learning and prediction, and less so about vicarious learning. The intro and set are heavily around vicarious learning, and later the use of vicarious learning and preference learning is rather mixed in the text. I think either tone down the focus on vicarious learning, or discuss how they are different. Some of the references here may be helpful: (Charpentier et al., Neuron, 2020; Olsson et al., Nature Reviews Neuroscience, 2020; Zhang & Glascher, Science Advances, 2020)

      We are appreciative of the Reviewer for raising this question and providing the reference. In response to this comment we have elected to avoid, in most cases, use of the term ‘vicarious’ and instead focus the paper on learning of others’ preferences (without specific commitment to various/observational learning per se). These changes are reflected throughout all sections of the revised manuscript, and in the revised title. We believe this simplified terminology has improved the clarity of our contribution.

      EXPERIMENTAL DESIGN

      (2) For each offer type, the experiment "added a uniformly distributed noise in the range of (-10 ,10)". I wonder what this looks like? With only integers such as 25:75, or even with decimal points? More importantly, is it possible to have either 70:30 or 90:10 option, after adding the noise, to have generated an 80:20 split shown to the participants? If so, for the analyses later, when participants saw the 80:20 split, which condition did this trial belong to? 70:30 or 90:10? And is such noise added only to the learning phase, or also to the baseline/transfer phases? This requires some clarification.

      We thank the Reviewer for pointing this out. The uniformly distributed noise was added to all three phases to make the proposers’ behavior more realistic. This added noise was rounded to integer numbers, constrained from -9 to 9, which means in both 70:30 and 90:10 offer types, an 80:20 split could not occur. We have made this feature of our design clear in the Method section Line 524 ~ 528:

      “In all task phases, we added uniformly distributed noise to each trial’s offer (ranging from -9 to 9, inclusive, rounding to the nearest integer) such that the random amount added (or subtracted) from the Proposer’s share was subtracted (or added) to the Receiver’s share. We adopted this manipulation to make the proposers’ behavior appear more realistic. The orders of offers participants experienced were fully randomized within each experiment phase. ”

      (3) For the offer conditions (90:10, 70:30, 50:50, 30:70, 10:90) - are they randomized? If so, how is it done? Is it randomized within each participant, and/or also across participants (such that each participant experienced different trial sequences)? This is important, as the order especially for the learning phase can largely impact the preference learning of the participants.

      We agree with the Reviewer the order in which offers are experienced could be very important. The order of the conditions was randomized independently for each participant (i.e. each participant experienced different trial sequences). We made this point clear in the Methods part. Line 527 ~ 528:

      “The orders of offers participants experienced were fully randomized within each experiment phase.”

      STATISTICAL ANALYSIS & COMPUTATIONAL MODELING

      (4) In Study 1 DI offer types (90:10, 70:30), the rejection rate for DI-AI averse looks consistently higher than that for DI averse (ie, the blue line is above the yellow line). Is this significant? If so, how come? Since this is a between-subject design, I would not anticipate such a result (especially for the baseline). Also, for the LME results (eg, Table S3), only interactions were reported but not the main results.

      We thank the Reviewer for pointing out this feature of the results. Prompted by this comment, we compared the baseline rejection rates between two conditions for these two offer types, finding in Experiment 1 that rejection rates in the DI-AI-averse condition were significantly higher than in the DI-averse condition (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.13, p < 0.001, Offer 70:30, β = 0.09, p < 0.034). We agree with the Reviewer that there should, in principle, be no difference between the experiences of participants in these two conditions is identical in the Baseline phase. However, we did not observe these difference in baseline preferences in Experiment 2 (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.07, p < 0.100, Offer 70:30, β = 0.05, p < 0.193). On the basis of the inconsistency of this effect across studies we believe this is a spurious difference in preferences stemming from chance.

      Regarding the LME results, the reason why only interaction terms are reported is due to the specification of the model and the rationale for testing.

      Taking the model reported in Table S3 as an example—a logistic model which examines Baseline phase rejection rates as a function of offer level and condition—the between-subject conditions (DI-averse and DI-AI-averse) are represented by dummy-coded variables. Similarly, offer types were also dummy-coded, such that each of the five columns (90:10, 70:30, 50:50, 30:70, and 10:90) correspond corresponded to a particular offer type. This model specification yields ten interaction terms (i.e., fixed effects) of interest—for example, the “DI-averse × Offer 90:10” indicates baseline rejection rates for 90:10 offers in DI-averse condition. Thus, to compare rejection rates across specific offer types, we estimate and report linear contrasts between these resultant terms. We have clarified the nature of these reported tests in our revised Results—for example, line189-190: “linear contrasts; e.g. 90:10 vs 10:90, all Ps<0.001, see Table S3 for logistic regression coefficients for rejection rates).

      Also in response to this comment that and a recommendation from Reviewer 2 (see below), we have revised our supplementary materials to make each model specification clearer as SI line 25:

      RejectionRate ~ 0 + (Disl + Advl):(Offer10 + Offer30 + Offer50 + Offer70 + Offer90) + (1|Subject)”

      (5) I do not particularly find this analysis appealing: "we examined whether participants' changes in rejection rates between Transfer and Baseline, could be explained by the degree to which they vicariously learned, defined as the change in punishment rates between the first and last 5 trials of the Learning phase." Naturally, the participants' behavior in the first 5 trials in the learning phase will be similar to those in the baseline; and their behavior in the last 5 trials in the learning phase would echo those at the transfer phase. I think it would be stronger to link the preference learning results to the change between the baseline and transfer phase, eg, by looking at the difference between alpha (beta) at the end of the learning phase and the initial alpha (beta).

      Thanks for pointing this out. Also, considering the comments from Reviewer 2 concerning the interpretation of this analysis, we have elected to remove this result from our revision.

      (6) I wonder if data from the baseline and transfer phases can also be modeled, using a simple Fehr-Schimdt model. This way, the change in alpha/beta can also be examined between the baseline and transfer phase.

      We agree with the Reviewer that a simplified F-S model could be used, in principle, to characterize Baseline and Transfer phase behavior, but it is our view that the rejection rates provide readers with the clearest (and simplest) picture of how participants are responding to inequity. Put another way, we believe that the added complexity of using (and explaining) a new model to characterize simple, steady-state choice behavior (within these phases) would not be justified or add appreciable insights about participants’ behavior.

      (7) I quite liked Study 2 which tests the generalization effect, and I expected to see an adapted computational modeling to directly reflect this idea. Indeed, the authors wrote, "[...] given that this model [...] assumes the sort of generalization of preferences between offer types [...]". But where exactly did the preference learning model assume the generalization? In the methods, the modeling seems to be only about Study 1; did the authors advise their model to accommodate Study 2? The authors also ran simulation for the learning phase in Study 2 (Figure 6), and how did the preference update (if at all) for offers (90:10 and 10:90) where feedback was not given? Extending/Unpacking the computational modeling results for Study 2 will be very helpful for the paper.

      We are appreciative of the Reviewer’s positive impression of Experiment 2. Upon reflection, we realize that our original submission was not clear about the modeling done in Experiment 2, and we should clarify here that we did also fit the Preference Inference model to this dataset. As in Experiment 1, this model assumes that the participants have a representation of the teacher’s preference as a Fehr-Schmidt form utility function and infer the Teacher’s Envy and Guilt parameters through learning. The model indicates that, on the basis of experience with the Teacher’s preferences on moderately unfair offers (i.e., offer 70:30 and offer 30:70), participants can successfully infer these guess of these two parameters, and in turn, compute Fehr-Schmidt utility to guide their decisions in the extreme unfair offers (i.e., offer 90:10 and offer 10:90).

      In response to this comment, we have made this clearer in our Results (Line 377-382):

      “Finally, following Experiment 1, we fit a series of computational models of Learning phase choice behavior, comparing the goodness-of-fit of the four best-fitting models from Experiment 1 (see Methods). As before, we found that the Preference Inference model provided the best fit of participants’ Learning Phase behavior (Figure S1a, Table S12). Given that this model is able to infer the Teacher’s underlying inequity-averse preferences (rather than learns offer-specific rejection preferences), it is unsurprising that this model best describes the generalization behavior observed in Experiment 2.”

      and in our revised Methods (Line 551-553)

      “We considered 6 computational models of Learning Phase choice behavior, which we fit to individual participants’ observed sequences of choices, in both Experiments 1 and 2, via Maximum Likelihood Estimation”

      Reviewer #2 (Public review):

      Summary:

      This study investigates whether individuals can learn to adopt egalitarian norms that incur a personal monetary cost, such as rejecting offers that benefit them more than the giver (advantageous inequitable offers). While these behaviors are uncommon, two experiments demonstrate that individuals can learn to reject such offers through vicarious learning - by observing and acting in line with a "teacher" who follows these norms. The authors use computational modelling to argue that learners adopt these norms through a sophisticated process, inferring the latent structure of the teacher's preferences, akin to theory of mind.

      Strengths:

      This paper is well-written and tackles a critical topic relevant to social norms, morality, and justice. The findings, which show that individuals can adopt just and fair norms even at a personal cost, are promising. The study is well-situated in the literature, with clever experimental design and a computational approach that may offer insights into latent cognitive processes. Findings have potential implications for policymakers.

      Weaknesses:

      Note: in the text below, the "teacher" will refer to the agent from which a participant presumably receives feedback during the learning phase.

      (1) Focus on Disadvantageous Inequity (DI): A significant portion of the paper focuses on responses to Disadvantageous Inequitable (DI) offers, which is confusing given the study's primary aim is to examine learning in response to Advantageous Inequitable (AI) offers. The inclusion of DI offers is not well-justified and distracts from the main focus. Furthermore, the experimental design seems, in principle, inadequate to test for the learning effects of DI offers. Because both teaching regimes considered were identical for DI offers the paradigm lacks a control condition to test for learning effects related to these offers. I can't see how an increase in rejection of DI offers (e.g., between baseline and generalization) can be interpreted as speaking to learning. There are various other potential reasons for an increase in rejection of DI offers even if individuals learn nothing from learning (e.g. if envy builds up during the experiment as one encounters more instances of disadvantageous fairness).

      We are appreciative of the Reviewer’s insight here and for the opportunity to clarify our experimental logic. We included DI offers in order to 1) expose participants to the full spectrum of offer types, and avoid focusing participants exclusively upon AI offers, which might result in a demand characteristic and 2) to afford exploration of how learning dynamics might differ in DI context s—which was, to some extent, examined in our previous study (FeldmanHall, Otto, & Phelps, 2018)—versus AI contexts. Furthermore, as this work builds critically on our previous study, we reasoned that replicating these original findings (in the DI context) would be important for demonstrating the generality of the learning effects in the DI context across experimental settings. We now remark on this point in our revised Introduction Line 129 ~132:

      “In addition, to mechanistically probe how punitive preferences are acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      (2) Statistical Analysis: The analysis of the learning effects of AI offers is not fully convincing. The authors analyse changes in rejection rates within each learning condition rather than directly comparing the two. Finding a significant effect in one condition but not the other does not demonstrate that the learning regime is driving the effect. A direct comparison between conditions is necessary for establishing that there is a causal role for the learning regime.

      We agree with the Reviewer and upon reflection, believe that direct comparisons between conditions would be helpful to support the claim that the different learning conditions are responsible for the observed learning effects. In brief, these specific tests buttress the idea that exposure to AI-averse preferences result in increases in AI punishment rates in the Transfer phase (over and above the rates observed for participants who were only exposed to DI-averse preferences).

      Accordingly, our revision now reports statistics concerning the differences between conditions for AI offers in Experiment 1 (Line 198~ 207):

      “Importantly, when comparing these changes between the two learning conditions, we observed significant differences in rejection rates for Adv-I offers: compared to exposure to a Teacher who rejected only Dis-I offers, participants exposed to a Teacher who rejected both Dis-I and Adv-I offers were more likely to reject Adv-I offers and rated these offers more unfair. This difference between conditions was evident in both 30:70 offers (Rejection rates: β(SE) = 0.10(0.04), p = 0.013; Fairness ratings: β(SE) = -0.86(0.17), p < 0.001) and 10:90 offers (Rejection rates: β(SE) = 0.15(0.04), p < 0.001, Fairness ratings: β(SE) = -1.04(0.17), p < 0.001). As a control, we also compared rejection rates and fairness rating changes between conditions in Dis-I offers (90:10 and 30:70) and Fair offers (i.e., 50:50) but observed no significant difference (all ps > 0.217), suggesting that observing an Adv-I-averse Teacher’s preferences did not influence participants’ behavior in response to Dis-I offers.”

      Line 222 ~ 230:

      “A mixed-effects logistic regression revealed a significant larger (positive) effect of trial number on rejection rates of Adv-I offers for the Adv-Dis-I-Averse condition compared to the Dis-I-Averse condition. This relative rejection rate increase was evident both in 30:70 offers (Table S7; β(SE) = -0.77(0.24), p < 0.001) and in 10:90 offers (β(SE) = -1.10(0.33), p < 0.001). In contrast, comparing Dis-I and Fairness offers when the Teacher showed the same tendency to reject, we found no significant difference between the two conditions (90:10 splits: β(SE)=-0.48(0.21),p=0.593;70:30 splits: β(SE)=-0.01(0.14),p=0.150; 50:50 splits: β(SE)=-0.00(0.21),p=0.086). In other words, participants by and large appeared to adjust their rejection choices in accordance with the Teacher’s feedback in an incremental fashion.”

      And in Experiment 2 Line 333 ~ 345:

      “Similar to what we observed in Experiment 1 (Figure 4a), Compared to the participants in the Dis-I-Averse Condition, participants in the Adv-I-Averse Condition increased their rates of rejection of extreme Adv-I offerers (i.e., 10:90) in the Transfer Phase, relative to the Baseline phase (β(SE) = -0.12(0.04), p < 0.004; Table S9), suggesting that participants’ learned (and adopted) Adv-I-averse preferences, generalized from one specific offer type (30:70) to an offer types for which they received no Teacher feedback (10:90). Examining extreme Dis-I offers where the Teacher exhibited identical preferences across the two learning conditions, we found no difference in the Changes of Rejection Rates from Baseline to Transfer phase between conditions (β(SE) = -0.05(0.04), p < 0.259). Mirroring the observed rejection rates (Figure 4b), relative to the Dis-I-Averse Condition, participants’ fairness ratings for extreme Adv-I offers increased more from the Baseline to Transfer phase in the Adv-Dis-I-Averse Condition than in the Dis-I-Averse condition (β(SE) = -0.97(0.18), p < 0.001), but, importantly, changes in fairness ratings for extreme Dis-I offers did not differ significantly between learning conditions (β(SE) = -0.06(0.18), p < 0.723)”

      Line 361 ~ 368:

      “Examining the time course of rejection rates in Adv-I-contexts during the Learning phase (Figure 5) revealed that participants learned over time to punish mildly unfair 30:70 offers, and these punishment preferences generalized to more extreme offers (10:90). Specifically, compared to the Dis-I-Averse Condition, in the Adv-Dis-I-Averse condition we observed a significant larger trend of increase in rejections rates for 10:90 (Adv-I) offers (Figure 5, β(SE) = -0.81(0.26), p < 0.002 mixed-effects logistic regression, see Table S10). Again, when comparing the rejection rate increase in the extremely Dis-I offers (90:10), we didn’t find significant difference between conditions (β(SE) = -0.25(0.19), p < 0.707).”

      (3) Correlation Between Learning and Contagion Effects:

      The authors argue that correlations between learning effects (changes in rejection rates during the learning phase) and contagion effects (changes between the generalization and baseline phases) support the idea that individuals who are better aligning their preferences with the teacher also give more consideration to the teacher's preferences later during generalization phase. This interpretation is not convincing. Such correlations could emerge even in the absence of learning, driven by temporal trends like increasing guilt or envy (or even by slow temporal fluctuations in these processes) on behalf of self or others. The reason is that the baseline phase is temporally closer to the beginning of the learning phase whereas the generalization phase is temporally closer to the end of the learning phase. Additionally, the interpretation of these effects seems flawed, as changes in rejection rates do not necessarily indicate closer alignment with the teacher's preferences. For example, if the teacher rejects an offer 75% of the time then a positive 5% learning effect may imply better matching the teacher if it reflects an increase in rejection rate from 65% to 70%, but it implies divergence from the teacher if it reflects an increase from 85% to 90%. For similar reasons, it is not clear that the contagion effects reflect how much a teacher's preferences are taken into account during generalization.

      This comment is very similar to a previous comment made by Reviewer 1, who also called into question the interpretability of these correlations. In response to both of these comments we have elected to remove these analyses from our revision.

      (4) Modeling Efforts: The modelling approach is underdeveloped. The identification of the "best model" lacks transparency, as no model-recovery results are provided, and fits for the losing models are not shown, leaving readers in the dark about where these models fail. Moreover, the reinforcement learning (RL) models used are overly simplistic, treating actions as independent when they are likely inversely related (for example, the feedback that the teacher would have rejected an offer provides feedback that rejection is "correct" but also that acceptance is "an error", and the later is not incorporated into the modelling). It is unclear if and to what extent this limits current RL formulations. There are also potentially important missing details about the models. Can the authors justify/explain the reasoning behind including these variants they consider? What are the initial Q-values? If these are not free parameters what are their values?

      We are appreciative of the Reviewer for identifying these potentially unaddressed questions.

      The RL models we consider in the present study are naïve models which, in our previous study (FeldmanHall, Otto, & Phelps, 2018), we found to capture important aspects of learning. While simplistic, we believed these models serve as a reasonable baseline for evaluating more complex models, such as the Preference Inference model. We have made this point more explicit in our revised Introduction, Line 129 ~ 132:

      “In addition, to mechanistically probe how punitive preferences may be acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis-I context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      Again, following from our previous modeling of observational learning (FeldmanHall et al., 2018), we believe that the feedback the Teacher provides here is ideally suited to the RL formalism. In particular, when the teacher indicates that the participant’s choice is what they would have preferred, the model receives a reward of ‘1’ (e.g., the participant rejects and the Teacher indicates they would preferred rejection, resulting in a positive prediction error) otherwise, the model receives a reward of ‘0’ (e.g., the participant accepts and the Teacher indicates they would preferred rejection, resulting in a negative prediction error), indicating that the participant did not choose in accordance with the Teacher’s preferences. Through an error driven learning process, these models provide a naïve way of learning to act in accordance with the Teacher’s preferences.

      Regarding the requested model details: When treating the initial values as free parameters (model 5), we set Q(reject, offertype) as free values in [0,1] and Q(accept,offertype) as 0.5. This setting can capture participants' initial tendency to reject or accept offers from this offer type. When the initial values are fixed, for all offer types we set Q(reject, offertype) = Q(accept,offertype) = 0.5. In practice, when the initial values are fixed, setting them to 0.5 or 0 doesn’t make much difference. We have clarified these points in our revised Methods, Line 275 ~ 576:

      “We kept the initial values fixed in this model, that is Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90)”

      And Line 582 ~ 584:

      “Formally, this model treats Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90) as free parameters with values between 0 and 1.”

      (5) Conceptual Leap in Modeling Interpretation: The distinction between simple RL models and preference-inference models seems to hinge on the ability to generalize learning from one offer to another. Whereas in the RL models learning occurs independently for each offer (hence to cross-offer generalization), preference inference allows for generalization between different offers. However, the paper does not explore RL models that allow generalization based on the similarity of features of the offers (e.g., payment for the receiver, payment for the offer-giver, who benefits more). Such models are more parsimonious and could explain the results without invoking a theory of mind or any modelling of the teacher. In such model versions, a learner learns a functional form that allows to predict the teacher's feedback based on said offer features (e.g., linear or quadratic form). Because feedback for an offer modulates the parameters of this function (feature weights) generalization occurs without necessarily evoking any sophisticated model of the other person. This leaves open the possibility that RL models could perform just as well or even show superiority over the preference learning model, casting doubt on the authors' conclusions. Of note: even the behaviourists knew that as Little Albert was taught to fear rats, this fear generalized to rabbits. This could occur simply because rabbits are somewhat similar to rats. But this doesn't mean little Alfred had a sophisticated model of animals he used to infer how they behave.

      We are appreciative of the Reviewer for their suggestion of an alternative explanation for the observed generalization effects. Our understanding of the suggestion, put simply, put simply, is that an RL model could capture the observed generalization effects if the model were to learn and update a functional form of the Teacher’s rejection preferences using an RL-like algorithm. This idea is similar, conceptually to our account of preference learning whereby the learner has a representation of the teacher’s preferences. In our experiment the offer is in the range of [0-100], the crux of this idea is why the participants should take the functional form (either v-shaped or quadratic) with the minimum at 50. This is important because, at the beginning of the learning phase, the rejection rates are already v-shaped with 50 as its minimum. The participants do not need to adjust the minimum of this functional form. Thus, if we assume that the participants represent the teacher’s rejection rate as a v-shape function with a minimum at [50,50], then this very likely implies that the participants have a representation that the teacher has a preference for fairness. Above all, we agree that with suitable setup of the functional form, one could implement an RL model to capture the generalization effects, without presupposing an internal “model” of the teacher’s preferences.

      However, there is another way of modeling the generalization effect by truly “model-free” similarity-based Reinforcement learning. In this approach, we do not assume any particular functional form of the teacher’s preferences, but rather, assumes that experience acquired in one offer type can be generalized to offers that are close (i.e., similar) to the original offer. Accordingly, we implement this idea using a simple RL model in which the action values for each offer type is updated by a learning rate that is scaled by the distance between that offer and the experienced offer (i.e., the offer that generated the prediction error). This learning rate is governed by a Gaussian distribution, similar to the case in the Gaussian process regression (cf. Chulz, Speekenbrink, & Krause, 2018). The initial value of the ‘Reject’ action, for each offer , is set to a free parameter between 0 and 1, and the initial value for the 'Accept’ action was set to 0.5. The results show that even though this model exhibits the trend of increasing rejection rates observed in the AI-DI punish condition, the initial preferences (i.e., starting point of learning) diverges markedly from the Learning phase behavior we observed in Experiment 1:

      Author response image 1.

      This demonstrated that the participant at least maintains a representation of the teacher’s preference at the beginning. That is, they have prior knowledge about the shape of this preference. We incorporated this property into the model, that is, we considered a new model that assumes v-shaped starting values for rejection with two parameters, alpha and beta, governing the slope of this v-shaped function (this starting value actually mimics the shape of the preference functions of the Fehr-Schmidt model). We found that this new model (which we term the “Model RL Sim Vstart”) provided a satisfactory qualitative fit of the Transfer phase learning curves in Experiment 1 (see below).

      Author response image 2.

      However, we didn’t adopt this model as the best model for the following reasons. First, this model yielded a larger AIC value (indicating worse quantitative fit) compared to our preference Inference model in both Experiments 1 and 2, likely owing to its increased complexity (5 free parameters versus 4 in the Preference Inference model). Accordingly, we believe that inclusion of this model in our revised submission would be more distracting than helpful on account of the added complexity of explaining and justifying these assumptions, and of course its comparatively poor goodness of fit (relative to the preference inference model).

      (6) Limitations of the Preference-Inference Model: The preference-inference model struggles to capture key aspects of the data, such as the increase in rejection rates for 70:30 DI offers during the learning phase (e.g. Figure 3A, AI+DI blue group). This is puzzling.

      Thinking about this I realized the model makes quite strong unintuitive predictions that are not examined. For example, if a subject begins the learning phase rejecting the 70:30 offer more than 50% of the time (meaning the starting guilt parameter is higher than 1.5), then overleaning the tendency to reject will decrease to below 50% (the guilt parameter will be pulled down below 1.5). This is despite the fact the teacher rejects 75% of the offers. In other words, as learning continues learners will diverge from the teacher. On the other hand, if a participant begins learning to tend to accept this offer (guilt < 1.5) then during learning they can increase their rejection rate but never above 50%. Thus one can never fully converge on the teacher. I think this relates to the model's failure in accounting for the pattern mentioned above. I wonder if individuals actually abide by these strict predictions. In any case, these issues raise questions about the validity of the model as a representation of how individuals learn to align with a teacher's preferences (given that the model doesn't really allow for such an alignment).

      In response to this comment we explain our efforts to build a new model that might be able conceptually resolves the issue identified by the Reviewer.

      The key intuition guiding the Preference inference model is a Bayesian account of learning which we aimed to further simplify. In this setting, a Bayesian learner maintains a representation of the teacher’s inequity aversion parameters and updates it according to the teacher’s (observed) behavior. Intuitively, the posterior distribution shifts to the likelihood of the teacher’s action. On this view, when the teacher rejects, for instance, an AI offer, the learner should assign a higher probability to larger values of the Guilt parameter, and in turn the learner should change their posterior estimate to better capture the teacher’s preferences.

      In the current study, we simplified this idea, implementing this sort of learning using incremental “delta rule” updating (e.g. Equation 8 of the main text). Then the key question is to define the “teaching signal”. Assuming that the teacher rejects an offer 70:30, based on Bayesian reasoning, the teacher’s envy parameter (α) is more likely to exceed 1.5 (computed as 30/(50-30), per equation 7) than to be smaller than 1.5. Thus, 1.5, which is then used in equation 8 to update α, can be thought of as a teaching signal. We simply assumed that if the initial estimate is already greater than 1.5, which means the prior is consistent with the likelihood, no updating would occur. This assumption raises the question of how to set the learning rate range. In principle, an envy parameter that is larger than 1.5 should be the target of learning (i.e., the teaching signal), and thus our model definition allows the learning rate to be greater than 1, incorporating this possibility.

      Our simplified preference inference model has already successfully captured some key aspects of the participants’ learning behavior. However, it may fail in the following case: assume that the participant has an initial estimate of 1.51 for the envy parameter (β). Let’s say this corresponds to a rejection rate of 60%. Thus, no matter how many times the teacher rejects the offer 70:30, the participant’s estimate of the envy parameter remains the same, but observing only one offer acceptance would decrease this estimate, and in turn, would decrease the model’s predicted rejection rate. We believe this is the anomalous behavior—in 70:30 offers—identified by the Reviewer which the model does not appear able to recreate participants’ in these offers.

      This issue actually touches the core of our model specification, that is, the choosing of the teaching signal. As we chose 1.5 as the teaching signal—i.e. lower bound on whenever the teacher rejects or accepts an offer of 70:30, a very small deviation of 1.5 would fail one part of updating. One way to mitigate this problem would be to choose a lower bound for α greater than 1.5, such that when the Teacher rejects a 70:30 offer, we assign a number greater than 1.5 (by ‘hard-coding’ this into the model via modification of equation 7). One sensible candidate value could be the middle point between 1.5 and 10 (the maximum value of α per our model definition). Intuitively, the model of this setting could still pull up the value of α to 1.51 when the teacher rejects 70:30, thus alleviating (but not completely eliminating) the anomaly.

      We fitted this modified Preference Inference model to the data from Experiment 1 (see Author response image 3 below) and found that even though this model has a smaller AIC (and thus better quantitative fit than the original Preference Inference model), it still doesn’t fully capture the participants’ behavior for 70:30 offers.

      Author response image 3.

      Accordingly, rather than revising our model to include an unprincipled ‘kludge’ to account for this minor anomaly in the model behavior, we have opted to report our original model in our revision as we still believe it parsimoniously captures our intuitions about preference learning and provides a better fit to the observed behavior than the other RL models considered in the present study.

      Reviewer #1 (Recommendations for the authors):

      (1) I do not particularly prefer the acronyms AI and DI for disadvantageous inequity and advantageous inequity. Although they have been used in the literature, not every single paper uses them. More importantly, AI these days has such a strong meaning of artificial intelligence, so when I was reading this, I'd need to very actively inhibit this interpretation. I believe for the readability for a wider readership of eLife, I would advise not to use AI/DI here, but rather use the full terms.

      We thank the Reviewer for this suggestion. As the full spelling of the two terms are somewhat lengthy, and appear frequently in the figures, we have elected to change the abbreviations for disadvantageous inequity and advantageous inequity to Dis-I and Adv-I, respectively in the main text and the supplementary information. We still use AI/DI in the response letter to make the terminology consistent.

      (2) Do "punishment rate" and "rejection rate" mean the same? If so, it would be helpful to stick with one single term, eg, rejection rate.

      We thank the Reviewer for this suggestion. As these terms have the same meaning, we have opted to use the term “rejection rate” throughout the main text.

      (3) For the linear mixed effect models, were other random effect structures also considered (eg, random slops of experimental conditions)? It might be worth considering a few model specifications and selecting the best one to explain the data.

      Thanks for this comment. Following established best practices (Barr, Levy, Scheepers, & Tily, 2013) we have elected to use a maximal random effects structure, whereby all possible predictor variables in the fixed effects structure also appear in the random effects structure.

      (4) For equation (4), the softmax temperature is denoted as tau, but later in the text, it is called gamma. Please make it consistent.

      We are appreciative of the Reviewer’s attention to detail. We have corrected this error.

      Reviewer #2 (Recommendations for the authors):

      (1) Several Tables in SI are unclear. I wasn't clear if these report raw probabilities of coefficients of mixed models. For any mixed models, it would help to give the model specification (e.g., Walkins form) and explain how variables were coded.

      We are appreciative of the Reviewer’s attention to detail. We have clarified, in the captions accompanying our supplemental regression tables, that these coefficients represent log-odds. Regretfully we are unaware of the “Walkins form” the Reviewer references (even after extensive searching of the scientific literature). However, in our new revision we do include lme4 model syntax in our supplemental information which we believe will be helpful for readers seeking replicate our model specification.

      (2) In one of the models it was said that the guilt and envy parameters were bounded between 0-1 but this doesn't make sense and I think values outside this range were later reported.

      We are again appreciative of the Reviewer’s attention to detail. This was an error we have corrected— the actual range is [0,10].

      (3) It is unclear if the model parameters are recoverable.

      In response to this comment our revision now reports a basic parameter recovery analysis for the winning Preference Inference model. This is reported in our revised Methods:

      “Finally, to verify if the free parameters of the winning model (Preference Inference) are recoverable, we simulated 200 artificial subjects, based on the Learning Phase of Experiment 1, with free parameters randomly chosen (uniformly) from their defined ranges. We then employed the same model-fitting procedure as described above to estimate these parameter value, observing that parameters. We found that all parameters of the model can be recovered (see Figure S2).”

      And scatter plots depicting these simulated (versus recovered) parameters are given in Figure S2 of our revised Supplementary Information:

      (4) I was confused about what Figure S2 shows. The text says this is about correlating contagious effects for different offers but the captions speak about learning effects. This is an important aspect which is unclear.

      We have removed this figure in response to both Reviewers’ comments about the limited insights that can be drawn on the basis of these correlations.

    2. Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether advantageous and disadvantageous inequality aversion can be vicariously learned and generalized. Using an adapted version of the ultimatum game (UG), in three phases, participants first gave their own preference (baseline phase), then interacted with a "teacher" to learn their preference (learning phase), and finally were tested again on their own (transfer phase). The key measure is whether participants exhibited similar choice preference (i.e., rejection rate and fairness rating) influenced by the learning phase, by contrasting their transfer phase and baseline phase. Through a series of statistical modeling and computational modeling, the authors reported that both advantageous and disadvantageous inequality aversion can indeed be learned (Study 1), and even be generalised (Study 2).

      Strengths:

      This study is very interesting, that directly adapted the lab's previous work on the observational learning effect on disadvantageous inequality aversion, to test both advantageous and disadvantageous inequality aversion in the current study. Social transmission of action, emotion, and attitude have started to be looked at recently, hence this research is timely. The use of computational modeling is mostly appropriate and motivated. Study 2 that examined the vicarious inequality aversion on conditions where feedback was never provided is interesting and important to strengthen the reported effects. Both studies have proper justifications to determine the sample size.

      Weaknesses:

      Despite the strengths, a few conceptual aspects and analytical decisions have to be explained, justified, or clarified.

      INTRODUCTION/CONCEPTUALIZATION

      (1) Two terms seem to be interchangeable, which should not, in this work: vicarious/observational learning vs preference learning. For vicarious learning, individuals observe others' actions (and optionally also the corresponding consequence resulted directly by their own actions), whereas, for preference learning, individuals predict, or act on behalf of, the others' actions, and then receive feedback if that prediction is correct or not. For the current work, it seems that the experiment is more about preference learning and prediction, and less so about vicarious learning. But the intro and set are heavily around vicarious learning, and late the use of vicarious learning and preference learning is rather mixed in the text. I think either tone down the focus on vicarious learning, or discuss how they are different. Some of the references here may be helpful: Charpentier et al., Neuron, 2020; Olsson et al., Nature Reviews Neuroscience, 2020; Zhang & Glascher, Science Advances, 2020

      EXPERIMENTAL DESIGN

      (2) For each offer type, the experiment "added a uniformly distributed noise in the range of (-10 ,10)". I wonder how this looks like? With only integers such as 25:75, or even with decimal points? More importantly, is it possible to have either 70:30 or 90:10 option, after adding the noise, to have generated an 80:20 split shown to the participants? If so, for the analyses later, when participants saw the 80:20 split, which condition did this trial belong to? 70:30 or 90:10? And is such noise added only to the learning phase, or also to the baseline/transfer phases? This requires some clarification.

      (3) For the offer conditions (90:10, 70:30, 50:50, 30:70, 10:90) - are they randomized? If so, how is it done? Is it randomized within each participants, and/or also across participants (such that each participant experienced different trial sequences)? This is important, as the order especially for the leanring phase can largely impact on the preference learning of the participants.

      STATISTICAL ANALYSIS & COMPUTATIONAL MODELING

      (4) In Study 1 DI offer types (90:10, 70:30), the rejection rate for DI-AI averse looks consistently higher than that for DI averse (ie, blue line is above the yellow line). Is this significant? If so, how come? Since this is a between-subject design, I would not anticipate such a result (especially for the baseline). Also, for the LME results (eg, Table S3), only interactions were reported but not the main results.

      (5) I do not particularly find this analysis appealing: "we examined whether participants' changes in rejection rates between Transfer and Baseline, could be explained by the degree to which they vicariously learned, defined as the change in punishment rates between the first and last 5 trials of the Learning phase." Naturally, participants' behavior in the first 5 trials in the learning phase will be similar to those in the baseline; and their behavior in the last 5 trials in the learning phase would echo those at the transfer phase. I think it would be stronger to link the preference learning results to the chance between baseline and transfer phase, eg, by looking at the difference between alpha (beta) at the end of the learning phase and the initial alpha (beta).

      (6) I wonder if data from the baseline and transfer phases can also be modeled, using a simple Fehr-Schimdt model? This way, the change in alpha/beta can also be examined between the baseline and transfer phase.

      (7) I quite liked Study 2 that tests the generalization effect, and I expected to see an adapted computational modeling to directly reflect this idea. Indeed, the authors wrote "[...] given that this model [...] assumes the sort of generalization of preferences between offer types [...]". But where exactly did the preference learning model assumed the generalization? In the methods, the modeling seems to be only about Study 1; did the authors advise their model to accommodate Study 2? The authors also ran simulation for the learning phase in Study 2 (Figure 6), and how did the preference updated (if at all) for offers (90:10 and 10:90) where feedback was not given? Extending/Unpacking the computational modeling results for Study2 will be very helpful for the paper.

      Comments on revisions:

      I kept my original public review, so that future readers can see the progress and development of the manuscript.

      The authors have largely addressed my original questions/concerns, and I have two outstanding comments.

      (a) Related to my original comment #6, where I suggested to apply the F-S model also to the baseline and transfer phase. The authors were inclined not to do it, but in fact later in comment #7 and in the manuscript they opted to use a more complex F-S-based model to their learning phase. I agree that the rejection rate is indeed a clear indication, but for completeness, it'd be more consistent and compelling if the paper follows a model-free (model-agnostic) and model-based approach in all phases of the experiment.

      (b) Related to my original comment #4, I appreciate that the authors have provided more details of their LMM models. But I don't think it is accurate regardless. First, all offer levels (50:50, 30:70, 10:90), should not be coded as pure categorical levels. In fact, they have an ordinal meaning, a single ordinal predictor with three levels should be used. This also avoids the excessive number of interactions the authors have pointed out.

      Second, running a model with only interactions without main effects is flawed. All textbooks on stats emphasize that without the presence of the main effects, the interpretation of interaction only is biased.

      So these LMMs needs to be revised before the manuscript eventually gets to a version of record.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #2 (Public Review):

      I would like to express my appreciation for the authors' dedication to revising the manuscript. It is evident that they have thoughtfully addressed numerous concerns I previously raised, significantly contributing to the overall improvement of the manuscript.

      Response: We appreciate the reviewers’ recognition of our efforts in revising the manuscript.

      My primary concern regarding the authors' framing of their findings within the realm of habitual and goal-directed action control persists. I will try explain my point of view and perhaps clarify my concerns. While acknowledging the historical tendency to equate procedural learning with habits, I believe a consensus has gradually emerged among scientists, recognizing a meaningful distinction between habits and skills or procedural learning. I think this distinction is crucial for a comprehensive understanding of human action control. While these constructs share similarities, they should not be used interchangeably. Procedural learning and motor skills can manifest either through intentional and planned actions (i.e., goal-directed) or autonomously and involuntarily (habitual responses).

      Response: We would like to clarify that, contrary to the reviewer’s assertion of a scientific consensus on this matter, the discussion surrounding the similarities and differences between habits and skills remains an ongoing and unresolved topic of interest among scientists (Balleine and Dezfouli, 2019; Du and Haith, 2023; Graybiel and Grafton, 2015; Haith and Krakauer, 2018; Hardwick et al., 2019; Kruglanski and Szumowska, 2020; Robbins and Costa, 2017). We absolutely agree with the reviewer that “Procedural learning and motor skills can manifest either through intentional and planned actions (i.e., goal-directed) or autonomously and involuntarily (habitual responses)”. But so do habits. Some researchers also highlight the intentional/goal-directed nature of habits (e.g., Du and Haith, 2023, “Habits are not automatic” (preprint) or Kruglanski and Szumowska, 2020, “Habitual behavior is goal-driven”: “definitions of habits that include goal independence as a foundational attribute of habits are begging the question; they effectively define away, and hence dispose of, the issue of whether habits are goal-driven (p 1258).” Therefore, there is no clear consensus concerning the concept of habit.

      While we acknowledge the meaningful distinctions between habits and skills, we also recognize a substantial body of literature supporting the overlap between these concepts (cited in our manuscript), particularly at the neural level. The literature clearly indicates that both habits and skills are mediated by subcortical circuits, with a progressive disengagement of cognitive control hubs in frontal and cingulate cortices as repetition evolves. We do not use these concepts interchangeably. Instead, we simply present evidence supporting the assertion that our trained app sequences meet several criteria for their habitual nature.

      Our choice of Balleine and Dezfouli (2018)'s criteria stemmed from the comprehensive nature of their definitions, which effectively synthesized insights from various researchers (Mazar and Wood, 2018; Verplanken et al., 1998; Wood, 2017, etc). Importantly, their list highlights the positive features of habits that were previously overlooked. However, these authors still included a controversial criterion ("habits as insensitive to changes in their relationship to their individual consequences and the value of those consequences"), even though they acknowledged the problems of using outcome devaluation methods and of relying on a null-effect. According to Kruglanski and Szumowska (2020), this criterion is highly problematic as “If, by definition, habits are goalindependent, then any behavior found to be goal-dependent could not be a habit on sheer logical grounds” (p. 1257). In their definition, “habitual behavior is sensitive to the value of the reward (i.e., the goal) it is expected to mediate and is sensitive to the expectancy of goal attainment (i.e., obtainment of the reward via the behavior, p.1265). In fact, some recent analyses of habitual behavior are not using devaluation or revaluation as a criterion (Du and Haith, 2023). This article, for example, ascertains habits using different criteria and provides supporting evidence for trained action sequences being understood as skills, with both goal-directed and habitual components.

      In the discussion of our manuscript, we explicitly acknowledge that the app sequences can be considered habitual or goal-directed in nature and that this terminology does not alter the fact that our overtrained sequences exhibit clear habitual features.

      Watson et al. (2022) aptly detailed my concerns in the following statements: "Defining habits as fluid and quickly deployed movement sequences overlaps with definitions of skills and procedural learning, which are seen by associative learning theorists as different behaviors and fields of research, distinct from habits."

      "...the risk of calling any fluid behavioral repertoire 'habit' is that clarity on what exactly is under investigation and what associative structure underpins the behavior may be lost." I strongly encourage the authors, at the very least, to consider Watson et al.'s (2022) suggestion: "Clearer terminology as to the type of habit under investigation may be required by researchers to ensure that others can assess at a glance what exactly is under investigation (e.g., devaluationinsensitive habits vs. procedural habits)", and to refine their terminology accordingly (to make this distinction clear). I believe adopting clearer terminology in these respects would enhance the positioning of this work within the relevant knowledge landscape and facilitate future investigations in the field.

      Response: We would like to highlight that we have indeed followed Watson et al (2022)’s recommendations on focusing on other features/criteria of habits at the expense of the outcome devaluation/contingency degradation paradigm, which has been more controversial in the human literature. Our manuscript clearly aligns with Watson et al. (2022) ‘s recommendations: “there are many other features of habits that are not captured by the key metrics from outcome devaluation/contingency degradation paradigms such as the speed at which actions are performed and the refined and invariant characteristics of movement sequences (Balleine and Dezfouli, 2019). Attempts are being made to develop novel behavioral tasks that tap into these positive features of habits, and this should be encouraged as should be tasks that are not designed to assess whether that behavior is sensitive to outcome devaluation, but capture the definition of habits through other measures”.

      Regarding the authors' use of Balleine and Dezfouli's (2018) criteria to frame recorded behavior as habitual, as well as to acknowledgment the study's limitations, it's important to highlight that while the authors labelled the fourth criterion (which they were not fulfilling) as "resistance to devaluation," Balleine and Dezfouli (2018) define it as "insensitive to changes in their relationship to their individual consequences and the value of those consequences." In my understanding, this definition is potentially aligned with the authors' re-evaluation test, namely, it is conceptually adequate for evaluating the fourth criterion (which is the most accepted in the field and probably the one that differentiate habits from skills). Notably, during this test, participants exhibited goaldirected behavior.

      The authors characterized this test as possibly assessing arbitration between goal-directed and habitual behavior, stating that participants in both groups "demonstrated the ability to arbitrate between prior automatic actions and new goal-directed ones." In my perspective, there is no justification for calling it a test of arbitration. Notably, the authors inferred that participants were habitual before the test based on some criteria, but then transitioned to goal-directed behavior based on a different criterion. While I agree with the authors' comment that: "Whether the initiation of the trained motor sequences in experiment 3 (arbitration) is underpinned by an action-outcome association (or not) has no bearing on whether those sequences were under stimulus-response control after training (experiment 1)." they implicitly assert a shift from habit to goal-directed behavior without providing evidence that relies on the same probed mechanism. Therefore, I think it would be more cautious to refer to this test as solely an outcome revaluation test. Again, the results of this test, if anything, provide evidence that the fourth criterion was tested but not met, suggesting participants have not become habitual (or at least undermines this option).

      Response: In our previously revised manuscript, we duly acknowledged that the conventional (perhaps nowadays considered outdated) goal devaluation criterion was not met, primarily due to constraints in designing the second part of the study. We did cite evidence from another similar study that had used devaluation app-trained action sequences to demonstrate habitual qualities (but the reviewer ignored this).

      The reviewer points out that we did use a manipulation of goal revaluation in one of the follow-up tests conducted (although this was not a conventional goal revaluation test inasmuch that it was conducted in a novel context). In this test, please note that we used 2 manipulations: monetary and physical effort. Although we did show that subjects, including OCD patients, were apparently goaldirected in the monetary reward manipulation, this was not so clear when goal re-evaluation involved the physical effort expended. In this effort manipulation, participants were less goaloriented and OCD patients preferred to perform the longer, familiar, to the shorter, novel sequence, thus exhibiting significantly greater habitual tendencies, as compared to controls. Hence, we cannot decisively conclude that the action sequence is goal-directed as the reviewer is arguing. In fact, the evidence is equivocal and may reflect both habitual and goal-directed qualities in the performance of this sequence, consistent with recent interpretations of skilled/habitual sequences (Du and Haith, 2023). Relying solely on this partially met criterion to conclude that the app-trained sequences are goal-directed, and therefore not habitual, would be an inaccurate assessment for several reasons: 1) the action sequences did satisfy all other criteria for being habitual; 2) this approach would rest on a problematic foundation for defining habits, as emphasized by Kruglanski & Szumowska (2020); and 3) it would succumb to the pitfall of subscribing to a zero-sum game perspective, as cautioned by various researchers, including the review by Watson et al. (2022) cited by the referee, thus oversimplifying the nuanced nature of human behavior.

      While we have previously complied with the reviewer’s suggestion on relabelling our follow-up test as a “revaluation test” instead of an “arbitration test”, we have now explicitly removed all mentions of the term “arbitration” (which seems to raise concerns) throughout the manuscript. As the reviewer has suggested, we now use a more refined terminology by explicitly referring to the measured behavior as "procedural habits", as he/she suggested. We have also extensively revised the discussion section of our manuscript to incorporate the reviewer’s viewpoint. We hope that these adjustments enhance the clarity and accuracy of our manuscript, addressing the concerns raised during this review process.

      In essence, this is an ontological and semantic matter, that does not alter our findings in any way. Whether the sequences are consider habitual or goal directed, does not change our findings that 1) Both groups displayed equivalent procedural learning and automaticity attainment; 2) OCD patients exhibit greater subjective habitual tendencies via self-reported questionnaires; 3) Patients who had elevated compulsivity and habitual self-reported tendencies engaged significantly more with the motor habit-training app, practiced more and reported symptom relief at the end of the study; 4) these particular patients also show an augmented inclination to attribute higher intrinsic value to familiar actions, a possible mechanism underlying compulsions.

      Reviewer #2 (Recommendations For The Authors):

      A few more small comments (with reference to the point numbers indicated in the rebuttal):

      (14) I am not entirely sure why the suggested analysis is deemed impractical (i.e., why it cannot be performed by "pretending" participants received the points they should have received according to their performance). This can further support (or undermine) the idea of effect of reward on performance rather than just performance on performance.

      Response: We have now conducted this analysis, generating scores for each trial of practices after day 20, when participants no longer gained points for their performance. This analysis assesses whether participants trial-wise behavioral changes exhibit a similar pattern following simulated relative increases or decrease in scores, as if they had been receiving points at this stage. Note that this analysis has fewer trials available, around 50% less on average.

      Before presenting our results, we wish to emphasize the importance of distinguishing between the effects of performance on performance and the effects of reward on performance. In response to a reviewer's suggestion, we assessed the former in the first revision of our manuscript. We normalized the movement time variable and evaluated how normalized behavioral changes responded to score increments and decrements. The results from the original analyses were consistent with those from the normalized data.

      Regarding the phase where participants no longer received scores, we believe this phase primarily helps us understand the impact of 'predicted' or 'learned' rewards on performance. Once participants have learned the simple association between faster performance and larger scores, they can be expected to continue exhibiting the reward sensitivity effects described in our main analysis. We consider it is not feasible to assess the effects of performance on performance during the reward removal phase, which occurs after 20 days. Therefore, the following results pertain to how the learned associations between faster movement times and scores persist in influencing behavior, even when explicit scores are no longer displayed on the screen.

      Results: The main results of the effect of reward on behavioral changes persist, supporting that relative increases or decreases in scores (real or imagined/inferred) modulate behavioral adaptations trial-by-trial in a consistent manner across both cohorts. The direction of the effects of reward is the same as in the main analyses presented in the manuscript: larger mean behavioral changes (smaller std) following ∆R- . First, concerning changes in “normalized” movement time (MT) trial-by-trial, we conducted a 2 x 2 factorial analysis of the centroid of the Gaussian distributions with the same factors Reward, Group and Bin. This analysis demonstrated a significant main effect of Reward (P = 2e-16), but not of Group (P = 0.974) or Bin (P = 0.281). There were no significant interactions between factors. The main Reward effect can be observed in the top panel of the figure below. The same analysis applied to the spread (std) of the Gaussian distributions revealed a significant main effect of Reward (P = 0.000213), with no additional main effects or interactions.

      Author response image 1.

      Next, conducting the same 2 x 2 factorial analyses on the centroid and spread of the Gaussian distributions fitted to the Consistency data, we also obtained a robust significant main effect of Reward. For the centroid variable, we obtained a significant main effect of Reward (P = 0.0109) and Group (P = 0.0294), while Bin and the factor interactions were non-significant. See the top panel of the figure below.

      On the other hand, Reward also modulated significantly the spread of the Gaussian distributions fitted to the Consistency data, P = 0.00498. There were no additional significant main effects or interactions. See the bottom panel in the figure below.

      Note that here the factorial analysis was performed on the logarithmic transformation of the std.

      Author response image 2.

      (16) I find this result interesting and I think it might be worthwhile to include it in the paper.

      Response: We have now included this result in our revised manuscript (page 28)

      (18) I referred to this sentence: "The app preferred sequence was their preferred putative habitual sequence while the 'any 6' or 'any 3'-move sequences were the goal-seeking sequences." In my understanding, this implies one choice is habitual and another indicates goal-directedness.

      One last small comment:
In the Discussion it is stated: "Moreover, when faced with a choice between the familiar and a new, less effort-demanding sequence, the OCD group leaned toward the former, likely due to its inherent value. These insights align with the theory of goal-direction/habit imbalance in OCD (Gillan et al., 2016), underscoring the dominance of habits in particular settings where they might hold intrinsic value."

      This could equally be interpreted as goal-directed behavior, so I do not think there is conclusive support for this claim.

      Response: The choice of the familiar/trained sequence, as opposed to the 'any 6' or 'any 3'-move sequences cannot be explicitly considered goal-directed: firstly, because the app familiar sequences were associated with less monetary reward (in the any-6 condition), and secondly, because participants would clearly need more effort and time to perform them. Even though these were automatic, it would still be much easier and faster to simply tap one finger sequentially 6 times (any6) or 3 times (any-3). Therefore, the choice for the app-sequence would not be optimal/goaldirected. In this sense, that choice aligns with the current theory of goal-direction/habit imbalance of OCD. We found that OCD patients prefer to perform the trained app sequences in the physical effort manipulation (any-3 condition). While this, on one hand cannot be explicitly considered a goal-directed choice, we agree that there is another possible goal involved here, which links to the intrinsic value associated to the familiar sequence. In this sense the action could potentially be considered goal-directed. This highlights the difficulty of this concept of value and agrees with: 1) Hommel and Wiers (2017): “Human behavior is commonly not driven by one but by many overlapping motives . . . and actions are commonly embedded into larger-scale activities with multiple goals defined at different levels. As a consequence, even successful satiation of one goal or motive is unlikely to also eliminate all the others(p. 942) and 2) Kruglanski & Szumowska (2020)’s account that “habits that may be unwanted from the perspective of an outsider and hence “irrational” or purposeless, may be highly wanted from the perspective of the individual for whom a habit is functional in achieving some goal” (p. 1262) and therefore habits are goal-driven.

      References:

      Balleine BW, Dezfouli A. 2019. Hierarchical Action Control: Adaptive Collaboration Between Actions and Habits. Front Psychol 10:2735. doi:10.3389/fpsyg.2019.02735

      Du Y, Haith A. 2023. Habits are not automatic. doi:10.31234/osf.io/gncsf Graybiel AM, Grafton ST. 2015. The Striatum: Where Skills and Habits Meet. Cold Spring Harb Perspect Biol 7:a021691. doi:10.1101/cshperspect.a021691

      Haith AM, Krakauer JW. 2018. The multiple effects of practice: skill, habit and reduced cognitive load. Current Opinion in Behavioral Sciences 20:196–201. doi:10.1016/j.cobeha.2018.01.015

      Hardwick RM, Forrence AD, Krakauer JW, Haith AM. 2019. Time-dependent competition between goal-directed and habitual response preparation. Nat Hum Behav 1–11. doi:10.1038/s41562019-0725-0

      Hommel B, Wiers RW. 2017. Towards a Unitary Approach to Human Action Control. Trends Cogn Sci 21:940–949. doi:10.1016/j.tics.2017.09.009

      Kruglanski AW, Szumowska E. 2020. Habitual Behavior Is Goal-Driven. Perspect Psychol Sci 15:1256– 1271. doi:10.1177/1745691620917676

      Mazar A, Wood W. 2018. Defining Habit in Psychology In: Verplanken B, editor. The Psychology of Habit: Theory, Mechanisms, Change, and Contexts. Cham: Springer International Publishing. pp. 13–29. doi:10.1007/978-3-319-97529-0_2

      Robbins TW, Costa RM. 2017. Habits. Current Biology 27:R1200–R1206. doi:10.1016/j.cub.2017.09.060

      Verplanken B, Aarts H, van Knippenberg A, Moonen A. 1998. Habit versus planned behaviour: a field experiment. Br J Soc Psychol 37 ( Pt 1):111–128. doi:10.1111/j.2044-8309.1998.tb01160.x

      Watson P, O’Callaghan C, Perkes I, Bradfield L, Turner K. 2022. Making habits measurable beyond what they are not: A focus on associative dual-process models. Neurosci Biobehav Rev 142:104869. doi:10.1016/j.neubiorev.2022.104869

      Wood W. 2017. Habit in Personality and Social Psychology. Pers Soc Psychol Rev 21:389–403. doi:10.1177/1088868317720362

    1. Author response:

      The following is the authors’ response to the previous reviews

      We have thoroughly addressed all the reviewers’ comments and meticulously revised the manuscript. Key modifications include the following:

      (a) Organizing the Logic and Highlighting Key Findings: We have revised the manuscript to emphasize key findings (especially the distinctions between the SEC and WOI groups) according to the following logic: constructing a receptive endometrial organoid, comparing its molecular characteristics with those of the receptive endometrium, highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the function involved in embryo interaction.

      (b) Clarity and Better Description of Bioinformatic Analyses: We have revised the sections involving bioinformatic analyses to provide a more streamlined and comprehensible explanation. Instead of overwhelming the reader with excessive details, we focused on the most important findings, and performed additional experimental validation.

      (c) Rationale for Gene Selection: We have clarified the rationale for selecting certain genes and pathways for inclusion in the analysis and manuscript. The associated gene expression data for all figures have been provided in the attached Dataset.

      (d) In the response letter, we have provided the detailed presentation of the methodological optimization for constructing this endometrial assembloids, along with optimization and comparison of endometrial organoid culture media. Furthermore, in the Limitations section, we have explicitly stated that stromal cells and immune cells gradually diminish with increasing passage numbers. Therefore, this study primarily utilized endometrial assembloids within the first three passages for all investigations.

      Below, we provide a point-by-point response to each comment, with all modifications highlighted in the revised manuscript. We respectfully hope that these revisions effectively address the concerns raised by the reviewers.

      Public Reviews:

      Reviewer #1 (Public Review):

      This study generated 3D cell constructs from endometrial cell mixtures that were seeded in the Matrigel scaffold. The cell assemblies were treated with hormones to induce a "window of implantation" (WOI) state. The authors did their best to revise their study according to the reviewers' comments. However, the study remains unconvincing and at the same time too dense and not focused enough.

      (1) The use of the term organoids is still confusing and should be avoided. Organoids are epithelial tissue-resembling structures. Hence, the multiplecell aggregates developed here are rather "coculture models" (or "assembloids"). It is still unexpected (unlikely) that these structures containing epithelial, stromal and immune cells can be robustly passaged in the epithelial growth conditions used. All other research groups developing real organoids from endometrium have shown that only the epithelial compartment remains in culture at passaging (while the stromal compartment is lost). If authors keep to their idea, they should perform scRNA-seq on both early and late (passage 6-10) "organoids". And they should provide details of culturing/passaging/plating etc that are different with other groups and might explain why they keep stromal and immune cells in their culture for such a long time. In other words, they should then in detail compare their method to the standard method of all other researchers in the field, and show the differences in survival and growth of the stromal and immune cells.

      (1) We appreciate your feedback and have revised the term 'organoids' to 'assembloids'. 2)

      I. Due to budget constraints, this study did not perform scRNA-seq on both early and late passages (P6-P10). Instead, immunofluorescence staining confirmed the persistence of stromal cells at passage 6 (as shown below).

      Author response image 1.

      Whole-mount immunofluorescence showed that Vimentin+ F-actin+ cells (stromal cells) were arranged around the glandular spheres that were only F-actin+(passage 6).

      II. Improvements in this study include the following.

      a. Optimization of endometrial tissue processing: The procedures for tissue collection, pretreatment, digestion, and culture were refined to maximize the retention of endometrial epithelial cells, stromal cells, and immune cells (detailed optimizations are provided in Response Table 1).

      b. Enhanced culture medium formulation: Based on previous protocols, WNT3A was added to promote organoid development and differentiation (PMID: 27315476), while FGF2 was supplemented to improve stromal cell survival (PMID: 35224622) (see Response Table 2 for medium comparisons). Representative culture outcomes are shown in the figure below.

      We acknowledge that the stromal and immune cells in this system still exhibit differences compared to their in vivo counterparts. In this study, we utilized the first three passages, which offer optimal cell diversity and viability, to meet experimental needs. However, replicating and maintaining the full complexity of endometrial cell types in vitro remains a major challenge in the field—one that we are actively working to address.

      Author response table 1.

      Methodological Optimization of Endometrial Organoids (Construction, Passaging, and Cryopreservation)

      Author response table 2.

      Optimization and comparison of endometrial organoid culture media

      Author response image 2.

      Bright-field microscopy captures the expansion of glands and surrounding stromal cells across passages 0 to 2 (scar bar=200μm) (Yellow arrows: stromal cells; White arrows: glands).

      (2) The paper is still much too dense, touching upon all kind of conclusions from the manifold bioinformatic analyses. The latter should be much clearer and better described, and then some interesting findings (pathways/genes) should be highlighted without mentioning every single aspect that is observed. The paper needs a lot of editing to better focus and extract take-home messages, not bombing the reader with a mass of pathways, genes etc which makes the manuscript just not readable or 'digest-able'. There is no explanation whatever and no clear rationale why certain genes are included in a list while others are not. There is the impression that mass bioinformatics is applied without enough focus.

      Thanks for your suggestions. We have made improvements and revisions in the following areas:

      (a) Clarity and Better Description of Bioinformatic Analyses: We have revised the sections involving bioinformatic analyses to provide a more streamlined and comprehensible explanation. Instead of overwhelming the reader with excessive details, we focused on the most important findings.

      (b) Organizing the Logic and Highlighting Key Findings: We have revised the manuscript to emphasize key findings according to the following logic: constructing a receptive endometrial organoid, comparing its molecular characteristics with those of the receptive endometrium, highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the function involved in embryo interaction.

      (c) Rationale for Gene Selection: We have clarified the rationale for selecting certain genes and pathways for inclusion in the analysis and manuscript.

      We hope these revisions address your concerns and improve the overall quality and clarity of the manuscript. Thank you once again for your valuable input.

      (3) The study is much too descriptive and does not show functional validation or exploration (except glycogen production). Some interesting findings extracted from the bioinformatics must be functionally tested.

      Thanks for your suggestions. We have restructured the logic and revised the manuscript, incorporating functional validation. The focus is on the following points: highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the functions involved in embryo interaction.

      (4) In contrast to what was found in vivo (Wang et al. 2020), no abrupt change in gene expression pattern is mentioned here from the (early-) secretory to the WoI phase. Should be discussed. Although the bioinformatic analyses point into this direction, there are major concerns which must be solved before the study can provide the needed reliability and credibility for revision.

      To further investigate the abrupt change, the Mfuzz algorithm was utilized to analyze gene expression across the three groups, focusing on gene clusters that were progressively upregulated or downregulated. It was observed that mitochondrial and cilia-related genes exhibited the highest expression levels in WOI endometrial organoids, as well as cell junction and negative regulation of cell differentiation were downregulated (Figure 4A).

      (5) All data should be benchmarked to the Wang et al 2020 and Garcia-Alonso et al. 2021 papers reporting very detailed scRNA-seq data, and not only the Stephen R. Quake 2020 paper.

      We appreciate your suggestion. By integrating data from Garcia-Alonso et al. (2021) (shown in the figure below), we observed that both WOI organoids and SEC organoids exhibit increased glandular secretory epithelium and developed ciliated epithelium, mirroring features of mid-secretory endometrium. The findings exhibit parallels when contrasting these two papers.

      Author response image 3.

      UMAP visualization of integrated scRNA-seq data (our dataset and Garcia-Alonso et al. 2021) showing: (A) cell types, (B) WOI-org, (C)CTRL-org, (D)SEC-org versus published midsecretory samples.

      (6) Fig. 2B: Vimentin staining is not at all clear. F-actin could be used to show the typical morphology of the stromal cells?

      We appreciate your suggestion. We performed additional staining for F-actin based on Vimentin, and found that Vimentin+ F-actin+ cells (stromal cells) were arranged around the glandular spheres that were only F-actin+.

      (7) Where does the term "EMT-derived stromal cells" come from? On what basis has this term been coined?

      Within endometrial biology, stromal cells in the transition from epithelial to mesenchymal phenotype are specifically referred to as 'stromal EMT transition cells' (PMID: 39775038, PMID: 39968688).

      In certain cancers or fibrotic diseases, epithelial cells can transition into a mesenchymal phenotype, contributing to the stromal environment that supports tumor growth or tissue remodeling (PMID: 20572012).

      (8) CD44 is shown in Fig. 2D but the text mentions CD45 (line 159)?

      In Fig 2D, T cells are defined as a cluster of CD45+CD3+ cells, further subdivided into CD4+ and CD8+ T cells based on their expression of CD4 and CD8. This figure does not include data on CD44.

      (9) All quantification experiments (of stainings etc) should be in detail described how this was done. It looks very difficult (almost not feasible) when looking at the provided pictures to count the stained cells.

      a. Manual Measurement:

      For TEM-observed pinopodes, glycogen particles, microvilli, and cilia, manual region-of-interest (ROI) selection was performed using ImageJ software for quantitative analysis of counts, area, and length. Twenty randomly selected images per experimental group were analyzed for each morphological parameter.

      b. Automated Measurement:

      We quantified the fluorescence images using ImageJ. Firstly, preprocess them by adjusting brightness and contrast, and removing background noise with the “Subtract Background” feature.

      Secondly, set the threshold to highlight the cells, then select the regions of interest (ROI) using selection tools. Thirdly, as for counting the cells, navigate to Analyze > Analyze Particles. AS for measuring the influence intensity and area, set the “Measurement” options as mean gray value. Adjust parameters as needed, and view results in the “Results” window. Save the data for further analysis and ensure consistency throughout your measurements for reliable results.

      For 3D fluorescence quantification, ZEN software (Carl Zeiss) was exclusively used, with 11 images analyzed per experimental group. This part has been incorporated into “Supporting Information”

      Line 94-100.

      c. Normalization Method:

      For fluorescence quantification, DAPI was used as an internal reference for normalization, where both DAPI and target fluorescence channel intensities were quantified simultaneously. The normalized target signal intensity (target/DAPI ratio) was then compared across experimental groups. A minimum of 15 images were analyzed for each parameter per group. This part has been incorporated into “Supporting Information” Line 101-104.

      (10) Fig. 3C: it is unclear how quantification can be reliably done. Moreover, OLFM4 looks positive in all cells of Ctrl, but authors still see an increase?

      (a) Fluorescence images were quantitatively analyzed using ImageJ by measuring the mean gray values. For normalization, DAPI staining served as an internal reference, with simultaneous measurement of mean gray values in both the target fluorescence channel and the DAPI channel. The relative fluorescence intensity was then calculated as the ratio of target channel to DAPI signal for inter-group quantitative comparisons.

      (b) OLFM4 is an E2-responsive gene. Its expression in endometrial organoids of the CTRL group is physiologically normal (PMID: 31666317). However, its fluorescence intensity (quantified as mean gray value) was significantly stronger in both the SEC and WOI groups compared to the CTRL group (quantitative method as described above).

      (11) Fig. 3F: Met is downregulated which is not in accordance with the mentioned activation of the PI3K-AKT pathway.

      We appreciate your careful review. Our initial description was imprecise. In the revised manuscript, this statement has been removed entirely.

      (12) Lines 222-226: transcriptome and proteome differences are not significant; so, how meaningful are the results then? Then, it is very hard to conclude an evolution from secretory phase to WoI.

      We appreciate your feedback. The manuscript has been comprehensively revised, and the aforementioned content has been removed.

      (13) WoI organoids show an increased number of cilia. However, some literature shows the opposite, i.e. less ciliated cells in the endometrial lining at WoI (to keep the embryo in place). How to reconcile?

      Thank you for raising this question. We conducted a statistical analysis of the proportion of ciliated cells across endometrial phases.

      (a) Based on the 2020 study by Stephen R. Quake and Carlos Simon’s team published in Nature Medicine (PMID: 32929266), the mid-secretory phase (Days 19–23) exhibited a higher proportion of ciliated cells compared to the early-secretory (Days 15–18) and late-secretory phases (Days 24– 28) (Fig. R13 A).

      (b) According to the 2021 study by Roser Vento-Tormo’s team in Nature Genetics, ciliated cell abundance peaked in the early-to-mid-secretory endometrium across all phases (Fig. R13 B-C).

      Data were sourced from the Reproductive Cell Atlas.

      (14) How are pinopodes distinguished from microvilli? Moreover, Fig. 3 does not show the typical EM structure of cilia.

      Thank you for this insightful question.

      (a) Pinopodes are large, bulbous protrusions with a smooth apical membrane. Under transmission electron microscopy (TEM), it can be observed that the pinopodes contain various small particles, which are typically extracellular fluid and dissolved substances.

      Microvilli are elongated, finger-like projections that typically exhibit a uniform and orderly arrangement, forming a "brush border" structure. Under transmission electron microscopy, dense components of the cytoskeleton, such as microfilaments and microtubules, can be seen at the base of the microvilli.

      (b) You may refer to the ciliated TEM structure shown in the current manuscript's Fig. 2E (originally labeled as Fig. 2H in the draft). The cilium is composed of microtubules. The cross-section shows that the periphery of the cilium is surrounded by nine pairs of microtubules arranged in a ring. The longitudinal section shows that the cilium has a long cylindrical structure, with the two central microtubules being quite prominent and located at the center of the cilium.

      (15) There is a recently published paper demonstrating another model for implantation. This paper should be referenced as well (Shibata et al. Science Advances, 2024).

      Thanks for your valuable comments. We have cited this reference in the manuscript at Line 77-78.

      (16) Line 78: two groups were the first here (Turco and Borreto) and should both be mentioned.

      Thanks for your valuable comments. We have cited this reference in the manuscript at Line 74-76.

      (17) Line 554: "as an alternative platform" - alternative to what? Authors answer reviewers' comments by just changing one word, but this makes the text odd.

      Thank you for your review. Here, we propose that this WOI organoid serves as an alternative research platform for studying endometrial receptivity and maternal-fetal interactions, compared to current secretory-phase organoids. In the revised manuscript, we have supplemented the data by co-culturing this WOI organoid with blastoid, demonstrating its robust embryo implantation potential.

      Reviewer #2 (Public Review):

      In this research, Zhang et al. have pioneered the creation of an advanced organoid culture designed to emulate the intricate characteristics of endometrial tissue during the crucial Window of Implantation (WOI) phase. Their method involves the incorporation of three distinct hormones into the organoid culture, coupled with additives that replicate the dynamics of the menstrual cycle. Through a series of assays, they underscore the striking parallels between the endometrial tissue present during the WOI and their crafted organoids. Through a comparative analysis involving historical endometrial tissue data and control organoids, they establish a system that exhibits a capacity to simulate the intricate nuances of the WOI. The authors made a commendable effort to address the majority of the statements. Developing an endometrial organoid culture methodology that mimics the window of implantation is a game-changer for studying the implantation process. However, the authors should strive to enhance the results to demonstrate how different WOI organoids are from SEC organoids, ensuring whether they are worth using in implantation studies, or a proper demonstration using implantation experiments.

      Thank you for your valuable suggestions. The WOI organoids differ from SEC organoids in the following aspects.

      (1) Structurally, WOI endometrial organoids exhibit subcellular features characteristic of the implantation window: densely packed pinopodes on the luminal side of epithelial cells, abundant glycogen granules, elongated and tightly arranged microvilli, and increased cilia (Figure 2F).

      (2) At the molecular level, WOI organoids show enlarged and functionally active mitochondria, enhanced ciliary assembly and motility, and single-cell signatures resembling mid-secretory endometrium.

      Specifically, mitochondrial- and cilia-related genes/proteins are most highly expressed in WOI organoids (Figure 4A,B). TEM analysis revealed that WOI organoids have the largest average mitochondrial area (Figure 4C). Mitochondrial-related genes display an increasing trend across the three organoid groups, and WOI organoids produce more ATP and IL-8 (Figure 4D,E).

      For cilia, WOI organoids upregulate genes/proteins involved in ciliary assembly, basal bodies, and motile cilia, while downregulating non-motile cilia markers (Figure 5A-C).

      Single-cell analysis further confirms that WOI organoids recapitulate mid-secretory endometrial traits in mitochondrial metabolism and cell adhesion (Figure 2G).

      (3) Functionally, WOI organoids demonstrate superior embryo implantation potential. Given the scarcity and ethical constraints of human embryos, we used blastoids for implantation assays (Figure 6A). These blastoids successfully grew within endometrial organoids, established interactions (Figure 6B), and exhibited normal trilineage differentiation (epiblast: OCT4; hypoblast: GATA6; trophoblast: KRT18) (Figure 6C). WOI organoids achieved significantly higher blastoid survival (66% vs. 19% in CTRL and 28% in SEC) and interaction rates (90% vs. 47% in CTRL and 53% in SEC), confirming their robust embryo-receptive capacity (Figure 6D,E).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      In conclusion, it is needed to first meet all the concerns of the reviewers and then submit an appropriately adapted and comprehensive paper (also showing the robustness of the "organoids" and functionality of the findings) instead of this still fully descriptive paper. Further comments are included in the rebuttal document of the authors and will be provided by the editor as PDF.

      Reviewer #2 (Recommendations For The Authors):

      The authors made a good effort to reply all the statements. However, there are some points that the authors need to address.

      • There is an inconsistency in the manuscript regarding the number of passages in which the organoids are used; in the response to the reviewers, it mentions 5 passages, while in the Materials and Methods section, it states 3 passages.

      We sincerely appreciate your thorough review. In this study, organoids within the first three passages were used. To address the reviewer's question comprehensively, we have now provided a detailed account of the organoid passage history in our response.

      • We agree that the difference between SEC and WOI organoids may be subtle, but in response to this, the authors should explain what they mean by "the most notable differences lie in the more comprehensive differentiation and varied cellular functions exhibited by WOI organoids..."

      In the original manuscript, this statement indicated that, at the single-cell level, WOI endometrial organoids exhibited more functionally mature and thoroughly differentiated characteristics compared to SEC endometrial organoids (See details below).

      In the revised version, we have restructured this section to focus on following aspects: hormone response, energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition, and embryo implantation potential. Consequently, the "the most notable differences lie in the more comprehensive differentiation and varied cellular functions exhibited by WOI organoids..."has been removed.

      (1) Varied cellular functions:

      a. Secretory Epithelium: Compared to SEC organoids, WOI organoids exhibit enhanced peptide metabolism and mitochondrial energy metabolism in their secretory epithelium, supporting endometrial decidualization and embryo implantation (Figure 3F).

      b. Proliferative Epithelium: Compared to SEC organoids, WOI organoids demonstrate enhanced GTPase activity, angiogenesis, cytoskeletal assembly, cell differentiation, and RAS protein signaling in their proliferative epithelium (Figure S2G).

      c. Ciliated Epithelium: The ciliated epithelium of WOI endometrial organoids is associated with the regulation of vascular development and exhibits higher transcriptional activity compared to SEC organoids (Figure 5E).

      d. Stromal Cells: Compared to SEC organoids, WOI organoids exhibit enhanced cell junctions, cell migration, and cytoskeletal regulation in EMT-derived stromal cells (Figure S4A right panel). Similarly, cell junctions are also strengthened in stromal cells (Figure S4A left panel).

      (2) comprehensive differentiation:

      a. Compared to SEC organoids, WOI organoids exhibit more complete differentiation from proliferative epithelium to secretory epithelium (Figure 3G).

      b. The WOI organoids demonstrate more robust ciliary differentiation compared to SEC organoids (Figure 5F).

      c. The proliferative epithelium progressively differentiates into EMT-derived cells. Compared to SEC organoids, WOI organoids are predominantly localized at the terminal end of the differentiation trajectory, indicating more complete differentiation (Figure S4B).

      • What do the authors mean by "average intensity" when referring to the extra reagents added to the WOI? The results that the authors show in response to Reviewer 2's Q1 must be included as part of the results and explain how it was done in the materials and methods section.

      This parameter indicates the growth status of organoids. It measures the gray value of organoids through long-term live-cell tracking. When organoids undergo apoptosis, they progressively condense into denser solid spheres, leading to an increase in gray value (average intensity). This content has been incorporated into the Results section (Line 129) and is further explained in the Supporting Information "Materials and Methods" (Lines 70-77).

      • In panel 1C, it is not possible to see the stromal cells around because they are brightfield images.

      You are partly right. Bright-field images alone indeed make it difficult to distinguish stromal cells. However, by combining whole-mount immunofluorescence staining with the characteristic elongated spindle-shaped morphology of stromal cells, we were able to roughly determine their distribution in the bright-field images.

      • Responding to Reviewer 2's question Q7, the authors indicate how they establish the cluster. However, they do not specify whether they extrapolate the data from a database or create the cluster themselves based on the literature. It should be stated from which classification list (or classification database) the extrapolation has been made.

      Within endometrial biology, stromal cells in the transition from epithelial to mesenchymal phenotype are specifically referred to as 'stromal EMT transition cells' (PMID: 39775038, PMID: 39968688).

      In certain cancers or fibrotic diseases, epithelial cells can transition into a mesenchymal phenotype, contributing to the stromal environment that supports tumor growth or tissue remodeling (PMID: 20572012).

      • Regarding Reviewer 2's question Q8, if the authors have not been able to make comparisons with, at least, SEC organoids, unfortunately, the ERT loses much of its strength and should not serve as support.

      We agree with you at this point. These results have been moved to the supplementary figures.

      • If the differences in the transcriptome and proteome between SEC and WOI organoids are not significant, the result does not support the authors' model. If there are barely any differences at the proteome and transcriptome level between SEC and WOI organoids, why would anyone choose to use their model over SEC organoids?

      We sincerely appreciate your valuable feedback. In this revised manuscript, we have further integrated transcriptomic and proteomic analyses, revealing that WOI organoids exhibit enlarged and functionally active mitochondria, along with enhanced cilia assembly and motility compared to SEC organoids. Additionally, using a blastoid model, we demonstrated that WOI organoids possess superior embryo implantation potential, significantly outperforming SEC organoids. Our research group aims to develop an embryo co-culture model. Through systematic comparisons of structural, molecular, and co-culture characteristics between SEC and WOI organoids, we ultimately confirmed the superior performance of WOI organoids.

      • SEC and WOI organoids must be different enough to establish a new model, and the authors do not demonstrate that they are.

      Thank you for your valuable feedback. In the revised manuscript, we have emphasized the distinctions between SEC and WOI organoids in terms of structure, molecular characteristics, and functionality (co-culture with blastoid), as detailed below.

      (1) Structurally, WOI endometrial organoids exhibit subcellular features characteristic of the implantation window: densely packed pinopodes on the luminal side of epithelial cells, abundant glycogen granules, elongated and tightly arranged microvilli, and increased cilia (Figure 2F).

      (2) At the molecular level, WOI organoids show enlarged and functionally active mitochondria, enhanced ciliary assembly and motility, and single-cell signatures resembling mid-secretory endometrium.

      Specifically, mitochondrial- and cilia-related genes/proteins are most highly expressed in WOI organoids (Figure 4A,B). TEM analysis revealed that WOI organoids have the largest average mitochondrial area (Figure 4C). Mitochondrial-related genes display an increasing trend across the three organoid groups, and WOI organoids produce more ATP and IL-8 (Figure 4D,E).

      For cilia, WOI organoids upregulate genes/proteins involved in ciliary assembly, basal bodies, and motile cilia, while downregulating non-motile cilia markers (Figure 5A-C).

      Single-cell analysis further confirms that WOI organoids recapitulate mid-secretory endometrial traits in mitochondrial metabolism and cell adhesion (Figure 2G).

      (3) Functionally, WOI organoids demonstrate superior embryo implantation potential. Given the scarcity and ethical constraints of human embryos, we used blastoids for implantation assays (Figure 6A). These blastoids successfully grew within endometrial organoids, established interactions (Figure 6B), and exhibited normal trilineage differentiation (epiblast: OCT4; hypoblast: GATA6; trophoblast: KRT18) (Figure 6C). WOI organoids achieved significantly higher blastoid survival (66% vs. 19% in CTRL and 28% in SEC) and interaction rates (90% vs. 47% in CTRL and 53% in SEC), confirming their robust embryo-receptive capacity (Figure 6D,E).

      • Regarding Q16, Boretto et al. 2017 and Turco et al. 2017 also manage to isolate stromal cells, but they lose them between passages. It's not a matter of isolating them from the tissue or not, but rather how they justify their maintenance in culture. In the images added by the authors, it can be seen that the majority of stromal cells are lost from P0 to P1 after thawing. I still believe that the epithelial part can be passed and maintained, but the rest cannot, and that should be mentioned in the paper as a limitation. However, the authors can demonstrate the maintenance of stromal cells by performing immunostaining with vimentin from passages 4, 5, and 6.

      Thank you for your valuable comments. We have added the statement 'Stromal cells and immune cells are difficult to pass down stably and their proportion is lower than that in the in vivo endometrium' to the Limitations section (Line 364-365). Additionally, we performed immunostaining with vimentin starting from passage 6 and confirmed the presence of Vimentin+ F-actin+ stromal cells (as shown in Author response image 1).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study provides important evidence supporting the ability of a new type of neuroimaging, OPM-MEG system, to measure beta-band oscillation in sensorimotor tasks on 2-14 years old children and to demonstrate the corresponding development changes, since neuroimaging methods with high spatiotemporal resolution that could be used on small children are quite limited. The evidence supporting the conclusion is solid but lacks clarifications about the much-discussed advantages of OPM-MEG system (e.g., motion tolerance), control analyses (e.g., trial number), and rationale for using sensorimotor tasks. This work will be of interest to the neuroimaging and developmental science communities.

      We thank the editors and reviewers for their time and comments on our manuscript. We have responded in detail to the comments, on a point-by-point basis, below. Included in our responses (and our revised manuscript) are additional analyses to control for trial count, clarification of the advantages of OPM-MEG, and justification of our use of sensory (as distinct from motor) stimulation. In what follows, our responses are in bold typeface; additions to our manuscript are in bold italic typeface. 

      Reviewer #1 (Public Review):

      Summary:

      Compared with conventional SQUID-MEG, OPM-MEG offers theoretical advantages of sensor configurability (that is, sizing to suit the head size) and motion tolerance (the sensors are intrinsically in the head reference frame). This study purports to be the first to experimentally demonstrate these advantages in a developmental study from age 2 to age 34. In short, while the theoretical advantages of OPM-MEG are attractive - both in terms of young child sensitivity and in terms of motion tolerance - neither was in fact demonstrated in this manuscript. We are left with a replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      Thank you for reviewing our manuscript. We agree that our results demonstrate substantial equivalence with conventional MEG. However, as mentioned by Reviewer 3, most past studies have “focused on older children and adolescents (e.g., 9-15 years old)” whereas our youngest group is 25 years. We believe that by obtaining data of sufficient quality in these age groups, without the need for any restriction of head movement, we have demonstrated the advantage of OPM-MEG. We now have made this clear in our discussion:

      “…our primary aim was to test the feasibility of OPM-MEG for neurodevelopmental studies. Our results demonstrate we were able to scan children down to age 2 years, measuring high-fidelity electrophysiological signals and characterising the neurodevelopmental trajectory of beta oscillations. The fact that we were able to complete this study demonstrates the advantages of OPM-MEG over conventional-MEG, the latter being challenging to deploy across such a large age range…”

      Strengths:

      A replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      As noted above the demonstration of equivalence was one of our primary aims. We have elaborated further on the advantages below.

      Weaknesses:

      The authors describe 64 tri-axial detectors, which they refer to as 192 channels. This is in keeping with some of the SQUID-MEG description, but possibly somewhat disingenuous. For the scientific literature, perhaps "64 tri-axial detectors" is a more parsimonious description.

      The number of channels in a MEG system refers to the number of independent measurements of magnetic field. This, in turn, tells us the number of degrees of freedom in the data that can be exploited by algorithms like signal space separation or beamforming. E.g. the MEGIN (cryogenic) MEG system has 306 channels, 102 magnetometers and 204 planar gradiometers. Sensors are constructed as “triple sensor elements” with one magnetometer and 2 gradiometers (in orthogonal orientations) centred on a single location. In our system, each sensor has three orthogonal metrics of magnetic field which are (by definition) independent. We have 64 such sensors, and therefore 192 independent channels – indeed when implementing algorithms like SSS we have shown we can exploit this number of degrees of freedom.1 192 channels is therefore an accurate description of the system.

      A small fraction (<20%) of trials were eliminated for analysis because of "excess interference" - this warrants further elaboration.

      We agree that this is an important point. We now state in our methods section:

      “…Automatic trial rejection was implemented with trials containing abnormally high variance (exceeding 3 standard deviations from the mean) removed. All experimental trials were also inspected visually by an experienced MEG scientist, to exclude trials with large spikes/drifts that were missed by the automatic approach. In the adult group, there was a significant overlap between automatically and manually detected bad trials (0.7+-1.6 trials were only detected manually). In the children 10.0 +-9.4 trials were only detected manually)…”

      We also note that the other reviewers and editor questioned whether the higher rejection rate in children had any bearing on results. This is an extremely important question. In revising the manuscript this has also been taken into account with all data reanalysed with equal trial counts in children and adults. Results are presented in Supplementary Information Section 5.

      Figure 3 shows a reduced beta ERD in the youngest children. Although the authors claim that OPMMEG would be similarly sensitive for all ages and that SQUID-MEG would be relatively insensitive to young children, one trivial counterargument that needs to be addressed is that OPM has NOT in fact increased the sensitivity to young child ERD. This can possibly be addressed by analogous experiments using a SQUID-based system. An alternative would be to demonstrate similar sensitivity across ages using OPM to a brain measure such as evoked response amplitude. In short, how does Figure 3 demonstrate the (theoretical) sensitivity advantage of OPM MEG in small heads ?

      We completely understand the referees’ point – indeed the question of whether a neuromagnetic effect really changes with age, or apparently changes due to a drop in sensitivity (caused by reduced head size or - in conventional MEG and fMRI - increased subject movement) is a question that can be raised in all neurodevelopmental studies.

      Our authors have many years’ experience conducting studies using conventional MEG (including in neurodevelopment) and agreed that the idea of scanning subjects down to age two in conventional MEG would not be practical; their heads are too small and they typically fail to tolerate an environment where they are forced to remain still for long periods. Even if we tried a comparative study using conventional MEG, the likely data exclusion rate would be so high that the study would be confounded. This is why most conventional MEG studies only scan older children and adolescents. For this reason, we cannot undertake the comparative study the reviewer suggests. There are however two reasons why we believe sensitivity is not driving the neurodevelopmental effects that we observe:

      Proximity of sensors to the head: 

      For an ideal wearable MEG system, the distance between the sensors and the scalp surface (sensor proximity) would be the same regardless of age (and size), ensuring maximum sensitivity in all subjects. To test how our system performed in this regard, we undertook analyses to compute scalp-to-sensor distances. This was done in two ways:

      (1) Real distances in our adaptable system: We took the co-registered OPM sensor locations and computed the Euclidean distance from the centre of the sensitive volume (i.e. the centre of the vapour cell) to the closest point on the scalp surface. This was measured independently for all sensors, and an average across sensors calculated. We repeated this for all participants (recall participants wore helmets of varying size and this adaptability should help minimise any relationship between sensor proximity and age).

      (2) Simulated distances for a non-adaptable system: Here, the aim was to see how proximity might have changed with age, had only a single helmet size been used. We first identified the single example subject with the largest head (scanned wearing the largest helmet) and extracted the scalpto-sensor distances as above. For all other subjects, we used a rigid body transform to co-register their brain to that of the example subject (placing their head (virtually) inside the largest helmet). Proximity was then calculated as above and an average across sensors calculated. This was repeated for all participants.

      In both analyses, sensor proximity was plotted against age and significant relationships probed using Pearson correlation. 

      In addition, we also wanted to probe the relation between sensor proximity and head circumference. Head circumference was estimated by binarising the whole head MRI (to delineate volume of the head), and the axial slice with the largest circumference around was selected. We then plotted sensor proximity versus head circumference, for both the real (adaptive) and simulated (nonadaptive) case (expecting a negative relationship – i.e. larger heads mean closer sensor proximity). The slope of the relationship was measured and we used a permutation test to determine whether the use of adaptable helmets significantly lowered the identified slope (i.e. do adaptable helmets significantly improve sensor proximity in those with smaller head circumference).

      Results are shown in Figure R1. We found no measurable relationship between sensor proximity and age (r = -0.195; p = 0.171) in the case of the real helmets (panel A). When simulating a non-adaptable helmet, we did see a significant effect of age on scalp-to-sensor distance (r = -0.46; p = 0.001; panel B). This demonstrates the advantage of the adaptability of OPM-MEG; without the ability to flexibly locate sensors, we would have a significant confound of sensor proximity. 

      Plotting sensor proximity against head circumference we found a significant negative relationship in both cases (r = -0.37; p = 0.007 and  r = -0.78; p = 0.000001); however, the difference between slopes was significant according to a permutation test (p < 0.025) suggesting that adaptable has indeed improved sensor proximity in those with smaller head circumference. This again shows the benefits of adaptability to head size.

      Author response image 1.

      Scalp-to-sensor distance as a function of age (A/B) and head circumference (C/D). A and C show the case for the real helmets; B and D show the simulated non-adaptable case.

      In sum, the ideal wearable system would see sensors located on the scalp surface, to get as close as possible to the brain in all subjects. Our system of multiple helmet sizes is not perfect in this regard (there is still a significant relationship between proximity and head circumference). However, our solution has offered a significant improvement over a (simulated) non-adaptable system. Future systems should aim to improve even further on this, either by using additively manufactured bespoke helmets for every subject (this is a gold standard, but also costly for large studies), or potentially adaptable flexible helmets.

      Burst amplitudes:

      The reviewer suggested to “demonstrate similar sensitivity across ages using OPM to a brain measure”. We decided not to use the evoked response amplitude (as suggested), since this would be expected to change with age. Instead, we used the amplitude of the bursts.

      Our manuscript shows a significant correlation between beta modulation and burst probability – implying that the stimulus-related drop in beta amplitude occurs because bursts are less likely to occur. Further, we showed significant age-related changes in both beta amplitude and burst probability leading to a conclusion that the age dependence of beta modulation was caused by changes in the likelihood of bursts (i.e. bursts are less likely to ’switch off’ during sensory stimulation in children). We have now extended these analyses to test whether burst amplitude also changes significantly with age – we reasoned that if burst amplitude remained the same in children and adults, this would not only suggest that beta modulation is driven by burst probability (distinct from burst amplitude), but also show directly that the beta effects we see are not attributable to a lack of sensitivity in younger people. 

      We took the (unnormalized) beamformer projected electrophysiological time series from sensorimotor cortex and filtered it 5-48 Hz (the motivation for the large band was because bursts are known to be pan-spectral and have lower frequency content in children; this band captures most of the range of burst frequencies highlighted in our spectra). We then extracted the timings of the bursts, and for each burst took the maximum projected signal amplitude. These values were averaged across all bursts in an individual subject, and plotted for all subjects against age.

      Author response image 2.

      Beta burst amplitude as a function of age; A) shows index finger simulation trials; B shows little finger stimulation trials. In both case there was no significant modulation of burst amplitude with age.

      Results (see Figure R2) showed that the amplitude of the beta burst showed no significant age-related modulation (R2 = 0.01, p = 0.48 for index finger and R2 = 0.01, p = 0.57 for the little finger). This is distinct from both burst probability and task induced beta modulation. This adds weight to the argument that the diminished beta modulation in children is not caused by a lack of sensitivity to the MEG signal and supports our conclusion that burst probability is the primary driver of the agerelated changes in beta oscillations.

      Both of the above analyses have been added to our supplementary information and mentioned in the main manuscript. The first shows no confound of sensor proximity to the scalp with age in our study. The second shows that the bursts underlying the beta signal are not significantly lower amplitude in children – which we reasoned they would be if sensitivity was diminished at younger ages. We believe that the two together suggest that we have mitigated a sensitivity confound in our study.

      The data do not make a compelling case for the motion tolerance of OPM-MEG. Although an apparent advantage of a wearable system, an empirical demonstration is still lacking. How was motion tracked in these participants?

      We agree that this was a limitation of our experiment. 

      We have the equipment to track motion of the head during an experiment, using IR retroreflective markers placed on the helmet and a set of IR cameras located inside the MSR. However, the process takes a long time to set up, it lacks robustness, and would have required an additional computer (the one we typically use was already running the somatosensory stimulus and video). When the study was designed, we were concerned that the increased set up time for motion tracking would cause children to get bored, and result in increased participant drop out. For this reason we decided not to capture motion of the head during this study.

      With hindsight this was a limitation which – as the reviewer states – makes us unable to prove that motion robustness was a significant advantage for this study. That said, during scanning there was both a parent and an experimenter in the room for all of the children scanned, and anecdotally we can say that children tended to move their head during scans – usually to talk to the parent. Whilst this cannot be quantified (and is therefore unsatisfactory) we thought it worth mentioning in our discussion, which reads:

      “…One limitation of the current study is that practical limitations prevented us from quantitatively tracking the extent to which children (and adults) moved their head during a scan. Anecdotally however, experimenters present in the room during scans reported several instances where children moved, for example to speak to their parents who were also in the room. Such levels of movement could not be tolerated in conventional MEG or MRI and so this again demonstrates the advantages afforded by OPM-MEG…”

      As a note, empirical demonstrations of the motion tolerance of OPM-MEG have been published previously: Early demonstrations included Boto et al. 2 who captured beta oscillations in adults playing a ball game and Holmes et al. who measured visual responses as participants moved their head to change viewing angle3. In more recent demonstrations, Seymour et al. measured the auditory evoked field in standing mobile participants4; Rea et al. measured beta modulation as subjects carried out a naturalistic handwriting task5 and Holmes et al measured beta modulation as a subject walked around a room.6

      Furthermore, while the introduction discusses at some length the phenomenon of PMBR, there is no demonstration of the recording of PMBR (or post-sensory beta rebound). This is a shame because there is literature suggesting an age-sensitivity to this, that the optimal sensitivity of OPM-MEG might confirm/refute. There is little evidence in Figure 3 for adult beta rebound. Is there an explanation for the lack of sensitivity to this phenomenon in children/adolescents? Could a more robust paradigm (button-press) have shed light on this?

      We understand the question. There are two limitations to the current study in respect to measuring the PMBR:

      Firstly, sensory tasks generally do not induce as strong a PMBR as motor tasks and with this in mind a stronger rebound response could have been elicited using a button press. However, it was our intention to scan children down to age 2 and we were sceptical that the youngest children would carry out a button press as instructed. For this reason we opted for entirely passive stimulation, requiring no active engagement from our participants. The advantages of this was a stimulus that all subjects could engage with. However, this was at the cost of a diminished rebound.

      The second limitation relates to trial length. Multiple studies have shown that the PMBR can last over ~10 s 7,8. Indeed, Pfurtscheller et al. argued in 1999 that it was necessary to leave 10 s between movements to allow the PMBR to return to a true baseline9, though this has rarely been adhered to in the literature. Here, we wanted to keep recordings short for the comfort of the younger participants, so we adopted a short trial duration. However, a consequence of this short trial length is that it becomes impossible to access the PMBR directly; one can only measure beta modulation with the task. This limitation has now been addressed explicitly in our discussion:

      “…this was the first study of its kind using OPM-MEG, and consequently aspects of the study design could have been improved. Firstly, the task was designed for children; it was kept short while maximising the number of trials (to maximise signal to noise ratio). However, the classical view of beta modulation includes a PMBR which takes ~10 s to reach baseline following task cessation7–9. Our short trial duration therefore doesn’t allow the rebound to return to baseline between trials, and so conflates PMBR with rest. Consequently, we cannot differentiate the neural generators of the task induced beta power decrease and the PMBR; whilst this helped ensure a short, child friendly task, future studies should aim to use longer rest windows to independently assess which of the two processes is driving age related changes…”

      Data on functional connectivity are valuable but do not rely on OPM recording. They further do not add strength to the argument that OPM MEG is more sensitive to brain activity in smaller heads - in fact, the OPM recordings seem plagued by the same insensitivity observed using conventional systems.

      Given the demonstration above that bursts are not significantly diminished in amplitude in children relative to adults; and further given the demonstrations in the literature (e.g. Seedat et al.10) that functional connectivity is driven by bursts, we would argue that the effects of connectivity changing with age are not related to sensitivity but rather genuinely reflect a lack of coordination of brain activity.

      The discussion of burst vs oscillations, while highly relevant in the field, is somewhat independent of the OPM recording approach and does not add weight to the OPM claims.

      We agree that the burst vs. oscillations discussion does not add weight to the OPM claims per se. However, we had two aims of our paper, the second being to “investigate how task-induced beta modulation in the sensorimotor cortices is related to the occurrence of pan-spectral bursts, and how the characteristics of those bursts change with age.” As the reviewer states, this is highly relevant to the field, and therefore we believe adds impact, not only to the paper, but also by extension to the technology.

      In short, while the theoretical advantages of OPM-MEG are attractive - both in terms of young child sensitivity and in terms of motion tolerance, neither was in fact demonstrated in this manuscript. We are left with a replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      We thank the referee for the time and important contributions to this paper. We believe the fact that we were able to record good data in children as young as two years old was, in itself, an experimental realisation of the ‘theoretical advantages’ of OPM-MEG. Our additional analyses, inspired by the reviewers comments, help to clarify the advantages of OPM-MEG over conventional technology. The reviewers’ insights have without doubt improved the paper.

      Reviewer #2 (Public Review):

      Summary:

      The authors introduce a new 192-channel OPM system that can be configured using different helmets to fit individuals from 2 to 34 years old. To demonstrate the veracity of the system, they conduct a sensorimotor task aimed at mapping developmental changes in beta oscillations across this age range. Many past studies have mapped the trajectory of beta (and gamma) oscillations in the sensorimotor cortices, but these studies have focused on older children and adolescents (e.g., 9-15 years old) and used motor tasks. Thus, given the study goals, the choice of a somatosensory task was surprising and not justified. The authors recorded a final sample of 27 children (2-13 years old) and 24 adults (21-34 years) and performed a time-frequency analysis to identify oscillatory activity. This revealed strong beta oscillations (decreases from baseline) following the somatosensory stimulation, which the authors imaged to discern generators in the sensorimotor cortices. They then computed the power difference between 0.3-0.8 period and 1.0-1.5 s post-stimulation period and showed that the beta response became stronger with age (more negative relative to the stimulation period). Using these same time windows, they computed the beta burst probability and showed that this probability increased as a function of age. They also showed that the spectral composition of the bursts varied with age. Finally, they conducted a whole-brain connectivity analysis. The goals of the connectivity analysis were not as clear as prior studies of sensorimotor development have not conducted such analyses and typically such whole-brain connectivity analyses are performed on resting-state data, whereas here the authors performed the analysis on task-based data. In sum, the authors demonstrate that they can image beta oscillations in young children using OPM and discern developmental effects.

      Thank you for this summary and for taking the time to review our manuscript.

      Strengths:

      Major strengths of the study include the novel OPM system and the unique participant population going down to 2-year-olds. The analyses are also innovative in many respects.

      Thank you – we also agree that the major strength is in the unique cohort.

      Weaknesses:

      Several weaknesses currently limit the impact of the study. 

      First, the choice of a somatosensory stimulation task over a motor task was not justified. The authors discuss the developmental motor literature throughout the introduction, but then present data from a somatosensory task, which is confusing. Of note, there is considerable literature on the development of somatosensory responses so the study could be framed with that.

      We completely understand the referee’s point, and we agree that the motivation for the somatosensory task was not made clear in our original manuscript.

      Our choice of task was motivated completely by our targeted cohort; whilst a motor task would have been our preference, it was generally felt that making two-year-olds comply with instructions to press a button would have been a significant challenge. In addition, there would likely have been differences in reaction times. By opting for a passive sensory stimulation we ensured compliance, and the same stimulus for all subjects. We have added text on this to our introduction as follows:

      “…Here, we combine OPM-MEG with a burst analysis based on a Hidden Markov Model (HMM) 10–12 to investigate beta dynamics. We scanned a cohort of children and adults across a wide age range (upwards from 2 years old). Because of this, we implemented a passive somatosensory task which can be completed by anyone, regardless of age…”

      We also state in our discussion:

      “…here we chose to use passive (sensory) stimulation. This helped ensure compliance with the task in subjects of all ages and prevented confounds of e.g. reaction time, force, speed and duration of movement which would be more likely in a motor task.7,8 However, there are many other systems to choose and whether the findings here regarding beta bursts and the changes with age also extend to other brain networks remains an open question.…”

      Regarding the neurodevelopmental literature – we are aware of the literature on somatosensory evoked responses – particularly median nerve stimulation – but we can find little on the neurodevelopmental trajectory of somatosensory induced beta oscillations (the topic of our paper). We have edited our introduction as follows:

      “…All these studies probed beta responses to movement execution; in the case of tactile stimulation (i.e. sensory stimulation without movement) both task induced beta power loss, and the post stimulus rebound have been consistently observed in adults9,13–18. Further, beta amplitude in sensory cortex has been related to attentional processes19 and is broadly thought to carry top down top down influence on primary areas20. However, there is less literature on how beta modulation changes with age during purely sensory tasks.…”

      We would be keen for the reviewer to point to any specific papers in the literature that we may have missed.

      Second, the primary somatosensory response actually occurs well before the time window of interest in all of the key analyses. There is an established literature showing mechanical stimulation activates the somatosensory cortex within the first 100 ms following stimulation, with the M50 being the most robust response. The authors focus on a beta decrease (desynchronization) from 0.3-0.8 s which is obviously much later, despite the primary somatosensory response being clear in some of their spectrograms (e.g., Figure 3 in older children and adults). This response appears to exhibit a robust developmental effect in these spectrograms so it is unclear why the authors did not examine it. This raises a second point; to my knowledge, the beta decrease following stimulation has not been widely studied and its function is unknown. The maps in Figure 3 suggest that the response is anterior to the somatosensory cortex and perhaps even anterior to the motor cortex. Since the goal of the study is to demonstrate the developmental trajectory of well-known neural responses using an OPM system, should the authors not focus on the best-understood responses (i.e., the primary somatosensory response that occurs from 0.0-0.3 s)?

      We understand the reviewer’s point. The original aim of our manuscript was to investigate the neurodevelopmental trajectory of beta oscillations, not the evoked response. In fact, the evoked response in this paradigm is complicated by the fact that there are three stimuli in a very short (<500 ms) time window. For this reason, we prefer the focus of our paper to remain on oscillations.

      Nevertheless, we agree that not including the evoked responses was a missed opportunity.  We have now added evoked responses to our analysis pipeline and manuscript. As surmised by the reviewer, the M50 shows neurodevelopmental changes (an increase with age). Our methods section has been updated accordingly and Figure 3 has been modified. The figure and caption are copied below for the convenience of the reviewer.

      Author response image 3.

      Beta band modulation with age: (A) Brain plots show slices through the left motor cortex, with a pseudo-T-statistical map of beta modulation (blue/green) overlaid on the standard brain. Peak MNI coordinates are indicated for each subgroup. Time frequency spectrograms show modulation of the amplitude of neural oscillations (fractional change in spectral amplitude relative to the baseline measured in the 2.5-3 s window). Vertical lines indicate the time of the first braille stimulus. In all cases results were extracted from the location of peak beta desynchronisation (in the left sensorimotor cortex). Note the clear beta amplitude reduction during stimulation. The inset line plots show the 4-40 Hz trial averaged phase-locked evoked response, with the expected prominent deflections around 20 and 50 ms. (B) Maximum difference in beta-band amplitude (0.3-0.8 s window vs 1-1.5 s window) plotted as a function of age (i.e., each data point shows a different participant; triangles represent children, circles represent adults). Note significant correlation (𝑅2 \= 0.29, 𝑝 = 0.00004 *). (C) Amplitude of the P50 component of the evoked response plotted against age. There was no significant correlation (𝑅2 \= 0.04, 𝑝 = 0.14 ). All data here relate to the index finger stimulation; similar results are available for the little finger stimulation in Supplementary Information Section 1.

      Regarding the developmental effects, the authors appear to compute a modulation index that contrasts the peak beta window (.3 to .8) to a later 1.0-1.5 s window where a rebound is present in older adults. This is problematic for several reasons. First, it prevents the origin of the developmental effect from being discerned, as a difference in the beta decrease following stimulation is confounded with the beta rebound that occurs later. A developmental effect in either of these responses could be driving the effect. From Figure 3, it visually appears that the much later rebound response is driving the developmental effect and not the beta decrease that is the primary focus of the study. Second, these time windows are a concern because a different time window was used to derive the peak voxel used in these analyses. From the methods, it appears the image was derived using the .3-.8 window versus a baseline of 2.5-3.0 s. How do the authors know that the peak would be the same in this other time window (0.3-0.8 vs. 1.0-1.5)? Given the confound mentioned above, I would recommend that the authors contrast each of their windows (0.3-0.8 and 1.0-1.5) with the 2.5-3.0 window to compute independent modulation indices. This would enable them to identify which of the two windows (beta decrease from 0.3-0.8 s or the increase from 1.0-1.5 s) exhibited a developmental effect. Also, for clarity, the authors should write out the equation that they used to compute the modulation index. The direction of the difference (positive vs. negative) is not always clear.

      We completely understand the referee’s point; referee 1 made a similar point. In fact, there are two limitations of our paradigm regarding the measurement of PMBR versus the task-induced beta decrease:

      Firstly, sensory tasks generally do not induce as strong a PMBR as motor tasks and with this in mind a stronger rebound response could have been elicited using a button press. However, as described above it was our intention to scan children down to age 2 and we were sceptical that the youngest children would carry out a button press as instructed.

      The second limitation relates to trial length. Multiple studies have shown that the PMBR can last over ~10 s7,8. Indeed, Pfurtscheller et al. argued in 1999 that it was necessary to leave 10 s between movements to allow the PMBR to return to a true baseline9 Here, we wanted to keep recordings relatively short for the younger participants, and so we adopted a short trial duration. However, a consequence of this short trial length is that it becomes impossible to access the PMBR directly because the PMBR of the nth trial is still ongoing when the (n+1)th trial begins. Because of this, there is no genuine rest period, and so the stimulus induced beta decrease and subsequent rebound cannot be disentangled. This limitation has now been made clear in our discussion as follows:

      “…this was the first study of its kind using OPM-MEG, and consequently aspects of the study design could have been improved. Firstly, the task was designed for children; it was kept short while maximising the number of trials (to maximise signal to noise ratio). However, the classical view of beta modulation includes a PMBR which takes ~10 s to reach baseline following task cessation7–9. Our short trial duration therefore doesn’t allow the rebound to return to baseline between trials, and so conflates PMBR with rest. Consequently, we cannot differentiate the neural generators of the task induced beta power decrease and the PMBR; whilst this helped ensure a short, child friendly task, future studies should aim to use longer rest windows to independently assess which of the two processes is driving age related changes…”

      To clarify our method of calculating the modulation index, we have added the following statement to the methods:

      “The beta modulation index was calculated using the equation , where , and are the average Hilbert-envelope-derived amplitudes in the stimulus (0.3-0.8s), post-stimulus (1-1.5s) and baseline (2.5-3s) windows, respectively.”

      Another complication of using a somatosensory task is that the literature on bursting is much more limited and it is unclear what the expectations would be. Overall, the burst probability appears to be relatively flat across the trial, except that there is a sharp decrease during the beta decrease (.3-.8 s). This matches the conventional trial-averaging analysis, which is good to see. However, how the bursting observed here relates to the motor literature and the PMBR versus beta ERD is unclear.

      Again, we agree completely; a motor task would have better framed the study in the context of existing burst literature – but as mentioned above, making 2-year-olds comply with the instructions for a motor task would have been difficult. Interestingly in a recent paper, Rayson et al. used EEG to investigate burst activity in infants (9 and 12 months) and adults during observed movement execution, with results showing stimulus induced decrease in beta burst rate at all ages, with the largest effects in adults21. This paper was not yet published when we submitted our article but does help us to frame our burst results since there is strong agreement between their study and ours. We now mention this study in both our introduction and discussion. 

      Another weakness is that all participants completed 42 trials, but 19% of the trials were excluded in children and 9% were excluded in adults. The number of trials is proportional to the signal-to-noise ratio. Thus, the developmental differences observed in response amplitude could reflect differences in the number of trials that went into the final analyses.

      This is an important observation and we thank the reviewer for raising the issue. We have now re-analysed all of our data, removing trials in the adults such that the overall number of trials was the same as for the children. All effects with age remained significant. We chose to keep the Figures in the main manuscript with all good trials (as previously) and present the additional analyses (with matched trial numbers) in supplementary information. However, if the reviewer feels strongly, we could do it the other way around (there is very little difference between the results).

      Reviewer #3 (Public Review):

      This study demonstrated the application of OPM-MEG in neurodevelopment studies of somatosensory beta oscillations and connections with children as young as 2 years old. It provides a new functional neuroimaging method that has a high spatial-temporal resolution as well wearable which makes it a new useful tool for studies in young children. They have constructed a 192-channel wearable OPM-MEG system that includes field compensation coils which allow free head movement scanning with a relatively high ratio of usable trials. Beta band oscillations during somatosensory tasks are well localized and the modulation with age is found in the amplitude, connectivity, and panspectral burst probability. It is demonstrated that the wearable OPM-MEG could be used in children as a quite practical and easy-to-deploy neuroimaging method with performance as good as conventional MEG. With both good spatial (several millimeters) and temporal (milliseconds) resolution, it provides a novel and powerful technology for neurodevelopment research and clinical applications not limited to somatosensory areas.

      We thank the reviewer for their summary, and their time in reviewing our manuscript.

      The conclusions of this paper are mostly well supported by data acquired under the proper method. However, some aspects of data analysis need to be improved and extended.

      (1) The colour bars selected for the pseudo-T-static pictures of beta modulation in Figures 2 and 3, which are blue/black and red/black, are not easily distinguished from the anatomical images which are grey-scale. A colour bar without black/white would make these figures better. The peak point locations are also suggested to be marked in Figure 2 and averaged locations in Figure 3 with an error bar.

      Thank you for this comment which we certainly agree with. The colour scheme used has now been changed to avoid black. We have also added peak locations. 

      (2) The data points in plots are not constant across figures. In Figures 3 and 5, they are classified into triangles and circles for children and adults, but all are circles in Figures 4 and 6.

      Thank you! We apologise for the confusion. Data points are now consistent across plots.

      (3) Although MEG is much less susceptible to conductivity inhomogeneity of the head than EEG, the forward modulating may still be impacted by the small head profile. Add more information about source localization accuracy and stability across ages or head size.

      This is an excellent point. We have added to our discussion relating to the accuracy of the forward model. 

      “…We failed to see a significant difference in the spatial location of the cortical representations of the index and little finger; there are three potential reasons for this. First, the system was not designed to look for such a difference – sensors were sparsely distributed to achieve whole head coverage (rather than packed over sensory cortex to achieve the best spatial resolution in one area22). Second, our “pseudo-MRI” approach to head modelling (see Methods) is less accurate than acquisition of participantspecific MRIs, and so may mask subtle spatial differences. Third, we used a relatively straightforward technique for modelling magnetic fields generated by the brain (a single shell forward model). Although MEG is much less susceptible to conductivity inhomogeneity of the head than EEG, the forward model may still be impacted by the small head profile. This may diminish spatial resolution and future studies might look to implement more complex models based on e.g. finite element modelling23. Finally, previous work 24 suggested that, for a motor paradigm in adults, only the beta rebound, and not the power reduction during stimulation, mapped motortopically. This may also be the case for purely sensory stimulation. Nevertheless, it remains the case that by placing sensors closer to the scalp, OPM-MEG should offer improved spatial resolution in children and adults; this should be the topic of future work…”

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      Major items to further test include the differing number of trials, the windowing issue, and the focus on motor findings in the intro and discussion. First, I would recommend the authors adjust the number of trials in adults to equate them between groups; this will make their developmental effects easier to interpret.  

      Thank you for raising this important point. This has now been done and appears in our supplementary information as discussed above.

      Second, to discern which responses are exhibiting developmental effects, the authors need to contrast the 0.3-0.8 window with the later window (2.5-3.0), not the window that appears to have the PMBR-like response. This artificially accentuates the response. I also think they should image the 1.0-1.5 vs 2.5-3.0s window to determine whether the response in this time window is in the same location as the decrease and then contrast this for beta differences. 

      We completely understand this point, which relates to separating the reduction in beta amplitude during stimulation and the rebound post stimulation. However, as explained above, doing so unambiguously would require the use of much longer trials. Here we were only able to measure stimulus induced beta modulation (distinct from the separate contributions of the task induced beta power reduction and rebound). It may be that future studies, with >10 s trial length, could probe the role of the PMBR, but such studies require long paradigms which are challenging to implement with children.

      Third, changing the framing of the study to highlight the somatosensory developmental literature would also be an improvement.

      We have added to our introduction a stated in the responses above.

      Finally, the connectivity analysis on data from a somatosensory task did not make sense given the focus of the study and should be removed in my opinion. It is very difficult to interpret given past studies used resting state data and one would expect the networks to dynamically change during different parts of the current task (i.e., stimulation versus baseline).

      We appreciate the point regarding connectivity. However, it was our intention to examine the developmental trajectory of beta oscillations, and a major role of beta oscillations is in mediating connectivity. It is true that most studies are conducted in the resting state (or more recently – particularly in children – during movie watching). The fact that we had a sensory task running is a confound; nevertheless, the connectivity we derived in adults bears a marked similarity to that from previous papers (e.g. 25) and we do see significant changes with age. We therefore believe this to be an important addition to the paper and we would prefer to keep it.

      References

      (1) Holmes, N., Bowtell, R., Brookes, M. J. & Taulu, S. An Iterative Implementation of the Signal Space Separation Method for Magnetoencephalography Systems with Low Channel Counts.

      Sensors 23, 6537 (2023).

      (2) Boto, E. et al. Moving magnetoencephalography towards real-world applications with a wearable system. Nature (2018) doi:10.1038/nature26147.

      (3) Holmes, M. et al. A bi-planar coil system for nulling background magnetic fields in scalp mounted magnetoencephalography. NeuroImage 181, 760–774 (2018).

      (4) Seymour, R. A. et al. Using OPMs to measure neural activity in standing, mobile participants. NeuroImage 244, 118604 (2021).

      (5) Rea, M. et al. A 90-channel triaxial magnetoencephalography system using optically pumped magnetometers. annals of the new york academy of sciences 1517, https://doi.org/10.1111/nyas.14890 (2022).

      (6) Holmes, N. et al. Enabling ambulatory movement in wearable magnetoencephalography with matrix coil active magnetic shielding. NeuroImage 274, 120157 (2023).

      (7) Pakenham, D. O. et al. Post-stimulus beta responses are modulated by task duration. NeuroImage 206, 116288 (2020).

      (8) Fry, A. et al. Modulation of post-movement beta rebound by contraction force and rate of force development. Human Brain Mapping 37, 2493–2511 (2016).

      (9) Pfurtscheller, G. & Lopes da Silva, F. H. Event-related EEG/MEG synchronization and desynchronization: Basic principles. Clin Neurophysio 110, 1842–1857 (1999).

      (10) Seedat, Z. A. et al. The role of transient spectral ‘bursts’ in functional connectivity: A magnetoencephalography study. NeuroImage 209, 116537 (2020).

      (11) Baker, A. P. et al. Fast transient networks in spontaneous human brain activity. eLife 2014, 1867 (2014).

      (12) Vidaurre, D. et al. Spectrally resolved fast transient brain states in electrophysiological data. NeuroImage 126, 81–95 (2016).

      (13) Gaetz, W. & Cheyne, D. Localization of sensorimotor cortical rhythms induced by tactile stimulation using spatially filtered MEG. NeuroImage 30, 899–908 (2006).

      (14) Cheyne, D. et al. Neuromagnetic imaging of cortical oscillations accompanying tactile stimulation. Cognitive Brain Research 17, 599–611 (2003).

      (15) van Ede, F., Jensen, O. & Maris, E. Tactile expectation modulates pre-stimulus β-band oscillations in human sensorimotor cortex. NeuroImage 51, 867–876 (2010).

      (16) Salenius, S., Schnitzler, A., Salmelin, R., Jousmäki, V. & Hari, R. Modulation of Human Cortical Rolandic Rhythms during Natural Sensorimotor Tasks. NeuroImage 5, 221–228 (1997).

      (17) Cheyne, D. O. MEG studies of sensorimotor rhythms: A review. Experimental Neurology 245, 27–39 (2013).

      (18) Kilavik, B. E., Zaepffel, M., Brovelli, A., MacKay, W. A. & Riehle, A. The ups and downs of beta oscillations in sensorimotor cortex. Experimental Neurology 245, 15–26 (2013).

      (19) Bauer, M., Oostenveld, R., Peeters, M. & Fries, P. Tactile Spatial Attention Enhances Gamma-Band Activity in Somatosensory Cortex and Reduces Low-Frequency Activity in Parieto-Occipital Areas. J. Neurosci. 26, 490–501 (2006).

      (20) Barone, J. & Rossiter, H. E. Understanding the Role of Sensorimotor Beta Oscillations. Frontiers in Systems Neuroscience 15, (2021).

      (21) Rayson, H. et al. Bursting with Potential: How Sensorimotor Beta Bursts Develop from Infancy to Adulthood. J Neurosci 43, 8487–8503 (2023).

      (22) Hill, R. M. et al. Optimising the Sensitivity of Optically-Pumped Magnetometer Magnetoencephalography to Gamma Band Electrophysiological Activity. Imaging Neuroscience (2024) doi:10.1162/imag_a_00112.

      (23) Stenroos, M., Hunold, A. & Haueisen, J. Comparison of three-shell and simplified volume conductor models in magnetoencephalography. NeuroImage 94, 337–348 (2014).

      (24) Barratt, E. L., Francis, S. T., Morris, P. G. & Brookes, M. J. Mapping the topological organisation of beta oscillations in motor cortex using MEG. NeuroImage 181, 831–844 (2018).

      (25) Rier, L. et al. Test-Retest Reliability of the Human Connectome: An OPM-MEG study. Imaging Neuroscience (2023) doi:10.1162/imag_a_00020.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Common comments

      (1) Significance of zero mutation rate

      Reviewers asked why we included mutation rate even though setting mutation rate to zero doesn’t change results. We think that including non-zero mutation rate makes our results more generalisable, and thus is a strength rather than weakness. To better motivate this choice, we have added a sentence to the beginning of Results:

      (2) Writing the mu=0 case first

      Reviewers suggested that we should first focus on the mu=0 case, and then generalize the result. The suggestions are certainly good. However, given the large amount of work involved in a re-organization, we have decided to adhere to our current narrative. However, we now only include equations where mu=0 in the main text, and have moved the case of nonzero mutation rate to Supplementary Information.

      (3) Making equations more accessible

      We have taken three steps to make equations more readable.

      ● Equations in the main text correspond to the case of zero-mutation rate.

      ● The original section on equation derivation is now in a box in the main text so that readers have the choice of skipping it but interested readers can still get a gist of where equations came from.

      ● We have provided a much more detailed interpretation of the equation (see page 10).

      (4) Validity of the Gaussian approximation

      Reviewers raised concerns about the validity of Gaussian approximation on F frequency𝑓(𝜏). The fact that our calculations closely match simulations suggest that this approximation is reasonable. Still, we added a discussion about the validity of this approximation in Box 1.

      We also added to SI with various cases of initial S and F sizes. This figure shows that when either initial S or initial F is small, the distribution of𝑓(𝜏) is not normal. However, if initial S and F are both on the order of hundreds, then the distribution of 𝑓(𝜏) is approximately Gaussian.

      Public Reviews:

      Summary:

      The authors demonstrate with a simple stochastic model that the initial composition of the community is important in achieving a target frequency during the artificial selection of a community.

      Strengths:

      To my knowledge, the intra-collective selection during artificial selection has not been seriously theoretically considered. However, in many cases, the species dynamics during the incubation of each selection cycle are important and relevant to the outcome of the artificial selection experiment. Stochasticity from birth and death (demographic stochasticity) plays a big role in these species' abundance dynamics. This work uses a simple framework to tackle this idea meticulously.

      This work may or may not be hysteresis (path dependency). If this is true, maybe it would be nice to have a discussion paragraph talking about how this may be the case. Then, this work would even attract the interest of people studying dynamic systems.

      We have added this clarification in the main text:

      “Note that here, selection outcome is path-dependent in the sense of being sensitive to initial conditions. This phenomenon is distinct from hysteresis where path-dependence results from whether a tuning parameter is increased or decreased.

      Weaknesses:

      (1) Connecting structure and function

      In typical artificial selection literature, most of them select the community based on collective function. Here in this paper, the authors are selecting a target composition. Although there is a schematic cartoon illustrating the relationship between collective function (y-axis) and the community composition in the main Figure 1, there is no explicit explanation or justification of what may be the origin of this relationship. I think giving the readers a naïve idea about how this structure-function relationship arises in the introduction section would help. This is because the conclusion of this paper is that the intra-collective selection makes it hard to artificially select a community that has an intermediate frequency of f (or s). If there is really evidence or theoretical derivation from this framework that indeed the highest function comes from the intermediate frequency of f, then the impact of this paper would increase because the conclusions of this stochastic model could allude to the reasons for the prevalent failures of artificial selection in literature.

      We have added this to introduction: “This is a common quest: whenever a collective function depends on both populations, collective function is maximised, by definition, at an intermediate frequency (e.g. too little of either population will hamper function [23]).”

      (2) Explain intra-collective and inter-collective selection better for readers.

      The abstract, the introduction, and the result section use these terms or intra-collective and inter-collective selection without much explanation. For the wide readership of eLife, a clear definition in the beginning would help the audience grasp the importance of this paper, because these concepts are at the core of this work.

      This is a great point. We have added in Abstract:

      “Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection.”

      and in Introduction

      “A selection cycle consists of three stages (Fig. 1). During collective maturation, intra-collective selection favors fast-growing individuals within a collective. At the end of maturation, inter-collective selection acts on collectives and favors those achieving the target composition. Finally during collective reproduction, offspring collectives sample stochastically from the parents, a process dominated by genetic drift.”

      (3) Achievable target frequency strongly depending on the degree of demographic stochasticity.

      I would expect that the experimentalists would find these results interesting and would want to consider these results during their artificial selection experiments. The main Figure 4 indicates that the Newborn size N0 is a very important factor to consider during the artificial selection experiment. This would be equivalent to how much bottleneck is imposed on the artificial selection process in every iteration step (i.e., the ratio of serial dilution experiment). However, with a low population size, all target frequencies can be achieved, and therefore in these regimes, the initial frequency now does not matter much. It would be great for the authors to provide what the N0 parameter actually means during the artificial selection experiments. Maybe relative to some other parameter in the model. I know this could be very hard. But without this, the main result of this paper (initial frequency matters) cannot be taken advantage of by the experimentalists.

      We have added an analytical approximation for N0˘, the Newborn size below which all target frequencies can be achieved in SI.

      Also, we have added lines indicating N0˘ in Fig4a.

      (4) Consideration of environmental stochasticity.

      The success (gold area of Figure 2d) in this framework mainly depends on the size of the demographic stochasticity (birth-only model) during the intra-collective selection. However, during experiments, a lot of environmental stochasticity appears to be occurring during artificial selection. This may be out of the scope of this study. But it would definitely be exciting to see how much environmental stochasticity relative to the demographic stochasticity (variation in the Gaussian distribution of F and S) matters in succeeding in achieving the target composition from artificial selection.

      You are correct that our work considers only demographic stochasticity.

      Indeed, considering other types of stochasticity will be an exciting future research direction. We added in the main text:

      “Overall our model considers mutational stochasticity, as well as demographic stochasticity in terms of stochastic birth and stochastic sampling of a parent collective by offspring collectives. Other types of stochasticity, such as environmental stochasticity and measurement noise, are not considered and require future research.”

      (5) Assumption about mutation rates

      If setting the mutation rates to zero does not change the result of the simulations and the conclusion, what is the purpose of having the mutation rates \mu? Also, is the unidirectional (S -> F -> FF) mutation realistic? I didn't quite understand how the mutations could fit into the story of this paper.

      This is a great point. We have added this to the beginning of Results to better motivate our study:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations. This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around. When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.

      See answer on common question 1.

      (6) Minor points

      In Figure 3b, it is not clear to me how the frequency difference for the Intra-collective and the Inter-collective selection is computed.

      We added a description in caption 3b.

      In Figure 5b, the gold region (success) near the FF is not visible. Maybe increase the size of the figure or have an inset for zoom-in. Why is the region not as big as the bottom gold region?

      We increased the resolution of Fig 5b so that the gold region near FF is more visible.

      We have added Fig 5c and the following explanation to the main text:

      “From numerical simulations, we identified two accessible regions: a small region near FF and a band region spanning from S to F (gold in Fig. 5b i). Intuitively, the rate at which FF grows faster than S+F is greater than the rate at which F grows faster than S (see section VIII in Supplementary Information). Thus, the problem can initially be reduced to a two-population problem (i.e. FF versus F+S; Fig. 5c left), and then expanded to a three-population problem (Fig. 5c right).”

      Recommendations For The Authors

      Since the conclusion of the model greatly depends on the noise (variation) of F and S in the Gaussian distribution, it would be nice to have a plot where the y-axis is the variation in terms of frequency and the x-axis is the s_0 or f_0 (frequency). In the plot, I would love to see how the variation in the frequency depends on the initial frequency of S and F. Maybe this is just trivial.

      In the SI, we added Fig6a, as per your request. Previous Fig6 became Fig6b.

      Reviewer #2 (Public review):

      The authors provide an analytical framework to model the artificial selection of the composition of communities composed of strains growing at different rates. Their approach takes into account the competition between the targeted selection at the level of the meta-community and the selection that automatically favors fast-growing cells within each replicate community. Their main finding is a tipping point or path-dependence effect, whereby compositions dominated by slow-growing types can only be reached by community-level selection if the community does not start and never crosses into a range of compositions dominated by fast growers during the dynamics.

      These results seem to us both technically correct and interesting. We commend the authors on their efforts to make their work reproducible even when it comes to calculations via extensive appendices, though perhaps a table of contents and a short description of these appendices at the start of SI would help navigate them.

      Thank you for the suggestion. We have added a paragraph at the beginning of SI.

      The main limitation in the current form of the article is that it could clarify how its assumptions and findings differ from and improve upon the rest of the literature:

      -  Many studies discuss the interplay between community-level evolution and species- or strain-level evolution. But "evolution" can be a mix of various forces, including selection, drift/randomness, and mutation/innovation.

      - This work's specificity is that it focuses strictly on constant community-level selection versus constant strain-level selection, all other forces being negligible (neither stochasticity nor innovation/mutation matter at either level, as we try to clarify now).

      Note that intra-collective selection is not strictly “constant” in the sense that selection favoring F is the strongest at intermediate F frequency (Fig 3). However, we think that you mean that intra- and inter-collective selection are present in every cycle, and this is correct for our case, and for community selection in general.

      -  Regarding constant community-level selection, it is only briefly noted that "once a target frequency is achieved, inter-collective selection is always required to maintain that frequency due to the fitness difference between the two types" [pg. 3 {section sign}2]. In other words, action from the selector is required indefinitely to maintain the community in the desired state. This assumption is found in a fraction of the literature, but is still worth clarifying from the start as it can inform the practical applicability of the results.

      This is a good point. We have added to abstract:

      “Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection.”

      - More importantly, strain-level evolution also boils down here to pure selection with a constant target, which is less usual in the relevant literature. Here, (1) drift from limited population sizes is very small, with no meaningful counterbalancing of selection, (2) pure exponential regime with constant fitness, no interactions, no density- or frequency-dependence, (3) there is no innovation in the sense that available types are unchanging through time (no evolution of traits such as growth rate or interactions) and (4) all the results presented seem unchanged when mutation rate mu = 0 (as noted in Appendix III), meaning that the conclusions are not "about" mutation in any meaningful way.

      With regard to point (1), Figure 4a (reproduced below) shows how Newborn size affects the region of achievable targets. Indeed at large Newborn size (e.g. 5000 and above), no target frequency is achievable (since drift is too small to generate sufficient inter-community variation and consequently all communities are dominated by fast-growing F). However at Newborn size of for example 1000, there are two regions of accessible target frequencies. At smaller Newborn size, all target frequencies become achievable due to drift becoming sufficiently strong.

      With regard to points (2) and (3), we have added to Introduction

      “To enable the derivation of an analytical expression, we have made the following simplifications.

      First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult.”

      With regard to point (4), we view this as a strength rather than weakness. We have added the following to the beginning of Results and Discussions:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations.”

      “When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.”

      See Point 1 of Common comments.

      - Furthermore, the choice of mutation mechanism is peculiar, as it happens only from slow to fast grower: more commonly, one assumes random non-directional mutations, rather than purely directional ones from less fit to fitter (which is more of a "Lamarckian" idea). Given that mutation does not seem to matter here, this choice might create unnecessary opposition from some readers or could be considered as just one possibility among others.

      We have added the following justification:

      “This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around.”

      It would be helpful to have all these points stated clearly so that it becomes easy to see where this article stands in an abundant literature and contributes to our understanding of multi-level evolution, and why it may have different conclusions or focus than others tackling very similar questions.

      Finally, a microbial context is given to the study, but the assumptions and results are in no way truly tied to that context, so it should be clear that this is just for flavor.

      We have deleted “microbial” from the title, and revised our abstract:

      Recommendations For The Authors

      (1) More details concerning our main remark above:

      - The paragraph discussing refs [24, 33] is not very clear in how they most importantly differ from this study. Our impression is that the resource aspect is not very important for instance, and the main difference is that these other works assume that strains can change in their traits.

      We are fairly sure that resource depletion is important in Rainey group’s study, as the attractor only evolved after both strains grew fast enough to deplete resources by the end of maturation. Indeed, evolution occurred in interaction coefficients which dictate the competition between strains for resources.

      Regardless, you raised an excellent point. As discussed earlier, we have added the following:

      “To enable the derivation of an analytical expression, we have made the following simplifications.

      First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult.”

      - We would advise the main text to focus on mu = 0, and only say in discussion that results can be generalized.

      Your suggestion is certainly good. However, given the large amount of work involved in a reorganisation, we have decided to adhere to our current narrative. However, as discussed earlier, we have added this at the beginning of Results to help orient readers:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations.”

      “When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.”

      (2) We think the material on pg. 5 "Intra-collective evolution is the fastest at intermediate F frequencies, creating the "waterfall" phenomenon", although interesting, could be presented in a different way. The mathematical details on how to find the probability distribution of the maximum of independent random variables (including Equation 1) will probably be skipped by most of the readers (for experienced theoreticians, it is standard content; for experimentalists, it is not the most relevant), as such I would recommend displacing them to SM and report only the important results.

      This is an excellent suggestion. We have put a sketch of our calculations in a box in the main text to help orient interested readers. As before, details are in SI.

      Similarly, Equations 2, 3, and 4 are hard to read given the large amount of parameters and the low amount of simplification. Although exploring the effect of the different parameters through Figures 3 and 4 is useful, I think the role of the equations should be reconsidered:

      i. Is it possible to rewrite them in terms of effective variables in a more concise way?

      See Point 3 of Common comments.

      ii. Is it possible to present extreme/particular cases in which they are easier to interpret?

      We have focused on the case where the mutation rate is zero. This makes the mathematical expressions much simpler (see above).

      (3) Is it possible to explain more in detail why the distribution of f_k+1 conditional to f_k^* is well approximated by a Gaussian? Also, have you explored to what extent the results would change if this were not true (in light of the few universal classes for the maximum of independent variables)?

      Despite the appeal to the CLT and the histograms in the Appendix suggesting that the distribution looks a bit like a Gaussian at a certain scale, fluctuations on that scale are not necessarily what is relevant for the results - a rapid (and maybe wrong) attempt at a characteristic function calculation suggests that in your case, one does not obtain convergence to Gaussians unless we renormalize by S(t=0) and F(t=0), so it seems there is a justification missing in the text as is for the validity of this approximation (or that it is simply assumed).

      See point 4 of Common comments.

      Reviewer #3 (Public Reviews):

      The authors address the process of community evolution under collective-level selection for a prescribed community composition. They mostly consider communities composed of two types that reproduce at different rates, and that can mutate one into the other. Due to such differences in 'fitness' and to the absence of density dependence, within-collective selection is expected to always favour the fastest grower, but the collective-level selection can oppose this tendency, to a certain extent at least. By approximating the stochastic within-generation dynamics and solving it analytically, the authors show that not only high frequencies of fast growers can be reproducibly achieved, aligned with their fitness advantage. Small target frequencies can also be maintained, provided that the initial proportion of fast growers is sufficiently small. In this regime, similar to the 'stochastic corrector' model, variation upon which selection acts is maintained by a combination of demographic stochasticity and of sampling at reproduction. These two regions of achievable target compositions are separated by a gap, encompassing intermediate frequencies that are only achievable when the bottleneck size is small enough or the number of communities is (disproportionately) larger.

      A similar conclusion, that stochastic fluctuations can maintain the system over evolutionary time far from the prevalence of the faster-growing type, is then confirmed by analyzing a three-species community, suggesting that the qualitative conclusions of this study are generalizable to more complex communities.

      I expect that these results will be of broad interest to the community of researchers who strive to improve community-level selection, but are often limited to numerical explorations, with prohibitive costs for a full characterization of the parameter space of such embedded populations. The realization that not all target collective functions can be as easily achieved and that they should be adapted to the initial conditions and the selection protocol is also a sobering message for designing concrete applications.

      A major strength of this work is that the qualitative behaviour of the system is captured by an analytically solvable approximation so that the extent of the 'forbidden region' can be directly and generically related to the parameters of the selection protocol.

      Thanks so much for these positive comments.

      I however found the description of the results too succinct and I think that more could be done to unpack the mathematical results in a way that is understandable to a broader audience. Moreover, the phenomenon the authors characterize is of purely ecological nature. Here, mutations of the growth rate are, in my understanding, neither necessary (non-trivial equilibria can be maintained also when \mu =0) nor sufficient (community-level selection is necessary to keep the system far from the absorbing state) for the phenomenon described. Calling this dynamics community evolution reflects a widespread ambiguity, and is not ascribable just to this work. I find that here the authors have the opportunity to make their message clearer by focusing on the case where the 'mutation' rate \mu vanishes (Equations 39 & 40 of the SI) - which is more easily interpretable, at least in some limits - while they may leave the more general equations 3 & 4 in the SI.

      See points 1-4 of Common comments.

      Combined with an analysis of the deterministic equations, that capture the possibility of maintaining high frequencies of fast growers, the authors could elucidate the dynamics that are induced by the presence of a second level of selection, and speculate on what would be the result of real open-ended evolution (not encompassed by the simple 'switch mutations' generally considered in evolutionary game theory), for instance discussing the invasibility (or not) of mutant types with slightly different growth rates.

      Indeed, evolution is not restricted to two types. However, our main goal here is to derive an analytical expression, and it was difficult for even two types. For three-type collectives, we had to resort to simulations. Investigating the case where fitness effects of mutations are continuously distributed is beyond the scope of this study.

      The single most important model hypothesis that I would have liked to be discussed further is that the two types do not interact. Species interactions are not only essential to achieve inheritance of composition in the course of evolution but are generally expected to play a key role even on ecological time scales. I hope the authors plan to look at this in future work.

      In our system, the S and F do interact in a competitive fashion: even though S and F are not competing for nutrients (which are always in excess), they are competing for space. This is because a fixed number of cells are transferred to the next cycle. Thus, the presence of F will for example reduce the chance of S being propagated. We have added this clarification to our main text:

      “Note that even though S and F do not compete for nutrients, they compete for space: because the total number of cells transferred to the next cycle is fixed, an overabundance of one population will reduce the likelihood of the other being propagated.”

      Recommendations For The Authors

      I felt the authors could put some additional effort into making their theoretical results meaningful for a population of readers who, though not as highly mathematically educated as they are, can nonetheless appreciate the implications of simple relations or scaling. Below, you find some suggestions:

      (1) In order to make it clear that there is a 'natural' high-frequency equilibrium that can be reached even in the absence of selection, the authors could examine first the dynamics of the deterministic system in the absence of mutations, and use its equilibria to elucidate the combined role of the 'fitness' difference \omega and of the generation duration \tau in setting its value. The fact that these parameters always occur in combination (when there are no mutations) is a general and notable feature of the stochastic model as well. Moreover, this model would justify why you only focus on decreasing the frequency in the new generation.

      Note that the ‘natural’ high-frequency equilibrium in the absence of collective selection is when fast grower F becomes fixed in the population. Following your suggestion, we have introduced two parameters 𝑅τ and 𝑊τ to reflect the coupling between ‘fitness’ and ‘generation duration’:

      (2) Since the phenomenon described in the paper is essentially ecological in nature (as the author states, it does not change significantly if the 'mutation rate' \mu is set to zero), I would put in the main text Equations 39 & 40 of the SI in order to improve intelligibility.

      See Point 2 at the beginning of this letter.

      These equations can be discussed in some detail, especially in the limit of small f^*_k, where I think it is worth discussing the different dependence of the mean and the variance of the frequency distribution on the system's parameters.

      This is a great suggestion. We have added the following:

      “In the limit of small , Equation (3) becomes f while Equation (4) becomes . Thus, both Newborn size (N<sub>0</sub>) and fold-change in F/S during maturation (W<sub>τ</sub>) are important determinants of selection progress.

      (3) I would have appreciated an explanation in words of what are the main conceptual steps involved in attaining Equation 2, the underlying hypotheses (notably on community size and distributions), and the expected limits of validity of the approximation.

      See points 3 and 4 at the beginning of this letter.

      (4) I think that some care needs to be put into explaining where extreme value statistics is used, and why is the median of the conditional distribution the most appropriate statistics to look at for characterizing the evolutionary trajectory (which seems to me mostly reliant on extreme values).

      Great point! We added an explanation of using median value in Box 1.

      and also added figure 7 to explaining it in SI.

      Showing in a figure the different distributions you are considering (for instance, plotting the conditional distribution for one generation in the trajectories displayed in Figure 2) would be useful to understand what information \bar f provides on a sequence of collective generations, where in principle there may be memory effects.

      Thanks for this suggestion. We have added to Fig 2d panel to illustrate the shape and position of F frequency distributions in each step in the first two selection cycles.

      (5) Similarly, I do not understand why selecting the 5% best communities should push the system's evolution towards the high-frequency solution, instead of just slowing down the improvement (unless you are considering the average composition of the top best communities - which should be justified). I think that such sensitivity to the selection intensity should be appropriately referenced and discussed in the main text, as it is a parameter that experimenters are naturally led to manipulate.

      In the main text, we have added this explanation:

      “In contrast with findings from an earlier study [23], choosing top 1 is more effective than the less stringent “choosing top 5%”. In the earlier study, variation in the collective trait is partly due to nonheritable factors such as random fluctuations in Newborn biomass. In that context, a less stringent selection criterion proved more effective, as it helped retain collectives with favorable genotypes that might have exhibited suboptimal collective traits due to unfavorable nonheritable factors. However, since this study excludes nonheritable variations in collective traits, selecting the top 1 collective is more effective than selecting the top 5% (see Fig. 11 in Supplementary Information).”

      (6) Equation 1 could be explained in simpler terms as the product between the probability that one collective reaches the transmitted value times the probability that all others do worse than that. The current formulation is unclear, perhaps just a matter of English formulation.

      We have revised our description to state:

      “Equation (1) can be described as the product between two terms related to probability: (i) describes the probability density that any one of the g Adult collectives achieves f given , and (ii) describes the probability that all other g – 1 collectives achieve frequencies above f and thus not selected.”

      (7) I think that the discussion of the dependence of the boundaries of the 'waterfall' region with the difference in growth rate \omega is important and missing, especially if one wants to consider open-ended evolution of the growth rate - which can occur at steps of different magnitude.

      We added a new chapter and figure in supplementary information on the threshold values when \omega varies. As expected, smaller \omega enlarges the success area.

      We have also added a new figure panel to show how maturation time affects selection efficacy.

      (8) Notations are a bit confusing and could be improved. First of all, in most equations in the main text and SI, what is initially introduced as \omega appears as s. This is confusing because the letter s is also used for the frequency of the slow type.

      The letter S is used to denote an attribute of cells (S cells), the type of cells (Equations 1-3 of the SI) and the number of these cells in the population, sometimes with different meanings in the same sentence. This is confusing, and I suggest referring to slow cells or fast cells instead (or at least to S-cells and F-cells), and keeping S and F as variables for the number of cells of the two types.

      All typos related to the notation have been fixed. We use S and F as types, and S and F (italic) and population numbers.

      (9) On page 3, when introducing the sampling of newborns as ruled by a binomial distribution, the information that you are just transmitting one collective is needed, while it is conveyed later.

      We have added this emphasis:

      “At the end of a cycle, a single Adult with the highest function (with F frequency f closest to the target frequency ) is chosen to reproduce g Newborn collectives each with N<sub>0</sub> cells (‘Selection’ and ’Reproduction’ in Fig. 1).”

      (10) I found that the abstract talks too early about the 'waterfall' phenomenon. As this is a concept introduced here, I suggest the authors first explain what it is, then use the term. It is a useful metaphor, but it should not obscure the more formal achievements of the paper.

      We feel that the “waterfall” analogy offers a gentle helping hand to orient those who have not thought much about the phenomenon. We view abstract as an opportunity to attract readership, and thus the more accessible the better.

      (11) In the SI there are numerous typos and English language issues. I suggest the authors read carefully through it, and add line numbers to the next version so that more detailed feedback is possible.

      Thank you for going through SI. We have gone through the SI, and fixed problems.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This paper presents a computational model of the evolution of two different kinds of helping ("work," presumably denoting provisioning, and defense tasks) in a model inspired by cooperatively breeding vertebrates. The helpers in this model are a mix of previous offspring of the breeder and floaters that might have joined the group, and can either transition between the tasks as they age or not. The two types of help have differential costs: "work" reduces "dominance value," (DV), a measure of competitiveness for breeding spots, which otherwise goes up linearly with age, but defense reduces survival probability. Both eventually might preclude the helper from becoming a breeder and reproducing. How much the helpers help, and which tasks (and whether they transition or not), as well as their propensity to disperse, are all evolving quantities. The authors consider three main scenarios: one where relatedness emerges from the model, but there is no benefit to living in groups, one where there is no relatedness, but living in larger groups gives a survival benefit (group augmentation, GA), and one where both effects operate. The main claim is that evolving defensive help or division of labor requires the group augmentation; it doesn't evolve through kin selection alone in the authors' simulations.

      This is an interesting model, and there is much to like about the complexity that is built in. Individual-based simulations like this can be a valuable tool to explore the complex interaction of life history and social traits. Yet, models like this also have to take care of both being very clear on their construction and exploring how some of the ancillary but potentially consequential assumptions affect the results, including robust exploration of the parameter space. I think the current manuscript falls short in these areas, and therefore, I am not yet convinced of the results. Much of this is a matter of clearer and more complete writing: the Materials and Methods section in particular is incomplete or vague in some important junctions. However, there are also some issues with the assumptions that are described clearly.

      Below, I describe my main issues, mostly having to do with model features that are unclear, poorly motivated (as they stand), or potentially unrealistic or underexplored.

      We would like to thank the reviewer for the thoughtful comments that helped us to greatly improve the clarity of our paper.  

      One of the main issues I have is that there is almost no information on what happens to dispersers in the model. Line 369-67 states dispersers might join another group or remain as floaters, but gives no further information on how this is determined. Poring through the notation table also comes up empty as there is no apparent parameter affecting this consequential life history event. At some point, I convinced myself that dispersers remain floaters until they die or become breeders, but several points in the text contradict this directly (e.g., l 107). Clearly this is a hugely important model feature since it determines fitness cost and benefits of dispersal and group size (which also affects relatedness and/or fitness depending on the model). There just isn't enough information to understand this crucial component of the model, and without it, it is hard to make sense of the model output.

      We use the same dispersal gene β to represent the likelihood an individual will either leave or join a group, thereby quantifying both dispersal and immigration using the same parameter. Specifically, individuals with higher β are more likely to remain as floaters (i.e., disperse from their natal group to become a breeder elsewhere), whereas those with lower β are either more likely to remain in their natal group as subordinates (i.e., queue in a group for the breeding position) or join another group if they dispersed.  

      We added in the text “Dispersers may migrate to another group to become subordinates or remain as floaters waiting for breeding opportunities, which is also controlled by the same genetic dispersal propensity as subordinates” to clarify this issue. We also added in Table 1 that β is the “genetic predisposition to disperse versus remain in a group”, and to Figure 1 that “subordinates in the group (natal and immigrants) […]” after we already clarified that “Dispersers/floaters may join a random group to become subordinates.”

      Related to that, it seems to be implied (but never stated explicitly) that floaters do not work, and therefore their DV increases linearly with age (H_work in eq.2 is zero). That means any floaters that manage to stick around long enough would have higher success in competition for breeding spots relative to existing group members. How realistic is this? I think this might be driving the kin selection-only results that defense doesn't evolve without group augmentation (one of the two main ways). Any subordinates (which are mainly zero in the no GA, according to the SI tables; this assumes N=breeder+subordinates, but this isn't explicit anywhere) would be outcompeted by floaters after a short time (since they evolve high H and floaters don't), which in turn increases the benefit of dispersal, explaining why it is so high. Is this parameter regime reasonable? My understanding is that floaters often aren't usually high resource holding potential individuals (either b/c high RHP ones would get selected out of the floater population by establishing territories or b/c floating isn't typically a thriving strategy, given that many resources are tied to territories). In this case, the assumption seems to bias things towards the floaters and against subordinates to inherit territories. This should be explored either with a higher mortality rate for floaters and/or a lower DV increase, or both.

      When it comes to floaters replacing dead breeders, the authors say a bit more, but again, the actual equation for the scramble competition (which only appears as "scramble context" in the notation table) is not given. Is it simply proportional to R_i/\sum_j R_j ? Or is there some other function used? What are the actual numbers of floaters per breeding territory that emerge under different parameter values? These are all very important quantities that have to be described clearly.

      Although it is true that dispersers do not work when they are floaters, they may later help if they immigrate into a group as a subordinate. Consequently, immigrant subordinates have no inherent competitive advantage over natal subordinates (as step 2.2. “Join a group” is followed by step 3. “Help”, which occurs before step 5. “Become a breeder”). Nevertheless, floaters can potentially outcompete subordinates of the same age if they attempt to breed without first queuing as a subordinate (step 5) when subordinates are engaged in work tasks. We believe that this assumption is realistic and constitutes part of the costs associated with work tasks. However, floaters are at a disadvantage for becoming a breeder because: (1) floaters incur higher mortality than individuals within groups (Eq. 3); and (2) floaters may only attempt to become breeders in some breeding cycles (versus subordinate groups members, who are automatically candidates for an open breeding position in the group in each cycle). Therefore, due to their higher mortality, floaters are rarely older than individuals within groups, which heavily influences their dominance value and competitiveness. Additionally, any competitive advantage that floaters might have over other subordinate group members is unlikely to drive the kin selection-only results because subordinates would preferably choose defense tasks instead of work tasks so as not to be at a competitive disadvantage compared to floaters.  

      Regarding whether floaters aren't usually high resource holding potential (RHP) individuals and, therefore, our assumptions might be unrealistic; empirical work in a number of species has shown that dispersers are not necessarily those of lower RHP or of lower quality. In fact, according to the ecological constraints hypothesis, one might predict that high quality individuals are the ones that disperse because only individuals in good condition (e.g., larger body size, better energy reserves) can afford the costs associated with dispersal (Cote et al., 2022). To allow differences in dispersal propensity depending on RHP, we extended our model in the Supplemental Materials by incorporating a reaction norm of dispersal based on their rank (D = 1 / (1 + exp (β<sub>R</sub> * Rβ<sub>0</sub>)) under the section “Dominance-dependent dispersal propensities” and now referenced in L195. This approach allows individuals to adjust their dispersal strategy to their competitiveness and to avoid kin competition by remaining as a subordinate in another group. Results show that the addition of the reaction norm of dispersal to rank did not qualitatively influence the results described in the main text.  

      We also added “number of floaters” present in the whole population to the summary tables as requested.  

      As a side note, the “scramble context” we mention was an additional implementation in which we made rank independent of age. However, since the main conclusions remained unchanged, we decided to remove it for simplicity from the final manuscript, but we forgot to remove it from Table 1 before submission.  

      I also think the asexual reproduction with small mutations assumption is a fairly strong one that also seems to bias the model outcomes in a particular way. I appreciate that the authors actually measured relatedness within groups (though if most groups under KS have no subordinates, that relatedness becomes a bit moot), and also eliminated it with their ingenious swapping-out-subordinates procedure. The fact remains that unless they eliminate relatedness completely, average relatedness, by design, will be very high. (Again, this is also affected by how the fate of the dispersers is determined, but clearly there isn't a lot of joining happening, just judging from mean group sizes under KS only.) This is, of course, why there is so much helping evolving (even if it's not defensive) unless they completely cut out relatedness.

      As we showed in the Supplementary Tables and the section on relatedness in the SI (“Kin selection and the evolution of division of labor"), high relatedness does not appear to explain our results. In evolutionary biology generally and in game theory specifically (with the exception of models on sexual selection or sex-specific traits), asexual reproduction is often modelled because it reduces unnecessary complexity. To further study the effect of relatedness on kin structures more closely resembling those of vertebrates, however, we created an additional “relatedness structure level”, where we shuffled half of the philopatric offspring using the same method used to remove relatedness completely, effectively reducing withingroup relatedness structure by half. As shown in the new Figure S3, the conclusions of the model remain unchanged.  

      Finally, the "need for division of labor" section is also unclear, and its construction also would seem to bias things against division of labor evolving. For starters, I don't understand the rationale for the convoluted way the authors create an incentive for division of labor. Why not implement something much simpler, like a law of minimum (i.e., the total effect of helping is whatever the help amount for the lowest value task is) or more intuitively: the fecundity is simply a function of "work" help (draw Poisson number of offspring) and survival of offspring (draw binomial from the fecundity) is a function of the "defense" help. As it is, even though the authors say they require division of labor, in fact, they only make a single type of help marginally less beneficial (basically by half) if it is done more than the other. That's a fairly weak selection for division of labor, and to me it seems hard to justify. I suspect either of the alternative assumptions above would actually impose enough selection to make division of labor evolve even without group augmentation.

      In nature, multiple tasks are often necessary to successfully rear offspring. We simplify this principle in the model by maximizing reproductive output when both tasks are carried out to a similar extent, allowing for some flexibility from the mean. We added to the manuscript “For example, in many cooperatively breeding birds, the primary reasons that individuals fail to produce offspring are (1) starvation, which is mitigated by the feeding of offspring, and (2) nest depredation, which is countered by defensive behavior. Consequently, both types of tasks are necessary to successfully produce offspring, and focusing solely on one while neglecting the other is likely to result in lower reproductive success than if both tasks are performed by individuals within the group.”

      Regarding making fecundity a function of work tasks and offspring survival as a function of defensive tasks, these are actually equivalent in model terms, as it’s the same whether breeders produce three offspring and two die, or if they only produce one. This represents, of course, an oversimplification of the natural context, where breeding unsuccessfully is more costly (in terms of time and energy investment) than not breeding at all.

      Overall, this is an interesting model, but the simulation is not adequately described or explored to have confidence in the main conclusions yet. Better exposition and more exploration of alternative assumptions and parameter space are needed.

      We hope that our clarifications and extension of the model satisfy your concerns.  

      Reviewer #2 (Public review):

      Summary:

      This paper formulates an individual-based model to understand the evolution of division of labor in vertebrates. A main conclusion of the paper is that direct fitness benefits are the primary factor causing the evolution of vertebrate division of labor, rather than indirect fitness benefits.

      Strengths:

      The paper formulates an individual-based model that is inspired by vertebrate life history. The model incorporates numerous biologically realistic details, including the possibility to evolve age polytheism where individuals switch from work to defence tasks as they age or vice versa, as well as the possibility of comparing the action of group augmentation alone with that of kin selection alone.

      Weaknesses:

      The model makes assumptions that restrict the possibility that kin selection leads to the evolution of helping. In particular, the model assumes that in the absence of group augmentation, subordinates can only help breeders but cannot help non-breeders or increase the survival of breeders, whereas with group augmentation, subordinates can help both breeders and non-breeders and increase the survival of breeders. This is unrealistic as subordinates in real organisms can help other subordinates and increase the survival of non-breeders, even in the absence of group augmentation, for instance, with targeted helping to dominants or allies. This restriction artificially limits the ability of kin selection alone to lead to the evolution of helping, and potentially to division of labor. Hence, the conclusion that group augmentation is the primary driving factor driving vertebrate division of labor appears forced by the imposed restrictions on kin selection. The model used is also quite particular, and so the claimed generality across vertebrates is not warranted.

      We would like to thank the reviewer for the in-depth review. We respond to these and other comments below.  

      I describe some suggestions for improving the paper below, more or less in the paper's order.

      First, the introduction goes to great lengths trying to convince the reader that this model is the first in this or another way, particularly in being only for vertebrates, as illustrated in the abstract where it is stated that "we lack a theoretical framework to explore the conditions under which division of labor is likely to evolve" (line 13). However, this is a risky and unnecessary motivation. There are many models of division of labor and some of them are likely to be abstract enough to apply to vertebrates even if they are not tailored to vertebrates, so the claims for being first are not only likely to be wrong but will put many readers in an antagonistic position right from the start, which will make it harder to communicate the results. Instead of claiming to be the first or that there is a lack of theoretical frameworks for vertebrate division of labor, I think it is enough and sufficiently interesting to say that the paper formulates an individual-based model motivated by the life history of vertebrates to understand the evolution of vertebrate division of labor. You could then describe the life history properties that the model incorporates (subordinates can become reproductive, low relatedness, age polyethism, etc.) without saying this has never been done or that it is exclusive to vertebrates; indeed, the paper states that these features do not occur in eusocial insects, which is surprising as some "primitively" eusocial insects show them. So, in short, I think the introduction should be extensively revised to avoid claims of being the first and to make it focused on the question being addressed and how it is addressed. I think this could be done in 2-3 paragraphs without the rather extensive review of the literature in the current introduction.

      We have revised the novelty statements in the Introduction by more clearly emphasizing how our model addresses gaps in the existing literature. More details are provided in the comments below.

      Second, the description of the model and results should be clarified substantially. I will give specific suggestions later, but for now, I will just say that it is unclear what the figures show. First, it is unclear what the axes in Figure 2 show, particularly for the vertical one. According to the text in the figure axis, it presumably refers to T, but T is a function of age t, so it is unclear what is being plotted. The legend explaining the triangle and circle symbols is unintelligible (lines 227-230), so again it is unclear what is being plotted; part of the reason for this unintelligibility is that the procedure that presumably underlies it (section starting on line 493) is poorly explained and not understandable (I detail why below). Second, the axes in Figure 3 are similarly unclear. The text in the vertical axis in panel A suggests this is T, however, T is a function of t and gamma_t, so something else must be being done to plot this. Similarly, in panel B, the horizontal axis is presumably R, but R is a function of t and of the helping genotype, so again some explanation is lacking. In all figures, the symbol of what is being plotted should be included.

      We added the symbols of the variables to the Figure axes to increase clarity. In Figure 3A, we corrected the subindex t in the x-axis; it should be subindex R (reaction norm to dominance rank instead of age). As described in Table 1, all values of T, H and R are phenotypically expressed values. For instance, T values are the phenotypically expressed values from the individuals in the population according to their genetic gamma values and their current dominance rank at a given time point.  

      Third, the conclusions sound stronger than the results are. A main conclusion of the paper is that "kin selection alone is unlikely to select for the evolution of defensive tasks and division of labor in vertebrates" (lines 194-195). This conclusion is drawn from the left column in Figure 2, where only kin selection is at play, and the helping that evolves only involves work rather than defense tasks. This conclusion follows because the model assumes that without group augmentation (i.e., xn=0, the kin selection scenario), subordinates can only help breeders to reproduce but cannot help breeders or other subordinates to survive, so the only form of help that evolves is the least costly, not the most beneficial as there is no difference in the benefits given among forms of helping. This assumption is unrealistic, particularly for vertebrates where subordinates can help other group members survive even in the absence of group augmentation (e.g., with targeted help to certain group members, because of dominance hierarchies where the helping would go to the breeder, or because of alliances where the helping would go to other subordinates). I go into further details below, but in short, the model forces a narrow scope for the kin selection scenario, and then the paper concludes that kin selection alone is unlikely to be of relevance for the evolution of vertebrate division of labor. This conclusion is particular to the model used, and it is misleading to suggest that this is a general feature of such a particular model.

      The scope of this paper was to study division of labor in cooperatively breeding species with fertile workers (i.e., primarily vertebrates), in which help is exclusively directed towards breeders to enhance offspring production (i.e., alloparental care). Our focus is in line with previous work in most other social animals, including eusocial insects and humans, which emphasizes how division of labor maximizes group productivity. Other forms of “general” help are not considered in the paper, and such forms of help are rarely considered in cooperatively breeding vertebrates or in the division of labor literature, as they do not result in task partitioning to enhance productivity.

      Overall, I think the paper should be revised extensively to clarify its aims, model, results, and scope of its conclusions.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      I reserved this section for more minor comments, relating to clarity and a general admonition to give us more detail and exploration of some basic population genetic quantities.

      Another minor point, although depending on whether I assume right or wrong, it could be major: I am not entirely sure that dispersers help in the groups they join as helpers, because of line 399, which states specifically that individuals who do remain in natal territories do. But I assume dispersers help (elsewhere, the authors state helping is not conditional on relatedness to the breeder). Otherwise, this model becomes even weirder for me. Either way, please clarify.

      Apologies if this was not clear. Immigrants that join a group (so dispersers from another group) as a subordinate help and queue for a breeding position, as does any natal subordinate born into the group. We rephased the sentence to “Subordinate group members, either natal or immigrants to the group, […]”  

      More generally, in simulation studies like this, there can be interactions between the strength of selection (which affects overall genetic variation maintained in the population), population size, and mutation rate/size, which can affect, for example, relatedness values. None of these quantities is explored here (and their interactions are not quantified), so it is not possible to evaluate the robustness of any of these results.

      Thank you for your comments about the parameter landscape. It is important to point out that variations in the mutation rate do not qualitatively affect our results, as this is something we explored in previous versions of the model (not shown). Briefly, we find that variations in the mutation rates only alter the time required to reach equilibrium. Increasing the step size of mutation diminishes the strength of selection by adding stochasticity and reducing the genetic correlation between offspring and their parents. Population size could, in theory, affect our results, as small populations are more prone to extinction. Since this was not something we planned to explore in the paper directly, we specifically chose a large population size, or better said, a large number of territories (i.e. 5000) that can potentially host a large population.  

      The authors also never say how it is actually determined. There is the evolved helping variable, and there is also the evolved reaction norm. I assume that the actual amount of help of each type is given by the product of T (equation 1) and H (for defense) and (1-T) and H (for work), but this should be stated explicitly.  

      Help provided is an interaction between H (total effort) and T (proportion of total effort invested in each type of task). To clarify the distinction between these two processes, we have now added “Hence, the gene α regulates the amount of help expressed, while the genes γ determine which specific helping tasks are performed at different time points in the breeding cycle”.  

      It is also weird that after introducing the T variable as a function of age, Figure 3 actually depicts it as a function of dominance value.

      Thank you for pointing out an error in Eq. 1. This inequality was indeed written incorrectly in the paper (but is correct in the model code); it is dominance rank instead of age (see code in Individual.cpp lines 99-119). We corrected this mistake throughout the manuscript.

      What is "scramble context"?

      “Scramble context” was an additional implementation that we decided to remove from the final manuscript, but we forgot to remove from Table 1 before submission. We have now removed it from the table.

      Reviewer #2 (Recommendations for the authors):

      Some specific comments:

      (1) L 31: "All theoretical..." These absolute statements are risky and unnecessary.

      Rephrased to “To date, most theoretical and empirical work…”

      (2) L 46: I believe Tom Wenseleers has published on the evolution of division of labor with reproductive workers and high within-colony conflict.

      Tom Wenseleers has indeed produced some models on the evolution of cooperation in social insects where some workers may reproduce. However, these models focus on the relevance of relatedness and policing selecting for a reduction in within-group conflict and the evolution of reproductive division of labor. Our model focuses instead on division of labor among workers (helpers). We have rephased this section to “task specialization is linked to sterility and where conflict of interest is generally low” to account for species of social insect in which variation in relatedness between group members and higher levels of reproductive conflict may arise. We also cited one of his papers.  

      (3) L 57: Again, unnecessary categorical statements.

      Rephrased to “Although a great deal of recent empirical work highlights the importance of direct benefits in the evolution of cooperative breeding behavior in vertebrates [21–24], we lack understanding on the joint influence of direct and indirect fitness benefits in the evolution of division of labor.”

      (4) L 67: This is said to be a key distinction, but in the paper, such a key role is not clearly shown. This and other tangential points are unnecessary to keep the introduction to the point.

      The different fitness costs of different tasks is the basis of our model on division of labor. Therefore, this is a key distinction and basis from which to describe different tasks in the model. We have left this sentence unchanged.

      (5) L 61-73: "In vertebrates, however, helpers may obtain fitness benefits directly via reproduction..." Some social insects may do so as well. It seems unnecessary and incorrect to say that vertebrate sociality is fundamentally different from invertebrate one. I think it is sufficiently interesting to say this work aims to understand vertebrate division of labor, by explicitly modeling aspects of its life history, without saying this can't happen in invertebrates or that no other model has ever done anything like it.

      Our point is not that, in some social insects, workers cannot obtain direct fitness benefits, but that previous models where the focus is on the colony reproductive outcome are only a good approximation to eusocial insect with sterile workers. However, to make this clearer we have added “In vertebrates and social insect with fertile workers, however, helpers may obtain fitness benefits directly via […]”.  

      (6) L 74-86: By this point, the introduction reads like a series of disconnected comments without a clear point.

      In L60 we added: “Understanding how direct and indirect benefits interact is particularly important in systems where individuals may differentially bear the fitness costs of cooperation”. By adding this sentence, we emphasize our focus on the largely unexplored direct fitness benefits and costs, as well as their interaction with indirect fitness. We then proceed to explain why it is crucial to consider that tasks have varying direct fitness costs and how the fitness benefits derived from cooperation change with age and resource-holding potential. These elements are essential for studying the division of labour in species with totipotent workers.

      (7) L 87: This sentence gives a clear aim. It would be clearer if the introduction focused on this aim.

      With the new sentence added in L60 (see previous comment), we bring the focus to the main question that we are trying to address in this paper earlier in the Introduction.  

      (8) L 88: "stochastic model" should be changed to "individual-based model".

      Done.

      (9) L 104: "limited number" is unclear. Say a fixed finite number, or something specific.

      Done.

      (10) L 105: "unspecified number" is unclear. Say the number of subordinates emerges from the population dynamics.

      Changed to “variable number of subordinate helpers, the number of which is shaped by population dynamics, with all group members capable of reproducing during their lifetime”.

      (11) L 112: "Dispersers" is used, but in the previous lines 107-109, the three categories introduced used different terms. Those three terms introduced should be used consistently throughout the paper, without using two or more terms for one thing.

      We use the term “disperser” to describe individuals that disperse from their natal group.

      Dispersers can assume one of three roles: (1) they can join another group as "subordinates"; (2) they can join another group as "breeders" if they successfully outcompete others; or (3) they can remain as "floaters" if they fail to join a group. "Floaters" are individuals who persist in a transient state without access to a breeding territory, waiting for opportunities to join a group in an established territory. We rephased the sentence to “Dispersers cannot reproduce without acquiring a territory (denoted here as floaters)”. This was also clarified in other instances where the term “dispersers” was used (e.g. L407). Other instances where this might not have been so clear, we replace “dispersers” with “floaters”.  

      (12) L 112: "(floaters)" Unclear parenthesis.

      See previous comment.  

      (13) L 115: There should be a reference to Methods around here.

      Added a reference to Figure 1.

      (14) L 117: To be clearer, say instead that dominance value is a linearly increasing function of age as a proxy of RHP and a linearly decreasing function of help provided due to the costs of working tasks. And refer to equation 2.

      Rephrased to “We use the term dominance value to designate the competitiveness of an individual compared to other candidates in becoming a breeder, regardless of group membership, that increases as a function of age, serving as a proxy for resource holding potential (RHP), and decreases as a function of help provided, reflecting costs to body condition from performing working tasks (Eq. 2).” We did not include “linearly” to keep it simpler, since it is clear from Eq. 2, which is now referenced here.  

      (15) L 119: "Subordinate helpers". As all subordinates are helpers, the helper qualifier is confusing.

      Subordinates are not necessarily helpers, as they can evolve help values of 0, hence, why we make it explicit here.

      (16) L 119: "choose". This terminology may be misleading. The way things are implemented in the model is that individuals are assigned a task depending on their genetic traits gamma. Perhaps it would be better to use a less intentional term, like perform one of two tasks.

      We changed “choose between two” to “engage in one of two”, which has less connotations of intentionality.

      (17) L 124: "Subordinates can [...] exhibit task specialization that [...] varies with their dominance value". It should be that it varies with age.

      Apologies. The equation was wrong; it does vary with dominance value. We corrected it accordingly.

      (18) L 133: "maximised" This is apparently important for the modelling procedure, but it is completely unclear what it means. Equation 4 comes out of nowhere, and it is said that such an equation is the maximum amount of help that can affect fecundity. Why? What does this mean? If there is something that is maximised, this should be proven. This value is then used for something (line 507), but it is unclear why or what it is used for (it says "we use the value of Hmax instead" without saying what for, no justification for the listed inequalities are given, and the claimed maximisation of an unspecified variable at those H values is not proven). Moreover, the notation in this section is also unclear: what are the sums over? Also, Hdefence and Hwork should vary over the index that is summed over, but the notation suggests that those quantities don't vary.

      We changed “maximized” to “greatest”, and we added a clarification to the rationality behind the maximization of the impact of help in the breeder’s productivity: “For example, in many cooperatively breeding birds, the primary reasons that breeders fail to produce offspring are (1) starvation, which is mitigated by the feeding of offspring, here considered as a work task, and (2) nest depredation, which is countered by defensive behavior. Consequently, both types of tasks are often necessary for successful reproduction, and focusing solely on one while neglecting the other is likely to result in lower reproductive success than if both tasks are performed by helpers within the group.”

      We now also clarify that the sums are for help given within a group (L 507), and added indexes to the equations.

      (19) L 152: "habitat saturation" How is this implemented? How is density dependence implemented? Or can the population size keep increasing indefinitely? It would be good to plot the population size over time, the group size over time, and the variance in group size over time. This could substantiate later statements about enhancing group productivity and could all be shown in the SI.

      Habitat saturation emerges from population dynamics due to the limited availability of territories and the fluctuating number of individuals, leading highly productive environments to experience habitat saturation. Although the number of group members is not restricted in our model, the population could theoretically increase indefinitely. However, this is not observed in the results presented here, as we selected parameter landscapes that stabilize population numbers. We confined our parameters to those where the population neither increased indefinitely (nor collapsed), as we did not incorporate density-dependent mortality traits for simplification. Consequently, the group size in the SI, where the standard deviation is already included, closely represents group size at any other given time during equilibrium.

      L 336: we changed “environments with habitat saturation” to “environments that lead to habitat saturation”, to increase clarity.

      (20) L 152: "lifecycle". Rather than the lifecycle, the figure describes the cycle of events in a single time step. The lifecycle (birth to death) goes over multiple time steps (as individuals live over multiple steps). So this figure shouldn't be called a life cycle.

      We changed “lifecycle” to “breeding cycle”.

      (21) L 156: "generation". This is not a generation but a time step.

      We changed “generation” to “breeding cycle”.

      (22) L 157: "previous life cycle" would mean that the productivity of a breeder depends on the number of helpers that its parents had, which is not what is meant.

      We changed “lifecycle” to “breeding cycle”.

      (23) L 158: "Maximum productivity is achieved when different helping tasks are performed to a similar extent." Again, unclear why that is the case.

      We added a clarification on this, see response to comment 18.  

      (24) L 160: "Dispersers/floaters". Use just one term for a single thing.

      See response to comment 11.   

      (25) L 162: "dispersal costs". I don't recall these being described in Methods.

      Individuals that disperse do not enjoy the protection of living in a territory and within a group of other individuals, so they have a higher mortality risk, described in Eq. 3.3. (negative values in the exponential part of the equation increase survival). The cost of dispersal is the same as individuals that remain as floaters at a given time step.

      (26) L 164: "generation" -> time step.

      We changed this to “breeding cycle”.  

      (27) L 170: "Our results show that division of labor initially emerges because of direct fitness benefits..." This is a general statement, but the results are only particular to the model. So this statement and others in the manuscript should be particular to the model. Also, Figure 2 doesn't say anything about what evolves "initially" as it only plots evolutionary equilibria.

      We rephrased this statement to “Our results suggest that voluntary division of labor involving tasks with different fitness costs is more likely to emerge initially because of direct fitness benefits”, to more accurately represent the conditions under which we modeled the division of labor.  

      Our reference to “initially” is regarding group formation (family groups versus aggregations of unrelated individuals or a mix). This is shown in the comparison between the different graphs at equilibrium. The initial state of the simulation is that all individuals disperse and do not cooperate.  

      (28) L 171: "but a combination of direct and indirect fitness benefits leads to higher rates and more stable forms of division of labor". What do you mean by "higher rates and more stable forms of division of labor"? Say how division of labor is shown in the figure (with intermediate T?).

      Yes, intermediate values of T show division of labor if γR ≠ 0. This is described under the section “The role of dominance in task specialization”. We added “with intermediate values suggesting a division of labor” to the Figure 2 legend.  

      (29) L173-175: "as depicted in Figure 2, intermediate values of task specialization indicate in all cases age/dominance-mediated task specialization (γt ≠ 0; Table 1) and never a lack of specialization (γt = 0; Table 1)". This sentence is unclear and imprecise. Does this sentence want to say that in Figure 2, all plots with intermediate values of T involve gamma t different from zero? If so, just say that.

      Rephrased to: “In Figure 2, all plots depicting intermediate values of T exhibit non-zero γR values and, hence, division of labor”.

      (30) L179-180: "forms of help that impact survival never evolve under any environmental condition when only kin selection occurs". This is misleading because under the KS scenario, help cannot positively impact survival in this model, so they never evolve.

      Help cannot affect survival but could potentially affect group persistence. If helpers increase breeder productivity and offspring remain philopatric and queue for the breeding position, then they will receive help from related individuals.   

      (31) L 210: "initially". What do you mean by that?

      Help only evolves in our model in family groups, which may then open the door for the evolution of help in mixed-kin groups. Therefore, we use “initially” to refer to the ancestral group structure that likely led to cooperation under benign environmental conditions. We rephased this section to “in more benign (and often highly productive) environments that lead to habitat saturation, help likely evolved initially in family groups, and defensive tasks are favored because competition for the breeding position is lower under kin selection.”

      (32) L 212: "kin selection is achieved". What does that mean?

      Rephased to “kin selection acts not only by selecting subordinates in their natal group to increase the productivity of a related breeder […]”

      (33) L 216: "division of labor seems to be more likely to evolve in increasingly harsh environments". Say in parentheses where this is shown.

      Added.  

      (34) L 218: "help evolves in benign environments". I don't see where this is shown. Figure 2 doesn't show that H is higher with lower m (e.g., in KS+GA column).

      Help does not evolve in benign environments under only direct fitness benefits derived from group augmentation (shown in Figure 2).  

      (35) L 225: "y-axis" should be "vertical axis", as y has another meaning in the model.

      Done.

      (36) L 226: "likelihood". Here and throughout, "likelihood" should be changed to probability. Likelihood means something else.

      Thank you for the advice, we have corrected this through the manuscript.  

      (37) L 236: "the slope of the reaction norm for the dominance value in task specialization".

      Unclear. Clearer to say: the rate at which individuals to shift from defense to work as they age.

      The important part is not so much the rate but the direction, that is, from work task to defense (or vice versa) as their rank increases. Changed to “the direction and rate of change in task specialization with dominance”.

      (38) L 257: "(task = 0; cost to dominance value)," This seems out of place.

      This aims to clarify that work tasks have a cost to dominance, while defense tasks have a cost to survival. This is particularly relevant in this model since different helping tasks are defined by their fitness costs.

      (39) L 258: "increase"-> "increase with age".

      Added “with dominance”.

      (40) L 262: "division of labor equilibria" What is that?

      Changed to “at equilibrium when division of labor evolves”

      (41) L 268: "Our findings suggest that direct benefits of group living play a driving role in the evolution of division of labor via task specialization in species with totipotent workers". This is a very general statement, but the results are much more circumscribed. First, the model is quite specific by assuming that, in the absence of group augmentation (xn=0), indirect fitness benefits can only be given to breeders (Equation 5) but not to other subordinates (Equations 2, 3.1). This is unrealistic, particularly for vertebrates, and reduces the possibility that indirect fitness benefits play a role.  

      As previously discussed, the scope of this paper was to study division of labor in cooperatively breeding species with fertile workers in which help is exclusively directed towards breeders to enhance offspring production through alloparental care. Other forms of “general” help do not result in task partitioning to enhance productivity.

      Second, the difference in costs of work and defense are what drive the evolution of "division of labor" (understood as intermediate T in case this is what the authors mean) in the KS scenario, but the functional forms of those two costs are quite specific and not of the same form, so these functions may bias the results found. Specifically, R is an unbounded linear function of work and the effect of this function becomes weaker as the individual ages due to the weakening force of selection with age (Equation 2) whereas Sh is a particular bounded nonlinear function of defense (Equation 3.1). These differences may tend to make the effect of Sh stronger due to the particular functions chosen.  

      The difference in costs is inherent to the nature of the different tasks (work versus defense): while survival is naturally bounded, with death as the lower bound, dominance costs are potentially unbounded, as they are influenced by dynamic social contexts and potential competitors. Therefore, we believe that the model’s cost structure is not too different from that in nature.  

      Third, no parameter sweep is given to see to what extent these results hold across the many parameters involved. So, in summary, the discussion should at least reflect that the results are of a restricted nature rather than giving the impression that they are of the suggested level of generality.

      During the exploratory phase of the model development, various parameters and values were assessed. However, the manuscript only details the ranges of values and parameters where changes in the behaviors of interest were observed, enhancing clarity and conciseness. For instance, variation in yh (the cost of help on dominance when performing “work tasks”) led to behavioral changes similar to those caused by changes in xh (the cost of help in survival when performing “defensive tasks”), as both are proportional to each other. Specifically, since an increase in defense costs raises the proportion of work relative to defense tasks, while an increase in the costs of work task has the opposite effect, only results for the variation of xh were included in the manuscript to avoid redundancy. Added to Table 1: “To maintain conciseness, further exploration of the parameter landscape was not included in the manuscript”.

      (42) L 270: "in eusocial insects often characterized by high relatedness and reproductive inhibition, sterile workers acquire fitness benefits only indirectly". This is misleading. Sterile workers of any taxa, be it insects or vertebrates, can only acquire fitness benefits indirectly as they are sterile, but eusocial insects involve not only sterile workers.

      Rephased to “In contrast, in eusocial species characterized by high relatedness and permanent worker sterility, such as most eusocial insects, workers acquire fitness benefits only indirectly”. In any case, permanent sterility only occurs in eusocial invertebrates; in vertebrates with reproductive inhibition sterility is only temporal and context dependent. Therefore, in vertebrates, sterile workers may potentially obtain direct fitness benefits if the social context changes, as is the case in naked mole-rats.  

      (43) L 273: "Group members in eusocial species are therefore predicted to maximize colony fitness due to the associated lower within-group conflict". Again, this is incorrect. Primitively eusocial insects have high conflict.

      We added “Group members in such eusocial species” to clarify that we are not referring here to primitively eusocial species but those with permanent sterile workers.  

      (44) L 277: "when the benefits of cooperation are evenly distributed among group members". In this model, the benefits of cooperation are not evenly distributed among group members: breeders reproduce, but subordinates don't.

      Subordinates may reproduce if they become breeders later in life. However, subordinates also benefit from cooperation as subordinates directly (greater survival in larger groups), and indirectly if they are related to the breeder. Here we refer to the first one, and we expand on that in the following sentence.  

      (45) L 280: "survival fitness benefits derived from living in larger groups seem to be key for the evolution of cooperative behavior in vertebrates [22, 63], and may also translate into low within-group conflict. This suggests that selection for division of labor in vertebrates is stronger in smaller groups". I don't see how the previous sentence suggests this. The paper does not present results to support this statement (i.e., no selection gradients in smaller vs larger groups are shown).

      The benefits of living in a larger group entail diminishing returns, so those living in smaller groups benefit greater by an increase in productivity and group size than those in a larger group.  

      (46) L 284: "Our model demonstrates that vertebrates evolve a more stable division of labor". Where is that shown? How is "more stable" measured?

      Rephrased to “vertebrates are more likely to evolve division of labor”. This is shown in Figure 2, that exemplifies that division of labor evolves in a wider range of environmental condition and to a higher degree (intermediate values of T).  

      (47) L 287: "direct fitness benefits in the form of group augmentation select more strongly for defensive tasks". Where is that shown? Establishing this would entail comparing selection gradients with direct fitness benefits of group augmentation and without them.

      In Figure 2, when we compare the GA column to KS+GA column, we see that at equilibrium, more helpers choose defense tasks, specially when they are free to choose their preferred task (circles).  

      (48) L 288: "kin selection alone seems to select only for work tasks." Again, this may be an artifact of the model assuming that helpers cannot increase non-breeders' fitness components except via group augmentation, and that defense tasks are inherently more costly than work tasks.

      As stated previously, we are studying task specialization in cooperative breeders where help is in the form of alloparental care (from allofeeding and egg care to defense from predators). We also assume that the costs are different, but whether one or the other is more costly depends on the relative context (e.g., a task can be more costly if it affects competitiveness in a very competitive environment). It is important to note that we name these tasks “work” and “defense” for practical reasons, but the focus of the paper is on tasks with different fitness costs that for their characteristics may not fit so well in under this terminology. While we acknowledge that most tasks have both kinds of fitness costs to a degree, here we focus on the main fitness costs of each kind of task (L430-436).  

      (49) L 290: "are comparatively large". This sounds as if the tasks are large, which is presumably not what is meant.

      Rephrased to “costs to dominance value and to the probability of attaining a breeding position are comparatively larger than survival costs.”

      (50) L 298: "helpers are predicted to increase defensive tasks with age or rank, whereas in harsh environments, work tasks are predicted to increase with age or rank." Add parentheses referring to where this is shown.

      This is shown in Figure 3, but since this is described in the discussion, we did not add a reference to the figure. If the editor would like us to refer to figures here, we can (see also comments below relating to the same issue).

      (51) L 308: "the role of age and environmental harshness on the evolution of division of labor". What is the prediction? Simply, the role of age is an assumption, not a prediction.

      Rephrased to “the role of environmental harshness on the evolution of division of labor via age-dependent task specialization”.

      (52) L 315: "individuals shifting from work tasks such as foraging for food, digging, and maintaining the burrow system, to defensive tasks such as guarding and patrolling as individuals grow older and larger". Say in parentheses where this is predicted.

      This prediction comes from Figure 3, we do not reference it here since we are in the Discussion section.  

      (53) L 320: "Under these conditions, our model predicts the highest levels of task partitioning and division of labor." Where is this predicted? Add parentheses referring to where this is shown. As it is, it is not possible to check the validity of the statement.

      This prediction comes from Figure 2 column KS+GA, we do not reference it here since we are in the Discussion section. The results with references to the figures are found under the Results section. In the discussion, we reiterate the results already described and add some examples from real data that seem to confirm our predictions.  

      (54) L 322: "In line with our model predictions, larger and older helpers of this species invest relatively more in territory maintenance, whereas younger/smaller helpers defend the breeding shelter of the dominant pair to a greater extent against experimentally exposed egg predators". These predictions are neat, but are now very difficult to understand from the figures. Maybe at the bottom of 3A, you could add a diagram work->defense for negative gamma_t and defense>work for positive gamma_t (or whatever order it is).

      Done.

      (55) L 325: "Territory maintenance has been shown to greatly affect routine metabolic rates and, hence, growth rates [80], which directly translates into a decrease in the likelihood of becoming dominant and attaining breeding status, as predicted by our model." This seems to be an assumption, not a prediction.

      That is true. We removed: “as predicted by our model”.  

      (56) L 352: "controlled". This means something else.

      Changed to “addressed”.

      (57) L 356: "summary, our study represents the first theoretical model aimed at elucidating the potential mechanisms underlying division of labor between temporal non-reproductives via task specialization in taxa beyond eusocial organisms". Again, claiming to be the first is risky and unnecessary.

      Rephrased to “our study helps to elucidate”.

      (58) L 358: "Harsh environments, where individuals can obtain direct fitness benefits from group living, favor division of labor, thereby enhancing group productivity and, consequently, group size." I'm not sure about this conclusion as harsh environments (large m in Figure 2) also involve the evolution of no division of labor (from the triangles and circles that are zero in the right bottom panel) and perhaps more so than with less harsh environments (intermediate m). Incidentally, in the bottom right panel of Figure 2, do the two separate clusters of triangles and circles mean that there is some sort of evolutionary branching?

      Yes, there are two different equilibria for the same set of conditions. Although it is true that for m=0.3 less division of labor evolves when kin selection and group augmentation act together, it is not the case when only group augmentation takes place. In addition, we qualify m=0.2 as harsh as opposed to benign in which we observe the rise of habitat saturation (m=0.1). m=0.3 is then an extreme harsh environment, in which in several instances different parameter landscape causes population collapse (see figures in the Supplemental Material).  

      (59) L 360: "Variation in the relative fitness costs of different helping tasks with age favors temporal polyethism". I don't see that this has been shown. Temporal polyethism evolves here whenever gamma_t evolves non-zero values. Figure 3A shows that non-zero gamma_t evolves with harsher environments, but I don't see what the "variation in relative fitness costs of different helping tasks" refers to.

      The evolved reaction norms of the model are towards different fitness costs depending on the task performed, since this is how we define the different types of tasks in the model.  

      (60) L 382: "undefined". Say variable. Undefined is something else.

      Undefined is more accurate, since we did not define how many subordinates there were per group, while “variable” could have been defined within a range, which was not the case in this model.  

      (61) L 390: "each genetic locus". Say earlier that each genetic trait is controlled by a single locus.

      Added.  

      (62) L 395: "complete" and "consistent" -> "certain".

      We changed one to “certain” and another to “absolute” to avoid using the same adjective twice in a sentence.  

      (63) L 396: What determines whether dispersers become subordinates or floaters? A trait? Or a fixed probability?

      We added “which is also controlled by the same genetic dispersal predisposition as for subordinates”.

      (64) L 412-413: "cycle". This should be a breeding step.

      Changed to “season” instead.

      (65) L 418: Say negatively impacts (it could also be positively impacts, which I guess is not what you mean).

      Done.

      (66) L 425: "a sample of floaters". Chosen how?

      Added “randomly drawn”.

      (67) L 426-428. But the equation in Table 1 indicates that all floaters compete for breeding spots, not a sample of floaters. This is not clear.

      The number of floaters sampled to try to breed at a given group is N<sub>f,b</sub> = 𝑓∗𝑁<sub>𝑓</sub>/𝑁<sub>𝑏</sub> (Table 1).

      Therefore, N<sub>f,b</sub> is the sample size of floaters for a given open breeding position, and f is how many groups on average a floater attempts to access in each time step.  

      (68) L 432. In the figure, the breeding cycle is called a step, but here it is called a cycle. There should be a single term used throughout. Breeding is not really a cycle here (it doesn't involve multiple steps that are repeated cyclically), so it seems more appropriate to call this breeding steps or breeding seasons.

      Taken into account previous comments, we changed the terms “generation” and “life cycle” to “breeding cycle”. We added “or seasons”.  

      (69) L 439: "generations". What are generations here, as generations are overlapping? You probably mean time steps or something else.

      Changed to “breeding cycles”.

      (70) L 439: "equilibrium was reached". Presumably, equilibrium is reached only asymptotically, so some cutoff is implemented in practice. So maybe say explicitly what cutoff was implemented.

      As mentioned, we run the model for 200’000 time steps, and if equilibrium was not reached for the phenotypic values, then we run the model for longer, with 400’000 time steps being the maximum at which all simulation reached equilibrium. In some cases, genetic values did not reach equilibrium at ranges at which there was no impact on phenotypic values, so these were disregarded to assess whether equilibrium was reached.  

      (71) L 452: "Even though individuals are likely to change the total amount of help given throughout their lives". Do you mean in real organisms or in the model? Say which. If it is in the model, it is not clear how.

      We added “in nature” to clarify that this was not the case in the model.  

      (72) L 455: "For more details on how individuals may adapt their level of help with age and social and environmental conditions, see [63]." Do you mean real individuals or in the model? Again, if it is in the model, it is unclear how this is possible and should be explained in this paper at least briefly rather than citing another one.

      We rephrased it to “How individuals in the model may adapt their level of help with age and social and environmental conditions has been described elsewhere.” We do not go into detail here because it is not within the scope of the paper, and those results have been described elsewhere.  

      (73) L 475: "helpers". Make terminology consistent throughout.

      All helpers are subordinates, but not all subordinates are helpers, as they may evolve no help. Since here we are describing those subordinates that do help, we use that terminology. We added “subordinate helpers” to clarify this further.  

      (74) L 476: "proportional". The dependence in Equation 1 is not "proportional to". Say something like "a survival probability (not rate) that decreases with the amount of help provided".

      Done.

      (75) L 482: "environmental"-> baseline, as defined first.

      Done.

      (76) L 486: "benefits". Can you briefly say in parentheses what those benefits are in real organisms? As in line 475, where you reminded the reader of survival costs due to predator defense.

      Added “such as those offered by safety in numbers or increased resource defense potential”.

      (77) L 494. "we first outline a basic model in which individuals". It is not clear what this sentence says, and the remainder of this section does not clarify it.

      We made two models for comparison, one where individuals can choose freely which task they prefer to perform, and another in which there is an increase in productivity when both kinds of tasks are performed to a similar extent at group level. In the latter model, individuals may choose an unpreferred task at certain times during their lived to increase the effect of the help provided in the breeder’s (and group’s) productivity.  

      We rephrased this section to “we first outline a basic model where individuals evolve their preferred helping task. Then we compare this to another model in which the breeder’s reproductive outcome is maximized when the group’s helping effort in each kind of tasks is performed to a roughly equal degree.”

      (78) L 496: "by performing both tasks". Sounds as if the breeder performs both tasks, not helpers.

      We changed to “when the group’s helping effort in each kind of tasks”.

      (79) L 497: "the maximum amount of cumulative help of each type (sigma Hmax) that can affect fecundity is given by Eq. 4:" This statement is imprecise. Presumably, what is meant is that this level of help maximises breeder productivity, as stated earlier in the paper. However, there is no proof that this level of help maximises breeder productivity, so this expression seems unjustified and it is unclear how it is used.

      This is a description of the model set up. As described later in the same section, the cumulative help of each time that will influence the breeder’s fecundity if maximum Hmax. Therefore, it does represent the maximum amount of cumulative help of each type that can affect the breeder’s fecundity.

      (80) L 500: "reproduced" -> "reproduce".

      Done.  

      (81) L 503. Say here what K is so that the reader knows what equation 5 is showing.

      Added “K” to the “The quantity of offspring produced (K)”.

      (82) L 503: "diminishing returns" -> "diminishing returns as help increases".

      Done.  

      (83) L 507: Why these inequalities?

      These inequalities explain the use of Hmax (response to comment 79). We rephased it to “the cumulative defense effort is larger than or the cumulative work effort is larger than ”.  

      (84) L 526: "removing the influence of relatedness from the model". It would be helpful to plot relatedness in this and the other scenario to check that it is indeed low here and high in the other.

      The actual values of relatedness are provided in the Supplemental Material Table S1. We added this reference to Figure 2.  

      (85) L 528: "It is possible that direct and indirect fitness benefits could have an additive effect on the evolution of alloparental care". This is technically incorrect. It is also unclear what the point of this sentence is.

      We have removed this sentence.  

      (86) Table 1: Say what are the allowed values for these genotypic traits (can they take negative values, be greater than one, are they continuous or discrete?): e.g., alpha \in [0,1] or alpha \in (-infinity, infinity). For phenotypic traits, it would be helpful if the third column lists the equation where the trait is defined. As the variables in the first column are scalars, they should not be bold face. Survival "rate" should be survival "probability" throughout.

      All genetic traits can take any real number (-infinity, infinity), but the phenotypic values are either constrained by the equation like for logistic formulas, or manually constrained like for dispersal propensity or help (only positive numbers allowed). We added “Each genetic trait is controlled by a single locus, and may take any real number” (L403), and added the boundaries for help and dominance value in Table 1. We decided against including the equations in the table due to space constraints. We removed the bold face as suggested. We changed all instances of “survival rate” to “survival probability”.

      (87) Figures S1, S2: I don't recall seeing references to these figures in the main text, but there should be, as well as for Tables S1-S3.

      Table S1 is now referenced in Figure 2. The other figures are now referenced in the main text when we reference the different sections in the Supplemental Materials (L190 and L198). Other Tables are referenced in their respective Figures in the SI.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Kroll et al. conduct an in-depth behavioral analysis of F0 knockouts of 4 genes associated with late-onset Alzheimer's Disease (AD), together with 3 genes associated with early-onset AD. Kroll and colleagues developed a web application (ZOLTAR) to compare sleep-associated traits between genetic mutants with those obtained from a panel of small molecules to promote the identification of affected pathways and potential therapeutic interventions. The authors make a set of potentially important findings vis-à-vis the relationship between AD-associated genes and sleep. First, they find that loss-of-function in late-onset AD genes universally results in night-time sleep loss, consistent with the well supported hypothesis that sleep disruption contributes to Alzheimer's-related pathologies. psen-1, an early-onset associated AD gene, which the authors find is principally responsible for the generation of AB40 and AB42 in zebrafish, also shows a slight increase in activity at night and slight decreases in night-time sleep. Conversely, psen-2 mutations increase daytime sleep, while appa/appb mutations have no impact on sleep. Finally, using ZOLTAR, the authors identify serotonin receptor activity as potentially disrupted in sorl1 mutants, while betamethasone is identified as a potential therapeutic to promote reversal of psen2 knockout-associated phenotypes.

      This is a highly innovative and thorough study, yet a handful of key questions remain. First, are night-time sleep loss phenotypes observed in all knockouts for late-onset AD genes in the larval zebrafish a valid proxy for AD risk?

      We cannot say, but it is an interesting question. We selected the four late-onset Alzheimer’s risk genes (APOE, CD2AP, CLU, SORL1) based on human genetics data and brain expression in zebrafish larvae, not based on their likelihood to modify sleep behaviour, which we could have tried by searching for overlaps with GWAS of sleep phenotypes, for example. Consequently, we find it remarkable that all four of these genes caused a night-time sleep phenotype when mutated. We also find it reassuring that knockout of appa/appb and psen2 did not cause a night-time sleep phenotype, which largely excludes the possibility that the phenotype is a technical artefact (e.g. caused by the F0 knockout method) or a property of every gene expressed in the larval brain.

      Having said that, it could still be a coincidence, rather than a special property of genes associated with late-onset AD. In addition to testing additional late-onset Alzheimer’s risk genes, the ideal way to answer this question would be to test in parallel a random set of genes expressed in the brain at this stage of development. From this random set, one could estimate the proportion of genes that cause a night-time sleep phenotype when mutated. One could then use that information to test whether late-onset Alzheimer’s risk genes are indeed enriched for genes that cause a night-time sleep phenotype when mutated.

      For those mutants that cause night-time sleep disturbances, do these phenotypes share a common underlying pathway? e.g. Do 5-HT reuptake inhibitors promote sleep across all 4 late-onset genes in addition to psen1? Can 5-HT reuptake inhibitors reverse other AD-related pathologies in zebrafish? Can compounds be identified that have a common behavioral fingerprint across all or multiple AD risk genes? Do these modify sleep phenotypes?

      To attempt to answer these questions, we used ZOLTAR to generate predictions for all the knockout behavioural fingerprints presented in the study, in the same way as for sorl1 in Fig. 5 and Fig. 5–supplement 1. Here are the indications, targets, and KEGG pathways which are shared by the largest number of knockouts (Author response image 1):

      – One indication is shared by 4/7 knockouts: “opioid dependence” (significant for appa/appb, psen1, apoea/apoeb, cd2ap).

      – Four targets are shared by 4/7 knockouts: “strychnine-binding glycine receptor” (psen1, apoea/apoeb, clu, sorl1); “neuronal acetylcholine receptor beta-2” (psen1, apoea/apoeb, cd2ap, clu); thyroid peroxidase (psen1, apoea/apoeb, cd2ap, clu); carbonic anhydrase IV (appa/appb, psen1, psen2, cd2ap).

      – Three KEGG pathways are shared by 5/7 knockouts: “cholinergic synapse” (psen1, apoea/apoeb, cd2ap, clu, sorl1); tyrosine metabolism (psen2, apoea/apoeb, cd2ap, clu, sorl1); and “nitrogen metabolism” (appa/appb, psen1, psen2, apoea/apoeb, cd2ap).

      As reminder, we hypothesised that loss of Sorl1 affected serotonin signalling based on the following annotations being significant: indication “depression”, target “serotonin transporter”, and KEGG pathway “serotonergic synapse”. Indication “depression” is only significant for sorl1 knockouts; target “serotonin transporter” is also significant for appa/appb and psen2 knockouts; and KEGG pathway “serotonergic synapse” is also significant for psen2 knockouts. ZOLTAR therefore does not predict serotonin signalling to be a major theme common to all mutants with a night-time sleep loss phenotype.

      Particularly interesting is cholinergic signalling appearing in the most common targets and KEGG pathways. Acetylcholine signalling is a major theme in research on AD. For example, the first four drugs ever approved by the FDA to treat AD were acetylcholinesterase inhibitors, which increase acetylcholine signalling by preventing its breakdown by acetylcholinesterase. These drugs are generally considered only to treat symptoms and not modify disease course, but this view has been called into question (Munoz-Torrero, 2008; Relkin, 2007). If, as ZOLTAR suggests, mutations in several Alzheimer’s risk genes affect cholinergic signalling early in development, this would point to a potential causal role of cholinergic disruption in AD.

      Author response image 1.

      Common predictions from ZOLTAR for the seven Alzheimer’s risk genes tested. Predictions from ZOLTAR which are shared by multiple knockout behavioural fingerprints presented in the study. Only indications, targets, and KEGG pathways which are significant for at least three of the seven knockouts tested are shown, ranked from the annotations which are significant for the largest number of knockouts.

      Finally, the web- based platform presented could be expanded to facilitate comparison of other behavioral phenotypes, including stimulus-evoked behaviors.

      Yes, absolutely. The behavioural dataset we used (Rihel et al., 2010) did not measure other stimuli than day/night light transitions, but the “SauronX” platform and dataset (MyersTurnbull et al., 2022) seems particularly well suited for this. To provide some context, we and collaborators have occasionally used the dataset by Rihel et al. (2010) to generate hypotheses or find candidate drugs that reverse a behavioural phenotype measured in the sleep/wake assay (Ashlin et al., 2018; Hoffman et al., 2016). The present work was the occasion to enable a wider and more intuitive use of this dataset through the ZOLTAR app, which has already proven successful. Future versions of ZOLTAR may seek to incorporate larger drug datasets using more types of measurements.

      Finally, the authors propose but do not test the hypothesis that sorl1 might regulate localization/surface expression of 5-HT2 receptors. This could provide exciting / more convincing mechanistic support for the assertion that serotonin signaling is disrupted upon loss of AD-associated genes.

      While working on the Author Response, we made some changes to the analysis ran by ZOLTAR to calculate enrichments (see Methods and github.com/francoiskroll/ZOLTAR, notes on v2). With the new version, 5-HT receptor type 2 is not a significantly enriched target for the sorl1 knockout fingerprint but type 4 is. 5-HT receptor type 4 was also shown to interact with sorting nexin 27, a subunit of retromer, so is a promising candidate (Joubert et al., 2004). Antibodies against human 5-HT receptor type 2 and 4a exist; whether they would work in zebrafish remains to be tested. In our experience, the availability of antibodies suitable for immunohistochemistry in the zebrafish is a serious experimental roadblock.

      Note, all the results presented in the “Version of Records” are from ZOLTAR v2.

      Despite these important considerations, this study provides a valuable platform for highthroughput analysis of sleep phenotypes and correlation with small-molecule-induced sleep phenotypes.

      Strengths:

      - Provides a useful platform for comparison of sleep phenotypes across genotypes/drug manipulations.

      - Presents convincing evidence that night-time sleep is disrupted in mutants for multiple late onset AD-related genes.

      - Provides potential mechanistic insights for how AD-related genes might impact sleep and identifies a few drugs that modify their identified phenotypes

      Weaknesses:

      - Exploration of potential mechanisms for serotonin disruption in sorl1 mutants is limited.

      - The pipeline developed can only be used to examine sleep-related / spontaneous movement phenotypes and stimulus-evoked behaviors are not examined.

      - Comparisons between mutants/exploration of commonly affected pathways are limited.

      Thank you for these excellent suggestions, please see our answers above.

      Reviewer #2 (Public Review):

      Summary:

      This work delineates the larval zebrafish behavioral phenotypes caused by the F0 knockout of several important genes that increase the risk for Alzheimer's disease. Using behavioral pharmacology, comparing the behavioral fingerprint of previously assayed molecules to the newly generated knockout data, compounds were discovered that impacted larval movement in ways that suggest interaction with or recovery of disrupted mechanisms.

      Strengths:

      This is a well-written manuscript that uses newly developed analysis methods to present the findings in a clear, high-quality way. The addition of an extensive behavioral analysis pipeline is of value to the field of zebrafish neuroscience and will be particularly helpful for researchers who prefer the R programming language. Even the behavioral profiling of these AD risk genes, regardless of the pharmacology aspect, is an important contribution. The recovery of most behavioral parameters in the psen2 knockout with betamethasone, predicted by comparing fingerprints, is an exciting demonstration of the approach. The hypotheses generated by this work are important stepping stones to future studies uncovering the molecular basis of the proposed gene-drug interactions and discovering novel therapeutics to treat AD or co-occurring conditions such as sleep disturbance.

      Weaknesses:

      - The overarching concept of the work is that comparing behavioral fingerprints can align genes and molecules with similarly disrupted molecular pathways. While the recovery of the psen2 phenotypes by one molecule with the opposite phenotype is interesting, as are previous studies that show similar behaviorally-based recoveries, the underlying assumption that normalizing the larval movement normalizes the mechanism still lacks substantial support. There are many ways that a reduction in movement bouts could be returned to baseline that are unrelated to the root cause of the genetically driven phenotype. An ideal experiment would be to thoroughly characterize a mutant, such as by identifying a missing population of neurons, and use this approach to find a small molecule that rescues both behavior and the cellular phenotype. If the connection to serotonin in the sorl1 was more complete, for example, the overarching idea would be more compelling.

      Thank you for this cogent criticism.

      On the first point, we were careful not to claim that betamethasone normalises the molecular/cellular mechanism that causes the psen2 behavioural phenotype. Having said that, yes, to a certain extent that would be the hope of the approach. As you say, every compound which normalises the behavioural fingerprint will not normalise the underlying mechanism, but the opposite seems true: every compound that normalises the underlying mechanism should also normalise the behavioural fingerprint. We think this logic makes the “behaviour-first” approach innovative and interesting. The logic is to discover compounds that normalise the behavioural phenotype first, only subsequently test whether they also normalise the molecular mechanism, akin to testing first whether a drug resolves the symptoms before testing whether it actually modifies disease course. While in practice testing thousands of drugs in sufficient sample sizes and replicates on a mutant line is challenging, the dataset queried through ZOLTAR provides a potential shortcut by shortlisting in silico compounds that have the opposite effect on behaviour.

      You mention a “reduction in movement bouts” but note here that the number of behavioural parameters tested is key to our argument. To take the two extremes, say the only behavioural parameter we measured in psen2 knockout larvae was time active during the day, then, yes, any stimulant used at the right concentration could probably normalise the phenotype. In this situation, claiming that the stimulant is likely to also normalise the underlying mechanism, or even that it is a genuine “phenotypic rescue”, would not be convincing. Conversely, say we were measuring thousands of behavioural parameters under various stimuli, such as swimming speed, position in the well, bout usage, tail movements, and eye angles, it seems almost impossible for a compound to rescue most parameters without also normalising the underlying mechanism. The present approach is somewhere inbetween: ZOLTAR uses six behavioural parameters for prediction (e.g. Fig 6a), but all 17 parameters calculated by FramebyFrame can be used to assess rescue during a subsequent experiment (Fig. 6c). For both, splitting each parameter in day and night increases the resolution of the approach, which partly answers your criticism. For example, betamethasone rescued the day-time hypoactivity without causing night-time hyperactivity, so we are not making the “straw man argument” explained above of using any broad stimulant to rescue the hypoactivity phenotype.

      Furthermore, for diseases where the behavioural defect is the primary concern, such as autism or bipolar disorder, perhaps this behaviour-first approach is all that is needed, and whether or not the compound precisely rescues the underlying mechanism is somewhat secondary. The use of lithium to prevent manic episodes in bipolar disorder is a good example. It was initially tested because mania was thought to be caused by excess uric acid and lithium can dissolve uric acid (Mitchell and Hadzi-Pavlovic, 2000). The theory is now discredited, but lithium continues to be used without a precise understanding of its mode of action. In this example, behavioural rescue alone, assuming the secondary effects are tolerable, is sufficient to be beneficial to patients, and whether it modulates the correct causal pathway is secondary.

      On the second point, we agree that testing first ZOLTAR on a mutant for which we have a fairly good understanding of the mechanism causing the behavioural phenotype could have been a productive approach. Note, however, that examples already exist in the literature (Ashlin et al., 2018; Hoffman et al., 2016). The example from Hoffman et al. (2016) is especially convincing. Drugs generating behavioural fingerprints that positively correlate with the cntnap2a/cntnap2b double knockout fingerprint were enriched with NMDA and GABA receptor antagonists. In experiments analogous to our citalopram and fluvoxamine treatments (Fig. 5c,d and Fig. 5–supplement 1c,d), cntnap2a/cntnap2b knockout larvae were overly sensitive to the NMDA receptor antagonist MK-801 and the GABAA receptor antagonist pentylenetetrazol (PTZ). Among other drugs tested, zolpidem, a GABAA receptor agonist, caused opposite effects on wild-type and cntnap2a/cntnap2b knockout larvae. Knockout larvae were found to have fewer GABAergic neurons in the forebrain. While these studies did not use precisely the same analysis that ZOLTAR runs, they used the same rationale and behavioural dataset to make these predictions (Rihel et al., 2010), which shows that approaches like ZOLTAR can point to causal processes.

      On your last point, we hope our experiment testing fluvoxamine, another selective serotonin reuptake inhibitor (SSRI), makes the connection between Sorl1 and serotonin signalling more convincing.

      - The behavioral difference between the sorl1 KO and scrambled at the higher dose of the citalopram is based on a small number of animals. The KO Euclidean distance measure is also more spread out than for the other datasets, and it looks like only five or so fish are driving the group difference. It also appears as though the numbers were also from two injection series. While there is nothing obviously wrong with the data, I would feel more comfortable if such a strong statement of a result from a relatively subtle phenotype were backed up by a higher N or a stable line. It is not impossible that the observed difference is an experimental fluke. If something obvious had emerged through the HCR, that would have also supported the conclusions. As it stands, if no more experiments are done to bolster the claim, the confidence in the strength of the link to serotonin should be reduced (possibly putting the entire section in the supplement and modifying the discussion). The discussion section about serotonin and AD is interesting, but I think that it is excessive without additional evidence.

      We mostly agree with this criticism. One could interpret the larger spread of the data for sorl1 KO larvae treated with 10 µM citalopram as evidence that the knockout larvae do indeed react differently to the drug at this dose, regardless of being driven by a subset of the animals. The result indeed does not survive removing the top 5 (p = 0.87) or top 3 (p = 0.18) sorl1 KO + 10 µM larvae, but this amounts to excluding 20 (3/14) or 35 (5/14) % of the datapoints as potential outliers, which is unreasonable. In fact, excluding the top 5 sorl1 KO + 10 µM is equivalent to calling any datapoint with z-score > 0.2 an outlier (z-scores of the top 5 datapoints are 0.2–1.8). Applying consistently the same criterion to the scrambled + 10 µM group would remove the top 6 datapoints (z-scores = 0.5–3.9). Comparing the resulting two distributions again gives the sorl1 KO + 10 µM distribution as significantly higher (p = 0.0015). We would also mention that Euclidean distance, as a summary metric for distance between behavioural fingerprints, has limitations. For example, the measure will be more sensitive to changes in some parameters but not others, depending on how much room there is for a given parameter to change. We included this metric to lend support to the observation one can draw from the fingerprint plot (Fig. 5c) that sorl1 mutants respond in an exaggerated way to citalopram across many parameters, while being agnostic to which parameter might matter most.

      Given that the HCR did not reveal anything striking, we agree with you that too much of our argument relied on this result being robust. As you and Reviewer #3 suggested, we repeated this experiment with a different SSRI, fluvoxamine (Fig. 5–supplement 1). We cannot readily explain why the result was opposite to what we found with citalopram, but in both cases sorl1 knockout larvae reacted differently than their control siblings, which adds an argument to our claim that ZOLTAR correctly predicted serotonin signalling as a disrupted pathway from the behavioural fingerprint. Accordingly, we mostly kept the Discussion on Sorl1 the same, although we concede that we may not have identified the molecular mechanism.

      - The authors suggest two hypotheses for the behavioral difference between the sorl1 KO and scrambled at the higher dose of the citalopram. While the first is tested, and found to not be supported, the second is not tested at all ("Ruling out the first hypothesis, sorl1 knockouts may react excessively to a given spike in serotonin." and "Second, sorl1 knockouts may be overly sensitive to serotonin itself because post-synaptic neurons have higher levels of serotonin receptors."). Assuming that the finding is robust, there are probably other reasons why the mutants could have a different sensitivity to this molecule. However, if this particular one is going to be mentioned, it is surprising that it was not tested alongside the first hypothesis. This work could proceed without a complete explanation, but additional discussion of the possibilities would be helpful or why the second hypothesis was not tested.

      There are no strong scientific reasons why this hypothesis was not tested. The lead author (F Kroll) moved to a different lab and country so the project was finalised at that time. We do not plan on testing this hypothesis at this stage. However, we adapted the wording to make it clear this is one possible alternative hypothesis which could be tested in the future. The small differences found by HCR are actually more in line with the new results from the fluvoxamine experiment, so it may also be that both hypotheses (pre-synaptic neurons releasing less serotonin when reuptake is blocked; or post-synaptic neurons being less sensitive) contribute. The fluvoxamine experiment was performed in a different lab (ICM, Paris; all other experiments were done in UCL, London) in a different wild-type strain (TL in ICM, AB x Tup LF in UCL), which complicates how one interprets this discrepancy.

      - The authors claim that "all four genes produced a fairly consistent phenotype at night". While it is interesting that this result arose in the different lines, the second clutch for some genes did not replicate as well as others. I think the findings are compelling, regardless, but the sometimes missing replicability should be discussed. I wonder if the F0 strategy adds noise to the results and if clean null lines would yield stronger phenotypes. Please discuss this possibility, or others, in regard to the variability in some phenotypes.

      For the first part of this point, please see below our answer to Reviewer #3, point (2) c.

      Regarding the F0 strategy potentially adding variability, it is an interesting question which we tested in a larger dataset of behavioural recordings from F0 and stable knockouts for the same genes (unpublished). In summary, the F0 knockout method does not increase clutchto-clutch or larva-to-larva variability in the assay. F0 knockout experiments found many more significant parameters and larger effect sizes than stable knockout experiments, but this difference could largely be explained by the larger sample sizes of F0 knockout experiments. In fact, larger sample sizes within individual clutches appears to be a major advantage of the F0 knockout approach over in-cross of heterozygous knockout animals as it increases sensitivity of the assay without causing substantial variability. We plan to report in more detail on this analysis in a separate paper as we think it would dilute the focus of the present work.

      - In this work, the knockout of appa/appb is included. While APP is a well-known risk gene, there is no clear justification for making a knockout model. It is well known that the upregulation of app is the driver of Alzheimer's, not downregulation. The authors even indicate an expectation that it could be similar to the other knockouts ("Moreover, the behavioural phenotypes of appa/appb and psen1 knockout larvae had little overlap while they presumably both resulted in the loss of Aβ." and "Comparing with early-onset genes, psen1 knockouts had similar night-time phenotypes, but loss of psen2 or appa/appb had no effect on night-time sleep."). There is no reason to expect similarity between appa/appb and psen1/2. I understand that the app knockouts could unveil interesting early neurodevelopmental roles, but the manuscript needs to be clarified that any findings could be the opposite of expectation in AD.

      On “there is no reason to expect similarity […]”, we disagree. Knockout of appa/appb and knockout of psen1 will both result in loss of Aβ (appa/appb encode Aβ and psen1 cleaves Appa/Appb to release Aβ, cf. Fig. 3e). Consequently, a phenotype caused by the loss of Aβ, or possibly other Appa/Appb cleavage products, should logically be found in both appa/appb and psen1 knockouts.

      On “it is well known that the upregulation of APP is the driver of Alzheimer’s, not downregulation”; we of course agree. Among others, the examples of Down syndrome, APP duplication (Sleegers et al., 2006), or mouse models overexpressing human APP show definitely that overexpression of APP is sufficient to cause AD. Having said that, we would not be so quick in dismissing APP knockout as potentially relevant to understanding of AD.

      Loss of soluble Aβ due to aggregation could contribute to pathology (Espay et al., 2023). Without getting too much into this intricate debate, links between levels of Aβ and risk of disease are often counter-intuitive too. For example, out of 138 PSEN1 mutations screened in vitro, 104 reduced total Aβ production and 11 even seemingly abolished the production of both Aβ40 and Aβ42 (Sun et al., 2017). In short, loss of soluble Aβ occurs in both AD and in our appa/appb knockout larvae.

      We added a sentence in Results (section psen2 knockouts […]) to briefly justify our appa/appb knockout approach. To be clear, we do not want to imply, for example, that the absence of a night-time sleep phenotype for appa/appb is contradictory to the body of literature showing links between Aβ and sleep, including in zebrafish (Özcan et al., 2020). As you say, our experiment tested loss of App, including Aβ, while the literature typically reports on overexpression of APP, as in APP/PSEN1-overexpressing mice (Jagirdar et al., 2021).

      Reviewer #3 (Public Review):

      In this manuscript by Kroll and colleagues, the authors describe combining behavioral pharmacology with sleep profiling to predict disease and potential treatment pathways at play in AD. AD is used here as a case study, but the approaches detailed can be used for other genetic screens related to normal or pathological states for which sleep/arousal is relevant. The data are for the most part convincing, although generally the phenotypes are relatively small and there are no major new mechanistic insights. Nonetheless, the approaches are certainly of broad interest and the data are comprehensive and detailed. A notable weakness is the introduction, which overly generalizes numerous concepts and fails to provide the necessary background to set the stage for the data.

      Major points

      (1) The authors should spend more time explaining what they see as the meaning of the large number of behavioral parameters assayed and specifically what they tell readers about the biology of the animal. Many are hard to understand--e.g. a "slope" parameter.

      We agree that some parameters do not tell something intuitive about the biology of the animal. It would be easy to speculate. For example, the “activity slope” parameter may indicate how quickly the animal becomes tired over the course of the day. On the other hand, fractal dimension describes the “roughness/smoothness” of the larva’s activity trace (Fig. 2–supplement 1a); but it is not obvious how to translate this into information about the physiology of the animal. We do not see this as an issue though. While some parameters do provide intuitive information about the animal’s behaviour (e.g. sleep duration or sunset startle as a measure of startle response), the benefit of having a large number of behavioural parameters is to compare behavioural fingerprints and assess rescue of the behavioural phenotype by small molecules (Fig. 6c). For this purpose, the more parameters the better. The “MoSeq” approach from Wiltschko et al., 2020 is a good example from literature that inspired our own Fig. 6c. While some of the “behavioural syllables” may be intuitive (e.g. running or grooming), it is probably pointless to try to explain the ‘meaning’ of the “small left turn in place with head motion” syllable (Wiltschko et al., 2020). Nonetheless, this syllable was useful to assess whether a drug specifically treats the behavioural phenotype under study without causing too many side effects. Unfortunately, ZOLTAR has to reduce the FramebyFrame fingerprint (17 parameters) to just six parameters to compare it to the behavioural dataset from Rihel et al., 2010, but here, more parameters would almost certainly translate into better predictions too, regardless of their intuitiveness.

      It is true however that we did not give much information on how some of the less intuitive parameters, such as activity slope or fractal dimension, are calculated or what they describe about the dataset (e.g. roughness/smoothness for fractal dimension). We added a few sentences in the legend of Fig. 2–supplement 1.

      (2) Because in the end the authors did not screen that many lines, it would increase confidence in the phenotypes to provide more validation of KO specificity. Some suggestions include:

      a. The authors cite a psen1 and psen2 germline mutant lines. Can these be tested in the FramebyFrame R analysis? Do they phenocopy F0 KO larvae?

      We unfortunately do not have those lines. We investigated the availability of importing a psen2 knockout line from abroad, but the process of shipping live animals is becoming more and more cost and time prohibitive. However, we observed the same pigmentation phenotype for psen2 knockouts as reported by Jiang et al., 2018, which is at least a partial confirmation of phenocopying a loss of function stable mutant.  

      b. psen2_KO is one of the larger centerpieces of the paper. The authors should present more compelling evidence that animals are truly functionally null. Without this, how do we interpret their phenotypes?

      We disagree that there should be significant doubt about these mutants being truly functionally null, given the high mutation rate and presence of the expected pigmentation phenotype (Jiang et al., 2018, Fig. 3f and Fig. 3–supplement 3a). The psen2 F0 knockouts were virtually 100% mutated at three exons across the gene (mutation rates were locus 1: 100 ± 0%; locus 2: 99.99 ± 0.06%; locus 3: 99.85 ± 0.24%). Additionally, two of the three mutated exons had particularly high rates of frameshift mutations (locus 1: 97 ± 5%; locus 2: 88 ± 17% frameshift mutation rate). It is virtually impossible that a functional protein is translated given this burden of frameshift mutations. Phenotypically, in addition to the pigmentation defect, double psen1/psen2 F0 knockout larvae had curved tails, the same phenotype as caused by a high dose of the γ-secretase inhibitor DAPT (Yang et al., 2008). These double F0 knockouts were lethal, while knockout of psen1 or psen2 alone did not cause obvious morphological defects. Evidently, most larvae must have been psen2 null mutants in this experiment, otherwise functional Psen2 would have prevented early lethality.

      Translation of zebrafish psen2 can start at downstream start codons if the first exon has a frameshift mutation, generating a seemingly functional Psen2 missing the N-terminus (Jiang et al., 2020). Zebrafish homozygous for this early frameshift mutation had normal pigmentation, showing it is a reliable marker of Psen2 function even when it is mutated. This mechanism is not a concern here as the alternative start codons are still upstream of two of the three mutated exons (the alternative start codons discovered by Jiang et al., 2020 are in exon 2 and 3, but we targeted exon 3, exon 4, and exon 6).

      We understand that the zebrafish community may be cautious about F0 phenotyping compared to stably generated mutants. As mentioned to Reviewer #2, we are planning to assemble a paper that expressly compares behavioural phenotypes measured in F0 vs. stable mutants to allay some of these concerns. Our current manuscript, which combines CRISPR-Cas9 rapid F0 screening with in silico pharmacological predictions, inevitability represents a first step in characterizing the functions of these genes. 

      c. Related to the above, for cd2AP and sorl1 KO, some of the effect sizes seem to be driven by one clutch and not the other. In other words, great clutch-to-clutch variability. Should the authors increase the number of clutches assayed?

      Correct, there is substantial clutch-to-clutch variability in this behavioural assay. This is not specific to our experiments. Even within the same strain, wild-type larvae from different clutches (i.e. non-siblings) behave differently (Joo et al., 2021). This is why it is essential to compare behavioural phenotypes within individual clutches (i.e. from a single pair of parents, one male and one female), as we explain in Methods (section Behavioural video-tracking) and in the documentation of the FramebyFrame package. We often see two different experimental designs in literature: comparing non-sibling wild-type and mutant larvae, or pooling different clutches which include all genotypes (e.g. pooling multiple clutches from heterozygous in-crosses or pooling wild-type clutches before injecting them). The first experimental design causes false positive findings (Joo et al., 2021), as the clutchto-clutch variability we and others observe gets interpreted as a behavioural phenotype. The second experimental design should not cause false positives but likely decreases the sensitivity of the assay by increasing the spread within genotypes. In both cases, the clutch-to-clutch variability is hidden, either by interpreting it as a phenotype (first case) or by adding it to animal-to-animal variability (second case). Our experimental design is technically more challenging as it requires obtaining large clutches from unique pairs of parents. However, this approach is better as it clearly separates the different sources of variability (clutch-to-clutch or animal-to-animal). As for every experiment, yes, a larger number of replicates would be better, but we do not plan to assay additional clutches at this time. Our work heavily focuses on the sorl1 and psen2 knockout behavioural phenotypes. The key aspects of these phenotypes were effectively tested in four experiments (five to six clutches) as sorl1 knockout larvae were also tracked in the citalopram and fluvoxamine experiments (Fig. 5 and Fig. 5–supplement 1), and psen2 knockout larvae were also tracked in the small molecule rescue experiment (Fig. 6 and Fig. 6–supplement 1).

      The psen2 behavioural phenotype replicated well across the six clutches tested (pairwise cosine similarities: 0.62 ± 0.15; Author response image 2a). 5/6 clutches were less active and initiating more sleep bouts during the day, as we claimed in Fig. 3.

      In the citalopram experiment, the H<sub>2</sub>O-treated sorl1 knockout fingerprint replicated fairly well the baseline recordings in Fig. 4, despite the smaller sample size (cos = 0.30 and 0.78; Author response image 2b, see “KO Fig. 5”). 5/6 of the significant parameters presented in Fig. 4–supplement 4 moved in the same direction, and knockout larvae were also hypoactive during the day but hyperactive at night. Note that two clutches were tracked on the same 96-well plate in this experiment. We calculated each larva’s z-score using the average of its control siblings, then we averaged all the z-scores to generate the fingerprint. The H<sub>2</sub>O treated sorl1 knockout clutch from the fluvoxamine experiment did not replicate well the baseline recordings (cos = 0.08 and 0.11; Author response image 2b, see “KO Fig. 5–suppl. 1”). Knockout larvae were hypoactive during the day as expected, but behaviour at night was not as robustly affected. As mentioned above, knockouts were made in a different genetic background (TL, instead of AB x Tup LF used for all other experiments), which could explain the discrepancy.

      We also took the opportunity to check whether our SSRI treatments replicated well the data from Rihel et al., 2010. For both citalopram (n = 3 fingerprints in the database) and fluvoxamine (n = 4 fingerprints in the database), replication was excellent (cos ≥ 0.67 for all comparisons of a fingerprint from this study vs. a fingerprint from Rihel et al. 2010; Author response image 2c,d). Note that the scrambled + 10 µM citalopram and + 10 µM fluvoxamine fingerprints correlate extremely well (cos = 0.92; can be seen in Author response image 2c,d), which was predicted by the small molecule screen dataset.

      Author response image 2.

      Replication of psen2 and sorl1 F0 knockout fingerprints and SSRI treatments from Rihel et al., 2010. a, (left) Every psen2 F0 knockout behavioural fingerprint generated in this study. Each dot represents the mean deviation from the same-clutch scrambled-injected mean for that parameter (z-score, mean ± SEM). From the experiments in Fig. 6, presented is the psen2 F0 knockout + H<sub>2</sub>O fingerprints. The fingerprints in grey (“not shown”) are from a preliminary drug treatment experiment we did not include in the final study. These fingerprints are from psen2 F0 knockout larvae treated with 0.2% DMSO, normalised to scrambled-injected siblings also treated with 0.2% DMSO. (right) Pairwise cosine similarities (−1.0–1.0) for the fingerprints presented. b, Every sorl1 F0 knockout behavioural fingerprint, as in a). c, The scrambled-injected + citalopram (10 µM) fingerprints (grey) in comparison to the citalopram (10–15 µM) fingerprints from the Rihel et al., 2010 database (green). d, The scrambled-injected + fluvoxamine (10 µM) fingerprint (grey) in comparison to the fluvoxamine fingerprints from the Rihel et al., 2010 database (pink). In c) and d), the scrambled-injected fingerprints are from the experiments in Fig. 5 and Fig. 5–suppl. 1, but were converted here into the behavioural parameters used by Rihel et al., 2010 for comparison. Parameters: 1, average activity (sec active/min); 2, average waking activity (sec active/min, excluding inactive minutes); 3, total sleep (hr); 4, number of sleep bouts; 5, sleep bout length (min); 6, sleep latency (min until first sleep bout).

      (3) The authors make the point that most of the AD risk genes are expressed in fish during development. Is there public data to comment on whether the genes of interest are expressed in mature/old fish as well? Just because the genes are expressed early does not at all mean that early- life dysfunction is related to future AD (though this could be the case, of course). Genes with exclusive developmental expression would be strong candidates for such an early-life role, however. I presume the case is made because sleep studies are mainly done in juvenile fish, but I think it is really a prejy minor point and such a strong claim does not even need to be made.

      This is a fair criticism but we do not make this claim (“early-life dysfunction is related to future AD”) from expression alone. The reviewer is probably referring to the following quote:

      “[…] most of these were expressed in the brain of 5–6-dpf zebrafish larvae, suggesting they play a role in early brain development or function,” which does not mention future risk of AD. We do suggest that these genes have a function in development. After all, every gene that plays a role in brain development must be expressed during development, so this wording seemed reasonable. Nevertheless, we adapted the wording to address this point and Reviewer #2’s complaint below. As noted, the primary goal was to check that the genes we selected were indeed expressed in zebrafish larvae before performing knockout experiments. Our discussion does raise the hypothesis that mutations in Alzheimer’s risk genes impact brain development and sleep early in life, but this argument primarily relies on our observation that knockout of late-onset Alzheimer’s risk genes causes sleep phenotypes in 7-day old zebrafish larvae and from previous work showing brain structural differences in children at high genetic risk of AD (Dean et al., 2014; Quiroz et al., 2015), not solely on gene expression early in life.

      Please also see our answer to a similar point raised by Reviewer #2 below (cf. Author response image 7).

      (4) A common quandary with defining sleep behaviorally is how to rectify sleep and activity changes that influence one another. With psen2 KOs, the authors describe reduced activity and increased sleep during the day. But how do we know if the reduced activity drives increased behavioral quiescence that is incorrectly defined as sleep? In instances where sleep is increased but activity during periods during wake are normal or elevated, this is not an issue. But here, the animals might very well be unhealthy, and less active, so naturally they stop moving more for prolonged periods, but the main conclusion is not sleep per se. This is an area where more experiments should be added if the authors do not wish to change/temper the conclusions they draw. Are psen2 KOs responsive to startling stimuli like controls when awake? Do they respond normally when quiescent? Great care must be taken in all models using inactivity as a proxy for sleep, and it can harm the field when there is no acknowledgment that overall health/activity changes could be a confound. Particularly worrisome is the betamethasone data in Figure 6, where activity and sleep are once again coordinately modified by the drug.

      This is a fair criticism. We agree it is a concern, especially in the case of psen2 as we claim that day-time sleep is increased while zebrafish are diurnal. We do not rely heavily on the day-time inactivity being sleep (the ZOLTAR predictions or the small molecule rescue do not change whether the parameter is called sleep or inactivity), but our choice of labelling can fairly be challenged.

      To address “are psen2 KO responsive to startling stimuli like controls when awake/when quiescent”, we looked at the larvae’s behaviour immediately after lights abruptly switched on in the mornings. Almost every larva, regardless of genotype, responded strongly to every lights-off transition during the experiment. Instead, we chose the lights-on transition for this analysis because it is a weaker startling stimulus for the larvae than the lights-off transition (Fig. 3–supplement 3), potentially exposing differences between genotypes or behavioural states (quiescent or awake). We defined a larva as having reacted to the lights switching on if it made a swimming bout during the second (25 frames) a er the lights-on transition. Across two clutches and two lights-on transitions, an average of 65% (range 52–73%) of all larvae reacted to the stimulus. psen2 knockout larvae were similarly likely, if not more likely, to respond (in average 69% responded, range 60–76%) than controls (60% average, range 44– 75%). When the lights switched on, about half of the larvae (39–51%) would have been classified as asleep according to the one-minute inactivity definition (i.e. the larva did not move in the minute preceding the lights transition). This allowed us to also compare behavioural states, as suggested by the reviewer. For three of the four light transitions, larvae which were awake when lights switched on were more likely to react than asleep larvae, but this difference was not striking (overall, awake larvae were only 1.1× more likely to react; Author response image 3). Awake psen2 knockout larvae were 1.1× (range 1.04–1.11×) more likely to react than awake control larvae, so, yes, psen2 knockout larvae respond normally when awake. Asleep psen2 knockout larvae were 1.4× (range 0.63–2.19×) more likely to react than asleep control larvae, so psen2 knockouts are also more or equally likely to react than control larvae when asleep. In summary, the overall health of psen2 knockouts did not seem to be a significant confound in the experiment. As the reviewer suggested, if psen2 knockout larvae were seriously unhealthy, they would not be as responsive as control larvae to a startling stimulus.

      Author response image 3.

      psen2 F0 knockouts react normally to lights switching on, indicating they are largely healthy. At each lights-on transition (9 AM), each larva was categorised as awake if it had moved in the preceding one minute or asleep if it had been inactive for at least one minute. Darker tiles represent larvae which performed a swimming bout during the second following lights-on; lighter tiles represent larvae which did not move during that second. The total count of each waffle plot was normalised to 25 so plots can be compared to each other. The real count is indicated in the corner of each plot. Data is from the baseline psen2 knockout trackings presented in Fig. 3 and Fig. 3–suppl. 2.

      Next, we compared inactive period durations during the day between psen2 and control larvae. If psen2 knockout larvae indeed sleep more during the day compared to controls, we may predict inactive periods longer than one minute to increase disproportionately compared to the increase in shorter inactive periods. This broadly appeared to be the case, especially for one of the two clutches (Author response image 4). In clutch 1, inactive periods lasting 1–60 sec were equally frequent in both psen2 and control larvae (fold change 1.0× during both days), while inactive periods lasting 1–2 min were 1.5× (day 1) and 2.5× (day 2) more frequent in psen2 larvae compared to control larvae. In clutch 2, 1–60 sec inactive periods were also equally frequent in both psen2 and control larvae, while inactive periods lasting 1–2 min were 3.4× (day 1) and 1.5× (day 2) more frequent in psen2 larvae compared to control larvae. Therefore, psen2 knockouts disproportionately increased the frequency of inactive periods longer than one minute, suggesting they genuinely slept more during the day.

      Author response image 4.

      psen2 F0 knockouts increased preferentially the frequency of longer inactive bouts. For each day and clutch, we calculated the mean distribution of inactive bout lengths across larvae of same genotype (psen2 F0 knockout or scrambled-injected), then compared the frequency of inactive bouts of different lengths between the two genotypes. For example, in clutch 1 during day 2, 0.01% of the average scrambled-injected larva’s inactive bouts lasted 111–120 seconds (X axis 120 sec) while 0.05% of the average psen2 F0 knockout larva lasted this long, so the fold change was 5×. Inactive bouts lasting < 1 sec were excluded from the analysis. In clutch 2, day 1 plot, two datapoints fall outside the Y axis limit: 140 sec, Y = 32×; 170 sec, Y = 16×. Data is from the baseline psen2 knockout trackings presented in Fig. 3 and Fig. 3–suppl. 2.

      Ultimately, this criticism seems challenging to definitely address experimentally. A possible approach could be to use a closed-loop system which, after one minute of inactivity, triggers a stimulus that is sufficient to startle an awake larva but not an asleep larva. If psen2 knockout larvae indeed sleep more during the day, the stimulus should usually not be sufficient to startle them. Nevertheless, we believe the two analyses presented here are consistent with psen2 knockout larvae genuinely sleeping more during the day, so we decided to keep this label. We agree with the reviewer that the one-minute inactivity definition has limitations, especially for day-time inactivity.

      (5) The conclusions for the serotonin section are overstated. Behavioural pharmacology purports to predict a signaling pathway disrupted with sorl1 KO. But is it not just possible that the drug acts in parallel to the true disrupted pathway in these fish? There is no direct evidence for serotonin dysfunction - that conclusion is based on response to the drug. Moreover, it is just one drug - is the same phenotype present with another SSRI? Likewise, language should be toned down in the discussion, as this hypothesis is not "confirmed" by the results (consider "supported"). The lack of measured serotonin differences further raises concern that this is not the true pathway. This is another major point that deserves further experimental evidence, because without it, the entire approach (behavioral pharm screen) seems more shaky as a way to identify mechanisms. There are any number of testable hypotheses to pursue such as a) Using transient transgenesis to visualize 5HT neuron morphology (is development perturbed: cell number, neurite morphology, synapse formation); b) Using transgenic Ca reporters to assay 5HT neuron activity.

      Regarding the comment, “is it not just possible that the drug acts in parallel to the true disrupted pathway”, we think no, assuming we understand correctly the question. Key to our argument is the fact that sorl1 knockout larvae react differently to the drug(s) than control larvae. As an example, take night-time sleep bout length, which was not affected by knockout of sorl1 (Fig. 4–supplement 4). For the sake of the argument, say only dopamine signalling (the “true disrupted pathway”) was affected in sorl1 knockouts and that serotonin signalling was intact. Assuming that citalopram specifically alters serotonin signalling, then treatment should cause the same increase in sleep bout length in both knockouts and controls as serotonin signalling is intact in both. This is not what we see, however. Citalopram caused a greater increase in sleep bout length in sorl1 knockouts than in scrambled-injected larvae. In other words, the effect is non-additive, in the sense that citalopram did not add the same number of z-scores to sorl1 knockouts or controls. We think this shows that serotonin signalling is somehow different in sorl1 knockouts. Nonetheless, we concede that the experiment does not necessarily say much about the importance of the serotonin disruption caused by loss of Sorl1. It could be, for example, that the most salient consequence of loss of Sorl1 is cholinergic disruption (see reply to Reviewer #1 above) and that serotonin signalling is a minor theme.

      Furthermore, we agree with the reviewer and Reviewer #2 that the conclusions were overly confident. As suggested, we decided to repeat this experiment with another SSRI, fluvoxamine. Please find the results of this experiment in Fig. 5–supplement 1. The suggestions to further test the serotonin system in the sorl1 knockouts are excellent as well, however we do not plan to pursue them at this stage.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Major Comments:

      - Data are presented in a variety of different ways, occasionally making comparisons across figures difficult. Perhaps at a minimum, behavioral fingerprints as in Figure 3 - Supplementary Figure 1 should be presented for all mutants in the main figures.

      We like this suggestion! Thank you. We brought the behavioural fingerprints figure (previously Fig. 4–supplement 5) as main Fig. 4, and put the figure focused on the sorl1 knockout behavioural phenotype in supplementary, with the other gene-by-gene figures.

      - It is not clear why some data were selected for supplemental rather than main figures. In many cases, detailed phenotypic data is provided for one example mutant in the main figures, and then additional mutants are described in detail in the supplement. Again, to facilitate comparisons between mutants, fingerprints could be provided for all mutants in a main figure, with detailed analyses moved to the supplements.

      The logic was to dedicate one main figure to psen2 (Fig. 3) as an example of an early-onset Alzheimer’s risk gene, and one to sorl1 (previously Fig. 4) as an example of a late-onset Alzheimer’s risk gene. We focused on them in main figures as they are both tested again later (Fig. 5 and Fig. 6). Having said that, we agree that the fingerprints may be a better use of main figure space than the parameters plots. In addition to the above (fingerprints of lateonset Alzheimer’s risk genes in main figure), we rearranged the figures in the early-onset AD section to have the psen2 F0 knockout fingerprint in main.

      - The explication of the utility of behavioral fingerprinting on page 35 is somewhat confusing. The authors describe drugs used to treat depression as enriched among small molecules anti-correlating with the sorl1 fingerprint. However, in Figure 5 - Supplementary Figure 1, drugs used to treat depression are biased toward positive cosines, which are indicated as having a more similar fingerprint to sorl1. These drugs should be described as more present among compounds positively correlating with the sorl1 fingerprint.

      Sorry, the confusion is about “(anti-)correlating”. Precisely, we meant “correlating and/or anti-correlating”, not just anti-correlating. We changed to that wording. In short, the analysis is by design agnostic to whether compounds with a given annotation are found more on the positive cosines side (le side in Fig. 5–supplement 1a) or the negative cosines side (right side). This is because the dataset often includes both agonists and antagonists to a given pathway but these are difficult to annotate. For example, say 10 compounds in the dataset target the dopamine D4 receptor, but these are an unknown mix of agonists and antagonists. In this case, we want ZOLTAR to generate a low p-value when all 10 compounds are found at extreme ends of the list, regardless of which end(s) that is (e.g. top 8 and bottom 2 should give an extremely low p-value). Initially, we were splitting the list, for each annotation, into positive-cosine fingerprints and negative-cosine fingerprints and testing enrichment on both separately, but we think the current approach is better as it reflects better the cases we want to detect and considers all available examples for a given annotation in one test. In sum, yes, in this case drugs used to treat depression were mostly in the positive-cosine side, but the other drugs on the negative-cosine side also contributed to what the p-value is, so it reflects better the analysis to say “correlating and/or anticorrelating”. You can read more about our logic for the analysis in Methods (section Behavioural pharmacology from sorl1 F0 knockout’s fingerprint).

      - The authors conclude the above-described section by stating: "sorl1 knockout larvae behaved similarly to larvae treated with small molecules targeting serotonin signaling, suggesting that the loss of Sorl1 disrupted serotonin signaling." Directionality here may be important. Are all of the drugs targeting the serotonin transporter SSRIs or similar? If so, then a correct statement would be that loss of Sorl1 causes similar phenotypes to drugs enhancing serotonin signaling. Finally, based on the correlation between serotonin transporter inhibitor trazodone and the sorl1 crispant phenotype, it is potentially surprising that the SSRI citalopram caused the opposite phenotype from sorl1, that is, increased sleep during the day and night. It is potentially interesting that this result was enhanced in mutants, and suggests dysfunction of serotonin signaling, but the statement that "our behavioral pharmacology approach correctly predicted from behaviour alone that serotonin signaling was disrupted" is too strong a conclusion.

      We understand “disrupt” as potentially going either way, but this may not be the common usage. We changed to “altered”.

      The point regarding directionality is excellent, however. We tested the proportion of serotonin transporter agonists and antagonists (SSRIs) on each side of the ranked list of small molecule fingerprints. We used the STITCH database for this analysis as it has more drug–target interactions, but likely less curated, than the Therapeutic Target Database (Szklarczyk et al., 2016). As with the Therapeutic Target Database, most fingerprints of compounds interacting with the serotonin transporter SLC6A4 were found on the side of positive cosines (p ~ 0.005 using the custom permutation test), which replicates Fig. 5a with a different source for the drug–target annotations (Author response image 5). On the side of positive cosines (small molecules which generate behavioural fingerprints correlating with the sorl1 fingerprint), there were 2 agonists and 26 antagonists. On the side of negative cosines (small molecules which generate behavioural fingerprints anti-correlating with the sorl1 fingerprint), there were 3 agonists and 2 antagonists. Using a Chi-squared test, this suggests a significant (p = 0.002) over-representation of antagonists (SSRIs) on the positive side (expected count = 24, vs. 26 observed) and agonists on the negative side (expected count = 1, vs. 3 observed). If SLC6A4 antagonists, i.e. SSRIs, indeed tend to cause a similar behavioural phenotype than knockout of sorl1, this would point in the direction of our original interpretation of the citalopram experiment; which was that excessive serotonin signalling is what causes the sorl1 behavioural phenotype.

      Author response image 5.

      Using the STITCH database as source of annotations also predicts SLC6A4 as an enriched target for the sorl1 behavioural fingerprint. Same figures as Fig. 5a,b but using the STITCH database (Szklarczyk et al., 2016) as source for the drug targets. a, Compounds annotated by STITCH as interacting with the serotonin transporter SLC6A4 tend to generate behavioural phenotypes similar to the sorl1 F0 knockout fingerprint. 40,522 compound–target protein pairs (vertical bars; 1,592 unique compounds) are ranked from the fingerprint with the most positive cosine to the fingerprint with the most negative cosine in comparison with the mean sorl1 F0 knockout fingerprint. Fingerprints of drugs that interact with SLC6A4 are coloured in yellow. Simulated p-value = 0.005 for enrichment of drugs interacting with SLC6A4 at the top (positive cosine) and/or bottom (negative cosine) of the ranked list by a custom permutation test. b, Result of the permutation test for top and/or bottom enrichment of drugs interacting with SLC6A4 in the ranked list. The absolute cosines of the fingerprints of drugs interacting with SLC6A4 (n = 52, one fingerprint per compound) were summed, giving sum of cosines = 15.9. To simulate a null distribution, 52 fingerprints were randomly drawn 100,000 times, generating a distribution of 100,000 random sum of cosines. Here, only 499 random draws gave a larger sum of cosines, so the simulated p-value was p = 499/100,000 = 0.005 **.

      If this were true, we would expect, as the reviewer suggested, SSRI treatment (citalopram or fluvoxamine) on control larvae to give a similar behavioural phenotype as knockout of sorl1. However, this generally did not appear to be the case (sorl1 knockout fingerprint vs. SSRI-treated control fingerprint, cosine = 0.08 ± 0.35; Author response image 6).

      Author response image 6.

      sorl1 F0 knockouts in comparison to controls treated with SSRIs. a, sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the citalopram experiment) in comparison with the scrambled-injected + citalopram (1 or 10 µM) fingerprints. Each dot represents the mean deviation from the same-clutch scrambled-injected H<sub>2</sub>O-treated mean for that parameter (z-score, mean ± SEM). b, As in a), sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the fluvoxamine experiment) in comparison with the scrambled-injected + fluvoxamine (10 µM) fingerprint.

      The comparison with trazodone is an interesting observation, but it is only a weak serotonin reuptake inhibitor (Ki for SLC6A4 = 690 nM, vs. 8.9 nM for citalopram; Owens et al., 1997) and it has many other targets, both as agonist or antagonist, including serotonin, adrenergic, and histamine receptors (Mijur, 2011). In any case, the average trazodone fingerprint does not correlate particularly well to the sorl1 knockout fingerprint (cos = 0.3). Finally, the sorl1 knockout behavioural phenotype could be primarily caused by altered serotonin signalling in the hypothalamus, where we found both the biggest difference in tph1a/1b/2 HCR signal intensity (Fig. 5f) and the highest expression of sorl1 across scRNA-seq clusters (Fig. 1– supplement 2). In this case, it would be correct to expect sorl1 knockouts to react differently to SSRIs than controls, but it would be incorrect to expect SSRI treatment to cause the same behavioural phenotype, as it concurrently affects every other serotonergic neuron in the brain.

      Finally, we agree the quoted conclusion was too strong given the current evidence. We since tested another SSRI, fluvoxamine, on sorl1 knockouts.

      - Also in reference to Figure 5: in panel c, data are presented as deviation from vehicle treated. Because of this data presentation choice, it's no longer possible to determine whether, in this experiment, sorl1 crispants sleep less at night relative to their siblings. Does citalopram rescue / reverse sleep deficits in sorl1 mutants?

      On your first point, please see our response to Reviewer #3 (2)c and Author Response 2b above.

      On “does citalopram rescue/reverse sleep deficits in sorl1 mutants”: citalopram (and fluvoxamine) tends to reverse the key aspects of the sorl1 knockout behavioural phenotype by reducing night-time activity (% time active and total Δ pixels), increasing night-time sleep, and shortening sleep latency (Author response image 7). Extrapolating from the hypothesis presented in Discussion, this may be interpreted as a hint that sorl1 knockouts have reduced levels of 5-HT receptors, as increasing serotonin signalling using an SSRI tends to rescue the phenotype. However, we do not think that focusing on the significant behavioural parameters necessarily make sense here. Rather, one should take all parameters into account to conclude whether knockouts react differently to the drug than wild types (also see answer to Reviewer #3, (7) on this). For example, citalopram increased more the night-time sleep bout length of sorl1 knockouts than the one of controls (Fig. 5), but this parameter was not modified by knockout of sorl1 (Fig. 4). To explain the rationale more informally, citalopram is only used as a tool here to probe serotonin signalling in sorl1 knockouts, whether it worsens or rescues the behavioural phenotype is somewhat secondary, the key question is whether knockouts react differently than controls.

      Author response image 7.

      Comparing untreated sorl1 F0 knockouts vs. treated with SSRIs. a, sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the citalopram experiment) in comparison with the sorl1 knockout + citalopram (1 or 10 µM) fingerprints. Each dot represents the mean deviation from the same-clutch scrambled-injected H<sub>2</sub>O-treated mean for that parameter (z-score, mean ± SEM). b, As in a), sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the fluvoxamine experiment) in comparison with the sorl1 + fluvoxamine (10 µM) fingerprint.

      - Possible molecular pathways targeted by tinidazole, fenoprofen, and betamethasone are not described.

      Tinidazole is an antibiotic, fenoprofen is a non-steroidal anti-inflammatory drug (NSAIDs), betamethasone is a steroidal anti-inflammatory drug. Interestingly, long-term use of NSAIDs reduces the risk of AD (in ’t Veld Bas A. et al., 2001). Several mechanisms are possible (Weggen et al., 2007), including reduction of Aβ42 production by interacting with γ-secretase (Eriksen et al., 2003). However, we did not explore the mechanism of action of these drugs on psen2 knockouts so do not feel comfortable speculating. We do not know, for example, whether these findings apply to betamethasone.

      Minor Comments:

      - On page 25, panel "g" should be labeled as "f".

      Thank you!

      - On page 35, a reference should be provided for the statement "From genomic studies of AD, we know that mutations in genes such as SORL1 modify risk by disrupting some biological processes.".

      Thank you, this is now corrected. There were the same studies as mentioned in Introduction.

      - On page 43, the word "and" should be added - "in wild-type rats and mice, overexpressing mutated human APP and PSEN1, AND restricting sleep for 21 days...".

      Right, this sentence could be misread, we edited it. “overexpressing […]” only applied to the mice, not the rats (as they are wild-type); and both are sleep-deprived.

      - On page 45, a reference should be provided for the statement "SSRIs can generally be used continuously with no adverse effects" and this statement should potentially be softened.

      The reference is at the end of that sentence (Cirrito et al., 2011). You are correct though; we reformulated this statement to: “SSRIs can generally be used safely for many years”. SSRIs indeed have side effects.

      - On page 54, a 60-minute rolling average is described as 45k rows, but this seems to be a 30-minute rolling average.

      Thank you! We corrected. It should have been 90k rows, as in: 25 frames-per-second × 60 seconds × 60 minutes.

      Reviewer #2 (Recommendations For The Authors):

      "As we observed in the scRNA-seq data, most genes tested (appa, appb, psen1, psen2, apoea, cd2ap, sorl1) were broadly expressed throughout the 6-dpf brain (Fig. 1d and Fig. 1supplement 3 and 4)."

      - apoea and appb are actually not expressed highly in the scRNA-seq data, and the apoea in situ looks odd, as if it has no expression. The appb gene mysteriously does not look as though it has high expression in the Raj data, but it is clearly expressed based on the in situ. I had previously noticed the same discrepancy, and I attribute it to the transcriptome used to map the Raj data, as the new DanioCell data uses a new transcriptome and indicates high appb expression in the brain. Please point out the discrepancy and possible explanation, perhaps in the figure legend.

      All excellent points, thank you. We included them directly in Results text.

      "most of these were expressed in the brain of 5-6-dpf zebrafish larvae, suggesting they play a role in early brain development or function."

      - Evidence of expression does not suggest function, particularly not a function in brain development. As one example, almost half of the genome is expressed prior to the maternal-zygotic transition but does not have a function in those earliest stages of development. There are numerous other instances where expression does not equal function. Please change the sentence even as simply as "it is possible that they".

      We mostly agree and edited to “[…], so they could play a role […]”.

      Out of curiosity, we plotted, for each zebrafish developmental stage, the proportion of Alzheimer’s risk gene orthologues expressed in comparison to the proportion of all genes expressed (Author response image 8). We defined “all genes” as every gene that is expressed in at least one of the developmental stages (n = 24,856), not the complete transcriptome, to avoid including genes that are never expressed in the brain or whose expression is always below detection limit. We counted a gene as “expressed” if at least three cells had detectable transcripts. Using these definitions, 82 ± 7% of genes are expressed during development. For every developmental stage except 5 dpf (so 11/12), a larger proportion of Alzheimer’s risk genes than all genes are expressed (+5 ± 4%).

      Author response image 8.

      Proportion of Alzheimer’s risk genes orthologues expressed throughout zebrafish development. Proportion of Alzheimer’s risk genes orthologues (n = 42) and all genes (n = 24,856) expressed in the zebrafish brain at each developmental stage, from 12 hours post-fertilisation (hpf) to 15 days post-fertilisation (dpf). “All genes” corresponds to every gene expressed in the brain at any of the developmental stages, not the complete transcriptome. A gene is considered “expressed” (green) if at least three cells had detectable transcripts. Single-cell RNA-seq dataset from Raj et al., 2020.

      "This frame-by-frame analysis has several advantages over previous methods that analysed activity data at the one-minute resolution."

      - Which methods are these? There are no citations. There are certainly existing methods in the zebrafish field that can produce similar data to the method developed for this project. This new package is useful, as most existing software is not written in R, so it would help scientists who prefer this programming language. However, I would be careful not to oversell its novelty, since many methods do exist that produce similar results.

      We added the references. There were referenced above after “we combined previous sleep/wake analysis methods”, but should have been referenced again here.

      We are not convinced by this criticism. We would obviously not claim that the FramebyFrame package is as sophisticated and versatile as video-tracking tools like SLEAP or DeepLabCut, but we do think it answers a genuine need that was not addressed by other methods. Specifically, we know of many labs recording pixel count data across multiple days using the Zebrabox or DanioVision (we added support for DanioVision data after submission), but there were no packages to extract behavioural parameters from these data. Other methods involved standalone scripts with no documentation or version tracking. We would concede the FramebyFrame package is mostly targeted at these labs, but we already know of six labs routinely using it and were recently contacted by a researcher tracking Daphnia in the Zebrabox.

      "F0 knockouts of both cutches" - "clutches"

      Thank you!

      Reviewer #3 (Recommendations For The Authors):

      I would suggest totally revamping the Introduction section, and being sure to provide readers with the context and background they need for the data that comes thereafter. Key areas to touch on, in no particular order, include:

      • Far more detail on the behavioral pharm screen upon which this paper builds, as a brief overview of that approach and the data generated are needed.

      Thank you for the suggestion, we added a sentence hinting at this work in the last Introduction paragraph.

      • Limitations of current zebrafish sleep/arousal assays that motivated the authors to develop a new, temporally high-resolution system.

      We think this is better explained in Results, as is currently. For example, we need to point to Fig. 2–supplement 2a,b,c to explain that one-minute methods were missing sleep bouts and how FramebyFrame resolves this issue.

      • A paragraph about sleep and AD, that does a better job of citing work in humans, mammalian, and invertebrate models that motivate the interest in the connection pursued here.

      Sorry, we think this would place too much focus on sleep and AD. We want the main topic of the paper to be the behavioural pharmacology approach, not AD or sleep per se. As the Introduction states, we see Alzheimer’s risk genes as a case study for the behavioural pharmacology approach, rather than the reason why the approach was developed. Additionally, presenting sleep and AD in Introduction risks sounding like ZOLTAR is specifically designed for this context, while we conceived of it as much more generalisable and explicitly encourage its use to study genes associated to other diseases. Note that the paragraph you suggest is, we think, mostly present in Discussion (section Disrupted sleep and serotonin signalling […]).

      • I modestly suggest eliminating making such a strong case for a gene-first approach being the best way to understand disease. It is not a zero-sum game, and there is plenty to learn from proteomics, metabolomics, etc. I suspect nobody will argue with the authors saying they leveraged the strength of their system and focused on key AD genes of interest.

      From your point below, we understand the following quote is the source of the issue: “For finding causal processes, studying the genome, rather than the transcriptome or epigenome, is advantageous because the chronology from genomic variant to disease is unambiguous […]”. We did not want to suggest it is a zero-sum game, but we now understand how it can be read this way. We adapted slightly the wording. What we want to do is highlight the causality argument as the advantage of the genomics approach. We feel we do not read this argument often enough, while it remains a ‘magic power’ of genomics. One essentially does not have to worry about causality when studying a pathogenic germline variant, while it is a constant concern when studying the transcriptome or epigenome (i.e. did the change in this transcript’s level cause disease, or vice-versa?). To take an example in the context of AD, arguments based on genomics (e.g. Down syndrome or APP duplication) are often the definite arbiters when debating the amyloid hypothesis, exactly because their causality cannot be doubted.

      Minor comments

      (1) The opening of the introduction is perhaps overly broad, spending an entire paragraph on genome vs transcriptome, etc and making the claim that a gene-first approach is the best path. It isn't zero-sum, and the authors could just get right into AD and study genes of interest. Similar issues occur throughout the manuscript, with sentences/paragraphs that are not necessarily needed.

      Please see our answer to your previous point. On the introduction being overly broad, we perfectly agree it is broad, but related to your point about presenting sleep and AD in the Introduction, we wish to talk about finding causal processes from genomics findings using behavioural pharmacology. We purposefully present research on AD as one instance of this broader goal, not the primary topic of the paper.

      Another example are these sentences, which could be totally removed as the following paragraph starts off making the same point much more succinctly. "From genomic studies of AD, we know that mutations in genes such as SORL1 modify risk by disrupting some biological processes. Presumably, the same processes are disrupted in zebrafish sorl1 knockouts, and some caused the behavioural alterations we observed. Can we now follow the thread backwards and predict some of the biological processes in which Sorl1 is involved based on the behavioural profile of sorl1 knockouts?"

      Thanks for the suggestion, but we think these sentences are useful to place back this Results section in the context of the Introduction. Think of the paper as mainly about the behavioural pharmacology approach, not on Alzheimer’s risk genes. The function of the paragraph here is not simply to explain the method by which we decided to study sorl1; it is to reiterate the rationale behind the behavioural pharmacology approach so that the reader understands where this Results section fits in the overall structure.

      (2) Related to the above, the authors use lecanemab as an example to support their approach, but there has been a great deal of controversy regarding this drug. I don't think such extensive justification is needed. This study uses AD risk genes as a case study in a newly developed behavioral pharm pipeline. A great deal of the rest of the intro seems to just fill space and could be more focused on the study at hand. Interestingly, a er gene selection, the next step in their pipeline is sleep/wake analysis yet nothing is covered about AD and sleep in the intro. Some justification of that approach (why focus on sleep/wake as a starting point for behavioral pharm rather than learning and memory?) would be a better use of intro space.

      There has indeed been controversy about lecanemab, but even the harshest critiques of the amyloid hypothesis concede that it slows down cognitive decline (Espay et al., 2023). That is all that is needed to support our argument, which is that research on AD started primarily from genomics and thereby yielded a disease-modifying drug. The controversy seems mostly focused on whether this effect size is clinically significant, and we think we correctly represent this uncertainty (e.g. “antibodies against Aβ such as lecanemab show promise in slowing down disease progression” and “the beneficial effects from targeting Aβ aggregation currently remain modest”).

      Your next point is entirely fair. We mostly answered it above. To explain further, the primary reason why we measured sleep/wake behaviour is to match the behavioural dataset from Rihel et al., 2010 so we can use it to make predictions, not to study sleep in the context of AD per se. Sure, perhaps learning and memory would have been interesting, but we do not know of any study testing thousands of small molecules on zebrafish larvae during a memory task. We understand it can be slightly confusing though, as we then spend a paragraph of Discussion on sleep as a causal process in AD, but we obviously need to discuss this topic given the findings. However, to reiterate, we purposefully designed FramebyFrame and ZOLTAR to be useful beyond studying sleep/wake behaviour. For example, FramebyFrame would not calculate 17 behavioural parameters if the only goal was to measure sleep. We now mention the Rihel et al., 2010 study in the Introduction as you suggested above (“Far more detail on the behavioral pharm screen […]”), as that is the real reason why sleep/wake behaviour was measured in the first place.

      (3) Also related to the above, another more relevant point that could be talked about in the intro is the need for more refined approaches to analyze sleep in zebrafish, given the effort that went into the new analysis system described here. Again, I think the context for why the authors developed this system would be more meaningful than the current content.

      Thank you, we think we answered this point above (especially below Limitations of current zebrafish sleep/arousal assays […]).

      (4) GWAS can stand for Genome-wide associate studies (plural) so I do not think the extra "s" is needed (GWASs) .

      Indeed, that seems to be the common usage. Thank you.

      (5) AD candidate risk genes were determined from loci using "mainly statistic colocalization". Can the authors add a few more details about what was done and what the "mainly" caveat refers to?

      “Mainly” simply refers to the fact that other methods were used by Schwartzentruber et al. (2021) to annotate the GWAS loci with likely causal genes, but that most calls were ultimately made from statistic colocalisation. Readers can refer to this work to learn more about the methods used.

      (6) The authors write "The loss of psen1 only had mild effects on behaviour" but I think they mean "sleep behaviors" as there could be many other behaviors that are disrupted but were not assessed. The same issue a few sentences later with "Behaviour during the day was not affected" and at the end of the following paragraph.

      Yes, that would be more precise, thank you.

      (7) For the Sorl1 pharmacology data, it is very hard to understand what is being measured behaviorally. Are the authors measuring sleep +/- citalopram, or something else, and why the change to Euclidean distance rather than all the measures we were just introduced to earlier in the manuscript?

      We understand these plots (Fig. 5c,d) are less intuitive, but it is important that we show the difference in behaviour compared to H<sub>2</sub>O-treated larvae of same genotype. The claim is that citalopram has a larger effect on knockouts than on controls, so the reader needs to focus on the effect of the drug on each genotype, not on the effect of sorl1 knockout. We added the standard fingerprints (i.e. setting controls to z-score = 0) here in Author response figures.

      Euclidean distance takes as input all the measures we introduced. The point is precisely not to select a single measure. For example, say we were only plotting active bout number during the day, we would conclude that 10 µM citalopram has the same effect on knockouts and controls. Conversely, if we had taken sleep bout length at night, we would conclude 10 µM has a stronger effect on knockouts. What is the correct parameter to select? Using Euclidean distance resolves this by taking all parameters into account, rather than arbitrarily choosing one.

      And what exactly is a "given spike in serotonin"? and how is this hypothesis the conclusion based on the lack of evidence for the second hypothesis? As the authors say, there could be other ways sorl1 knockouts are more sensitive to citalopram, so the absence of evidence for one hypothesis certainly does not support the other hypothesis.

      We mean a given release of serotonin in the synaptic cleft. We have fixed this wording. 

      We tend to disagree on the second point. We can think of two ways that sorl1 knockouts are more sensitive to citalopram: 1) they produce more serotonin, so blocking reuptake causes a larger spike in knockouts; or 2) blocking reuptake causes the same increase in both knockouts and wild-types but knockouts react more strongly to serotonin. We cannot in fact think of another way to explain the citalopram results. Not finding overwhelming evidence for 1) surely supports 2) somewhat, even if we do not have direct evidence for it. As an analogy, if two diagnoses are possible for a patient, testing negative for the first one supports the other one, even before it is directly tested.

      (8) Again some language is used without enough care. Fish are referred to as "drowsier" under some drug conditions. How do the authors know the animal is drowsy? The phenotype is more specific - more sleep, less activity.

      Thank you, we switched to “Furthermore, fenoprofen worsened the day-time hypoactivity of psen2 knockout larvae […]”.

      (9) This sentence is misleading as it gives the impression that results in this manuscript suggest the conclusion: "Our observation that disruption of genes associated with AD diagnosis after 65 years reduces sleep in 7-day zebrafish larvae suggest that disrupted sleep may be a common mechanism through which these genes exert an effect on risk." That idea is widely held in the field, and numerous other previous manuscripts/reviews should be cited for clarity of where this hypothesis came from.

      This idea is not widely held in the field. You likely read this point as “disrupted sleep is a risk factor for AD”, which, yes, is widely discussed in the field, but is not precisely what we are saying. We hypothesise that mutations in some of the Alzheimer’s risk genes cause disrupted sleep, possibly from a very early age, which then causes AD decades later. Studies and reviews on sleep and AD rarely make this hypothesis, at least not explicitly. The closest we know of are a few recent human genetics studies, typically using Mendelian Randomisation, finding that higher genetic risk of AD correlates with some sleep phenotypes, such as sleep duration (Chen et al., 2022; Leng et al., 2021). The work of Muto et al. (2021) is particularly interesting as it found correlations between higher genetic risk of AD and some sleep phenotypes in men in their early twenties, which seems unlikely to be a consequence of early pathology (Muto et al., 2021). Note, however, that even these studies do not mention sleep possibly being disrupted early in development, which is what our findings in zebrafish larvae support. As we mention, we think a team should test whether sleep is different in infants at higher genetic risk of AD, essentially performing an analogous, but obviously much more difficult, experiment as we did in zebrafish larvae. We do not know of any study testing this or even raising this idea, so evidently it is not widely held. Having said that, the studies we mention here were not referenced in the Discussion paragraph. We have now corrected this.

      Ashlin TG, Blunsom NJ, Ghosh M, Cockcroft S, Rihel J. 2018. Pitpnc1a Regulates Zebrafish Sleep and Wake Behavior through Modulation of Insulin like Growth Factor Signaling. Cell Rep 24:1389–1396. doi:10.1016/j.celrep.2018.07.012

      Chen D, Wang X, Huang T, Jia J. 2022. Sleep and LateOnset Alzheimer’s Disease: Shared Genetic Risk Factors, Drug Targets, Molecular Mechanisms, and Causal Effects. Front Genet 13. doi:10.3389/fgene.2022.794202

      Cirrito JR, Disabato BM, Restivo JL, Verges DK, Goebel WD, Sathyan A, Hayreh D, D’Angelo G, Benzinger T, Yoon H, Kim J, Morris JC, Mintun MA, Sheline YI. 2011. Serotonin signaling is associated with lower amyloid-β levels and plaques in transgenic mice and humans. Proc Natl Acad Sci U S A 108:14968–14973. doi:10.1073/pnas.1107411108

      Dean DC, Jerskey BA, Chen K, Protas H, Thiyyagura P, RoonJva A, O’Muircheartaigh J, Dirks H, Waskiewicz N, Lehman K, Siniard AL, Turk MN, Hua X, Madsen SK, Thompson PM, Fleisher AS, Huentelman MJ, Deoni SCL, Reiman EM. 2014. Brain Differences in Infants at Differential Genetic Risk for Late-Onset Alzheimer Disease A Cross-sectional Imaging Study. JAMA Neurol 71:11–22. doi:10.1001/jamaneurol.2013.4544

      Eriksen JL, Sagi SA, Smith TE, Weggen S, Das P, McLendon DC, Ozols VV, Jessing KW, Zavitz KH, Koo EH, Golde TE. 2003. NSAIDs and enantiomers of flurbiprofen target γ-secretase and lower Aβ42 in vivo. J Clin Invest 112:440–449. doi:10.1172/JCI18162

      Espay AJ, Herrup K, Kepp KP, Daly T. 2023. The proteinopenia hypothesis: Loss of Aβ42 and the onset of Alzheimer’s Disease. Ageing Res Rev 92:102112. doi:10.1016/j.arr.2023.102112

      Hoffman EJ, Turner KJ, Fernandez JM, Cifuentes D, Ghosh M, Ijaz S, Jain RA, Kubo F, Bill BR, Baier H, Granato M, Barresi MJF, Wilson SW, Rihel J, State MW, Giraldez AJ. 2016. Estrogens Suppress a Behavioral Phenotype in Zebrafish Mutants of the AuJsm Risk Gene, CNTNAP2. Neuron 89:725–733. doi:10.1016/j.neuron.2015.12.039

      in ’t Veld Bas A, Ruitenberg A, Hofman A, Launer LJ, van Duijn CM, Stijnen T, Breteler MMB, Stricker BHC. 2001. Nonsteroidal Anti inflammatory Drugs and the Risk of Alzheimer’s Disease. N Engl J Med 345:1515–1521. doi:10.1056/NEJMoa010178

      Jagirdar R, Fu C-H, Park J, Corbek BF, Seibt FM, Beierlein M, Chin J. 2021. Restoring activity in the thalamic reticular nucleus improves sleep architecture and reduces Aβ accumulation in mice. Sci Transl Med 13:eabh4284. doi:10.1126/scitranslmed.abh4284

      Jiang H, Newman M, Lardelli M. 2018. The zebrafish orthologue of familial Alzheimer’s disease gene PRESENILIN 2 is required for normal adult melanotic skin pigmentation. PLOS ONE 13:e0206155. doi:10.1371/journal.pone.0206155

      Jiang H, Pederson SM, Newman M, Dong Y, Barthelson K, Lardelli M. 2020. Transcriptome analysis indicates dominant effects on ribosome and mitochondrial function of a premature termination codon mutation in the zebrafish gene psen2. PloS One 15:e0232559. doi:10.1371/journal.pone.0232559

      Joo W, Vivian MD, Graham BJ, Soucy ER, Thyme SB. 2021. A Customizable Low-Cost System for Massively Parallel Zebrafish Behavioral Phenotyping. Front Behav Neurosci 14.

      Joubert L, Hanson B, Barthet G, Sebben M, Claeysen S, Hong W, Marin P, Dumuis A, Bockaert J. 2004. New sorting nexin (SNX27) and NHERF specifically interact with the 5-HT4a receptor splice variant: roles in receptor targeting. J Cell Sci 117:5367–5379. doi:10.1242/jcs.01379

      Leng Y, Ackley SF, Glymour MM, Yaffe K, Brenowitz WD. 2021. Genetic Risk of Alzheimer’s Disease and Sleep Duration in Non-Demented Elders. Ann Neurol 89:177–181. doi:10.1002/ana.25910

      Mitchell PB, Hadzi-Pavlovic D. 2000. Lithium treatment for bipolar disorder. Bull World Health Organ 78:515–517.

      Mikur A. 2011. Trazodone: properties and utility in multiple disorders. Expert Rev Clin Pharmacol 4:181–196. doi:10.1586/ecp.10.138

      Munoz-Torrero D. 2008. Acetylcholinesterase Inhibitors as Disease-Modifying Therapies for Alzheimer’s Disease. Curr Med Chem 15:2433–2455. doi:10.2174/092986708785909067

      Muto V, Koshmanova E, Ghaemmaghami P, Jaspar M, Meyer C, Elansary M, Van Egroo M, Chylinski D, Berthomier C, Brandewinder M, Mouraux C, Schmidt C, Hammad G, Coppieters W, Ahariz N, Degueldre C, Luxen A, Salmon E, Phillips C, Archer SN, Yengo L, Byrne E, Collette F, Georges M, Dijk D-J, Maquet P, Visscher PM, Vandewalle G. 2021. Alzheimer’s disease genetic risk and sleep phenotypes in healthy young men: association with more slow waves and daytime sleepiness. Sleep 44. doi:10.1093/sleep/zsaa137

      Myers-Turnbull D, Taylor JC, Helsell C, McCarroll MN, Ki CS, Tummino TA, Ravikumar S, Kinser R, Gendelev L, Alexander R, Keiser MJ, Kokel D. 2022. Simultaneous analysis of neuroactive compounds in zebrafish. doi:10.1101/2020.01.01.891432

      Owens MJ, Morgan WN, Plok SJ, Nemeroff CB. 1997. Neurotransmiker receptor and transporter binding profile of antidepressants and their metabolites. J Pharmacol Exp Ther 283:1305– 1322.

      Özcan GG, Lim S, Leighton PL, Allison WT, Rihel J. 2020. Sleep is bi-directionally modified by amyloid beta oligomers. eLife 9:e53995. doi:10.7554/eLife.53995

      Quiroz YT, Schultz AP, Chen K, Protas HD, Brickhouse M, Fleisher AS, Langbaum JB, Thiyyagura P, Fagan AM, Shah AR, Muniz M, Arboleda-Velasquez JF, Munoz C, Garcia G, Acosta-Baena N, Giraldo M, Tirado V, Ramírez DL, Tariot PN, Dickerson BC, Sperling RA, Lopera F, Reiman EM. 2015. Brain Imaging and Blood Biomarker Abnormalities in Children With Autosomal Dominant Alzheimer Disease: A Cross-Sectional Study. JAMA Neurol 72:912–919. doi:10.1001/jamaneurol.2015.1099

      Relkin NR. 2007. Beyond symptomatic therapy: a reexamination of acetylcholinesterase inhibitors in Alzheimer’s disease. Expert Rev Neurother 7:735–748. doi:10.1586/14737175.7.6.735

      Rihel J, Prober DA, Arvanites A, Lam K, Zimmerman S, Jang S, Haggarty SJ, Kokel D, Rubin LL, Peterson RT, Schier AF. 2010. Zebrafish Behavioral Profiling Links Drugs to Biological Targets and Rest/Wake Regulation. Science 327:348–351. doi:10.1126/science.1183090

      Sleegers K, Brouwers N, Gijselinck I, Theuns J, Goossens D, Wauters J, Del-Favero J, Cruts M, van Duijn CM, Van Broeckhoven C. 2006. APP duplication is sufficient to cause early onset Alzheimer’s dementia with cerebral amyloid angiopathy. Brain J Neurol 129:2977–2983. doi:10.1093/brain/awl203

      Sun L, Zhou R, Yang G, Shi Y. 2017. Analysis of 138 pathogenic mutations in presenilin-1 on the in vitro production of Aβ42 and Aβ40 peptides by γ-secretase. Proc Natl Acad Sci 114:E476– E485. doi:10.1073/pnas.1618657114

      Szklarczyk D, Santos A, von Mering C, Jensen LJ, Bork P, Kuhn M. 2016. STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Res 44:D380–D384. doi:10.1093/nar/gkv1277

      Weggen S, Rogers M, Eriksen J. 2007. NSAIDs: small molecules for prevention of Alzheimer’s disease or precursors for future drug development? Trends Pharmacol Sci 28:536–543. doi:10.1016/j.Jps.2007.09.004

      Wiltschko AB, Tsukahara T, Zeine A, Anyoha R, Gillis WF, Markowitz JE, Peterson RE, Katon J, Johnson MJ, Daka SR. 2020. Revealing the structure of pharmacobehavioral space through motion sequencing. Nat Neurosci 23:1433–1443. doi:10.1038/s41593-020-00706-3

      Yang T, Arslanova D, Gu Y, Augelli-Szafran C, Xia W. 2008. Quantification of gamma-secretase modulation differentiates inhibitor compound selectivity between two substrates Notch and amyloid precursor protein. Mol Brain 1:15. doi:10.1186/1756-6606-1-15

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Yun et al. examined the molecular and neuronal underpinnings of changes in Drosophila female reproductive behaviors in response to social cues. Specifically, the authors measure the ejaculate-holding period, which is the amount of time females retain male ejaculate after mating (typically 90 min in flies). They find that female fruit flies, Drosophila melanogaster, display shorter holding periods in the presence of a native male or male-associated cues, including 2-Methyltetracosane (2MC) and 7-Tricosene (7-T). They further show that 2MC functions through Or47b olfactory receptor neurons (ORNs) and the Or47b channel, while 7-T functions through ppk23 expressing neurons. Interestingly, their data also indicates that two other olfactory ligands for Or47b (methyl laurate and palmitoleic acid) do not have the same effects on the ejaculate-holding period. By performing a series of behavioral and imaging experiments, the authors reveal that an increase in cAMP activity in pC1 neurons is required for this shortening of the ejaculate-holding period and may be involved in the likelihood of remating. This work lays the foundation for future studies on sexual plasticity in female Drosophila.

      The conclusions of this paper are mostly supported by the data, but aspects of the lines used for individual pC1 subtypes and visual contributions as well as the statistical analysis need to be clarified.

      (1) The pC1 subtypes (a - e) are delineated based on their morphology and connectivity. While the morphology of these neurons is distinct, they do share a resemblance that can be difficult to discern depending on the imaging performed. Additionally, genetic lines attempting to label individual neurons can easily be contaminated by low-level expression in off-target neurons in the brain or ventral nerve cord (VNC), which could contribute to behavioral changes following optogenetic manipulations. In Figures 5C - D, the authors generated and used new lines for labeling pC1a and pC1b+c. The line for pC1b+c was imaged as part of another recent study (https://doi.org/10.1073/pnas.2310841121). However, similar additional images of the pC1a line (i.e. 40x magnification and VNC expression) would be helpful in order to validate its specificity.

      We have included the high-resolution images of the expression of the pC1a-split-Gal4 driver in the brain and the VNC in the new figures S6A and S6B.

      (2) The author's experiments examining olfactory and gustatory contributions to the holding period were well controlled and described. However, the experiments in Figure 1D examining visual contributions were not sufficiently convincing as the line used (w1118) has previously been shown to be visually impaired (Wehner et al., 1969; Kalmus 1948). Using another wild-type line would have improved the authors' claims.

      It is evident that w1118 flies are visually impaired and are able to receive a limited amount of visual information in dim red light. Nevertheless, they are able to exhibit MIES phenotypes, which further supports the dispensability of visual information in MIES. In a 2024 study, Doubovetzky et al. (1) found that MIES in ninaB mutant females, which have defects in visual sensation, was not altered. This further corroborates our assertion that vision is likely to be of lesser importance than olfaction in MIES.

      (3) When comparisons between more than 2 groups are shown as in Figures 1E, 3D, and 5E, the comparisons being made were not clear. Adding in the results of a nonparametric multiple comparisons test would help for the interpretation of these results.

      We have revised figures 1E, 3D, 5E and the accompanying legends as suggested.

      Reviewer #2 (Public Review):

      The work by Yun et al. explores an important question related to post-copulatory sexual selection and sperm competition: Can females actively influence the outcome of insemination by a particular male by modulating the storage and ejection of transferred sperm in response to contextual sensory stimuli? The present work is exemplary for how the Drosophila model can give detailed insight into the basic mechanism of sexual plasticity, addressing the underlying neuronal circuits on a genetic, molecular, and cellular level.

      Using the Drosophila model, the authors show that the presence of other males or mated females after mating shortens the ejaculate-holding period (EHP) of a female, i.e. the time she takes until she ejects the mating plug and unstored sperm. Through a series of thorough and systematic experiments involving the manipulation of olfactory and chemo-gustatory neurons and genes in combination with exposure to defined pheromones, they uncover two pheromones and their sensory cells for this behavior. Exposure to the male-specific pheromone 2MC shortens EHP via female Or47b olfactory neurons, and the contact pheromone 7-T, present in males and on mated females, does so via ppk23 expressing gustatory foreleg neurons. Both compounds increase cAMP levels in a specific subset of central brain receptivity circuit neurons, the pC1b,c neurons. By employing an optogenetically controlled adenyl cyclase, the authors show that increased cAMP levels in pC1b and c neurons increase their excitability upon male pheromone exposure, decrease female EHP, and increase the remating rate. This provides convincing evidence for the role of pC1b,c neurons in integrating information about the social environment and mediating not only virgin but also mated female post-copulatory mate choice.

      Understanding context and state-dependent sexual behavior is of fundamental interest. Mate behavior is highly context-dependent. In animals subjected to sperm competition, the complexities of optimal mate choice have attracted a long history of sophisticated modelling in the framework of game theory. These models are in stark contrast to how little we understand so far about the biological and neurophysiological mechanisms of how females implement post-copulatory or so-called "cryptic" mate choice and bias sperm usage when mating multiple times.

      The strength of the paper is decrypting "cryptic" mate choice, i.e. the clear identification of physiological mechanisms and proximal causes for female post-copulatory mate choice. The discovery of peripheral chemosensory nodes and neurophysiological mechanisms in central circuit nodes will provide a fruitful starting point to fully map the circuits for female receptivity and mate choice during the whole gamut of female life history.

      We appreciate the positive response to our work.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      While appreciating the quality of the work the reviewers had a few key concerns that would greatly improve the manuscript. These are:

      (1) In some cases the specific statistical analyses are not clear. Could the authors please clarify what comparisons were made and the specific tests used?

      We have clarified the comparisons made in the multiple comparison analysis and specified the tests used in figures 1E, 3D, 5E.

      (2) Could the authors please include data that verify the expression patterns of their new reagent for pC1a, which will be useful for the community?

      Figure S6 was revised to include the expression of the pC1a-split-Gal4 gene in the brain (Fig. S6A) and the VNC (Fig. S6B).

      (3) A figure summarising their findings in the context of known circuitry will be useful.

      A new Figure 7 has been prepared, which provides a summary of our findings.

      (4) The SAG data are interesting. Do the authors wish to consider moving it to the main text or removing it if too preliminary?

      The supplementary figure 10 and related discussions in the discussion section have been removed.

      In the revised version of this manuscript, we present new evidence that the Or47b gene is required for 2MC-induced cAMP elevation in pC1 neurons, but not for 7T-induced one (see Fig. 5F). This observation supports that Or47b is a receptor for 2MC.

      The following paragraph was inserted at line 248 to provide a detailed description of the new findings: "To further test the role of Or47b in 2MC detection, we generated Or47b-deficient females with pC1 neurons expressing the CRE-luciferase reporter. Females with one copy of the wild-type Or47b allele, which served as the control group, showed robust CRE-luciferase reporter activity in response to either 2MC or 7-T. In contrast, Or47b-deficient females showed robust CRE-luciferase activity in response to to 7-T, but little activity in response to 2MC. This observation suggests that the odorant receptor Or47b plays an essential role in the selective detection of 2MC (Fig. 5F).”

      In addition, the following sentence was inserted at line 308 in the discussion section: “In this study, we provide compelling evidence that 2MC induces cAMP elevation in pC1 neurons and EHP shortening via both the Or47b receptor and Or47b ORNs, suggesting that 2MC functions as an odorant ligand for Or47b.”

      Relative CRE-luciferase reporter activity of pC1 neurons in females of the indicated genotypes, incubated with a piece of filter paper perfumed with solvent vehicle control or the indicated pheromones immediately after mating. The CRE-luciferase reporter activity of pC1 neurons of Or47b-deficient females (Or47b2/2 or Or47b3/3) was observed to increase in response to 7-T but not to 2MC. To calculate the relative luciferase activity, the average luminescence unit values of the female incubated with the vehicle are set to 100%. Mann-Whitney Test (n.s. p > 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001). Gray circles indicate the relative luciferase activity (%) of individual females, and the mean ± SEM of data is presented.

      Reviewer #1 (Recommendations For The Authors):

      (1) There was a discrepancy between the text and the figures. Based on the asterisks above the data in Figure S5A, the data supports only 150 ng of 7-T shortening the ejaculation holding period. However, the text states that (line 190) "150 or 375 ng of 7-T significantly shortened EHP." It would be helpful if the authors clarified this discrepancy.

      The sentence has been revised and now reads as follows: ‘150 ng of 7-T significantly shortened EHP’.

      (2) Based on the current organization of the text, it was not clear how 2MC was identified and its concentrations were known to be physiologically relevant. It would be helpful if the authors could expand on this in lines 178 - 179.

      The following sentences were inserted into the revised version of the manuscript at line 178: The EHP was therefore measured in females incubated in a small mating chamber containing a piece of filter paper perfumed with male CHCs, including 2-methylhexacosane, 2-methyldocosane, 5-methyltricosane, 7-methyltricosane, 10Z-heneicosene, 9Z-heneicosene, and 2MC at various concentrations (not shown). Among these, 2MC at 750 ng was the only one that significantly reduced EHP (Fig. 3A; Fig. S4). 2MC was mainly found in males, but not in virgin females (30). Notably, it is present in D. melanogaster, D. simulans, D. sechellia, and D. erecta, but not in D. yakuba (30, 60).

      (3) The inset pie chart image illustrating MIES in Figure 1A was difficult to interpret. It would be helpful if the authors used a different method for representing this (i.e. a timeline).

      Figure 1A was revised as suggested.

      (4) In lines 121 - 122, the authors state that the females are exposed to "actively courting naive wild type Canton S males." This was difficult to understand and might be improved by removing "actively courting."

      Revised as suggested.

      Reviewer #2 (Recommendations For The Authors):

      (1) Summary figure

      The story is quite comprehensive and contains a lot of detail regarding the interaction of signaling pathways, internal state, and sensory stimuli. I believe a schematic summary figure bringing together all findings could be very helpful and would make it much easier to understand the discussion!

      Figure 7 has been prepared, which provides a summary of the findings and an explanation of the current working model.

      (2) Figure S10/effect on SAG activation of EHP

      At the moment, the quite interesting and relevant result that SAG activation shortens EHP shown in Figure S10 is only referred to in the discussion. Maybe move this to the results and give it a bit more attention? Actually, I believe this is a very exciting finding that could also be the basis for some more interesting speculations about physiological relevance. Since SAG is silenced upon seminal fluid/sex peptide exposure after mating, a mating with failed SAG silencing (i.e. unusually high post-mating SAG activity) could indicate to the female that there was low or failed sex peptide/seminal fluid transfer. In such a case it would be probably advantageous for the female to decrease EHP and quickly remate, as females need the "beneficial" effects of seminal fluid on ovulation and physiology adaptation. SAG could therefore represent another arm of sensing male quality- here not via external pheromones, but internally, via sensing male sex peptide levels.

      If this is a bit preliminary and rather suited to start a new study, Figure S10 could also be removed from the current manuscript.

      Figure S10 and associated text were removed in the revised version of the manuscript.

      (3) PhotoAC experiments in pC1b,c: the authors find that raising cAMP levels in pC1b,c leads to a decrease in EHP. They argue that increased cAMP levels lead to higher excitability of pC1b,c. This implies that the activity of pC1b,c promotes mating plug ejection. I assume the authors have also tried activating pC1b,c directly by optogenetic cation channels? What is the outcome of this? If different from elevating cAMP levels: why so?

      We employed CsChrimson, a red light-sensitive channelrhodopsin, to investigate the effect of optogenetic activation of each pC1 subset on EHP. Optogenetic activation of pC1a, pC1d, or pC1e had little effect on EHP; however, optogenetic activation of pC1b, c significantly increased EHP. This observation was puzzling because optogenetic silencing of the same neurons also increased EHP. In this experiment, females expressing CsChrimson were exposed to red light for the entire period of EHP measurement. Therefore, we suspect that prolonged activation of pC1b and pC1c neurons depleted their neurotransmitter pool, resulting in a silencing effect, but this requires further testing.

      Author response image 1.

      The prolonged optogenetic activation of pC1b, c neurons increases EHP, mimicking silencing of pC1b, c neurons. Females of the indicated genotypes were cultured on food with or without all-trans-retinal (ATR). The ΔEHP is calculated by subtracting the mean of the reference EHP of females cultured in control ATR- food from the EHP of individual females in comparison. The female genotypes are as follows: (A) 71G01-GAL4/UAS-CsChrimson, (B) pC1a-split-Gal4/UAS-CsChrimson, (C) pC1b,c-split-Gal4/UAS-CsChrimson, (D) pC1d-split-Gal4/UAS-CsChrimson, and (E) pC1e-split-Gal4/UAS-CsChrimson. Gray circles indicate the ΔEHP of individual females, and the mean ± SEM of data is presented. Mann-Whitney Test (n.s. p > 0.05; *p <0.05; ****p < 0.0001). Numbers below the horizontal bar represent the mean of the EHP differences between the indicated treatments.

      (4) Text edits

      In general, the manuscript is very well-written, clear, and easy to follow. I recommend small edits of the text and correction of typos in some places:

      l.92: "Drosophila females seem to signal the social sexual context through sperm ejection." This sentence could give the impression that the main function of sperm ejection was to signal to conspecifics. I recommend reformulating to leave it open if ejected sperm is a signal or rather a simple cue. e.g. :"There is evidence that Drosophila females detect the social sexual context through sperm ejected by other females."

      Thanks for the good suggestion. It has been revised as suggested. In addition, we have also made additional changes to the text to correct typos.

      l.97: "transcriptional factor" > "transcription factor"

      Revised as suggested. See lines 77, 98, and 201.

      l.101: "There are Dsx positive 14 pC1 neurons in each brain hemisphere of the brain," > "There are 14 Dsx positive pC1 neurons in each brain hemisphere,"

      Revised as suggested, it now reads " There are 14 Dsx-positive pC1 neurons in each hemisphere of the brain, ...".

      l.160: ", even up to 1440 ng" > ", even when applied at concentrations as high as 1440 ng"

      Revised as suggested.

      l.168: "females with male oenocytes significantly shortens EHP" >"females with male oenocytes significantly shorten EHP"

      Revised as suggested.

      l.181: "it was restored when Orco expression is reinstated" >"it was restored when Orco expression was reinstated"

      Revised as suggested. See line 186.

      l.196: "MIES is almost completely abolished" >"MIES was almost completely abolished"

      Revised as suggested. See line 201.

      l.202: "a sexually dimorphic transcriptional factor gene" >"the sexually determination transcription factor gene" or "the sex specifically spliced transcription factor gene". The gene itself is not dimorphic!

      Revised as suggested, lines 208-210 now read "The same study found that Dh44 receptor neurons involved in EHP regulation also express doublesex (dsx), which encodes sexually dimorphic transcription factors."

      l.211: "to silenced" > "to silence"

      Revised as suggested. See line 216.

      l.229: "females that selectively produce the CRE-Luciferase reporter gene" >"females that selectively express CRE-Luciferase reporter"

      Revised as suggested. See line 234.

      l.271: "neurons. expedite" > delete dot

      Revised as suggested. See line 284.

      l.287: "Furthermore, our study has uncovered the conserved neural circuitry that processes male courtship cues and governs mating decisions play an important role in regulating this behavior." > grammar: "our study has uncovered that the conserved neural circuitry that processes male courtship cues and governs mating decisions plays an important role in regulating this behavior." Also: the meaning of "conserved" is not fully clear to me here: conserved in regards to other Drosophila species? Or do the authors mean: general functional similarity with mouse sexual circuitry?

      The sentence (lines 299-301) has been revised for clarity to read "In addition, our study has revealed that the neural circuit that processes male courtship cues and controls mating decisions plays an important role in regulating this behavior. This fly circuit has recently been proposed to be homologous to VMHvl in the mouse brain (45, 46).”

      l.311: "lipid drolet" > "lipid droplets"

      Revised as suggested. See line 325.

      l.316 and in several instances in the following, including Figure 5 caption (l.723) : "cAMP activity" > "cAMP levels" or "increased cAMP levels"

      Revised as suggested.

      l.323: "in hemibrain" > ", as seen in the hemibrain connectome dataset"

      Revised as suggested. See line 337.

      l.326: "increased cAMP levels causes pC1b,c neurons" > "increased cAMP levels cause pC1b,c neurons"

      Revised as suggested. See line 340.

      l.329: "removement" > "removal" or "ejection"

      Revised as suggested, it now reads "the removal of the mating plug". See line 343.

      l. 330: "This observation well aligns" > "The observation aligns well"

      Revised as suggested. See line 345.

      l. 398: Behavior assays: It would be good to describe how mating plug ejection was identified- by eye? Under the microscope/UV light?

      The following sentence has been added to the behavioral assays section at lines 425-426: The sperm ejection scene, in which the female expels a white sac containing sperm and the mating plug through the vulva, has been directly observed by eye in recorded video footage.

      l.685, Figure legend 2: "thermal activation" > "thermogenetic activation"

      Revised as suggested. See line 430.

      Reference:

      (1) Doubovetzky, N., Kohlmeier, P., Bal, S., & Billeter, J. C. (2023). Cryptic female choice in response to male pheromones in Drosophila melanogaster. bioRxiv, 2023-12.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This manuscript represents a fundamental contribution demonstrating that fentanyl-induced respiratory depression can be reversed with a peripherally-restricted mu opioid receptor antagonist. The paper reports compelling and rigorous physiological, pharmacokinetic, and behavioral evidence supporting this major claim, and furthers mechanistic understanding of how peripheral opioid receptors contribute to respiratory depression. These findings reshape our understanding of opioid-related effects on respiration and have significant therapeutic implications given that medications currently used to reverse opioid overdose (such as naloxone) produce severe aversive and withdrawal effects via actions within the central nervous system.

      We thank the reviewers for their insightful comments and critiques, which we have incorporated into the manuscript. We believe these revisions have significantly improved the manuscript. Additionally, following discussions among the authors, we have revised the color scheme across all figures. For example, the color of the symbols in Figure 1B-D now match the bars in Figure 1E-J, rather than the symbols. We feel that this change improves the clarity and visual consistency of the figures, making it easier to interpret the data across figures.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper shows that the synthetic opioid fentanyl induces respiratory depression in rodents. This effect is revised by the opioid receptor antagonist naloxone, as expected. Unexpectedly, the peripherally restricted opioid receptor antagonist naloxone methiodide also blocks fentanyl-induced respiratory depression.

      Strengths:

      The paper reports compelling physiology data supporting the induction of respiratory distress in fentanyl-treated animals. Evidence suggesting that naloxone methiodide reverses this respiratory depression is compelling. This is further supported by pharmacokinetic data suggesting that naloxone methiodide does not penetrate into the brain, nor is it metabolized into brain-penetrant naloxone.

      Weaknesses:

      A weakness of the study is the fact that the functional significance of opioid-induced changes in neural activity in the nTS (as measured by cFos and GcAMP/photometry) is not established. Does the nTS regulate fentanyl-induced respiratory depression, and are changes in nTS activity induced by naloxone and naloxone methiodide relevant to their ability to reverse respiratory depression?

      Reviewer #2 (Public review):

      Summary:

      In this article, Ruyle and colleagues assessed the contribution of central and peripheral mu opioid receptors in mediating fentanyl-induced respiratory depression using both naloxone and naloxone methiodide, which does not cross the blood-brain barrier. Both compounds prevented and reversed fentanyl-induced respiratory depression to a comparable degree. The advantage of peripheral treatments is that they circumvent the withdrawal-like effects of naloxone. Moreover, neurons located in the nucleus of the solitary tract are no longer activated by fentanyl when nalaxone methiodide is administered, suggesting that these responses are mediated by peripheral mu opioid receptors. The results delineate a role for peripheral mu opioid receptors in fentanyl-derived respiratory depression and identify a potentially advantageous approach to treating overdoses without inflicting withdrawal on the patients.

      Strengths:

      The strengths of the article include the intravenous delivery of all compounds, which increase the translational value of the article. The authors address both the prevention and reversal of fentanyl-derived respiratory depression. The experimental design and data interpretation are rigorous and appropriate controls were used in the study. Multiple doses were screened in the study and the approaches were multipronged. The authors demonstrated the activation of NTS cells using multiple techniques and the study links peripheral activation of mu opioid receptors to central activation of NTS cells. Both males and females were used in the experiments. The authors demonstrate the peripheral restriction of naloxone methiodide.

      Weaknesses:

      Nalaxone is already broadly used to prevent overdoses from opioids so in some respects, the effects reported here are somewhat incremental.

      The reviewer is correct that naloxone is the standard antidote for reversing opioid-induced respiratory depression. However, its limitations, including the risk of precipitated withdrawal, are well-documented in both preclinical and clinical studies. The likelihood of withdrawal increases when multiple doses of naloxone are administered. Since naloxone-induced withdrawal is centrally mediated, this study aimed to evaluate a peripherally restricted MOR antagonist for its ability to prevent or reverse fentanyl-induced respiratory depression. A key finding is that NLXM reversed OIRD without inducing aversive behavior. This suggests that peripheral antagonists like NLXM may be integrated into intervention strategies that save lives while preventing the adverse behavioral and physiological effects that are observed after treatment with naloxone.

      Reviewer #3 (Public review):

      Summary:

      This manuscript outlines a series of very exciting and game-changing experiments examining the role of peripheral MORs in OIRD. The authors outline experiments that demonstrate a peripherally restricted MOR antagonist (NLX Methiodide) can rescue fentanyl-induced respiratory depression and this effect coincides with a lack of conditioned place aversion. This approach would be a massive boon to the OUD community, as there are a multitude of clinical reports showing that naloxone rescue post fentanyl over-intoxication is more aversive than the potential loss-of-life to the individuals involved. This important study reframes our understanding of successful overdose rescue with potential for reduced aversive withdrawal effects.

      Strengths:

      Strengths include the plethora of approaches arriving at the same general conclusion, the inclusion of both sexes and the result that a peripheral approach for OIRD rescue may side-step severe negative withdrawal symptoms of traditional NLX rescue.

      Weaknesses:

      The major weakness of this version relates to the data analysis assessed sex-specific contributors to the results.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Some points for the authors to consider are:

      (1) In the Abstract, it is unclear why "high potency and lipophilicity" contribute to opioid-induced respiratory depression.

      The higher potency of fentanyl compared to other opioids significantly increases the risk of overdose and subsequent respiratory depression. Its high lipophilicity facilitates rapid absorption and central nervous system penetration, which contributes to the rapid onset of these cardiorespiratory depression. The narrow therapeutic window of fentanyl further emphasizes the critical need for timely intervention when an overdose has occurred, and effective antagonists to reverse respiratory depression and save lives. We have revised the abstract to clarify these points.

      (2) Are the doses of fentanyl used in the study (2, 20, or 50 µg/kg IV) relevant to those achieved by fentanyl-exposed human drug users?

      In these studies, we intravenously administered three doses of fentanyl. The human equivalent doses (HED) of 20ug/kg and 50 ug/kg fentanyl are ~3 ug/kg and ~8 ug/kg, respectively. These doses have previously been shown to induce respiratory depression in humans (Dahan et al.,2005).

      (3) In Figure 1, it appeared that only a small fraction of tyrosine hydroxylase-positive (TH+) neurons expressed cFos in response to fentanyl, and the degree of cFos expression was largely similar across all fentanyl doses tested. Thus, it is unclear whether TH+ neurons play a role in fentanyl-induced respiratory depression, and the value of these data is unclear (see point #6 below also).

      As shown in the mean data, the lowest dose of fentanyl, which was below the threshold for inducing OIRD, activated approximately 50% of tyrosine hydroxylase-positive (TH+) nTS neurons. In contrast, the highest dose of fentanyl resulted in a statistically significant increase, with ~75% of TH+ cells co-expressing Fos-IR.

      We included the assessment of catecholaminergic nTS cells for several reasons. The regions of the nTS evaluated in this study contains high expression of MOR and are the termination points of sensory afferent fibers transmitting cardiorespiratory information to the nTS (Aicher et al., 2000; Furdui et al., 2024). Catecholaminergic cells receive direct excitatory inputs from visceral afferents (Appleyard et al., 2007) and exhibit intensity-dependent increases in Fos-IR in rats exposed to hypoxic air (Kline et al., 2010; King et al., 2012). These neurons are essential for generating appropriate cardiorespiratory responses to hypoxic challenges (Bathina et al., 2013; King et al., 2015). As the reviewer notes, rats exposed to fentanyl exhibit a high degree of Fos-IR in the nTS, including catecholaminergic neurons. Despite the robust fentanyl-induced activation (increased Fos-IR) nTS neurons, yet there appears to be a failure to initiate appropriate chemoreflex-mediated cardiorespiratory responses. Our photometry data further indicate that fentanyl-induced changes in neuronal activity are mediated, in part, by peripheral MOR. Collectively, these findings suggest that fentanyl impacts nTS activity through alterations in peripheral afferent signaling to the nTS, which may contribute to the severity and duration of OIRD.

      (4) It would help with the flow of the paper if the pharmacokinetic data shown in Figure 6 were presented earlier (as part of Figure 2).

      We have moved the biodistribution data earlier in the manuscript, now presenting it as Figure 2. The numbering of all subsequent figures has been adjusted accordingly.

      (5) In Figure 5, there appears to be a large number of GCaMP-expressing neurons located outside the nTS. To what degree can the changes in calcium signaling, attributed to alterations in neural activity in the nTS, be explained by altered activity of neurons located outside the nTS?

      The reviewer is correct that our viral spread extends beyond the boundaries of the nTS, raising the possibility that the responses observed in Figure 5 may be influenced by neural activity of cells outside the nTS. While some viral spread beyond the target region is unavoidable, calcium transients were measured at the tip of the fiber, which was positioned directly within the nTS.

      To address this concern further, we performed Fos immunohistochemistry in a subset of animals that received bilateral GCaMP virus injections into the nTS. Following fentanyl administration (50 µg/kg IV), brains were collected two hours later. As shown in the accompanying image, we observed Fos-IR co-expression with GCaMP exclusively within the nTS boundaries. No Fos-IR was detected outside the nTS, including in GCaMP cells. Taken together, these findings support our conclusion that the data depicted in our photometry figure (now Figure 6) accurately represent fentanyl-induced activity changes in nTS neurons.

      Author response image 1.

      Arrowheads: Fos-negative GCaMP cell; Arrows: Co-labeled Fos/GCaMP cell; Asterisk: Fos+ GCaMP-negative cell

      (6) Currently, the cFos and photometry data are descriptive in nature. Are opioid-induced changes in nTS neural activity relevant to respiratory depression? If so, one might expect DREADD-mediated stimulation of the nTS neural activity (or stimulating nTS activity by some other means) would reverse fentanyl-induced respiratory depression similar to naloxone and methyl-naloxone.

      The reviewer raises an interesting point regarding the relevance of the nTS in the context of OIRD. The nTS is a major site of integration of sensory afferent information and involved in the initiation of reflex responses that facilitate a return to homeostasis. As described above, we characterized the collective response of nTS neurons to intravenous fentanyl using both Fos immunohistochemistry and fiber photometry. Our data indicate that fentanyl-induced changes in nTS activity are strongly mediated by peripheral MOR. While the suggestion to use global chemogenetic activation of nTS neurons to reverse fentanyl-induced respiratory depression is intriguing, results from these experiments may be difficult to interpret due to the extensive heterogeneity of the nTS. However, we are currently conducting similar experiments using a more selective approach that will allow us to isolate and evaluate specific nTS phenotypes to better understand their contributions to OIRD.

      (7) Are peripherally restricted mu opioid receptor (MOR) agonists available? If so, it would strengthen the paper if such compounds could be used to show that stimulation of peripheral MORs is sufficient to induce respiratory distress independent of actions on centrally located MORs.

      Peripherally acting Mu Opioid Receptor Antagonists (PAMORAs) are indeed available and currently being evaluated in our laboratory.

      Reviewer #2 (Recommendations for the authors):

      Consider having the figures/data numbered in the order that they appear in the manuscript. Right now, Figure 6 is mentioned between Figures 1 and 2 (minor).

      Thank you for this suggestion. We have reordered the figures so that the biodistribution figure appears before the MOR antagonist pretreatment and reversal figures.

      Reviewer #3 (Recommendations for the authors):

      This manuscript outlines a series of very exciting and game-changing experiments examining the role of peripheral MORs in OIRD. The authors outline experiments that demonstrate a peripherally restricted MOR antagonist (NLX Methiodide) can rescue fentanyl-induced respiratory depression and this effect coincides with a lack of conditioned place aversion. This approach would be a massive boon to the OUD community, as there are a multitude of clinical reports showing that naloxone rescue post fentanyl over-intoxication is more aversive than the potential loss-of-life to the individuals involved. This important study reframes our understanding of successful overdose rescue with potential for reduced aversive withdrawal effects.

      While this is an exciting and important study, there are a few minor to moderate critiques for the authors to consider. These are below.

      (1) Title: "devoid of aversive effects" - While CPA is a good, cumulative indicator of potential aversive effects, it is not an exhaustive one. Since no other withdrawal measures were included, this is an overstatement.

      The reviewer is correct in noting that our analysis of aversive effects is not exhaustive. Since we only assessed changes in aversive behavior between NLX and NLXM, we believe it is more accurate to modify the title accordingly. We have changed the title from “devoid of aversive effects” to “devoid of aversive behavior” better reflect the scope of the experiments conducted.

      (2) Page 3, top line: MOR (mu opioid receptor) is highly expressed...

      An article should likely be included prior to MOR or make plural and adjust the sentence.

      Thank you for this suggestion. We have reworked this section in the manuscript.

      (3) Figure 6D: this figure is very important for the interpretation of every single figure. It should either be moved to figure 1 or 2 or combined with figure 1 or 2.

      Thank you for this suggestion. The biodistribution figure has been moved to Figure 2.

      (4) Page 5, line 164, Figure 21-D: remove the 1.

      Done.

      (5) Sex differences (or lack thereof):

      Throughout the manuscript, the authors report a lack of sex differences. However, while the data is not powered for the distinction of sex differences, there appears to be a bi-modal distribution of the individual data points that likely correspond to sex across most experiments. For example, in Figure 2E there are both color and clear dots, which this reviewer assumes indicates sex (however, this wasn't easily apparent if it was commented on at all in the paper). If you look at the saline oxygen saturation (nadir) levels (2e), there is wide variability with the red-filled circles, but not the clear ones. This may indicate a bimodal distribution (and may be related to the baseline HR sex differences highlighted). This is also the case in Figure 2L but is perhaps more obvious in the CPA score data (Figure 4d), where it seems the nlx negative CPA effects were likely driven primarily by one sex. While this reviewer does not expect a full powering of experiments for sex differences (and also is very appreciative of the inclusion of both sexes), full raw data with sex indicated included in the supplemental data would greatly aid the field in general and allow for those with a specific interest in this area to build upon this data. Additionally, further discussion regarding the potential role of sex differences in the translational value of these findings is also warranted.

      For all bar graphs, open symbols represent females and filled symbols represent males. This information can be found in the first paragraph of the Materials and Methods section. We have also added this information to each figure for increased visibility. We appreciate the acknowledgement of our inclusion of both sexes. For all experiments, we attempted to balance by sex. Unfortunately, we occasionally had to exclude animals for technical reasons (with clogged catheters being the most common reason for exclusion). This sometimes led to an imbalance in sex in some groups, as the reviewer has noted. In the graph of oxygen saturation nadir values in Fig 2E (now Fig 3E in the revised manuscript, all animals received intravenous fentanyl at a dose of 20 ug/kg. The reviewer is correct that there is greater variability in the males (filled symbols) compared to the females (open symbols) in this graph. However, this variability in the distribution was not observed in Fig 1E or Fig 4E, in which male and female rats received an identical dose of 20 ug/kg. Taking this into account, our overall interpretation of the data is that there is relatively minor sex difference in the responses observed after intravenous fentanyl, and the variability in Fig 3E is primarily due to a lower n compared to Fig 1E.

      All raw data will be uploaded to a data repository.

      (6) Page 7, line 209: Figure 5D should be Figure 6D.

      We have incorporated this change.

      (7) Page 8, line 267: Cure should be Curve.

      We have incorporated this change.

      (8) Discussion: Page10, line322 states that "no detectable NLX ... was found in brain tissue". This is incorrect based on Figure 6.

      The sentence the reviewer highlighted refers to detection of NLX or NLXM in brain tissue from animals that received intravenous NLXM. As demonstrated in the biodistribution figure (now Figure 2 in the manuscript), our data demonstrate that an intravenous injection of NLXM did not result in NLX formation in the brain. We have reworked the sentence for clarity.

      (9) jGCaMP injections: Figure 5B/c shows the distribution of the gcamp across animals. The optic fiber is placed directly over the NTs. However, how are we certain there isn't a nearby nuclei/structure outside the NTS that is contributing to the photometry data presented in D-G?

      See our above comment.  

      (10) Fiber Photometry and Sex: These studies unfortunately may have had only 1 of a sex included in the fiber photometry data. While the inclusion is overall good, the single value for a sex suggests that there are differences, given the clustering of the data. While the anesthesia may be driving this potential sex effect, it is not clear based on the data presented. For reference: https://link.springer.com/article/10.1007/s12975-012-0229-y

      The reviewer is correct that there was an imbalance of sex in this dataset. While we made every attempt to balance for sex across all experiments, we unfortunately had to exclude some animals for technical reasons (clogged catheter, missed injection site, etc). This produced an imbalance in our photometry studies and did not allow us to thoroughly evaluate sex differences in fentanyl-induced changes in neural activity or in the responses to anesthesia. We have expanded on this limitation in the discussion.

      (11) Figure 5 - the bars are not the color indicated by the legend.

      We have corrected this in the figure. Thank you.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to first thank the Editor as well as the two reviewers for their enthusiasm and careful evaluation of our manuscript. We also appreciate their thoughtful and constructive comments and suggestions. They did, however, have concerns regarding experimental design, data analysis, and over-interpretation of our findings. We endeavored to address these concerns through refinement of our framing, inclusion of additional new analyses, and rewriting some parts of our discussion section. We hope our response can better explain the rationale of our experimental design and data interpretation. In addition, we also acknowledge the limitations of our present study, so that it will benefit future investigations into this topic. Our detail responses are provided below.

      Reviewer #1 (Public Review)

      This study examines whether the human brain uses a hexagonal grid-like representation to navigate in a non-spatial space constructed by competence and trustworthiness. To test this, the authors asked human participants to learn the levels of competence and trustworthiness for six faces by associating them with specific lengths of bar graphs that indicate their levels in each trait. After learning, participants were asked to extrapolate the location from the partially observed morphing bar graphs. Using fMRI, the authors identified brain areas where activity is modulated by the angles of morphing trajectories in six-fold symmetry. The strength of this paper lies in the question it attempts to address. Specifically, the question of whether and how the human brain uses grid-like representations not only for spatial navigation but also for navigating abstract concepts, such as social space, and guiding everyday decision-making. This question is of emerging importance.

      Thanks very much again for the evaluation and comments. Please find our revision plans to each comment below.

      The weak points of this paper are that its findings are not sufficiently supporting their arguments, and there are several reasons for this:

      (1) Does the grid-like activity reflect 'navigation over the social space' or 'navigation in sensory feature space'? The grid-like representation in this study could simply reflect the transition between stimuli (the length of bar graphs). Participants in this study associated each face with a specific length of two bars, and the 'navigation' was only guided by the morphing of a bar graph image. Moreover, any social cognition was not required to perform the task where they estimate the gridlike activity. To make social decision-making that was conducted separately, we do not know if participants needed to navigate between faces in a social space. Instead, they can recall bar graphs associated with faces and compute the decision values by comparing the length of bars. Notably, in the trust game in this study, competence and trustworthiness are not equally important to make a decision (Equation 1). The expected value is more sensitive to one over the other. This also suggests that the space might not reflect social values but perceptual differences.

      The Reviewer raises an interesting point. We apologize for not being clear enough to address this possibility in our original manuscript and we will improve the clarity in our revision. To address this issue, we would like to break it into two sub-questions and answer them separately: 1) Are participants merely memorizing the values associated with each avatar or do they place the avatars on a two-dimensional map in their internal representation. 2) If so, are the two dimensions of this internal representation social dimensions relating to competence and trust or sensory dimensions relating to bar height (i.e., social space or sensory space).

      For the first question, we hope our analysis of the distance effect on the reaction time in the comparison task can address this issue. Specifically, it came from the idea that distance is a measure of similarity between two avatars in the 2D social space. The closer two avatars are, the more similar they are, hence distinguishing them will be harder and result in longer reaction time. If participants are merely memorizing the avatars as six isolated instances without integrating them into a low-dimensional map, then avatars should be equidistant (as if they were lying on the vertices of a 5-simplex), and would not show a distance effect. Therefore, we interpreted the stronger distance effect as a behavioural index of having a better internal map-like representation. This approach is adopted from the work by Park et al. (2020), where they used the distance effect to demonstrate human brains map abstract relationships among entities from piecemeal learning.

      For the second question of ‘social space’ vs. ‘sensory space’, our study adopted the paradigm developed by, in which they used a similar way to construct a conceptual space and found that such space can be represented with grid-like code in the entorhinal and prefrontal cortex. We stayed close to the original design by Constantinescu et al. (2016) and hoped that our work could provide, to some extent, a close replication of their result but using non-spatial social concepts instead. Indeed, this led to the limitation of our study that participants are passively traversing the artificial space rather than actively navigating in the space to make decisions/inferences. And we did not find sufficient evidence as reported in previous grid-like coding fMRI studies. This may have to do with low signal quality in the medial temporal region, we are not entirely sure. Nevertheless, we don’t think our findings contradict or disprove previous findings in any way. Here we would also like to point to the work by Park et al. (2021). Their task involves making novel inferences in a 2D social hierarchy space and found that grid-like code in the entorhinal cortex and medial prefrontal cortex support such novel inferences. Hence, we argue that results from these studies and partial evidence from our study collectively support the idea that the entorhinal is important for representing abstract knowledge (spatial and non-spatial).

      (2) Does the brain have a common representation of faces in a social space? In this study, participants don't need to have a map-like representation of six faces according to their levels of social traits. Instead, they can remember the values of each trait. The evidence of neural representations of the faces in a 2-dimensional social space is lacking. The authors argued that the relationship between the reaction times and the distances between faces provides evidence of the formation of internal representations. However, this can be found without the internal representation of the relationships between faces. If the authors seek internal representations of the faces in the brain, it would be important to show that this representation is not simply driven by perceptual differences between bar graphs that participants may recall in association with each face.

      Considering these caveats, it is hard for me to agree if the authors provide evidence to support their claims.

      With regard to the common representation of faces, this is a potential limitation of our paradigm because our current task design didn’t include a stage of face presentation to properly test this question. With regard to the asymmetry between the two dimensions in determining expected value. We think that the prerequisite for identifying six-fold grid-like coding is to have an abstract space formed by orthogonal dimensions, i.e., competence and trustworthiness in our task are not correlated. In addition, the scanner task does not require computation of expected value. However, we do think that it is worth investigating whether the extent to which each dimension contributes to decision-making and inference will distort the grid-like representation of the map. Our prediction is that the entorhinal cortex will maintain a representation of the map invariant to this aspect so that it can support inferences in different contexts where different weights may be assigned to different dimensions. But this will be an interesting hypothesis for future studies to test. We hope that our revision plans with above considerations could address the Reviewer’s comments.

      Reviewer #2 (Public Review)

      Summary:

      In this work, Liang et al. investigate whether an abstract social space is neurally represented by a grid-like code. They trained participants to 'navigate' around a two-dimensional space of social agents characterized by the traits of warmth and competence, then measured neural activity as participants imagined navigating through this space. The primary neural analysis consisted of three procedures: 1) identifying brain regions exhibiting the hexagonal modulation characteristic of a grid-like code, 2) estimating the orientation of each region's grid, and 3) testing whether the strength of the univariate neural signal increases when a participant is navigating in a direction aligned with the grid, compared to a direction that is misaligned with the grid.

      From these analyses, the authors find the clearest evidence of a grid-like code in the prefrontal cortex and weaker evidence in the entorhinal cortex.

      Strengths:

      The work demonstrates the existence of a grid-like neural code for a socially-relevant task, providing evidence that such coding schemes may be relevant for a variety of two-dimensional task spaces.

      Thank you very much again for your careful evaluation and thoughtful comments. Please find our response to the comments below.

      Weaknesses:

      In various parts of this manuscript, the authors appear to use a variety of terms to refer to the (ostensibly) same neural regions: prefrontal cortex, frontal pole, ventromedial prefrontal cortex (vmPFC), and orbitofrontal cortex (OFC). It would be useful for the authors to use more consistent terminology to avoid confusing readers.

      Thanks for pointing out the use of terms, we will try to improve that in the revision of our manuscript.

      Claims about a grid code in the entorhinal cortex are not well-supported by the analyses presented. The whole-brain analysis does not suggest that the entorhinal cortex exhibits hexagonal modulation; the strength of the entorhinal BOLD signal does not track the putative alignment of the grid code there; multivariate analyses do not reveal any evidence of a grid-like representational geometry.

      On a conceptual level, it is not entirely clear how this work advances our understanding of gridlike encoding of two-dimensional abstract spaces, or of social cognition. The study design borrows heavily from Constantinescu et al. 2016, which is itself not an inherent weakness, but the Constantinescu et al. study already suggests that grid codes are likely to underlie two-dimensional spaces, no matter how abstract or arbitrary. If there were a hypothesis that there is something unique about how grid codes operate in the social domain, that would help motivate the search for social grid codes specifically, but no such theory is provided. The authors do note that warmth and competence likely have ecological importance as social traits, but other past studies have used slightly different social dimensions without any apparent loss of generality (e.g., Park et al. 2021). There are some (seemingly) exploratory analyses examining how individual difference measures like social anxiety and avoidance might affect the brain and behavior in this study, but a strong theoretical basis for examining these particular measures is lacking.

      We acknowledge that we used very similar dimensions to the work by Park et al. (2021). While Park and colleagues (2021) took a more innovative and rigorous approach, we tried to stay close to the original design by Constantinescu et al. (2016) with the hope that our work could provide, to some extent, a close replication of their result. Our data was collected before the 2021 paper came out and as the comment points out, we did not find as complete and convincing evidence as in these previous grid-like coding fMRI papers. This may be due to low signal quality in the medial temporal region, we are not entirely sure. But we don’t think our current findings can contradict or disprove previous findings in any way.

      I found it difficult to understand the analyses examining whether behavior (i.e., reaction times) and individual difference measures (i.e., social anxiety and avoidance) can be predicted by the hexagonal modulation strength in some region X, conditional on region X having a similar estimated grid alignment with some other region Y. It is possible that I have misunderstood the authors' logic and/or methodology, but I do not feel comfortable commenting on the correctness or implications of this approach given the information provided in the current version of this manuscript.

      We apologize for not being clear enough in the manuscript and we will improve the clarity in our revision. This exploratory analysis aims to examine if there is any correlation between the strength of grid-like representation of social value map and behavioral indicators of map-like representation; and test if there are any correlation between the strength of grid-like representation of this social value map and participants’ social trait. For the behavioral indicator, we used the distance effect in the reaction time of the comparison task outside the scanner. The closer a pair of avatars are, the more similar they are, hence distinguishing them will be harder and results in longer reaction time when making comparison judgement. If participants are merely memorizing the avatars as six isolated instances without integrating them into a map, all avatars should be equidistant and there wouldn’t be a distance effect. We interpreted stronger grid-like activity as a neural index of better representation of the 2D social space, and we interpreted stronger distance effect as a behavioral index of having better internal map-like representation.

      It was puzzling to see passing references to multivariate analyses using representational similarity analysis (RSA) in the main text, given that RSA is only used in analyses presented in the supplementary material.

      We speculate if RSA in entorhinal ROI would be more sensitive than the wholebrain univariate analysis to identify grid-like code because a previous paper on grid-like code in olfactory space (Bao et al., 2019) didn’t identify grid-like representation with univariate analysis but identified it with RSA analysis. However, we failed to find evidence of grid-like code in the entorhinal ROI aligned to its own putative grid orientation with the RSA approach. We reported this result in the main text to show that we carried out a relatively thorough investigation to test the hypothesis using various approaches and decided to add references to the RSA approach in the main text as well.

      Reviewer #3 (Public Review)

      Liang and colleagues set out to test whether the human brain uses distance and grid-like codes in social knowledge using a design where participants had to navigate in a two-dimensional social space based on competence and warmth during an fMRI scan. They showed that participants were able to navigate the social space and found distance-based codes as well as grid-like codes in various brain regions, and the grid-like code correlated with behavior (reaction times).

      On the whole, the experiment is designed appropriately for testing for distant-based and grid-like codes and is relatively well-powered for this type of study, with a large amount of behavioral training per participant. They revealed that a number of brain regions correlated positively or negatively with distance in the social space, and found grid-like codes in the frontal polar cortex and posterior medial entorhinal cortex, the latter in line with prior findings on grid-like activity in the entorhinal cortex. The current paper seems quite similar conceptually and in design to previous work, most notably by Park et al., 2021, Nature Neuroscience.

      Thanks very much again for your careful evaluation and comments. Please find our response to the comments below.

      Below, I raise a few issues and questions on the evidence presented here for a grid-like code as the basis of navigating abstract social space or social knowledge.

      (1) The authors claim that this study provides evidence that humans use a spatial / grid code for abstract knowledge like social knowledge.

      This data does specifically not add anything new to this argument. As with almost all studies that test for a grid code in a similar "conceptual" space (not only the current study), the problem is that when the space is not a uniform, square/circular space, and 2-dimensional then there is no reason the code will be perfectly grid-like, i.e., show six-fold symmetry. In real-world scenarios of social space (as well as navigation, semantic concepts), it must be higher dimensional - or at least more than two-dimensional. It is unclear if this generalizes to larger spaces where not all part of the space is relevant. Modelling work from Tim Behrens' lab (e.g., Whittington et al., 2020) and Bradley Love's lab (e.g., Mok & Love, 2019) have shown/argued this to be the case. In experimental work, like in mazes from the Mosers' labs (e.g., Derdikman et al., 2009), or trapezoid environments from the O'Keefe lab (Krupic et al., 2015), there are distortions in mEC cells, and would not pass as grid cells in terms of the six-fold symmetry criterion.

      The authors briefly discuss the limitations of this at the very end but do not really say how this speaks to the goal of their study and the claim that social space or knowledge is organized as a grid code and if it is in fact used in the brain in their study and beyond. This issue deserves to be discussed in more depth, possibly referring to prior work that addressed this, and raising the issue for future work to address the problem - or if the authors think it is a problem at all.

      Thanks very much for the references to the papers that we haven’t considered enough in our discussion. We will endeavour to discuss the topic in more depth in our revision. In summary, we raise this discussion point because various research groups have found gridlike representations in 2D artificial conceptual space. We think that the next step for a stronger claim would be to find the representation of more spontaneous non-spatial maps.

      Data and analysis

      (2) Concerning the negative correlation of distance with activation in the fusiform gyrus and visual cortex: this is a slightly puzzling but potentially interesting finding. However, could this be related to reaction times? The larger the distance, the longer the reaction times, so the original finding might reflect larger activations with smaller distances.

      Thanks very much for the suggestion. However, we didn’t find a correlation between response time in the choice stage in the scanner task and the negative distance activation in the fusiform gyrus (Figures below). Meanwhile, the morph period in each trial remains the same, the negative correlation of distance with activation in the fusiform gyrus could also be interpreted as a positive correlation of morphing speed with activation in the fusiform gyrus. Indeed, stronger negative activation indicates larger activation for smaller distances, but we are uncertain what it indicates concerning the functional role of Fusiform in our current task.

      Author response image 1.

      (3) Concerning the correlation of grid-like activity with behavior: is the correlation with reaction time just about how long people took (rather than a task-related neural signal)? The authors have only reported correlations with reaction time. The issue here is that the duration of reaction times also relates to the starting positions of each trial and where participants will navigate to. Considering the speed-accuracy tradeoff, could performance accuracy be negatively correlated with these grid consistency metrics? Or it could be positively correlated, which would suggest the grid signal reflects a good representation of the task.

      We apologize for not being clear enough in the manuscript and we will improve the clarity in our revision. The reaction time used to calculate the distance effect is from a task outside the scanner. The closer a pair of avatars are, the more similar they are, hence distinguishing them will be harder and results in longer reaction time when making comparison judgement. If participants are merely memorizing the avatars as six isolated instances without integrating them into a map, all avatars should be equidistant and there wouldn’t be a distance effect. We interpreted stronger grid-like activity as a neural index of better representation of the 2D social space, and we interpreted stronger distance effect as a behavioural index of having better internal map-like representation. This was the motivation behind this analysis.

      References

      Bao, X., Gjorgieva, E., Shanahan, L. K., Howard, J. D., Kahnt, T., & Gottfried, J. A. (2019). Grid-like Neural Representations Support Olfactory Navigation of a Two-Dimensional Odor Space. Neuron, 102(5), 1066-1075 e1065. https://doi.org/10.1016/j.neuron.2019.03.034

      Constantinescu, A. O., O'Reilly, J. X., & Behrens, T. E. J. (2016). Organizing conceptual knowledge in humans with a gridlike code. Science,352(6292), 1464-1468. https://doi.org/10.1126/science.aaf0941

      Park, S. A., Miller, D. S., & Boorman, E. D. (2021). Inferences on a multidimensional social hierarchy use a grid-like code. Nat Neurosci, 24(9), 1292-1301. https://doi.org/10.1038/s41593-02100916-3

      Park, S. A., Miller, D. S., Nili, H., Ranganath, C., & Boorman, E. D. (2020). Map Making: Constructing, Combining, and Inferring on Abstract Cognitive Maps. Neuron, 107(6), 1226-1238 e1228. https://doi.org/10.1016/j.neuron.2020.06.030

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      There has been intense controversy over the generality of Hamilton's inclusive fitness rule for how evolution works on social behaviors. All generally agree that relatedness can be a game changer, for example allowing for otherwise unselectable altruistic behaviors when 𝑐 < 𝑟𝑏, where 𝑐 is the fitness cost to the altruism, 𝑏 is the fitness benefit to another, and 𝑟 their relatedness. Many complications have been successfully incorporated into the theory, including different reproductive values and viscous population structures.

      I agree, especially if by incorporating viscous population structures, the reviewer means the discovery of the cancellation effect (Wilson, Pollock, and Dugatkin, 1992, Taylor, 1992).

      The controversy has centered on another dimension; Hamilton's original model was for additive fitness, but how does his result hold when fitnesses are non-additive? One approach has been not to worry about a general result but just find results for particular cases. A consistent finding is that the results depend on the frequency of the social allele - nonadditivity causes frequency dependence that was absent in Hamilton's approach.

      Just to be extra precise: Hamilton’s (1964) original model did not use the Price equation nor the regression approach to define costs and benefits, and it did indeed simply presuppose fixed, additive fitness effects.

      Also for extra precision on terminology: many researchers will describe all fitnesses in social evolution as frequency dependent. The reason they do, is that with or without additivity, both the fitness of cooperators (with the social allele) and the fitness of defectors (without the social alle) typically increase in the frequency of cooperators in the population; the more cooperators there are, the more individuals run into them, which increases average fitness. The result depending on the frequency I take to mean that which of those two fitnesses is larger flips at a certain frequency, which automatically implies that the difference between them is depending on the frequency of the social allele. This is indeed the result of non-additivity. We will return to this in more detail in the response to Reviewer #3. Also at the end of Appendix B I have added a bit to be extra precise regarding frequency dependence.

      Two other approaches derive from Queller via the Price equation. Queller 1 is to find forms like Hamilton's rule, but with additional terms that deal with non-additive interaction, each with an r-like population structure variable multiplied by a b-like fitness effect (Queller, 1985). Queller 2 redefines the fitness effects c and b as partial regressions of the actor's and recipient's genes on fitness. This leaves Hamilton's rule intact, just with new definitions of c and b that depend on frequency (Queller, 1992a).

      Queller 2 is the version that has been most adopted by the inclusive fitness community along with assertions that Hamilton's rule in completely general. In this paper, van Veelen argues that Queller 1 is the correct approach. He derives a general form that Queller only hinted at. He does so within a more rigorous framework that puts both Price's equation and Hamilton's rule on firmer statistical ground. Within that framework, the Queller 2 approach is seen to be a statistical misspecification - it employs a model without interaction in cases that actually do have interaction. If we accept that this is a fatal flaw, the original version of Hamilton's rule is limited to linear fitness models, which might not be common.

      I totally agree.

      Strengths:

      While the approach is not entirely new, this paper provides a more rigorous approach and a more general result. It shows that both Queller 1 and Queller 2 are identities and give accurate results, because both are derived from the Price equation, which is an identity. So why prefer Queller 1? It identifies the misspecification issue with the Queller 2 approach and points out its consequences. For example, it will not give the minimum squared differences between the model and data. It does not separate the behavioral effects of the individuals from the population state (𝑏 and 𝑐 become dependent on 𝑟 and the population frequency).

      Just to be precise on a detail: in the data domain, as long as the number of parameters in a statistical model is lower than the number of data points, adding parameters typically (generically) lowers the sum of squared errors. That is to say, for an underspecified statistical model, the sum of squared errors goes down if a parameter is added, but for an already overspecified statistical model, the same is still true (although, typically, by how much the sum of squared errors is reduced will differ). The model specification task for a statistician includes knowing when to keep adding parameters, because the data suggest that the model is still underspecified, and when to stop adding parameters, because the model is well-specified, even if adding parameters still reduces the sum of squared errors.

      In a modeling context, on the other hand, one can say that sum of squared differences will stop decreasing at the point where the statistical model is well-specified, that is: when it matches the model we are considering.

      The paper also shows how the same problems can apply to non-social traits. Epistasis is the non-additivity of effects of two genes within the individual. (So one wonders why have we not had a similarly fierce controversy over how we should treat epistasis?)

      The paper is clearly written. Though somewhat repetitive, particularly in the long supplement, most of that repetition has the purpose of underscoring how the same points apply equally to a variety of different models.

      Finally, this may be a big step towards reconciliation in the inclusive fitness wars. Van Veelen has been one of the harshest critics of inclusive fitness, and now he is proposing a version of it.

      I am very happy to hear this, because I am indeed hopeful for reconciliation. I would like to add a comment, though. The debate on Hamilton’s rule/inclusive fitness is regularly thought of as a battle between two partizan camps, where both sides care at least as much about winning as they do about getting things right. This is totally understandable, because to some degree that is true. Also, I agree that it is fair to position me in the camp that is critical of the inclusive fitness literature. However, I would like to think that I have not been taking random shots at Hamilton’s rule. I have pointed to problems with the typical use of the Price equation and Hamilton’s rule, and I think I did for very good reasons. I am obviously very happy that finding the Generalized Price equation, and the general version of Hamilton’s rule, allowed me to go beyond this, and (finally) offer a correct alternative, and I totally appreciate that this opens the door for reconciliation, as this reviewer points out. But I would not describe this as a road-toDamascus moment. In order to illustrate the continuity in my work, I would like to point to three papers.

      In van Veelen (2007), I pointed to the missing link between the central result in Hamilton’s (1964) famous paper (which states that selection dynamics take the population to a state where mean inclusive fitness is maximized), and Hamilton’s actual rule (which states that selection will lead to individuals maximizing their individual inclusive fitness). My repair stated the additional assumptions that were necessary to make the latter follow from the former. I would say that this can hardly be characterized as an attack on Hamilton’s rule. Reading Hamilton (1964) with enough care to notice something is missing, and then repairing it, I think is a sign of respect, and not an attack.

      Van Veelen (2011) is about the replicator dynamics for n-player games, with the possibility of assortment. This puts the paper in a domain that does not assume weak selection, and that is typically not much oriented towards inclusive fitness. I included a theorem that implies that, under the condition of linearity, inclusive fitness not only gets the direction of selection right, but 𝑟𝑏 − 𝑐 becomes a parameter that also determines the speed of selection. This I think is representative, in the sense that in many of my papers, I carefully stake out when the classic version of Hamilton’s rule does work.

      In Akdeniz and van Veelen (2020), we moreover take a totally standard inclusive fitness approach in a model of the cancellation effect at the group level.

      I would say that this does not line up with the image of a harsh critic that takes random shots at Hamilton’s rule or inclusive fitness.

      Weaknesses:

      van Veelen argues that the field essentially abandoned the Queller 1 approach after its publication. I think this is putting it too strongly - there have been a number of theoretical studies that incorporate extra terms with higher-order relatednesses. It is probably accurate to say that there has been relative neglect. But perhaps this is partly due to a perception that this approach is difficult to apply.

      I can imagine that the perceived difficulty in application may have played a role in the neglect of the Queller 1 approach. What for sure has played a role, and I would think a much bigger one, is that the literature has been pretty outspoken that the Queller 1 approach is the wrong way to go. The main text cites a number of papers that hold this position very emphatically (The first one of those was a News and Views by Alan Grafen (1985) that accompanied the paper in which Queller presented his Queller 1 approach. I am very happy that Appendix B shows on how many levels this News and Views was wrong.). There is only a handful of papers that follow the Queller 1 example.

      The model in this paper is quite elegant and helps clarify conceptual issues, but I wonder how practical it will turn out to be. In terms of modeling complicated cases, I suspect most practitioners will continue doing what they have been doing, for example using population genetics or adaptive dynamics, without worrying about neatly separating out a series of terms multiplying fitness coefficients and population structure coefficients.

      I am not sure if I see what the reviewer envisions practitioners that use population genetics will keep on doing. I would think that the Generalized Price equation in regression form is a description of population genetic dynamics, and therefore, if practitioners will not make an effort to “neatly separate out a series of terms multiplying fitness coefficients and population structure coefficients”, then all I can say is that they should. I cannot do more than explain why, if they do not, they are at risk of mischaracterizing what gets selected and why.

      Regarding those that use adaptive dynamics, I would say that this is a whole different approach. Within this approach, one can also apply inclusive fitness; see Section 6 and Appendix D of van Veelen et al. (2017). Appendix D is full of deep technical results and was done by Benjamin Allen.

      For empirical studies, it is going to be hard to even try to estimate all those additional parameters. In reality, even the standard Hamilton's rule is rarely tested by trying to estimate all its parameters. Instead, it is commonly tested more indirectly, for example by comparative tests of the importance of relatedness. That of course would not distinguish between additive and non-additive models that both depend on relatedness, but it does test the core idea of kin selection. It will be interesting to see if van Veelen's approach stimulates new ways of exploring the real world.

      Regarding the impact on empirical studies, there are a few things that I would like to say. The first is that I would just like to repeat, maybe a bit more elaborately, what I wrote at the end of the main text. Given that the generalized version of Hamilton’s rule produces a host of Hamilton-like rules, and given the fact that all of them by construction indicate the direction of selection accurately, the question whether or not Hamilton’s rule holds turns out to be illposed. That means that we can stop doing empirical tests of Hamilton’s rule, which are predicated on the idea that Hamilton’s rule, with benefits and costs being determined by the regression method, could be violated – which it cannot (Side note: it is possible to violate Hamilton’s rule, if costs and benefits are defined according to the counterfactual method; see van Veelen et al. (2017) and van Veelen (2018). This way of defining costs and benefits is less common, although there are authors that find this definition natural enough to assume that this is the way in which everybody defines costs and benefits (Karlin and Matessi, 1983, Matessi and Karlin, 1984).). Instead, we should do empirical studies to find out which version of Hamilton’s rule applies to which behaviour in which species.

      would like to not understate what a step forward this is. The size of the step forwards is of course also due to the dismal point of departure. As theorists, we have failed our empiricists, because all 12 studies included in the review by Bourke (2014) of papers that explicitly test Hamilton’s rule are based on the misguided idea that the traditional Hamilton’s rule, with costs and benefits defined according to the regression method, can be violated. While the field does sometimes have disdain for mathematical nit-picking, this is a point where a little more attention to detail would have really helped. If the hypothesis is that Hamilton’s rule holds, and the null is that it does not, then trying to specify how the empirical quantity that reflects inclusive fitness would be distributed under the null hypothesis (in order to do the right statistical tests) would have forced researchers to do something with the information that this quantity is not distributed at all, because Hamilton’s rule is general (in the sense that it holds for any way in which the world works). If one would prefer to reverse the null and the alternative hypothesis, one would run into similar problems. Understanding that the question is ill-posed therefore is a big step forwards from the terrible state of statistics and the waste of research time, attention and money on the empirical side of this field (see also Section 8 of van Veelen et al., 2017).

      I would agree that doing comparative statics may not be much affected by this. Section 5 of van Veelen et al. (2017) indicates that there can be a large set of circumstances under which the general idea “relatedness up → cooperation up” still applies. But that may be a bit unambitious, and Section 8 of van Veelen et al. (2017), and the final section of van Veelen (2018) contain some reflections on empirical testing that may allow us to go beyond that. As long as there is change happening in the Generalized Price equation, the population is not in equilibrium. For empirical tests, one can either aim to capture selection as it happens, or assume that what we observe reflects properties of an equilibrium. This leads to interesting reflections on how to do empirics, which may differ between traits that are continuous and traits that are discrete (again: see van Veelen et al. (2017), and van Veelen (2018).

      Reviewer #2 (Public review):

      Summary:

      This manuscript reconsiders the "general form" of Hamilton's rule, in which "benefit" and "cost" are defined as regression coefficients. It points out that there is no reason to insist on Hamilton's rule of the form −𝑐 + 𝑏𝑟 > 0, and that, in fact, arbitrarily many terms (i.e. higherorder regression coefficients) can be added to Hamilton's rule to reflect nonlinear interactions. Furthermore, it argues that insisting on a rule of the form −𝑐 + 𝑏𝑟 > 0 can result in conditions that are true but meaningless and that statistical considerations should be employed to determine which form of Hamilton's rule is meaningful for a given dataset or model.

      Totally right. I cannot help to want to be extra precise, though, by distinguishing between the data domain and the modelling domain. In the data domain, statistical considerations apply in order to avoid misspecification. In this domain, avoiding misspecification can be complicated, because we do not know the underlying data generating process, and we depend on noisy data to make a best guess. In the modeling domain, however, there is no excuse for misspecification, as the model is postulated by the modeler. I therefore would think that in this domain, it does not really require “statistical considerations” to minimize the probability of misspecification; we can get the probability of misspecification all the way down to 0 by just choosing not to do it.

      Strengths:

      The point is an important one. While it is not entirely novel-the idea of adding extra terms to Hamilton's rule has arisen sporadically (Queller, 1985, 2011; Fletcher et al., 2006; van Veelen et al., 2017)--it is very useful to have a systematic treatment of this point. I think the manuscript can make an important contribution by helping to clarify a number of debates in the literature. I particularly appreciate the heterozygote advantage example in the SI.

      Me too, and I really hope the readers make it this far! I have thought of putting it in the main text, but did not know where that would fit.

      Weaknesses:

      Although the mathematical analysis is rigorously done and I largely agree with the conclusions, I feel there are some issues regarding terminology, some regarding the state of the field, and the practice of statistics that need to be clarified if the manuscript is truly to resolve the outstanding issues of the field. Otherwise, I worry that it will in some ways add to the confusion.

      (1) The "generalized" Price equation: I agree that the equations labeled (PE.C) and (GPE.C) are different in a subtle yet meaningful way. But I do not see any way in which (GPE.C) is more general than (PE.C). That is, I cannot envision any circumstance in which (GPE.C) applies but (PE.C) does not. A term other than "generalized" should be used.

      This is a great point! Just to make sure that those that read the reports online understand this point, let me add some detail. The equation labeled (PE.C) – which is short for Price equation in covariance form – is

      The derivation in Appendix A then assumes that we have a statistical model that includes a constant and a linear term for the p-score. It then defines the model-estimated fitness of individual 𝑖 as , where 𝑤<sub> 𝑖</sub> is the realized number of offspring of individual 𝑖, and 𝜀<sub> 𝑖</sub> is the error term – and it is the sum over all individuals of this error term-squared that is minimized. The vector of model-estimated fitnesses will typically be different for different choices of the statistical model. Appendix A then goes on to show that, whatever the statistical model is that is used, for all of them , as long as the statistical model includes a constant and a linear term for the p-score. That means that we can rewrite (PE.C) as

      The point that the reviewer is making, is that this is not really a generalization. For a given dataset (or, more generally, for a given population transition, whether empirical or in a model), is just a number, and it happens to be the case that 𝐶𝑜𝑣(𝑤:, 𝑝) returns the same number, whatever statistical model we use for determining what the model-estimated fitnesses 𝑤<sub> 𝑖</sub> are (as long as the statistical model includes a constant and a linear term for the p-score). In other words, (PE.C) is not really nested in (GPE.C), so (GPE.C) is not a proper generalization of (PE.C).

      This is a totally correct point, and I had actually struggled a bit with the question what terminology to use here. Equation (GPE.C) is definitely general, in the sense that we can change the statistical model, and thereby change the vector of model-estimated fitnesses , but as long as we keep the constant and the linear term in the statistical model, the equation still applies. But it is not a generalization of (PE.C).

      I do however have a hard time coming up with a better label. The General Price equation may be a bit better, but it still suggests generalization. The Statistical Model-based Price equation does not suggest or imply generalization, but it does not convey how general it is, and it suggests that it could be an alternative to the normal Price equation that one may or may not choose to use – while this version really is the one we should use. It may moreover create the impression that this is only for doing statistics, and one might use the traditional Price equation for anything that is not statistics. I cannot really think of other good alternatives, but I am of course open to suggestions.

      So, by lack of a better label, I called this the Generalized Price equation in covariance form. Though clearly imperfect, there are still a few good things about this label. The first is that, as mentioned above, this equation is general, in the sense that it holds, regardless of the statistical model. The second reason is that this is Step 1 in a sequence of three steps., the other two of which do produce proper generalizations. Step 2 goes from this equation in covariance form to the Generalized Price Equation in regression form, which is a proper generalization of the traditional Price equation in regression form. Step 3 goes from the Generalized Price Equation in regression form to the general version of Hamilton’s rule, which is also a proper generalization of the classical Hamilton’s rule. Since I would suggest that Step 1 on its own is kind of useless, and therefore Step 1 and Step 2 will typically come as a package, I would be tempted to think that this justifies the abuse of terminology for the Price Equation in covariance form. I did however add the observation made by the reviewer at the point where the Generalized Price equation (in both forms) is derived, so I hope this at least partly addresses this concern.

      (2) Regression vs covariance forms of the Price equation: I think the author uses "generalized" in reference to what Price called the "regression form" of his equation. But to almost everyone in the field, the "Price Equation" refers to the covariance form. For this reason, it is very confusing when the manuscript refers to the regression form as simply "the Price Equation".

      As an example, in the box on p. 15, the manuscript states "The Price equation can be generalized, in the sense that one can write a variety of Price-like equations for a variety of possible true models, that may have generated the data." But it is not the Price equation (covariance form) that is being generalized here. It is only the regression that Price used that is being generalized.

      To be consistent with the field, I suggest the term "Price Equation" be used only to refer to the covariance form unless it is otherwise specified as in "regression form of the Price equation".

      I am not sure about the level of confusion induced here, but I totally see that it can be helpful to avoid all ambiguity. I therefore went over everything, and whenever I wrote “Price equation”, I tried to make sure it comes either with “in covariance form” or with “in regression form”. At some places, it is a bit over the top to keep repeating “in regression form”, when it is abundantly clear which form is being discussed. Also, I added no qualifiers if a statement is true for both forms of the Price equation, or if the claim refers to the whole package of going through Step 1 and Step 2 mentioned above.

      (3) Sample covariance: The author refers to the covariance in the Price equation as “sample covariance”. This is not correct, since sample covariance has a denominator of N-1 rather than N (Bessel’s correction). The correct term, when summing over an entire population, is “population covariance”. Price (1972) was clear about this: “In this paper we will be concerned with population functions and make no use of sample functions”. This point is elaborated on by Frank (2012), in the subsection “Interpretation of Covariance”.

      I totally agree. On page 418 of van Veelen (2005), I wrote:

      “Another possibility is that we think of 𝑧<sub>i</sub> and 𝑞<sub>i</sub>, 𝑖 = 1,…,𝑁 as realizations of a jointly distributed random variable. […] In that case the expression between square brackets is a good approximation for what statisticians […] call a sample covariance. A sample covariance is defined as but in large samples it is OK to replace 𝑁 − 1 by 𝑁, and then this formula reduces to Price’s 𝐶𝑜𝑣(𝑧, 𝑞).”

      In van Veelen et al. (2012), I slid a little, because in Box 1 on page 66, I wrote that is the sample covariance, and only in footnote 1 on the same page did I include Bessel’s correction, when I wrote:

      “To be perfectly precise, the sample covariance is defined as

      In this manuscript, I slid a little further, and left Bessel’s correction out altogether. I am happy that the reviewer pointed this out, so I can make this maximally precise again.

      The reviewer also quotes Price (1972), page 485:

      “In this paper we will be concerned with population functions and make no use of sample functions”.

      Below, the reviewer will return to the issue of distinguishing between the sample covariance with Bessel’s correction, and the sample covariance without Bessel’s correction, where the latter is regularly also referred to as the population covariance. A natural interpretation of the quote from Price (1972), if we read a bit around this quote in the paper, is that the difference between his “population functions” and his “sample functions” is indeed Bessel’s correction.

      The reviewer also states that Frank (2012) elaborates on this in the subsection “Interpretation of Covariance”. What is interesting, though, is that, when Frank (2012) writes, on page 1017 “It is important to distinguish between population measures and sample measures”, the difference between those is not that one does, and the other does not include Bessel’s correction. The difference between “population measures” and “sample measures” in Frank (2012), page 1017

      “It is important to distinguish between population measures and sample measures”,

      the difference between those is not that one does, and the other does not include Bessel’s correction. The difference between “population measures” and “sample measures” in Frank (2012), page 1017, is that

      “In many statistical applications, one only has data on a subset of the full population, that subset forming a sample.”

      The distinction between a population covariance and a sample covariance in Frank (2012) therefore is that they are “covariances” of different things (where the word covariances is in quotation marks, because, again, they are not really covariances). Besides just making sure that Price (1972) and Frank (2012) are not using these terms in the same way, this also perfectly illustrates the mix-up between statistical populations (or data generating processes) and biological populations that I discuss on pages 8 and 9 of Appendix A. I will return to this below, when I explain why I want to avoid using the word “population covariance” for the sample covariance without Bessel’s correction.

      Of course, the difference is negligible when the population is large. However, the author applies the covariance formula to populations as small as 𝑁 = 2, for which the correction factor is significant.

      Absolutely right.

      The author objects to using the term "population covariance" (SI, pp. 8-9) on the grounds that it might be misleading if the covariance, regression coefficients, etc. are used for inference because in this case, what is being inferred is not a population statistic but an underlying relationship. However, I am not convinced that statistical inference is or should be the primary use of the Price equation (see next point). At any rate, avoiding potential confusion is not a sufficient reason to use incorrect terminology.

      There are a few related, but separate issues. One is what to call the 𝐶𝑜𝑣(𝑤, 𝑝)-term. Another, somewhat broader, is to avoid mixing up statistical populations and biological populations. A third is what the primary use of the Price equation is. The third issue I will respond to below, where it reappears. Here I will focus on the first two, which can be discussed without addressing the third.

      In a data context, I now call the 𝐶𝑜𝑣(𝑤, 𝑝)-term “’" times the sample covariance, or, in other words, the sample covariance without Bessel’s correction”. This should be unambiguous. In a modeling context I refer to 𝐶𝑜𝑣(𝑤, 𝑝)-term as “the 𝐶𝑜𝑣(𝑤, 𝑝)-term” and describe it as a summary statistic or a notational convention. There are two reasons for this choice.

      The first is that neither of these use the word “population”. I like this, because there is a persistent scope for confusion between statistical populations and biological populations (as exemplified by Frank, 2012). This leads to an incorrect, but widespread intuition that if we “know the entire (biological) population” in a data context, there is nothing that can be estimated. This is what pages 8 and 9 of Appendix A are all about.

      The second reason is that by using two labels, I also differentiate between the data context and the modeling context. This is important for reasons I will return to later.

      Relatedly, I suggest avoiding using 𝐸 for the second term in the Price equation, since (as the ms points out), it is not the expectation of any random variable. It is a population mean. There is no reason not to use something like Avg or bar notation to indicate population mean. Price (1972) uses "ave" for average.

      I totally agree that the second term in the Price equation is not an expectation. I made this point in van Veelen (2005), and I repeated this in the manuscript. This remark by the reviewer prompted me to spell this out a bit more emphatically in Appendix A. That still leaves me with the choice what notation to use.

      I therefore looked up all contributions to the Theme issue “Fifty years of the Price equation” in the Philosophical Transactions of the Royal Society B, and found that almost all contributions use 𝐸, sometimes saying that this refers to an expectation or an average. Of course, this is wrong. However (and this is another argument), it is equally wrong as using 𝐶𝑜𝑣 or 𝑉𝑎𝑟. The terms abbreviated as 𝐶𝑜𝑣 and 𝑉𝑎𝑟 are equally much not a covariance and a variance as the term abbreviated as 𝐸 is not an expectation. So I would think that there are a few reasons for sticking with 𝐸 here; 1) consistency with the literature; 2) consistency with the treatment of other terms; and 3) the fact that this term is not really of any importance in this manuscript. I do however totally understand the reviewer’s reasons, which I suppose include that for using 𝐸, there are relatively unproblematic alternatives (ave or upper bar) that are not available for the other terms. I hope therefore that being a bit more emphatic in the manuscript about 𝐸 not being an expectation at least partly addresses this concern.

      I should add, however, that the distinction between population statistics vs sample statistics goes away for regression coefficients (e.g. b, c, and r in Hamilton's rule) since in this case, Bessel's correction cancels out.

      Totally correct.

      (4) Descriptive vs. inferential statistics: When discussing the statistical quantities in the Price Equation, the author appears to treat them all as inferential statistics. That is, he takes the position that the population data are all generated by some probabilistic model and that the goal of computing the statistical quantities in the Price Equation is to correctly infer this model.

      Before I respond to this, I would like to point out that this literature has started going off the rails right from the very beginning. One of the initial construction errors was to use the ungeneralized Price equation in regression form. The other one is that the paper in which Price (1970) presented his equation is inconsistent, and suggests that the equation can be used for constructing hypotheses and for testing them at the same time (see van Veelen (2005), page 416). That, of course, is not possible; the first happens in the theory/modeling domain, and the second in the empirical testing/statistics domain, and they are separate exercises.

      These construction errors have warped the literature based on it, and have resulted in a lot of mental gymnastics and esoteric statements, which are needed if we are not willing to consider the possibility that there could be anything amiss with the original paper by Price (1970).

      In this paper, I undo both of these construction errors. Undoing the second one means exploring both domains separately. In Sections 2-4 of Appendix A I explore the possibility that the Price equation is applied to data. In Section 5 of Appendix A I explore the possibility that it is used in a modelling context. The primary effort here is just to do it right, and I have not read anything to suggest that I did not succeed in doing this. Secondarily, of course, I also want to contrast this to what happens in the existing literature. That is what this point by the reviewer is about. It is therefore important to be aware that seeing the contrast accurately is complicated by the apologetic warp in the existing literature.

      As a first effort to unwarp, I would like to point to the fact that I am not taking any position on what the Price equation should be used for. All I do here is explore (and find) possibilities, both in the statistical inference domain and in the modeling domain. I also find that there is scope for misspecification in both, and that, in both domains, we should want to avoid misspecification. The thing that I criticize in the existing literature therefore is not the choice of domain. The thing that I criticize is the insistence on, and celebrating of what is most accurately described as misspecification. This typically happens in the modeling domain.

      It is worth pointing out that those who argue in favor of the Price Equation do not see it this way: "it is a mistake to assume that it must be the evolutionary theorist, writing out covariances, who is performing the equivalent of a statistical analysis." (Gardner, West, and Wild, 2011); "Neither data nor inferences are considered here" (Rousset, 2015). From what I can tell, to the supporters of the Price equation and the regression form of Hamilton's rule, the statistical quantities involved are either population-level *descriptive* statistics (in an empirical context), or else are statistics of random variables (in a stochastic modeling context).

      Again, this description of the friction between my paper and the existing literature is predicated on the suggestion that I have only one domain in mind where the Price equation can be applied. That is not the case; I consider both.

      In the previous paragraph, the reviewer states that I “treat statistical quantities as inferential statistics”, and in this paragraph the reviewer contrasts that with the supporters of the (ungeneralized) Price equation that supposedly treat the same quantities as “descriptive statistics”. This is also beside the point, but it will take some effort to sort out the spaghetti of entangled arguments (where the spaghetti is the result of the history in this field, as indicated earlier).

      First of all, it is not unimportant to point out that the way most people use the terms “inferential statistics” and “descriptive statistics” is that the first refers to an activity, and the second to a function of a bunch of numbers, typically data. Inferential statistics is a combination of parameter estimation and model specification (those are activities). Descriptive statistics are for instance the average values of variables of interest (which makes them a function of a set of numbers). When doing inferential statistics (or statistical inference), looking at the descriptive statistics of the dataset is just a routine before the real work begins. It is important to remember that.

      Now I suppose that this reviewer uses these words a little differently. When he or she writes that I “treat statistical quantities as inferential statistics”, I assume that the reviewer means that I want to use a term like for doing statistical inference, or that, when I want to interpret such a term, I include considerations typical of statistical inference. Within the data domain, that is totally correct. In the paper I argue that there are very good reasons for this. We would like to know what the data can tell us about the actual fitness function, and if we do our statistical inference right, and choose our Price-like equation accordingly, then that means that we would be able to give a meaningful interpretation to a term like . It also means that we then have an equation that describes the genetic population dynamics accurately.

      When the reviewer states that other papers treat them as “population level descriptive statistics” in an empirical context, I have a hard time coming up with papers for which that is the case. Most papers apply the Price equation in the modeling domain (That is to say: this is true in evolution. In ecology the Price equation is often applied to data; see Pillai and Gouhier (2019) and Bourrat et al. (2023)). But even if there are researchers that apply the Price equation to data, then considering these statistical quantities as “descriptive statistics” would not make sense. Looking at the descriptive statistics alone is not an empirical exercise; it is just a routine that happens before the actual statistical inference starts. In a data context, saying that considerations that are standard in statistical inference do not apply, because one is just not doing statistical inference, is the equivalent of an admission of guilt. If you do not consider statistical significance, and never mention that sample size could matter, because you are using these terms as “descriptive statistics, not inferential statistics”, then you’re basically admitting to not doing a serious empirical study.

      Besides treating statistical quantities as descriptive statistics in a data context, the reviewer also states that, in a stochastic modeling context, other researchers treat the same statistical quantities as “statistics of random variables”. This is first of all very generous to the existing literature. I imagine that the reviewer is imagining a modeling exercise where for instance the covariance between two variables is postulated. A theory exercise would then take that as a starting point for the derivation of some theoretical result. This, however, is not what happens in most of the literature.

      There are two things that I would like to point out. First of all, postulating covariances and deriving results from assumptions regarding those covariances is not an activity that requires using the Price equation. There are many stochastic models that function perfectly fine without the Price equation. This is maybe a detail, but it is important to realize that what the reviewer probably thinks of as a legitimate theoretical exercise may be something that can very well be done without the Price equation.

      Secondly, I would like to repeat something that I have pointed out before, which is that the Price equation can be written for any transition, whether this transition is likely or unlikely, given a model, and even for transitions that are impossible. For all of those transitions, one can write the (ungeneralized) Price equation, and for all of those, the Price equation will be an identity, and it will contain the things that the reviewer refers to as “statistical quantities”. It is important to realize that these “statistical quantities”, therefore, are properties of a transition, and that every transition comes with its own ”statistical quantity”. That implies that they are not properties of random variables; they reflect something regarding one transition. What one could imagine, though, is the following. To fix ideas, let’s take the Price equation in regression form, and focus on . A meaningful modeling exercise starts with assumptions about the likelihood of all different transitions, and therefore the likelihood of different values of 𝛽 materializing – or it starts with assumptions that imply those probabilities. In a theoretical exercise, one could then derive statements about the expectation and variance of those “statistical quantities”. For instance, one can calculate the expected value 𝐸[𝛽] =𝐸, and the variance 𝑉𝑎𝑟[𝛽] = 𝑉𝑎𝑟 , where this expectation is a proper expectation (taken over the probabilities with which these transitions materialize) and this variance is a proper variance, for the same reason.

      This is what I do on page 416 of van Veelen (2005) and in Section 5 of Appendix A. I think something like this is what the reviewer may have in mind, but it is worth pointing out that this still does not mean that the from the Price equation for any given transition is now a property of a random variable. Much of the literature, however, is not at the level of sophistication that I imagine the reviewer has in mind – although there are papers that are; see the discussion below of Rousset and Billiard (2000) and Van Cleve (2015).

      In the appendix to this reply, I will address the quotes from Gardner, West, and Wild (2011) and Rousset (2015). This takes up some space, so that is why it is at the end of this reply.

      In short, the manuscript seems to argue that Price equation users are performing statistical inference incorrectly, whereas the users insist that they are not doing statistical inference at all.

      That is not what the manuscript argues, but I am happy to clarify. The manuscript explores both the use of the Price equation when applied to data (and therefore for statistical inference) and when applied to transitions in a model. The criticism on the existing literature is not that it performs statistical inference incorrectly. The criticism is that the literature insists on misspecification, which typically happens in a modelling context.

      The problem (and here I think the author would agree with me) arises when users of the Price equation go on to make predictive or causal claims that would require the kind of statistical analysis they claim not to be doing. Claims of the form "Hamilton's rule predicts.." or use of terms like "benefit" and "cost" suggest that one has inferred a predictive or causal relationship in the given data, while somehow bypassing the entire theory of statistical inference.

      I do not really know how to interpret this paragraph. The use of the word “data” suggests that this pertains to a data context, but I do not know what would qualify as a “predictive claim” in that domain, or how any study would go from data to a claim of the form “Hamilton’s rule predicts …”. Again, I do not really know papers that apply the Price equation to data. None of the empirical papers reviewed in Bourke (2014) for instance do. I would however agree that it is close to obvious that an approach that does indeed bypass the entire theory of statistical inference cannot identify causal relations in datasets. I think the examples in Section 2 of Appendix A also clearly illustrate that a literature in which the word “sample size” is absent, cannot be doing statistical inference.

      There is also a third way to use the Price equation which is entirely unobjectionable: as a way to express the relationship between individual-level fitness and population-level gene frequency change in a form that is convenient for further algebraic manipulation. I suspect that this is actually the most common use of the Price equation in practice.

      I am not sure if I understand what it means for the Price equation to “express the relationship between individual-level fitness and population-level gene frequency change”. That is a bit reminiscent of how John Maynard Smith saw the Price equation (Okasha, 2005), but he also emphasized that he was unable to follow George Price and his equation. For sure, it cannot be that one side of the Price equation reflects something at the individual level and the other something at the population level, because both sides of the Price equation are equally aggregated over the population. Just to be safe, and to avoid unwarranted associative thinking, I would therefore choose to be minimalistic, and say that the Price equation is an identity for a transition between a parent population and an offspring population.

      Regardless of the words we choose, however, the question how harmless or objectionable the use of the Price equation is in the literature is absolutely relevant. In earlier papers I have tried to cover a spectrum of examples of different ways to use (or misuse) the Price equation. In van Veelen (2005) I cover Grafen (1985a), Taylor (1989), Price (1972), and Sober and Wilson (2007). The main paper that is discussed in van Veelen et al. (2012) is Queller (1992b), but Section 7 of that paper also discusses the way the Price equation is used in Rousset and Billiard (2000), Taylor (1989), Queller (1985), and Page and Nowak (2002). These discussions also come with a description of how much it takes to repair them, and this varies all the way from nothing, or a bit of minor rewording, to being beyond repair.

      What is good to observe, is that the papers in which the use of the Price equation is the least problematic, are also the papers in which, if the reference to the Price equation would be taken out, nothing really changes. These are papers that start with a model, or a collection of models, and that, at some point in the derivation of their results, point to a step that can, but does not have to be described as using the Price equation. An example of this is Rousset and Billiard (2000); see the detailed description in Section 7 of van Veelen et al. (2012).

      I am happy to point to a few more papers on the no harm, no foul end of the spectrum here.

      Allen and Tarnita (2012) discuss properties of the dynamics in a well-defined set of models.

      Towards the end of the paper, a version of the Price equation more or less naturally appears. This is more of an interesting aside, though, and does not really play a role in derivation of the core results of the paper. Van Cleve (2015) is similar to Rousset and Billiard (2000), in that the “application of the Price equation” there is a minor ingredient of the derivation of the results. (A detail that this reviewer may find worth mentioning, given earlier comments, is that Van Cleve (2015) writes the left-hand side of the Price equation as 𝐸(𝑤Δ𝑝|𝐩), instead of . First two very unimportant things. Van Cleve (2015) uses 𝑤 for mean fitness, for which is a more common symbol. Another detail of lesser importance is that it includes the vector of parent p-scores in the notation, which in their notation is 𝐩. More importantly, however, is that Van Cleve (2015) writes 𝐸(Δ𝑝) for , which extends the (mis)use of the symbol 𝐸 for what really is just an average. This is consistent within the Price equation, in the sense that it now denotes the average with 𝐸, both on the right-hand side and on the left-hand side of the Price equation. It can however be a little bit confusing, because when Rousset and Billiard (2000) write , then this is a proper expectation. In their case, this summarizes all possible transitions out of a given state, and weighs them by their probabilities of happening, given a state summarized by 𝑝.). I am also happy to extend the spectrum a bit here. Some papers on inclusive fitness do not use the Price equation at all, even though one could imagine places where it could be inserted. A nice example of such a paper is Taylor et al. (2007).

      In this paper, I hope I can be excused from taking a complete inventory of this literature, and I hope that I do not have to count how many papers fall into the different categories. This would help assess the veracity of the suspicion the reviewer has, which is that the most common use of the Price equation is entirely unobjectionable, but I just do not have the time. I would however not want to underestimate the aggregate damage done in this field. The spectrum spanned in my earlier papers does include a fair amount of nonsense results. This typically happens in papers that do not study a specific model or set of models, but that take the Price equation as their point of departure for their theorizing. Also there seems to be a positive correlation between how exalted and venerating the language is that is used when describing the wonders and depths of the Price equation, and how little sense the claims make that are “derived” with it.

      We also should not set the bar too low. This is a literature that, at the starting point, has a few construction errors in it, as described in the paper. That is reason for concern. Moreover, one of the main end products of this literature is what we send our empiricists to the field with. As Section 8 of van Veelen et al. (2017) indicates, what we have supplied to our empiricists to work with is nothing short of terrible. I would therefore want to maintain that the damage done is enormous, and if there are also a few papers around that may use the ungeneralized Price equation in an innocuous way, then that is not enough redemption for my taste. We are still facing a literature in which, at every instance where the Price equation is used, we still need to check in which category it falls.

      For a paper that aims to clarify these thorny concepts in the literature, I think it is worth pointing out these different interpretations of statistical quantities in the Price equation (descriptive statistics vs inferential statistics vs algebraic manipulation). One can then critique the conclusions that are inappropriately drawn from the Price equation, which would require rigorous statistical inference to draw. Without these clarifications, supporters of the Price equation will again argue that this manuscript has misunderstood the purpose of the equation and that they never claimed to do inference in the first place.

      I would like to return to the point that I made at the beginning of my response to point (4), which is that the “thorniness” of these concepts is the result of the warp in the literature, resulting from the construction errors in Price (1970). If people want to understand how to apply the Price equation right, I think that reading Appendix A and B would work just fine. Again, I have not read anything that suggests that there is anything incorrect in there, so if the literature contains “thorny” concepts, it might just be that this is the result of the mental gymnastics necessitated by the unwillingness to accept that there might be something not completely right with Price (1970). Moreover, given my experiences in the field, I am not sure that there is anything that I could say that would convince the supporters of the ungeneralized Price equation.

      (5) "True" models: Even if one accepts that the statistical quantities in the Price equation are inferential in nature, the author appears to go a step further by asserting that, even in empirical populations, there is a specific "true" model which it is our goal to infer. This assumption manifests at many points in the SI when the author refers to the "true model" or "true, underlying population structure" in the context of an empirical population.

      Again, in Appendix A I explore both a data context and a modeling context. In the modeling context none of this applies, because in such a context, there is only the model that we postulate. In the part in which I explore what the Price equation can do in a data context, I do indeed use words like “true model” or "true underlying population structure".  

      I do not think it is necessary or appropriate, in empirical contexts, to posit the existence of a Platonic "true" model that is generating the data. Real populations are not governed by mathematical models. Moreover, the goal of statistical inference is not to determine the "true model" for given data but to say whether a given statistical model is justified based on this data. Fitting a linear model, for example, does not rule out the possibility there may be higher-order interactions - it just means we do not have a statistical basis to infer these higher-order interactions from the data (say, because their p-scores are insignificant), and so we leave them out.

      This remark suggests that the statistical approach in Sections 2-4 of Appendix A is more naïve than it should be, and that I would overlook the possibility of, for instance, interaction effects that are really nonzero, but that are statistically not significant. Now first of all, at a superficial level, I would like to say that this strikes me as somewhat inconsistent. In the remarks further back, the reviewer seems to excuse those that use the Price equation on data without any statistical considerations whatsoever. The reason why the reviewer is giving them a pass, is that they are “just not doing statistical inference”. Instead, they are doing this whole other thing with, you know, descriptive statistics. As I indicated above, that is just a fancy way of saying that they are not doing serious statistics – or serious empirics, for that matter.

      In this comment, on the other hand, the reviewer also suggests that the statistics that I use to replace the total absence of any statistical considerations with, is not quite up to snuff. Below, I will indicate why that is not the case at all, but I think it is also worth registering a touch of irony there.

      In order to address this issue, it is worth first observing that the whole of classical statistics is based on probability theory in the following sense. We are always asking ourselves the question: if the data generating process works like this, what would the likelihood be of certain outcomes (datasets); and if the data generating process works some other way (sometimes: the complement of whatever “this” is), what would the likelihood then be of the same outcomes. By comparing those, we draw inferences about the underlying data generating process (which is a word suggestive of a “Platonic” world view that the reviewer seems to reject). Therefore, if one would impose a ban on using Platonic words like “true data generating process”; “actual fitness function”; or “the population structure that is out there”, it would be impossible to teach any course in statistics, basic or advanced. Also it would be impossible to practice, and talk about, applied statistics.

      Now the reviewer claims that “Real populations are not governed by mathematical models”. I do not really know if I agree or disagree with that statement, but the example that the reviewer gives does not fit that claim. The reviewer suggests that if we find a higher order term not to be statistically significant (and therefore we reject the hypothesis that it is nonzero), then that would not necessarily mean that it is not there. That is totally true, and statisticians tend to be fully aware of that. But that does not imply that there is no true data-generating process; the whole premise of this example is that there is, but that the sample size is not large enough to determine it in a detailed enough way so as to include this interaction effect, that apparently is small relative to the sample size.

      The third thing to reflect on here, is that the reviewer seems to suggest that the Generalized Price equation in regression form, as presented in my paper, comes with a specific statistical approach, that he or she classifies as philosophically naïve or unsophisticated. That, however, is not the case, and I am very grateful that this remark by this reviewer allows me to make a point that I think shines a light on how the Generalized Price equation puts the train that started going off the rails in 1970 back on track, and reconnects it with the statistics it borrows its terminology from. To see that, it is good to be aware that statistics never gives certainty. The whole discipline is built around the awareness that it is possible to draw the wrong inference, and the aim is to determine, minimize, and balance, the likelihoods of making different wrong inferences. So, statistics produces statements about the confidence with which one can say that something works one way or the other. In some instances, the data are not enough to say anything with any confidence. In other cases, the data are rich enough so that it is really unlikely that we incorrectly infer that for instance a certain gene matters for fitness.

      The nice thing about the setup with the Generalized Price equation, is that those statistical considerations translate one-to-one to considerations regarding which Price-like equation to choose. If the data do not allow us to pick any model with confidence, then we should be equally agnostic about which Price-like equation describes the population genetic dynamics accurately. If the statistics gives us high confidence that a certain model matches the data, then we should pick the matching Price-like equation with the same confidence. This also carries over to higher level statistical considerations.

      If we think about terms that, if we would gather a gargantuan amount of data, might be statistically significant, but very small, then economists call those statistically significant, but economically insignificant. When rejecting the statistical significance on the basis of a not gargantuan dataset, statisticians are aware that terms that really have a zero effect, as well as terms, the effect of which is really small, are rejected with the same statistical test – and that we should be fine with that. All such considerations carry over to what we think of regarding the choice of a Price-like equation to describe the population genetic dynamics. Even if people disagree about whether or not to include a term that is statistically significant, but relatively small, such a disagreement can still happen within this setup, and just translates to a disagreement on which Price-like equation to choose.

      Similarly, people could also disagree about whether it is justified to use polynomials to characterize a fitness function. If we decide that we can, because of Taylor expansions, then the core result of the paper implies that the population genetic dynamics can be summarized by a generalized Hamilton’s rule (as long as the fitness function includes a constant and a linear term regarding the p-score). On the other hand, if we do not believe this is justified, and prefer to use an altogether different family of fitness functions, then we can no longer do this. All of this leaves space for all kinds of statistical considerations and disagreements, that just carry over to the choice for one or the other Price-like equation as an accurate description of the population genetic dynamics. Or, if one does not believe polynomials should be used, then this leads to not picking any Price-like equation at all.

      So, this is a long way of saying that the Generalized Price equation creates space for all statistical considerations to regain their place, and does not hinge on one approach to statistics or another.

      What we can say is that if we apply the statistical model to data generated by a probabilistic model, and if these models match, then as the number of observations grows to infinity, the estimators in the statistical model converge to the parameters of the data-generating one.

      But this is a mathematical statement, not a statement about real-world populations.

      Again, I do not know if I agree or disagree with the last sentence. However, that does not really matter, because either option only has implications for how we are to think of the relation between a Price-like equation describing a population genetic dynamics and real-world populations. It is not relevant for the question which Price-like equation to pick, or whether to pick one at all.

      A resolution I suggest to points 3, 4, and 5 above is:

      *A priori, the statistical quantities in the Price Equation are descriptive statistics, pertaining only to the specific population data given.

      *If one wishes to impute any predictive power, generalizability, or causal meaning to these statistics, all the standard considerations of inferential statistics apply. In particular, one must choose a statistical model that is justified based on the given data. In this case, one is not guaranteed to obtain the standard (linear) Hamilton's rule and may obtain any of an infinite family of rules.

      *If one uses a model that is not justified based on the given data, the results will still be correct for the given population data but will lack any meaning or generalizability beyond that.

      *In particular, if one considers data generated by a probabilistic model, and applies a statistical model that does not match the data-generating one, the results will be misleading, and will not generalize beyond the randomly generated realization one uses.

      Of course, the author may propose a different resolution to points 3-5, but they should be resolved somehow. Otherwise, the terminology in the manuscript will be incorrect and the ms will not resolve confusion in the field.

      I have outlined my solutions extensively above. I really appreciate that Reviewers #1 and #2 have spent time and attention on the manuscript and on the long appendices.  

      Appendix to the response to reviewer #2: Some remarks on Gardner, West & Wild (2011), Frank (2012), and Rousset (2015)

      An accurate response to the quote from Gardner, West, and Wild (2011) in the review report takes up space. I therefore wanted to put that in an appendix to the response to reviewer #2. I also include a few paragraphs regarding Frank (2012) and Rousset (2015), both of which are also mentioned by reviewer #2. All of this might also be of interest to people that are curious about how what I find in my paper relates to the existing literature.

      Gardner, West & Wild (2011) The quote I am responding to is “it is a mistake to assume that it must be the evolutionary theorist, writing out covariances, who is performing the equivalent of a statistical analysis” I want to put that into context, so I will go over the whole paragraph that surrounds the quote. The paragraph is called Statistics and Evolutionary Theory and can be found on page 1038 of the paper. I think that it is worth pointing out that it is not easy to respond to their somewhat impressionistic collages of words and formulas. I will therefore cut the paragraph up in a few smaller bits and try to make sense of it bit by bit. The paragraph begins with:

      “Our account of the general theory of kin selection has been framed in statistical terms.” Based on what they write two sentences down, the best match between those words and what they do in the paper would be: “our account uses words like “covariance”, “variance” and “expectation” for things that are not what “covariance”, “variance” and “expectation” mean in probability theory and statistics.” I would be totally open to an argument why that is nonetheless OK to do, but the way Gardner, West, and Wild (2011) phrase it obscures the fact that this needs any justification or reflection at all. “Framing something in statistical terms” is unspecific enough to sound completely harmless.

      “The use of statistical methods in the mathematical development of Darwinian theory has itself been subjected to recent criticism (van Veelen, 2005; Nowak et al., 2010b), so we address this criticism here.

      Also here, specifics would be helpful. The “use of statistical methods” sounds like it is more than just using terms from statistics, so this might refer to the minimizing of the sum of squared differences, which is also mentioned a sentence down in Gardner, West, and Wild (2011). If it does, then it is worth observing that in statistics, the minimizing of the sum of squared differences (or residuals, or errors) comes with theorems that point very clearly to what is being achieved by doing this. The Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest variance within the class of linear unbiased estimators. This implies that minimizing the sum of squared errors helps answering a well-defined question in statistics; under certain conditions, an OLS estimator is our best shot at uncovering an unknown relation between variables. To also minimize a sum of squared differences, but now in the modeling domain, qualifies as “use of statistical methods” only in a very shallow way. It means that a similar minimization is performed. Without an equivalent of the Gauss-Markov theorem that would shine a light on what it is that is being achieved by doing so, that does not carry the same weight as it does in the statistics domain – in that it does not carry any weight at all.

      “The concern is that statistical terms – such as covariances and least-squares regressions – should properly be reserved for conventional statistical analyses, where hypotheses are tested against explicit data, and that they are out of place in the foundations of evolutionary theory (van Veelen, 2005; Nowak et al., 2010b).”

      Again, a few things are a bit vague. What are “explicit data”? Are there data that are not explicit? Why the generic “foundations of evolutionary theory”, instead of a more specific description of what these statistical terms are used for? But either way, this is a misrepresentation of what I wrote in van Veelen (2005). I did not suggest to “reserve statistical terms for conventional statistical analysis” just because. As I do here in the current paper, what I did there was explore the possibilities for the Price equation to help with what I then called Type I and Type II questions. Type I questions find themselves in the modeling domain and Type II questions find themselves in the statistical domain. I was not arguing for a ban on applying statistical concepts outside of the domain of statistical inference. All that I said is that in its current practice, it does not really help answering questions of either type.  

      “However, this concern is misplaced. First, natural selection is a statistical process, and it is therefore natural that this should be defined in terms of aggregate statistics, even if only strictly by analogy (Frank, 1997a, 1998).”

      This is a vague non-argument. Almost nothing is well-defined here. What does it mean for natural selection to be a statistical process? Is that just an unusual term for a random process? If so, then I suppose I agree, but that has nothing to do with what I state or claim. And what does it mean to be defined in terms of aggregate statistics? What is the alternative? I have no idea how any of this relates to anything that I claim or state in my papers.

      “Second, Fisher (1930, p198) coined the term ‘covariance’ in the context of his exposition of the genetical theory of natural selection, so the evolutionary usage of this term has precedent over the way the term is used in other fields.”

      This is what I would call a “historic fallacy”. The fact that Fisher coined the term “covariance” in a book on genetics and natural selection does not mean that any “evolutionary usage” of the term “covariance”, however nonsensical, now has precedent over the way the term is used in other fields. Irrespective of the path that the history of science, genetics, or statistics took, right now we are in a place where about every student at every university anywhere in the world that takes a course in probability theory and/or statistics, learns that covariance is a property of a random variable (see also Wikipedia). And they do for a very good reason; it is essential in recognizing the relation between probability theory on the one hand and statistics on the other. Being curious how this “evolutionary usage” of the term covariance works, if covariance turns out not to be a property of a random variable, is therefore perfectly justified, and “Fisher coined the term” is not a safe word that exempts it from scrutiny. 

      Third, it is a mistake to assume that it must be the evolutionary theorist, writing out covariances, who is performing the equivalent of a statistical analysis.

      Again, that is just not what anyone is saying. Nobody is suggesting that an evolutionary theorist should perform the equivalent of statistical analysis. All I did was point to how little is being achieved by transferring formulas from statistics to a modeling context.

      A better analogy is to regard Mother Nature in the role of statistician, analysing fitness effects of genes by the method of least-squares, and driving genetic change according to the results of her analyses (cf. Crow, 2008).

      I have no idea what any of this means. Mother Nature is a personification of something that is not a person, and that does not have cognition. Without sentience, “Mother Nature” cannot assume the role of statistician, and cannot analyse fitness effects.

      More generally, analogy is the basis of all understanding, so when isomorphisms arise unexpectedly between different branches of mathematics (in this case, theoretical population genetics and statistical least-squares analysis) this represents an opportunity for advancing scientific progress and not an anomaly that is to be avoided.

      This is a strawman argument, puffed up with platitudes. Nobody is arguing against analogies. But what is the analogy supposed to be here? Just taking least squares from statistical inference and performing it in a modeling context does not make it an analogy. The GaussMarkov theorem, which is the basis for why least squares helps answering questions in statistics, just does not mean anything in a modeling context. OLS in modeling is just willful misspecification, and nothing that it does in statistics translates to anything meaningful in modeling. Again, declaring it an analogy, or an isomorphism, does not make it one.

      Frank (2012) Because the reviewer also mentions Frank (2012), I would like to include a small remark on this paper too. “Natural Selection. IV. The Price equation” by Frank (2012) is partly a response to my earlier criticism of the use of the Price equation. Much like Gardner, West, and Wild (2011), I would describe this paper as what is called a ”flight forwards” in Dutch. While the questions I ask are relatively prosaic (such as: how does the Price equation help derive a prediction from model assumptions?), Frank (2012) pivots to suggesting that there is a profound philosophy-of-science disagreement that I am on the wrong side of. It is close to impossible to respond to Frank (2012), because it is a labyrinth of arguments that sound deep and impressive, but that are just not specific enough to know how they relate to points that I made – or even just what they mean in general. Just to pick a random paragraph:

      “Is there some reorientation for the expression of natural selection that may provide subtle perspective, from which we can understand our subject more deeply and analyse our problems with greater ease and greater insight? My answer is, as I have mentioned, that the Price equation provides that sort of reorientation. To argue the point, I will have to keep at the distinction between the concrete and the abstract, and the relative roles of those two endpoints in mature theoretical understanding.”

      For many of those terms, I have no real idea what they mean, and also reading the rest of the paper does not help understanding what this has to do with the more prosaic questions that are waiting for an answer. What is “reorientation”? What does “concrete” versus “abstract” have to do with the question what is being achieved by doing least squares regressions in modeling? What would be an example of a mature and an immature theoretical understanding?

      Rousset (2015) is also mentioned by the reviewer. This paper is not esoteric. It states, as reviewer #2 points out, that "neither data nor inferences are considered". This paper therefore finds itself in the modeling domain, and not in the data domain. It does however still dodge the question what the benefits are of misspecification in the modeling domain. As a matter of fact, it denies that there is misspecification at all.

      “In the presence of synergies, the residuals have zero mean and are uncorrelated to the predictors. No further assumption is made about the distribution of the residuals. Thus, there is no sense in which the regression is misspecified.”

      This is a remarkable quote, and testament to the lasting impact of the construction errors in Price (1970). Misspecification is literally defined as getting the model wrong. In statistics, avoiding misspecification can be complicated, because of the noise in the data. The real datagenerating process is unknown, and because of the noise, there is always the possibility that data that are generated by one model look like they could also have been generated by another. The challenge is to reduce the odds of getting the model wrong to acceptable proportions, which is what statistical tests are for. But in modeling, we know what the model is; it is postulated by the modeler. Therefore, misspecification can be avoided by just not replacing it with a different model.

      What is being discussed in this part of Rousset (2015) is replacing what in this manuscript is called Model 3 (𝑤<sub>𝑖</sub> = 𝛼 + 𝛽<sub>1,0</sub>𝑝<sub>𝑖</sub> + 𝛽<sub>1,1</sub>𝑝<sub>𝑖</sub> + 𝛽<sub>1,1</sub>𝑝<sub>𝑖</sub>𝑞<sub>𝑖</sub> + 𝜀<sub>𝑖</sub>) with Model 2 (𝑤<sub>𝑖</sub> = 𝛼 + 𝛽<sub>1,0</sub>𝑝<sub>𝑖</sub>+ 𝛽<sub>1,0</sub>𝑝<sub>𝑖</sub>𝑞<sub>𝑖</sub> + 𝜀<sub>𝑖</sub>), and choosing the parameters in Model 2 so that it is as close as it can be to Model

      (3) This is just the definition of misspecification. That is to say: the misspecification part is the choosing of Model 2 as a reference model. The minimizing of the sum of squared residuals one could consider as minimizing the damage.

      While Rousset (2015) finds itself in the modeling domain, it does nonetheless point to the field of statistics here, by stating that “the residuals have zero mean and are uncorrelated to the predictors”. From this, the paper concludes that “there is no sense in which the regression is misspecified”. That is just plain wrong. Minimizing the sum of the squared residuals guarantees that the residuals are uncorrelated with the variables that are included in the reference model, with respect to which the squared sum of residuals is minimized. The criterion that Rousset (2015) uses is that the model is well-specified if there is no correlation between the residuals (here: ) and the variables included in the reference model (here: 𝑝<sub>𝑖</sub> and 𝑞<sub>𝑖</sub>). But according to this criterion, all models would always be well-specified, and no model could ever be misspecified. The correct criterion, however, also requires that the residuals are not correlated with variables not included in the reference model. And here, the residuals are in fact correlated with 𝑝<sub>𝑖</sub>𝑞<sub>𝑖</sub>, which is the variable that is included in Model 3, but not in Model 2. Therefore, according to the correct version of this criterion, this model is in fact misspecified – as it should be, because getting the model wrong is the definition of misspecification.

      In order to make sure that there can be no misunderstanding, I have added subsections at the end of Section 2 and Section 4 of Appendix A, and at the end of Section 2 of Appendix B. These subsections show that the algebra of minimizing the sum of squared errors implies that there is no correlation between the errors, or the residuals, and the variables that are included in the model. This is by no means something new; it is the reason why we do OLS to begin with. For additional details about misspecification, I would refer to Section 1b (viii) in van Veelen (2020).

      Finally, there is a detail worth noticing. In the main text, as well as in Appendix B, I use an analogy (and, unlike what Gardner, West, and Wild, 2011, refer to as an analogy, this actually is one). This is an analogy between two choices. On the one hand, there is the choice between Price-like equation 1 (based on Model 1 as a reference model) and Price-like equation 2 (based on Model 2 as a reference model) both applied to Model 2. On the other hand, there is the choice between Price-like equation 2 (based on Model 2 as a reference model) and Price-like equation 3 (based on Model 3 as a reference model) both applied to Model 3. Model 1 is the non-social model, Model 2 is the social model without interaction term, and Model 3 is the social model with interaction term. That makes the first choice a choice between treating a social model as a social model, or as a non-social model. The second choice is between treating a social model with interaction term as a social model with interaction term, or as a social model without interaction term. The power of this analogy is that every argument against treating the social model as if it is a non-social model is also an argument against treating the social model with interaction term as if it is a social model without interaction term.

      This ties in with the incorrect criterion for when a model is well-specified from Rousset (2015) as follows. His criterion (that there should be no correlation between the residuals and the variables in the model) declares the social model without interaction term well-specified as a reference model, when we are considering a social model with interaction term. According to the same criterion, however, the non-social model would also have to be declared to be wellspecified as a reference model, when the model we are considering is a social model. The reason is that also here, there is no correlation between the residuals and the variables that are included in this model. This is clearly not what anyone is advocating for, and for good reasons. The residuals here would, after all, be correlated with the p-score of the partner, which is a variable that is not included in the non-social model. This is a good indication that we should not use the non-social model for a social trait.

      Reviewer #3 (Public review):

      Before responding to this review, I would like to express that I appreciate the fact that the reviews and the responses are public at eLife. Besides just being useful in general, this also allows readers to get a behind the scenes glimpse into the state of the field, and the level of the reviewing. While the reports by Reviewers #1 and #2 show openness and an interest in getting things right, the report by Reviewer #3 is representative of the many review reports that I have received from the inclusive fitness community in the past. These reports tend to be rhetorically strong, and to those who do not have the time to dig deeper in the details, these reports are probably also convincing. I will therefore go through this review line by line to show how little there is behind the confident off-hand dismissal.

      There is an interesting mathematical connection - an "isomorphism"-between Price's equation and least-squares linear regression.

      This is esoteric and needlessly vague. Why is the word “isomorphism” used? In mathematics, an isomorphism is a structure-preserving mapping. The Price equation is an equation, or an identity, which makes it a bit difficult to imagine what the set of objects is on one end of the mapping. Least-squares linear regression can perhaps be seen as a function of a dataset, which would make it a single object (one function). This complicates things at the other end of the mapping too, if that set is a singleton set. The only isomorphism that I can think of is a trivial isomorphism where one equation is mapped onto one function and vice versa. It seems unlikely that this is what the reviewer means. The word isomorphism moreover is in quotes, so maybe this is supposed to be figurative. But what would it be that is being suggested here by this figure of speech? Just saying that there is, as the reviewer puts it, an “interesting mathematical connection”, does not make it so. It would already be a start to just specify what the mathematical connection is, because I have a hard time seeing what that would be. Is it just that, if you divide the Cov(𝑤, 𝑝)-term by the Var(𝑝)-term, then you get a regression coefficient? If that is what the reviewer has in mind, that would be a rather shallow observation.

      Some people have misinterpreted this connection as meaning that there is a generalitylimiting assumption of linearity within Price's equation, and hence that Hamilton's rule-which is derived from Price's equation-provides only an approximation of the action of natural selection.

      Here, the reviewer pulls a switcheroo. The use of the word “general”, or “generality”, here refers to the fact that the classical Price equation is an identity for all possible transitions between a parent and an offspring population. This is the sense in which the inclusive fitness literature uses the word general, and so do I in the relevant places in the manuscript. When I do, I make sure to add phrases like “in the sense that whatever the true model is, it always gets the direction of selection right”. As a consequence, the classical Hamilton’s rule is also totally general, in the same sense.

      One of the core points of the paper is that this is not unique to the classical Price equation. As a matter of fact, there is a large set of Price-like equations and Hamilton-like rules that are equally much identities, and equally much general (in the sense that they get the direction of selection right for all possible transitions). The being an identity and being completely general (in this sense) therefore cannot be a decisive criterion in favour of the classical Price equation and the classical Hamilton’s rule.

      On the other hand, the way in which my Generalized Price equation and my generalized version of Hamilton’s rule are general, is that they do not restrict the statistical model with respect to which errors are squared, summed and minimized to one linear statistical model. This generalization generates the variety of Price-like equations and Hamilton-like rules mentioned above (all of which are general in the sense of always getting the direction of selection right) and it gives us the flexibility to pick one that separates terms that reflect the fitness function from terms that reflect the population state.

      In response to my generalizing the Price equation and Hamilton’s rule in this second sense, the criticism of the reviewer comes down to saying that the Price equation and Hamilton’s rule do not need generalizing, because they already are general – the switcheroo being that this refers to generality in the first sense. That makes it sound like this could be an honest mistake, confusing one way in which these can be described as general with another. However, I really hammered this point home in the manuscript. Even a cursory reading of the manuscript reveals that I am fully aware that the classical Price equation and the classical Hamilton’s rule are general in the first sense.

      It is also not helpful that, as a description of what I supposedly claim, this is impressionistic, and lacks specificity. The Price equation is an equation, or an identity. What does it mean for there to be an “assumption of linearity” within it? For the classical Price equation in covariance form (which Reviewer #2 argues is what most people think of as “the Price equation”) there is no way in which one can transform this into a meaningful statement. There is just nothing in there to which the adjective “linear” can be applied. Linearity only becomes a thing when we ask ourselves how we can interpret the regression coefficient in the classical Price equation in regression form. That would be the linearity of the statistical model the differences with which are squared, summed and minimized in the regression.

      This is in contrast to the majority view that Hamilton's rule is a fully general and exact result.

      Again, in this manuscript, I write, time and again, that the classical Hamilton’s rule is fully general (in the sense that it is applies to any transition), and exact (if that means that it always gets the direction of selection right). So, this is clearly not where the contrast with the majority view lies. The contrast with the majority view is that the majority insist on misspecification, and I suggest not to do that.

      To briefly give some mathematical details: Price's equation defines the action of natural selection in relation to a trait of interest as the covariance between fitness 𝑤 and the genetic breeding value 𝑔 for the trait, i.e. Cov(𝑤, 𝑔);

      The Price equation is an identity, not a definition. When deciding on a definition, there is some freedom. We can choose to define ⊂ so that 𝐴 ⊂ 𝐵 means that 𝐴 is a strict subset of 𝐵; or we can choose to define ⊂ so that 𝐴 ⊂ 𝐵 means that 𝐴 is a (not necessarily strict) subset of 𝐵. The Price equation does not “define the action of natural selection”, because it is an identity. There is no freedom to “define” any other way.

      The more serious reason why this is conceptually also a little dangerous, is the following. Imagine a locus with two alleles. Both of them are non-coding bits of DNA. Selection therefore does not act on either of them. Now imagine a parent population with an average p-score of 0.5, or, in other words, the frequency of these alleles in the parent population is 50-50. That makes the expected value of the p-score in the offspring population 0.5 too. In finite populations, however, randomness can make the p-score grow a bit larger or a bit smaller than 0.5. If the parent population is small, the variance (the expected squared deviation from 0.5) can actually be sizeable. If the p-score in the offspring population lands above 0.5, then the Price equation has a > 0 and a 𝐶𝑜𝑣(𝑤, 𝑝) > 0. Describing the Price equation as “defining the action of natural selection” now suggests that higher p-scores have been selected for (or, in other words, that “the action of natural selection in relation to a trait of interest” is positive). With equal probability, however, < 0 and therefore also 𝐶𝑜𝑣(𝑤, 𝑝) < 0, and this would then make us draw the opposite conclusion, that natural selection has acted to lower the p-scores in the population. Both of those would be wrong, because in this situation, it would have been randomness that changed the average p-score. 

      this is a fully general result that applies exactly to any arbitrary set of (𝑔, 𝑤) data; without any loss of generality this covariance can be expressed as the product of genetic variance Var(𝑝) and a coefficient 𝑏(𝑔, 𝑤), the coefficient simply being defined as 𝑏(𝑔, 𝑤) = for all Var(𝑝) > 0; it happens that if one fits a straight line to the same (𝑔, 𝑤) data by means of least-squares regression then the slope of that line is equal to 𝑏(𝑔, 𝑤).

      Why this needs to be explained is a bit of a mystery. These “mathematical details” are in almost all Price equation papers, and they are the point of departure of my Appendix A (it is on page 7 of a more than 90 page long set of appendices). Seeing the need to explain this suggests that the reviewer thinks that there is a chance that I or anyone reading this paper would have missed this. I have not, and, more importantly, none of this invalidates the point I make in the paper.   

      All of this has already been discussed, repeatedly, in the literature.

      All of this has already been discussed, repeatedly, in the literature indeed. It is just that it does not engage with anything I write in the manuscript, or that I wrote in my other papers.

      Now turn to the present paper: the first sentence of the Abstract says "The generality of Hamilton's rule is much debated", and then the next sentence says "In this paper, I show that this debate can be resolved by constructing a general version of Hamilton's rule".

      This is correct.

      But immediately it's clear that this isn't really resolving the debate, what this paper is actually doing is asserting the correctness of the minority view (i.e. that Hamilton's rule as it currently stands is not a general result)

      It seems to me that the reason why this is “immediately clear” to this reviewer is that the reviewer has not processed the contents of the paper. I am not sure if I have to repeat this, but I am not saying that “Hamilton’s rule as it currently stands” is not general (in the sense that it always gets the direction of selection right). It is, and I say that it is a bunch of times. But so are other rules.

      and then attempting to build a more general form of Hamilton's rule upon that shaky foundation.

      I am not just “attempting to build a more general form of Hamilton's rule”. I did in fact build a more general form of Hamilton’s rule (where the generality refers to the richer set of reference statistical models).

      Predictably, the paper erroneously interprets the standard formulation of Hamilton's rule as a linear approximation and develops non-linear extensions to improve the goodness of fit for a result that is already exactly correct.

      Nowhere in the paper or the appendices do I describe the standard formulation of Hamilton’s rule (or, for that matter, any formulation of Hamilton’s rule) as an “approximation”. It is just not a word that has anything to do with this. If we are doing statistical inference, and the sum of squared errors that is minimized decreases by adding a variable in the statistical model with regard to which the sum of squared errors is minimized, then that will typically improve the goodness of fit. In statistics this is not described that as an improvement in how well the statistical model “approximates” the data, or whatever it is that the reviewer would suggest is being approximated here.

      This is not a convincing contribution. It will not change minds or improve understanding of the topic.

      There is indeed plenty of scope for this not to change minds or improve understanding of the topic. It will not change the minds or improve the understanding of those that are not really interested in getting this right. Obviously, it will also not convince those that do not read it.

      Nor is it particularly novel. Smith et al (2010, "A generalisation of Hamilton's rule for the evolution of microbial cooperation" Science 328, 1700-1703) similarly interpreted Hamilton's rule as a linear model and provided a corresponding polynomial expansion - usefully fitting the model to microbial data so as to learn something about the costs and benefits of cooperation in an empirical setting. it's odd that this paper isn't cited here.

      Let me begin by pointing to what I agree with. Given that smith et al. (2010) and my manuscript are both in the business of generalizing Hamilton’s rule, it would be helpful to the reader if my paper includes more information about how the two efforts relate. I will discuss the relation below, and I will also include that in Appendix B, and point to it in the main text. Before I do, however, I would like to point to two details in the review report that fit a pattern.

      The first is that the reviewer describes what smith et al. (2010) do as “useful”, and seems to think of fitting polynomial expansions as a legitimate way to “learn something about the costs and benefits of cooperation in an empirical setting”. That sounds quite positive. My paper, in which I supposedly repeat this, however, is characterized as misguided. This fits a pattern; all of the reviews I received from the inclusive fitness community include a “done before”, and regularly the done before is described approvingly, while my paper is described as fundamentally flawed.

      Also customary is the lack of detail. What would be really useful here, is something like “equation A.14 in this manuscript is the same as equation 6 in smith et al. (2010) if we choose . This kind of statement would pin down the way in which what I do has been done before. That, however, would require going into detail, at the risk of finding out that what is done in my manuscript is actually quite different from what happens in smith et al. (2010). That is also a recurrent thing. When I look up the done before, I typically find something that is not quite the same.  

      Now on to the paper. What smith et al. (2010) try to do is something that I wholeheartedly support. It is an empirical study that tries to capture non-linearity. A first point of order is that it is worth asking ourselves: linear or non-linear in what? For that, I would like to go back to the setup of my manuscript. Model 2 from the Main Text is

      In this fitness function, 𝑝! is the p-score of individual 𝑖 and 𝑞! is the p-score of the partner that individual 𝑖 is matched with. This is a standard model of social behaviour if 𝛽<sub>1,0</sub> < 0 and 𝛽<sub>0,1</sub> > 0. Such choices for 𝛽<sub>1,0</sub> and 𝛽<sub>0,1</sub> indicate that having a higher p-score decreases the fitness of individual 𝑖 and increases the fitness of its partner. Here we assume that 𝛼 = 1, 𝛽<sub>1,0</sub> \= −1, and 𝛽<sub>0,1</sub> \= 2. We assume that p-scores can only be 0 or 1, or, in other words, we assume that there are only cooperators and defectors in the population (or, in terms of smith et al., 2010: cooperators and cheaters).

      For a well-mixed population, where the likelihood of being matched with a cooperator is the same for cooperators and defectors (it is equal to the frequency of cooperators for both), we can now plot the fitnesses of cooperators (red) and defectors (blue) as a function of the frequency of cooperators (Appendix 1-figure 6 left).

      We can do the same for a population with relatedness where the probability of being matched with a cooperator is + 𝑓<sub>c</sub> for cooperators, and 𝑓<sub>c</sub> for defectors, where 𝑓<sub>c</sub> is the frequency of cooperators (Appendix 1-figure 6 right). For relatedness 𝑟 = 0 and 𝑟 = "7, cooperation is selected against at every frequency.

      Increasing relatedness further, we would find that for 𝑟 = the lines coincide, which implies that at every frequency, cooperation is neither selected for nor against. For 𝑟 > ": cooperation will be selected for at every frequency. This pattern implies that, as we have seen in the manuscript, the classical Hamilton’s rule works perfectly fine for Model 2; with 𝑐 = −𝛽<sub>1,0</sub> = 1 and 𝑏 = 𝛽<sub>0,1</sub> \= 2, cooperation is selected for if and only if 𝑟𝑏 > 𝑐. The fitnesses of cooperators and defectors as functions of the frequency of cooperators, moreover, are always parallel lines, regardless of relatedness.

      Model 3 in the main text extends Model 2 by adding an interaction term:

      Now we choose 𝛼 = 1, 𝛽<sub>1,0</sub> = −1, 𝛽<sub>1,0</sub> = 1, and 𝛽<sub>1,1</sub>  \= 1. We again draw the fitnesses of cooperators and defectors, both at relatedness 𝑟 = 0 (Appendix 1-figure 7 left) and at relatedness 𝑟 = (Appendix 1-figure 7 right). In the manuscript, I argue that the appropriate version of Hamilton’s rule here is Queller’s rule: 𝑟<sub>0,1</sub>𝑏<sub>0,1</sub> + 𝑟<sub>1,1</sub>𝑏<sub>1,1</sub> > 𝑐 with 𝑐 = −𝛽<sub>1,0</sub> = 1, 𝑏<sub>0,1</sub> = 𝛽<sub>0,1</sub> = 1, and 𝑏<sub>1,1</sub> = 𝛽<sub>1,1</sub> = 1. The fitnesses of cooperators and defectors as functions of the frequency of cooperators are still straight lines, but they are no longer parallel.

      The first thing to observe, therefore, is that a model with synergy, in which the classic version of Hamilton’s rule would be misspecified, and Queller’s rule would be well-specified, does not require the fitnesses as functions of the frequencies of cooperators to be non-linear. All that changes with the addition of the interaction term, is that they stop being parallel.

      The paper by smith et al. (2010) is an effort to capture non-linearities in the way fitnesses depend on the frequency of cooperators. That, therefore, goes beyond the step from Model 2 to Model 3. Whether it uses the right method to capture those non-linearities, we will come back to in a second, but it is important to realize that also without these non-linearities, the classic version of Hamilton’s rule can be too limiting to accurately describe selection. (Here, I should add that this implies that we were wrong in Wu et al. (2013), when we suggested that “for this experiment, it seems unnecessary to use the generalized Hamilton’s rule, if instead the Malthusian fitness is adopted. In other words, the Wrightian fitness approach calls for a generalization of Hamilton’s rule, whereas the Malthusian fitness approach does not (or at least not in a drastic way, as Malthusian fitnesses are almost linear in the frequency of cooperators).” Using Malthusian fitnesses, the functions were close to linear, but not close to parallel, and therefore also here, Hamilton’s rule needs generalizing - albeit in a different way than smith et al. (2010) did).

      The cooperation that is observed in the Myxococcus xanthus studied by smith et al. (2010) is not a good match with a model where individuals are matched in pairs for an interaction that determines their fitnesses. These microbes cooperate in large groups, and a better match would therefore be the n-player public goods games studied in van Veelen (2018). There, we see that simple, straightforward ways to describe synergies (or anti-synergies) can easily lead to fitnesses not being linear in the frequency of cooperators.

      The way smith et al. (2010) try to capture those non-linearities, however, is not free of complications. We addressed those in Wu et al. (2013), and I summarized them, shortly, in van Veelen (2018). One of the issues is that most of the non-linearity smith et al. (2010) pick up is the result of considering Wrightian fitness rather than Malthusian fitness. In a continuous time model with a constant growth rate, the population size at time 𝑡 is 𝑁(𝑡) = 𝑒<sup>mt</sup>𝑁(0), where 𝑚 is the Malthusian fitness. In a discrete time model with a constant average number of offspring per individual, the population at time 𝑡 is 𝑁(𝑡) = 𝑤<sup>t</sup>𝑁(0), where 𝑤 is the Wrightian fitness. If we take 𝑚 = ln 𝑤, these are the same, and if 𝑤 is close to 1, then 𝑚 can be approximated by 𝑤 − 1. That also implies that if 𝑤 is close to 1 (or, equivalently, if 𝑚 is close to 0) one is locally linear if the other is too. However, in the experiment by smith et al. (2010) the aggregate fitness effects are not small, and what is highly nonlinear in terms of Wrightian fitness is close to linear in Malthusian fitness.

      Another complication is that the Taylor coefficients that smith et al. (2010) find are the result of a combination of the data and the choice of a functional form they choose to first apply to their data. That means that a different choice of a functional form would have given different Taylor coefficients, while the in-between transformation can also be skipped. Also, the number of Taylor coefficients is larger than the dimensionality of the data, which are based on averages for 6 frequencies. For more details on these complications, I would like to refer to Wu et al. (2013) and van Veelen (2018). A nice detail is that if we consider the way the fitnesses of cooperators and defectors compare when using Malthusian fitnesses, then a comparison of the slopes actually suggests anti-synergies, which leads to a stable mix of cooperators and cheaters, already in the absence of population structure. This matches what is suggested by Archetti and Scheuring, (2011, 2012) and Archetti (2018).

      Besides these technical complications, smith et al. (2010) is also different, in the sense that it is an empirical paper. It does not contain the Generalized Price equation, it contains no insights regarding how to derive population genetic dynamics from the Generalized Price equation, or how to derive the appropriate rules from those, and it has a very different approach to separating fitness effects and population structure.

      To end on a positive note, I would like to quote a bit out of Wu et al. (2013):

      “While we criticise these mathematical issues, we are convinced that smith et al. (2010) aim into the right direction: to incorporate the nonlinearities characteristic of biology into social evolution, we may have to extend and generalize the approach of inclusive fitness. It would be beautiful if such a generalization would ultimately include Hamilton’s original rule as a special case […].”

      I like to think that this is exactly what I have done in this paper.

      References

      Akdeniz, A., & van Veelen, M. (2020). The cancellation effect at the group level. Evolution, 74(7), 1246–1254. doi: 10.1111/evo.13995

      Allen, B., & Tarnita, C. E. (2012). Measures of success in a class of evolutionary models with fixed population size and structure. Journal of Mathematical Biology, 68, 109–143. doi: 10.1007/s00285-012-0622-x

      Archetti, M. (2018). How to Analyze Models of Nonlinear Public Goods. Games 2018, Vol. 9, Page 17, 9(2), 17. doi: 10.3390/g9020017

      Archetti, M., & Scheuring, I. (2011). Coexistence of cooperation and defection in public goods games. Evolution, 65(4), 1140–1148. doi: 10.1111/j.1558-5646.2010.01185.x

      Archetti, M., & Scheuring, I. (2012). Review: Game theory of public goods in one-shot social dilemmas without assortment. Journal of Theoretical Biology, 299, 9–20. doi: 10.1016/j.jtbi.2011.06.018

      Bourke, A. F. G. (2014). Hamilton’s rule and the causes of social evolution. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1642), 20130362. doi: 10.1098/rstb.2013.0362

      Bourrat, P., Godsoe, W., Pillai, P., Gouhier, T. C., Ulrich, W., Gotelli, N. J., & van Veelen, M. (2023). What is the price of using the Price equation in ecology? Oikos, 2023(8). doi: 10.1111/oik.10024

      Crow, J. F. (2008). Commentary: Haldane and beanbag genetics. International Journal of Epidemiology, 37(3), 442–445. doi: 10.1093/ije/dyn048

      Fisher, R. (1930). The genetical theory of natural selection. Retrieved from https://www.cabidigitallibrary.org/doi/full/10.5555/19601600934

      Fletcher, J. A., & Zwick, M. (2006). Unifying the theories of inclusive fitness and reciprocal altruism. American Naturalist, 168(2), 252–262. doi: 10.1086/506529

      Frank, S. A. (1997). The Price equation, Fisher’s fundamental theorem, kin selection, and causal analysis. Evolution, 51(6), 1712–1729. doi: 10.1111/j.1558-5646.1997.tb05096.x

      Frank, S. A. (1998). Foundations of social evolution. Princeton: Princeton University Press.

      Frank, S. A. (2012). Natural selection. IV. The Price equation*. Journal of Evolutionary Biology, 25(6), 1002–1019. doi: 10.1111/j.1420-9101.2012.02498.x

      Gardner, A., West, S. A., & Wild, G. (2011). The genetical theory of kin selection. Journal of Evolutionary Biology, 24(5), 1020–1043. doi: 10.1111/j.1420-9101.2011.02236.x

      Grafen, A. (1985a). A geometric view of relatedness. Oxford Surveys in Evolutionary Biology, 2(2), 28-89.

      Grafen, A. (1985b). News and Views. Evolutionary theory: Hamilton’s rule OK. Nature, 318(6044), 310–311. doi: 10.1038/318310a0

      Hamilton, W. D. (1964). The genetical evolution of social behaviour. I. Journal of Theoretical Biology, 7(1), 1–16. doi: 10.1016/0022-5193(64)90038-4

      Karlin, S., & Matessi, C. (1983). The eleventh R. A. Fisher Memorial Lecture - Kin selection and altruism. Proceedings of the Royal Society of London. Series B. Biological Sciences, 219(1216), 327–353. doi: 10.1098/rspb.1983.0077

      Matessi, C., & Karlin, S. (1984). On the evolution of altruism by kin selection. Proceedings of the National Academy of Sciences, 81(6), 1754–1758. doi: 10.1073/pnas.81.6.1754

      Nowak, M. A., Tarnita, C. E., & Wilson, E. O. (2010). The evolution of eusociality. Nature, 466(7310), 1057–1062. doi: 10.1038/nature09205

      Okasha, S. (2005). Maynard Smith on the levels of selection question. Biology and Philosophy, 20(5), 989–1010. doi: 10.1007/S10539-005-9019-1/METRICS

      Page, K. M., & Nowak, M. A. (2002). Unifying evolutionary dynamics. Journal of Theoretical Biology, 219(1). doi: 10.1016/S0022-5193(02)93112-7

      Pillai, P., & Gouhier, T. C. (2019). Not even wrong: the spurious measurement of biodiversity’s effects on ecosystem functioning. Ecology, 100(7), e02645. doi: 10.1002/ecy.2645

      Price, G. R. (1970). Selection and Covariance. Nature, 227(5257), 520–521. doi: 10.1038/227520a0

      Price, G. R. (1972). Extension of covariance selection mathematics. Annals of Human Genetics, 35(4), 485-490.

      Queller, D. C. (1985). Kinship, reciprocity and synergism in the evolution of social behaviour. Nature, 318(6044), 366–367. doi: 10.1038/318366a0

      Queller, D. C. (1992a). A general model for kin selection. Evolution, 46(2), 376–380. doi: 10.1111/j.1558-5646.1992.tb02045.x

      Queller, D. C. (1992b). Quantitative Genetics, Inclusive Fitness, and Group Selection. The American Naturalist, 139(3), 540–558. doi: 10.1086/285343

      Queller, D. C. (2011). Expanded social fitness and Hamilton’s rule for kin, kith, and kind. Proceedings of the National Academy of Sciences, 108(supplement_2), 10792–10799. doi: 10.1073/pnas.1100298108

      Rousset, & Billiard. (2000). A theoretical basis for measures of kin selection in subdivided populations: Finite populations and localized dispersal. Journal of Evolutionary Biology, 13(5). doi: 10.1046/j.1420-9101.2000.00219.x

      Rousset, F. (2015). Regression, least squares, and the general version of inclusive fitness. Evolution, 69(11), 2963–2970. doi: 10.1111/evo.12791

      Smith, J., Van Dyken, J. D., & Zee, P. C. (2010). A generalization of hamilton’s rule for the evolution of microbial cooperation. Science, 328(5986), 1700–1703. doi: 10.1126/science.1189675

      Sober, Elliott., & Wilson, D. Sloan. (2007). Unto others : the evolution and psychology of unselfish behavior. 394. Retrieved from https://www.hup.harvard.edu/books/9780674930476

      Taylor, P. D. (1992). Altruism in viscous populations - an inclusive fitness model. Evolutionary Ecology, 6(4), 352–356. doi: 10.1007/bf02270971

      Taylor, Peter D. (1989). Evolutionary stability in one-parameter models under weak selection. Theoretical Population Biology, 36(2), 125–143. doi: 10.1016/00405809(89)90025-7

      Taylor, Peter D., Day, T., & Wild, G. (2007). Evolution of cooperation in a finite homogeneous graph. Nature, 447(7143), 469–472. doi: 10.1038/nature05784

      Van Cleve, J. (2015). Social evolution and genetic interactions in the short and long term. Theoretical Population Biology, 103. doi: 10.1016/j.tpb.2015.05.002

      van Veelen, M. (2005). On the use of the Price equation. Journal of Theoretical Biology, 237(4). doi: 10.1016/j.jtbi.2005.04.026

      van Veelen, M. (2007). Hamilton’s missing link. Journal of Theoretical Biology, 246(3). doi: 10.1016/j.jtbi.2007.01.001

      van Veelen, M. (2011). The replicator dynamics with n players and population structure. Journal of Theoretical Biology, 276(1). doi: 10.1016/j.jtbi.2011.01.044

      van Veelen, M. (2018). Can Hamilton’s rule be violated? ELife, 7. doi: 10.7554/eLife.41901

      van Veelen, M. (2020). The problem with the Price equation. Philosophical Transactions of the Royal Society B: Biological Sciences, 375(1797), 20190355. doi: 10.1098/rstb.2019.0355

      van Veelen, M., Allen, B., Hoffman, M., Simon, B., & Veller, C. (2017). Hamilton’s rule. Journal of Theoretical Biology, 414. doi: 10.1016/j.jtbi.2016.08.019

      van Veelen, M., García, J., Sabelis, M. W., & Egas, M. (2012). Group selection and inclusive fitness are not equivalent; the Price equation vs. models and statistics. Journal of Theoretical Biology, 299. doi: 10.1016/j.jtbi.2011.07.025

      Wilson, D. S., Pollock, G. B., & Dugatkin, L. A. (1992). Can altruism evolve in purely viscous populations? Evolutionary Ecology, 6(4), 331–341. doi: 10.1007/bf02270969

      Wu, B., Gokhale, C. S., van Veelen, M., Wang, L., & Traulsen, A. (2013). Interpretations arising from Wrightian and Malthusian fitness under strong frequency dependent selection. Ecology and Evolution, 3(5). doi: 10.1002/ece3.500

    1. Reviewer #2 (Public Review):

      The goal of the present study is to better understand the 'control objectives' that subjects adopt in a video-game-like virtual-balancing task. In this task, the hand must move in the opposite direction from a cursor. For example, if the cursor is 2 cm to the right, the subject must move their hand 2 cm to the left to 'balance' the cursor. Any imperfection in that opposition causes the cursor to move. E.g., if the subject were to move only 1.8 cm, that would be insufficient, and the cursor would continue to move to the right. If they were to move 2.2 cm, the cursor would move back toward the center of the screen. This return to center might actually be 'good' from the subject's perspective, depending on whether their objective is to keep the cursor still or keep it near the screen's center. Both are reasonable 'objectives' because the trial fails if the cursor moves too far from the screen's center during each six-second trial.

      This task was recently developed for use in monkeys (Quick et al., 2018), with the intention of being used for the study of the cortical control of movement, and also as a task that might be used to evaluate BMI control algorithms. The purpose of the present study is to better characterize how this task is performed. What sort of control policies are used. Perhaps more deeply, what kind of errors are those policies trying to minimize? To address these questions, the authors simulate control-theory style models and compare with behavior. They do in both in monkeys and in humans.

      These goals make sense as a precursor to future recording or BMI experiments. The primate motor-control field has long been dominated by variants of reaching tasks, so introducing this new task will likely be beneficial. This is not the first non-reaching task, but it is an interesting one and it makes sense to expand the presently limited repertoire of tasks. The present task is very different from any prior task I know of. Thus, it makes sense to quantify behavior as thoroughly as possible in advance of recordings. Understanding how behavior is controlled is, as the authors note, likely to be critical to interpreting neural data.

      From this perspective - providing a basis for interpreting future neural results - the present study is fairly successful. Monkeys seem to understand the task properly, and to use control policies that are not dissimilar from humans. Also reassuring is the fact that behavior remains sensible even when task-difficulty become high. By 'sensible' I simply mean that behavior can be understood as seeking to minimize error: position, velocity, or (possibly) both, and that this remains true across a broad range of task difficulties. The authors document why minimizing position and minimizing velocity are both reasonable objectives. Minimizing velocity is reasonable, because a near-stationary cursor can't move far in six seconds. Minimizing position error is reasonable, because the trial won't fail if the cursor doesn't stray far from the center. This is formally demonstrated by simulating control policies: both objectives lead to control policies that can perform the task and produce realistic single-trial behavior. The authors also demonstrate that, via verbal instruction, they can induce human subjects to favor one objective over the other. These all seem like things that are on the 'need to know' list, and it is commendable that this amount of care is being taken before recordings begin, as it will surely aid interpretation.

      Yet as a stand-alone study, the contribution to our understanding of motor control is more limited. The task allows two different objectives (minimize velocity, minimize position) to be equally compatible with the overall goal (don't fail the trial). Or more precisely, there exists a range of objectives with those two at the extreme. So it makes sense that different subjects might choose to favor different objectives, and also that they can do so when instructed. But has this taught us something about motor control, or simply that there is a natural ambiguity built into the task? If I ask you to play a game, but don't fully specify the rules, should I be surprised that different people think the rules are slightly different?

      The most interesting scientific claim of this study is not the subject-to-subject variability; the task design makes that quite likely and natural. Rather, the central scientific result is the claim that individual subjects are constantly switching objectives (and thus control policies), such that the policy guiding behavior differs dramatically even on a single-trial basis. This scientific claim is supported by a technical claim: that the authors' methods can distinguish which objective is in use, even on single trials. I am uncertain of both claims.

      Consider Figure 8B, which reprises a point made in Figure 1&3 and gives the best evidence for trial-to-trial variability in objective/policy. For every subject, there are two example trials. The top row of trials shows oscillations around the center, which could be consistent with position-error minimization. The bottom row shows tolerance of position errors so long as drift is slow, which could be consistent with velocity-error minimization. But is this really evidence that subjects were switching objectives (and thus control policies) from trial to trial? A simpler alternative would be a single control policy that does not switch, but still generates this range of behaviors. The authors don't really consider this possibility, and I'm not sure why. One can think of a variety of ways in which a unified policy could produce this variation, given noise and the natural instability of the system.

      Indeed, I found that it was remarkably easy to produce a range of reasonably realistic behaviors, including the patterns that the authors interpret as evidence for switching objectives, based on a simple fixed controller. To run the simulations, I made the simple assumption that subjects simply attempt to match their hand position to oppose the cursor position. Because subjects cannot see their hand, I assumed modest variability in the gain, with a range from -1 to -1.05. I assumed a small amount of motor noise in the outgoing motor command. The resulting (very simple) controller naturally displayed the basic range of behaviors observed across trials (see Image 1)

      Peer review image 1.

      Some trials had oscillations around the screen center (zero), which is the pattern the authors suggest reflects position control. In other trials the cursor was allowed to drift slowly away from the center, which is the pattern the authors suggest reflects velocity control. This is true even though the controller was the same on every trial. Trial-to-trial differences were driven both by motor noise and by the modest variability in gain. In an unstable system, small differences can lead to (seemingly) qualitatively different behavior on different trials.

      This simple controller is also compatible with the ability of subjects to adapt their strategy when instructed. Anyone experienced with this task likely understands (or has learned) that moving the hand slightly more than 'one should' will tend to shepherd the cursor back to center, at the cost of briefly high velocity. Using this strategy more sparingly will tend to minimize velocity even if position errors persist. Thus, any subject using this control policy would be able to adapt their strategy via a modest change in gain (the gain linking visible cursor position to intended hand position).

      This model is simple, and there may be reasons to dislike it. But it is presumably a reasonable model. The nature of the task is that you should move your hand opposite where the cursor is. Because you can't see your hand, you will make small mistakes. Due to the instability of the system, those small mistakes have large and variable effects. This feature is likely common to other controllers as well; many may explicitly or implicitly blend position and velocity control, with different trials appearing more dominated by one versus the other. Given this, I think the study presents only weak evidence that individual subjects are switching their objective on individual trials. Indeed, the more parsimonious explanation may be that they aren't. While the study certainly does demonstrate that the control policy can be influenced by verbal instructions, this might be a small adjustment as noted above.

      I thus don't feel convinced that the authors can conclusively tell us the true control policy being used by human and monkey subjects, nor whether that policy is mostly fixed or constantly switching. The data are potentially compatible with any of these interpretations, depending on which control-style model one prefers.

      I see a few paths that the authors might take if they chose.<br /> --First, my reasoning above might be faulty, or there might be additional analyses that could rule out the possibility of a unified policy underlying variable behavior. If so, the authors may be able to reject the above concerns and retain the present conclusions. The main scientifically novel conclusion of the present study is that subjects are using a highly variable control policy, and switching on individual trials. If this is indeed the case, there may be additional analyses that could reveal that.<br /> --Second, additional trial types (e.g., with various perturbations) might be used as a probe of the control policy. As noted below, there is a long history of doing this in the pursuit system. That additional data might better disambiguate control policies both in general, and across trials.<br /> --Third, the authors might find that a unified controller is actually a good (and more parsimonious) explanation. Which might actually be a good thing from the standpoint of future experiments. Interpretation of neural data is likely to be much easier if the control policy being instantiated isn't in constant flux.

      In any case, I would recommend altering the strength of some conclusions, particularly the conclusion that the presented methods can reliably discriminate amongst objectives/policies on individual trials. This is mentioned as a major motivation on multiple occasions, but in most of these instances, the subsequent analysis infers the objective only across trial (e.g., one must observe a scatterplot of many trials). By Figure 7, they do introduce a method for inferring the control policy on individual trials, and while this seems to work considerably better than chance, it hardly appears reliable.

      In this same vein I would suggest toning down aspects of the Introduction and Discussion. The Introduction in particular is overly long, and tries to position the present study as unique in ways that seem strained. Other studies have built links between human behavior, monkey behavior, and monkey neural data (for just one example, consider the corpus of work from the Scott lab that includes Pruszynski et al. 2008 and 2011). Other studies have used highly quantitative methods to infer the objective function used by subjects (e.g. Kording and Wolpert 2004). The very issue that is of interest in the present study - velocity-error-minimization versus position-error-minimization - has been extensively addressed in the smooth pursuit system. That field has long combined quantitative analyses of behavior in humans and monkeys, along with neural recordings. Many pursuit experiments used strategies that could be fruitfully employed to address the central questions of the present study. For example, error stabilization was important for dissecting the control policy used by the pursuit system. By artificially stabilizing the error (position or velocity) at zero, or at some other value, one can determine the system's response. The classic Rashbass step (1961) put position and velocity errors in opposition, to see which dominates the response. Step and sinusoidal perturbations were useful in distinguishing between models, as was the imposition of artificially imposed delays. The authors note the 'richness' of the behavior in the present task, and while one could say the same of pursuit, it was still the case that specific and well-thought through experimental manipulations were pretty critical. It would be better if the Introduction considered at least some of the above-mentioned work (or other work in a similar vein). While most would agree with the motivations outlined by the authors - they are logical and make sense - the present Introduction runs the risk of overselling the present conclusions while underselling prior work.

    1. While hot dog vendors have been part of the city’s gray market for decades, changes in state law in 2018 and 2022 removing illegal vending from the police code and streamlining health permits have led to a boom in their numbers. In response, the city started a campaign warning of foodborne illness risks (opens in new tab) and launched a vending task force, a multiagency enforcement team that issues fines and confiscates carts. But it’s a cat-and-mouse game.<img id="5skp1nj4390bt72ou8cvhc5t25" alt="A large, bright yellow stylized sun with long, rectangular rays radiates from the right side on a solid light blue background." credit="" crop="[object Object]" loading="lazy" decoding="async" data-nimg="fixed" style="position:absolute;top:0;left:0;bottom:0;right:0;box-sizing:border-box;padding:0;border:none;margin:auto;display:block;width:0;height:0;min-width:100%;max-width:100%;min-height:100%;max-height:100%;object-fit:cover" class=" lazyloaded" srcSet="/_next/image/?url=https%3A%2F%2Fassets.sfstandard.com%2Fimage%2F994911177489%2Fimage_5skp1nj4390bt72ou8cvhc5t25&amp;w=120&amp;q=75 1x, /_next/image/?url=https%3A%2F%2Fassets.sfstandard.com%2Fimage%2F994911177489%2Fimage_5skp1nj4390bt72ou8cvhc5t25&amp;w=240&amp;q=75 2x" src="/_next/image/?url=https%3A%2F%2Fassets.sfstandard.com%2Fimage%2F994911177489%2Fimage_5skp1nj4390bt72ou8cvhc5t25&amp;w=240&amp;q=75"/>Subscribe to The DailyBecause “I saw a TikTok” doesn’t always cut it. Dozens of stories, daily.Sign up nowThe workers are mostly undocumented immigrants from Central and South America, The Standard found through interviews with more than a dozen. Some have fled crime and violence. Many are seeking asylum and sending money home while they eke out an existence, one sale at a time. Others are victims of human trafficking: vulnerable people smuggled into the U.S. by groups to whom they are indebted.

      NUT GRAF -confirmed by Alex

    1. Rather thanassume that game genres, platforms, or specific texts determine game playpractice, we organize our description with different practices of play thatemerged from our ethnographic material.

      How one plays a game says a lot more than the gameplay itself I've seen my cousins and nephews exhibit unusual behavior with games.

    2. Our focus is not on the relation between individual kidsand game content and representation, but rather on how game playpractice and activity are situated within a broader set of cultural and socialengagements and contexts.

      Game play has its effect on people depending on content, pace, and the subject as a whole.

    3. Gaming practices are extremely diverse in nature and form; gameplay is a complex and multilayered phenomenon.

      There are game genres that instill certain cognitive features when utilized depending on the genre.

    1. China has really begun to figure out how to take a leaf from the U.S. playbook and in a certain sense play that game better

      This quote captures the article’s theme. China is adopting the U.S.’s own economic tactics, perhaps more effectively

    1. Le Prince d’Aquitaine à la tour abolie

      Eliot’s line “Le prince d’Aquitaine à la tour abolie” which translates to “the prince of Aquitaine, his tower in ruins” is a direct reference to the identical line in Gerard de Nerval’s poem El Desdichado. The “tour abolie” or “tower in ruins” references back to the “falling towers” from earlier in the section, and, thus the “unreal city” referenced in “The Burial of the Dead” and in “A Game of Chess.” In these references, the city, which may seem at first glance to be bustling and full of life, is inverted upon further investigation. With “brown fog” and “tower in ruins,” the images that Eliot portrays of the urban environment is anything but inspiring. The “towers” that make up the city, falling or in ruins, are replicated in the structure of the poem, with the poem itself acting as an autonomous landscape, with the characters, Madame Sosostris, Tiresias, etc, going through the motions of life while surrounded by a world, or words, falling apart. Thus, at the end of The Waste Land, bringing this notion of “la tour abolie” to the forefront, the readers can end the poem, seeing the microcosm of references and urbanity crumbling beneath itself, with no hope of resurrection.

      In his annotation on the same line, Richard Lu compares the “tour abolie” and “seule étoile” (in the following line of El Desdichado) to the tarot deck. He states that “in tarot decks, The Star directly follows The Tower card. The Tower is often called the most dreadfall card as it often implies a sudden disaster which instates a change in your world. This change is extreme. However, after the tower falls, the Star appears, the tarot card of hope. However, Eliot ends with this reference. The Star is dead, there may not be a hope after the destruction of The Tower.” Concluding “The Waste Land,” this line, among others, works to support the desolation laid out since the very first line. Though approached with a dead and barren landscape, throughout the poem one can find a glimmer of hope, an ounce of inspiration between the clever minds and many characters that prolong the narrative. This ending defies all of that. Eliot, through all five parts of the poem, sets up an ending where divinity and faith have no place. There is no hope for the The Star tarot card coming next, it is dead. There is no God, nothing matters. In the wake of destruction, among the “falling towers” in ruins, one finally becomes as desolate and depressing as their surroundings.

    2. What you get married for if you don’t want children?

      Between his textual narrative of Lil and his reference to Ophelia, Eliot examines the contrast and connection between love, virginity, purity, and exploitation. Here, the speakers discuss how Lil is not taking care of her appearance and will not be appealing to her husband, then one says “What you get married for if you don’t want children?” This implies that the sole reasons for marriage are sexuality and having children, while the concept of love is not mentioned. The final line of “The Game of Chess” is “Good night, ladies, good night, sweet ladies, good night, / good night,” which is a reference to Ophelia’s farewell in Hamlet prior to her suicide. By ending the passage with Ophelia’s words of distress, it is implied that Ophelia’s situation is very significant to Eliot’s message. Ophelia says, in the excerpt from Hamlet, that “To-morrow is Saint Valentine's day, All in the morning betime, And I a maid at your window, To be your Valentine. Then up he rose, and donn'd his clothes, And dupp'd the chamber-door; Let in the maid, that out a maid Never departed more.” She mentions arriving at Hamlet’s window as a virgin (a maid – older description for a young unmarried virgin), looking to be his Valentine, or to find love. Ophelia leaves this meeting no longer a maid, or a virgin. Afterwards, Ophelia adds, “Before you tumbled me, You promised me to wed. So would I ha' done, by yonder sun, An thou hadst not come to my bed.” Hamlet promised to marry her, which she was ready for, but instead just slept with her and then shunned her. Ophelia seems to feel used and exploited for her body. Lil, having already lived a life of five children, chooses to resist the need to appease her husband with superficial changes to her appearance. It is almost as if Lil lived Ophelia’s life, but continued living with a different mindset, though she is still subject to the same expectations and judgement. The difference between the two women is that Lil experiences this while married, and Ophelia is a young unmarried woman. Considering the time period of the piece, having lost her virginity to a man who decides not to marry her after all, Ophelia is left “ruined” and “dishonored” in society and in future romantic relationships. Essentially, by taking her virginity without marrying her, Hamlet has sentenced Ophelia to a life without the authentic love she originally desired. Left without clear choices and grieving the loss of her father, Ophelia becomes mentally unstable and feels that she has no other option than suicide by drowning. This is significant, because water is most often viewed as spiritually pure, especially as the medium for baptism. At the start of one’s life, they are baptized, and at the end of Ophelia’s life, she drowns. So, at line 170, when the women in the bar say “goonight” to each other, they are just going home for the night. In the final line, however, when the farewells shift to Ophelia’s voice, she is saying goodbye to the “Game of Chess” – the “game” of a woman experiencing sexual exploitation and a loss of pure connection – and transitioning the reader to the next section where water (the River Thames) becomes polluted and “impure,” as well.

    3. nightingale

      Eliot’s “The Game of Chess" and its referenced sources characterize women (or the queen piece) as the real pawns of society, exploited by men (the king piece) despite their power. Eliot begins the section with “The Chair she sat in, like a burnished throne, / Glowed on the marble…” In older versions of chess, specifically the marble-like Lewis chessmen, the queen piece sits on an elaborate throne, cradling her head in her hand with a tired expression. So, Eliot's description aligns closely with the chess piece of the Queen. At the same time, this description is a direct reference to Antony and Cleopatra: “The Chair she sat in, like a burnished throne, Glowed on the marble.” So, The Game of Chess begins with Cleopatra, the queen of Egypt and one of the most well known women of immense power labeled a seductress. In fact, the six assigned sources all display women used as scapegoats, always described but never given a chance to never given a chance to stand up for themselves. They are used as pawns in literature, society, and history. Further, these women are almost all associated deeply with snakes, or a symbol for the devil in many works. The foundation of this comparison is shown in Paradise Lost, as Eve is tempted by a serpent, or the devil, and then is blamed alongside the serpent for eternity. Notably, Cleopatra kills herself with an asp, or a serpent, to escape a future of humiliation at the expense of being forever silenced. In Ovid, Philomela’s tongue is cut out because the king dislikes her words, and severed tongue is compared to a snake. By taking away her tongue, or her voice, the king seems to believe he has stunted her ability to tempt and manipulate. In Baudelaire, he writes “The haunches slightly sharp, and the waist sinuous / As a snake poised to strike, / That she's still quite young!” Even as the woman is described in an undone state, she is still viewed as “a snake poised to strike.” Tying these references back to the text, Eliot argues through his characterization of “ the nightingale Filled all the desert with inviolable voice And still she cried, and still the world pursues,” that these women are labeled snakes, always poised to strike and poison others with their cunning manipulation, while they are truly nightingales, only afforded a grieving voice in the night. The thread is clear of women being exploited by men then blamed by those same men and the rest of society without a chance to share their voice.

    4. good night.

      “RON: Once I make my move, the queen will take me. Then you’re free to check the king. HARRY: No. Ron, no! HERMIONE: He’s going to sacrifice himself. RON: Do you want to stop Snape from getting that stone or not? Harry, it’s you that has to go on. I know it. Not me. Not Hermione. You. Knight to H3.” The scene feels like the final game, the sacrifice, the victory THE GAME. Pound’s The Game of Chess works that way too. The definite article locks the world in structure: “Red knights, brown bishops, bright queens.” Everything burns with precision, every piece belongs to the pattern. “The” implies consequence. Each move means something, each color holds.<br /> But Eliot’s A Game of Chess loosens that grip. A, not the. Suddenly the game isn’t singular or grand but one of many, maybe endless. The definite becomes indefinite, the sacrifice hollow. “‘My nerves are bad tonight. Yes, bad. Stay with me.’” The moves don’t land, “‘What shall I do now? What shall I do?’” There’s no check, no king, just exhaustion masquerading as strategy. The board gleams under purpose; a board flickers under repetition. Pound’s definite article closes the frame; Eliot’s indefinite article opens it until it collapses. THE Game demands sacrifice. A Game doesn’t even notice one’s been made. You don’t know A game is over, until THE game starts, and it's time to say goodnight.

    5. Unstoppered, lurked her strange synthetic perfumes, Unguent, powdered, or liquid—troubled, confused

      “Mirror Mirror on the wall, who is the fairest of them all?” (Snow White), the evil queen utters, jealous of the pure perfect young girl. While Eliot does not write of a pristine princess (rather a queen), there is much regality in A Game of Chess, seeming to draw from Baudelaire’s opulent descriptions in A Martyred Woman. What fascinated me the most in both pieces is their olfactory descriptions, more specifically their representation of perfume, and its reflection or rather refraction. These two lines seem to reflect one another visually, both beginning with Un–; however, that is really the extent of their mirroring. “Unstoppered, lurked her strange synthetic perfumes, Unguent, powdered, or liquid, troubled, confused.” The prefix un- does more than repeat; it undoes. To unstopper a bottle is to release what was meant to remain contained, while unguent evokes the oily, ceremonial luxury of queens and corpses alike. The language itself opens and unravels, performing the act of unsealing that the scene describes. Yet what spills out is not clarity but confusion. The perfumes, once symbols of beauty and refinement, have become dense, chemical, and suffocating. In Baudelaire’s A Martyred Woman, perfume occupies the same paradox. The room is decadent yet dying, filled with “perfume flasks” and “bouquets exhaling their final breath.” The air is both intoxicating and fatal, heavy with sweetness that edges toward rot. Eliot refracts this atmosphere into the modern world; his perfumes are “synthetic,” their allure artificial. Where Baudelaire’s fragrance veils decay in beauty, Eliot’s amplifies decay through imitation. Perfume, then, becomes a mirrored contradiction, both attraction and repulsion, luxury and poison. Its scent seduces even as it suffocates. In both poets, the air itself becomes a reflection of moral and physical decay, a beautiful corruption, a sweetness turned stale, lingering long after life has left the room.

    6. The river sweats Oil and tar

      The idea of a river sweating is peculiar. Sweating is the release of a liquid from the body (from the body’s sweat glands). It is a crucial bodily function for regulating temperature, and the cooling effect occurs when sweat evaporates from the skin, absorbing excess heat from the body to do so. If a river releases liquid—oil and tar—isn’t it just to mix with the rest of the water again? Actually—some light oils can evaporate in large bodies of water, if the water is warm and the surface area is large. But it is clearly heavy oils that are meant (like crude oil or motor oil), particularly with tar following. The river sweating here and human sweating remain the same in the most basic sense: both are a function to aid the source, a kind of homeostatic regulation. Contrastingly, the river sweating functions to rid of waste—explicitly material, but probably extending to moral, spiritual, emotional. Unfortunately, this effort is doomed from the start—in a horribly looping, muddled way in that it goes round and round with the oil and tar being supposedly somewhat separated from the rest of the water, or perhaps consolidated, just to mix in again. This process was doomed from the start, but the river wasn’t, with the arrival of the oil and tar. Their presence is obtrusively unnatural. This must be remembered as the following phrases wash over with their sense of predetermination/overarching control: “The barges drift / With the turning tide…sails…swing…The barges wash / Drifting logs…”

      What jumped out to me as mirroring this futile process of mixing was the lines “Weialala leia / Wallala leialala”—repeated twice, with the third iteration being just “la la.” In looking back at some past annotations, I noted that Jeannie ’25 made a comment on how “Weialala leia” can be read as a wail. In fact, when attempting to read it out loud, it sounded as though I was reciting a string of jumbled duplications of the word “wail,” and as one, it definitely sounded as a wail. This word is what the letters form, mixed around—but as more and more is added distortion grows, not clarity. In Götterdämmerung, this wailing call is for Siegfried, to return the ring, and thus the Rhine-gold. While initially appearing successful as Siegried enters right after the first call, subsequent ones fall upon deaf ears—along with the rest of the words of the Rhine Daughters. It is almost as though between when the words leave their mouths and when they reach the ears of Siegried—or the reader—there is a warping. The Rhine Daughters’ final “La! la!” feels a dwindling final attempt with recognition of the futility—the “w” is gone—or perhaps marks an even greater warping. Eliot exaggerates this further, with his third repetition being just “la la.” All of this seems to me to very much connect to female voice in the poem—particularly the line “And still she cried, and still the world pursues, / ‘Jug Jug’ to dirty ears” in “A Game of Chess.”

    7. Her brain allows one half-formed thought to pass: 'Well now that’s done: and I’m glad it’s over.'

      I found this line interesting as it adds to the discourse on female voice, and it connects to the curious section of dialogue in “A Game of Chess.”

      In the woman’s encounter with the “young man carbuncular,” she is given no agency—or, rather, no action at all on her part is marked. Not a single action verb follows the pronoun “she.” “Is” is not an action verb (it is a linking verb), and it is used to describe her state as “bored and tired”—thus a mere projection by Tiresias (a man). In fact, lack of action is what is marked. The man’s “caresses” are “unreproved,” his “exploring hands” encounter “no defense,” and his “vanity…makes a welcome of indifference.” The woman’s entire functioning appears to be gone. The connection to John Donne’s Elegy XIX. To His Mistress Going to Bed furthers this.

      The one thing left is her voice—not outer (clearly) but inner. But this is operating on the lowest level. It is a singular (“one”), “half-formed” thought that comes. What is interesting is that this is all “her brain allows”—she is stopping herself, or rather a part of her or something inside of her is, as evidently there are two sides. But these two sides—her brain and from wherever thoughts issue (also the brain?)—are natural, integral parts of the self. What is going on here?

      But there is definitely a control running through the encounter—and perhaps extending beyond?—as if the whole thing had been laid out. The woman, afterwards, “smoothes her hair with automatic hand” and “puts a record on the gramophone”—set to go round and round and round. The reference to The Vicar of Wakefield appears to add to this, and so does the odd use of the colon, somehow urging an inevitability. I think it has all been laid out—with Tiresias, who can see the future, looming over the scene.

      In terms of connection to the dialogue in “A Game of Chess,” there seems to be some sort of comment on women’s agency and what is asked of them: “‘Speak to me. Why do you never speak. Speak. / ‘What are you thinking of? What thinking? What? / ‘I never know what you are thinking. Think.’” I would like to explore this further.

    8. HURRY UP PLEASE ITS TIME

      This line can be read in two ways (at least). On a first pass—at least this was very much the case for me—it reads with “please” as an adverb, as it is often used in requests or questions. There is a problem then with “its” which is read as “it’s” as in “it is time.” But this error feels fitting—the repetition of the line and its formatting in all caps create a sense of urgency, a rush from which this mistake could ensue. It feels that punctuation has been omitted in a similar fashion.

      But “its” could also be read as is, in which case it is a possessive pronoun—“time” belongs to “it.” “Please” is then an imperative verb. I am leaning towards this reading, as it feels slightly hidden (quite Eliot-like), and plays into the question of agency I have been exploring in a number of my annotations.

      So what is the “it”? I think “it” refers to some greater force, power, or overarching structure, and here it feels clear that this is the game of chess—which is often played with time constraints.

      Chess appears to be ruling this section of the poem, especially the parts pertaining to women. I find chess very interesting in that, in looking at its set-up, it is suggested that the queen, as the most powerful piece (being able to move as she does), should be the most secure. Yet the rules define winning as capturing the king. The queen’s role is one of sacrifice, to protect the king, and in doing so almost always meets her demise. The women referenced in “A Game of Chess” follow this arc. They hold the immense power of “love,” but somehow this is, in each case, twisted to serve men and then lead to their death. It seems that Lil will meet a similar end, with the last line on page 59 being a reference to (some of) Ophelia’s last words (where she is speaking about herself?). The ties to Middleton’s A Game at Chess and its sexual interpretations of the game link these two ideas more firmly.

      In Pound’s The Game of Chess there is a pattern of lines on the page that repeats four times. It is a sequence of one line and then the line below it being indented (a couple times?). The space, notably, forms a clear “angle” and an uppercase “l” if rotated 180°. This pattern/spacing, even exaggerated a bit, is replicated twice with lines 117-120 in “A Game of Chess.” This stands out against the formatting up to this point. Now the section is physically fitting into “the game.”

      Eliot made the title “A Game of Chess”—not “The Game of Chess” (Pound) or “A Game at Chess” (Middleton). “A Game of Chess” feels more open and less defining than “The Game of Chess.” There is some room. But “A Game at Chess” feels more action-oriented. As always with Eliot, I feel there is back-and-forth.

    1. Below is a fake pronunciation guide on youtube for “Hors d’oeuvres”: Note: you can find the real pronunciation guide here [g25], and for those who can’t listen to the video, there is an explanation in this footnote[1] In the youtube comments, some people played along and others celebrated or worried about who would get tricked

      This reminds me of the curious case of the popular youtuber SIivaGunner. SIivaGunner has been on the internet since the early 2010's, and their content has focused around uploading high quality songs of various video games, and if you were to look at their channel you'd see just that, videos of video game songs labeled accordingly, at least that's what it seems. If you were to watch any of these videos, you may quickly realize that the songs are slightly, if not very different to what you would expect. That is the crux of SIivaGunner, they upload songs that seem to be accurate riffs from the game their from, but instead the songs have been altered and remixed to reference and sound like another song entirely. This is technically trolling, but in a harmless and fun way, with people loving the altered songs and memes, that is until the channel got banned by Youtube for "false thumbnails". The channel actually got banned multiple times, each timer the team made a new channel with a similar name (ie. SilvaGunner, GIlvaSunner). The Youtube channel is mostly safe as of now with the workaround they came up with, were they give the titles of the songs a seemingly true but made up versions of the song, such as "Beta Mix" or "JP Version".

  6. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. MMO Video Game News, Reviews & Games List. December 2023. URL: https://www.mmorpg.com/ (visited on 2023-12-05).

      Link leads to a website where people can disscuss, review, and rate video-games. They can also read about the lastest news, developments, and updates on current or future games they are interested in

    1. If the trolls claim to be nihilists about ethics, or indeed if they are egoists, then they would argue that this doesn’t matter and that there’s no normative basis for objecting to the disruption and harm caused by their trolling. But on just about any other ethical approach, there are one or more reasons available for objecting to the disruptions and harm caused by these trolls! If the only way to get a moral pass on this type of trolling is to choose an ethical framework that tells you harming others doesn’t matter, then it looks like this nihilist viewpoint isn’t deployed in good faith[1]. Rather, with any serious (i.e., non-avoidant) moral framework, this type of trolling is ethically wrong for one or more reasons (though how we explain it is wrong depends on the specific framework).

      I think the section on Trolling and Nihilism raises an important point that some trolling communities don’t just push boundaries playfully but actually seem to treat ethics as irrelevant. What struck me is how this can lead to a kind of moral vacuum where the harm to people or groups is dismissed as part of the “game” of disruption.

    1. In the early Internet message boards that were centered around different subjects, experienced users would “troll for newbies” by posting naive questions that all the experienced users were already familiar with. The “newbies” who didn’t realize this was a troll would try to engage and answer, and experienced users would feel superior and more part of the group knowing they didn’t fall for the troll like the “newbies” did. These message boards are where the word “troll” with this meaning comes from.

      I think this concept of internet trolling newbies has essentially become easier on X now with the whole payment system for verification. Prior, you can probably tell who's real and who's fake based on profile picture, profile info, etc, but now it's a whole new ball game. Everyone can pay $10 for a verification check, so there's no real distinction between what's real or fake.

    1. There’s a saying in Japan, “All of us are smarter than any one of us.” And I would say that all of us are better than any one of us, no matter what the game is, business or sports.

      This makes me think of my club lacrosse motto. It is We>Me. We had that written on all our shirts one year.

    1. Reviewer #1 (Public review):

      The authors conducted a series of experiments using two established decision-making tasks to clarify the relationship between internalizing psychopathology (anxiety and depression) and adaptive learning in uncertain and volatile environments. While prior literature has reported links between internalizing symptoms - particularly trait anxiety - and maladaptive increases in learning rates or impaired adjustment of learning rates, findings have been inconsistent. To address this, the authors designed a comprehensive set of eight experiments that systematically varied task conditions. They also employed a bifactor analysis approach to more precisely capture the variance associated with internalizing symptoms across anxiety and depression. Across these experiments, they found no consistent relationship between internalizing symptoms and learning rates or task performance, concluding that this purported hallmark feature may be more subtle than previously assumed.

      Strengths:

      (1) A major strength of the paper lies in its impressive collection of eight experiments, which systematically manipulated task conditions such as outcome type, variability, volatility, and training. These were conducted both online and in laboratory settings. Given that trial conditions can drive or obscure observed effects, this careful, systematic approach enables a robust assessment of behavior. The consistency of findings across online and lab samples further strengthens the conclusions.

      (2) The analyses are impressively thorough, combining model-agnostic measures, extensive computational modeling (e.g., Bayesian, Rescorla-Wagner, Volatile Kalman Filter), and assessments of reliability. This rigor contributes meaningfully to broader methodological discussions in computational psychiatry, particularly concerning measurement reliability.

      (3) The study also employed two well-established, validated computational tasks: a game-based predictive inference task and a binary probabilistic reversal learning task. This choice ensures comparability with prior work and provides a valuable cross-paradigm perspective for examining learning processes.

      (4) I also appreciate the open availability of the analysis code that will contribute substantially to the field using similar tasks.

      Weakness:

      (1) While the overall sample size (N = 820 across eight experiments) is commendable, the number of participants per experiment is relatively modest, especially in light of the inherent variability in online testing and the typically small effect sizes in correlations with mental health traits (e.g., r = 0.1-0.2). The authors briefly acknowledge that any true effects are likely small; however, the rationale behind the sample sizes selected for each experiment is unclear. This is especially important given that previous studies using the predictive inference task (e.g., Seow & Gillan, 2020, N > 400; Loosen et al., 2024, N > 200) have reported non-significant associations between trait anxiety symptoms and learning rates.

      (2) The motivation for focusing on the predictive inference task is also somewhat puzzling, given that no cited study has reported associations between trait anxiety and parameters of this task. While this is mitigated by the inclusion of a probabilistic reversal learning task, which has a stronger track record in detecting such effects, the study misses an opportunity to examine whether individual differences in learning-related measures correlate across the two tasks, which could clarify whether they tap into shared constructs.

      (3) The parameterization of the tasks, particularly the use of high standard deviations (SDs) of 20 and 30 for outcome distributions and hazard rates of 0.1 and 0.16, warrants further justification. Are these hazard rates sufficiently distinct? Might the wide SDs reduce sensitivity to volatility changes? Prior studies of the circle version of this predictive inference task (e.g., Vaghi et al., 2019; Seow & Gillan, 2020; Marzuki et al., 2022; Loosen et al., 2024; Hoven et al., 2024) typically used SDs around 12. Indeed, the Supplementary Materials suggest that variability manipulations did not seem to substantially affect learning rates (Figure S5)-calling into question whether the task manipulations achieved their intended cognitive effects.

      (4) Relatedly, while the predictive inference task showed good reliability, the reversal learning task exhibited only "poor-to-moderate" reliability in its learning-rate estimates. Given that previous findings linking anxiety to learning rates have often relied on this task, these reliability issues raise concerns about the robustness and generalizability of conclusions drawn from it.

      (5) As the authors note, the study relies on a subclinical sample. This limits the generalizability of the findings to individuals with diagnosed disorders. A growing body of research suggests that relationships between cognition and symptomatology can differ meaningfully between general population samples and clinical groups. For example, Hoven et al. (2024) found differing results in the predictive inference task when comparing OCD patients, healthy controls, and high- vs. low-symptom subgroups.

      (6) Finally, the operationalization of internalizing symptoms in this study appears to focus on anxiety and depression. However, obsessive-compulsive disorder is also generally considered an internalizing disorder, which presents a gap in the current cited literature of the paper, particularly when there have been numerous studies with the predictive inference task and OCD/compulsivity (e.g., Vaghi et al., 2019; Seow & Gillan, 2020; Marzuki et al., 2022; Loosen et al., 2024; Hoven et al., 2024), rather than trait anxiety per se.

      Overall:

      Despite the named limitations, the authors have done very impressive work in rigorously examining the relationship between anxiety/internalizing symptoms and learning rates in commonly used decision-making tasks under uncertainty. Their conclusion is well supported by the consistency of their null findings across diverse task conditions, though its generalizability may be limited by some features of the task design and its sample. This study provides strong evidence that will guide future research, whether by shifting the focus of examining dysfunctions of larger effect sizes or by extending investigations to clinical populations.

    1. Ethical guidelines govern AI integration, emphasizing data privacy and ethical implications. Transparent communication is crucial for addressing ethical concerns in AI’s contribution to game world creation.

      Although the author does mention and repeating the importance of transparency and keeping players safe, the author doesn't provide any examples.

    2. The integration of artificial intelligence (AI) in game development opens up new possibilities for the industry. The role of AI is crucial in enhancing personalized experiences for younger players.

      Through this section, it shows that this article is mainly industry perspective on AI and ethics. It mainly mentions how AI is improving the industry rather than the ethic issues.

    3. AI systems and algorithms drive game content creation, optimizing difficulty levels to match player skills and improving game progression. These systems analyze player data, preferences, and behavior, allowing game designers to create personalized experiences. Furthermore, AI algorithms can be utilized to design realistic game characters, each with their unique behavior patterns, adding depth to gameplay. However, ethical considerations arise when using AI to influence game design, including the potential perpetuation of harmful stereotypes or privacy concerns related to player data.

      The section shows the awareness of the issues of AI on the ethic challenges to the problems such as player privacy. It also mentions that transparent communication are essential in using AI on video games. The article was published in 2023, which means this issue has been recognize long before.

    1. Game playing programs have done much to popularize AI

      Game-playing programs can play simple games and are now getting better at harder games.

    1. These children taught me that tables do not exist. That anything does. And they did it every day with a simple game over and over and over. Of course, it works with anything. And I finally called that game "Let's destroy a table." (Laughter) Or "Let's destroy anything,"

      for - language - game - let's destroy anything - adjacency - game - let's destroy anything - Buddhist teachings on interdependent origination - this game reminds me of Buddhist teachings on interdependent origination - nothing really has an essential nature - if you try to look for it in its parts, you won't find it

    1. Synthèse des Métiers : Perspectives et Exigences

      Résumé

      Ce document propose une synthèse exhaustive des informations présentées sur un large éventail de professions, allant de l'ingénierie à la santé, en passant par les arts et le droit.

      L'analyse des données révèle plusieurs thèmes transversaux. Premièrement, l'accès à de nombreux métiers spécialisés, notamment dans les secteurs de la santé et de la technologie, exige un parcours académique long et rigoureux, souvent au niveau Bac+5 et pouvant dépasser Bac+10. Deuxièmement, la réussite professionnelle repose systématiquement sur une double compétence : une expertise technique pointue ("hard skills") et des qualités interpersonnelles solides ("soft skills") comme la communication, la gestion du stress et le travail d'équipe.

      Troisièmement, la notion de "vocation" ou de "passion" est un moteur essentiel, particulièrement dans les domaines exigeants qui demandent des sacrifices personnels importants.

      Enfin, le marché du travail est caractérisé par une forte variabilité des rémunérations, non seulement entre les secteurs mais aussi en fonction de l'expérience, du statut (salarié, libéral, public, privé) et de la nécessité omniprésente d'une formation continue pour s'adapter aux évolutions technologiques et réglementaires.

      --------------------------------------------------------------------------------

      Ingénierie et Technologie

      Cette section regroupe les métiers au cœur de l'innovation, de la conception et du développement technologique. Ces professions exigent une forte expertise scientifique et une capacité à résoudre des problèmes complexes.

      Ingénieur Ferroviaire

      Rôle et Missions : Gérer la sécurité des circulations ferroviaires en concevant et entretenant des systèmes robustes. Collabore avec des services variés comme le BTP, l'architecture et l'environnement.

      Formation et Diplômes : Niveau Bac+5 minimum, via une école d'ingénieurs (ex: École des Ponts ParisTech, Conservatoire national des arts et métiers).

      Compétences et Qualités Requises : Inventif, stratégique, organisé, curieux, capable de diriger une équipe.

      Rémunération : Salaire net mensuel débutant à 2 250 €, pouvant atteindre 5 000 € en fin de carrière.

      Avantages et Inconvénients :

      Avantages : Secteur en plein essor qui recrute, polyvalence, belles évolutions de carrière, nombreuses primes.    ◦ Inconvénients : Études longues et exigeantes, nécessité de se mettre à jour constamment, amplitudes horaires parfois excessives.

      Ingénieur en Construction Automobile

      Rôle et Missions : Créer, développer et construire les pièces des véhicules pour optimiser les modèles actuels et futurs. Travaille à partir d'un cahier des charges, réalise des calculs, des essais sur ordinateur et des tests sur prototypes.

      Formation et Diplômes : Bac+5 en école d'ingénieur ou master en mécanique/électronique.

      Compétences et Qualités Requises : Passionné, imaginatif, rigoureux, persévérant. Nécessite également une bonne vue, une bonne condition physique et de la dextérité.

      Conditions de Travail : Principalement en bureau mais peut nécessiter des déplacements. Semaines de 35 à 40 heures, voire plus selon les projets.

      Rémunération : Début de carrière autour de 2 000 € net/mois. Le salaire annuel brut peut passer de 37 300 € à 98 000 € en fin de carrière.

      Perspectives d'Évolution : Insertion professionnelle facile malgré des débuts parfois difficiles en sortie d'école.

      Ingénieur en Intelligence Artificielle (IA)

      Rôle et Missions : Concevoir et développer des algorithmes capables d'apprendre et de prendre des décisions de manière autonome. Programme des modèles d'IA, analyse de grandes quantités de données.

      Formation et Diplômes : Bac+5 (diplôme d'ingénieur ou master spécialisé en IA). Spécialités NSI, Maths et Physique recommandées au lycée. Formation continue essentielle.

      Compétences et Qualités Requises : Programmation (Python, R), maîtrise des mathématiques appliquées, esprit d'analyse, rigueur, créativité, curiosité.

      Conditions de Travail : Travail en équipe avec des data scientists et développeurs, horaires flexibles, télétravail courant.

      Rémunération :

      ◦ Source 1 : Salaire annuel de 45 000 € à 55 000 € en début de carrière, pouvant dépasser 100 000 € avec l'expérience.    ◦ Source 2 : Salaire brut mensuel de 3 500 € à 4 500 € pour un débutant, pouvant atteindre 7 000 € ou plus.

      Perspectives d'Évolution : Secteur en pleine explosion avec une très forte demande et des opportunités dans de nombreux domaines (santé, finance, automobile).

      Ingénieur Aéronautique et Aérospatial

      Rôle et Missions : Concevoir, développer, tester et améliorer les aéronefs (avions, hélicoptères, drones) et les engins spatiaux (fusées, satellites). Travaille sur les composants, les moteurs, les systèmes de navigation.

      Formation et Diplômes : Bac+5 minimum, via une école d'ingénieur spécialisée (ENAC, ESTACA, IPSA) ou généraliste.

      Compétences et Qualités Requises : Solides connaissances en mécanique des fluides, aérodynamique, thermodynamique, matériaux. Maîtrise de l'anglais, esprit d'équipe, capacité à travailler sous pression.

      Conditions de Travail : S'exerce dans des usines, des bureaux d'études ou des agences (NASA, ESA). Déplacements fréquents sur les chantiers.

      Rémunération :

      ◦ Ingénieur aéronautique : Commence à 3 400 € brut/mois, peut atteindre 123 000 €/an avec l'expérience.    ◦ Ingénieur aérospatial : Salaire annuel de 40 000 € à 50 000 € en sortie d'école, jusqu'à 80 000 € après 10 ans, et peut dépasser 100 000 €.

      Perspectives d'Évolution : Secteur très demandé. Évolution vers des postes de chef de projet, directeur technique ou expert.

      Ingénieur Chimiste

      Rôle et Missions : Concevoir de nouveaux produits, mettre en œuvre des démarches scientifiques, réaliser des contrôles qualité et rédiger des fiches de données de sécurité (FDS).

      Formation et Diplômes : Bac+5 obtenu en école d'ingénieur.

      Compétences et Qualités Requises : Patience, rigueur, sociabilité, bonnes capacités rédactionnelles, maîtrise de l'anglais et excellentes compétences en physique-chimie et mathématiques.

      Conditions de Travail : Travail en petits groupes (laboratoire, usine), mobile pour répondre aux besoins des entreprises. Journées de 8 heures maximum.

      Rémunération : Environ 2 000 € dans le public et 3 000 € dans le privé. Des pays comme l'Allemagne ou le Luxembourg offrent des salaires plus élevés.

      Ingénieur en Conception Mécanique

      Rôle et Missions : Développement d'objets techniques de demain (recherche et développement). Conçoit le produit, son mécanisme, réalise des modélisations et des essais.

      Formation et Diplômes : Bac+5 (diplôme d'ingénieur ou master) avec une mention en mécanique ou génie mécanique.

      Compétences et Qualités Requises : Curiosité, persévérance, goût pour l'innovation, maîtrise des logiciels de conception, solides connaissances théoriques (aérodynamique, résistance des matériaux).

      Conditions de Travail : Travail en bureau ou en laboratoire, en équipe pluridisciplinaire. Temps de travail moyen de 40 heures/semaine.

      Rémunération : Salaire brut mensuel de 2 800 € en sortie d'école, évoluant vers 3 500 € et pouvant atteindre 5 000 €.

      Concepteur Développeur / Ingénieur Logiciel

      Rôle et Missions : Créer, développer et mettre en place des applications, logiciels ou sites web selon un cahier des charges. Analyse les besoins, écrit le code, effectue des tests et peut former les utilisateurs.

      Formation et Diplômes : Niveau Bac+2 (DUT/BTS) à Bac+5 (diplôme d'ingénieur, Master MIAGE).

      Compétences et Qualités Requises : Maîtrise technique des langages de programmation (HTML, CSS, PHP, etc.), rigueur, capacité d'adaptation, sens de l'organisation, écoute du client, travail en équipe.

      Conditions de Travail : Travail sédentaire mais collaboratif. Délais parfois courts, environ 9 heures de travail par jour.

      Rémunération :

      ◦ Concepteur Développeur : Jusqu'à 2 000 € brut/mois en début de carrière, environ 5 000 € pour un profil expérimenté.    ◦ Ingénieur Logiciel : 2 830 € brut/mois en début de carrière (2 200 € net), jusqu'à 4 500 € brut (3 600 € net) avec l'expérience.

      Développeur de Jeux Vidéo

      Rôle et Missions : Écrire et modifier le code source pour assurer le bon fonctionnement d'un jeu. Optimise les graphismes, l'IA et la fluidité. Utilise des moteurs de jeu (Unity, Unreal Engine) et des langages (C++, Python).

      Formation et Diplômes : Bac+3 à Bac+5 en informatique, développement logiciel ou jeux vidéo. Des écoles spécialisées (Isart, Supinfogame) sont une voie possible. Formation continue indispensable.

      Compétences et Qualités Requises : Logique, rigueur, patience, esprit d'analyse, capacité à résoudre des problèmes techniques.

      Conditions de Travail : Travail en studio ou en freelance, principalement sédentaire. Collaboration étroite avec les graphistes et game designers. Horaires classiques mais "périodes de crunch time" intenses en fin de projet.

      Rémunération : Salaire annuel brut de 30 000 € à 40 000 € pour un débutant, pouvant atteindre 60 000 € et plus avec l'expérience.

      Domoticien

      Rôle et Missions : Installer et programmer des systèmes automatisés dans les habitations (volets, alarmes, thermostats) pour améliorer le confort, la sécurité et l'efficacité énergétique.

      Formation et Diplômes : Bac Pro Systèmes Numériques, BTS Domotique, ou BUT Génie Électrique et Informatique Industrielle. Formation régulière nécessaire.

      Compétences et Qualités Requises : Connaissances en électronique et informatique, logique, précision, esprit d'analyse, patience.

      Conditions de Travail : Métier dynamique, partagé entre chantiers et bureaux d'études. Déplacements fréquents, collaboration avec d'autres corps de métier. Semaines de 35 à 40 heures, avec possibles heures supplémentaires.

      Rémunération : Salaire brut mensuel de 1 800 € à 2 200 € en début de carrière, pouvant atteindre 3 000 € à 4 000 € avec l'expérience.

      --------------------------------------------------------------------------------

      Santé et Sciences du Vivant

      Ce domaine regroupe des professions dédiées au soin, au diagnostic et à l'amélioration de la santé humaine. Elles se caractérisent par un parcours d'études long, un fort sens des responsabilités et un contact humain central.

      Médecin (Chirurgien, Pédiatre, Urgentiste, Médecin Légiste)

      Spécialité

      Formation

      Rôle et Missions

      Conditions de Travail

      Rémunération (Début de carrière)

      Chirurgien

      11-12 ans post-bac

      Opérer des patients pour soigner ou réparer.

      Très intense, longues heures, forte pression, gardes. Travail d'équipe.

      4 000 - 5 000 € / mois

      Pédiatre

      10-11 ans post-bac

      Soigner les enfants de la naissance à 18 ans, suivi médical, diagnostic.

      En cabinet, hôpital, PMI. Contact humain et psychologique essentiel.

      3 000 - 3 500 € net / mois

      Médecin Urgentiste

      10 ans post-bac min.

      Prise en charge de patients en situation d'urgence.

      Forte pression, décisions rapides, travail d'équipe pluridisciplinaire.

      4 500 - 5 000 € brut / mois

      Médecin Légiste

      9-11 ans post-bac + DES

      Analyser corps et blessures dans un cadre judiciaire (autopsies, examens).

      Stress élevé, résistance émotionnelle requise, horaires variables et urgences.

      3 000 - 3 500 € net / mois

      Professionnels Paramédicaux et de la Santé

      Profession

      Formation

      Rôle et Missions

      Conditions de Travail

      Rémunération (Début de carrière)

      Infirmière

      Bac + IFSI (3 ans)

      Prodiguer des soins, surveiller l'état de santé, accompagner les patients.

      Très variées (hôpital, libéral, école, armée). Horaires décalés, stress.

      Varie fortement : 1 860 € à 3 075 € brut / mois

      Kinésithérapeute

      5 ans post-bac

      Soigner et rééduquer les personnes ayant des troubles du mouvement.

      En libéral, hôpital, centre de rééducation. Métier mobile, horaires flexibles mais longs.

      Non spécifié

      Pharmacien

      6-9 ans post-bac

      Gérer et distribuer les médicaments, conseiller les patients.

      En officine, hôpital, industrie. Horaires variables, gardes. Forte responsabilité.

      3 000 - 3 500 € brut / mois

      Diététicienne

      Bac+2 (BTS/BUT)

      Conseiller et accompagner les personnes dans la gestion de leur alimentation.

      En cabinet, hôpital, scolaire. Métier plutôt sédentaire.

      1 800 - 2 200 € brut / mois

      Chirurgiens-Dentistes et Spécialistes

      Rôle et Missions : Prévention, soins conservateurs (caries, détartrage), pose de prothèses et actes chirurgicaux (extractions, implants).

      Formation :

      Chirurgien-dentiste : 6 ans post-bac (PASS/LAS + 5 ans d'études).    ◦ Dentiste pédiatrique : Spécialisation après le cursus de chirurgie dentaire.

      Compétences et Qualités Requises : Minutie, méthode, empathie, écoute, dextérité manuelle. Le dentiste pédiatrique doit savoir rassurer les enfants.

      Conditions de Travail : Métier sédentaire, en cabinet libéral ou hôpital. Travail en équipe (assistant, secrétaire).

      Rémunération :

      Chirurgien-dentiste : De 2 500 € à 7 500 € / mois selon l'expérience.    ◦ Dentiste pédiatrique : 2 500 € à 6 000 € / mois en libéral ; 2 000 € à 3 500 € en salariat.

      Biologiste Médical

      Rôle et Missions : Analyser des prélèvements biologiques (sang, urine) pour aider au diagnostic. Valide les prescriptions, interprète les résultats, participe à la recherche.

      Formation et Diplômes : Bac+9 minimum (Doctorat en Pharmacie ou Médecine + DES de biologie médicale).

      Compétences et Qualités Requises : Solides connaissances scientifiques, compétences en gestion, esprit d'initiative, sens du dialogue.

      Conditions de Travail : Travail sédentaire en laboratoire (privé ou hospitalier), en équipe. Semaines de 30 à 40 heures avec gardes possibles.

      Rémunération : À partir de 4 500 € brut/mois (public) ou 5 500 € brut/mois (privé).

      --------------------------------------------------------------------------------

      Art, Création et Communication

      Ce secteur rassemble des métiers où la créativité, le sens esthétique et la capacité à transmettre un message sont primordiaux. Ils sont souvent passionnants mais peuvent être exigeants et concurrentiels.

      Directeur Artistique

      Rôle et Missions : Donner une identité visuelle forte à un projet (campagne publicitaire, site web, magazine). Élabore des concepts, choisit couleurs et typographies, supervise la production graphique.

      Formation et Diplômes : Formation en art graphique, design ou communication visuelle (Beaux-Arts, Gobelins, etc.). Licence (Bac+3) minimum, Master (Bac+5) recommandé.

      Compétences et Qualités Requises : Créatif, curieux, à l'affût des tendances, compétences techniques (Photoshop, Illustrator), rigueur, gestion du stress.

      Conditions de Travail : Exigeant, pression des délais, horaires parfois irréguliers en agence.

      Rémunération : Salaire annuel brut de 30 000 € à 40 000 € pour un débutant, 50 000 € à 70 000 € (voire plus) pour un profil confirmé.

      Journaliste

      Rôle et Missions : Rechercher, vérifier et transmettre des informations au public. Peut se spécialiser (grand reporter, journaliste politique, etc.).

      Formation et Diplômes : Formation post-bac dans l'une des 14 écoles de journalisme reconnues en France.

      Compétences et Qualités Requises : Rigueur, curiosité, disponibilité, maîtrise des outils multimédias.

      Conditions de Travail : Peut s'exercer sur le terrain ou en bureau. Horaires très variables, soumis à l'actualité (weekends, nuits). Débuts souvent précaires (pigiste).

      Rémunération : Très grandes variations de salaire, de 1 140 € à 45 000 € par mois.

      Metteur en Scène et Chorégraphe

      Rôle et Missions :

      Metteur en scène : Crée des spectacles, dirige une équipe artistique et gère des aspects variés (rédaction de dossiers, budgets).    ◦ Chorégraphe : Crée les mouvements pour des danseurs ou des comédiens. Souvent danseur à l'origine.

      Formation et Diplômes : Parcours universitaires (études théâtrales) ou expérience directe en tant qu'artiste (danseur).

      Compétences et Qualités Requises : Curiosité, travailleur, savoir-faire variés, capacité à diriger une équipe. La passion est décrite comme une "nécessité".

      Conditions de Travail : Métier difficile, conditions financières souvent précaires.

      Rémunération : Non spécifiée, mais la précarité est soulignée.

      Joaillier

      Rôle et Missions : Concevoir, fabriquer, réparer et restaurer des bijoux en métaux précieux. Combine savoir-faire artisanal, précision technique et créativité.

      Formation et Diplômes : Filières artisanales (CAP, BMA) ou artistiques (DN MADE, École Boulle, Haute École de Joaillerie).

      Conditions de Travail : Travail sédentaire en atelier, seul ou en équipe. Horaires de 35-39h/semaine pour les salariés, jusqu'à 50-60h pour les indépendants.

      Rémunération : Non spécifiée.

      Architecte d'Intérieur

      Rôle et Missions : Concevoir et réaliser l'aménagement d'espaces intérieurs. Visite les sites, dessine les plans, suit les chantiers.

      Formation et Diplômes : BTS Étude et réalisation d'agencement (Bac+2), DN MADE (Bac+3), Master (Bac+5).

      Compétences et Qualités Requises : Créativité, innovation, savoir dessiner.

      Conditions de Travail : Jongle entre le bureau (sédentaire) et le chantier (mobile). Le temps de travail est très variable, grande disponibilité requise.

      Rémunération : Débute à environ 2 300 €/mois, peut atteindre 3 800 €/mois en entreprise.

      --------------------------------------------------------------------------------

      Droit, Sécurité et Service Public

      Ces professions sont au service de la justice, de la protection des citoyens et de l'ordre public. Elles exigent un grand sens de l'éthique, de la rigueur et une forte résistance au stress.

      Avocat

      Rôle et Missions : Conseiller, défendre et représenter les intérêts de ses clients (particuliers, entreprises) devant les juridictions. Peut être généraliste ou spécialisé.

      Formation et Diplômes : Master en droit (Bac+4/5) + examen d'entrée à l'École d'avocats (EDA) + 18 mois de formation pour obtenir le CAPA.

      Compétences et Qualités Requises : Patience, écoute, organisation, persévérance, connaissance approfondie du droit.

      Conditions de Travail : Travail mobile (tribunal) et sédentaire (cabinet). Profession majoritairement libérale, avec un temps de travail variable.

      Rémunération :

      ◦ Source 1 : 1 800 € à 2 700 € / mois en début de carrière.    ◦ Source 2 : 30 000 € à 40 000 € / an en début de carrière, peut dépasser 100 000 € / an avec l'expérience.

      Sapeur-Pompier Professionnel

      Rôle et Missions : Secourir les personnes et protéger les biens lors d'incendies, d'accidents et autres sinistres.

      Formation et Diplômes : Diplôme national du brevet minimum, suivi d'un concours.

      Compétences et Qualités Requises : Excellente condition physique et mentale, capacité à travailler en équipe, respect des ordres, gestion des émotions face à des situations choquantes.

      Conditions de Travail : Métier difficile, physiquement et mentalement. Forte hiérarchie et responsabilité croissante avec le grade.

      Rémunération : Varie selon le grade, complétée par des primes.

      Officier de Police Judiciaire (OPJ)

      Rôle et Missions : Constater les infractions, recevoir les plaintes, mener des enquêtes et placer des suspects en garde à vue sous l'autorité du procureur.

      Formation et Diplômes : Licence (Bac+3) + concours de l'École Nationale Supérieure de la Police (ENSP) (formation de 18 mois).

      Compétences et Qualités Requises : Rigueur, courage, droiture, sens du collectif.

      Conditions de Travail : Travail en commissariat, souvent en horaires décalés (nuit, weekends). Métier stressant, exigeant des décisions rapides et le respect strict des procédures.

      Rémunération : Un lieutenant débute à 2 420 € brut/mois, peut atteindre 4 000 € brut/mois en tant que commissaire.

      --------------------------------------------------------------------------------

      Commerce, Gestion et Construction

      Ce secteur couvre la conception et la réalisation de bâtiments ainsi que la gestion des activités commerciales. Il requiert des compétences en organisation, en management et en communication.

      Architecte

      Rôle et Missions : Concevoir un projet architectural (plans) et suivre sa réalisation sur le chantier. Peut concerner des constructions neuves ou des rénovations.

      Formation et Diplômes : Diplôme d'architecte DPLG, obtenu après des études en école d'architecture. Formation continue exigée.

      Conditions de Travail : Moitié sédentaire (dessin, administratif), moitié mobile (relevés, réunions de chantier). Travail d'équipe indispensable. Temps de travail élevé (environ 50h/semaine pour un gérant).

      Rémunération : Non spécifiée.

      Conducteur de Travaux

      Rôle et Missions : Responsable de la gestion et de la coordination d'un chantier de construction. Veille au respect des délais, du budget et des normes de sécurité.

      Formation et Diplômes : DUT Génie Civil ou diplôme d'école d'ingénieur en construction.

      Compétences et Qualités Requises : Organisé, sens de la gestion, rigoureux, capable de résoudre des problèmes rapidement, bonne endurance physique.

      Conditions de Travail : Combine travail sur le chantier (extérieur) et au bureau (administratif). Travail en équipe. Semaines de 35 à 40 heures avec possibles heures supplémentaires.

      Rémunération : Salaire brut mensuel de 2 000 € à 2 500 € en début de carrière, pouvant atteindre 4 000 € ou plus.

      Manager Commercial

      Rôle et Missions : Gérer un ou plusieurs rayons d'un magasin, ce qui inclut le rangement, la mise en avant des produits et la gestion des achats.

      Formation et Diplômes : DUT Management ou Master Commercial.

      Compétences et Qualités Requises : Organisé, sociable, capable de calculer et de gérer des stocks.

      Conditions de Travail : Travail mobile au sein du magasin, en équipe. Semaines de 35 heures, mais peut inclure le travail le week-end et des horaires matinaux ou tardifs.

      Rémunération : Environ 1 100 € net/mois pour un alternant, jusqu'à 2 800 € net/mois pour un manager confirmé.

      --------------------------------------------------------------------------------

      Science et Recherche

      Ces métiers sont dédiés à l'avancement des connaissances. Ils demandent un très haut niveau d'études, de la rigueur intellectuelle et une grande persévérance.

      Paléontologue

      Rôle et Missions : Étudier les restes fossiles des êtres vivants du passé. Extrait, préserve, étudie et reconstitue des squelettes.

      Formation et Diplômes : Longues études jusqu'au Doctorat (Bac+8).

      Compétences et Qualités Requises : Connaissances en biologie et géologie, maîtrise des technologies de fouille, patience et persévérance.

      Rémunération : Débute à 1 900 € brut/mois.

      Astrophysicien

      Rôle et Missions : Étudier le ciel, les objets célestes (étoiles, planètes, galaxies) et leurs caractéristiques physiques. Collecte et analyse des données de télescopes et satellites.

      Formation et Diplômes : Doctorat (thèse) incontournable. Voies possibles via l'université ou une école d'ingénieur.

      Compétences et Qualités Requises : Grande rigueur, capacité à se représenter des concepts abstraits, savoir travailler en équipe.

      Conditions de Travail : Principalement un métier de bureau, mais avec des déplacements pour les conférences. Horaires souples mais pouvant atteindre 40h/semaine.

      Rémunération : Varie selon les agences, de 3 000 € à 5 000 € net/mois.

      --------------------------------------------------------------------------------

      Autres Métiers Spécialisés

      Libraire

      Rôle et Missions : Sélectionner, acheter et vendre des ouvrages. Conseille les clients, gère les stocks et organise des événements culturels.

      Formation et Diplômes : Formations possibles du CAP au DUT et Licence Professionnelle "Métiers du livre".

      Compétences et Qualités Requises : Bonne culture générale, goût pour la lecture, excellente mémoire, capable de rester debout longtemps.

      Conditions de Travail : Métier sédentaire en librairie, horaires de commerce (35-39h/semaine, incluant le samedi).

      Rémunération : Salaire brut mensuel de 1 500 € à 1 800 € pour un débutant.

      Accompagnateur en Moyenne Montagne

      Rôle et Missions : Accompagner des groupes de personnes en moyenne montagne (randonnée, raquettes). Ne peut pas marcher sur des glaciers ou utiliser des techniques d'alpinisme.

      Formation et Diplômes : Diplôme d'État, préparé au Centre National de Ski Nordique et de Moyenne Montagne.

      Compétences et Qualités Requises : Excellente condition physique, amour de la nature, savoir travailler en groupe.

      Conditions de Travail : Travail en extérieur par tous les temps. Statut souvent indépendant, nécessitant de se faire connaître.

      Rémunération : Gagne entre 170 € et 270 € par sortie.

    1. In this way, data in different places can be linked together by referencing the elements they have in common

      This is game-changing for Salem: using shared identities to link parish records, land deeds, family trees, and trial transcripts. The end outcome would be a Linked Salem Dataset, allowing users to move from a confession to a property border to a minister's sermon. It is an ethical rebuilding of context, reuniting the social fabric that panic once ripped apart.

  7. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. Inauthenticity can be a calculated risk, like that taken when planning someone a surprise party and using a few judicious lies in the process, or it can be an artifact of how complicated it is to be ourselves in a many-faceted world.

      Inauthenticity can be both a mask and a mirror — something we wear, and something that reveals how complex we are. Sometimes, by reversal assumption, we get what others are trying to achieve, and thus understand their true motives. It's like psychology game. Reminds me of Hannibal.

    1. Resourcefulness

      I have never thought about STEM as something that required resourcefulness. I am curious about this and would love if the author game a specific example here.

    1. 19th-century American painting of Native American men hunting bison on the Great Plains. Big game hunting in the northeastern woodlands was similarly a male profession.

      I find this interesting and exciting because this would take a lot of courage and strength. Bison are a very tough animal, and they can tear things up.

    1. By contrast, real definitions aim not just to tell us about the waywords are used, but also to find some attributes that are in some wayessential to the object being defined. A chemist trying to find out thestructure and properties of matter is trying to form a real definition ofthe thing studied. However, identifying the essential attributes can bedifficult, and the whole idea of trying to find essential attributes canbe considered problematic.

      But, and maybe it's philosophical or even a metaphysical thought, can things be essentially true without having a social influence? Like even math is based on theories, we say 1+1=2 because it fits, but it's a theory. So real definitions are also based on verbal agreements. (It's more a questioning about the definition of these definitions, I get the difference and how it applies to game study.)

    1. Wednesdays : Synthèse et Analyse Approfondie

      Résumé Exécutif

      Ce document présente une analyse détaillée du jeu vidéo Wednesdays, coédité et coproduit par Arte, sorti le 26 mars 2025 sur PC (Steam et Itch.io).

      Conçu par l'auteur et directeur créatif Pierre et l'illustratrice Exaeva, ce jeu narratif aborde les thématiques complexes et sensibles de l'inceste, de la pédocriminalité et des violences intrafamiliales.

      Malgré la dureté des sujets, le jeu adopte un ton qualifié de "lumineux et bienveillant".

      D'une durée moyenne de deux heures, Wednesdays se distingue par une direction artistique unique, inspirée de la bande dessinée indépendante, où les personnages victimes sont représentés avec des têtes cubiques.

      Un pilier central du projet est son accessibilité, pensée à la fois pour les non-joueurs et les personnes en situation de handicap, avec un travail approfondi sur la lisibilité des couleurs et des mécaniques de jeu simplifiées.

      Le développement, mené par une petite équipe de sept personnes travaillant à distance, a été marqué par des choix créatifs forts, notamment la création de l'espace de décompression "Orcopark" et une conception sonore immersive qui pallie l'absence de doublage.

      Wednesdays se positionne comme une œuvre cherchant à libérer la parole et à utiliser le médium du jeu vidéo comme un outil de prise de conscience et d'écoute.

      I. Présentation du Jeu "Wednesdays"

      A. Concept et Thématiques Abordées

      Wednesdays est un jeu vidéo narratif qui plonge le joueur dans les souvenirs fragmentés de Timothé, un personnage victime d'inceste.

      Le but est de reconstituer son histoire en explorant différentes scènes de sa vie, de l'enfance à l'âge adulte. Le jeu traite frontalement de sujets difficiles comme la pédocriminalité et les violences intrafamiliales.

      Malgré la gravité de ces thèmes, la démarche des créateurs est de proposer une expérience "lumineuse et bienveillante".

      L'approche narrative et visuelle évite toute représentation graphique de la violence, privilégiant la suggestion, la pédagogie et l'émotion.

      Des avertissements de contenu (trigger warnings) sont intégrés directement dans le jeu pour permettre aux joueurs de se préserver.

      B. Équipe de Développement et Édition

      Le jeu est le fruit d'une collaboration entre plusieurs talents de la scène indépendante, sous l'égide d'Arte qui coproduit et coédite des jeux vidéo depuis plus de dix ans.

      Membre

      Rôle

      Contributions Notables

      Pierre

      Auteur et Directeur Créatif

      Conception du projet, écriture du scénario et des dialogues.

      Exaeva

      Illustratrice

      Création de toute la direction artistique, des personnages et des décors.

      Virginia

      Sound Designer

      Conception de l'univers sonore, incluant les gimmicks sonores des personnages.

      Florent Morin (The Pixel Hunt)

      Éditeur

      Accompagnement du projet, gestion administrative, conseils créatifs.

      Chris

      Programmeur

      Développement technique, lui-même concerné par le sujet du jeu.

      Nico Novac

      Artiste Pixel Art

      Création des visuels pour la section "Orcopark".

      Dianne

      Programmeuse (renfort)

      Aide à la programmation sur des aspects spécifiques du jeu.

      L'équipe principale de sept personnes a travaillé majoritairement à distance via Discord, sans réunions formelles, démontrant une grande autonomie de chaque membre.

      C. Données Clés

      Caractéristique

      Détail

      Date de sortie

      26 mars 2025

      Plateformes

      PC (via Steam et Itch.io)

      Durée de jeu moyenne

      Environ 2 heures à 2 heures 30

      Genre

      Jeu narratif, Bande dessinée interactive

      II. Direction Artistique et Conception Visuelle

      A. Un Style "Bande Dessinée Interactive"

      La direction artistique de Wednesdays est l'un de ses aspects les plus marquants. Elle s'inspire fortement de la bande dessinée indépendante franco-belge et américaine, avec des références citées comme Frédéric Peeters, Craig Thompson et Tillie Walden.

      Le processus de création est traditionnel et méticuleux :

      1. Dessin sur papier : Exaeva réalise tous les dessins des décors et des personnages sur papier, son support de prédilection. Les personnages sont dessinés sur des calques en papier "layout", un peu transparent, utilisé en animation traditionnelle.

      2. Numérisation : Tous les éléments graphiques sont ensuite scannés.

      3. Colorisation numérique : Les couleurs sont ajoutées digitalement, en respectant une technique de bichromie, qui consiste à utiliser principalement deux teintes dominantes par image pour créer des ambiances colorées et lumineuses spécifiques.

      Cette approche donne au jeu une texture unique, avec un aspect crayonné très personnel qui va à contre-courant des productions 3D ultra-réalistes.

      B. Le Symbolisme des "Têtes Cubiques"

      Un choix visuel central du jeu est la représentation des personnages victimes d'inceste avec des têtes cubiques. Cette idée, présente dès la genèse du projet, a plusieurs fonctions :

      Visibilité de l'invisible : Elle rend les victimes, souvent invisibles dans la société, immédiatement identifiables pour le joueur.

      Faciliter la projection : En s'appuyant sur les théories de Scott McCloud (L'Art invisible), un visage moins détaillé et réaliste permet au joueur de se projeter plus facilement dans le personnage.

      Défi artistique : Contrairement à l'idée initiale que cela simplifierait le travail, l'absence d'expressions faciales a représenté un défi majeur. Toute l'émotion des personnages doit être transmise par la corporalité, les postures et la gestuelle, ce qui a demandé un travail d'animation et de dessin très poussé.

      C. Processus Créatif et Influences

      La collaboration entre Pierre et Exaeva a été fondamentale. Pierre arrivait avec des idées de scènes, parfois sous forme de placeholders (visuels de substitution) très simples, et Exaeva les transformait en scènes complètes.

      De nombreuses décisions de mise en scène ont été prises lors de sessions de travail à Bruxelles, autour d'un verre. Le jeu alterne entre les scènes dessinées par Exaeva et l'univers en pixel art d'Orcopark, créant un contraste visuel fort.

      III. Conception Sonore et Narrative

      A. Sound Design sans Voice Acting

      Le jeu ne contient pas de dialogues parlés (voice acting), un choix justifié par le budget mais aussi par une volonté artistique.

      La sound designer Virginia a créé un univers sonore immersif basé sur des sons réalistes et des "gimmicks" sonores pour chaque personnage, tous liés à l'univers du papier et de l'écriture :

      Timothé : Bruit de machine à écrire.

      Les enfants : Bruits de Crayola ou de feutres.

      Joël (le père) : Son de stylo-plume.

      Fatia (l'institutrice) : Bruit de craie sur un tableau.

      Cette approche permet non seulement d'identifier auditivement qui parle, mais renforce aussi l'idée que l'histoire est en train de s'écrire ou de se reconstituer.

      B. La Libération de la Parole par le Gameplay

      La structure narrative et les mécaniques de jeu sont conçues pour servir le thème principal : la difficulté et les étapes de la libération de la parole.

      Souvenirs fragmentés : Le joueur peut choisir les souvenirs dans un ordre non linéaire, reflétant le processus non chronologique de la mémoire traumatique.

      Mécaniques de dialogue : Dans certaines scènes, comme celle de la voiture avec le personnage de Yeram, le gameplay joue avec les bulles de dialogue.

      Le joueur sélectionne une option, mais le personnage peine à la formuler, la phrase change ou est remplacée par des points de suspension.

      Cela représente la lutte interne pour verbaliser le trauma. Pierre note que près de 4% des bulles de dialogue du jeu sont des silences ("..."), soulignant l'importance de ce qui n'est pas dit.

      IV. L'Accessibilité : Un Pilier du Projet

      L'accessibilité a été une priorité dès le début du développement. L'objectif était double :

      1. Rendre le jeu jouable par des non-joueurs : Avec des contrôles simples et une interface claire.

      2. Inclure les personnes en situation de handicap.

      Pour y parvenir, l'équipe a collaboré avec Game Accessibility Hub, une société spécialisée. Des tests ont été menés avec des joueurs ayant différents handicaps.

      Un exemple marquant est celui d'un testeur achromate (qui ne voit aucune couleur).

      Il a trouvé le jeu parfaitement lisible et a même ressenti une différence dans la seule scène conçue en noir et blanc pur, validant ainsi l'efficacité des contrastes et de la direction artistique.

      Le travail sur les palettes de couleurs a été systématiquement testé à l'aide d'outils simulant différentes formes de daltonisme. Arte a soutenu cette démarche en allouant un budget supplémentaire dédié à l'accessibilité.

      V. Genèse et Coulisses de la Production

      A. D'un "One-Man Show" au Jeu Vidéo

      L'idée de Wednesdays est née de l'inspiration de Pierre après avoir vu L'Imposture, un spectacle de marionnettes de Lucie Arnodin.

      Fasciné par la capacité du spectacle à traiter de sujets graves avec légèreté et une narration éclatée, il a d'abord tenté d'écrire un one-man show sur le sujet.

      Après un retour mitigé d'un ami proche, il a abandonné cette idée pour se tourner vers un médium qu'il maîtrisait : le jeu vidéo, tout en conservant le ton et l'approche narrative du projet initial.

      B. Orcopark : L'Espace de Décompression

      L'interface de sélection des chapitres a connu une évolution significative. Le concept initial était un bureau sur lequel le joueur cliquait sur différents objets pour lancer les souvenirs. Jugée "un peu chiante" par Arte, cette idée a été remplacée par Orcopark, un parc d'attractions rétro en pixel art.

      Orcopark sert de hub central mais aussi d'espace "safe" pour le joueur.

      Entre des scènes émotionnellement intenses, il peut prendre le temps de se détendre, de ramasser des débris, de cliquer sur des éléments interactifs et de décorer son parc.

      Cet espace a été développé plus que prévu initialement, à l'encouragement d'Arte, pour renforcer son rôle de sas de décompression.

      C. Anecdotes de Développement

      Moustache le chat : Le chat Moustache a été ajouté dans une scène finale à la demande de Nil, le fils de Pierre.

      Joël, l'alter ego vieilli : Le design du personnage du père, Joël, est basé sur une version vieillie de l'auteur, Pierre.

      Figurine en argile : L'objet mystère de l'émission était une statuette en argile réalisée par la grand-mère de Pierre, qui a aussi servi de base pour une marionnette dans un autre de ses projets de jeu sur Game Boy Camera.

      VI. Réflexions sur l'Impact et la Réception

      A. Le Jeu Vidéo comme Média d'Écoute

      Les créateurs soulignent la position particulière de Wednesdays, un "OVNI" qui se situe à l'intersection du jeu vidéo et de l'œuvre culturelle. Cette position hybride pose des défis de réception :

      • Les journalistes spécialisés jeu vidéo peuvent être déroutés par un jeu qui ne correspond pas aux critères d'évaluation habituels (gameplay, durée de vie, etc.).

      • Les journalistes culturels généralistes peuvent être réticents en raison d'un mépris ou d'une méconnaissance du médium.

      Malgré cela, le jeu a reçu une bonne couverture en France et a trouvé son public.

      B. Un Outil pour la Prise de Conscience

      Le retour le plus gratifiant pour l'équipe vient des joueurs.

      De nombreux témoignages font état de l'impact positif du jeu, y compris de la part de personnes victimes qui se sont senties comprises ou qui ont eu des prises de conscience sur leur propre vécu en jouant.

      Le jeu semble ainsi atteindre son objectif : non seulement libérer la parole de son personnage, mais aussi potentiellement celle de ses joueurs, et sensibiliser l'entourage aux réalités de l'inceste.

    1. "generic non-playable characters (NPCs)" into "dynamic, interactive characters capable of striking up a conversation, or providing game knowledge to aid players in their quests."

      what are generic non-playable characters?

    1. The use of AI in dental education also includes educational games and quizzes to test students' knowledge and improve information retention.

      AI in dental education increasingly uses adaptive quizzes and serious games to enhance engagement and long-term retention. These tools apply machine learning to tailor question difficulty, provide instant feedback, and simulate clinical decision-making in a safe environment. Studies show that game-based and AI-assisted learning improves students’ diagnostic accuracy and motivation compared to traditional formats. Reviews further suggest that gamified, retrieval-based learning supports better knowledge retention when paired with timely feedback. Overall, AI-driven quizzes and educational games make dental education more interactive and personalized, strengthening learning outcomes.

    2. Artificial Intelligence (AI) has the potential to play a significant role in enhancing the quality of medical care and helping doctors to reflect and learn from their mistakes. There are several ways in which AI can be utilized for this purpose.

      Earning the quality of care for patients is something that anyone wants upon entering a hospital in their most vulnerable moments. Having AI help healthcare providers learn and reflect on their past mistakes is something that can really change the healthcare game and help humans perform better at what they do.

    1. the world of our perception is just a projection of an incredibly high dimensional configuration space.

      Like video game code, and what we see is the projection of the code

    1. 10:55 "wir holen Sozialhilfeempfänger ins Land, Bürgergeldempfänger, und das führt dann dazu, dass die Belastung steigt."<br /> das führt letztendlich zur kompletten abschaffung vom sozialstaat, weil irgendwann ist einfach kein geld mehr da, und die zentralbank gibt auch keine kredite mehr... und spätestens dann startet der bürgerkrieg in deutschland (und europa), und der rest der welt denkt sich nur "go woke, go broke!!" und das leben geht weiter, ausser in europa, da gehen die lichter aus. game over! multikulti-experiment gescheitert!

    1. By reading a handful of good books

      Here is a list of books recommended by William Bernstein in If You Can:

      I Will Teach You To Be Rich by Ramit Sethi

      The Simple Path To Wealth by JL Collins

      The Little Book of Common Sense Investing by John Bogle

      Winning the Loser's Game by Charles Ellis

      The Bogleheads' Guide To Investing by Mel Lindauer, Taylor Larimore, and Michael LeBoeuf

      A Random Walk Down Wall Street by Burton Malkiel

      The Millionaire Next Door by Thomas Stanley

      The Psychology of Money by Morgan Housel

      The Index Card by Helaine Olen and Harold Pollack

      How A Second Grader Beats Wall Street by Allan Roth

      Just Keep Buying by Nick Maggiulli

      The White Coat Investor by James Dahle

      How To Make Your Money Last by Jane Bryant Quinn

      Retire Before Mom and Dad by Rob Berger

      The Five Years Before You Retire by Emily Guy Birken

      How To Plan for the Perfect Retirement by Dana Anspach

      Retirement Planning Guidebook by Wade Pfau

      Retirement Planning for Dummies by Matthew Krantz

      The New Retirement Savings Time Bomb by Ed Slott

    1. If it’s a game-changer that will be regularly used in future jobs, then students will need to know how to use it expertly; thus it may be premature and potentially a disservice to students to ban AI in the classroom. Without learning how to “wrangle” AI, you could be at a competitive disadvantage once you graduate and enter the workforce.

      I read this as a strong case for why banning AI in the classroom could actually harm students. If AI is going to be central to the workforce in the future, it is a skill that students need to learn to use responsibly, much like digital literacy.

  8. inst-fs-iad-prod.inscloudgate.net inst-fs-iad-prod.inscloudgate.net
    1. An honest attempt to secure a good education for poor children therefore leaves policymakers with two difficult choices. They can send them to schools with wealthier children, or they can, as a reasonable second best, seek to give them an education in their own neighborhood that has the features of school-ing for well-off students. The former has proved so far to be too expensive po-litically, and the latter has often been too expensive financially. Americans want all children to have a real chance to learn, and they want all schools to foster democracy and promote the common good, but they do not want those things enough to make them actually happen.

      This sentences profoundly reveals the real dilemma facing educational equity in the United States. Policymakers are left with two options: either force poor children to attend schools in wealth areas, or replicate high-quality educational resources in impoverished communities. This is like asking society a choice: are you willing to sacrifice some of your own interests to help others, or pay more taxes to improve overall education? The results show that while Americans pay lip service to educational equity, when it comes to the price to pay, most choose silence. Ironically, this contradiction exposes the core lie of the "American Dream"—it claims equal opportunity for all, yet it tacitly allows the wealthy to help their children obtain better educational resources by purchasing school district housing and attending private schools. As the author suggests, schools should be tools for breaking down class stratification, but in reality, they have become machines for reproducing privilege. The problem isn't that there are no solutions, but that vested interests are simply unwilling to change the rules of the game.

    1. Actions are judged on the sum total of their consequences (utility calculus) The ends justify the means. Utilitarianism: “It is the greatest happiness of the greatest number that is the measure of right and wrong.” That is, What is moral is to do what makes the most people the most happy.

      I'd say of all of the different frameworks provided, Consequentialism has the most direct application and parallels with many of the ethical questions and debates we often have in regards to social media. Often the game that gets played when it comes to social media is the data and the numbers, and we see developers measure value, success, and popularity largely through the numbers they get fed. And just as one could argue this mindset is flawed, you could say the same flaws exist in Consequentialism. As much at looking at final outcomes can be a rational way to make decisions, it ultimately strips some of the humanity and nuance away from said decisions in the short term. I found the parallels between these two mindsets very interesting.

  9. Sep 2025
    1. The player’s initial fear that they might need to act quickly to defend themselves from some lurking supernatural horror becomes transmuted, by the end of the story, into the inevitable realization that their character has already lost her chance to act,

      Reed ties in how the walking game is able to tell a story. To critics that said walking games lose the sense of agency, that isn’t true in the sense that the decisions and the paths the player chooses to follow gives the player understanding of the story. The use of misdirection by staging the game as a horror genre captivates the player to explore the house and see what is to jump out at them. This is then shifted into what the actual story is about, the feeling of abandonment and isolation of Sam from her family, and the player can only sift through the remnants of the home to piece the story together. This forces introspection onto the player.

    2. Contrary to expectations, these games are rarely just about exploration.

      I agree with this very much due to how I have played many different types of games in the past. The games that are considered walking sims are much more than just exploring and actually requires you to pay attention to all of the details and bits of the story that you find. Only after this will you be able to connect the dots and understand the big picture of the game.

    3. Walking simulators became the most visible examples of the tensions associated with indie gaming, which often involve limits to interactions and the removal of recognizable mechanisms of challenge and victory

      This builds on the idea of an expectation from players when they refer to a game. Walking simulators contrast with the fast and violent nature of gaming, which was the dominant form of gaming at the time. The absence of the agency of survival seems to take away a vital part of what makes a game a gaming experience, is what the consensus at the time.

    4. Permalife games are difficult in an entirely different way than games requiring skill or strategy, requiring players to enact the motions of continuing existence, even in the face of survival under (or complicity with) the evils of that existence. “Perhaps the walking sim’s greatest power is how it makes players recognize and consider such decisions and the way they influence gaming outcomes and environments. A number of traditional big-budget titles don’t demand this kind of moral engagement, which makes sense—asking a player to stop and consider the horrible things they’re doing is antithetical to moving forward” (Clark 2017). Slowness is forefronted in a game of permalife: adrenaline is neither the goal nor the appeal.

      As walking simulators began to emerge during the high of many first-person shooting games it Introduced the idea of “permalife” which steered the focus away from death and the challenges that came with it, and instead used those struggles to focus on characters and their internal and external lives, deepening the players thinking

    5. The term didn’t really take off and become weaponized, however, until the growing resentment of “outsiders” and indie games that would culminate in Gamergate, after which it was retroactively applied with vitriol to games released much earlier like Dear Esther (originally 2008) and To the Moon (2011) (Clark 2017).

      People have ideas of what a game should be engrained in their minds and it creates these types of backlashes from uniqueness and difference from the "social norm." Addressing alternative ideas in game should be seen in a respect of art rather than negatively.

    6. Sam’s story is revealed through letters, notes, music, and other minutia of daily living, which the player combs through: not only Sam’s, but other people in her life. The game asks the player to confront prejudice without any ability via in-game mechanics to resolve it.

      The story in Gone Home is told through the aftermath of the event with what Sam and her family left behind at the house. This way of experiencing the story indirectly challenges the player's agency, with the fact that the player can only see what is left of the family and cannot do anything but observe. This can create frustration or introspection in the player depending on how they are choosing to experience the game. The frustration can stem from being unable to interact with the story, but introspection can stem from really taking the time to learn the story fully and its emotional impact on the people in it.

    7. But the dominant genre during this period was the much more frenetic first-person shooter. With many shooter engines increasingly providing tools to build your own levels or otherwise modify game content, it’s no surprise that many gamemakers began using these tools for other purposes. Many of the earliest walking simulators (including Dear Esther, The Stanley Parable, and Mary Flanagan’s 2003 [domestic]) were originally mods for first-person shooters, and the first-person perspective has come to define the genre.

      I love the connection they make to how the walking simulators paved the way for a lot of modern gaming. Many of the root mechanics found in modern first-person games like Call of Duty are seen as deriving from walking simulators and fight back against the notion that walking simulators were useless and had no impact on the gaming industry.

    8. Some have suggested creating a contrast with “first-person shooters” by coining the term “first-person walkers,” defining these as games in which “a player’s perception of the game world may be refocused to that of an investigator or close observer, via a strict adherence to minimal interactivity and slow, limited pacing” (Muscat 2016).

      Changing the name of "walking simulators" to a "first person walker" game undermines the whole idea of a simulator. Simulators are purposefully meant to simulate an experience/world, and by walking, you truly explore a new world and fully indugle in its concepts. Most simulators also have an end goal in mind (granting the player agency), a drive to fully experience the world.

    9. Gone Home also plays with player agency by subverting expectations about danger and complicity.

      We've talked about this concept of subverting the genre of horror in Gone Home a lot in our previous discussions, but I want to emphasize again how important this mechanic was to the plot of the game. Instead of just a "walking simulator," Gone Home felt like a walking simulator with a pinch of cortisol. The subversion of genres made the storytelling much more effective, and the walking part of the game set the tone for a serious conversation by creating a space for the player to become curious about the story’s details by slowing the pace.

    10. in a game like Gone Home or Tacoma, the world is inert until you have an effect on it. The fact that you are opening a cabinet and picking up an object and rotating it isn’t important to the game world… What’s important is that through affecting the game world, you’re creating(p. 131)your understanding of it

      This really explores how in Gone Home, everything you know about the story is how in depth you explore the game. Every new note or snippet reveals something new to the story that produces some sort of appeal and satisfaction. Agency is related to this as "through affecting the game world, you're creating your understanding". The player feels satisfaction through finding something new and completing the missing pieces of the puzzle of what has occurred while Katie was gone.

    1. I thing group activities like this are the best way to engage a class and I look forward to participating in similar tasks this quarter. As it happens I have actually participated in this exact game previously for a political science class seeing the over lap of knowledge between the two subjects if very interesting.

    2. working

      I thing group activities like this are the best way to engage a class and I look forward to participating in similar tasks this quarter. As it happens I have actually participated in this exact game previously for a political science class seeing the over lap of knowledge between the two subjects if very interesting.

    3. begin developing a sense of what they think would be a fair way of distributing resources and to critique the political and social institutions under which they live.

      this is a great way to teach college or high school kids about the inequality that comes from different governing systems, and how to make sure everyone gets whats owed to them. it is nice to see how such a simple game can get people this interested in their own systems of government

    4. During the first round of this exercise, students inevitably take so many fish that there are none left in the lake. Students then discuss what has happened and what they ought to do differently in the next round. Some students have strong intuitions that everybody should take an equal amount, while others insist that all that matters is that in the end there are enough fish left to repopulate the lake. Not only is this exercise pedagogically engaging, but it leads students to develop proposals and to evaluate them critically. When successful, students use what they learned in this exercise to begin developing a sense of what they think would be a fair way of distributing resources and to critique the political and social institutions under which they live.

      I wonder what comparisons the students playing this game could draw to real life after having analyzed the outcome of this imaginary predicament?

    5. I divide students into groups and ask them to imagine that each group is a family subsisting by fishing from a lake. If a group catches two fish, most of their family will survive, although some among the weak, elderly, or very young in the family could die. If the group catches three fish, all of their family will survive. If they catch any more fish, the excess will rot. However, two fish have to be left in the lake in order for the fish population to be replenished the following year. If the groups over-fish, famine ensues and all of the families will die. There are only enough ‘fish’ (paper fish) in the ‘lake’ (a bag I pass around) to allow for most families to take just two fish, if there are to be two fish left in the lake in the end. During the first round of this exercise, students inevitably take so many fish that there are none left in the lake.

      I thing group activities like this are the best way to engage a class and I look forward to participating in similar tasks this quarter. As it happens I have actually participated in this exact game previously for a political science class seeing the over lap of knowledge between the two subjects if very interesting.

    1. Coaches’ leadership plays animportant role in shaping how this performance pressureimpacts athletes’ mental health, drawing an intricate relation-ship between coaches’ influence, performance and mentalhealth (Mallett & Lara-Bercial, 2023). Coaching research hasemphasised the importance of coaches’ knowledge and beha-viours as main determinants of positive athlete experiencesand sport outcomes (Côté & Gilbert, 2009).

      Under strong amounts of pressure in elite athletes, the influence of coaching can positively or negatively shape an athletes experience. The approval from these leaders in sports can allow the player too have satisfaction and confidence in their game.

    1. Archaeological evidence indicates that African gatherers and hunters adapted their tools and ways of life to three basic African environments: the tropical rainforests with hardwoods and small game; the more open savannas with a diversity of large game living in grasslands, woods, and gallery forests along the rivers; and riverbank and lakeside ecologies found along major water-cou

      Shows them adapting to certain situations like different biomes.

    2. Archaeological evidence indicates that African gatherers and hunters adapted their tools and ways of life to three basic African environments: the tropical rainforests with hardwoods and small game; the more open savannas with a diversity of large game living in grasslands, woods, and gallery forests along the rivers; and riverbank and lakeside ecologies found along major water-c

      This shows how many biomes are in Africa and how big the continent is

    3. Archaeological evidence indicates that African gatherers and hunters adapted their tools and ways of life to three basic African environments: the tropical rainforests with hardwoods and small game; the more open savannas with a diversity of large game living in grasslands, woods, and gallery forests along the rivers; and riverbank and lakeside ecologies found along major water-cour

      Discusses how African hunter-gatherers adjusted their tools and lifestyles to fit three main African environments: tropical rainforests, open savannas, and riverbank/lakeside areas, each with unique resources and conditions.

    4. A few isolated forest dwellers, even in the twenty-first century, still live in bands of thirty to fifty individuals. Their pursuit of game and harvesting of a variety of insect, stream, and plant foods keep them on the move in a rather fixed cycle as various foods come into season at different locations in their foraging areas.

      This part explains that their lifestyle revolves around hunting and gathering diverse food sources, causing them to move in a predictable pattern based on seasonal availability in different areas.

    5. As in most gathering and hunting societies, women’s economic functions, along with childbearing, are absolutely crucial. Women typically generate more food through gath-ering than the men who hunt animals or look for game that has already been killed. Gathering and hunting societies appear to have developed del-icately balanced social relationships that permitted necessary group deci-

      This is different from what I learned about Africa because I was taught that women did not matter in that continent

    1. yet there the nightingale 100 Filled all the desert with inviolable voice And still she cried, and still the world pursues, 'Jug Jug' to dirty ears.

      What is clear across this selection of readings, and their integration into the opening of “A Game of Chess” is the power of woman, or rather the power of “love” (a horribly ambiguous term as it is employed in the readings), which emanates from woman. Enobarbus and Mecaenas fear Cleopatra, in her relationship with Antony. Right after Enorbarbus says, “I saw her once Hop forty paces through the public street; And having lost her breath, she spoke, and panted, That she did make defect perfection, And, breathless, power breathe forth” Mecaenas replies, “Now Antony must leave her utterly.” It is this “now” that marks the fear; it is directly following the detailing of Cleopatra’s everlasting, impossible perfection and power (explicit here) that it is decided Antony cannot be with her. “...secret splendor and fatal beauty That nature had bestowed on her” describes only the corpse of the woman in A Martyred Woman (mysteriously radiant, dangerously enchanting, almost like a force of nature—this connection could either be a testament to female power, or a putting of it onto something else/an attempt to remove it from the source). Philomela is said to immediately ignite “the flame of love” that “takes” Tereus, “as if one had set afire ripe grain, dry leaves, or a haystack.”

      And it is as though in utter fear of this power, of this hold of women over men, issuing so strongly before a woman even really does anything, that their voices are taken away. Philomela’s tongue is cut off, and then she is falsely proclaimed dead (an ultimate silencing). Octavius knows what his plan is with Cleopatra—what she says will not make a difference. Dido is twisted into suicide (a self-silencing). The entirety of A Martyred Woman is a man speaking on the image of a dead woman—his thoughts, his views, his opinions.

      Eliot adds to this. “Marie” follows “he said” (doubly so, actually, with Eliot writing the poem). Madame Sosostris is a reference to Chrome Yellow’s Madame Sesostris—a man impersonating a woman. And is the spelling change a further mocking? Sybil’s voice is hidden in the epigraph under male layers: Trimalchio, and his friends from childhood, Eliot (macro). When Eliot says, “yet there the nightingale Filled all the desert with inviolable voice And still she cried, and still the world pursues ‘Jug Jug’ to dirty ears” it appears that the voice, the story, of Philomela persists in the song of the nightingale—despite the violence inflicted upon her (rape by King Tereus and the mutilation of her tongue)—but in nature (there might be more here, too, with another nature!), the female nightingale is mute; only the male of the species sings. If anything is being sung, then, it is Tereus’ voice, Tereus’ story. Or what is being “sung” by Philomela as a nightingale comes out as nothing (“‘Jug Jug’”)—can a desert be filled? It would no longer be a desert. Then, mockingly, “the world pursues…”

    2. The Chair she sat in, like a burnished throne,

      "The Chair she sat in, like a burnished throne..." as mentioned in his notes, is a direct reference to Cleopatra from Shakespeare's Antony and Cleopatra. The image evoked through the inclusion of a throne sets a lavish and ornate tone for II.) A Game of Chess. The throne also signifies an purposeful shift from the natural world and decaying land in The Burial of the Dead to artificial human creations. Eliot goes on to reference other lavish decor elements including "jewels", "vials of ivory and coloured glass," and "strange synthetic perfumes." The proceeding lines are jagged and chaotic, similar to the space being described. In contrast, below is a pasted description of Cleopatra's throne from Act 1, scene 2:

      The barge she sat in, like a burnish'd throne, Burn'd on the water: the poop was beaten gold; Purple the sails, and so perfumed that The winds were love-sick with them;

      Here, her throne is aligned with elements of the natural world including water and wind. Cleopatra's power, represented in part through her throne, is able to captivate forces even as strong as the wind as they're "love-sick." Thus, while Eliot refers to Shakespeare's work in the opening line of A Game of Chess, the surrounding associations to the thrones differ greatly. It is ultimately through this divergence that Eliot comments on the decay of the wasteland as feminine power is reduced to a force against nature.

    3. Good night, ladies, good night, sweet ladies, good night, good night.

      This line is a reference to Ophelia in Shakespeare's play Hamlet. Ophelia states/sings "Good night, ladies; good night, sweet ladies; good night, good night," in conversation with King Claudius. This appears just before her stage exit and ultimate death by suicide/drowning. The inclusion of "sweet ladies" as a closing image from Ophelia is rather ironic due to her mental state at this point in the play. She has gone mad, evident in her repetitive speech preceding her good night call, specifically in relation to death. She repeats "He is dead and gone," in song, demonstrating a troubled mentality. Furthermore, her commentary on Saint Valentine's day is connected to feminine seduction and lust. She therefore displays the messy combination of trauma, hardship, and feminine desires. Thus, this context does not allude to an image of a "sweet lady," instead displaying the complexity of womenhood. Moving to Eliot's inclusion of this line as the closing thought to II. A Game of Chess, Ophelia's madness is echoed in the pub. This sets up for a similar good night call, signifying a tragic future for the women at the bar. Their time is running out as repeated "Hurry up please its time" and their bodies look "antique." In closing, this borrowed goodnight line from Hamlet furthers Eliot's commentary of women going mad depicted through the chaos of the pub and connection to Ophelia's mental state.

    4. poor Albert, He’s been in the army four years, he wants a good time, And if you don’t give it him, there’s others will,

      A Game of Chess" gives interesting insight into the place of women, marriage, and sex in Eliot's The Waste Land. In a conversation between a group of women, the narrator (keep in mind, this narrator is a different one from the first section of The Waste Land), says to her friend, "and think of poor Albert, / He's been in the army four years, he wants a good time, / And if you don't give it him, there's others will" (lines 147-9). In the waste land, marriage has become hollow and superficial; rather than being a bondage rooted in love and mutual respect, it acts chiefly to 1) fulfill the husband's sexual needs, and 2) to produce children. Additionally, sex has become an emotionally sterile act, devoid of any intimacy or tenderness: it is a service to be completed by the wife to her husband, lest she hopes to lose her husband’s loyalty.

      Drawing back to an earlier text, Baudelaire’s poem “A Martyred Woman” takes the changed nature of sexual intimacy to an entirely new extreme: violence. The subject, a decapitated body of a young woman surrounded by perfumes and luxury possessions, becomes an object of fetishization to the narrator. He first admires the “secret splendor and fatal beauty” of her nude body, only to conclude that she is a sex worker who gave away her “inert, complacent flesh to fill / The immensity of his lust” (lines 23 and 48-9). While both texts depict women’s body and sexuality being denied their own agency, there is something all the more violent in the Baudelaire poem: the act of sex is not purely emotionally sterile and transactional, but grotesquely commodified, reduced to an object of lust even after death. The deceased woman is violated twice, first by her killer, and then the narrator, who aestheticizes her lifeless body.

    5. And if you don’t give it him, there’s others will,

      In Eliot's quote "And if you don't give it him, there's others will," the speaker strips Lil of her self-worth and reduces her to a replaceable object. Eliot's utilization of chess imagery, paired with his connections to sources that depict violence toward women, suggests that relationships between men and women in the wasteland are purely transactional, with each individual in pursuit of their own victory. In Pound's "Blocked light working in. Escapes. Renewing of contest," chess is a constantly evolving strategic game. This mirrors how Lil can easily be replaced—she is valued only for what she offers, not for her inherent worth as a person. Middleton's "We must not trust the policy of Europe Upon a woman's tongue" reinforces this pattern of silencing and dehumanizing women. Women are deemed unreliable, just as Lil is seen as disposable. The quote "Divided from herself and her fair judgment, Without the which we are pictures, or mere beasts" proves particularly revealing. Here, Ophelia's "fair judgment"—her ability to have rational thoughts—is presented as the quality that makes her human. Yet when men chose to ignore or suppress these very qualities in women, they reduce them to inhuman objects. This dehumanization allows men to justify there cruelty.

    6. 'My nerves are bad to-night. Yes, bad. Stay with me.

      This whole second section of “A Game of Chess” breaks formally from the first half. The beginning of the section is comprised of one large paragraph, unbroken, while the middle section, specifically lines 111-128, are broken up: indentations carry words across the page, whole sentences begin with one word on the line before, questions infect the poem. The form of the poem begins to break apart and descend here, mirroring the content and references to Hamlet’s Ophelia, who begins to break apart inside and descend into a state of madness. Interestingly, the line pulled directly from Ophelia, “Good night, ladies; good night, sweet ladies; good night, good night,” which ends this section of the poem, is not the last word spoken by Ophelia. This is surprising, as the crazed repetition of “good night” within Ophelia’s line seems like a final goodbye, and would not be shocking as the last words of a person contemplating suicide. Eliot is therefore both in accordance and disagreement with the original Shakespeare: his text breaks down like Ophelia, but his ending is taken from her middle. This reinterpretation reflects broader modernist themes of taking the old and adapting it to the new, which Eliot does repeatedly, and takes it to a deeper level within the form of the poem, mirroring the content of an older text within the form of his text.

    7. I remember Those are pearls that were his eyes.

      When I read the line “Those are pearls that were his eyes” in tonight’s reading, I was shocked. I was immediately taken back to our conversation about Ariel’s character in The Tempest, and the identical line in “The Burial of the Dead,” Furthermore, the difference between the uses of “Those are pearls that were his eyes,” interested me a lot. In “The Burial of the Dead” the line is a parenthetical line in reference to the “drowned Phonecian Sailor” in the tarot card reading by “Madame Sosostris.” Furthermore the full line reads “Those are pearls that were his eyes. Look!” This is a direct reference to when Ariel sings to Ferdinand, whom they have shipwrecked on Prospero’s island. They sing: “Full fathom five thy father lies; / Of his bones are coral made; / Those are pearls that were his eyes: / Nothing of him that doth fade / But doth suffer a sea-change / Into something rich and strange,” lying to Ferdinand that his father died in the shipwreck, leaving him, as the heir to the throne, the new King. This line, full of rhymes and interesting imagery, acts as a spell, with Ariel using his magical rhetoric to convince Ferdinand of an untruth. Thus, the line’s use in “The Burial of the Dead” can be seen as a diversion, with the “Look!” making that all the more convincing. In “A Game of Chess,” the line reads “I remember/Those are pearls that were his eyes.” To which the (a?) speaker responds, “‘Are you alive, or not? Is there nothing in your head?’” The reintroduction of this line in this context, perhaps suggests the memory of humanity’s connection to nature, a time when beautiful and rare pearls equated to the beauty and importance of one’s vision. The response, “‘Are you alive, or not?’” confused me, especially after the preceding line, but maybe that will be parsed through during our discussion, or in a further source. Additionally, the placement of this line, in response to the questions “Do/you know nothing? Do you see nothing? Do you remember/’Nothing?’” felt really intentional to me. It seemed as if it was testing the reader’s memory; in some way, this quoted speaker is Eliot, reaching out to the audience, asking us if we remember the first use of the line.

    8. The glitter of her jewels rose to meet it, From satin cases poured in rich profusion

      The lines “In the midst of perfume flasks, of sequined fabrics / And voluptuous furniture, / Of marble statues, pictures, and perfumed dresses…A headless cadaver pours out, like a river, / On the saturated pillow / Red, living blood, that the linen drinks up / As greedily as a meadow” stuck out to me when reading the references tonight. The inspiration taken from this source, “A Martyred Woman” in Les Fleurs du Mal by Baudelaire, lies not in a central theme, character, or a famous line, but instead the image of an ornate room. “A Martyred Woman” dances around an extravagant room, with “voluptuous furniture” and “marble statues,” finally setting onto the subject, a nude, headless cadaver, posed on the bed: the product of an “unwholesome love.”

      The inspired lines of “A Game of Chess” – “glitter of her jewels rose to meet it,/From satin cases poured in rich profusion./In vials of ivory and coloured glass/Unstoppered, lurked her strange synthetic perfumes” etc. – are clearly harking back to the image of luxury provided in “A Martyred Woman.” The difference, however, lies in the subject, or lack thereof: the nude cadaver. Her presence and neck oozing “red, living blood,” is central to the themes of the poem, yet the possessives “her” and “she,” which are continually referenced in the stanza, are undefined. This gives us only one half of the story provided by Baudelaire. We, the readers, know, from reading “A Martyred Woman,” that the ornate room holds secrets and murder, yet, in “A Game of Chess,” we live and read about it with utter ignorance.

      Furthermore, the lines “fatal beauty/That nature had bestowed on her” from “A Martyred Woman” also really stood out to me. In a tale of a woman, brutally murdered, but worshiped by a love sick lover, the idea of "nature" “bestow[ing]” this “fatal[ity]” is really interesting.. This idea of “nature” “bestow[ing]” is twofold: it is, as if, instead of her death being the fault of her murderer, it is her “nature”, her existence, that allowed it; or her murder was a “natur[al],” primitive reaction on the part of her obsessed killer. This image of luxury and wealth shrouding blood, obsession, and unhealthy love is really interesting.

    9. Unstoppered, lurked her strange synthetic perfumes

      Throughout the sources we read tonight, there is a recurring theme of violence toward women imposed through romantic relationships. In Eliot's "A Game of Chess," perfume serves as a symbol that shifts from natural purity to artificial corruption across the works he references. In Paradise Lost, perfume appears in the quote "Fanning thir odoriferous wings dispense Native perfumes, and whisper whence they stole Those balmie spoiles." Here, perfume is natural and pure, emanating from Eden's perfect landscape. This directly contradicts Eliot's description of perfume as "Unstoppered, lurked her strange synthetic perfumes, Unguent, powdered, or liquid—troubled, confused And drowned the sense in odours." Eliot explicitly calls perfume "synthetic," creating contrast with Milton's natural description. This same corrupted use of perfume appears in Shakespeare's Antony and Cleopatra: "The barge she sat in, like a burnish'd throne, Burn'd on the water: the poop was beaten gold; Purple the sails, and so perfumed that The winds were love-sick with them." Here, perfume affects nature itself, making the winds "love-sick," which aligns with Eliot's theme of grief in The Waste Land. Later in the same play, "From the barge A strange invisible perfume hits the sense Of the adjacent wharfs... Antony, Enthroned i' the market-place, did sit alone, Whistling to the air; which, but for vacancy, Had gone to gaze on Cleopatra too." Cleopatra's perfume seems to brainwash people when they smell it, making them obsessed with her. In Baudelaire's "A Martyred Woman," perfume again appears "In the midst of perfume flasks, of sequined fabrics And voluptuous furniture, Of marble statues, pictures, and perfumed dresses That trail in sumptuous folds." Perfume exists in an elegant and extravagant context, but it masks the horror of the dead woman in the room. Eliot makes perfume fake and "synthetic" because it is no longer pure like it was in Paradise Lost. In Cleopatra it makes people lose themselves and obsess over her, and in "A Martyred Woman" perfume conceals death and decay. Since perfume is something associated with attraction, and both Cleopatra and Baudelaire's poem center on the death of women, Eliot demonstrates that relationships have become as artificial and deceptive as the synthetic perfumes.

    10. From satin cases poured in rich profusion; In vials of ivory and coloured glass Unstoppered, lurked her strange synthetic perfumes

      The line “From satin cases poured in rich profusion; In vials of ivory and coloured glass / Unstoppered, lurked her strange synthetic perfumes” felt to me an interesting parallel between the line “Purple the sails, and so perfumed that the winds were love-sick with them;” from Antony and Cleopatra. While the clearest connection between these texts is the “burnish’d throne” line, an image that Eliot draws directly from Shakespeare, the description of lavish luxury that follows intrigued me. The lush, opulent imagery of “satin,” “rich,” and “ivory” mirrors similar imagery within Shakespeare, like the use of “purple,” a color which was costly to produce, and therefore associated with wealth and nobility. These images stand in stark contrast to the dark, deathly imagery of the previous chapter of the poem, like the “brown fog” and “dead sound.” However, underneath this rich imagery is a thread of suicide and violence more gruesome than that found in “The Burial of the Dead.”

      Almost all of the sources we read include a woman of noble standing conducting some act of violence, whether it be suicide or violence against another. The imagery highlighted above from Antony and Cleopatra is vastly different from the ending where Cleopatra kills herself by being bitten by a snake. This pattern continues in Ovid, where Philomela is both “dressed magnificently” and later goes on to “strike” and “hack” the throat of her nephew, brutally killing him before feeding him to his father. In Aeneid, Dido lays on a “golden couch” before she gives “over to the flames,” thereby killing herself. All of these cases, which Eliot references over and over in this passage, infect the poem with a dark, gruesome meaning that would otherwise be covered by the richness of the imagery. This shows the nature of death and darkness, which can express themselves so clearly, as seen in “The Burial of the Dead,” or can seep into and hide themselves behind glamour and opulence, as seen in the beginning of “A Game of Chess.”

    11. I do not find The Hanged Man.

      “The cards, the cards, the cards will tell/ The past, the present, and the future as well/ The cards, the cards, just take three/ Take a little trip into your future with me” (Princess and The Frog). But can Tarot really tell the future? Or is it just a game? In Ritual and Romance Weston recounts the history of tarot cards, which are from unknown origin, and are believed to have started as a game– like a standard deck of cards. But how did they become vessels of clairvoyance? The transformation is what I believe Eliot is most interested in, as he writes “I do not find the hanged man.” Just as the man hangs between two states, the card exists in limbo, as both a game and a powerful tool of divination. The mystery surrounding the cards’ origin only further complicates this story of the hanged man, existing as the string that suspends and bridges game and prophecy. Without their muddled history, the cards would have a clear purpose as either a game or tool of deviation and they would not be suspended between both.

    1. si on tape Hexagon game et avec mon nom à côté on va trouver plein d'exemples et des modèles pour pouvoir les reproduire 00:02:25 c'est des petits outils très simples on donne aux élèves des hexagones sur lesquels on a ajouté des images on a ajouté des mots on a ajouté des citations et donc on crée une collection 00:02:39 d'hexagones et on affiche un sujet au tableau et on leur dit et ben vous avez ces hexagones vous les utiliser pour répondre à cette question et vous les organisez de la façon qui vous semblera 00:02:52 la plus opportune euh donc voulez vous pouvez faire des flèches vous pouvez faire des titres vous pouvez les coller les un à côté des autres vous pouvez faire des dessins à côté des hexagones mais à la fin je veux le poster le plus 00:03:06 clair et le plus cohérent possible que vous pourrez éventuellement aller présenter au reste de la classe à l'oral

      https://histoire-geographie.ac-dijon.fr/spip.php?article1092

    1. Given all of these skills, and the immense challenges of enacting them in ways that are just, inclusive, anti-sexist, anti-racist, and anti-ableist, how can one ever hope to learn to be a great designer? Ultimately, design requires practice. And specifically, deliberate practice33 Ericsson, K. A., Krampe, R. T., & Tesch-Ršmer, C. (1993). The role of deliberate practice in the acquisition of expert performance. Psychological Review. . You must design a lot with many stakeholders, in many contexts, and get a lot of feedback throughout. The rest of this book will help you structure this practice, showing you the kinds of methods and skills that you might need to learn to be a great designer and design facilitator— but it will be up to do you to do the practice, get the feedback, and learn.

      I agree with the sentiment that there are numerous challenges involved when it comes to designing things. We often take for granted the fact that certain designs may not be inclusive for particular groups of people. I’m very interested in this aspect of design, especially when it comes to making designs that are inclusive of disabled people, such as those who rely on screen readers or are color blind. I find this conclusion of the chapter useful because there are important things to keep in mind when designing something. This reminds me of my INFO 498 C class, where they address that as a game designer, you must take into consideration the different feedback that you will receive in order to improve upon your product.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This is a manuscript describing outbreaks of Pseudomonas aeruginosa ST 621 in a facility in the US using genomic data. The authors identified and analysed 254 P. aeruginosa ST 621 isolates collected from a facility from 2011 to 2020. The authors described the relatedness of the isolates across different locations, specimen types (sources), and sampling years. Two concurrently emerged subclones were identified from the 254 isolates. The authors predicted that the most recent common ancestor for the isolates can be dated back to approximately 1999 after the opening of the main building of the facility in 1996. Then the authors grouped the 254 isolates into two categories: 1) patient-to-patient; or 2) environment-to-patient using SNP thresholds and known epidemiological links. Finally, the authors described the changes in resistance gene profiles, virulence genes, cell wall biogenesis, and signaling pathway genes of the isolates over the sampling years.

      Strengths:

      The major strength of this study is the utilisation of genomic data to comprehensively describe the characteristics of a long-term Pseudomonas aeruginosa ST 621 outbreak in a facility. This fills the data gap of a clone that could be clinically important but easily missed from microbiology data alone.

      Weaknesses:

      The work would further benefit from a more detailed discussion on the limitations due to the lack of data on patient clinical information, ward movement, and swabs collected from healthcare workers to verify the transmission of Pseudomonas aeruginosa ST 621, including potential healthcare worker to patient transmission, patient-to-patient transmission, patient-to-environment transmission, and environment-to-patient transmission. For instance, the definition given in the manuscript for patient-to-patient transmission could not rule out the possibility of the existence of a shared contaminated environment. Equally, as patients were not routinely swabbed, unobserved carriers of Pseudomonas aeruginosa ST 621 could not be identified and the possibility of misclassifying the environment-to-patient transmissions could not be ruled out. Moreover, reporting of changes in rates of resistance to imipenem and cefepime could be improved by showing the exact p-values (perhaps with three decimal places) rather than dichotomising the value at 0.05. By doing so, readers could interpret the strength of the evidence of changes.

      Impact of the work:

      First, the work adds to the growing evidence implicating sinks as long-term reservoirs for important MDR pathogens, with direct infection control implications. Moreover, the work could potentially motivate investments in generating and integrating genomic data into routine surveillance. The comprehensive descriptions of the Pseudomonas aeruginosa ST 621 clones outbreak is a great example to demonstrate how genomic data can provide additional information about long-term outbreaks that otherwise could not be detected using microbiology data alone. Moreover, identifying the changes in resistance genes and virulence genes over time would not be possible without genomic data. Finally, this work provided additional evidence for the existence of long-term persistence of Pseudomonas aeruginosa ST 621 clones, which likely occur in other similar settings.

      We thank the reviewer for their thorough evaluation of our work, and for the suggested improvements. A main goal of this study was to show that integrating routine wgs in the clinic was a game changer for infection control efforts. We appreciate this aspect was highlighted as a strength by this reviewer. While some of the weaknesses identified are inherent to the data (or lack thereof) available for this study, we have revised the manuscript to include a detailed discussion on limitations (sampling, thresholds of genetic relatedness, definition and categories etc.) that could influence the genomic inferences. We also provided exact p-values for the changes in rates of resistance, as requested. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly.

      Reviewer #2 (Public Review):

      Summary:

      The authors present a report of a large Pseudomonas aeruginosa hospital outbreak affecting more than 80 patients with first sampling dates in 2011 that stretched over more than 10 years and was only identified through genomic surveillance in 2020. The outbreak strain was assigned to the sequence type 621, an ST that has been associated with carpabapenem resistance across the globe. Ongoing transmission coincided with both increasing resistance without acquisition of carbapenemase genes as well as the convergence of mutations towards a host-adapted lifestyle.

      Strengths:

      The convincing genomic analyses indicate spread throughout the hospital since the beginning of the century and provide important benchmark findings for future comparison.

      The sampling was based on all organisms sent to the Multidrug-resistant Organism Repository and Surveillance Network across the U.S. Military Health System.

      Using sequencing data from patient and environmental samples for phylogenetic and transmission analyses as well as determining recurring mutations in outbreak isolates allows for insights into the evolution of potentially harmful pathogens with the ultimate aim of reducing their spread in hospitals.

      Weaknesses:

      The epidemiological information was limited and the sampling methodology was inconsistent, thus complicating the inference of exact transmission routes. Epidemiological data relevant to this analysis include information on the reason for sampling, patient admission and discharge data, and underlying frequency of sampling and sampling results in relation to patient turnover.

      We thank the reviewer for their thoughtful feedback on our manuscript and for highlighting the quality of the genomic analyses. We agree that the lack of patient epi data (e.g. date of admission and discharge) and the inconsistent sampling through the years are limitations of this study. We have revised the manuscript to acknowledge these limitations and discuss how not having this data complicates the inference of exact transmission routes. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly.

      Reviewer #3 (Public Review):

      Summary:

      This paper by Stribling and colleagues sheds light on a decade-long P. aeruginosa outbreak of the high-risk lineage ST-621 in a US Military hospital. The origins of the outbreak date back to the late 90s and it was mainly caused by two distinct subclones SC1 and SC2. The data of this outbreak showed the emergence of antibiotic resistance to cephalosporin, carbapenems, and colistin over time highlighting the emerging risk of extensively resistant infections due to P. aeruginosa and the need for ongoing surveillance.

      Strengths:

      This study overall is well constructed and clearly written. Since detailed information on floor plans of the building and transfers between facilities was available, the authors were able to show that these two subclones emerged in two separate buildings of the hospital. The authors support their conclusions with prospective environmental sampling in 2021 and 2022 and link the role of persistent environmental contamination to sustaining nosocomial transmission. Information on resistance genes in repeat isolates for the same patients allowed the authors to detect the emergence of resistance within patients. The conclusions have broader implications for infection control at other facilities. In particular, the paper highlights the value of real-time surveillance and environmental sampling in slowing nosocomial transmission of P. aeruginosa.

      Weaknesses:

      My major concern is that the authors used fixed thresholds and definitions to classify the origin of an infection. As such, they were not able to give uncertainty measures around transmission routes nor quantify the relative contribution of persistent environmental contamination vs patient-to-patient transmission. The latter would allow the authors to quantify the impact of certain interventions. In addition, these results represent a specific US military facility and the transmission patterns might be specific to that facility. The study also lacked any data on antibiotic use that could have been used to relate to and discuss the temporal trends of antimicrobial resistance.

      We thank the reviewer for their evaluation of our work and for highlighting the broad implications of our findings regarding the application of real-time surveillance to suppress nosocomial transmission. We agree with the reviewer that fixed thresholds and definitions are imperfect to classify the origin of an infection. The design of this study (e.g. inconsistent sampling through time) was not conducive to provide a comprehensive/quantitative measurement of transmission routes. Thus, we decided to apply conservative thresholds of genetic relatedness and strict conditions (e.g. time between isolate collection, shared hospital location etc.) to favor specificity as our goal was simply to establish that cases of environmentto-patient transmission did happen. In the absence of a truth set, we have not performed sensitivity analysis, but we are conducting a follow-up study to compare inferences from MCMC models to our original fixed-thresholds predictions. This limitation is now discussed in the revised manuscript. Finally, we have positively answered all the specific recommendations suggested by the reviewer and modified the manuscript accordingly including the addition of Figure S3.

      Reviewer #1 (Recommendations For The Authors):

      The definitions used on lines 391-396 are necessarily somewhat arbitrary, but it would be helpful to have a little bit more justification for the choices made, particularly for the definition of environmental involving the "3x the number of years they were separated". It seems a little hard to square this with the more relaxed 10 SNP cutoff for a patient-to-patient designation. Are there reasons for thinking SNP differences associated with environmental transmission should be smaller than for patient-to-patient, or is the aim here just to set the bar higher for assuming an environmental source? Because these definitions are quite arbitrary, there could also be some value in exploring the sensitivity of the results to these assumptions.

      Thank you. We agree with the reviewers that SNP thresholds, albeit necessarily, are arbitrary and that more discussion/justification was needed to put the genomic inferences in context. We have revised the manuscript to indicate that: 1/ the 10 SNP cutoff for a patient-to-patient designation was set to account for the known evolution rate of P. aeruginosa (inferred by BEAST at 2.987E-7 subs/site/year in this study and similar to previous estimates PMID: 24039595) and the observed within host variability (now displayed in revised Fig. 1E). We note that this SNP distance was not sufficient and that an epi link (patients on the same ward at the same time) needed to be established. 2/ the environment-to-patient definition was indeed set to be most conservative (nearly identical isolates in two patients from the same ward with no known temporal overlap for > 365 days). This was indeed done to favor high specificity as this inference relied solely on clinical isolates (i.e. the identical environmental strain in the patientenvironment-patient chain was not sampled). For these clinical isolates to have acquired no/very little mutation in that much time, no/low replication is expected and, although unsampled, we propose this most likely happened on hospital surfaces.

      While the term "core genome" should be familiar to most readers, "shell genome" and "cloud genome" are less widely known, and an explanation of what these terms mean here would be helpful.

      Thank you. We have revised the manuscript to define the core, shell, and cloud genomes as genes sets found in ≥ 99%, ≥ 95% and ≥ 15% of isolates, respectively.

      In the first paragraph of the discussion, it could be added that in many cases for clinically important Gram negatives short read sequencing alone will fail to detect transmission events as outbreaks can be driven by plasmid spread with only very limited clonal spread (see, for example, https://www.nature.com/articles/s41564-021-00879-y )

      Thank you. We agree this is an important/emerging aspect of surveillance. However, the goal of this discussion point was to explain why such a large outbreak was missed prior to implementing WGS (short read) surveillance. We feel that discussing “plasmid outbreaks” (which is not at play here, and relatively rare in P. aeruginosa compared to the Enterobacteriaceae) and the need for long read will distract from the narrative. 

      line 599 What does "Mock" mean here? Would it be more accurate to say it is a simplified floor plan?

      Thank you. “Mock” was changed to “simplified”

      IPAC abbreviation is only used once - spelling it out in full would increase readability.

      Revised manuscript was edited as suggested.

      MHS is only used twice.

      Revised manuscript was edited to spell out Military Health System

      Line 364: full stop missing.

      Revised manuscript was edited as suggested.

      Line 401: Bayesian rather than bayesian.

      Revised manuscript was edited as suggested.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for giving me the opportunity to review this interesting manuscript.

      The conclusions of this paper are mostly well supported by the data presented, but epidemiological information was limited and the sampling methodology was inconsistent, thus complicating inference of exact transmission routes.

      Major issues:

      What was the baseline frequency of clinical and/or screening samples of Pseudomonas aeruginosa at the hospital? Neither Figure 1D nor Table S1 allows for differentiating between clinical and screening samples. Most isolates were cultured from clinical materials, and there is no information about the patients' length of stay and their respective sampling dates. Is there any possibility of finding out whether the samples were collected for clinical or screening purposes? Would it be possible to include the patients' admission data to determine whether the strains were imported into the hospital or related to a previous stay, e.g. among known carriers? Also, the issue of sampling dates vs. patient stay on the ward should be addressed, as there may be an overlap in patients' stay on the ward but no overlap in terms of sampling dates or even missing samples (missing links).

      We have revised the manuscript to address this important point: i) 16 isolates were from surveillance swabs and are labelled “Surveillance” in Table S1. The remaining 237 were clinical isolates; ii) unfortunately, because the sampling was done under a public health surveillance framework, we do not have access to historical patient data (admission/discharge date, wards, rooms, etc.) and we can not calculate length of stay or better identify patient overlap. These limitations are now acknowledged in the discussion of the revised manuscript.

      In order to evaluate the extent of the outbreak, more epidemiological data would be useful What is the size of the hospital, what is the average patient turnover, and what is the average length of stay in ICU and non-ICU? Is there any specialization besides the military label?

      We have revised the manuscript to indicate that facility A is 425-bed medical center and is the only Level 1 trauma center in the Military Health System. Unfortunately, the data to calculate length of stay, throughout the years, in ICU and non-ICU, was not available to us. This limitation is now also acknowledged in the discussion.

      Perhaps the authors could attempwt to discuss the extent to which large outbreaks like these may be considered as part of unavoidable evolutionary processes within the hospital microbiome as opposed to accumulation and transmission of potentially harmful genes/clones, and differentiate between the putative community spread without any epidemiological links on the one hand, and hospital outbreaks that could be targeted by local infection prevention activities on the other hand.,

      We respectfully disagree with the suggestion that this large outbreak “may be considered as part of unavoidable evolutionary processes within the hospital microbiome” and should be opposed to “transmission of potentially harmful genes/clones”. As a matter of fact, our data showed that infection control staff at Facility A responded with multiple interventions, including closing sinks, replacing tubing, and using foaming detergents. This resulted in slowing the spread of the ST621 outbreak with just 3 cases identified in 2022, 0 cases in 2023 and 1 case in 2024. This is now discussed in the revised manuscript.

      Page 5, lines 88-92 lines 101-104. It seems as if the outbreak was identified only by the means of genomic surveillance. This raises questions as to the rationale for sampling and sequencing, especially prior to 2020. Considering 11 cases per year between 2011 and 2016, one could assume such an outbreak would have been noticed without sequencing data.

      The MRSN was created in 2010, in response to the outbreak of MDR Acinetobacter baumannii in US military personnel returning from Iraq and Afghanistan. Between 2011 and 2017, the MRSN collected MDR isolates (mandate for all MDR ESKAPE but compliance varied between years and facilities) from across the Military Health System and, for select isolates (e.g. high-risk isolates carrying ESBLs or carbapenemases) performed molecular typing by PFGE. In 2017 the MRSN started to perform whole genome sequencing of its entire repository. In 2020, a routine prospective sequencing service was started and first detected the ST621 outbreak. A retrospective analysis of historical isolate genomes (2011-2019) identified additional cases. The first paragraph of the discussion lists possible factors to explain why the ST621 escaped detection by traditional approaches. We believe 11 cases per year is not a strong signal when stratified by month, wards, or both, especially for a clone lacking a carbapenemase and without a remarkable antibiotic susceptibility profile. 

      Did the infection control personnel suspect transmission? If yes, was the sampling and submission of samples to the MRSN adapted based on the epidemiologic findings?

      The ST621 outbreak was unsuspected before the initial genomic detection in 2020. Until that point, MDR isolates only (Magiorakos et al PMID: 21793988) were collected but compliance was variable through time. Quickly thereafter (starting in 2021), complete sampling of all clinical P. aeruginosa (MDR or not) from Facility A was started. The manuscript was revised to clarify those details of the sampling strategy.

      Is there any information about how many environmental sites were sampled without evidence of ST621 / screening samples were cultured without evidence of Pseudomonas aeruginosa?

      For patient isolates, only 16 isolates were from surveillance swabs. The remaining 237 were clinical isolates. No denominator data was available to calculate P. aeruginosa and ST-621 positivity rate in surveillance swabs throughout the time period. For environmental isolates, a total of 159 swabs were taken from 55 distinct locations in 8 wards/units including the ER. This data is now included in the revised manuscript. However, a complete analysis of these swabs (positivity rate for ESKAPE pathogens, P. aeruginosa, per ward/floor/room, per swab type (sink drain, bed rail etc.) etc.) is beyond the scope of this study and is being performed as a follow up investigation.

      Page 5 lines 89 and 39 Figure S1B. Please describe how the allelic distance for the cluster threshold was selected.

      As indicated in the legend of Figure S1B, no thresholds were applied. All ST621 isolates ever sequenced by the MRSN were included. All except 3 isolates shared between 023 cgMLST allelic differences. The remaining 3 were distant by 88-89 allelic differences. The text was revised to clarify this point.

      Page 5 lines 99-100. Could the authors please provide some distribution measures (e.g. IQR).

      Done as requested. The revised manuscript now reads “…of just 38 single nucleotide polymorphisms (SNPs), and an IQR of 19 (Fig. 1A, Table S1).”

      Page 5 line 102. Could the authors please provide some distribution measures (e.g. IQR).

      Please see above. A chart was created and is now included as Fig. S2.

      Page 6 line 107 and page 34 figure 1c. In the text it is stated that isolates were collected in 27 wards, the figure 1C depicts 26 wards and n/a.

      Thank you for spotting this inconsistency. This has been fixed in the revised manuscript.

      Page 6 lines 117-118. Samples collected in the emergency room would imply samples collected on admission, already addressed previously. Did the authors investigate a potential import into the hospital from community reservoirs or were all these isolates collected among patients who had been previously admitted to the hospital and/or tested positive for the outbreak strain?

      We agree that samples collected in the ER imply samples collected on admission. Of the 29 ER isolates only 9 (31%) were primary isolates (first detection in a new patient) which suggests a majority were from returning patients at Facility A. Because the sampling was done under a public health surveillance framework, we do not have access to historical patient data (admission/discharge date, wards, rooms, etc.) to investigate/confirm that these 9 patients had previous visits at Facility A. This point is now discussed in the revised manuscript.

      Page 6 line 128. This could also represent increased selective pressure. However, according to Table S1, the 28 isolates collected in 2011 (the number does not match with Figure 1D) were from many different wards, thus indicating earlier spread throughout the hospital.

      Yes, we agree. Please note that table S1 lists all isolates for 2011 whereas Figure 1D focuses on primary (first isolate from each patients) only.  

      Page 7 line 133. Both Figure 2 and the discussion section, page 13 line 296 suggest the year 2005 instead of 2004?

      Thank you for catching this typographical error. This was corrected to 2004 in the revised manuscript.

      Figure 1E. The figure should also depict intra-patient diversity for comparison.

      Thank you for this great suggestion. We have revised Figure 1E accordingly.

      Page 7, lines 146-147 Could the authors attempt explaining the upper part of the bimodal peaks?

      This is an all-vs-all SNP analysis for all inter-patient isolates. For each isolates all distances to other isolates are reported, not only the smallest. The upper peaks represent comparisons to isolates from a different outbreak subclone (SC1 vs SC2).

      Page 7, line 150 This is a very small number considering the extent of the outbreak and suggests a large number of missing links. Or does this rather imply continuous import and evolution over time that does not necessarily represent transmission within the hospital?

      We believe all cases were due to transmission happening within the hospital. Based on conservative thresholds (genetic relatedness and epi link, or lack thereof) the precise origin from another patient (n=10) or a contaminated surface (n=12) can be inferred. For the remaining 60 patients, with the available sampling, the conditions we chose are not met and we simply do not conclude whether a direct patient-to-patient or an environmental origin was more likely.

      Page 8 line 155. What does the temporal overlap refer to - sampling date versus patient's stay on the ward? Please specify.

      The temporal overlap was investigated from sampling dates, as dates of patient admission/discharged were not available.

      Page 8, line 157: What does primary/serial isolate mean - first and follow-up samples of ST621 per patient?

      Yes. Primary isolate is used to designate the first isolate from a patient. Serial isolates designate follow-up samples of ST621.

      Page 8 line 165: Table S3 and Figure 3 only refer to environmental samples from three wards. Ward 20 rooms 2 and 18 as well as ward 1 rooms 1 and 6 were hotspots - is there any information on the specific infection control/disinfection measures? Addressed in discussion page 12, lines 273-275, but no information on what was actually done.

      The manuscript was revised to indicate the precise disinfection measures that were taken. A follow-up study is ongoing to assess long-term efficacy and monitor possible retrograde growth from previously contaminated sinks.

      Page 8 line 175: Evaluation of change in resistance fraction over time - There may have been a selection bias with an inconsistent number of strains sequenced per year.

      Yes, incomplete sampling and possible selection bias are now listed with other limitations of this study in the discussion of the revised manuscript.

      Page 9 line 183: The referral to Table S1 is unclear, I could not find the number and the specific isolates selected for long-read sequencing.

      Thank you. This has been added to the revised Table S1.

      Page 10 lines 217-225 and Figure 4C: Perhaps it is possible to better align what is written in the text and the caption of the figure. The caption does not clarify that only one patient develops colistin resistance (what was the reason to include the other patients?).

      Thank you. We have revised the text and the caption of the figure to clarify that only isolates from one patient developed colistin resistance. The isolates from the other patients on Fig. 4C are shown to provide context and accurately map the emergence of the PhoQE77fs mutation.  

      Page 10, lines 228-229 and Table S5: How is it possible to identify those 64 genes in Table S5?

      We have revised Table S5 to facilitate the identification of the 64 genes with ≥ 2 independently acquired mutations (excluding SYN). Specifically, we have added column E labeled “Counts independent mutations per locus (excluding SYN)”. A total of 205 rows (in this table each row is a variant) have a value ≥ 2 and these represent 64 genes (upon deduplication of locus tags).  

      Page 13, lines 280-281: Where is the information on chronic infection presented? Serial cultures would not necessarily mean chronic infection.

      Authors response: Yes, we agree this was not the appropriate characterization and this was revised to ‘long-term’ infections.

      Page 14 line 306: Emergence of colistin resistance in a single patient, correct?

      Yes. This was further clarified in the text.

      Page 14 lines 315-320: This should go to the results section. In particular disinfection, closing, and replacing of tubing should be mentioned in the results section in reference to the results presented in Table S3.

      Thank you. We have considered this suggestion and have decided to leave this discussion as the closing paragraph of this publication. A follow-up study is ongoing to assess long-term efficacy of these interventions on the ST-621 bur also other outbreak clones at Facility A.

      Methods

      Page 15 lines 330-333: Perhaps it is possible to avoid redundancy.

      Thank you. We have revised the text accordingly.

      Page 15 lines 341: Information on which isolates were subjected to long-read sequencing is missing.

      Thank you. This has been added to the revised Table S1.

      Page 16 line 345: Was there a particular reason why Newbler was chosen?

      No. At the time Newbler was the default assembler built in the MRSN bacterial genome analysis pipeline and QC processes.

      Page 16, line 357-358: What was the rationale for selecting this isolate as reference genome?

      This isolate was chosen because it was collected early in the outbreak and phylogenetic analysis revealed it had low root to tip divergence.

      Page 16 line 361: Why 310 isolates, if only 253 were assigned to the outbreak clone and only a subset of those were collected in facility A?

      This was a typographical error that has corrected (it now reads “…set of 253 isolates.”) in the revised manuscript.  

      Page 17 lines 387-395: What is the reason that intra-patient diversity was not included in the set of criteria for SNP distances?

      The observed within host variability (now displayed in revised Fig. 1E) was taken into consideration when setting SNP thresholds for categorizing patient-to-patient transmission or environment-to-patient event. This is now clarified in the revised manuscript.

      Page 17 line 392: How was the threshold of <=10 SNPs determined?

      The 10 SNP cutoff to infer a patient-to-patient transmission event was set to account for the known evolution rate of P. aeruginosa (inferred by BEAST at 2.987E-7 subs/site/year in this study, and similar to previous estimates PMID: 24039595) and the observed within host variability (now displayed in revised Fig. 1E). We note that this SNP distance was not sufficient and that an epi link (patients on the same ward within the same month) needed to be established.

      Page 17 line 395 and Figure 2: What was the assumed average mutation rate per genome per year?

      Thank you. The mean substitution rate inferred by BEAST was 2.987E-7 similar to estimate from previous studies on P. aeruginosa outbreaks (e.g. PMID: 24039595).

      Reviewer #3 (Recommendations For The Authors):

      Please find (line-by-line comments) on each section of the manuscript below:

      Introduction

      Line 86: I am wondering why the authors state ">28 facilities" instead of the exact number of facilities from which these lineages were recovered.

      Thank you. Manuscript was revised to provide the exact number of facilities. It now reads “…recovered from 37 and 28 facilities, respectively.”

      Methods

      It's not clear to me which criteria were used for collecting these isolates (both prospective and retrospective). I understand that some of the data are described in more detail in Lebreton et al but I did not find the specific criteria for the collection of the isolates and I imagine that these might differ if different facilities. Would it be possible to comment on that and add a short paragraph in the Methods section?

      Thank you. This lack of clarity was also raised by other reviewers, and we have revised the manuscript to indicate that: 1/MDR isolates only (Magiorakos et al PMID: 21793988) were collected from 2011-2020 with the same criteria for all facilities although compliance was variable through time and between facilities; and 2/ starting in 2021 all P. aeruginosa isolates, irrespective of their susceptibility profile, were collected from Facility A

      The data comes from a US Military hospital. Is this related to the US Veterans Affairs Healthcare system? Is there more detailed information about the demographics of the patient population?

      Facility A is part of the Military Health System (MHS) which provides care for active service members and their families. This is distinct from the US Veterans Affairs Healthcare system. Only limited patient data was accessible to us as this study was done as part of our public health surveillance activities. Patient age (avg. 57.2 +/- 21.0) and gender (ratio male/female 1.7) are provided in the revised manuscript. 

      Line 384ff: The origin of infection was inferred based on the SNP threshold and epidemiological links. However, recombination events can complicate the interpretation of SNP data. Have the authors attempted to account for this?

      Thank you. We agree that recombination events can complicate the interpretation of SNP data. We used Gubbins v2.3.1 to filter out recombination from the core SNP alignment, as indicated in the revised manuscript.

      The authors' definition of environment-to-patient transmission seems conservative (nearly identical strain and no known temporal overlap for > 365 days). Have the authors changed the threshold, performed sensitivity analyses, and tested how this would affect their results?

      Indeed, acknowledging that fixed thresholds have limitations in their ability to accurately predict the origin of infections, we took a conservative approach to favor specificity as our goal was simply to establish that cases of environment-to-patient transmission did happen. In the absence of a truth set, we have not performed sensitivity analysis, but we are conducting a follow-up study to compare inferences from MCMC models to our original predictions. This limitation is now discussed in the revised manuscript.

      The authors don't seem to incorporate the role of healthcare workers in the transmission process. Could they comment on this? I am assuming that environment-to-patient transmission could either be directly from the environment to the patient or via a healthcare worker. I think it's fine to make simplifying assumptions here but it would be great if this was explicitly described.

      Thank you for this suggestion. We have not sampled the hands of healthcare workers in this study. As a result, the reviewer is correct to say that we made the simplifying assumption that healthcare workers would be possible intermediates in either environment-topatient or patient-to-patient transmissions, as previously described by others (PMID: 8452949). This limitation is now discussed in the revised manuscript.

      Page 5, line 100: What does "all vs all" mean? Based on the supplement, I assume it's the pairwise distance and then averaged across all of those. It would improve the readability of the manuscript if the authors could briefly define this term and then maybe refer to Table S1.

      Thank you. We have created Fig.S2 and revised the manuscript to state that ST-621 isolates from facility A belonged to the same outbreak clone with a distance (averaged all vs all pairwise comparison) of just 38 single nucleotide polymorphisms (SNPs), and an IQR of 19 (Fig. S2, Table S1).

      Figure 1D: It would be interesting to see additional figures in the supplement on the percentage of sequenced isolates per year and whether it varies across the different sources/sites. Is there any information on which isolates were chosen for sequencing?

      Lack of clarity in the sampling/sequencing scheme was raised by multiple reviewers and we have provided a thorough response to earlier comments. We also have revised the material and methods section accordingly. Finally, we have created Fig. S3 to show the percentage of sequenced isolates per year across different sources/sites, as suggested by the reviewer. No noticeable patterns were observed. 

      It seems like only a subset of all clinical isolates were sequenced. Would it be possible that SC2 was present already earlier but not picked up until a certain date?

      Although all isolates received by the MRSN were sequenced, compliance varied through time so it is true that not all clinical isolates were sequenced between 2011-2019. As such, we fully agree with this hypothesis and discuss this possibility as BEAST analysis placed the origin of SC2 in 2004 while the first detection of an SC2 isolate was in December 2012. This limitation is now discussed in the revised manuscript.

      Could the authors elaborate on whether the isolates resulted from single-colony picks? Is it possible that the different absence of a subclone is due to the fact that they picked only a colony?

      Yes, the isolates resulted from single-colony picks except when the presence of different colony morphologies was noted. In the latter, representative isolates for each colony morphologies were processed. We have revised the methods to make that clear.

      Figure 2: It is difficult to see which nodes belong to which patient due to the small font size. I wonder if it was possible to color the nodes for each patient, to make it more readable.

      We tried coloring the nodes but with > 60 distinct patients/colors we decided it did not improve clarity. We have revised figure 2 to increase the font size.  

      Page 7-8, lines 154-155: Did the authors check whether there were isolates of the same strain (that were found in the environment) present in other patients elsewhere in the ward?

      Yes. In rare cases, we observed virtually genetically identical isolates from two patients collected in different wards. Because we only have access to clinical isolate data (collected from patient X in ward Y) and do not have access to patient data (admission/discharge date, wards, rooms, etc.), we do not know but cannot exclude that patients overlap in a room prior to the sampling of their P. aeruginosa isolates. We designed our fixed thresholds to be conservative. As a result, in this analysis, these cases are labelled as “undetermined”.  

      Page 8: Do the authors have any information on antibiotic use during this timeframe? From the discussion, it seems like there is no patient-level prescription data. Is there any data on overall trends? How were trends in antibiotic use correlated with trends in antibiotic resistance?

      Unfortunately, patient-level prescription data (or any other data not linked to the bacterial specimens) was not accessible to us as this study was done as part of our public health surveillance activities.

      To infer the origin of infection, the authors used a static method with fixed thresholds and definitions. This study does not provide any uncertainty with their estimates. Maybe the authors could add a sentence in the discussion section that MCMC methods to infer transmission trees incorporating WGS could provide these estimates. These methods have not been applied to PA a lot but two examples where MCMC methods have been used without WGS (though the definition of environmental contamination may differ between these studies and this study).

      https://doi.org/10.1186/s13756-022-01095-x

      https://doi.org/10.1371/journal.pcbi.1006697

      Thank you for this great suggestion. We have revised the manuscript to include a discussion on the limitations of fixed thresholds to infer transmission chains/origins, and to discuss existing alternatives including MCMC methods. 

      Line 322-323: This sentence is a bit vague since not all of these HAI are due to P. aeruginosa. I would suggest citing a number that is specific to PA.

      Thank you. While our paper shows a particular example of protracted P. aeruginosa outbreak, the roll-out of routine WGS surveillance in the clinic will help prevent hospital-associated drug-resistant infections for more than this species. We believe that broadening the scope in the last sentence of the manuscript is important and we decline to revise as suggested.

    1. Egoism# Sources [b83] [b84] “Rational Selfishness”: It is rational to seek your own self-interest above all else. Great feats of engineering happen when brilliant people ruthlessly follow their ambition. That is, Do whatever benefits yourself. Altruism is bad.

      I was just reading articles about game theory and this reminds me of Prisoner's Dilemma where best actions for individuals lead to worse outcomes for everyone. I am curious how egoism deals with these kind of settings.

    1. il s'agit de l'atelier 2 autour du bien-être à l'école de l'atelier 3 du coup je vous le fais de têtes alors 00:57:45 que j'avais des des notes en tout cas c'est l'atelier 3 et l'atelier 4 les ateliers donc 2 3 et 4 sont annulés en revanche il reste des places dans les autres ateliers du coup vous pouvez vous y rendre de façon spontanée et vous 00:57:58 ajouter sur les listes d'émargements donc il reste le 1 euh valeur de la République avec du coup l'équipe de collègues Carole Janine et j'ai oublié le nom de la troisème personne le 00:58:11 l'atelier 5 l'Escape game autour de l'inclusion le 6 euh le 6 les réseaux sociaux merci beaucoup autour des réseaux sociaux avec le Clémi 00:58:22 euh l'atelier 7 euh qui est autour de l'interculturalité justement du plurilinguisme et l'atelier 8 avec la question de l'expérience du débat en classe
    1. You have to make up. So the ball,which is a physical object, only becomes meaningful as a football within thecontext of the rules of the game. The only way you play is to develop a game ora language game about football: “You can put it there; You can’t put it there. Youcan’t touch it; You can’t kick it, etc.” Within the rules, it becomes a football.

      This is even more evidence for the previous claims in this. It dips into philosophical reasoning for why things are certain things. the football does not have purpose or meaning until you put it in the context of a football game and on a football field. We give things meanings to support our world view.

    1. Game Design and Intelligent Interaction

      The book presents a collection of chapters that focus on the design, use, and evaluation of games and the application of gamification processes in serious learning scenarios. This is clearly the way of the future, as those technologies are currently being used to change the way we explore, learn, and share our knowledge with others. The field will evolve in the near future with the use of new delivery platforms, while various technologies will merge into more concrete media, including wearable multipurpose devices. This book presents a series of design and evaluation case studies enabling the reader to appreciate the complexity of the task in hand, sample different case studies, and appreciate how different requirements can be met using game design and evaluation theory, analysis, and implementation.

    1. Men hunted big game, defended the band from predatory animals, and fought; women gathered, fished, trapped small animals, and grew the "three sisters" of corn, beans, and squash in garden plots they shifted when soil fertility began to wane. Because they controlled the more dependable food sources, women had social power; they typically were responsible for distributing all the food and often chose the men who led councils and war parties.

      How the native cultures lived in the northeastern woodlands

    1. The 2000s have been a remarkable decade of transformation in American television

      Tv become more accessible, streaming changed the game, more funding , and better tvs. Tv has definitely changed a lot.

    1. cast 24 spells and had visited the game on at least two unique days

      try to find: number of slots they've tried, how many spins per one, maybe complete two of the daily quests on two or three unique days?

    1. 17

      shapes are mutually defining and read as a single unit whether...

      19

      the figure-field relationship ceases to be as such, and all shapes interact with one another.

      33 consider when a pattern can be strongly directional — how does it read when turned across diff angles?

      34, on game C, Manipulations

      move from simple positions of the design unit to more complex by carefully controlling the repeat size, its dark-light balance, and its alignment with itself

      Figure 1-35 (p37)...!!!

    1. Grant andthe Boone and Crockett Club advocated for game lawsthat dictated how many animals could be taken, when, andwhere.

      It seems like what these men really wanted was control based off their hunting regulations (not that hunting regulations are bad, but maybe they were creating them for the wrong reasons) and their background in eugenics.

    1. ron, which would later become the basis of an entirely new economy, was already in use in the Near East and Anatolia; though its use was limited to small, decorative items. There are some small daggers of meteoric iron in Hittite graves in in the tomb of Tutankhamun (3,325 years ago),

      it is crazy to think that iron has held a place over 3000 years ago with it still being very prominent today and a total game changer when it was first able to be mass mined and produced.

    1. And even if I’m managing to keep kneejerk political takes to a minimum, even if I’ve not suffered the same miserable fate as Matt Taibbi, it’s still become easy for me to simply perform a kind of studied eclecticism instead of really getting into something. America invented Europe in the fifth century AD: sure. The Trial is an ethnographically complete account of the Poro society of West Africa: why not. George HW Bush was an elf: you know and I know that I can do this stuff in my sleep. That’s what happens when you’re the best in the game.

      this is very funny though reminiscent of someone in its ironic but present ego

    1. Some teachers use small whiteboards for this.

      I actually really like the idea of this, but I would rather have more open communication in the classroom. Instead, could this be used when you are doing more of a game in the classroom? That's how I envision it. You could even split them into small groups and see multiple responses, yet still not have to look at 25 different whiteboards.

    1. achieved

      seems like a complicated way to say that they are liars. What's the benefit of theory here? Are we also playing an obfuscating game that ensnares academics in little traps?

    Annotators

    1. “There he saw a sight so curious that he could not tear himself away. At one end of the green stood a group of a hundred and fifty youths, guarding one goal, all striving to prevent the ball of a single little boy, who was playing against the whole of them, from getting in; but for all that they could do, he won the game, and drove his ball home to the goal. “Then they changed sides, and the little lad defended his one goal against the hundred and fifty balls of the other youths, all sent at once across the ground. But though the youths played well, following up their balls, not one of them went into the hole, for the little boy caught them one after another just outside, driving them hither and thither, so that they could not make the goal. But when his turn came round to make the counter-stroke, he was as successful as before; nay, he would get the entire set of a hundred and fifty balls into their hole, for all that they could do. “Then they played a game of getting each other’s cloaks off without tearing them, and he would have their mantles off, one after the other, before they could, on their part, even unfasten the brooch that held his cloak.[35] When they wrestled with each other, it was the same thing: he would have them on the ground before all of them together could upset him, or make him budge a foot.

      The boy appears to be an unstoppable force.

    1. Cognitive Load

      When crafting items in Minecraft, the player has to remember many different recipes, often involving multiple steps. If the player is unfamiliar with the recipes, it can create high cognitive load, making the crafting process slow and frustrating. However, by using the crafting book or search bar, the game reduces cognitive load by presenting the recipes and options in an organized, easy-to-access way, allowing the player to focus on gameplay rather than memorizing every recipe.

    2. When setting up a new video game, the graphics menu might have dozens of options like textures, shadows, brightness, and resolution. With so many choices, players often take a long time adjusting everything before they even start playing. But if the game only offers a few simple presets like Low, Medium, and High, players usually make a decision much faster. This shows Hick’s Law the more choices you have, the longer it takes to decide.

    1. Progressive Disclosure

      I think Progressive Disclosure is powerful because it balances simplicity with depth. New users aren’t overwhelmed right away, but advanced users still discover more value over time. It’s like leveling up in a game features unlock as you get comfortable, which keeps users engaged instead of confused or frustrated.

    2. Variable Reward

      So the reason why Variable Rewards are powerful in UX design because people are naturally drawn to unexpected outcomes. Instead of getting the same reward every time, users will stay more engaged when the result is unpredictable. This has the same principle that makes slot machines addictive, that uncertainty has people to come back.

      One example is that digital products is features like a loot box in a video game where the reward is uncertain but has a possibility very valuable.

    3. Sunk Cost Effect

      I think this is something we all fall victim to from time to time. For example, if I'm playing a game and I keep losing, I may continue to play until I win, even if it's frustrating for me, because I don't want to feel like I've invested my time in nothing. It's also super common in gamblers, it seems.

    4. Loss Aversion

      There is an awareness that is, in my opinion, to be instructed towards implementation of progress-based gamifications, like achievements in a game- or in shadier cases profit bonuses in gambling. It can be manipulative to vulnerable users to implement loss aversion, evident in common scamming tactics as well.

    1. He was facing reality. A "skin game" is being played

      Malcom X is pointing out that race shapes power and inequality in America. This shows the connection between history and present-day racism. I notice how he connects the past with what is still happening.

    1. Page Layout: The overall design layout of the page is well done. The black text with the red hyperlinks highly contrasts against the white background. The font style between the text and the headline is also different, helping one easily differentiate between them.

    2. This reportedly includes roughly 800 athletes in Gaza, with more than 400 soccer players killed, including Sulemain Al-Obeid, who was known as the “Palestinian Pele,” Mohammed Barakat, known as the “Legend of Khan Younis,” and Ahmad Abu al-Atta.

      Navigation Links: Like many media outlets, hyperlinks are standard. This feature is handy when links are directly integrated into the text, as it has been done here, without disrupting the flow of the text.

    3. Rising Global Pressure

      Direct Links to Article Section: This is a feature I don't recall seeing anywhere before, which has struck me as something beneficial. It offers the ability to directly share the link to the exact part of the article by clicking one button. It is an efficient and helpful feature.

    4. Zeteo makes use of several video elements across the page. While it offers a "Playback speed" option, it fails to present a transcription or a summary of the video, which would be greatly useful to individuals with a hearing impairment.

    5. Pro-Palestinian protesters in Rome call for Israel to be banned from sporting events on Sept. 8, 2025. Photo by Andrea Ronchini/NurPhoto via Getty Images

      Image Zoom-in: Zeteo presents an option to zoom in on pictures, a feature I rarely see across different platforms. This feature gives low-vision users the ability to view the image and its accompanying caption properly. Despite this, it lacks alt-text, which is a beneficial feature.

    1. The multimodality of digital art works challenges writers, users, and critics to bring together diverse expertise and interpretive traditions to understand fully the aesthetic strategies and possibilities of electronic literature.

      Katherine Hayles considers an electronic literary text as something independent, possessing its own materiality, rather than a new interpretation of a printed book. And this materiality is primarily created by the code that is used to create it. Such literature is a hybrid phenomenon that exists at the intersection of literature, game mechanics, visual art, and programming. And it is precisely this feature that requires the creation of fundamentally new approaches and tools for criticism.

    2. Readers come to digital work with expectations formed by print, including extensive and deep tacit knowledge of letter forms, print conventions, and print literary modes. Of necessity, electronic literature must build on these expectations even as it modifies and transforms them. At the same time, because electronic literature is normally created and performed within a context of networked and programmable media, it is also informed by the powerhouses of contemporary culture, particularly computer games, films, animations, digital arts, graphic design, and electronic visual culture. In this sense electronic literature is a "hopeful monster"

      With the rapid spread of technology in our daily lives through various gadgets, literature as a significant source of information has inevitably moved into the digital realm. Katherine Hayles accurately noticed that new forms of "modern culture" necessitated new forms of text, for example, character's lines in a computer game. As stated in the definition, literature has abandoned its printed form in favor of a computer-based code shell. The author also covered the question of progress in genres of electronic literature bounded to the progress in technologies itself. It is illustrated with an expansion of hypertext fiction forms and deeper immersion to interactive fiction. In general, we can conclude that electronic literature is a natural successor of the printed literature.

    1. In another scenario, a writer trying to explain the complex concept of "Quantum Mechanics" might explore various analogies or metaphors, such as "a game of chance" or "waves in the ocean." With the assistance of generative AI, the writer can develop a more detailed and fleshed-out exposition of each analogy or metaphor, allowing them to gauge which resonates more intuitively and effectively with their intended audience

      example

    1. These assessed his ability to hold numbers, pictures and words in mind. One classic test measures how many numbers a person can repeat, both forwards and backwards, soon after hearing them. Most people manage about seven. ‘He was not exceptional on any of these standard tests,’ said Rissman. ‘We didn’t find anything other than playing chess that he seems to be supremely gifted at.’ But next came the brain scans. With Gareyev lying down in the machine, Rissman looked at how well connected the various regions of the chess player’s brain were. Though the results are tentative and as yet unpublished, the scans found much greater than average communication between parts of Gareyev’s brain that make up what is called the frontoparietal control network. Of 63 people scanned alongside the chess player, only one or two scored more highly on the measure. ‘You use this network in almost any complex task.

      talking about the test and the game

    2. The nature of the game is to run through possible moves in the mind to see how they play out. From this, regular players develop a memory for the patterns the pieces make, the defences and attacks. ‘You recreate it in your mind,’ said Gareyev. ‘A lot of players are capable of doing what I’m doing.’ The real mental challenge comes from playing multiple games at once in the head. Not only must the positions of each piece on every board be memorised, they must be recalled faithfully when needed, updated with each player’s moves, and then reliably stored again, so the brain can move on to the next board. First moves can be tough to remember because they are fairly uninteresting. But the ends of games are taxing too, as exhaustion sets in. When Gareyev is tired, his recall can get patchy.

      player movements

    3. But displays of the feat go back centuries. The first recorded game in Europe was played in 13th-century Florence.

      first recorded game in Europe was played in 13 century Florence

    4. n the hope of understanding how he and others like him can perform such mental feats, researchers at the University of California in Los Angeles (UCLA) called him in for tests. They now have their first results. ‘The ability to play a game of chess with your eyes closed is not a far reach for most accomplished player,’ said Jesse Rissman, who runs a memory lab at UCLA

      UCLA tested Gareyev and othere for tests - play chess with your eyes closed - runs a memory lab

    1. delete any or all of your User Content from Company owned, controlled or used servers and from the Service, for any reason or no reason, whether intentional or unintentional, and without any liability of any kind to you or any other party.

      hell no again That game needs to be replaced by an alternative that renders this obsoloate!

    1. After thoroughly reading the assignment sheet, you might not have questions right away. However, after reading it again, either before or after you try to start the assignment, you might find that you have questions. Don’t play a guessing game when it comes to tackling assignment criteria–ask the right person for help: the instructor. Discuss any and all questions with the person who assigned the work, either in person or via email. Visit him or her during office hours or stay after class. Do not wait until the last minute, as doing so puts your grade at risk. Don’t be shy about asking your professors questions. Not only will you better your understanding and the outcome of your paper, but professors tend to enjoy and benefit from student inquiry, as questions help them rethink their assignments and improve the clarity of their expectations. You are probably not the only student with a question, so be the one who is assertive and responsible enough to find answers. In the worst case scenario, when you have completed all of these steps and a professor still fails to provide you with the clarity you are looking for, discuss your questions with fellow classmates.

      ask questions. KEY- dont be shy, be the one that ask questions because maybe other have the same questions.

    2. After thoroughly reading the assignment sheet, you might not have questions right away. However, after reading it again, either before or after you try to start the assignment, you might find that you have questions. Don’t play a guessing game when it comes to tackling assignment criteria–ask the right person for help: the instructor

      When I read any big assignments, I like doing Cornell notes so I can understand what I have read, and I put my questions down right away.

    1. Local school boards protested characterizations of Washington, Jefferson, and Madison as unpatriotic owners of “forced labor camps.”

      And yet notes that it was TAUGHT to kids in school history lessons -> seems to me to be cherry picking what is and isn't "history" here.

      Ah, but likens it to Conservatives' view that if THIS is the history being taught, just shouldn't teach history at all -> calls this "zero-sum game of heroes and villains" instead of exploring nuance, etc.

      "It was not an anlysis of people's ideas in their own time, nor a process of change over time." Again -> main issue here is it's link to the present.

    1. Sometimes, multiple news sources will post or broadcast the same story word-for-word. Just because a story is shared widely doesn’t mean that it is accurate, and it doesn’t tell you where the data came from. Keep searching to find a better source.

      The specific line reminds me of the concept of Journalism. When I was doing previous research for this class, I looked into sources like ScienceDaily, which was referred to as a site for journalism. Journalism is a low-quality form of Journalism in which information is repackaged to create articles to meet the increasing pressure of time and cost without further research or fact-checking. It plays a huge game of telephone between news and research articles that offers, most of the time, nothing new for consumers, which lengthens the time in research. There are many arguments on whether or not certain things are churnalism or articles that are catered to putting information in plain terms or simpler terms for audiences like children and the general public is up to wider debate and Case by case.

  10. learn-us-east-1-prod-fleet01-beaker-xythos.content.blackboardcdn.com learn-us-east-1-prod-fleet01-beaker-xythos.content.blackboardcdn.com
    1. which says that demand goes down as the price goes up.10 Has thelaw of demand been found to be false?Neither I nor any other economist would be willing to con-cede that the law of demand fails to hold. Instead, we wouldlook for factors that might account for the Ursinus Collegeapplication anomaly

      I can actually say that this has happened to me in real life where this anomaly occurred. This video game called "R.U.S.E." was removed from Steam awhile ago and me as well as some of my friends really enjoyed it, so when we got computers we decided to go looking for digital keys for it. When we found them before a few years back they were like $60 and now they are $200 - $250 even though the demand for the game has not grown since it is so old. I just wanted to show that this anomaly is more common than one might think.

    1. Markets are the ultimate infinite game.

      Zero mention of the obvious problems with market failure or dysfunction, Zero mention of how markets are greedy (literally) algorithms often misaligned with long-term incentives

    1. The believing game gives nuance to the idea of multiple perspectives during critical conversations. Rather than listening for holes in an argument or idea and biting our tongues to keep from arguing in reply, the believing game asks us to “try to believe things that we don’t believe—especially things we don’t want to believe” (“The Believing Game” 4). Elbow reiterates that the danger that lies in both games is imbalance: vehemently dismissing every idea beyond what we are ideologically attached to (the doubting game) and accepting whatever seems most evident from those we agree with or those with authority in a culture (the believing game). Rather, careful thinking requires both doubting and believing to see an idea from multiple angles or perspectives. One goal of critical listening is to “dwell-in” (“Bringing the Rhetoric” 395) or believe—even if only for a moment—another person’s perspective or experience. Elbow provides three teaching strategies to encourage the believing game when an outnumbered view surfaces in a discussion: ■  The three-minute or five-minute rule can be invoked for any member of the classroom community who feels they are not being listened to; when the rule is invoked, this person may speak for three or five minutes and no one else is allowed to talk or reply. ■  Allies only— no objections is a rule that permits only those who are able to believe the minority-held viewpoint to participate in the discussion, with no objections allowed. ■  Testimony is a practice where speakers are asked to share their stories or life experiences that informed their viewpoint on an issue and to share their experience with what it is like to live with this view. Other participants in the discussion must not respond or disagree, even after the speaker’s story has ended.

      The believing game gives nuance to the idea of multiple perspectives during critical conversations. Rather than listening for holes in an argument or idea and biting our tongues to keep from arguing in reply, the believing game asks us to “try to believe things that we don’t believe—especially things we don’t want to believe” (“The Believing Game” 4). Elbow reiterates that the danger that lies in both games is imbalance: vehemently dismissing every idea beyond what we are ideologically attached to (the doubting game) and accepting whatever seems most evident from those we agree with or those with authority in a culture (the believing game). Rather, careful thinking requires both doubting and believing to see an idea from multiple angles or perspectives. One goal of critical listening is to “dwell-in” (“Bringing the Rhetoric” 395) or believe—even if only for a moment—another person’s perspective or experience. Elbow provides three teaching strategies to encourage the believing game when an outnumbered view surfaces in a discussion:

      ■ The three-minute or five-minute rule can be invoked for any member of the classroom community who feels they are not being listened to; when the rule is invoked, this person may speak for three or five minutes and no one else is allowed to talk or reply. ■ Allies only— no objections is a rule that permits only those who are able to believe the minority-held viewpoint to participate in the discussion, with no objections allowed. ■ Testimony is a practice where speakers are asked to share their stories or life experiences that informed their viewpoint on an issue and to share their experience with what it is like to live with this view. Other participants in the discussion must not respond or disagree, even after the speaker’s story has ended.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Joint Public Review:

      Weaknesses:

      The lack of pleiotropy is an unconfirmable assumption of MR, and the addition of those models is therefore quite important, as this is a primary weakness of the MR approach. Given that concern, I read the sensitivity analyses using pleiotropy-robust models as the main result, and in that case, they can't test their hypotheses as these models do not show a BMI instrumental variable association. The other weakness, which might be remedied, is that the power of the tests here is not described. When a hypothesis is tested with an under-powered model, the apparent lack of association could be due to inadequate sample size rather than a true null. Typically, when a statistically significant association is reported, power concerns are discounted as long as the study is not so small as to create spurious findings. That is the case with their primary BMI instrumental variable model - they find an association so we can presume it was adequately powered. But the primary models they share are not the pleiotropy-robust methods MR-Egger, weighted median, and weighted mode. The tests for these models are null, and that could mean a couple of things: (1) the original primary significant association between the BMI genetic instrument was due to pleiotropy, and they therefore don't have a robust model to explore the effects of the tobacco genetic instrument. (2) The power for the sensitivity analysis models (the pleiotropy-robust methods) is inadequate, and the authors share no discussion about the relative power of the different MR approaches. If they do have adequate power, then again, there is no need to explore the tobacco instrument.

      Reviewing Editor Comments:

      We suggest that the authors add power estimates to assess whether the sample size is sufficient, given the strength and variability of the genetic instruments. It would also be helpful to present effect estimates for the tobacco instruments alone, to clarify their independent contribution and improve the interpretation of the joint models. In addition, the role of pleiotropy should be addressed more clearly, including which model is considered primary. Stratified analyses by smoking status are encouraged, as prior studies indicate that BMI-HNC associations may differ between smokers and non-smokers. Finally, the comparison with previous studies should be revised, as most reported null findings without accounting for tobacco instruments. If this study finds an association, it should not be framed as a replication

      We would like to highlight that post-hoc power calculations are often considered redundant since the statistical power estimated for an observed association is directly related to its p-value[1]. In other words, the uncertainty of the association is already reflected in its 95% confidence interval. However, we understand power calculations may still be of interest to the reader, so we have incorporated them in the revised manuscript. We have edited the text as follows (lines 151-155):“Consequently, we used the total R<sup>2</sup> values to examine the statistical power in our study[42]. However, we acknowledge that the value of post-hoc power calculations is limited, since the statistical power estimated for an observed association is already reflected in the 95% confidence interval presented alongside the point estimate[43].” We have also added supplementary figures 1 and 2.

      We can see that when using the latest HEADSpAcE data we were able to detect BMI-HNC ORs as small as 1.16 with 80% power, while the GAME-ON dataset only permitted the detection of ORs as small as 1.26 using the same BMI instruments (Figure B). We have explained these figures in the results section as follows (lines 257-263): “Using the BMI genetic instruments (total R<sup>2</sup>= 4.8%) and an α of 0.05, we had 80% statistical power to detect an OR as small as 1.16 for HNC risk (Supplementary Figure 1). For WHR (total R<sup>2</sup>= 3.1%) and WC (total R<sup>2</sup>= 4.4%), we could detect odds ratios (ORs) as small as 1.20 and 1.17, respectively. This is an improvement in terms of statistical power compared to the GAME-ON analysis published by Gormley et al.[28], for which there was 80% power to detect an OR as small as 1.26 using the same BMI genetic instruments (Supplementary Figure 2).”

      The reason we use inverse variance weighted (IVW) Mendelian randomization (MR) to obtain our main results rather than the pleiotropy-robust methods mentioned by the reviewer/editors (i.e., MR-Egger, weighted median and weighted mode) is that the former has greater statistical power than the latter[2]. Hence, instead of focussing on the statistical significance of the pleiotropy-robust analyses, we consider it is of more value to compare the consistency of the effect sizes and direction of the effect estimates across methods. Any evidence of such consistency increases our confidence in our main findings, since each method relies on different assumptions. As we cannot be sure about the presence and nature of horizontal pleiotropy, it is useful to compare results across methods even though they are not equally powered. It is true that our results for the genetically predicted effects of body mass index (BMI) on the risk of head and neck cancer (HNC) differ across methods. This is precisely what led us to question the validity of our main finding (suggesting a positive effect of BMI on HNC risk). We have now clarified this in the methods section of the revised manuscript as advised. Lines 165-171:

      “Because the IVW method assumes all genetic variants are valid instruments[44], which is unlikely the case, three pleiotropy-robust two-sample MR methods (i.e., MR-Egger[45], weighted median[46] and weighted mode[47]) were used in sensitivity analyses. When the magnitude and direction of effect estimates are consistent across methods that rely on different assumptions, the main findings are more convincing. As we cannot be sure about the presence and nature of horizontal pleiotropy, it is useful to compare results across methods even if they are not equally powered.”

      We understand that the reviewer/editors are concerned that we do not have a robust model to explore the role of tobacco consumption in the link between BMI and HNC. However, we have a different perspective on the matter. If indeed, the main IVW finding for BMI and HNC is due to pleiotropy (since some of the pleiotropy-robust methods suggest conflicting results), then the IVW multivariable MR method is a way to explore the potential source of this bias[3]. We were particularly interested in exploring the role of smoking in the observed association because smoking and adiposity are known to influence each other [4-9] and share a genetic basis[10, 11].

      We agree that it would be useful to present the univariable MR effect estimates for smoking behaviour and HNC risk along those obtained using multivariable MR. We have now included the univariable MR estimates for both smoking behaviour variables as a note under Supplementary Table 11 and in the manuscript (lines 316-318): “In univariable IVW MR, both CSI and SI were linked to an increased risk of HNC (CSI OR=4.47 per 1-SD higher CSI, 95%CI 3.31–6.03, p<0.001; SI OR=2.07 per 1-SD higher SI 95%CI 1.60–2.68, p<0.001) (Additional File 2: note in Supplementary Table 11).”

      We understand the appeal of conducting stratified MR analyses by smoking status. However, we anticipate such analyses would hinder the interpretation of our findings as they can induce collider bias which could spuriously lead to different effect estimates across strata[12, 13].

      We thank the reviewer/editors for their comment regarding the way we frame of our findings. We have now edited the discussion section to highlight our study results are different to those obtained in studies that do not account for smoking behaviour. Lines 398-401: “With a much larger sample (N=31,523, including 12,264 cases), our IVW MR analysis suggested BMI may play a role in HNC risk, in contrast to previous studies. However, our sensitivity analyses implied that causality was uncertain.”

      Reviewer #1 (Recommendations for the authors):

      The authors do share a table of the percent variance explained of the different genetic instruments, which vary widely, and that table is very welcome because we can get some sense of their utility. The problem is that they don't translate that into a power estimate for the case-control study size that they use. They say that it is the biggest to date, which is good, but without some formal power estimate, it is not particularly reassuring. A framework for MR study power estimates was reported in PMID: 19174578, but that was using very simple MR constructs in use in 2009, and it isn't clear to me if that framework can be used here. That power paper suggests that weak genetic instruments need very large sample sizes, far larger than what is used in the current manuscript. I am unable to estimate the true strength of the instruments used here, and so I am unsure of whether power is an issue or not.

      We have now included power calculations in our manuscript to address the reviewer’s concerns. Nevertheless, as mentioned above, post-hoc power calculations are of limited value, as statistical power is already reflected in the uncertainty around the point estimates (the 95% confidence intervals). Hence, it is important to avoid drawing conclusions regarding the likelihood of true effects or false negatives based on these calculations.

      Although the hypothesis here is that smoking accounts for the apparent BMI association previously reported for HNC, it would have been preferable to see the estimates for their 2 genetic instruments for tobacco alone. The current results only show the BMI instruments alone and then with the tobacco instruments. I would like to see what the risk estimates are for the tobacco instrument alone, so that I can judge for myself what happens in the joint models. As presented, one can only do that for the BMI instruments.

      We thank the reviewer for this comment. The univariable IVW MR estimate of smoking initiation was OR=2.07 (95%CI 1.60 to 2.68, p<0.001), while the one for comprehensive smoking index was OR=4.47 (95%CI 3.31 to 6.03, p<0.001). We have included this information in the manuscript as requested (please see response to reviewing editor above).

      On line 319, they write that "We did not find evidence against bias due to correlated pleiotropy..." I find this difficult to parse, but I think it means that they should believe that correlated pleiotropy remains a problem. So again, they seem to see their primary model as compromised, and so do I. This limitation is again stated by the authors on lines 351-352.

      We apologise if the wording of the sentence was not easy to understand. When using the CAUSE method, we did not find evidence to reject the null hypothesis that the sharing (correlated pleiotropy) model fits the data at least as well as the causal model. In other words, our CAUSE finding and the inconsistencies observed across our other sensitivity analyses led us to believe that our main IVW MR estimate for BMI-HNC was likely biased by correlated pleiotropy. We believe it is important to explore the source of this bias, which is why we used multivariable MR to investigate the direct effect of BMI on HNC risk while accounting for smoking behaviour.

      In the following paragraphs (lines 358-369), the authors state that their findings are consistent with prior reports, but that doesn't seem to be the case if we take their primary BMI instrument as representing the outcome of this manuscript. Here, they find an association between the BMI instrument and HNC risk, but in each of the other papers they present the primary finding was null without the extensive model changes or the aim of accounting for tobacco with another instrument. I don't see that as replication.

      This is a good point. We have now edited the discussion of our manuscript to avoid giving the impression that our findings replicate those from studies that do not account for smoking behaviour in their analyses. We have edited lines 384-401 as follows:

      “Previous MR studies suggest adiposity does not influence HNC risk[27-29]. Gormley et al.[28] did not find a genetically predicted effect of adiposity on combined oral and oropharyngeal cancer when investigating either BMI (OR=0.89 per 1-SD, 95% CI 0.72–1.09, p=0.26), WHR (OR=0.98 per 1-SD, 95% CI 0.74–1.29, p=0.88) or waist circumference (OR=0.73 per 1-SD, 95% CI 0.52–1.02, p=0.07) as risk factors. Similarly, a large two-sample MR study by Vithayathil et al.[29] including 367,561 UK Biobank participants (of which 1,983 were HNC cases) found no link between BMI and HNC risk (OR=0.98 per 1-SD higher BMI, 95% CI 0.93–1.02, p=0.35). Larsson et al.[27] meta-analysed Vithayathil et al.’s[29] findings with results obtained using FinnGen data to increase the sample size even further (N=586,353, including 2,109 cases), but still did not find a genetically predicted effect of BMI on HNC risk (OR=0.96 per 1-SD higher BMI, 95% CI 0.77–1.19, p=0.69). With a much larger sample (N=31,523, including 12,264 cases), our IVW MR analysis suggested BMI may play a role in HNC risk, in contrast to previous studies. However, our sensitivity analyses implied that causality was uncertain.”

      We also deleted part of a sentence in the discussion section, so lines 416-418 now look as follows: “An important strength of our study was that the HEADSpAcE consortium GWAS used had a large sample size which conferred more statistical power to detect effects of adiposity on HNC risk compared to previous MR analyses[27-29].”

      On lines 384-386 they note a strength is that this is the largest study to date, but I would reiterate that larger and more powerful does not equate to adequately powered.

      This is true. We have included power calculations in the manuscript as requested.

      It's well known that different HNC subsites have different etiologies, as they mention on lines 391-392, and it is implicit in their use of data on HPV positive and negative oropharyngeal cancer. They say that they did not find evidence for heterogeneity in this study, but that would only be true for the null BMI instrument. The effect sizes for their smoking instruments are strikingly different between the subsites.

      We agree and are sorry for the confusion we may have caused by the way we worded our findings. We have edited the text to clarify that the lack of subsite heterogeneity only applied to our results for BMI/WHC/WC-HNC risk. Lines 418-424 now read as follows:

      “Furthermore, the availability of data on more HNC subsites, including oropharyngeal cancers by HPV status, allowed us to investigate the relationship between adiposity and HNC risk in more detail than previous MR studies which limited their subsite analyses to oral cavity and overall oropharyngeal cancers[28, 68]. This is relevant because distinct HNC subsites are known to have different aetiologies[69], although we did not find evidence of heterogeneity across subsites in our analyses investigating the genetically predicted effects of BMI, WHR and WC on HNC risk.”

      Finally, the literature on mutational patterns gives us strong reason to believe that HNC caused by tobacco are biologically distinct from tumors not caused by tobacco. The authors report in the introduction that traditional observational studies of BMI and HNC have reported different findings in smokers versus never smokers, so I would assume there is a possibility that the BMI instrument could have different associations with tumors of the tobacco-induced phenotype and tumors with a non-tobacco induced phenotype. I would assume that authors have access to the data on self-reported tobacco use behavior, even if they can't separate these tumors by molecular types. Stratifying their analysis by tobacco users or not might reveal different results with the BMI instrument.

      We appreciate the reviewer’s comment. We agree that it would have been interesting to present stratified analyses by smoking status along our main findings. However, we decided against this because of the risk of inducing collider bias in our MR analyses i.e., where stratifying on smoking status may induce spurious associations between the adiposity instruments and confounding factors. Multivariable MR is considered a better way of investigating the direct effects of an exposure (adiposity) on an outcome (HNC) accounting for a third variable (smoking)[14], which is why we opted for this method instead.

      References:

      (1) Heinsberg LW, Weeks DE: Post hoc power is not informative. Genet Epidemiol 2022, 46(7):390-394.

      (2) Burgess S, Butterworth A, Thompson SG: Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol 2013, 37(7):658-665.

      (3) Burgess S, Davey Smith G, Davies NM, Dudbridge F, Gill D, Glymour MM, Hartwig FP, Kutalik Z, Holmes MV, Minelli C et al: Guidelines for performing Mendelian randomization investigations: update for summer 2023. Wellcome Open Res 2019, 4:186.

      (4) Morris RW, Taylor AE, Fluharty ME, Bjorngaard JH, Asvold BO, Elvestad Gabrielsen M, Campbell A, Marioni R, Kumari M, Korhonen T et al: Heavier smoking may lead to a relative increase in waist circumference: evidence for a causal relationship from a Mendelian randomisation meta-analysis. The CARTA consortium. BMJ Open 2015, 5(8):e008808.

      (5) Taylor AE, Morris RW, Fluharty ME, Bjorngaard JH, Asvold BO, Gabrielsen ME, Campbell A, Marioni R, Kumari M, Hallfors J et al: Stratification by smoking status reveals an association of CHRNA5-A3-B4 genotype with body mass index in never smokers. PLoS Genet 2014, 10(12):e1004799.

      (6) Taylor AE, Richmond RC, Palviainen T, Loukola A, Wootton RE, Kaprio J, Relton CL, Davey Smith G, Munafo MR: The effect of body mass index on smoking behaviour and nicotine metabolism: a Mendelian randomization study. Hum Mol Genet 2019, 28(8):1322-1330.

      (7) Asvold BO, Bjorngaard JH, Carslake D, Gabrielsen ME, Skorpen F, Smith GD, Romundstad PR: Causal associations of tobacco smoking with cardiovascular risk factors: a Mendelian randomization analysis of the HUNT Study in Norway. Int J Epidemiol 2014, 43(5):1458-1470.

      (8) Carreras-Torres R, Johansson M, Haycock PC, Relton CL, Davey Smith G, Brennan P, Martin RM: Role of obesity in smoking behaviour: Mendelian randomisation study in UK Biobank. BMJ 2018, 361:k1767.

      (9) Freathy RM, Kazeem GR, Morris RW, Johnson PC, Paternoster L, Ebrahim S, Hattersley AT, Hill A, Hingorani AD, Holst C et al: Genetic variation at CHRNA5-CHRNA3-CHRNB4 interacts with smoking status to influence body mass index. Int J Epidemiol 2011, 40(6):1617-1628.

      (10) Thorgeirsson TE, Gudbjartsson DF, Sulem P, Besenbacher S, Styrkarsdottir U, Thorleifsson G, Walters GB, Consortium TAG, Oxford GSKC, consortium E et al: A common biological basis of obesity and nicotine addiction. Transl Psychiatry 2013, 3(10):e308.

      (11) Wills AG, Hopfer C: Phenotypic and genetic relationship between BMI and cigarette smoking in a sample of UK adults. Addict Behav 2019, 89:98-103.

      (12) Coscia C, Gill D, Benitez R, Perez T, Malats N, Burgess S: Avoiding collider bias in Mendelian randomization when performing stratified analyses. Eur J Epidemiol 2022, 37(7):671-682.

      (13) Hamilton FW, Hughes DA, Lu T, Kutalik Z, Gkatzionis A, Tilling K, Hartwig FP, Davey Smith G: Non-linear Mendelian randomization: evaluation of effect modification in the residual and doubly-ranked methods with simulated and empirical examples. Eur J Epidemiol 2025.

      (14) Sanderson E, Davey Smith G, Windmeijer F, Bowden J: An examination of multivariable Mendelian randomization in the single-sample and two-sample summary data settings. Int J Epidemiol 2019, 48(3):713-727.

    1. “archaeology is an inherentlyuncanny subject” (p. 91) in his discussion of the spectacle of anatomical dissection and the archae-ological gaze, as it “brings dead people, dead places and dead things into the world of the living”

      I did not know this was an opinion people had on archeology. All the people I know who learns of my interest in archeology often mentions the game Temple Run or Indiana Jones. Archeology as the catalyst to an adventure

    1. Linking rewards to performance measures.Should compensation systems be linked tobalanced scorecard measures? Some compa-nies, believing that tying financial compensa-tion to performance is a powerful lever, havemoved quickly to establish such a linkage.

      Linking compensation to Balanced Scorecard measures makes sense because it creates alignment—people are rewarded for achieving the same goals the organization cares about. I like that Pioneer Petroleum went beyond just financial outcomes to include customer, employee, and environmental indicators, since this encourages a more balanced focus. At the same time, I think the risks are very real. If the wrong measures are chosen, or if the data isn’t reliable, people could “game the system” or prioritize numbers over quality. To me, this shows that while tying rewards to strategy can be powerful, it requires careful design and ongoing review to make sure the incentives actually drive the right behaviors.

    1. We wanted to limit social media as much as possible. Butwhen friends plan where to meet up via Instagram messen-ger or some other platform, and when the key informationfor every soccer game—where, when, which uniform—iscommunicated via group chat, there is no choice but to join.

      Yes! I do not want my children on social media for as long as I possibly can. However, many coaches and extra curriculars use apps and social media to communicate. There is then minimal choices in keeping them away from it.

    Annotators

    1. One can imagine that a few curious 23rd-century simulators mightfocus on the early 21st century. Let’s suppose the simulators live in aworld in which Hillary Clinton defeated Jeb Bush in the US presiden-tial election of 2016. They might ask: How would history have beendifferent if Clinton had lost? Varying a few parameters, the simulatorsmight go so far as to simulate a world where the 2016 victor was DonaldTrump. They might even simulate Brexit and a pandemic.

      I think that use of VR in this way would be very interesting in a game format but I think that bringing things like VR into such serious topics as politics could get messy. Politics are already such a controversial topic that adding a "what could've been" scenario could be harmful.

    2. These temporary limitations will pass. The physics engines thatunderpin VR are improving. In years to come, the headsets will getsmaller, and we will transition to glasses, contact lenses, and eventuallyretinal or brain implants. The resolution will get better, until a virtualworld looks exactly like a nonvirtual world. We will figure out how tohandle touch, smell, and taste. We may spend much of our lives in theseenvironments, whether for work, socializing, or entertainment.

      Its so crazy to me how much VR can and will change the world. I think that its really cool to use as a fun game or activity but I do not think that it should be incorporated into everyday life. I feel as though its going to make the world into such a fake environment and ruin true socialness and connection.

    3. In the 2000s, people began spending vast amounts of timein multiplayer virtual worlds like Second Life and World of Warcraft.In the 2010s, there arrived the first rumblings of consumer-level virtualreality headsets, like the Oculus Rift. That decade also saw the firstwidespread use of augmented reality environments, which populate thephysical world with virtual objects in games like Pokémon Go.

      I remember when Pokemon Go first came out, it became a part of everyone's routine to find them and they became sort of to immersid in the game because it had a tie in through the real world and through a simulation. It had some backlash because players were to focused on their when walking to find these pokemon, they would be lured into dangerous areas, be injured from vehicles, falling, etc. Much like VR, it was fun because our in your own reality while playing because it feels real to the players.

    1. But, is there an inflection point of consequence that changes the name of the “game” of life on earth for everybody and everything? It’s more than climate change; it’s also extraordinary burdens of toxic chemistry, mining, depletion of lakes and rivers under and above ground, ecosystem simplification, vast genocides of people and other critters, etc, etc, in systemically linked patterns that threaten major system collapse after major system collapse after major system collapse. Recursion can be a drag.

      Haraway begins with a strong compelling opening that frames the argument she is trying to make. She proposes the question of the "inflection point” that changes the “game of life on earth”; she immediately emphasizes the gravity of the ecological and systemic crisis. She expands beyond the understanding of climate change to include toxic chemistry, mining, water depletion, ecosystem simplification, and mass extinctions to establish the interconnect and recursive nature of the collapse of our planet.

    1. Inboth cases, the benefits are wildly exaggerated and the costs passed on to the people who never got aseat at the table

      People are just expected to allow A.I. to take over without a say in the game.

    1. In fact, it reminds me of a particular game my son William invented at about age five. At his own initiative he one day drew a large game board, assembled dice and playing pieces, and invited his father to join him in an inventively improvised game with ever-changing and ever more elaborate rules. After two hours of this surreal activity, my husband became restless and began asking every five minutes or so if the game was almost over. William responded by calmly walking into the kitchen, where I was sitting, and asking me to write his father the following note:DEAR DAD—THIS GAME WILL NEVER END. WILLIAMThe rhizome has the same message.

      This is by far the clearest way to illustrate the idea of the rhizome story. It is a rather complex idea to comprehend and this makes it much easier to wrap your head around.

    2. But activity alone is not agency. For instance, in a tabletop game of chance, players may be kept very busy spinning dials, moving game pieces, and exchanging money, but they may not have any true agency

      I feel like this applies to "The boy in the book", in which we are also given multiple choices throughout the course of the story, but the effects of those choices are very limited. We get to choose where Nathan go and who he meets, but we do not get to decide what he ask or what he looks for when he gets to the places where we wants him to be (the Internet search is a major example of this.) The question then, should be to what degree should we get to make choices and should our choices affect the narrative for it to be considered true agency?

    3. As I move forward, I feel a sense of powerfulness, of significant action, that is tied to my pleasure in the unfolding story.

      When I read this sentence, it reminded me of human's desire for accomplishment. I think thats why adventure maze works well. Just like when you are playing a game of Escape room, you are actively trying to find a way out and each step and subtle clue you find makes you feel closer to the goal. Adventure maze uses this sense of accomplishment to bring powerfulness to the player.

    4. cruel things that happen to the hero are often treated as instances of a specific social injustice

      This speaks to how, in Hana Feels, her own feelings about cutting herself feel like they're being treated as a real-life scenario rather than a story when playing the game and making decisions, since you are making choices that will impact her mental health and her possibly continuing self-harm.

    5. The palace is full of informants, who speak in text bubbles and whom you reply to from menus, and you must negotiate with them carefully, offering them icons representing money or other valuables. A mysterious peddler on one of the lower levels holds a talisman needed to get into the highest chamber. You must have it with you while you stand on a special spot that is hidden in the patterning of the floor. If you forget to get it, you must retrace your steps through many perils. The game is like a treasure hunt in which a chain of discoveries acts as a kind of Ariadne’s thread to lead you through the maze to the treasure at the center. (11)

      Nigh all computations are based on an equal sign, and so all forms of interaction within games are forced to accept the logic of exchange in a sense of transaction rather than potlatch.These transactional actions allow the player to advance and so lead you to the ultimate transaction between the author where you broker your actions throughout the game for an ending you hope for.

    1. distinct models, one appropriate to fiction, the other to non-fiction.Instead, one model seems to be enough, a model that is capable ofinflection by fictional or non- fictional concerns.

      This book was initially published in 1982. Back then, perhaps non-fiction programs were exclusively news programs and TV live game shows; this is no longer true today. In recent times, reality TV shows have become extremely popular. In my opinion, now more than ever, the narration of TV needs to be divided.