10,000 Matching Annotations
  1. Mar 2026
    1. 以下是按照你提供的模板对这篇文章的完整解读:


      📌 元数据概览

      标题: 《你不知道的 Claude Code:架构、治理与工程实践》——聚焦于如何从工程视角深度使用和治理 Claude Code 这一 AI 编程 Agent 工具。

      作者: Tw93,独立开发者,维护个人技术博客与 Weekly 周刊,有开源 terminal 项目 Kaku(Rust + Lua)等实际工程产出,具备较深的前沿 AI 工具工程实践经验。

      链接: tw93.fun

      标签: #ClaudeCode #AI编程Agent #上下文工程 #PromptCaching #MCP #Skills #Hooks #Subagents #CLAUDE.md #工程实践


      ✨ 核心观点与亮点

      主张: Claude Code 的使用瓶颈不在于模型能力,而在于系统设计层面的治理缺失。作者将其拆解为六层架构(CLAUDE.md / Tools / Skills / Hooks / Subagents / Verifiers),并主张每一层都需要精心设计,任何一层的失衡都会导致整个协作系统崩坏。

      亮点: - 对 200K 上下文的真实成本进行了精确拆解,揭示了 MCP 工具定义是最大的"隐形上下文杀手"(5 个 Server 即可消耗 12.5% 的上下文); - 提出了用 HANDOFF.md 跨会话传递进度的实用模式,解决了压缩算法丢失架构决策的痛点; - 揭示了 Prompt Caching 的前缀匹配机制及其对架构设计的深层影响,包括"会话中途切换模型反而更贵"这一反直觉结论; - 提出"让 AI 审 AI"的 Plan Mode 进阶玩法(一个 Claude 写计划,一个 Codex 以高级工程师身份审计); - 分享了 Claude Code 内部工具演进的真实案例,包括 AskUserQuestion 独立工具的设计哲学。


      🔍 逐层深入理解

      开始: 文章以作者半年双账号、每月 40 美元的深度使用经历为引子,指出大多数人把 Claude Code 当 ChatBot 用所遇到的典型困境——上下文越来越乱、工具堆越多效果越差、规则越长越不遵守。作者的核心判断是:这不是 Prompt 问题,而是系统设计问题。

      发展: 文章以六层架构为骨架,逐层展开工程实践。上下文工程部分详细拆解了 200K 窗口的真实占用结构,并给出分层加载策略(常驻 / 按路径 / 按需 / 隔离 / 不进上下文)。Skills 部分区分了检查清单型、工作流型、领域专家型三种设计模式,并强调描述符的精简对上下文节省的重要性。Hooks 部分说明了强制执行逻辑与模型自由判断之间的边界。Subagents 部分强调其核心价值在于"隔离"而非"并行"。Prompt Caching 部分则深入剖析了 Claude Code 整个架构围绕缓存前缀匹配而构建的设计逻辑,包括 defer_loading、Compaction 的实现原理等。

      结论: 作者总结了用户使用 Claude Code 的三个阶段(工具使用者 → 流程优化者 → 系统设计者),并提出一个核心判断标准:如果你说不清楚"什么叫做完",这个任务就不适合直接交给 Claude 自主完成。验证标准是 Agent 能否可靠运作的前提,而非事后补充。


      📚 关键术语/概念

      CLAUDE.md 项目级持久契约文件,每次会话都必须成立的命令、边界与禁止项。应保持短、硬、可执行,避免写成团队知识库。

      MCP(Model Context Protocol): 外部能力接入协议,让 Claude 访问 GitHub、Sentry、数据库等外部系统。每个 Server 的工具定义会占用大量固定上下文,是最大的隐形成本来源。

      Skills: 按需加载的知识与工作流,描述符常驻上下文,完整内容按需拉取。核心设计原则是"progressive disclosure"(渐进式披露)。

      Hooks: 在 Claude 执行操作的生命周期节点前后强制插入确定性逻辑的拦截层,不依赖模型判断,适合格式化、保护文件、推送通知等场景。

      Subagents: 从主对话派出的独立 Claude 实例,有独立上下文窗口和受限工具集,核心价值是隔离而非并行。

      Prompt Caching: 按前缀匹配工作的缓存机制,静态内容(系统指令、工具定义)应放在 Prompt 前端以最大化命中率。破坏缓存的常见操作包括在系统 Prompt 中嵌入时间戳、中途增删工具等。

      Verifiers: 验证闭环机制,包括命令退出码、lint、测试、截图对比、生产日志等,是 Agent 工程化可信的基础。

      HANDOFF.md 在开新会话前由 Claude 生成的进度交接文件,记录当前进展、尝试路径、成功/失败结论及下一步方向,用于替代压缩算法的摘要质量依赖。

      RTK(Rust Token Killer): 一个通过 Hook 自动过滤命令输出、只保留决策所需核心信息的开源工具,系统化解决 Tool Output 噪声问题。


      🚫 无关信息过滤

      文章末尾的评论区互动(如"太牛了""收获颇丰"等读者留言)以及"请 Tw93 喝冰可乐"的打赏入口,与文章核心工程内容无关,可略过。另外,文末附带的另一篇文章预告《连龙虾都不会装的人,怎么会用龙虾呢?》属于站内推荐,不属于本文内容范畴。


      🌟 文章摘要

      本文是一篇面向工程师的 Claude Code 深度实践指南。作者从六层架构视角(CLAUDE.md、Tools/MCP、Skills、Hooks、Subagents、Verifiers)出发,系统阐述了如何治理上下文污染、精细设计 Skills 与工具、利用 Hooks 强制执行确定性逻辑、通过 Subagents 隔离复杂任务,以及围绕 Prompt Caching 机制进行架构设计。文章的核心论点是:Claude Code 的效果瓶颈不在模型,而在系统治理——只有把六层结构都设计到位,Agent 才能在约束下稳定自主运作。


      💡 金句提取

      "卡住的地方几乎从来不是模型不够聪明,更多时候是给了它错误的上下文,或者写出来了但根本没法判断对不对,也没法撤回。"

      "假如一个任务你都说不清楚「Claude 怎么才算做对了」,那它大概率也不适合直接丢给 Claude 自动完成。"

      "Cache Rules Everything Around Me——对 agent 同样如此,Claude Code 的整个架构都是围绕 Prompt 缓存构建的。"

      "当初加这个工具是因为模型不够强,模型变强之后它反而变成了枷锁。值得过段时间回来检查一下,当初加的限制还成不成立。"


      🔄 总结归纳

      这篇文章的价值在于它把 Claude Code 从"一个好用的 AI 工具"重新定位为"一套需要工程化治理的 Agent 系统"。作者用半年真实踩坑经验,提炼出了一套从上下文管理、Skills 设计、Hooks 拦截到 Prompt Caching 架构的完整方法论。对于已经在使用 Claude Code 但感到"越用越乱"的工程师来说,这篇文章提供了一个清晰的诊断框架和可落地的改进路径,是目前中文社区中难得的一篇兼具深度与实操性的工程实践总结。


      ❓ 引发思考的问题

      问题一: 文章提出"验证标准是 Agent 能否可靠运作的前提",但在实际工程中,很多任务的验收标准本身就是模糊的(如"优化代码可读性")。在这类场景下,如何设计出足够可信的 Verifier,而不是让验证本身也依赖模型的主观判断?

      问题二: 随着模型能力的持续增强,文章中提到的很多约束层(如 TodoWrite 工具最终成为枷锁)可能会逐渐失效。那么,如何建立一套动态的"配置健康检查"机制,让 CLAUDE.md 和 Skills 随着模型能力的演进而持续迭代,而不是积累成越来越厚的历史负担?

    1. How Cultured Chicken is Made Code

      Top of this, or maybe on another page, it would be nice to have some sort of mosaic graph with different cost break-downs for different scenarios. Dividing up the cost into different components to get to total cost per kilogram, and then perhaps each of those mosaic elements could link to a different section explaining it.

    1. These recommendations, on Good Practice for Confer-ence Abstracts and Presentations (GPCAP), focus oncompany-sponsored research (see the ‘Note on termin-ology’ section). However, they do not cover other companyactivities that may be linked to conferences (e.g. satellitesymposia organized alongside scientific conferences, med-ical education and marketing activities) because these aregoverned by regional and national legislation or codes(e.g. EFPIA code of practice [8], FDA regulations [9]). Aswith the GPP guidelines, GPCAP focuses on the pres-entation of all types of company-sponsored researchand the specific challenges surrounding this, ratherthan investigator-sponsored or investigator-initiatedtrials or research (where companies have no role intheir presentation or publication), although many of theprinciples also apply to the presentation of other typesof research at scientific meetings. The aim of GPCAP istherefore to provide guidance on good submission andpresentation practice for scientific and medical congresses,specifically addressing certain aspects where current

      This paragraph describes the Good Practice for Conference Abstracts and Presentations (GPCAP) guidelines, they where developed to address gaps in publication standards for company-sponsored research presented at scientific meetings. It focuses on quality, transparency, and ethical presentation. For example, maybe the PRISMA guideline may not apply to conference abstracts.

    1. The government continued to require all women to adhere to “Islamic dress” standards in public, including covering their hair and fully covering their bodies in loose clothing – an overcoat and a hijab or, alternatively, a chador (a full body-length piece of fabric worn over both the head and clothes). “Un-Islamic dress” was punished with arrests, imprisonment, lashings, fines, mandatory psychiatric treatment, closure of businesses which did not enforce dress codes, and dismissal from employment. In an address to regime figures on April 4, Ayatollah Khamenei described the hijab as a requirement of both Islam and the law. Ignoring it, he added, was “forbidden both under Islam and politically.” According to press reports, judges also sentenced women convicted of not wearing the hijab to public service, including work in morgues or street cleaning, in lieu of prison time. Protests against the mandatory hijab under the slogan “Zan, Zendegi, Azadi” (“Woman, Life, Freedom”) continued into the year, originally prompted by Mahsa Amini’s death in custody in 2022. In April, VOA reported authorities closed 45 businesses after the businesses ignored warnings they were not enforcing the compulsory hijab rule among their customers. According to VOA, on April 15, the government launched a new domestic surveillance program for enforcing the mandatory hijab law, and the national police chief, Ahmad Reza Radan, said authorities would employ advanced surveillance capabilities, including street cameras, to identify women violating the law. VOA stated that despite the threat of punishment, videos posted on social media appeared to show many women in different parts of the country defying the rule. In July, Amnesty International reported that more than a million women had received SMS text warnings threatening to confiscate their cars if they were found traveling in a vehicle unveiled. The Amnesty report also stated that unveiled women had been denied access to banking, education, and public transit. The Amnesty posting also reported that on July 16, a police spokesman announced the return of police patrols to enforce compulsory veiling. Analysis published by the U.S. Institute of Peace (USIP) stated that as of September, regular citizens were being compelled to assist law enforcement efforts regarding the wearing of the hijab. USIP reported citizens were required to deny services to women who did not adhere to hijab regulations, blocking access to banks, shops, and restaurants, and that business owners who failed to comply risked fines or even closure of their establishments. On July 10, the Guardian newspaper reported that at least 60 women had been barred from university for noncompliance with the hijab law. In his August report, UN Special Rapporteur Rehman reported that “vigilante justice” by elements of the government resulted in violence against women, including arrests and arbitrary detentions, and that “hundreds of businesses” were closed or had received warnings for allowing customers or employees to wear an “improper” hijab. CHRI reported that on August 7, the Tehran municipal government hired 400 hijab guards to enforce hijab compliance in public areas. According to CHRI, the guards were separate from the morality police, who enforced codes of conduct for both men and women. On September 20, parliament approved the “Bill to Support the Family by Promoting the Culture of Chastity and Hijab,” that would set new and harsher penalties for noncompliance with the Islamic dress code as interpreted by authorities; the Council of Guardians had not approved the bill by year’s end. Amnesty International reported the bill equated unveiling to nudity. According to a statement from Ravina Shamdasani, spokesperson for the UN High Commissioner for Human Rights, the proposed changes increased the potential prison time for violating the compulsory dress code from two months to 10 years and increased the fine from 500,000 rials ($12) to up to 360 million rials ($8,600). Shamdasani said women also faced flogging, travel restrictions, and deprivation of online access. According to USIP and the Associated Press, the bill also called for more strict segregation of the sexes in schools, parks, hospitals, and other locations. The bill reportedly extended punishments to business owners who served women not wearing hijab and activists who organized against it. Celebrities not wearing the hijab properly could be banned from leaving the country and performing. On September 1, a panel of UN experts that included UN Special Rapporteur Rehman stated the legislation “could be described as a form of gender apartheid.” The panel expressed concern that the language of the draft law could lead to violent enforcement, adding that the bill “violates fundamental rights, including the right to take part in cultural life, the prohibition of gender discrimination, freedom of opinion and expression, the right to peaceful protest, and the right to access social, educational, and health services and freedom of movement.” On October 6, the Norwegian Nobel Committee awarded the 2023 Nobel Peace Prize to Narges Mohammadi, the imprisoned deputy head of the NGO Defenders of Human Rights Center, “for her fight against the oppression of women in Iran and her fight to promote human rights and freedom for all.” Nobel Committee chair Berit Reiss-Andersen said, “Her brave struggle has come with tremendous personal costs. Altogether, the regime has arrested her 13 times, convicted her five times, and sentenced her to a total of 31 years in prison and 154 lashes.” In addition to speaking out against the death penalty, torture, and solitary confinement, Mohammadi opposed the compulsory hijab. She remained in Evin Prison at year’s end, serving a 12-year sentence on multiple charges related to her advocacy, including “spreading propaganda against the regime.” Radio Farda reported authorities twice denied Mohammadi medical care and exams after her refusal to wear a hijab. In a letter to the Nobel Committee, Mohammadi wrote, “The ‘compulsory hijab’ is a means of control and repression imposed on the society and on which the continuation and survival of this authoritarian religious regime depends.” In a letter to CNN smuggled out of the prison, Mohammadi said that the government was using the hijab as a pretext “all to preserve the image of religious Islamic men and ensure the security and purity of women,” adding, “The ‘compulsory hijab’ was a deceitful scheme against women and a tool of pressure to strengthen the power of the religious government.” Following the Nobel committee’s announcement, numerous foreign governments and human rights organizations, including Amnesty International, called on the government to release Mohammadi. On February 9, international media reported the government released seven women from Evin Prison, including Saba Kord Afshari, who had been imprisoned since 2019 after she campaigned against the mandatory hijab. In 2022, authorities reduced her sentence from seven and a half years to five years. The government reportedly continued to suppress other public behavior it deemed counter to Islamic law, such as women dancing or singing in public. According to a UN experts panel statement issued in March, several young women who filmed themselves dancing on the street without covering their hair were chased down and forced to apologize on state television. In January, according to KHRN, the Revolutionary Court in Shahriar in Tehran Province sentenced a Kurdish woman, Mahsa Farhadi, to two years’ imprisonment for not wearing a hijab. Security officials arrested Farhadi in late 2022. On March 7, international media reported a court had sentenced an unnamed woman to two years in prison for removing her hijab. One report quoted prosecutor Abbas Jafari Dolatabadi as saying the woman was attempting to “encourage corruption through the removal of the hijab in public.” Radio Liberty/Radio Free Europe reported that earlier in the year, authorities announced they had detained 29 women who removed their head scarves as part of a campaign against the country’s mandatory Islamic dress code. On August 17, KHRN reported authorities arrested Firmesk Babaei after she removed her hijab and shouted antigoverment slogans near the governor’s office in the city of Paveh in Kermanshah Province. According to the report, officials beat Babaei during her arrest and took her to an unknown location.

      gender-based violence

    2. The joint report stated that in 2022, authorities increasingly prosecuted Christians under Article 500, which criminalizes “engaging in propaganda that educates in a deviant way contrary to the holy religion of Islam” and carries a punishment of up to five years’ imprisonment; or up to 10 years if the defendant received financial or organizational help from outside the country. The report said the increase of such prosecutions “indicates the prevalence of surveillance of Iranian citizens regarding their religious beliefs.” Christians convicted under this article were punished with imprisonment, internal exile, travel bans, community service, and deprivation of some social services. Authorities also often charged Christians under Article 499 of the penal code for membership in a group proscribed by Article 498, which criminalizes any group, society or branch that “aims to perturb the security of the country.” Punishment ranged from two to 10 years in prison. Christian NGOs and the online Christian media outlet Morningstar News reported authorities in February released from Evin Prison Pastor Yousef Nadarkhani, as well as Christian converts Hadi Rahimi and Zaman Fadaei, as part of the government’s amnesty marking the anniversary of the 1979 revolution. Nadarkhani had been serving a six-year sentence for acting against national security, propagating house churches, and promoting “Zionist Christianity.” Rahimi and Faadaei were serving four- and six-year sentences, respectively, for “acting against national security” and “spreading ‘Zionist’ Christianity” for attending house churches. In early July, NGOs reported that authorities rearrested Nadarkhani, as well as fellow Church of Iran pastor Matthias (Abdolreza Ali) Haghnejad, on charges of attempting to undermine national security after the government pressured a couple belonging to the church into incriminating the two men. According to CSW, Haghnejad never met the couple and Nadarkhani was only vaguely acquainted with them. Haghnejad already was in government custody awaiting a retrial on 2014 charges of undermining government security and promoting Zionist Christianity. Despite his initial acquittal of those charges, the Supreme Court overturned the verdict and the government rearrested him in late 2022, a short time after his release from prison on other charges. In mid-July, the government moved Haghnejad to a prison in the city of Minab in Hormozgan Province, a thousand miles from his home in the city of Bandar Azali. On January 3, the government arrested Anahita Khademi, Haghnejad’s wife. Authorities released her on bail on January 28, after charging her with “propaganda against the regime” and “disturbing public opinion.” In May, Article 18 reported that a judge in Branch 34 of the Appeals Court in Tehran overturned the conviction of Christian converts Homayoun Zhaveh and Sara Ahmadi, who had been convicted of participating in a house church. The judge broke with legal precedent by ruling that participation in a house church was not illegal. He said gathering with people of one’s own faith was “natural” and having books related to Christianity was “also an extension of their beliefs,” adding there was no evidence the couple had acted against the country’s security or had connections with opposition groups or organizations. Prisoners practicing a religion other than Twelver Shia Islam reported experiencing discrimination. Activists and NGOs reported the government continued to detain or disappear Yarsani activists and community leaders for raising awareness regarding government practices or discrimination against the Yarsani community, such as the requirement that Yarsanis identify themselves as Shia in order to access employment or higher education.

      Whereas the regime has a history of disproportionately cracking down on religious and ethnic minorities, including Christians, Baha’is, Zoroastrians, Jews, Sunnis, agnostics and Kurds;

    3. On January 19, the European Parliament adopted a resolution condemning “in the strongest possible terms the death sentences of peaceful protesters in Iran,” urging the country to abolish the death penalty, and calling on the government “to review its legal code and eliminate moharebeh (‘enmity against God’) and mofsed-e-filarz (‘corruption on earth’) as punishable offenses” and to release human rights defenders.

      Whereas the regime continues to persecute citizens who it disagrees with by using criminal statutes like “insulting the Prophet,”“insulting Islam,”“rebellion against God,” and “corruption on earth;

    4. According to the Baha’i International Community (BIC) and multiple international news organizations, security forces in cities across the country continued to shut down Baha’i-owned businesses, conducted multiple raids of Baha’i homes, arrested Baha’is in their homes or workplaces on unsubstantiated charges, and confiscated money and personal belongings. BIC reported authorities closed 59 businesses from mid-July to mid-August. In a February report, UN Special Rapporteur on the situation of human rights in the Islamic Republic of Iran Javaid Rehman stated the Baha’i minority “remained most severely persecuted, with a marked increase in arrests, targeting, and victimization,” including being deprived of livelihoods, denied access to higher education, and denied the ability to bury their dead in accordance with Baha’i rites. BIC reported that between March and May, authorities sentenced four Baha’is to five years in prison for seeking to facilitate Baha’i burials. On September 20, parliament approved a bill that, if adopted, would increase penalties for noncompliance with the Islamic dress code as interpreted by authorities, raising prison time for violating the code from two months to 10 years and increasing the fine from 500,000 rials ($12) to up to 360 million rials ($8,600). In April, the government announced a domestic surveillance program, including using street cameras, to enforce the hijab law; authorities closed 45 businesses for allowing patrons to violate the law.

      Whereas the regime has a history of disproportionately cracking down on religious and ethnic minorities, including Christians, Baha’is, Zoroastrians, Jews, Sunnis, agnostics and Kurds;

    5. According to UN experts, numerous international human rights NGOs, and media reporting, the government convicted and executed peaceful protesters on charges of “enmity against God” and dissidents on charges of blasphemy and spreading anti-Islamic propaganda. A June report by the UN Secretary-General stated that ethnic and religious minorities were “significantly affected” in the context of the nationwide protests following the September 2022 death in custody of Mahsa (Zhina) Amini, a Sunni Kurdish citizen whom the “morality police” (Gasht-e Ershad, literally “guidance patrol”) detained for allegedly wearing her hijab improperly and thereby violating the country’s strict Islamic dress code. The report also stated the government disproportionately imposed death sentences on persons belonging to ethnic minorities, including members of the Baloch, Arab, and Kurdish minorities. According to the Human Rights Activists News Agency (HRANA), the government executed at least 746 individuals during the year, including for offenses classified as “corruption on earth” or “ideological-political-religious reasons,” and arrested 142 citizens for religious reasons. HRANA said the majority of human rights violations against religious minorities involved Baha’is (85 percent), but also impacted Sunnis (11 percent), Yarsans (2 percent), Gonabadi Dervishes, Christians, and other religious minorities.

      both Whereas the regime has a history of disproportionately cracking down on religious and ethnic minorities, including Christians, Baha’is, Zoroastrians, Jews, Sunnis, agnostics and Kurds; and Whereas the regime continues to persecute citizens who it disagrees with by using criminal statutes like “insulting the Prophet,”“insulting Islam,”“rebellion against God,” and “corruption on earth;”

    6. Prevailing fatwas prescribe the death penalty for apostasy. According to the penal code, the application of the death penalty varies depending on the religion of both the perpetrator and the victim. The penal code criminalizes insulting “divine religions or Islamic schools of thought” and committing “any deviant educational or proselytizing activity that contradicts or interferes with the sacred law of Islam.” Proselytization of religions other than Islam carries a punishment of up to 10 years in prison. Nongovernmental organizations (NGOs) said these provisions put religious minorities at a higher risk of persecution. The law, as typically interpreted, prohibits Muslim citizens from changing or renouncing their religious beliefs. The constitution states that Zoroastrians, Jews, and Christians are the only recognized religious minorities permitted to worship and form religious societies “within the limits of the law

      Whereas the regime has a history of disproportionately cracking down on religious and ethnic minorities, including Christians, Baha’is, Zoroastrians, Jews, Sunnis, agnostics and Kurds;

    1. https://www.reddit.com/r/typewriters/comments/1ktb7ty/what_are_the_rules_of_typewriter_club/?sort=new

      The Rules of Typewriter Club

      The first rule of Typewriter club is Do not oil the segment.

      The second rule of Typewriter club is DO NOT oil the segment.

      Do not ask the value of your typewriter: they are invaluable.

      Always talk about typewriter club. Every chance you get: to family, friends, complete strangers...

      If you only have one typewriter, you must refer to it as "my FIRST typewriter".

      If you're new to typewriter club, you have to type.

      A typewriter is not broken unless it is clean and broken.

      Parts of a typewriter should only be removed in order to repair another typewriter.

      Keychoppers shall have the extremities they used to chop keys chopped off.

      More than one machine is allowed to be your "favorite".

      The last typewriter you bought is the greatest one. Until the next one.

      Never leave a typewriter outside, in a barn, or in a damp basement to rust.

      Typewriters are to type with. They should not be "flipped".

      Any reason is a good reason to buy and use a typewriter.

      The hardest part of typewriter repair is believing you can do it. Everything else is just instructions plus a careful, thoughtful hand. —Rt. Rev. Theodore Munk

      If you see a typewriter, you should take photos and upload the details to the TypewriterDatabase.com.

      Typewriters are not mood setting decor, they are meant to be used.

      Always leave a typewriter in better condition than you found it.

      We form things; we do not "bend" them.

      The only acceptable way to dispose of a typewriter is to find it a new home. The only exception is in dire circumstances in time of war when one should follow the guidance of the Underwood manual and "Smash typewriters and components with a sledge or other heavy instrument; burn with kerosene, gasoline, fuel oil, flame thrower, or incendiary bomb; detonate with firearms, grenades, TNT, or other explosives."

      If anyone asks you about your typewriter, you must spend at least five minutes talking to them about it.

      Legitimate typewriter sellers never use the phrases "it works" or "it just needs a new ribbon."

      Remember that typewriters are dangerous and can be used for samizdat. As Woody Guthrie wrote: "This machine kills fascists."

      Blessed are those who give typewriters to children for theirs is the Kingdom of Heaven.

      "In death, they have a name." Lenore Fenton. Lenore Fenton. Lenore Fenton!

      The Typewriter Database does not list every single serial number, just ranges of numbers and years in which they were made. You are responsible for figuring out which year your number fits into.

      "Working but needs new ribbon" is seller's code for "I have no idea if it really works, but I'm going to try to sell you this machine for the price of a fully functioning machine that was just serviced by a professional shop despite the fact that I just took it out of grandpa's barn and I'm not sure if the mouse inside is dead or not. Also, I can't afford $10 to replace an old ribbon to truly participate in the charade of the price I'm going to try to fleece you with."

    1. Reviewer #1 (Public review):

      Summary:

      The authors test the hypotheses, using an effort-exertion and an effort-based decision-making task, while recording brain dynamics with EEG, that the brain processes reward outcomes for effort differentially when they earned for themselves versus others.

      Strengths:

      The strengths of this experiment include what appears to be a novel finding of opposite signed effects of effort on the processing of reward outcomes when the recipient is self versus others. Also, the experiment is well-designed, the study seems sufficiently powered, and the data and code are publicly available.

      Weaknesses:

      There is some concern about the fact that participants report feeling less subjective effort, but also more disliking of tasks when they were earning rewards for others versus self. The concern is that participants worked with less vigor during self-versus-others trials and this may partly account for a key two-way Recipient x Effort interaction on the size of the Reward Positivity EEG component. Of note, participants took longer to complete tasks when working for others. While it is true that, in all cases, participants met the requisite task demands (they pressed the required number of buttons) they did so more sluggishly when earning rewards for others. The Authors argue that this reflects less motivation when working for others, which is a plausible explanation. The Authors also try to rule out this diminished vigor as a confounding explanation by showing that the two way interaction remains even when including reaction times (and also self-reported task liking) as a covariate. Nevertheless, it is possible that covariates do not fully account for the effects of differential motivation levels which would otherwise explain the two-way interaction. As such, I think a caveat is warranted regarding this particular result.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The authors test the hypotheses, using an effort-exertion and an effort-based decision-making task, while recording brain dynamics with EEG, that the brain processes reward outcomes for effort differentially when they earned for themselves versus others.

      The strengths of this experiment include what appears to be a novel finding of opposite signed effects of effort on the processing of reward outcomes when the recipient is self versus others. Also, the experiment is well-designed, the study seems sufficiently powered, and the data and code are publicly available.

      We thank Reviewer #1 for the affirmative appraisal of our manuscript as well as the thoughtful and insightful comments, which have enabled us to significantly improve the manuscript.

      (1) Inferences rely heavily on the results of mixed effects models which may or may not be properly specified and are not supported by complementary analyses.

      We thank Reviewer #1 for raising this critical issue of model specification. We have re-fitted our mixed-effects models and performed complementary analyses to validate the robustness of our findings. Specifically, we adopted the maximal converging random-effects structure (including random slopes for Recipient, Effort, and Magnitude where feasible) while ensuring model stability (see Responses to Reviewer #1’s Recommendations point 2). Crucially, our primary findings, including the Recipient × Effort and Recipient × Effort × Magnitude interactions, remained robust. Furthermore, additional analyses confirmed that these results were not confounded by factors such as response speed and subjective effort rating (see Responses to Reviewer #1’s Recommendations point 5).

      (2) Also, not all results hang together in a sensible way. For example, participants report feeling less subjective effort, but also more disliking of tasks when they were earning rewards for others versus self. Given that participants took longer to complete tasks when earning effort for others, it is conceivable that participants might have been working less hard for others versus themselves, and this may complicate the interpretation of results.

      We thank Reviewer #1 for this insightful point (which also relates to Reviewer #3’s point 5). In our study, participants were asked to rate three specific dimensions: Effort (“How much effort did you exert to complete each effort condition when earning rewards for yourself [or the other person]?”), Difficulty (“How much difficulty did you perceive in each effort condition when earning rewards for yourself [or the other person]?”), and liking (“How much did you like each effort condition when earning rewards for yourself [or the other person]?”).

      We acknowledge the Reviewer #1’s concern that the lower subjective effort ratings for others seems contradictory to the higher disliking and longer completion times. We propose that in this paradigm, subjective effort ratings are susceptible to demand characteristics and likely captured motivational engagement (e.g., “how hard I tried” or “how willing I was”) rather than perceived task demands. To disentangle these factors, we included a measure of perceived task difficulty, which is anchored in task properties and is less prone to social desirability biases (Harmon-Jones et al., 2020; Wright et al., 1990). We found no differences in perceived difficulty between self- and other-benefiting trials (Figure 2D), suggesting that the task demands were perceived as equivalent across conditions. To examine this interpretation more directly, we analyzed correlations among participants’ ratings of difficulty, effort, and liking. As illustrated in Figure S1, we found no correlation between difficulty and effort ratings. Crucially, liking ratings were negatively correlated with difficulty ratings.

      More importantly, our performance data contradict the interpretation that participants “worked less hard” for others in terms of task completion. While participants took longer to complete tasks for others, they maintained comparable, near-ceiling success rates for self (97%) and other (96%) recipients (b = -0.46, p = 0.632; Supplementary Table S1). This dissociation suggests that although participants were less motivated (e.g., lower subjective ratings, longer completion times, and greater disliking) to work for others, they ultimately exerted the necessary physical effort to achieve successful outcomes. Thus, the results consistently point to a decrease in prosocial motivation (consistent with prosocial apathy) rather than a failure of effort exertion.

      Wright, R. A., Shaw, L. L., & Jones, C. R. (1990). Task demand and cardiovascular response magnitude: Further evidence of the mediating role of success importance. Journal of Personality and Social Psychology, 59(6), 1250-1260. https://doi.org/10.1037/0022-3514.59.6.1250

      Harmon-Jones, E., Willoughby, C., Paul, K., & Harmon-Jones, C. (2020). The effect of perceived effort and perceived control on reward valuation: Using the reward positivity to test a dissonance theory prediction. Biological Psychology, 107910. https://doi.org/10.1016/j.biopsycho.2020.107910

      Reviewer #2 (Public review):

      Measurements of the reward positivity, an electrophysiological component elicited during reward evaluation, have previously been used to understand how self-benefitting effort expenditure influences the processing of rewards. The present study is the first to complement those measurements with electrophysiological reward after-effects of effort expenditure during prosocial acts. The results provide solid evidence that effort adds reward value when the recipient of the reward is the self but discounts reward value when the beneficiary is another individual.

      An important strength of the study is that the amount of effort, the prospective reward, the recipient of the reward, and whether the reward was actually gained or not were parametrically and orthogonally varied. In addition, the researchers examined whether the pattern of results generalized to decisions about future efforts. The sample size (N=40) and mixed-effects regression models are also appropriate for addressing the key research questions. Those conclusions are plausible and adequately supported by statistical analyses.

      We appreciate Reviewer #2’s positive appraisal of our manuscript. We are fortunate to receive your thoughtful and insightful suggestions and have revised the manuscript accordingly.

      (1) Although the obtained results are highly plausible, I am concerned whether the reward positivity (RewP) and P3 were adequately measured. The RewP and P3 were defined as the average voltage values in the time intervals 300-400 ms and 300-440 ms after feedback onset, respectively. So they largely overlapped in time. Although the RewP measure was based on frontocentral electrodes (FC3, FCz, and FC4) and the P3 on posterior electrodes (P3, Pz, and P4), the scalp topographies in Figure 3 show that the RewP effects were larger at the posterior electrodes used for the P3 than at frontocentral electrodes. So there is a concern that the RewP and P3 were not independently measured. This type of problem can often be resolved using a spatiotemporal principal component analysis. My faith in the conclusions drawn would be further strengthened if the researchers extracted separate principal components for the RewP and P3 and performed their statistical analyses on the corresponding factor scores.

      We thank Reviewer #2 for raising this issue. We would like to clarify that these two components were time-locked to different types of feedback and therefore reflect neural responses to distinct stages of the prosocial effort task. Specifically, the P3 was time-locked to performance feedback (the effort-completion cue; e.g., the tick shown in Figure 1B), whereas the RewP was time-locked to reward feedback (e.g., the display of “+0.6”). Thus, despite the numerical similarity in the post-stimulus windows, the components capture neural activity evoked by independent events separated in time, corresponding to the performance monitoring versus reward evaluation stages of the task. To avoid misunderstanding, we have made this distinction more explicit in the revised manuscript, which now reads, “Single-trial RewP amplitude was measured as mean voltage from 300 to 400 ms relative to reward feedback onset (i.e., reward delivery) over frontocentral channels (FC3, FCz, FC4). We also measured the parietal P3 (300–440 ms; averaged across P3, Pz, and P4) in response to performance feedback (i.e., effort completion), given its relationship with motivational salience (Bowyer et al., 2021; Ma et al., 2014)” (page 27, para. 1, lines 2–6).

      Reviewer #3 (Public review):

      This study investigates how effort influences reward evaluation during prosocial behaviour using EEG and experimental tasks manipulating effort and rewards for self and others. Results reveal a dissociable effect: for self-benefitting effort, rewards are evaluated more positively as effort increases, while for other-benefitting effort, rewards are evaluated less positively with higher effort. This dissociation, driven by reward system activation and independent of performance, provides new insights into the neural mechanisms of effort and reward in prosocial contexts.

      This work makes a valuable contribution to the prosocial behaviour literature by addressing areas that previous research has largely overlooked. It highlights the paradoxical effect of effort on reward evaluation and opens new avenues for investigating the mechanisms underlying this phenomenon. The study employs well-established tasks with robust replication in the literature and innovatively incorporates ERPs to examine effort-based prosocial decision-making - an area insufficiently explored in prior work. Moreover, the analyses are rigorous and grounded in established methodologies, further enhancing the study's credibility. These elements collectively underscore the study's significance in advancing our understanding of effort-based decision-making.

      We thank Reviewer #3 for the positive assessment. We are particularly encouraged by the reviewer’s recognition of our novel integration of ERPs to uncover the distinct effects of effort on reward evaluation for self versus others. We have carefully addressed the specific recommendations raised in the subsequent comments to further strengthen the rigor and clarity of the manuscript.

      (1) Incomplete EEG Reporting: The methods indicate that EEG activity was recorded for both tasks; however, the manuscript reports EEG results only for the first task, omitting the decision-making task. If the authors claim a paradoxical effect of effort on self versus other rewards, as revealed by the RewP component, this should also be confirmed with results from the decision-making task. Omitting these findings weakens the overall argument.

      We thank Reviewer #3 for giving us the opportunity to verify the specific roles of our two tasks. The primary aim of our study is to elucidate the neural after-effects of effort exertion on subsequent reward evaluation during prosocial acts. The prosocial effort task was specifically designed for this purpose, as it involves actual effort expenditure followed by reward outcomes. Furthermore, this task uses preset effort-reward combinations, ensuring balanced trial counts and adequate signal-to-noise ratios across conditions, a critical requirement for robust ERP analysis. In contrast, the prosocial decision-making task was included specifically to quantify behavioral preference (i.e., prosocial effort discounting) rather than neural reward processing. Specifically, this task involves choices without immediate effort execution and reward feedback, making it impossible to examine the neural after-effects of effort exertion. However, the decision-making task remains indispensable for our study structure: it provides an independent behavioral phenomenon of prosocial apathy, which allowed us to link individual differences in behavioral motivation to the neural dissociations observed in the prosocial effort tasks (as detailed in our Responses to Reviewer #3’s 2). Thus, the two tasks provide complementary, rather than redundant, insights into the behavioral and neural mechanism of prosocial effort.

      (2) Neural and Behavioural Integration: The neural results should be contrasted with behavioural data both within and between tasks. Specifically, the manuscript could examine whether neural responses predict performance within each task and whether neural and behavioural signals correlate across tasks. This integration would provide a more comprehensive understanding of the mechanisms at play.

      We thank Reviewer #3 for this insightful and helpful suggestion. We agree that linking neural signatures with behavioral patterns is crucial for establishing the functional significance for our ERP findings. Regarding within-task association, it is important to note that the prosocial effort task was designed to require participants to exert fixed, preset levels of physical effort to earn uncertain rewards. This experimental control was necessary to standardize effort exertion across self-benefiting and other benefiting trials, thereby minimizing confounds such as differences in physical or perceived effort prior to the feedback phase. Indeed, the neural after-effects remained after controlling for these behavioral measures (i.e., response speed and self-reported effort; as detailed in responses to Reviewer #1’Recommendations point 5). Furthermore, unlike the prosocial effort task, the decision-making task inherently precludes the examination of the neural after-effects of effort; therefore, within-task association in this task was not possible.

      Given these considerations, we focused on the cross-task association. We examined whether the neural after-effects of effort (indexed by the RewP) in the prosocial effort task were modulated by individual differences in effort discounting. We used the K value estimated from the prosocial decision-making task as the index of effort discounting. We entered the K value (log-transformed and z-scored) as a continuous predictor into the mixed-effects models of RewP amplitudes. The full regression estimates for the model are presented in Table S1 (left).

      We observed a significant four-way interaction among recipient, effort, magnitude, and K value (b = 0.58, p = 0.013). To decompose this complex interaction, we performed simple slopes analyses separately for self- and other-benefiting trials at high and low levels of reward magnitude and discounting rate (±1 SD). As shown in Figure S2, for self-benefiting trials, the effort-enhancement effect on the RewP was significant only for participants with high discounting rates at low reward magnitude (b = 1.02, 95% CI = [0.22, 1.82], p = 0.012). In contrast, participants with low discounting rates exhibited no significant effort effect (b = -0.37, 95% CI = [-0.89, 0.15], p = 0.159). At high reward magnitude, simple slopes analyses detected no significant effort effects for either high (b = 0.35, 95% CI = [-0.44, 1.14], p = 0.383) or low (b = 0.45, 95% CI = [-0.07, 0.97], p = 0.093) discounting individuals. These findings strongly support the cognitive dissonance account (Aronson & Mills, 1959): those who find effort most aversive are most compelled to inflate the value of small rewards to justify their exertion. For these individuals, the completion of a costly action for a small reward may trigger a stronger internal justification effect, resulting in an amplified neural reward response.

      For other-benefiting trials, participants with low discounting rates exhibited a significant effort-discounting effect at high reward magnitude (b = -0.97, 95% CI = [-1.74, -0.20], p = 0.014). In contrast, no significant effort effects were observed for participants with high discounting rates at either high (b = -0.45, 95% CI = [-0.97, 0.08], p = 0.098) or low (b = -0.16, 95% CI = [-0.69, 0.38], p = 0.564) reward magnitudes, nor for participants with low discounting rates at low reward magnitude (b = 0.14, 95% CI = [-0.64, 0.92], p = 0.729). These results suggest that the justification mechanism observed for self-benefiting effort appears absent for other-benefiting effort. Instead, we observed a persistent effort discounting before, during, and after effort expenditure, which was most pronounced in individuals with low effort sensitivity (low K) when reward magnitude was high. This seemingly paradoxical pattern might be interpreted through the lens of disadvantageous inequity aversion (Fehr & Schmidt, 1999). Specifically, the combination of high personal effort and high monetary reward for another person creates a salient disparity between the participant’s incurred cost and the recipient’s gain. Although low-K individuals are behaviorally willing to tolerate this cost, their neural valuation system may nonetheless track the “unfairness” of this asymmetry, thereby attenuating the neural reward signal (Tricomi et al., 2010). These insights suggest that facilitating prosocial behavior may require not just lowering costs, but potentially framing outcomes to trigger the effort justification mechanisms that drive the effort paradox observed in self-benefiting acts (Inzlicht & Campbell, 2022).

      To confirm this four-way interaction, we also replaced the high-effort choice proportions in the decision-making task and observed a similar four-way interaction among recipient, effort, magnitude, and high-effort choice proportions (b = -0.58, p = 0.014; see Table S1 for detailed regression estimates). Together, this cross-task analysis not only provides a more comprehensive understanding of the mechanisms at play but also justifies the inclusion of the prosocial decision-making task. We sincerely thank Reviewer #3’ for this valuable suggestion, which has significantly strengthened our manuscript. We have included this analysis (page 16, para. 2; page 17, paras. 1–2) and discussed the results (page 20, para. 2, lines 10–15; page 20, para. 3; page 21, para. 1, lines 1–8) in the revised manuscript.

      Aronson, E., & Mills, J. (1959). The effect of severity of initiation on liking for a group. The Journal of Abnormal and Social Psychology, 59(2), 177-181. https://doi.org/10.1037/h0047195

      Fehr, E., & Schmidt, K. M. (1999). A theory of fairness, competition, and cooperation. The Quarterly Journal of Economics, 114(3), 817-868. http://www.jstor.org/stable/2586885

      Tricomi, E., Rangel, A., Camerer, C. F., & O'Doherty, J. P. (2010). Neural evidence for inequality-averse social preferences. Nature, 463(7284), 1089-1091. https://doi.org/10.1038/nature08785

      (3) Success Rate and Model Structure: The manuscript does not clearly report the success rate in the prosocial effort task. If success rates are low, risk aversion could confound the results. Additionally, it is unclear whether the models accounted for successful versus unsuccessful trials or whether success was included as a covariate. If this information is present, it needs to be explicitly clarified. The exclusion criteria for unsuccessful trials in both tasks should also be detailed. Moreover, the decision to exclude electrodes as independent variables in the models warrants an explanation.

      We appreciate the opportunity to clarify these points. In the revised manuscript, we have now explicitly reported the descriptive statistics and the results of a mixed-effects logistic model on response success in the revised manuscript (page 8, para. 1, lines 2–4; Supplementary Table S1). Participants achieved similarly high success rates in both self (M = 97%) and other trials (M = 96%; Figure S3). As shown in Table S2, success rates decreased as effort increased (b = -4.77, p < 0.001). However, no other effects reached significance (ps > 0.245). These near-ceiling success rates indicate strong task engagement and effectively rule out risk aversion as a potential confound.

      Regarding model structure, we excluded unsuccessful trials from statistical analyses because they were rare and distributed equally across conditions. Given the near-ceiling performance, we did not include success rate as a covariate, as it offers limited variance.

      Finally, we did not include electrodes as an independent variable because our hypotheses focused on condition effects rather than topographic differences. Following established research (e.g., Krigolson, 2018; Proudfit, 2015), we averaged RewP amplitudes across a frontocentral cluster (FC3, FCz, and FC4) and P3 amplitudes across a parietal cluster (P3, Pz, and P4), where activity is typically maximal. Averaging across these theoretically grounded clusters improves the signal-to-noise ratio and provides more reliable estimates of the underlying components. We have explicitly included this rationale in the revised manuscript, which reads, “Data were averaged across the selected electrode clusters to improve signal-to-noise ratio and reliability” (page 27, para. 1, lines 9–10).

      Proudfit, G. H. (2015). The reward positivity: From basic research on reward to a biomarker for depression. Psychophysiology, 52(4), 449-459. https://doi.org/10.1111/psyp.12370

      Krigolson, O. E. (2018). Event-related brain potentials and the study of reward processing: Methodological considerations. Int J Psychophysiol, 132(Pt B), 175-183. https://doi.org/10.1016/j.ijpsycho.2017.11.007

      (4) Prosocial Decision Computational Modelling: The prosocial decision task largely replicates prior behavioural findings but misses the opportunity to directly test the hypotheses derived from neural data in the prosocial effort task. If the authors propose a paradoxical effect of effort on self-rewards and an inverse effect for prosocial effort, this could be formalised in a computational model. A model comparison could evaluate the proposed mechanism against alternative theories, incorporating the complex interplay of effort and reward for self and others. Furthermore, these parameters should be correlated with neural signals, adding a critical layer of evidence to the claims. As it is, the inclusion of the prosocial decision task seems irrelevant.

      We thank Reviewer #3 for this thoughtful suggestion regarding the value of computational modelling. We fully agree that formalizing mechanisms is crucial, but we would like to clarify why a computational model of decision-making cannot directly capture the paradoxical after-effects observed in our neural data. The paradoxical after-effect of effort exertion we report refers to experienced utility (i.e., how prior costs modulate the hedonic consumption of a reward), whereas the decision task measures decision utility (i.e., how prospective costs and benefits are integrated to guide choice). We included the prosocial decision task to establish a behavioral baseline and replicate the well-documented phenomenon of prosocial apathy. Consistent with prior work (e.g., Lockwood et al., 2017; Lockwood et al., 2022), our data show that at the decision stage (ex-ante), effort functions as a universal cost: participants discounted rewards for both self and others, differing only quantitatively (steeper discounting for others). It is only after effort is exerted (ex-post) that the pattern reverses: effort is valued for self but remains costly for others, representing a qualitative shift. Crucially, incorporating a "paradoxical valuation" parameter (i.e., effort as a reward) into our decision model would mathematically contradict the behavioral reality. Since participants actively avoided high-effort options, a model assuming effort adds value might fail to fit the choice data. The theoretical novelty of our study lies precisely in this temporal dissociation: whereas self-benefiting effort paradoxically enhances reward valuation, other-benefiting effort induces a persistent reward devaluation.

      To address the reviewer’s interest in bridging these two domains, we examined whether these distinct stages are linked at the level of individual differences. We hypothesized that an individual’s sensitivity to prospective effort cost (discounting rate K) might modulate their susceptibility to the retrospective neural after-effect. As detailed in our Responses to Reviewer #3’s point 2, we found that for self-benefiting trials, high-discounting individuals showed an effort-enhancement effect on the RewP at low reward magnitude, while for other-benefiting trials, low-discounting individuals exhibited effort-discounting effects at high reward magnitude. We sincerely thank Reviewer #3’ for this valuable suggestion, which has successfully correlated the two tasks and facilitated our understanding of the mechanisms at play.

      Lockwood, P. L., Hamonet, M., Zhang, S. H., Ratnavel, A., Salmony, F. U., Husain, M., & Apps, M. A. J. (2017). Prosocial apathy for helping others when effort is required. Nat Hum Behav, 1(7), 0131. https://doi.org/10.1038/s41562-017-0131.

      Lockwood, P. L., Wittmann, M. K., Nili, H., Matsumoto-Ryan, M., Abdurahman, A., Cutler, J., Husain, M., & Apps, M. A. J. (2022). Distinct neural representations for prosocial and self-benefiting effort. Curr Biol, 32(19), 4172-4185 e4177. https://doi.org/10.1016/j.cub.2022.08.010.

      (5) Contradiction Between Effort Perception and Neural Results: Participants reported effort as less effortful in the prosocial condition compared to the self condition, which seems contradictory to the neural findings and the authors' interpretation. If effort has a discounting effect on rewards for others, one might expect it to feel more effortful. How do the authors reconcile these results? Additionally, the relationship between behavioural data and neural responses should be examined to clarify these inconsistencies.

      This point aligns with the issues raised in Reviewer #1’s point 2. We acknowledge the apparent discrepancy between lower reported effort in the prosocial condition and the neural discounting effect. As detailed in our Responses to Reviewer #1’s point 2, we reconcile this by proposing that subjective effort ratings in this paradigm likely reflect motivational engagement (e.g., “how hard I tried” or “how willing I was”) rather than perceived task demands. Under this interpretation, the lower effort ratings for others reflect a withdrawal of engagement (consistent with prosocial apathy), which conceptually aligns with, rather than contradicts, the neural discounting effect. To validate this, we contrasted effort ratings with difficulty ratings (a more reliable index of objective demand). Our correlational analysis revealed no significant relationship between difficulty and effort ratings (r = -0.21, p = 0.196), suggesting that they capture distinct constructs. Furthermore, liking ratings were negatively correlated with difficulty ratings (r = -0.43, p = 0.011) but not with effort ratings (r = 0.32, p = 0.061), further dissociating the two measures. Crucially, as detailed in our Responses to Reviewer #1’s Recommendations point 5, our RewP effects remained significant even after controlling for individual effort ratings. This demonstrates that the neural effort-discounting effect for others is a physiological signature that operates independently of the subjective report bias.

      (6) Necessary Revisions to Manuscript: If the authors address the issues above, corresponding updates to the introduction and discussion sections could strengthen the narrative and align the manuscript with the additional analyses.

      We thank Reviewer #3 for the above insightful and helpful comments. We have carefully addressed these issues raised above and have updated the manuscript accordingly, including abstract, introduction, result, and discussion sections.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      Major comments:

      (1) The two biggest concerns I have are

      - Whether the mixed-effect models are properly specified, and

      - Whether the main interaction between the Recipient and effort on the reward positivity (RewP) reflects different levels of effort exertion when working for self versus others.

      We thank Reviewer #1 for identifying these two critical issues. We have carefully considered these points and conducted additional analyses to address them. Below, we provide a detailed response to each concern, explaining how we have improved the model specification and ruled out alternative interpretations regarding effort exertion.

      (2) On the first point, I noticed that the authors selectively excluded random effects for Effort and Magnitude when regressing RewP on Effort, Magnitude, Recipient, and Valence. This is important because the key result in the paper is a fixed effect two-way interaction between Recipient and Effort and a three-way interaction between Recipient, Effort, and Magnitude. It is not clear that these results will remain significant when Effort and Magnitude are included as random effects in the model. Thus the authors should justify their exclusion as random effects, and/or show that the results don't depend on including those random effects in the model. The same logic applies to the specification of other mixed effects models (e.g. the effect of Magnitude in the model predicting RTs).

      We thank Reviewer #1 for raising this important methodological point. We fully agree that including random slopes wherever possible reduces Type 1 error rates and yields more conservative tests of fixed effects. In our analyses, we determined the random effects structure for each model using singular value decomposition (SVD). Specifically, we began with a maximal model that included by-participant random slopes for all main effects and interactions as well as a participant-level random intercept. When the model failed to converge or yielded a singular fit, we applied SVD to identify redundant dimensions (i.e., components explaining zero variance) and iteratively removed these terms until convergence was achieved. This procedure allowed us to retain the maximal converging random-effects structure while ensuring model stability. We have clarified this procedure in the revised manuscript as follows, “For each model, we fitted the maximal random-effects structure and, when the model was overparameterized, used singular value decomposition to simplify the random-effects structure until the model converged” (page 28, para. 1, lines 5–8).

      Regarding the RewP model, including all variables (i.e., Recipient, Effort, Magnitude, and Valence) in the random-effects structure resulted in a boundary (singular) fit. Examination of the variance-covariance structure of the random effects revealed that the random slopes for Valence and Magnitude were perfectly negatively correlated (r = -1.00), indicating severe overparameterization. In our original submission, we removed the random slopes for Effort and Magnitude because the SVD analysis indicated redundant dimensions in the model structure.

      However, we agree with the Reviewer that retaining slopes for variables involved in key interactions is crucial. Therefore, we re-evaluated the model strategy: instead of removing Effort and Magnitude, we removed the random slope for Valence (which was the primary source of the perfect correlation). This modification successfully resolved the singularity while allowing us to retain the random slopes for the critical variables (i.e., Effort and Magnitude).

      Critically, this updated model yielded the same pattern of results as our original submission: the two-way interaction between Recipient and Effort and the three-way interaction between Recipient, Effort, and Magnitude remained significant (see Table S3). As expected, including the random slopes for Effort and Magnitude yielded a more conservative test of the fixed effects. While the critical three-way interaction remained significant (p = 0.019), the simple slope for the Self condition at high reward magnitude shifted slightly from significant (p = 0.041) to marginally significant (p = 0.056). However, the effect size remained largely unchanged (b = 0.42 vs. original b = 0.43), and the dissociation pattern, where self-benefiting trials show a positive trend while other-benefiting trials show a significant negative slope, remains robust and is statistically supported by the significant interaction. We have adopted this updated model in the revised manuscript and updated the relevant sections accordingly. Finally, note that we have removed the RewP table from the Supplementary Materials because the RewP model results are now presented as a figure in the main text (as suggested by Reviewer #1’s Recommendations point 3).

      We have also carefully verified the random effects structures for other mixed-effects models, including the RT and Performance-P3 models in the prosocial effort task, as well as the decision time and decision choice models in the prosocial decision-making task. The updated information is detailed as follows:

      Regarding the RT model, we replaced it with a more reasonable model of response speed (button presses per second), as suggested by Reviewer #1 (see our responses to Reviewer #1’s Recommendations point 4 for details).

      Regarding the performance-P3 model, the random-effects structure could only support Effort, as in our original submission; thus, the results remain unchanged.

      Regarding the decision time model, we have updated our results to include the quadratic effort term, as suggested by Reviewer #1 (see our responses to Reviewer #1’s Recommendations point 6 for details).

      Regarding the decision choice model, we included Recipient, Effort, and Magnitude in the random-effects structure. As shown in Table S4, the results remain largely consistent with the original model, except for a newly significant interaction between effort and magnitude. Follow-up simple slopes analyses revealed that the discounted effect of effort was more pronounced at low reward magnitude (M − 1SD: b = -2.69, 95% CI = [-3.09, -2.29], p < 0.001) than at high reward magnitude (M + 1SD: b = -2.38, 95% CI = [-2.82, -1.94],p < 0.001).

      In summary, we have improved the model specification following Reviewer #1’s suggestion. Crucially, the results remain qualitatively consistent with our original findings. We have updated the Results section, figures (Figures 2, 4, and 5), and OSF documents (including a new R Markdown file and an HTML output file detailing the final results) to reflect these analyses. Additionally, we have explicitly stated the method used for calculating p-values in the mixed-effects models (page 28, para. 1, lines 8–10), which was omitted in the original submission.

      (3) Regarding the mixed models, it would also be good to show a graphical depiction summarizing key effects (e.g. the Recipient by Effort interaction on RewP) rather than just showing the predictions of the fitted mixed effects models.

      This point is well-taken. Please see Figure S4, which visualizes the key effects and has now been included in the revised manuscript as Figure 4A.

      (4) Finally, regarding the mixed effect models of RTs - given the common finding that RTs are not normally distributed, the Authors might be better off regressing 1/RT (interpreted as speed rather than latency) since 1/RT will often make distributions less asymmetric and heavy-tailed.

      We thank Reviewer #1 for this helpful suggestion regarding data distribution. In our original analysis, the dependent variable was “completion time” (i.e., the latency to complete the required button presses with the 6-s window). We agree that these raw latency data exhibited characteristic non-normality (see Figure S5, Left). Based on Reviewer #1’s suggestion, we adopted “response speed” (calculated as button presses per second) as the dependent variable. As expected, this transformation substantially improved the normality of the distribution (see Figure S5, Right). We have refitted the mixed-effects model using this speed metric. Critically, the results largely replicated the patterns observed in our original model, with the exception that the main effect of reward magnitude did not reach significance in the speed model (see Table 5). Given the superior distributional properties of the speed metric, we have replaced the original latency analysis with the response speed model in the revised manuscript. We have updated the Results section (page 8, para. 1, lines 4–9) and Figures 2B–C accordingly.

      (5) Regarding the level of effort exerted, there are two reasons to suspect that participants exerted less for others versus themselves. The first is that they were slower to complete the button pressing for others versus themselves. The second is that they reported paradoxically less subjective effort for others versus self (paradoxical because they also reported liking the task less for others versus self). The explanation for both may be that they exerted less effort for others versus self and this has important implications for interpreting the main effects. If they exerted less effort for others, this may partly account for the key Recipient:Effort and Recipient:Effort:Magnitude interactions in the mixed effects regression of RewP. Do either median effort durations or self-reported effort predict the magnitude of the Recipient:Effort and Recipient:Effort:Magnitude interactions (if these were included as random effects)? If so, that would provide evidence supporting this story. Alternatively, if median durations or self-reported effort were included as covariates, do these interactions still obtain? In any case, the Authors should include caveats regarding this potential explanation of the self-versus-other interactions with effort and magnitude on the RewP" (or explain why this can not explain the interactions).

      We thank Reviewer #1 for raising this important interpretational issue. We acknowledge the concern that differences in physical exertion or perceived effort could potentially confound the neural findings. However, we argue that the observed RewP effects are not driven by these factors for several reasons.

      First, the prosocial effort task enforced fixed effort thresholds (10%–90% of their maximum effort level) across self-benefiting and other-benefiting trials. Importantly, participants achieved ceiling-level success rates that were highly comparable between self-benefiting (97%) and other-benefiting (96%) trials, indicating that they successfully exerted the required effort across conditions.

      Second, regarding the slower response speed for others (we used response speed instead of completion time, as the former is more suitable for statistical analysis; see details in Responses to Reviewer #1’s Recommendations point 4), we interpret this as a reduction in motivation rather than a reduction in the amount of effort exerted. Similarly, as detailed in our Responses to Reviewer#1’s point 2, subjective effort ratings in this paradigm appear to be influenced by demand characteristics and do not reliably track physical exertion. For instance, liking ratings were associated with difficulty (r = -0.43, p = 0.011) instead of effort (r = 0.32, p = 0.061) ratings.

      To empirically rule out the possibility that these behavioral differences account for the neural effect, we followed the reviewer’s suggestion and re-ran the mixed-effects model predicting RewP amplitudes with trial-by-trial response speed and subjective effort rating included as covariates. These control analyses revealed that neither response speed (b = -0.07, p = 0.614) nor self-reported effort (b = 0.10, p = 0.186) significantly predicted RewP amplitudes (see Table S6). Most importantly, the key interactions of interest (Recipient × Effort and Recipient × Effort × Magnitude) remained significant and virtually unchanged. These findings suggest that the observed neural after-effects of prosocial effort are not driven by variations in motor execution or perceived effort.

      Minor comments:

      (6) In Figure 5A a quadratic effect (not a linear effect) seems fairly obvious in decision times as a function of effort level. This makes sense given that participants are close to indifference, on average, around the 50-70% effort level. I recommend fitting a model that has a quadratic predictor and not just a linear predictor when regression decision times on effort levels.

      We thank Reviewer #1 for this insightful suggestion. We agree that decision times likely track decision conflict, which typically peaks near indifference points (e.g., moderate effort levels). Accordingly, we reanalyzed the decision time data using a mixed-effects model that included both linear and quadratic terms for effort. As detailed in Table S7, this analysis revealed a significant quadratic main effect of effort, which was further qualified by a significant interaction between the quadratic effort term and reward magnitude. Decomposition of this interaction (Figure S6) revealed that the quadratic effort effect was more pronounced at low reward magnitude (M − 1SD: b = -160.10, 95% CI = [-218.30, -101.90], p < 0.001) than at high reward magnitude (M + 1SD: b = -99.50, 95% CI = [-157.60, -41.40], p = 0.001). However, we found no significant interactions involving the quadratic effort term and recipient. We have updated the Results section (page 13, para. 2; page 14, para. 1) and Figures 5A–B (right panel) to reflect these findings.

      (7) The distinction between the effort and decision-making tasks wasn't super clear from the main text. A sentence early on in the results section could be useful for readers' understanding.

      This point is well taken. In the revised manuscript, we have clarified this distinction at the beginning of the Results section (page 6, para. 2, lines 1–10). In addition, we have explicitly indicated the corresponding task within each subsection heading in the Results:

      “2.1 Investing effort for others is less motivating than for self in the prosocial effort task” (page 7)

      “2.2 Effort adds reward value for self but discounts reward value for others in the prosocial effort task” (page 9)

      “2.3 Reward is devalued by effort to a higher degree for others than for self in the prosocial decision-making task” (page 13)

      (8) To what does "three trials" refer to on lines 143-144?

      Thank you for raising this point. Participants completed three trials in which they were asked to press a button as rapidly as possible with their non-dominant pinky finger for 6000 ms. The maximum effort level was operationalized as the average button-press count across the three trials. To improve clarity, we have also provided more detailed description in the Results section, which reads: “The mean maximum effort level (i.e., the average button-press count across three 6000-ms trials; see Procedure for details) ….” (page 7, para. 1, lines 1–2).

      (9) It is unclear how the authors select their time windows for ERP analyses.

      We thank Reviewer #1 for this comment. Measurement parameters (i.e., time windows and channel sites) were determined based on the grand-averaged ERP waveforms and topographic maps collapsed across all conditions. This procedure is orthogonal to the conditions of interest and prevents bias in the selection of measurement windows and channels, consistent with the “orthogonal selection approach” (Luck & Gaspelin, 2017). We have clarified this point in the revised manuscript, which now reads, “Measurement parameters (time windows and channel sites) were determined from the grand-averaged ERP waveforms and topographic maps collapsed across all conditions, which was thus orthogonal to the conditions of interest (Luck & Gaspelin, 2017)” (page 27, para. 1, lines 6–9).

      Luck, S., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn't). Psychophysiology, 54(1), 146-157.

      (10) There are a few typos throughout. For example, Line 124 should read "other half benefitted...", Line 127 should read "interest at each effort level...", "following" on Line 369, and Supplemental table titles incorrectly spell the word "Results".

      We thank Reviewer #1 for catching these errors. We have corrected all the specific typos noted (page 6, para. 2, lines 11 and 15; page 22, para. 3, line 2; Supplementary Table S2). Furthermore, we have conducted a thorough proofreading of the entire text and supplementary materials to ensure linguistic accuracy and consistency throughout the manuscript.

      Reviewer #2 (Recommendations for the authors):

      Minor comments:

      (1) Lines 84-86. "The RewP ... has its neural sources in the anterior cingulate cortex (Gehring & Willoughby, 2002) and ventral striatum (Foti et al., 2011)." This is a better reference for the ACC source: https://pubmed.ncbi.nlm.nih.gov/23973408/. And perhaps remove the reference to the ventral striatum; most people would agree that activity in the ventral striatum cannot be measured with scalp EEG.

      We thank Reviewer #2 for providing the updated reference, which has been cited in the revised manuscript. We agree that activity in the VS cannot be reliably measured with scalp EEG and thus have removed the reference to the VS. The revised sentence now reads, “… has its neural sources in the anterior cingulate cortex (Gehring & Willoughby, 2002; Hauser et al., 2014)” (page 4, para. 2, lines 12–13).

      (2) Lines 152-153. What exactly is shown in Figure 2A? How did the authors average across subjects?

      We thank Reviewer #2 for raising this issue. Figure 2A depicts the distribution of the maximum effort level, defined as the average button-press count across three 6000-ms trials completed before the prosocial effort task. In these trials, participants were instructed to press the button as rapidly as possible with their non-dominant pinky fingers. To improve clarity, we have revised the figure caption as: “(A) Distribution of the maximum effort level (i.e., the average button-press count across three 6000-ms trials) across participants” (Figure 2).

      (3) Lines 160-164. "As expected (Figure 2D), participants perceived increased effort as more difficult ... and more disliking (b = -0.62, p < 0.001) when the beneficiary was others than themselves." Does this sentence describe the main effect of the beneficiary or the interaction between beneficiary and effort level, as the start of the sentence ("increased effort") suggests?

      We thank Reviewer #2 for pointing out this ambiguity. The sentence describes the main effect of beneficiary rather than the interaction between beneficiary and effort level. In the revised manuscript, we have rephrased the sentence as: “They felt less effort (b = -0.32, p = 0.019) and more disliking (b = -0.62, p = 0.001) for other-benefiting trials compared to self-benefiting trials” (page 9, para. 1, lines 4–6).

      (4) Lines 195-196. "..., we conducted post-hoc simple slopes analyses at -1 SD ("Low") and + SD ("High") reward magnitude." I did not understand what the authors meant with these reward magnitudes, given that the actual potential rewards were ¥0.2, ¥0.4, ¥0.6, ¥0.8, and ¥1.0.

      In our analyses, the actual reward magnitudes (¥0.2, ¥0.4, ¥0.6, ¥0.8, and ¥1.0) were z-scored and entered as a continuous regressor in the mixed-effects models. Post-hoc simple slopes analyses were then conducted at ±1 SD from the mean of the z-scored reward magnitude. To clarify, we have revised the sentence as “… we conducted post-hoc simple slopes analyses at 1 standard deviation (SD) below (“Low”) and above (“High”) the mean reward magnitude” (page 11, para. 2, lines 8–9). This standard method for testing simple effects for continuous predictors is recommended by Aiken and West (1991). Aiken, L. S., West, S. G., & Reno, R. R. (1991). Multiple regression: Testing and interpreting interactions. Sage.

      (5) Lines 253 and 275. I would not call this a computational model. The authors fit a curve to data, there is no model of the computations involved.

      This point is well taken. We have replaced “computational model” with “discounting” (Figure 5) and “parabolic discounting model” (page 15, para. 1, line 15).

      (6) Line 710. Figure S1 does not show topographic maps of the P3, as the figure caption suggests.

      We thank Reviewer #2 for identifying this oversight. We have now included topographic maps of the P3 in Figure S1.

      (7) Please check language in lines 33 (effect between), 38 (shape), 49 (highest cost form?), 74 (tunning), 90 (omit following), 127 (interest on at each effort level), 135 (press buttons >> rapidly press a button?), 142 (motivated), 219 (should low be high?), 265-266 (missing word), 275 (confirmed by following), 292 (an action can be effortful, a feeling cannot), 315 (when it comes into), 330-331 (data is plural; the aftereffect of prosocial effect), 387 (interest on at each effort level), 405 (should quickly be often?).

      We thank Reviewer #2 for the careful review and feedback about these language issues. We have revised all the phrasing you identified. The corrections are as follows:

      Line 33: “effect between” has been changed to “effects for” (page 2, para. 1, line 6).

      Line 38: “shape” has been updated to “shapes” (page 2, para. 1, line 13).

      Line 49: “highest cost form?” has been revised to “the most common cost type” (page 3, para. 1, lines 7–8).

      Line 74: “tunning” has been corrected to “tuning” (page 4, para. 2, line 1).

      Line 90: omit following. Done (page 5, para. 1, line 2).

      Line 127: “interest on at each effort level” has been corrected to “liking for each effort level” (page 6, para. 2, line 15).

      Line 135: “press buttons” has been updated to “rapidly press a button” (the caption of Figure 1).

      Line 142: “motivated” has been revised to “motivating” (page 7).

      Line 219: should low be high? Yes, we have corrected this (the caption of Figure 4).

      Lines 265–266: The missing word “with” has been inserted (page 15, para. 1, line 2).

      Line 275: “confirmed by following” has been revised as “corroborated by a parabolic …” (page 15, para. 1, line 15).

      Line 292: an action can be effortful, a feeling cannot. We have changed the word “effortful” to “effort” (page 18, para. 2, line 3).

      Line 315: “when it comes into” has been revised to “when it came to” (page 19, para. 1, line 10).

      Lines 330–331: These two expressions have been revised to “our data establish …” and “the after-effect of prosocial effort” (page 20, para. 1, lines 2–3).

      Line 387: “interest on at each effort level” has been corrected to “interest at each effort level” (page 23, para. 2, line 5).

      Line 405: should quickly be often? We agree that “quickly” might imply latency or speed of a single press, whereas the task required maximizing the frequency of presses within the time window. To capture this meaning accurately, we have revised the phrase to “pressed a button as rapidly as possible” (implying repetition rate) in the revised manuscript (page 24, para. 2, lines 3–4).

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript addresses an important methodological issue - the fragility of meta-analytic findings - by extending fragility concepts beyond trial-level analysis. The proposed EOIMETA framework provides a generalizable and analytically tractable approach that complements existing methods such as the traditional Fragility Index and Atal et al.'s algorithm. The findings are significant in showing that even large meta-analyses can be highly fragile, with results overturned by very small numbers of event recodings or additions. The evidence is clearly presented, supported by applications to vitamin D supplementation trials, and contributes meaningfully to ongoing debates about the robustness of meta-analytic evidence. Overall, the strength of evidence is moderate to strong, though some clarifications would further enhance interpretability.

      Strengths:

      (1) The manuscript tackles a highly relevant methodological question on the robustness of meta-analytic evidence.

      (2) EOIMETA represents an innovative extension of fragility concepts from single trials to meta-analyses.

      (3) The applications are clearly presented and highlight the potential importance of fragility considerations for evidence synthesis.

      Weaknesses:

      (1) The rationale and mathematical details behind the proposed EOI and ROAR methods are insufficiently explained. Readers are asked to rely on external sources (Grimes, 2022; 2024b) without adequate exposition here. At a minimum, the definitions, intuition, and key formulas should be summarized in the manuscript to ensure comprehensibility.

      (2) EOIMETA is described as being applicable when heterogeneity is low, but guidance is missing on how to interpret results when heterogeneity is high (e.g., large I²). Clarification in the Results/Discussion is needed, and ideally, a simulation or illustrative example could be added.

      (3) The manuscript would benefit from side-by-side comparisons between the traditional FI at the trial level and EOIMETA at the meta-analytic level. This would contextualize the proposed approach and underscore the added value of EOIMETA.

      (4) Scope of FI: The statement that FI applies only to binary outcomes is inaccurate. While originally developed for dichotomous endpoints, extensions exist (e.g., Continuous Fragility Index, CFI). The manuscript should clarify that EOIMETA focuses on binary outcomes, but FI, as a concept, has been generalized.

      Reviewer #2 (Public review):

      Summary:

      The study expands existing analytical tools originally developed for randomized controlled trials with dichotomous outcomes to assess the potential impact of missing data, adapting them for meta-analytical contexts. These tools evaluate how missing data may influence meta-analyses where p-value distributions cluster around significance thresholds, often leading to conflicting meta-analyses addressing the same research question. The approach quantifies the number of recodings (adding events to the experimental group and/or removing events from the control group) required for a meta-analysis to lose or gain statistical significance. The author developed an R package to perform fragility and redaction analyses and to compare these methods with a previously established approach by Atal et al. (2019), also integrated into the package. Overall, the study provides valuable insights by applying existing analytical tools from randomized controlled trials to meta-analytical contexts.

      Strengths:

      The author's results support his claims. Analyzing the fragility of a given meta-analysis could be a valuable approach for identifying early signs of fragility within a specific topic or body of evidence. If fragility is detected alongside results that hover around the significance threshold, adjusting the significance cutoff as a function of sample size should be considered before making any binary decision regarding statistical significance for that body of evidence. Although the primary goal of meta-analysis is effect estimation, conclusions often still rely on threshold-based interpretations, which is understandable. In some of the examples presented by Atal et al. (2019), the event recoding required to shift a meta-analysis from significant to non-significant (or vice versa) produced only minimal changes in the effect size estimation. Therefore, in bodies of evidence where meta-analyses are fragile or where results cluster near the null, it may be appropriate to adjust the cutoff. Conducting such analyses-identifying fragility early and adapting thresholds accordingly-could help flag fragile bodies of evidence and prevent future conflicting meta-analyses on the same question, thereby reducing research waste and improving reproducibility.

      Weaknesses:

      It would be valuable to include additional bodies of conflicting literature in which meta-analyses have demonstrated fragility. This would allow for a more thorough assessment of the consistency of these analytical tools, their differences, and whether this particular body of literature favored one methodology over another. The method proposed by Atal et al. was applied to numerous meta-analyses and demonstrated consistent performance. I believe there is room for improvement, as both the EOI and ROAR appear to be very promising tools for identifying fragility in meta-analytical contexts.

      I believe the manuscript should be improved in terms of reporting, with clearer statements of the study's and methods' limitations, and by incorporating additional bodies of evidence to strengthen its claims.

      Reviewer #3 (Public review):

      Summary and strengths:

      In this manuscript, Grimes presents an extension of the Ellipse of Insignificant (EOI) and Region of Attainable Redaction (ROAR) metrics to the meta-analysis setting as metrics for fragility and robustness evaluation of meta-analysis. The author applies these metrics to three meta-analyses of Vitamin D and cancer mortality, finding substantial fragility in their conclusions. Overall, I think extension/adaptation is a conceptually valuable addition to meta-analysis evaluation, and the manuscript is generally well-written.

      Specific comments:

      (1) The manuscript would benefit from a clearer explanation of in what sense EOIMETA is generalizable. The author mentions this several times, but without a clear explanation of what they mean here.

      (2) The authors mentioned the proposed tools assume low between-study heterogeneity. Could the author illustrate mathematically in the paper how the between-study heterogeneity would influence the proposed measures? Moreover, the between-study heterogeneity is high in Zhang et al's 2022 study. It would be a good place to comment on the influence of such high heterogeneity on the results, and specifying a practical heterogeneity cutoff would better guide future users.

      (3) I think clarifying the concepts of "small effect", "fragile result", and "unreliable result" would be helpful for preventing misinterpretation by future users. I am concerned that the audience may be confusing these concepts. A small effect may be related to a fragile meta-analysis result. A fragile meta-analysis doesn't necessarily mean wrong/untrustworthy results. A fragile but precise estimate can still reflect a true effect, but whether that size of true effect is clinically meaningful is another question. Clarifying the effect magnitude, fragility, and reliability in the discussion would be helpful.

      I am very appreciative of the insightful comments you all shared, and in light of them have made several clarifications and revisions. Thank you again, I am grateful to have received such considered feedback and I hope I’ve addressed any outstanding issues. I have replied to each reviewer’s recommendations in this document sequentially for ease of scanning, and am most grateful for the summary strengths and weaknesses, which I am also incorporated into these replies. Thank you again!

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The manuscript makes the important argument that many meta-analyses are inherently fragile, which aligns with prior work (e.g., PMID: 40999337). Please add the reference to the statements.

      Excellent point, thank you – I’ve expanded the discussion of fragility analysis, and its application to meta-analysis, including this reference.

      (2) The rationale and mathematical underpinnings of the proposed EOI and ROAR methods are not sufficiently explained. While the authors cite Grimes (2022, 2024b), readers are expected to rely heavily on these external sources without adequate exposition in the current paper. This limits the ability to fully evaluate the reasonableness of the methods or to reproduce the approach. I strongly recommend expanding the description of EOI and ROAR within the manuscript.

      I agree fully – I was a little remiss in this scope, as I was worried about overwhelming the reader. However, I was too sparse with detail and have now extended the text this way to describe the methods intuitively as possible (see Discussion, subsection “Ellipse of Insignificance and Region of Attainable Redaction”

      (3) In the Methods, the authors note that EOIMETA is applicable when between-study heterogeneity is low. However, the manuscript provides little guidance on how to interpret results when heterogeneity is high (e.g., larger I² values). I recommend clarifying this issue in the Results or Discussion sections, emphasizing the limitations of EOIMETA under high heterogeneity. Ideally, the authors could include either a small simulation study or an illustrative example to demonstrate the performance of the method in such settings.

      This is an excellent question, and I was remiss for not considering it better in the manuscript. Originally, the simple idea was to just pool the results for EOI, in which case heterogeneity would be an issue. But I then subsequently added weighed-inverse variance methods to account for situations with increased heterogeneity, so my initial comment was not strictly correct. I’ve changed the text in several places, notably in the methods and in the discussion (see reply point 5).

      (4) While EOIMETA is introduced as a generalizable fragility metric for meta-analyses, the illustrative examples would benefit from clearer comparisons with the traditional Fragility Index (FI). Because FI is well established in the RCT literature and familiar to many readers, presenting side-by-side results (e.g., FI at the trial level versus EOIMETA at the meta-analytic level) would provide important context. Such comparisons would also highlight the added value of EOIMETA, underscoring that even when individual trials appear robust under FI, the pooled meta-analysis may remain fragile.

      This is an excellent idea! The new table is given below. Note that traditional FI are not defined for non-significant results, and EOI is ambiguous for counts <2.

      (5) In the Discussion currently states that the Fragility Index (FI) applies only to binary outcomes. This is not entirely accurate. While the original FI was indeed developed for dichotomous endpoints, subsequent methodological work has extended the concept to other data types, including continuous outcomes (continuous fragility index, CFI). The manuscript should acknowledge this distinction: EOIMETA presently focuses on binary outcomes at the meta-analytic level, but FI more broadly is not restricted to binary data. Adding this clarification, with appropriate citations, would improve accuracy and place EOIMETA more clearly within the broader fragility literature.

      Thank you for this catch – clarified now in the discussion:

      Reviewer #2 (Recommendations for the authors):

      (1) Typos/inconsistencies/writing clarifications: All table and figure legends and titles are missing a period at the end of each sentence. In the sentence "to be estimated by bootstrap methods. Initially, we ran...", there should be a space between "methods" and "Initially" (line 113).

      Apologies, these are now remedied.

      (2) In Table 2, the total number of patients in the meta-analysis of all 12 studies is reported as 133,262, whereas the text states 133,475 patients. Based on my calculations from Figure 2, the total appears to be 133,262. Could you please clarify this discrepancy?

      Certainly – your calculations are correct. The text figure was a typo based on a very early draft where the summation function was not correctly run, and doubled counted some cases. This was fixed for the figure but not the text. The text should now match, thank you for spotting this. There are some issues with figure 2, which I will address in next few points.

      (3) Regarding this point, the meta-analysis by Zhang et al. (2019) shows some inconsistencies in the reported number of patients in the paper. According to the data provided on GitHub the total number of patients is 37671. However, Table 1 of the paper lists 38538 patients, and the main text states "5 RCTs involving 39168 patients." Similarly, for Guo et al. (2023), the main text reports that the meta-analysis included 11 RCTs with 112165 patients, whereas the table lists 111952, which appears consistent with the data available on GitHub. There is also a discrepancy in Zhang et al. (2022), which cites 61853 patients in the introduction but 61223 patients in Table 1. These inconsistencies should be clarified, as even small discrepancies in reported sample sizes can undermine the credibility of the analyses presented.

      Well-spotted – the incorrect figures are artefacts of an early draft with a double-counting summation function, and I should have spotted them and removed them prior to submission. To clarify, the correct figures from each study (which agree with github data) are given in the corrected table 1.

      Thus, there are 38,538 subjects in the Zhang et al 2019 analysis, which matches the first sheet of the github listing. The confusion comes from sheet 2 which was included only with this, which breaks these events down into events / non-events (hence the total non-events being 37,671) but keeps the old labels. This is needlessly confusing, and accordingly I have re-uploaded the data with correct headers for sheet 2.  This summation problem was also apparent in the total of figure 2, which has been replaced with a correct version now. Thank you for spotting this!

      (4) In line 158, who does "He" refer to? Please clarify this in more detail.

      Apologies, this was a typo and should have read “the” – now corrected.

      (5) The discrepant results of the RCT by Scragg et al. (2018) between the meta-analysis by Zhang et al. and that by Guo et al. could be presented in a table. This could be included as supplementary material or, preferably, in the main text (Results section).

      To avoid confusion, I will add a version of this to the github files for interested users to explore.

      (6) In the legend of Figure 2, a period is missing at the end of the sentence. Additionally, although it is generally understood, it would be helpful to specify that the numbers in parentheses represent the confidence intervals. Please confirm whether these are 95%, 89%, or 99% confidence intervals.

      Apologies, these are 95% CIs. Clarified now in updated legends.

      (7) The statement of "The more recent and robust methods for fragility analysis (EOI) and redaction (ROAR) have potential applications beyond fragile-by-design RCTs, extending to cohort studies, preclinical work, and even ecological studies, as stated by the author" in line 163. Could you please provide references supporting these claims? I believe the relevant references may be included in the EOI paper, but it would be helpful to cite them here as well.

      This has recently been used in new analysis now cited in the introduction with fuller description of method for context. Please see response to reviewer 1, points 2

      (8) Since the study was previously published as a preprint (https://www.medrxiv.org/content/10.1101/2025.08.15.25333793v1.full-text), this should be mentioned in the manuscript.

      Added as a note now.

      (9) It would also be valuable to include a figure illustrating ROAR for the same meta-analyses presented in Figure 1 for EOI, possibly as supplementary material.

      See reply to point 10.

      (10) Finally, it would be interesting to provide plots of both EOI and ROAR for the meta-analyses of all 12 included studies. These graphs could be replicated using the code examples provided by the author in the original EOI and ROAR publications.

      These have now been added to the github repository as supplementary material.

      (11a) Replications of EOI fragility: eoicfunc.R (github): - In the code provided on GitHub, an error occurred in the "EllipseFromEquation" function within eoifunc. This was due to the PlaneGeometry package not being available for the latest version of R. I attempted several installation methods (using devtools, remotes, and GitHub, as well as direct installation from a URL). However, after adjusting the code, I was able to run the analyses. For the full cohort, including all 12 studies using the EOI approach, I obtained a Minimal Experimental Arm only recoding (xi) = 14 and a Minimal Control Arm only recoding (yi) = 15, whereas the authors reported that 5 recodings were sufficient. It appears that differences in code versions or functions might have slightly affected the results. After downgrading R and running the eoic function with PlaneGeometry successfully installed, the fragility index for the EOI approach was 15 rather than 5.

      Apologies for the issue with PlaneGeometry, I will try to fix this for future iterations. The difference you see is an artefact of running EOIFUNC on pooled data, rather than the dedicated EOIMETA function, with the chief difference being that EOIFUNC doesn’t apply WIV correction.  If we simply pool events, this is the output:

      Author response image 1.

      If the reviewer uses the EOIMETA function which employs inverse weighing, then to define each trial we use a vector of events and non-events in each arm. For all the 12 studies, this would be (in R code syntax, or import from github file)

      Author response image 2.

      Then they will obtain:

      Author response image 3.

      If the reviewer runs a simple pooler analysis with weighed inverse correction turned off, they should return a similar answer as a simple eoifunc call, save the zero count correction difference. But EOIMETA weighs the sample, and is reported in main paper.

      (12) I recalculated the eoic function for Zhang et al. (2019) and found a fragility index (dmin) of 1. FECKUP Vector Length: 0.5722. Minimal Experimental Arm Recoding (xi): 0.7738. Minimal Control Arm Recoding (yi): 0.8499.

      This again appears to be an artefact of using eoifunc rather than eoimeta; with eoimeta, which uses WIV to adjust the studies for heterogeneity effects, this is the reported output:

      Author response image 4.

      (13) Using the previous code (before downgrading R and loading PlaneGeometry), I recalculated the EOI for Zhang et al. (2022) and found Minimal Experimental Arm only recoding (xi) = 55 and Minimal Control Arm only recoding (yi) = 59-results slightly closer to those reported by the authors. After properly loading PlaneGeometry, I recalculated and obtained for Zhang et al. (2022): Fragility index (dmin) = 57; FECKUP Vector Length = 39.948; Minimal Experimental Arm Recoding (xi) = 54.5436; Minimal Control Arm Recoding (yi) = 58.635.

      Again this appears to be a difference in using eoifunc or eoimeta as a call -  I can replicate this result using EOIFUNC:

      Author response image 5:

      But adjusting for study weighing with eoimeta:

      Author response image 6.

      (14) For Guo et al. (2022), the EOI fragility index was 17 [dmin = 17]. FECKUP Vector Length: 11.3721. Minimal Experimental Arm Recoding (xi): -15.6825. Minimal Control Arm Recoding (yi): -16.5167. However, the authors report an EOI fragility of 38. Since I was able to load PlaneGeometry properly and run eoicfunc.R (from GitHub) without errors, the discrepancies likely reflect minor coding or version inconsistencies rather than software limitations.

      These again stem from using eoifunc on simple pooled data versus eoimeta, which adjusts by study.

      (15) Replications of ROAR fragility: roarfunc.R (github): - For Guo et al. (2022), the ROAR fragility calculated using roarfunc.R was 16 [rmin (Redaction Fragility Index) = 16]. FOCK Vector Length: 15.942. Minimal Experimental Arm Redaction (xc): 15.9442. Minimal Control Arm Redaction (yc): 978.8906. In the main text, the author reports a redaction fragility of 37. What might explain these discrepancies?

      Again, this stems from EOIMETA versus EOIFUNC (and roarfunc calls without weighed adjustment). As the reviewer has observed, the fragility increases when there is no study level adjustment, which we have now added to the discussion text.

      (16) In generic_run.R, line 6 contains a bug - it is missing a forward slash (/) between the directory path and the filename. The correct line of code should be: pathload = paste0(pathname, "/", filename, exname). The same issue occurs in generalcode.R.

      Apologies, I will correct this in the upload!

      (17) Theoretical framework: Is there any other method available for comparison besides the one proposed by Atal et al.? Could you include a brief literature review describing alternative approaches?

      To my knowledge, there is not – Xing et al (now referenced) covered this earlier in the year, and I have included an expanded background for this purpose. Please see reply to reviewer 1, point 1.

      (18a) There appears to be no heterogeneity in the meta-analysis in terms of effect sizes and I², likely because most values are quite large, yet the included studies address very different populations (e.g., patients with COPD, NSCLC survivors, older adults, women, and GI cancer survivors). This could have been explained more clearly, including how such diverse literature might influence fragility indices or whether there is a logical rationale for combining these studies. Could you perform a sensitivity analysis or provide a conceptual explanation of how the heterogeneity - or lack thereof - across these trials may affect the fragility indices? Although I² values are small, the conceptual heterogeneity among studies suggests that the pooled results may be comparing fundamentally different clinical contexts, which requires clarification.

      I think this is a very pertinent point, I am unsure as to why these authors combined such diverse populations without any consideration of whether they were comparable, but this is a common problem in meta-analysis. I have added the following to the discussion to address this problem:

      “The use of vitamin D meta-analyses in this work was chosen as illustrative rather than specific, but it is worth noting that there are methodological concerns with much vitamin D research. (Grimes aet al., 2024). The three studies cited in this work report relatively low heterogeneity in their meta-analysis in both effect sizes and I<sup>2</sup> values, but it is worth noting that the included studies addressed very different populations, including patients with Chronic Obstructive Pulmonary Disease, Non small cell lung cancer survivors, women only cohorts, older adults, and gastrological cancer survivors. These groups have presumably different risk factors for cancer deaths, and why the authors of these studies combined the cohorts with fundamentally different clinical contexts is unclear. Why the heterogeneity appeared so relatively low in different groups is also a curious feature. This goes beyond the scope of the current work, but serves as an example of the reality that meta-analysis is only as strong as its underlying data and methodological rigor in comparing like-with-like, and the conclusions drawn from them must always be seen in context.”

      Reviewer #3 (Recommendations for the authors):

      (1) Line 156, acronym FI not defined.

      Apologies, I this is now defined at the outset as “fragility index”.

      (2) Line 158, typo "He"?

      Apologies again, this was a typo and was supposed to read “the”, fixed now.

      (3) Across the manuscript, I think the "re-coding" phrasing may confuse clinical readers. Maybe rephrasing to "flipping event classification" or "flipping group" would be better.

      Excellent point – this has now been modified at the outset.

    1. eLife Assessment

      This structural biology study provides insights into the assembly of the GID/CTLH E3 ligase complex. The multi-subunit complex forms unique, ring-shaped assemblies and the findings presented here describe a "specificity code" that regulates formation of subunit interfaces. The data supporting the conclusions are convincing, both in thoroughness and rigor. This study will be valuable to biochemists, structural biologists, and could lay foundation for novel designed protein assemblies.

    2. Reviewer #2 (Public review):

      Summary:

      This is a very interesting study focusing on a remarkable oligomerization domain, the LisH-CTLH-CRA module. The module is found in a diverse set of proteins across evolution. The present manuscript focuses on the extraordinary elaboration of this domain in GID/CTLH RING E3 ubiquitin ligases, which assemble into a gigantic, highly ordered, oval-shaped megadalton complex with strict subunit specificity. The arrangement of LisH-CTLH-CRA modules from several distinct subunits is required to form the oval on the outside of the assembly, allowing functional entities to recruit and modify substrates in the center. Although previous structures had shown that data revealed that CTLH-CRA dimerization interfaces share a conserved helical architecture, the molecular rules that govern subunit pairing have not been explored. This was a daunting task in protein biochemistry that was achieved in the present study, which defines this "assembly specificity code" at the structural and residue-specific level.<br /> The authors used X-ray crystallography to solve high-resolution structures of mammalian CTLH-CRA domains, including RANBP9, RANBP10, TWA1, MAEA, and the heterodimeric complex between RANBP9 and MKLN. They further examined and characterized assemblies by quantitative methods (ITC and SEC-MALS) and qualitatively using nondenaturing gels. Some of their ITC measurements were particularly clever, and involved competitive titrations, and titrations of varying partners depending on protein behavior. The experiments allowed the authors to discover that affinities for interactions between partners is exceptionally tight, in the pM-nM range, and to distill the basis for specificity while also inferring that additional interactions beyond the LisH-CTLH-CRA modules likely also contribute to stability. Beyond discovering how the native pairings are achieved, the authors were able to use this new structural knowledge to reengineer interfaces to achieve different preferred partnerings.

      Strengths:

      Nearly everything about this work is exceptionally strong.<br /> -The question is interesting for the native complexes, and even beyond that has potential implications for design of novel molecular machines.<br /> -The experimental data and analyses are quantitative, rigorous, and thorough.<br /> -The paper is a great read - scholarly and really interesting.<br /> -The figures are exceptional in every possible way. They present very complex and intricate interactions with exquisite clarity. The authors are to be commended for outstanding use of color and color-coding throughout the study, including in cartoons to help track what was studied in what experiments. And the figures are also outstanding aesthetically.

      Weaknesses:

      There are no major weaknesses of note, and in the revision the authors addressed my minor suggestions for the text.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      GID/CTLH-type RING ligases are huge multi-protein complexes that play an important role in protein ubiquitylation. The subunits of its core complex are distinct and form a defined structural arrangement, but there can be variations in subunit composition, such as exchange of RanBP9 and RanBP10. In this study, van gen Hassend and Schindelin provide new crystal structures of (parts of) key subunits and use those structures to elucidate the molecular details of the pairwise binding between those subunits. They identify key residues that mediate binding partner specificity. Using in vitro binding assays with purified protein, they show that altering those residues can switch specificity to a different binding partner.

      Strengths:

      This is a technically demanding study that sheds light on an interesting structural biology problem in residue-level detail. The combination of crystallization, structural modeling, and binding assays with purified mutant proteins is elegant and, in my eyes, convincing.

      Weaknesses:

      I mainly have some suggestions for further clarification, especially for a broad audience beyond the structural biology community.

      We thank the reviewer for the careful evaluation of our manuscript and for the positive and encouraging assessment of our work. We also thank the reviewer for the constructive suggestions to improve clarity for a broader audience and have revised the manuscript accordingly.

      (1) The authors establish what they call an 'engineering toolkit' for the controlled assembly of alternative compositions of the GID complex. The mutagenesis results are great for the specific questions asked in this manuscript. It would be great if they could elaborate on the more general significance of this 'toolkit' - is there anything from a technical point of view that can be generalized? Is there a biological interest in altering the ring composition for functional studies?

      We thank the reviewer for raising this important point. Beyond addressing the specific pairwise assembly mechanisms analyzed in this study, we agree that the broader significance of this engineering toolkit warrants further discussion. The residue-level understanding of CTLH-CRA interfaces not only explains assembly specificity but also enables rational manipulation of ring composition in a controlled manner. We have therefore expanded the end of the discussion section to outline generalizable strategies for CRA-interface disruption and to highlight potential biological applications of altering ring composition for functional studies.

      (2) Along the same lines, the mutagenesis required to rewire Twa1 binding was very complex (8 mutations). While this is impressive work, the 'big picture conclusion' from this part is not as clear as for the simpler RanBP9/10. It would be great if the authors could provide more context as to what this is useful for (e.g., potential for in vivo or in vitro functional studies, maybe even with clinical significance?)

      We thank the reviewer for this important comment and agree that the broader implications of the more complex Twa1 rewiring were not sufficiently emphasized in the original manuscript. Through the competition ITC experiments (Fig. 5), we aimed to demonstrate a concrete application of the Twa1. At the same time, we recognize that additional use cases are conceivable. To address this point, we have expanded the discussion section to clarify the conceptual significance of Twa1 rewiring and briefly outline further potential applications of controlled interface manipulation. These additions aim to better contextualize the broader relevance of this approach beyond the specific mechanistic questions addressed in this study.

      (3) For many new crystal structures, the authors used truncated, fused, or otherwise modified versions of the proteins for technical reasons. It would be helpful if the authors could provide reasoning why those modifications are unlikely to change the conclusions of those experiments compared to the full-length proteins (which are challenging to work with for technical reasons). For instance, could the authors use folding prediction (AlphaFold) that incorporates information of their resolved structures and predicts the impact of the omitted parts of the proteins? The authors used AlphaFold for some aspects of the study, which could be expanded.

      We agree with the reviewer that the transferability of the domain constructs to the corresponding full-length proteins is an important consideration. In the original version of the manuscript, we addressed this point by fitting the experimentally determined CTLH-CRA domain structures of muskelin and RanBP9 into the cryo-EM maps of the full-length complexes (Fig. 5d), demonstrating that the applied truncations and fusion strategies are compatible with the architecture observed in the intact assembly. Following the reviewer’s suggestion, we have further strengthened this analysis by adding a new Supplementary Figure 1. In this figure, the experimentally determined CTLH-CRA domain structures are superposed with full-length AlphaFold predictions. This comparison shows that removal of flexible linker regions, such as those between the CTLH and CRA motifs or at terminal segments, does not alter the overall fold or the binding interfaces of the domains. Together, these analyses support the conclusion that the domain constructs faithfully represent the structural and interaction properties of the full-length proteins.

      Reviewer #2 (Public review):

      Summary:

      This is a very interesting study focusing on a remarkable oligomerization domain, the LisH-CTLH-CRA module. The module is found in a diverse set of proteins across evolution. The present manuscript focuses on the extraordinary elaboration of this domain in GID/CTLH RING E3 ubiquitin ligases, which assemble into a gigantic, highly ordered, oval-shaped megadalton complex with strict subunit specificity. The arrangement of LisH-CTLHCRA modules from several distinct subunits is required to form the oval on the outside of the assembly, allowing functional entities to recruit and modify substrates in the center. Although previous structures had shown that data revealed that CTLH-CRA dimerization interfaces share a conserved helical architecture, the molecular rules that govern subunit pairing have not been explored. This was a daunting task in protein biochemistry that was achieved in the present study, which defines this "assembly specificity code" at the structural and residue-specific level.

      The authors used X-ray crystallography to solve high-resolution structures of mammalian CTLH-CRA domains, including RANBP9, RANBP10, TWA1, MAEA, and the heterodimeric complex between RANBP9 and MKLN. They further examined and characterized assemblies by quantitative methods (ITC and SEC-MALS) and qualitatively using nondenaturing gels. Some of their ITC measurements were particularly clever and involved competitive titrations and titrations of varying partners depending on protein behavior. The experiments allowed the authors to discover that affinities for interactions between partners is exceptionally tight, in the pM-nM range, and to distill the basis for specificity while also inferring that additional interactions beyond the LisH-CTLH-CRA modules likely also contribute to stability. Beyond discovering how the native pairings are achieved, the authors were able to use this new structural knowledge to reengineer interfaces to achieve different preferred partnerings.

      Strengths:

      Nearly everything about this work is exceptionally strong.

      (1) The question is interesting for the native complexes, and even beyond that, has potential implications for the design of novel molecular machines.

      (2) The experimental data and analyses are quantitative, rigorous, and thorough.

      (3) The paper is a great read - scholarly and really interesting.

      (4) The figures are exceptional in every possible way. They present very complex and intricate interactions with exquisite clarity. The authors are to be commended for outstanding use of color and color-coding throughout the study, including in cartoons to help track what was studied in what experiments. And the figures are also outstanding aesthetically.

      Weaknesses:

      There are no major weaknesses of note, but I can make a few recommendations for editing the text.

      We are very grateful to the reviewer for this exceptionally positive and thoughtful assessment of our work. We sincerely appreciate the recognition of both the conceptual scope and the technical depth of the study. We are particularly encouraged by the reviewer’s comments regarding the clarity and presentation of the figures. Considerable effort went into ensuring that the structural and biochemical complexity of the CTLH assemblies could be conveyed in a clear and accessible manner, and we are grateful that this was appreciated. We thank the reviewer for the constructive recommendations for textual improvements.

      Reviewer #3 (Public review):

      Summary:

      Protein complexes, like the GID/CTLH-type E3 ligase, adopt a complex three-dimensional structure, which is of functional importance. Several domains are known to be involved in shaping the complexes. Structural information based on cryo-EM is available, but its resolution does not always provide detailed information on protein-protein interactions. The work by van gen Hassend and Schindelin provides additional structural data based on crystal structures.

      Strengths:

      The work is solid and very carefully performed. It provides high-resolution insights into the domain architecture, which helps to understand the protein-protein interactions on a detailed molecular level. They also include mutant data and can thereby draw conclusions on the specificity of the domain interactions. These data are probably very helpful for others who work on a functional level with protein complexes containing these domains.

      Weaknesses:

      The manuscript contains a lot of useful, very detailed information. This information is likely very helpful to investigate functional and regulatory aspects of the protein complexes, whose assembly relies on the LisH-CTLHCRA modules. However, this goes beyond the scope of this manuscript.

      We thank the reviewer for the detailed review of our manuscript and for the constructive and positive remarks. We greatly appreciate the recognition of the high-resolution structural insights and the value of combining crystallographic data with mutational analyses to elucidate domain-specific interactions. We are also grateful for the acknowledgment that these findings may serve as a useful resource for future functional and regulatory studies of LisH-CTLH-CRA-containing complexes. While such aspects extend beyond the immediate scope of the present study, we hope that the structural framework provided here will facilitate and inspire future investigations addressing these questions.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) For the ITC measurements that are less accurate, the authors may want to represent that in the figures with an approximate sign.

      We thank the reviewer for this helpful suggestion. After consideration, we decided not to introduce an approximate sign in the main figures, as this would be inconsistent with the graphical conventions used throughout the manuscript (there is also no equal sign). Since the associated errors are reported directly alongside each K<sub>D</sub> value, we believe that the precision of the measurements is sufficiently conveyed. However, we agree that explicitly marking estimated values can be appropriate in specific cases. We have therefore added approximate signs in Supplementary Fig. 5 for the K<sub>D</sub> estimation of self-association.

      (2) The names of the proteins are from mammals and should probably be capitalized.

      We agree that capitalization is generally appropriate for mammalian protein names. In particular, for proteins such as Rmnd5a, which is identical in sequence between mouse and human, the use of human-style nomenclature would indeed be fully justified. Originally, we chose the current nomenclature to distinguish the proteins studied here from strictly human versions, as most constructs are derived from mouse and one (muskelin) from rat. This approach also avoids inconsistencies between the mouse and rat proteins within the manuscript and maintains alignment with the nomenclature used in our previous publications. For the sake of consistency and continuity, we have therefore retained the original formatting throughout the manuscript.

      (3) For the sequence alignments, it would be good to specify in the legend which organisms these are from, and where the differences are in mouse and rat proteins used in the study, and the human proteins.

      We appreciate this constructive suggestion. We have revised the sequence alignment legends to clearly specify the organism of origin for each sequence included in the analysis. In addition, we have added a new Supplementary Figure 1 presenting the AlphaFold predictions of the mouse proteins and rat muskelin used in this study. Within these models, sequence differences relative to the human proteins are indicated, and variations within the CTLH-CRA domains are explicitly annotated. These additions clarify how the constructs analyzed here relate to their human counterparts.

      (4) A few points about the referencing:

      (a) It was reference 27 that first described the dual-sided interactions where the CRA domain weaves back and forth such that CTLH-CRAN and LisH-CRAC mediate the contacts on the two sides. This should be cited.

      We fully agree and added the reference accordingly.

      (b) To this reviewer's knowledge, it was references 13 and 9 that resolved the daisy-chain of helical LisH-CTLHCRA interactions around the oval helical structures.

      We agree with the reviewer that references 13 and 9 resolved the helical LisH-CTLH-CRA daisy-chain arrangement around the oval structure. Reference 13 was already cited in the original manuscript, and we have now added reference 9 to appropriately acknowledge this contribution. We have retained reference 14, although it did not resolve the helical daisy-chain architecture, as it described a related oval assembly of CTLH complex components that remains relevant in the structural context discussed.

      (c) A cryo-EM map with RANBP10 was shown at low resolution in reference 8.

      We agree with the reviewer that a low-resolution cryo-EM map including RANBP10 was reported in reference 8. Our original wording was not sufficiently precise and may have given the impression that RANBP10 had not been characterized. Our intention was to convey that, although cryo-EM maps exist, detailed atomic-level information on subunit interfaces was lacking. We have revised the paragraph accordingly to clarify this point and now cite reference 8 explicitly in this context.

      (d) The Discussion requires referencing.

      We agree with the reviewer that additional referencing improves the clarity and contextualization of the Discussion. We have revised the Discussion section accordingly and added appropriate references to support the statements made.

    1. AbstractVector-borne diseases pose a persistent and increasing challenge to human, animal, and agricultural systems globally. Mathematical modeling frameworks incorporating vector trait responses are powerful tools to assess risk and predict vector-borne disease impacts. Developing these frameworks and the reliability of their predictions hinge on the availability of experimentally derived vector trait data for model parameterization and inference of the biological mechanisms underpinning transmission. Trait experiments have generated data for many known and potential vector species, but the terminology used across studies is inconsistent, and accompanying publications may share data with insufficient detail for reuse or synthesis. The lack of data standardization can lead to information loss and prohibits analytical comprehensiveness. Here, we present MIReVTD, a Minimum Information standard for Reporting Vector Trait Data. Our reporting checklist balances completeness and labor- intensiveness with the goal of making these important experimental data easier to find and reuse, without onerous effort for scientists generating the data. To illustrate the standard, we provide an example reproducing results from an Aedes aegypti mosquito study.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag020), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1:

      The authors propose MIReVTD, a concise minimum-information checklist for reporting vector trait data, motivated by the lack of consistent terminology and metadata that impedes reuse and synthesis across studies. The scope and intent are clearly stated in the Abstract and Introduction, including the emphasis on FAIR principles and the illustrative Aedes aegypti example and VecTraits implementation. Overall, this is a timely, valuable contribution that complements MIReAD (arthropod abundance) and the vector competence minimum data standard, and it will be highly useful to both experimentalists and modellers.

      Major

      • It would be highly beneficial to demonstrate the compatibility and added value of the MIReVTD and the VecTraits database to the existing initiatives aiming to collect and structure similar information. The authors mentioned ETS, MIAPPE, and MIReAD but and explicit mapping of the minimum information field alignment will help to place MIReVTD in context and facilitate adoption of this standard.
      • The "Axes of Variation" section is strong, but it could be clearer about what constitutes a stressor or condition. It would help to list common confounders such as humidity, photoperiod, diet or food ration and quality, larval density, and light cycle, and to encourage recording fixed or background conditions in separate fields rather than only gradients. This would help avoid ambiguity between variables that are experimentally varied and those that simply describe the environment.
In Figure 2, the second stressor appears to take a fixed value (0.1). This is somewhat confusing because it is not clear whether this field is meant for another gradient (e.g., temperature in the range of 20 to 40 °C in addition to food ration categories), or whether it lists fixed conditions under which the experiment was performed. If it is the latter, it might be more practical to include additional fields for stressors so that all relevant conditions, such as humidity and photoperiod, can be recorded. It would also help to clarify whether a third or further stressor can be added to the table, and how these would appear.
It might in fact be preferable not to distinguish gradients from fixed conditions at all, and instead to treat them uniformly as conditions, each defined with its corresponding unit and uncertainty. This would simplify the structure and prevent confusion about whether a variable was held constant or systematically varied.
      • It would highly improve usability and adoption if the standard also recommended ORCIDs for contributors, DOIs for datasets, and an explicit data license (e.g., CC BY/CC0). If this extension is possible, I recommend that the authors add a short "Data licensing & citation" paragraph to the Results section.

      Minor - Line 248: Fig. 2? - Please update the citation of bayesTPC. - If possible, please provide a code snippet with the data used (in Zenodo or as Supplementary Material) for Fig. 1. - I believe the followings are also relevant to this study and should be mentioned appropriately: - Adams B, Franz N, König-Ries B, et al. TraitBank: Practical semantics for organism attribute data. Semantic Web. 2015;7(6):577-588. doi:10.3233/SW-150190 - Kattge, J., Ogle, K., Bönisch, G., Díaz, S., Lavorel, S., Madin, J., Nadrowski, K., Nöllert, S., Sartor, K. and Wirth, C. (2011), A generic structure for plant trait databases. Methods in Ecology and Evolution, 2: 202-213. https://doi.org/10.1111/j.2041-210X.2010.00067.x

    1. Reviewer #2 (Public Review):

      Summary:

      After manually labelling 144 human adult hemispheres in the lateral parieto-occipital junction (LPOJ), the authors 1) propose a nomenclature for 4 previously unnamed highly variable sulci located between the temporal and parietal or occipital lobes, 2) focus on one of these newly named sulci, namely the ventral supralateral occipital sulcus (slocs-v) and compare it to neighbouring sulci to demonstrate its specificity (in terms of depth, surface area, gray matter thickness, myelination, and connectivity), 3) relate the morphology of a subgroup of sulci from the region including the slocs-v to the performance in a spatial orientation task, demonstrating behavioural and morphological specificity. In addition to these results, the authors propose an extended reflection on the relationship between these newly named landmarks and previous anatomical studies, a reflection about the slocs-v related to functional and cytoarchitectonic parcellations as well as anatomic connectivity and an insight about potential anatomical mechanisms relating sulcation and behaviour.

      Strengths:

      - To my knowledge, this is the first study addressing the variable tertiary sulci located between the superior temporal sulcus (STS) and intra-parietal sulcus (IPS).

      - This is a very comprehensive study addressing altogether anatomical, architectural, functional and cognitive aspects.

      - The definition of highly variable yet highly reproductible sulci such as the slocs-v feeds the community with new anatomo-functional landmarks (which is emphasized by the provision of a probability map in supp. mat., which in my opinion should be proposed in the main body).

      - The comparison of different features between the slocs-v and similar sulci is useful to demonstrate their difference.

      - The detailed comparison of the present study with state of the art contextualises and strengthens the novel findings.

      - The functional study complements the anatomical description and points towards cognitive specificity related to a subset of sulci from the LPOJ

      - The discussion offers a proposition of theoretical interpretation of the findings

      - The data and code are mostly available online (raw data made available upon request).

    1. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      The authors develop a method that extracts functional properties clusters cells from single-cell RNA sequencing using machine learning techniques.

      As it stands, there are several major shortcomings in the presentation of the work:

      • The motivation for the method is not well explained. Certainly analyses of single cell transcriptomics data do not capture the full state or trajectory, however there isn't any concrete example of the problem that this method intends to solve, nor how/why existing methods fail to capture this.
      • No motivation is given for any of the approaches that feature in the method, and therefore there is no consistent logical thread that a reader can follow to be convinced that the results are plausible.
      • Methods are not sufficiently explained.

      For example: There is no indication of how the original graph is obtained, nor what is diffusing in the 'graph diffusion' method. Statements like "The term 'sparse' indicates the process of sparsifying the matrix." does not explain anything about the actual process of making the matrix sparse. This could perhaps be understood in terms of a 'sparse' function in a particular linear algebra package, which would implicitly make this procedure concrete, but this is not mentioned.

      Section 2.2.2 mentions a reinforcement learning scheme, however none of the following explanation describes anything related to the commonly accepted reinforcement learning literature, and several quantities (such as the loss) remain undefined. Similarly, section 2.2.3 mentions the BERT pre-trained transformer without indicating how specifically it was modified or trained for this particular purpose, except perhaps in Figure 4 which itself is intensely confusing. Again in this section the authors mention a 'genetic algorithm' with no reference to any commonly accepted approaches used in the development of genetic algorithms for optimisation, and with no explanation of what exactly is optimised or how convergence is monitored.

      No code implementation is provided, and therefore it is impossible to use this to understand any of the methodology, and renders it impossible to reproduce.

      • Where mathematical notation is used it is incredibly confusing to read, using multiple symbols for different concepts, and not appearing to conform to any commonly accepted convention. In some cases, these are missing completely, for example on page 6, rendering it entirely impossible to follow.
      • The results do not support the assertions made about the method.

      No explanation of the alternative methods is given in section 3.1, nor why they are expected to perform the tasks chosen, or what the configuration of these models are and whether these have been optimised. In Table 2 many alternative methods are listed, however there is no explanation why only a small subset were chosen for comparison, nor what information the authors base their conclusions on (whether these were actually executed for this purpose, or whether they were interpreted from the paper).

      Metrics such as 'accuracy' are not defined, and are the only numerical evaluation of the method, whereas one would expect considerably more detailed evaluation of the claims, such as in the CoSpar paper.

      Section 2.4 mentions 'Details of scRNA-seq data processing and experimental methods are shown in the Supplementary Appendix 6 - Animal Processing.', however I have no supplementary material titled 'Appendix 6', and nothing at all that documents the scRNA-Seq pipeline.

      Section 3.2 seems to describe a very manual procedure for identifying these clusters, and seems to bear no relation to the TOGGLE procedure defined previously, so it's not clear how good an indication this is of the performance of the algorithm. Furthermore, the subsequent results seem to rarely refer to the TOGGLE method at all, and lack any meaningful comparison to alternative methods or why TOGGLE is essential for obtaining these results.

      In many cases the plots in this section are so distorted by compression that making out the text and the points is essentially impossible, and so I cannot comment on any of these.

      I would strongly recommend that the methodology of the paper is greatly expanded to cover exactly what is done, such that it is possible to reproduce in its entirety. Asking a third party who is an expert in machine learning to read through the descriptions and the mathematics would also be highly beneficial to ensure their correctness. Furthermore, it is essential that all of the claims made in the introduction and throughout the paper be systematically and explicitly validated in the results. If this cannot be done on real data, where ground-truth labels and trajectories are hard to come by, some evidence for these claims could be acquired by evaluation on simulated data.

      Significance

      The inference of cell state and trajectories from single cell RNA sequencing is a timely and important task in computational biology, with many important downstream applications. The method described in this paper aims to distinguish functionally distinct cell populations that exhibit small differences in transcript counts. However, it is not precisely articulated why the complicated approach proposed here is advantageous over simpler more conventional approaches, such as graph clustering, random-walk based methods such as CoSpar, or trajectory inference based on ODE kinetics such as in scVelo.

      Furthermore, the methods described are exceptionally vague and hard to follow, with mathematical descriptions and naming schemes that are inconsistent with the commonly accepted literature of the techniques referenced. Therefore it is also difficult for even a well-prepared reader to come to their own conclusions as to the performance and applicability of the proposed approach. This issue is compounded by the fact that there is very little in the way of validation of the specific claims made of the method, let alone in relation to alternative methods.

      The study would be greatly improved by expanded methods sections, documenting in detail what is done at each stage. Where existing work is referenced without an exact description, how the implementation differs from the reference must be addressed. Most of the description is currently in text form only, which is wholly insufficient for the kind of complex mathematical operations described. Furthermore, many of the claims throughout the paper go unaddressed in the results, where there are only a few accuracy metrics and comparisons of results, and there is no attempt to rigorously demonstrate an advantage of any of the novel components presented (for example, in an ablation study). Expanding these numerical metrics and comparisons across all applications of the method is essential for demonstrating the assertions of the paper - for example, constructing a metric for the comparison of the TOGGLE and ground-truth UMAP comparisons. In cases where there is no real ground truth in the data, simulated datasets could be used to demonstrate the plausible performance of TOGGLE in ideal scenarios.

      My expertise is in computational biology and machine learning, with a background in physics.

    1. To reduce the need for code switching, organizations must invest in a diverse workforce and make inclusive actions and behaviors a systemic expectation. By providing training on cross-cultural competence and ensuring equity in processes, leaders can build more inclusive work environments.

      This sounds nice but does it actually work or have individuals been molded to code switch because of previous experiences so they don't trust when a workplace is more diverse?

    2. Another study of Hispanic and Latino professionals revealed that 40 percent regularly code switch

      Multicultural not just the black community

    3. 2 Types of Code Switching Strategic code switching happens when employees read the room and tailor their behaviors to the expectations of the workplace. Protective code switching happens when an employee code switches to downplay their identity and thus avoid being stereotyped.

      realizing the reasons and differences in why we engage in some of these actions and categorizing makes it more real and helps connect that its not only you who does this

    1. insiders discount! Members of techforword insiders save 20% on this training!You'll find your discount code in the insiders community.Not a member yet? Join now for tons of exclusive perks like this one!

      remove

    1. insiders discount! Members of techforword insiders save 20% on this training!You'll find your discount code in the insiders community.Not a member yet? Join now for tons of exclusive perks like this one!

      remove

    1. insiders discount! Members of techforword insiders save 20% on this training!You'll find your discount code in the insiders community.Not a member yet? Join now for tons of exclusive perks like this one!

      remove

    1. insiders discount! Members of techforword insiders save 20% on this training!You'll find your discount code in the insiders community.Not a member yet? Join now for tons of exclusive perks like this one!

      remove

    1. insiders discount! Members of techforword insiders save 20% on this training!You'll find your discount code in the insiders community.Not a member yet? Join now for tons of exclusive perks like this one!

      remove

    1. insiders discount! Members of techforword insiders save 20% on this training!You'll find your discount code in the insiders community.Not a member yet? Join now for tons of exclusive perks like this one!

      Remove

    1. In the professional world, plagiarism results in loss of credibility and often compensation, including future opportunities. In a classroom setting, plagiarism results in a range of sanctions, from loss of a grade to expulsion from a school or university. CNM outlines the consequences for academic dishonesty in its Code of Conduct. In both professional and academic settings, the penalties are severe.   If you use someone else’s work, cite it. Give credit where credit is due.

      if we do not give credit where it is due, that can result in serious conflict with people getting in trouble .

    1. Author response:

      The following is the authors’ response to the original reviews.

      General Response

      We are grateful for the constructive comments from reviewers and the editor.

      The main point converged on a potential alternative interpretation that top-down modulation to the visual cortex may be contributing to the NC connectivity we observed. For this revision, we address that point with new analysis in Fig. S8 and Fig. 6. These results indicate that top-down modulation does not account for the observed NC connectivity.

      We performed the following analyses.

      (1) In a subset of experiments, we recorded pupil dynamics while the mice were engaged in a passive visual stimulation experiment (Fig. S8A). We found that pupil dynamics, which indicate the arousal state of the animal, explained only 3% of the variance of neural dynamics. This is significantly smaller than the contribution of sensory stimuli and the activity of the surrounding neuronal population (Fig. S8B). In particular, the visual stimulus itself typically accounted for 10-fold more variance than pupil dynamics (Fig. S8C). This suggests that the population neural activity is highly stimulus-driven and that a large portion of functional connectivity is independent of top-down modulation. In addition, after subtracting the neural activity from the pupil-modulated portion, the cross-stimulus stability of the NC was preserved (Fig. S8D).

      We note that the contribution from pupil dynamics to neural activity in this study is smaller than what was observed in an earlier study (Stringer et al. 2019 Science). That can be because mice were in quiet wakefulness in the current study, while mice were in spontaneous locomotion in the earlier study. We discuss this discrepancy in the main text, in the subsection “Functional connectivity is not explained by the arousal state”.

      (2) We performed network simulations with top-down input (Fig. 6F-H). With multidimensional top-down input comparable to the experimental data, recurrent connections within the network are necessary to generate cross-stimulus stable NC connectivity (Fig. 6G). It took increasing the contribution from the top-down input (i.e., to more than 1/3 of the contribution from the stimulus), before the cross-stimulus NC connectivity can be generated by the top-down modulation (Fig. 6H). Thus, this analysis provides further evidence that top-down modulation was not playing a major role in the NC connectivity we observed.

      These new results support our original conclusion that network connectivity is the principal mechanism underlying the stability of functional networks.

      Public Reviews:

      Reviewer #1 (Public Review):

      Using multi-region two-photon calcium imaging, the manuscript meticulously explores the structure of noise correlations (NCs) across the mouse visual cortex and uses this information to make inferences about the organization of communication channels between primary visual cortex (V1) and higher visual areas (HVAs). Using visual responses to grating stimuli, the manuscript identifies 6 tuning groups of visual cortex neurons and finds that NCs are highest among neurons belonging to the same tuning group whether or not they are found in the same cortical area. The NCs depend on the similarity of tuning of the neurons (their signal correlations) but are preserved across different stimulus sets - noise correlations recorded using drifting gratings are highly correlated with those measured using naturalistic videos. Based on these findings, the manuscript concludes that populations of neurons with high NCs constitute discrete communication channels that convey visual signals within and across cortical areas.

      Experiments and analyses are conducted to a high standard and the robustness of noise correlation measurements is carefully validated. However, the interpretation of noise correlation measurements as a proxy from network connectivity is fraught with challenges. While the data clearly indicates the existence of distributed functional ensembles, the notion of communication channels implies the existence of direct anatomical connections between them, which noise correlations cannot measure.

      The traditional view of noise correlations is that they reflect direct connectivity or shared inputs between neurons. While it is valid in a broad sense, noise correlations may reflect shared top-down input as well as local or feedforward connectivity. This is particularly important since mouse cortical neurons are strongly modulated by spontaneous behavior (e.g. Stringer et al, Science, 2019). Therefore, noise correlation between a pair of neurons may reflect whether they are similarly modulated by behavioral state and overt spontaneous behaviors. Consequently, noise correlation alone cannot determine whether neurons belong to discrete communication channels.

      Behavioral modulation can influence the gain of sensory-evoked responses (Niell and Stryker, Neuron, 2010). This can explain why signal correlation is one of the best predictors of noise correlations as reported in the manuscript. A pair of neurons that are similarly gain-modulated by spontaneous behavior (e.g. both active during whisking or locomotion) will have higher noise correlations if they respond to similar stimuli. Top-down modulation by the behavioral state is also consistent with the stability of noise correlations across stimuli. Therefore, it is important to determine to what extent noise correlations can be explained by shared behavioral modulation.

      We thank the reviewer for the constructive and positive feedback on our study.

      The reviewer acknowledged the quality of our experiments and analysis and stated a concern that the noise correlation can be explained by top-down modulation. We have addressed this concern carefully in the revision, please see the General Response above.

      Reviewer #2 (Public Review):

      Summary:

      This groundbreaking study characterizes the structure of activity correlations over a millimeter scale in the mouse cortex with the goal of identifying visual channels, specialized conduits of visual information that show preferential connectivity. Examining the statistical structure of the visual activity of L2/3 neurons, the study finds pairs of neurons located near each other or across distances of hundreds of micrometers with significantly correlated activity in response to visual stimulation. These highly correlated pairs have closely related visual tuning sharing orientation and/or spatial and/or temporal preference as would be expected from dedicated visual channels with specific connectivity.

      Strengths:

      The study presents best-in-class mesoscopic-scale 2-photon recordings from neuronal populations in pairs of visual areas (V1-LM, V1-PM, V1-AL, V1-LI). The study employs diverse visual stimuli that capture some of the specialization and heterogeneity of neuronal tuning in mouse visual areas. The rigorous data quantification takes into consideration functional cell groups as well as other variables that influence trial-to-trial correlations (similarity of tuning, neuronal distance, receptive field overlap). The paper convincingly demonstrates the robustness of the clustering analysis and of the activity correlation measurements. The calcium imaging results convincingly show that noise correlations are correlated across visual stimuli and are strongest within cell classes which could reflect distributed visual channels. A simple simulation is provided that suggests that recurrent connectivity is required for the stimulus invariance of the results. The paper is well-written and conceptually clear. The figures are beautiful and clear. The arguments are well laid out and the claims appear in large part supported by the data and analysis results (but see weaknesses).

      Weaknesses:

      An inherent limitation of the approach is that it cannot reveal which anatomical connectivity patterns are responsible for observed network structure. The modeling results presented, however, suggest interestingly that a simple feedforward architecture may not account for fundamental characteristics of the data. A limitation of the study is the lack of a behavioral task. The paper shows nicely that the correlation structure generalizes across visual stimuli. However, the correlation structure could differ widely when animals are actively responding to visual stimuli. I do think that, because of the complexity involved, a characterization of correlations during a visual task is beyond the scope of the current study.

      An important question that does not seem addressed (but it is addressed indirectly, I could be mistaken) is the extent to which it is possible to obtain reliable measurements of noise correlation from cell pairs that have widely distinct tuning. L2/3 activity in the visual cortex is quite sparse. The cell groups laid out in Figure S2 have very sharp tuning. Cells whose tuning does not overlap may not yield significant trial-to-trial correlations because they do not show significant responses to the same set of stimuli, if at all any time. Could this bias the noise correlation measurements or explain some of the dependence of the observed noise correlations on signal correlations/similarity of tuning? Could the variable overlap in the responses to visual responses explain the dependence of correlations on cell classes and groups?

      With electrophysiology, this issue is less of a problem because many if not most neurons will show some activity in response to suboptimal stimuli. For the present study which uses calcium imaging together with deconvolution, some of the activity may not be visible to the experimenters. The correlation measure is shown to be robust to changes in firing rates due to missing spikes. However, the degree of overlap of responses between cell pairs and their consequences for measures of noise correlations are not explored.

      Beyond that comment, the remaining issues are relatively minor issues related to manuscript text, figures, and statistical analyses. There are typos left in the manuscript. Some of the methodological details and results of statistical testing also seem to be missing. Some of the visuals and analyses chosen to examine the data (e.g., box plots) may not be the most effective in highlighting differences across groups. If addressed, this would make a very strong paper.

      We thank the reviewer for acknowledging the contributions of our study.

      We agree with the reviewer that future studies on behaviorally engaged animals are necessary. Although we also agree with the reviewer that behavior studies are out the scope of the current manuscript, we have included additional analysis and discussion on whether and how top-down input would affect the NC connectivity in the revision. Please see the General Response above.

      Reviewer #3 (Public Review):

      Summary:

      Yu et al harness the capabilities of mesoscopic 2P imaging to record simultaneously from populations of neurons in several visual cortical areas and measure their correlated variability. They first divide neurons into 65 classes depending on their tuning to moving gratings. They found the pairs of neurons of the same tuning class show higher noise correlations (NCs) both within and across cortical areas. Based on these observations and a model they conclude that visual information is broadcast across areas through multiple, discrete channels with little mixing across them.

      NCs can reflect indirect or direct connectivity, or shared afferents between pairs of neurons, potentially providing insight on network organization. While NCs have been comprehensively studied in neuron pairs of the same area, the structure of these correlations across areas is much less known. Thus, the manuscripts present novel insights into the correlation structure of visual responses across multiple areas.

      Strengths:

      The study uses state-of-the art mesoscopic two-photon imaging.

      The measurements of shared variability across multiple areas are novel.

      The results are mostly well presented and many thorough controls for some metrics are included.

      Weaknesses:

      I have concerns that the observed large intra-class/group NCs might not reflect connectivity but shared behaviorally driven multiplicative gain modulations of sensory-evoked responses. In this case, the NC structure might not be due to the presence of discrete, multiple channels broadcasting visual information as concluded. I also find that the claim of multiple discrete broadcasting channels needs more support before discarding the alternative hypothesis that a continuum of tuning similarity explains the large NCs observed in groups of neurons.

      Specifically:

      Major concerns:

      (1) Multiplicative gain modulation underlying correlated noise between similarly tuned neurons

      (1a) The conclusion that visual information is broadcasted in discrete channels across visual areas relies on interpreting NC as reflecting, direct or indirect connectivity between pairs, or common inputs. However, a large fraction of the activity in the mouse visual system is known to reflect spontaneous and instructed movements, including locomotion and face movements, among others. Running activity and face movements are some of the largest contributors to visual cortex activity and exert a multiplicative gain on sensory-evoked responses (Niell et al, Stringer et al, among others). Thus, trial-by-fluctuations of behavioral state would result in gain modulations that, due to their multiplicative nature, would result in more shared variability in cotuned neurons, as multiplication affects neurons that are responding to the stimulus over those that are not responding ( see Lin et al, Neuron 2015 for a similar point).<br /> As behavioral modulations are not considered, this confound affects most of the conclusions of the manuscript, as it would result in larger NCs the more similar the tuning of the neurons is, independently of any connectivity feature. It seems that this alternative hypothesis can explain most of the results without the need for discrete broadcasting channels or any particular network architecture and should be addressed to support its main claims.

      (1b) In Figure 5 the observations are interpreted as evidence for NCs reflecting features of the network architecture, as NCs measured using gratings predicted NC to naturalistic videos. However, it seems from Figure 5 A that signal correlations (SCs) from gratings had non-zero correlations with SCs during naturalistic videos (is this the case?). Thus, neurons that are cotuned to gratings might also tend to be coactivated during the presentation of videos. In this case, they are also expected to be susceptible to shared behaviorally driven fluctuations, independently of any circuit architecture as explained before. This alternative interpretation should be addressed before concluding that these measurements reflect connectivity features.

      We thank the reviewer for acknowledging the contributions of our study.

      The reviewer suggested that gain modulation might be interfering with the interpretation of the NC connectivity. We have addressed this issue in the General Response above.

      Here, we will elaborate on one additional analysis we performed, in case it might be of interest. We carried out multiplicative gain modeling by implementing an established method (Goris et al. 2014 Nat Neurosci) on our dataset. We were able to perform the modeling work successfully. However, we found that it is not a suitable model for explaining the current dataset because the multiplicative gain induced a negative correlation. This seemed odd but can be explained. First, top-down input is not purely multiplicative but rather both additive and multiplicative. Second, the top-down modulation is high dimensional. Third, the firing rate of layer 2/3 mouse visual cortex neurons is lower than the firing rates for non-human primate recordings used in the development of the method (Goris et al. 2014 Nat Neurosci). Thus, we did not pursue the model further. We just mention it here in case the outcome might be of interest to fellow researchers.

      (2) Discrete vs continuous communication channels

      (2a) One of the author's main claims is that the mouse cortical network consists of discrete communication channels. This discreteness is based on an unbiased clustering approach to the tuning of neurons, followed by a manual grouping into six categories in relation to the stimulus space. I believe there are several problems with this claim. First, this clustering approach is inherently trying to group neurons and discretise neural populations. To make the claim that there are 'discrete communication channels' the null hypothesis should be a continuous model. An explicit test in favor of a discrete model is lacking, i.e. are the results better explained using discrete groups vs. when considering only tuning similarity? Second, the fact that 65 classes are recovered (out of 72 conditions) and that manual clustering is necessary to arrive at the six categories is far from convincing that we need to think about categorically different subsets of neurons. That we should think of discrete communication channels is especially surprising in this context as the relevant stimulus parameter axes seem inherently continuous: spatial and temporal frequency. It is hard to motivate the biological need for a discretely organized cortical network to process these continuous input spaces.

      (2b) Consequently, I feel the support for discrete vs continuous selective communication is rather inconclusive. It seems that following the author's claims, it would be important to establish if neurons belong to the same groups, rather than tuning similarity is a defining feature for showing large NCs.

      Thanks for pointing this out so that we can clarify.

      We did not mean to argue that the tuning of neurons is discrete. Our conclusions are not dependent on asserting a particular degree of discreteness. We performed GMM clustering to label neurons with an identity so that we could analyze the NC connectivity structure with a degree of granularity supported by the data. Our analysis suggested that communication happens within a class, rather than through mixed classes. We realized that using the term “discrete” may be confusing. In the revised text we used the term “unmixed” or “non-mixing” instead to emphasize that the communication happens between neurons belonging to the same tuning cluster, or class. 

      However, we do see how the question of discreteness among classes might be interesting to readers. To provide further information, we have included a new Fig. S2 to visualize the GMM classes using t-SNE embedding.

      Finally, as stated in point 1, the larger NCs observed within groups than across groups might be due to the multiplicative gain of state modulations, due to the larger tuning similarity of the neurons within a class or group.

      We have addressed this issue in the General Response above and the response to comment (1).

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      A general recommendation discussed with the reviewers is to make use of behavioural recording to assess whether shared behaviourally driven modulations can explain the observed relation between SC and NC, independently of the network architecture. Alternatively, a simulation or model might also address this point as well as the possibility that the relation of SC and NC might be also independent of network architecture given the sparseness of the sensory responses in L2/3.

      We have addressed this in the General Response above.

      Broadly speaking, inferring network architecture based on NCs is extremely challenging. Consequently, the study could also be substantially improved by reframing the results in terms of distributed co-active ensembles without insinuation of direct anatomical connectivity between them.

      We agree that the inferring network architecture based on NCs is challenging. The current study has revealed some principles of functional networks measured by NCs, and we showed that cross-stimulus NC connectivity provides effective constraints to network modeling. We are explicit about the nature of NCs in the manuscript. For example, in the Abstract, we write “to measure correlated variability (i.e., noise correlations, NCs)”, and in the Introduction, we write “NCs are due to connectivity (direct or indirect connectivity between the neurons, and/or shared input)”. We are following conventions in the field (e.g., Sporns 2016; Cohen and Kohn 2011).

      Notice also that the abstract or title should make clear that the study was made in mice.

      Sorry for the confusion, we now clearly state the study was carried out in mice in the Abstract and Introduction.

      Reviewer #1 (Recommendations For The Authors):

      The manuscript presents a meticulous characterization of noise correlations in the visual cortical network. However, as I outline in the public review, I think the use of noise correlations to infer communication channels is problematic and I urge the authors to carefully consider this terminology. Language such as "strength of connections" (Figure 4D) should be avoided.

      We now state in the figure legend that the plot in Fig. 4D shows the average NC value.

      My general suggestion to the authors, which primarily concerns the interpretation of analyses in Figures 4-6, is to consider the possible impact of shared top-down modulation on noise correlations. If behavioral data was recorded simultaneously (e.g. using cameras to record face and body movements), behavioral modulation should be considered alongside signal correlation as a possible factor influencing NCs.

      We have addressed this issue in the General Response above.

      I may be misunderstanding the analysis in Figure 4C but it appears circular. If the fraction of neurons belonging to a particular tuning group is larger, then the number of in-group high NC pairs will be higher for that group even if high NC pairs are distributed randomly. Can you please clarify? I frankly do not understand the analysis in Figure 4D and it is unclear to me how the analyses in Figure 4C-D address the hypotheses depicted in the cartoons.

      Sorry for the confusion, we have clarified this in the Fig. 4 legend.

      Each HVA has a SFTF bias (Fig. 1E,F; Marshel et al., 2011; Andermann et al., 2011; Vries et al., 2020). Each red marker on the graph in Fig. 4C is a single V1-HVA pair (blue markers are within an area) for a particular SFTF group (Fig. 1). The x-axis indicates the number of high NC pairs in the SFTF group in the V1-HVA pair divided by the total number of high NC pairs per that V1-HVA pair (summed over all SFTF groups). The trend is that for HVAs with a bias towards a particular SFTF group, there are also more high NC pairs in that SFTF group, and thus it is consistent with the model on the right side. This is not circular because it is possible to have a SFTF bias in an HVA and have uniformly low NCs. The reviewer is correct that a random distribution of high NCs could give a similar effect, which is still consistent with the model: that the number of high NC pairs (and not their specific magnitudes) can account for SFTF biases in HVAs.

      To contrast with that model, we tested whether the average NC value for each tuning group varies. That is, can a small number of very high NCs account for SFTF biases in HVAs? That is what is examined in Fig. 4D. We found that the average NC value does not account for the SFTF biases. Thus, the SFTF biases were not related to the modulation in NC (i.e., functional connection strength). 

      I found the discussion section quite odd and did not understand the relevance of the discussion of the coefficient of variation of various quantities to the present manuscript. It would be more useful to discuss the limitations and possible interpretations of noise correlation measurements in more detail.

      We have revised the discussion section to focus on interpreting the results of the current study and comparing them with those of previous studies.

      Figure 3B: please indicate what the different colors mean - I assume it is the same as Figure 3A but it is unclear.

      We added text to the legend for clarification.

      Typos: Page 7: "direct/indirection wiring", Page 11: "pooled over all texted areas"

      We have fixed the typos.

      Reviewer #2 (Recommendations For The Authors):

      The significance of the results feels like it could be articulated better. The main conclusion is that V1 to HVA connections avoid mixing channels and send distinctly tuned information along distinct channels - a more explicit description of what this functional network understanding adds would be useful to the reader.

      Thanks for the suggestion. We have edited the introduction section and the discussion section to make the take-home message more clear.

      Previous studies with anatomical data already indicate distinctly tuned channels - several of which the authors cite - although inconsistently:

      • Kim et al 2018 https://doi.org/10.1016/j.neuron.2018.10.023

      • Glickfeld et al., 2013 (cited)

      • Han et al., 2022 (cited)

      • Han and Bonin 2023 (cited)

      Thanks for the suggestion, we now cite the Kim et al. 2018 paper.

      I think the information you provide is valuable - but the value should be more clearly spelled out - This section from the end of the discussion for example feels like abdicates that responsibility:<br /> "In summary, mesoscale two-photon imaging techniques open up the window of cellular-resolution functional connectivity at the system level. How to make use of the knowledge of functional connectivity remains unclear, given that functional connectivity provides important constraints on population neuron behavior."

      A discussion of how the results relate to previous studies and a section on the limitations of the study seems warranted.

      Thanks for the suggestion, we have extensively edited the discussion section to make the take-home message clear and discuss prior studies and limitations of the present study.

      Details:

      Analyses or simulations showing that the dependency of correlations on similarity of tuning is not an artifact of how the data was acquired is in my mind missing and if that is the case it is crucial that this be addressed.

      At each step of data analysis, we performed control analysis to assess the fidelity of the conclusion. For example, on the spike train inference (Fig. S4), GMM clustering (Fig. S1), and noise correlation analysis (Figs. 2, S5).

      None of the statistical testing seems to use animals as experimental units (instead of neurons). This could over-inflate the significance of the results. Wherever applicable and possible, I would recommend using hierarchical bootstrap for testing or showing that the differences observed are reproducible across animals.

      We analyzed the tuning selectivity of HVAs (Fig. 1F) using experimental units, rather than neurons. It is very difficult to observe all tuning classes in each experiment, so pooling neurons across animals is necessary for much of the analysis. We do take care to avoid overstating statistical results, and we show the data points in most figure to give the reader an impression of the distributions.

      Page 2. "The number of neurons belonged to the six tuning groups combined: V1, 5373; LM, 1316; AL, 656; PM, 491; LI, 334." Yet the total recorded number of neurons is 17,990. How neurons were excluded is mentioned in Methods but it should be stated more explicitly in Results.

      We have added text in the Fig. 1 legend to direct the audience to the Methods section for information on the exclusion / inclusion criteria.

      Figure 1C, left. I don't understand how correlation is the best way to quantify the consistency of class center with a subset of data. Why not use for example as the mean square error. The logic underlying this analysis is not explained in Methods.

      Sorry for the confusion, we have clarified this in the Methods section.

      We measured the consistency of the centers of the Gaussian clusters, which are 45-dimensional vectors in the PC dimensions. We measured the Pearson correlation of Gaussian center vectors independently defined by GMM clustering on random subsets of neurons. We found the center of the Gaussian profile of each class was consistent (Fig. 1C). The same class of different GMMs was identified by matching the center of the class.

      Figure 1E. There are statements in the text about cell groups being more represented in certain visual areas. These differences are not well represented in the box plots. Can't the individual data points be plotted? I have also not found the description and results of statistical testing for these data.

      We have replotted the figure (now Fig. 1F) with dot scatters which show all of the individual experiments.

      Figure 2A, right, since these are paired data, I am not quite sure why only marginal distributions are shown. It would be interesting to know the distributions of correlations that are significant.

      This is only for illustration showing that NCs are measurable and significantly different from zero or shuffled controls. The distribution of NCs is broad and has both positive and negative values. We are not using this for downstream analysis.

      Figure 4A, I wonder if it would not be better to concentrate on significant correlations.

      We focused on large correlation values rather than significant values because we wanted to examine the structure of “strongly connected” neuron pairs. Negative and small correlation values can be significant as well. Focusing on large values would allow us to generate a clear interpretation.  

      Figure 4B, 'Mean strength of connections' which I presume mean correlations is not defined anywhere that I can see.

      I believe the reviewer means Fig. 4D. It means the average NC value. We have edited the figure legend to add clarity.

      Figure 4F, a few words explaining how to understand the correlation matrix in text or captions would be helpful.

      Sorry for the confusion, we have clarified this part in figure legend for Fig. 4F.

      Page 5, right column: Incomplete sentence: "To determine whether it is the number of high NC pairs or the magnitude of the NCs,".

      We have edited this sentence.

      Page 5, right column: "Prior findings from studies of axonal projections from V1 to HVAs indicated that the number of SF-TF-specific boutons -rather than the strength of boutons- contribute to the SF-TF biases among HVAs (Glickfeld et al., 2013)." Glickfeld et al. also reported that boutons with tuning matched to the target area showed stronger peak dF/F responses.

      Thank you. We have revised this part accordingly.

      Page 9, the Discussion and Figure 7 which situates the study results in a broader context is welcome and interesting, but I have the feeling that more words should be spent explaining the figure and conceptual framework to a non-expert audience. I am a bit at a loss about how to read the information in the figure.

      Sorry for the confusion, we have added an explanation about this section (page 10, right column).

      As far as I can see, data availability is not addressed in the manuscript. The data, code to analyze the data and generate the figures, and simulation code should be made available in a permanent public repository. This includes data for visual area mapping, calcium imaging data, and any data accessory to the experiments.

      We have stated in the manuscript that code and data are available upon request. We regularly share data with no conditions (e.g., no entitlement to authorship), and we often do so even prior to publication.

      The sex of the mice should be indicated in Figure T1.

      The sex of the mice was mixed. This is stated in the Methods section.

      Methods:

      Section on statistical testing, computation of explained variance missing, etc. I feel many analyses are not thoroughly described.

      Sorry for the confusion, we have improved our method section.

      Signal correlation (similarity between two neurons' average responses to stimuli) and its relation to noise correlation is not formally defined.

      We have included the definition of signal correlation in the Methods.

      Number of visual stimulation trials is not stated in Methods. Only stated figure caption.

      The number of visual stimulus trials is provided in the last paragraph of the Methods section (Visual Stimuli).

      Fix typos: incorrect spelling, punctuation, and missing symbols (e.g. closing parentheses).

      We have carefully examined the spelling, punctuation, and grammar. We have corrected errors and we hope that none remain.

      Why use intrinsic imaging to locate retinotopic boundaries in mice already expressing GCaMP6s?

      We agree with the reviewer that calcium imaging of visual cortex can be used to identify the visual cortex.

      It is true that areas can be mapped using the GCaMP signals. That is not our preferred approach. Using intrinsic imaging to define the boundary between V1 and HVAs has been a well refined routine in our lab for over a decade. It is part of our standard protocol. One advantage is that the data (from intrinsic signals) is of the same nature every time. This enables us to use the same mapping procedure no matter what reporters mice might be expressing (and the pattern, e.g., patchy or restricted to certain cell types).

      Reviewer #3 (Recommendations For The Authors):

      The possibilty that larger intra-group NCs observed simply reflect a multiplicative gain on cotuned neurons could be addressed using pupil and/or face recordings: Does pupil size or facial motion predict NCs and if factored out, does signal correlation still predict NCs?

      Perhaps a variant of the network model presented in Figure 6 with multiplicative gain could also be tested to investigate these issues.

      We have addressed this issue in general response.

      Here, we will elaborate on one additional analysis we performed, in case it might be of interest. We carried out multiplicative gain modeling by implementing an established method (Goris et al. 2014 Nat Neurosci) on our dataset. We were able to perform the modeling work successfully. However, we found that it is not a suitable model for explaining the current dataset because the multiplicative gain induced a negative correlation. This seemed odd but can be explained. First, top-down input is not purely multiplicative but rather both additive and multiplicative. Second, the top-down modulation is high dimensional. Third, the firing rate of layer 2/3 mouse visual cortex neurons is lower than the firing rates for non-human primate recordings used in the development of the method (Goris et al. 2014 Nat Neurosci). Thus, we did not pursue the model further. We just mention it here in case the outcome might be of interest to fellow researchers.

      Similarly further analyses can be done to strengthen support for the claims that the observed NCs reflect discrete communication channels. A direct test of continuous vs categorical channels would strengthen the conclusions. One possible analysis would be to compare pairs with similar tuning (same SC) belonging to the same or different groups.

      Thanks for pointing this out so that we can clarify.

      We did not mean to argue that the tuning of neurons is discrete. Our conclusions are not dependent on asserting a particular degree of discreteness. We performed GMM clustering to label neurons with an identity so that we could analyze the NC connectivity structure with a degree of granularity supported by the data. Our analysis suggested that communication happens within a class, rather than through mixed classes. We realized that using the term “discrete” may be confusing. In the revised text we used the term “unmixed” or “non-mixing” instead to emphasize that the communication happens between neurons belonging to the same tuning cluster, or class. 

      However, we do see how the question of discreteness among classes might be interesting to readers. To provide further information, we have included a new Fig. S2 to visualize the GMM classes using t-SNE embedding.

      I also found many places where the manuscript needs clarification and /or more methodological details:<br /> • How many times was each of the stimulus conditions repeated? And how many times for the two naturalistic videos? What was the total duration of the experiments?

      The number of visual stimulus trials is provided in the last paragraph of the Methods section entitled Visual Stimuli. About 15 trials were recorded for each drifting grating stimulus, and about 20 trials were recorded for each naturalistic video.

      • Typo: Suit2p should be Suite2p (section Calcium image processing - Methods).

      We have fixed the typo.

      • What do the error bars in Figure 1E represent? Differences in group representation across areas from Figure 1E are mentioned in the text without any statistical testing.

      We have revised the Figure 1E (current Fig. 1F), and we now show all data points.

      • The manuscript would benefit from a comparison of the observed area-specific tuning biases across areas (Figure 1E and others) with the previous literature.

      We have included additional discussion on this in the last paragraph of the section entitled Visual cortical neurons form six tuning groups.

      • Why are inferred spike trains used to calculate NCs? Why can't dF/F be used? Do the results differ when using dF/F to calculate NC? Please clarify in the text.

      We believe inferred spike trains provide better resolution and make it easier to compare with quantitative values from electrical recordings. Notice that NC values computed using dF/F can be much larger than those computed by inferred spike trains. For example, see Smith & Hausser 2010 Nat Neurosci. Supplementary Figure S8.

      • The sentence seems incomplete or unclear: "That is, there are more high NC pairs that are in-group." Explicit vs what?

      We have revised this sentence.

      • Figure 1E is unclear to me. What is being plotted? Please add a color bar with the metric and the units for the matrix (left) and in the tuning curves (right panels). If the Y and X axes represent the different classes from the GMM, why are there more than 65 rows? Why is the matrix not full?

      We have revised this figure. Fig. 1D is the full 65 x 65 matrix. Fig. 1F has small 3x3 matrices mapping the responses to different TF and SF of gratings. We hope the new version is clearer.

      • How are receptive fields defined? How are their long and short axes calculated? How are their limits defined when calculating RF overlap?

      We have added further details in the Methods section entitled “Receptive field analysis”.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Thank you for the thoughtful and constructive comments on our manuscript. We have carefully addressed all points raised, and believe the manuscript is substantially improved as a result. In particular, we have performed:

      - Comprehensive spatial analysis of stable mutants. Following Recommendations for the authors comment #1, we performed spatial analysis by binning the anterior-posterior axis into 200 µm strata. This analysis validates our initial conclusions and reveals striking spatiotemporal dynamics, including profoundly blunted HFD responses in foxp1b mutants (68% reduction) and loss of spatial gradients in foxp1a mutants.

      - Substantially enhanced the statistical rigour of the screen analysis. We have implemented stratified Kolmogorov-Smirnov tests (within-experiment testing, then combined via Fisher's method) alongside linear mixed models to control for batch effects. In the revised manuscript, we now focus on three hypertrophy genes – foxp1b, txnipa and mmp14b – which are robustly validated by both methods.

      - Normalisation of adipose area to body size. To address concerns about developmental delay (Recommendations for the authors #2), we now normalise adipose area to standard length. With this normalisation, foxp1b single mutants show only a non-significant trend toward decreased adiposity (updated from our original analysis), while the hypertrophic LD morphology remains highly significant - demonstrating the phenotype is independent of body size and not a developmental delay.

      - Revised title. As suggested by Recommendations for the authors comment #6, we have changed the title to: "A quantitative in vivo CRISPR-imaging platform identifies regulators of hyperplastic and hypertrophic adipose morphology in zebrafish"

      - Extensive code and analysis availability. We now provide all code and extensive analysis pipelines in interactive HTML documents at https://github.com/jeminchin/zebrafish_adipose_morphology_screen

      Joint Public Review:

      We thank the reviewers for their thoughtful assessment of our work and their recognition of the rigorous experimental design, statistical approaches, and the utility of both the identified genes and screening pipeline for the field. We address their concerns below.

      Weakness:

      Distinguishing developmental patterning from adipose tissue plasticity

      We appreciate this important distinction and agree that separating developmental from adaptive effects is a key challenge in the field. We would like to make several points in response:

      First, we acknowledge this limitation in our discussion and have now expanded this section to more explicitly address the interpretive boundaries of our approach. Our screening platform was intentionally designed to capture the outcome of genetic perturbation across development and early adaptation, as these processes are inherently intertwined during the establishment of adipose tissue.

      Second, regarding the suggested analysis of lipid droplet size along the AP axis in response to HFD: we have now performed this analysis and include it as new Fig. 6 and new Supplemental Fig. 8 & 9. These data validate our initial conclusions and reveal striking spatiotemporal dynamics, including profoundly blunted HFD responses in foxp1b mutants (68% reduction) and loss of spatial gradients in foxp1a mutants. Further, these data provide additional resolution on regional responses to dietary challenge.

      Third, we note that our stable mutant validation experiments (Figure 6) do begin to disentangle these effects by examining both baseline and HFD-challenged conditions in animals with constitutive genetic loss. However, we agree that definitive separation would require temporally controlled genetic manipulation, which we now acknowledge as an important future direction.

      Lack of tissue-specific manipulations

      We agree that tissue-specific approaches would strengthen mechanistic conclusions and have acknowledged this limitation in our revised discussion. The current study was designed as a discovery-focused screen to identify candidate regulators, with the understanding that mechanistic dissection would require follow-up studies employing tissue-specific tools.

      We note that adipocyte-specific Cre/lox or Gal4-UAS approaches in zebrafish are feasible and represent an important next phase of investigation for the most promising candidates identified here, rather than a requirement for the current screening study. We have added text explicitly framing our findings as establishing genetic associations that warrant future tissue-autonomous investigation.

      Recommendations for the authors: 

      (1) Analysis: In Figure 6, the authors state that foxp1b mutants "fail to undergo further hypertrophic remodeling in response to a high-fat diet (HFD)." Foxp1b mutant juveniles are already hypertrophic before the high-fat diet. After a high-fat diet, these mutants reach mean lipid droplet diameters similar to WT, approximately 65 µm, which the authors state earlier in the manuscript are "a potential upper limit of LD growth at this developmental stage." The authors should perform additional analysis of their existing data. Specifically, determine lipid droplet size by binning the AP axis as shown in Figure 3. The rationale is that lipid droplet size differences in response to HFD may be more evident when not considering the anterior populations of lipid droplets that have already reached maximum steady state size for this juvenile stage. This would not require any new experiments, just reanalyzing data similar to how they did in Figure 3.

      We thank the reviewer for this excellent suggestion. We have performed the requested spatial analysis by binning the AP axis into 200 µm strata (Figure 3 approach). These data can be found in new Fig. 6H-M, and new Supplemental Figs 8 & 9. This new analysis verifies our initial conclusions, and also reveals several very interesting spatiotemporal dynamics

      (i) Baseline hypertrophy in foxp1b mutants across AP strata

      In support of our initial conclusion that foxp1b mutants have larger LDs at baseline, the spatial analysis confirms that on a control diet (baseline), foxp1b mutants have significantly larger LDs than WT across strata 1-5 (new Fig. 6I), ranging from +22.2 µm larger in strata 1 to +17.8 µm larger in strata 5 (all FDR-adjusted p < 0.05, linear mixed effects model). Extended analysis across all 15 strata is shown in Supplemental Figs. 8 & 9. By contrast, and also in support of our initial conclusion, foxp1a mutants showed no baseline hypertrophy on control diet (all strata p > 0.10, Supplemental Fig. 8).

      (ii) foxp1b mutants show a profoundly blunted hypertrophic response to HFD

      Using paired analysis (same fish on both control diet and after 14 days of high-fat diet) with a linear mixed effects model, we quantified the effect of HFD across all strata:

      (A) Anterior/oldest strata (1-6): WT + HFD increases LD diameter by +25.1-28.1 µm (+52-58%, p < 0.0001). Whereas, foxp1b mutants + HFD only increase LD diameter by +7.5-11.7 µm (+12-19%, p < 0.003). Therefore, in the oldest/most anterior regions, containing the largest LDs, the hypertrophic response of foxp1b mutants to HFD is ~57% weaker than WTs.

      (B) Posterior/newer strata (7-15): WT + HFD undergo significant increases in LD diameter of +17.7-23.7 µm (p < 0.024). However, in foxp1b mutants there is no significant hypertrophic response at all (p > 0.068), and hypertrophic effect sizes decline from +6.8 µm (stratum 7) to +0.4 µm (stratum 15).

      (C) Overall effect: Averaged across all strata, WT + HFD LDs show +24.4 µm increase (p < 0.0001), whereas foxp1b mutant LDs only show a +7.7 µm increase with HFD (p = 0.020). Therefore, foxp1b mutants show a 68% reduction in hypertrophic growth in response to HFD compared to WT (Fig. 6K).

      The consequence of these spatial dynamics is that WT SAT LDs - which start 22 µm smaller than foxp1b mutants on a control diet - undergo massive hypertrophy across all regions/strata in response to a HFD. Meanwhile, foxp1b mutants - starting larger than in WTs - show only a modest, spatially restricted response. This results in a convergence in LD size in early/anterior strata, but WT LDs actually surpass foxp1b mutant sizes in late/posterior strata (strata 14-15: +WT 14.7 µm larger on HFD, p = 0.028; Supplemental Figs. 8 & 9).

      By contrast, foxp1a mutants retain the capacity for HFD-induced hypertrophy but show a ~35% weaker response than WT (p = 0.023) – significantly less severe than the 68% reduction in foxp1b mutants. Interestingly, foxp1a mutants after HFD show a reduction in the AP gradation of LD size observed in WT and foxp1b mutants (uniform +14.4 mm across all strata versus WT range of +26.4 mm anteriorly to +16.6 mm posteriorly), suggesting that foxp1a may regulate spatial heterogeneity in adaptive responses to HFD (Fig. 6L-M).

      (iii) Developmental ceiling or impaired adaptive capacity?

      The reviewer raises an important question about whether anterior adipose LDs have reached a "developmental ceiling." After conducting the spatial analysis suggested by the Reviewer, we now believe several lines of evidence support an intrinsic defect in HFD-induced hypertrophy in foxp1b mutants, rather than reaching a developmentally determined limit:

      First, foxp1b mutants show reduced responses across ALL strata, not just anterior regions. The attenuation extends throughout the entire AP axis (57% reduction in strata 1-6, complete loss of response in strata 7-15). If anterior adipocytes had simply reached a size ceiling, we would expect normal responses in posterior regions where cells are smaller - but we don't observe this.

      Second, in posterior/newer regions of SAT (strata 14-15) the hypertrophic response to HFD in foxp1b is so limited that WT LDs actually become larger than foxp1b mutant LDs (+14.7 mm larger, p = 0.028; Supplemental Fig. 9). This demonstrates that these LD sizes are not developmentally limiting and argues for intrinsic hypertrophic defects in response to HFD.

      Third, foxp1a mutants provide an important control. These mutants show no baseline hypertrophy (all strata p > 0.10) yet still exhibit blunted hypertrophic responses to HFD (~35% reduction, p = 0.023), proving that reduced HFD responses can occur independently of baseline hypertrophy.

      We have updated the Results and Discussion to reflect these new conclusions. Methods have been updated to include the spatial analysis approach.

      (2) Adipose morphogenesis in WT is a function of standard length, as shown by the authors. At juvenile stages, foxp1 mutants are both smaller and have reduced adipocyte coverage, while adults show normal body length and very subtle adipose phenotypes. Can the authors demonstrate that the observed defects in foxp1 mutant juveniles are bona fide phenotypes rather than a developmental delay?

      We thank the reviewer for this key point. We agree it is critical to distinguish true foxp1b-dependent phenotypes from potential developmental delay. Importantly, our data strongly argue against a simple developmental delay. We show that LD size scales with body size in Fig. 3G, with smaller zebrafish having smaller LDs and larger zebrafish having larger LDs. In contrast to a developmental delay, our data show that foxp1b single and foxp1a;foxp1b double mutants are smaller (reduced standard length) but have larger LDs (Fig. 6E,G). This dissociation between body size and LD size is the opposite of what would be expected from developmental delay.

      To account for the body size difference, we have now normalised adipose area to standard length (Fig. 6F). With this normalisation, foxp1b single mutants show only a non-significant trend toward decreased adiposity, whereas foxp1a;foxp1b double mutants remain significantly reduced. This represents a change from our original analysis and we have updated the text accordingly. Critically, despite normalised adipose area showing only a trend in foxp1b singles, the hypertrophic LD morphology remains highly significant (Fig. 6G), demonstrating that the morphological phenotype is robust and independent of overall body size.

      We have clarified this interpretation in the Results and Discussion.

      (3) What was the rationale for selecting one amongst paralogous genes for the screen? For example, why did the authors choose ptenb rather than ptena?

      (4) Point 3 is particularly relevant for the final six genes that resulted in adipose phenotypes. Why did the authors choose not to target both paralogs, given that multi-plexed F0 CRISPR targeting is feasible in zebrafish (PMID: 29974860).

      We answer Points 3 & 4 together here.

      We used the DIOPT (DRSC Integrative Ortholog Prediction Tool) orthology tool to identify the zebrafish paralogue with the highest orthology score to each human gene. This tool integrates predictions from 20 orthology databases to generate a composite score. We selected the paralogue with the highest DIOPT score for each gene. For example, we selected ptenb over ptena because it showed a higher predicted orthology to human PTEN.

      We acknowledge this approach has important limitations, including orthology scores not necessarily predicting functional equivalence (ie, the "most orthologous" paralogue may not be the one with the most relevant adipose tissue function in zebrafish). We acknowledge that this may mean we have missed genuine hits - testing only one paralogue means we could fail to identify genes where the "less orthologous" paralogue has the relevant adipose function.

      Our findings with Foxp1 paralogues both validate this approach and reveal its limitations. The higher-scoring paralogue foxp1b (DIOPT score = 13/19) showed the more severe phenotype, validating our prioritisation. However, the lower-scoring paralogue foxp1a (DIOPT score = 5/19), which we tested subsequently, showed a distinct but significant phenotype (altered spatial patterning) – a finding that would have been missed had we not pursued secondary validation.

      For future screens where comprehensive hit identification is the goal, multiplexed targeting of all paralogues would be valuable, though this may complicate interpretation of paralogue-specific phenotypes. We have discussed this in the Discussion.

      (5) General framework and limitations: The analysis platform presented in the manuscript cannot separate the developmental effects from adipose tissue plasticity/remodeling. Potential approaches that may help address this concern include: (a) establishing a baseline model to illustrate how WT fish respond to high-fat diet (HFD); (b) showing how mutants with hyperplasticity (opposite effects of foxp1 mutants) respond to HFD; (c) examining whether foxp1 gene expression level changes in response to HFD. However, these approaches (especially a and b) would require extensive experimental work and may be beyond the scope of this study. Without further evidence or data support of adipose tissue plasticity and remodeling, the author may want to emphasize in the background and discussion sections how adipose tissue development may affect plasticity and adaptation, and soften the tone of how genes may directly regulate adipose tissue plasticity and adaptation.

      We thank the reviewer for this comment about the relationship between adipose development and plasticity/remodelling. We agree this is an important issue as we are looking in juvenile fish that are still growing. Therefore, when we feed them HFD and see LDs get bigger – is this diet-induced remodelling or just accelerated normal development (ie, growth that would happen anyway, but occurring faster due to more nutrients)?

      To address the reviewer's specific suggestions:

      (A) Baseline model of WT HFD response: We have now performed detailed spatial analysis of WT responses to HFD (new Fig 6H-M, Supplemental Figs. 8 & 9). This analysis establishes a comprehensive baseline for hypertrophic responses to HFD in developing adipose tissue. In summary, WT fish show robust, statistically significant and spatially-graded hypertrophic responses to HFD across the entire AP axis, with responses ranging from +28.1 mm anteriorly to +17.7 mm posteriorly.

      We agree with the Reviewer that separating developmental from adaptive processes in growing juvenile fish is challenging. Importantly, we believe foxp1a mutants provide compelling genetic evidence that we are studying adaptive responses rather than purely developmental processes. foxp1a mutants have normal baseline LD sizes on control diet (demonstrating foxp1a is not required for developmental adipose expansion), yet when challenged with HFD show significantly reduced hypertrophic expansion and reduction of spatial gradient. This genetic dissociation strongly argues we are observing adaptive capacity rather than developmental growth rate.

      (B) Hyperplastic mutants:

      We agree that analysis of hyperplastic mutants would provide valuable complementary information about tissue remodelling capacity. However, as the reviewer anticipated, this would require: (1) generating stable lines of the appropriate hyperplastic mutants, (2) conducting paired HFD feeding studies, (3) performing spatial morphometric analysis comparable to our foxp1 studies, and (4) potentially distinguishing hyperplastic vs hypertrophic contributions to expansion. We agree this constitutes substantial additional experimental work beyond the scope of the current manuscript, though it represents an important direction for future studies.

      (C) foxp1 expression changes in HFD:

      Unfortunately, we do not have SAT samples from HFD-treated fish preserved for RNA analysis, and therefore cannot assess whether foxp1 expression levels change in response to dietary challenge. This would be valuable for future studies to determine whether foxp1 genes are dynamically regulated during metabolic adaptation or function as constitutive regulators of adaptive capacity.

      Following the Reviewer's guidance, we have revised throughout the manuscript to more carefully distinguish developmental patterning from metabolic adaptation.

      (6) Title: In the absence of experimental results that can distinguish between developmental effects from adipose tissue plasticity/remodeling, such as those mentioned above, the manuscript title is not accurate and should therefore be revised to be something like "hyperplastic and hypertrophic adipose morphology."

      We have now altered the title as the Reviewer suggested to “A quantitative in vivo CRISPR-imaging platform identifies regulators of hyperplastic and hypertrophic adipose morphology in zebrafish”

      Minor:

      (7) In mice studies, deleting foxp1b in adipose tissue protects mice from diet-induced obesity, while overexpressing foxp1b in adipose tissue promotes diet-induced obesity (Liu et al., Nature Communication, 2019). These overall phenotypes and foxp1b-mediated effects appear to be contradictory to what is observed in the zebrafish model. Can the authors also provide more evidence/discussion on why such a difference occurs comparing zebrafish and mice models?

      We thank the reviewer for this important comparison. We believe the apparent contradictions reflect (1) differences in adipose tissue thermogenic capacity - between species possibly, but also between functionally distinct depots and (2) whole-organism versus tissue-specific experimental approaches.

      (1) Different adipose tissue biology: browning-prone vs browning-resistant adipose

      Liu et al. (2019, PMID: 31699980) demonstrated that adipose-specific deletion of Foxp1 in mice increases thermogenesis and browning of SAT, with protection from diet-induced obesity (DIO) and improved insulin sensitivity. Conversely, Foxp1 overexpression impaired adaptive thermogenesis and promoted DIO. Mechanistically, Foxp1 directly represses β3-adrenergic receptor transcription, thereby inhibiting the thermogenic program. Strikingly, mouse Foxp1-deleted adipocytes displayed smaller, multilocular lipid droplets characteristic of brown/beige adipocytes.

      These morphological outcomes initially appear opposite to our zebrafish findings: mouse Foxp1 mutants have smaller adipocytes (due to browning), while zebrafish foxp1b mutants have larger lipid droplets (hypertrophy). We believe this fundamental difference may reflect the propensity of adipose tissue to undergo adaptive thermogenesis.

      While it was recently discovered that zebrafish possess thermogenic epicardial adipose tissue (PMID: 38507414), in general zebrafish adipose is not considered thermogenic, and zebrafish as ectotherms are thought to lack adaptive thermogenesis for thermoregulation. The exact thermogenic potential of zebrafish adipose remains to be fully characterised, but potential differences in thermogenic capacity between mouse and zebrafish adipose may help explain the distinct phenotypic outcomes.

      Importantly, Liu et al. studied mouse inguinal subcutaneous WAT - the depot most prone to browning in rodents. It remains unclear what role Foxp1 plays in browning-resistant mammalian WAT depots, where thermogenic conversion does not readily occur. In such depots, Foxp1 loss might produce phenotypes more similar to our zebrafish findings - dysregulated white adipose function without browning.

      The above hypothesis suggest that browning responses may mask other roles for Foxp1 in WAT. Interestingly, although not quantified in the paper, Liu et al.’s Foxp1 overexpression model (Ap2-Foxp1) appeared to reduce adipocyte size despite suppressing Ucp1 expression and reducing lipolysis. These data suggest more complex roles and indicate that Foxp1’s control of adipocyte size might extend beyond simply regulating thermogenesis and may involve coordinating the balance between hyperplastic versus hypertrophic expansion.

      Furthermore, human subcutaneous WAT is not as prone to browning as mouse inguinal WAT. Human browning occurs primarily in specialised depots (e.g. supraclavicular, deep neck), while the majority of human adipose tissue represents constitutive white adipose with limited thermogenic capacity. Therefore, it remains an open question whether FOXP1's primary physiological role in humans relates to thermogenesis regulation (in specialised depots) or white adipose metabolic control (in the majority of adipose tissue). Zebrafish findings examining constitutive WAT function (admittedly the lack of adaptive thermogenesis in zebrafish is presumed at this stage) may be more relevant to human adipose than initially appear.

      (2) Whole-organism vs tissue-specific effects on metabolic health

      A second apparent contradiction concerns metabolic outcomes: mouse adipose-specific Foxp1 deletion improves metabolic health (Liu et al.), whereas our zebrafish whole-organism foxp1b mutants display metabolic dysfunction (baseline hypertrophy, impaired HFD response, hyperglycaemia and fatty liver). We believe this discrepancy reflects comparison of whole-animal mutants (zebrafish) to tissue-specific deletions (mouse), rather than opposite adipose tissue functions.

      Critically, Foxp1 has established roles in hepatic glucose metabolism. Zou et al. (PMID: 26504089) demonstrated that hepatic Foxp1 inhibits expression of gluconeogenesis genes and decreases hepatic glucose production and fasting blood glucose by competing with Foxo1 for binding of insulin responsive gluconeogenic genes. In line with these observations, we observe fatty liver and hyperglycaemia in foxp1a;foxp1b double mutant zebrafish (data not shown), suggesting that the metabolic dysfunction in our whole-animal mutants may be driven primarily by hepatic Foxp1 loss rather than adipose-specific effects.

      We have expanded on the points raised here in the Discussion.

      (8) Line 522-524: "The major phenotype in foxp1a mutants was impaired adipose expansion following HFD, suggesting failure to respond to diet-induced stress signals". In the presented Figure 6j, foxp1a mutant expands adipose LD size following HFD, similar to the control, which is contradictory to the statement above. Please clarify.

      We thank the reviewer for highlighting this apparent inconsistency and apologise for imprecise wording. These measurements are actually consistent but refer to different scales of analysis.

      Tissue level (Supplementary Fig. 7): foxp1a mutants show significantly reduced total adipose expansion (based on whole-animal Nile Red images) compared to wild-type fish on HFD—this is what we refer to as "impaired adipose expansion."

      Cellular level (Fig. 6L-M): At the individual adipocyte level, foxp1a mutants show statistically significant increases in LD diameter following HFD. However, the magnitude is reduced by ~35% compared to wild-type (mutants: +14.4 µm; WT: +22.2 µm; p = 0.023).

      We have revised the text to more precisely state "reduced adipose expansion" rather than "impaired expansion" to avoid implying complete failure to respond.

    1. Deepfakes à Caractère Sexuel : Analyse des Enjeux, du Cadre Légal et des Dispositifs de Protection

      Synthèse

      Ce document de synthèse analyse les enjeux critiques liés aux deepfakes à caractère sexuel, tels que présentés lors du webinaire du Centre Hubertine Auclert.

      Les deepfakes à caractère sexuel constituent une forme grave de cyberviolence sexiste et sexuelle, s'inscrivant dans un continuum de domination et d'objectification des femmes.

      Points clés à retenir :

      • Une violence ciblée : 98 % des vidéos deepfakes en ligne sont de nature pornographique.

      Les victimes sont massivement des femmes (82 %) et des mineurs (55 %).

      • Accessibilité technique : L'émergence des applications de « nudification » (nudify apps) permet de déshabiller virtuellement n'importe qui à partir d'un simple selfie, sans compétences techniques.

      • Évolution législative : La France a renforcé son arsenal juridique en 2024 avec l'article 226-8-1 du Code pénal (loi SREN), criminalisant spécifiquement les montages sexuels non consentis générés par algorithme.

      • Urgence de la responsabilité des plateformes : Les associations dénoncent un système où la sécurité repose sur la victime plutôt que sur les plateformes.

      L'utilisation de technologies comme le « hachage » (dispositif Disrupt) est essentielle pour prévenir la viralité.

      --------------------------------------------------------------------------------

      1. Définitions, Origines et Mécanismes

      Origine du Terme et Nature de la Violence

      Le terme « deepfake », apparu en 2017 sur Reddit, est la contraction de Deep learning (apprentissage profond) et Fake (faux).

      Son origine est intrinsèquement sexiste, le terme ayant été popularisé par un utilisateur publiant des vidéos pornographiques truquées de célébrités sans leur consentement.

      • Définition : Insertion de l'image ou de la voix d'une personne dans un contenu intime, sexuel ou pornographique sans son consentement.

      • Objectif : Harceler, humilier, discréditer ou exercer un chantage.

      Les Applications de « Nudification »

      L'industrialisation de cette violence est facilitée par les nudify apps.

      Ces outils, souvent gratuits ou peu coûteux, sont entraînés spécifiquement pour générer des corps nus à partir de photos ordinaires.

      Leur disponibilité massive sur les stores (Apple, Google) participe à une banalisation de la production d'images sexuelles non consenties.

      --------------------------------------------------------------------------------

      2. Ampleur du Phénomène et Données Statistiques

      Les enquêtes récentes (notamment celle de 2025 menée par Féministes contre le cyberharcèlement, Point de Contact et Stop Fisha) révèlent un caractère systémique :

      | Profil des Victimes / Auteurs | Statistiques Clés | | --- | --- | | Femmes et filles | 82 % des victimes de cyberviolences sexistes et sexuelles. | | Mineurs | 55 % des victimes. | | Groupes minorés | 85 % des personnes LGBTQA+ et 71 % des personnes racisées sont concernées. | | Auteurs connus | 85 % sont des hommes. | | Impact psychologique | Conséquences graves pour 24 % des victimes (dépression, pensées suicidaires chez près de 50 % des jeunes victimes). |

      Le Continuum des Violences

      Les cyberviolences ne naissent pas ex nihilo ; elles prolongent les rapports de domination existants (sexisme, racisme, validisme).

      Dans 60 % des cas, les violences en ligne sont articulées à des violences hors ligne.

      --------------------------------------------------------------------------------

      3. Cadre Légal et Enjeux Juridiques

      L'Évolution du Droit Français

      Avant 2024, les recours s'appuyaient sur l'atteinte à la vie privée, l'usurpation d'identité ou le harcèlement.

      La loi SREN (2024) a introduit l'article 226-8-1 du Code pénal :

      • Infraction : Publication d'un montage ou contenu algorithmique (IA) à caractère sexuel sans consentement.

      • Sanctions : 2 ans d'emprisonnement et 60 000 € d'amende.

      • Circonstance aggravante : Si la diffusion a lieu via un service de communication en ligne (réseaux sociaux), les peines passent à 3 ans d'emprisonnement et 75 000 € d'amende.

      Régulations Européennes

      • Digital Services Act (DSA) : Obligation de modération et de transparence pour les plateformes.

      • Règlement IA (IA Act) : Obligation d'étiquetage des contenus générés ou modifiés par IA.

      • Directive UE (2024) : Reconnaissance du partage non consenti de contenus intimes comme une forme de violence de genre.

      L'Affaire Grock (X/Elon Musk)

      Fin 2025, l'IA "Grock" intégrée à X a généré des millions d'images sexuelles non consenties en quelques jours.

      • Données : 53 % des images produites étaient sexualisantes, 80 % représentaient des femmes, et 2 % des mineurs.

      • Réaction : Une enquête pénale a été ouverte en France en janvier 2025, incluant une perquisition des bureaux de X en février.

      --------------------------------------------------------------------------------

      4. Parcours d'Accompagnement et Moyens Techniques

      Collecte de Preuves : Les Réflexes Cruciaux

      Malgré l'envie de supprimer immédiatement les contenus, la victime doit d'abord :

      • Capturer l'écran : Inclure les métadonnées (URL, date, heure, nom d'utilisateur).

      • Télécharger le contenu : Le stocker de manière sécurisée (clé USB).

      • Conserver le contexte : Garder les messages de chantage ou insultes associés.

      Dispositifs de Signalement et de Protection

      • Faros : Plateforme gouvernementale pour les contenus manifestement illicites.

      • Signaleurs de Confiance : Associations (comme Point de Contact) bénéficiant d'une priorité de traitement auprès des plateformes et de Faros.

      • Dispositif Disrupt : Technologie de "hachage" (signature numérique unique) permettant d'identifier un contenu pour empêcher sa diffusion ou rediffusion sur les plateformes partenaires.

      • Lignes d'écoute : 3018 (jeunes/cyberharcèlement), 3919 (violences femmes).

      Associations Spécialisées

      • Stop Fisha : Accompagnement juridique et psychologique.

      • Féministes contre le cyberharcèlement : Plaidoyer et formation.

      • En Avant Toute(s) : Chat anonyme pour les jeunes victimes.

      --------------------------------------------------------------------------------

      5. Recommandations pour une Protection Durable

      Le document souligne que la sécurité numérique ne doit plus reposer sur la seule vigilance des utilisatrices, mais sur des choix politiques et techniques des plateformes.

      Axes d'amélioration préconisés :

      • Paramètres protecteurs par défaut : Imposer aux plateformes des politiques de désamplification et de retrait préventif.

      • Interopérabilité du signalement : Permettre à une victime de signaler un contenu une seule fois pour qu'il soit traité sur l'ensemble des plateformes.

      • Formation des professionnels : Renforcer la formation initiale et continue des forces de l'ordre, magistrats et personnels de santé sur les spécificités des cyberviolences de genre.

      • Éducation dès le plus jeune âge : Intégrer les notions de consentement numérique, de respect de la vie privée et d'égalité de genre dans les programmes scolaires.

      • Plateforme holistique : Créer un guichet unique d'accompagnement juridique, technique et psychologique pour toutes les victimes.

    1. 21.1.3. Automation# We also covered a number of topics in automation, such as: History of Programming Python Programming Language JupyterHub and JupyterNotebooks Variables Data types (e.g., numbers, strings) Reddit Praw library (posting, searching, etc.) Other code libraries (e.g., time) For Loops Conditionals (if/else) Lists Dictionaries Functions (calling, and writing our own) Sentiment Analysis Recursion (for printing tweets and replies) We hope that by the end of this course, you have a familiarity of what programming is and some of what you can do with it. We particularly hope you have a familiarity with basic Python programming concepts, and an ability to interact with Reddit using computer programs.

      In section 21.1.3 on Automation, we learned about the basics of programming and how it can be used to automate tasks. This included key concepts such as variables, data types, loops, conditionals, lists, dictionaries, and functions. We also used Python and tools like Jupyter Notebook, and explored libraries such as PRAW for interacting with Reddit. In addition, we were introduced to more advanced ideas like sentiment analysis and recursion. Overall, this section helped us understand how programming works and how it can be used to build simple automated programs.

    2. We also covered a number of topics in automation, such as: History of Programming Python Programming Language JupyterHub and JupyterNotebooks Variables Data types (e.g., numbers, strings) Reddit Praw library (posting, searching, etc.) Other code libraries (e.g., time) For Loops Conditionals (if/else) Lists Dictionaries Functions (calling, and writing our own) Sentiment Analysis Recursion (for printing tweets and replies) We hope that by the end of this course, you have a familiarity of what programming is and some of what you can do with it. We particularly hope you have a familiarity with basic Python programming concepts, and an ability to interact with Reddit using computer programs.

      The most difficult part of this course, but I learned a lot!

    3. 21.1. What We Covered# We covered a lot of topics in this book, and we hope you learned something and found it valuable! 21.1.1. Social Media# We covered a number of topics in relation to social media: Bots Data History of Social Media Authenticity Trolling Data Mining Privacy and Security Accessibility Recommendation Algorithms Virality Mental Health Content Moderation Content Moderators Crowdsourcing Harassment Public Shaming Capitalism Colonialism We hope that by the end of this book you know a lot of social media terminology (e.g., context collapse, parasocial relationships, the network effect, etc.), that you have a good overview of how social media works and is used, and what design decisions are made in how social media works, and the consequences of those decisions. We also hope you are able to recognize how trends on internet-based social media tie to the whole of human history of being social and can apply lessons from that history to our current situations. 21.1.2. Ethics# We covered a number of ethics frameworks and you got practice applying them in different situations: Confucianism Taoism Virtue Ethics Aztec Virtue Ethics Natural Rights Consequentialism Deontology Ethics of Care Ubuntu American Indigenous Ethics Divine Command Theory Egoism Existentialism Nihilism We hope that by the end of this book, you have a familiarity with applying different ethics frameworks, and considering the ethical tradeoffs of uses of social media and the design of social media systems. Again, our goal has been not necessarily to come to the “right” answer, but to ask good questions and better understand the tradeoffs, unexpected side-effects, etc. 21.1.3. Automation# We also covered a number of topics in automation, such as: History of Programming Python Programming Language JupyterHub and JupyterNotebooks Variables Data types (e.g., numbers, strings) Reddit Praw library (posting, searching, etc.) Other code libraries (e.g., time) For Loops Conditionals (if/else) Lists Dictionaries Functions (calling, and writing our own) Sentiment Analysis Recursion (for printing tweets and replies) We hope that by the end of this course, you have a familiarity of what programming is and some of what you can do with it. We particularly hope you have a familiarity with basic Python programming concepts, and an ability to interact with Reddit using computer programs.

      One thing I found interesting is how this course connects social media, ethics, and programming together. It made me realize that technologies like algorithms and bots are not just technical tools, but also have ethical and social consequences. Understanding these connections helps us think more critically about how social media platforms are designed and used.

    1. The last code block does not need a break, the block breaks (ends) there anyway.

      آخر بلوك كود لا يحتاج إلى break، لأن البلوك ينتهي هناك على أي حال.

    2. This will stop the execution of code, and no more cases are tested.

      هذا سيوقف تنفيذ الكود، ولن يتم اختبار المزيد من الحالات (cases) الأخرى.

    1. Author response:

      Common responses:

      We thank the editors for considering our paper and the reviewers for their thoughtful and detailed feedback. Based on the comments, we will revise our manuscript to better describe how our approach differs from modeling strategies that are common in the field. We also aim to elaborate on the advantages of fastFMM and what scientific questions it is designed to answer. Finally, we will provide more background on our example analyses and the interpretation of the results.

      Within this response, “within-trial timepoints”, “time-varying predictors/behaviors”, and “signal magnitude” are used as specific examples of the general concepts of functional domain”, “functional co-variates”, and “functional outcome”, respectively. To make statements or examples more concrete, we may use the former neuroscience-specific terms when making general claims about functional models.

      - ncFLMM, cFLMM: non-concurrent or concurrent functional linear mixed models.

      - FUI: fast univariate inference. An approximation strategy to perform FLMM Cui et al. (2022).

      - fastFMM the R package that implements FUI.

      - CI confidence interval.

      Before specific line-by-line responses, we provide a brief comparison between cFLMM and fixed effects encoding models. All three reviewers suggested that fixed effects models could be an existing alternative to cFLMM (Reviewer 1 (1B), Reviewer 2 (2C), Reviewer 3 (3A)). Their shared comments highlight that our revision should articulate the advantages and applications of cFLMM relative to existing analysis strategies.

      Functional regression methods like cFLMM produce functional coefficient estimates that quantify how the magnitude of predictor-signal associations evolve across an ordered functional domain such as within-trial timepoints. Standard scalar outcome regression methods, like the GLMs specified in Engelhard et al. (2019), model these associations and their corresponding coefficients as fixed across the functional domain. While GLM encoding models may include time-varying predictors, these analysis strategies do not model the predictor–signal association as changing over the functional domain.

      Moreover, encoding models are less suited to hypothesis testing in clustered or longitudinal settings (e.g., repeated-measures datasets) and yield regression coefficient estimates that are only interpretable with respect to the units of the basis functions. In contrast, cFLMM provides time-varying coefficient estimates that are interpretable as statistical contrasts in terms of the original variables and produces hypothesis tests in clustered settings. cFLMM can be applied to datasets that define covariates in terms of the same flexible representations of covariates used in encoding models; this is a modeling choice rather than a methodological characteristic.

      The remainder of this provisional author response will respond to reviewers’ concerns line-by-line, approximately in the order they appear.

      Reviewer #1 (Public review):

      We thank Reviewer 1 for their comments, especially their efforts to provide first-hand experience with loading and applying fastFMM. We hope that recent improvements to fastFMM’s public release and vignettes address Reviewer 1’s concerns about ease-of-use.

      (1A) Overall, while they make a compelling case that this approach is less biased and more insightful, the implementation for many experimentalists remains challenging enough and may limit widespread adoption by the community.

      We believe the reviewer may have experimented with an old version of fastFMM, so their experience may not reflect recent rewrites and improvements. fastFMM v1.0.0+ is now stable, validated on CRAN, and contains new example data and step-by-step tutorials. We designed fastFMM’s model-fitting code to be similar to common GLM packages in R to reduce the learning curve for new users.

      (1B) …a clearer presentation of how common implementations in the field are performed (i.e. GLM) and how one could alternatively use the cFLMM approach would help.

      We will provide a clearer description of existing methods in the revised manuscript. Briefly, inference with fastFMM can accommodate large datasets that contain clustered data, repeated measures, or complex hierarchical effects, e.g., experiments with multiple animals and multiple trials per animal. When encoding models are fit to each cluster (e.g., animal, neuron) separately, we are not aware of a principled method to pool these cluster-specific models together to quantify uncertainty or yield an appropriate global hypothesis test.

      Reviewer #2 (Public review):

      Reviewer 2’s thoughtful feedback helped structure our points in the common response above, which we will refer to when applicable. In our response, we aim to clarify the problems that cFLMM solves and characterize the advantages in interpretability.

      (2A) The aim of incorporating variables that change within trial into this framework is interesting, and the technical implementation appears to be rigorous. However, I have some reservations as to whether the way in which variables that change within trial have been integrated into the analysis framework is likely to be widely useful, and hence how impactful the additional functionality of cFLMM relative to the previously published FLMM will be.

      We hope that the common response addresses these concerns. We were motivated to provide a concurrent extension of fastFMM based on our experience with statistical consulting in neuroscience research. Questions that benefit from a functional approach are common and often not adequately modeled with a non-concurrent approach, such as the variable trial length analysis we describe below.

      (2B) It is less clear that this approach makes sense for variables that change within trial…This partitioning of variance in the predictor into a between-trial component whose effect on the signal is modeled, and a within-trial component whose effect on the signal is not, is artificial in many experiment designs, and may yield hard to interpret results.

      We thank Reviewer 2 for highlighting a point that we did not adequately explain and that we will address further in the revision. The pointwise and joint CIs estimated by fastFMM account for uncertainty in the coefficient estimates due to variation in the predictors across within-trial timepoints. cFLMM targets a statistical quantity, or estimand, that is defined by trial timepoint specific effects, so the first step of our estimation strategy fits separate pointwise mixed models. However, models from every within-trial timepoint are then combined to calculate uncertainty and smooth the coefficient estimates. Thus, the widths of the pointwise and joint CIs depend on the estimated between-timepoint covariance and a smoothing penalty. Loewinger et al. (2025a) provides further details in Appendices 2 and 3, describing the covariance structure and detailing the power improvements of FUI compared to multiple-comparisons corrections.

      Other functional regression estimation strategies jointly fit the entire model with a single regression, e.g., functional generalized estimating equations Loewinger et al (2025b). However, these methods use basis expansions of the coefficients. In contrast, the encoding models mentioned in 2C below and Reviewer 3 (3A) apply basis-expansions of the covariates, and the resulting model does not capture how signal–covariate associations evolve across some functional domain. Although the first stage in the fastFMM approach fits pointwise linear models, this is only one of three steps in the estimation strategy. fastFMM yields coefficient estimates comparable to those that would be obtained from functional regression estimation strategies that jointly estimate the functional coefficients in a single regression. We mention this to distinguish between the target statistical quantity (functional coefficients) and the estimation strategy (pointwise vs. joint).

      (2C) …an alternative approach would be to run a single regression analysis across all timepoints, and capture the extended temporal responses to discrete behavioural events by using temporal basis functions convolved with the event timeseries. This provides a very flexible framework for capturing covariation of neural activity both with variables that change continuously such as position, and discrete behavioural events such as choices or outcomes, while also handling variable event timing from trial-to-trial.

      Our understanding is that the suggested approach aims to quantify the association between the outcome and within-trial patterns in covariates. This is a great question and we will incorporate a discussion of this into the revision. However, temporal basis functions convolved with the covariate time series cannot directly characterize these relationships. Encoding models can detect the contribution of predictors to neural signals while remaining agnostic to the precise relationship, but this flexibility can come at the cost of interpretability. The coefficients of the convolutions may not be translatable into a clear statistical contrast in terms of the original covariates.

      In our paper, we provide examples of cFLMM models with simple signal-covariate relationships. The coefficient estimates quantify the expected change in signal given a one unit change in the original predictors. Let 𝑌(𝑠) be the outcome and 𝑋(𝑠) be some covariate at within-trial timepoint 𝑠. For brevity, we will suppress subject/trial indices and random effects in the following notation. The coefficient at time point 𝑠 can be captured by the generic mean model

      𝔼[𝑌(𝑠) ∣ 𝑋(𝑠) = 1] − 𝔼[𝑌 (𝑥)|𝑋(𝑠) = 0].

      In contrast, the change in signal associated with patterns in within-trial covariates can be written as

      𝔼[𝑌 (𝑠<sub>1</sub>) ∣ 𝑋(𝑠<sub>2</sub>) = 1] − 𝔼[𝑌 (𝑠<sub>1</sub>) ∣ 𝑋(𝑠<sub>2</sub>) = 0]

      for all pairs of timepoints 𝑠<sub>1</sub>, 𝑠<sub>2</sub>. While simple lagged or offset outcome-predictor associations can be incorporated as covariates in cFLMM, the approach does not capture all within-trial timepoints 𝑠<sub>1</sub>, 𝑠<sub>2</sub>. Encoding models also do not target the above estimand. Instead, a full function-on-function regression could estimate the above. This topic can be incorporated into our revision and may be a future line of inquiry.

      (2D) In the Machen et al. data…From the resulting beta coefficient timeseries (Figure 3C) it is not straightforward to understand how neural activity changed as the subject approached and then received the reward. A simpler approach to quantify this, which I think would have yielded more interpretable coefficient timeseries would have been to align activity across trials on when the subject obtained the reward. More broadly, handling variable trial timing in analyses like FLMM which use trial aligned data, can be achieved either by separately aligning the data to different trial events of interest or by time warping the signal to align multiple important timepoints across trials.

      In this experiment, mice waited in a trigger zone, ran through a linear corridor, then received a food reward in the reward delivery zone of either water or strawberry milkshake Machen et al. (2026). Mice received different rewards between sessions but the same reward within all trials of a given session. This design complicated the analysis, as the reward type produced prominent differences in average latency (water: 3.3 seconds, milkshake: 2.0 seconds). The authors wanted to disentangle whether mean differences in the signal across reward types reflected differences in motivation to obtain the reward or differences in reaction to reward receipt.

      We agree that performing a reward-aligned analysis would be an intuitive approach to visualize the differences in average signal for mice that received milkshake compared to water. In fact, we provide a ncFLMM reward-aligned analysis in Figure S1 of Machen et al. (2025). We will add this analysis to the revision and thank the reviewer for the suggestion. We emphasize, however, that this method answers a different question. It does not identify how the signal change associated with receiving the milkshake evolves with respect to latency, especially if the relationship is non-linear. Time warping faces similar obstacles in this setting, especially since sufficiently flexible curve registration can induce similarity due purely to noise. Generally, time warping does not lend itself to hypothesis testing as it is unclear how to propagate uncertainty from the time warping model into final hypothesis tests.

      We believe cFLMM is an appropriate choice for the specific question, and we will revise the manuscript to better reflect its advantages. The functional coefficient estimates in Figures 3C-iii and 3C-iv provide insights that are not possible to derive from the proposed alternatives. For example, we can infer that for short latencies, we do not see a significant difference in signal magnitude for mice receiving water and mice receiving the milkshake. However, for latencies longer than around 2 seconds, receiving the milkshake is associated with an additional positive change in signal. We agree that we should make Figure 3C and the accompanying discussion more clear and thank Reviewer 2 for their feedback on interpretation.

      Reviewer 3 (Public review):

      (3A) …it is not clear what the conceptual or methodological advance of this work is. As it is written, the manuscript focuses on showing how concurrent regressors offer interpretation advantages over non-concurrent regressors. While the benefit of such time-varying regressors is supported by previous literature (e.g., Engelhard et al., 2020), it is not clear whether the examples provided in the current study clearly support the advantage of one over the other…

      We assume Reviewer 3 is referencing “Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons Engelhard et al. (2019). We hope that the Common response sufficiently contrasts the settings where each approach can be applied. Because these models have different goals and assumptions, they are appropriate for answering different questions.

      (3B) In this specific example, if the question is about speed and reward type, why variables such as latency to reward or a binary “reward zone vs corridor” (RZ) regressors are used instead of concurrent velocity (or peak velocity - in the case of the non-concurrent model)? Furthermore, if timing from trial start to reward collection is variable, why not align to reward collection, which would help in the interpretation of the signal and comparison between methods? Furthermore, while for the non-concurrent method, the regressors' coefficients are shown, for the concurrent one, what seems to be plotted are contrasts rather than the coefficients. The authors further acknowledge the interpretational difficulties of their analysis.

      Thank you for pointing out that we were not clear. This was mentioned by multiple reviewers and highlights the need to elaborate on our motivation in the revision. In this example, we wanted to investigate the change in signal-reward association as a function of within-trial timepoints, not the association between instantaneous velocity and the signal. “Slow” or “fast” means “mouse with below or above average latency”. We ask you to please refer to Reviewer 2 (2C) where we discuss why event alignment is an insufficient correction.

      The functional coefficient estimates in Figure 3C are interpreted as contrasts because the fixed effect coefficients capture the difference in expected signal between strawberry milkshake and water along the functional domain. An advantage of cFLMM is that it is easy to specify models in which the coefficients correspond to interpretable contrasts of the signal across conditions. The coefficient estimate shown in Figure 3B-ii also corresponds to a contrast because the estimates capture the difference in mean signal from strawberry milkshake and water. Equations (7) and (8) in the section “Materials and methods” and sub-section “Variable trial length analysis” provide additional details on the fixed effect coefficients. Based on this confusion, we will convert the two 1 x 4 sub-plots of 3B and 3C into two 2 x 2 sub-plots to avoid unintended direct comparisons.

      To contextualize how we “acknowledge the interpretational difficulties of [our] analysis”, we stated that a non-concurrent FLMM attempting to control for a time-based covariate is difficult to interpret. The concurrent FLMM provides a straightforward interpretation directly related to the question of interest, which we discuss above in Reviewer 2 (2D).

      (3C) Because the relation between behavioral variables and neuronal signal is not instantaneous, previous literature using fixed effects uses, for example, different temporal lags, splines, and convolutional kernels; however, these are not discussed in the manuscript.

      Thank you for this suggestion. All three reviewers raised this topic (see Reviewer 1 (1B), Reviewer 2 (2C), and the Common responses), and we will incorporate our response in the revision.

      (3D) From the methods, it seems that in the concurrent version of fastFMM, both concurrent and non-concurrent regressors can be included, but this is not discussed in the manuscript.

      This is an important point that we mentioned implicitly. In our cFLMM specification of the Jeong et al. (2022) model, “we incorporated trial-specific covariates for trial number and session, modeling these as increasing numerical values rather than identical categorical variables”, which are also plotted in Appendix 3. In Box 1, “if the functional covariate of interest is a scalar constant across the domain, the models fit by the concurrent and non-concurrent procedure are identical”. We will explicitly point out that cFLMM can perform inference on combinations of functional and constant covariates.

      (3E) The methodological advance is not clearly stated, apart from inputting into fastFMM a 3D matrix of regressors x trial x timepoint, instead of a 2D matrix of regressors x trial.

      Prior to our work described in this Research Advance, it was not obvious that the existing approximation approach in fastFMM could be generalized to cFLMM. During the writing of the article, a fastFMM user reached out for help with producing pseudo-concurrent FLMMs by duplicating rows in a nonconcurrent model, which both underscores the unmet need for cFLMMs and the difficulty in fitting them with available tools.

      The “under-the-hood” differences are described in Appendix 4. Concurrent FLMM with fast univariate inference was theoretically possible as early as Cui et al. (2022). The univariate step was straightforward, but guaranteeing “fast” and “inference” was not. We needed to verify, for example, that the method-of-moments estimation of the random effects covariance matrix generalized to cFLMM, which is not a trivial step. Characterizing whether the method achieved asymptotic coverage required extensive simulation studies (Figure 4, Appendix 2). Future work may focus on fully characterizing the asymptotic convergence in high noise or high complexity regimes.

      (3F) This manuscript is neither a clear demonstration of the need for concurrent variables, nor a 'tutorial' of how to use fastFMM with the added extension.

      We hope that the Common responses clarifies how cFLMM compares to existing approaches and fills a gap in the data analysis landscape for neuroscience. The fastFMM R package vignettes contain example analyses, and we intend for these files to be work in tandem with the manuscript. To provide more guidance for interested analysts, we can explicitly reference these tutorials within the revision.

      Planned revisions

      The following summary is not exhaustive.

      Writing additions:

      Per 1B, 2C and 3A, the Common responses will be incorporated in the revision.

      Per 2B, we will discuss function-on-function regression and explore how to estimate statistical contrasts for complex within-trial relationships. Relatedly, we will clarify that the CIs in fastFMM are constructed using an estimate of the within-trial covariance of the predictors, and clarify the definition of pointwise and joint CIs.

      Per 3D, we will explicitly state that concurrent FLMMs can include covariates that are constant over within-trial timepoints.

      Though we cannot prescribe a universally correct model selection procedure, we will mention that AIC, BIC, and other summary statistics can inform the specification of the random effects.

      Analysis modifications:

      Parts of Appendix 3 may be included in Figure 2 to directly address the question investigated by Jeong et al. (2022) and Loewinger et al (2024).

      When discussing Machen et al. (2025) data, the supplementary analysis with reward-aligned ncFLMM models might be added to clarify the ncFLMM/cFLMM difference.

      Per \ref{rvw2:encoding}, the additional analysis aimed at disentangling latency and reward in Machen et al.’s variable trial length data may be incorporated as an additional sub-figure in Figure 3.

      Aesthetic changes:

      Figure 3 will be reorganized to avoid unintended direct comparisons between the coefficients of the non-concurrent and concurrent model.

      Citations for Machen et al. (2026) will be updated to reflect publication of the preprint.

      The version number for fastFMM will be updated.

      References

      Cui E, Leroux A, Smirnova E, Crainiceanu CM. Fast Univariate Inference for Longitudinal Functional Models. Journal of Computational and Graphical Statistics. 2022; 31(1):219–230. https://doi.org/10.1080/10618600.2021.1950006, doi: 10.1080/10618600.2021.1950006, pMID: 35712524.

      Engelhard B, Finkelstein J, Cox J, Fleming W, Jang HJ, Ornelas S, Koay SA, Thiberge SY, Daw ND, Tank DW, Witten IB. Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons. Nature. 2019 Jun; 570(7762):509–513. https://www.nature.com/articles/s41586-019-1261-9, doi: 10.1038/s41586-019-1261-9.

      Jeong H, Taylor A, Floeder JR, Lohmann M, Mihalas S, Wu B, Zhou M, Burke DA, Namboodiri VMK. Mesolimbic dopamine release conveys causal associations. Science. 2022; 378(6626):eabq6740. https://www.science.org/doi/abs/10.1126/science.abq6740, doi: 10.1126/science.abq6740.

      Loewinger G, Cui E, Lovinger D, Pereira F. A statistical framework for analysis of trial-level temporal dynamics in fiber photometry experiments. eLife. 2025 Mar; 13:RP95802. doi: 10.7554/eLife.95802.

      Loewinger G, Levis AW, Cui E, Pereira F. Fast Penalized Generalized Estimating Equations for Large Longitudinal Functional Datasets. ArXiv. 2025 Jun; p. arXiv:2506.20437v1. https://pmc.ncbi.nlm.nih.gov/articles/PMC12306803/.

      Machen B, Miller SN, Xin A, Lampert C, Assaf L, Tucker J, Herrell S, Pereira F, Loewinger G, Beas S. The encoding of interoceptive-based predictions by the paraventricular nucleus of the thalamus D2R+ neurons. iScience. 2026 Jan; 29(1):114390. doi: 10.1016/j.isci.2025.114390.

    1. How I Dropped Our Production Database and Now Pay 10% More for AWS
      • The author accidentally dropped their production database while using an AI agent (Claude Code) to manage AWS infrastructure via Terraform.
      • The incident occurred because the author attempted to merge two separate projects into one, ignoring the AI’s advice to keep them separate to save on VPC costs.
      • The AI agent generated a Terraform plan that included deleting existing resources to recreate them under the new unified structure.
      • The author authorized a terraform apply and subsequently a terraform destroy without carefully reviewing the plan, mistakenly believing the agent was only cleaning up temporary resources.
      • Because the author had not set up external backups and the automated RDS snapshots were deleted along with the instance, all data was initially lost.
      • AWS Support was miraculously able to recover a snapshot, though the author now pays 10% more for AWS due to implementing more robust (and expensive) backup and security measures.
      • The "lesson learned" highlights the dangers of "vibe engineering"—relying on AI agents to execute destructive commands without human oversight or a deep understanding of the underlying tools.

      Hacker News Discussion

      • Negligence Over AI Risk: Many commenters argue that the issue wasn't the AI itself, but the author's decision to bypass standard safety procedures, such as reviewing terraform plan before execution.
      • Critique of "Vibe Engineering": Users criticized the trend of letting LLMs handle infrastructure (IaC) without the human operator understanding the deterministic tools they are using.
      • Infrastructure Over-engineering: Several participants pointed out that the project seemed over-engineered with AWS and Terraform when a simple VPS or SQLite database might have sufficed and been easier to manage.
      • AWS Data Recovery: Former AWS employees expressed surprise that support could recover the data, noting that AWS typically treats a user-initiated deletion as a final security command to wipe the data.
      • The Importance of Staging: A recurring theme was that major migrations should be tested in a staging environment first; running unverified AI-generated scripts directly against production was labeled as "insanity."
    1. The cellquant pipeline is distributed as a single Python script alongside an environment specification (environment.yml)

      I'm not sure there is any benefit to organizing this software as a single script rather than as a modern python package with a modular structure and a single CLI entrypoint. Single large scripts (this one is 2300 lines) are hard to work with for both humans and agents (they are hard to navigate for humans, and it is impossible for two agents to concurrently work on unrelated parts of the code, as they would be trying to edit the same file).

    2. A complete command line interface (CLI) reference (CLI_REFERENCE.md) documents every argument with its default value, valid range, and interaction with cell-type presets. This reference is structured to be readable by both humans and chatbots, so that a user can paste it into a conversation and ask the AI to find the relevant parameter.

      This kind of documentation is a double-edged sword, as it creates two sources of truth and is hard to keep in sync with the code (this is true for both humans and agents, who each tend to forget to update the docs). I would suggest eliminating this document in favor of writing and maintaining thorough docstrings in the code itself, which humans can introspect in their IDE and which agents will "know" to read without any prompting.

    1. The introduction of the three farm laws and implementation of the labour code are examples of this agenda.

      Why? Not clear how these push the digitisation agenda

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The authors show experimentally that, in 2D, bacteria swim up a chemotactic gradient much more effectively when they are in the presence of lateral walls. Systematic experiments identify an optimum for chemotaxis for a channel width of ~8µm, a value close to the average radius of the circle trajectories of the unconfined bacteria in 2D. These chiral circles impose that the bacteria swim preferentially along the right-side wall, which indeed yields chemotaxis in the presence of a chemotactic gradient. These observations are backed by numerical simulations and a geometrical analysis.

      Reviewer #3 (Public review):

      This paper addresses, through experiment and simulation, the combined effects of bacterial circular swimming near no-slip surfaces and chemotaxis in simple linear gradients. The authors have constructed a microfluidic device in which a gradient of L-aspartate is established, to which bacteria respond while swimming while confined in channels of different widths. There is a clear effect that the chemotactic drift velocity reaches a maximum in channel widths of about 8 microns, similar in size to the circular orbits that would prevail in the absence of side walls. Numerical studies of simplified models confirm this connection.

      The experimental aspects of this study are well executed. The design of the microfluidic system is clever in that it allows a kind of "multiplexing" in which all the different channel widths are available to a given sample of bacteria.

      The authors have included a useful intuitive explanation of their results via a geometric model of the trajectories. In future work it would be interesting to analyze further the voluminous data on the trajectories of cells by formulating the mathematical problem in terms of a suitable Fokker-Planck equation for the probability distribution of swimming directions. In particular, this might help understand how incipient circular trajectories are interrupted by collisions with the walls and how this relates to enhanced chemotaxis.

      The authors argue that these findings may have relevance to a number of physiological and ecological contexts. As these would be characterized by significant heterogeneity in pore sizes and geometries, further work will be necessary to translate the present results to those situations.

      Thanks to the referees' input and more work, we think our revised manuscript now meets the high standard of eLife

      Recommendations for the authors:

      The importance of the circular swimming chirality for the observed phenomenon could be further emphasized by actually using the word "chiral" or "chirality" in the text. Also indicating what would change is swimming were counterclockwise rather then clockwise would help the reader understand the key significance of chirality.

      We thank the reviewer for this insightful suggestion. We agree that the chirality of the surface interaction is central to the observed phenomenon and should be explicitly highlighted to improve the reader's understanding.

      In response, we have incorporated the terms "chiral" and "chirality" throughout the manuscript (Abstract, Introduction, Results, and Discussion) to emphasize this aspect. Furthermore, we have added a specific explanation in the Results section (the last paragraph of subsection “The cells in the right sidewall region dominated the chemotaxis of E. coli with lane confinements”) detailing the hypothetical scenario of counter-clockwise swimming. We clarify that in such a case, the hydrodynamic interaction would cause cells to veer left, resulting in up-gradient accumulation along the left sidewall rather than the right. We believe these additions significantly improve the clarity of the underlying physical mechanism.

      Reviewer #1 (Recommendations for the authors):

      I still have several comments that the authors may want to consider for the last version.

      - The run and tumble behavior of the cells at the surface remains puzzling and would need some more explanation in the text. Tumbles with no significant reorientation angle amount largely to smooth swimmers. How can a model based on run-and-tumbles be used to explain the difference between LSW and RSW?

      We apologize for the lack of clarity regarding the surface run-and-tumble behavior. While it is true that surface tumbles often result in smaller reorientation angles compared to bulk swimming, they are not negligible and play a critical role in the observed asymmetry. As shown in the tumble angle distributions (Fig. 2E and 2F), the probability of a tumble angle exceeding π/2 is approximately 9% for sidewall trajectories and 30% for the middle area. This tumbling behavior leads to differences between the left sidewall (LSW) and right sidewall (RSW) in two key ways:

      First, as detailed in our geometric analysis (Fig. 6), running cells following stable clockwise circular paths are geometrically favored to reach the RSW. Because cells moving up-gradient (towards the RSW) experience suppressed tumbling, they maintain these stable circular trajectories and accumulate effectively. Conversely, cells moving down-gradient (towards the LSW) experience enhanced tumbling. These frequent interruptions distort the circular trajectories required to reach the LSW, resulting in fewer bacteria entering the LSW compared to the RSW.

      Second, once at the wall, the difference in tumbling frequency dictates retention. Majority of LSW cells are swimming down-gradient (LSW-DG) and thus tumble more frequently, increasing their probability of escaping the wall. Majority of RSW cells are swimming up-gradient (RSW-UG), suppressing tumbles and increasing their residence time at the wall.

      The relevant clarifications have been included in the last paragraph of “Results” in the manuscript.

      - Figure 5B would need more explanation. I still don't understand the different behaviors for the right and left side walls at small widths. Is it noise really or a more complex behavior? Since most of these calculations are based precisely on the shape of these curves it would be useful to discuss them in more detail.

      We apologize for the lack of clarity. The behavior observed at small widths in Figure 5B is not noise; rather, it reflects the idealized nature of our simulation model.

      In the simulation, bacteria were modeled as active particles without explicit steric exclusion for the flagella and cell body. Consequently, simulated cells retain the ability to reorient and turn freely even in very narrow lanes (w ≤ 6 μm), allowing the geometric sorting mechanism (which favors the RSW) to function efficiently even at small widths. This is why the simulation shows a distinct difference between LSW and RSW proportions in this regime.

      In the experimental reality, however, the finite size of the bacterial body and flagella creates steric hindrance. In narrow channels, this physical constraint restricts the cells' ability to turn, thereby disrupting the circular swimming mechanism required to sort cells into the RSW. As a result, experimental data shows that the proportions of LSW and RSW cells tend to equalize in narrow channels (e.g., w = 6 μm in Fig. 4B), leading to a lower chemotactic drift velocity than predicted by the simulation.

      We have added a discussion regarding these steric effects and the deviation at narrow widths to the Results section (the penultimate paragraph of subsection "Simulation of E. coli chemotaxis within lane confinement") in the revised manuscript.

      - The importance of the chirality of the circular trajectories, although essential, remains insufficiently mentioned in the text.

      We have incorporated the terms "chiral" and "chirality" throughout the manuscript (Abstract, Introduction, Results, and Discussion) to emphasize this aspect. Furthermore, we have added a specific explanation in the Results section (the last paragraph of subsection “The cells in the right sidewall region dominated the chemotaxis of E. coli with lane confinements”) detailing the hypothetical scenario of counter-clockwise swimming.

      - It would be useful to color-code the trajectories of Figure 1B and alike with time.

      Thank you for the suggestion. Now the trajectories in Fig. 1B have been redrawn. Distinct colors denote individual trajectories, with color intensity darkening to indicate time progression.

    1. Reviewer #3 (Public review):

      In the manuscript entitled "A Minimal tooth Enhancer Regulates dlx2b Expression During Zebrafish Tooth 1 Formation: Insights into Cis-Regulatory Logic in Organogenesis", the authors explore the cis-regulatory logic of a dlx2b minimal enhancer capable of directing dlx2b gene expression to the developing tooth germs. The study combines (1) CRISPR-mediated GFP knock-in to track endogenous gene expression; (2) a promoter-bashing approach to identify a minimal tooth enhancer (MTE); (3) site-directed mutagenesis coupoled with transgenesis to assess the individual role of conserved TF binding sites; and (4) in vivo deletgion of the MTE to examine the consequences for gene expression. Overall, this is a technically solid study that provides some novel insights into tooth development and extends previous observations by the authors (Jackman & Stock, 2006; PNAS). However, the added value of the manuscript is limited by both the narrow experimental scope and the relatively modest impact of the findings for the broader field of developmental biology.

      Main concerns:

      (1) My main concern is that the study restricts the search for cis-regulatory information to the 5' region 4kb upstream of the TSS of the gene, rather than encompassing the full genomic locus. This is particularly limiting given that a knock-in allele was generated, which in principle allows interrogation of regulatory elements across the entire locus, and that the authors acknowledge the availability of genome-wide regulatory datasets (e.g. DANIO-CODE) in the Discussion. Despite this, no systematic effort is made to test additional regulatory elements beyond the proximal promoter/enhancers.<br /> This has important implications for the interpretation of the current work as: (a) dlx2b, as many developmental genes, resides in a gene desert enriched in open chromatin regions that may function as distal enhancers, and (b) the deletion of the MTE unmasked a cis-regulatory activity which nature cannot be explained with the information provided, and that may seem relevant for the expression of the gene in the dental mesenchyme.

      (2) A second concern is the absence of information on the functional consequences of deleting the gene or the MTE on tooth primordium development. From the description of the KI strategy, it is unclear whether the GFP insertion results in a functional fusion protein. The cytoplasmic GFP distribution and the schematic in Figure S1 instead suggest the presence of a terminal stop codon in the GFP sequence, which would result in a dlx2b loss-of-function allele. If this interpretation is correct, the manuscript does not describe the developmental consequences in homozygous embryos. Similar concerns apply to the MTE deletion: it remains unclear whether loss of this enhancer results in any detectable morphological or developmental defects.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer # 1 (Public review):

      (1) Structure and Presentation of Results

      • I recommend reordering the visual-cue experiments to progress from simpler conditions (no cues) to more complex ones (cue-conflict). This would improve narrative logic and accessibility for non-specialist readers. The authors have chosen not to implement this suggestion, which I respect, but my recommendation stands.

      Thank you for this suggestion. We understand your point that presenting the experiments from simpler to more complex conditions may seem more intuitive. However, we have kept the original order because it better reflects the logic of the study itself. Our work first asked whether fall armyworms, like the Bogong moth, use a magnetic compass that is integrated with visual cues. Only after establishing this behavioral feature did we go on to test whether visual cues are required to maintain magnetic orientation. To make this reasoning clearer to readers, we have explicitly stated in the Introduction that magnetic orientation in the Bogong moth depends on the integration of visual cues, which provides clearer context for the experimental design.

      (2) Ecological Interpretation

      • The authors should expand their discussion on how the highly simplified, static cue setup translates to natural migratory conditions, where landmarks are dynamic, transient, or absent. Specifically, further consideration is needed on how the compass might function when landmarks shift position, become obscured, or are replaced by celestial cues. Additionally, the discussion would benefit from a more consolidated section with concrete suggestions for future experiments involving transient, multiple, or more naturalistic visual cues. This point was addressed partially in one paragraph of the Discussion, which reads as follows:

      "In nature, they are likely to encounter a range of luminance-gradient visual cues, including relatively stable celestial cues as well as transient or shifting local features encountered en route. Although such natural cues differ from our simplified laboratory stimulus, they may represent intermittently sampled visual inputs that can be optimally integrated with magnetic information, with the congruency between visual and magnetic cues likely playing a key role in maintaining a stable compass response. Whether the cues are static or changing, brief periods without them may still allow the subsequent recovery of a stable long-distance orientation strategy. Determining which types of natural visual cues support the magnetic-visual compass, and how they interact with magnetic information, including how their momentary alignment or angular relationship is integrated and how such visual cue-magnetic field interactions may require time to influence orientation, together with elucidating the genetic and ecological bases of multimodal orientation, will be important objectives for future research." While this paragraph is informative, the wording remains lengthy, somewhat unclear, and vague. Shorter, clearer statements would improve readability and impact. For example:

      • How could moths maintain direction during periods when only the magnetic field is present and visual landmarks are absent?

      • Could celestial cues (e.g., stars) compensate, and what happens if these are also obscured?

      • What role does saliency play when multiple visual landmarks are present simultaneously?

      • How might a complex skyline without salient landmarks affect orientation?

      Including simple, concise sentences that pose concrete open questions and suggest experimental designs would strengthen the discussion without creating space issues. In my view, a comprehensive discussion of how the simplified, static cue setup relates to natural migratory conditions-where landmarks are dynamic, transient, or absent-would add significant value to the paper.

      Thank you for this constructive and insightful comment. You correctly point out that our articulation of the ecological relevance of the simplified, static cue setup was not sufficiently clear. We also agree that the original wording in the Discussion remained overly general. In the revised Discussion, we updated the manuscript to incorporate recently published findings on the use of light–dark gradients for orientation in fall armyworms. However, we explicitly note that it remains unclear whether fall armyworms can exploit naturally occurring luminance gradients, such as those generated by the moon, for orientation under natural conditions. We further emphasize that during natural migration the visual environment is dynamic, with celestial cues available intermittently and local visual features changing continuously during flight. In this context, we outline several key unresolved questions, including whether celestial cues can compensate when local landmarks are absent; how multiple visual cues are weighted and integrated with geomagnetic information; how transient visual cues (like moving clouds or changing illumination) influence orientation; and how luminance gradients that are common in natural nocturnal environments interact with the geomagnetic field to support orientation. For each of these issues, we briefly suggest experimental approaches to guide future research.

      (3) Methodological Details and Reproducibility

      • The lack of luminance level measurements should be explicitly highlighted.

      Thank you for your helpful suggestion. You are right that luminance level is an important experimental parameter. We have stated this information in the Methods section under Behavioral apparatus: “The ambient light level in the experimental environment was measured to be below 1 lux using a Testo 540 lux meter (Testo SE & Co. KGaA, Titisee-Neustadt, Germany). Further work is still required to compare the illuminance used in this study with that under natural conditions, which are inherently variable.” This point is also clarified in the legend of Figure S3 in the supplementary material.

      • The authors chose not to adjust figure legends by replacing "magnetic South" with "magnetic North." While I believe this would be more conventional and preferable, this is ultimately a minor stylistic issue.

      Thank you very much for your suggestion. We understand your point and agree that using “magnetic North” would be more conventional. However, because our experiments focus on the orientation behavior of the autumn population, magnetic South is aligned with the landmark direction representing the potential migratory direction, which we believe makes the figures more intuitive for readers. We therefore consider this a minor stylistic issue.

      (4) Conceptual Framing and Discussion

      • Although the authors made a good attempt to explain the limitations of using an artificial visual cue, I believe there is room or a more explicit argument. For example, it could be stated clearly that this species is unlikely to encounter a situation in nature where a single, highly salient landmark coincides with its migratory direction. Therefore, how these findings translate to real migratory contexts remains an open question. A sentence or two making this point directly would strengthen the discussion.

      Thank you for your helpful suggestion. We now address this point explicitly in the Discussion, noting that fall armyworms are unlikely to experience a natural visual environment dominated by a single, static, and highly salient landmark coinciding with their migratory direction. Consequently, how these findings translate to real migratory contexts remains an open question.

      (5) Technical and Open-Science Points

      • Sharing the R code openly (e.g., via GitHub) should be seriously considered. The code does not need to be perfectly formatted, but making it available would be highly beneficial from an open-science perspective.

      Thank you for the suggestion. We agree that making code openly available is valuable from an open-science perspective. The MMRT script used in this study is Moore’s Modified Rayleigh Test, available from the original publication by Massy et al. (2021; https://doi.org/10.1098/rspb.2021.1805). In the previous version, we only cited this reference in the Materials and Methods section; we have now added a direct link to the script to improve clarity and accessibility. We have also provided a public link to the data-recording scripts used in the Flash Flight Simulator (https://doi.org/10.17632/6jkvpybswd.1). This repository additionally includes a map-based optical flow script that was not used in the present study but is shared for completeness.

      Reviewer #1 (Recommendations for the authors):

      • LL. 133-137 (end of paragraph starting with "The fall armyworm is a migratory crop pest native to the Americas"): Suggest splitting into shorter, clearer sentences. The limitations of this method could be better articulated here and elaborated in the Discussion.

      Thank you for this suggestion. We have revised this paragraph by splitting it into shorter, clearer sentences and by articulating the limitations of this method more explicitly. These limitations are further elaborated in the Discussion.

      • LL. 181-185 (end of paragraph starting with "To examine if fall armyworms integrate geomagnetic and visual cues for seasonal migratory orientation"): It would be helpful to state explicitly that season-specific headings have been confirmed in the lab using a flight simulator, but destination regions remain unknown without further tracking experiments.

      Thank you for this helpful suggestion. We have now clarified in the revised manuscript that season-specific orientation headings have been confirmed in the laboratory using a flight simulator, while the actual migratory destination regions remain unclear in the absence of tracking experiments.

      • LL. 230-234 (start of paragraph "Our previous research showed that fall armyworms reared under artificially simulated fall conditions…"): Clarify which migratory season is being referenced.

      Thank you for this helpful suggestion. We have clarified in the text that the migratory season referenced here is the autumn migratory season. In addition, we have added information in the Methods to specify the actual calendar season during which the insects were reared under the simulated conditions.

      • LL. 270-272 (middle of Fig. 2 caption): Suggest explicitly mentioning that for this population, the seasonally appropriate direction is southbound in autumn and northbound in spring, as this may not be clear to non-specialists.

      Thank you for this helpful suggestion. We have now explicitly stated the seasonally appropriate migratory directions for this population, indicating southbound migration in autumn and northbound migration in spring, to improve clarity for non-specialist readers.

      • LL. 421 (middle of paragraph starting with "We also considered the limitations of the Rayleigh test…"): Add that the groups lacking visual cues exhibited "lower directedness as per lower vector length (r)" in addition to lower flight stability.

      Thank you for this helpful suggestion. We further note that the conclusions drawn from the flight stability analysis are consistent with those based on individual r-value analyses.

      • LL. 499-501 ("unlike some vertebrates that can rely solely on magnetic information (Mouritsen, 2018)"): This point is slightly downplayed. It should be emphasized that nearly all tested vertebrates and invertebrates (e.g., birds, mole rats, fish, frogs, and other insects) demonstrate a magnetic compass without requiring visual landmarks. Moths are the only tested invertebrates so far that show landmark-magnetic field dependency for their magnetic compass to be manifested in a behavioural orientation response in Flight Simulator.

      Thank you for this important comment. We agree that this point represents a key synthesis in the Discussion, as it concerns how our findings relate to, and differ from, magnetic orientation demonstrated in other animal groups. We have therefore expanded the Discussion to note that studies have shown that some animals can exhibit directional preferences in simplified visual environments solely in response to changes in the magnetic field, and we now cite representative examples from birds and mole rats. At the same time, we also acknowledge important methodological and phenotypic differences among taxa. In particular, moths’ magnetic orientation has been assessed using a flight simulator, a setup in which stable directional behavior must be actively maintained during continuous movement. This is an important difference from orientation assays in birds during take-off or in terrestrial mammals such as mole rats. Moreover, whether birds and other animals rely on visual input to detect or calibrate magnetic information under certain conditions remains an open question. We therefore emphasize here both the phenotypic differences observed across experimental systems and the methodological considerations.

      • LL. 560-565 (paragraph starting with "Our flight simulator system (Dreyer et al., 2021) …"): Suggest clarifying what the Flash flight simulator system is and how it differs from the Mouritsen-Frost flight simulator.

      Thank you for this suggestion. We have added a brief clarification of the Flash flight simulator and how it differs from the Mouritsen–Frost system.

      • LL. 605-608 ("Spectral measurements …"): Explicitly mention that total illuminance was not measured and that further work is required to compare the illuminance used with natural conditions which of course vary.

      Thank you for this helpful suggestion. We agree that total illuminance is an important factor. We have now added a statement noting that the ambient light level in the experimental environment was measured to be below 1 lux using a Testo 540 lux meter, and we further acknowledge that additional work is required to compare the illuminance used in this study with that under naturally variable conditions.

      • LL. 628-641 (end of paragraph starting with "Electromagnetic noise at the experimental site ... "): Explain why this matters for interpreting behavioural responses. Highlight that although conditions were somewhat magnetically noisy which based on the past work may disrupt magnetic compass as it was shown in birds (eg Engels et al. 2014 Nature), the observed magnetic response under certain conditions indicates that the magnetic sense remained functional when landmark and magnetic field were aligned. This way you can pre-empt this criticism of your magnetic conditions being not ideal and noise on the left handside of the spectrum measured (which is not uncommon).

      Thank you for this helpful suggestion. We have now cited Engels et al. (2014, Nature) in this section and expanded the text to explain why electromagnetic noise at the experimental site is relevant for interpreting the behavioural responses. We also clarify the rationale for measuring electromagnetic noise and discuss the observed low-frequency (“left-hand side”) noise in the spectrum.

      • Fig. 51: Suggest adapting Y-axes and using violin or box plots (e.g., panels A/B starting from 30 up to 50, etc.).

      Thank you for this helpful suggestion. We have revised Fig. 5 accordingly by adapting the Y-axis scaling and replacing the original plots with box plots, as suggested.

    1. , asset pedagogies were making schools more welcoming of minor-itized cultures, yet only as a bridge to a White curriculum that remained the only arrival point,as shown in Figure 1, Option B.

      This last sentence brings it all home. As I said before, I'm able to communicate through the white curriculum because it was the only thing available. There was no class on code meshing that I could take. Standard English is the standard which is fine but it almost feels like that is the only option available no matter your background.

    2. At the root of this representational impoverishment was a one‐to‐one identifi-cation between race and culture that failed to do justice to the evolving nature of minoritizedyouth practices, especially to their novel combinations of languages, modalities, and literacies

      What I personally understand about this text right here is someone fitting in with even with their background. I personally had to adapt to the norm and it was easier to simply adopt the entire package that is being an English speaker than to code mesh.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The "number sense" refers to an imprecise and noisy representation of number. Many researchers propose that the number sense confers a fixed (exogenous) subjective representation of number that adheres to scalar variability, whereby the variance of the representation of number is linear in the number.

      This manuscript investigates whether the representation of number is fixed, as usually assumed in the literature, or whether it is endogenous. The two dimensions on which the authors investigate this endogeneity are the subject's prior beliefs about stimuli values and the task objective. Using two experimental tasks, the authors collect data that are shown to violate scalar variability and are instead consistent with a model of optimal encoding and decoding, where the encoding phase depends endogenously on prior and task objectives. I believe the paper asks a critically important question. The literature in cognitive science, psychology, and increasingly in economics, has provided growing empirical evidence of decisionmaking consistent with efficient coding. However, the precise model mechanics can differ substantially across studies. This point was made forcefully in a paper by Ma and Woodford (2020, Behavioral & Brain Sciences), who argue that different researchers make different assumptions about the objective function and resource constraints across efficient coding models, leading to a proliferation of different models with ad-hoc assumptions. Thus, the possibility that optimal coding depends endogenously on the prior and the objective of the task, opens the door to a more parsimonious framework in which assumptions of the model can be constrained by environmental features. Along these lines, one of the authors' conclusions is that the degree of variability in subjective responses increases sublinearly in the width of the prior. And importantly, the degree of this sublinearity differs across the two tasks, in a manner that is consistent with a unified efficient coding model.

      We thank Reviewer #1 for her/his comments and for placing our work in a broader context.

      Comments:

      (1) Modeling and implementation of estimation task

      The biggest concern I have with the paper is about the experimental implementation and theoretical account of the estimation task. The salient features of the experimental data (Figure 1C) are that the standard deviations of subjects' estimated quantities are hump-shaped in the true stimulus x and that the standard deviation, conditional on the true stimulus x, is increasing in prior width. The authors attribute these features to a Bayesian encoding and decoding model in which the internal representation of the quantity is noisy, and the degree of noise depends on the prior - as in models of efficient coding (Wei and Stocker 2015 Nature Neuro; Bhui and Gershman 2018 Psych Review; Hahn and Wei 2024 Nature Neuro).

      The concern I have is about the final "step" in the model, where the authors assume there is an additional layer of motor noise in selecting the response. The authors posit that the subject's selection of the response is drawn from a Gaussian with a mean set to the optimally decoded estimate x*(r), and variance set to a free parameter sigma_0^2. However, the authors also assume that the Gaussian distribution is "truncated to the prior range." This truncation is a nontrivial assumption, and I believe that on its own, it can explain many features of the data.

      To see this, assume that there is no noise in the internal representation of x, there is only motor noise. This corresponds to a special case of the authors' model in which υ is set to 0. The model then reduces to a simple account in which responses are drawn from a Gaussian distribution centered at the true value of x, but with asymmetric noise due to the truncation. I simulated such a model with sigma_0=7. The resulting standard deviations of responses for each value of x (based on 1000 draws for each value of x), across the three different priors, reproduce the salient patterns of the standard deviation in Figure 1C: i) within each condition, the standard deviation is hump-shaped and peaks at x=60 and ii) conditional on x, standard deviation increases in prior width. The takeaway is that this simple model with only truncated motor noise - and without any noisy or efficient coding of internal representations - provides an alternative channel through which the prior affects behavior.

      Of course, this does not imply that subjects' coding is not described by the efficient encoding and decoding model posited by the authors. However, it does suggest an important alternative mechanism for the authors' theoretical results in the estimation task. Moreover, some of the quantitative conclusions about the differences in behavior with the discrimination task would be greatly affected by the assumption of truncated motor noise.

      Turning to the experiment, a basic question is whether such a truncation was actually implemented in the design. That is, was the range of the slider bar set to the range of the prior? (The methods section states that the size on the screen of the slider was proportional to the prior width, but it was unclear whether the bounds of the slider bar changed with the prior). If the slider bar range did depend on the prior, then it becomes difficult to interpret the data. If not, then perhaps one can perform analyses to understand how much the motor noise is responsible for the dependence of the standard deviation on both x and the prior width. Indeed, the authors emphasize that their model is best fit at α=0.48, which would seem to imply that the best fitting value of υ is strictly positive. However, it would be important to clarify whether the estimation procedure allowed for υ=0, or whether this noise parameter was constrained to be positive (i.e., clarify whether the estimation assumed noisy and efficient coding of internal representations).

      We thank Reviewer #1 for her/his close attention to the motor-noise component of our model, in particular its truncation at the border of the prior. We agree that the truncated motor noise should be examined more closely as it affects the variance of responses. We address here the questions raised by the reviewer, and we detail the new analyses we have conducted.

      First, regarding the experimental paradigm, we note that this truncation was indeed implemented in the design, i.e., the range of the slider bar corresponded to the range of the prior (we now indicate this more clearly in the manuscript). Subjects thus were not able to select an estimate that was not in the support of the prior, and it is precisely for this reason that we model the selection step with a truncated distribution, so that the model is consistent with the experimental setup. This truncation naturally decreases the response variability near the bounds, and this may affect differently the overall variability for the different priors, as noted by the reviewer in her/his simulations. We have conducted a series of analysis to investigate this question.

      First, we consider a model in which there is no cognitive noise, but only motor noise. To answer one of the reviewer’s questions, the model-fitting procedure did allow for a vanishing cognitive noise (𝜈 = 0), i.e., it allowed for such a “motor-noise-only” mechanism to be the main account of the data. This value (𝜈 = 0), however, does not maximize the likelihood of the model, and thus this hypothesis is not the best account of the data. Nevertheless, we fit a model that enforces the absence of cognitive noise (i.e., with 𝜈 = 0). The BIC of this “motor-noise-only” model is higher than that of our best-fitting model by more than 1100, indicating very strong support for the best-fitting model, which features a positive cognitive noise (𝜈 > 0), and 𝛼 = 1/2, as in our theoretical proposal.

      Furthermore, the standard deviation of responses predicted by the motor-noise-only model overestimates substantially the variability of subjects' responses in the Narrow and Medium conditions (Figure 4, panel b), while the predictions of the best-fitting model are much closer to the behavioral data (panel a). Finally, the variances predicted by this model do not increase linearly with the prior width (contrary to the behavioral data). Instead, the variance increases more between the Narrow and the Medium priors than between the Medium and the Wide priors, as the effects of the bounds attenuate with the wider prior (panel c, solid green line).

      To further this analysis we fit in addition a model with no cognitive noise (𝜈 = 0), but in which we now allow the degree of motor noise, 𝜎<sub>0</sub>, to depend on the prior. Our reasoning is that if the truncated motor noise were the sole explanation for the increase in subjects' variance with the prior width, then we would expect the noise levels for the three priors to be roughly equal. We find instead that they are different (with values of 5.9, 8.3, and 9.8, for the prior widths 20, 40, and 60, respectively, when pooling subjects; and when fitting subjects individually the distributions of parameter values exhibit a clear increase; see panels c and d above). This model moreover yields a BIC higher by more than 590 than our best-fitting model. We note in addition that these parameter values differ in such a way that they result in response variances that are a linear function of the prior width, as found in the behavioral data, although they overestimate the subjects' variances (panel c, dotted green line). This linear increase is directly predicted by our best-fitting model, which has one less parameter (2 vs. 3), and which moreover accurately predicts the variability of subjects across priors (panel c, pink line). Hence the data do not support a model with no cognitive noise and with only a constant, truncated motor noise.

      We also consider another possibility, that in addition to truncated motor noise there is in fact a degree of cognitive noise, but one that is insensitive to the width of the prior. In other words, there is cognitive imprecision, but it does not efficiently adapt to the prior range, as in our proposal. This corresponds to setting 𝛼 = 0, in our model; but this specification of the model results in a poor fit, with a BIC higher by more than 300 than that of the best-fitting model, whose cognitive noise scales with the exponent 𝛼 = 1/2, consistent with our theory. Thus our data do not support the hypothesis of a cognitive noise that does not scale with the prior range; instead, subjects' responses support a model in which the variance of the cognitive noise increases linearly with the prior range.

      We note in addition that there is inter-subject variability: different subjects have different degrees of imprecision. But if the source of the imprecision was the truncated motor noise, then different degrees of truncated noise should result in different relationships between the behavioral variance and the prior widths: subjects with smaller noise should be relatively insensitive to the width of the prior, while subjects with greater noise should be more sensitive. In that case, when fitting the subjects with the model in which the imprecision scales as a power of the width, we should expect subjects to exhibit a diversity of best-fitting parameter values 𝛼. Instead, as noted, we find that the data is best captured by a single exponent 𝛼 = 1/2, equal for all the subjects. This suggests that although the “baseline level” of the imprecision may differ per subject, the way that their imprecision increases as a function of the prior width is the same for all the subjects, a behavior that is not explained by truncated noise alone.

      Furthermore, Prat-Carrabin, Harl, and Gershman 2025 present behavioral results obtained in a similar numerosity-estimation task, with the same prior ranges, but with the experimental difference that the slider was not limited to the range of the current prior: instead it had the same width in all three conditions, and covered in all trials a range wider than that of the Wide prior (from 25 to 95). The behavioral variance observed in this study increases linearly with the prior range, as in our results. Thus we conclude that the linear increase in subjects' variability does not originate in the bounds of the experimental slider.

      Finally, Prat-Carrabin et al. 2025 presents an fMRI study involving a similar numerosityestimation experiment. This study shows that numerosity-sensitive neural populations in human parietal cortex adapt their tuning properties to the current numerical range, resulting in less precise neural encoding when the range is wider. This substantiates the notion that the degree of imprecision in cognitive noise adapts to the prior range, as in our proposal.

      Overall, we conclude that the linear increase of behavioral variability that we document originates in the endogenous adaptation, across conditions, of the amount of imprecision in the internal encoding of numerosities.

      We now include these analyses in a new section of the Methods (p. 24-27), which we summarize in the main text (p. 7-8). The Figure above is now included (as Figure 4). We also now cite the references mentioned by Reviewer #1 and which we had not already cited (Bhui and Gershman 2018 Psych Review; Hahn and Wei 2024 Nature Neuro).

      References:

      Prat-Carrabin, A., Harl, M. V., & Gershman, S. J. (2025). Fast efficient coding and sensory adaptation in gain-adaptive recurrent networks (p. 2025.07.11.664261). bioRxiv. https://doi.org/10.1101/2025.07.11.664261

      Prat-Carrabin, A., de Hollander, G., Bedi, S., Gershman, S. J., & Ruff, C. C. (2025). Distributed range adaptation in human parietal encoding of numbers (p. 2025.09.25.675916). bioRxiv. https://doi.org/10.1101/2025.09.25.675916

      (2) Differences across tasks

      A main takeaway from the paper is that optimal coding depends on the expected reward function in each task. This is the explanation for why the degree of sublinearity between standard deviation and prior width changes across the estimation and discrimination task. But besides the two different reward functions, there are also other differences across the two tasks. For example, the estimation task involves a single array of dots, whereas the discrimination task involves a pair of sequences of Arabic numerals. Related to the discussion above, in the estimation task the response scale is continuous whereas in the discrimination task, responses are binary. Is it possible that these other differences in the task could contribute to the observed different degrees of sublinearity? It is likely beyond the scope of the paper to incorporate these differences into the model, but such differences across the two tasks should be discussed as potential drivers of differences in observed behavior.

      If it becomes too difficult to interpret the data from the estimation task due to the slider bar varying with the prior range, then which of the paper's conclusions would still follow when restricting the analysis to the discrimination task?

      There are indeed several differences between the estimation and discrimination tasks that could, in principle, contribute to the quantitative differences observed between them. The fact that the estimation task requires a continuous numerical report whereas the discrimination task involves a binary choice is captured in our model by incorporating distinct loss functions for the two tasks (Eq. 4). This distinction is a key element of the theoretical framework, as it determines the optimal allocation of representational precision. We agree with Reviewer #1 that another important difference is that the estimation task involves non-symbolic dot arrays while the discrimination task uses short sequences of Arabic numerals, which could also affect performance through distinct perceptual or cognitive processes. Although we cannot exclude this possibility, it is unclear why such a difference in stimulus format would produce the specific quantitative patterns that we observe — and that are predicted by our proposal, namely, the sublinear scalings with task-dependent exponents. Each experiment, taken independently, supports the model's central prediction that the precision of internal representations scales sublinearly with the width of the prior distribution. Taken together, the two tasks show that this dependence itself varies with the observer's objective, confirming that perceptual precision is endogenously determined by both the statistical context and the task goal.

      We agree with Reviewer #1 that this point should be mentioned; we now do so in the Discussion (p. 17-18).

      (3) Placement literature

      One closely related experiment to the discrimination task in the current paper can be found in Frydman and Jin (2022 Quarterly Journal of Economics). Those authors also experimentally vary the width of a uniform prior in a discrimination task using Arabic numerals, in order to test principles of efficient coding. Consistent with the current findings, Frydman and Jin find that subjects exhibit greater precision when making judgments about numbers drawn from a narrower distribution. However, what the current manuscript does is it goes beyond Frydman and Jin by modeling and experimentally varying task objectives to understand and test the effects on optimal coding. This contribution should be highlighted and contrasted against the earlier experimental work of Frydman and Jin to better articulate the novelty of the current manuscript.

      We thank Reviewer #1 and we agree that the work of Frydman and Jin is highly relevant to our study. Instead of comparing our contributions to theirs, we have decided to have a close look at their data, in light of our theoretical proposal. This enables us to test the predictions of our theory against human choices made in a rather different decision situation than that of our discrimination task.

      Thus we looked, in their data, at the participants' probability of choosing the risky lottery instead of the certain amount, as a function of the difference between the lottery's expected value (pX) and the certain amount (C; we also added a small bias term to the certain option; such bias was not necessary with our discrimination data, presumably because of the inherent symmetry of our task).

      We find, as did Frydman and Jin, and similarly to our discrimination task, that the participants are more precise when the proposed amounts are sampled from a Narrow prior, in comparison to a Wide prior (see figure above, first panel). But we also find, as in our discrimination task, that when normalizing the value difference by the prior width participants are more sensitive to this normalized difference in the Wide condition than in the Narrow one, suggesting that their imprecision scales across conditions by a smaller factor than the prior width (last panel). And we find, consistent with our discrimination data and with our theory, that choice probabilities in the two conditions match very well when normalizing the difference by the prior width raised to the exponent 3/4 (third panel).

      Model fitting supports this observation. We fit the data to our model (described by Eq. 3), with the addition of a lapse probability and of a bias, and with different values of the exponent 𝛼. The best-fitting model is the one with 𝛼 = 3/4. Its BIC (35,419) is lower than those of the models with 𝛼 = 1, ½, and 0 (by 142, 39, and 514, respectively). It is also lower by 2.14 than a model in which 𝛼 is left as a free parameter (in which case the bestfitting 𝛼 is 0.68, a value not far from 3/4). We emphasize that these BIC values indicate that the hypotheses 𝛼 = 0 and 𝛼 =1 are clearly rejected, i.e., the participants' imprecision increases with the prior width (𝛼 > 0), but sublinearly (𝛼 < 1). In other words, the responses collected by Frydman and Jin in a risky-choice task are quantitatively consistent with our results obtained in a number-discrimination task, and they further substantiate our model of endogenous precision.

      We moreover note that their proposed model is similar to ours, in that the decision-maker is allowed to optimize a noisy encoding scheme to the prior, subject to a ‘capacity constraint’ on the number 𝑛 of encoding signals that can be obtained. Crucially, this capacity constraint is assumed to be a property of the decision-maker that does not change across priors, and thus 𝑛 is fixed across prior widths. Therefore, their model predicts that the participants' imprecision should scale linearly with the prior width (this is also what we obtain in our model if we don’t optimize a similar parameter; see the revised presentation of the model on p. 12-13). We note that when they fit this parameter, 𝑛, separately across conditions, they find that it is larger with the wider prior. This is precisely what our model of endogenous precision predicts. In turn this predicts a sublinear scaling of the imprecision, instead of the linear one that would result from a fixed 𝑛, and indeed we find a sublinear scaling in both their dataset and ours. What is more, in both datasets the sublinear scaling is best captured by the exponent 𝛼 = 3/4, as we predict.

      This analysis of another independent dataset obtained with a different experimental paradigm significantly strengthens our conclusions. Thus we added to the Results section a new subsection discussing this analysis, and the figure above now appears as Figure 3. We also mention it in the Introduction (l. 87-89) and in the Discussion (l. 556-557).

      Reviewer #2 (Public review):

      Summary:

      This paper provides an ingenious experimental test of an efficient coding objective based on optimization as a task success. The key idea is that different tasks (estimation vs discrimination) will, under the proposed model, lead to a different scaling between the encoding precision and the width of the prior distribution. Empirical evidence in two tasks involving number perception supports this idea.

      Strengths:

      The paper provides an elegant test of a prediction made by a certain class of efficient coding models previously investigated theoretically by the authors.

      The results in experiments and modeling suggest that competing efficient coding models, optimizing mutual information alone, may be incomplete by missing the role of the task.

      We thank Reviewer #2 for her/his positive comments on our work.

      Weaknesses:

      The claims would be more strongly validated if data were present at more than two widths in the discrimination experiment.

      We agree that including additional prior widths would allow for a more detailed validation of the predicted scaling law, in particular in the discrimination task. Our design choices across the two experiments reflect a trade-off between the number of prior widths and the number of trials per condition. In the estimation task, we include three widths because this is necessary to identify all three parameters of the model: the variance of the motor noise , the baseline variance of internal imprecision (𝜈<sup>2</sup>), and the scaling exponent (𝛼). Extending both tasks to include additional prior widths would indeed provide a more robust test of the predicted scaling law. We now note this point in the revised Discussion (p. 17).

      A very strong prediction of the model -- which determines encoding entirely from prior and task -- is that Fisher Information is uniform throughout the range, strongly at odds with the traditional assumption of imprecision increasing with the numerosity (Weber/Fechner law). This prediction should be checked against the data collected. It may not be trivial to determine this in the Estimation experiment, but should be feasible in the Discrimination experiment in the Wide condition: Is there really no difference in discriminability at numbers close to 10 vs numbers close to 90? Figure 2 collapses over those, so it's not evident whether such a difference holds or not. I'd have loved to look into this in reviewing, but the authors have not yet made their data publicly available - I strongly encourage them to do so.

      Importantly, the inverse u-shaped pattern in Figure 1 is itself compatible with a Weber's-law-based encoding, as shown by simulation in Figure 5d in Hahn&Wei [1]. This suggests a potential competing variant account, in apparent qualitative agreement with the findings reported: the encoding is compatible with Fisher's law, and only a single scalar, the magnitude of sensory noise, is optimized for the task for the loss function (3). As this account would be substantially more in line with traditional accounts of numerosity perception - while still exhibiting taskdependence of encoding as proposed by the authors - it would be worth investigating if it can be ruled out based on the data gathered for this paper.

      References:

      [1] Hahn & Wei, A unifying theory explains seemingly contradictory biases in perceptual estimation, Nature Neuroscience 2024

      Indeed our efficient-coding model predicts that a uniform should result in a constant Fisher-information function, and we agree with Reviewer #2 that this is at odds with the common assumption that the imprecision increases with the magnitude. To investigate this possibility, we now consider, in the revised manuscript, a more general model of Gaussian encoding, in which the internal representation, 𝑟, is normally distributed around an increasing transformation of the number, 𝜇(𝑥), as

      𝑟|𝑥~𝑁(𝜇(𝑥), 𝜈<sup>2</sup>𝑤<sup>2 𝛼</sup>),

      where the encoding function, 𝜇(𝑥), can be either linear (𝜇(𝑥) = 𝑥) or logarithmic (𝜇(𝑥) = log (𝑥)). This allows us to test whether the data are better captured by a uniform Fisher information (as predicted by the linear encoding under a uniform prior) or by a compressed, Weber-like representation.

      We note, first, that in both tasks our conclusions regarding the dependence of the imprecision on the prior width remain unchanged, whether we choose the linear encoding or the logarithmic encoding. With both choice of encoding, the estimation task is best fit by a model with 𝛼 = 1/2, and the discrimination task by a model with 𝛼 = 3/4, implying a sublinear scaling of the variance with the width of the prior, in quantitative agreement with our theory.

      In the estimation task, the logarithmic encoding yields a significantly lower BIC than the linear one, by more than 380 (see Table 1). The results are less clear in the discrimination task, where the BIC with the logarithmic encoding is lower by 2.1 when pooling together the responses of all the subject, but it is larger by 2.6 when fitting each subject individually. We conduct in addition a “Bayesian model selection” procedure, to estimate the relative prevalence of each encoding among subjects. The resulting estimate of the fraction of the population that is best fit by the logarithmic encoding is 87.6% in the estimation task, and 45.9% in the discrimination task (vs. 12.4% and 54.1% for the linear encoding).

      To further investigate the behavior of subject in the Discrimination task, we look at their proportion of correct choices in the Wide and Narrow conditions, for the trials in which both averages are below the middle value of the prior, and for those in which both are above the middle value. We find no significant difference in the Narrow condition (see Figure below). In the Wide condition, the proportion of correct responses appear larger when the averages are small (with a significant difference when binning together the trials in which the absolute difference between the averages is between 4 and 12; Fisher's exact test p-value: 0.030).

      To complement this analysis, we fit a probit model with lapses, which is equivalent to our Gaussian model with linear encoding, but allowing the noise scale parameter to differ when both averages are above, or below, the middle value of the prior. We fit this model separately in each condition, only on the trials in which both averages are either above or below the middle value; and we test a more constrained model in which the scale parameter is equal for both small and large averages. In the Narrow condition, a likelihood-ratio test does not reject the null hypothesis that the scale parameter is constant (𝜒<sup>2</sup>(1) = 0.026, 𝑝 = 0.87), but in the Wide condition this hypothesis is rejected (𝜒<sup>2</sup> (1) = 7.6, 𝑝 = 0.006). In this condition the best-fitting scale parameter is 29% larger (9.4 vs. 6.3) with the large averages than with the small averages, pointing to a larger imprecision with the larger numbers.

      These results and the prevalence of the Weber/Fechner encoding prompt us to consider, in our efficient-coding model, the hypothesis that a logarithmic compression is an additional constraint on the possible encoding schemes. In our model, the internal representation (𝑟) could take any form as long as its Fisher information verified the constraint in Eq. 5 on the integral of its square-root. We now consider a strong, additional constraint: that over the support of the prior, the Fisher information of the signal must be of the form that one would obtain with a logarithmic encoding, i.e., 𝐼(𝑥) ∝ 1/𝑥<sup>2</sup>. (For the sake of generality we choose this specification instead of directly assuming a logarithmic encoding, because other types of encoding schemes yield a Fisher information of this form, e.g., one with “multiplicative noise” (Zhou et al., 2024); we do not seek, here, to distinguish between these different possibilities). We solve the same efficient-coding optimization problem (Eq. 6), but now with this additional constraint. We find that the resulting optimal Fisher information is approximately:

      , for the estimation task,

      and , for the discrimination task,

      for any 𝑥 on the support of the prior, and where 𝑥<sub>mid</sub> is the middle of the prior and 𝜃 is a constant. These Fisher-information functions differ from the one previously obtained without the additional constraint (Eq. 9), in that they fall off as 1/𝑥<sup>2</sup>, consistent with our additional constraint. However, we note that the dependence on the prior width, 𝑤, is identical: here also, the imprecision is proportional to , in the estimation task, and to 𝑤<sup>3/4</sup>, in the discrimination task.

      In its logarithmic variant (𝜇(𝑥) = log (𝑥)), the Fisher information of the model of Gaussian representations that we have considered throughout is 1/(𝑥 𝜈 𝑤<sup>𝛼</sup>)<sup>2</sup>. It is thus consistent with the predictions just presented, if 𝛼 = 1/2 for the estimation task, and 𝛼 = 3/4 for the discrimination task, i.e., the two values that best fit the data.

      This is precisely the model suggested by Reviewer #2. Overall, we conclude that with both linear and logarithmic encoding schemes, our efficient-coding model — wherein the degree of imprecision is endogenously determined — accounts for the task-dependent sublinear scaling of the imprecision that we observe in behavioral data. As for the imprecision across numbers, a sizable fraction of subjects, particularly in the estimation task, are best fit by the logarithmic encoding, consistent with previous reports that numbers are often represented on a compressed, approximately logarithmic scale. This encoding may itself reflect an efficient adaptation to a long-term environmental prior that is skewed, with smaller numbers occurring more frequently, leading to greater representational precision. This pattern is less clear in the discrimination task. It is possible that the rate at which the precision decreases across numbers itself depends on the task, such that not only the overall level of imprecision, but also its variation across numbers, may be modulated by the task's demands. In this study we have focused on the endogenous choice of the overall precision, but an avenue for future research would be to examine how this adaptation interacts with the detailed shape of the encoding across numbers.

      In the revised manuscript, we have modified the presentation of the model to include the transformation 𝜇(𝑥) (p. 6-7 and 10-11). We have updated accordingly Table 1 (shown above; p. 24), which reports the BICs of all the models for the estimation task (and which now includes the models with logarithmic encoding). There is now a section in the Results dedicated to the question of the logarithmic compression, which includes the efficientcoding model constrained by the logarithmic encoding (p. 15-16). The results on the performance of subjects with larger numbers are presented in Methods (p. 29-31), and mentioned in the main text (p. 14-15). The Methods also provides details about the efficient-coding model with logarithmic encoding (p. 32-33). These results are further commented on in the Discussion (p. 18). Finally, the data and code are now available online at this address: https://osf.io/d6k3m/ , which we note on p. 33.

      Reference

      Zhou, J., Duong, L. R., & Simoncelli, E. P. (2024). A unified framework for perceived magnitude and discriminability of sensory stimuli. Proceedings of the National Academy of Sciences, 121(25), e2312293121. https://doi.org/10.1073/pnas.2312293121

      Reviewer #3 (Public review):

      Summary:

      This work demonstrates that people's imprecision in numeric perception varies with the stimulus context and task goal. By measuring imprecision across different widths of uniform prior distributions in estimation and discrimination tasks, the authors find that imprecision changes sublinearly with prior width, challenging previous range normalization models. They further show that these changes align with the efficient encoding model, where decision-makers balance expected rewards and encoding costs optimally.

      Strengths:

      The experimental design is straightforward, controlling the mean of the number distribution while varying the prior width. By assessing estimation errors and discrimination accuracy, the authors effectively highlight how imprecision adjusts across conditions.

      The model's predictions align well with the data, with the exponential terms (1/2 and 3/4) of imprecision changes matching the empirical results impressively.

      We thank Reviewer #3 for his/her positive comments on our work.

      Weaknesses:

      Some details in the model section are unclear. Specifically, I'm puzzled by the Wiener process assumption where r∣x∼N(m(x)T,s^2T). Does this imply that both the representation of number x and the noise are nearly zero at the beginning, increasing as observation time progresses? This seems counterintuitive, and a clearer explanation would be helpful.

      In the original formulation of the model, indeed both the mean of the representation and its variance are nearly zero when T is also near zero, but in such a way that the Fisher information, 𝑇(𝑚′(𝑥)/𝑠)<sup>2</sup>, is proportional to 𝑇. We note that a different specification, with a mean 𝑚(𝑥) (instead of 𝑚(𝑥)𝑇) and a variance 𝑠<sup>2</sup>/𝑇 (instead of 𝑠<sup>2</sup>𝑇), i.e., 𝑟|𝑥~𝑁(𝑚(𝑥), 𝑠<sup>2</sup>/𝑇), for 𝑇 > 0, would result in the same Fisher information.

      In any event, in the revised manuscript, we now formulate the model differently. Specifically, we assume that the encoding results from an accumulation of independent, identically-distributed signals, but the precision of each signal is limited, and each of them entails a cost. Formally, we posit, first, that the Fisher information of one signal, 𝐼<sub>1</sub>(𝑥), is subject to the constraint:

      This constraint appears in many other efficient-coding models in the literature (Wei & Stocker 2015, 2016; Wang et al. 2016; Morais & Pillow, 2018; etc.), and it arises naturally for unidimensional encoding channels (Prat-Carrabin & Woodford, 2001; e.g., for a neuron with a sigmoidal tuning curve, it is equivalent to assuming that the range of possible firing rates is bounded). Second, we assume that the observer incurs a cost each time a signal is emitted (e.g., the energy resources consumed by action potentials). The total cost is thus proportional to the number of signals, which we denote by 𝑛. More signals, however, allow for a better precision: specifically, under the assumption of independent signals, the total Fisher information resulting from 𝑛 signals is the sum of the Fisher information of each signal, i.e., 𝐼(𝑥) = 𝑛𝐼<sub>1</sub>(𝑥).

      A tradeoff ensues between the increased precision brought by accumulating more signals, and the cost of these signals. We assume that the observer chooses the function 𝐼<sub>1</sub>(.) and the number 𝑛 of signals that solve the minimization problem subject to ,

      where 𝜆 > 0. We can first solve this problem for the Fisher information of one signal, 𝐼<sub>1</sub>(𝑥). In the case of a uniform prior of width 𝑤, we find that it is zero outside of the support of the prior, and

      for any 𝑥 on the support of the prior. This intermediate result corresponds to the optimal Fisher information of an observer who is not allowed to choose the number of signal, 𝑛, (and who receives instead 𝑛 = 1 signal). It is the solution predicted by the efficient-coding models mentioned above, that include the constraint on 𝐼<sub>1</sub>(𝑥), but that do not allow for the observer to choose the amount of signals, 𝑛. With this solution, the scale of the observer's imprecision, , is proportional to 𝑤, and it does not depend on the task — contrary to our experimental results.

      Solving the optimization problem for 𝑛, in addition to 𝐼<sub>1</sub>(𝑥), we find that with a uniform prior the optimal number is proportional to 𝑤 in the estimation task, and to in the discrimination task (specifically, treating 𝑛 as continuous, we obtain ). In other words, the observer chooses to obtain more signals when the prior is wider, and in a way that depends on the task. We give the general solution for the total Fisher information, 𝐼(𝑥) = 𝑛𝐼<sub>1</sub>(𝑥), in the case of a prior 𝜋(𝑥) that is not necessarily uniform:

      where 𝜃 = 𝜆/𝐾. This is of course the same solution that we obtained in the original manuscript.

      We hope that this new formulation of the efficient-coding model will seem more intuitive to the reader (p. 12-13 in the revised manuscript).

      The authors explore range normalization models with Gaussian representation, but another common approach is the logarithmic representation (Barretto-García et al., 2023; Khaw et al., 2021). Could the logarithmic representation similarly lead to sublinearity in noise and distribution width?

      We agree with Reviewer #3 that a common approach when modeling the perception of numbers is to consider a logarithmic encoding. We have conducted several analyzes that examine this proposal. These are presented in detail in our response to a comment of Reviewer #2, above (p. 11-14 of this document). We summarize shortly our findings, here:

      (i) A model with a logarithmic encoding better fits a majority of subjects in the estimation task, but a bit less than half the subjects in the discrimination task.

      (ii) The examination of the performance of subjects in the discrimination task, however, suggests that in the Wide condition they discriminate slightly better the small numbers, as compared to the larger numbers.

      (iii) We consider a constrained version of our efficient-coding model, in which the Fisher information must be consistent with that of a logarithmic encoding (i.e., decreasing as 1/𝑥<sup>2</sup>); we find that the resulting optimal Fisher information depends on the prior width in the same way than without the constraint, i.e., a scaling of the imprecision with , in the estimation task, and with 𝑤<sup>3/4</sup>, in the discrimination task.

      (iv) When considering the model with logarithmic encoding, we find that it best fits the data when its imprecision scales with the width with the same exponents, i.e., , in the estimation task (𝛼 = 1/2), and 𝑤<sup>3/4</sup>, in the discrimination task (𝛼 = 3/4). In other words, the data support the predictions of our theoretical model.

      In the revised manuscript, we have modified accordingly the presentation of the model (p. 6-7 and 10-11), the Tables 1 (p. 24) and 2 (p. 30) which report the BICs. There is now a section in the Results dedicated to the question of the logarithmic compression, including the efficient-coding model constrained by the logarithmic encoding (p. 15-16). The results on the performance of subjects with larger numbers are presented in Methods (p. 29-31), and mentioned in the main text (p. 15-16). The Methods also provides details about the efficient-coding model with logarithmic encoding (p. 32-33). These results are further commented on in the Discussion (p. 18). Finally, we now cite the articles mentioned by Reviewer #3 (Barretto-García et al., 2023; Khaw et al., 2021).

      Additionally, Heng et al. (2020) found that subjects did not alter their encoding strategy across different task goals, which seems inconsistent with the fully adaptive representation proposed here. I didn't find the analysis of participants' temporal dynamics of adaptation. The behavioral results in the manuscript seem to imply that the subjects adopted different coding schemes in a very short period of time. Yet in previous studies of adaptation, experimental results seem to be more supportive of a partial adaptive behavior (Bujold et al., 2021; Heng et al., 2020), which might balance experimental and real-world prior distributions. Analyzing temporal dynamics might provide more insight. Noting that the authors informed subjects about the shape of the prior distribution before the experiment, do the results in this manuscript suggest a top-down rapid modulation of number representation?

      We thank Reviewer #3 for his/her comment and for pointing to these articles. The Reviewer raises several points — that of the dynamics of adaptation, that of the adaptation to the prior, and that of the adaptation to the task. We address each of them.

      To investigate the dynamics of the subjects’ adaptation, we examined separately, in each task, the responses obtained in the trials in the first and second halves of each condition. In the estimation task, the standard deviations of responses, as a function of the presented number and of the prior width, are very similar in the two halves (see Figure 8, panel a). The Bonferroni-Holm-corrected p-values of Levene's tests of equality of the variances across the two halves are all above 0.13, and thus we do not reject the hypothesis that the variance in the first half of the trials is equal to the variance in the second half. Moreover, the variance in both halves appear to be a linear function of the width, rather than the squared width (panel b). We conclude that the behavior of subjects in the estimation task is stable across each experimental condition, including the sublinear scaling of their imprecision.

      In the discrimination task, the subjects' choice probabilities, as a function of the difference between the averages of the red and blue numbers, are similar in the first and second halves of trials (panel c). The Bonferroni-Holm-corrected p-values of Fisher exact tests of equality of proportions (in bins of the average difference that contain about 500 trials each) are all above 0.9, and thus we do not reject the hypothesis that the choice probabilities are equal, in the first and second halves of the trials. Furthermore, the choice probabilities as a function of the absolute average difference normalized by the prior width raised to the exponent 3/4 are all similar, across session halves and across prior widths, suggesting that the sublinear scaling that we find is a stable behavior of subjects (panel d).

      Overall, we conclude that the behavior we exhibit in both tasks is stable over the course of each experimental condition. We note that in both experiments, subjects were explicitly informed of the prior distribution at the beginning of each condition, and each condition included two preliminary training phases that familiarized them with the prior (the specifics for each task are detailed in the Methods section).

      As pointed out by Reviewer #3, Heng et al. (2020) and Bujold et al. (2021) report a partial adaptation of encoding to recently experienced distributions. We note that in our study, a sizable fraction of subjects, particularly in the estimation task, are best fit by the logarithmic encoding. This suggests that, while subjects adapt to the experimental prior, they retain a residual logarithmic compression — an encoding that itself would be efficient under a long-term, skewed prior in which smaller numbers are more frequent. In that sense our findings are thus consistent with the partial adaptation of Heng et al. (2020) and Bujold et al. (2021). At the same time, the same sublinear scaling of imprecision that we find in our study has been obtained in a numerosity-estimation task in which the prior was changed on every trial (Prat-Carrabin et al., 2025), indicating that the adaptation to the prior can occur quickly (on the order of a second) — possibly through a fast top-down modulation of the encoding, as suggested by Reviewer #3. These findings suggest that on a short timescale the encoding adapts efficiently to the prior (as evidenced by the scaling in imprecision), but within structural constraints (the logarithmic encoding).

      Regarding the adaptation to the task, Heng et al. (2020) indeed do not find subjects to be adapting their encoding, across two discrimination tasks (one in which the subject is rewarded for making the correct choice, and one in which the subject is rewarded with the chosen option). A difference with our paradigm is that their task involves simultaneous presentation of two dot arrays, while our discrimination task uses two interleaved sequences of Arabic numerals. More importantly, we do not directly compare the encoding between the estimation and discrimination tasks. Instead, we show that within each task, the adaptation to the prior is quantitatively consistent with the optimal coding predicted for that task's objective, as reflected in the task-specific sublinear scaling exponents. Directly contrasting the encoding across tasks would be a very interesting direction for future work.

      In the revised manuscript, we present the analysis on the stability of subjects’ behavior in the Methods section (p. 29), and we mention it in the main text when presenting the results of the estimation task (p. 5) and of the discrimination task (p. 8-10). In the Discussion, we cite Heng et al. (2020) and Bujold et al. (2021) and comment on the adaptation to the prior and to the task (p. 18).

      Barretto-García, M., De Hollander, G., Grueschow, M., Polanía, R., Woodford, M., & Ruff, C. C. (2023). Individual risk attitudes arise from noise in neurocognitive magnitude representations. Nature Human Behaviour, 7(9), 15511567. https://doi.org/10.1038/s41562-023-01643-4

      Bujold, P. M., Ferrari-Toniolo, S., & Schultz, W. (2021). Adaptation of utility functions to reward distribution in rhesus monkeys. Cognition, 214, 104764. https://doi.org/10.1016/j.cognition.2021.104764

      Heng, J. A., Woodford, M., & Polania, R. (2020). Efficient sampling and noisy decisions. eLife, 9, e54962. https://doi.org/10.7554/eLife.54962

      Khaw, M. W., Li, Z., & Woodford, M. (2021). Cognitive Imprecision and SmallStakes Risk Aversion. The Review of Economic Studies, 88(4), 19792013. https://doi.org/10.1093/restud/rdaa044

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) As mentioned above, the result of inverse u-shaped variability is in strong qualitative agreement with the predictions of a generic Bayesian encoding-decoding model of a flat prior, even under a standard encoding respecting Weber's law, as shown in Figure 5d in: Hahn & Wei, A unifying theory explains seemingly contradictory biases in perceptual estimation, Nature Neuroscience 2024. This paper should probably be cited.

      We now cite Hahn & Wei, 2024. We comment above on our analyzes regarding the logarithmic encoding.

      (2) "Requests for the data can be sent via email to the corresponding author" Why are the data not made openly available? Barring ethical or legal concerns (which are not apparent for this type of data), there is no reason not to make data and code open.

      "Requests for the code used for all analyses can be sent via email to the corresponding author." Same: why not make them open?

      We agree that it is good practice to make the data and code publicly available. They are now available here: https://osf.io/d6k3m/

      Reviewer #3 (Recommendations for the authors):

      The orange dot in Figure 1C does not appear to be described in the figure caption, although an explanation of it is mentioned in the main text.

      We thank Reviewer #3 for pointing out this omission. We now include explanations in the caption.

      I hope the authors will consider making their data publicly available on OSF or another platform.

      The data and code are now publicly available on OSF: https://osf.io/d6k3m/

    1. Reviewer #2 (Public review):

      Summary:

      Neurons have varied responses to external stimuli that cannot be explained by naive Poisson models. Previous work has quantified and partitioned higher-than-Poisson variability in the brain into different components. The authors improve on these methods to infer how both the stimulus drive and internal gain dynamics impact neuronal variability continuously in time. The clean and well-reasoned model is rigorously developed and then applied to neural data across the visual hierarchy. This lends new insights into how variability is partitioned, agreeing with and extending previous work on how that variability changes from early visual areas (LGN, V1) through to higher, motion-sensitive areas (area MT). Another key contribution is that this partitioning can be fully addressed as a continuous-time process, which allows for the dissection of how the timescale of fluctuations in these two components changes across the brain's processing arc.

      Strengths:

      (1) The model is cleanly derived and thoroughly documented, including usable code shared in a GitHub repo. This makes the method immediately portable to other neural systems.

      (2) This is a clear and well-presented piece of work. The figures and writing are clear and understandable, and all pieces of the derivations are included in the main text and supplementary information.

      (3) Comparisons to other models, particularly the one from Goris et al., 2014 shows how this Continuous Modulated Poisson (CMP) model outperforms previous work.

      (4) New insights about how variability partitioning changes across the visual stream from LGN to MT are revealed, including how the gain fluctuates on longer timescales in higher visual areas. Another key result about the anticorrelation between the variance in stimulus drive and gain fluctuations comports with theories about how neurons maintain efficient, reliable encoding.

      (5) In addition to the results reported here, this work will serve as an excellent tutorial for students and postdocs first delving into the sources of variability in the brain.

      Weaknesses:

      The work is somewhat incremental, building on previous studies of the partitioning of variability in the brain, but it provides important new extensions, as noted above.

      The only major gap I would suggest addressing in the Discussion is the observation of sub-Poisson variability in the brain. It seems clear that this model can extend to sub-Poisson variability and its partitioning and perhaps even show how that varies in real time, with an animal's attentional state. That is, of course, beyond the scope of the current work, but could be mentioned in the Discussion.

    1. Reviewer #1 (Public review):

      Summary:

      This manuscript examines whether retrieval practice protects memory-based inference from acute stress and proposes rapid neural reactivation of a bridging memory element as the underlying mechanism. Using a two-day associative inference paradigm combined with EEG decoding, the authors report that stress impairs inference accuracy and speed, while retrieval practice eliminates these deficits and restores neural signatures associated with bridge-element reactivation. The study addresses an important and timely question by integrating research on retrieval-based learning, stress effects on memory, and neural dynamics of inference. While the work provides promising multi-level evidence linking behavioral and neural findings, limitations in experimental design, causal interpretation, and decoding specificity weaken the strength of the mechanistic claims and suggest that further work is needed to disentangle strengthened associative memory from inference-specific protection effects

      Strengths:

      (1) Strong theoretical integration<br /> The study integrates three influential frameworks: memory integration through associative inference, stress-induced retrieval impairment, and the testing effect. The authors present a clear theoretical narrative linking these domains and derive testable hypotheses that retrieval practice protects inference by strengthening neural reactivation of a bridge element. The conceptual framing is well-grounded in prior literature and addresses an important gap regarding neural dynamics during inference.

      (2) Multi-level evidence<br /> The study provides converging behavioral and neural evidence. The authors demonstrate that stress reduces inference accuracy and speed, while retrieval practice eliminates these deficits. EEG decoding further suggests that bridge element reactivation predicts successful inference. The combination of behavioral performance and neural decoding strengthens the overall argument.

      (3) Transparent experimental implementation<br /> The procedures are described in substantial detail, including stimulus construction, stress manipulation, and decoding pipelines. Data and code availability are also strengths, facilitating reproducibility.

      Weaknesses:

      (1) Insufficient evidence that retrieval practice specifically protects inference rather than strengthening associative memories

      A central claim of the manuscript is that retrieval practice specifically protects inference ability rather than simply strengthening underlying associative memories. However, the current data do not convincingly distinguish between these possibilities. Although the authors limited analyses to trials in which AB and BC pairs were correctly retrieved in the subsequent memory test, this procedure does not fully rule out the possibility that improved inference performance reflects stronger base associative memories rather than enhanced integrative processes.

      Importantly, the direct memory retrieval test used a two-alternative forced-choice (2AFC) format, which inherently allows a substantial proportion of correct responses to arise from guessing. Consequently, trials classified as "successfully retrieved" may still include weak associative memory traces, making it difficult to conclude that failures in inference reflect deficits in integration rather than incomplete associative learning.

      The authors further argue that retrieval practice does not improve inference in the absence of stress, suggesting independence between inference and associative memory strength. However, this null effect does not sufficiently rule out mediation through strengthened premise memory. A factorial design and/or mediation analysis would be necessary to determine whether inference resilience emerges independently of premise memory strength.

      (2) Apparent below-chance inference performance raises interpretational concerns

      One surprising aspect of the results is that inference performance across experiments and groups appears to fall below the theoretical chance level (0.33) in Figure 4A. This is particularly unexpected because analyses were restricted to trials in which participants correctly retrieved both AB and BC associations.

      If performance is indeed below chance, this raises concerns regarding whether participants fully understood the task instructions or whether other methodological factors influenced performance. Additionally, below-chance performance complicates the interpretation of subsequent behavioral and neural analyses. It is possible that this reflects my misunderstanding of the figure; therefore, clarification from the authors regarding how inference accuracy is calculated and presented would be helpful.

      (3) Between-experiment implementation of retrieval practice weakens causal inference

      The retrieval practice manipulation was implemented as a separate experiment rather than as part of a factorial design. Experiment 2 was conducted after results from Experiment 1 were known, and the authors acknowledge this post hoc decision. This design introduces several potential confounds, including cohort differences between experiments, possible differences in participant motivation or task familiarity, and reduced ability to rigorously test interaction effects.

      Although the authors combined data across experiments to test interactions between stress and retrieval practice, such post hoc aggregation cannot fully substitute for a factorial design. A within-experiment 2 × 2 design (Stress × Retrieval Practice) would provide substantially stronger causal evidence and reduce confounding influences.

      (4) Lack of an appropriate comparison condition for retrieval practice limits the interpretation of the mechanism

      Although acknowledged briefly in the discussion, the absence of an appropriate comparison condition for retrieval practice represents a critical limitation. Without a matched re-exposure or restudy control condition, it remains unclear whether observed benefits are attributable specifically to retrieval practice or to additional exposure to AB and BC associations.

      Furthermore, it is unclear whether retrieval practice operates at the trial level or the participant level. Retrieval practice could enhance memory representations for specific practiced items, making those trials more resistant to stress, or it could induce a more global change in cognitive strategy or stress resilience across participants. One way to address this issue would be to analyze inference performance separately for trials that were successfully retrieved during the retrieval practice phase versus those that were not.

      (5) Interpretation of EEG decoding as bridge-element reactivation may be overstated

      The neural decoding results form the mechanistic foundation of the manuscript; however, the interpretation that decoding reflects reactivation of specific bridging memories may be overstated. The classifier distinguishes between face and building categories, and because the bridging element belongs to one of these categories, successful decoding may reflect category-level semantic activation rather than reinstatement of item-specific episodic representations.

      Alternative explanations include category-level retrieval, strategic task differences, or even attentional biases. Because only two categories were used, the decoding analysis lacks the specificity necessary to distinguish between category-level and item-level reactivation. As such, conclusions regarding the reinstatement of specific bridging memories should be tempered or supported with additional analyses.

    1. Reviewer #1 (Public review):

      Summary:

      The authors introduce EMUsort, an open-source algorithm for the automatic decomposition of high-resolution intramuscular EMG recordings. The method builds upon the Kilosort4 framework and incorporates modifications designed to better handle the spatial and temporal characteristics of intramuscular signals. The performance of EMUsort is evaluated on openly available datasets and compared against KS4 and MUEdit, demonstrating improved motor unit accuracy.

      Strengths:

      (1) The manuscript is clearly written, technically detailed, and well structured.

      (2) The open-source software is thoroughly documented, both within the manuscript and in the accompanying repository README, facilitating adoption by the community.

      (3) The availability of both code and datasets is a major strength, enabling reproducibility and independent validation.

      (4) The authors provide quantitative comparisons with existing decomposition algorithms, which is essential for contextualizing the proposed method.

      (5) The methodological details are sufficiently described to allow replication and further development by other researchers.

      Weaknesses:

      While the manuscript is strong overall, I have several suggestions that could further strengthen its impact and clarity.

      (1) Benchmarking and community integration

      A recent work has proposed standardized datasets and benchmarking pipelines for high-density surface EMG decomposition ("MUniverse: A Simulation and Benchmarking Suite for Motor Unit Decomposition", Mamidanna*, Klotz*, Halatsis* et al, NeurIPS 2025). A similar effort for intramuscular EMG would be highly valuable to the field. The authors may consider discussing how their dataset and algorithm could be integrated into broader benchmarking initiatives (e.g., platforms such as MUniverse), enabling systematic comparisons across multiple datasets and decomposition methods.

      (2) Comparison with additional decomposition algorithms

      Since the manuscript compares EMUsort with MUEdit, it would be appropriate to also include a comparison with Swarm-Contrastive Decomposition (SCD), which has been proposed for both surface and intramuscular EMG signals. Including this comparison, or explicitly discussing why it was not feasible, would strengthen the positioning of EMUsort relative to the current state of the art.

      (3) Manual editing and post-processing

      In practical EMG decomposition workflows, manual inspection and editing of motor units are often required after automatic decomposition. It would be useful for readers to know whether EMUsort provides (or is compatible with) a graphical interface or workflow for manual refinement, or how the authors envision this step being handled.

      (4) Ablation analysis of algorithmic modifications

      EMUsort is described as an extension of Kilosort4. An ablation analysis examining the impact of the main modifications introduced relative to KS4 would help clarify which changes contribute most to the observed performance improvements and under which conditions.

      (5) Failure modes and limitations

      A more explicit discussion of when EMUsort is likely to fail or degrade in performance would be valuable. For example, sensitivity to the number of channels, recording duration, signal quality, or motor unit density could be discussed to guide users.

      (6) Generalisability to surface EMG

      Given the shared methodological foundations between surface and intramuscular EMG decomposition, it would be helpful to know whether EMUsort has been tested on high-density surface EMG datasets or whether the authors expect limitations when applied outside the intramuscular domain.

      (7) Applicability to human intramuscular recordings

      The authors could clarify whether EMUsort has been tested on human intramuscular EMG, and discuss any expected differences in performance due to anatomical or physiological factors.

      (8) Parameter sensitivity

      Clustering-based methods can be sensitive to parameter choices. Reporting a parameter sensitivity analysis, or at least discussing the robustness of EMUsort to parameter variations, would increase confidence in the method's reliability and ease of use.

      (9) Differences between template matching and BSS methods

      Since the manuscript proposes a new template matching algorithm, but it compares its performance with a BSS one (MUedit), BSS algorithms should be described in the introduction. The differences between the methodologies should be highlighted, and the pros and cons of each described.

      Conclusion:

      The authors largely achieve their stated aims, and the results mostly support the main conclusions. EMUsort represents a meaningful contribution to the EMG decomposition literature, particularly for researchers working with high-resolution intramuscular recordings. With additional clarification regarding benchmarking, algorithmic ablations, and limitations, the manuscript would be further strengthened and likely to have a substantial impact on the field.

    2. Reviewer #2 (Public review):

      Summary:

      This work presents a new spike sorter, EMUsort, to target the challenging task of spike sorting Motor Unit Action Potentials (MUAP). EMUsort is essentially a modified version of Kilosort, with some key extensions to target EMG data: correct for large delays due to propagation across channels, spike detection of highly overlapping and large units via multiple thresholds, an increased number of waveform templates for spike detection, and an extended representation of waveforms to grasp complex MUAP spike shapes. The results on simulated data show solid evidence that the applied modifications make a difference for EMG recordings. All in all, I believe that EMUsort will greatly improve spike sorting performance for high-density EMG data.

      Strengths:

      The manuscript is well written, and the methods and modifications to the Kilosort pipeline are well-motivated, well-explained, and clear. The simulation results provide strong evidence that the presented modifications make spike sorting of high-density EMG data more accurate.

      Weaknesses:

      The method is overall only validated on 15 simulated motor units. The monkey dataset, in particular, seems too "easy" and not challenging enough to highlight weaknesses of any of the spike sorters. A second weakness is in the distribution of the code, which is shipped with submodules for Kilosort and SpikeInterface, and makes it hard to maintain long-term, and pins to old versions of these key dependencies.

    1. Reviewer #1 (Public review):

      Summary:

      Freas and Wystrach present a computational model of steering in insects. In this model, the central complex provides an error signal indicating the animal should turn left or right; this error signal biases the function of an oscillator composed of two mutually inhibiting self-exciting units. The output of these units generates a "steering signal" that is used both to set the direction and speed of the ant. Additionally, a separate module induces pauses, and an inverse relation between forward speed and turning speed is externally imposed. Statistics of the trajectories generated by the model are compared to the measured behaviors of ants.

      Strengths:

      While the model is very simple compared to state-of-the-art models, that simplicity makes it a potentially useful guide to researchers studying insect navigation. Some predictions that emerge from the model appear to be experimentally testable, although a more complete description of the model and its parameters, as well as an analysis of how this model's predictions differ from previous models' predictions, would be required to design these experiments.

      Weaknesses:

      I found it difficult to identify evidence in the paper supporting central elements of the abstract. Hopefully, these difficulties can be resolved with a clearer presentation and the addition of supporting detail, especially in the methods.

      (1) The model is not clearly described

      In the Materials and Methods, there is no description of the model, just "The computational model is presented in Figure 1." (This is probably a typo and may refer to Figure 2A-C), and a link to Matlab source code. It is inappropriate to ask readers or reviewers to examine source code in lieu of providing a method, but I attempted to do so anyway. To my eye, the source code does not match the model presented in 2A-C. For instance, in 2C, "Steering signal" inhibits "Freeze", but I couldn't find this in the source. "Freeze" is shown to inhibit "steering signal," but as "steering signal" is a signed quantity, it's not clear what this means. Literally, since "ang_speed_raw = L-R," it would seem to indicate the "freeze" would bias towards right turns. In the code, "freeze" appears to be implemented through the boolean variable "speed_inhibition_time." The logic controlled by this variable doesn't appear to inhibit the "steering signal" but instead (depending on control parameters) either reduces the movement speed and amplifies the turning rate, or it turns the angular speed output into a temporal integral of the control signal.

      There are a number of parameters in the source code that aren't described at all in the paper, including the internal oscillator parameters.

      Together, these limitations make it difficult to understand what is being simulated, what parts of the model are tied to biology, and where the model improves on or departs from previous work.

      It is absolutely essential that authors fully describe the computational model, that they explain the meaning of all parameters of the model, and that they explain how the particular values of these parameters were chosen.

      (2) The biological inspiration is unclear

      A central claim of the paper is that the model is "biologically grounded." But some elements, for instance, using a signed quantity to represent left-right steering drive, are not biologically possible; at best, these are shorthand for biologically possible implementations, e.g., opposing groups of left-right driving neurons.

      The mechanism that produces fixations and saccades - the "freeze" module - is not tied to any particular anatomy of the insect brain. Initiation of a freeze occurs at a specific time coded into the model by the authors; it is not generated by an internal model signal. Release of a freeze is by drawing a random variable; there is no neural mechanism proposed to generate this signal.

      In some versions of the model, instead of directly controlling the signal, during fixations, the angular drive signal is integrated into a variable "cumul_drive." No neural substrate is proposed for this integrator. In the code, if cumul_drive passes a threshold, the angular heading of the ant changes (saccades), but only if this threshold is passed before the Poisson process ends the fixation. No neural substrate is proposed for any of this logic.

      The model steps forward in time by a fixed increment - the actual duration (in seconds) of this time step is not specified. From Figure 4F, G, it appears a simulation time step is meant to be about 10ms. This would imply an oscillator frequency of about 2 Hz (Fig 2B), that the heading oscillates at a similar frequency (2G), and that a forward crawling ant stops moving every 500 ms (2I). Are these plausible? Can they be compared to an experiment?

      Model parameters, including the ones that control the frequency of the oscillator, are non-dimensionalized. It is not possible to evaluate whether these parameters are biologically plausible or match experimental results.

      (3) Claims that behaviors emerge from the model may be overstated

      The abstract claims that steering correction and fixations/saccades emerge naturally from the same model. But it appears to me that fixations/saccades are externally imposed by the specification of specific times for a "freeze." Faster angular rotation during saccades than during course correction is imposed and does not emerge naturally from neural simulations.

      (4) Citations to previous literature are difficult to follow, and modeling results are presented as though they are experimental data

      I would ask the authors to be much clearer in their description and citation of previous work. It should be clear whether the cited work was experimental or computational. To the extent possible, the actual measurement should be described succinctly. Instead of grouping references together to support a sentence with multiple claims, references should be cited for each claim. Studies of computational models should not be presented as proving a biological result.

      For example:

      a) Lines 141-146:<br /> "Previous studies have established many key components of insect navigation, including .... the intrinsic oscillatory dynamics in the lateral accessory lobes (LALs) that support continuous zigzagging locomotion (Clément et al., 2023; Kanzaki, 2005; Namiki and Kanzaki, 2016; Steinbeck et al., 2020)."

      The first reference is to one author's previous modeling work - it hypothesizes that oscillations in the LAL support zigzagging but includes no data that would "establish" the fact. Kanzaki et al. 2005 describes numerical modeling and simulation with a physical robot. Namiki and Kanzaki, 2016 is a review article that links the LAL to zigzagging behavior. It describes the LAL as a winner-take-all bistable network but does not describe or hypothesize that the LAL has intrinsic oscillatory dynamics. Steinbeck et al. 2020 is a more comprehensive review; it reinforces that the LAL is a winner-take-all bistable network that drives left-right steering, including during zig-zagging behavior. But in my reading, I could not find a statement that the LAL has intrinsic oscillatory dynamics (the closest is Steinbeck et al. saying the activity pattern switches regularly, as does the behavior; this doesn't imply that the LAL is intrinsically oscillatory.)

      b) Lines 701-703:<br /> "In plume-tracking moths, CX output has been shown to modulate LAL flip-flop neurons driving zigzagging (Adden et al., 2022)."

      This reads as though an experimental measurement was made, but in fact, this is modeling work.

      c) Lines 703-706:<br /> "In ants, strong goal signals in the CX - whether elicited by the path integrator or visual familiarity (Wehner et al., 2016; Wystrach et al., 2020b, 2015) do not only sharpen directional accuracy but also increase oscillation frequency (Clément et al., 2023)."

      Here again, modeling results are presented as though they were experimental data.

    1. Für E-Commerce-Unternehmen ist der zentrale Vorteil die Anbindung an 200+ Integrationen mit Shopsystemen, Marktplätzen und Versanddienstleistern, ergänzt durch Xentral Connect, eine No-Code-Middleware-Schicht zur Automatisierung von Workflows ohne Entwicklerbeteiligung.

      Change this sentence to: Für E-Commerce-Unternehmen ist der zentrale Vorteil die Anbindung an über 200 Integrationen mit Shopsystemen, Marktplätzen und Versanddienstleistern, ergänzt durch Xentral Connect, eine No-Code-Middleware-Schicht zur Automatisierung von Workflows ohne Entwicklerbeteiligung.

    1. Depending on our knowledge related to the species and its system, we might subjectively filter individuals.

      Wrong place - Should be just above the next code chunk

    1. Hence, code-switching research needs to develop representative experimental methods forthe usage in large-scale studies. The assumption that experimental studies are not suitablefor the study of code-switching was challenged by Gullberg et al. (2009), who proposed arange of experimental methods suited for code-switching research.

      This further drives the point that code-switching is complex and somewhat misunderstood. It must be studied in many different ways to even begin to understand its impact on humans.

    2. Care-givers want to choose language policies supporting proficiency in the homelanguage/heritage language, but of even greater concern is often development in themajority language associated with upward social mobility. A common concern of care-givers of bilingual children is that code-switching could delay or negatively affect thechildren’s proficiency in each language. Anecdotally, the first author of this editorial had afirst-hand experience of these types of negative assumptions about code-switching recently:in an annual review of their bilingual child’s performance, professional nursery staff praisedthe child’s general linguistic development, but also expressed their concern about the child’scontinued code-switching, describing it in a derogatory fashion as “Misch-Masch” (Germanfor “hotchpotch”).

      There are still negative connotations surrounding code-switching. The professional nursery staff clearly have concern when it comes to this child's continued code-switching.

    3. One marker of cross-linguistic similarity are cognates, i.e.,lexical items from separate languages that reassemble each other formally (Levenshtein1966)

      This suggests code-switching is increased when similarities are seen by someone trying to speak a different language.

    4. Overall, their study confirms the assumption of a cognitive effort associatedwith code-switching,

      Code-switching studies clearly show an increase in brain power needed to code-switch.

    5. Treffers-Daller et al. (2022) investigate the diversity of code-switching in Malay–English bilingual speech. They identify numerous incidences of code-switches involvingfunction words, which is highly restricted in most language pairs (Lehtinen 1966). Thisconfirms previous observations by Muysken (2000), who suggested that highly dense formsof code-switching involve linguistic co-activation at the grammatical level, i.e., involvingfunction words. Whilst Muysken proposed that such dense forms of code-switching(labelled as “congruent lexicalization by Muysken) Fare more likely in typologically similarlanguages, the novelty of the contribution of Treffers-Daller et al. is that they attest suchdense code-switches of function words in a typologically distant language combination,i.e., Malay–English. Hence, their study further challenges the assumption of “universalconstraints”. It also proposes a more cautious version of this assumption, namely thatconstraints are less pronounced in dense code-switching. The authors provide an insightfuldiscussion of how typological distance and structural (dis)similarity of the languages understudy may shape the diversity of code-switching patterns.

      These studies seem to prove that there are no "universal constraints" when it comes to code-switching. It is also not a free-for-all and has nuances depending on many factors of the spoken languages.

    6. Giventhat code-switching remains a stigmatized bilingual practice, in particular in European andNorth American communities (Garrett 2012; Jaworska and Themistocleous 2018), parentsof bilingual children frequently raise concerns about their children’s code-switching.

      Parents of bilingual children still have fear their children will be treated different or thought of as less proficient because of their code-switching.

    7. According to Deuchar (2020), we stillknow little about the relative role of external sociolinguistic and internal psycholinguisticand linguistic factors in shaping code-switching. We suggest that it is only possible tomake progress in our understanding of the variability of code-switching patterns if webring together interdisciplinary insights from a range of research areas, such as linguistics,sociolinguistics, clinical linguistics, psycholinguistics, and neuroscience, and investigatecode-switching variability in different sociocultural constellations, with typologically differ-ent languages and with different types of multilingualism and proficiency levels, and alsoacross the lifespan

      Code-switching is a complex topic which we are still trying to understand. There are a myriad of factors that have and continue to give us a better understanding of it.

    8. Accounting for this behavior is challengingbut is of great interest to linguistics and cognitive science. Code-switching can providea window into the mind to study the ways in which the grammars and lexicons of twolanguages interact (Muysken 2000).

      Code-Switching and language will forever be a constant changing organism so to speak. It will never cease to change.

    Annotators

    1. I had effectively been censored, as had many of my classmates. Fortunately, I had another dialect with which to communicate in the school setting. Code-switching, therefore, allowed me to express myself in less-accommodating environments.

      her classmates were silenced and lost their ability to express themselves since they were forced into an unaccomadating enviroments, the writer however could code switch so she could express herself effectively.

    2. Later, of course, as I pursued my graduate studies in speech-language pathology, I realized that a number of AAE-speakers are unfamiliar with MAE and code-switching.

      proof to her seemingly easy solution.

    3. Her refrain, clearly articulated in AAE, threatened that I bet' not talk like dat at school. Her reminder echoed in my head for years. My parents and grandparents, all adept code-switchers, had raised me to speak both AAE and Mainstream American English (MAE). Daily interactions and family get-togethers were marinated in AAE. MAE was reserved for interactions in mixed company, addressing elders, more serious discussions among family members, and education.

      Her family teachering her code switching and teaching her where to use it and how to use it.

    1. Reviewer #1 (Public review):

      Summary:

      Participants learned a graph-based representation, but, contrary to the hypotheses, failed to show neural replay shortly after. This prompted a critical inquiry into temporally delayed linear modeling (TDLM)--the algorithm used to find replay. First, it was found that TDLM detects replay only at implausible numbers of replay events per second. Second, it detects replay-to-cognition correlations only at implausible densities. Third, there are concerning baseline shifts in sequenceness across participants. Fourth, spurious sequences arise in control conditions without a ground truth signal. Fifth, the revised manuscript adapts a previously published synthetic simulation to show that previous validations/support of TDLM may have overestimated TDLM sensitivity because synthetic assumptions can produce unrealistically high pattern separability and reduced baseline confounds.

      Strengths:

      - This work is meticulous and meets a high standard of transparency and open science, with preregistration, code and data sharing, external resources such as a GUI with the task and material for the public.

      - The writing is clear, balanced, and matter-of-fact.

      - By injecting visually evoked empirical data into the simulation, many surface-level problems are avoided, such as biological plausibility and questions of signal-to-noise ratio.

      - The investigation of sequenceness-to-cognition correlations is an especially useful add-on because much of the previous work uses this to make key claims about replay as a mechanism.

      - In the revised version, the authors foreshadow ways to improve sequenceness detection by introducing a sign-flipping analysis.

      Weaknesses:

      Many of the weaknesses are not so much flaws in the analyses, but shortcomings when it comes to interpretation and a lack of making these findings as useful as they could be. Furthermore, as I will explain below, some weaknesses have been partially improved in the last round of revisions.

      - I found the bigger picture analysis to be lacking, though improved in the latest version. Let us take stock: in other work during active cognition, including at least one study from the Authors, TDLM shows significant sequenceness. But the evidence provided here suggests that even very strong localizer patterns injected into the data cannot be detected as replay except at implausible speeds. How can both of these things be true? Assuming these analyses are cogent, do these findings not imply something more destructive about all studies that found positive results with TDLM? In the revisions, the manuscript concentrates a bit more on criteria that influence detection of sequences, though it is still not entirely clear what consequences there are for previous work.

      - All things considered, TDLM seems like a fairly vanilla and low assumption algorithm for finding event sequences. Although the authors have improved their discussion of "boundary conditions" or factors for why TDLM might fail, it remains not fully clear to what extent the core problem is TDLM on an algorithmic/mathematical level (intrinsic factor), vs data quality, power, window size (extrinsic factors).

      - The new sign-flip analysis underscores the authors' goal of being solution-oriented, though it is important to emphasize that a comprehensive way forward is not yet provided. This is fine, but the manuscript could be improved further through a concrete alternative or a revised version of the original approach.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      (1) I found the bigger picture analysis to be lacking. Let us take stock: in other work, during active cognition, including at least one study from the Authors, TDLM shows significance sequenceness. But the evidence provided here suggests that even very strong localizer patterns injected into the data cannot be detected as replay except at implausible speeds. How can both of these things be true? Assuming these analyses are cogent, do these findings not imply something more destructive about all studies that found positive results with TDLM?

      Our focus here is on advancing methodology. Given the diversity of tasks and cognitive states in the TDLM literature, replay could exceed detection thresholds under specific conditions—especially when true event durations align with short analysis windows. While a comprehensive re-analysis of prior datasets is beyond our scope, we agree a concise synthesis can strengthen the paper.

      The previous TDLM literature uses a diverse set of tasks and addresses a broad spectrum of cognitive constructs/processes. As we acknowledge, it is perfectly possible that replay bursts in short time windows are well detectable by TDLM. However, we acknowledge that some commentary on this is warranted and have added the following paragraph to the discussion that addresses “improving TDLMs sensitivity”:

      “Finally, what do our simulations imply for the broader MEG replay literature? Our implementation successfully detects replay when boundary conditions are met, as shown in the simulation. But sensitivity depends critically on high fidelity between the analysis window and the density of replay events. A systematic evaluation of these conditions as they apply to prior studies remains beyond the scope of the current paper. Instead, our focus is on delineating boundary conditions that we hope will motivate conduct of power analyses in future work as well as inclusion of simulations that approximate realistic experimental conditions.”

      (2) All things considered, TDLM seems like a fairly 'vanilla' and low-assumption algorithm for finding event sequences. It is hard to see intuitively what the breaking factor might be; why do the authors think ground truth patterns cannot be detected by this GLM-based framework at reasonable densities?

      We agree with the overall sentiment of the referee. Our intuition is that one of the principal shortcomings of the method relates to spurious sequenceness induced by unknown factors at baseline, and poor transfer of the decoder to other modalities. and have a rough understanding of how they occur, we are currently not in a position to identify their nature. Note that we believe that these confounders are not exclusive to TDLM but are potentially threatening to all kinds of sequenceness analysis of longer time series that rely on decoders. Indeed, we suspect that classifier training is another bottleneck, as we don’t know the exact nature of the representations that are replayed, including the degree of overlap there is with a commonly used visual localizer. That said, this is not of relevance for the simulation in so far as we insert patterns that exceed the pattern strength in the localizer.

      Finally, a potential major drawback is the permutation test for significance testing. As the original authors of TDLM have noted, the current test which permutes states is overly conservative. It measures fixed effects and as it only considers the group level mean it is accordingly easily biased by individual outliers. This we have tried to account for by z-scoring sequenceness scores. We have also conferred on this with some of the authors of TDLM and discussed a yet unpublished method that aims to address this exact issue. The proposed new method uses a sign-flip permutation test at a group level and therefore implements a random-effects model of the data. This significance test has markedly increased power while still controlling for FWER. However, while we show in our power analysis that the new method is indeed more sensitive, it does not materially change the interpretation of the data. We have included this novel method in the paper and added it into the main analysis and most of the simulations.

      (3) Can the authors sketch any directions for alternative methods? It seems we need an algorithm that outperforms TDLM, but not many clues or speculations are given as to what that might look like. Relatedly, no technical or "internal" critique is provided. What is it about TDLM that causes it to be so weak?

      We believe there are several shortcomings and bottlenecks within TDLM that need to be evaluated and improved. While we highlight these issues in the discussion section titled “Improving TDLMs sensitivity,” we agree that we should provide a clearer outline of its current shortcomings. We have now added to the discussion to expand on that we think needs improvement (‘fixed time lag’) and also add a summary statement at the end of the relevant paragraph to recap the main issues needed for an improved successor method. The new paragraphs read:

      “Lastly, there are certain assumptions that TDLM makes that might not hold (see Methods Study II): Current implementations look for a fixed time lag that is the same across all participants and between all reactivation events. If time lags differ across participants, TDLM will fail to find them. Similarly, TDLM assumes a fixed sequence order and is not robust against slight within-sequence permutations or in-sequencemissing reactivation events. However, from other data sources., such as hippocampal place cell recordings, it is known that such permutations can occur where some states are skipped or fail to decode during replay. Similarly, it is assumed that each reactivation event lasts between 10-30 milliseconds, but the true temporal evolution of reactivation measured by TDLM is currently unknown. Future method development might focus on improving invariance to these assumptions.

      […]

      In summary, there are several areas where TDLM might be improved, including a restriction in its search space, improvement in classifiers, a validation of localizer representation transfer to other domains (e.g. memory representations), and the extension of TDLM to render it more robust against violations of its core assumptions.”

      Reviewer #2 (Public review):

      Weaknesses:

      The sample size is small (n=21, after exclusions), even for TDLM studies (which typically have somewhere between 25-40 participants). The authors address this somewhat through a power analysis of the relationship between replay and behavioural performance in their simulations, but this is very dependent on the assumptions of the simulation. Further, according to their own power analysis, the replay-behaviour correlations are seriously underpowered (~10% power according to Figure 7C), and so if this is to be taken at face value, their own null findings on this point (Figure 3C) could therefore just reflect under sampling as opposed to methodological failure. I think this point needs to be made more clearly earlier in the manuscript.

      We agree with the referee that our sample is smaller than previous studies due to participant exclusion criteria. However, the take-away message from our behavioural simulation and bootstrapping is that even with larger sample sizes, it is difficult to overcome baseline fluctuations of sequenceness, even if very strong replay patterns were detectable and sample sizes were of similar size to that of previous studies. Therefore, we are not convinced that that our null findings are fully explained by the smaller sample size compared to that of previous studies, Additionally, we show that even within the range of other studies, similar power would have been expected (Supplement Figure 11). However, it is true that in general null findings can be explained by under-sampling, under the assumption that an effect is present. To amplify this point, we have added the following to the Figure 3C:

      “[…]. NB, however, as our simulation shows, correlations of sequenceness with behavioural markers are likely to be underpowered and occur only with very high replay rates or much higher sample size. See our simulation discussion for a more detailed explanation on how correlations may be inherently biased, where fluctuations in baseline sequenceness overshadow individual scaling with behavioural markers.”

      Furthermore, we have added the following paragraph to the discussion to highlight this point and refer to a power analysis we have now added to the supplement (see next answer):

      “Sample sizes in previous TDLM literature usually range between 20 to 40 participants. A bootstrap power analysis shows that even at those sample sizes, power would remain low unless unrealistically high replay rates are assumed (Supplement Figure 11). Our bootstrap simulation shows that a correlation analysis between sequenceness and behaviour would in these cases be drastically underpowered, even under an assumption of high replay densities.”

      Finally, we have added a remark about the sample size to the limitations section, as naturally, an increase in sample size would yield higher power:

      “Finally, while initially planning for thirty participants, due to exclusion criteria, our study featured fewer participants than most previous studies using TDLM (i.e. usually 25-40, but 21 in our study). While we are confident that our simulation results hold under these sample sizes, as sample sizes of other studies show comparable power to ours (Fehler! Verweisquelle konnte nicht gefunden werden.), we cannot fully rule out a possibility that our null-findings are explained by a lack in power alone.”

      Relatedly, it would be very useful if one of the recommendations that come out of the simulations in this paper was a power analysis for detecting sequenceness in general, as I suspect that the small sample size impacts this as well, given that sequenceness effects reported in other work are often small with larger sample sizes. Further, I believe that the authors' simulations of basic sequenceness effects would themselves still suffer from having a small number of subjects, thereby impacting statistical power. Perhaps the authors can perform a similar sort of bootstrapping analysis as they perform for the correlation between replay and performance, but over sequenceness itself?

      We agree with the referee that this, in principle, is a great idea. However, the way that significance thresholds are calculated poses a conceptual problem for such an analysis: as for significance threshold we are defining the maximum sequenceness value across all participants, all time lags and all permutations. This sequenceness value is compared against the mean of all participants, disregarding the standard deviation. This maximum threshold would not change if we bootstrapped some of our samples. Additionally, the 95% would also not change significantly. To illustrate this point, we have added this analysis to the supplement, as Supplement Figure 10. However, the new sign-flip permutation test we now include allows for such a comparison, as it takes variance between participants into account as well! We have included all three variants of the power analysis and the figure description now reads:

      “Supplement Figure 11 Power analysis of sequenceness significance for bootstrapped samples sizes. A) Powermap for state-permutation thresholds. However, here the bootstrap approach suffers from a conceptual problem: significance thresholds are defined by the permutation maximum and/or 95-percentile of the maximums across all sequence-permutations across participants. If we resample bootstrap-participants from our existing pool, the maximum thresholds computed will remain relatively stable across resampled participants, as it only compares against the mean and disregards the standard deviation. B) The newly presented statistical approach is significantly more sensitive at higher sample sizes. Note that even then, 80% power is only reached with replay density of higher than 50 min-1 at a sample size of 60 participants. Additionally, the sign-flip permutation test assumes that the mean is at zero. As we observed a non-zero mean due to spurious oscillations, we subtracted the mean sequenceness of the baseline condition from each participant before permuting to achieve a null distribution with mean zero, as otherwise, we would have found significant replay effects in the baseline condition at increasing sample size. Nevertheless, due to the higher sensitivity, the new sign-flip test is recommended over the previous sequence-permutation-based test. Colours indicate the power from 0 to 1 for different bootstrapped sample sizes and densities. 80% power thresholds are outlined in black.”

      The task paradigm may introduce issues in detecting replay that are separate from TDLM. First, the localizer task involves a match/mismatch judgment and a button press during the stimulus presentation, which could add noise to classifier training separate from the semantic/visual processing of the stimulus. This localizer is similar to others that have been used in TDLM studies, but notably in other studies (e.g., Liu, Mattar et al., 2021), the stimulus is presented prior to the match/mismatch judgment. A discussion of variations in different localizers and what seems to work best for decoding would be useful to include in the recommendations section of the discussion.

      We agree and thank the referee for raising this issue. Note, we acknowledge we forgot to mention that these trials were excluded from classifier training. Our rationale of presenting the oddball during stimulus presentation, and not thereafter, was an assumption that by first presenting the audio and then the visual cue we would create more generalized representations that would be less modalitydependent. However, importantly, we excluded all trials that were oddballs from localizer training. Therefore we assume that this particular design choice will not greatly affect the decoder training. If some motor-preparation activity is present during the stimulus presentation, then it should be present equally across all trials and hence be ignored by the classifier as we balanced the transitions between images. We now added this information to the main text:

      “In each trial, a word describing the stimulus was played auditorily, after which the corresponding stimulus was shown. In ~11% of cases, there was a mismatch between word and image (oddball trials), and these trials were excluded from the localizer training.” Additionally in the methods section: “These oddball-trials were excluded from all further analysis and decoder training.”

      Nevertheless, we agree that the extant variety in localizer designs is underdiscussed where many assumptions of classifier training are not, as yet, fully validated. We have added a sentence highlighting different oddball paradigms to the section on the discussion of localizers and also add a summary statement with recommendations. The passage now reads:

      “Additionally, a wide variety of oddballs has been used (e.g. upside-down, scrambled, or mismatched images, cues presented visually, as words, auditorily, etc), and at this time it is unclear if these affect the representations that the classifier learns [...] In summary, we would expect a multimodal categorical localizer, and a classifier that isn’t trained on a specific timepoint, to generalize best.”

      Second, and more seriously, I believe that the task design for training participants about the expected sequences may complicate sequence decoding. Specifically, this is because two images (a "tuple") are shown together and used for prediction, which may encourage participants to develop a single bound representation of the tuple that then predicts a third image (AB -> C rather than A -> B, B -> C). This would obviously make it difficult to i) use a classifier trained on individual images to detect sequences and ii) find evidence for the intended transition matrix using TDLM. Can the authors rule out this possibility?

      We thank the reviewer for raising a possibility we have not considered! While there is some evidence that a single bound representation would have overlap with its constituents (especially before long term-consolidation) and therefore be detectable by the classifiers, we acknowledge the possibility that individual classifiers would fail to be sensitive to such a compound representation. In fact we find in the retrieval data some evidence for a combined replay of representations (where representations are replayed seemingly at the same time, see Kern 2024). We have added such a possibility to the interims-discussion of Study 1 as a qualification . However, this does not change the results or interpretation of our simulation which we consider is a key message of the paper.

      The relevant segment in the discussion section now reads:

      “Additionally, given that the stimuli were presented in combined triplets, participants may have formed a singular representation of associated items and subsequently replayed these (e.g., AB→C), instead of replaying item-by-item transitions (A→B→C). Under such a scenario, a classifier trained on individual items may fail to detect these newly formed bound representations, particularly if they diverge strongly from the single-item patterns. In our previous study where we address retrieval (Kern et al., 2024) we found that states were to varying extent co-reactivated, yet classifiers trained on single items retained sensitivity to detect these combined reactivation events. Consistent with this, prior work suggests that unified representations retain overlap with their constituent item representations (Dennis et al., 2024; Liang et al., 2020), however, there’s also evidence that different brain regions are involved if representational unitization occurs (Staresina & Davachi, 2010), potentially confusing classifiers. Therefore, we cannot exclude that rest-related consolidation replays engendered unitized representations that were insufficiently captured by our singleitem classifiers.“

      Participants only modestly improved (from 76-82% accuracy) following the rest period (which the authors refer to as a consolidation period). If the authors assume that replay leads to improved performance, then this suggests there is little reason to see much taskrelated replay during rest in the first place. This limitation is touched on (lines 228-229), but I think it makes the lack of replay finding here less surprising. However, note that in the supplement, it is shown that the amount of forward sequenceness is marginally related to the performance difference between the last block of training and retrieval, and this is the effect I would probably predict would be most likely to appear. Obviously, my sample size concerns still hold, and this is not a significant effect based on the null hypothesis testing framework the authors employ, but I think this set of results should at least be reported in the main text.

      We disagree that an absence or presence of replay might be inferred from an absolute memory enhancement. While consolidation can lead to absolute improvement of performance in, for example, motor memory domains one formulation is that in declarative learning tasks replay stabilizes latent memory traces, and in such a scenario would not necessarily lead to a boosted performance. While many declarative consolidation studies report an increase of performance compared to a control condition (i.e. without a consolidation window), this does not necessarily entail an absolute performance increase, as replay might just act to protect against loss of memory traces. Therefore, the modest increase we observe does not inference as to the presence of absence of replay absent a proper control condition.

      We did expect to find a correlation between replay and individual behavioural. Indeed, a weak correlation with performance and sequenceness can be detected. However, as we also show any such correlation is overshadowed by baseline fluctuations in sequenceness such that its overall validity is questionable, even under very high replay rates. We are therefore circumspect about this correlation, even if it was significant. Therefore, in the discussion, we chose to refrain from putting much focus on this correlation. Nevertheless, we do add a short statement to the corresponding figure label, discussing this precise issue. The segment now reads:

      “While we found a non-significant relation between a memory performance enhancement and post-learning forward sequenceness we are cautious not to overinterpret these results. As in the section “Correlation with behaviour only present at high replay speeds” the noted correlational measure oscillates heavily with baseline sequenceness fluctuations, and any true replay effect is likely to be overshadowed by such fluctuations.”

      I was also wondering whether the authors could clarify how the criterion over six blocks was 80% but then the performance baseline they use from the last block is 76%? Is it just that participants must reach 80% within the six blocks *at some point* during training, but that they could dip below that again later?

      We thank the reviewer for highlighting this point: The first block wherein participants reached >80% ended the learning blocks. After a maximum of six blocks the learning session was ended regardless of performance. Therefore, some participants’ learning blocks were ended after six blocks and without them reaching a performance of 80%.. While we described this in the Methods section, it was missing from the Results Study I section, which now contains:

      “[...] Participants then learned triplets of associated items according to a graph structure. Within the learning session, participants performed a maximum of six learning blocks, but the session was stopped if participants reached 80% memory performance (criterion learning,, up to a memory performance criterion of 80% (see Methods for details)”

      The Figure 2 description now contains

      “[...] Participants’ completed up to six blocks of learning trials. After reaching 80% in any block, no more learning blocks were performed (criterion learning) [...]”

      Lastly, there was a mistake in the Behavioural results section, which stated “All thirty participants, except one, [..] to criterion of 80%.” This is an error. In our preregistration, we defined to only include participants that successfully learned anything at all above chance. Here,we meant that only one participant failed to reach a criterion that we defined as “successful learning”. We fixed it and it now reads

      “with an accuracy above 50% (which we preregistered beforehand as an exclusion criterion for “successful learning above chance”).”

      Additionally, we have noted this for clarity in the methods section and excuse this mistake:

      “Additionally, as successful above-chance learning was necessary for the paradigm, we ensured all remaining participants had a retrieval performance of at least 50% (one participant had to be excluded, but was already excluded due to low decoding performance).”

      Because most of the conclusions come from the simulation study, there are a few decisions about the simulations that I would like the authors to expand upon before I can fully support their interpretations. First, the authors use a state-to-state lag of 80ms and do not appear to vary this throughout the simulations - can the authors provide context for this choice? Does varying this lag matter at all for the results (i.e., does the noise structure of the data interact with this lag in any way?)

      This was a deliberate choice but we acknowledge the reasoning behind this was not detailed in our initial submission. We chose a lag of 80 millisecond for three reasons: first, it is distant from the 9-11 Hz alpha oscillations we observed in our participants and does not share a harmonic with the alpha rhythm; second, we wanted to get a clear picture of the effect of simulated replay that is as isolated as possible from spurious sequenceness confounders present in the baseline condition. Thus, we chose a lag in which the sequenceness score was close to zero in the baseline condition; thirdly , in this revision, we subtracted the mean sequenceness value of the baseline such that any simulation effects would start, on average, at zero sequenceness. In this way, we could attribute any increase in sequenceness to the experimentally inserted replay, that was independent of spurious oscillations. Finally (but less importantly), as we observed that a correlation of sequenceness with behaviour was fluctuated strongly, for the reason detailed above, we chose a lag in which a correlation was as close as possible to zero. If we had not chosen a lag that adhered to these conditions, we were at risk of measuring simulated replay plus spurious sequenceness confounders.

      We have added a sentence to the main text detailing this justification:

      “We chose this timepoint (80 msec state to state lag) as its sequenceness value was close to zero in the baseline condition as well as being distant to the observed alpha rhythms of the participants (which varied between ~9-11 Hz). Additionally, we subtracted the mean sequenceness value of the baseline at 80 milliseconds lag such that any simulation effects would, on average, start at zero sequenceness “

      Additionally, we now add a more detailed explanation to the methods section.

      “This time lag (80 msec) was chosen in order to isolate precisely an effect of the experimentally inserted sequenceness. Thus, we chose a lag at which the mean baseline sequenceness was close to zero and where the correlation with behaviour was low. Additionally, we subtracted the mean sequenceness value (at 80 milliseconds) at baseline from the specific lag recorded for each participant, such that simulation effects would be initialized at zero sequenceness on average enabling any effects to be attributed purely to inserted replay. Additionally, we excluded time lags too close to the alpha rhythms of participants (which varied between ~9-11 Hz) or lags which would have a harmonic with the rhythm.”

      Second, it seems that the approach to scaling simulated replays with performance is rather coarse. I think a more sensitive measure would be to scale sequence replays based on the participants' responses to *that* specific sequence rather than altering the frequency of all replays by overall memory performance. I think this would help to deliver on the authors' goal of simulating an "increase of replay for less stable memories" (line 246).

      The referee makes an excellent point and our simulations could be rendered more realistic by inserting the actual tuples that participants answered correctly. If we understand the point correctly, there are two different ways replay might be impacted by performance: First, we can conjecture that there is greater replay if memory performance is not saturated. Second, replay only occurs for content that has actually been encoded!

      The main reasons why we chose to simulate the entire sequence being replayed for each participant is based on the following. TDLM is implemented such that the amount of replay alone is relevant, and actual transitions are not affecting the results beyond noise. Under the assumption that class-specific classifiers perform equally well, simulating A->B, B->C or simulating A->B, A->B yields equivalent results. However, results can differ if this assumption is violated. By drawing from the entire space of classes we insert, we minimize the risk of some classifiers being worse than others for some participants. For example, if we simulated only A->B for some participant instead of the whole sequence, and by chance classifier A performs suboptimally, we would then introduce additional unwanted variance into our results.

      Secondly, from our reading of the literature we infer that replay is increased generally (i.e. density of learning-specific replay is increased) for less stable memories. However, we do not have indicators of memory strength, but only a binary “remembered or not”. As TDLM is invariant to the actual transitions being replayed and only indexes the number of transitions, we chose to ignore which transitions we insert and only scaled the amount of replay.

      We have added an analysis to the Appendix that discusses this specific aspect of our study where we show that results are equivalent if we simulate replay of “A->B B->C C->D” or only “A->B A->B A->B A->B”. As we do not know how replay density interacts with memory trace stability, we opted to leave the current simulation as is. The corresponding paragraph and figure description now read:

      “From literature we know that replay is increased after learning and that less stable memories are replayed more often. We simulated this effect by scaling our replay density inversely with performance. However, for simplicity, in our simulation, we inserted sampled transitions from all valid transitions given by the graph structure, i.e., the following transitions were valid: However, this meant that some participants would have transitions inserted that they didn’t actually remember. To show that this would not change results, we simulated two scenarios: In the full sequence scenario, all valid graph transitions are inserted (i.e. all participant’s replay is sampled from 'A->B, B->C, C->D, D->E, E->F, F->G, G->E, E->H, H->I, I->B, B->J, J->A'). In the second scenario (memorized transitions) we only replayed transitions that the participant actually retrieved correctly during the post-resting state testing sessions (i.e. a participant’s replay would have been sampled from ‘A->B, B->C, G->E, E->H, H>I’, if those were the ones he remembered). In both scenarios, the number of events is kept constant. The results are equivalent as can be seen in Appendix A Figure 3. NB this only holds under the assumptions that classifiers are equally good at decoding each class.”

      […]

      “TDLM is insensitive towards which transitions are replayed and only sensitive to how many transitions are detected in total. Here we simulate transitions either sampled from the full graph (light orange/green) or participant-specific transitions of trials that participants correctly remembered (dark orange/green). Shaded areas denote the standard error across participants.”

      On the other hand, I was also wondering whether it is actually necessary to use the real memory performance for each participant in these simulations - couldn't similar goals (with a better/more full sampling of the space of performance) be achieved with simulated memory performance as well, taking only the MEG data from the participant?

      The decision to use real memory performance is indeed arbitrary. We could have also used randomly sampled values. However, as we wanted to understand our nullresults better we opted to use real performance to adhere as close as possible to the findings we previously reported. Using uniformly sampled memory performance would be less explanatory w.r.t to our actual results of the resting state data that are reported in the first study we report in the manuscript (Study I).

      Nevertheless, our current implementation already presents an approach that samples the entire performance range for the sub-analysis focusing on the correlation with behaviour. Here, in the section on “best-case”-scenario, we implement this such that it spans factors from 1 to 0 (i.e., a participant with 100% performance gets a replay scale factor of 0 and hence no replay simulated, and the worst performing participant with 50% performance has a replay rate multiplied by 1). We scale the amount of replay with this factor. As a correlation is invariant to linear scaling, statistically this is equivalent to stretching the performance distribution from 0 to 100%. We have added a sentence to the methods to provide further focus on this point:

      “To assess how performance might affect replay in our specific dataset, we chose to use the original participants’ performance values instead of uniformly sampling the performance space (which ranged from 50 to 100%). However, for the correlation analysis, we additionally added a “best-case” scenario, in which we scale replay from 0 to 1, an approach that is statistically equivalent to scaling values to the full space of possible performance (0 to 100%) (see Results Study II: Simulation).”

      Finally, Figure 7D shows that 70ms was used on the y-axis. Why was this the case, or is this a typo?

      Thanks, this is indeed a typo, we fixed it.

      Because this is a re-analysis of a previous dataset combined with a new simulation study on that data aimed at making recommendations about how to best employ TDLM, I think the usefulness of the paper to the field could be improved in a few places. Specifically, in the discussion/recommendation section, the authors state that "yet unknown confounders" (line 295) lead to non-random fluctuations in the simulated correlations between replay detection and performance at different time lags. Because it is a particularly strong claim that there is the potential to detect sequenceness in the baseline condition where there are no ground-truth sequences, the manuscript could benefit from a more thorough exploration of the cause(s) of this bias in addition to the speculation provided in the current version.

      We are currently working on a theoretical basis to explain these spurious sequenceness confounders in the baseline condition. Indeed, in our preliminary work, in certain contexts we can induce significant sequenceness in the absence of any replay signal during baseline. However, this work is at an early stage and we still have some conceptional problems to solve before we are confident enough with these data. We believe at present it would be premature to add these data to the current manuscript. Nevertheless, we now mention these spurious sequenceness confounders to raise awareness for the field and also add greater context to the discussion, highlighting one of the issues that we think is of importance:

      “[…] For example, if two classifiers’ probabilities oscillate at 10 Hz but at a different phase, a spurious time lag can be found reflecting this phase shift. We speculate that more complex interactions between classifiers oscillating at different phases are also conceivable.”

      In addition, to really provide that a realistic simulation is necessary (one of the primary conclusions of the paper), it would be useful to provide a comparison to a fully synthetic simulation performed on this exact task and transition structure (in addition to the recreation of the original simulation code from the TDLM methods paper).

      Thank you for this suggestion! We have now added a synthetic simulation, trying to keep as close as possible to the original simulation code in Liu et al. (2021), while also incorporating our current means of simulating the data (i.e. scaling by performance). We think this synthetic simulation greatly improves the paper and gives weight to our suggestion about the superiority of a hybrid approach. Additionally, it prompted us to look closer at patterns that are inserted in the synthetic simulation and perform a comparative analysis. We have now added the simulation to the main text, together with a methodological explanation of how we simulated the data in the methods section. We also added a discussion on the results and why we think a hybrid approach is currently superior to synthetic approach. The whole new section is too long to paste here – it is found after the main simulation section in the manuscript. We have also added another sentence to the abstract referring to this new inclusion.

      Finally, I think the authors could do further work to determine whether some of their recommendations for improving the sensitivity of TDLM pan out in the current data - for example, they could report focusing not just on the peak decoding timepoint but incorporating other moments into classifier training.

      While we do understand the desire to test further refinement to TDLM on the data directly, we intentionally do not include such analyses in the current paper. Our experience also informs us that there is an enormous branching factor of parameters when applying TDLM, with implications for significance of results in one or other direction. However, as there are currently only limited ways to know how well parameter changes actually improve the sensitivity to replay versus exacerbate potential underlying confounders that induce spurious sequenceness (e.g., we can get significant replay in the control condition with some parameter changes). To exclude such false positive findings, we opt for a relatively strict adherence to previously published approaches. Thus, in the current paper, we limit ourselves to assessing the reliability and robustness of previous approaches.

      Furthermore, while training on a later timepoint might increase sensitivity for a classifier when transferring between different modalities (e.g. visual to memory representation), this approach does not transfer well in our simulations, as the inserted patterns are from the same modality. We consider other, more bespoke studies, are better suited to improve classifier training. NB also see our recently started Kaggle challenge to tackle this problem: https://www.kaggle.com/competitions/the-imagine-decoding-challenge

      However, we have added a note about this dilemma to the improvement section. The section now includes:

      “Nevertheless, as the considerable branching factor poses a threat of increased falsepositive findings we opt to focus the current simulations on previously published pipelines and parameters. Future studies should systematically evaluate parameter choices on TDLM under different conditions, something that is beyond the remit of the current study.”

      Lastly, I would like the authors to address a point that was raised in a separate public forum by an author of the TDLM method, which is that when replays "happen during rest, they are not uniform or close." Because the simulations in this work assume regularly occurring replay events, I agree that this is an important limitation that should be incorporated into alternative simulations to ensure the lack of findings is not because of this assumption.

      The temporal distribution of replay throughout the resting state should not matter, as TDLM is invariant w.r.t to how replay events are distributed within the analysis window. Specifically, it does not matter if replay events occur in bursts or are uniformly distributed. Only the number of transitions is relevant, where they occur or if they are close to each other is not relevant to the numerical results (as long as the refractory window is kept, too short distances will lead to interactions between events and reduce sensitivity).). To emphasize this point, we have added another simulation which is shown in Appendix A.1 and Appendix A Figure 1. We have referenced it in the text and added the following paragraph in the Methods section

      Additionally, the timepoints of inserting replay within the resting state are sampled from a uniform distribution. Even though TDLM tracks reactivation events over time, at a macro-scale the algorithm is invariant to the temporal distribution. At each time step, the GLM regresses onto a future time step up to the maximum time lag of interest, yielding a predictor per lag. However, these predictors within the GLM are independently assessed, and hence, TDLM is, outside of the time lag window, relatively invariant to the temporal distribution of replay. To demonstrate our claim, we simulated uniform replay vs “bursty” replay that only occurs in some parts of the resting state, both yield equivalent sequenceness results (see Appendix A.1).

      Reviewer #3 (Public review):

      (1) I am still left wondering why other studies were able to detect replay using this method. My takeaway from this paper is that large time windows lead to high significance thresholds/required replay density, making it extremely challenging to detect replay at physiological levels during resting periods. While it is true that some previous studies applying TDLM used smaller time windows (e.g., Kern's previous paper detected replay in 1500ms windows), others, including Liu et al. (2019), successfully detected replay during a 5-minute resting period. Why do the authors believe others have nevertheless been able to detect replay during multi-minute time windows?

      (Due to similarity, we combined our responses with the first question of Reviewer 1)

      We are reluctant to make sweeping judgments in relation to previous literature as we wanted to prioritize on advancing methodology instead. The previous TDLM literature uses a diverse set of tasks and cognitive processes. As we state ourselves, it is possible that replay bursts in short time windows are well detectable by TDLM. We were intentionally cautious to directly critique previous studies without detailed re-analysis of their work and wanted to leave such a conclusion up to the reader. However, we realize that such a “thought-starter” might be warranted and improve the paper. Therefore, we have added the following paragraph to the discussion about “improving TDLMs sensitivity”:

      “Finally, what do our simulations imply for the broader MEG replay literature? Our implementation successfully detects replay when boundary conditions are met, as shown in the simulation. But sensitivity depends critically on high fidelity between the analysis window and the amount of replay events. A systematic evaluation of these conditions across prior studies is beyond the scope of this paper, so we do not want to adjudicate earlier findings and leave this assessment up to the reader. Instead, we delineate the boundary conditions and urge future work to conduct power analyses where possible and include simulations that approximate realistic experimental conditions.”

      For example, some studies using TDLM report evidence of sequenceness as a contrast between evidence of forwards (f) versus backwards (b) sequenceness; sequenceness was defined as ZfΔt - ZbΔt (where Z refers to the sequence alignment coefficient for a transition matrix at a specific time lag). This use case is not discussed in the present paper, despite its prevalence in the literature. If the same logic were applied to the data in this study, would significant sequenceness have been uncovered? Whether it would or not, I believe this point is important for understanding methodological differences between this paper and others.

      This approach was first introduced as part of a TDLM-predecessor that utilized crosscorrelations (Kurth-Nelson 2016), where this step is a necessity to extract any sequenceness signal at all by subtracting signals that are present in both (akin to an EEG reference). However, its validity is less clear when fwd and bkw are estimated separately, as is in the GLM case. The rationale behind subtracting here is the same as for autocorrelations: there are oscillatory confounds present in the data that introduce spurious sequenceness in both directions alike, i.e. at the same time lag, that can simply be removed by subtracting. However, this assumption only holds if the sole confounder is auto-correlations caused by a global signal that oscillates at all sensors at the same phase. In our own experience, and mentioned in the discussion, we do not think this assumption holds. Arguably, there are more complex interactions at play that cannot be removed by such a subtraction such as an increase in false positives if confounders are in an opposite direction at a specific time lag. This assumption-violation can be seen in our baseline condition, where other spurious sequenceness diverges in opposite directions for some time lags (e.g. at ~90 ms where forward sequenceness is negative and backward sequenceness is positive). We reasoned that oscillatory confounds are more stable when comparing pre vs post for the same direction than comparing within session between forward minus backward.

      Finally, we note issues introduced by the various ways that sequenceness has been analysed in previous papers: normalization of sequenceness (z-scoring across time lags or across participants or not at all), normalization of probabilities (taking raw decision scores, z-scoring, soft-max, dividing by mean, subtracting mean), taking a windowed approach and summing sequenceness scores, not to mention the various classifier choices that can be made, and all of this can be applied before subtracting conditions from each other or before subtraction. In our experience there is insufficient regard to control for multiple comparison when running all these analyses risking selectivity in reporting.

      Nevertheless, subtracting forward from backward replay is probably as valid as post minus pre. Therefore, we have added fwd-bkw plots to the supplement and explained some of the reasoning for not reporting them in the main text in the figure label. The figure label and reference now read:

      “Finally, we report forward minus backward sequenceness and our motivation for using an across-session post-pre comparison instead of within-session forwardbackward in Supplement Figure 10.”

      […]

      “Forward minus backward sequenceness within each resting state session. Previous papers often report subtraction of backward from forward sequenceness (fwd-bkw) as a means to remove oscillatory confounds that impact both sequenceness directions in synchrony. While required in early cross-correlation approaches (KurthNelson et al., 2016), its validity in GLM-based frameworks depends on an assumption that confounds are global and in-phase across sensors. We observed this assumption is violated in our baseline data, where spurious sequenceness occasionally diverges in opposite directions at specific time lags (e.g., ~90 ms). In such instances, subtraction would increase the false-positive rate rather than suppress noise. In Figure 3B, we prioritized the comparison of pre-task versus post-task sequenceness within the same direction, as oscillatory confounds appeared more stable across time within a single direction, as opposed to across directions within a single session. However, we consider both approaches are valid. We now provide the fwd-bkw plots for completeness and comparison with previous literature. A) forward minus backwards sequenceness for Control (left) and Post-Learning resting-state (right). B) T-value distribution of the sign-flip permutation test for Control (left) and Post-Learning resting-state (right)”

      (2) Relatedly, while the authors note that smaller time windows are necessary for TDLM to succeed, a more precise description of the appropriate window size would greatly improve the utility of this paper. As it stands, the discussion feels incomplete without this information, as providing explicit guidance on optimal window sizes would help future researchers apply TDLM effectively. Under what window size range can physiological levels of replay actually be detected using TDLM? Or, is there some scaling factor that should be considered, in terms of window size and significance threshold/replay density? If the authors are unable to provide a concrete recommendation, they could add information about time windows used in previous studies (perhaps, is 1500ms as used in their previous paper a good recommendation?).

      We currently do not have an empirical estimate of which window sizes are appropriate. While we used 1500ms in our previous paper, this was solely given by the experiment design which had a 1.5s wait period before the next stimulus. Our recommendation for best guidance on this matter would be to investigate related intracranial literature for SWR rate increases under similar experimental conditions. We have added the following paragraph to the discussion:

      “At this stage we cannot offer a general recommendation for window sizes as they are likely to depend on details of the research paradigm. However, intracranial recordings can be used as proxy to estimate the duration of replay bursts, for example as reported in (Norman et al., 2019) where increased SWRs were seen up to 1500 ms after retrieval cue onset”

      (3) In their simulation, the authors define a replay event as a single transition from one item to another (example: A to B). However, in rodents, replay often traverses more than a single transition (example: A to B to C, even to D and E). Observing multistep sequences increases confidence that true replay is present. How does sequence length impact the authors' conclusions? Similarly, can the authors comment on how the length of the inserted events impacts TDLM sensitivity, if at all?

      Good point! So far, most papers do not seem to include multi-step TDLM and in our experience rightfully, as it is conceptionally difficult to define clear significance thresholds while keeping in mind that shorter sub-sequences are contained within a longer sequence (e.g. ABC contains both AB and BC and a longer dependency of AC) that renders it difficult to define the correct way to create a null distribution for the permutation test. Therefore, we tried to stay as close as possible to previous approaches and only looked for single-step transitions. Nevertheless, we have added an analysis to the supplement comparing how TDLM behaves if we simulate A->B->C or A->B and separate B->C. It shows that TDLM is only sensitive to the number of transitions present in the data, and it does not matter if they are chained or chunked. The segment reads:

      “We intentionally designed our study to encourage replay of triplets. However, this begs the question as to whether it matters if triplets or individual chunks of a sequence are replayed at different time points? Here, we simulated two scenarios. In one, we inserted replay of single transitions alone with a refractory period, e.g. A->B and separate B->C transitions. In a second scenario, we simulate replay of chained triplets, e.g. A->B->C, with a distance of 80 milliseconds each. Importantly, we kept the number of transitions constant (i.e., A->B, … B->C and where A->B->C would both have 2 transitions. This creates a context wherein a four-minute resting state would have ~100 events of A->B->C inserted and ~200 events of A->B or B->C, such that in both cases this results in the same number of single step transitions. We found both are equivalent, with TDLM agnostic to the length of sequence trains, i.e., it does not matter if replay is chunked or chained under the assumption that the number of transitions remains fixed, as can be seen in Appendix A Figure 2”

      And the reference Figure description reads:

      “TDLM is invariant to the length of sequence replay trains under an assumption that the number of target transitions (e.g. single steps) is fixed. We simulated replay either as two temporally separate A->B, B->C events (light orange/green) or as a single A>B->C event (dark orange/green), both yielding equivalent sequenceness. Shaded areas denote the standard error across participants”

      For example, regarding sequence length, is it possible that TDLM would detect multiple parts of a longer sequence independently, meaning that the high density needed to detect replay is actually not quite so dense? (example: if 20 four-step sequences (A to B to C to D to E) were sampled by TDLM such that it recorded each transition separately, that would lead to a density of 80 events/min).

      Indeed, this is an interesting proposal. We intentionally kept our simulation close to the way previous simulations were set-up (i.e. Liu & Dolan et al 2021, Liu & Mattar 2021) by simulating one-step transitions and simulated them such that there is no overlap between separate events (e.g. by defining a refractory period). If the duration of replay is increased then we would also need to increase the length of the refractory period, resulting in a reduced upper limit of how much replay can occur in a 1-minute time window. This in turn would approximate roughly the same number of transitions that can be inserted into the resting state and, as detailed above, would yield the same results. Nevertheless, as we chose to use replay density and not transition density as a marker, the density would be reduced, even if the number of transitions stay the same. We have added an analysis using multi-step replay to the supplement and discuss its implications and caveats. In the main discussion we have added the following segment:

      “Similarly, in our simulation, for simplicity and to keep consistency with previousstimulations, we restricted replay events to span two reactivation events. While the characteristics of replay as measured by TDLM are unknown, it is conceivable that several steps can be replayed within one replay event. We show that the vanilla version of TDLM is fundamentally sensitive to the number of single-step transitions alone, and disregards if these are replayed chained or chunked (Appendix A.2 and Appendix A Figure 2). Nevertheless, if the number of reactivation events chained within a replay event increases, TDLMs sensitivity is increased relative to the replay density and thresholds are reached earlier (see Appendix A Figure 4). See Appendix A.4 for a simulation of multi-step replay events and our discussion of the caveats.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Please label the various significance thresholds in the legend of Figure 3.

      We have labelled all the thresholds in the figure legends.

      Reviewer #2 (Recommendations for the authors):

      I think that some of the clarity is hampered because there is a bit too much reliance on explanations from the previous paper using this task, which hampers clarity in the paper. For example, Figure 1 is not particularly useful for understanding the study in its current form; I found myself relying almost exclusively on Supplementary Figure 1 (which is from the previous paper). I'd recommend presenting some version of SF1 in the main text instead. Another example of this overreliance on the previous paper is that, as far as I can tell, the present paper never explicitly states which transitions are being tested in TDLM. In the prior work, it states "all allowable graph transitions", and so I assumed this was the same here, but the paper should standalone without having to go back to the other study. I'd recommend that the authors revise the paper in these and other places where the previous paper is mentioned.

      Thanks for raising this point! We were uncertain ourselves how to deal with the overlap in content and did not want to bloat the paper or plagiarize ourselves too much. On the advice of the referee have implemented the following to improve the manuscript and reduce a reliance on the previous paper:

      Supplement Figure 1 is indeed crucial to understanding the experiment. We have moved it to the methods section under Methods: Procedure

      Added more stimulus description to the Methods: Localizer section

      Included more details about the localizer and graph learning that were missing before

      We have added the note about which transitions we were looking for in the Methods section. Additionally, we have added this information to the Results section of Study 1.

      There are also a few typos I noticed:

      (1) Line 73: "during in the context of."

      (2) Line 287: " to exploring the."

      We fixed the typos.

      Reviewer #3 (Recommendations for the authors):

      (1) Why did the authors choose an 80ms state-to-state time lag for their simulation? I believe they should make the reason for this decision clear in the main text.

      Indeed, this point was also raised by the other reviewer. We have added a sentence to the main text about the rationale behind this decision:

      “We chose this timepoint (80 millisecond state-to-state lag) as its sequenceness value was close to zero in the baseline condition as well as being distant to the observed alpha rhythms of the participants (which varied between ~9-11 Hz). Additionally, we subtracted the mean sequenceness value of the baseline at 80 millisecond lag such that any simulation effects would, on average, start at zero sequenceness.“

      Additionally, we have added some further explanation to the Methods section.

      “This time lag (80 msec) was chosen in order to isolate precisely an effect of the experimentally inserted sequenceness. Thus, we chose a lag at which the mean baseline sequenceness was close to zero and where the correlation with behaviour was low. Additionally, we subtracted the mean sequenceness value (at 80 milliseconds) at baseline from the specific lag recorded for each participant, such that simulation effects would be initialized at zero sequenceness on average enabling any effects to be attributed purely to inserted replay. Additionally, we excluded time lags too close to the alpha rhythms of participants (which varied between ~9-11 Hz) or lags which would have a harmonic with the rhythm.“

      (2) Line 168: Can the authors define what these conservative and liberal criteria are in the text?

      We have added definitions of the criteria in the text. The text now reads:

      “[..] significance thresholds (conservative, i.e. the maximum sequenceness across all permutations and timepoints or liberal criteria, i.e. the 95% percentile of aforementioned sequenceness).”

      (3) Line 478: "calculate" instead of "calculated".

      (4) Figure 7 D: y-axis is labeled "70 ms" I believe it should be labeled 80 ms.

      Thanks, we fixed the two typos.

      (5) With replay defined as sequential reactivation at a compressed temporal timescale, many of the iEEG citations (lines 54-55) do not demonstrate replay (they show stimulus reinstatement or ripple activity, but not sequential replay). Replay studies in humans using intracranial methods have been mostly limited to those measuring single-unit activity, a good example being Vaz et al., 2020 (https://www.science.org/doi/10.1126/science.aba0672).

      We agree that, under a strict definition articulated by Genzel et al. that defines replay as sequential reactivation, many prior human iEEG studies are better described as stimulus reinstatement or ripple-related activity rather than true sequence replay. We have revised the text accordingly and now highlight the few intracranial microelectrode studies that demonstrate replay of firing sequences at the cellular/ensemble level in humans (Eichenlaub et al., 2020; Vaz et al., 2020), distinguishing these from macro-scale iEEG work providing indirect evidence alone.

      The revised paragraph now reads:

      “Replay has been shown using cellular recordings across a variety of mammalian model organisms (Hoffman & McNaughton, 2002; Lee & Wilson, 2002; Pavlides & Winson, 1989). Replay studies in humans using intracranial recordings are few, but include work demonstrating compressed replay of firing-pattern sequences in motor cortex during rest (Eichenlaub et al., 2020) as well as single-unit replay of trialspecific cortical spiking sequences during episodic retrieval (Vaz et al., 2020). By contrast, most iEEG studies report stimulus-specific reinstatement or ripple-locked activity changes without explicit demonstration of temporally compressed sequential replay (Axmacher et al., 2008; Staresina et al., 2015). As these methods are only applied under restricted clinical circumstances, such as during pre-operative neurosurgical assessments, this limits opportunities to investigate human replay. Therefore, this gives urgency to efforts aimed at developing novel methods to investigate human replay non-invasively.”

      (6) The expectations about replay frequency are grounded in literature on hippocampal replay sequences. However, MEG captures signals from across the entire brain, and the hippocampal contribution is likely relatively weak compared to all other signals. This raises an important question: is TDLM genuinely unable to detect replay at physiological (i.e., hippocampal) levels, or is it instead detecting a different form of sequential reactivation - possibly involving cortex or other regions - that may occur more frequently? More broadly, when we have evidence of replay from TDLM, do we believe it is the same thing as replay of CA1 place cell spiking sequences, as detected in rodents? Commenting on this distinction would help further develop theories of replay and what TDLM is measuring.

      This is indeed an important point that has garnered relatively little attention. While there is some evidence of a relation to hippocampal replay in form of high-frequency power increase in the hippocampus, ultimately it is not possible to know without intracranial recordings, as signal strength from those regions is rather poor in MEG.

      We have added the following segment to the manuscript that discusses these issues:

      “However, while we are using indices of SWRs as a proxy for replay density estimation, the relationship between hippocampal replay and replay detected by TDLM remains uncertain. While current decoding approaches measure replay-like phenomena on cortical sites, previous papers have reported a power increase in hippocampal areas coinciding with replay episodes as detected by TDLM. Nevertheless, it is conceivable that cortical replay found by TDLM could occur independently of hippocampal replay and SWRs and be generated by different mechanisms. Some TDLM-studies find a replay state-to-state time lag of above 100 ms, much slower than e.g. previously reported place cell replay. Future studies should employ simultaneous intracranial and cortical surface recordings to establish the relationship between hippocampal replay and replay found by TDLM.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors aimed to elucidate the recruitment order and assembly of the Cdv proteins during Sulfolobus acidocaldarius archaeal cell division using a bottom-up reconstitution approach. They employed liposome-binding assays, EM, and fluorescence microscopy with in vitro reconstitution in dumbbellshaped liposomes to explore how CdvA, CdvB, and the homologues of ESCRT-III proteins (CdvB, CdvB1, and CdvB2) interact to form membrane remodeling complexes.

      The study sought to reconstitute the Cdv machinery by first analyzing their assembly as two subcomplexes: CdvA:CdvB and CdvB1:CdvB2ΔC. The authors report that CdvA binds lipid membranes only in the presence of CdvB and localizes preferentially to membrane necks. Similarly, the findings on CdvB1:CdvB2ΔC indicate that truncation of CdvB2 facilitates filament formation and enhances curvature sensitivity in interaction with CdvB1. Finally, while the authors reconstitute a quaternary CdvA:CdvB:CdvB1:CdvB2 complex and demonstrate its enrichment at membrane necks, the mechanistic details of how these complexes drive membrane remodeling by subcomplexes removal by the proteasome and/or CdvC remain speculative.

      Although the work highlights intriguing similarities with eukaryotic ESCRT-III systems and explores unique archaeal adaptations, the conclusions drawn would benefit from stronger experimental validation and a more comprehensive mechanistic framework.

      Strengths:

      The study of machinery assembly and its involvement in membrane remodeling, particularly using bottom-up reconstituted in vitro systems, presents significant challenges. This is particularly true for systems like the ESCRT-III complex, which localizes uniquely at the lumen of membrane necks prior to scission. The use of dumbbell-shaped liposomes in this study provides a promising experimental model to investigate ESCRT-III and ESCRT-III-like protein activity at membrane necks.

      The authors present intriguing evidence regarding the sequential recruitment of ESCRT-III proteins in crenarchaea-a close relative of eukaryotes. This finding suggests that the hierarchical recruitment characteristic of eukaryotic systems may predate eukaryogenesis, which is a significant and exciting contribution. However, the broader implications of these findings for membrane remodeling mechanisms remain speculative, and the study would benefit from stronger experimental validation and expanded contextualization within the field.

      We thank the Referee for his/her appreciation of our work.

      Weaknesses:

      This manuscript presents several methodological inconsistencies and lacks key controls to validate its claims. Additionally, there is insufficient information about the number of experimental repetitions, statistical analyses, and a broader discussion of the major findings in the context of open questions in the field.

      We have now added more controls, information about repetitions, and discussion.

      Reviewer #2 (Public review):

      Summary:

      The Crenarchaeal Cdv division system represents a reduced form of the universal and ubiquitous ESCRT membrane reverse-topology scission machinery, and therefore a prime candidate for synthetic and reconstitution studies. The work here represents a solid extension of previous work in the field, clarifying the order of recruitment of Cdv proteins to curved membranes.

      Strengths:

      The use of a recently developed approach to produce dumbbell-shaped liposomes (De Franceschi et al. 2022), which allowed the authors to assess recruitment of various Cdv assemblies to curved membranes or membrane necks; reconstitution of a quaternary Cdv complex at a membrane neck.

      We thank the Referee for his/her appreciation of the work.

      Weaknesses:

      The manuscript is a bit light on quantitative detail, across the various figures, and several key controls are missing (CdvA, B alone to better interpret the co-polymerisation phenotypes and establish the true order of recruitment, for example) - addressing this would make the paper much stronger. The authors could also include in the discussion a short paragraph on implications for our understanding of ESCRT function in other contexts and/or in archaeal evolution, as well as a brief exploration of the possible reasons for the discrepancy between the foci observed in their liposome assays and the large rings observed in cells - to better serve the interests of a broad audience.

      We have now added more controls, information about repetitions, and discussion.

      Reviewer #3 (Public review):

      Summary:

      In this report, De Franceschi et al. purify components of the Cdv machinery in archaeon M. sedula and probe their interactions with membrane and with one-another in vitro using two main assays - liposome flotation and fluorescent imaging of encapsulated proteins. This has the potential to add to the field by showing how the order of protein recruitment seen in cells is related to the differential capacity of individual proteins to bind membranes when alone or when combined.

      Strengths:

      Using the floatation assay, they demonstrate that CdvA and CdvB bind liposomes when combined. While CdvB1 also binds liposomes under these conditions, in the floatation assay, CdvB2 lacking its C-terminus is not efficiently recruited to membranes unless CdvAB or CdvB1 are present. The authors then employ a clever liposome assay that generates chained spherical liposomes connected by thin membrane necks, which allows them to accurately control the buffer composition inside and outside of the liposome. With this, they show that all four proteins accumulate in necks of dumbbell-shaped liposomes that mimic the shape of constricting necks in cell division. Taken altogether, these data lead them to propose that Cdv proteins are sequentially recruited to the membrane as has also been suggested by in vivo studies of ESCRT-III dependent cell division in crenarchaea.

      We thank the Referee for his/her appreciation of the work.

      Weaknesses:

      These experiments provide a good starting point for the in vitro study the interaction of Cdv system components with the membrane and their consecutive recruitment. However, several experimental controls are missing that complicate their ability to draw strong conclusions. Moreover, some results are inconsistent across the two main assays which make the findings difficult to interpret:

      (1) Missing controls.

      Various protein mixtures are assessed for their membrane-binding properties in different ways. However, it is difficult to interpret the effect of any specific protein combination, when the same experiment is not presented in a way that includes separate tests for all individual components. In this sense, the paper lacks important controls. For example, Fig 1C is missing the CdvB-only control. The authors remark that CdvB did not polymerise (data not shown) but do not comment on whether it binds membrane in their assays. In the introduction, Samson et al., 2011 is cited as a reference to show that CdvB does not bind membrane. However, here the authors are working with protein from a different organism in a different buffer, using a different membrane composition and a different assay. Given that so many variables are changing, it would be good to present how M. sedula CdvB behaves under these conditions.

      We thank the referee for raising this point. We have now added these data in Figure 1C. Indeed it turns out that CdvB from M. sedula exhibits clear membrane binding on its own in a flotation assay.

      Similarly, there is no data showing how CdvB alone or CdvA alone behave in the dumbbell liposome assay.

      Without these controls, it's impossible to say whether CdvA recruits CdvB or the other way around. The manuscript would be much stronger if such data could be added.

      We have now added these data in Figure 1E, 1F and 1G. Overall, we can confirm that CdvA binds the membrane better in the presence of CdvB (although both proteins can bind the membrane on their own). Both proteins appear to recognize the curved region of the membrane neck.

      (2) Some of the discrepancies in the data generated using different assays are not discussed.

      The authors show that CdvB2∆C binds membrane and localizes to membrane necks in the dumbbell liposome assay, but no membrane binding is detected in the flotation assay. The discrepancy between these results further highlights the need for CdvB-only and CdvA-only controls.

      We have now added these controls in Figure 1. In addition, we would like to clarify that the flotation assay and the SMS dumbbell assay serve different purposes and are not directly comparable in quantitative terms. In the flotation assay, all the protein present as input is eventually recovered and visualized. Thus, quantitative information on the proportion of the fraction of the total protein bound to lipids can be inferred from this assay. The SMS assay, in contrast, provides a very different kind of information. Because of the particular protocol required to generate dumbbells (De Franceschi, 2022), the total amount of protein in the inner buffer in dumbbells is not accurately defined, because protein that is not correctly reconstituted (e.g. which aggregates while still in the droplet phase) will interfere with vesicle generation, with the result that dumbbell with such aggregates is generally not formed in the first place. This renders it impossible to draw any quantitative conclusions about the proportion of the sample bound to lipids. The SMS is therefore not directly comparable to the flotation assay, and it is rather complementary to it. Indeed, the purpose of the SMS is to provide information about curvature selectivity of the protein.

      (3) Validation of the liposome assay.

      The experimental setup to create dumbbell-shaped liposomes seems great and is a clever novel approach pioneered by the team. Not only can the authors manipulate liposome shape, they also state that this allows them to accurately control the species present on the inside and outside of the liposome. Interpreting the results of the liposome assay, however, depends on the geometry being correct. To make this clearer, it would seem important to include controls to prove that all the protein imaged at membrane necks lie on the inside of liposomes. In the images in SFig3 there appears to be protein outside of the liposome. It would also be helpful to present data to show test whether the necks are open, as suggested in the paper, by using FRAP or some other related technique.

      We thank the Referee for his/her appreciation. The proteins are encapsulated inside the liposomes, not outside of them. While Figure S3 might give the appearance that there is some protein outside, this is actually just an imaging artifact. Author response image 1 (below) explains this: When the membrane and protein channel are shown separately, it is clear that the protein cluster that appeared to be ‘outside’ actually colocalizes with an extra small dumbbell lobe (yellow arrowhead). The protein appeared to be outside of it because (1) the protein fluorescent signal is stronger than the signal from the membrane, and (2) there is a certain time delay in the acquisition of the two channels (0.5-1 second), thus the membrane may have slightly shifted out of focus when the fluorescence was being acquired. We are confident that the protein is inside in these dumbbells because the procedure for preparing the dumbbells requires extensive emulsification by pipetting, which requires ≈ 1 minute. This time is more than sufficient for proteins with high affinity for the membrane, like ESCRT and Cdv, to bind the membrane. For an example of how fast binding under confinement can be, please see movie 2 from this paper: De Franceschi N, Alqabandi M, Miguet N, Caillat C, Mangenot S, Weissenhorn W, Bassereau P. The ESCRT protein CHMP2B acts as a diffusion barrier on reconstituted membrane necks. J Cell Sci. 2018 Aug 3;132(4):jcs217968.

      Moreover, in many instances, we observed that the protein is inside because, by increasing the gain in the images post-acquisition, a clear protein signal appear in the lumen (see Author response image 2).

      Author response image 1.

      Separate channels showing colocalization of protein and lipids (adapted from Figure S3). The zoom-in shows separate channels, highlighting that the CdvB2 cluster that seems to be ‘outside the dumbbell’ actually colocalizes with the small terminal lobe of the dumbbell, indicating that the protein is encapsulated within that lobe.

      Author response image 2.

      Residual protein present inside lumen of dumbbells as visualized by increasing the brightness post-acquisition.

      We are not sure what the referee means by “test whether the necks are open, as suggested in the paper”. We are confident that the lobes of dumbbells originated from a single floppy vesicle, and were therefore mutually connected with an open neck (at least at the onset of the experiment). We have performed extensive FRAP assays on dumbbells in previous papers (De Franceschi et al., ACS nano 2022 and De Franceschi et al., Nature Nanotech 2024) which unequivocally proved that these chains of dumbbells are connected with open necks. We now also performed a few FRAP assay with reconstituted Cdv proteins, which confirmed this point. We have added a movie of such an experiment to the manuscript (Movie 1).

      Investigating whether the necks are open or closed after Cdv reconstitution is indeed a very relevant question, that could be rephrased as “verify whether Cdv proteins or their combination can induce membrane scission”. This is however beyond the scope of this manuscript, as the current work merely addressed the question of hierarchical recruitment of Cdv proteins at the membrane. We plan to examine this in future work.

      (4) Quantification of results from the liposome assay.

      The paper would be strengthened by the inclusion of more quantitative data relating to the liposome assay. Firstly, only a single field of view is shown for each condition. Because of this, the reader cannot know whether this is a representative image, or an outlier? Can the authors do some quantification of the data to demonstrate this? The line scan profiles in the supplemental figures would be an example of this, but again in these Figures only a single image is analyzed.

      The images that we showed are indeed representative. The dumbbells that are generated by the SMS approach contain an “internal control”: in each dumbbell, the protein has the option of localizing at the neck or localizing elsewhere in the region of flat membrane. We see consistently that Cdv proteins have a strong preference for localizing at the neck.

      We would recommend that the authors present quantitative data to show the extent of co-localization at the necks in each case. They also need a metric to report instances in which protein is not seen at the neck, e.g. CdvB2 but not CdvB1 in Fig2I, which rules out a simple curvature preference for CdvB2 as stated in line 182.

      While the request for better quantitation is reasonable, this would require carrying out very significant new experiments at the microscope, which is rendered near-impossible since both first authors left the lab on to new positions.

      Secondly, the authors state that they see CdvB2∆C recruited to the membrane by CdvB1 (lines 184-187, Fig 2I). However, this simple conclusion is not borne out in the data. Inspecting the CdvB2∆C panels of Fig 2I, Fig3C, and Fig3D, CdvB2∆C signal can be seen at positions which don't colocalize with other proteins. The authors also observe CdvB2∆C localizing to membrane necks by itself (Fig 2E). Therefore, while CdvB1 and CdvB2∆C colocalize in the flotation assay, there is no strong evidence for CdvB2∆C recruitment by CdvB1 in dumbbells. This is further underscored by the observation that in the presented data, all Cdv proteins always appear to localize at dumbbell necks, irrespective of what other components are present inside the liposome. Although one nice control is presented (ZipA), this suggests that more work is required to be sure that the proteins are behaving properly in this assay. For example, if membrane binding surfaces of Cdv proteins are mutated, does this lead to the accumulation of proteins in the bulk of the liposome as expected?

      In the particular example of Figure 2I, it indeed appears that there are some clusters of CdvB2ΔC that do not contain CdvB1 (we indicated them in Author response image 3 by red arrowheads), while the yellow arrowheads indicate clusters that contain both proteins. It can be clearly seen that the clusters that do contain both proteins (yellow arrows) are localized at necks, while those that only contain CdvB2ΔC (red arrows) are not localized at necks. This is no coincidence. The clusters indicated by the red arrow do contain CdvB1. However, these clusters rapidly diffuse on the membrane plane because they are not fixed at the neck: therefore, they constantly shift in and out of focus. Because there is a time delay in the acquisition of each channel (between 0.5 and 1 second), these cluster were in focus when the CdvB2ΔC signal was being acquired, but sifted out of focus when the CdvB1 signal was being acquired. This implies that the clusters indicated by the yellow arrowheads are stably localized at necks, which is precisely the point we wished to make with this experiment: because Cdv proteins have an affinity for curved geometry, they preferentially and stably localize at necks. Why don’t all the clusters localize at necks then? We estimate that the simple answer is that, in this particular case, there are more clusters than there are necks, so some of the clusters must necessarily localize somewhere else.

      Author response image 3.

      Current Figure 2H, where clusters that are double-positive for both CdvB1 and CdvB2ΔC are indicated by yellow arrowheads, while cluster that apparently only contain CdvB2ΔC are indicated by red arrowheads. It is observed that all the double-positive clusters are localized at necks.

      (5) Rings.

      The authors should comment on why they never observe large Cdv rings in their experiments. In crenarchaeal cell division, CdvA and CdvB have been observed to form large rings in the middle of the 1 micron cell, before constriction. Only in the later stages of division are the ESCRTs localized to the constricting neck, at a time when CdvA is no longer present in the ring. Therefore, if the in vitro assay used by the authors really recapitulated the biology, one would expect to see large CdvAB rings in Figs 1EF. This is ignored in the model. In the proposed model of ring assembly (line 252), CdvAB ring formation is mentioned, but authors do not discuss the fact that they do not observe CdvAB rings - only foci at membrane necks. The discussion section would benefit from the authors commenting on this.

      The referee is correct: it is intriguing that we don’t see micron-sized rings for CdvA and CdvB. We do note that our EM data (Fig.S1) show that CdvA in its own can form rings of about 100-200nm diameter, well below the diffraction limit, that could well correspond to the foci that we optically resolve in Figure 1. We now added a brief comment on this to the manuscript on lines 256-264.

      (6) Stoichiometry

      It is not clear why 100% of the visible CdvA and 100% of the the visible CdvB are shifted to the lipid fraction in 1C. Perhaps this is a matter of quantification. Can the authors comment on the stoichiometry here?

      We agree that this was unclear. Since that particular gel was stained by coumassie, the quantitative signals might be unreliable, and hence we have repeated this experiment using fluorescently labelled proteins, which show indeed a less extreme distribution. This was also done to make the data more uniform, as requested by the referees.

      (7) Significance of quantification of MBP-tagged filaments.

      Authors use tagging and removal of MBP as a convenient, controllable system to trigger polymerisation of various Cdv proteins. However, it is unclear what is the value and significance of reporting the width and length of the short linear filaments that are formed by the MBP-tagged proteins. Presumably they are artefactual assemblies generated by the presence of the tag?

      Providing a measure of the changes induced by MBP removal, in fact, validates that this actually has an effect. But perhaps this places too much emphasis on the short filaments. We now opted for a compromise, removing the quantification of the width and length of short filaments formed by MBPtagged protein from the text, but keeping the supplementary figure showing their distribution as compared to the other filaments (Figure S2E, SF).

      Similar Figure 2C doesn't seem a useful addition to the paper.

      We removed panel 2C, and now merely report these values in the text.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I would suggest the authors perform a deeper discussion about their findings, such as what are the evolutionary implications, how they think lipids from these archaea may affect the recruitment process,...

      Because there is no exact homology between Archaea Cdv proteins and Eukaryotic ESCRT-III proteins, we do not feel our work brings new evolutionary implications beyond what we already state in the manuscript. We also dis not perform experiments using Archaea lipids, thus we would rather not speculate on how they may potentially affect the recruitment of Cdv proteins.

      In general, the manuscript lacks information regarding some scale bars, number of experimental repetitions (n or N), statistical analysis when needed, information about protein concentrations used in their assays.

      We have now added this information in the manuscript.

      Below, I provide a list of comments that I think the authors should address to improve the manuscript:

      (1) Line 113-114: The authors test protein-membrane interactions using flotation assays with positively curved SUV membranes but encapsulate proteins in dumbbell-shaped liposomes with negative curvature at the connecting necks. Might the use of membranes with opposite curvatures affect the recruitment process? Since the proteins are fluorescently labeled, I suggest testing recruitment using flat giant unilamellar vesicles or supported lipid bilayers (with zero curvature) to validate their findings.

      We thank the referee for this suggestion. Please do note that we are not claiming in our paper that Cdv proteins recognize negative curvature. We merely observe that they localize at necks. The neck of a dumbbell exhibits the so-called “catenoid” geometry, which is characterized by having both positive and negative curvature.

      Experimentally, on the SUVs, we now realize there was a mistake in the method section: In the flotation assay we in fact used multilamellar vesicles, not SUVs, precisely for the reason mentioned by the referee. We apologize for the oversight and have now corrected this in the methods. Multilamellar vesicles are not characterized by a strong positive curvature as SUVs do, but we do agree that they likely don’t have negative curvature there either. Because of the heterogeneous nature of the multilamellar vesicles, they provide a binding assay that was rather independent of the curvature. Complementary to the flotation assay, the SMS approach was employed to reveal the curvature preference of proteins.

      Finally, we performed the experiment on large GUVs suggested by the referee using CdvB as an example, but this turned out to be inconclusive because the protein forms clusters: these clusters may be creating local curvature at the nanometer scale, which cannot be resolved by optical microscopy (Author response image 4). This is quite typical for proteins that recognize curvature (cf. for instance: De Franceschi N, Alqabandi M, Miguet N, Caillat C, Mangenot S, Weissenhorn W, Bassereau P. The ESCRT protein CHMP2B acts as a diffusion barrier on reconstituted membrane necks. J Cell Sci. 2018 Aug 3;132(4):jcs217968.)

      Author response image 4.

      Fluorescently labelled CdvB bound to giant unilamellar vesicle. The protein was added in the outer buffer. CdvB forms distinct clusters, which may generate a local region of high membrane curvature.

      (2) Line 138-139: How is His-ZipA binding the membrane? Wouldn't Ni<sup>2+</sup>-NTA lipids be required? If not, how is the binding achieved?

      Indeed, NTA-lipids were present. This is now stated both in the legend and in the methods.

      (3) In the encapsulated protein assays, why does the luminal fluorescence intensity of the encapsulated protein sometimes appear similar to the bulk fluorescence signal? Since only a small fraction of the protein assembles at membrane necks, shouldn't the luminal pool of unbound protein show higher fluorescence intensity inside the liposomes?

      We thank the referee for raising this point and giving us the opportunity to explain this. The reason is that Cdv proteins have a very high affinity for the neck, and when they cluster at the neck the fluorescence intensity of the cluster is many times higher than the background fluorescence. Because we were interested in imaging the clusters and avoiding overexposing them, we adjusted the imaging conditions accordingly, with the result that the fluorescence from both the lumen and the bulk is at very low level.

      By choosing different imaging conditions, however, it can be actually seen that the signal inside the lumen is clearly higher than the bulk: this can be seen for instance in Author response image 2, where the brightness has been properly adjusted.

      (4) Line 184-185: In Fig. 2I, some CdvB2ΔC puncta seem independent of CdvB1 and are not localized at membrane necks. How many such puncta exist? For example, in the provided micrograph, 2 out of 5 clusters are independent of CdvB1. This proportion is significant. Could the authors quantify the prevalence of these structures and discuss why they form?

      We thank the referee for giving us the opportunity to explain this apparent discrepancy. We’ll like to stress the fact that CdvB2ΔC and CdvB1 form an obligate heterodimer: in all our experiments, without exception, we find that they form a strong complex when we mix the two proteins. This is true both in dumbbells and in flotation assays.

      In the particular example of Figure 2I, it indeed appears that there are some clusters of CdvB2ΔC that do not contain CdvB1 (we indicated them in Author response image 3 by red arrowheads), while the yellow arrowheads indicate clusters that contain both proteins. It can be clearly seen that the clusters that do contain both proteins (yellow arrows) are localized at necks, while those that only contain CdvB2ΔC (red arrows) are not localized at necks. This is no coincidence. The clusters indicated by the red arrow do contain CdvB1. However, these clusters rapidly diffuse on the membrane plane because they are not fixed at the neck: therefore, they constantly shift in and out of focus. Because there is a time delay in the acquisition of each channel (between 0.5 and 1 second), these cluster were in focus when the CdvB2ΔC signal was being acquired, but sifted out of focus when the CdvB1 signal was being acquired. This implies that the clusters indicated by the yellow arrowheads are stably localized at necks, which is precisely the point we wished to make with this experiment: because Cdv proteins have affinity for curved geometry, they preferentially and stably localize at necks. Why don’t all the clusters localize at necks then?

      (5) Figure 1E and 1F: Why do lipids accumulate and colocalize with the proteins? How can the authors confirm lumen connectivity between vesicles? Performing FRAP assays could validate protein localization and enrichment at the lumen of the membrane necks.

      At first sight, indeed some lipid enrichment seems to be observed at the neck between lobes of dumbbells.

      This is, however, an imaging artifact due to the fact that the neck is diffraction limited. As shown in the Author response image 5, we are acquiring the membrane signal from both lobes at the neck region, and therefore the signal is roughly double, hence the apparent lipid enrichment.

      Author response image 5.

      Schematic illustrating that the neck between two lobes is smaller than the diffraction limit of optical microscopy (the size of a typical pixel is indicated by the green square). Because of this technical limitation, the fluorescence intensity of the membrane at the neck is twice that of a single membrane.

      The referee is correct in pointing out that these images do not prove that the lobes are connected, and that FRAP assays is the only way to prove this point. However, in previous papers we have confirmed extensively that in chains of dumbbells the lobes are connected:

      - De Franceschi N, Pezeshkian W, Fragasso A, Bruininks BMH, Tsai S, Marrink SJ, Dekker C. Synthetic Membrane Shaper for Controlled Liposome Deformation. ACS Nano. 2022 Nov 28;17(2):966–78. doi: 10.1021/acsnano.2c06125.

      - De Franceschi N, Barth R, Meindlhumer S, Fragasso A, Dekker C. Dynamin A as a one-component division machinery for synthetic cells. Nat Nanotechnol. 2024 Jan;19(1):70-76. doi: 10.1038/s41565023-01510-3.

      Random sticking of liposomes would also generate clusters of vesicles, not linear chains. We now provide also a Movie (Movie 1) supporting this point.

      Investigating whether the necks are open or closed after Cdv reconstitution is indeed a very relevant question, that could be rephrased as “verify whether Cdv proteins or their combination can induce membrane scission”. This is however beyond the scope of this manuscript, as the current work merely addressed the question of hierarchical recruitment of Cdv proteins at the membrane. We plan to examine this in future work.

      (6) Why didn't the authors use the same lipid composition, particularly the same proportion of negatively charged lipids, on the SUVs of the flotation assays and on the dumbbell-shaped liposomes?

      In flotation assays, it is typical to use a relatively large proportion of negatively charged lipids, to promote protein binding. This is because the aim is to maximize membrane coverage by the protein. The SMS procedure to generate dumbbell-shaped GUVs is completely different, however. Rather than covering the membrane with protein, the idea is to reduce the amount of protein to a minimum, so that any curvature preference can be best visualized. This is e.g. routinely done in tube pulling experiments, for the same reason (See for instance Prévost C, Zhao H, Manzi J, Lemichez E, Lappalainen P, Callan-Jones A, Bassereau P. IRSp53 senses negative membrane curvature and phase separates along membrane tubules. Nat Commun. 2015 Oct 15;6:8529. doi: 10.1038/ncomms9529).

      (7) Line 117-119: The suggestion that polymer formation between CdvA and CdvB facilitates membrane recruitment is intriguing. However, fluorescence microscopy experiments could better elucidate whether there is sequential recruitment of CdvB followed by CdvA, or if these proteins form a heteropolymer composite for membrane binding. Can CdvB bind membranes independently, or does this require synergy between CdvA and CdvB.

      We thank the referee for prompting us to perform this experiment. As we now show in Figure 1C, CdvB indeed is able to bind the membrane independently of CdvA. Whether this happens sequentially or simultaneously is an interesting question, but one that is impossible to address with either the SMS or the flotation assay, because in both cases we can only observe the endpoint of the recruitment.

      We would also like to clarify one specific experimental detail. Perhaps unsurprisingly, the results from the flotation assay are dependent on the way the assay is performed. In particular, we observed that the same protein can exhibit a different binding profile depending on whether it is being loaded either at the top or at the bottom of the gradient. This can be seen in Author response image 6. This is counterintuitive, since once the equilibrium is reached, the result should only depend on the density of the sample. We performed an overnight centrifugation (> 16 hours) on a short tube (< 3 cm tall), thus equilibrium is being reached (which is corroborated by the fact that CdvB1 and CdvB2 can float to the top of the gradient within this timespan, as shown in Figure 2C, 2E, 2G). We ascribe the difference between top and bottom loading to the fact that, when the sample is loaded at the bottom, it has to be mixed with a concentrated sucrose solution, while in the case of loading from the top, this is not done.

      In literature, both loading from top and from bottom have been used:

      - Lata S, Schoehn G, Jain A, Pires R, Piehler J, Gottlinger HG, Weissenhorn W. Helical structures of ESCRTIII are disassembled by VPS4. Science. 2008 Sep 5;321(5894):1354-7. doi: 10.1126/science.1161070

      - Moriscot C, Gribaldo S, Jault JM, Krupovic M, Arnaud J, Jamin M, Schoehn G, Forterre P, Weissenhorn W, Renesto P. Crenarchaeal CdvA forms double-helical filaments containing DNA and interacts with ESCRT-III-like CdvB. PLoS One. 2011;6(7):e21921. doi: 10.1371/journal.pone.0021921.

      - Senju Y, Lappalainen P, Zhao H. Liposome Co-sedimentation and Co-flotation Assays to Study LipidProtein Interactions. Methods Mol Biol. 2021;2251:195-204. doi: 10.1007/978-1-0716-1142-5_14. In performing the flotation assay for CdvB1 and CdvB2ΔC, or when using all 4 proteins together, we loaded the sample at the bottom, and we could detect reproducible binding to liposomes (Figures 2D, 2F, 2H, 3A). However, CdvB does not bind the membrane when loaded at the bottom. Thus, for the experiments shown in figure 1C, we loaded the proteins at the top. This experimental setup allowed us to highlight that CdvB indeed induce a stronger interaction between CdvA and the membrane.

      Author response image 6.

      CdvB binding to multilamellar vesicles in a flotation assay. In the left panel, the sample was loaded at the top of the sucrose gradient; in the right panel it was loaded at the bottom.

      (8) Line 165-173: The authors claim that filament curvature differs between CdvB2ΔC alone and the CdvB1:CdvB2ΔC complex. Are these differences statistically significant? What is the sample size (N)? Furthermore, how do the authors confirm interactions between these proteins in the absence of membranes based solely on EM micrographs?

      We can confirm that the filaments are composed by both proteins, because the filaments have different curvature when both proteins are present. However, as requested by referee 3, point (7), we removed the quantification of curvature from panel 2C. We report the N number in the text.

      (9) Line 121-123: Are the authors referring to positive or negative membrane curvatures? The cited literature suggests ESCRT-III proteins either lack curvature preferences (e.g., Snf7, CHMP4B) or prefer high positive curvature (e.g., late ESCRT-III subunits). This is confusing since the authors later test recruitment to negatively curved necks.

      We do not claim that Cdv proteins prefer positive or negative curvature, because the necks present in dumbbells have a catenoid geometry, which include both positive and negative curvature. We have now clarified this in the discussion.

      (10) Since the conclusions rely on the oligomeric state of the proteins, providing SEC-MALS spectra to show the protein oligomeric state right after the purification would strengthen the claims.

      While such SEC-MALDI experiments may be interesting, practical implementation of this is not possible since both first authors left the lab on to new positions.

      (11) Line 157-160: Suppl. Fig. 2 shows only a single EM micrograph of a small filament. Could the authors provide lower magnification images showing more filaments?

      As requested by Referee 3, point (7), we have toned down the importance of these short filaments.

      Also, why are the sample sizes for filament length (N=161) and width (N=129) different?

      Protein filaments formed by Cdv tend to stick to each other side by side, so that for some filaments the width could not be accurately assessed, and accordingly those were removed from the analysis.

      (12) The introduction states that CdvA binds membranes while CdvB does not. However, the results suggest CdvB facilitates membrane binding, helping CdvA attach. This discrepancy needs further explanation.

      We thank the referee for raising this point. We have now performed additional experiments (both SMS assay and flotation assays) showing that indeed CdvB from M. sedula is (unlike CdvB from Sulfolobus) able to bind the membrane on its own (Figure 1C, 1F).

      Reviewer #2 (Recommendations for the authors):

      Best practice would be to show single fluorescence channels in grayscale or inverted grayscale, retaining pseudocolouring only for the merged multichannel image.

      We decided to retain and standardize the colors, both for gels and for microscopy images, in order to have the same color-code for each protein. We believe this improves readability, and this was also a request from Referee 3. Thus, throughout the manuscript, CdvA is in grayscale, CdvB in yellow, CdvB1 in green, CdvB2ΔC in cyan and the membrane in magenta.

      It would be great to include a quantification of liposome curvature vs focal intensity of the various Cdv components - across figures.

      Quantification of liposome curvature at the neck can be done (De Franceschi et al., Nature Nanotech. 2024). However, in practice, this requires transferring of the sample post-preparation into a new chamber in order to increase the signal-to-noise ratio of the encapsulated dye, a procedure that drastically reduces the yield of dumbbells. The very sizeable amount of work required to obtain reliable measurements, especially considering all the proteins and protein combinations used in this study, indicates that this represents a project in itself, which goes well beyond the scope of this manuscript.

      Reviewer #3 (Recommendations for the authors):

      (1) We would encourage the authors to consider including the length of the scale bar next to the scale bar in each image and not in the figure description. This would greatly aid in clarity and interpretation of figures.

      We have now written the length of the scale bar in the figures.

      (2) In a similar vein, could the authors consider labeling panels throughout the manuscript, writing that sample is being presented? This goes mainly for the negative stain and the dumbbell fluorescence images, as having to continuously consult the figure legend again hinders clarity.

      We have now labelled the EM images as requested by the referee.

      (3) Lines 254-256: would the statement hold not only for CdvB2∆C, but for all imaged proteins? They all seem to localize to membrane necks, presumably favoring membrane binding to a specific membrane topology.

      We agree with the referee, and changed the phrasing accordingly.

      (4) CdvB2∆C construct - presumably this was a truncation of helix 5 of the ESCRT-III domain? Figure 1A shows that the ESCRT-III domain spans residues 34-170 and therefore implies that all five ESCRT-III helices (which make up the ESCRT-III domain) are present in the C-terminal truncation. Could the authors clarify?

      Indeed, the truncation was done at residue 170.

      (5) Results of the liposome flotation assays are presented inconsistently across the three figures (Figs 1C, 2DFH, and 3A). This makes it more difficult than it needs to be to interpret and compare results. Could the authors consider presenting the three gels in a more similar, standardized way across the three figures?

      To improve readability, we now standardized the colors, both for gels and for microscopy images, in order to have the same color-code for each protein. Thus, throughout the manuscript, CdvA is in grayscale, CdvB in yellow, CdvB1 in green, CdvB2ΔC in cyan and the membrane in magenta.

      (6) From the data presented in Fig 1EF, it cannot be concluded whether CdvB and CdvA colocalize, as only one protein is labelled. Is there a technical reason for this?

      We have now repeated the same experiment by having both proteins labelled, confirming that there is co-localization at the neck (Figure 1G).

      (7) Fig 2C: is the difference between the two samples significant

      As requested by Referee 3, we have removed Figure 2C.

      (8) Fig 2I is missing a 'merged' panel.

      We have now added the merged panel.

      (9) The fluorescence intensity plots in Supp Figs 1C and 3C would be easier to interpret if the lipid and protein signal would be plotted on the same plot (say, with normalized fluorescence intensity)

      It is not immediately obvious to us what the signal should be normalized to. What we wished to convey with these plots was that the intensity of proteins spikes at the neck region. In an attempt to improve clarity, we have now aligned the plots vertically, and highlighted the position of the neck.

      (10) CdvA should have a capital "A" in Figure 3A, panel 3.

      We have now corrected this.

      (11) The discussion doesn't comment on the need to truncate CdvB2.

      This is explained in the result session.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Kroeg et al. describe a novel method for 2D culture human induced pluripotent stem cells (hiPSCs) to form cortical tissue in a multiwell format. The method claims to offer a significant advancement over existing developmental models. Their approach allows them to generate cultures with precise, reproducible dimensions and structure with a single rosette; consistent geometry; incorporating multiple neuronal and glial cell types (cellular diversity); avoiding the necrotic core (often seen in free-floating models due to limited nutrient and oxygen diffusion). The researchers demonstrate the method's capacity for long-term culture, exceeding ten months, and show the formation of mature dendritic spines and considerable neuronal activity. The method aims to tackle multiple key problems of in vitro neural cultures: reproducibility, diversity, topological consistency, and electrophysiological activity. The authors suggest their potential in high-throughput screening and neurotoxicological studies.

      Strengths:

      The main advances in the paper seem to be: The culture developed by the authors appears to have optimal conditions for neural differentiation, lineage diversification, and long-term culture beyond 300 days. These seem to me as a major strength of the paper and an important contribution to the field. The authors present solid evidence about the high cell type diversity present in their cultures. It is a major point and therefore it could be better compared to the state of the art. I commend the authors for using three different IPS lines, this is a very important part of their proof. The staining and imaging quality of the manuscript is of excellent quality.

      We thank the reviewer for the positive comments on the potential of our novel platform to address key problems of in vitro neural culture, highlighting the longevity and reproducibility of the method across multiple cell lines.

      Weaknesses:

      (1) The title is misleading: The presented cultures appear not to be organoids, but 2D neural cultures, with an insufficiently described intermediate EB stage. For nomenclature, see: doi: 10.1038/s41586-022-05219-6. Should the tissue develop considerable 3D depth, it would suffer from the same limited nutrient supply as 3D models - as the authors point out in their introduction.

      We appreciate the opportunity to clarify this point. We respectfully disagree that the cultures do not meet the consensus definition of an organoid. In fact, a direct quote from the seminal nomenclature paper referenced by the reviewer states: “We define organoids as in vitro-generated cellular systems that emerge by self-organization, include multiple cell types, and exhibit some cytoarchitectural and functional features reminiscent of an organ or organ region. Organoids can be generated as 3D cultures or by a combination of 3D and 2D approaches (also known as 2.5D) that can develop and mature over long periods of time (months to years).” (Pasca et al, 2022 doi10.1038/s41586-022-05219-6). Therefore, while many organoid types indeed have a more spherical or globular 3D shape, the term organoid also applies to semi-3D or nonglobular adherent organoids, such as renal (Czerniecki et al 2018, doi.org/10.1016/j.stem.2018.04.022) and gastrointestinal organoids (Kakni et al 2022, doi.org/10.1016/j.tibtech.2022.01.006). Accordingly, the adherent cortical organoids described in the manuscript exhibit self-organization to single radial structures consisting of multiple cell layers in the z-axis, reaching ~200um thickness (therefore remaining within the limits for sufficient nutrient supply), with consistent cytoarchitectural topology and electrophysiological activity, and therefore meet the consensus definition of an organoid.

      (2) The method therefore should be compared to state-of-the-art (well-based or not) 2D cultures, which seems to be somewhat overlooked in the paper, therefore making it hard to assess what the advance is that is presented by this work.

      It was not our intention to benchmark this model quantitatively against other culture systems. Rather, we have attempted to characterize the opportunities and limitations of this approach, with a qualitative contrast to other culture methods. Compared to stateof-the-art 2D neural network cultures, adherent cortical organoids provide distinct advantages in:

      (1) Higher order self-organized structure formation, including segregation of deeper and upper cortical layers.

      (2) Longevity: adherent cortical organoids can be successfully kept in culture for at least 1 year, whereas 2D cultures typically deteriorate after 8-12 weeks.

      (3) Maturity, including the formation of dendritic mushroom spines and robust electrophysiological activity.

      (4) Cell type diversity including a more physiological ratio of inhibitory and excitatory neurons (10% GAD67+/NeuN+ neurons in adherent cortical organoids, vs 1% in 2D neural networks), and the emergence of oligodendrocyte lineage cells.

      On the other hand, limitations of adherent cortical organoids compared to 2D neural network cultures include:

      (1) Culture times for organoids are much longer than for 2D cultures and the method can therefore be more laborious and more expensive.

      (2) Whole cell patch clamping is not easily feasible in adherent cortical organoids because of the restrictive geometry of 384-well plates.

      (3) Reproducibility is prominently claimed throughout the manuscript. However, it is challenging to assess this claim based on the data presented, which mostly contain single frames of unquantified, high-resolution images. There are almost no systematic quantifications presented. The ones present (Figure S1D, Figure 4) show very large variability. However, the authors show sets of images across wells (Figure S1B, Figure S3) which hint that in some important aspects, the culture seems reproducible and robust.

      We made considerable efforts to establish quantitative metrics to assess reproducibility. We applied a quantitative scoring system of single radial structures at different time points for multiple batches of all three lines as indicated in Figure S1C. This figure represents a comprehensive dataset in which each dot represents the average of a different batch of organoids containing 10-40 organoids per batch. To emphasize this, we have adapted the graph to better reflect the breadth of the dataset. Additional quantifications are given in Figure S2 for progenitor and layer markers for Line 1 and in Figure 2 for interneurons across all three lines, showing relatively low variability. That being said, we acknowledge the reviewer’s concerns and have modified the text to reduce the emphasis of this point, pending more extensive data addressing reproducibility across an even broader range of parameters.

      (4) What is in the middle? All images show markers in cells present around the center. The center however seems to be a dense lump of cells based on DAPI staining. What is the identity of these cells? Do these cells persist throughout the protocol? Do they divide? Until when? Addressing this prominent cell population is currently lacking.

      A more comprehensive characterization of the cells in the center remains a significant challenge due to the high cell density hindering antibody penetration. However, dyebased staining methods such as DAPI and the LIVE/DEAD panel confirm a predominance of intact nuclei with very minimal cell death. The limited available data suggest that a substantial proportion of the cells in the center are proliferative neural progenitors, indicated by immunolabeling for SOX2 (Figure 2A,D;Figure S4C). Furthermore, we are currently optimizing the conditions to perform single cell / nuclear RNA sequencing to further characterize the cellular composition of the organoids.

      (5) This manuscript proposes a new method of 2D neural culture. However, the description and representation of the method are currently insufficient. (a) The results section would benefit from a clear and concise, but step-by-step overview of the protocol. The current description refers to an earlier paper and appears to skip over some key steps. This section would benefit from being completely rewritten. This is not a replacement for a clear methods section, but a section that allows readers to clearly interpret results presented later.

      We have revised the manuscript to include a more detailed step-by-step overview of the protocol.

      (b) Along the same lines, the graphical abstract should be much more detailed. It should contain the time frames and the media used at the different stages of the protocol, seeding numbers, etc.

      As suggested, we have adapted the graphical abstract to include more detail.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, van der Kroeg et al have developed a method for creating 3D cortical organoids using iPSC-derived neural progenitor cells in 384-well plates, thus scaling down the neural organoids to adherent culture and a smaller format that is amenable to high throughput cultivation. These adherent cortical organoids, measuring 3 x 3 x 0.2 mm, self-organize over eight weeks and include multiple neuronal subtypes, astrocytes, and oligodendrocyte lineage cells.

      Strengths:

      (1) The organoids can be cultured for up to 10 months, exhibiting mature dendritic spines, axonal myelination, and robust neuronal activity.

      (2) Unlike free-floating organoids, these do not develop necrotic cores, making them ideal for high-throughput drug discovery, neurotoxicological screening, and brain disorder studies.

      (3) The method addresses the technical challenge of achieving higher-order neural complexity with reduced heterogeneity and the issue of necrosis in larger organoids. The method presents a technical advance in organoid culture.

      (4) The method has been demonstrated with multiple cell lines which is a strength.

      (5) The manuscript provides high-quality immunostaining for multiple markers.

      We appreciate the reviewer’s acknowledgement of the strengths of this novel platform as a technical advance in organoid culture that reduces heterogeneity and shows potential for higher throughput experiments.

      Weaknesses:

      (1) Direct head-to-head comparison with standard organoid culture seems to be missing and may be valuable for benchmarking, ie what can be done with the new method that cannot be done with standard culture and vice versa, ie what are the aspects in which new method could be inferior to the standard.

      In our opinion, it would be extremely difficult to directly compare methods. Most notably, whole brain organoids grow to large and irregular globular shapes, while adherent cortical organoids have a more standardized shape confined by the geometry of a 384well. Moreover, it was not our intention to benchmark this model quantitatively against other culture systems. Rather, we have attempted to characterize the opportunities and limitations of this approach, with a qualitative contrast to other culture methods, as addressed in response to comment 2 of Reviewer 1 above.

      (2) It would be important to further benchmark the throughput, ie what is the success rate in filling and successfully growing the organoids in the entire 384 well plate?

      Figure S1 shows the success rate of organoid formation and stability of the organoid structures over time. In addition, we have added the number of wells that were filled per plate.

      (3) For each NPC line an optimal seeding density was estimated based on the proliferation rate of that NPC line and via visual observation after 6 weeks of culture. It would be important to delineate this protocol in more robust terms, in order to enable reproducibility with different cell lines and amongst the labs.

      Figure S1 provides the relationship between proliferation rate and seeding density, allowing estimation of seeding densities based on the proliferation rate of the NPCs. However, we appreciate the reviewers' feedback and have modified the methods to provide more detail.

      Reviewer #3 (Public review):

      Summary:

      Kroeg et al. have introduced a novel method to produce 3D cortical layer formation in hiPSC-derived models, revealing a remarkably consistent topography within compact dimensions. This technique involves seeding frontal cortex-patterned iPSC-derived neural progenitor cells in 384-well plates, triggering the spontaneous assembly of adherent cortical organoids consisting of various neuronal subtypes, astrocytes, and oligodendrocyte lineage cells.

      Strengths:

      Compared to existing brain organoid models, these adherent cortical organoids demonstrate enhanced reproducibility and cell viability during prolonged culture, thereby providing versatile opportunities for high-throughput drug discovery, neurotoxicological screening, and the investigation of brain disorder pathophysiology. This is an important and timely issue that needs to be addressed to improve the current brain organoid systems.

      We thank the reviewer for highlighting the strengths of our novel platform. We appreciate that all three reviewers agree that the adherent cortical organoids presented in this manuscript reliably demonstrate increased reproducibility and longevity. They also commend its potential for higher throughput drug discovery and neurotoxicological/phenotype screening purposes.

      Weaknesses:

      While the authors have provided significant data supporting this claim, several aspects necessitate further characterization and clarification. Mainly, highlighting the consistency of differentiation across different cell lines and standardizing functional outputs are crucial elements to emphasize the future broad potential of this new organoid system for large-scale pharmacological screening.

      We appreciate the feedback and have added more detail on consistency and standardization of functional outputs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Minor points

      (1) As the preprint is officially part of the eLife review, I have to remark that the preprint which is made available on bioarxiv, suffers from some serious compatibility or format problem: one cannot highlight sentences as in a regular PDF and when trying to copypaste sentences from it jumbled characters are copied to the clipboard.

      The updated version of the paper on bioRxiv should not suffer from these compatibility issues.

      (2) Since the paper is presenting a new method it should briefly describe how each step, including the hiPSC culture was done, the reference to an earlier publication in this case is not sufficient, and this practice is generally best to avoid for methods papers.

      Each step in the culturing process has now been described in the methods.

      (3) The EB stage is insufficiently described. The "2D - 3D - 2D" transitions should be clearly explained.

      The methods section has been rewritten and expanded to include these processes in more detail.

      (4) Is there one FACS sorting in the protocol, or multiple (additional at IPS culture)? What markers each? What is the motivation for sorting and purifying the neural progenitors? Was the culture impure? What was purity? What cell types are expected after sorting, and what is removed?

      Only one FACS sorting step is performed at the NPC stage. This was added as an improvement to our original neural network protocol (Günhanlar et al 2018) to ensure consistency over different hiPSC source cell lines that can yield variable amounts of frontal cortical patterned NPCs. Positive sorting for neural lineage markers CD184 and CD24, and negative sorting for mesenchymal/neural crest CD217 and CD44 glial progenitor markers, according to Yuan et al 2011, ensures frontal-patterned cortical NPCs as confirmed for all batches by immunohistochemistry for SOX2, Nestin and FOXG1. We have added new text to the Methods section to clarify this more explicitly.

      (5) Seeding protocol and parameters are insufficiently described, and from what I read they are poorly defined: "Specifically, the optimal seeding density was determined by visual inspection of the organoids between 28 to 42 days after seeding a range of cell densities in the 384-well plate wells." For a new method, precise, actionable instructions are needed. I may have overlooked those elsewhere, in this case, please clarify these sections.

      The Methods section was rewritten and expanded to describe the methodology in greater detail with more actionable instructions.

      (6) The timeline in Figure 1 is not clearly delineated; I found it hard to understand which figure corresponds to which stage (e.g. facs sorting is not mentioned in the first part of the results but it is part of Figure 1A, neural rosette formation can happen both before and after facs sorting, simply referring to rosettes is not clear). Later parts of the manuscript 
> clearly introduce the terms sorting and seeding in the context of this method, and how ages (days) refer to these time points.

      Figure 1 was adapted to clarify the generation of Neural Progenitor Cells (NPCs) and subsequent seeding of NPCs to generate Adherent Cortical Organoids (ACOs).

      (7) The authors define: "cortical organized defined as a single radial structure." This is not a commonly used definition of organoids, for nomenclature, please see: doi: 10.1038/s41586-022-05219-6 (Pasca et al 2022).

      To clarify, the statement is not meant to reflect a definition of organoids in general, but rather the scoring of proper structure formation for Figure S1C. For discussion on nomenclature, see our response to point 1 of Reviewer 1 in the public review. We changed the wording to be more accurate.

      (8) In Figure S1d, the authors write: "the fraction of structurally intact cultures decreased to 50%", but I'm looking at that graph there seems to be no notable decrease, but huge variability. The authors should quantify claims of decrease by linear regression and an R square. Variation within and the cross-cell lines seem to be large. Also, it is unclear if dots are corresponding to the same wells/plates, in other words: is this a longitudinal experiment? What is the overall success rate? How is success determined? Are there clear criteria? to the same wells/plates, in other words: is this a longitudinal experiment? What is the overall success rate? How is success determined? Are there clear criteria?

      We agree with the reviewer that the claim on fraction of intact cultures decreasing over time to 50% is an overinterpretation due the large variability. We changed the wording in the manuscript to: While some later batches show moderately reduced success rates compared with the earliest batches, properly formed single-structure organoids were still obtained at 40–90% success across all examined time points (Figure S1C), indicating that long-term culture is feasible albeit with variable efficiency. The data are not longitudinal as each dot represents an endpoint of a different batch of organoids, totaling 18 independent batches across the three lines. We have clarified this in the figure legend. Success was defined at the well level as the presence of a single, continuous radial structure occupying the well, without obvious fragmentation or fusion events, as assessed by LIVE/DEAD that also confirmed viability. Wells were scored as successful only when the radial structure showed predominantly live signal with no large necrotic areas. Wells containing multiple radial structures, fused aggregates, or predominantly dead tissue were scored as unsuccessful.

      (9) Figure s1c: the numbering to this panel should be swapped, because it is referenced after other panels in the text. The reference is confusing: "Plotting the interaction between proliferation and the amount of NPCs required to be seeded for the successful generation of adherent cortical organoids" - success is not present in this graph at all? How is that measured?

      Figures S1C and S1D have been adapted to clarify the measure of ‘successful organoid formation’.

      (a) The description of this plot is confusing: "The doubling time of the NPCs explains more than half the variation (r2 = 0.67) of the required seeding density." What else is there? I thought that this was the formula the authors suggested to determine seeding density, but it seems not. Or is "manual inspection" the determinant, and that seems to correlate with this metric?

      Even though the rate of proliferation, measured as doubling time, is the main determinant of the seeding density, it is not the only determinant of the seeding density. For instance, intrinsic differences in differentiation potential could also play a role. Therefore, NPC lines with similar doubling times might still have slightly different optimal seeding densities. We have added clarification of this conclusion to the Results section.

      (b) Seeding density is a key parameter in many in vitro differentiation and culture protocols. This importance however does not mean that this density is attributable to differences in cell proliferation rate. Alternatively, the amount of cells determines the amount of secreted molecules and cell-to-cell contacts.

      Here, when we refer to the cell density, we specifically refer to the cell density needed to generate the ACO. We show that the most important contributor to the variation in ACO formation is the proliferation, measured here as the doubling time. We agree that there are other factors involved such as the secreted molecules, cell-to-cell contacts as well as the ability of a given NPC line to differentiate into a post-mitotic cell.

      (c) Is it mentioned which cell line this experiment corresponds to?

      The data in Figure S1D is from the 3 reported cell lines, as well as 2 clones from a fourth IPS cell line. This is detailed in the Methods section of the proliferation assay.

      (d) Without a more detailed explanation, seeding density and doubling time could be independent variables.

      These two variables are highly correlated as shown in Figure S1D, but it is true that there can be other variables that account for the observed variance, as discussed above in Point 9b.

      (e) In this figure the success rate is not visible at all so I have no idea how the autors arrive at a conclusion about success rate.

      We have adapted the figure legend to reflect which cell lines the dots in Fig. S1D represent. NPC lines can have substantial variation in proliferation rates. The figure reflects data of NPCs of 5 clones of 4 different hiPSC lines (as indicated in the Methods) with different proliferation rates. Also, the ACO success rate (operationally defined uniformly to the data shown in Fig. S1C) was also included.

      (10) Figure 2: Clean spatial segregation seems to be a strength of the system and therefore I would recommend putting more of the relevant microscopy images to the main figure, which are now currently in Figure S4.

      We have adapted Figure 2 accordingly, and included additional representative cortical layering images in Figure S4.

      (11) The variability in interneuron content seems to be significant, as currently presented in the figure. However, this may be due to a special organization. It would first quantify in consecutive rings around the centers whether interneurons have a tendency to be enriched towards the center or the edge of the culture. Maybe this explains the variability that is currently present in Figure s5b.

      We agree that spatial organization of interneurons could, in principle, contribute to variability. In our analysis, however, images were acquired from positions selected by a random sampling grid across the entire culture, rather than from specific central or peripheral regions. Each field contained on average 130.6 ± 16.1 NeuN+ nuclei, which provided a relatively large sampling volume per position. If interneurons were strongly enriched at the center or edge, we would expect systematic differences in interneuron fraction between fields assigned to central versus peripheral grid positions. We did not observe such a pattern in our dataset, suggesting that spatial organization is not the main driver of the observed variability.

      (12) Because in previous figures it seems like there is considerable variability across individual cultures and images here are coming from separate cultures, please use different shapes of the points coming from different cultures/wells, to see if maybe there is a culture-to-culture difference that explains the variability present in the figure.

      We have added different symbols per organoid for the interneuron quantifications and moved this quantification to main Figure 2.

      (13) I believe it is currently the standard error of the mean which is displayed in the figure, which is not an appropriate representation for variability, or the reproducibility across individual data points. SEM quantifies the reproducibility of the mean, not the reproducibility of the individual data points, which matters here. Mean refers to the mean of this quantification experiment and therefore it's not a biological entity. A box plot showing the interquartile range besides the individual data points would be an accurate representation of the spread of the data.

      We agree and have adapted the data, now in Figure 5, accordingly.

      (14) Again, in general, the main figures should contain much more of the quantification, as opposed to just raw images.

      Quantifications have been added in Figure 2 for the GAD67/NeuN for all cell lines as well as a time course quantification of GAD67/NeuN for 1 of the cell lines. In Figure 4, we have added excitatory and inhibitory synaptic quantifications.

      (15) Figure 2F-I the location of the center of the rosette should be marked with a star so that the conclusion about the direction of processes can be established.

      The suggested addition of a marker at the center of each rosette was evaluated but not implemented, because it reduced rather than improved figure clarity.

      (16) Figure 3 b and c:

      High magnification images of single cells, can't show changes in cell type morphology, and one cannot conclude that these cells are present in significant numbers across time. Zoomed-out images or quantification would be necessary for such a claim. The authors already have such images as presented in the next panels, so quantification without new experiments.
> I am uncertain about the T3 supplement here - do these images correspond to the same conditions?

      (a) It is unclear to me why different markers are used in the different panels, namely why NG2 is not used in any of the other images.

      NG2 was used at early developmental time points to show the presence of Oligodendrocyte Precursor Cells (OPCs). At later time points, the focus switched to MBP staining to indicate more mature oligodendrocyte lineage cells. Although NG2 and MBP are not in the same panels, the staining was performed for both antibodies at the same developmental time point (Day 119) as seen in Figure 3C and 3D.

      (b) Color coding in Figure 3G is ambiguous; the use of two blues should be avoided, and the Sub-sub panels should be individually labeled for the color code.

      We agree, and have now used different colors.

      (c) It is unclear if the presence of the t3 molecule is part of the standard procedure or if it was a side experiment to enhance the survival of oligodendrocytes. Are there no oligodendrocytes without? How does T3 affect other cell types, and the general health and differentiation of the cultures?

      Indeed, T3 is essential for oligodendrocyte formation. We did not observe obvious effects on the general health or differentiation potential of the cultures.

      (d) Is the 2ng/ml t3 from day one to the final day?

      Indeed, in the organoids cultured to study oligodendrocyte formation, T3 was added from Day 1. These details have now been clarified in the Methods and Results sections.

      (17) Figure 4:

      (a) Microscopy in this figure is high quality and very convincing about neural maturity.

      (b) The term "cluster" should be avoided. Unclear what it means here, but my best guess is "cells in a frame of view." Cluster is used with a different meaning in electrophysiology.

      This was adapted to ‘neurons in a field of view (FOV)’.

      (c) Panel J: I assume each row corresponds to a single cell? Could this be clarified? Are these selected cells from each frame, or all active cells are represented?

      Indeed, each row corresponds to a single cell, showing all active cells in the frame. This is now clarified in the legend.

      (d) How many Wells do these data correspond to, and in which line it was measured?

      As reported in the legend for Figure 5, these data correspond to 2 wells at Day 61 to which we have now added calcium imaging data from 3 wells from a different batch at Day 100. We have included in the legend that these recordings were from Line 1.

      (e) Panels G to I, again, the use of standard error of the mean is inappropriate and misleading: looking at the error bar one must conclude that there is minimal variation, which is the exact opposite of the conclusions, when one would look at the variability of the raw data points.

      As suggested, the graphs have been adapted as boxplots with interquartile ranges to highlight the distribution of data points.

      (f) It is unclear how many neurons and how many total actively firing neurons are present in the videos analyzed

      All neurons that were active in the field of view and showed at least one calcium event during the ~10 minute recording were included in the analysis. Using this method, we cannot comment on the proportion of neurons that were active from the total amount of neurons present, since the AAV virus we used does not transduce all neurons.

      (g) This figure shows the strength of the method in achieving neural maturity and function. There seems to be that there is considerable activity in the neuronal cultures analyzed. To conclude how reliably the method leads to such mature cultures one would need to measure at least a dozen wells (even if with some simpler and low-resolution method). Concluding reproducibility from one or two hand-picked examples is not possible.

      We agree with the reviewer that the number of wells used for calcium imaging analysis was limited. We are currently working on more advanced methods to increase the throughput of this analysis. However, we’ve now added another timepoint to the calcium imaging data in Figure 5 from an independent batch of 3 adherent cortical organoids, which demonstrates continued robust activity at Day 100, as well as Day 61.

      Methods:

      (1) Stem cell culture. The artist described that line 3 is grown on MEFs. Is this true for the other two lines, furthermore were they cultured in identical conditions?

      Line 2 and 3 were not grown on MEFs. We specifically chose different sources of NPCs to reflect the robust nature of the differentiation protocol. We have recently also adapted the protocol from Line 3 NPCs to confirm that the protocol also works starting from hiPSCs grown in feeder-free conditions in StemFlex medium, by adapting NPC differentiation according to our recent publication in Frontiers in Cellular Neuroscience (Eigenhuis et al 2023).

      (2) "NPCs were differentiated to adherent cortical organoids between passages 3 and 7 after sorting." Please clarify this sentence. I assume it refers to the first facs sorting of the protocol, but a section is not sufficiently detailed.

      We have adapted the methods to clarify that the FACS purification step occurs at the NPC stage.

      (3) I didn't fully understand: It seems to be that there are two steps of fact sorting involved, one after passage 3 and one after week 4. This should be represented in the graphical abstract of Figure 1.

      As outlined above, there is only 1 FACS sorting step at NPC stage. We have adapted this in the Methods and in the graphical abstract.

      (4) Neural differentiation: The authors write that optimal seeding density was determined by visual inspection of the organoids - this is.

      We have clarified the Methods section to better explain the process of optimizing the seeding density for each NPC line to generate the ACOs.

      (5) What does the following sentence mean: "Cells were refreshed every 2-3 days." Does it mean in replacement of the complete media? How much Media was added to the Wells?

      This is a very good point that we have now clarified in the Methods, as full replenishment of media is neither feasible, nor desirable. From the total volume of 110 µl per well, 80 µl is taken out and replaced with 85 µl to compensate for evaporation.

      (6) Calcium imaging: can the authors explain the decision to move the cultures one day before imaging into brainphys neural differentiation medium? In 3D organoid protocols, brainphys is gradually introduced to avoid culture shock (very different composition), and used for multiple months to enhance neural differentiation. For recording electrophysiological activity, artificial CSF is the most common choice.

      Indeed, for whole cell recordings of 2D neural networks as performed in Günhanlar et al 2018, we used gradual transition to aCSF. For the current ACOs, we found that using BrainPhys from the start of organoid differentiation prevents structure formation, probably because of increased speed of maturation disrupting proliferation and organization of radial glia differentiation. However, by changing the media to BrainPhys just one day before recording (reflecting a gradual change as not all medium is fully replenished and easier than switching to aCSF during recording), we saw greatly improved neuronal activity.

      (7) Statistical analysis : As I pointed out before, the standard error of the mean is not an appropriate metric to represent the variability of the data. It is meant to represent the variability of the estimated average. The following thought experiment should make it clear: I measured the expression of a gene in my system. 50 times I measured 0 and 50 times I measured 100. The average is 50, but of course it is a very bad representation of the data because no such data points exist with that value. Yet the standard error of the mean would be plus minus 5.

      We have revised Figures 5C–5D to boxplots displaying the interquartile range with all individual data points overlaid, which more accurately represents the variability in the dataset.

      Discussion

      (1) The discussion focuses on human cortical development, however, the methods presented by the authors entail dissociation and replating through multiple stages not part of brain development. I see the approach as more valuable as a possibly reliable method that generates both diverse and mature neural cultures.

      We have revised the Discussion to avoid explicitly invoking an in vitro recapitulation of human cortical development. Nevertheless, given that the NPCs from which the organoids originate exhibit frontal cortical identity, coupled with the timely emergence of cortical neuronal markers and rudimentary cortical layering, we are increasingly confident that the development of these cultures most likely mirrors that of the frontal cortex. To further substantiate this hypothesis, single-cell RNA sequencing experiments will be conducted in the future to provide additional insights.

      (2) One of the major claims of the authors is that the method is very reproducible. However, there is almost no data on reproducibility throughout the paper. Mostly single, high magnification images are presented, which therefore represent a small region of a single well of a single batch of a single cell line. Based on the data presented it is not possible to evaluate the reproducibility of the method.

      We agree that the original version did not sufficiently document reproducibility. To address this, we have refined and expanded our presentation of reproducibility data. The previous success-rate panel (original Figure S1D) has been moved and adapted as the new Figure S1C. In this updated version, each dot still represents the endpoint success rate of an independent batch, but dot size now scales with batch size (10–40 organoids), and the legend specifies the total numbers of organoids analyzed per line (line 1: n=248; line 2: n=70; line 3: n=70). Together with the distribution of success rates between ~40– 90% across multiple time points and three iPSC lines, this more detailed representation allows readers to directly assess the robustness of line-to-line and batch-to-batch performance. In addition, new time course quantifications of interneuron proportion (Figure 2G,H), synaptic marker densities (Figure 4H, I), and late-stage calcium imaging (Figure 5C,D,E) further demonstrate that key structural and functional read-outs show overlapping ranges across lines and independent differentiations, reinforcing that the method yields reproducible core phenotypes despite some biological variability.

      (3) The data presented is very promising, and it suggests that the authors derived optimal conditions for neural differentiation and neural culture diversification. I am confident that the authors can show that reproducibility, at least in a practical sense (e.g. in wells that form a culture) is high.

      Overall, this is a very promising and exciting work, that I am looking forward to reading in a mature manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) Direct head-to-head comparison with standard organoid culture seems to be missing and may be valuable for benchmarking, ie what can be done with the new method that cannot be done with standard culture and vice versa, ie what are the aspects in which new method could be inferior to the standard.

      We have now more clearly elaborated the differences with other methods. As addressed in our response to point 2 of Reviewer 1 in the public reviews, there are several limitations and advantages to the adherent cortical organoids model listed as follows:

      Advantages of adherent cortical organoids:

      (1) Higher order self-organized structure formation, including segregation of deeper and upper cortical layers.

      (2) Longevity: adherent cortical organoids can be successfully kept in culture for at least 1 year, whereas 2D cultures typically deteriorate after 8-12 weeks.

      (3) Maturity, including the formation of dendritic mushroom spines and robust electrophysiological activity.

      (4) Cell type diversity including a more physiological ratio of inhibitory and excitatory neurons (10% GAD67+/NeuN+ neurons in adherent cortical organoids, vs 1% in 2D neural networks), and the emergence of oligodendrocyte lineage cells.

      On the other hand, limitations of adherent cortical organoids compared to 2D neural network cultures include:

      (1) Culture times for organoids are much longer than for 2D cultures and the method can therefore be more laborious and more expensive.

      (2) Whole cell patch clamping is not easily feasible in adherent cortical organoids because of the restrictive geometry of 384-well plates.

      (2) It would be important to further benchmark the throughput, ie what is the success rate in filling and successfully growing the organoids in the entire 384 well plate?

      We have addressed this question in the current version of Fig. S1C, in which multiple batches of organoids of all three lines were scored for their success rate. The graph reflects the proportion of properly formed organoids of +/- 400 seeded wells scored at different timepoints, in which each timepoint is a different batch. As mentioned in the response to Reviewer 1, we have also added data on the number of organoids seeded per line in the figure legend.

      (3) For each NPC line an optimal seeding density was estimated based on the proliferation rate of that NPC line and via visual observation after 6 weeks of culture. It would be important to delineate this protocol in more robust terms, in order to enable reproducibility with different cell lines and amongst the labs.

      As outlined in the response to Reviewer 1, we have clarified the Methods and Discussion sections on seeding density and proliferation rate.

      Reviewer #3 (Recommendations for the authors):

      Kroeg et al. have introduced a novel method to produce 3D cortical layer formation in hiPSC-derived models, revealing a remarkably consistent topography within compact dimensions. This technique involves seeding frontal cortex-patterned iPSC-derived neural progenitor cells in 384-well plates, triggering the spontaneous assembly of adherent cortical organoids consisting of various neuronal subtypes, astrocytes, and oligodendrocyte lineage cells. Compared to existing brain organoid models, these adherent cortical organoids demonstrate enhanced reproducibility and cell viability during prolonged culture, thereby providing versatile opportunities for high-throughput drug discovery, neurotoxicological screening, and the investigation of brain disorder pathophysiology. This is an important and timely issue that needs to be addressed to improve the current brain organoid systems. While the authors have provided significant data supporting this claim, several aspects necessitate further characterization and clarification. Particularly, highlighting the consistency of differentiation across different cell lines and standardizing functional outputs are crucial elements to emphasize the future broad potential of this new organoid system for large-scale pharmacological screening.

      (1) Considering the emergence of astrocyte markers (GFAP, S100b) and upper layer neuron marker (CUX1) around Day 60, the overall differentiation speed is significantly faster compared to other forebrain organoid protocols. Are these accelerated sequences of neurodevelopment consistent across different hiPSC lines?

      As shown in Fig. S5, astrocytes are present around Day 60 for all three lines. For comparison with other organoid protocols, an important consideration is that the timeline for these organoids starts at NPC plating, while for other protocols timing often starts from the hiPSC stage. We have clarified the timeline in the graphical abstract in Figure 1A and in the Methods.

      (2) The calcium imaging results in Figure 4G were recorded at a single time point, Day 61, a relatively early time window compared to other forebrain organoid protocols (more than 100 days, PMID: 31257131; PMID: 36120104). Are the neurons in adherent cortical organoids functionally mature enough around Day 61? How consistent is this functional activity across different cell lines and independent differentiation batches?

      As discussed above in Point 1, it is important to consider that the specified timeline starts from NPC plating. In analogy to 2D neural networks, robust neuronal activity can be observed after ~8 weeks in culture. In addition, we have now added calcium imaging data for an additional batch of organoids at Day 100 in Figure 5, which exhibit comparable levels of neuronal activity as observed on Day 61.

      (3) Along the same line, Various cell types, such as oligodendrocytes and astrocytes, are believed to influence neuronal maturation. Therefore, longitudinal studies until the late stage are necessary to observe changes in electrophysiological activity based on the degree of neuronal maturation (at least two more later time points, such as 100 days and 150 days).

      As described in the previous points, we have now included a Day 100 time point in the calcium imaging data, in addition to the recordings at Day 61 (Figure 5C-E).

      (4) The authors assert that heterogeneity among organoids has been diminished using the human adherent cortical organoids protocol. However, there is inadequate quantitative data to prove the consistency of neuronal activities between different wells. Therefore, experiments quantifying the degree of heterogeneity between organoids, such as through methods like calcium imaging, are necessary to determine if neuron activity occurs consistently across each organoid well.

      We agree with the review and have added several quantitative experiments: a) we’ve added another timepoint to the calcium imaging data in Figure 5 from an independent batch of 3 adherent cortical organoids, which demonstrates continued robust activity at day 100, as well as day 61; b) we added synapse quantification in Figure 4, and c) interneuron quantification in Figure 2. We are currently also pursuing high throughput measures of activity to assess the longitudinal activity of ACOs in a larger number of wells. This way we can more definitively quantify the time-dependent variance in organoid activity.

      (5) Is this platform applicable to other functional measurements for neuronal activity, such as the MEA system? When observing the morphology of neurons formed in organoids, they appear to extend axons and dendrites in a consistent direction, suggesting a radial structure that demonstrates high reproducibility across wells. A culture system where neurons are arranged with such consistency in directionality could be highly beneficial for experiments utilizing the MEA system to assess parameters such as the speed of electrical activity transmission and stimulus-response. Therefore, there seems to be a need for a more detailed explanation of the utility of the structural characteristics of the culture system.

      The ACO platform is indeed suitable for MEA recordings. We are in the process of engineering the required geometry using HD-MEA systems through specialized inserts to generate ACOs on MEA systems.

      (6) In Figure 2E-I, authors suggest morphological diversity of GFAP+/S100b+ astrocyte, but the imaging data presented in Figure F-I is only based on GFAP immunoreactivity.

      Since GFAP is also expressed in radial glial cells at this stage (Figure 2I), many fibrous astrocytes and interlaminar astrocytes are likely radial glial neural progenitor cells instead of astrocytes. It appears necessary to perform additional staining using astrocyte markers such as S100B or outer radial glia markers such as HOPX to demonstrate that the figure depicts subtype-specific morphologies of astrocytes.

      In Figure 2M, we stained for GFAP and PAX6 to mark radial glia that look different than the astrocyte morphologies we describe in Figure 2J-L. We see a large overlap in GFAP and S100B staining in Figure 2I, in which most GFAP+ cells are double positive for S100B (yellow) that is more consistent with astrocyte maturation than radial glia. Furthermore, we have not seen PAX6 staining outside the dense edges of the center of the ACO.

      (7) In Figure 4D, the axon appears to exhibit directionality. Additional explanation regarding the organization of the axon is necessary. Further research utilizing sparse staining to examine the morphology of single neurons seems warranted.

      The polarized directionality of the axons is something we indeed have also noticed. We are looking into options to further investigate this intriguing property of the ACOs.

      (8) Figure 1E-F only showed cell viability in the early stages around Day 40-50. To demonstrate the superior long-term viability of ACO culture, it appears necessary to illustrate the ratio of dead cells to live cells over the course of a time course.

      Figure S1B shows LIVE/DEAD staining for ACOs of all three lines, revealing minimal DEAD staining at Day 56. A longitudinal time course experiment was not performed, however the line- and batch-specific quantifications over developmental timepoints in Figure S1C provide an indication of the robust long-term viability of the ACOs.

    1. As someone who has been taking CS classes alongside this course, this final reminder about embedding ethics into tech design is super relevant to me. It’s so easy to get completely caught up in just making the code work or hitting a deadline, but realizing that those small design choices can have massive real-world impacts is a very sobering thought. I really hope the tech industry starts prioritizing ethical awareness just as much as pure technical skills.

    1. How have your views on automation and programming changed (or been reinforced)?

      My view of automation and programming has changed because I now see them as ethical as well as technical. Code does not just solve problems; it also reflects choices about whose needs matter, what behaviors get rewarded, and what harms may be scaled.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This study provides useful insights into addressing the question of whether the prevalence of autoimmune disease could be driven by sex differences in the T cell receptor (TCR) repertoire, correlating with higher rates of autoimmune disease in females. The authors compare male and female TCR repertoires using bulk RNA sequencing, from sorted thymocyte subpopulations in pediatric and adult human thymuses; however, the results do not provide sufficient analytical rigor and incompletely support the central claims.

      The statement in the editorial assessment that our study “does not provide sufficient analytical rigor” surprised us. TCR repertoire analysis is indeed a highly complex domain, both experimentally and computationally. We consider ourselves to be leading experts in this field and have invested a great deal of effort to ensure the rigor and reproducibility of every analytical step.

      Specifically, our group has previously benchmarked and published validated methodologies for the following areas: (i) TCR repertoire generation (Barennes et al., Nat Biotechnol 2021), (ii) repertoire analysis (Six et al., Frontiers in Immunol, 2013; Chaara et al., Frontiers in Immunol, 2018; Ritvo et al., PNAS, 2018; Mhanna et al., Diabetes, 2021; Trück et al., eLife, 2021; Quiniou et al., eLife, 2023; Mhanna et al., Cell Rep Methods, 2024; Mhanna et al., Nat Rev Primers Methods, 2024), and (iii) the curation and quality control of public TCR databases (Jouannet et al., NAR Genomics and Bioinformatics 2025). The current study applies these optimized and peer-reviewed pipelines, along with additional internal quality controls that we have been implemented over the years, ensuring the highest possible analytical standards for TCR repertoire studies.

      We therefore respectfully feel that the phrase “insufficient analytical rigor” does not accurately reflect the methodological robustness of our work. This perception is also in contrast to the comment made by one of the reviewers, who explicitly noted that “overall, the methodologies appear to be sound.”

      We would therefore be grateful if, upon reviewing our detailed point-by-point responses, the editors could reconsider this statement and tone it down in the final editorial summary.

      With regard to comment that our results “incompletely support the central claims”, we will leave it to the reader’s judgement. We believe that our work provides a robust and transparent basis for future research into TCR repertoire, autoimmunity, and women’s health.

      Reviewer 1 (Public reviews):

      Summary

      The goal of this paper was to determine whether the T cell receptor (TCR) repertoire differs between a male and a female human. To address this, this group sequenced TCRs from doublepositive and single-positive thymocytes in male and female humans of various ages. Such an analysis on sorted thymocyte subsets has not been performed in the past. The only comparable dataset is a pediatric thymocyte dataset where total thymocytes were sorted.

      They report on participant ages and sexes, but not on ethnicity, race, nor provide information about HLA typing of individuals. Though the experiments themselves are heroic, they do represent a relatively small sampling of diverse humans. They observed no differences in TCRbeta or TCRalpha usage, combinational diversity, or differences in the length of the CDR3 region, or amino acid usage in the CD3aa region between males or females. Though they observed some TCRbeta CD3aa sequence motifs that differed between males and females, these findings could not be replicated using an external dataset and therefore were not generalizable to the human population.

      They also compared TCRbeta sequences against those identified in the past using computational approaches to recognize cancer-, bacterial-, viral-, or autoimmune-antigens. They found very little overlap of their sequences with these annotated sequences (depending on the individual, ranging from 0.82-3.58% of sequences). Within the sequences that were in overlap, they found that certain sequences against autoimmune or bacterial antigens were significantly over-represented in female versus male CD8 SP cells. Since no other comparable dataset is available, they could not conclude whether this is a finding that is generalizable to the human population.

      Strengths:

      This is a novel dataset. Overall, the methodologies appear to be sound. There was an attempt to replicate their findings in cases where an appropriate dataset was available. I agree that there are no gross differences in TCR diversity between males and females.

      We appreciate the positive feedback from the reviewer regarding these points.

      Weaknesses:

      Overall, the sample size is small given that it is an outbred population. The cleaner experiment would have been to study the impact of sex in a number of inbred MHC I/II identical mouse strains or in humans with HLA-identical backgrounds.

      We respectfully disagree with the reviewer’s statement. We firmly believe that the issue we are dealing with, namely sex-based differences in thymic TCR selection relevant to autoimmunity, should be investigated more thoroughly in the general human population than in inbred mouse models.

      While inbred mouse strains, being MHC I/II identical, eliminate the complexity of MHC variation, this comes at the cost of biological relevance. Firstly, a discrepancy in TCR generation or selection may only become apparent under specific MHC contexts, which could easily be overlooked when studying a single inbred strain. Secondly, inbred strains frequently contain fixed genetic variants that may influence thymic selection or immune regulation. This has the potential to introduce confounding effects rather than reducing them and not solving the generalization issue.

      We are in full agreement that an HLA-matched human cohort would reduce inter-individual variability. However, such sampling is impossible in practice, as our thymic tissues were obtained from deceased organ donors, a collection effort that was, as the reviewer rightly noted, “heroic”. Despite these inherent limitations, the patterns we observed were consistent across multiple analytical approaches, lending robustness to our findings.

      We now explicitly acknowledge this limitation in the Discussion of the revised manuscript and explain why, despite this constraint, our study provides meaningful and biologically relevant insights into human TCR selection and sex-related immune differences.

      It is unclear whether there was consensus between the three databases they used regarding the antigens recognized by the TCR sequences. Given the very low overlap between the TCR sequences identified in these databases and their dataset, and the lack of replication, they should tone down their excitement about the CD8 T cell sequences recognizing autoimmune and bacterial antigens being over-represented in females.

      The three databases used in this study - McPAS-TCR, IEDB, and VDJdb - provide complementary and partially non-overlapping specificity landscapes. McPAS-TCR is enriched for pathology-associated TCRs, while IEDB and VDJdb contain a higher proportion of viral specificities. Combining them therefore broadens the antigenic spectrum accessible for analysis and represents the most comprehensive approach currently possible to capture the diversity of TCR–antigen annotations.

      With regard to the limited overlap between our dataset and these databases, this observation should be interpreted with caution. While the overlap may appear minimal at first glance, it is a biologically significant phenomenon. The public databases collectively contain only a minute fraction of the total universe of TCR specificities, estimated to exceed 10<sup>15-21</sup> possible receptors in humans. In this context, the observation of any overlap at all, particularly with coherent biological patterns such as the overrepresentation of autoimmune- and bacterialassociated TCRs in females, is noteworthy.

      We have included a short clarification in the Discussion of the revised manuscript to make this point explicit and to further temper the language describing this finding.

      The dataset could be valuable to the community.

      We thank the reviewer for highlighting the potential value of this dataset to the community. It will be made publicly available on the NCBI website. We would like to clarify that our intention has always been to make this dataset publicly available; therefore, we take back any incorrect suggestions made in the original submission.

      Reviewer #1 (Recommendations for the authors):

      I would just recommend toning down the excitement about autoimmune TCRs being overrepresented in females. Then the conclusions will be in alignment with their results.

      We thank the reviewer for this constructive recommendation. We would like to express our full support for the editorial transparency policies of eLife, which allow readers to access to both the reviewers’ comments and our detailed responses, enabling them to form their own informed opinions regarding our conclusions.

      Nevertheless, we have moderated some of our wording.

      Reviewer #2 (Public review):

      Summary:

      This study addresses the hypothesis that the strikingly higher prevalence of autoimmune diseases in women could be the result of biased thymic generation or selection of TCR repertoires. The biological question is important, and the hypothesis is valuable. Although the topic is conceptually interesting and the dataset is rich, the study has a number of major issues that require substantial improvement. In several instances, the authors conclude that there are no sex-associated differences for specific parameters, yet inspection of the data suggests visible trends that are not properly quantified. The authors should either apply more appropriate statistical approaches to test these trends or provide stronger evidence that the observed differences are not significant. In other analyses, the authors report the differences between sexes based on a pulled analysis of TCR sequences from all the donors, which could result in differences driven by one or two single donors (e.g., having particular HLA variants) rather than reflect sex-related differences.

      Strengths:

      The key strength of this work is the newly generated dataset of TCR repertoires from sorted thymocyte subsets (DP and SP populations). This approach enables the authors to distinguish between biases in TCR generation (DP) and thymic selection (SP). Bulk TCR sequencing allows deeper repertoire coverage than single-cell approaches, which is valuable here, although the absence of TRA-TRB pairing and HLA context limits the interpretability of antigen specificity analyses. Importantly, this dataset represents a valuable community resource and should be openly deposited rather than being "available upon request."

      We thank the reviewer for highlighting the potential value of this dataset to the community. It will be made publicly available on the NCBI website. We would like to clarify that our intention has always been to make this dataset publicly available; therefore, we take back any incorrect suggestions made in the original submission.

      Weaknesses:

      Major:

      The authors state that there is "no clear separation in PCA for both TRA and TRB across all subsets." However, Figure 2 shows a visible separation for DP thymocytes (especially TRA, and to a lesser degree TRB) and also for TRA of Tregs. This apparent structure should be acknowledged and discussed rather than dismissed.

      We thank the reviewer for this careful observation. Discussing apparent “trends” rather than statistically significant results is indeed a nuanced issue, as over-interpretation of visual patterns is usually discouraged. We agree that, within the specific context of TCR repertoire analyses, visual structures in multivariate projections such as PCA can provide useful contextual information.

      However, we have not identified a striking trend in our representation. We therefore chose to avoid overemphasizing these visual impressions in the text.

      Supplementary Figures 2-5 involve many comparisons, yet no correction for multiple testing appears to be applied. After appropriate correction, all the reported differences would likely lose significance. These analyses must be re-evaluated with proper multiple-testing correction, and apparent differences should be tested for reproducibility in an external dataset (for example, the pediatric thymus and peripheral blood repertoires later used for motif validation).

      As is standard in exploratory immunogenomic studies, including TCR repertoire analyses, our objective was to uncover broad biological patterns rather than to establish definitive statistical associations. In analyses that are discovery-oriented, correction for multiple testing, while essential in confirmatory contexts, is not mandatory and may even obscure meaningful trends by inflating type II error rates. Our objective was therefore to highlight consistent directional patterns across analytical layers, to guide future confirmatory work rather than to make categorical claims.

      We also note that this comment somewhat contrasts with the earlier suggestion to discuss trends that are not statistically significant.

      With regard to the proposal to verify our observations using an external dataset, we are in full agreement that independent confirmation would be beneficial. However, as reviewer 1 rightly emphasized, the generation of such datasets from sorted human thymocyte subsets is “heroic” and has rarely, if ever, been achieved. We are aware of no existing dataset that provides comparable material or analytical depth.

      The available single-cell thymic dataset (Park et al., Science 2020) includes only a few hundred sequences per donor, which is significantly less than the number of sequences in our study. This limited dataset is not adequate for cross-validation or for representing the full complexity of thymic TCR repertoires.

      As with the pediatric thymus dataset, the lack of statistical power in the dataset due to the small number of female subjects (only three) means that sex-related differences in V/J usage cannot be evaluated.

      Finally, the peripheral blood dataset is not appropriate for validating thymic generation or selection processes, as it reflects post-thymic selection and antigen-driven remodeling, making it impossible to distinguish peripheral effects from thymic influences.

      For these reasons, none of the currently available datasets provides a sufficiently clean or powerful framework to test the reproducibility of subtle sex-associated effects on thymic TCR repertoires. Nevertheless, we fully agree that confirmation in an independent and larger cohort will be an important next step to refine these exploratory findings and assess their generalizability to a broader human population.

      Supplementary Figure 6 suggests that women consistently show higher Rényi entropies across all subsets. Although individual p-values are borderline, the consistent direction of change is notable. The authors should apply an integrated statistical test across subsets (for example, a mixed-effects model) to determine whether there is an overall significant trend toward higher diversity in females.

      We agree that Rényi entropies tend to show a consistent direction of change across subsets, with slightly higher values observed in females. In this section, our objective was to provide a descriptive overview of diversity patterns for each thymic subset. This is because these subsets are biologically distinct and therefore require individual analysis, as we previously demonstrated using the same dataset (Isacchini et al, PRX Life. 2024). Therefore, while a mixed-effects approach could in principle be applied to test for an overall trend, such an analysis would rely on the assumption of a common sex effect across heterogeneous cell types.

      It is important to note that the complete dataset has now been made publicly available, enabling interested researchers to perform additional integrative or model-based analyses to further explore these diversity trends.

      Figures 4B and S8 clearly indicate enrichment of hydrophobic residues in female CDR3s for both TRA and TRB (excluding alanine, which is not strongly hydrophobic). Because CDR3 hydrophobicity has been linked to increased cross-reactivity and self-reactivity (see, e.g., Stadinski et al., Nat Immunol 2016), this observation is biologically meaningful and consistent with higher autoimmune susceptibility in females.

      We thank the reviewer for this insightful comment.

      As correctly noted, increased hydrophobicity at specific CDR3β positions has been linked to enhanced cross-reactivity and self-reactivity, as described by Stadinski et al. (Nat Immunol 2016), and we reference this work in the manuscript.

      In our analysis corresponding to Figure 4B (TRB), hydrophobicity was quantified at the sequence level by computing, for each unique CDR3β sequence, the overall proportion of hydrophobic amino acids across the CDR3 loop. This approach aligns with that of Lagattuta et al. (Nat Immunol 2022), whose code we adapted to accommodate longer CDR3s. This global hydrophobicity metric captures overall composition, but, by its construction, does not account for positional context, the key mechanism implicated by Stadinski et al.

      As outlined in our original Figure 4C, the results were obtained through a position-based amino acid analysis. For each CDR3β sequence, we extracted the amino acid at every IMGTdefined CDR3 position (p104–p118) and quantified, at each position, the percentage of unique sequences containing each amino acid. Positions p109 and p110 correspond to the p6–p7 sites highlighted by Stadinski et al. as functionally relevant for self-reactivity. This analysis evaluates positional composition independently of clonotype frequency, focusing specifically on hydrophobic amino acid classes.

      Following the recommendation of the reviewer, the revised manuscript has removed alanine (which is only weakly hydrophobic) has been excluded from the hydrophobic residue set. With this refined definition, we observe a significant enrichment of hydrophobic amino acids at p109 in CD8 T cell repertoires from females, with similar but non-significant trends at p109 in DP and CD4 Teff cells and at p110 in CD8 cells (see new Figure 4C).

      As outlined in the revised Methods, Results, and Discussion sections, Figure 4C focuses exclusively on positional hydrophobic amino acid usage. This was previously implicit, although it was noted in the legend and visually represented in the plots.

      The majority of "hundreds of sex-specific motifs" are probably donor-specific motifs confounded by HLA restriction. This interpretation is supported by the failure to validate motifs in external datasets (pediatric thymus, peripheral blood). The authors should restrict analysis to public motifs (shared across multiple donors) and report the number of donors contributing to each motif.

      We fully agree that donor-specific and HLA-restricted motifs represent a major potential confounder in repertoire-level comparisons. To minimize this potential bias, our analysis was explicitly restricted to public motifs, as clearly stated in the Materials and Methods section:

      “Additional filters were applied so that: (i) a motif includes public CDR3aa sequences (shared by at least two individuals); (ii) a significant enrichment is detected (Fisher’s exact test, p < 0.01); and (iii) a usage difference between groups of at least twofold (Wilcoxon test, p < 0.05).”

      Accordingly, every motif reported in the manuscript is supported by at least two independent donors, ensuring that no motif reflects an individual- or HLA-specific effect (see Supplementary Figures 10-13[previously Supplementary Figure 9]). We have now added a more explicit mention of the number of donors contributing to each motif in the figure legend and have clarified this point in the revised Methods and Results sections to make this criterion more visible to readers.

      When comparing TCRs to VDJdb or other databases, it is critical to consider HLA restriction. Only database matches corresponding to epitopes that can be presented by the donor's HLA should be counted. The authors must either perform HLA typing or explicitly discuss this limitation and how it affects their conclusions.

      We respectfully disagree with the assertion that HLA typing is necessary for the type of comparative analysis we have conducted. While it is true that HLA molecules present peptides to TCRs and thereby contribute to the tripartite interaction determining T cell activation, extensive evidence indicates that the CDR3 region, particularly CDR3β, is the dominant determinant of antigen specificity. This finding is supported by structural and computational studies (Madi et al., eLife, 2017; Huang et al., Nat. Biotech., 2020; MayerBlackwell et al., Methods Mol. Biol., 2022) showing that CDR3β residues are responsible for the majority of peptide contacts, whereas CDR1 and CDR2 primarily interact with the MHC framework.

      As emphasized in several recent benchmarking studies (e.g., Springer et al., Front Immunol, 2021), CDR3β sequence composition alone captures most of the information required for specificity inference. Consequently, widely used and validated computational tools such as GIANA (Zhang et al. Nat. Commun. 2021), iSMART (Zhang et al. Clin. Cancer Res. 2020), and ATMTCR (Cai et al. Front. Immunol. 2022) rely exclusively on CDR3β aminoacid sequences and still achieve high predictive performance.

      Our analysis aligns with this well-established paradigm. While we agree that integrating donor HLA typing would refine epitope-level annotation and reduce potential noise, the absence of HLA data does not invalidate the comparative framework we used, which focuses on relative representation of annotated specificities across groups rather than on individual TCR–HLA–peptide triads.

      Although the age distributions of male and female donors are similar, the key question is whether HLA alleles are similarly distributed. If women in the cohort happen to carry autoimmuneassociated alleles more often, this alone could explain observed repertoire differences. HLA typing and HLA comparison between sexes are therefore essential.

      To address the issue of any potential differences in HLA background, we examined the subset of adult donors for whom HLA typing information was available (HLA-A, HLA-B, HLADR, and HLA-DQB; n = 16). Within this subset, the distribution of HLA alleles was relatively balanced between males and females (as illustrated by the heatmap showing HLA class II expression patterns and HLA class I family grouping in Author response image 1). This analysis suggests that the sex-associated differences in the repertoire observed in our study are unlikely to be driven solely by unequal representation of autoimmune-associated HLA alleles.

      We acknowledge, however, that complete HLA information was not available for all donors, which remains a limitation of the dataset.

      Author response image 1.

      In some analyses (e.g., Figures 8C-D) data are shown per donor, while others (e.g., Fig. 8A-B) pool all sequences. This inconsistency is concerning. The apparent enrichment of autoimmune or bacterial specificities in females could be driven by one or two donors with particular HLAs. All analyses should display donor-level values, not pooled data.

      While Figures 8A–B present pooled data to summarize global trends, the corresponding donor-level analyses were provided in Supplementary Figures 15B and 16 (previously Supplementary Figures 11B and 12). In these, each individual is shown separately, with each point representing an individual. It is important to note that these donor-resolved plots do not reveal any sample-specific driver: the patterns observed in the pooled data remain consistent across donors, without any single individual accounting for the apparent enrichments. As outlined in the revised manuscript, readers now directed to the relevant supplementary figures for further clarification.

      The reported enrichment of matches to certain specificities relative to the database composition is conceptually problematic. Because the reference database has an arbitrary distribution of epitopes, enrichment relative to it lacks biological meaning. HLA distribution in the studied patients and HLA restrictions of antigens in the database could be completely different, which could alone explain enrichment and depletions for particular specificities. Moreover, differences in Pgen distributions across epitopes can produce apparent enrichment artifacts. Exact matches typically correspond to high-Pgen "public" sequences; thus, the enrichment analysis may simply reflect variation in Pgen of specific TCRs (i.e., fraction of high-Pgen TCRs) across epitopes rather than true selection. Consequently, statements such as "We observed a significant enrichment of unique TRB CDR3aa sequences specific to self-antigens" should be removed.

      We respectfully disagree with the conclusion that our enrichment analysis lacks biological meaning. Our approach directly involves a direct comparison of the same set of observed TCR sequences between males and females. Consequently, any potential biases related to generation probability (Pgen), which affect all sequences equally, cannot account for the observed sex-specific differences. To summarize, because the comparison is performed on the same set of sequences, changes in the probability of generation across epitopes cannot explain the differences seen between the sexes.

      We do agree, however, that the composition of the reference databases may influence apparent enrichment patterns, as these resources contain uneven distributions of epitope categories and often incomplete information regarding HLA restriction. It should be noted that this limitation is inherent to all database-based annotation approaches, a fact which is explicitly acknowledged in the revised Discussion.

      The overrepresentation of self-specific TCRs in females is the manuscript's most interesting finding, yet it is not described in detail. The authors should list the corresponding self-antigens, indicate which autoimmune diseases they relate to, and show per-donor distributions of these matches.

      We thank the reviewer for this constructive suggestion.

      As recommended, we have expanded the description of the self-specific TCRs identified in our dataset and now provide this information in Supplementary Table 2 of the revised manuscript. Specifically, the table lists the corresponding self-antigens and the autoimmune diseases with which they are associated. In our curated database, these annotations primarily correspond to celiac disease and type 1 diabetes, which were the two autoimmune contexts explicitly defined in the manually curated reference datasets.

      For the “cancer” specificity group, we have clarified that antigen assignments were established based on (i) annotations available in the original databases (IEDB, VDJdb, McPAS-TCR) and (ii) cross-referencing with additional resources, including the Human Protein Atlas, the Cancer Antigenic Peptide Database (de Duve Institute), and the Cancer Antigen Atlas (Yi et al., iScience 2021), to ensure consistency in the classification of cancer and neoantigen specificities. Please refer to the Materials and Methods section for a full description of the procedure for this specific assignment.

      Donor-level distributions of these self-specific matches are now shown in Supplementary Figures 15B and 16 (previously Supplemental Figures 11B and 12), allowing direct visualization of inter-donor variability. Importantly, these plots confirm that the observed enrichment in females is not driven by a single individual, further supporting the robustness of the finding.

      The concept of poly-specificity is controversial. The authors should clearly explain how polyspecific TCRs were defined in this study and highlight that the experimental evidence supporting true polyspecificity is very limited (e.g., just a single TCR from Figure 5 from Quiniou et al.).

      We certainly agree (and regret) that the concept of TCR polyspecificity remains a subject of debate and often underappreciated in the field of immunology. As Don Mason famously discussed in his seminal essay “A very high cross-reactivity is an essential feature of the TCR” (doi: 10.1016/S0167-5699(98)01299-7) published over 25 years ago, both theoretical and experimental evidence indicates that each TCR can, in principle, recognize millions of distinct peptides, albeit with variable avidity.

      Although this principle is widely accepted, it is frequently overlooked in the field of experimental immunology. In this area, anything that deviates from strict monospecificity is often disregarded as noise.

      In our own analyses of large-scale TCR repertoires, we have repeatedly observed that many CDR3 sequences are annotated with multiple specificities across different databases, often corresponding to peptides from unrelated organisms. As demonstrated in Quiniou et al. (eLife 2023), such polyreactive TCRs exhibit distinctive features, including biased physicochemical composition, and tend to be enriched in various biological contexts. In our preliminary study of such TCRs, which have the capacity to be specific for multiple viral- and self- epitopes, we hypothesized that they may serve as a first line of defense against pathogens and also be involved in triggering autoimmunity. We therefore consider it important to report this phenomenon rather than omit it, especially given its potential relevance to both protective immunity and autoimmunity.

      In the present study, polyspecific TCRs were defined operationally as TRB CDR3aa sequences associated with a minimum of two distinct specificity groups, corresponding either to different microbial species or to multiple antigen categories within the curated database. Therefore, our definition captures broader antigenic groupings rather than epitope-level binding events.

      We fully acknowledge that direct experimental evidence for true molecular-level polyspecificity remains limited. Indeed, as the reviewer notes, only a single TCR with multiepitope reactivity has been rigorously demonstrated to date (Quiniou et al.2023). Consequently, our analysis does not make claims about structural promiscuity; instead, it uses database-annotated cross-reactivity as a proxy to explore broader repertoire-level patterns.

      As outlined in the Methods section, this definition has been clarified and its discussion expanded in the Discussion to explicitly address these conceptual and methodological nuances.

      Minor:

      Clarify why the Pgen model was used only for DP and CD8 subsets and not for others.

      As noted, computing Pgen values involves two steps: (i) training a generative model of V(D)J recombination using IGoR, and (ii) estimating generation probabilities with OLGA based on that model. Both steps require a significant amount of computing power, especially when applied to large repertoires across multiple subsets. For this reason, we focused the analysis on DP thymocytes, which represent the repertoire prior to thymic selection, and CD8 T cells after CD8 selection.

      The Methods section should define what a "high sequence reliability score" is and describe precisely how the "harmonized" database was constructed.

      Briefly, the annotated database used in this study was constructed in accordance with the procedure established in our previously published work (Jouannet et al., NAR Genomics and Bioinformatics, 2025). The study integrates three publicly available resources, IEDB, VDJdb, and McPAS-TCR, which were collected as of October 2023. These three datasets were then merged into a single harmonized compendium, undergoing extensive standardization. When entries shared identical information across databases (same V–CDR3–J for both TRA and TRB, same epitope, organism, PubMed ID, and cell subset), only one representative was kept; discrepant or incomplete entries were retained to preserve information. We then assigned a sequence reliability score, the Verified Score (VS), following the verification strategy used by IEDB. The scale ranges from 0 to 2 and reflects the concordance between calculated and curated TRA/TRB CDR3 sequences (2 = both TRA and TRB present are verified, 1.1 = only TRA verified, 1.2 = only TRB verified, 0 = no verified chain). A second score, the Antigen Identification Score (AIS), is used to rank antigen-identification methods on a scale of 0 to 5, according to the strength of the experimental evidence supporting them.

      In the present study, “high reliability” refers to sequences with a verified TRB CDR3aa chain (VS ≥ 1.2) and an AIS score corresponding to T cells in vitro stimulation with a pathogen, protein or peptide, or pMHC X-mer sorting (> 3.2, excluding categories 4.1 and 4.2), ensuring that downstream analyses were performed on a rigorously curated and biologically trustworthy dataset. The Methods section now explicitly details these criteria.

      The statement "we generated 20,000 permuted mixed-sex groups" is unclear. It is not evident how this permutation corrects for individual variation or sex bias. A more appropriate approach would be to train the Pgen model separately for each individual's nonproductive sequences (if the number of sequences is large enough).

      The objective of this analysis was to determine whether the enrichment of TRBV06-5 in females was due to random grouping of individuals or whether it was attributable to sex itself. To do so, we generated all possible perfectly mixed groups of donors (i.e., groups containing an equal number of male and female donors) for the concerned thymocyte subset, and then performed 20,000 random pairwise comparisons between such mixed groups. For each comparison, we tested the TRBV06-5 usage between the two mixed groups. This procedure directly evaluates whether group composition (independent of sex) could spuriously generate differences in TRBV usage. Notably, none of these 20,000 comparisons between the two mixed groups yielded a statistically significant difference in TRBV06-5 usage. In contrast, when comparing the true male and female groups, a significant difference was identified. This demonstrates that the signal we observe is not driven by random donor grouping or individual-level variation, but is specifically associated with sex. It is important to note that this analysis, which is designed to exclude spurious group effects, is rarely performed in published repertoire studies, yet it provides an important internal control for robustness.

      Reviewer #2 (Recommendations for the authors):

      (1) Data availability "upon request" is unacceptable. All raw and processed data, as well as scripts used for analysis and figure generation, must be publicly deposited before publication.

      We would like to clarify that our intention has always been to make this dataset publicly available. It was a mistake to suggest otherwise in the original submission.

      (2) At the beginning of the Results section, include a brief description of the dataset: number of donors, sex ratio, age range, number of samples per subset, and sorting strategy. Although Figure 1 shows this, the information should also be mentioned in the main text.

      In line with the recommendation, we have now added a summary of the cohort characteristics at the beginning of the Results section. This includes the number of donors, sex ratio, age range, number of samples per subset, and the sorting strategy used. While this information was already included in Figure 1, we concur that including it directly in the main text enhances readability.

      (3) Report the number of cells and unique clonotypes analyzed per individual. Rank-frequency plots (in log-log coordinates) would be helpful.

      We have now added, for each donor and each subset, the number of cells, and additionally for each chain, the number of total and unique clonotypes analyzed. This information is provided in the revised manuscript in a new supplementary table (Supplemental Table 1).

      These plots have been integrated into the revised manuscript as Supplementary Figure 2.

      (4) For analysis in Figure 4B, the total fraction of hydrophobic amino acids should be calculated for each patient separately, and values for men and women should be compared (analogously to Figure 4C, but for the whole CDR3 and excluding alanine).

      Please note that the TRB CDR3aa composition in Figure 4B has already been quantified at the individual level. For each unique TRB CDR3aa sequence, we computed the proportion of each of the 20 amino acids across the CDR3β loop, then summarized these values per donor (mean per individual). The log2 fold change displayed in Figure 4B (and supplemental Figure 9 for TRA) is calculated from the median donor-level values for females versus males, rather than from pooled CDR3s. It is intended as descriptive, “global” view of amino acid usage within the central CDR3 region. Hydrophobicity was not used directly in the computation, but is indicated only by bar color, based on the Kyte-Doolittle- derived IMGT classification. This provides an observational overview of amino acid composition in the central CDR3 region.

      As the mechanistic link between hydrophobicity and self-reactivity described by Stadinski et al. is explicitly position-dependent, we consider positional analyses to be the most appropriate method for formally interrogating this hypothesis, as we did in Figure 4C. Here, our primary focus was on the position-specific usage of hydrophobic amino acids at IMGT positions p109-p110. These positions correspond to the central p6-p7 positions described by Stadinski et al. For each individual, we computed the proportion of unique TRB CDR3aa sequences carrying a hydrophobic amino acid at a given position.

      Accordingly, in the revised manuscript we refined the Figure 4C by excluding alanine due to its weak hydrophobic property (as recommended by the reviewer) This positional composition analysis now reveals a statistically significant increase in hydrophobic usage at p109 in female CD8 repertoires, with similar, though non-significant, trends at p109 in DP and CD4Teff ad at p110 in CD8 cells. Figure 4B is therefore retained as an exploratory overview of amino acid composition usage along the CDR3 loop, while Figure 4C is used for the more specific question of hydrophobicity and potential cross-reactivity.

      The Methods section has been expanded to provide clearer descriptions of these computations, and the Results and Discussion sections corresponding to Figures 4B-C (and supplemental Figure 9) have been revised to make the rationale, implementation, and interpretation of these hydrophobicity analyses more explicit.

      (5) Figure 6 shows a trend toward higher clustering of Treg TCRs in males, which could relate to the lower incidence of autoimmunity in men. The authors could test whether specific Treg clusters are male-specific and shared among male donors.

      As shown in Figure 6, a clear trend towards higher similarity among Treg CDR3aa sequences in males is evident, as indicated by the proportion of sequences included in clusters and in the overall similarity density. However, identifying “male-specific clusters” shared across donors is not straightforward in our analytical framework.

      In our approach, for each cell subset, CDR3aa sequences were downsampled 100 times to the smallest sample size, and clustering was repeated at each iteration. Therefore, the clusters’ identities are not consistent across iterations. The clusters depend on the specific subset of sequences selected at each downsampling step, as well as on their underlying Pgen distribution. Therefore, it is not possible to reliably assess whether specific clusters are systematically “male-shared”. This is because cluster composition is a function of stochastic resampling rather than of biological structure. For this reason, a comparison of cluster identities across donors would not produce interpretable results.

    1. Author response:

      The following is the authors’ response to the original reviews

      We thank the reviewers for their constructive feedback, which has helped preparing a substantially improved manuscript. In response to concerns about the conceptual distinction between prediction and stimulus dependency, we have fundamentally restructured the paper around the notion of passive control systems. This involved rewriting the Abstract, Introduction, and large portions of the Results (~60% of text revised).

      Key changes:

      - New analyses on Goldstein et al. (2022) data. We demonstrate that our findings—including the insufficiency of proposed corrections—generalise to the original dataset (Figures S2B, S3B, S5C, S6B).

      - Clarified novel contribution. We now make explicit that prior control analyses (residualisation, bigram removal) do not address the concern, because hallmarks persist in passive systems that cannot predict.

      - Proposed criterion for future work. Pre-onset neural encoding can only count as evidence for prediction if it exceeds a passive baseline (e.g., acoustics).

      We believe the revision offers a clearer, more rigorous contribution and provides a constructive framework for evaluating claims of neural prediction.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper tackles an important question: What drives the predictability of pre-stimulus brain activity? The authors challenge the claim that "pre-onset" encoding effects in naturalistic language data have to reflect the brain predicting the upcoming word. They lay out an alternative explanation: because language has statistical structure and dependencies, the "pre-onset" effect might arise from these dependencies, instead of active prediction. The authors analyze two MEG datasets with naturalistic data.

      Strengths:

      The paper proposes a very reasonable alternative hypothesis for claims in prior work. Two independent datasets are analyzed. The analyses with the most and least predictive words are clever, and nicely complement the more naturalistic analyses.

      Weaknesses:

      I have to admit that I have a hard time understanding one conceptual aspect of the work, and a few technical aspects of the analyses are unclear to me. Conceptually, I am not clear on why stimulus dependencies need to be different from those of prediction. Yes, it is true that actively predicting an upcoming word is different from just letting the regression model pick up on stimulus dependencies, but given that humans are statistical learners, we also just pick up on stimulus dependencies, and is that different from prediction? Isn't that in some way, the definition of prediction (sensitivity to stimulus dependencies, and anticipating the most likely upcoming input(s))?

      We thank the reviewer for this comment, which highlights that the previous version wasn’t sufficiently clear. Conceptually, the difference is critical: it is the difference between passively encoding or representing the stimulus (like e.g., a spectrogram of the stimulus would), and actively generating predictions.

      We have substantially changed the framing of the paper to put the notion of control systems centre-stage. One such control system is the speech acoustics: they encode the stimulus (and thus its dependencies) but cannot predict. When we observe the "hallmarks of prediction" in acoustics, this demonstrates the hallmarks can arise without any prediction.

      This brings me to some of the technical points: If the encoding regression model is learning one set of regression weights, how can those reflect stimulus dependencies (or am I misunderstanding which weights are learned)? Would it help to fit regression models on for instance, every second word or something (that should get rid of stimulus dependencies, but still allow to test whether the model predicts brain activity associated with words)? Or does that miss the point? I am a bit unclear as to what the actual "problem" with the encoding model analyses is, and how the stimulus dependency bias would be evident. It would be very helpful if the authors could spell out, more explicitly, the precise predictions of how the bias would be present in the encoding model.

      Different weights are estimated per time point in the time-resolved regression. This allows the model to learn how the response to words unfolds, but also to learn different stimulus dependencies at each timepoint. Fitting on every second word would reduce but not eliminate the problem. Our control system approach provides a more principled test. We have clarified the mechanism in the Introduction (lines 82-90), explaining how correlations between neighbouring words allow the regression model to predict prior neural activity without assuming pre-activation.

      Reviewer #2 (Public Review):

      Summary:

      At a high level, the reviewers demonstrate that there is an explanation for pre-word-onset predictivity in neural responses that does not invoke a theory of predictive coding or processing. The paper does this by demonstrating that this predictivity can be explained solely as a property of the local mutual information statistics of natural language. That is, the reason that pre-word onset predictivity exists could simply boil down to the common prevalence of redundant bigram or skip-gram information in natural language.

      Strengths:

      The paper addresses a problem of significance and uses methods from modern NeuroAI encoding model literature to do so. The arguments, both around stimulus dependencies and the problems of residualization, are compellingly motivated and point out major holes in the reasoning behind several influential papers in the field, most notably Goldstein et al. This result, together with other papers that have pointed out other serious problems in this body of work, should provoke a reconsideration of papers from encoding model literature that have promoted predictive coding. The paper also brings to the forefront issues in extremely common methods like residualization that are good to raise for those who might be tempted to use or interpret these methods incorrectly.

      Weaknesses:

      The authors don't completely settle the problem of whether pre-word onset predictivity is entirely explainable by stimulus dependencies, instead opting to show why naive attempts at resolving this problem (like residualization) don't work. The paper could certainly be better if the authors had managed to fully punch a hole in this.

      We thank the reviewer for their assessment.

      We believe our paper does punch the hole that can be punched, which is a hole in the method. Our control demonstrates that adjusting the features (X matrix) cannot address dependencies that persist in the signal itself (Y matrix). Because the hallmarks emerge in a system that cannot predict (even after linearly removing the previous stimulus) attributing pre-onset encoding performance to neural prediction (rather than stimulus structure) is fundamentally ambiguous, and different (e.g. variance partitioning) approaches would suffer from the same ambiguity. We have reframed the manuscript to make this argument more clearly.

      Reviewer #3 (Public Review):

      Summary:

      The study by Schönmann et al. presents compelling analyses based on two MEG datasets, offering strong evidence that the pre-onset response observed in a highly influential study (Goldstein et al., 2022) can be attributed to stimulus dependencies, specifically, the auto-correlation in the stimuli—rather than to predictive processing in the brain. Given that both the pre-onset response and the encoding model are central to the landmark study, and that similar approaches have been adopted in several influential works, this manuscript is likely to be of high interest to the field. Overall, this study encourages more cautious interpretation of pre-onset responses in neural data, and the paper is well written and clearly structured.

      Strengths:

      (1) The authors provide clear and convincing evidence that inherent dependencies in word embeddings can lead to pre-activation of upcoming words, previously interpreted as neural predictive processing in many influential studies.

      (2) They demonstrate that dependencies across representational domains (word embeddings and acoustic features) can explain the pre-onset response, and that these effects are not eliminated by regressing out neighboring word embeddings - an approach used in prior work.

      (3) The study is based on two large MEG datasets, showing that results previously observed in ECoG data can be replicated in MEG. Moreover, the stimulus dependencies appear to be consistent across the two datasets.

      We’d like to thank the reviewer for their comments on our preprint.

      Weaknesses:

      (1) To allow a more direct comparison with Goldstein et al., the authors could consider using their publicly available dataset.

      We thank the reviewer for this suggestion. The Goldstein dataset was not publicly available when we conducted this research. However, we have now applied our control analyses to their stimulus material, and found that the exact same problem applies to their dataset, too.

      We have added analyses of the Goldstein et al. (2022) podcast stimulus throughout the paper. Results are shown in Figures S2B, S3B, S5C, and S6B. Critically, we observe the same pattern: both hallmarks emerge in the acoustic control system, and residualisation fails to eliminate them. This demonstrates that our findings generalise to the very dataset used to establish pre-onset encoding as evidence for neural prediction.

      (2) Goldstein et al. already addressed embedding dependencies and showed that their main results hold after regressing out the embedding dependencies. This may lessen the impact of the concerns about self-dependency raised here.

      We thank the reviewer for raising this point, as it reveals we failed to convey a central argument in the previous version. Goldstein et al.'s control analysis did not address the concern. We show that even after the control analyses that Goldstein et al. perform (removing bigrams, regressing out embedding dependencies) the "hallmarks of prediction" still emerge when applying the analysis to a passive control system that by definition does not predict: the speech acoustics. We now also show this in their data.

      To better convey this critical point, around the concept of "passive control systems". We now first establish that the hallmarks appear in acoustics (Figure 3), then show that residualisation fails to remove them (Figure 4). This makes explicit that any claim about "controlling for dependencies" must be validated against a system that cannot predict.

      (3) While this study shows that stimulus dependency can account for pre-onset responses, it remains unclear whether this fully explains them, or whether predictive processing still plays a role. The more important question is whether pre-activation remains after accounting for these confounds.

      We thank the reviewer for this question, and we agree that the question whether pre-activation occurs is an important and interesting one. However, we ask a different question in our study: Our goal is not to definitively establish whether the brain predicts during language processing; it is to scrutinise what counts as evidence for prediction, and to correct for some highly influential claims made in the literature. The reviewer asks whether pre-activation remains "after accounting for these confounds." But the point we are trying to make is that in this analytical framework, one cannot analytically account for these confounds: corrections to the X matrix leave dependencies in the data itself intact, as the acoustic control demonstrates.

      We do offer recommendations for future work. The passive control systems approach can serve as a benchmark: pre-onset neural encoding (or decoding) can only count as evidence for prediction if it exceeds what is observed in a passive control system like acoustics (which is not what we observe). Additionally, the field could move toward less naturalistic stimuli with tighter experimental controls, reducing the correlations that make this attribution so difficult. Developing a new definitive test is beyond the scope of our paper, but we believe applying this benchmark is a necessary first step.

      To make this clearer, we have rewritten the Discussion to explicitly state this criterion (lines 331-340) and to outline these recommendations for future work (lines 337-340). We have also added a paragraph extending our argument to decoding approaches (lines 343-354), noting that the same ambiguity applies regardless of analytical direction.

      Recommendations for Authors:

      Reviewer #1 (Recommendations for Authors):

      As per my "Weakness" point, I would appreciate engagement with the conceptual point related to the difference between prediction and stimulus correlations. Most importantly, I hope the authors will spell out more explicitly which predictions their proposal makes, and how exactly those would be present in an encoding model.

      Our proposal makes a clear prediction: if pre-onset encoding can be explained by stimulus dependencies (essentially a confound in the analysis) the same hallmarks should emerge in passive control systems that encode the stimulus but do not predict. We test this with word embeddings and speech acoustics, and both show hallmarks despite not doing any prediction.

      Reviewer #2 (Recommendations for Authors):

      I greatly enjoyed reading the paper and only have minor quibbles. The work is overdue and will no doubt be a valuable addition to the literature to push back on over-hyped claims about the implications of pre-word predictivity in neural response. I have few issues with the methods that the paper uses, they seem sensible and in line with previous work that has investigated these questions, and I did not find typos.

      One point I would like to raise is whether or not there is a more effective solution to resolving the issues behind residualization that the paper demonstrates. The authors show that removing next-word information does not effectively resolve the problem that local relationships in the stimulus dataset pose. The challenge to me here seems to be that it is difficult to get a model to "not learn" a relationship that is learnable. I wonder if a better solution to this is to not try to get a model to exclude a set of information but instead to do some sort of variance partitioning where you train a model to predict the next-word representation from the current-word representation (as in the self-predictivity analysis) and then build an encoding model out of the predicted representation. Then, compare the pre-word-onset encoding performance of the prediction with the pre-word-onset encoding performance of the original representation. If the performance of the two models roughly matches, that would be strong evidence that most of what these models are capturing before word onset is just explainable by the stimulus dependencies, no?

      We would like to thank the reviewer for their kind words and positive appraisal!

      The proposed analysis is that if a linear proxy representation, w_hat_t – predicted linearly from w_{t-1} – yields pre-onset predictivity comparable to the actual w_t vector, this would support that the effect can be explained by stimulus dependencies. While this is an interesting alternative analysis, we would be cautious about the inverse conclusion: that if w_t outperforms the linear proxy w_hat_t, the residual variance must reflect true neural prediction.

      This is because of our control system results. We show that even when we remove the "predictable" shared variance – which is similar to computing the difference between w_t and w_hat_t – the unique information still yields pre-onset predictivity, albeit reduced, in the passive acoustics that by definition cannot predict. Therefore, instead of developing an ever-more-clever way to "correct" for the problem by adjusting the X matrix, we focus on showing that the problem lies in the stimulus itself. For the revision, we focused on reframing the problem and hope we have punched a fuller hole in the logic by breaking down the fundamental issue more clearly and showing it applies to the stimulus material of Goldstein et al. (2022) as well.

      Additionally, I would say that I was a bit confused about what was going on in the methods figures, to the point where I do not see the value in having them, but thankfully, the text was clear enough to resolve that confusion.

      We are sad the methods illustration wasn’t helpful. In presentations we have found that the illustrations were generally helpful to bring the analysis across, e.g. the aspect of keeping the analysis identical but simply replacing the brain data with either word vectors (current Figure 2) and acoustics (current Figure 3). In the revision we have reorganised the schematics slightly, we introduce the acoustics as a control system earlier, to separately introduce residualisation and its insufficiency (Figure 4). We hope this helps

      Reviewer #3 (Recommendations for Authors):

      (1) My major concern is the extent to which this study offers new insights beyond what was already demonstrated in Goldstein's work. First, the embedding dependency highlighted by the authors seems somewhat expected, given how these embeddings are constructed: GloVe embeddings are based on word co-occurrence statistics, and GPT embeddings are combinations of embeddings of preceding words. More importantly, Goldstein et al. addressed this issue by regressing out neighboring word embeddings. This control was effective, as also confirmed by the current manuscript, and their main results remain. Therefore, the embedding dependency appears to have been properly accounted for in the earlier study.

      Building on the previous point, I appreciate the analysis of dependencies across representational domains, which I see as the main novel contribution of this manuscript. I would encourage the authors to explore this aspect more deeply. If I understand correctly, stimulus dependencies may persist even after regressing out neighboring word embeddings due to two potential factors:

      (a) Temporal dependencies in embeddings: since the regression of neighbor words is performed at the word level rather than over time, temporal dependency may remain.

      (b) Cross-feature dependencies - specifically, correlations between embeddings and acoustic features.

      Regarding the first factor, it is not entirely clear to me whether this is a real problem—i.e., whether word-level regression fails to remove temporal dependencies. A simulation could help clarify this and support the argument. While it's not essential, it would be valuable if the authors could propose a method to address this issue, or at least outline it as a direction for future work.

      For the second point, it would be helpful for the authors to explicitly explain the potential relationship between word embeddings and acoustic features. Additionally, while correlations between features are a common problem in speech research, they are typically addressed by regressing out acoustic features early in the analysis (Gwilliams et al., 2022). It would strengthen the current findings if the authors could test whether the self-predictability persists even after controlling for neighboring embeddings and acoustic features.

      We appreciate the extensive and detailed engagement with our work, which has been very useful in highlighting key unclarities and gaps we had to address.

      We do believe our study goes well beyond what was shown by Goldstein, by identifying a fundamental limitation in their analysis, and showing that their purported control analyses do not in fact control for the problem. We’ll address the reviewers' sub-questions in turn.

      (i) Why this offers crucial insights beyond Goldstein et al.

      While Goldstein et al. indeed addressed embedding dependencies via residualization (or in their case projection), their conclusion relied on the assumption that any neural encoding surviving this "fix" must reflect genuine predictive pre-activation. Our study invalidates this assumption. By applying the residualization fix, we show that the "hallmarks of prediction" persist just as robustly in a passive control system that cannot predict (the speech acoustics) as in the neural data. (We also show this for bigram removal.)

      This provides a key new insight: persistent pre-onset predictivity after “correction” is not evidence that the dependency issue was solved. Instead, because the same effect persists in a system that cannot predict (acoustics), the persistence of the hallmarks cannot be attributed to prediction. It demonstrates that the standard "fix" is mathematically insufficient to remove the confound, rendering the original evidence for neural prediction fundamentally ambiguous.

      (ii) Why do dependencies/hallmarks persist after residualization?

      Residualization successfully removes the linear dependency between the current embedding (w_t) and the previous embedding (w_{t-1}) within the feature space. However, it does not (and cannot) remove the dependency from language itself, and therefore from the brain which (in some format) encodes the linguistic stimulus. Language is massively redundant. Knowing the current word tells you something about what came before – acoustically, syntactically, semantically. As long as the embedding identifies the word, the regression model will re-learn this relationship. For instance, in the case of acoustics, even when using the corrected embedding, the regression will re-learn that certain words (e.g., "Holmes") tend to follow certain acoustic patterns (e.g., the acoustics of "Sherlock"). “This shows that correcting the embeddings is insufficient: the dependencies exist in language itself, and the model will re-learn them from any signal that encodes that language.”

      (iii) Why not regress out the acoustics?

      This is also why "regressing out acoustics" (as the reviewer suggests) would miss the point. We do not claim that acoustic features leak into the neural signal or that acoustics are a specific confound to be removed. Rather, we use acoustics as a “passive baseline”: a system that encodes the stimulus but cannot predict. That the method yields "hallmarks of prediction" in this baseline demonstrates these hallmarks are not valid evidence for prediction—regardless of what additional features one regresses out. This motivates our proposed criterion: future studies seeing evidence for neural pre-activation should not rest on finding pre-onset encoding per se, since passive systems show this too. Rather, it should require demonstrating that the brain signal contains more information about the upcoming word than the passive stimulus baseline.

      As these aspects are fundamental to the interpretation of our study, we have fundamentally re-organised and re-wrote large parts of the paper. We hope it is much clearer now.

      (2) To better compare to Goldstein's work, the author may consider performing the same analyses using their publicly available dataset.

      This is a good suggestion. When we initially conducted this research, the Goldstein dataset was not yet publicly available. It now is, and we have applied our analyses to their stimulus material. The same problem emerges: the hallmarks of prediction appear in the acoustics of their podcast stimuli. Even after applying the control analyses, pre-onset predictivity is robust in their acoustics (indeed, in correlation terms, higher than reported for the neural data, so there is not more predictivity in the brain than in the stimulus material), confirming that the issue we identify applies to the original dataset. Results are shown in Figures S2B, S3B, S5C, and S6B.

      (3) It is also interesting to show the predictability effect after word onsets for self-predictability analyses, for example, in Figure 2C. The predictability effect is not only reflected in pre-onset responses but also in post-onset responses, i.e., larger responses for unpredicted words. Whether the stimulus dependency mirror this effect?

      Our paper focuses specifically on temporal dependencies – the capacity of the current word to predict the previous stimulus signal (e.g., previous acoustics, previous embeddings) – and how this mimics neural pre-activation. Post-onset analyses, by contrast, concerns the mapping between the current word and its concurrent signal, which involves fundamentally different mechanisms (e.g., mapping fidelity, frequency effects, acoustic clarity, word length) and would require the consideration of covariates of the attributes of the word post-onset to meaningfully interpret. Post-onset, there can be differences between predictable and non predictable words – e.g. sometimes unpredictable words are pronounced with more emphasis – which is why surprisal studies include a large range of covariates. However, this is not about stimulus dependencies or pre-activation, so we consider it is beyond scope of our study.

      (4) The authors might consider reporting the encoding performance for the residual word embeddings, similar to Figure S6B in Goldstein's paper. This would allow us to determine whether pre-activation persists in the MEG responses and compare its pattern with the predictability of pre-onset acoustics.

      We do report this analysis, in the revised supplement it is shown in Figure S7. We placed it in the supplement precisely because residualized embeddings are not the "fix" they appear to be: as we show, they still yield strong pre-onset predictivity in the passive acoustic baseline (Figure 4, S6), undermining their use as a control.

      (5) The series of previous pre-activation analyses proposed fruitful findings, e.g., the difference between brain regions (Fig. S4, (Goldstein et al., 2022)) and the difference between listeners and speakers (Figure 2, (Zada et al., 2024)). Whether these observed differences can be explained by the stimulus dependency?

      We appreciate this question. Our goal is to address the general logic of using pre-onset encoding as evidence for prediction, rather than to critique every finding in specific papers, especially as it pertains to a specific author. But briefly:

      Speaker vs. Listener differences (Zada et al., 2024): Zada et al. report distinct temporal profiles: speaker encoding peaks pre-onset (planning?), whereas listener encoding peaks post-onset but shows a pre-onset "ramp." Our critique applies to interpreting this ramp as "prediction." However, this interpretation is not central to their paper, which focuses on speaker-listener coupling via shared embedding spaces. We leave the implications (which are clear enough) to the reader.

      Regional differences (Goldstein et al., 2022): Encoding timecourses do vary across electrodes, as we also observe across MEG sources (and participants). But our point is logical: because pre-onset encoding does not necessarily reflect prediction, finding a channel with stronger pre-onset encoding does not mean that channel performs “more prediction”. For instance, one subject in the Armeni dataset showed higher pre-onset than post-onset encoding (and indeed activity) overall – but it would be implausible to conclude this subject "only predicts" and does not “process” or “listen”. More likely, this reflects differences in signal-to-noise, integration windows, or source contributions. The exact sources of these morphological differences are interesting but unclear, and speculating on them is beyond our scope.

      (6) I appreciate that the authors have shared their code; however, some parts appear to be missing. For example, the script encoding_analysis.py only includes package-loading code.

      Thank you for noticing, we have updated our code database.

      (7) What do the error bars in the figures represent - for example, in Figure 1C? How many samples were included in the significance tests? The difference between the two curves appears small, yet it is reported as significant. Additionally, Figure S1 shows large differences between subjects and between the two MEG datasets. Do the authors have any explanation for these differences?

      The shaded areas in our previous Figure 1c) show 95% confidence intervals computed over the 100 MEG sources identified to be part of the bilateral language system and the 10 cross-validation splits.

      We do not have an elaborate explanation for the differences in encoding performance across the three subjects in the few-subject dataset. Instead, we interpret these differences as a likely consequence of substantial inter-individual variability in evoked responses, even at the source level, arising from differences in cortical folding and the orientation of underlying current dipoles. We deem this a likely explanation since different electrodes in Goldstein’s ECoG data also showed very different encoding profiles.

      With respect to the multi-subject dataset, we suspect that the large differences stem most likely from two substantial differences: First, the acoustics were purposefully manipulated by the experimenters to reduce temporal dependence. This made it harder for listeners to concentrate on the stories and thereby might have potentially led to lower quality neural data. Furthermore, it reduced one form of stimulus dependency, namely the acoustic temporal dependencies, which could be exploited by the encoding model to reach higher encoding accuracies. Secondly, MEG has a notoriously poor signal-to-noise ratio, and the amount of data per participant (7.745 words as opposed to 85.719 in the few-subject dataset) might not have been enough to produce reliably high encoding results.

      Finally, the current study is clear and convincing, and my suggestions are not intended to question its novelty or robustness. Rather, I believe the authors are in a strong position to address a critical question in language processing: whether pre-activation occurs. The authors have thoughtfully considered important confounds related to pre-onset responses. Adding some approaches to regressing out these confounds could be particularly helpful for determining whether a true pre-onset response remains.

      We thank the reviewer again for their constructive feedback, suggestions and questions. To clarify, however, our goal is *not* to definitively attest to whether pre-activation occurs. Our goal is simply to scrutinise a specific method to test for linguistic prediction. This method purports to be an improvement on conventional post-onset (e.g. surprisal-based) methods, as it can directly investigate effects occurring prior to word onset. We have demonstrated fundamental limitations in the underlying logic of this method. We propose passive control systems as baselines against which claims of prediction should be evaluated. Against this baseline, the current evidence does not show unequivocal support for prediction: pre-onset encoding in the brain does not exceed that in the passive control. However, we do not conclude from this that pre-activation does not exist — that would require a different study entirely. Our aim is more methodological: to establish what should count as evidence for prediction, not to settle whether prediction occurs.

      We would like to thank the reviewers and editors for their thoughtful feedback, which has been tremendously helpful in improving the paper.

    1. You want to review like a junior. When AI generates code, don't review it like a senior checking for bugs. Review it like a junior trying to learn. There's tree questions. Can I explain every line to a teammate? Do I understand why this approach was chosen? And could I modify the code confidently without reprompting? If the answer to any of these is no, you don't merge it. You sit with it. You understand it or you rewrite the parts that you don't. I

      Review AI-generated code as if you are a junior developer trying to learn

    2. code of view wasn't about catching bugs. It was a primary mechanism for knowledge transfer. When your engineers learn architecture by reading senior engineer PRs, seniors maintain context by reviewing everything, and code of view was the learning loop. It was how shared theory was built and maintained across the entire team. But AI bypasses that entire loop. The code shows up, it works, it gets merged, but nobody learned anything. Nobody built context. Nobody can debug at a 3AM without reprompting the AI. And code of view was never about catching bugs. It was always about building shared understanding. AI skipped the most important step.

      Code Review to maintain shared context

    1. Reviewer #2 (Public review):

      Summary:

      This manuscript introduces Miniscope Processing Suite (MPS), a novel no-code GUI-based pipeline built to easily process long-duration one-photon calcium imaging data from head-mounted Miniscopes. MPS aims to address two large problems that persist despite the rapid proliferation of Miniscope use across the field. The first issue is concerned with the high technical barrier to using existing pipelines (e.g., CaImAn, MIN1PIPE, Minian, CaliAli) that require users to have coding skills to analyze data. The second problem addressed is the intense memory limitations of these pipelines, which can prevent analysis of long-duration (multi-hour) recordings without state-of-the-art hardware. The MPS toolbox takes inspiration from what existing pipelines do well, innovates new modules like Window Cropping, NNDSVD initialization, Watershed-based segmentation, and improves the user experience to improve access to calcium imaging analysis without the need for new training in new coding languages. In many ways, MPS achieves this aim, and thus will be of interest to a growing, broad audience of new calcium imagers.

      There are, however, some concerns with the current manuscript and pipeline that, if addressed, would greatly improve the impact of this work. Currently, the manuscript provides insufficient evidence that MPS can generate good results efficiently on various data sets, and it is not properly benchmarked against other established packages. Additionally, considering the goal of MPS is to attract novices to attempt Miniscope analysis, better tutorials, documentation, and walkthroughs of expected vs inaccurate results should be provided so that it is clear when the user can trust the output. Otherwise, this simplified approach may end up leading new users to erroneous results.

      Strengths:

      The manuscript itself is well-organized, clear, and easy to follow. MPS is clearly designed to remove the computational barrier for entry for a broad neuroscience community to record and analyze calcium data. The development of several well-detailed algorithmic innovations merits recognition. Firstly, MPS is extremely easy to install, keep updated, and step through. Having each step save every output automatically is a well-thought-out feature that will allow users to enter back into the pipeline at any step and compare results.

      The implementation of an erroneous frame identifier and remover during preprocessing is an important new feature that is typically done offline with custom-built code. Interactive ROI cropping early in the pipeline is an efficient way to lower pixel load, and NNDSVD initialization is a new way to provide nonnegative, biologically interpretable starting spatial and temporal factors for later CNMF iterations. Parallel temporal-first update ordering cuts down dramatically on later computational load. Together, all these features, neatly packaged into a no-code GUI like the Data Explorer for manual curation, are practical additions that will benefit end users.

      Weaknesses:

      A major limitation of this manuscript is that the authors don't validate the accuracy of their source extraction using ground-truth data or any benchmark against existing pipelines. The paper uses their own analysis of processing speeds, component counts, signal-to-noise ratio improvements, and morphological characteristics of detected cells, but it needs to be reworked to include some combination of validation against manually annotated ground truth data sets, simulated data with known cell locations and activity patterns, or cross-validation with established pipelines on identical datasets. Without this kind of validation, it is impossible to truly determine whether MPS produces biologically acceptable results that help distinguish it from what is currently already available. For example, line 57 refers to the CaImAn pipeline having near-human efficiency (Figures 3-5 and Tables 1 and 2 of the CaImAn paper), but no specific examples for MPS performance benchmarks are made. Figure 15 of the Minian paper provides other examples of how to show this.

      Considering one of the main benefits of MPS is its low memory demand and ability to run on unsophisticated hardware, the authors should include a figure that shows how processing times and memory usage scale with dataset sizes (FOV, number of frames and/or neurons, sparsity of cells) and differing pipelines. Figure 8 of the CaImAn paper and Figure 18 of the Minian paper show this quite nicely. Table 1 currently references how "traditional approaches" differ methodologically from MPS innovations, but runtime comparisons on identical datasets processed through MPS, CaImAn, Minian, or CaliAli would be necessary to substantiate performance claims of MPS being "10-20X faster". Additionally, while the paper does mention the type of hardware used by the experimenters, a table with a full breakdown of components may be useful for reproducibility. As well as the minimum requirements for smooth processing.

      The current datasets used for validating MPS are not described in the manuscript. The manuscript appears to have 28 sessions of calcium imaging, but it is unclear if this is a single cohort or even animal, or whether these data are all from the same brain region. Importantly, the generalizability of parameter choices and performance could vary for others based on brain region differences, use of alternative calcium indicators (anything other than GCaMP8f used in the paper), etc. This leads to another limitation of the paper in its current form. While MPS is aimed at eliminating the need to code, users should not be expected to blindly trust default or suggested parameter selections. Instead, users need guidance on what each modifiable parameter does to their data and how each step analysis output should be interpreted. Perhaps including a tutorial with sample test data for parameter investigation and exploration, like many other existing pipelines do, is warranted. This would also increase the transparency and reproducibility of this work.

      Currently, the documentation and FAQ website linked to MPS installation does not do an adequate job of describing parameters or their optimization. The main GitHub repository does contain better stepwise explanations, but there needs to be a centralized location for all this information. Additionally, a lack of documentation on the graphs created by each analysis step makes it hard for a true novice to interpret whether their own data is appropriately optimized for the pipeline. Greater detail on this would greatly improve the quality and impact of MPS.

    2. Author response:

      (1) Claim regarding NNDSVD initialization

      Reviewer #1:

      The authors state that "MPS is the first implementation of Constrained Non-negative Matrix Factorization (CNMF) with Nonnegative Double Singular Value Decomposition (NNDSVD) initialization." However, NNDSVD initialization is the default method in scikit-learn's NMF implementation and is also used in CaIMAN. I recommend rephrasing this claim in the abstract to more accurately reflect MPS's novelty, which appears to lie in the specific combination of constrained NMF with NNDSVD initialization, rather than being the first use of NNDSVD initialization itself.

      We agree that our original phrasing was too broad. NNDSVD-family initialization is widely used in NMF implementations (e.g., scikit-learn) and is available within some pipeline components. We revised the abstract and main text to clarify our intended contribution: MPS seeds CNMF directly with NNDSVD-derived nonnegative factors as the primary initialization strategy, rather than relying on heuristic or greedy ROI-based seeding, integrated within a memory-efficient, end-to-end workflow for long-duration miniscope recordings.

      (2) Installation issue on macOS

      Reviewer #1:

      At present, there are practical issues that limit the usability of the software. The link to the macOS installer on the documentation website is not functional. Furthermore, installation on a MacBook Pro was unsuccessful, producing the following error: "rsync(95755): error: ... Permission denied ...unexpected end of file."

      We thank the reviewer for identifying the broken installer link and the macOS installation error. We fixed the macOS installer link on the documentation website and updated installation instructions to explicitly address common macOS permission-related failures (including rsync "Permission denied" errors that arise when attempting to write into protected directories without appropriate privileges). We re-tested installation on clean macOS systems and confirmed successful installation under the revised instructions.

      (3) Validation, benchmarking, and cross-pipeline comparison

      Reviewer #2:

      A major limitation of this manuscript is that the authors don't validate the accuracy of their source extraction using ground-truth data or any benchmark against existing pipelines... Without this kind of validation, it is impossible to truly determine whether MPS produces biologically acceptable results... Considering one of the main benefits of MPS is its low memory demand and ability to run on unsophisticated hardware, the authors should include a figure that shows how processing times and memory usage scale with dataset sizes and differing pipelines... runtime comparisons on identical datasets processed through MPS, CaImAn, Minian, or CaliAli would be necessary to substantiate performance claims of MPS being "10-20X faster".

      We thank the reviewers for their careful reading and for raising the question of biological validity, which we agree is central to any calcium imaging analysis tool. We would like to clarify, however, that MPS does not introduce a novel source extraction algorithm, and therefore the question of biological validity is not one that MPS alone can answer - nor should it be expected to. MPS is built on CNMF, the same mathematical framework underlying CaImAn and Minian. The contribution of MPS lies in its initialization strategy and parallelization architecture, which allow this proven framework to operate in the multi-hour recording regime.

      To address the reviewers' request for a direct qualitative comparison, we will run MPS, CaImAn, Minian, and MIN1PIPE on a representative 10-minute real recording with clearly visible neurons. The figure will show the spatial components (ROI footprints) and representative temporal traces (ΔF/F) for all four pipelines on identical data. We anticipate that the spatial layouts and temporal dynamics will be highly concordant across pipelines, demonstrating that MPS produces biologically consistent output. We believe this side-by-side comparison will provide a clear demonstration that MPS output is comparable in quality to established tools on tractable recordings.

      Regarding runtime comparison across pipelines, we will provide a table showing approximate processing times at three recording durations (5, 20, and 180 minutes). On short recordings, all pipelines are expected to complete successfully at different rates, whereas on long-duration recordings, this pipeline behavior is expected to diverge. We acknowledge that any single runtime benchmark reflects specific hardware and dataset characteristics and may not generalize to all configurations. We will therefore present these data as illustrative rather than definitive and will direct readers to the MPS documentation for guidance on hardware-specific tuning.

      (4) Dataset description and scope of generalizability

      Reviewer #2:

      The current datasets used for validating MPS are not described in the manuscript. The manuscript appears to have 28 sessions of calcium imaging, but it is unclear if this is a single cohort or even animal, or whether these data are all from the same brain region. Importantly, the generalizability of parameter choices and performance could vary for others based on brain region differences, use of alternative calcium indicators...

      We agree that the dataset description should be centralized and unambiguous. We added a dedicated Methods subsection stating that all results are based on a single, controlled experimental dataset consisting of 28 long-duration miniscope sessions acquired under consistent conditions (same brain region, calcium indicator, optical configuration, and acquisition parameters). This section explicitly specifies the number of animals, brain region, frame rate, field of view, session duration, and total data volume. We also clarified that conclusions are intended to evaluate MPS performance in this controlled long-duration setting rather than to claim universal parameter generalizability across brain regions, indicators, or optical systems.

      (5) Parameter guidance and documentation

      Reviewer #2:

      ...users should not be expected to blindly trust default or suggested parameter selections. Instead, users need guidance on what each modifiable parameter does to their data and how each step analysis output should be interpreted. Currently, the documentation and FAQ website linked to MPS installation does not do an adequate job of describing parameters or their optimization...

      We agree that users should not blindly trust default or suggested parameters. We substantially expanded and centralized documentation by adding a parameter-selection walkthrough that explains what each modifiable parameter does, how it affects intermediate and final outputs, and how diagnostic plots generated at each stage should be interpreted. Rather than prescribing dataset-specific parameter values, we explicitly framed parameter selection as an iterative, hypothesis-driven process informed by experimental factors such as calcium indicator kinetics, lens size and numerical aperture, field of view, recording duration, and expected neuronal density. We consolidated previously dispersed explanations from the GitHub repository into a single documentation site and expanded figure descriptions to guide interpretation by less experienced users. A representative sample dataset and accompanying analysis code were made publicly available at https://github.com/ariasarch/MPS_Sample_Code to support parameter exploration on tractable data.

      (6) Packaging and distribution

      Reviewer #1:

      ...current best practices in software development increasingly rely on continuous integration and continuous deployment (CI/CD) pipelines to ensure reproducibility, testing, and long-term maintenance. In this context, it has become standard for Python packages to be distributed via PyPI or Conda. Without dismissing the value of standalone installers, the overall quality and sustainability of MPS would be greatly enhanced by also supporting conventional environment-based installations.

      Regarding distribution more broadly: while our one-click installers are intended to reduce setup burden for non-programmers, we recognize the value of conventional environment-based distribution for longterm sustainability. We are exploring the feasibility of adding a standard PyPI and/or Conda installation pathway alongside the standalone installers. To ensure reproducibility across environments, all package dependencies are now explicitly version-pinned at installation time, eliminating environment drift as a source of irreproducibility.

      We would note, however, that PyPI distribution alone does not fully resolve the reproducibility challenges inherent to scientific Python software. Even with version-pinned dependencies, downstream changes in the Python interpreter itself, compiled extension modules, and platform-specific build toolchains can silently alter numerical behavior in ways that are difficult to anticipate or control. Our standalone installers address this by shipping a complete, fixed execution environment, and we believe this remains a meaningful architectural advantage for ensuring long-term reproducibility - particularly for non-developer users who may not be in a position to diagnose subtle environment-related failures. We see PyPI/Conda support and standalone installers as complementary rather than equivalent approaches, and will pursue both where feasible.

    1. Reviewer #1 (Public review):

      Summary:

      The authors present a compelling case for the necessity of age-specific templates in functional hyperalignment. Given that the brain undergoes substantial developmental, structural, and functional changes across the lifespan, a 'one-size-fits-all' canonical template is often insufficient. This study effectively demonstrates that incorporating age-congruent features significantly enhances the performance and sensitivity of hyperalignment models. By validating these findings across two independent datasets (Cam-CAN and DLBS), the paper provides robust evidence that accounting for age-related functional organization is a critical prerequisite for accurate functional alignment in lifespan research.

      Strengths:

      (1) The authors used three metrics to evaluate performance. Across all metrics, they found that age-congruent templates outperformed age-incongruent templates, suggesting that age-specific templates can improve alignment.

      (2) These findings highlight the superiority of age-congruent templates for hyperalignment. This work underscores the importance of age-matching in cross-subject functional mapping and represents a vital step forward for the methodology.

      Weaknesses:

      (1) Participant Demographics and Group Separation:

      The study defines the 'older' cohort as 65-90 years and the 'younger' cohort as 18-45 years. While this 20-year gap (ages 46-64) effectively maximizes the contrast between groups, the results in Figure 4a suggest that the predicted individualized connectomes follow a continuous distribution. Given this continuity, could the authors provide the average median trends for Figures 2a and 2b to illustrate how the model behaves across the missing age range?

      (2) Request for Implementation:

      I have been unable to locate the source code associated with this publication. Could the authors please provide a link to the repository or clarify if the implementation is available for reproduction?

      (3) Analysis of Prediction Performance and Distribution:

      While Figures 3b and 5b clearly demonstrate that the congruent template improves correlation, Figure 4a shows a distinct shift in the scatter distribution. Could the authors provide a detailed explanation of the prediction performance metrics used? Specifically, I would like to understand how the underlying method accounts for the distribution differences observed when applying the congruent template.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The electrocardiogram (ECG) is routinely used to diagnose and assess cardiovascular risk. However, its interpretation can be complicated by sex-based and anatomical variations in heart and torso structure. To quantify these relationships, Dr. Smith and colleagues developed computational tools to automatically reconstruct 3D heart and torso anatomies from UK Biobank data. Their regression analysis identified key sex differences in anatomical parameters and their associations with ECG features, particularly post-myocardial infarction (MI). This work provides valuable quantitative insights into how sex and anatomy influence ECG metrics, potentially improving future ECG interpretation protocols by accounting for these factors.

      Strengths:

      (1) The study introduces an automated pipeline to reconstruct heart and torso anatomies from a large cohort (1,476 subjects, including healthy and post-MI individuals).

      (2) The 3-stage reconstruction achieved high accuracy (validated via Dice coefficient and error distances).

      (3) Extracted anatomical features enabled novel analyses of disease-dependent relationships between sex, anatomy, and ECG metrics.

      (4) Open-source code for the pipeline and analyses enhances reproducibility.

      Weaknesses:

      (1) The linear regression approach, while useful, may not fully address collinearity among parameters (e.g., cardiac size, torso volume, heart position). Although left ventricular mass or cavity volume was selected to mitigate collinearity, other parameters (e.g., heart center coordinates) could still introduce bias.

      (2) The study attributes residual ECG differences to sex/MI status after controlling for anatomical variables. However, regression model errors could distort these estimates. A rigorous evaluation of potential deviations (e.g., variance inflation factors or alternative methods like ridge regression) would strengthen the conclusions.

      (3) The manuscript's highly quantitative presentation may hinder readability. Simplifying technical descriptions and improving figure clarity (e.g., separating superimposed bar plots in Figures 2-4) would aid comprehension.

      (4) Given established sex differences in QTc intervals, applying the same analytical framework to explore QTc's dependence on sex and anatomy could have provided additional clinically relevant insights.

      We thank Reviewer 1 for their kind and constructive comments. While we have thoroughly addressed all specific recommendations below, in brief, we have added new analysis of the variance inflation factor in Supplementary Tables 2 and 3 to reassure readers that the chosen parameter sets exhibit low levels of collinearity, and provided more explanation for why the relative positional parameters were chosen to avoid this issue. We have added explanatory figures for all positional and orientational parameters to improve understanding of the technical details, and improved clarity of existing figures as detailed below. We welcome the suggestion to add QT interval to the manuscript – whilst this was only available in the UK Biobank for a single lead, we have included an analysis of both QT and QTc intervals in this lead to Page 10, and added some discussion of this to the second full paragraph of Page 14.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Comment 1: “Collinearity and Regression Analysis: It would be valuable to assess the collinearity among the regressed parameters (e.g., cardiac size, torso volume, heart center positions [x, y, z], and cardiac orientation angles) and evaluate whether alternative regression methods (e.g., ridge regression) might improve robustness. Additionally, cardiac digital twinning with electrophysiological models could help isolate the exact contribution of electrophysiology while enabling sensitivity analysis. Nonlinear regression or machine learning approaches might also enhance the predictive power of the analysis.”

      We thank the reviewer for drawing attention to the important issue of collinearity in the parameter sets used in the regression analysis. To address this, we have added Supplementary Tables 2 and 3, which detail the variance inflation factors for each of the parameter sets used. This was considered in the selection of anatomical parameters – e.g. using relative position not absolute distances between landmarks, which would be more collinear. As these are all below a value of 3.4, we believe that the effect of collinearity is limited, and thus to reduce subjectivity of parameter selection in more complex methods, and encourage interpretability, we have retained our linear regression analysis. In addition, we have added an explanation to the second full paragraph on Page 6 of how we calculated the relative, rather than absolute position of the cardiac centre partially to avoid the problem of collinearity when using multiple absolute distances. We concur that modelling and simulation techniques are well suited to explore the electrophysiological component further – as this is out of the scope of this work, we have addressed the role of these methods in future work in the final paragraph of Page 16.

      Comment 2: “Figure Clarity (Bar Plots): The superimposed bar plots in Figures 2-4 are difficult to interpret; separating the bars for each coefficient would improve readability.”

      We accept that the stacked bar plots could be improved in their clarity. Whilst plotting each anatomical parameter separately multiplies the number of plots by a factor of nine, and makes comparison between parameters more difficult, we have added clear horizontal grid lines in order to make values easier to read and interpret.

      Comment 3: “Feature Extraction Visualization: A schematic figure illustrating the steps for measuring heart positional parameters (e.g., with example annotations) would help readers better understand the feature extraction methodology.”

      We agree with the reviewer that the calculation of positional and orientational parameters is crucial to illustrate clearly. We have included additional Supplementary Figures 2 and 3 to better convey these parameters.

      Reviewer #2 (Public review):

      Summary:

      Missed diagnosis of myocardial ischemia (MI) is more common in women, and treatment is typically less aggressive. This diagnosis stems from the fact that women's ECGs commonly exhibit 12 lead ECG biomarkers that are less likely to fall within the traditional diagnostic criteria. Namely, women have shorter QRS durations and lower ST junction and T wave amplitudes, but longer QT intervals, than men. To study the impact, this study aims to quantify sex differences in heart-torso anatomy and ECG biomarkers, as well as their relative associations, in both pre- and post-MI populations. A novel computational pipeline was constructed to generate torso-ventricular geometries from cardiac magnetic resonance imaging. The pipeline was used to build models for 425 post-myocardial infarction subjects and 1051 healthy controls from UK Biobank clinical images to generate the population.

      Strengths:

      This study has a strength in that it utilizes a large patient population from the UK Biobank (425 postMI and 1051 healthy controls) to analyze sex-based differences. The computational pipeline is stateof-the-art for constructing torso-ventricular geometries from cardiac MR and is clinically viable. It draws on novel machine learning techniques for segmentation, contour extraction, and shape modeling. This pipeline is publicly available and can help in the large-scale generation of anatomies for other studies. This allows computation of various anatomical factors (torso volume, cavity volume, etc), and subsequent regression analysis on how these factors are altered before and after MI from the 12-lead ECG.

      Weaknesses:

      Major weaknesses stem from the fact that, while electrophysiological factors appear to play a role across many leads, both post-MI and healthy, the electrophysiological factors are not stated or discussed. The computational modeling pipeline is validated for reconstructing torso contours; however, potential registration errors stemming from ventricular-torso construction are not addressed within the context of anatomical factors, such as the tilt and rotation of the heart. This should be discussed as the paper's claims are based on these results. Further analysis and explanation are needed to understand how these sex-specific results impact the ECG-based diagnosis of MI in men and women, as stated as the primary reason for the study at the beginning of the paper. This would provide a broader impact within the clinical community. Claims about demographics do not appear to be supported within the main manuscript but are provided in the supplements. Reformatting the paper's structure is required to efficiently and effectively present and support the findings and outcomes of this work.

      We thank Reviewer 2 for their considered and detailed feedback. We greatly appreciate the invitation to elaborate on the electrophysiological factors, and we have added discussion of this matter to the second and third full paragraphs on Page 14, extending to Page 15 and first full paragraph on Page 15, and highlighted the role of modelling and simulation in future work on the third full paragraph of Page 16. We agree that registration errors are one reason behind remaining reconstruction errors and feel a strength of our study is that the large number of subjects used aided in reducing the effect of this noise, and have updated the second full paragraph of Page 16 to reflect this. We are wary of moving too many supplemental figures and tables describing demographic trends to the main manuscript for fear of diluting the specific answers to our research questions. We have however actioned the suggestions as detailed below to reformat the paper, including redressing the balance of supplemental versus main methodological sections, and thank the reviewer for their guidance in increasing our clarity.

      Reviewer #2 (Recommendations for the authors):

      (1) Please detail what "chosen to be representative of the underlying dataset" means in terms of a validation dataset.

      We thank the reviewer for addressing the lack of clarity in this matter. We have added a reference in the third full paragraph on Page 6 to Supplementary Appendix 1.1, where we have included full details of the selection criteria.

      (2) “Current guidelines ... further research [16]." The paragraph should begin with a broader statement that is relevant to the fact that the entire body of work focuses on ECG-based diagnosis differences in women, rather than LVEF through echocardiography.

      We have revised the introduction to Paragraph 3 on Page 3 to clarify our motivation for focusing on the ECG in order to shape proposals for novel ECG-based risk stratification tools.

      (3) The last paragraph of the introduction should more clearly state what was performed and how you aim to prove your hypothesis. There is no mention of the data, the regression model, or other key aspects important to the reader.

      We have added methodological details to Paragraph 5 on Page 3 in order to clarify our approach in testing our hypothesis.

      (4) An overview paragraph should be included in the Methods at the beginning.

      We thank the reviewer for this valuable suggestion – we have added an overview paragraph to the start of the methodology section on Page 5.

      (5) The computational pipeline portion of the methods should be written in full paragraphs instead of almost a bulleted list. In general, more details from the supplement should be provided in the methods.

      We thank the reviewer for raising important points concerning the balance of methodological description in the main manuscript and the supplementary materials. We have added detailed description of the reconstruction pipeline to Pages 5 and 6. We feel that the ordered format of the methods section adds to the reproducibility and transparency of our methodology.

      (6) The torso reconstruction method was already validated in Smith et al. [29]. What value does your additional validation bring to this methodology? Furthermore, how does the construction of the ventricular-torso reconstructions using the cardiac axes (not just the torso contours) influence ECG metrics?

      We apologise that this was not clear – we have clarified in Paragraph 4 on Page 5 that while Smith et al. 2022 provided a detailed validation to the contour extraction networks, it did not validate the torso reconstruction pipeline, as it only presents the reconstruction of two cases as a proof of concept. We have also expanded the second full paragraph on Page 6 to explain that the sparse (but not dense) cardiac anatomies were constructed in order to calculate the cardiac size, which we found was a key factor moderating many ECG biomarkers. We also specified that the cardiac position and orientation were necessary in order to relate these to the torso axes and positions of the ECG electrodes.

      (7) Include the details of the regression analysis in the main body of the methods for the readers. This is crucial to the claims and outcomes of the paper. Only a sentence is included in the results and one in the figure: "Each factor's contribution is calculated from the product of the regression coefficients and anatomical sex differences (Supplementary Appendix 1.5)." What specific contributions can I expect to see in the results figures? The results are filled with methodological aspects that should be in the results.

      We thank the reviewer again for this important comment regarding the balance of the main text methodology and supplementary methodology sections. We have added detail to the statistical analysis section of the main text on Pages 7 and 8 in order for the reader to understand the following results section without consulting the supplemental methods. We have also removed these details from the results section.

      (8) What is "the remaining estimated effect of electrophysiology". Did you do simulations on the electrophysiology, or how is this computed from the clinical data of patients? More explanation is needed, as without this, the paper is just focusing on anatomy.

      We have clarified this important point by moving the explanation of the methodology underpinning our estimation of the electrophysiological contributions using the clinical ECGs from the supplementary methods to the main manuscript on the second full paragraph on Page 7, and continuing to Page 8. We have also specified the role of simulations studies in future work on the final paragraph on Page 16.

      (9) Include an overview paragraph of the methods to create more structure.

      We thank the reviewer again for the further attention to this issue – as previously, we have added an overview paragraph to the methodology section on Page 5.

      (10) Only 19.8% of the patients were female, which is probably due to females having a more severe presentation of the disease. How does this impact, bias, or skew your results?

      This comment raises a very interesting point, and while the origin of this imbalance is of course multifactorial – women likely do have lower rates of MI events due to the cardioprotective role of estrogen and different health promoting behaviours, and our sex imbalance was reflective of wider trends in MI diagnosis. However, as mentioned in Paragraph 2 Page 3 of the text, there are more missed MI diagnoses in women, and we agree that this may lead to a more severe presentation of female MI pathophysiology. We have expanded the first full paragraph on Page 16 to specify the ECG and demographic impacts that this has on our results, and that it is a strength of this work that we may contribute to future adjustment of the diagnostic criteria, such that future investigations do not have this bias, and that clinical outcomes are improved.

      (11) A lot of extra information is provided in Tables 1 and 2. Include additional information in the supplements that is not directly relevant to your findings.

      We agree that Table 2 is supplementary, rather than critical information, and have moved it accordingly to the Supplementary Materials on Page 38. We do believe that Table 1 is central for understanding the extracted dataset.

      (12) Combine paragraphs 3 and 4 into a single paragraph. "Current guidelines..." and "T wave amplitude...". They are part of a single coherent concept.

      We have removed the paragraph break on Page 3 Paragraph 3.

      (13) Check all acronyms throughout the paper. The abbreviation for sudden cardiac death (SCD) is only used once in the same paragraph. Remove the acronym and type it out. T-wave amplitude (TWA) is introduced twice in a Figure caption and not introduced until the methods.

      Many thanks for this suggestion – we have reviewed all acronyms in the manuscript.

      (14) "Figure 1B showcases the capability of the computational pipeline to extract torso contours and reconstruct them into 3D meshes". Isn't this Figure 1A?

      We apologise that this was unclear, and have updated the sentence on the first full paragraph of Page 8 to clarify the purpose of Figure 1B.

      (15) No need to state: "Female y-axis limits have been adjusted by the difference in healthy QRS duration between sexes for ease of comparison" in the Figure 2 caption.

      We have removed this statement on all relevant captions.

      (16) The paragraph "For lead V6, 15.9% of healthy subjects..." can be combined with the previous section.

      We have removed this paragraph break on Page 9 to improve readability.

      (17) The only demographics I could find were age and BMI. State which demographics you used explicitly. This is especially true when the discussion makes claims like "Our findings suggest that corrected QRS duration taking into consideration demographics...". How did you take them into account?

      We accept that our previous description of the demographic adjustment to QRS duration in the discussion did not adequately reflect the comprehensiveness of our approach, and have adjusted the second paragraph on Page 14 to rectify this.

      (18) The results section is also almost a bulleted list that should be written and reformatted into paragraphs.

      The ordered style of our results section was designed to compare how our obtained data answers our research question differently for ECG intervals, amplitudes, and axis angles. Whilst we have adjusted paragraph breaks and moved methodological details to more appropriate sections, we have retained this stylistic choice.

      (19) The following sentence should be in the introduction: "Alterations to the polarity and amplitude of the T wave are used in the diagnosis of acute MI [42] and TWA affects proposed risk stratification tools, particularly markers of repolarization abnormalities [9, 43]."

      We thank the reviewer for this suggestion. We have included the discussion of how TWA is separately used in proposed risk stratification and current diagnostic tools in Paragraph 3 of Page 3.

    2. Reviewer #1 (Public review):

      Summary:

      The electrocardiogram (ECG) is routinely used to diagnose and assess cardiovascular risk. However, its interpretation can be complicated by sex-based and anatomical variations in heart and torso structure. To quantify these relationships, Dr. Smith and colleagues developed computational tools to automatically reconstruct 3D heart and torso anatomies from UK Biobank data. Their regression analysis identified key sex differences in anatomical parameters and their associations with ECG features, particularly post-myocardial infarction (MI). This work provides valuable quantitative insights into how sex and anatomy influence ECG metrics, potentially improving future ECG interpretation protocols by accounting for these factors.

      Strengths:

      • The study introduces an automated pipeline to reconstruct heart and torso anatomies from a large cohort (1,476 subjects, including healthy and post-MI individuals). • The 3-stage reconstruction achieved high accuracy (validated via Dice coefficient and error distances). • Extracted anatomical features enabled novel analyses of disease-dependent relationships between sex, anatomy, and ECG metrics. • Open-source code for the pipeline and analyses enhances reproducibility.

      Weaknesses:

      • The study attributes residual ECG differences to sex/MI status after controlling for anatomical variables. However, regression model errors could distort these estimates. A rigorous evaluation of potential deviations (e.g., variance inflation factors or alternative methods like ridge regression) would strengthen the conclusions.

    3. Reviewer #1 (Public review):

      Summary:

      The electrocardiogram (ECG) is routinely used to diagnose and assess cardiovascular risk. However, its interpretation can be complicated by sex-based and anatomical variations in heart and torso structure. To quantify these relationships, Dr. Smith and colleagues developed computational tools to automatically reconstruct 3D heart and torso anatomies from UK Biobank data. Their regression analysis identified key sex differences in anatomical parameters and their associations with ECG features, particularly post-myocardial infarction (MI). This work provides valuable quantitative insights into how sex and anatomy influence ECG metrics, potentially improving future ECG interpretation protocols by accounting for these factors.

      Strengths:

      (1) The study introduces an automated pipeline to reconstruct heart and torso anatomies from a large cohort (1,476 subjects, including healthy and post-MI individuals).

      (2) The 3-stage reconstruction achieved high accuracy (validated via Dice coefficient and error distances).

      (3) Extracted anatomical features enabled novel analyses of disease-dependent relationships between sex, anatomy, and ECG metrics.

      (4) Open-source code for the pipeline and analyses enhances reproducibility.

      Weaknesses:

      (1) The linear regression approach, while useful, may not fully address collinearity among parameters (e.g., cardiac size, torso volume, heart position). Although left ventricular mass or cavity volume was selected to mitigate collinearity, other parameters (e.g., heart center coordinates) could still introduce bias.

      (2) The study attributes residual ECG differences to sex/MI status after controlling for anatomical variables. However, regression model errors could distort these estimates. A rigorous evaluation of potential deviations (e.g., variance inflation factors or alternative methods like ridge regression) would strengthen the conclusions.

      (3) The manuscript's highly quantitative presentation may hinder readability. Simplifying technical descriptions and improving figure clarity (e.g., separating superimposed bar plots in Figures 2-4) would aid comprehension.

      (4) Given established sex differences in QTc intervals, applying the same analytical framework to explore QTc's dependence on sex and anatomy could have provided additional clinically relevant insights.

    1. Reviewer #1 (Public review):

      The manuscript "Heterozygote advantage cannot explain MHC diversity, but MHC diversity can explain heterozygote advantage" explores two topics. First, it is claimed that the recently published by Mattias Siljestam and Claus Rueffler conclusion (in the following referred to as [SR] for brevity) that heterozygote advantage explains MHC diversity does not withstand an even very slight change in ecological parameters. Second, a modified model that allows an expansion of MHC gene family shows that homozygotes outperform heterozygotes. This is an important topic and could be of potential interest to the readership of eLife if the conclusions are valid and non-trivial.

      The resubmitted manuscript addresses several questions from my previous review. In particular, there is a more detailed description of how the code of Siljestam and Rueffler ([SR]) was used for the simulations and the calculation of the factor 2.7 x 10^43 that is the key to the alleged breakdown of the numerical reasoning presented by in [SR].

      Yet I think that important aspects of my critique of the first statement of the manuscript about the flaws of [SR] model remain unanswered. I guess the discussion becomes rather general about the universality and robustness of various types of models to parameter changes. My point is that none of the models is totally universal. The model in [SR] is not phenomenological as none of the parameters or functional forms were derived empirically. Instead, it is a proof of principle demonstration that inevitably grossly simplifies the actual immune response. The choice of constants and functions used in Eqs. (1-5) is dictated by the mathematical convenience and works in a limited range of parameter values. It is shown in [SR] that for 3 pathogens and reasonable "virulence " \nu, the alleles branch. These conclusions are supported by the analytically derived Adaptive Dynamics branching criteria (7), which, contrary to the statement is the cover letter (" It is clear from Fig. 4 of Siljestam and Rueffler that the branching condition is far from sufficient for high MHC diversity.") is perfectly confirmed by the simulation data shown in Fig. 4.

      The mathematical simplicity of the [SR] model generates various artifacts, such as the mentioned by the Author reduction of the "condition" by an enormous factor 2.7 x 10^43 and the resulting decrease in the "survival" induced by the addition of a new pathogen. This occurs at the very large value of \nu=20, whose effect is enormous due to the Gaussian form of (1), which, once again, was chosen for the mathematical convenience. In reality, a new pathogen cannot reduce the "survival" by such a factor as it would wipe out any resident population. So to compensate for such an artifact, the additional factor c_max was introduced to buffer such an excess. There is no reason to fix c_max once for an arbitrary number of pathogens, because varying c_max basically reflects the observation that a well-adapted individual must have a reasonable survival probability. At the same time, there are many ways in which the numerical simulation may break down when the survival rates become of the order of 10^(-43) instead of one, so it comes to no surprise that the diversification, predicted by the adaptive dynamics, does not readily occur in the scenario with an addition or removal of the 8th pathogen with a very high virulence \nu=20.

      I have doubts that the reported breakdown of the [SR] model with fixed c_max remains observable with less extreme values of m and \nu (say, for \nu=7 and m=3 plus or minus 1 used in Fig. 3 in the manuscript).

      So I still find the claim that " the phenomenon that leads to high diversity in the simulations of Siljestam and Rueffler depends on finely tuned parameter values" is not well substantiated.

    2. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1:

      Yet I think that important aspects of my critique of the first statement of the manuscript about the flaws of [SR] model remain unanswered.

      I believe that I have fully addressed the points in the earlier review. The reviewer had doubted that my results were correct, attributing them to “a poor setup of the model” on my part. The reviewer stated that if I were correct about the factor of >10<sup>43</sup> change in cmax, this would “naturally break down all the estimates and conclusions made in Siljestam and Rueffler” (S&R).

      It appears that the reviewer is now convinced that my results represent a faithful analysis of the models on which S&R based their claims. The reviewer now contends that these results, including the factor of >10<sup>43</sup>, present no difficulties for the claims of S&R after all. In fact, this enormous factor of >10<sup>43</sup> is now claimed to support the conclusions of S&R by invalidating my conclusions. I respond to these new and very different arguments in what follows.

      As I stated in the first round of review, the issue is not the enormity of this factor per se, but the fact that the compensatory adjustment of cmax conceals the true effects of changes in other parameters. These effects are large; small changes to the parameter values mostly eliminate the diversity that the model is claimed to explain.

      The model in [SR] is not phenomenological as none of the parameters or functional forms were derived empirically. Instead, it is a proof of principle demonstration that inevitably grossly simplifies the actual immune response.

      The hidden sensitivity of the results of S&R to paramater values is sufficient to invalidate them as a proof of principle. The manuscript goes further and explains how the problem "is not specific to the details of the models of Siljestam and Rueffler, but is inherent in the phenomenon invoked to allow high diversity" because "any change that affects condition by as much as the difference between MHC heterozygotes and homozygotes will eliminate high equilibrium diversity". This general principle addresses all of the reviewer's points.

      In reality, a new pathogen cannot reduce the "survival" by such a factor as it would wipe out any resident population. So to compensate for such an artifact, the additional factor cmax was introduced to buffer such an excess. There is no reason to fix cmax once for an arbitrary number of pathogens, because varying cmax basically reflects the observation that a well-adapted individual must have a reasonable survival probability.

      This is not a legitimate reason for making compensatory, diversity-promoting adjustments to cmax when evaluating sensitivity to other parameters. If the number of pathogens or their virulence changes, cmax obviously does not automatically change along with it. If the population or species consequently goes extinct, then it goes extinct. If it persists, it does so with the same value of cmax.

      The possibility of extinction arguably puts a minimum value on cmax, but it does not restrict it to a range of values that conveniently leads to high MHC diversity. In the examples that I analyzed, slightly decreasing the number of pathogens or their virulence, which increases survivability, eliminates diversity. This phenomenon obviously cannot be dismissed on the grounds that survivability would be too low for the species to exist.

      S&R in effect assume that the condition of the most fit homozygote remains fixed, regardless of the number of pathogens, their virulence, and myriad other differences between species. It is this assumption that is without justification.

      At the same time, there are many ways in which the numerical simulation may break down when the survival rates become of the order of 10^(-43) instead of one

      I am not sure what is meant by “the numerical simulation may break down”. Numerical error is not a tenable explanation of the lack of diversity observed in that simulation. The outcome is exactly what is expected from purely theoretical considerations: conditions of all genotypes fall on the steep part of the curve, making the mechanism proposed by S&R largely inoperative, so a pair of alleles forming a fit heterozygote comes to predominate. The numerical simulation is actually superfluous.

      Low survival rates are completely irrelevant to the effect of decreasing the number of pathogens or their virulence, which does not lower survival rates, but does eliminate diversity.

      so it comes to no surprise that the diversification, predicted by the adaptive dynamics, does not readily occur in the scenario with an addition or removal of the 8th pathogen with a very high virulence \nu=20.

      Whether or not it surprising, the lack of diversity is a problem for the claims of S&R, as there is no reason to expect the number of pathogens to have just the right value to produce high diversity. Furthermore, for many combinations of values of the other parameters (e.g., my v=19.5 and 20.5 examples), no number of pathogens leads to high diversity.

      Again, the general principle mentioned above makes the details that the reviewer refers to irrelevant. Nonetheless, some additional remarks are in order:

      (1) This comment ignores the fact that removal of a pathogen, or a slight decrease in “virulence”, eliminates diversity without lowering survival rates.

      (2) Small increases or decreases in v (virulence) eliminate diversity without having such large effects on condition.

      (3) In the example emphasized by the reviewer, mean survival rates are nowhere near as low as 10<sup>-43</sup>. Only homozygotes have such low fitness.

      (4) The adaptive dynamics predict the low diversity seen in the simulations, contrary to what the reviewer seems to suggest. Elimination of diversity is not an artifact of the simulation.

      (5) v\=20 was chosen because it is most favorable to the model of S&R in that it yields the highest diversity. Indeed, S&R only observed realistically high diversity with the narrow gaussians that the reviewer objects to. With lower values of v, diversity is much lower, but even this meager diversity is eliminated by small changes in parameter values (see below). If narrow gaussians and large effects of pathogens somehow invalidate results, then they invalidate the high-diversity results of S&R.

      I have doubts that the reported breakdown of the [SR] model with fixed cmax remains observable with less extreme values of m and \nu (say, for \nu=7 and m=3 plus or minus 1 used in Fig. 3 in the manuscript).

      These doubts are unwarrented. With the suggested parameter values, for example, increasing or decreasing m by 1 reduces the effective number of alleles to around 1 or 2. This can easily be checked using the simulation code of S&R, as detailed in my initial response and now in a Supplementary Text. Even without this result, the general principle mentioned above tells us that considering other regions of parameter space cannot rescue the conclusions of S&R.

      So I still find the claim that " the phenomenon that leads to high diversity in the simulations of Siljestam and Rueffler depends on finely tuned parameter values" is not well substantiated.

      What is unsubstantiated is the claim of S&R that “For a large part of the parameter space, more than 100 and up to over 200 alleles can emerge and coexist”. As my manuscript illustrates, this is an illusion created by the adjustment of one parameter to compensate for changes in others.

      The reviewer even acknowledges that “the choice of constants and functions...works in a limited range of parameter values”. Furthermore, the manuscript explains why this problem is inherent to the general phenomenon, not specific to the details of the model or parameter values.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      It appears obvious that with no or a little fitness penalty, it becomes beneficial to have MHC-coding genes specific to each pathogen. A more thorough study that takes into account a realistic (most probably non-linear in gene number) fitness penalty, various numbers of pathogens that could grossly exceed the self-consistent fitness limit on the number of MHC genes, etc, could be more informative.

      The reviewer seems to be referring to the cost of excessively high presentation breadth. Such a cost is irrelevant to the inferior fitness of a polymorphic population with heterozygote advantage compared to a monomorphic population with merely doubled gene copy number. It is relevant to the possibility of a fitness valley separating these two states, but this issue is addressed explicitly in the manuscript.

      An addition or removal of one of the pathogens is reported to affect "the maximum condition", a key ecological characteristic of the model, by an enormous factor 10^43, naturally breaking down all the estimates and conclusions made in [RS]. This observation is not substantiated by any formulas, recipes for how to compute this number numerically, or other details, and is presented just as a self-standing number in the text.

      It is encouraging that the reviewer agrees that this observation, if correct, would cast doubt on the conclusions of Siljestam and Rueffler. I would add that it is not the enormity of this factor per se that invalidates those conclusions, but the fact that the automatic compensatory adjustment of c</sub>max</sub> conceals the true effects of removing a pathogen, which are quite large.

      I am not sure why the reviewer doubts that this observation is correct. The factor of 2.7∙10<sup>43</sup> was determined in a straightforward manner in the course of simulating the symmetric Gaussian model of Siljestam and Rueffler with the specified parameter values. A simple way to determine this number is to have the simulation code print the value to which c</sub>max</sub> is set, or would be set, by the procedure of Siljestam and Rueffler for different parameter values. I have in this way confirmed this factor using the simulation code written and used by Siljestam and Rueffler. A procedure for doing so is described in the new Supplementary Text S1. In addition, I now give a theoretical derivation of this factor in Supplementary Text S2.

      This begs the conclusion that the branching remains robust to changes in cmax that span 4 decades as well.

      That shows at most that the results are not extremely sensitive to c</sub>max</sub> or K. They are, nonetheless, exquisitely sensitive to m and v. This difference in sensitivities is the reason that a relatively small change to m leads to such a large compensatory change in c</sub>max</sub>. It is evident from Fig. 4 of Siljestam and Rueffler that the level of diversity is not robust to these very large changes in c</sub>max</sub>, which include, as noted above, a change of over 43 orders of magnitude.

      As I wrote above, there is no explanation behind this number, so I can only guess that such a number is created by the removal or addition of a pathogen that is very far away from the other pathogens. Very far in this context means being separated in the x-space by a much greater distance than 1/\nu, the width of the pathogens' gaussians. Once again, I am not totally sure if this was the case, but if it were, some basic notions of how models are set up were broken. It appears very strange that nothing is said in the manuscript about the spatial distribution of the pathogens, which is crucial to their effects on the condition c.

      I did not explicitly describe the distribution of pathogens in antigenic space because it is exactly the same as in Siljestam and Rueffler, Fig. 4: the vertices of a regular simplex, centered at the origin, with unity edge length.

      The number in question (2.7∙10<sup>43</sup>) pertains to the Gaussian model with v\=20. As specified by Siljestam and Rueffler, each pathogen lies at a distance of 1 from every other pathogen, so the distance of any pathogen from the others is indeed much greater than 1/v. This condition holds, however, for most of the parameter space explored by Siljestam and Rueffler (their Fig. 4), and for all of the parameter space that seemingly supports their conclusions. Thus, if this condition indicates that “basic notions of how models are set up were broken”, they must have been broken by Siljestam and Rueffler.

      ...the branching condition appears to be pretty robust with respect to reasonable changes in parameters.

      It is clear from Fig. 4 of Siljestam and Rueffler that the branching condition is far from sufficient for high MHC diversity.

      Overall, I strongly suspect that an unfortunately poor setup of the model reported in the manuscript has led to the conclusions that dispute the much better-substantiated claims made in [SD].

      The reviewer seems to be suggesting that my simulations are somehow flawed and my conclusions unreliable. I have addressed the reasons for this suggestion above. Furthermore, I have confirmed the main conclusion—the extreme sensitivity of the results of Siljestam and Rueffler to parameter values--using the code that they used for their simulations, indicating that my conclusions are not consequences of my having done a “poor setup of the model”. I now describe, in Supplementary Text S1, how anybody can verify my conclusions in this way.

      Reviewer #2 (Public review):

      (1) The statement that the model outcome of Siljestam and Rueffler is very sensitive to parameter values is, in this form, not correct. The sensitivity is only visible once a strong assumption by Siljestam and Rueffler is removed. This assumption is questionable, and it is well explained in the manuscript by J. Cherry why it should not be used. This may be seen as a subtle difference, but I think it is important to pin done the exact nature of the problem (see, for example, the abstract, where this is presented in a misleading way).

      I appreciate the distinction, and the importance of clearly specifying the nature of the problem. However, as I understand it, Siljestam and Rueffler do not invoke the implausible assumption that changes to the number of pathogens or their virulence will be accompanied by compensatory changes to c</sub>max</sub>. Rather, they describe the adjustment of c</sub>max</sub> (Appendix 7) as a “helpful” standardization that applies “without loss of generality”. Indeed, my low-diversity results could be obtained, despite such adjustment, by combining the small change to m or v with a very large change to K (e.g., a factor of 2.7∙10<sup>43</sup>). In this sense there is no loss of generality, but the automatic adjustment of c</sub>max</sub> obscures the extreme sensitivity of the results to m and v.

      (2) The title of the study is very catchy, but it needs to be explained better in the text.

      I have expanded the end of the Discussion in the hope of clarifying the point expressed by the title.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I would like to suggest to the author that they provide essential details about their simulations that would justify their claims, and to communicate with Mattias Siljestam and Claus Rueffler whether claims of the lack of robustness could be confirmed.

      The models simulated were modified versions of those of Siljestam and Rueffler. Thus, only the modifications were described in my manuscript. I have added a more detailed description of how c</sub>max</sub> was set in the simulations concerned with sensitivity to parameter values. In addition, the new Supplementary Text S1, which describes confirmation of the lack of robustness using the code of Siljestam and Rueffler, should remove any doubt about this conclusion.

      Reviewer #2 (Recommendations for the authors):

      I have no further recommendations. The manuscript is well written and clear.

      Thank you.

      Reviewer #3 (Recommendations for the authors):

      (1) Since this is a full report and not just a letter to the editor, it would benefit from a bit more introduction of what the MHC actually is and what the current understanding of its evolution is. Currently, it assumes a lot of knowledge about these genes that might not be available to every reader of eLife.

      I have added some more information to the opening paragraph. I would also note that this report was submitted as a “Research Advance”, which may only need “minimal introductory material”.

      (2) Some more recent literature on MHC evolution should be added, e.g., the review by Radwan et al. 2020 TiG, a concrete case of MHC heterozygote advantage by Arora et al. 2020 MolBiolEvol, and a simulation of MHC CNV evolution by Bentkowski et al. 2019 PLOSCompBiol.

      I have cited some additional literature.

      (3) Since much of the criticism hinges on the cmax parameter, its biological meaning or role (or the lack thereof) could be discussed more.

      I am not sure what I can add to what is in the first paragraph of the Discussion.

      (4) I find it difficult to grasp how the v parameter, which is intended to define pathogen virulence, if I understand it correctly, can be used to amend the breadth of peptide presentation. Maybe this could be illustrated better.

      I have attempted to make this clearer. The parameter v actually controls the breadth of peptide detection conferred by an allele, which, if not identical to the breath of presentation, is certainly affected by it. The basis of the “virulence” interpretation seems to be that narrower detection breadth can, according to the model, only decrease peptide detection probability, which increases the damage done by pathogens.

      (5) Please check sentences in lines 279ff on peptide detection and cost of . There seem to be words missing.

      There was an extraneous word, which I have removed. Thank you for pointing this out.

    1. From the formal technicalities of parameters tothe warm glow of a sunset across a limpid pool, virtual wateris both technology and story, code and emotion. Most of all,in a gallery setting you can appreciate the water itself, withoutthe distraction of having somewhere else to be or someoneelse to kill.You may find your understanding of real, wet water chang-ing too. When I returned to my hometown of Wellington,New Zealand for a visit after making the game, I sat by its har-bor and stared into the water. I watched caustics I only knewthe name of thanks to making a video game; I blinked intothe dazzling reflections of the sun; I thought about the “pro-cessing power” of physics itself, capable of such refractive bril-liance.

      I feel this is a you thing. Your thoughts as you created this... which, is a perpetuation of yourself. Your hegemony. Your visibility. One thing would be for this book to be anonymous and out there, but as it is, it's pretentious, self-confirming, an attempt to make the redundant heroic.

    2. Computation is a key part of how a game means what itmeans.

      It's a language that automates electrons. It's, in a way, instructions that perpetuate a set of inanimate inorganic workers in the form of small pulses of light. It's morse code, but for a complex transcription system that turns it into orders. It's yet a human-made way to organise stuff...

    Annotators

    1. Reviewer #2 (Public review):

      Summary:

      This paper presents a new approach for explicitly transforming B-cell receptor affinity into evolutionary fitness in the germinal center. It demonstrates the feasibility of using likelihood-free inference to study this problem and demonstrates how effective birth rates appear to vary with affinity in real-world data.

      Strengths:

      (1) The authors leverage the unique data they have generated for a separate project to provide novel insights into a fundamental question.

      (2) The paper is clearly written, with accessible methods and a straightforward discussion of the limits of this model.

      (3) Code and data are publicly available and well-documented.

      Weaknesses (minor):

      (1) Lines 444-446: I think that "affinity ceiling" and "fitness ceiling" should be considered independent concepts. The former, as the authors ably explain, is a physical limitation. This wouldn't necessarily correspond to a fitness ceiling, though, as Figure 7 shows. Conversely, the model developed here would allow for a fitness ceiling even if the physical limit doesn't exist.

      (2) Lines 566-569: I would like to see this caveat fleshed out more and perhaps mentioned earlier in the paper. While relative affinity is far more important, it is not at all clear to me that absolute affinity can be totally ignored in modeling GC behavior.

      (3) One other limitation that is worth mentioning, though beyond the scope of the current work to fully address: the evolution of the repertoire is also strongly shaped by competition from circulating antibodies. (Eg: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3600904/, http://www.sciencedirect.com/science/article/pii/S1931312820303978). This is irrelevant for the replay experiment modeled here, but still an important factor in general repertoires.

    2. Reviewer #2 (Public review):

      Summary:

      This paper presents a new approach for explicitly transforming B cell receptor affinity into evolutionary fitness in the germinal center. It demonstrates the feasibility of using likelihood-free inference to study this problem and demonstrates how effective birth rates appear to vary with affinity in real-world data.

      Strengths:

      • The authors leverage the unique data they have generated for a separate project to provide novel insights to a fundamental question.
      • The paper is clearly written, with accessible methods and straightforward discussion of the limits of this model.
      • Code and data are publicly available and well-documented.

      Weaknesses:

      • No substantial weaknesses noted.
    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper aims to characterize the relationship between affinity and fitness in the process of affinity maturation. To this end, the authors develop a model of germinal center reaction and a tailored statistical approach, building on recent advances in simulation-based inference. The potential impact of this work is hindered by the poor organization of the manuscript. In crucial sections, the writing style and notations are unclear and difficult to follow.

      We thank the reviewer for their kind words, and have endeavored to address all of their concerns as to the structure and style of the manuscript.

      Strengths:

      The model provides a framework for linking affinity measurements and sequence evolution and does so while accounting for the stochasticity inherent to the germinal center reaction. The model's sophistication comes at the cost of numerous parameters and leads to intractable likelihood, which are the primary challenges addressed by the authors. The approach to inference is innovative and relies on training a neural network on extensive simulations of trajectories from the model.

      Weaknesses:

      The text is challenging to follow. The descriptions of the model and the inference procedure are fragmented and repetitive. In the introduction and the methods section, the same information is often provided multiple times, at different levels of detail.

      Thank you for pointing this out. We have rearranged the methods in order to make the presentation more linear, and to reduce duplication with the introduction.

      Specifically, we moved the affinity definition to the start, removed the redundant bullet point list, and moved the parameter value table to the end.

      This organization sometimes requires the reader to move back and forth between subsections (there are multiple non-specific references to "above" and "below" in the text).

      This is a great point, we have either removed or replaced all references to "above" or "below" with more specific citations.

      The choice of some parameter values in simulations appears arbitrary and would benefit from more extensive justification. It remains unclear how the "significant uncertainty" associated with these parameters affects the results of inference.

      We have clarified where various parameter values come from:

      “In addition to the four sigmoid parameters, which we infer directly, there are other parameters in Table 1 about which we have incomplete information. The carrying capacity method and the choice of sigmoid for the response function represent fundamental model assumptions. We also fix the death rate for nonfunctional (stop) sequences, which would be very difficult to infer with the present experiment. For others, we know precise values from the replay experiment for each GC (time to sampling, # sampled cells/GC), but use a somewhat wider range for the sake of generalizability. The mutability multiplier is a heuristic factor used to match the SHM distributions to data. The naive birth rate is determined by the sigmoid parameters, but has its own range in order to facilitate efficient simulation.

      For two of the three remaining parameters (carrying capacity and initial population), we can ostensibly choose values based on the replay experiment. These values carry significant uncertainty, however, partly due to inherent experimental uncertainty, but also because they may represent different biological quantities to those in simulation. For instance, an experimental measurement of the number of B cells in a germinal center might appear to correspond closely to simulation carrying capacity. However if germinal centers are not well mixed, such that competition occurs only among nearby cells, the "effective" carrying capacity that each cell experiences could be much smaller.

      Fortunately, in addition to the neural network inference of sigmoid parameters, we have another source of information that we can use to infer non-sigmoid parameters: summary statistic distributions. We can use the matching of these distributions to effectively fit values for these additional unknown parameters. We also include the final parameter, the functional death rate, in these non-sigmoid inferred parameters, although it is unconstrained by the replay experiment, and it is unclear whether it is uniquely identifiable.”

      In addition, the performance of the inference scheme on simulated data is difficult to evaluate, as the reported distributions of loss function values are not very informative.

      We thought of two different interpretions for this comment, so have worked to address both.

      First, the comment could have been that the distribution of loss functions on the training sample does not appear to be informative of performance on data-like samples. This is true, and in our revision we have emphasized the distinction between the two types of simulation sample: those for training, where each simulated GC has different (sampled) parameter values; vs the "data mimic" samples where all GCs have identical parameters. Since the former have different values for each GC, we can only plot many inferred curves together on the latter. We also would like to emphasize that the inference problem for one GC will have much more uncertainty than will that for an ensemble of GCs (as in the full replay experiment).

      “After building and training our neural network, we evaluate its performance on subsets of the training sample. While this evaluation provides an important baseline and sanity check, it is important to note that the training sample differs dramatically from real data (and the “data mimic” simulation sample that mimics real data). While real data consists of 119 GCs with identical parameters and thus response functions, we need the GCs in our training sample to span the space of all plausible parameter values. This means that while we must evaluate performance on individual GCs in the training and testing samples, in real data (and data mimic simulation) we combine results from 119 curves into a central (medoid) curve. Inference on the training sample will thus appear vastly noisier than on real data and data mimic simulation, and also cannot be plotted with all true and inferred curves together.”

      A second interpretation was that the reviewer did not have an intuitive sense of what a loss function value of, say, 1.0 actually means. To address this second interpretation, we have also added a supplement to Figure 2 with several example true and inferred response functions from the training sample, with representative loss values spanning 0.17 to 2.18. We have also added the following clarification to the caption of Figure 1-figure supplement 2:

      “The loss value is thus the fraction of the area under the true curve represented by the area between the true and inferred curves.”

      Finally, the discussion of the similarities and differences with an alternative approach to this inference problem, presented in Dewitt et al. (2025), is incomplete.

      We have expanded this section of the manuscript, and added a new plot directly comparing the methods.

      “In order to compare more directly to DeWitt et al. 2025, we remade their Fig.S6D, truncating to values at which affinities are actually observed in the bulk data, and using only three of the seven timepoints (11, 20, and 70, Figure 8, left). We then simulated 25 GCs with central data mimic parameters out to 70 days. For each such GC, we found the time point with mean affinity over living cells closest to each of three specific “target” affinity values (0.1, 1.0, 2.0) corresponding to the mean affinity of the bulk data at timepoints 11, 20, and 70. We then plot the effective birth rates of all living cells vs relative affinity (subtracting mean affinity) at the resulting GC-specific timepoints for all 25 GCs together Figure 8, right). Note that because each GC evolves at very different and time-dependent rates, we could not simply use the timepoints from the bulk data, since each GC slice from our simulation would then have very different mean affinity. The mean over GCs of these GC-specific chosen times is 10.9, 24.5, 44.4 (compared to the original bulk data time points 11, 20, 70). It is important to note that while the first two target affinities (0.1 and 1.0) are within the affinity ranges encountered in the extracted GC data, the third value (2.0) is far beyond them, and thus represents extrapolation to an affinity regime informed more by our underlying model than by the real data on which we fit it.”

      Reviewer #2 (Public review):

      Summary:

      This paper presents a new approach for explicitly transforming B-cell receptor affinity into evolutionary fitness in the germinal center. It demonstrates the feasibility of using likelihood-free inference to study this problem and demonstrates how effective birth rates appear to vary with affinity in real-world data.

      Strengths:

      (1) The authors leverage the unique data they have generated for a separate project to provide novel insights into a fundamental question. (2) The paper is clearly written, with accessible methods and a straightforward discussion of the limits of this model. (3) Code and data are publicly available and well documented.

      Weaknesses (minor):

      (1) Lines 444-446: I think that "affinity ceiling" and "fitness ceiling" should be considered independent concepts. The former, as the authors ably explain, is a physical limitation. This wouldn't necessarily correspond to a fitness ceiling, though, as Figure 7 shows. Conversely, the model developed here would allow for a fitness ceiling even if the physical limit doesn't exist.

      Right, whoops, good point. We've rearranged the discussion to separate the concepts, for instance:

      “While affinity and fitness ceilings are separate concepts, they are closely related. An affinity ceiling is a limit to affinity for a given antigen: there are no mutations that can improve affinity beyond this level. This would result in a truncated response function, undefined beyond the affinity ceiling. A fitness ceiling, on the other hand, is an upper asymptote on the response function. Such a ceiling would result in a limit on affinity for a germinal center reaction, since once cells are well into the upper asymptote of fitness they are no longer subject to selective pressure.”

      (2) Lines 566-569: I would like to see this caveat fleshed out more and perhaps mentioned earlier in the paper. While relative affinity is far more important, it is not at all clear to me that absolute affinity can be totally ignored in modeling GC behavior.

      This is a great point, we've added a mention of this where we introduce the replay experiment in the Methods:

      “It is important to note that this is a much lower level than typical BCR repertoires, which average roughly 5-10% nucleotide shm.”

      And expanded on the explanation in the Discussion:

      “Some aspects of behavior in the low-shm/early times regime of the extracted GC data are also potentially different to those at the higher shm levels and longer times found in typical repertoires. This is especially relevant to affinity or fitness ceilings, to which we likely have little sensitivity with the current data.”

      (3) One other limitation that is worth mentioning, though beyond the scope of the current work to fully address: the evolution of the repertoire is also strongly shaped by competition from circulating antibodies. (Eg: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3600904/, http://www.sciencedirect.com/science/article/pii/S1931312820303978). This is irrelevant for the replay experiment modeled here, but still an important factor in general repertoires.

      Yes good point, we've added these citations in a new paragraph on between-lineage competition:

      “We also neglect competition among lineages stemming from different rearrangement events (different clonal families), instead assuming that each GC is seeded with instances of only a single naive sequence, and that neither cells nor antibodies migrate between different GCs. More realistically for the polyclonal GC case, we would allow lineages stemming from different naive sequences to compete with each other both within and between GCs (Zhang et al. 2013: McNamara et al. 2020; Barbulescu et al. 2025). Implementing competition among several clonal families within a single GC would be conceptually simple and computationally practical in our current software framework. Competition among many GCs, however, would be computationally prohibitive because our time required is primarily determined by the total population size, since at each step we must iterate over every node and every event type in order to find the shortest waiting time. For the monoclonal replay experiment specifically, however, all naive sequences are the same and so the current modeling framework is sufficient.”

      Recommendations for the authors:

      Reviewing Editor Comments:

      The authors are encouraged to follow the suggestions of manuscript re-organization by Reviewer 1, in order to improve readability. We would also like to suggest improving the discussion of the traveling wave model to explain it in a more self-contained way. In passing, please clarify what is meant by 'steady-state' in that model. A superficial understanding would suggest that the only steady state in that model would be a homogeneous population of antibodies with maximum affinity/fitness.

      These are great suggestions. We have substantially rearranged the text according to Reviewer 1's suggestions, especially the Methods, and expanded on and rearranged the traveling wave discussion. We've also clarified throughout that the traveling wave model is assuming steady state with respect to population. In the public response to reviewer 1 above we describe these changes in more detail.

      Reviewer #1 (Recommendations for the authors):

      I suggest that the organization of the paper be reconsidered. The current methods section is long and at times repetitive, making it impossible to parse in a single reading. Moving some technical details from the main text to an appendix could improve readability. Despite the length of the methods section, many important points, such as justification of choices in model specification or values of parameters, are treated only briefly.

      We have rearranged the methods section, particularly the discussion of our model, and have more clearly justified choices of parameter values as described in the public response.

      Discussion of similarities and differences with reference to Dewitt et al. 2025 should be revised, as it's currently unclear whether the method presented here has any advantages.

      We have expanded this comparison, and emphasized the main disadvantage of the traveling wave approach: there is no way of knowing whether by abstracting away so much biological detail it misses important effects. We have also emphasized that the two approaches use different types of data (time series vs endpoint) which are typically not simultaneously available:

      “The clear advantage of the traveling wave model is its simplicity: if its high level view is accurate enough to effectively model the relevant GC dynamics, it is far more tractable. But reproducing low-level biological detail, and making high-dimensional real data comparisons (e.g. Figure 5) to iteratively improve model fidelity, are also useful, providing direct evidence that we are correctly modeling the underlying biological processes. The two approaches also utilize different types of data: we use a single time point, and thus must reconstruct evolutionary history; whereas the traveling wave requires a series of timepoints. The availability of both types of data is a unique feature of the replay experiment, and provides us with the opportunity to directly compare the approaches.”

      The results obtained from the same data should be directly compared (can the response function be directly compared to the result in Figure S6D in Dewitt et al., 2025? If yes, it should be re-plotted here and compared/superimposed with Figures 6 and 7). The text mentions the results differ, but it remains ambiguous whether the differences are significant and what their implications are.

      We've added a new Figure 8, comparing a modified version of the traveling wave Fig S6D to a new plot derived from our results using the data mimic parameters. While the two plots represent fundamentally different quantities, they do put the results of the two methods on an approximately equal footing and we see nice concordance between them in regions with significant data (they disagree substantially for larger negative affinities). We have also added emphasis to the point that the traveling wave model uses an entirely separate dataset to what we use here.

      Other comments:

      (1) l. 80: "[in] around 10 days"?

      Text rearranged so this phrase no longer appears.

      (2) l. 96: "an intrinsic rate [given by?] the response function above".

      Text rearranged so this phrase no longer appears.

      (3) Figure 1: The. “specific model” could part be expanded and improved to help make sense of model parameters and the order of different processes in the population model. Example values of parameters can be plotted rather than loosely described, (e.g., y_h+y_c, the upper asymptotes can be plotted in place of the “yscale determines upper asymptotes” label.

      Great suggestion, we've changed the labels.

      (4) The cartoons in the other parts are somewhat cryptic or illegible due to small sizes.

      We have added text in the caption linking to the figures that are, in the figure, intended to be in schematic form only.

      “Plots from elsewhere in the manuscript are rendered in schematic form: those in “infer on data” refer to Figure 4-figure supplement 1, and those in “simulate with inferred parameters” to Figure 5.

      (5) L. 137: It's not helpful to give numerical values before the definition of affinity. (and these numbers are repeated later).

      Good point, we've moved the affinity definition to the previous section, and remove the duplicate range information.

      (6): Table 1: A number of notations are unclear, such as “#seqs/GC” or “mutability multiplier”. The double notation for crucial parameters doesn't help. At the moment the table is introduced, the columns make little sense to the reader, and it's not well specified what dictates the choice or changes of parameter values or ranges.

      We've moved the table further down until after the parameters have been introduced, and clarified the indicated names.

      (7) l. 147: Choices of model are not justified and appear arbitrary (e.g., why death events happen at one of two rate).

      We have clarified the reasoning behind having two death rates.

      (8) l.151: “happened on the edges of developing phylogenetic tree” - ambiguous: do they accumulate at cell divisions? What is a “developing tree”?

      We have removed this ambiguous phrasing.

      (9) l.161: This paragraph is particularly dense.

      We have rearranged this section of the methods, and split up this paragraph.

      (10) l. 164: All the different response functions for different event types? Or only the one for birth, as stated before?

      Yes. This has been clarified.

      (11) l.167: Does the statement in the bracket refer to a unit?

      This has been clarified.

      (12) l. 169: Discussion of the implementation seems too detailed.

      Hopefully the rearranged description is clearer, but we worry that removing the details of events selection would leave some readers confused.

      (13) l. 186: Why describe the methods that, in the end, were not used? Similarly, as a mention of “variety of response functions” seems out of place if only one choice is used throughout the paper. eq. (2): that's mˆ{-1} from eq. (1). Having the two equations using the same notation is confusing.

      We've moved the mention of alternatives to the Discussion, where it is an important source of uncontrolled systematic uncertainty, and removed the extra equation.

      (14) l. 206: Unclear what “thus” refers to.

      Removed.

      (15) l.211: What does “neglecting y_h” mean?

      This has been clarified.

      (16) l. 242: Unclear what “this” refers to.

      Clarified.

      (17) l. 261: What does “model independence” refer to in this context?

      From the sigmoid model. Clarified.

      (18) l. 306: What values for which parameters? References?

      We have clarified and updated this statement - it was out of date, corresponding to the analysis before we started fitting non-sigmoid parameters.

      “In addition to the four sigmoid parameters, which we infer directly, there are other parameters in Table 1 about which we have incomplete information. The carrying capacity method and the choice of sigmoid for the response function represent fundamental model assumptions. We also fix the death rate for nonfunctional (stop) sequences, which would be very difficult to infer with the present experiment. For others, we know precise values from the replay experiment for each GC (time to sampling, # sampled cells/GC), but use a somewhat wider range for the sake of generalizability. The mutability multiplier is a heuristic factor used to match the SHM distributions to data. The naive birth rate is determined by the sigmoid parameters, but has its own range in order to facilitate efficient simulation.

      For two of the three remaining parameters (carrying capacity and initial population), we can ostensibly choose values based on the replay experiment. These values carry significant uncertainty, however, partly due to inherent experimental uncertainty, but also because they may represent different biological quantities to those in simulation. For instance, an experimental measurement of the number of B cells in a germinal center might appear to correspond closely to simulation carrying capacity. However if germinal centers are not well mixed, such that competition occurs only among nearby cells, the "effective" carrying capacity that each cell experiences could be much smaller.

      Fortunately, in addition to the neural network inference of sigmoid parameters, we have another source of information that we can use to infer non-sigmoid parameters: summary statistic distributions. We can use the matching of these distributions to effectively fit values for these additional unknown parameters. We also include the final parameter, the functional death rate, in these non-sigmoid inferred parameters, although it is unconstrained by the replay experiment, and it is unclear whether it is uniquely identifiable.”

      (19) l. 326: "is interpreted as having" or "corresponds to"?

      Changed.

      (20) l. 340: Not sure what "encompassing" means in this context.

      Clarified.

      (21) l. 341: "We do this..." -- I think this sentence is not grammatical.

      Fixed.

      (22) l. 348: "on simulation" -- "from simulated data"?

      Indeed.

      (23) l. 351: "top rows", the figures only have one row.

      Fixed.

      (24) Figure 2: It's difficult to tell from the loss function itself whether inference on simulated data works well. Why not report the simulated and inferred response functions? The equivalent plots in Figure 5 would also be informative. Has inference been tested for different "sigmoid parameters" values?

      This is an important point that was not clear, thanks for bringing it up. We have expanded on and emphasized the differences between these samples and the reasoning behind their different evaluation choices. Briefly, we can't display true vs inferred response functions on the training samples since the curves for each GC are different -- the plot would be entirely filled in with very different response function shapes. This is why we do actual performance evaluation on the "data mimic" samples, where all GCs have the same parameters. Summary stats (like Fig 5) for the training sample are in Fig 5 Supplement 2.

      (25) l. 354: Unclear what "this" refers to.

      Removed.

      (26) l. 355: We assume the parameters are the same?

      Yes, we assume all data GCs have the same parameters. We have added emphasis of this point.

      (27) Figure 4: Is "lambda" the fitness? Should be typeset as \lambda_i?

      Our convention is to add the subscript when evaluating fitness on individual cells, but to omit it, as here, when plotting the response function as a whole.

      (28) l. 412: "[a] carrying capacity constraint".

      Fixed.

      Reviewer #2 (Recommendations for the authors):

      (1) In 2 places, you state that observed affinity ranged from -37 to 3, but I assume that the lower bound should be -3.7.

      The -37 was actually correct, but we had mistakenly missed updating it when we switched to the latest (current) version of the affinity model. We have updated the values, although these don't really have any effect on the model since we only infer within bounds in which we have a lot of points:

      “Affinity is ∅ for the initial unmutated sequence, and ranges from -12.2 to 3.5 in observed sequences, with a mean median of -0.3 (0.3).

      (2). I had to look up the Vols nicker paper to understand the tree encoding: It would be nice to spend another sentence or two on it here for those who aren't familiar.

      Great point, we have added the following:

      “We encode each tree with an approach similar to Lambert et al. (2023) and Thompson et al. (2024), most closely following the compact bijective ladderized vector (CBLV) approach from Voznica et al. (2022). The CBLV method first ladderizes the tree by rotating each subtree such that, roughly speaking, longer branches end up toward the left. This does not modify the tree, but rather allows iteration over nodes in a defined, repeatable way, called inorder iteration. To generate the matrix, we traverse the ladderized tree in order, calculating a distance to associate with each node. For internal nodes, this is the distance to root, whereas for leaf nodes it is the distance to the most-recently-visited internal node (Voznica et al., 2022, Fig. 2). Distances corresponding to leaf nodes are arranged in the first row of the matrix, while those from internal nodes form the second row.”

      (3) On line 351, you refer to the "top rows of Figure 2 and Figure 3," but each only has one row in the current version. I think it should now be "left panel.".

      Fixed.

      (4) How many vertical dashed lines are in the left panel of the bottom row of Figure 7? I think it's more than one, but can't tell if it is two or three...

      Nice catch! There were actually three. We've shortened them and added a white outline to clarify overlapping lines.

      (5) Would the model be applicable to GCs with multiple naive founders of different affinities? Or would more/different parameters be needed to account for that?

      The model would be applicable, but since the time required for our simulation scales roughly with the total simulated population size, we could probably only handle competition among at most a couple of GCs. Some sort of "migration strength" parameter would be required for competition among GCs (or within one GC if we don't want to assume it's well-mixed), but that doesn't seem a terrible impediment. We've added the following:

      “We also neglect competition among lineages stemming from different rearrangement events (different clonal families), instead assuming that each GC is seeded with instances of only a single naive sequence, and that neither cells nor antibodies migrate between different GCs. More realistically for the polyclonal GC case, we would allow lineages stemming from different naive sequences to compete with each other both within and between GCs (Zhang et al. 2013; McNamara et al. 2020; Barbulescu et al. 2025). Implementing competition among several clonal families within a single GC would be conceptually simple and computationally practical in our current software framework. Competition among many GCs, however, would be computationally prohibitive because our time required is primarily determined by the total population size, since at each step we must iterate over every node and every event type in order to find the shortest waiting time. For the monoclonal replay experiment specifically, however, all naive sequences are the same and so the current modeling framework is sufficient.”

    1. 19.6. Programming, Gender, Status, and Money# While we’ve been talking about capitalism and social media platforms, we also want to look at the world of programming as well. In particular, we want to highlight how the profession of programming went from being a disrespected, low-pay job for women, to being a highly respected and high paying job for men. 19.6.1. Programming as Women’s Work# As you may have noticed in chapter 2 of the book, the first programmers were almost all women. When computers were being invented, men put themselves in charge of building the physical devices (hardware), and then gave the work of programming the computer (software) to their assistants, often women. So, for example, you can see this at various stages of computer development, such as: 1800s: Charles Babbage describes the first full computer (Analytical Engine), and Ada Lovelace writes down the first computer program for it. 1945: The first general-purpose electrical computer was created by men and programmed by women 1950s: Grace Hopper invents the compiler to help with programming the computers (built by men) As women were advancing the field of computer programming, some of them became frustrated with how they were viewed, such as Margaret Hamilton: Fig. 19.2 Photo of Margaret Hamilton next to the computer program source code (which she was in charge of) for the Apollo missions to the moon.# When Margaret Hamilton was in charge of creating the software to run on the Apollo rockets, the men around her considered programming to be easy and less serious than the “engineering” they were doing in building the rocket. So, she began calling the programming she was doing “software engineering” to convey the complexity and rigor of the work she and her team were doing. She was able to convince her colleagues and the term “software engineering” became common. Still, up through the 1960s and 1970s, most computer companies made their money by selling the physical hardware. They happened to include some software to go with it, and people who bought the hardware would sometimes hire people to make more software. So software was still considered secondary, and up until the early 1980s, women were getting around 37% of computer science degrees. Fig. 19.3 Graph showing percentage of women receiving degrees in different fields from NPR.# 19.6.2. Programming for Boys and Men# In the early 1980s, a number of things changed which ended up with programming seen as a male profession, and a highly profitable and respected one. One of the changes was that some men in the computer business figured out how to make money selling software. This was particularly the case for Bill Gates who convinced companies like IBM to license his software, so he could continue making money as more people used it. Another change was that as computers became small enough for people to buy them for their homes, they became seen as toys for boys and not girls. The same transition is seen in video game consoles from being for the whole family to being for boys only (e.g., the Nintendo Game Boy). In the end, computer programming became profitable and male-dominated. As many are trying to get women into programming, so that they aren’t cut out of profitable and important fields, Amy Nguyen warns that men might just decide that programming is low status again (as has happened before in many fields): The history of women in the workplace always tells the same story: women enter a male-dominated profession, only to find that it’s no longer a respectable field. Because they’re a part of it, so men leave in droves. Because women do it, and therefore it must not be important. Because society would rather discredit an entire profession than acknowledge that a female-dominated field might be doing something that actually matters.

      I found this reading very interesting because it shows how the social perception of programming has changed over time. In the early days, programming was considered less important work and was often done by women, even though many of them made significant contributions to the field. As software became more profitable and respected, the field gradually became more male-dominated. This made me realize that the status of a profession is not always based on the difficulty or importance of the work, but can also be shaped by social and economic factors. It also highlights why it is important to make technology fields more inclusive so that opportunities in influential and well-paid careers are accessible to everyone.

    1. Why do social media platforms make decisions that harm users? And why do social media platforms sometimes go down paths of self-destruction and alienating their users? Sometimes these questions can be answered by looking at the economic forces that drive decision-making on social media platforms, in particular with capitalism. So let’s start by defining capitalism. 19.1.1. Definition of Capitalism:# Capitalism is: “an economic system characterized by private or corporate ownership of capital goods, by investments that are determined by private decision, and by prices, production, and the distribution of goods that are determined mainly by competition in a free market” Merriam-Webster Dictionary In other words, capitalism is a system where: Individuals or corporations own businesses These business owners make what they want and set their own prices. They compete with other businesses to convince customers to buy their products. These business owners then hire wage laborers at predetermined rates for their work, while the owners get the excess business profits or losses. Related Terms# Here are a few more terms that are relevant to capitalism that we need to understand in order to get to the details of decision-making and strategies employed by social media companies. Shares / Stocks Shares or stocks are ownership of a percentage of a business, normally coming with getting a percentage of the profits and a percentage of power in making business decisions. Companies then have a board of directors who represent these shareholders. The board is in charge of choosing who runs the company (the CEO). They have the power to hire and fire CEOs For example: in 1985, the board of directors for Apple Computers denied Steve Jobs (co-founded Apple) the position of CEO and then they fired him completely CEOs of companies (like Mark Zuckerberg of Meta) are often both wage-laborers (they get a salary, Zuckerberg gets a tiny symbolic $1/year) and shareholders (they get a share of the profits, Zuckerberg owns 16.8%) Free Market Businesses set their own prices and customers decide what they are willing to pay, so prices go up or down as each side decides what they are willing to charge/spend (no government intervention) See supply and demand What gets made is theoretically determined by what customers want to spend their money on, with businesses competing for customers by offering better products and better prices Especially the people with the most money, both business owners and customers Monopoly “a situation where a specific person or enterprise is the only supplier of a particular thing” Monopolies are considered anti-competitive (though not necessarily anti-capitalist). Businesses can lower quality and raise prices, and customers will have to accept those prices since there are no alternatives. Cornering a market is being close enough to a monopoly to mostly set the rules (e.g., Amazon and online shopping) 19.1.2. Socialism# Let’s contrast capitalism with socialism: Socialism, in contrast is a system where: A government owns the businesses (sometimes called “government services”) A government decides what to make and what the price is the price might be free, like with public schools, public streets and highways, public playgrounds, etc. A government then may hire wage laborers at predetermined rates for their work, and the excess business profits or losses are handled by the government For example, losses are covered by taxes, and excess may pay for other government services or go directly to the people (e.g., Alaska uses its oil profits to pay people to live there). As an example, there is one Seattle City Sewer system, which is run by the Seattle government. Having many competing sewer systems could actually make a big mess of the underground pipe system. 19.1.3. Accountability in Capitalism and other systems# Let’s look at who the leaders of businesses (or services) are accountable for in capitalism and other systems. Democratic Socialism (i.e., “Socialists1”)# With socialism in a representative democracy (i.e., “democratic socialism”), the government leaders are chosen by the people through voting. And so, while the governmental leaders are in charge of what gets made, how much it costs, and who gets it, those leaders are accountable to the voters. So, in a democratic socialist government, theoretically, every voter has an equal say in business (or government service) decisions. Note, that there are limitations to the government leaders being accountable to the people their decisions affect, such as government leaders ignoring voters’ wishes, or people who can’t vote (e.g., the young, non-citizens, oppressed minorities) and therefore don’t get a say.

      I thought this assignment was interesting because it connected programming with a real-world scenario. It helped me understand how the way we design an algorithm can affect fairness and outcomes for different people. I also liked that it made us think not only about writing correct code, but also about the social impact of algorithms.

    1. The American Yawp Reader Menu Skip to content HomeAbolitionist Sheet Music Cover Page, 1844 America Guided by Wisdom Engraving, 1815 American Revolution Cartoon Anti-Catholic Cartoon, 1855 Anti-immigrant cartoon Anti-Immigrant Cartoon, 1860 Anti-Thomas Jefferson Cartoon, 1797 Barack Obama, Howard University Commencement Address (2016) Blueprint and Photograph of Christ Church Broadening The American Yawp Reader Broadening the Yawp Burying the Dead Photograph, 1865 Casta Painting Civil War Nurses Illustration, 1864 Cliff Palace Constitutional Ratification Cartoon, 1789 County Election Painting, 1854 Drawing of Uniforms of the American Revolution Effects of the Fugitive Slave Law Lithograph, 1850 F15 – Manifest Destiny Reader F16 – Colliding Cultures Reader F16 – Colonial Society Reader F16 – Reconstruction Reader Fifteenth Amendment Print, 1870 Genius of the Ladies Magazine Illustration, 1792 Introduction Johnson and Reconstruction Cartoon, 1866 Manifest Destiny Painting, 1872 Map of British North America Martin Van Buren Cartoon, 1837 Missionary Society Membership Certificate, 1848 Painting of Enslaved Persons for Sale, 1861 Painting of New Orleans Print of the Slave Ship Brookes Proslavery Cartoon, 1850 Royall Family Sectional Crisis Map, 1856 Sketch of an Algonquin Village The Fruit of Alcohol and Temperance Lithographs, 1849 The Society for United States Intellectual History Primary Source Reader Woody Guthrie, “This Land” (1940-1945) Indigenous America Reader Native American Creation Stories Journal of Christopher Columbus, 1492 An Aztec account of the Spanish attack Bartolomé de Las Casas Describes the Exploitation of Indigenous Peoples, 1542 Thomas Morton Reflects on Indians in New England, 1637 The story of the Virgin of Guadalupe Alvar Nuñez Cabeza de Vaca Travels through North America, 1542 Colliding Cultures Reader Richard Hakluyt Makes the Case for English Colonization, 1584 John Winthrop Dreams of a City on a Hill, 1630 John Lawson Encounters Native Americans, 1709 A Gaspesian Man Defends His Way of Life, 1691 The Legend of Moshup, 1830 Accusations of witchcraft, 1692 and 1706 Manuel Trujillo Accuses Asencio Povia and Antonio Yuba of Sodomy, 1731 British North America Reader Olaudah Equiano Describes the Middle Passage, 1789 Recruiting Settlers to Carolina, 1666 Letter from Carolina, 1682 Francis Daniel Pastorius Describes his Ocean Voyage, 1684 Song about Life in Virginia Haudenosaunee Thanksgiving Address Rose Davis is sentenced to a life of slavery, 1715 Colonial Society Reader Boston trader Sarah Knight on her travels in Connecticut, 1704 Eliza Lucas Letters, 1740-1741 Jonathan Edwards Revives Enfield, Connecticut, 1741 Samson Occom describes his conversion and ministry, 1768 Extracts from Gibson Clough’s War Journal, 1759 Pontiac Calls for War, 1763 Alibamo Mingo, Choctaw leader, Reflects on the British and French, 1765 The American Revolution Reader George R. T. Hewes, A Retrospect of the Boston Tea-party, 1834 Thomas Paine Calls for American independence, 1776 Declaration of Independence, 1776 Women in South Carolina Experience Occupation, 1780 Oneida Declaration of Neutrality, 1775 Boston King recalls fighting for the British and for his freedom, 1798 Abigail and John Adams Converse on Women’s Rights, 1776 A New Nation Reader Hector St. Jean de Crèvecœur Describes the American people, 1782 A Confederation of Native peoples seek peace with the United States, 1786 Mary Smith Cranch comments on politics, 1786-87 James Madison, Memorial and Remonstrance Against Religious Assessments, 1785 George Washington, “Farewell Address,” 1796 Venture Smith, A Narrative of the Life and Adventures of Venture, 1798 Susannah Rowson, Charlotte Temple, 1794 The Early Republic Reader Letter of Cato and Petition by “the negroes who obtained freedom by the late act,” in Postscript to the Freeman’s Journal, September 21, 1781 Thomas Jefferson’s Racism, 1788 Black scientist Benjamin Banneker demonstrates Black intelligence to Thomas Jefferson, 1791 Congress Debates Going to War, 1811 Creek headman Alexander McGillivray (Hoboi-Hili-Miko) seeks to build an alliance with Spain, 1785 Tecumseh Calls for Native American Resistance, 1810 Abigail Bailey Escapes an Abusive Relationship, 1815 The Market Revolution Reader James Madison Asks Congress to Support Internal Improvements, 1815 A Traveler Describes Life Along the Erie Canal, 1829 Blacksmith Apprentice Contract, 1836 Maria Stewart bemoans the consequences of racism, 1832 Rebecca Burlend recalls her emigration from England to Illinois, 1848 Harriet H. Robinson Remembers a Mill Workers’ Strike, 1836 Alexis de Tocqueville, “How Americans Understand the Equality of the Sexes,” 1840 Democracy in America Reader Missouri Controversy Documents, 1819-1820 Rhode Islanders Protest Property Restrictions on Voting, 1834 Black Philadelphians Defend their Voting Rights, 1838 Andrew Jackson’s Veto Message Against Re-chartering the Bank of the United States, 1832 Frederick Douglass, “What to the Slave is the Fourth of July?” 1852 Rebecca Reed accuses nuns of abuse, 1835 Samuel Morse Fears a Catholic Conspiracy, 1835 Religion and Reform Reader Revivalist Charles G. Finney Emphasizes Human Choice in Salvation, 1836 Dorothea Dix defends the mentally ill, 1843 David Walker’s Appeal to the Colored Citizens of the World, 1829 William Lloyd Garrison Introduces The Liberator, 1831 Angelina Grimké, Appeal to Christian Women of the South, 1836 Sarah Grimké Calls for Women’s Rights, 1838 Henry David Thoreau Reflects on Nature, 1854 The Cotton Revolution Reader Nat Turner explains the Southampton rebellion, 1831 Harriet Jacobs on Rape and Slavery, 1860 Solomon Northup Describes a Slave Market, 1841 George Fitzhugh Argues that Slavery is Better than Liberty and Equality, 1854 Sermon on the Duties of a Christian Woman, 1851 Mary Polk Branch remembers plantation life, 1912 William Wells Brown, “Clotel; or, The President’s Daughter: A Narrative of Slave Life in the United States,” 1853 Manifest Destiny Reader Cherokee Petition Protesting Removal, 1836 John O’Sullivan Declares America’s Manifest Destiny, 1845 Diary of a Woman Migrating to Oregon, 1853 Chinese Merchant Complains of Racist Abuse, 1860 Wyandotte woman describes tensions over slavery, 1849 Letters from Venezuelan General Francisco de Miranda regarding Latin American Revolution, 1805-1806 President Monroe Outlines the Monroe Doctrine, 1823 The Sectional Crisis Reader Prigg v. Pennsylvania, 1842 Stories from the Underground Railroad, 1855-56 Harriet Beecher Stowe, Uncle Tom’s Cabin, 1852 Charlotte Forten complains of racism in the North, 1855 Margaraetta Mason and Lydia Maria Child Discuss John Brown, 1860 1860 Republican Party Platform South Carolina Declaration of Secession, 1860 The Civil War Reader Alexander Stephens on Slavery and the Confederate Constitution, 1861 General Benjamin F. Butler Reacts to Self-Emancipating People, 1861 William Henry Singleton, a formerly enslaved man, recalls fighting for the Union, 1922 Poem about Civil War Nurses, 1866 Ambrose Bierce Recalls his Experience at the Battle of Shiloh, 1881 Civil War songs, 1862 Abraham Lincoln’s Second Inaugural Address, 1865 Reconstruction Reader Freedmen discuss post-emancipation life with General Sherman, 1865 Jourdon Anderson Writes His Former Enslaver, 1865 Charlotte Forten Teaches Freed Children in South Carolina, 1864 Mississippi Black Code, 1865 General Reynolds Describes Lawlessness in Texas, 1868 A case of sexual violence during Reconstruction, 1866 Frederick Douglass on Remembering the Civil War, 1877 16. Capital and Labor William Graham Sumner on Social Darwinism (ca.1880s) Henry George, Progress and Poverty, Selections (1879) Andrew Carnegie’s Gospel of Wealth (June 1889) Grover Cleveland’s Veto of the Texas Seed Bill (February 16, 1887) The “Omaha Platform” of the People’s Party (1892) Dispatch from a Mississippi Colored Farmers’ Alliance (1889) Lucy Parsons on Women and Revolutionary Socialism (1905) 17. Conquering the West Chief Joseph on Indian Affairs (1877, 1879) William T. Hornady on the Extermination of the American Bison (1889) Chester A. Arthur on American Indian Policy (1881) Frederick Jackson Turner, “Significance of the Frontier in American History” (1893) Turning Hawk and American Horse on the Wounded Knee Massacre (1890/1891) Helen Hunt Jackson on a Century of Dishonor (1881) Laura C. Kellogg on Indian Education (1913) 18. Life in Industrial America Andrew Carnegie on “The Triumph of America” (1885) Henry Grady on the New South (1886) Ida B. Wells-Barnett, “Lynch Law in America” (1900) Henry Adams, The Education of Henry Adams (1918) Charlotte Perkins Gilman, “Why I Wrote The Yellow Wallpaper” (1913) Jacob Riis, How the Other Half Lives (1890) Rose Cohen on the World Beyond her Immigrant Neighborhood (ca.1897/1918) 19. American Empire William McKinley on American Expansionism (1903) Rudyard Kipling, “The White Man’s Burden” (1899) James D. Phelan, “Why the Chinese Should Be Excluded” (1901) William James on “The Philippine Question” (1903) Mark Twain, “The War Prayer” (ca.1904-5) Chinese Immigrants Confront Anti-Chinese Prejudice (1885, 1903) African Americans Debate Enlistment (1898) 20. The Progressive Era Booker T. Washington & W.E.B. DuBois on Black Progress (1895, 1903) Jane Addams, “The Subjective Necessity for Social Settlements” (1892) Eugene Debs, “How I Became a Socialist” (April, 1902) Walter Rauschenbusch, Christianity and the Social Crisis (1907) Alice Stone Blackwell, Answering Objections to Women’s Suffrage (1917) Woodrow Wilson on the New Freedom (1912) Theodore Roosevelt on “The New Nationalism” (1910) 21. World War I & Its Aftermath Woodrow Wilson Requests War (April 2, 1917) Alan Seeger on World War I (1914; 1916) The Sedition Act of 1918 (1918) Emma Goldman on Patriotism (July 9, 1917) W.E.B DuBois, “Returning Soldiers” (May, 1919) Lutiant Van Wert describes the 1918 Flu Pandemic (1918) Manuel Quezon calls for Filipino Independence (1919) 22. The New Era Warren G. Harding and the “Return to Normalcy” (1920) Crystal Eastman, “Now We Can Begin” (1920) Marcus Garvey, Explanation of the Objects of the Universal Negro Improvement Association (1921) Hiram Evans on the “The Klan’s Fight for Americanism” (1926) Herbert Hoover, “Principles and Ideals of the United States Government” (1928) Ellen Welles Page, “A Flapper’s Appeal to Parents” (1922) Alain Locke on the “New Negro” (1925) 23. The Great Depression Herbert Hoover on the New Deal (1932) Huey P. Long, “Every Man a King” and “Share our Wealth” (1934) Franklin Roosevelt’s Re-Nomination Acceptance Speech (1936) Second Inaugural Address of Franklin D. Roosevelt (1937) Lester Hunter, “I’d Rather Not Be on Relief” (1938) Bertha McCall on America’s “Moving People” (1940) Dorothy West, “Amateur Night in Harlem” (1938) 24. World War II Charles A. Lindbergh, “America First” (1941) A Phillip Randolph and Franklin Roosevelt on Racial Discrimination in the Defense Industry (1941) The Atlantic Charter (1941) FDR, Executive Order No. 9066 (1942) Aiko Herzig-Yoshinaga on Japanese Internment (1942/1994) Harry Truman Announcing the Atomic Bombing of Hiroshima (1945) Declaration of Independence of the Democratic Republic of Vietnam (1945) 25. The Cold War The Truman Doctrine (1947) NSC-68 (1950) Joseph McCarthy on Communism (1950) Dwight D. Eisenhower, “Atoms for Peace” (1953) Senator Margaret Chase Smith’s “Declaration of Conscience” (1950) Lillian Hellman Refuses to Name Names (1952) Paul Robeson’s Appearance Before the House Un-American Activities Committee (1956) 26. The Affluent Society Juanita Garcia on Migrant Labor (1952) Hernandez v. Texas (1954) Brown v. Board of Education of Topeka (1954) Richard Nixon on the American Standard of Living (1959) John F. Kennedy on the Separation of Church and State (1960) Congressman Arthur L. Miller Gives “the Putrid Facts” About Homosexuality” (1950) Rosa Parks on Life in Montgomery, Alabama (1956-1958) 27. The Sixties Barry Goldwater, Republican Nomination Acceptance Speech (1964) Lyndon Johnson on Voting Rights and the American Promise (1965) Lyndon Johnson, Howard University Commencement Address (1965) National Organization for Women, “Statement of Purpose” (1966) George M. Garcia, Vietnam Veteran, Oral Interview (1969/2012) The Port Huron Statement (1962) Fannie Lou Hamer: Testimony at the Democratic National Convention 1964 28. The Unraveling Report of the National Advisory Commission on Civil Disorders (1968) Statement by John Kerry of Vietnam Veterans Against the War (1971) Nixon Announcement of China Visit (1971) Barbara Jordan, 1976 Democratic National Convention Keynote Address (1976) Jimmy Carter, “Crisis of Confidence” (1979) Gloria Steinem on Equal Rights for Women (1970) Native Americans Occupy Alcatraz (1969) 29. The Triumph of the Right First Inaugural Address of Ronald Reagan (1981) Jerry Falwell on the “Homosexual Revolution” (1981) Statements of AIDS Patients (1983) Statements from The Parents Music Resource Center (1985) Pat Buchanan on the Culture War (1992) Phyllis Schlafly on Women’s Responsibility for Sexual Harassment (1981) Jesse Jackson on the Rainbow Coalition (1984) 30. The Recent Past Bill Clinton on Free Trade and Financial Deregulation (1993-2000) The 9/11 Commission Report, “Reflecting On A Generational Challenge” (2004) George W. Bush on the Post-9/11 World (2002) Obergefell v. Hodges (2015) Pedro Lopez on His Mother’s Deportation (2008/2015) Chelsea Manning Petitions for a Pardon (2013) Emily Doe (Chanel Miller), Victim Impact Statement (2015) Frederick Douglass on Remembering the Civil War, 1877

      The publisher, author, and title.

    1. AI-Generated Content (March 2025): This page was created through iterative prompting of Claude Code (Opus 4.5) and GPT-5.2 Pro, feeding in workshop discussion content and focal papers for our Pivotal Questions initiative (Benjamin et al. 2023, UK Green Book Wellbeing Guidance, Bond & Lang 2019, Haushofer & Shapiro 2016/2018, Kaiser & Oswald 2022). While grounded in these sources, this content requires further human verification. Specific claims, citations, and numerical details should be checked against the original literature before relying on them.

      Make this a folding box #implement

    1. Standardized Testing is Still Failing Students Educators have long known that standardized tests are an inaccurate and unfair measure of student progress. There's a better way to assess students. By: Cindy Long, Senior Writer Published: March 30, 2023 Share Key Takeaways Standardized tests don’t accurately measure student learning and growth. Unlike standardized tests, performance-based assessment allows students to choose how they show learning. Performance-based assessment is equitable, accurate, and engaging for students and teachers. Break out your No. 2 pencil and answer this multiple choice question: How do standardized tests measure student learning? A. In a single snapshot. B. With biased test questions. C. Without determining learning growth. D. All of the above. If you didn’t answer “all of the above,” then: A. You haven’t been paying attention, or B. You work for a testing software company. Most of us know that standardized tests are inaccurate, inequitable, and often ineffective at gauging what students actually know. The good news is, there’s a better way:  Performance-based assessment provides an essential piece of the puzzle in measuring student growth. What is Performance-Based Assessment (PBA)? This system of learning and assessment allows students to demonstrate knowledge and skill through critical-thinking, problem-solving, collaboration, and the application of knowledge to real-world situations. In other words, it helps students prepare for college, career, and life. PBA also allows educators to create more engaging instruction and address learning gaps by observing over time. And it helps them gather well-rounded information to better support their students’ success—a far cry from the “drill and kill” of state and federal standardized tests. PBA can mean asking students to compose a few sentences in an open-ended short response; develop a thorough analysis in an essay; conduct a laboratory investigation; curate a portfolio of student work; or complete an original research paper. Younger students may design experiments, write poems, or create art that demonstrate concepts.  “PBA allows students more choice in how they can show what hey’ve learned, and it allows differentiated instruction for different learning styles,” says Molly Malinowski, a first-grade teacher at Lynch Elementary School, in the Winchester school district, in Massachusetts. “Standardized tests don’t allow choice because it’s one-size-fits-all. Students may have the knowledge, but may not be able to show what they know and understand on the test. They can demonstrate that in PBA.”  “Performance-based assessment allows students more choice in how they can show what they’ve learned," says first-grade teacher Molly Malinowski. Built-in Differentiation One of Malinowski’s favorite math assessments focuses on addition, using numbers from 0 through 20. She introduces the concept and, with inquiry-based instruction, presents lessons and talks about strategies. As students pose questions or problems, the process ignites their curiosity. At the end of the unit, they have an “addition celebration.” She starts by building fluency and then asks students to share their problem-solving strategies with the class. “The best part is that they get to show their learning in a lot of different ways,” Malinowski says. “For this math standard, they make a video, a poster, a voice recording, a mini-play, or a piece of writing. Students have a voice in how they demonstrate their learning.”  PBA allows students to demonstrate knowledge and skill through critical-thinking, problem-solving, collaboration, and the application of knowledge to real-world situations. In other words, it helps students prepare for college, career, and life. And when they share with the class, that helps reinforces everyone’s learning, she explains. None of this is possible in an individual test. “With this kind of instruction and assessment, we can … offer targeted support,” she says. “Some students will want to create their own piece of work, while others need more scaffolding, so I can really see who is learning what.” Each academic area in her school, from math to language arts to science and social studies, has a PBA. “We’ve been encouraged to take time to develop PBAs for our grade level, developing and tweaking them, and trying them out and making edits as we go,” she adds.  Sequences, Not Snapshots Malinowski teaches in one of eight public school districts participating in the Massachusetts Consortium for Innovative Education Assessment (MCIEA). MCIEA is a partnership between these school districts and their local teachers unions, who are working together to create a fair and effective accountability system, offering a more dynamic picture of student quality and learning than a single standardized test.  “The first objective of MCIEA is to measure  student learning in a way that relies on teacher-created, classroom-embedded, performance assessments rather than externally created standardized assessments,” says MCIEA co-founder Jack  Schneider. “The second objective is to measure school quality in a way that is more holistic, valid, and democratic than standardized tests.” In a science and sound assessment, Molly Malinowski’s students use what they learned to create instruments from recycled materials. Credit: Courtesy of Molly Malinowski The consortium’s framework goes beyond test scores and graduation and absentee rates, instead focusing on multiple measures of student engagement and achievement, as well as school environment. “It’s a more equitable system that doesn’t punish marginalized students,” Schneider explains. The current data used to measure student success, adds Schneider, closely correlates to race, income, and family educational attainment. It really only tells us about advantage and disadvantage outside school, not what schools are actually doing. “We want to capture all of the things that show student learning and achievement, not just this tortuously narrow vision that is embedded in the current system,” he says.  “Where in the data is student engagement, their sense of belonging, performing arts, or development of civic competencies?” he asks. “There are massive gaps. So many of the current variables are just demographics in disguise.” On a standardized test, there is only one right answer. Every other answer is wrong. But educators know when students are engaged in authentic and meaningful tasks, they can arrive at answers that are not entirely wrong or entirely right. “We’d never give someone a standardized test to see if they can fly an airplane,” Schneider says. “PBA can show us much more effectively what a student knows that actually reflects the complexity of knowing and doing something.” Leveling the Playing the Field “PBA allows students to ask more questions and bounce ideas around, which reduces anxiety without reducing rigor,” says Alissa Holland. Credit: Jason Grow Alissa Holland is an instructional coach in the Milford Public Schools district, in Massachusetts, one of the districts participating in the MCIEA. “Standardized tests create test anxiety and some kids even have test phobia because they have just this one chance at getting it right,” says Holland, who works at Stacy Middle School. “PBA allows students to ask more questions and bounce ideas around, which reduces anxiety without reducing rigor." PBA also removes bias and creates more equity. For example, Holland recalls a state standardized test question about decomposition for eighth-grade science: “The question asked about grass clippings after mowing the lawn, and students got hung up on ‘grass clippings,’” she says. “Some had no idea what they were because they either lived in apartments and had no lawns, or they had lawns but didn’t mow them because they had landscapers.” A PBA on decomposition, on the other hand, would allow students to show their learning by demonstrating it with something they’re familiar with—grass clippings, an apple core, or leaves in the park—in a diagram, an essay or poem, or  artwork. Students can show much more of their knowledge over time, in a form they choose, rather than on one day, on one test question, where only one answer is right, Holland explains. “Working on a PBA shows growth while inspiring a growth mindset,” Holland says. Her colleague Dan Cote, a social studies teacher in the same district, agrees. Nobody likes to regurgitate everything they know on a single test,  he says. “The PBA process acknowledges that learning happens in different stages, where we can give feedback and build skills during the assessment,” Cote says. “It also increases engagement because students walk in the shoes of a geographer or historian, with tasks that mimic real world activities.”   In Dan Cote’s social studies class, students demonstrate learning by studying artifacts like real-world archaeologists and explaining how they would lead an ancient civilization. Credit: Jason Grow Stronger Relationships with Students Standards and rubrics don’t always call for collaboration, and there’s certainly no teamwork on a test, but in PBA, students often work in  groups that build critical social skills. When Cote’s students learn about earlier civilizations, they form teams of four and decide together what time period and landscape their civilization would inhabit.  “They have tough decisions to make and problems to solve, like laws for farmers, how to respond to a natural disaster, what they should invest in as a civilization,” Cote explains.  “Then each student writes their version of what happened, like a historian would—which demonstrates how easily bias occurs when the students read the other students’ varying versions of the same events.” Then the students create artifacts from their civilizations, and their classmates try to guess which civilization each student inhabited. At the end of every PBA, Cote debriefs with the class. “Students always say that working together was the best part,” he says. That’s the best part for the teachers, too, because it helps them build stronger relationships with students.  “We gain so much more when we can hear their thoughts,” Cote says. “It’s not about a gotcha on a test score.”  References See More See Less In this special NEA Today series, we examine the current state of student assessment, what's not working, and propose better ways of equitably and effectively supporting student learning. { let counter = 0; const id = 'div-gpt-ad-1582153075652-0' /** * Before calling googletag.display(), make sure that: * * - googletag has been defined and is ready * - the div we are calling can be found by the dom * */ const checkForPubAdsAndDiv = setInterval(function() { if (window.googletag && googletag.apiReady) { // Found everything. Commenting out debug in case we need it again. // console.dir('-----') // console.dir('adding to cmd: ' + id) // console.dir('checks: ' + counter) // console.dir('googletag.pubadsReady: ' + googletag.pubadsReady) // console.dir('googletag.pubads(): ' + googletag.pubads()) // console.dir('-----') googletag.cmd.push(function() { googletag.display('div-gpt-ad-1582153075652-0'); }); clearTimeout(checkForPubAdsAndDiv) } else { // Still waiting. counter++ if (counter === 120) { // About 30 seconds. // Commenting out debug in case we need it again. // console.dir('-----') // console.dir('gave up waiting on ' + id) // console.dir('window.googletag: ' + window.googletag) // console.dir('googletag.apiReady: ' + googletag.apiReady) // console.dir('div is found: ' + document.getElementById(id)) // console.dir('googletag.pubadsReady: ' + googletag.pubadsReady) // console.dir('googletag.pubads(): ' + googletag.pubads()) // console.dir('-----') clearTimeout(checkForPubAdsAndDiv) } } }, 250) } Get more from We're here to help you succeed in your career, advocate for public school students, and stay up to date on the latest education news. Sign up to stay informed. NEW nea.org email signup Contact Information First Name Last Name Zip Code Email Opt in to email updates from NEA (Optional) Your donation will be securely processed. Great public schools for every student

      I don't feel that the standardized testing is beneficial for our students. We are teaching are students to take tests and not learn information. I have seen schools who spend the months prior to the testing working on teaching the kids how to take this test and what they need to do.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We sincerely thank the reviewer for the thorough and constructive evaluation of our manuscript. We greatly appreciate the recognition of our work's strengths, particularly the integration of experiments and mathematical modeling, the stochastic framework for describing sloughing events, and the insights into pressure-driven detachment dynamics.

      We have carefully considered each point raised and provide detailed responses below. In response to the reviewer's comments, we have revised the Methods section to better clarify our approach to three-dimensional assessment. We believe these revisions have improved the clarity of the manuscript.

      Below, we address each of the specific concerns raised by the reviewer:

      Public Reviews:

      Reviewer #1 (Public review):

      Weaknesses:<br /> The study achieves its primary goal of integrating experiments and modeling to understand the coupling between flow and biofilm growth and detachment in a microfluidic channel, but it should have highlighted the weaknesses of the methods. I list the ones that, in my opinion, are the main ones:

      The study does not consider biofilm porosity, which could significantly affect the flow and forces exerted on the biofilm. Porosity could impact the boundary conditions, such as the no-slip condition, which should be validated experimentally.

      Porosity is indeed a key component of biofilm structures, resulting from the polymeric nature of the EPS matrix, mechanical forces, and biological processes such as cell death or predation. When considering flow-biofilm interactions, this porosity may allow fluid flow through the biofilm, with reported permeability values spanning an extremely broad range from 1015 to 10-7 m2 (Kurz et al., 2023).

      However, we argue that biofilm permeability is not the primary driver in our system:

      (1) In microscopy visualization, our biofilms form dense structures where flow around the biofilm through narrow channels dominates over flow through the porous biofilm matrix.

      (2) We performed microrheology experiments in these biofilms by imaging the Brownian motion of nanoparticles in the biofilm. Their trajectories indicate that, in our conditions, the viscoelastic flow of the biofilm itself largely dominates over the flow of culture medium through the biofilm matrix.

      (3) We argue that the extreme variability in reported permeability values (spanning several orders of magnitude, Kurz et al., 2023) reflects not only differences in experimental systems, but also fundamental challenges in defining and measuring permeability for viscoelastoplastic biofilms (the biofilm itself is actually flowing). Given this uncertainty, incorporating permeability into our model would introduce parameters that cannot be reliably constrained from literature or independently measured in our setup. Our approach (i.e. treating the biofilm as impermeable and focusing on flow obstruction) avoids this parametrization complexity while successfully capturing the observed dynamics.

      (4) Our model successfully predicts the observed scaling laws (φmax ∝ Q1/2, Fig. 7f) and hydraulic resistance dynamics (Fig. 3) without invoking permeability, suggesting that flow obstruction rather than flow penetration is the dominant mechanism.

      Reference: Kurz, D. L.; Secchi, E.; Stocker, R.; Jimenez-Martinez, J. Morphogenesis of biofilms in porous media and control on hydrodynamics. Environ. Sci. Technol. 2023, 57 (14), 5666−5677.

      The research suggests EPS development as a stage in biofilm growth but does not probe it using lectin staining. This makes it impossible to accurately assess the role of EPS in biofilm development and detachment processes.

      We respectfully disagree that lectin staining is necessary to assess the role of EPS in our system, and we argue that our approach using genetic mutants is superior for the following reasons. Lectin staining has significant limitations. While widely used, lectin staining (e.g., concanavalin A) is non-specific (binding not only to EPS polysaccharides but also to bacterial cell surfaces) and is non-quantitative. It can confirm the presence of polysaccharides but cannot establish causal relationships between specific EPS components and mechanical properties or detachment dynamics. We performed preliminary experiments with ConA-rhodamine (data not shown), which showed widespread presence of polysaccharides. However, this provided limited insight beyond confirming EPS production, which is well-established for P. aeruginosa PAO1 biofilms. We employed a more rigorous genetic approach to directly assess the role of EPS composition. We used Δpel and Δpsl mutants (strains lacking key exopolysaccharides that are the primary structural components of the PAO1 matrix). Our results demonstrate that both mutants show significantly reduced maximum clogging compared to wild-type. The Δpsl mutant is particularly affected, with near-complete detachment at certain flow rates. These differences directly link EPS composition to mechanical stability and detachment dynamics. This genetic approach provides causal, quantitative evidence for the role of specific EPS components in biofilm development and detachment, information that lectin staining cannot provide. We believe this addresses the reviewer's concern more rigorously than lectin staining would.

      While the force and flow are three-dimensional, the images are taken in two dimensions. The paper does not clearly explain how the 2D images are extrapolated to make 3D assessments, which could lead to inaccuracies.

      We thank the reviewer for this important observation. We would like to clarify our methodological approach. Our primary three-dimensional measurement is the hydraulic resistance R(t), obtained from pressure drop measurements across the biofilm-containing channel section. This pressure-based measurement inherently captures the three-dimensional flow obstruction caused by the biofilm. We then employ a geometric model (uniform biofilm layer on all channel walls) to convert R(t) into volume fraction φ(t).

      The two-dimensional fluorescence imaging serves to validate this model-based approach rather than being the basis for three-dimensional extrapolation. The uniform layer assumption is supported by three independent lines of evidence: (i) the excellent quantitative agreement between predicted and measured scaling laws (φmax ∝ Q1/2, Fig. 7f), obtained without adjustable parameters; (ii) the high reproducibility of φmax values across different flow rates and replicates; and (iii) the strong correlation between model-derived φ(t) from pressure measurements and integrated fluorescence intensity (Fig. 3b-d).

      We have added clarifying text in the Methods section (subsection "Data analysis for the calculation of the hydraulic resistance and volume fraction") to better explain this approach and emphasize that pressure measurements provide the three-dimensional information, with the geometric model serving as the link to volume fraction.

      Although the findings are tested using polysaccharide-deficient mutants, the results could have been analyzed in greater detail. A more thorough analysis would help to better understand the role of matrix composition on the stochastic model of detachment.

      We thank the reviewer for this suggestion. Our mutant analysis demonstrates that Δpsl and Δpel strains have significantly reduced φmax and altered detachment dynamics compared to wild-type (Fig. 8), directly linking EPS composition to mechanical stability as predicted by our model. A rigorous quantitative connection between matrix composition and the stochastic parameters (interevent times, jump amplitudes) would require: (i) substantially more sloughing events for statistical power, (ii) independent mechanical characterization of each mutant, and (iii) a mechanistic model linking EPS composition to detachment parameters. We are currently developing microrheology approaches to characterize mutant mechanical properties, which could enable such refinement in future work.

      However, this represents a substantial study beyond the scope of the current manuscript, which establishes the self-sustained sloughing-regrowth cycle and its stochastic nature. The mutant results serve their intended purpose: demonstrating that EPS composition affects detachment, consistent with our model's framework.

      Reviewer #2 (Public review):

      This manuscript develops well-controlled microfluidic experiments and mathematical modelling to resolve how the temporal development of P. aeruginosa biofilms is shaped by ambient flow. The experiment considers a simple rectangular channel on which a constant flow rate is applied and UV LEDs are used to confine the biofilm to a relatively small length of device. While there is often considerable geometrical complexity in confined environments and feedback between biofilm/flow (e.g. in porous media), these simplified conditions are much more amenable to analysis. A non-dimensional mathematical model that considers nutrient transport, biofilm growth and detachment is developed and used to interpret experimental data. Regimes with both gradual detachment and catastrophic sloughing are considered. The concentration of nutrients in the media is altered to resolve the effect of nutrient limitation. In addition, the role of a couple of major polysaccharide EPS components are explored with mutants, which leads results in line with previous studies.

      There has been a vast amount of experimental and modelling work done on biofilms, but relatively rarely are the two linked together so tightly as in this paper. Predictions on influence of the non-dimensional Damkohler number on the longitudinal distribution of biofilm and functional dependence of flow on the maximum amount of biofilm (𝜙max) are demonstrated. The study reconfirms a number of previous works that showed the gradual detachment rate of biofilms scales with the square root of the shear stress. More challenging are the rapid biofilm detachment events where a large amount of biofilm is detached at once. These events occur are identified experimentally using an automated analysis pipeline and are fitted with probability distributions. The time between detachment events was fitted with a Gamma distribution and the amplitude of the detachment events was fitted with a log-normal distribution, however, it is not clear how good these fits are. Experimental data was then used as an input for a stochastic differential equation, but the output of this model is compared only qualitatively to that of the experiments. Overall, this paper does an admirable job of developing a well-constrained experiments and a tightly integrated mathematical framework through which to interpret them. However, the new insights this provides the underlying physical/biological mechanisms are relatively limited.

      We thank the reviewer for the thorough evaluation of our work and for highlighting the tight integration between experiments and modeling. We appreciate the constructive feedback regarding the goodness-of-fit for the probability distributions.

      To address the concern that "it is not clear how good these fits are," we have added quantile-quantile (Q-Q) plots for the Gamma distribution fits of inter-event times to the Supplementary Materials (Supplementary Figure S20). These plots demonstrate that the sample quantiles track the theoretical Gamma quantiles across all flow rates (0.2, 2, and 20 μL/min), indicating that the Gamma distribution provides a reasonable approximation of the overall distributional behavior. For detachment amplitudes, we selected the lognormal distribution based on the observed high skewness and kurtosis in the data, which are characteristic signatures of lognormal processes.

      Formal goodness-of-fit tests (chi-square, Kolmogorov-Smirnov) yielded mixed results across datasets, passing for some while failing for others. This variability reflects inherent noise from measurements, discrete temporal sampling, automated detection thresholds, and intrinsic biological variability. Importantly, our goal is to capture essential distributional characteristics for input into the stochastic model, not to achieve perfect statistical fit across all individual datasets. The Q-Q plots confirm that these distributions provide reasonable approximations, and the qualitative agreement between model predictions and experimental observations validates this modeling approach. We have revised the Methods section to clarify this rationale.

      We respectfully disagree that “new insights this provides the underlying physical/biological mechanisms are relatively limited.” Beyond confirming previous findings (e.g., scaling for gradual detachment), we believe our work provides several novel mechanistic insights. First, the Pe/Da criterion enables quantitative prediction of nutrient limitation regimes, allowing systematic decoupling of nutrient effects from other phenomena in biofilm studies. Second, we demonstrate that pressure, not shear, drives sloughing detachment events, a mechanism overlooked in previous studies where the notion of “shear-induced detachment” clearly dominates. Third, we show that sloughing-regrowth cycles occur even in single channels, establishing pressure-driven fluctuations as a signature of confined biofilm growth, independent of geometric complexity. Finally, the stochastic description of sloughing demonstrates that, while instantaneous biofilm states are irreproducible, the underlying randomness is predictable, therefore addressing a fundamental challenge in biofilm research.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) In the abstract, I suggest clarifying the term "bacteria development." It is unclear if it refers to bacterial growth, biofilm formation, or biofilm detachment. The concept is expressed more clearly at the end of the Introduction.

      We have modified the entire abstract to make it clearer. The abstract now explicitly establishes the key processes - growth ('nutrients necessary for growth', 'growing bacteria obstruct flow paths') and detachment ('mechanical stresses that cause detachment', 'flow-induced detachment', 'sloughing') - before using 'bacterial development' as a collective term to refer to these coupled spatiotemporal dynamics. We believe the abstract is now clear as written.

      (2) Findings from Sanfilippo et al. (2019) were slightly questioned by Padron et al. (PNAS, 2023), who discovered that H2O2 transport is responsible for fro operon upregulation.

      Thanks for the clarification, which is indeed significant. The new sentence now reads: Pseudomonas aeruginosa has been found to regulate the fro operon in response to flow-modulated H2O2 concentrations (Sanfilippo et al. 2019, Padron et al. 2023).

      (3) Additionally, Kurz et al. (2022) account for pressure buildup as the mechanism controlling sloughing.

      We respectfully disagree and note that Kurz et al. (2022) identify shear stress, not pressure buildup, as the primary mechanism controlling sloughing. Besides the title, key sentences include “opening was driven by a physical process and specifically by the shear forces associated with flow through the biofilm”, “The opening of the PFPs is driven by flow-induced shear stress, which increases as a PFP becomes narrower due to microbial growth, causing biofilm compression and rupture.” While pressure differences are measured as indicators of system state and do contribute to normal compression stresses, their mechanistic explanation emphasizes that narrowing PFPs experience increased shear rates that eventually exceed the biofilm's yield stress, triggering viscoplastic deformation and detachment. The pressure buildup is a hydraulic consequence of narrowing rather than the direct cause of sloughing. In contrast, our work demonstrates that in confined geometries, pressure differences generate tangential stresses at the biofilm-solid interface that directly drive detachment.

      (4) The flow control strategy represented in Fig. 1 is not explained and should be detailed in the Methods section.

      The methods section reads as follows. Inoculation and flow experiments BHI suspensions were adjusted at optical density at OD640nm= 0.2 (108 CFU/mL) and inoculated inside the microchannels from the outlet, up to approximately ¾ of the channel length in order to keep a clean inlet. The system was let at room temperature (25°C) for 3h under static conditions. Flow experiments were then performed at 0.02, 0.2, 2, 20 and 200 μL/min constant flow rates for 72h in the microchannels at room temperature. For the experiments at 0.2, 2, 20 and 200 μL/min, the fluidic system was based on a sterile culture medium reservoir pressurized by a pressure controller (Fluigent FlowEZ) and connected with a flow rate controller (Fluigent Flow unit). The flow rate was maintained constant by using a controller with a feedback loop adjusting the pressure in the liquid reservoir. The reservoir was connected to the chip using Tygon tubing (Saint Gobain Life Sciences Tygon™ ND 100-80) of 0.52 mm internal diameter and 1.52 mm external diameter, along with PEEK tubing (Cytiva Akta pure) with 0.25 mm inner diameter adapters for flow rate controller. The waste container was also pressurized by another independent pressure controller to reduce air bubble formation in the inlet part. For the experiments at 0.02 μL/min, we used an Harvard Phd2000 syringe pump for the flow.

      (5) Including images of the actual biofilms formed in a portion of the channel would aid in understanding the analysis presented in Fig. 2.

      Images are introduced later on (eg Figure 5). There is also supplementary material showing videos.

      (6) The boundary conditions used to calculate the stress in the developed model should be discussed. The authors should specify why biofilm porosity is neglected.

      We have added a detailed discussion in the supplementary (Section I.2).

      (7) In the first section of the Results, the authors hypothesize that heterogeneity in biofilm development could be due to oxygen limitation. However, given the high oxygen permeability of PDMS, this hypothesis is later denied by their data. It would be prudent to avoid this hypothesis initially to streamline the presentation. Additionally, the authors should specify how oxygen levels at the inlet and outlet are measured.

      We appreciate this comment and agree that streamlining would simplify the presentation. However, after careful consideration, we have chosen to retain the oxygen limitation hypothesis for the following reasons: (1) oxygen limitation is a frequently invoked mechanism in biofilm systems and deserves explicit consideration, (2) it is not immediately obvious that oxygen remains non-limiting in larger microchannels where transverse gradients could develop, and (3) systematically eliminating this plausible alternative hypothesis strengthens our mechanistic conclusion that BHI drives the observed heterogeneity. Regarding oxygen measurements: we did not directly measure dissolved oxygen concentrations. Our approach is only indirect.

      (8) What is the standard deviation of the doubling time measured at different flows (page 9)?

      We have indicated the standard deviation in the text. Note that the graph shows the SEM.

      (9) What is the "zone of interest" in the channel mentioned on page 9?

      We have added the following sentence to clarify: To further understand this effect, let us consider the mass balance of biofilm in the zone of interest -- the zone where biofilm grows in between the two UVC irradiation zones -- in the channel.

      (10) Minor and major detachment events should be classified based on a defined threshold or criteria, and their frequency should be measured.

      We appreciate the reviewer's concern about quantitative rigor. However, we respectfully disagree that imposing arbitrary thresholds to classify 'minor' vs. 'major' events would improve our analysis. Detachment events in our system span a continuum of magnitudes, and any threshold would be artificial and potentially misleading. Our quantitative characterization of detachment dynamics is provided through the statistical analysis of interevent times, which we show follow a gamma distribution. This stochastic framework captures the full spectrum of detachment behavior without requiring arbitrary binning. The terms 'minor' and 'major' in our manuscript are used qualitatively to illustrate the range of observed phenomena, not as formal classifications.

      (11) Have the authors identified a reason for the peaks in the volume fraction in the Δpsl mutants at the highest flow rate?

      The biofilm thickness following these sloughing events is below our detection limit, consistent with a residual layer of cells. However, these cells grow, leading to a time window where the fraction is measurable, before a new detachment event occurs. Our understanding is that the psl mutant forms a weaker matrix with a much lower threshold for sloughing.

      (12) The fit of the probability density function for the relative density function does not match the data well. The authors should comment on this.

      We have added quantile-quantile (Q-Q) plots for the Gamma distribution fits of inter-event times to the Supplementary Materials (Supplementary Figure S20). These plots demonstrate that the sample quantiles track the theoretical Gamma quantiles across all flow rates (0.2, 2, and 20 μL/min), indicating that the Gamma distribution provides a reasonable approximation of the overall distributional behavior. For detachment amplitudes, we selected the lognormal distribution based on the observed high skewness and kurtosis in the data, which are characteristic signatures of lognormal processes. Formal goodness-of-fit tests (chi-square, Kolmogorov-Smirnov) yielded mixed results across datasets, passing for some while failing for others. This variability reflects inherent noise from measurements, discrete temporal sampling, automated detection thresholds, and intrinsic biological variability. Importantly, our goal is to capture essential distributional characteristics for input into the stochastic model, not to achieve perfect statistical fit across all individual datasets. The Q-Q plots confirm that these distributions provide reasonable approximations, and the qualitative agreement between model predictions and experimental observations validates this modeling approach. We have revised the Methods section to clarify this rationale.

      (13) Additionally, the simulated fraction appears very flat, with limited detachments compared to experiments. Why?

      The model captures the essential dynamics of growth-detachment cycles, including the characteristic timescales and volume fraction ranges. Some event-to-event variability in the experimental data likely reflects biological stochasticity not captured by our current approach—for example, variations in local biofilm mechanical properties or matrix composition that affect the precise stress at which sloughing occurs. While incorporating such biological variability as a stochastic parameter would improve detailed agreement, it would require extensive additional characterization beyond the scope of this study. The current model successfully reproduces the key qualitative and semi-quantitative features of the system.

      (14) The methods section should include a more detailed explanation of how the model was validated against experimental data.

      Model validation was performed by comparing predicted biofilm volume fraction time series and sloughing event statistics against experimental observations across multiple flow rates. The model reproduces the characteristic growth-sloughing cycles, timescales, and steady-state volume fractions without additional parameter fitting beyond the experimentally measured distributions.

      (15) It would be useful to include information on the reproducibility of the experiments and any variations observed between replicates.

      Experiments were performed in N=3 biological replicates. Individual time series for all replicates are shown in Supplementary Figures, demonstrating consistent behavior across replicates.

      (16) A discussion of the limitations of the study, particularly regarding the assumptions made in the modeling and their potential impact on the results, would strengthen the paper.

      We have added a discussion on why we chose to neglect the porosity of the biofilm, and strengthened parts on the uniform biofilm layer assumption.

      Reviewer #2 (Recommendations For The Authors):

      Page 2: "A vast" —> "The vast"

      Changed.

      The text and line widths on many of the figures are far too small. I printed it out at normal size, but had to look at a PDF and magnify to actually see what the graphs are showing. Fig. 9c is particularly illegible.

      Changed.

      Fig. 1 caption "photonic" —> "optical"?

      Changed

      Can you spell out the actual mathematical definition of 𝜙 on page 5 when it is introduced? Currently it just says the "cross section volume fraction of the biofilm", but that seems potentially ambiguous. It is valid to say that this is "fraction of the cross section occupied by the biofilm"?

      Changed

      Bottom of page 5: can you state the physical interpretation of the assumption that M is bounded between 0 and 1. i.e. that growth is larger than detachment?

      There is a comment on that in the paper. It reads “In assuming that M ∈ ]0, 1] and eliminating cases where M > 1, we have not considered situations of systematic detachment 𝜙equ = 0 for any value of the concentration, since this is not a situation that we encountered experimentally.” This comes just after presenting the expression on the only non-trivial steady-state, as it becomes easier to explain the consequences of the initial choice at this point.

      Currently the choice of detachment initially used in the model is a bit confusing. You say that you are going to assume a (1-𝜙)-1 model for simplicity (bottom of page 5), but then later you find that the (1-𝜙)3/4 model is more accurate (page 16). Since the latter has already been confirmed in numerous other studies, why not start with that one from the beginning?

      We thank the reviewer for this important question, which highlights an area where our presentation could be clearer. We did not find that the (1-φ)-3/4 model is "more accurate." Rather, we deliberately chose the (1-φ)-1 scaling because it captures pressure-induced detachment, which we hypothesized would dominate in confined flows where biofilms clog a large portion of the channel. The (1-φ)-3/4 scaling, widely used in previous studies, describes shear stress at the biofilm/fluid interface and was developed primarily for reactor systems where pressure effects are negligible. Our analysis on page 16 validates this choice by demonstrating that pressure stress indeed exceeds shear stress when volume fraction is large, which corresponds to late Stage I and all of Stage II precisely where our model is applied. The excellent quantitative agreement between predicted and measured φmax values across flow rates (Fig. 7f, Table 1) further supports the (1-φ)-1 scaling. We recognize that our initial presentation may have suggested the (1-φ)-1 choice was merely for "simplicity." We have revised this section to emphasize that this scaling was chosen specifically to capture pressure-driven detachment in confined geometries, with the physical justification provided by the stress analysis that follows. We have also clarified our ideas on page 16 to express clearly that (1-φ)-3/4 is never used. We could alternatively use a multi-modal detachment function combining both scalings, but the data do not require this additional complexity.

      In general, the models you derived in this study could be better contrasted with that from previous works. e.g. can you compare your Eqn (4) with the steady-state solutions obtained by other previous studies? Is this consistent with previous works or different? (aside from framing the biofilm thickness in terms of 𝜙)

      We are currently working on a paper dedicated to modeling biofilm development in confined flows, which will do a better job at comparing approaches.

      Top of page 6 - you assume K* = 0.1 - Does this assume that cells grow at half the rate in 0.1X BHI as they do in 1X BHI? Has this been confirmed experimentally or is this just a guess?

      This was estimated rather than measured directly. Model predictions were a lot more sensitive to the Damköhler number, than to the value of K.

      "radial" is used widely in this paper, but you are using a square geometry. Is "transverse" a better choice?

      Yes it clearly is. It’s been changed.

      Fig 3. Are panels (a) and (b) showing different bioreps of the same condition? If so, please spell that out in the caption.

      There was an error here in the caption of fig a. This has been changed. The correspondence is between a and c, and these are exactly the same, not bioreps.

      In multiple places it noted that the change in hydraulic resistance is correlated with the "change in biofilm colonization." Why not demonstrate this directly using a cross correlation analysis? How is the latter connected to the 𝜙 parameter? (e.g. is this d(𝜙)/dt?)

      We thank the reviewer for this suggestion. To clarify: φ(t) represents the volume fraction of biofilm in the channel. We measure this in two independent ways: (1) φ(t) from hydraulic resistance (black line in Fig. 3) i.e. calculated from pressure measurements using φ = 1 - √(R₀/R(t)), assuming uniform layer growth (see Methods section "Data analysis for the calculation of hydraulic resistance and volume fraction") and (2) φ(t) from fluorescence (green squares in Fig. 3) i.e. estimated from integrated GFP intensity or image segmentation of the glass/liquid interface. The reviewer is correct that we should quantify this relationship directly. We have now added correlation analysis between these two independent measurements of φ (new Supplementary Figure S21). The analysis shows strong positive correlation, with r-values ranged from 0.68 to 0.77 across all flow rates. This validates two key aspects of our approach: (1) the uniform layer assumption used to convert R(t) to φ(t) is reasonable, and (2) the pressure-based measurements accurately capture the dynamics visible in fluorescence imaging, including both growth phases and sloughing events. The strong agreement is particularly notable given that these measurements probe different aspects of the biofilm: hydraulic resistance is sensitive to the three-dimensional obstruction of flow, while fluorescence captures primarily the biofilm attached to the glass surface within our focal plane. Their correlation supports the model assumptions. We have revised the manuscript to clarify this relationship and present the correlation analysis.

      Top of page 9 - a doubling time of 110 mins is reported in liquid culture - is this in shaken or static conditions? Can you provide some data on how this was calculated? (e.g. on a plate reader?) Do you think your measurements in the microfluidics could be affected by attachment/detachment of cells, rather than being solely driven by division. It is curious that your apparent growth rate varies by a factor of two across the different flow rates and there is not a monotonic dependency. Both attachment and detachment would depend on the flow rate (with some non-trivial dependencies).e.g. https://www.pnas.org/doi/10.1073/pnas.2307718120 https://doi.org/10.1016/j.bpj.2010.11.078

      Given that your doubling time in the microfluidics is sole based on changes in cell number (rather than directly tracking cell divisions) it seems possible your results here are measuring the combined effect of growth, attachment and detachment, rather than just growth.

      We agree with those comments regarding the doubling time measurement. We have added a description of how we performed the doubling time measurement in the Methods section.

      Page 9 - you discuss the role of EPS here, but the effect of EPS is not demonstrated here and this is muddled with a discussion about the non-linearity of the putative dependency. Maybe this would be on a firmer footing if you save the discussion of EPS for the section on the Psl and Pel mutants?

      Changed.

      Middle of page 9: Please define what "smooth detachment" means and contrast it with catastrophic sloughing. Also, please define what you mean by "flow, seeding, and erosion" detachment are and how these three things differ from one another.

      We have clearly defined each term in the revised version.

      The results from wavelet scalograms seem to be underutilised and not well described. Can you clearly say what time series this analyses has been calculated on the caption? e.g. hydraulic resistance? Other than simply pointing out the "blue stripes", what can be gained from this analyses that could not be obtained with another method? It would be great if the basic features of this plot could more fully discussed (e.g. is the curved envelope at the bottom caused by edge effects?)

      We have improved the text, captions and method section following the reviewer’s comment.

      Fig. 5 a and b - please list the time at which each of these images were taken. Do these have the same dt between the two sets of images?

      Yes the dt is the same (30 minutes). It’s been indicated in the caption.

      Fig. 6: you have significant 2D variation in the biofilm width along the length of the channel. The relative contribution of pressure and shear based detachment will be different at different positions along the length. However, this variation is ignored in your model. Can you please comment on this in our manuscript and how it might affect the interpretation of your results? e.g. would the longitudinally averaged description yield the same result as one that takes the geometry into account (on average)?

      Our model indeed assumes longitudinally averaged properties. A more detailed spatially resolved model would be valuable for capturing heterogeneities and will be explored in future work.

      Bottom of page 11: you say standard deviations are in the range of 10-3. How does this jibe with the error bars on the middle flow rate in Fig. 7e?

      This extremely low standard deviation only applies to the maximum value of 𝜙 and is a completely different measurement from the whisker boxes presented in fig7e.

      Fig. 7: You are calculating the "Fraction" here. Is this "𝜙"? If so, can you put that on the y-axis instead? You calculate the volume fraction two different ways e.g. with hydraulic resistance and with imaging. Is only one of these shown in (e)? Is the same powerlaw dependence shown in (f) conserved when the other measurement of the "fraction" is used? Can you include both in Fig. 7e?

      We have modified the axis and indicated 𝜙.

      (e) is calculated only from hydraulic resistance. This is the most precise measurement to evaluate 𝜙 quantitatively.

      Related to the previous comment: Some of the estimates of 𝜙max in Table 1 are obtained by fitting the model to integrated fluorescence data (Fig. 2b), while others are estimated from measurements of the hydraulic resistance. The former yields non-unique sets of parameters. Can the biofilm fraction instead actually be estimated directly from fluorescent imaging by segmenting biofilm and directly calculating how much of the cross section is occupied by cells on average across the length? This seems like a more direct measure of this quantity. Given there are multiple ways of estimating the same parameter, it would be better consistency checking to make sure that different methods actually yield the same result.

      We have now added in Fig S21 a direct comparison of these two measurement methods. These are strongly correlated. Microscopy is more direct but only provides 2D pictures. Hydraulic resistance provides a 3D measurement, but relies on a model of biofilm distribution. Both are imperfect, but correlate well. In particular, we see that the 2D measurement does capture sloughing.

      You cite a large number of supplemental figures (e.g. Fig. S21 on page 12), but the figures in your SI only go up to 11.

      We have revised references to supplementary figures.

      Bottom of page 11: Your data from liquid culture suggests that your psl mutant grows at half the rate of WT cells. Is that consistent with your microfluidic data (e.g. Fig. 8)? If not, might this be a sign that your growth rate analyses from the microfluidics might be affected by attachment/detachment? (see comment above) Psl cells should detach much more easily.

      The approach taken to measure doubling times in the microfluidic system does not rely on the macroscopic measurements presented in figure 8, but rather on the approach presented in fig 4. These measurements require specific imaging (different magnification and time stepping) and we did not perform such experiments for the mutants.

      In analyses of sloughing, you fit the times between the jumps and the relative amplitude. Are these two random variables correlated with one another? Might that influence your results? Your methods say that "jumps were identified through through the selection of local maxima" of the derivative. Do you to say "minima" here? Did you keep all local maxima/minima or did you have a threshold?

      These are two random variables, not correlated with another. This is an assumption, and it would be interesting to analyze whether these are correlated. To perform this analysis, we believe that we would first need to acquire even more data and more replications to improve the statistical analysis.

      Yes, it was minima (in the code we make everything positive, hence the confusion).

      Yes, there is a threshold on the value of the jump itself. This value is extremely low and essentially filters out noise.

      Fig. 9 - can you make it clearer in the caption what timeseries you are analysing here? I understand from the methods this that is the "volume fraction." The data/fits are difficult to see in Fig. 9 b and impossible to see in Fig. 9c because the green bars get in the way of the other two data sets. Can this visualisation be improved? It is not clear to me how good of a job the Gamma and log-normal fits are actually doing.

      We have clarified that histograms are calculated from all experiments/replicates.

      We have slightly modified the graph to make it clearer. This comparison is intrinsically hard, partly because it compares discrete data with continuous PDFs.

      Aside from noting the results from the stochastic sloughing model are 'strikingly similar to experimental data', which seems to be based on a qualitative analysis of the lines in Fig. 7 d, e, and f. However, experimental data is not plotted in the same graph nor is the experimental data that we should be comparing this to cited in the text/caption.

      We have added a note in the caption to indicate which figure it can be compared to.

    1. Reviewer #2 (Public review):

      Summary:

      This paper presents a novel transformer-based neural network model, termed the epistatic transformer, designed to isolate and quantify higher-order epistasis in protein sequence-function relationships. By modifying the multi-head attention architecture, the authors claim they can precisely control the order of specific epistatic interactions captured by the model. The approach is applied to both simulated data and ten diverse experimental deep mutational scanning (DMS) datasets, including full-length proteins. The authors argue that higher-order epistasis, although often modest in global contribution, plays critical roles in extrapolation and capturing distant genotypic effects, especially in multi-peak fitness landscapes.

      Strengths:

      (1) The study tackles a long-standing question in molecular evolution and protein engineering: "how significant are epistatic interactions beyond pairwise effects?" The question is relevant given the growing availability of large-scale DMS datasets and increasing reliance on machine learning in protein design.

      (2) The manuscript includes both simulation and real-data experiments, as well as extrapolation tasks (e.g., predicting distant genotypes, cross-ortholog transfer). These well-rounded evaluations demonstrate robustness and applicability.

      (3) The code is made available for reproducibility.

      Weaknesses:

      (1) The paper mainly compares its transformer models to additive models and occasionally to linear pairwise interaction models. However, other strong baselines exist. For example, the authors should compare baseline methods such as "DANGO: Predicting higher-order genetic interactions". There are many works related to pairwise interaction detection, such as: "Detecting statistical interactions from neural network weights", "shapiq: Shapley interactions for machine learning", and "Error-controlled non-additive interaction discovery in machine learning models".

      (2) While the transformer architecture is cleverly adapted, the claim that it allows for "explicit control" and "interpretability" over interaction order may be overstated. Although the 2^M scaling with MHA layers is shown empirically, the actual biological interactions captured by the attention mechanism remain opaque. A deeper analysis of learned attention maps or embedding similarities (e.g., visualizations, site-specific interaction clusters) could substantiate claims about interpretability.

      (3) The distinction between nonspecific (global) and specific epistasis is central to the modeling framework, yet it remains conceptually underdeveloped. While a sigmoid function is used to model global effects, it's unclear to what extent this functional form suffices. The authors should justify this choice more rigorously or at least acknowledge its limitations and potential implications.

      (4) The manuscript refers to "pairwise", "3-4-way", and ">4-way" interactions without always clearly defining the boundaries of these groupings or how exactly the order is inferred from transformer layer depth. This can be confusing to readers unfamiliar with the architecture or with statistical definitions of interaction order. The authors should clarify terminology consistently. Including a visual mapping or table linking a number of layers to the maximum modeled interaction order could be helpful.

      Comments for the revision:

      I want to thank the authors for their efforts in revising the manuscript. Most of the concerns raised in the initial review have been adequately addressed.

      However, one important issue remains. I previously asked the authors to benchmark their method against stronger baselines. The authors declined, arguing that these alternatives are "not directly applicable to the types of analyses." I am not persuaded by this rationale. In my view, these baseline methods target essentially the same underlying problem, and at least some, if not all, should be included in a comparative evaluation (or the manuscript should provide a clearer, more technically grounded explanation of why such comparisons are not feasible or not meaningful).

    2. Reviewer #3 (Public review):

      Summary:

      Sethi and Zou present a new neural network to study the importance of epistatic interactions in pairs and groups of amino acids to the function of proteins. Their new model is validated on a small simulated data set, and then applied to 10 empirical data sets. Results show that epistatic interactions in groups of amino acids can be important to predict the phenotype of a protein, especially for sequences that are not very similar to the training data.

      Strengths:

      The manuscript relies on a novel neural network architecture that makes it easy to study specifically the contribution of interactions between 2, 3, 4 or more amino acids. The novel network architecture achieves such a level of interpretability without noticeable performance penalty. The study of 10 different protein families shows that there is variation among protein families in the importance of these interactions, and that higher order interactions are particularly important to predict the phenotypes of distant proteins.

      Weaknesses:

      The Github repository provides a README file to run a standard pipeline, but a user will need to go through the code to actually know what that pipeline is doing.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The authors present an approach that uses the transformer architecture to model epistasis in deep mutational scanning datasets. This is an original and very interesting idea. Applying the approach to 10 datasets, they quantify the contribution of higher-order epistasis, showing that it varies quite extensively.

      Suggestions:

      (1) The approach taken is very interesting, but it is not particularly well placed in the context of recent related work. MAVE-NN, LANTERN, and MoCHI are all approaches that different labs have developed for inferring and fitting global epistasis functions to DMS datasets. MoCHI can also be used to infer multidimensional global epistasis (for example, folding and binding energies) and also pairwise (and higher order) specific interaction terms (see 10.1186/s13059-024-03444-y and 10.1371/journal.pcbi.1012132). It doesn't distract from the current work to better introduce these recent approaches in the introduction. A comparison of the different capabilities of the methods may also be helpful. It may also be interesting to compare the contributions to variance of 1st, 2nd, and higher-order interaction terms estimated by the Epistatic transformer and MoCHI.

      We thank the reviewer for the very thoughtful suggestion.

      Although these methods are conceptually related to our method, none of them can be realistically used to perform the type of inference we have done in the paper on most the datasets we used, as they all require explicitly enumerating the large number of interaction terms.

      We have included new text (Line 65-74) in the introduction to discuss the advantages and disadvantages of these models. We believe this has made our contribution better placed in the broader context of the field.

      (2) https://doi.org/10.1371/journal.pcbi.1004771 is another useful reference that relates different metrics of epistasis, including the useful distinction between biochemical/background-relative and backgroundaveraged epistasis.

      We have included this very relevant reference in the introduction. We also pointed out the limitation of these class of methods is that they typically require near combinatorically complete datasets and often have to rely on regularized regression to infer the parameters, making the inferred model parameters disconnected from their theoretical expectations. Line 49-56.

      (3) Which higher-order interactions are more important? Are there any mechanistic/structural insights?

      We thank the reviewer for pointing out this potential improvement. We have now included a detailed analysis of the GRB2-SH3 abundance landscape in the final section of the results. In particular, we estimated the contribution of individual amino acid sites to different orders (pairwise, 3-4th order, 4-8th order) of epistasis and discuss our finding in the context of the 3D structure of this domain. We also analyzed the sparsity of specific interactions among subsets of sites.

      Please see Results section “Architecture of specific epistasis for GRB2-SH3 abundance.”

      Reviewer #2 (Public review):

      Summary:

      This paper presents a novel transformer-based neural network model, termed the epistatic transformer, designed to isolate and quantify higher-order epistasis in protein sequence-function relationships. By modifying the multi-head attention architecture, the authors claim they can precisely control the order of specific epistatic interactions captured by the model. The approach is applied to both simulated data and ten diverse experimental deep mutational scanning (DMS) datasets, including full-length proteins. The authors argue that higher-order epistasis, although often modest in global contribution, plays critical roles in extrapolation and capturing distant genotypic effects, especially in multi-peak fitness landscapes.

      Strengths:

      (1) The study tackles a long-standing question in molecular evolution and protein engineering: "how significant are epistatic interactions beyond pairwise effects?" The question is relevant given the growing availability of large-scale DMS datasets and increasing reliance on machine learning in protein design.

      (2) The manuscript includes both simulation and real-data experiments, as well as extrapolation tasks (e.g., predicting distant genotypes, cross-ortholog transfer). These well-rounded evaluations demonstrate robustness and applicability.

      (3) The code is made available for reproducibility.

      We thank the reviewer for the positive feedback.

      Weaknesses:

      (1) The paper mainly compares its transformer models to additive models and occasionally to linear pairwise interaction models. However, other strong baselines exist. For example, the authors should compare baseline methods such as "DANGO: Predicting higher-order genetic interactions." There are many works related to pairwise interaction detection, such as: "Detecting statistical interactions from neural network weights", "shapiq: Shapley interactions for machine learning", and "Error-controlled nonadditive interaction discovery in machine learning models."

      We thank the reviewer for this very helpful comment. These references are indeed conceptually quite similar to our framework. Although they are not directly applicable to the types of analyses we performed in this paper (partitioning contribution of epistasis into different interaction orders in terms of variance components), we have included a discussion of these methods in the introduction (Line 70-74). We believe this helps better situate our method within the broader conceptual context of interpreting machine learning models for epistatic interactions.

      (2) While the transformer architecture is cleverly adapted, the claim that it allows for "explicit control" and "interpretability" over interaction order may be overstated. Although the 2^M scaling with MHA layers is shown empirically, the actual biological interactions captured by the attention mechanism remain opaque. A deeper analysis of learned attention maps or embedding similarities (e.g., visualizations, site-specific interaction clusters) could substantiate claims about interpretability.

      Again, we thank the reviewer for the thoughtful comment. We have addressed this comment together with a related comment by Reviewer1 by including a detailed analysis of the GRB2-SH3 landscape using a marginal epistasis framework, where we quantified the contribution of individual sites to different orders of epistasis as well as the sparsity of epistatic interactions. We also present these results in the context of the structure of this protein. Please see Results section “Architecture of specific epistasis for GRB2-SH3 abundance.”

      (3) The distinction between nonspecific (global) and specific epistasis is central to the modeling framework, yet it remains conceptually underdeveloped. While a sigmoid function is used to model global effects, it's unclear to what extent this functional form suffices. The authors should justify this choice more rigorously or at least acknowledge its limitations and potential implications.

      We agree that the under parameterization of the simple sigmoid function could be be potentially confounding. We did compare different choices of functional forms for modeling global epistasis. Overall, we found that there is no difference between a simple sigmoid function with four trainable parameters and the more complex version (sum of multiple sigmoid functions, used by popular methods such as MAVENN). Therefore, all results we presented in the paper were based on the model with a single scalable sigmoid function.

      We have added relevant text; line 153-158. We have also included side-by-side comparisons of the model performance for the GRB-abundance and the AAV2 dataset to corroborate this claim (Supplemental Figure 1).

      (4) The manuscript refers to "pairwise", "3-4-way", and ">4-way" interactions without always clearly defining the boundaries of these groupings or how exactly the order is inferred from transformer layer depth. This can be confusing to readers unfamiliar with the architecture or with statistical definitions of interaction order. The authors should clarify terminology consistently. Including a visual mapping or table linking a number of layers to the maximum modeled interaction order could be helpful.

      We thank the reviewer for the thoughtful suggestion. We have rewritten the description of our metrics for measuring the importance of "pairwise", "3-4-way", and ">4-way" interactions; Line 232-239.

      We have also added a table to improve clarity, as suggested; Table 2.

      Reviewer #3 (Public review):

      Summary:

      Sethi and Zou present a new neural network to study the importance of epistatic interactions in pairs and groups of amino acids to the function of proteins. Their new model is validated on a small simulated data set and then applied to 10 empirical data sets. Results show that epistatic interactions in groups of amino acids can be important to predict the function of a protein, especially for sequences that are not very similar to the training data.

      Strengths:

      The manuscript relies on a novel neural network architecture that makes it easy to study specifically the contribution of interactions between 2, 3, 4, or more amino acids. The study of 10 different protein families shows that there is variation among protein families.

      Weaknesses:

      The manuscript is good overall, but could have gone a bit deeper by comparing the new architecture to standard transformers, and by investigating whether differences between protein families explain some of the differences in the importance of interactions between amino acids. Finally, the GitHub repository needs some more information to be usable.

      We thank the reviewer for the thoughtful comments. We have listed our response below in the “Recommendations for the authors” section.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Some of the dataset labels are confusing. For example, GRB is actually the protein GRB2 and more specifically just one of the two SH3 domains from GRB2 (called GRB2-SH3 in Faure et al.).

      We thank the reviewer for catching this. Our original naming of the datasets followed the designation of library number in the Faure et al paper (which constructed 3 variant libraries and performed different assays on them). To avoid confusion (and also save space in the figure titles), we have now renamed the datasets using this mapping:

      Author response table 1.

      Reviewer #3 (Recommendations for the authors):

      (1) What is the cost of the interpretability of the model? It would be interesting to evaluate how a standard transformer, complete with its many non-linearities, performs on the simulated 13-position data, using the r2 metric. This is important as the last sentence of the discussion seems to suggest that the model proposed by the authors could be used in other contexts, where perhaps interpretability would be less important.

      We thank the reviewer for this suggestion. We have run a generic transformer model on the GRBabundance and AAV2 datasets. Overall, we found minimal difference between the generic model and our interpretable model, suggesting that fitting the interpretable transformer does not incur significant cost in performance.

      We have included a side-by-side comparison of the performance of the generic transformer and our three-layer model in Supplemental Figure 5 and a discussion of this finding in Line 256-259.

      (2) The 10 data sets analyzed by the authors differ in their behaviour. I was wondering whether the proteins have different characteristics, beyond the number and distribution of mutants in the data sets. For instance, do high-order interactions play a bigger role in longer proteins, in proteins with more secondary structures, in more hydrophobic proteins?

      We fully agree that this is a highly relevant question. Unfortunately, the paucity of datasets suitable for the type of analyses we performed in the paper limit our ability to draw general conclusions. Furthermore, the differences in genotype distribution among the 10 datasets may be the main driving factor in the behaviors of the models.

      We included our thoughts on this issue in the discussion (Line 477-481).

      We will definitely revisit this question if this type of high-order combinatorial DMS data becomes more available in the (hopefully) near future.

      (3) Although the code appears to be available in the repository, there is no information about the content of the different folders, about what the different scripts do, or about how to reproduce the article's results. More work should be done to clarify it all.

      Thank you for pointing this out. We have substantially improved our github repository and included many annotations for reproducibility.

      (4) Typos and minor comments:

      (a) p3 "a multi-peak fitness landscapes": landscape.

      (b) p3 "Here instead of directly fitting the the regression coefficients in Eq. 2": remove 'the'.

      (c) p3 "neural network architectures do not allow us to control the highest order of specific epistasis": a word is missing.

      (d) p6 "up to 1,926, 3,014, and 4,102 parameters, respectively-all smaller than the size of the training dataset": it's not very clear what size of the dataset means: number of example sequences?

      (e) p6 "This results confirm": This result confirms.

      (f) p6 "to the convergence of of the variance components of the model landscape to the ground truth.": remove 'of'.

      (g) p7 "to characterize the importance higher-order interactions": the importance of.

      (h) p7 "The improvement varies across datasets and range": and ranges.

      (i) p9 "over the pairwise model is due to the its ability": remove 'the'.

      (j) p13 "This results suggest that pairwise": result suggests.

      (k) p13 "although the role assessed by prediction for randomly sampled genotypes seems moderate": sampled. Also, I'm not sure I understand this part of the sentence: what results are used to support this claim? It's not 6b, which is only based on the mutational model.

      This is in Supplemental Figure 7.

      (l) p13 "potentially by modeling how the these local effects": remove the.

      (m) p13 "We first note that the the higher-order models": remove the.

      (n) p15 "M layers of MHA leads to a models that strictly": lead to a model.

      (o) Supp Figure 1: "Solid lines shows the inverse": show.

      (p) Supp p 10 "on 90% of randomly sample data": sampled.

      (q) Supp p11 "Next, assume that Eq. 5 is true for m > 0. We need to show that Eq. 5 is also true for m + 1.": shouldn't it be m>=0 ? It seems important to start the recursive argument.

      Good catch.

      (r) Supp p11 "Since the sum in line 9 run through subsets": runs.

      (s) Supp p11 "we can further simplify Eq. 11 it to": remove it.

      We have fixed all these problems. We very much appreciate the reviewer’s attention.

    1. Nobody Gets Promoted for Simplicity
      • The Complexity Bias: Engineering cultures often reward over-engineered solutions because they provide a more "impressive" narrative for promotion packets and design reviews compared to simple, effective ones.
      • Hiring Pitfalls: The problem begins at the interview stage, where candidates are often pushed toward complex distributed systems even when a simple solution is more appropriate, teaching them that "simple isn't enough."
      • Unearned Complexity: While complexity is necessary for massive scale or multi-team coordination, many teams adopt "unearned complexity"—sharding databases or adding abstractions for problems that don't yet exist.
      • Invisible Virtues: Simplicity is harder to see and appreciate than complexity. Engineers who avoid problems by choosing simpler paths often find their contributions overlooked because there is no "big system" to point to.
      • Shifting the Default: To fix this, teams should change the questions they ask in reviews from "how do we scale this?" to "what is the simplest version we can ship?" and reward the deletion of code or the decision to defer unnecessary features.

      Hacker News Discussion

      • The "Google Sheets" Interview Trap: A popular anecdote in the thread involved a candidate suggesting Google Sheets to solve a spreadsheet-sharing problem. While pragmatic, it led to an awkward "failure" because the interviewer wanted to see software design skills, highlighting the disconnect between business value and interview expectations.
      • Interviewers vs. Candidates: Many users argued that if a candidate gives a simple, non-technical answer, a skilled interviewer should acknowledge it as correct and then pivot to a hypothetical scenario ("If we had to build it, how would you design it?") rather than penalizing the candidate.
      • Professionalism vs. Compliance: Some commenters believe developers should be "difficult" by pushing back against unnecessary builds to save company resources, while others argued that refusing to engage in a technical interview exercise is a red flag for future collaboration.
      • The Cost of "Acting Smart": There was a consensus that "acting smart" (building complex things) is often more financially rewarding than "being smart" (solving the problem simply), leading to a industry-wide "cargo culting" of complex architectures.
      • Visible Deletion: Several users noted that the most valuable engineers are often those who delete thousands of lines of code, but that organizations lack the metrics or cultural frameworks to celebrate and promote based on reduction rather than addition.
    1. While coding/agentic benchmarks [1] 1 have proven useful for understanding AI capabilities, they typically sacrifice realism for scale and efficiency—the tasks are self-contained, don’t require prior context to understand, and use algorithmic evaluation that doesn’t capture many important capabilities. These properties may lead benchmarks to overestimate AI capabilities

      Main Claim: Using A.I. along code leads to decreased efficiency and grasp of reality with the goal/project in mind.

    1. The volume of software increases significantly with AI coding, but higher volume could mean more bugs and more rework. Engineers have a lot more AI code to review and test.

      Due to the high amount of code A.I. could produce, bugs and errors may be more prone.

    2. Developers are increasingly acting as curators, reviewers, integrators and problem-solvers—making them more strategic and valuable.

      A.I. is shifting the standards of software development, not stealing jobs. Developers are going to use A.I. for assistance while also for producing code.

    1. But now, the spread of A.I. programming tools, which can quickly generate thousands of lines of computer code — combined with layoffs at companies like Amazon, Intel, Meta and Microsoft — is dimming prospects in a field that tech leaders promoted for years as a golden career ticket.

      A.I. is taking jobs from computer science students. Leaving the tech industry promise as null and void.

    1. AI detects bugs, vulnerabilities and inefficiencies early in the development cycle. AI-driven testing tools can generate test cases, prioritize critical tests and even run tests autonomously.

      Evidence: A.I. could improve the quality of software by running, testing, and debugging code.

    2. This automation allows developers to focus on higher-level tasks such as problem-solving and architectural design rather than code generation, bug detection and testing.

      Evidence: A.I. could take care of the "grunt" work while developers focus on high-end tasks.

    3. Engineers and developers now manage AI’s integration into the development process. They collaborate closely with AI systems and use their expertise to refine AI-generated outputs and make sure they meet technical requirements.

      Evidence: software engineers could use A.I. as a tool to assist and refine code from both humans and A.I.

    4. Tools such as generative AI, code completion systems and automated testing platforms reduce the need for engineers, developers and programmers to manually write code, debug or conduct time-consuming tests.

      Evidence: A.I. is removing the need for manual computer programming.

    5. AI is fundamentally redefining the role of software engineers and developers, moving them from code implementers to orchestrators of technology.

      Claim: A.I. is changing the role of software engineers into a much more easier discipline.

    6. Gen AI-powered tools help developers focus on complex problems, while AI-driven autocompletion and real-time suggestions improve speed and accuracy.

      Evidence: A.I. could help write and generate efficient code.

    7. How AI is used in software development

      Evidence List: * Generates code; decreasing the development time. * Fixes and catches bugs. * Tests code and ensures industry level quality * Could handle project management * Can create documentation for a project * Could improve code to increase optimization. * Increases the security of data-sensitive code.

    8. Tools such as IBM watsonx Code Assistant™, GitHub Autopilot and GitHub Copilot help developers write code faster and with fewer errors and can generate suggestions and autocomplete code.

      Evidence: A.I. tools assist programmers by creating code that's a lot more efficient and less time consuming.

    9. AI tools adapt and evolve by using machine learning models and deep learning techniques, which leads to more efficient coding practices and project outcomes.

      Gives reason to why A.I. writes good code and how it'll only improve its output.

    10. This capability accelerates coding, reduces human error and allows developers to focus on more complex and creative tasks rather than boilerplate code.

      Due to the nature of A.I. and how its structured, A.I. helps produce clean and efficient code. Code that lacks in errors.

    1. Missing of region-based editing.

      Mở rộng thêm augment code vùng masking để chỉnh sửa theo region thay vì toàn bộ ảnh mà chỉ vùng masked

    1. run npm install. You may need to run this again after some changes made to the Web2Cit core source code.

      Am I sure I need this? The linked package is overwritten if I do this...

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      Evidence, reproducibility and clarity

      Summary of findings and key conclusions This manuscript asks how pharmacologic targeting of the outer mitochondrial membrane protein MIRO1 (RHOT1) with a MIRO1-binding compound (MR3) reshapes immunosuppressive programs in the glioma tumor microenvironment (TME). The core of the paper is a cross-species transcriptomic comparison that combines an in vivo mouse dataset with an ex vivo human perturbation dataset. Model systems and approach (as described): • Mouse in vivo: GL261-Luc intracranial glioma in C57BL/6J mice; MR3 is administered intracranially at the implantation site (10 µM in 5 µL DMSO) on days 11 and 18, and tumors are harvested on day 22 for single-nucleus RNA-seq (snRNA-seq). • Mouse snRNA-seq: NeuN-based nuclei sorting, 10x Genomics v3.1; alignment to mm10; Seurat-based integration and annotation. Tumor-cell calling is supported by CNV inference (SCEVAN/CopyKAT). One MR3-treated sample is excluded after QC, leaving 3 control vs 2 MR3-treated samples (11,940 NeuN− nuclei). • Human ex vivo: freshly resected glioma cores from 3 patients are cultured with 10 µM MR3 or DMSO for 24 h, followed by bulk RNA-seq (STAR alignment to hg19; DESeq2 for differential expression). • Cross-species integration: the analysis is restricted to 1:1 orthologs and protein-coding genes shared across datasets; inferred cell-cell signaling is explored with CellChat. Main findings (as presented): • MR3 shifts expression of a subset of glioma-associated genes toward a non-tumor-like direction ("rescued genes") and is associated with large changes in inferred cell-type composition in the mouse snRNA-seq dataset (including a marked drop in the fraction of nuclei annotated as tumor: 44.5% to 4.3%; Fig. 1E). • Across TCGA-vs-GTEx (glioma-upregulated genes) and three MR3 response analyses (mouse snRNA-seq, mouse pseudo-bulk, and human bulk RNA-seq), PARP11/Parp11 is reported as the only gene that is consistently upregulated in glioma and consistently downregulated by MR3 (Fig. 2B). • Within the mouse myeloid compartment, Parp11 is most enriched in MAC4 and MAC1, while MAC1 shows high Cd274 (Pdl1/PD-L1). MR3 reduces Parp11 in MAC4/MAC1 and reduces Cd274 in MAC1 (Fig. 2H). • CellChat analysis suggests that in controls MAC1 is the dominant sender of PD-L1/PD-1 signaling to CD8+ T cells (Fig. 3C), and that this PD-L1/PD-1 interaction is strongly diminished after MR3 (Fig. 3E). • The authors propose a paracrine model in which MAC4-derived PGE2 (via Ptges3) sustains Parp11 expression in MAC1 through cAMP/PKA/CREB, promoting PD-L1-mediated T-cell suppression; MR3 disrupts this circuitry (Fig. 4). Major comments 1. Strength of the conclusions Two parts of the story felt well supported by the data as shown. First, the cross-species convergence on PARP11/Parp11 is a clear and potentially useful result (Fig. 2B). Second, the myeloid subclustering plus CellChat analysis makes a coherent case that PD-L1/PD-1 signaling in this model is dominated by a specific macrophage subset (MAC1) and changes after MR3 (Fig. 2H, Fig. 3). Where I was less convinced is when the manuscript moves from "transcriptomic and modeling evidence" to causal statements such as "MIRO1-mediated axis driving immunosuppression" and "MR3 reduces tumor burden by reactivating immunity." At the moment, several central inferences remain indirect: • Causality is inferred primarily from transcriptomic shifts and ligand-receptor inference rather than functional immune readouts.

      -We thank the Reviewer for the constructive evaluation. We have toned down the claims throughout the manuscript with tracking.

      • __ On-target attribution to MIRO1 hinges on MR3 being a MIRO1 binder; the study does not include a genetic MIRO1 perturbation or a target-engagement/epistasis test in the relevant immune compartments (and the authors acknowledge this limitation in the Discussion).__ -We have examined on-target activity of MR3 in our other papers. For example, by depleting Miro1 with CRISPRi in glioma cells (Miro1 KD cells), we found that it phenocopied the effect of MR3. We also expressed Miro1-7A, a drug-resistant mutant of Miro1 predicted to be unable to bind MR3 (1) in Miro1 KD glioma cells, which rendered glioma cells insensitive to MR3 treatment. These data demonstrate that in cellular glioma models, Miro1 is the target of MR3 and MR3 exerts its functions via directly binding to Miro1.

      We have also excluded off-target effect of MR3 by examining other mitochondrial GTPases (1, 2) including Miro2.

      We agree these data were not done specifically in immune compartments, and have acknowledged it in Discussion and added more explanation in Introduction citing our published papers.

      • __ The very large reduction in "tumor cell proportion" (Fig. 1E) is striking but is still a composition measure of recovered nuclei; it is not, on its own, a direct measurement of tumor size/burden and could be sensitive to differential nuclei recovery or cell loss during processing.__ -We agree that the "tumor cell proportion" in Fig. 1E represents the composition of recovered nuclei and is not, by itself, a direct measurement of tumor size or burden. We have removed "tumor burden" throughout the manuscript to avoid confusion.

      To determine whether the observed reduction might reflect technical bias, we examined the quality control metrics across all samples. Of the six initial samples (three control and three treated), one treated sample (TN1) showed clear quality concerns and was therefore excluded from downstream analysis.

      For the remaining samples, the distributions of detected genes per nucleus and total RNA counts per nucleus were similar between groups. The percentage of mitochondrial reads was consistently low, and only a small fraction of nuclei was removed during filtering, indicating overall comparable nuclei quality. Notably, the treated samples yielded similar or even higher total numbers of recovered nuclei, despite showing a lower tumor cell proportion. Please refer to new Fig. S1A for these results.

      Together, these observations suggest that the decrease in tumor cell proportion is unlikely to be explained simply by differential nuclei recovery, sequencing depth, or filtering effects. That said, we recognize that compositional differences in single-nucleus RNA sequencing data do not provide a direct measurement of tumor burden. We have revised the manuscript to clarify this point and to indicate that independent future approaches would be required for definitive assessment.

      I think the paper can go forward in its current scope, but the strength of the claims should match the level of evidence. If the authors want to keep strong, causal language in the title/abstract ("driving immunosuppression," "reduces tumor burden"), then I consider one or two targeted validation experiments essential (see below). Alternatively, the authors can temper the language and position the mechanistic model more explicitly as a hypothesis generated from the transcriptomic analysis.

      -We thank the Reviewer! We have toned down the claims throughout the manuscript to make the data consistent with the conclusion.

      __ Statements that should be labeled as preliminary/speculative (unless additional validation is added) • MAC4-derived PGE2 as the upstream driver of MAC1 Parp11/PD-L1: plausible and nicely consistent with Ptges3 being MAC4-high in controls and reduced with MR3 (Fig. 4A), but not demonstrated.__

      -We have changed the conclusion of this part to:

      Together, these bioinformatic findings suggest that MAC4 may produce PGE₂, which could act on nearby MAC1 cells in a paracrine manner to increase Parp11 expression, although this model needs to be functionally validated.

      • __ MIRO1 _→ mtDNA _→ cGAS/STING _→_ Ptges3 as a mechanistic chain: interesting, but currently framed largely by pathway knowledge plus modest expression changes (Supplementary Fig. S5).__ -We have added: "which requires future functional investigation."

      • __ "MR3 reactivates anti-tumor immunity to reduce tumor burden": the gene set enrichment and CellChat shifts are consistent with immune activation, but immune-mediated tumor control is not directly tested.__ -We have toned down these claims on tumor burden and only conclude as: MR3 may enhance anti-tumor immune responses.

      __ Replication and statistics Mouse snRNA-seq replication is limited after QC (3 control vs 2 MR3-treated animals). With n=2 treated, it is hard to know whether some of the biggest composition and cluster-level changes are robust to animal-to-animal variability.__

      -As also explained to Rev 2, we originally planned 3 mice per group. Despite losing one after QC, sample-level pseudobulk PCA analysis (treating each mouse as one replicate) of the mice shows clear separation of treated from untreated groups (new Fig. S2C), supporting technical reproducibility despite a small n. The two MR3-treated samples clustered together and were clearly separated from controls, indicating that the transcriptional effect of MR3 exceeds inter-animal variability (new Fig. S2C). The reduction in tumor cell proportion was also observed in both treated animals (new Fig. S2F). We have added this description to the Results (Page 5, lines 116-118) and included a new figure showing the tumor cell proportion for each animal (new Fig. S2F).

      We acknowledge this is a limitation, but as the Reviewer also pointed out that our paper's significance is to transcriptomically link Miro1 to well-known immune suppression factors in glioma TME and integrate 3 glioma databases which will facilitate researchers in the field to advance their own research. Thus, our methods and resource should be still valid and useful to the community.

      Relatedly, the snRNA-seq differential expression is performed with Seurat FindMarkers (Wilcoxon rank-sum). Per-cell testing can inflate significance if biological replicate structure is not accounted for (pseudoreplication). I suggest the authors clarify exactly how they handled sample-level replication for the key DE results and, where possible, re-run the main DE comparisons using a sample-aware approach (e.g., pseudo-bulk within cell types/subclusters).

      -We thank the reviewer for raising this important point. In the original analysis, differential expression was performed using Seurat's FindMarkers function which performs per-cell testing. We acknowledge that this approach can overestimate significance if biological replicate structure is not explicitly accounted for.

      To address this, we re-ran the key differential expression analyses using a pseudo-bulk approach: counts were aggregated per cell type/subcluster per sample, and DE testing was performed across samples rather than individual cells. The main results and conclusions remain consistent with the original analysis, while this approach ensures that statistical significance properly reflects biological replication (new FigS3. D-F).

      For the human bulk RNA-seq, the methods indicate 3 patient tissues split across MR3 vs DMSO for 24 h. In DESeq2, a paired design (including patient as a blocking factor) would be important to avoid patient-to-patient variability dominating the treatment signal; the manuscript should confirm whether the design formula accounted for this.

      -In the revised manuscript, we re-ran the DESeq2 analysis using a paired design with patient as a blocking factor and compared DMSO and MR3 within each patient (P1-P3). The results are consistent with our previous analysis. PARP11 remains significantly downregulated (raw p-value Finally, several places in the Methods define significance using p-value cutoffs (e.g., GEPIA3 TCGA/GTEx analysis uses p 1; human DE uses p = 1). Because multiple testing is substantial in all of these analyses, I recommend reporting FDR-adjusted values consistently (and being explicit about whether figures/tables show raw or adjusted p-values).

      -We have now used FDR-adjusted values for the TCGA/GTEx analysis and have updated Fig. 1C (top left), Results, and Methods accordingly. PARP11 remains significant after FDR correction.

      For the human bulk RNA-seq, very few genes pass an adjusted p 2FC| > 1 across all four differential expression analyses and updated the corresponding description in Methods.

      __ Do the data support the macrophage-to-CD8 suppression claim? The CellChat PD-L1/PD-1 network figures are suggestive (Fig. 3C/E), but ligand-receptor inference is not the same as demonstrating functional T-cell inhibition. At minimum, I would like to see one orthogonal readout (flow or immunostaining) showing that PD-L1__ protein on myeloid cells and PD-1 on CD8 T cells change in the expected directions after MR3, and that CD8 T cells show an activation/effector signature at the protein level.

      -We agree this would be clearly the next step in functional studies, but the current manuscript is focused on transcriptomic analysis and method building, so we have toned down any claims at the functional level.

      In addition, we have observed that T cells after MR3 treatment show upregulation of cytotoxicity- and IFN-response-related genes consistent with enhanced effector function at the transcriptional level. We have added new Fig. S6A and explanation in Result.

      __ PARP11: mediator vs marker The cross-species PARP11 result is the most convincing and potentially generalizable finding in the manuscript (Fig. 2B). However, in the specific context of this study, PARP11 is still best supported as a conserved MR3-responsive candidate rather than a demonstrated causal driver of PD-L1-mediated suppression. If the authors want to argue PARP11 is an effector of the pathway (rather than a marker), they should either soften the language or add a minimal functional linkage experiment within the existing scope (see "Optional" experiments below).__

      -We have softened the overall language throughout the manuscript to emphasize the correlation and PARP11 as a marker and to reflect the bioinformatic nature of the study. As this paper's main goal is method development and resource building, with already 11 figures, we think functional experiments could be done in another paper.

      __ Reproducibility and clarity of methods I appreciate that the authors provide a code/data portal (MiroScape) and a GitHub link. To make the study as reproducible as possible, I recommend: • Deposit raw sequencing reads for both mouse and human datasets (GEO/SRA) and include accession numbers in the manuscript.__

      -We have just deposited all raw data. Accession numbers will be provided once it is public.

      • __ Provide a short, consolidated "computational reproducibility" note with software versions and key parameters (Seurat, CellChat, STAR, DESeq2, etc.).__ -Added

      • __ Clarify pseudo-bulk construction (what is aggregated, at what level, and how many biological replicates contribute to each pseudo-bulk comparison).__ -Added

      • __ Add a brief summary of MR3 provenance/validation and what "MIRO1-binding" means operationally in the context of these experiments (especially for readers outside the MIRO1 field).__ -We have added this in Introduction.

      Experiments requested (kept within the existing claims) I am intentionally not suggesting new lines of experimentation. The experiments below are aimed only at supporting the paper's current central claims. I separate them into items I consider essential vs optional, depending on how strongly the authors want to phrase mechanistic conclusions.

      -We thank the Reviewer. We have toned down the claims to reflect the bioinformatic nature of the paper. We will perform suggested experiments below in another paper.

      Essential if the title/abstract continue to use strong causal language • Protein-level validation of the PD-L1/PD-1 axis and CD8 activation in the GL261 model. A focused flow cytometry panel (myeloid PD-L1; CD8 PD-1 plus one or two effector markers such as GZMB/IFNG/Ki67) or multiplex IF/IHC on tumor sections would substantially strengthen the central MAC1 ____→____ CD8 claim. • An orthogonal measure of tumor burden in the same treatment paradigm. The manuscript currently treats the drop in the fraction of nuclei annotated as tumor (Fig. 1E) as a reduction in tumor burden; I recommend including IVIS longitudinal data and/or histologic tumor area/volume at harvest to support this statement. • If feasible, modestly increase in vivo biological replication (the snRNA-seq analysis currently has n=2 treated after QC). Even adding one additional treated animal that passes QC would help. Feasibility (rough guidance only; core pricing varies widely by institution): a repeat GL261 cohort to harvest tumors for flow and/or histology typically takes ~3-6 weeks end-to-end. A small flow panel plus core time is often on the order of a few thousand USD (antibodies and cytometry), while basic histology/IF quantification might be in the hundreds to low-thousands. If the authors already have stored tissue from the existing cohort, some of this could be faster/cheaper. Optional (only if the authors want the MAC4 ____→____ PGE2 ____→____ Parp11 mechanism to be more than a model) • Measure PGE2 (ELISA or targeted lipidomics) in tumor lysates/conditioned media from control vs MR3-treated samples, or provide a closer proxy for PGE2 pathway engagement in the relevant clusters. Optional (only if the authors want to argue PARP11 is an effector) • A minimal functional linkage experiment (in vitro) testing whether PARP11 perturbation phenocopies the relevant aspect of MR3 in macrophages (e.g., PD-L1 levels and/or the ability to suppress CD8 activation in a co-culture). This could be done with a PARP11 inhibitor or knockdown. I do not think in vivo genetics are required for this manuscript, but some functional tie would prevent overinterpretation.

      __ Minor comments A. Analysis/experimental clarifications that seem straightforward • Human DESeq2: please clarify whether the DESeq2 design was paired by patient (i.e., patient as a blocking factor).__

      -See above. We re-ran the human differential expression analysis using a paired design with patient as a blocking factor and explained in Methods.

      • __ snRNA-seq DE: please clarify whether any sample-aware method was used for the key DE conclusions (especially Parp11/Cd274 changes) rather than per-cell statistics alone.__ -See above. The key DE results are based on sample-level pseudobulk (each mouse as one replicate). The two MR3-treated samples cluster together in pseudobulk PCA (new Fig. S2C), and the tumor reduction is seen in both animals (new Fig. S2F), supporting robustness to animal variability.

      • __ CellChat: because min.cells filtering is used (min.cells = 20), please note this explicitly in figure legends where subclusters appear only in one condition, so readers understand why certain labels are missing.__ -We have edited the Fig 3 legend accordingly.

      __ Figure and text consistency issues I noticed several figure/legend/citation issues that look like simple fixes: • Fig. 3 legend panel labeling: the legend text refers to the PD-L1/PD-1 chord plot as (C) MR3− and (D) MR3+, but (D) is the heatmap panel; the chord plots are (C) and (E). This should likely read (C) MR3− and (E) MR3+.__

      -Yes, and corrected.

      • __ Fig. 5 panel reference: the Results text refers to the Cross Species module as Fig. 5F, but the Fig. 5 legend defines panels (A-E) and labels (E) as "Cross Species module." Please reconcile (either change the text to Fig. 5E or add a panel F).__ -Changed to "E".

      • __ Discussion figure citation: the Discussion cites Ptges3/PGE2 evidence as "(Figure 3)," but Ptges3 is shown in Fig. 4A and the model is in Fig. 4B.__ -Added "Figure 4A-B" there.

      • __ Fig. 1D numbers: the Results text states 509/1,602 (mouse) and 15/106 (human) "rescued" genes (Fig. 1D), but the Fig. 1D pie charts are labeled with different totals (mouse total 3490; human total 104). Please reconcile the denominators and ensure the figure matches the text and analysis choice (bulk vs snRNA vs filtered gene sets).__ -For the cross-species analysis, we only counted genes with human-mouse orthologs so that the two datasets were compared in the same gene space. This avoids inflation from species-specific genes. We have added a clarification in the figure legend.

      • __ Fig. 2 legend: there is a stray quote in "lymphoid subclusters" (appears as subclusters").__ -removed.

      __ Presentation and framing • Tone down or carefully qualify statements equating snRNA-seq composition shifts with reduced tumor burden (or add an orthogonal tumor-burden measurement as suggested above).__

      -We have removed "tumor burden" throughout the manuscript.

      • __ Where possible, tie mechanistic language explicitly to the level of evidence ("consistent with," "suggests," "model proposes") so readers do not over-interpret the transcriptomic inference.__ -done.

      • __ Consider adding a small schematic in the Results or a short "interpretation" sentence in the figure legends explaining what the CellChat plots do and do not show, since non-specialists can misread these as direct interaction measurements.__ -We have added explanations in Fig 3 legends for CellChat and emphasized the transcriptomic nature of the data.

      __ Prior literature The PARP11 immunotherapy literature is cited appropriately. For the PGE2 angle, it may help readers if the authors add one or two glioma-focused references on PGE2-mediated myeloid/T-cell suppression (if not already in the full reference list).__

      -We have added two more papers showing PGE2 may induce MDSCs and immunosuppresion in glioma (3) (4).

      Significance

      Nature and significance of the advance The advance here is primarily conceptual and resource-oriented. Conceptually, the work connects a mitochondrial regulator (MIRO1) to a specific, testable immunosuppressive circuit in the glioma TME. Technically, the cross-species perturbation framework and the accompanying MiroScape portal should be useful to groups looking for conserved, drug-responsive immune programs.

      Context within the existing literature Immunosuppression in glioma and the importance of tumor-associated myeloid populations are well established, as is the limited success of checkpoint blockade in GBM. The manuscript's proposed MAC4/MAC1 paracrine model and its emphasis on PD-L1/PD-1 signaling adds a focused, hypothesis-generating view of how particular macrophage states might sustain CD8 dysfunction. The identification of PARP11 as a conserved MR3-responsive gene also fits with emerging work implicating PARP11 in immunoregulatory programs and response to immunotherapy.

      Audience • Neuro-oncology and glioma TME researchers (myeloid heterogeneity, immune suppression). • Tumor immunology groups interested in myeloid-driven checkpoint resistance. • Researchers working on mitochondrial stress signaling and immunometabolism. • Computational biologists building cross-species or multi-modal integration frameworks. Reviewer expertise and limitations Keywords: glioma microenvironment; macrophage/microglia biology; tumor immunology; single-cell/nucleus transcriptomics; computational ligand-receptor inference. Limitations: I am not a medicinal chemist, so I cannot deeply evaluate MR3 chemistry, PK/PD, or specificity beyond what is presented. I also did not evaluate the full web-portal implementation beyond the manuscript description.

      Reviewer #2

      Evidence, reproducibility and clarity

      The authors study responses to MIRO1 inhibition in a mouse model of GL261 GBM and in human tissue pieces treated ex vivo. They provide an interesting link between mitochondrial function and potential therapeutic outcomes in a tumor type that is typically challenging to treat. The manuscript is written clearly, in correct English language and figures are well structured and easy to interpret. -We thank the Reviewer for the positive comments. We want to clarify that the compound binds to Miro1 and doesn't inhibit Miro1's GTPase activity (1). We have now added explanation in Introduction.

      __ Major critique: 1. However, I need to stress that study is based of few experiments with low robustness. The predominant experiment is single-nuclei RNAseq analysis of GL261 tumors implanted into mice, constituting 3 CTRL and 2 treated mice, due to removal of 3rd animal following sequencing (low recovery of high quality nuclei). Therefore, the sample group is small. This is understandable for snRNA-seq experiment (although 3 animals in treated group is somewhat necessary), but the efficiency of treatment with MR3 should be better documented in a larger cohort of animals. Crucial changes in distribution of cell types or polarisation of myeloid cells should be confirmed with flow cytometry, which is more feasible on a larger cohort.__

      -We agree. As explained to Rev 3, the current paper is focused on conceptual and methodical advances and providing a resource to the community, which is already big with 11 figures. As Rev 1 mentioned, our paper's significance is to transcriptomically link Miro1 to well-known immune suppression factors in glioma TME and integrate 3 glioma databases which will facilitate researchers in the field to advance their own research. Importantly, PCA analysis of the mice at the animal level showed clear separation of treated from untreated groups and the reduction in tumor cell proportion was also observed in both treated animals (new Fig. S2C, F), supporting technical reproducibility despite a small n. Thus, our methods and resource should be still valid and useful to the community. Exploring the tumor-reducing efficacy of MR3 or combined treatments (e.g. with anti-PD-L1 or PARP11 inhibitor) in larger cohorts is an exciting next step.

      __ Human model does not seem robust (also, only 3 patients). Very few genes are affected by treatment (incomparably less than in mice), which poses a question if the model is sufficient to study the effect of the treatment. This should be at least discussed and arguments should be stated why such model is suitable.__

      -We agree and the observed variability in treatment response is actually expected and consistent with the well-established molecular and phenotypic heterogeneity of human glioma. Importantly, despite this diversity, we identified one gene (PARP11) consistently altered across all patient's samples and mouse model. This cross-species reproducibility supports the biological and translational relevance of the finding of PARP11. We have now added this to Discussion.

      In addition, we reanalyzed the human bulk RNA-seq using a paired design with patient as a blocking factor as suggested by another reviewer, which increased the number of DE genes (new Fig. 1C).

      __ Fig. S1E shows that actually few genes are commonly affected between human and mouse experiments. So conclusion about "conserved" modulation by MR3 seem an overstatement.__

      -We meant "Parp11" is conserved. We have deleted "conserved" throughout the manuscript when we didn't refer specifically Parp11 to avoid confusion.

      __ Mechanistic conclusions about PARP11, PGE, PD-L1 etc are not documented by any wet lab experiments, just by bioinformatic modelling.__ -We have scrutinized the Main Text to emphasize this.

      Minor: 1. Authors should discuss choice of GL261 model. It is immunogenic and does not resemble human GBM ideally, so the choice should be explained.

      -Although GL261 model demonstrates higher immunogenicity compared to human GBM, this feature enables evaluation of immune-modulating therapies and mechanisms in an immune-competent setting. This model still preserves critical aspects of glioma biology, including immunosuppressive TME, invasive behavior, and intracranial growth (5). Thus, this model provides a suitable platform for our study of mechanistic investigation of immune cells in the TME. We have now added this to Method.

      __ In clustering of mouse snRNAseq data, T cells seem underclustered, e.g. Treg cluster clearly constitutes half of Il2ra-positive and negative cells, the latter probably being conventional CD4+ T cells (usually CD4+ T cells in GL261 are 50:50 Treg and conventional). This can affect further conclusions on cell:cell interactions.__

      -We thank the reviewer for this important observation. We agree that in the former annotation, it was improper to annotate all the CD4+ T cells as Treg cells, given the limited expression of Foxp3, Il2ra and other Treg marker genes. Consequently, the previously annotated "Treg cluster" likely includes both regulatory-like and conventional CD4+ T cells.

      We have further clustered the CD4+ T cell population and found that if we divided CD4+ T cells into conventional CD4+ T and Treg cells, it yielded few Treg cells for downstream analysis (~50). This would compromise the robustness and reliability of our following analysis (CellChat/DEA/etc).

      To address this, we have revised our annotation and now refer to this population more conservatively as "regulatory-like CD4+ T cells" rather than bona fide Tregs. Importantly, this subset still exhibits elevated expression of immunoregulatory molecules and is associated with CD8+ T cell dysfunction, preserving the main conclusions regarding immune suppression within the tumor microenvironment. We have updated the Results, Figures, and Discussion accordingly to clarify this revised annotation and its implications for cell-cell interactions.

      Please refer to following new figures for the updated annotation and associated results:

      Fig. 2G-H, Fig. 3A-G, Fig. S4C-D,G, Fig. S5-B-G, Fig. S6A.

      Significance

      The study provides an interesting conclusion and potentially relevant discovery. However, in opinion of this reviewer, the performed experiments do not strengthen this sufficiently, especially in terms of mechanical insights and weak data on human samples. In the line of general literature on new treatments of GBM and testing thereof in mouse model, this study lacks mechanistic insights and solid data on therapeutic efficiency.

      -As mentioned above, the goal of this paper is to provide novel methods to integrate datasets, resource building, and identify markers in the glioma TME. It will serve as useful resources to the community and form the foundation for future therapeutic validation in larger cohorts. We have acknowledged the limitations in the revised manuscript.

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): ____

      The authors of "Cross-Species Transcriptomic Integration Reveals a Conserved, MIRO1-Mediated Macrophage-to-T-Cell Signaling Axis Driving Immunosuppression in Glioma" present transcriptomic, both bulk RNA Seq and single nucleus RNA Seq, from GL261 murine gliomas treated with the Miro1 targeting compound MR3. RNA Seq data from human tumor explants treated with MR3 is also presented. The authors compared DEGs from their treated tissues with publicly available RNA Seq data sets comparing DEGs from normal tissue and Glioma tumors. The goal being to identify genes modulated by MR3 that may be underlying glioma growth, TME changes, and immunosuppression. There is a significant amount of data presented, with in-depth analysis conducted on the sequencing data sets. The manuscript is lacking in mechanistic depth and this reviewer feels that the results are over-interpreted, especially without any additional conformational assays run to confirm the interpretation of the sequencing data. There were many bold statements made (lines 109-110, 117, 130-131, 142-144, 163-165) that I felt did not have enough evidence to back up their claims. __

      -We have toned down these places mentioned above:

      Line 109-110: Deleted now

      Line 117: Deleted now

      Line 130-131: Deleted now

      Line 142-144: Deleted: "highly differentially expressed", the rest of the sentence is supported by our data.

      Line 163-165: Deleted now

      As explained later, our paper is focused on bioinformatic analysis and resource and method building. In-depth functional studies will be performed in another paper.

      __A significant concern is the lack of conformation that MR3 is targeting Miro1 in these models. __

      -We have done this in another manuscript where we show that in cellular glioma models, Miro1 is the target of MR3 and MR3 exerts its functions via directly binding to Miro1.

      __Previous publications from the authors have shown evidence that MR3 reduces Miro1 expression in cell and fly models. Sometimes this requires the co application of FCCP or antimycin A. Thus, the results attributed within cannot be attributed to Miro1 changes but rather any on or off-target effect of MR3. __

      -We originally discovered MR3 by ligand-based in silico modeling and thermal shift direct binding assay (1, 2). Thus, MR3 is a Miro1 binder (stated in Abstract and Introduction too, now we have added more background in Introduction). Indeed, sometimes we saw MR3 reduced Miro1 protein levels under certain conditions, for example, in vivo in flies after days of feeding (1, 2), or in PD cells upon Antimycin A or CCCP treatment (1, 2, 6, 7). MR3 mostly likely exerts its function via altering Miro1 protein-protein interactions (8) and Miro1 protein is subsequently degraded in proteasomes following complex dissociation or after posttranslational modifications (1, 2) (8). We have stated this hypothesis in Result section (page 10, possible model).

      In our other papers we have excluded off-target effect of MR3 by examining other mitochondrial GTPases (1, 2) including Miro2, and by showing Miro1 KD glioma cells phenocopied the effects of MR3 and drug-resistant Miro1 mutant in glioma cells rendered insensitivity to MR3. These data show Miro1 is the main target of MR3.

      We have added more explanations to the Introduction.

      __Understanding that mouse studies are expensive and time-consuming, and the acquisition of human tissue is not trivial, the sample sets are still small. Further confirmation of findings in cell models, organoids etc. would strengthen the findings and justify the smaller sample size of mice and human tissue. __

      -We agree and we have another in-depth study. However, the current paper is focused on conceptual and methodical advances and providing a resource to the community, which is already big with 11 figures. As Rev 1 mentioned, our paper's significance is to transcriptomically link Miro1 to well-known immune suppression factors in glioma TME and integrate 3 glioma databases which will facilitate researchers in the field to advance their own research. Thoroughly understanding Miro1's role in glioma TME is our next goal as stated in Discussion and is beyond the scope of the current study.

      __The website MiroScape will be a very useful tool in the proper hands. ____

      1. Confirm activity of MR3 on Miro1 in relevant samples. Direct downregulation? Modulation of other targets known to be altered by MR3? __

      -As mentioned above, we have shown in tumor cells, MR3 disrupts pathogenic Miro1-protein interactions without the need to reduce Miro1 protein. There is currently no other target known to be altered by MR3, not even Miro2, demonstrated before (1, 2). We have added more explanations in Main Text.

      __ Conduct further mechanistic work to validate claims inferred by differentially expressed genes.__

      -As mentioned above, our current paper is focused on bioinformatic methods and resource building. Further mechanistic work will be performed in another paper.

      __ Significantly temper claims related cell targeting, direct communication between cells and overarching responses inferred from Sequencing data. -Done. See above and Main Text.

      Reviewer #3 (Significance (Required)):

      My laboratories expertise lies in signaling related to mitochondrial structure and function. We have investigated the Miro1 protein and effects on cellular responses related to Miro1 expression. We have tested the MR3 compound in our own systems with limited success. Therefore my major concerns lie in validating the on-target activity of the compound in their models. __-As explained above, in our other papers we have thoroughly examined on-target activity of MR3 by courter-screening other Miro1 related/similar proteins (1, 2, 6, 7) and by using Miro1 KD cells. We have now added more explanations in Main Text.

      __ With additional mechanistic validation this could be a very significant study. Using advanced model systems as the authors do allows for a comprehensive understanding of tissue responses. This is far advanced from simple single cell line culture studies but also adds significant complexity to the interpretation of the data. I am a strong believer that Sequecing data must be validated with functional assays.__

      -We agree and are actively conducting those studies. However, bioinformatic analysis and method and resource building are sometimes too comprehensive to combine with functional data which may take years to obtain. We think our paper's method, markers identified in TME, and resources will be very useful to the community.

      References

      1. Hsieh CH, Li L, Vanhauwaert R, Nguyen KT, Davis MD, Bu G, Wszolek ZK, Wang X. Miro1 Marks Parkinson's Disease Subset and Miro1 Reducer Rescues Neuron Loss in Parkinson's Models. Cell metabolism. 2019;30(6):1131-40 e7. Epub 2019/10/01. doi: 10.1016/j.cmet.2019.08.023. PubMed PMID: 31564441; PMCID: PMC6893131.
      2. Li L, Conradson DM, Bharat V, Kim MJ, Hsieh CH, Minhas PS, Papakyrikos AM, Durairaj AS, Ludlam A, Andreasson KI, Partridge L, Cianfrocco MA, Wang X. A mitochondrial membrane-bridging machinery mediates signal transduction of intramitochondrial oxidation. Nat Metab. 2021. Epub 2021/09/11. doi: 10.1038/s42255-021-00443-2. PubMed PMID: 34504353.
      3. Mi Y, Guo N, Luan J, Cheng J, Hu Z, Jiang P, Jin W, Gao X. The Emerging Role of Myeloid-Derived Suppressor Cells in the Glioma Immune Suppressive Microenvironment. Front Immunol. 2020;11:737. Epub 2020/05/12. doi: 10.3389/fimmu.2020.00737. PubMed PMID: 32391020; PMCID: PMC7193311.
      4. Dean PT, Hooks SB. Pleiotropic effects of the COX-2/PGE2 axis in the glioblastoma tumor microenvironment. Front Oncol. 2022;12:1116014. Epub 20230126. doi: 10.3389/fonc.2022.1116014. PubMed PMID: 36776369; PMCID: PMC9909545.
      5. Mathios D, Kim JE, Mangraviti A, Phallen J, Park CK, Jackson CM, Garzon-Muvdi T, Kim E, Theodros D, Polanczyk M, Martin AM, Suk I, Ye X, Tyler B, Bettegowda C, Brem H, Pardoll DM, Lim M. Anti-PD-1 antitumor immunity is enhanced by local and abrogated by systemic chemotherapy in GBM. Science translational medicine. 2016;8(370):370ra180. Epub 2016/12/23. doi: 10.1126/scitranslmed.aag2942. PubMed PMID: 28003545; PMCID: PMC5724383.
      6. Bharat V, Durairaj AS, Vanhauwaert R, Li L, Muir CM, Chandra S, Kwak CS, Le Guen Y, Nandakishore P, Hsieh CH, Rensi SE, Altman RB, Greicius MD, Feng L, Wang X. A mitochondrial inside-out iron-calcium signal reveals drug targets for Parkinson's disease. Cell Rep. 2023;42(12):113544. Epub 2023/12/07. doi: 10.1016/j.celrep.2023.113544. PubMed PMID: 38060381.
      7. Bharat V, Hsieh CH, Wang X. Mitochondrial Defects in Fibroblasts of Pathogenic MAPT Patients. Front Cell Dev Biol. 2021;9:765408. Epub 2021/11/23. doi: 10.3389/fcell.2021.765408. PubMed PMID: 34805172; PMCID: PMC8595217.
      8. Kwak CS, Du Z, Creery JS, Wilkerson EM, Major MB, Elias JE, Wang X. Optogenetic Proximity Labeling Maps Spatially Resolved Mitochondrial Surface Proteomes and a Locally Regulated Ribosome Pool. bioRxiv. 2025. Epub 2026/01/07. doi: 10.64898/2025.12.21.693523. PubMed PMID: 41497653; PMCID: PMC12767525.
    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #1

      Evidence, reproducibility and clarity

      Summary of findings and key conclusions

      This manuscript asks how pharmacologic targeting of the outer mitochondrial membrane protein MIRO1 (RHOT1) with a MIRO1-binding compound (MR3) reshapes immunosuppressive programs in the glioma tumor microenvironment (TME). The core of the paper is a cross-species transcriptomic comparison that combines an in vivo mouse dataset with an ex vivo human perturbation dataset.

      Model systems and approach (as described):

      • Mouse in vivo: GL261-Luc intracranial glioma in C57BL/6J mice; MR3 is administered intracranially at the implantation site (10 µM in 5 µL DMSO) on days 11 and 18, and tumors are harvested on day 22 for single-nucleus RNA-seq (snRNA-seq).
      • Mouse snRNA-seq: NeuN-based nuclei sorting, 10x Genomics v3.1; alignment to mm10; Seurat-based integration and annotation. Tumor-cell calling is supported by CNV inference (SCEVAN/CopyKAT). One MR3-treated sample is excluded after QC, leaving 3 control vs 2 MR3-treated samples (11,940 NeuN− nuclei).
      • Human ex vivo: freshly resected glioma cores from 3 patients are cultured with 10 µM MR3 or DMSO for 24 h, followed by bulk RNA-seq (STAR alignment to hg19; DESeq2 for differential expression).
      • Cross-species integration: the analysis is restricted to 1:1 orthologs and protein-coding genes shared across datasets; inferred cell-cell signaling is explored with CellChat.

      Main findings (as presented):

      • MR3 shifts expression of a subset of glioma-associated genes toward a non-tumor-like direction ("rescued genes") and is associated with large changes in inferred cell-type composition in the mouse snRNA-seq dataset (including a marked drop in the fraction of nuclei annotated as tumor: 44.5% to 4.3%; Fig. 1E).
      • Across TCGA-vs-GTEx (glioma-upregulated genes) and three MR3 response analyses (mouse snRNA-seq, mouse pseudo-bulk, and human bulk RNA-seq), PARP11/Parp11 is reported as the only gene that is consistently upregulated in glioma and consistently downregulated by MR3 (Fig. 2B).
      • Within the mouse myeloid compartment, Parp11 is most enriched in MAC4 and MAC1, while MAC1 shows high Cd274 (Pdl1/PD-L1). MR3 reduces Parp11 in MAC4/MAC1 and reduces Cd274 in MAC1 (Fig. 2H).
      • CellChat analysis suggests that in controls MAC1 is the dominant sender of PD-L1/PD-1 signaling to CD8+ T cells (Fig. 3C), and that this PD-L1/PD-1 interaction is strongly diminished after MR3 (Fig. 3E).
      • The authors propose a paracrine model in which MAC4-derived PGE2 (via Ptges3) sustains Parp11 expression in MAC1 through cAMP/PKA/CREB, promoting PD-L1-mediated T-cell suppression; MR3 disrupts this circuitry (Fig. 4).

      Major comments

      1. Strength of the conclusions Two parts of the story felt well supported by the data as shown. First, the cross-species convergence on PARP11/Parp11 is a clear and potentially useful result (Fig. 2B). Second, the myeloid subclustering plus CellChat analysis makes a coherent case that PD-L1/PD-1 signaling in this model is dominated by a specific macrophage subset (MAC1) and changes after MR3 (Fig. 2H, Fig. 3). Where I was less convinced is when the manuscript moves from "transcriptomic and modeling evidence" to causal statements such as "MIRO1-mediated axis driving immunosuppression" and "MR3 reduces tumor burden by reactivating immunity." At the moment, several central inferences remain indirect:
        • Causality is inferred primarily from transcriptomic shifts and ligand-receptor inference rather than functional immune readouts.
        • On-target attribution to MIRO1 hinges on MR3 being a MIRO1 binder; the study does not include a genetic MIRO1 perturbation or a target-engagement/epistasis test in the relevant immune compartments (and the authors acknowledge this limitation in the Discussion).
        • The very large reduction in "tumor cell proportion" (Fig. 1E) is striking but is still a composition measure of recovered nuclei; it is not, on its own, a direct measurement of tumor size/burden and could be sensitive to differential nuclei recovery or cell loss during processing. I think the paper can go forward in its current scope, but the strength of the claims should match the level of evidence. If the authors want to keep strong, causal language in the title/abstract ("driving immunosuppression," "reduces tumor burden"), then I consider one or two targeted validation experiments essential (see below). Alternatively, the authors can temper the language and position the mechanistic model more explicitly as a hypothesis generated from the transcriptomic analysis.
      2. Statements that should be labeled as preliminary/speculative (unless additional validation is added)
        • MAC4-derived PGE2 as the upstream driver of MAC1 Parp11/PD-L1: plausible and nicely consistent with Ptges3 being MAC4-high in controls and reduced with MR3 (Fig. 4A), but not demonstrated.
        • MIRO1 → mtDNA → cGAS/STING → Ptges3 as a mechanistic chain: interesting, but currently framed largely by pathway knowledge plus modest expression changes (Supplementary Fig. S5).
        • "MR3 reactivates anti-tumor immunity to reduce tumor burden": the gene set enrichment and CellChat shifts are consistent with immune activation, but immune-mediated tumor control is not directly tested.
      3. Replication and statistics Mouse snRNA-seq replication is limited after QC (3 control vs 2 MR3-treated animals). With n=2 treated, it is hard to know whether some of the biggest composition and cluster-level changes are robust to animal-to-animal variability. Relatedly, the snRNA-seq differential expression is performed with Seurat FindMarkers (Wilcoxon rank-sum). Per-cell testing can inflate significance if biological replicate structure is not accounted for (pseudoreplication). I suggest the authors clarify exactly how they handled sample-level replication for the key DE results and, where possible, re-run the main DE comparisons using a sample-aware approach (e.g., pseudo-bulk within cell types/subclusters). For the human bulk RNA-seq, the methods indicate 3 patient tissues split across MR3 vs DMSO for 24 h. In DESeq2, a paired design (including patient as a blocking factor) would be important to avoid patient-to-patient variability dominating the treatment signal; the manuscript should confirm whether the design formula accounted for this. Finally, several places in the Methods define significance using p-value cutoffs (e.g., GEPIA3 TCGA/GTEx analysis uses p < 0.05 and |log2FC| > 1; human DE uses p < 0.05 and log2FC >= 1). Because multiple testing is substantial in all of these analyses, I recommend reporting FDR-adjusted values consistently (and being explicit about whether figures/tables show raw or adjusted p-values).
      4. Do the data support the macrophage-to-CD8 suppression claim? The CellChat PD-L1/PD-1 network figures are suggestive (Fig. 3C/E), but ligand-receptor inference is not the same as demonstrating functional T-cell inhibition. At minimum, I would like to see one orthogonal readout (flow or immunostaining) showing that PD-L1 protein on myeloid cells and PD-1 on CD8 T cells change in the expected directions after MR3, and that CD8 T cells show an activation/effector signature at the protein level.
      5. PARP11: mediator vs marker The cross-species PARP11 result is the most convincing and potentially generalizable finding in the manuscript (Fig. 2B). However, in the specific context of this study, PARP11 is still best supported as a conserved MR3-responsive candidate rather than a demonstrated causal driver of PD-L1-mediated suppression. If the authors want to argue PARP11 is an effector of the pathway (rather than a marker), they should either soften the language or add a minimal functional linkage experiment within the existing scope (see "Optional" experiments below).
      6. Reproducibility and clarity of methods I appreciate that the authors provide a code/data portal (MiroScape) and a GitHub link. To make the study as reproducible as possible, I recommend:
        • Deposit raw sequencing reads for both mouse and human datasets (GEO/SRA) and include accession numbers in the manuscript.
        • Provide a short, consolidated "computational reproducibility" note with software versions and key parameters (Seurat, CellChat, STAR, DESeq2, etc.).
        • Clarify pseudo-bulk construction (what is aggregated, at what level, and how many biological replicates contribute to each pseudo-bulk comparison).
        • Add a brief summary of MR3 provenance/validation and what "MIRO1-binding" means operationally in the context of these experiments (especially for readers outside the MIRO1 field). Experiments requested (kept within the existing claims) I am intentionally not suggesting new lines of experimentation. The experiments below are aimed only at supporting the paper's current central claims. I separate them into items I consider essential vs optional, depending on how strongly the authors want to phrase mechanistic conclusions. Essential if the title/abstract continue to use strong causal language
        • Protein-level validation of the PD-L1/PD-1 axis and CD8 activation in the GL261 model. A focused flow cytometry panel (myeloid PD-L1; CD8 PD-1 plus one or two effector markers such as GZMB/IFNG/Ki67) or multiplex IF/IHC on tumor sections would substantially strengthen the central MAC1 → CD8 claim.
        • An orthogonal measure of tumor burden in the same treatment paradigm. The manuscript currently treats the drop in the fraction of nuclei annotated as tumor (Fig. 1E) as a reduction in tumor burden; I recommend including IVIS longitudinal data and/or histologic tumor area/volume at harvest to support this statement.
        • If feasible, modestly increase in vivo biological replication (the snRNA-seq analysis currently has n=2 treated after QC). Even adding one additional treated animal that passes QC would help. Feasibility (rough guidance only; core pricing varies widely by institution): a repeat GL261 cohort to harvest tumors for flow and/or histology typically takes ~3-6 weeks end-to-end. A small flow panel plus core time is often on the order of a few thousand USD (antibodies and cytometry), while basic histology/IF quantification might be in the hundreds to low-thousands. If the authors already have stored tissue from the existing cohort, some of this could be faster/cheaper. Optional (only if the authors want the MAC4 → PGE2 → Parp11 mechanism to be more than a model)
        • Measure PGE2 (ELISA or targeted lipidomics) in tumor lysates/conditioned media from control vs MR3-treated samples, or provide a closer proxy for PGE2 pathway engagement in the relevant clusters. Optional (only if the authors want to argue PARP11 is an effector)
        • A minimal functional linkage experiment (in vitro) testing whether PARP11 perturbation phenocopies the relevant aspect of MR3 in macrophages (e.g., PD-L1 levels and/or the ability to suppress CD8 activation in a co-culture). This could be done with a PARP11 inhibitor or knockdown. I do not think in vivo genetics are required for this manuscript, but some functional tie would prevent overinterpretation.

      Minor comments

      A. Analysis/experimental clarifications that seem straightforward

      • Human DESeq2: please clarify whether the DESeq2 design was paired by patient (i.e., patient as a blocking factor).
      • snRNA-seq DE: please clarify whether any sample-aware method was used for the key DE conclusions (especially Parp11/Cd274 changes) rather than per-cell statistics alone.
      • CellChat: because min.cells filtering is used (min.cells = 20), please note this explicitly in figure legends where subclusters appear only in one condition, so readers understand why certain labels are missing.

      B. Figure and text consistency issues

      I noticed several figure/legend/citation issues that look like simple fixes: - Fig. 3 legend panel labeling: the legend text refers to the PD-L1/PD-1 chord plot as (C) MR3− and (D) MR3+, but (D) is the heatmap panel; the chord plots are (C) and (E). This should likely read (C) MR3− and (E) MR3+. - Fig. 5 panel reference: the Results text refers to the Cross Species module as Fig. 5F, but the Fig. 5 legend defines panels (A-E) and labels (E) as "Cross Species module." Please reconcile (either change the text to Fig. 5E or add a panel F). - Discussion figure citation: the Discussion cites Ptges3/PGE2 evidence as "(Figure 3)," but Ptges3 is shown in Fig. 4A and the model is in Fig. 4B. - Fig. 1D numbers: the Results text states 509/1,602 (mouse) and 15/106 (human) "rescued" genes (Fig. 1D), but the Fig. 1D pie charts are labeled with different totals (mouse total 3490; human total 104). Please reconcile the denominators and ensure the figure matches the text and analysis choice (bulk vs snRNA vs filtered gene sets). - Fig. 2 legend: there is a stray quote in "lymphoid subclusters" (appears as subclusters").

      C. Presentation and framing

      • Tone down or carefully qualify statements equating snRNA-seq composition shifts with reduced tumor burden (or add an orthogonal tumor-burden measurement as suggested above).
      • Where possible, tie mechanistic language explicitly to the level of evidence ("consistent with," "suggests," "model proposes") so readers do not over-interpret the transcriptomic inference.
      • Consider adding a small schematic in the Results or a short "interpretation" sentence in the figure legends explaining what the CellChat plots do and do not show, since non-specialists can misread these as direct interaction measurements.

      D. Prior literature The PARP11 immunotherapy literature is cited appropriately. For the PGE2 angle, it may help readers if the authors add one or two glioma-focused references on PGE2-mediated myeloid/T-cell suppression (if not already in the full reference list).

      Significance

      Nature and significance of the advance

      The advance here is primarily conceptual and resource-oriented. Conceptually, the work connects a mitochondrial regulator (MIRO1) to a specific, testable immunosuppressive circuit in the glioma TME. Technically, the cross-species perturbation framework and the accompanying MiroScape portal should be useful to groups looking for conserved, drug-responsive immune programs.

      Context within the existing literature

      Immunosuppression in glioma and the importance of tumor-associated myeloid populations are well established, as is the limited success of checkpoint blockade in GBM. The manuscript's proposed MAC4/MAC1 paracrine model and its emphasis on PD-L1/PD-1 signaling adds a focused, hypothesis-generating view of how particular macrophage states might sustain CD8 dysfunction. The identification of PARP11 as a conserved MR3-responsive gene also fits with emerging work implicating PARP11 in immunoregulatory programs and response to immunotherapy.

      Audience

      • Neuro-oncology and glioma TME researchers (myeloid heterogeneity, immune suppression).
      • Tumor immunology groups interested in myeloid-driven checkpoint resistance.
      • Researchers working on mitochondrial stress signaling and immunometabolism.
      • Computational biologists building cross-species or multi-modal integration frameworks.

      Reviewer expertise and limitations

      Keywords: glioma microenvironment; macrophage/microglia biology; tumor immunology; single-cell/nucleus transcriptomics; computational ligand-receptor inference. Limitations: I am not a medicinal chemist, so I cannot deeply evaluate MR3 chemistry, PK/PD, or specificity beyond what is presented. I also did not evaluate the full web-portal implementation beyond the manuscript description.

    1. ORBIS database access (400M+ companies)Operating margin analysisTotal cost markup calculationsCustom search criteriaNACE code filteringIndependence verificationStatistical analysis includedInterquartile range calculation

      szintén ide kellene írni zárójelben, hogy a write-upot nem tartalmazza az ár

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In the current study, Huang et al. examined ACC response during a novel discrimination-avoid task. The authors concluded that ACC neurons primarily encode post-action variables over extended periods, reflecting the animal's preceding actions rather than the outcomes or values of those actions. Specifically, they identified two subgroups of ACC neurons that responded to different aspects of the actions. This work represents admirable efforts to investigate the role of ACC in task-performing mice. However, in my opinion, alternative explanations of the data were not sufficiently explored, and some key findings were not well supported.

      Strengths:

      The development of the new discrimination-avoid task is applauded. Single-unit electrophysiology in task-performing animals represents admirable efforts and the datasets are valuable. The identification of different groups of encoding neurons in ACC can be potentially important.

      Weaknesses:

      One major conclusion is that ACC primarily encodes the so-called post-action variables (specifically shuttle crossing). However, only a single example session was included in Figure 2, while in Supplementary Figure 2 a considerable fraction of ACC neurons appears to respond to either the onset of movement or ramp up their activity prior to movement onset. How did the authors reach the conclusion that ACC preferentially respond to shuttle crossing?

      We now include more example sessions and the main results from individual animals (Fig. 3; Figs. S2–S3; Fig. 8). Overall, the results are consistent across recording sessions and animals.

      While shuttle crossings were the primary reference for most analysis, using shuttle initiation as a reference led to similar conclusions (Fig.4). Namely, we found that most ACC neurons exhibit either robust (22%; Types 1a & 2a) or moderate (51%; Types 1b & 2b) post-shuttle activity changes (Fig.4), while only a subset exhibits ramping pre-shuttle activity (16%; Types 3b & 3c). Therefore, our conclusion was intended to highlight the role of post-shuttle activity in learning. While we do not exclude the possibility that pre-shuttle ACC activity contributes to learning, its involvement is likely more limited.

      In Figure 4, it was concluded that ACC neurons respond to action independent of outcome. Since these neurons are active on both correct and incorrect shuttle but not stay trials, they seem to primarily respond to overt movement. If so, the rationale for linking ACC activity and adaptive behavior/ associative learning is not very clear to me. Further analyses are needed to test whether their firing rates correlated with locomotion speed or acceleration/deceleration. On a similar note, to what extent are the action state neurons actually responding to locomotion-related signals? And can ACC activity actually differentiate correct vs. incorrect stays?

      In this study, we highlight two distinct groups of ACC neurons: action-state and action-content neurons. Both groups of neurons tend to show sustained activity even when the animals remain immobile after completing shuttle behaviors, suggesting that their activity is not directly driven by locomotion. Furthermore, action-content neurons are selectively engaged in only one of the two shuttle categories, either rooms A→B or B→A shuttles. Therefore, differences in neuronal activity are unlikely to reflect locomotor differences, given that both shuttle types involve similar movement patterns. Finally, we analyzed ACC neuronal activity in relation to locomotion speed. Our results indicate that only a small fraction of neurons (<15%) show speed-correlated activity (Fig.5), suggesting that most ACC neurons do not encode movement-related information. Taken together, these findings support the distinction between ACC activity and locomotion encoding.

      As for the small subset of speed-related neurons, it remains unclear whether these speed-related neurons represent a distinct subpopulation within the ACC or reflect recordings from the nearby motor cortex. Postmortem examination of the recording sites suggests that most neurons were recorded from the ACC, while a small subset may be located at the border between the ACC and motor cortex (Fig. S2). Therefore, it is possible that the small fraction of speed-related neurons originated from the motor cortex.

      Lastly, given that the ACC neurons display no or limited activity during stay trials, their activity generally does not differentiate correct vs. incorrect stays (Fig.S7). However, ACC activity does show moderate differentiation between room-A vs. room-B stays (Fig.S7).

      Given that a considerable amount of ACC neurons encode 'action content', it is not surprising that by including all neurons the model is able to make accurate predictions in Figure 6. How would the model performance change by removing the content neurons?

      We thank the reviewer for this thoughtful analysis idea. Excluding action-content neurons drastically reduces decoding accuracy (Fig.8), suggesting that they are the main drivers for differentiating rooms AB vs. BA shuttles.

      Moving on to Figure 7. Since Figure 4 showed that ACC neurons respond to movement regardless of outcome, it is somewhat puzzling how ACC activity can be linked to future performance.

      As discussed earlier (point #2), ACC activity does not simply reflect locomotion itself. We interpret the post-shuttle ACC activity as encoding both the preceding shuttle state (shuttle or stay) and shuttle content (rooms AB or BA). Regardless of the outcome (safety or shock), such encoding is essential for cue–action–outcome associative learning, because both positive and negative feedback can drive learning. The level of post-shuttle ACC activity may reflect task engagement, with greater engagement facilitating learning and improving future performance.

      Two mice contributed about 50% of all the recorded cells. How robust are the results when analyzing mouse by mouse?

      We have added further analysis of highlighting the results of each mouse. Although the total number of recorded neurons varied across mice, the major findings were consistent. In every mouse, we observed sustained post-shuttle ACC activity (Fig.S2), and population-level ACC activity reliably decoded shuttle contents (rooms AB vs. BA; Fig.8).

      Lastly, the development of the new discrimination-avoid task is applauded. However, a major missing piece here is to show the importance of ACC in this task and what aspects of this behavior require ACC.

      We appreciate this feedback. We are currently conducting additional experiments to determine whether inhibiting ACC activity during distinct time windows disrupts task learning. We hope to publish a follow-up paper on these findings in the near future.

      Reviewer #2 (Public review):

      Summary:

      The current dataset utilized a 2x2 factorial shuttle-escape task in combination with extracellular single-unit recording in the anterior cingulate cortex (ACC) of mice to determine ACC action coding. The contributions of neocortical signaling to action-outcome learning as assessed by behavioral tasks outside of the prototypical reward versus non-reward or punished vs non-punished is an important and relevant research topic, given that ACC plays a clear role in several human neurological and psychiatric conditions. The authors present useful findings regarding the role of ACC in action monitoring and learning. The core methods themselves - electrophysiology and behavior - are adequate; however, the analyses are incomplete since ruling out alternative explanations for neural activity, such as movement itself, requires substantial control analyses, and details on statistical methods are not clear.

      Strengths:

      (1) The factorial design nicely controls for sensory coding and value coding, since the same stimulus can signal different actions and values.

      (2) The figures are mostly well-presented, labeled, and easy to read.

      (3) Additional analyses, such as the 2.5/7.5s windows and place-field analysis, are nice to see and indicate that the authors were careful in their neural analyses.

      (4) The n-trial + 1 analysis where ACC activity was higher on trials that preceded correct responses is a nice addition, since it shows that ACC activity predicts future behavior, well before it happens.

      (5) The authors identified ACC neurons that fire to shuttle crossings in one direction or to crossings in both directions. This is very clear in the spike rasters and population-scaled color images. While other factors such as place fields, sensory input, and their integration can account for this activity, the authors discuss this and provide additional supplemental analyses.

      Weaknesses:

      (1) The behavioral data could use slightly more characterization, such as separating stay versus shuttle trials.

      We appreciate this feedback. In the revised manuscript, we present data separating stay versus shuttle trials (Fig.1). Additionally, we provide new data from extended training sessions (Fig.S2).

      (2) Some of the neural analyses could use the necessary and sufficient comparisons to strengthen the authors' claims.

      We have now used the necessary and sufficient comparisons where applicable. In the SVM decoding analysis, we show that population ACC activity is sufficient to decode AB or BA shuttles. We also show that excluding action-content, but not other ACC neurons, drastically reduces decoding accuracy, suggesting that these neurons are necessary for the decoding (Fig.8).

      (3) Many of the neural analyses seem to utilize long time windows, not leveraging the very real strength of recording spike times. Specifics on the exact neural activity binning/averaging, tests, classifier validation, and methods for quantification are difficult to find.

      We chose to perform our neural analyses on a longer time scale, given the sustained activity we see in the data. To further justify that decision, we now provide additional results highlighting the sustained activity of ACC neurons in our task (Fig.2; Fig.S2). Additionally, we now provide more specifics of the neural analyses in Methods section.

      (4) The neural analyses seem to suggest that ACC neurons encode one variable or the other, but are there any that multiplex? Given the overwhelming evidence of multiplexing in the ACC a bit more discussion of its presence or absence is warranted.

      This is an interesting point of discussion, and we thank the reviewer for pointing this out. Overall, our results suggest that individual ACC neurons preferentially engage in only one of the proposed functions, rather than multiplexing across them. For example, action-state and action-content ACC neurons primarily engage in action monitoring, but not in decision-making, planning, or outcome tracking. Nevertheless, we cannot rule out the possibility that other ACC neurons, through their distinct connectivity or location in different ACC subregions, engage in other proposed functions. Thus, when considering the ACC as a whole, its function may still be multiplexed.

      Another possible reason we do not see clear multiplexing of neurons may be due to the dynamic nature of our task. Unlike established tasks that often assign fixed positive or negative values to cues, the cues in our task are not inherently associated with valence. Instead, their meaning is dynamically determined by the animal’s location (context) at the time of cue presentation. Since values are not fixed and change based on context, value-related responses may not be reflected in the ACC in our tasks.

      We have now incorporated the above discussions into our revised manuscript.

      Reviewer #3 (Public review):

      Summary:

      The authors record from the ACC during a task in which animals must switch contexts to avoid shock as instructed by a cue. As expected, they find neurons that encode context, with some encoding of actions prior to the context, and encoding of neurons post-action. The primary novelty of the task seems to be dynamically encoding action-outcome in a discrimination-avoidance domain, while this is traditionally done using operant methods. While I'm not sure that this task is all that novel, I can't recall this being applied to the frontal cortex before, and this extends the well-known action/context/post-context encoding of ACC to the discrimination-avoidance domain.

      While the analysis is well done, there are several points that I believe should be elaborated upon. First, I had questions about several details (see point 3 below). Second, I wonder why the authors downplayed the clear action coding of ACC ensembles. Third, I wonder if the purported 'novelty' of the task (which I'm not sure of) and pseudo-debate on ACC's role undermines the real novelty - action/context/outcome encoding of ACC in discrimination-avoidance and early learning.

      Strengths:

      Recording frontal cortical ensembles during this task is particularly novel, and the analyses are sophisticated. The task has the potential to generate elegant comparisons of action and outcome, and the analyses are sophisticated.

      Weaknesses:

      I had some questions that might help me understand this work better.

      (1) I wonder if the field would agree that there is a true 'debate' and 'controversy' about the ACC and conflict monitoring, or if this is a pseudodebate (Line 34). They cite 2 very old papers to support this point. I might reframe this in terms of the frontal cortex studying action-outcome associations in discrimination-avoidance, as the bulk of evidence in rodents comes from overtrained operant behavior, and in humans comes from high-level tasks, and humans are unlikely to get aversive stimuli such as shocks.

      We appreciate this feedback. We have revised the Introduction and Discussion.

      (2) Does the purported novelty of the task undermine the argument? While I don't have an exhaustive knowledge of this behavior, the novelty involves applying this ACC. There are many paradigms where a shock triggers some action that could be antecedents to this task.

      We argue our newly designed discrimination–avoidance task is unique for several reasons. First, it requires animals to discriminate both sensory cues and environment contexts. Unlike established tasks that often assign fixed positive or negative values to cues, the cues in our task are not inherently associated with valence. Instead, their meaning is dynamically determined by the animal’s location (context) at the time of cue presentation, which reflects a conceptual advance over previous techniques. Furthermore, by removing valence from the cues, this design helps disentangle the ACC’s potential role in value encoding from other cognitive functions.

      Second, this task involves robust, ethologically relevant actions (i.e., shuttles), unlike many established paradigms that rely on less naturalistic behaviors such as saccades or lever presses. We view this as a key distinction from prior approaches, as even previous paradigms that utilize shutting responses or other naturalistic responses, fail to incorporate dynamic integration of cues and contexts.

      Finally, the clear temporal separation between actions and outcomes further helps disentangle the ACC’s roles in action monitoring vs. outcome tracking.

      (3) The lack of details was confusing to me:

      (a) How many total mice? Are the same mice in all analyses? Are the same neurons? Which training day? Is it 4 mice in Figure 3? Five mice in line 382? An accounting of mice should be in the methods. All data points and figures should have the number of neurons and mice clearly indicated, along with a table. Without these details, it is challenging to interpret the findings.

      We are sorry for the confusion. We now provide additional details and clear N numbers for each analysis to improve clarity.

      (b) How many neurons are from which stage of training? In some figures, I see 325, in some ~350, and in S5/S2B, 370. The number of neurons should be clearly indicated in each figure, and perhaps a table.

      All data were obtained from well-trained mice. For some analyses, the N is smaller because certain task sessions contained very few incorrect trials (≤3), which prevented us from examining ACC activity during those trials. We have modified figure legend so that neuron count is clear.

      (c) Were the tetrodes driven deeper each day? The depth should be used as a regressor in all analyses?

      Yes, the tetrodes were driven slightly deeper across task sessions (~80 µm per step; 2–4 depths per mouse). Given limited depth changes, preliminary analyses indicate no clear differences in ACC activity across these recording depths. However, we cannot rule out potential dorsal–ventral subregion differences if recordings were to span larger depth ranges.

      (d) Was is really ACC (Figure 2A)? Some shanks are in M2? All electrodes from all mice need to be plotted as a main figure with the drive length indicated.

      We have now included a supplementary figure showing all recording sites (Fig.S2). It is likely that a small subset of neurons was recorded at the ACC/M2 border area. Unfortunately, we are unable to separate them out due to blind recording design of our tetrode arrays.

      (e) It's not clear which sessions and how many go into which analysis

      We have now specified the number of task sessions for each analysis (see Methods).

      (f) How many correct and incorrect trials (<7?) are there per session?

      We have now specified the number of correct and incorrect trials per session (see Methods).

      (g) Why 'up to 10 shocks' on line 358? What amplitudes were tried? What does scrambled mean?

      We decided to use up to 10 mild shocks per trial because mice do not necessarily shuttle to the safe room after one or even a few shocks during the early stages of training. This design allows mice to efficiently learn the concept of the task (i.e., one room is safe while the other delivers shocks). Each shock was specified in the Methods section as 0.5 mA, 0.1 s. A “scrambled shock” refers to an electric shock delivered through multiple floor bars in a randomized pattern, effectively preventing the animal from avoiding the stimulus.

      (4) Why do the authors downplay pre-action encoding? It is clearly evident in the PETHs, and the classifiers are above chance. It's not surprising that post-shuttle classification is so high because the behavior has occurred. This is most evident in Figure S2B, which likely should be a main figure.

      We did not intend to downplay pre-action encoding. Our analysis shows that most ACC neurons exhibit either robust (22%; Types 1a & 2a) or moderate (51%;Types 1b & 2b) post-shuttle activity changes (Fig.4). Although a subset of ACC neurons exhibits ramping pre-shuttle activity, they represent a much smaller fraction (16%; Types 3b & 3c). Therefore, our conclusion was intended to highlight the role of post-shuttle activity in learning. While we do not exclude the possibility that pre-shuttle ACC activity contributes to learning, its involvement is likely more limited

      (5) The statistics seem inappropriate. A linear mixed effects model accounting for between-mouse variance seems most appropriate. Statistical power or effect size is needed to interpret these results. This is important in analyses like Figure 7C or 6B.

      We appreciate this feedback. We now use appropriate statistics and report effect size.

      (6) Better behavioral details might help readers understand the task. These can be pulled from Figures S2 and S5. This is particularly important in a 'novel' task.

      We now provide more details to help better understand the task and have added new figures (Fig.1; Figs. S1&S2).

      (7) Can the authors put post-action encoding on the same classification accuracy axes as Figure 6B? It'd be useful to compare.

      We appreciate the comment, but we are unsure what clarification is being requested.

      (8) What limitations are there? I can think of several - number of animals, lack of causal manipulations, ACC in rodents and humans.

      We now include discussions on limitation of our study. One caveat of our study is that the discrimination–avoidance task requires weeks of training in mice. By the time they master the task, ACC activity may reflect modified neural circuits. Investigating ACC activity during early phase of learning, such as by introducing a new pair of cues or contexts, could provide further insights into ACC’s role in learning and cognitive processes. Additionally, a limitation of the current study is the lack of evidence for the causal role of post-action ACC activity in complex associative learning. Future investigations using closed-loop strategies to selectively disrupt ACC activity during the post-action phase could help address this question.

      Minor:

      (1) Each PCA analysis needs a scree plot to understand the variance explained.

      We have added a scree plot for each PCA analysis.

      (2) Figure 4C - y and x-axes have the same label?

      We have corrected the y-axis label.

      (3) What bin size do the authors use for machine learning (Not clear from line 416)?

      The bin sizes used were 2.5, 5, 7.5, or 10 sec which have now been discussed in the Methods section.

      (4) Why not just use PCA instead of 'dimension reduction' (of which there are many?)

      We have adjusted the phrasing where appropriate.

      (5) Would a video enhance understanding of the behavior?

      We appreciate this feedback. We now include a few videos to accompany our paper.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Is Figure 1C sufficiently powered?

      We have now included data from additional mice and updated the figure accordingly.

      (2) Task performance was not plateaued after 10 sessions in Figure 1B. How variable is task performance in the datasets with ephys recordings (session to session, mouse to mouse).

      We have now included additional data from extended training (15 sessions; Fig.S2). Moderate variations across both sessions and mice are observed. Specifically, the total number of correct/incorrect shuttles used for ephys analysis are 19/5, 19/4, 21/5, 20/4 (mouse #1; 4 sessions); 20/7, 23/7, 20/7 (mouse #2; 3 sessions); 19/4, 16/2 (mouse #3; 2 sessions); 26/4, 23/4, 17/6, 25/5 (mouse #4; 4 sessions); 20/5, and 17/4 (mouse #5; 2 sessions), respectively.

      (3) Please quantify the results in Figure 3, for both within individual mice and across mice.

      We have calculated maximum trajectory length within the 3-D space (Fig. 3C).

      (4) What is the effect size in Figure 7C?

      We now report the effect size.

      (5) Please provide more details for spike sorting.

      We have now included more details in the Methods section.

      (6) More detailed cell type or correlation analysis in Figures 4 and 5 may be helpful. For example, if putative regular and fast-spiking neurons were simultaneously recorded, did the FS directly inhibit the RS to give rise to the apparent encoding properties?

      We recorded a small number of putative interneurons (n = 13) from only three mice, which precludes drawing meaningful conclusions, particularly given their heterogeneous responses during discrimination–avoidance tasks. Accordingly, we include only an example interneuron demonstrating discrimination between AB vs. BA shuttles (Fig. S5). Nevertheless, it is evident there are reciprocal monosynaptic connections between putative interneurons and certain pyramidal neurons, as indicated by short-latency (~2 ms) excitatory or inhibitory interactions (Fig. S5). That said, follow up studies with greater Ns are needed to parse out these details

      Reviewer #2 (Recommendations for the authors):

      (1) While I appreciate displaying the success rate for the sake of simplifying behavioral data in Figure 1B, it would be nice to also see these data broken out as correct vs incorrect for stay vs shuttle trials, since it is difficult to determine whether the performance increases are primarily driven by mice improving at stay vs shuttle responses

      We appreciate this feedback. In the revised manuscript, we present data separating stay versus shuttle trials (Fig.1; Fig.S2).

      (2) In Figure 2 the comparison between shuttle and stay is not particularly convincing, since the comparison is also essentially movement vs no movement and place1-->place2 vs place1-->place1. A more appropriate comparison might be action state neurons vs action content neurons during A-->B, B-->A, or both crossings. If it is true that these populations contain this information, then action state neurons should traverse a large component space in both directions, action content neurons only one direction, and so on.

      We agree that the comparison is not ideal due to differences in locomotion. However, it provides valuable information suggesting that the ACC plays a limited role during stay trials, despite these trials involve mental and cognitive processes comparable to shuttle trials. While we appreciate the reviewer’s suggestion, the proposed analysis is not particularly reliable given the relatively small number of simultaneously recorded action-state or action-content neurons.

      (3) I would say the above point applies to Figure 3 as well. I would also note that this reviewer greatly appreciates the rigor of showing ensemble activity in each subject.

      We appreciate this comment. See our response above.

      (4) In Figure 5 do these neurons show the same A-->B vs B-->A firing patterns during correct vs incorrect shuttles? The text describing the data in Figure 4 suggests this should be the case but even from a quick glance it sort of seems like the population dynamics during correct vs incorrect shuttles are not the same. My concern is that averaging neural activity over 5s windows washes out all these dynamics

      Preliminary analysis suggests that these firing patterns apply to both correct and incorrect shuttles. However, the main reason we did not compare correct and incorrect trials is the limited amount of data. In many sessions, there are only a few (≤5) incorrect shuttles, which include both AB or BA shuttles (Fig.1C; Fig.S2), thus lacking the statistical power for a meaningful comparison.

      (5) Some information on classifier validation is required - was this leave-out validation and if so how many trials were left-out vs tested? K-fold, and if so, how many folds? Was the trial order shuffled for each simulation? Classifiers will pick up within-session temporal information. In addition to this classifier accuracy during the different time points should be compared by a non-parametric test, and compared to the 95th percentile of the label-shuffled distribution.

      Yes, we use standard 10-fold cross-validation. We appreciate the suggestion on trial-order shuffling, and implementing this procedure does not change our original conclusion. Additionally, we have applied a non-parametric test.

      (6) How exactly were neurons classified as content vs state? Was it the average activity during the 5s following the shuttle? If this is stated I could not really find it easily so I might suggest clarifying.

      We now use a new method for classification of the two neuron types (Fig.7). We have included detailed methods in the revised manuscript.

      (7) Movement drives cortical neuron activity more than anything else I have ever seen. Really, more than anything else, it would be nice to demonstrate that it is not movement alone or movement multiplexed with place/sensory information/direction driving these responses.

      We have analyzed ACC neuronal activity in relation to locomotion speed. Our results indicate that only a small fraction of ACC neurons (<15%) show speed-correlated activity (Fig.5). It remains unclear whether these speed-related neurons represent a distinct subpopulation within the ACC or reflect recordings from nearby motor cortex. Postmortem examination of the recording sites suggests that most neurons were recorded from the ACC, while a small subset may be located at the border between the ACC and motor cortex. Therefore, it is possible that the small fraction of speed-related neurons originated from the motor cortex.

      Furthermore, we identify two distinct groups of ACC neurons: <iaction-state and action-content neurons, both of which tend to show sustained activity even when the animals remain immobile after completing shuttle behaviors. This prolonged activation in the absence of movement suggests that their activity is not directly driven by locomotion. Moreover, action-content neurons are selectively engaged in only one of the two shuttle categories, either rooms AB or BA shuttles. Therefore, differences in neuronal activity are unlikely to reflect locomotor differences, given that both shuttle types involve similar movement patterns.

      (8) In addition to the above, the place-field analysis in Supplemental Figure 5 only shows 4 neurons. Was the whole population analyzed? Is it possible to decode place from the population during the ITI? The data in this figure sort of look exactly like place fields - many cortical neurons and also some hippocampal neurons have more than 1 place field

      We have now provided additional place-field analysis. A comparison with hippocampal CA1 neurons (recorded during the same task) suggests that ACC neurons encode limited spatial information.

      (9) "a simple Pavlovian association strategy is unlikely to be sufficient for learning the task" ... is Pavlovian occasion setting not a simple association? Tones and contexts both readily act as Pavlovian occasion setters. Similarly positive/negative patterning might also explain how the task is learned.

      We appreciate this comment and have revised the sentence accordingly. It is possible that animals use multiple strategies to learn and perform the task effectively. In the early stages, animals may rely more heavily on sensory–spatial integration, whereas in later stages, sensory- or location-related Pavlovian associative strategies may contribute to performance, particularly when animals begin to show place preferences during inter-trial intervals.

      (10) I might suggest softening this language and others like it. For example, 2x2 factorial designs are not really novel.

      We have revised the language used to describe the task.

      (11) Some of the color-scale bars and figures do not have labels. For example, Supplementary Figure 3, Supplementary Figure 5. Please add labels.

      We have added the missing labels to all color bars.

      Reviewer #3 (Recommendations for the authors):

      (1) Some relevant papers that should be cited:

      https://doi.org/10.1523/JNEUROSCI.4450-08.2008

      10.1016/j.neuron.2018.11.016

      https://doi.org/10.1016/j.jphysparis.2014.12.001

      We appreciate these suggestions.

      (2) Where can we download the data and code?

      We will upload the essential data and MATLAB code to GitHub to accompany the publication of the final version of this paper.

  2. Feb 2026
    1. Here’s the kicker: none of this costs anything at runtime. Mojo's compile-time metaprogramming means the abstractions exist only in the source code. The GPU assembly the compiler produces is identical to what you'd get from a hand-written, monolithic kernel

      Unique kernels for each program ?

    1. I played a lot when I was younger. That’s when it all started. But when Istarted to code seriously at the age of 15 or 16, I played games less. I becameinterested in games in a different way. It was more about dismantlinggames and learning how they worked. Of course, I still try a lot of games,but I don’t necessarily play them for their entertainment value. It’s moreabout research work, has been for the past twenty years. It’s a bit of adifferent starting point compared to actual ‘gamers’ (Hugo, CEO).

      But to give a purpose to games, to make them personally meaningful, to reflect on them thereafter, or to "read" them "attentively", needn't be a productivist activity. Eudaimonic games, thinky games, provocative games... these encourage a different kind of approach, one focused on learning, not immediate (competitive) pleasures.

    Annotators

    1. 3.4.1. Umsetzung mit R Shiny

      Da dieses Book für R Fortgeschrittene macht, frage ich mich, ob es dieses Kapitel in seiner ausführlichkeit eigentlich braucht. oder: könnte dies ein Punkt sein, an dem man die fortgeschrittenheit 'testet'? Einfach den Gesamtcode hinterlegen und darunter verständnisfragen zum Code hinterlegen. Werden die Fragen richtig beantwortet, weiß der Mensch und wissen wir, dass die grundlegende Kenntnisse vorhanden sind. Versteht die Person den Code nicht, fehlt das Fachwissen und man kann auf entsprechende Lernressourcen verweisen!

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Using single-unit recording in 4 regions of non-human primate brains, the authors tested whether these regions encode computational variables related to model-based and model-free reinforcement learning strategies. While some of the variables seem to be encoded by all regions, there is clear evidence for stronger encoding of model-based information in the anterior cingulate cortex and caudate.

      Strengths:

      The analyses are thorough, the writing is clear, and the work is well-motivated by prior theory and empirical studies.

      Weaknesses:

      My comments here are quite minor.

      The correlation between transition and reward coefficients is interesting, but I'm a little worried that this might be an artifact. I suspect that reward probability is higher after common transitions, due to the fact that animals are choosing actions they think will lead to higher reward. This suggests that the coefficients might be inevitably correlated by virtue of the task design and the fact that all regions are sensitive to reward. Can the authors rule out this possibility (e.g., by simulation)?

      We fully agree with the reviewer that the task design has in-built correlations between transition and reward, and thus the correlation between neural selectivity for feedback and transition (Figure 3E) may be due to the different reward expectation after common or rare transitions. We did try to make this point in the manuscript:

      This suggests that the brain treats being diverted away from your current objective equivalent to losing reward, which is sensible as the subject would normally expect lower rewards on rare trials if their reward-seeking behaviour was efficient.

      We’ve now updated the wording of this statement to try and better make this point and avoid confusion that any non-reward-related encoding is involved:

      “As the reward expectation will be higher on common compared to rare trials, this demonstrates that the brain encodes being diverted to an area with a lower reward expectation equivalent to actually receiving a low reward (and vice versa).”

      We have also adjusted the significance test of this correlation to use a circular permutation test that accounts for correlations between the regressors. This test still found there to be significant correlation in all areas.

      We have described this new permutation test in Methods:

      “For comparing correlations between weights for different features (i.e., between transition and reward coding, Figure 3E), the null distribution of correlations observed in circularly shifted data was compared to the correlation seen in the actual data. This accounts for any correlations between features that existed in the task by preserving the structure of the design matrices.”

      And updated the text in Results accordingly:

      “All regions, but particularly ACC, encoded a common transition (at the time of transition) similar to a high reward (at the time of feedback), as there was a positive correlation between the coefficients for reward and transition (the transition parameter was signed such that common and rare transitions were equivalent to high and low rewards, respectively) (ACC r=0.4963, DLPFC r=0.3273, caudate r=0.4712, putamen, r=0.5052; all p<0.002 except DLPFC where p=0.006, circular permutation test; Figure 3E, S5).”

      The explore/exploit section seems somewhat randomly tacked on. Is this really relevant? If yes, then I think it needs to be integrated more coherently.

      We thank the reviewer for this comment. We agree that the motivation for the explore/exploit analysis was not sufficiently clear in the original version.

      Our aim was not to introduce this as a separate or tangential effect, but rather to highlight how the task’s reward structure (with outcome levels stable for 5–9 trials) naturally created alternating periods favoring exploitation of a known high-value option versus exploration when outcomes changed. This feature of the task is tightly linked to MB-RL computations, as it requires integration of state-transition knowledge and updating across trials.

      Importantly, we show previously in the manuscript that ACC encoded state-transition structure (i.e., common versus rare transition) and MB-value estimates (at choice epoch). However, here we aimed to highlight that the same region also modulated choice encoding as a function of whether the subject was in an exploratory or exploitative regime – by knowing another feature of the task that relies on state-transition and outcome. We have revised this section to better integrate it into the main logic of the paper:

      “In our task, the outcome level (high, medium, low) of each second-stage stimulus remained the same for 5-9 trials before potentially changing. This design naturally created periods where subjects could ‘exploit’ the same Choice 1 to maximize reward for several trials; and other periods where they had to ‘explore’ different second-stage stimuli to optimize reward (as contingencies shifted). In classical MB-RL, the transition between reward states can be learned by keeping counts of observed transitions from a current state-action pair to a subsequent state, yielding a maximum-likelihood estimate of the environment’s dynamics [42]. In fact, knowledge about the reward contingency schedule could support decision-making in both exploitation – by enabling efficient choice when rewards are stable; and exploration – by guiding alternative behaviour most likely to yield improved outcomes (this is different from MF learning, where exploration is more random since the agent lacks explicit state-transition knowledge).

      We thus repeated our decoding analysis of choice 1 stimulus identity, but this time limited trials to those where they had not received a high reward for the previous two trials (‘explore’ trials), and those where the previous two rewards had been the highest level (‘exploit’ trials). All regions encoded choice 1 for some duration of the choice epoch for both explore (p<0.002 in all cases, permutation test; Figure 7A) and exploit (p<0.002 in all cases; Figure 7B) conditions, but decoding accuracy was strongest in ACC. Choice 1 was less strongly decoded – particularly in ACC – in the former condition compared to the latter (p<0.002 for at least 140 ms in all cases, permutation test on differences observed; Figure 7C); and, also during exploitation, the ACC encoded choice 1 before the choice was even presented to the subject (Figure S8). This pre-choice ACC encoding in exploit trials may reflect the need to allocate cognitive (or attentive) resources to features – i.e., choice 1 stimulus identity – that are most certain predictors of important outcomes. As a control, we also decoded the direction of the Choice 1 (where choice was indicated via joystick movement), which was randomised each trial and therefore orthogonal to the stimulus that was chosen. Again, all four regions encoded its direction in both explore (p<0.002 in all cases; Figure 7D) and exploit (p<0.002 in all cases; Figure 7E). However, there were minimal differences in the strength of the representation between explore and exploit conditions (ACC, p=0.088, cluster-based permutation test; DLPFC p=0.016; caudate p=0.32; putamen p=1; Figure 7F). Therefore, exploit behaviour specifically upregulated relevant task parameters that were worth remembering across trials.”

      Reviewer #2 (Public review):

      Summary:

      The authors investigate single-neuron activity in rhesus macaques during model-based (MB) and model-free (MF) reinforcement learning (RL). Using a well-established two-step choice task, they analyze neural correlates of MB and MF learning across four brain regions: the anterior cingulate cortex (ACC), dorsolateral PFC (DLPFC), caudate, and putamen. The study provides strong evidence that these regions encode distinct RL-related signals, with ACC playing a dominant role in MB learning and caudate updating value representations after rare transitions. The authors apply rigorous statistical analyses to characterize neural encoding at both population and single-neuron levels.

      Strengths:

      (1) The research fills a gap in the literature, which has been limited in directly dissociating MB vs. MF learning at the single unit level and across brain areas known to be involved in reinforcement learning. This study advances our understanding of how different brain regions are involved in RL computations.

      (2) The study used a two-step choice task Miranda et al., (2020), which was previously established for distinguishing MB and MF reinforcement learning strategies.

      (3) The use of multiple brain regions (ACC, DLPFC, caudate, and putamen) in the study enabled comparisons across cortical and subcortical structures.

      (4) The study used multiple GLMs, population-level encoding analyses, and decoding approaches. With each analysis, they conducted the appropriate controls for multiple comparisons and described their methods clearly.

      (5) They implemented control regressors to account for neural drift and temporal autocorrelation.

      (6) The authors showed evidence for three main findings:

      (a) ACC as the strongest encoder of MB variables from the four areas, which emphasizes its role in tracking transition structures and reward-based learning. The ACC also showed sustained representation of feedback that went into the next trial. b) ACC was the only area to represent both MB and MF value representations.

      (c) The caudate selectively updates value representations when rare transitions occur, supporting its role in MB updating.

      (7) The findings support the idea that MB and MF reinforcement learning operate in parallel rather than strictly competing.

      (8) The paper also discusses how MB computations could be an extension of sophisticated MF strategies.

      Weaknesses:

      (1) There is limited evidence for a causal relationship between neural activity and behavior. The authors cite previous lesion studies, but causality between neural encoding in ACC, caudate, and putamen and behavioral reliance on MB or MF learning is not established.

      We agree with the reviewer that the present study does not establish causal relationships, and we do not claim otherwise in the manuscript. Our work was designed as a comprehensive characterization of neural activity across ACC, DLPFC, caudate, and putamen during reward-seeking decision-making. By systematically comparing MB- and MF- RL signals across these regions, we provide new insights into the division of labor and cooperative interactions within cortico-striatal networks.

      While causal manipulations (e.g., lesions, inactivations, stimulation) are indeed required to directly establish necessity or sufficiency, correlational studies such as ours play a crucial role in identifying where and how computationally relevant signals are represented. Importantly, our findings align with and extend prior causal work, for example showing that ACC and striatal lesions disrupt MB control. Thus, our study contributes a detailed functional mapping of MB and MF RL encoding across multiple nodes of this circuit, which serves as an important foundation for future causal investigations (e.g., using transcranial ultrasound stimulation).

      (2) There is a heavy emphasis on ACC versus other areas, but it is unclear how much of this signal drives behavior relative to the caudate.

      We appreciate the reviewer's observation regarding this matter. Our intention was not to place a heavy emphasis on ACC, rather this came naturally from the data. The ACC demonstrated considerably more robust and enduring neural activity compared to other brain regions – for instance, reward-related signals in the ACC continued well beyond individual trials (Fig. 2A-B), and encoding of state transitions remained active from the initial transition through to the feedback phase (Fig. 3A-B). By comparison, distinctions among other regions were less pronounced, which naturally resulted in the ACC receiving greater attention in our analytical findings.

      We acknowledge that the caudate plays an essential and complementary role in driving behavior, and we believe that this is emphasized in the two key subsections of our “Results”. First, caudate neurons encoded model-based choice values (Fig. 4A, 4C) and uniquely remapped these values following rare transitions (Fig. 5), reflecting flexible adjustment of action values. Second, decoding analyses showed that both ACC and caudate populations predicted first-stage choices (Fig. 6C), linking their activity directly to behavioral decisions. In the Discussion section, we also highlight that “the distinctive caudate signal of updating (flipping) the value estimates of the currently experienced option on rare trials” goes beyond a “general temporal-difference RPE” and rather supports “the role of caudate in MB valuation”.

      (3) The role of the putamen is somewhat underexplored here.

      Our analyses were conducted in an identical manner across all four recorded regions (ACC, DLPFC, caudate, and putamen), and we consistently reported the results for putamen alongside the others. For example, in the Results section we describe how “both caudate and putamen encoded the reward from the previous trial negatively during the feedback period of the current trial” (Fig. 2F-G), and that “all regions had a significant population of neurons that encoded MB-, but not MF-, derived value” including putamen (Fig. 4F). Similarly, we show that putamen, like caudate, encoded a dopamine-like RPE signal at feedback (“both caudate and putamen neurons clearly responded at feedback with the parametric features of a dopamine-like RPE”; Discussion). These findings align with previous work linking the putamen to MF learning and are discussed explicitly in the context of MF-MB dissociations. We therefore believe that the putamen was not underexplored, but rather that its contribution was more circumscribed relative to ACC and caudate because the signals observed were quantitatively weaker and less distinctive for MB computations.

      (4) The authors mention the monkeys were overtrained before recording, which might have led to a bias in the MB versus MF strategy.

      We agree that extensive training can influence the balance between MB and MF in choice behaviour and neuronal responses.

      In a previous comprehensive behavioral analysis of the same dataset (Miranda et al., 2020, PLoS Computational Biology - ref. 36, Figure S6B) we showed that both MB and MF strategies contributed to behavior, with MB dominance stable across weeks of testing – supporting that overtraining did not eliminate MF influences (but rather stabilized a mixed strategy with robust MB contributions).

      In the same manuscript, we have also: i) cautioned the readers when comparing our results to data from the original human studies; ii) acknowledged that our extensive training cannot address earlier phases of learning in which sensitivity to the task structure is first acquired; and iii) also provided task-related reasons for such MB dominance – as training made the transition structure well learned (making MB computationally less costly and faster to implement) and the non-stationary outcomes favored the flexibility of MB strategies.

      In the present manuscript, we also have acknowledged that overtraining may have shifted neural signals toward stronger MB representations, or alternatively enabled more sophisticated task representations:

      “On the other hand, MF-based estimates were neither as striking nor as specific to striatal regions as expected and observed in previous studies [18]. The monkeys were extensively trained on the task before recordings commenced, which may have caused a shift towards both MB behaviour and MB value representation within the striatum. Alternatively, this training may have allowed more sophisticated representations to occur, such as using latent states to expand the task space [54].”

      Importantly, we strongly believe that this possibility does not detract from our main finding that both MB and MF signals were present across regions, with ACC showing the strongest multiplexing of the two.

      (5) The GLM3 model combines MB and MF value estimates but does not clearly mention how hyperparameters were optimized to prevent overfitting. While the hybrid model explains behavior well, it does not clarify whether MB/MF weighting changes dynamically over time.

      We appreciate this comment and would like to note that, for completeness, we have on several occasions directed the reader to our prior behavioural analysis of the same dataset (Miranda et al., 2020, PLoS Computational Biology, ref 36). In that work, we provide a full and detailed description of both the task and the computational modeling approach (see particularly the “Model fitting procedures” section). Furthermore, our model-fitting was grounded in the MF/MB RL framework used in the original human two-step study (Daw et al., 2011); and the fitting procedures also followed previous studies (Huys et al., 2011).

      Hyperparameters – including the MB/MF weighting parameter (ω) - were estimated using maximum likelihood under two complementary approaches and with priors providing regularization across sessions. First, we performed a fixed-effects analysis, in which parameters were estimated independently for each session by maximizing the likelihood separately; secondly, we conducted a mixed-effects analysis, treating parameters as random effects across sessions within each subject. The effect of the prior procedure reduces the risk of overfitting by constraining parameters based on their empirical distributions, rather than allowing unconstrained session-by-session estimates. Finally, all model fitting procedures were verified on surrogate generated data.

      With regard to dynamic weighting, our approach – consistent with most two-step studies – assumed ω to be constant across trials within each session. This was a deliberate choice, both for comparability with prior work and because our subjects were extensively trained, making session-level stability of strategy weights a reasonable assumption. Indeed, our analyses showed no systematic drift in ω across sessions, suggesting that MB/MF balance was stable over sessions. While approaches that allow dynamic ω estimation are possible, we believe such extensions would likely have minimal impact in the current dataset.

      (6) It was unclear from the task description whether the images used changed periodically or how the transition effect (e.g., in Figure 3) could be disambiguated from a visual response to the pair of cues.

      All images were kept constant across sessions. Common/Rare transitions themselves were not explicitly cued, but rather each second-stage state was associated with a specific background colour, followed ~1s later by the presentation of two specific second-stage choice cues (Figure 1B). Hence the subject could infer whether they were transitioned down a Rare or Common path by the background colour, which can be disambiguated in time from the visual responses to the second-stage cues. We’ve updated the Results text to make this clearer:

      “Tracking the state-transition structure of the task is imperative for solving the task as a MB-learner. All four regions encoded whether the current trial’s first-stage choice transitioned to the common or rare second-stage state (which could be inferred by a change in background colour immediately after choice indicating which second stage state they had just entered, Figure 1A).”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 7 appears to be missing.

      We thank the reviewer for pointing this out. Figure 7 was inadvertently omitted in the previous version and has now been included in the revised manuscript.

      (2) No stats reported in the section on explore/exploit.

      We apologise for this oversight. This section now also reports the relevant statistics:

      “We thus repeated our decoding analysis of choice 1 stimulus identity, but this time limited trials to those where they had not received a high reward for the previous two trials (‘explore’ trials), and those where the previous two rewards had been the highest level (‘exploit’ trials). All regions encoded choice 1 for some duration of the choice epoch for both explore (p<0.002 in all cases, permutation test; Figure 7A) and exploit (p<0.002 in all cases; Figure 7B) conditions, but decoding accuracy was strongest in ACC. Choice 1 was less strongly decoded – particularly in ACC – in the former condition compared to the latter (p<0.002 for at least 140 ms in all cases, permutation test on differences observed; Figure 7C); and, also during exploitation, the ACC encoded choice 1 before the choice was even presented to the subject (Figure S8). This pre-choice ACC encoding in exploit trials may reflect the need to allocate cognitive (or attentive) resources to features – i.e., choice 1 stimulus identity – that are most certain predictors of important outcomes. As a control, we also decoded the direction of the Choice 1 (where choice was indicated via joystick movement), which was randomised each trial and therefore orthogonal to the stimulus that was chosen. Again, all four regions encoded its direction in both explore (p<0.002 in all cases; Figure 7D) and exploit (p<0.002 in all cases; Figure 7E). However, there were minimal differences in the strength of the representation between explore and exploit conditions (ACC, p=0.088, cluster-based permutation test; DLPFC p=0.016; caudate p=0.32; putamen p=1; Figure 7F).”

      (3) Make sure that error bars are explained in all figure captions where appropriate.

      We apologise that this information was absent. Error bars always represent the standard error of the mean. This has now been added to all relevant figure legends.

      Reviewer #2 (Recommendations for the authors):

      Overall, I think this is a great manuscript and was presented clearly and succinctly. I have some minor suggestions:

      (1) Typo: Abstract "ACC, DLPFC, caudate and striatum" I think should be "caudate and putamen".

      We have amended this incorrect reference in the introduction:

      “One such task that does enable the dissociation of MB and MF computations is Daw et al. (2011)’s ‘two-step’ task [18]. It contains a probabilistic transition between task states to uncouple MF learners (who would assign credit to which state was rewarded regardless of the transition) from MB learners (who would appropriately assign credit based on the reward and transition that occurred). Rodents [19], monkeys [36], and humans [18] all use MB-like behaviour to solve the task. Evidence in rodents suggests dorsal anterior cingulate cortex (ACC) tracks rewards, states, and the probabilistic transition structure, and that ACC is essential in implementing a MB-strategy [37]. Here, we compare primate single neuron activity of 4 different subregions implicated in reward-based learning and choice (ACC, dorsolateral PFC (DLPFC), caudate, and putamen) during performance of the classic two-step task, and demonstrate signatures of MB-RL primarily in ACC, and MF-RL signatures most notably in putamen.”

      (2) Could the authors provide a rationale for why they did the single-level encoding the way they did, instead of running an ANOVA?

      We thank the reviewer for this point. We are not entirely certain which specific ANOVA approach is being suggested, but our rationale for using a GLM-based encoding analysis is that such approach allows us to model continuous, trial-by-trial variables (e.g., value signals, prediction errors, transitions) while simultaneously controlling for multiple correlated predictors. This approach is widely used in systems neuroscience (particularly in decision-making research) offering analytical flexibility and comparability with prior approaches.

      (3) How were the 20 iterations for decoding decided? That seems low.

      We do not agree that 20 repetitions of 5-fold cross validation is low. The error bars in panels 6C-E demonstrate what low variance occurred across these 20 repetitions. It is the average of these low variance repetitions against which we performed statistics by performing a permutation test where these 20 repetitions were repeated a further 500 times.

      (4) It was unclear to me how the authors reached the conclusion "Thus, caudate activity appeared to represent the value of the state the subject was currently in." when the state value wasn't computed directly. I don't see how encoding the chosen and unchosen option is the same as the state the animal is in, which should also incorporate where the animal is in a block of trials or session, and the knowledge regarding the chosen and unchosen option.

      We agree with this point and have tempered this statement:

      “Thus, caudate’s encoding of an option’s value also reflected the availability of the option.”

      (5) Figures 1C, D, and E were not legible to me even at 200% zoom.

      We apologise for this oversight. We’ve now updated panels 1C-E to a more readable size:

      (6) There is a Figure 2H in the figure legend, but the panel appears to be missing from Figure 2.

      This text has been removed.

      (7) Figure 2: It would've been nice to see F and G for all areas.

      We have now added this data as additional panels in Figure 2.

      (8) Figure 3: How is the transition disambiguated from a visual response to the set of images?

      This was indicated by the background changing colour to that of the learned second stage state before the actual choices were presented. We’ve updated the Results text to make this clearer:

      “Tracking the state-transition structure of the task is imperative for solving the task as a MB-learner. All four regions encoded whether the current trial’s first-stage choice transitioned to the common or rare second-stage state (which was indicated by a change in background colour before the second stage choices were presented, Figure 1A).”

      (9) Figure 4F: Is this collapsed across time points? So neurons that were significant at any time? I'm confused how Figure 4A relates to 4F, as 4A shows much lower percentages of significant neurons.

      Figure 4F counts the total number of neurons that had a significant period of encoding at any timepoint over the epoch (as assessed with a length-based permutation test). Whereas, 4A shows the amount of significant encoding neurons at any one time point. Investigating this further, we found that the encoding was dynamic with different neurons encoding different parts of the epoch. We have now added a new supplementary figure to highlight this and refer to it in Results:

      “Examination of the strongest signal observed, ACC’s encoding of MB Q-values, showed a dynamic pattern with different neurons encoding the signal at different parts of the epoch (Figure S6). When aggregating the number of significant coders throughout the epoch, and examining the specificity of MB versus MF coding, we found that all regions had a significant population of neurons that encoded MB-, but not MF-, derived value (30, 18.72, 23 and 24% of neurons in ACC, DLPFC, caudate and putamen respectively; all p<0.0014 binomial test against 10% (as the strongest response to either of the two options was used); Figure 4F).“

      (10) Data/ code could be made publicly available instead of upon request.

      All data and code to reproduce figures are now available at https://github.com/jamesbutler01/TwoStepExperiment. The manuscript has been updated to reflect this:

      Data and materials availability:

      All data and code to reproduce figures are available at https://github.com/jamesbutler01/TwoStepExperiment.

    1. Lizenzmodell und Einsatzszenarien Dash und Shiny sind beide Open Scource Anwendungen können diese Frameworks kostenlos verwendet werden. [@dabbas_interactive_2021] Im Gegensatz dazu setzen Power BI und Tableau für die vollständigen Versionen auf kommerzielle Lizenzmodelle. Power BI und Tableau bieten beide zusätzlich auch noch eine kostenfreie Version an, die allerdings nicht alle Funktionen bereitstellt, die auch die kostenpflichtige Version bereitstellt, besonders für kleine und mittlere Unternehmen stellt die kostenlose Option über ausreichende Funktionen und Features zur Verfügung. [@krishnan_research_nodate] Programmiersprache, Zielgruppe, Bedienbarkeit, Erweiterbarkeit R Shiny ist vollständig auf der Programmiersprache R aufgebaut und eignet sich besonders für Nutzer: innen aus Wissenschaft und Datenanalyse. [@khedr_interactive_2021] Diese Verbundenheit mit dem R Ökosystem ist für Shiny enormer Vorteil, da Shiny dadurch leicht erweiterbar ist,diese verwenden zu können, setzt aber Vorkenntnisse in der Programmiersprache R voraus. [@walker_tools_2016] Python Dash hingegen basiert auf der Programmiersprache Python und wird häufig in den Bereichen Data Science und Analytics verwendet und richtet sich an Nutzer: innen, die bereits mit Datenanalyse-Tools wie pandas oder scikit-learn, die auf Python basieren auskennen. [@odonnell_interaktive_2020] Der Vorteil in Dash liegt in der Verwendung der modernen Frontend-Technologie mit React, zudem wird dem Nutzer vollständige Freiheit über die Gestaltung gegeben mit dem Code. Wie auch bei R Shiny ist es verbundenheit mit dem Python Ökosystem für Dash vom Vorteil da auch hier die Anwendung leicht erweiterbar ist. [@dabbas_interactive_2021] Auf Grund dessen dass Power BI und Tableau dafür entwickelt wurden um im Bereich Business-Intelligence und Unternehmenssektor angewendet zu werden wird größtenteils auf herkömmliche Programmierung verzichtet, obwohl Tableau mit R verbunden werden kann, um die Darstellung besser auf die Erwartungen des Anwenders anpassen zu können. Anstelle von Programmierung nutzen beide Tools GUI’s mit einer Drag-and-Drop-Funktionalität. [@odonnell_interaktive_2020] [@vasundhara_data_nodate]In Bezug auf Erweiterbarkeit bieten Tableau und Power BI eher schlechtere Erweiterbarkeiten, da die Plugin Optionen niedrig sind.

      Was ist der Zweck dieses Paragrafens außer die Tabelle zu wiederholen und leicht zu ergänzen= Ich erwähne das, weil ich sehr viel Potenzial sehe, dies in einen Vor- und Nachteile sowie Empfehlungsparagrafen auszubauen. Dafür genügen aber zusammenfassende Sätze wie

      "Da Python und R als Programmiersprachen offen für Erweiterungen sind, besteht ein umfangreiches Ökosystem aus sogenannten Bibliotheken, mit denen sich Funktionalitäten erweitern lassen ... etc etc. Die Voraussetzung ist die Kenntnis der jeweiligen Sprache und Vorkenntnisse im Programmieren sowie Vertraut mit Konzepten des objekt-orientierten Programmierens" (oder so).

    2. Mittel (Erfahrung in R) Mittel (Erfahrung in Python) Sehr einfach (Drag & Drop) Sehr einfach (Drag & Drop)

      Die Kategorie "mittel" ist nirgendwo definiert. Weglassen und stattdessen inhaltlich füllen: Code-Line-Programmierung in R; Code-line-PRogramming in Python etc, drag n drop etc.

    1. from jupyterquiz import display_quiz quiz_file = 'questions/COLDFISH.json' display_quiz(quiz_file, num=4, colors='fdsp') #xJkhFuYZQYwh { --jq-multiple-choice-bg: #345995; --jq-mc-button-bg: #fafafa; --jq-mc-button-border: #e0e0e0e0; --jq-mc-button-inset-shadow: #555555; --jq-many-choice-bg: #e26d5a; --jq-numeric-bg: #5bc0eb; --jq-numeric-input-bg: #c0c0c0; --jq-numeric-input-label: #101010; --jq-numeric-input-shadow: #999999; --jq-string-bg: #861657; --jq-incorrect-color: #666666; --jq-correct-color: #87a878; --jq-text-color: #fafafa; } .Quiz { max-width: 600px; margin-top: 15px; margin-left: auto; margin-right: auto; /* margin-bottom: 15px;*/ /* padding-bottom: 4px;*/ padding-top: 4px; line-height: 1.1; font-size: 16pt; border-radius: inherit; } .QuizCode { font-size: 14pt; margin-top: 10px; margin-left: 20px; margin-right: 20px; } .QuizCode>pre { padding: 4px; } .Quiz code { background-color: lightgray; color: black; } .Quiz .QuizCode code { background-color: inherit; color: inherit; } .Quiz .MCButton code { background-color: inherit; color: inherit; } .MCButton .QuizCode { text-align: left; } .Answer { border-radius: inherit; display: grid; grid-gap: 10px; grid-template-columns: 1fr 1fr; margin: 10px 0; } @media only screen and (max-width:480px) { .Answer { grid-template-columns: 1fr; } } .Feedback { font-size: 16pt; text-align: center; /* min-height: 2em;*/ } .Input { align: left; font-size: 20pt; } .Input-text { display: block; margin: 10px; color: inherit; width: unset; min-width: 140px; max-width: 93%; field-sizing: content; background-color: var(--jq-numeric-input-bg); color: var(--jq-text-color); padding: 5px; padding-left: 10px; font-family: inherit; font-size: 20px; font-weight: inherit; line-height: 20pt; border: none; border-radius: 0.2rem; transition: box-shadow 0.1s); } .Input-text:focus { /*outline: none;*/ background-color: var(--jq-numeric-input-bg); box-shadow: 0.6rem 0.8rem 1.4rem -0.5rem var(--jq-numeric-input-shadow); } .MCButton { background: var(--jq-mc-button-bg); border: 1px solid var(--jq-mc-button-border); border-radius: inherit; color: #333333; padding: 10px; font-size: 16px; cursor: pointer; text-align: center; display: flex; align-items: center; justify-content: center; } .MCButton p { color: inherit; } .MultipleChoiceQn { padding: 10px; background: var(--jq-multiple-choice-bg); color: var(--jq-text-color); border-radius: inherit; } .ManyChoiceQn { padding: 10px; background: var(--jq-many-choice-bg); color: var(--jq-text-color); border-radius: inherit; } .NumericQn { background: var(--jq-numeric-bg); border-radius: inherit; color: var(--jq-text-color); padding: 10px; } .NumericQn p { color: inherit; } .StringQn { background: var(--jq-string-bg); border-radius: inherit; color: var(--jq-text-color); padding: 10px; } .StringQn p { color: inherit; } .InpLabel { color: var(--jq-numeric-input-label); float: left; font-size: 15pt; line-height: 34px; margin-right: 10px; } .incorrect { color: var(--jq-incorrect-color); } .correct { color: var(--jq-correct-color); } .correctButton { /* background: var(--jq-correct-color); */ animation: correct-anim 0.6s ease; animation-fill-mode: forwards; box-shadow: inset 0 0 5px var(--jq-mc-button-inset-shadow); color: var(--jq-text-color); /*outline: none;*/ } .incorrectButton { animation: incorrect-anim 0.8s ease; animation-fill-mode: forwards; box-shadow: inset 0 0 5px var(--jq-mc-button-inset-shadow); color: var(--jq-text-color); /*outline: none;*/ } @keyframes incorrect-anim { 100% { background-color: var(--jq-incorrect-color); } } @keyframes correct-anim { 100% { background-color: var(--jq-correct-color); } }

      Is this an error, or does it need to stay here?

    1. A generative AI like ChatGPTData Analyst can take on the role of the evaluation soft-ware. It is expected that this manner of use will make thestudents' work easier, as less emphasis needs to be placedon the programming itself. Instead, teachers can incorpo-rate exercises that encourage students to code more effi-ciently and accurately with the assistance of AI. Thisshifts the focus from finding the right command or func-tion to examining and understanding the data moreclosely. As a consequence, students are better enabled tointerpret the results of statistical evaluation software cor-rectly, thus fulfilling goal 8 of the GAISE report.

      rhetoric: Schwarz uses a statement of transition to contrast the old education model (rote memorization of commands) with a new required model (critical examination).

      inference: This supports the argument that education and labor must start to pivot away from the "Generalist" process-oriented tasks. If the machine assistants handle the 'How' (the commands and functions), then the human must focus more on the 'Why' and the 'what does it mean (understanding/wisdom)'. This helps to validate the work of the assistants and helps to make it useful and valuable in the real world.

    2. The results show that generative AI canfacilitate data analysis for individuals with minimal knowledge of statistics,mainly by generating appropriate code, but only partly by following standardprocedures.

      rhetoric: author uses comparative, objective statement (logos) to establish the main boundary of the technology's capability/capacity -- it excels at technical generation (things like coding) but fails at standard procedures (methodological adherence to SOPs).

      inference: the proves the 'Raising the Floor' concept. AI completely automates the entry-level syntax (the "Word"), meaning that the Generalist coder is obsolete! However, because it fails at standard procedures, it requires a human architect to guide it to outputs that are valuable in the real world.

    1. If a competitor performs these functions "better" but uses a different syntax or API, the customer must rewrite thousands of lines of application code, dashboards (Grafana/Tableau), and internal scripts

      All of these integrations, connectors, etc are super hard to just one shot port over

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We appreciate the time and effort the reviewers have invested in providing constructive feedback on our manuscript. Below, we’ve detailed additional work, corrections, and improvements that we will complete during the revision process.


      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary

      Folding is a major morphogenetic process that shapes tissues and organs in three dimensions. The mechanisms underlying tissue folding have been extensively explored and are often driven by actomyosin-based apical constriction. Here, the authors describe changes in cell geometry and mechanics during mouse neural tube formation. They build on quantitative fixed imaging and live junction ablation to extract cell geometry and junctional tension. These analyses are performed at different developmental stages and in both male and female embryos to propose a mechanical mechanism for neural tube elevation in the brain.

      Major comments

      The authors report quantitative data on cell geometry and junctional tension inferred from laser ablation. Overall, there are numerous statements that require stronger support from the experimental data. To substantiate several of their claims, the authors need to provide a larger number of data points-or at least comparable numbers across experimental conditions-for the tension measurements. Additional statistical analyses are required throughout to support the conclusions.

      Figure 1

      1. Does the projection algorithm account for tissue curvature when computing cell geometrical parameters such as area and anisotropy? At present, our projection algorithm does not correct for tissue curvature. Curvature in the tissue can make larger cells appear smaller in projections, skew the angle of cell orientations, and change aspect ratios. The largest curvature in the midbrain neural tube samples that we analyze is found in the transition region from the midline and lateral regions (~10-30% of tissue width) of 5 ss and 8ss embryos. The regions at the midline and more laterally are relatively flat. Therefore, distortion from curvature will not dramatically alter our key conclusions. We will apply a curvature correction using existing tools (Herbert S., et al (2021) BMC Biology) to sample images and determine if there are substantial differences in curvature-sensitive cells shape metrics. These will be included in a supplement to Figure 1. If there is a significant difference, we will expand the correction to all images that we analyze and update our analysis.

      The authors should provide information on the accuracy and reliability of the cell segmentation.

      We can provide a supplement to Figure 1 to demonstrate the accuracy of the segmentation. We have used F-Actin to segment cells in our images, which is enriched along the cell junctions but can also form medial cables that cross the cell surface. Junctional actomyosin is notably brighter than medial cables, and segmentation with our trained CellPose model is consistently able to distinguish the junctions. We also checked segmentation and performed manual corrections to ensure accuracy. To demonstrate this for our readers, we will prepare samples stained with both F-actin and ZO-1, a tight junction component that is localized to cell junctions. We will then segment the image twice in CellPose, once using the F-actin signal and once using the ZO-1 signal. The resulting cell outlines will then be digitally superimposed to show how much the signals overlap, and we will plot out the cell frequency as a function of area to determine if F-actin segmentations can segment with the same fidelity as ZO-1. Recent work by a co-author has shown excellent corroboration of neuroepithelial apical cell areas segmented using F-Actin and ZO-1 (Ampartzidis I., et al. eLife 2026). We are confident that our data will show a similar result.

      The authors indicate that the rate of apical constriction differs between male and female embryos. However, apical sizes differ only at specific positions along the ML axis (Fig. 2H, I).

      In Figure 2H, we show that at 5 ss males have larger apical areas than females at the midline, adjacent lateral cells, and at the surface ectoderm-neural epithelium border. By 8ss (Figure 2I), cells at the midline are smaller in males than females, while cells in more lateral regions are now equivalent between sexes. This change in apical area over time suggests that males have faster rates of constriction than females at the midline and adjacent lateral region where male cells become smaller or equivalent in size to female cells, respectively. We will perform statistical analysis (see comment #4) to determine if there are regions with significant differences in rate and amend our language to clarify that these differences are region specific as appropriate.

      The authors should provide statistical analyses for the rates shown in Fig. 2J. Are these rates significantly different between males and females, and between medial and lateral regions?

      Currently we calculate our rates using the difference in population averages of apical area at each stage shown in Figure 2H and 2I for each sex, and dividing by the number of somite stages, 3. As a result, there is only one rate value at each midline-lateral bin for each sex which is not amenable to statistical analysis. To correct this, we will calculate rates by subtracting the average apical area of each embryo at 8 ss from the population average of embryos at 5 ss. This will create 5 rates for both females and males at each 10% midline-lateral bin. We plan to perform a two-way ANOVA to determine if there are statistical differences in rates between males and females at each bin position and between medial and lateral regions. We will also add a section describing these calculations to the “Statistical Analysis” portion of the methods.

      Please clearly state the main novelty of this study relative to the work published by Brooks et al.

      Our study builds on the work of Brooks ER, et al. (2020) eLife. Brooks demonstrates that cells in a region of the lateral neural folds undergo apical constriction (Figure 1) and that cells at the midline do not (Figure 2). We expand and improve upon this work in the following ways:

      1. A) As required by our funding sources at the NIH (NOT-OD-15-102) we have collected, analyzed, and reported on sex as a biological variable of interest. In doing so, we have shown that there are clear sex differences in apical area in the neural tube that were not previously shown. We also show that there is apical constriction within the neural tube midline in a sex dependent manner. Brooks et al do not address sex in their work.
      2. B) We have provided more complete and spatially precise information on midline-lateral patterns of apical area and apical constriction. To show changes in apical area of lateral cells, Brooks selects a 100 x 100 µm region of interest in the midbrain (Figure 1E-F, Figure 2A) but does not specify the midline-lateral or rostral-caudal location of this region of interest or standardize it between embryos of different ages and dimensions. In our study, we’ve standardized our measurements to a 100 µm wide band across the midbrain adjacent to the midbrain/hindbrain boundary (Figure 2A-C). We also standardize positions as a percent distance from midline to account for differences in width between embryos and ages. This allows us to consistently compare similar populations of cells along the midline-lateral axis and determine changes in apical area over time.
      3. C) We connect patterns of apical area and constriction to F-actin and Myosin-IIB density. Though Brooks et al report some analysis of F-actin in lateral cells (Figure 6), they do not analyze the midline cells or explore the relationship between cell shape and actomyosin.
      4. D) Finally, we tested the mechanical properties of the tissue through laser ablation in living mouse embryos. From these ablations we’ve found that tension at the midline is less than in more lateral regions. Work in the neural tubes of frog (Haigo S., et al. (2003) Current Biology, Baldwin AT., et al. (2022) eLife, Matsuda M., et al. (2023) Nature Communications) and chicken (Kinoshita N., (2008) * Cell, Nishimura T., et al. (2012) Cell) embryos has conclusively shown that enriched midline actomyosin promotes apical contractility and drives hinge formation. It was therefore largely believed that a similar contractile hinge was employed in mammals (Copp AJ. and Green NDE. (2010) J. Pathol, Nikolopoulo E., et al. (2017) Development). Collectively, our work is the first to demonstrate that such a contractile hinge is not present in the mammalian brain neural tube. Figure 3*

      The authors need to provide statistical support for the claim that large midline cells exhibit reduced F-actin and Myosin IIB levels.

      We will conduct a two-way ANOVA to determine if there are statistical differences in F-actin and Myosin IIB density at the midline and more lateral regions in both males and females. We will update our language in the text and plots as appropriate from these results.

      F-actin and Myosin IIB intensities should be plotted as a function of cell area to support the proposed anticorrelation between apical area and actomyosin levels.

      We will make plots of cell areas vs. F-actin or Myosin IIB density for cells in each embryo. We will then fit a line to determine the R value for each embryo to determine if there is a negative correlation between cell area and actomyosin intensity. We will also adjust our language in the text as appropriate based on the results of these tests.

      Statistical analyses are missing to substantiate the increase in F-actin levels between stages ss5 and ss8.

      We will perform an F-test to determine homogeneity of variance between F-actin at 5 ss and 8 ss followed by the appropriate t-test to determine if there is a statistical increase in F-actin over time. We will also amend our language in the text to reflect the results of this test.

      Figure S3 should be supported by plots showing Myosin II and F-actin intensity as a function of position along the ML axis, together with appropriate statistics.

      In Figure 3A-D, we show representative images of F-Actin and Myosin IIB density in female embryos. These are plotted as the purple lines in Figure 3 E-H. Figure 3 Supplement 1 shows representative images of F-actin and Myosin IIB density in male embryos. These are plotted as the green lines in Figure 3 E-H. We will add a line in the caption of Figure 3 Supplement 1 indicating that these samples are represented and plotted in Figure 3. We also noted a typo in the respective captions, incorrectly indicating male or females were shown in the figure. We will correct these typos as well. Additionally, we will perform the statistical tests indicated under comment #6.

      Figure 4

      The authors state that lateral tension in male embryos is not different from midline tension, yet the number of data points is much lower than in females. To support this claim, the number of ablations should be comparable across sexes.

      As part of this study we performed 270 ablations in the neural tubes of 83 mouse embryos: an exceptional scale of ablations that is the first of its kind in early embryos. We conducted our initial recoil velocity analysis blinded to information on sex. Male embryos were statistically underrepresented in our data set because male embryos develop faster than their female littermates (Seller MJ. and Perkins-Cole KJ. (1987) J. Reprod. Fert.). As such, the neural folds of male embryos were too elevated to ablate. At present we do not have the resources or justification to perform laser ablations on additional animals to obtain the number of male embryos needed to supplement the already exceptionally large data set. We will instead perform a power analysis to determine if: 1) we have a sample size large enough to detect a biologically-meaningful difference with suitable power, 2) the sample size required to detect the observed difference is so large that the difference would not be biologically meaningful, or 3) we do not have a sample size large enough to detect a difference confidently. With the results of this analysis, we will amend our language in the text to reflect the most accurate claims that can be made.

      Is lateral tension different between males and females?

      In Figure 4G we show that females have statistically different tension between the lateral and midline regions, while males do not. However, we do not test if the lateral or midline tension is different between females and males. We will perform an F-test and t-test to determine if there are statistical differences between males and females in this region.

      Similarly, the data in Fig. S4 used to claim no change in tension over time are not supported by sufficient data points.

      As discussed in comment #10, the scale of ablations is already substantial, and the initial recoil velocities were analyzed blinded to information on embryo age. We will calculate a best fit line for these plots to demonstrate if there is a trend in recoil velocity over time. We will then adjust our language in the text as appropriate with this added information.

      Would the medial and lateral tensions reported in Fig. 4G remain unchanged if the authors perform statistical analyses on 10-15 ablations per condition?

      We do not have a justification for removal or exclusion of any of the laser ablations analyzed in this study. We will instead perform a power analysis, as indicated in comment # 10, and adjust the language in the text as appropriate given the results of that analysis.

      Figure 5

      The number of data points in Fig. 5J and L is insufficient to support claims of no difference. The only detectable difference arises in the comparison with much higher sample size (Fig. 5L, ML vs RC).

      In Figure 5J we disaggregate ablations performed at the midline by directionality (midline-lateral or rostral-caudal). We were unable to detect a statistically significant difference based on the direction of initial recoil velocity in either sex, though N’s for all categories are comparable. As discussed in comments #10 and #12, the scale of ablations conducted in this study is uniquely substantial. We will perform a power analysis for our anisotropy measurements in the lateral region of the tissue to determine if we have a sample size large enough to have detect a biologically-relevant difference with high confidence or if the sample size required to detect the observed difference is so large that the difference would not be biologically meaningful. Given the results of this analysis, we will amend our language in the text to reflect the most accurate claims that can be made.

      The authors conclude that males have higher ML tension than RC tension, but given the limited data this conclusion should be amended to "no detectable difference."

      In Figure 5L, we disaggregate ablations performed in the lateral regions, by directionality (midline-lateral or rostral-caudal). We find a statistical difference in the directionality of initial recoil velocity in females. In males, though we can observe a difference in the initial recoil velocity means, we are unable to detect a statistical difference, likely due to the smaller male sample size. As discussed in comments #10 and #12, the scale of ablations conducted in this study is uniquely substantial and was conducted blinded to embryo sex. Given that males develop faster than their female littermates (Seller MJ. and Perkins-Cole KJ. (1987) J. Reprod. Fert.) we were unable to obtain more males in our data set. We will perform a power analysis for our anisotropy measurements in the lateral region of the tissue to determine if: 1) we have a sample size large enough to detect a biologically-meaningful difference with suitable power, 2) the sample size required to detect the observed difference is so large that the difference would not be biologically meaningful, or 3) we do not have a sample size large enough to detect a difference confidently. With the results of this analysis, we will amend our language in the text to reflect the most accurate claims that can be made.

      Code availability

      The authors should provide access to the code used to generate the projections.

      We are committed to ensuring open access to all code used as part of this study, including components of the projection workflow, data analysis, and figure creation. We are in the process of assembling a GitHub repository containing these files as well as documentation to allow for use by other members of the research community and public. We will publicly publish this documentary upon completion of the repository or at time of publication, whichever comes first.

      Reviewer #1 (Significance (Required)):

      The authors propose a mechanical model for neural tube elevation based on analyses of cell geometry and tension at two developmental stages. The reported differences in cell geometry or actomyosin levels do not appear to explain the differences in geometry or tension suggested between male and female embryos. This raises questions about the relationship between these measurements and their relevance for understanding the mechanisms of neural tube elevation.

      If the major concerns outlined above are rigorously addressed, the manuscript will offer a valuable descriptive characterization of neural tube cell geometry and mechanical stress during morphogenesis. Such datasets could form a foundation for future studies investigating the mechanisms driving neural tube elevation.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      The manuscript investigates the role of apical constriction and actomyosin organization in shaping the mouse brain neural epithelium during neural tube elevation, with particular emphasis on sex-specific differences. The authors develop an imaging and analysis pipeline to reconstruct the apical surface of the neural plate in three dimensions and perform quantitative measurements of apical cell area, actin, and myosin IIB distributions. Targeted laser ablation experiments are used to infer regional tissue tension.

      The main findings can be summarized as follows. First, the authors identify a mediolateral gradient in apical cell area, with larger cells at the midline and smaller cells on the lateral neural folds, which inversely correlates with actomyosin density. Laser ablation experiments suggest that apical tension is lower and isotropic at the midline, whereas it is higher and anisotropic on the lateral folds, particularly in females. Second, sex-dependent differences in apical cell area, constriction rates, and actomyosin levels are reported at early somite stages, preceding previously described sex biases in neural tube defects.

      The experimental work is technically solid, and the imaging and quantification pipeline represents a useful advance for analyzing large, curved epithelial surfaces. However, the study feels incomplete in its current form. Despite addressing neural tube elevation, the manuscript does not provide a comprehensive analysis of the folding process itself. Key aspects such as three-dimensional tissue morphology, curvature evolution, or global shape changes of the neural folds are not quantified. In addition, other potentially relevant cellular behaviors, such as proliferation, cell rearrangements, or contributions from neighboring tissues, are not examined, nor are they compared systematically between sexes.

      Conceptually, the study focuses narrowly on correlations between apical cell area, actomyosin density, and inferred tension. While these measurements are carefully performed, the relationship between differential actomyosin contractility and three-dimensional tissue folding remains largely descriptive. No mechanical model or simulation framework is provided to link changes in actomyosin organization and cell shape to the emergence of neural folds and hinge formation. As a result, it is difficult to assess whether the measured differences in tension (on the order of ~40%) are sufficient to account for the proposed mechanical behavior of the tissue.

      The central hypothesis advanced by the authors is that a relatively "soft" midline, flanked by stiffer, tension-bearing lateral folds, facilitates hinge formation during brain neurulation. However, this hypothesis is not directly tested by perturbation. For example, experimentally increasing contractility or stiffness at the midline (e.g., via optogenetic activation of apical constriction machinery) would provide a more direct test of causality. As it stands, the data demonstrate correlation rather than necessity or sufficiency.

      Relatedly, alternative interpretations are not fully addressed. Large apical cell areas and low actomyosin levels at the midline could arise as a consequence of tissue geometry, contact with underlying structures such as the notochord, or extrinsic mechanical constraints, rather than being the primary cause of hinge formation. Similarly, anisotropic stresses generated at the tissue or embryo scale could align cells and actomyosin cables, producing the observed patterns without requiring locally specified apical tension differences as the initiating mechanism. The manuscript does not clearly distinguish whether apical tension asymmetries are a driver of folding or an emergent outcome of folding dynamics.

      Finally, while the identification of sex differences is intriguing, it remains unclear what mechanistic insight is gained beyond establishing that such differences exist. The functional consequences of these differences for neural tube closure, robustness, or failure are not explored, nor is it clear how they integrate into the proposed lateral tension model.

      In summary, this study provides high-quality measurements of apical cell geometry, actomyosin organization, and inferred tension in the mouse neural epithelium. However, the lack of direct perturbations, mechanical modeling, and quantitative analysis of three-dimensional tissue deformation limits the strength of the mechanistic conclusions. Addressing these gaps would substantially strengthen the manuscript and clarify the causal role of apical tension patterns in neural fold formation.

      __ __The reviewer makes an excellent point, that direct perturbation of the system would enable us to test our hypothesis and inform whether the reduced contractility at the midline is essential for neural tube elevation. However, at present the technology needed to conduct an optogenetic experiment like that described by the reviewer does not exist. As with the laser ablations, an optogenetic experiment requires access to live and healthy embryos. Currently, mouse embryos can be cultured for several days in roller culture, where they are continuously rotated, or for several hours in static culture (Aguilera-Castrejon A. and Hanna JH. (2021) J. Vis. Exp.). Both techniques require that the yolk and amniotic sacs remain intact around the embryo. To access the apical surface of the brain neural tube for imaging, both sacs must be breached, after which the embryo has about 30 minutes before it begins to exhibit altered cellular morphology and tissue integrity and ultimate embryo death.

      The neural tube elevates over several hours and closes fully after more than a day (Jacobson AG. and Tam PPL. (1982) The Anatomical Record). Even if we did acquire mice expressing photoactivatable constructs, the support membranes of the embryos would need to be breached to activate protein interactions. The embryos would die before any meaningful progress in neural tube elevation could be evaluated. Conducting an experiment like this would greatly advance our understanding of the system, and we hope that the needed technologies are developed to enable future work of this nature. The Galea lab previously purchased a photo-activatable Cre line, but was unable to induce deletion of a protein of interest using this allele before closure of the neural tube was completed (and the blue light needed to activate the cre was photo-toxic).

      At present, there is some experimental evidence to suggest that lack of apical constriction at the midline if important for proper neural tube closure. Brooks ER, et al. (2020) eLife shows that a truncated Ift122 mutant, leads to abnormal constriction of the midline cells but does not disrupt lateral cell apical constriction, leading to a failure in brain neural tube closure in these embryos. Ift122 regulates trafficking and signaling proteins in cilia, which in turn regulates Sonic hedgehog signaling which Brooks ER, et al. also demonstrates regulates apical constriction. While this disruption is clearly multifaceted and nuanced, it provides some genetic support for the lack of apical constriction at the midline being important for neural tube closure.



      Major Comments

      Figure quality. Figure 1 contains very low-resolution images, which makes it difficult to evaluate the segmentation quality and tissue morphology. Higher-resolution versions should be provided.

      In Figure 1, we outline the conceptual strategy and approach used to create and analyze shell projections of the curved neural tube. As much of our analysis builds from segmentation of cells in the projections, being able to assess segmentation quality from high resolution images is critical to evaluating the quality of the data shown. As discussed in comment #2, we will create a supplement to Figure 1 to demonstrate the accuracy of the segmentation. This will include high resolution images of both the label used to segment and the resulting segmentation, with corresponding overlays.

      Cell segmentation strategy and validation. The authors segment cell areas using Myosin II and F-actin signals. This approach may introduce inaccuracies, as actomyosin cables can traverse the apical surface of individual cells and do not always coincide with cell boundaries. Segmentation based on junctional markers such as ZO-1 may be more appropriate. At minimum, the authors should provide a quantitative validation of segmentation accuracy, for example by overlaying segmentation results on raw images together with a nuclear marker (e.g., DAPI or H2B-GFP), to demonstrate that the number of segmented cells corresponds to the number of nuclei.

      We will provide a supplement to Figure 1 to demonstrate the accuracy of the segmentation. We have used F-Actin to segment cells in our images. F-actin is enriched along junctions but cells can also have medial pools and F-actin cables, which might lead to errors. Though we understand the reviewer’s logic in asking to align segmentations with marked nuclei, the morphology of the neural epithelium makes this approach infeasible. The neural epithelium is pseudostratified, and nuclear position varies along the apical-basal axis depending on the cell cycle phase of each cell. As a result, an apical shell projection of nuclei would not capture all nuclei and a maximum intensity projection in Z of all nuclei would be uninterpretable as there would be substantial XY overlap between nuclei. Instead, we will create a supplement to Figure 1 to demonstrate the accuracy of the segmentation as discussed in comment #2. We will segment samples stained for both F-Actin and junctional markers like ZO-1. We will then create overlays of the resulting cell outlines and a cell area frequency plot for both segmentations to evaluate if F-actin based segmentation deviates from tight junction-based segmentation.

      Lack of cross-sectional views of neural tube morphology. The manuscript would benefit from the inclusion of cross-sectional images of the neural tissue at different developmental stages. This would serve two purposes: (i) to demonstrate that the authors have a comprehensive understanding of the full three-dimensional folding process during neural tube closure, including medial and lateral hinge formation, and (ii) to allow readers to visualize the tissue geometry corresponding to the analyzed projection datasets (e.g., at 5 ss and 8 ss).

      A key component of our model states that the changes in cell-level morphology and features correspond to changes in tissue level morphology (Figure 6). Specifically, that lateral apical constriction coincides with the flattening and elevation of the dorsal bulges on the lateral neural folds. We agree that it is beneficial to include additional visuals of tissue morphology. We plan to add an additional figure at the start of manuscript that details both the dorsal and relevant cross-sectional views of the somite stages analyzed. These visuals will take the form of graphical illustrations along with 3D confocal microscopy images and optical reconstructions of samples.

      Sex-specific differences in overall neural plate morphology. The authors report that at 5 ss, males consistently have larger apical cell areas than females. It is unclear whether this difference reflects a global difference in neural plate morphology. Showing representative images of female and male neural plates would help readers directly assess whether there are overt morphological differences beyond those revealed by quantitative analysis.

      If one sex has larger cells than the other, it would be reasonable to expect that the neural folds may be wider as well. In Figure 2B-C, we show representative images of male embryos at 5 and 8 ss. As part of the additions we indicated in comment #19, we will also include dorsal and cross-sectional views of both male and female embryos at the stages analyzed. If there is a difference in tissue morphology between sexes, we will also quantify these differences in tissue size, curvature, etc.

      Cell number analysis. The authors state, based on prior literature, that cell numbers do not change between 5 and 8 ss. Given that the tissue is already segmented in the current study, this claim should be directly verified using the authors' own data. This analysis should be straightforward and would strengthen the conclusions.

      We agree and will determine the number of cells analyzed for each embryo to test if there are changes in cell numbers at different stages and between sexes, along with appropriate statistical tests.

      Relation between tissue curvature and cellular properties. It would be highly informative to extract the three-dimensional morphology of the neural plate, in particular its curvature, and examine how curvature correlates with two-dimensional cell anisotropy, apical area, and F-actin/Myosin intensity. For example, at 8 ss the authors report a U-shaped dependence of cell area along the mediolateral axis. How does this pattern relate to local tissue curvature?

      We agree with this assessment and will create optical reslices in the midbrain adjacent to but excluding the midbrain hindbrain boundary. We will then divide the apical surface into 10% bins and fit a circle to the apical surface of the neural epithelium in each to calculate the local radius of curvature, which is the reciprocal of curvature for the surface. We can then correlate these values with two-dimensional cell shape and actomyosin density metrics.

      Visualization of sex differences in medial actin levels (Figure 3). In Figure 3, the reported female-male difference in medial actin levels would benefit from visualization of the raw data. A zoomed-in inset of the midline region, shown separately for females and males, would help substantiate this claim.

      In Figure 3, we demonstrate patterns of the whole-cell apical F-actin (Fig. 3A, B) and Myosin IIB (Fig. 3C, D) density. We find that there is no difference in F-Actin density between males and females (Fig. 3E, F), but a significant difference in midline Myosin IIB density at 5 ss that is mostly absent by 8 ss (Fig. 3G, H). We currently provide representative images for female and male myosin IIB expression across the midline-lateral axis in Figure 3C, D, and Figure 3-Supplement 1C and D. We can provide a close-up image of Myosin IIB in the midline region for both sexes as part of Figure 3, with additional annotations on existing representative image to indicate their origin.

      Typographical error. Line 143: please correct "cell are" to "cell area".

      We thank the reviewer for pointing out this error and will correct this typo and perform additional editing to correct any other typos present in the manuscript.

      Quantitative correlation analysis between cell area and actomyosin. The authors qualitatively discuss the relationship between cell area dynamics and actomyosin levels. It would strengthen the analysis to directly compute and report correlations between these variables, and to explicitly test whether actin and myosin levels are anti-correlated with apical cell area.

      As discussed in comment #6, we will plot cell area vs. F-actin or Myosin IIB density for each embryo and fit a line to calculate their correlation coefficient. From there, we will determine if there is a negative correlation between cell area and actomyosin intensity.

      Interpretation of anti-correlation and contractile hinge mechanism. In lines 143-157, the authors state that the observed anti-correlation between actomyosin and cell area argues against a contractile hinge mechanism. However, this anti-correlation could also suggest that apical cell area is determined by local mechanical or geometric constraints rather than by local actomyosin contractility. The authors should clarify and discuss this alternative interpretation.

      Within the neural epithelium of mice and other vertebrates, F-actin and myosin-IIB are enriched on the apical surface relative to other regions of the cell (Sadler TW, et al. (1982) Science, Matsuda M., et al. (2023), Nat. Communication, Röper, K. (2013) BioArchitecture). This poises the actomyosin network to be able to selectively constrict the apical surface relative to the basal side of the cells. Apical constriction is observed to actively facilitate the formation of hinges in folding tissues (Chanet S, et al. (2017), Nature Communications, Nishimura T., et al. (2012) Cell, Chistrodoulou N., et al. (2015) Cell Reports) in what we term the contractile hinge model of tissue folding. Tissues that employ this model of folding are expected to have small apical areas and apical enrichment of contractile actomyosin at the hinge point during folding. We observe large apical areas, low apical actomyosin density, and low apical tension at the midline hinge of the mouse midbrain neural tube, which are all inconsistent with a contractile hinge mechanism being employed in this tissue folding process. We agree with the reviewer that “cell shape does not always match [acto]myosin contractility levels, because cell shape depends on extrinsic, as well as intrinsic forces” (Line 147-149). We also agree that anticorrelation of actomyosin density and apical cell area does not per se argue against the contractile hinge model and will amend our language to be clearer. We will also further elaborate on potential extrinsic factors that may lead to the observed cell behaviors at the midline in the discussion.

      Statistical robustness of laser ablation results (Figure 4). The differences in recoil velocity between regions appear small, with substantial overlap between the distributions. In addition, the sample sizes for lateral versus midline ablations appear unequal (with visibly more data points in the lateral condition). These factors raise concerns about the robustness and statistical significance of the reported differences, which should be addressed more carefully.

      In Figure 4E, we show initial recoil velocities binned only by region: lateral vs. midline and report a 3.03 μm/s vs. 2.40 μm/s, or 26% difference between the two regions. We then show in Figure 4G that by considering another relevant variable, sex, we find initial recoil to be 3.15 μm/s vs. 2.30 μm/s, or 37% difference in females and 2.68 μm/s vs. 2.57 μm/s, or 4% difference in males. We go on to show In Figure 5L that within the lateral region that recoils also vary by direction, with a 38% difference. Ultimately the final conclusions that we draw regarding tissue tension that we present in our model are derived from the most finely disaggregated data in Figure 5. Our goal in presenting a stepwise disaggregation of the data was to demonstrate which variables had the greatest impact on the variance within our data set. We agree with the reviewer that a more precise statistical analysis of this data set is warranted that accounts for the complexity and multitude of variables that can influence our conclusions. In addition to the power analysis described in comment #10, we plan to conduct a mixed-effect model analysis of our data that considers factors including sex, age, cut direction, cut region, cut number, and embryos to determine which factors explain the most variance in the population. We will add this analysis as a supplement to Figure 4 alongside a description of the tests performed in the Statistical Analysis section of the methods. We will also adjust our language in the text to clearly state the limitations of the data as presented and qualify conclusions as appropriate.

      Speculative statement regarding anisotropic tension in males. Line 278: "We believe that both sexes demonstrate anisotropic tension, given that males have cell aspect ratios and orientations in the lateral neural folds similar to females." This statement is speculative. Either anisotropic tension in males should be directly measured and reported, or this statement should be removed.

      As discussed in comment # 15, in Figure 5L, though we can observe a difference in the initial recoil velocity means, we are unable to detect a statistical difference. Ablations were conducted blinded to embryo sex, but fewer male embryos were suitable for ablation because males develop faster than their female littermates (Seller MJ. and Perkins-Cole KJ. (1987) J. Reprod. Fert.). We were therefore unable to obtain more males in our data set. At present we do not have the resources to perform additional laser ablations to supplement the existing data set. We will instead perform a power analysis for our anisotropy measurements in the lateral region of the tissue to determine if: 1) we have a sample size large enough to detect a biologically-meaningful difference with suitable power, 2) the sample size required to detect the observed difference is so large that the difference would not be biologically meaningful, or 3) we do not have a sample size large enough to detect a difference confidently. With the results of this analysis, we will amend our language in the text to reflect the most accurate claims that can be made.

      Reviewer #2 (Significance (Required)):

      This study provides high-quality measurements of apical cell geometry, actomyosin organization, and inferred tension in the mouse neural epithelium. However, the lack of direct perturbations, mechanical modeling, and quantitative analysis of three-dimensional tissue deformation limits the strength of the mechanistic conclusions. Addressing these gaps would substantially strengthen the manuscript and clarify the causal role of apical tension patterns in neural fold formation. At the end of the day, the authors suggest an hypothesis that is not well support by their data, which is of high quality.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary

      This manuscript by De La O et al addresses a long-standing question of how actomyosin contributes mechanically to cranial neural tube elevation in the mouse, a system in which classical midline contractile hinge models appear insufficient. The authors develop an image-processing and analysis pipeline that enables reconstruction and quantitative analysis of the apical actomyosin network across the large, curved dorsal surface of the mouse brain neuroepithelium. Using this approach, combined with laser ablation-based tension measurements in live embryos, they report a medio-lateral gradient of apical cell area and an inverse gradient of actomyosin density. Contrary to contractile hinge models described in frog, chick and invertebrate systems, they find that the midline exhibits low, isotropic tension, while the lateral neural folds show higher anisotropic apical tension, consistent with their proposal of a "lateral tension" mechanism for neural tube elevation.

      The work provides an important reframing of actomyosin function in mammalian cranial neurulation, supported by extensive quantitative imaging and mechanical measurements. The finding that lateral, rather than midline, actomyosin networks dominate tissue tension is compelling and helps reconcile previous observations that midline hinge formation in mouse can proceed despite actomyosin perturbation. The study is technically sophisticated and addresses a biologically important process with clear relevance to neural tube defect etiology. However, several aspects of the statistical treatment, interpretation of laser ablation data, and mechanistic framing require clarification or tempering to fully support the authors' conclusions.

      Major comments

      Statistical unit and pseudo-replication in cell-based analyses (Figures 2-3)

      In Figures 2 and 3, it is unclear whether statistical comparisons were performed at the level of individual cells or embryos. Because cells are nested within embryos, treating cells as independent observations raises concerns about pseudo-replication and inflated statistical significance, particularly for sex-dependent effects. While the color-coded maps are visually compelling, they may overstate confidence in differences between conditions if embryo-to-embryo variability is not explicitly accounted for.

      Clarification is needed as to whether statistical testing was performed on embryo-level summary values (e.g., one value per embryo per positional bin), or whether hierarchical or mixed-effects models were used with embryo treated as a random effect. Providing embryo-level summary plots would also help readers assess inter-embryo variability. Addressing this point is important for confidence in both the reported medio-lateral gradients and the sex differences.

      We agree with the reviewer that it is inappropriate to calculate statistics based on measurements of individual cells. As indicated under the ‘Statistical Analysis and Figure Assembly’ section of our methods “For fixed images, cell shape and protein intensity analysis (Figure 2H-J, Figure 3E-H, Figure 5E-H), N = 5 embryos for all conditions and n, or the number of cells in each 10% bin, is ≥ 150 cells for each embryo” (Line 556 – 558). The average and SD between embryos are shown in these plots and is calculated at the embryo level, not the cell level. We chose to consolidate this information in the methods section as the same data set is used across the three figures. We will add a line to the figure captions that N values for all experiments can be found in this section of the methods. We will also provide supplementary plots showing the bin averages for each individual embryo, color coded by embryo to show the distribution of the data set.

      Interpretation of actomyosin density as a proxy for contractility (Figure 3)

      The descriptive correlation between apical cell area and actomyosin density is clear and consistent. However, actomyosin abundance alone does not necessarily equate to force generation, particularly in the absence of measurements of myosin activation state (e.g., pMLC), actomyosin dynamics, or direct perturbations linking actomyosin levels to mechanical output. Although the authors appropriately note that cell shape does not always reflect intrinsic contractility, actomyosin density is nevertheless used to argue against a contractile hinge mechanism.

      While the subsequent laser ablation experiments address tissue tension more directly, the mechanistic conclusions drawn from actomyosin density measurements alone would benefit from more careful qualification. Tempering language that equates actomyosin enrichment with contractile output, or explicitly acknowledging these limitations, would strengthen the interpretation.

      It is largely believed that apical pools of actomyosin are active and that apical localization of actomyosin is dependent on activation. Shroom3, an actin-binding protein, is localized to the apical adherens junctions in the neural tube (Haigo SL., et al. (2003) Curr. Biol., Hildebrand JD. and Soriano P. (1999) Cell), where it can recruit Rho kinases (ROCKs) that in turn phosphorylate and activate Myosin IIB (Nishimura T. and Takeichi M. (2008) Dev.). Mutations in Shroom3 lead to neural tube close defects and its overexpression in the neural tube can induce apical constriction and increased apical accumulation of Myosin II tube (Haigo SL., et al. (2003) Curr. Biol., Hildebrand JD. (2005) J. Cell Sci.). In the mouse neural tube, Myosin IIB intensity is greater in cells that can apically constrict than in those that cannot constrict (Galea GL., et al. (2021) Nat Commun). Additionally, inhibition of ROCK reduces apical tension, presumably by reduction of activated Myosin II (Butler MB., et al. (2019) J. Cell Sci.). We agree with the reviewer’s assessment that to definitively state that the apical pools of Myosin IIB and F-actin are promoting apical contractility, a demonstration of the phosphorylation state of the Myosin II regulatory light chain (pMLC) or observations/perturbations in live embryos is necessary. We will adjust our language to reflect this limitation. We will also provide information on the relationship between apically localized actomyosin and contractility.

      Statistical and biological independence of laser ablation measurements (Figures 4-5)

      The Methods indicate that 155 laser ablations were analyzed across 71 embryos, implying that multiple ablations were performed per embryo. It would be helpful to clarify how this hierarchical data structure was handled statistically. Specifically, were recoil velocities averaged per embryo, paired with embryos for ML vs. RC comparisons, or analyzed using hierarchical/mixed-effects models?

      Our laser ablation data set captures variables including embryo sex, age, cut location, cut direction, and cut number. Therefore, we did not feel it appropriate to average recoils within the same embryo as these cuts were intentionally in different regions (lateral vs midline) or in different orientations (i.e. a rostral-caudal cut and midline-lateral cut on opposite lateral folds), which our analysis has shown would lead to averaging out potential differences. Ablations were far apart from each other, and we had checked that ablation order did not predict changes in recoil. However, we agree with the reviewer that a more precise statistical analysis of this data set is warranted that accounts for the complexity of variables potentially influencing initial recoil velocities. As discussed in comment #27, we plan to conduct a mixed-effect model analysis of our data that considers the above and add this analysis as a supplement to figure 4. We will include a description of this in the methods and our language in the text to clearly state the limitations of the data as presented and qualify conclusions as appropriate.

      In addition, embryos were subjected up to 5 ablations within a short time window. Because laser ablation disrupts tissue integrity and can induce rapid cytoskeletal remodeling, it is unclear whether later ablations represent independent measurements of the native tension state. Clarification is needed regarding whether the authors tested for effects of ablation order (e.g., first vs. later cuts), ensured sufficient spatial separation between ablation sites, or verified that repeated ablations did not systematically alter recoil measurements. Demonstrating that initial recoil velocity is independent of cut number would substantially strengthen confidence in the mechanical conclusions.

      We agree with the reviewer that laser ablations cause disruptions to tissue, and these disruptions can impact the results of additional ablations performed near the site of prior ablations. The average embryo in our data set has three ablations: one on either neural fold and one at the midline, with hundreds of µm distances from each other. In embryos that had more than 3 ablations made far away from each other (additional ablations were performed in the hindbrain rhombomeres, rhombomere boundaries, at the neuroepithelium and surface ectoderm boundary, or at the zipper point, but n numbers of these are insufficient for analysis). We will supplement the methods text describing the laser ablations to clarify this for readers. Additionally, after an ablation, displacement is not detectable further than 3-5 cell lengths away from the cut even after several seconds post ablation. We will provide visual examples of these cuts after ablation to demonstrate this phenomenon. As discussed in comment #27 and #32, we will also perform mixed-effect modeling to determine if cut number impacts observed initial recoil velocities. We will also provide plots demonstrating relevant examples of these comparisons (e.g. sequential lateral cuts made in the same direction).

      Interpretation of sex-dependent tension differences (Figures 4-5)

      Figure 4 shows a clear lateral-greater-than-midline tension difference in females, whereas this pattern is not detected in males under initial analysis. Later, Figure 5 reveals directional anisotropy in the lateral neural folds of both sexes. As currently framed, this creates some ambiguity regarding whether the proposed lateral tension mechanism is sex-specific, sex-biased in magnitude, or sex-general but masked by directional averaging in males.

      Clarifying this distinction, both in figure presentation and in the text, would strengthen the mechanistic interpretation and prevent confusion. In particular, it would be helpful to more clearly explain how directional anisotropy reconciles the apparent absence of regional tension differences in males in Figure 4.

      We appreciate the reviewer taking the time to indicate this point of confusion. We ultimately conclude that the lateral tension model of neural tube elevation is agnostic of sex. Though there are nuanced differences in some of the details regarding Myosin IIB density, midline apical constriction and tension anisotropy, we do not believe these differences would fundamentally change the mechanical model used between sexes. With specific regards to masking of the lateral neural fold tension in males, we briefly address this in the discussion: “The averaging of [Rostra-Caudal] and [Midline-Lateral] [Initial Recoil Velocities] likely masked tension differences between the midline hinge and lateral neural folds, creating the false impression that males did not have high tension on the lateral neural folds” (Line 280-282). We will adjust the text in the results and discussion section to clearly indicate that are lateral tension model applies to both sexes, though some differences in specific details exist, and that averaging may have led to the result in Figure 4G.

      Causal overreach in mechanistic interpretation of anisotropic tension

      While the laser ablation data convincingly demonstrates spatial and directional differences in recoil consistent with patterned mechanical anisotropy, the manuscript frequently treats anisotropic apical tension as a mechanistic explanation for neural tube elevation. The presented experiments do not directly test whether anisotropic apical tension is necessary or sufficient for tissue bending, nor whether isotropic tension at the midline plays a causal role. Initial recoil velocity reflects not only pre-existing tension but also tissue geometry and viscoelastic properties, which may differ between midline and lateral regions.

      As such, statements suggesting that anisotropic lateral tension "explains" neural fold elevation should be tempered or reframed. The data strongly support spatial patterning of mechanical properties but do not yet establish causal primacy. Recasting the model as a mechanically consistent framework rather than a definitive mechanism would better align conclusions with the data.

      Our lateral tension model proposes that a regionalized difference in tension, with high tension in the lateral neural folds and low tension at the midline, is needed to enable neural tube elevation and ultimate closure. We agree with the reviewer, our work demonstrates that the results of our laser ablation experiments, along with measurements of cell shapes and protein density, are consistent with the lateral tension model that we propose. Our model is also supported by past work that shows that perturbations that disrupt actomyosin contractility leads to defects in brain neural tube elevation and closure but not midline hinge formation. For example, chemical perturbation of actin polymerization with Cytochalasin D (Ybot-Gonzalez P. and Copp AJ. (1999) Dev. Dyn), and genetic perturbations of Shroom 3, which apically localizes actomyosin (Hildebrand JD. And Soriano P. (1999) Cell) or Fhod3, which promotes actin polymerization (Sulistomo HW, et al. (2019) J. Biol. Chem.) all have brain neural tube closure defects but form a midline hinge. However, since we do not directly perturb tension, we have only demonstrated consistency rather than causality or sufficiency. We will adjust and temper our language accordingly in the relevant sections of the results and discussion.

      Minor comments

      Manuscript length and clarity

      The manuscript is longer and more complex than necessary for its central message. Several sections of the Results, particularly methodological validation and somite-stage stratification, could be streamlined.

      We agree with the reviewer and will continue editing the manuscript, prioritizing clarity, brevity, and precision of language so that readers are able to quickly understand the key points of the manuscript.

      Sex differences section

      The section on sex differences is interesting but somewhat tangential. Clarifying whether these findings are intended as mechanistic insight or observational motivation for future work could improve focus.

        We intended this section to offer up perspectives that inform and motivate future work to continue to track, analyze, and report on sex difference during development. We will edit this section of the discussion to improve clarity and brevity so that the reader can easily acquire this takeaway. Sex differences in the penetrance of exencephaly is an active area of research and our manuscript provides the first cell-level measurements which will guide the field in disaggregating future analyses by embryo sex.
      
    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #1

      Evidence, reproducibility and clarity

      Summary

      Folding is a major morphogenetic process that shapes tissues and organs in three dimensions. The mechanisms underlying tissue folding have been extensively explored and are often driven by actomyosin-based apical constriction. Here, the authors describe changes in cell geometry and mechanics during mouse neural tube formation. They build on quantitative fixed imaging and live junction ablation to extract cell geometry and junctional tension. These analyses are performed at different developmental stages and in both male and female embryos to propose a mechanical mechanism for neural tube elevation in the brain.

      Major comments

      The authors report quantitative data on cell geometry and junctional tension inferred from laser ablation. Overall, there are numerous statements that require stronger support from the experimental data. To substantiate several of their claims, the authors need to provide a larger number of data points-or at least comparable numbers across experimental conditions-for the tension measurements. Additional statistical analyses are required throughout to support the conclusions.

      Figure 1

      • Does the projection algorithm account for tissue curvature when computing cell geometrical parameters such as area and anisotropy?
      • The authors should provide information on the accuracy and reliability of the cell segmentation.

      Figure 2

      • The authors indicate that the rate of apical constriction differs between male and female embryos. However, apical sizes differ only at specific positions along the ML axis (Fig. 2H, I). The authors should provide statistical analyses for the rates shown in Fig. 2J. Are these rates significantly different between males and females, and between medial and lateral regions?
      • Please clearly state the main novelty of this study relative to the work published by Brooks et al.

      Figure 3

      • The authors need to provide statistical support for the claim that large midline cells exhibit reduced F-actin and Myosin IIB levels.
      • F-actin and Myosin IIB intensities should be plotted as a function of cell area to support the proposed anticorrelation between apical area and actomyosin levels.
      • Statistical analyses are missing to substantiate the increase in F-actin levels between stages ss5 and ss8.
      • Figure S3 should be supported by plots showing Myosin II and F-actin intensity as a function of position along the ML axis, together with appropriate statistics.

      Figure 4

      • The authors state that lateral tension in male embryos is not different from midline tension, yet the number of data points is much lower than in females. To support this claim, the number of ablations should be comparable across sexes. Is lateral tension different between males and females?
      • Similarly, the data in Fig. S4 used to claim no change in tension over time are not supported by sufficient data points. Would the medial and lateral tensions reported in Fig. 4G remain unchanged if the authors perform statistical analyses on 10-15 ablations per condition?

      Figure 5

      • The number of data points in Fig. 5J and L is insufficient to support claims of no difference. The only detectable difference arises in the comparison with much higher sample size (Fig. 5L, ML vs RC). The authors conclude that males have higher ML tension than RC tension, but given the limited data this conclusion should be amended to "no detectable difference."

      Code availability

      The authors should provide access to the code used to generate the projections.

      Significance

      The authors propose a mechanical model for neural tube elevation based on analyses of cell geometry and tension at two developmental stages. The reported differences in cell geometry or actomyosin levels do not appear to explain the differences in geometry or tension suggested between male and female embryos. This raises questions about the relationship between these measurements and their relevance for understanding the mechanisms of neural tube elevation.

      If the major concerns outlined above are rigorously addressed, the manuscript will offer a valuable descriptive characterization of neural tube cell geometry and mechanical stress during morphogenesis. Such datasets could form a foundation for future studies investigating the mechanisms driving neural tube elevation.

    1. Amazon Mechanical Turk [p16]: A site where you can pay for crowdsourcing small tasks (e.g., pay a small amount for each task, and then let a crowd of people choose to do the tasks and get paid).

      This site is such a fascinating object of modern labor, and how the processes of the internet abstract humanity from code. So much of the algorithms of Amazon, Google, and YouTube work partly on the basis of human labor, yet because the internet allows asynchronous, atomized labor, we forget this fact and think of the algorithm as a magical bit of technology. I wonder if there is a way to organize the internet which presents the labor that goes into its creation in a way that is unobtrusive to the user.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      General assessment of the work:

      In this manuscript, Mohr and Kelly show that the C1 component of the human VEP is correlated with binary choices in a contrast discrimination task, even when the stimulus is kept constant and confounding variables are considered in the analysis. They interpret this as evidence for the role V1 plays during perceptual decision formation. Choice-related signals in single sensory cells are enlightening because they speak to the spatial (and temporal) scale of the brain computations underlying perceptual decision-making. However, similar signals in aggregate measures of neural activity offer a less direct window and thus less insight into these computations. For example, although I am not a VEP specialist, it seems doubtful that the measurements are exclusively picking up (an unbiased selection of) V1 spikes. Moreover, although this is not widely known, there is in fact a long history to this line of work. In 1972, Campbell and Kulikowski ("The Visual Evoked Potential as a function of contrast of a grating pattern" - Journal of Physiology) already showed a similar effect in a contrast detection task (this finding inspired the original Choice Probability analyses in the monkey physiology studies conducted in the early 1990's). Finally, it is not clear to me that there is an interesting alternative hypothesis that is somehow ruled out by these results. Should we really consider that simple visual signals such as spatial contrast are *not* mediated by V1? This seems to fly in the face of well-established anatomy and function of visual circuits. Or should we be open to the idea that VEP measurements are almost completely divorced from task-relevant neural signals? Why would this be an interesting technique then? In sum, while this work reports results in line with several single-cell and VEP studies and perhaps is technically superior in its domain, I find it hard to see how these findings would meaningfully impact our thinking about the neural and computational basis of spatial contrast discrimination.

      We agree that single cell measurements allow for a spatially more detailed analysis, but they are not feasible in humans. Assuming we value insights into the relationship between neural activity and decision making in the human as well as non-human brain, we are restricted to non-invasive measurements such as EEG, which inevitably showcase the neural underpinnings of decision making at a coarser level of analysis. This was the challenge we met with our paradigm design. For example, we chose contrast as the task-relevant stimulus feature in this study because monotonic contrast response functions exist for sensory neurons throughout the visual system, and the aggregated measures that we could attain with EEG would reflect that contrast-sensitivity and hence provide a window onto the encoding of the main decision-relevant quantity. We were specifically interested in initial afferent, contrast-dependent V1 activity reflected in the C1 component (80-90 ms). As we point out in the Introduction, the C1 is unusual among EEG signals in the extent to which it is dominated by a single visual area, V1 (Jeffreys & Axford, 1972; Clark et al., 1994; Di Russo et al., 2002; Ales et al., 2010; Mohr et al., 2024), and even if other downstream areas also make a minor contribution in the C1 time period, it still represents a very low-level sensory response early in the sensory analysis pipeline, appropriate for addressing our primary question of whether such a low-level signal is used in the formation of perceptual decisions. The alternative hypothesis, that early responses are passed over in decision readout, relates to a fundamental debate about whether early sensory responses are separated from cognition. The possibility that late, but not early, representations are correlated with choices does not imply that the later sensory representations are divorced from the earlier ones, only that there is a noise component that is not shared between the two, such as that produced by the ensuing computations that generate the later representations. Instead, a lack of choice probability in early representations would imply that decision readout is selective in where it sources sensory evidence from, with some possible reasons being to maintain high quality standards for sensory evidence or to impose a layer of separation between cognition and sensation.

      As the reviewer points out, the animal literature is highly mixed on the topic of choice probability in V1. Even for orientation discrimination tasks where V1 is ostensibly highly suited given the existence of orientation columns in V1, and even when measurements are taken from V1 neurons with good neurometric performance and/or aggregated across a V1 population (Jasper et al 2019), some studies have reported little to no V1 choice probability. If our alternative hypothesis of no EEG-indexed V1 choice probability flies in the face of well-established anatomy and function of visual circuits, then so also do these empirical findings in the animal neurophysiology literature. 

      Although there are important aspects of choice probability that are accessible in single cell studies but not in EEG (e.g. noise correlations, details of circuit physiology), our EEG measurements tap into the same phenomenon, just at a different level of analysis, i.e. the neural population level. At this level, we have been able to address whether the full body of sensory responses at a particular stage of visual analysis is systematically related to perceptual decision outcomes. Very similar questions are in fact sometimes addressed in the animal neurophysiology literature; for example, Kang and Maunsell (2020) aggregated single-cell choice probability measurements within visual areas to investigate whether choice probability strength at the level of an entire visual area was sensitive to task demands. The global vantage point of EEG comes with the additional benefit of picking up signatures of other potentially mediating processes such as attention and being able to control for them in our analysis. Our human study thus provides a valuable complementary viewpoint alongside animal neurophysiology work in this area.

      Summary of substantive concerns:

      (1) The study of choice probability in V1 cells is more extensive than portrayed in the paper's introduction. In recent years, choice-related activity in V1 has also been studied by Nienborg & Cumming (2014), Goris et al (2017), Jasper et al (2019), Lange et al (2023), and Boundy-Singer et al (2025). These studies paint a complex picture (a mixture of positive, absent, and negative results), but should be mentioned in the paper's introduction.

      We thank the reviewer for highlighting these papers bearing on choice-related activity in V1, only two of which we had cited. The three additional studies do indeed lend further support to our description of the complex picture around V1-CP effects in the literature and we have now included them.

      (2) The very first study to conduct an analysis of stimulus-conditioned neural activity during a perceptual decision-making task was, in fact, a VEP study: Campbell and Kulikowski (1972). This study never gained the fame it perhaps deserves. But it would be appropriate to weave it into the introduction and motivation of this paper.

      We are aware of this paper, and indeed we ourselves have shown steady-state VEP (SSVEP) correlations with timing and selection of decision reports (O'Connell et al 2012; Grogan et al 2023), but SSVEPs do not provide an index of initial afferent V1 activity in the way that the C1 of the transient VEP does. SSVEPs are evoked by a rapid sequence of stimulus onsets, so that activity cannot be attributed to a particular stimulus onset nor its bottom-up latency resolved, and, being a response to an ongoing stimulus, it combines top-down and bottom-up influences from striate and extra striate areas (Di Russo et al 2007). Indeed, in Campbell and Kulikowski (1972) the SSVEP was almost entirely eliminated when the stimulus was undetected. This is in keeping with robust modulations of the SSVEP by spatial attention (Muller and Hillyard 2000). Cognitive influences of this magnitude are never observed in the C1, and in fact are often not observed at all even when later VEP components show robust modulations (Luck et al 2000), which motivated a recent meta-analysis to address the issue (Qin et al 2022). This highlights the important distinction between the earliest transient VEP activity reflecting mainly the initial afferent response in V1, and steady-state sensory activity reflecting a mix of bottom-up and top-down influences across visual cortex. Because of the importance of this distinction, we have added a reference to the above SSVEP papers to the 3rd paragraph of the introduction along with a statement about the distinction.

      (3) What are interesting alternative hypotheses to be considered here? I don't understand the (somewhat implicit) suggestion here that contrast representations late in the system can somehow be divorced from early representations. If they were, they would not be correlated with stimulus contrast.

      This same conundrum applies to single-cell studies of choice probability. Do studies showing choice probability in V4 but not V1 for example demonstrate that V4 is divorced from V1? In such studies, measurements are typically taken from large representative samples of neurons from both areas with good neurometric performance in both cases and the task often (though not always) involves a target stimulus feature that is encoded in V1 such as orientation. Why then should V4 but not V1 show choice probability when we know the vast majority of input to the visual cortex passes through V1? It must be that feature representation and choice formation are different things with one not inferring the other. This is true for an EEG study as much as it is for a single-cell study.

      The alternative hypothesis in our study is that the early sensory responses indexed by the C1 are not directly used in the formation of the perceptual decision at hand. As outlined in our comments above, this does not imply that those early responses are divorced from later responses. Of course, both are correlated with stimulus contrast and so would correlate with each other across changing contrast but this does not necessitate that their noise is correlated when contrast is held constant because new instantiations of noise can be generated by the computations performed at each stage of visual processing. Thus, the interesting alternative hypothesis is that information contained in the sensory representation generated during initial afferent V1 activity is not used directly to form decisions, and instead, decisions are read out from the outputs of computations performed further downstream. Such an outcome, if it had arisen in our data, would have been consistent with a separation between cognition and early visual processing. Instead, our results suggest a certain level of cognitive interfacing at the lowest and earliest cortical levels of visual processing. We have now added text to the Introduction to highlight the distinction between sensory representation and decision readout in order to make the alternative hypothesis clearer.

      (4) I find the arguments about the timing of the VEP signals somewhat complex and not very compelling, to be honest. It might help if you added a simulation of a process model that illustrated the temporal flow of the neural computations involved in the task. When are sensory signals manifested in V1 activity informing the decision-making process, in your view? And how is your measure of neural activity related to this latent variable? Can you show in a simulation that the combination of this process and linking hypothesis gives rise to inverted U-shaped relationships, as is the case for your data?

      We thank the reviewer for this suggestion of a simulation, which we carried out using the Matlab code. We have also included new Figure 1-Figure Supplement 1 in the revised manuscript.

      In our view, sensory signals in V1 are informing the decision-making process in this task from at least as early as the initial afferent response. The main point about C1 latency in relation to the response-time contingency of the choice probability effect is that the more time that elapses without a decision made (and therefore the more additional sensory processing that contributes to the decision), the more diluted is the contribution of the C1 to the decision by contributions from later representations, and thus choice probability reduces. Likewise, when response times are too quick for C1 evidence to contribute, choice probability is also absent, hence the inverted-U-shaped curve. Moreover, if the C1-choice correlation is mediated by a top-down factor such as attention rather than readout, the inverted-U-shaped curve is not expected because in such a case the relative timing of the C1 and choice commitment would not be relevant.

      Reviewer #2 (Public review):

      Summary:

      Mohr and Kelly report a high-density EEG study in healthy human volunteers in which they test whether correlations between neural activity in the primary visual cortex and choice behavior can be measured non-invasively. Participants performed a contrast discrimination task on large arrays of Gabor gratings presented in the upper left and lower right quadrants of the visual field. The results indicate that single-trial amplitudes of C1, the earliest cortical component of the visual evoked potential in humans, predict forced-choice behavior over and beyond other behavioral and electrophysiological choice-related signals. These results constitute an important advance for our understanding of the nature and flexibility of early visual processing.

      Strengths:

      (1) The findings suggest a previously unsuspected role for aggregate early visual cortex activity in shaping behavioral choices.

      (2) The authors extend well-established methods for assessing covariation between neural signals and behavioral output to non-invasive EEG recordings.

      (3) The effects of initial afferent information in the primary visual cortex on choice behavior are carefully assessed by accounting for a wide range of potential behavioral and electrophysiological confounds.

      (4) Caveats and limitations are transparently addressed and discussed.

      We would like to thank the reviewer for these positive remarks.

      Weaknesses:

      (1) It is not clear whether integration of contrast information across relatively large arrays is a good test case for decision-related information in C1. The authors raise this issue in the Discussion, and I agree that it is all the more striking that they do find C1 choice probability. Nevertheless, I think the choice of task and stimuli should be explained in more detail.

      We thank the reviewer for raising this point about the large stimulus arrays. As we said in our Discussion, it would seem that aggregation across a large stimulus region would be better suited to a downstream visual area with larger receptive fields, yet our setting of a strict deadline would put the emphasis back on earlier sensory representations. We now elaborate on this matter in the discussion, to say that although the small receptive fields and short, slow horizontal connections in V1 mean that the aggregation necessary for performing the task is unlikely to happen within V1 during the C1 timeframe, the aggregation would be readily achieved simply by convergence of the outputs of all relevant V1 neurons for a given stimulus array on the same decision process. In this sense, the design of our paradigm was such that the globally-measured C1 component on the scalp reflected the same aggregated evidence input as the summed V1 readout that we suppose would be entering the decision process.  

      We have also added further rationale in the Methods section on the practical benefits of the stimulus design, as the reviewer anticipates in their subsequent point, of yielding robust C1 signals. This concern was paramount in the design of this study because we expected the C1 difference metric that was of interest to be very small. We also needed a robust C1 to be measured in both the upper and lower visual field in as many individuals as possible and, in our experience, this is true less often when using smaller stimuli, even with a pre-mapping procedure.

      It also helped to homogenize C1 topography across individuals and ensure that topographies from the upper and lower visual field had sufficient overlap that there were electrodes with strong loading from both topographies where the C1 difference as a function of which array was brighter would be maximal.

      We have updated the methods section to provide these rationales while we describe the stimulus design.

      (2) In a similar vein, while C1 has canonical topographical properties at the grand-average level, these may differ substantially depending on individual anatomy (which the authors did not assess). This means that task-relevant information will be represented to different degrees in individuals' single-trial data. My guess is that this confound was mitigated precisely by choosing relatively extended stimulus arrays. But given the authors' impressive track record on C1 mapping and modeling, I was surprised that the underlying rationale is only roughly outlined. For example, given the topographies shown and the electrode selection procedure employed, I assume that the differences between upper and lower targets are mainly driven by stimulus arms on the main diagonal. Did the authors run pilot experiments with more restricted stimulus arrays? I do not mean to imply that such additional information needs to be detailed in the main article, but it would be worth mentioning.

      We thank the reviewer for their thoughtful consideration of this issue about individual variability in C1 retinotopy. Indeed, as the reviewer anticipated we expected the large stimulus coverage to mitigate this issue and we think that our response to the point above and the changes we made to the manuscript in response address this point also. Although we did not show this in the manuscript, we did in fact find that C1 topography was much more similar across individuals than it has been in previous C1 experiments we have carried out with smaller stimuli.

      However, we acknowledge the reviewer’s point that the signal measured at a specific electrode likely has a variable loading strength from the various gratings in the stimulus array and that the gratings of maximal loading may indeed vary from subject to subject. Such inter-subject variability cannot confound the choice probability effects because the latter are measured within-subject. Nevertheless, it could be a source of noise. We believe the impact of this is unlikely to be substantial for the following reasons:

      i) We designed the spatial spread of contrasts in such a way as to encourage participants to aggregate across the full array. In essence, to match the property of the C1 as an aggregate measure of V1 activity, we designed a task that involved aggregating across stimulus elements. Therefore, the decision weighting applied to any particular grating should be representative of the weighting applied to all gratings and, as such, the specific gratings that contribute most to the C1 signal for a particular participant should be relatively inconsequential.

      ii) By avoiding the horizontal and vertical meridians we avoided the regions of space where the shifts in C1 topography are largest.

      (3) Also, the stimulus arrangement disregards known differences in conduction velocity between the upper and lower visual fields. While no such differences are evident from the maximal-electrode averages shown in Figure 1B, it is difficult to assess this issue without single-stimulus VEPs and/or a dedicated latency analysis. The authors touch upon this issue when discussing potential pre-C1 signals emanating from the magnocellular pathway.

      Indeed, there are important differences in V1 properties between the upper and lower visual fields, visual acuity being another example in addition to conduction velocity as the reviewer points out. However, these differences appeared to be quite minimal in this case (Figure 1B does in fact include a single-stimulus VEP – the “1-stim” entry in the legend). Perhaps this is also due to the large stimulus array which may include a range of conduction velocities within it and thereby blur overall differences between the upper and lower visual field. The variability of contrast within each array was also quite high (+/-20% from the midpoint), which would have further increased within-array conduction velocity variability and blurred differences between arrays.

      Our staircasing procedure may have also helped in this regard to some extent as it included a bias parameter between the arrays to account for any behavioural response biases. Although the small contrast changes it usually incurred are likely much too small to change conduction velocities, it corrected for any effect on behaviour they may have.

      (4) I suspect that most of these issues are at least partly related to a lack of clarity regarding levels of description: the authors often refer to 'information' contained in C1 or, apparently interchangeably, to 'visual representations' before, during, or following C1. However, if I understand correctly, the signal predicting (or predicted by) behavioral choice is much cruder than what an RSA-primed readership may expect, and also cruder than the other choice-predictive signals entered as control variables: namely, a univariate difference score on single-trial data integrated over a 10 ms window determined on the basis of grand-averaged data. I think it is worth clarifying and emphasizing the nature of this signal as the difference of aggregate contrast responses that *can* only be read out at higher levels of the visual system due to the limited extent of horizontal connectivity in V1. I do not think that this diminishes the importance of the findings - if anything, it makes them more remarkable.

      This is true that a univariate measure may stick out in a field increasingly favouring multivariate analyses with the spread of machine learning, and so we have added a short qualifier in the methods section where we describe the C1 measurement to explicitly state that it is a scalar variable. What we have done in using this univariate measure is leverage the rich prior knowledge about V1 anatomy and neurophysiology, rather than trust in data-driven classifiers; interestingly, we found that such a classifier trained on all electrodes discriminates choices less well than our informed univariate measure during the C1 time-frame. 

      We also thank the reviewer for raising an interesting point about the nature of aggregation and readout in the context of our stimulus. We agree that it is not feasible that V1 activity would be aggregated locally in V1 across such large regions of space prior to being readout within the C1 time period. As we say above, the aggregation may instead be carried out through convergent transmission of the parallel, spatially-local V1 information to the decision process.

      (5) Arguably even more remarkable is the finding that C1 amplitudes themselves appear to be influenced by choice history. The authors address this issue in the Discussion; however, I'm afraid I could not follow their argument regarding preparatory (and differential?) weighting of read-outs across the visual hierarchy. I believe this point is worth developing further, as it bears on the issue of whether C1 modulations are present and ecologically relevant when looking (before and) beyond stimulus-locked averages.

      We thank the reviewer for their positive appraisal of this additional finding, which we also found remarkable. We agree that our description of our interpretation was too brief and lacked clarity. We have reworded it and expressed it in terms of the speed accuracy trade-off, with the new explanation given below. However, it is important to remember that this account is speculative and serves only to explain the response-time contingency of the bias. That the bias was present and constitutes a modulation of the C1 does not rest on this argument:

      […] “to explain the RT contingency for the C1 bias, we speculate that the speed-accuracy trade-off could fluctuate from trial to trial and that the corresponding decision bound fluctuations (Heitz and Schall 2012) could be implemented by pre-determining decision weights across visual areas. For example, to achieve faster decisions, the sensory evidence requirement could be reduced by placing greater emphasis on initial afferent V1 evidence. In such a case, the RT contingency of the above choice history bias could be explained if the C1 bias is exerted in proportion with the planned emphasis of C1 evidence for the upcoming decision.”

      Recommendations to the Authors:

      Reviewer #2 (Recommendations for the authors):

      (1) As someone whose first language is not English, I am somewhat hesitant to bring this up, but I found the use of 'readout' as both noun and verb somewhat confusing. I thought read-out was defined as 'that which is read out'.

      We agree that this dual use of the word readout may cause confusion. To avoid this, we have edited the manuscript to replace verbal forms of the word “readout” with “read out”.

      (2) I found it difficult to follow the reasoning for why intermediate RTs should be the ones most affected by C1-related information. Perhaps this could be described in more detail for the uninitiated reader.

      We appreciate that our reasoning for why intermediate RTs should be the ones most affected by C1-related information was difficult to follow. We have now added a simulation to showcase this rationale more clearly - see response to reviewer 1, and new figure supplement to figure 1. 

      (3) It would be interesting to compare the effect sizes observed here to those seen in single-cell studies and to discuss this comparison with regard to differences in the nature of EEG signals and single-cell firing rates.

      While we agree that such a comparison would be interesting if feasible, it would have to be for the same task settings, which have not been used in a single-cell study, and  the very different nature and extent of noise between the two recording modalities would make such a comparison difficult to interpret, e.g. background noise in EEG from ongoing processes unrelated to the task. 

      (4) Figure 1: It may be worth mentioning in the legend that only parts of the peripheral stimulus grid are shown for better visibility, as the Methods speak of 9 x 9 grids. Also, in panel B, it should be mentioned that waveshapes are calculated using individually selected maximal-difference electrodes.

      We thank the reviewer for spotting these. We have updated the caption for this figure to reflect these two observations.

      (5) Figure 4: The different shades of green may be difficult to distinguish when printed.

      Although this may be true, we chose shades of green that differ in luminance so they should still be distinguishable. Different colours may in fact be less distinguishable if they had the same luminance and the print was black-and-white. We chose different shades of the same colour to reflect the fact that we were plotting the same signals at different difficulty levels. In our opinion, this takes precedence since eLife is an online journal so the majority of readers will likely read it digitally.

      (6) Methods/Task: While the ITI of 780 ms is substantial, I was wondering why the authors decided against jittering this interval? It would be helpful to briefly discuss whether contrast adaptation for slow periodic stimulation may have affected the findings.

      We opted against jittering the ITI to avoid an additional source of inter-trial variability. While this may allow for adaptation effects of this source, this would be approximately constant across trials and therefore less of a concern for our design. We have added text to the methods section to state this rationale.

      (7) Methods/Stimuli: The authors convincingly argue that focusing on single arms of the stimuli is an unlikely strategy, but did they ask for participants' strategies during debriefing?

      We are glad that the reviewer found our argument about whether or not participants may have focused on a single arm of the stimuli convincing. We did not ask participants about their strategies but even with such a debriefing, there would still remain a possibility that a participant may have used that strategy but were unaware that they were doing so. In any case, if participants were doing this it would have dampened the strength of our choice probability result. 

      (8) Methods/Procedure, Difficulty Titration: Why did the authors opt for manually adapting the difficulty level in a separate session rather than constantly and automatically titrating difficulty?

      We did this because calculating choice probability requires a comparison of trials with different choice outcomes but the same stimulus so continuously staircasing difficulty level during the experiment would have created a confound. Although this could have been corrected for in our regression, this would have entailed greater noise that we could avoid by staircasing in advance.

    1. Reviewer #1 (Public review):

      Summary

      The manuscript by Ma et al. provides robust and novel evidence that the noctuid moth Spodoptera frugiperda (Fall Armyworm) possesses a complex compass mechanism for seasonal migration that integrates visual horizon cues with Earth's magnetic field (likely its horizontal component). This is an important and timely study: apart from the Bogong moth, no other nocturnal Lepidoptera has yet been shown to rely on such a dual-compass system. The research therefore expands our understanding of magnetic orientation in insects with both theoretical (evolution and sensory biology) and applied (agricultural pest management, a new model of magnetoreception) significance.

      The study uses state-of-the-art methods and presents convincing behavioural evidence for a multimodal compass. It also establishes the Fall Armyworm as a tractable new insect model for exploring the sensory mechanisms of magnetoreception, given the experimental challenges of working with migratory birds. Overall, the experiments are well designed, the analyses are appropriate, and the conclusions are generally well supported by the data.

      Strengths

      • Novelty and significance: First strong demonstration of a magnetic-visual compass in a globally relevant migratory moth species, extending previous findings from the Bogong moth and opening new research avenues in comparative magnetoreception.

      • Methodological robustness: Use of validated and sophisticated behavioural paradigms and magnetic manipulations consistent with best practices in the field. The use of 5 min bins to study a dynamic nature of magnetic compass which is anchored to a visual cue but updated with latency of several minutes is an important finding and a new methodological aspect in insect orientation studies.

      • Clarity of experimental logic: The cue-conflict and visual cue manipulations are conceptually sound and capable of addressing clear mechanistic questions.

      • Ecological and applied relevance: Results have implications for understanding migration in an invasive agricultural pest with expanding global range.

      • Potential model system: Provides a new, experimentally accessible species for dissecting the sensory and neural bases of magnetic orientation.

      Weaknesses

      Overall, this is a strong study, and the authors have completed an excellent major revision that has undoubtedly addressed most major and minor issues. The remaining points below are minor recommendations, and I acknowledge that differences in opinion are always possible:

      (1) Structure and Presentation of Results

      • I recommend reordering the visual-cue experiments to progress from simpler conditions (no cues) to more complex ones (cue-conflict). This would improve narrative logic and accessibility for non-specialist readers. The authors have chosen not to implement this suggestion, which I respect, but my recommendation stands.

      (2) Ecological Interpretation

      • The authors should expand their discussion on how the highly simplified, static cue setup translates to natural migratory conditions, where landmarks are dynamic, transient, or absent. Specifically, further consideration is needed on how the compass might function when landmarks shift position, become obscured, or are replaced by celestial cues. Additionally, the discussion would benefit from a more consolidated section with concrete suggestions for future experiments involving transient, multiple, or more naturalistic visual cues.

      This point was addressed partially in one paragraph of the Discussion, which reads as follows:

      "In nature, they are likely to encounter a range of luminance-gradient visual cues, including relatively stable celestial cues as well as transient or shifting local features encountered en route. Although such natural cues differ from our simplified laboratory stimulus, they may represent intermittently sampled visual inputs that can be optimally integrated with magnetic information, with the congruency between visual and magnetic cues likely playing a key role in maintaining a stable compass response. Whether the cues are static or changing, brief periods without them may still allow the subsequent recovery of a stable long-distance orientation strategy. Determining which types of natural visual cues support the magnetic-visual compass, and how they interact with magnetic information, including how their momentary alignment or angular relationship is integrated and how such visual cue-magnetic field interactions may require time to influence orientation, together with elucidating the genetic and ecological bases of multimodal orientation, will be important objectives for future research."

      While this paragraph is informative, the wording remains lengthy, somewhat unclear, and vague. Shorter, clearer statements would improve readability and impact. For example:

      • How could moths maintain direction during periods when only the magnetic field is present and visual landmarks are absent?

      • Could celestial cues (e.g., stars) compensate, and what happens if these are also obscured?

      • What role does saliency play when multiple visual landmarks are present simultaneously?

      • How might a complex skyline without salient landmarks affect orientation?

      Including simple, concise sentences that pose concrete open questions and suggest experimental designs would strengthen the discussion without creating space issues. In my view, a comprehensive discussion of how the simplified, static cue setup relates to natural migratory conditions-where landmarks are dynamic, transient, or absent-would add significant value to the paper.

      (3) Methodological Details and Reproducibility

      • The lack of luminance level measurements should be explicitly highlighted.

      • The authors chose not to adjust figure legends by replacing "magnetic South" with "magnetic North." While I believe this would be more conventional and preferable, this is ultimately a minor stylistic issue.

      (4) Conceptual Framing and Discussion

      • Although the authors made a good attempt to explain the limitations of using an artificial visual cue, I believe there is room for a more explicit argument. For example, it could be stated clearly that this species is unlikely to encounter a situation in nature where a single, highly salient landmark coincides with its migratory direction. Therefore, how these findings translate to real migratory contexts remains an open question. A sentence or two making this point directly would strengthen the discussion.

      (5) Technical and Open-Science Points

      • Sharing the R code openly (e.g., via GitHub) should be seriously considered. The code does not need to be perfectly formatted, but making it available would be highly beneficial from an open-science perspective.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      The manuscript by Ma et al. provides robust and novel evidence that the noctuid moth Spodoptera frugiperda (Fall Armyworm) possesses a complex compass mechanism for seasonal migration that integrates visual horizon cues with Earth's magnetic field (likely its horizontal component). This is an important and timely study: apart from the Bogong moth, no other nocturnal Lepidoptera has yet been shown to rely on such a dual-compass system. The research therefore expands our understanding of magnetic orientation in insects with both theoretical (evolution and sensory biology) and applied (agricultural pest management, a new model of magnetoreception) significance.

      The study uses state-of-the-art methods and presents convincing behavioural evidence for a multimodal compass. It also establishes the Fall Armyworm as a tractable new insect model for exploring the sensory mechanisms of magnetoreception, given the experimental challenges of working with migratory birds. Overall, the experiments are well-designed, the analyses are appropriate, and the conclusions are generally well supported by the data.

      Strengths

      (1) Novelty and significance: First strong demonstration of a magnetic-visual compass in a globally relevant migratory moth species, extending previous findings from the Bogong moth and opening new research avenues in comparative magnetoreception.

      (2) Methodological robustness: Use of validated and sophisticated behavioural paradigms and magnetic manipulations consistent with best practices in the field. The use of 5-minute bins to study the dynamic nature of the magnetic compass which is anchored to a visual cue but updated with a latency of several minutes, is an important finding and a new methodological aspect in insect orientation studies.

      (3) Clarity of experimental logic: The cue-conflict and visual cue manipulations are conceptually sound and capable of addressing clear mechanistic questions.

      (4) Ecological and applied relevance: Results have implications for understanding migration in an invasive agricultural pest with an expanding global range.

      (5) Potential model system: Provides a new, experimentally accessible species for dissecting the sensory and neural bases of magnetic orientation.

      Weaknesses

      While the study is strong overall, several recommendations should be addressed to improve clarity, contextualisation, and reproducibility:

      We thank Reviewer #1 for the positive and encouraging evaluation of our study. We appreciate the recognition of our work’s strengths and are grateful for the constructive feedback on the remaining weaknesses, which will guide and strengthen our revisions.

      Structure and presentation of results

      Requires reordering the visual-cue experiments to move from simpler (no cues) to more complex (cue-conflict) conditions, improving narrative logic and accessibility for non-specialists.

      Thank you for this thoughtful suggestion. While we appreciate the rationale for presenting results from simpler to more complex conditions, we kept the original sequence because it aligns with the logic of our study. Our initial aim was to determine whether fall armyworms use a magnetic compass integrated with visual cues, as shown in the Bogong moth. After establishing this phenotype, we then examined whether visual cues are required for maintaining magnetic orientation. We have also clarified in the Introduction that magnetic orientation in the Bogong moth relies on integration with visual cues, which provides readers with clearer context and improves the overall narrative flow.

      Ecological interpretation

      (a) The authors should discuss how their highly simplified, static cue setup translates to natural migratory conditions where landmarks are dynamic, transient or absent.

      Thank you for raising this important point. We agree that natural migratory environments provide visual information that is often dynamic, transient, or intermittently absent, in contrast to the simplified and static cue used in our indoor experiments. Our intention in using a minimal, static cue was to isolate and test the fundamental presence of magnetic–visual integration in fall armyworms under fully controlled conditions.To address the reviewer’s concern, we have added a brief note in the Discussion indicating that fall armyworms may encounter both static and dynamic luminance-based visual cues in nature, such as light–dark gradients created by terrain features or more stable celestial patterns. Although these natural cues differ from our simplified laboratory stimulus, they may similarly provide asymmetric visual structure that can be integrated with magnetic information. We also note that determining which natural visual cues support the magnetic–visual compass will be an important direction for future work.

      (b) Further consideration is required regarding how the compass might function when landmarks shift position, are obscured, or are replaced by celestial cues. Also, more consolidated (one section) and concrete suggestions for future experiments are needed, with transient, multiple, or more naturalistic visual cues to address this.

      Thank you for this constructive suggestion. We appreciate the reviewer’s point that additional consideration of how the compass might function under shifting, obscured, or celestial visual cues would strengthen the manuscript. Given the limited evidence currently available for this species, we have incorporated a concise and appropriately cautious discussion addressing these possibilities.

      Methodological details and reproducibility

      (a) It would be better to move critical information (e.g., electromagnetic noise measurements) from the supplementary material into the main Methods.

      Thank you for this helpful suggestion. In the revised manuscript, we have added the key electromagnetic noise measurements information to the main Methods section.

      (b) Specifying luminance levels and spectral composition at the moth's eye is required for all visual treatments.

      Thank you for this helpful comment. We have clarified in the Methods as well as the legend of Fig. S3 that both luminance levels and spectral composition were measured at the position corresponding to the moth’s head.

      (c) Details are needed on the sex ratio/reproductive status of tested moths, and a map of the experimental site and migratory routes (spring vs. fall) should be included.

      Thanks. We have added the reproductive status of the tested moths in the Methods, specifying that all individuals used were unmated 2-day-old adults.

      (d) Expanding on activity-level analyses is required, replacing "fatigue" with "reduced flight activity," and clarifying if such analyses were performed.

      Thank you for this comment. In this context, the term “fatigue” referred to the possibility that moths might gradually lose motivation or attention to orient when flying for an extended period in a simplified, artificial environment with limited sensory cues. Such a decrease in orientation motivation over time could, in theory, lead to a loss of individual orientation and consequently to the observed loss of group orientation. To test this possibility, we analyzed the orientation performance of each individual moth across different phases using the Rayleigh test. The r-value was used as a measure of individual directedness (higher r-values indicate stronger orientation). Our results showed that mean r-values did not differ significantly among the experimental phases (multiple comparisons, Table S2). This indicates that 25min measurement itself was not responsible for the loss of orientation. We did not perform a quantitative activity-level analysis in this study. However, as mentioned in Methods, flight activity was continuously monitored during the experiments by observing fluctuations in the pointer values on the experimental software, which corresponded to the moth’s rotational movements. If the pointer values remained unchanged for more than 10 seconds, the experimenter checked for wing vibrations by sound; if the moth had stopped flying, gentle tapping on the arena wall was used to stimulate renewed flight. Only individuals that maintained active flight throughout the experiment, with fewer than four instances of wingbeat cessation, were included in the analysis. We also mentioned that activity level analysis was not performed due to technical difficulties in the revised manuscript.

      Figures and data presentation

      (a) The font sizes on circular plots should be increased; compass labels (magnetic North), sample sizes, and p-values should be included.

      Thank you for this helpful suggestion. Regarding the compass labels and statistical reporting, our analysis provides significance levels as ranges rather than exact p-values; therefore, we clarified in the figure legends that the two dashed circles correspond to thresholds for statistical significance p = 0.05 and p = 0.01, respectively. Sample sizes are already indicated within each panel. To avoid visual clutter caused by displaying both magnetic North and South, we show only the magnetic South direction (mS) consistently across panels, which can improve readability.

      (b) More clarity is required on what "no visual cue" conditions entail, and schematics or photos should be provided.

      Thank you for this comment. In our study, the “no visual cue” condition refers to the absence of the black triangular landmark inside the flight simulator. To improve clarity, we have updated the legend of Fig. 4 to explicitly state this and have referred readers to the schematic in Fig. 1, which illustrates the structure of the flight simulator. These additions clarify what the “no visual cue” condition entails without requiring additional schematics.

      (c) The figure legends should be adjusted for readability and consistency (e.g., replace "magnetic South" with magnetic North, and for box plots better to use asterisks for significance, report confidence intervals).

      Thank you. Regarding the choice of compass labeling, we intentionally used magnetic South (mS) rather than magnetic North (mN) because the main population tested in our experiments represents the autumn migratory generation. During autumn, fall armyworms orient southward when visual and magnetic cues are aligned. Using magnetic South in the plots therefore provides a clearer representation of cue alignment in this season and avoids potential confusion when interpreting the combined visual–magnetic information.

      Conceptual framing and discussion

      (a) Generalisations across species should be toned down, given the small number of systems tested by overlapping author groups.

      Thank you for this valuable comment. In the revised manuscript, we have softened such statements in both abstract and maintext.

      (b) It requires highlighting that, unlike some vertebrates, moths require both magnetic and visual cues for orientation.

      Thank you for this helpful suggestion. We have added a sentence to the Discussion explicitly highlighting that, unlike some vertebrates capable of using magnetic information in the absence of visual cues, moths require the integration of both magnetic and visual cues for accurate orientation. This clarification emphasizes the distinct multimodal nature of compass use in migratory moths.

      (c) It should be emphasised that this study addresses direction finding rather than full navigation.

      Thank you for this important clarification. We have now made it explicit in the manuscript that our experiments address direction finding (i.e., orientation) rather than full navigation. This distinction is stated in both the Introduction and Discussion to clearly define the scope of the study.

      (d) Future Directions should be integrated and consolidated into one coherent subsection proposing realistic next steps (e.g., more complex visual environments, temporal adaptation to cue-field relationships).

      Thank you for this constructive suggestion. We agree that outlining realistic next steps is valuable. However, given the limited scope of the current data, we have only slightly expanded the existing forward-looking statements in the Discussion.

      (e) The limitations should be better discussed, due to the artificiality of the visual cue earlier in the Discussion.

      Thank you for this comment. We agree that the artificiality of the visual cue is an important limitation of the present study. Rather than extending speculative discussion, we have clarified this limitation in the revised Discussion and highlighted the key questions that future work must address.

      Technical and open-science points

      Appropriate circular statistics should be used instead of t-tests for angular data shown in the supplementary material.

      Thank you for this comment. We have addressed this point (Fig. S1) in the revised supplementary material.

      Details should be provided on light intensities, power supplies, and improvements to the apparatus.

      Thank you. Light intensities are reported as spectral irradiance measurements in Supplementary Materials, which provide full wavelength-resolved information for the illumination used, although a separate measurement of total illuminance (lux) was not performed. We have also added the requested information on the power supplies.

      The derivation of individual r-values should be clarified.

      Thanks. We have clarified in the revised manuscript.

      Share R code openly (e.g., GitHub).

      Thanks. We are in the process of organizing the relevant R code, but have not been able to upload it to GitHub before the current revision deadline. The code is available from the corresponding author upon request.\

      Some highly relevant - yet missing - recent and relevant citations should be added, and some less relevant ones removed..

      Thanks. We added one recent relevant reference to the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      This work provided experimental evidence on how geomagnetic and visual cues are integrated, and visual cues are indispensable for magnetic orientation in the nocturnal fall armyworm.

      Strengths:

      Although it has been demonstrated previously that the Australian Bogon moth could integrate global stellar cues with the geomagnetic field for long-distance navigation, the study presented in this manuscript is still fundamentally important to the field of magnetoreception and sensory biology. It clearly shows that the integration of geomagnetic and visual cues may represent a conserved navigational mechanism broadly employed across migratory insects. I find the research very important, and the results are presented very well.

      We thank Reviewer #2 for the positive and encouraging evaluation of our study. We appreciate the recognition of our work’s strengths.

      Weaknesses:

      The authors developed an indoor experimental system to study the influence of magnetic fields and visual cues on insect orientation, which is certainly a valuable approach for this field. However, the ecological relevance of the visual cue may be limited or unclear based on the current version. The visual cues were provided "by a black isosceles triangle (10 cm high, 10 cm 513 base) made from black wallpaper and fixed to the horizon at the bottom of the arena". It is difficult to conceive how such a stimulus (intended to represent a landmark like a mountain) could provide directional information for LONG-DISTANCE navigation in nocturnal fall armyworms, particularly given that these insects would have no prior memory of this specific landmark. It might be a good idea to make a more detailed explanation of this question.

      We appreciate the constructive feedback on the weaknesses, which will guide and strengthen our revisions. To address the reviewer’s concern, we have added a brief note in the Discussion indicating that fall armyworms may encounter both static and dynamic luminance-based visual cues in nature, such as light–dark gradients created by terrain features or more stable celestial patterns. Although such natural cues differ from our simplified laboratory stimulus, they may represent intermittently sampled visual inputs that can be optimally integrated with magnetic information, whether the cues are static or changing, and brief periods without them may still allow the subsequent recovery of a stable long-distance orientation strategy.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Major to Medium Suggestions

      (a) Reordering of Visual Cue Tests

      The manuscript currently presents cue-conflict experiments before the simpler "no visual cue" tests. For non-specialist readers, it would be more logical to start with the basic condition (no visual cues) and then move to progressively more complex ones. This provides a clearer and more logically sound narrative.

      For example, the results could first demonstrate that without visual cues, the moths fail to orient (both in darkness and uniform light), and then show that introducing a single salient cue (a triangle on the horizon) restores directed behaviour. This would help readers understand the logic of the progression and should be better integrated throughout the Results and Discussion.

      Thanks. We have responded this comment in Public Reviews.

      (b) Translating Key Findings to Realistic Scenarios (LL 333-344 or where suitable in Discussion, and mentioning that we utilised a reductionist principle first in Intro, but clearly articulated that it is very simplified)

      The main text (eg Discussion) should address how these findings translate to real-world conditions. The experimental design used a single, highly salient, and static cue, always aligned with the migratory direction. In nature, such a consistent landmark is unlikely-mountains or other features would shift position relative to the moth's trajectory as it flies.

      Key questions arise which need to be addressed:

      - How would the compass system adapt to changing landmark positions as the moth moves?

      - What happens when no landmarks are visible (e.g. over flat plains or cloudy nights)?

      - Would stellar or other cues take over in such cases? Your hypotheses, please.

      Addressing these points - and proposing specific future experiments (e.g. with transient or multiple visual cues)-would strengthen the ecological relevance of the findings and show a clear way forward.

      Thanks for your kind comments. We now explicitly state in the Introduction that our study employs a reductionist approach using a simplified visual environment to isolate magnetic-visual interactions. As the ecological questions raised by the reviewer cannot be addressed with the current dataset, we avoid extended speculation but have added brief clarification in the Discussion and addressed these points in the Public Reviews response. We also indicate that future work will need to examine the types of visual cues that can support magnetic orientation and how such cues couple with geomagnetic information.

      Technical and Methodological Points

      (a) Incomplete Methods Section

      Critical technical information (e.g. electromagnetic noise measurements) currently appears only in supplementary figure legends. All such details should be included in the main Methods section if the word count allows (or include a short section in the main text with reference to more details in the supplementary material).

      Thanks for your kind comments. We have addressed this as suggested in the Public Reviews.

      (b) Lighting Conditions

      Specify luminance levels (the amount of light emitted and passing through in quanta per unit of surface, eg m2) at the moth's eye and indicate whether spectral composition was consistent between treatments (with and without the visual cue).

      Thanks for your comments. We have responded to this point in the Public Reviews.

      (c) Figures

      - Increase font sizes on circular histograms.

      - Add compass labels (ideally magnetic North, mN, not south, etc, as it is usual in pertinent literature), sample sizes, and p-values on each panel.

      - Replace "magnetic South" (mS) indicators with magnetic North (mN) to align with convention.

      Thanks for your comments. We have responded to this point in the Public Reviews.

      (d) Migratory Expectations

      Include expected compass bearings for spring and autumn migrations (with citations) to relevant figures (Figure 2, 4, S2).

      Thanks for your comments. We have added the information that “We recently found that fall armyworms from the year-round range in Southwest China (Yunnan) exhibit seasonally appropriate migratory headings when flown outdoors in virtual flight simulators, heading northward in the spring and southward in the fall, and this seasonal reversal is controlled by photoperiod (Chen et al., 2023).” in Introduction. Thus, we didn’t offer expected seasonal compass bearings in Results section.

      (e) Add a map showing the experimental site and known migratory routes, clearly labelling spring vs fall routes. It would help justify expected headings.

      Thank you for this suggestion. At present, there are no experimentally validated migratory routes (e.g., through mark-release-recapture or tracking approaches) for the specific fall armyworm population used in our study. Because these routes have not been biologically confirmed, we didn’t offer a presumed migratory map that may imply unwarranted certainty.

      (f) Composition of Test Groups

      Indicate sex ratios and reproductive status (mated/unmated) of tested moths, if known or comment if unknown, as both can affect migratory motivation and behaviour.

      Thank you for this suggestion. We have responded to this point in the Public Reviews.

      (g) Role and Nature of Visual Cues

      While the results clearly show that orientation disappears without visual cues, the triangle cue is highly artificial. Well-studied Bogong moths are known to rely on views of Australian mountain ranges during their nocturnal migrations, but there is no evidence that armyworms use a similar strategy. Even for bogongs, it is not just one salient mountain always in front of them on migration. Discuss whether Fall Armyworm would encounter comparable natural cues in the field along their migratory route, or whether the triangle might simply provide a frame of reference rather than a true landmark.

      Thank you for this comments. We have responded to this point in the Public Reviews.

      (h) Future work could test:

      - More naturalistic sky cues (moonlight, star fields).

      - Varying the landmark's position relative to the magnetic field - slowly moving along - transient landmarks. Also, less salient landmarks and a more complex skyline, as it is usually more complex than just a single salient peak.

      Thank you for this comments. We have responded to this point in the Public Reviews. Brief discussion as suggested has been added to the revised manuscript.

      Minor Comments and Line-by-Line Suggestions

      L70 - Check citation (possibly Mouritsen 2018). Missing in the list of references.

      Thanks. This point has been addressed.

      L75 - Consider citing the new and highly relevant preprint:

      Pakhomov, A., Shapoval, A., Shapoval, N., & Kishkinev, D. (2025). Not All Butterflies Are Monarchs: Compass Systems in the Red Admiral (Vanessa atalanta). bioRxiv.

      Thanks. We have cited this reference.

      LL81-82 - Clarify vague phrasing; specify criteria for "good" vs "poor" orientation ability. Or reword/leave out.

      Thanks for your comments.

      L85 - "but one," not "bar one." 

      Thanks. Corrected.

      L124 - The 2 genetic citations are weakly linked to magnetoreception. We do not have a clear understanding of the insect magnetoreceptor and its underlying mechanism, so we simply cannot interpret genetic associations very well to underpin them to magnetoreception. For example, does noctuid's magnetic sense require a magnetised-based receptor and genes involved in biomineralization? Consider removing or softening claims. 

      Thanks. Adressed.

      LL123-126 - Define what for YOU constitutes "strong evidence" for magnetoreception (e.g. adaptive directional behaviour consistent with migratory orientation?). Is there such a thing as strong evidence at all?

      Thanks for your comments. We agree that terms such as “confirmed” or “strong evidence” can overstate the certainty of magnetoreception findings, given the ongoing debates in the field. In the revised manuscript, we have toned down.

      L153 - Indicate whether coils in NMF condition were powered or inactive.

      Thanks for your comments. Addressed.

      L163 - Justify use of multiple 5-min phases (e.g. temporal resolution of behaviour). It is confusing at the start, where first mentioned, and becomes clearer only towards the end, but it should be clearer at the start.

      Thanks for your comments. The assay was divided into these 5-min segments to provide the temporal resolution needed to detect changes in flight orientation as the relative alignment of magnetic and visual cues was systematically altered. We now clarify this earlier in the Results.

      LL167-171 - This is a good place where you can provide a map (main or supplementary with referencing) showing the study site and migration routes.

      Thanks for your suggestion. We have responded to this point in the Public Reviews.

      L174 - Avoid repetition of "expected."

      Thanks. Addressed.

      LL176-177 - Report 95% confidence intervals or equivalent and clarify which test (e.g. Moore's paired test) each p-value refers to.

      Thanks for your suggestion.

      LL189-191 - explain what fatigue means. I would remove fatigue and substitute it with "lowered flight activity". Also, the same statement comes later, so avoid repetitiveness and remove it in one place. The analysis of directedness is good throughout, but what about the analysis of activity level? Could you explain whether you did it or not, and if not, why, or if angular changes can serve as an activity proxy? Replace "fatigue" with "reduced flight activity." Avoid repetition. Clarify if activity level analysis was performed or if it was not, e.g. due to technical difficulties.

      Thanks for your comments. We have responded to this point in the Public Reviews.

      L196 - Note whether 95% CI overlaps with the expected direction. This is a crucial outcome.

      Thanks for your comments.

      LL203-205 - unclear, better to stick to "congruency", especially "initial congruency for the relationship between mN and visual cue" throughout.

      Thanks for your suggestions.

      L206 - Better to introduce a new subheading: "Laboratory-Reared Animals.".

      Thanks for your suggestion. A new subheading has been added in the revised manuscript.

      LL207-208 - Clarify which cues were available in Chen et al. (2023) and how they differ here.

      Thanks for your comments. In Chen et al. (2023), the moths oriented under an artificial starry sky together with optic flow cues. In contrast, our experiments intentionally removed both the starry-sky pattern and optic flow to avoid introducing additional visual information when testing magnetic-visual integration for orientation. We have added further clarification regarding the conditions used in Chen et al. (2023) in the revised manuscript.

      L228 - Use "lab-reared" consistently throughout the entire MS. Do not mix with lab-raised.

      Thanks. Addressed by consistently using “lab-raised”.

      Figure 2 - Confusing in parts, especially for people coming from birds and other vertebrates orientation background. At 12 o'clock, you usually expect either mN / gN (magnetic or geographic North) or the animal's own initial directional response used as control to compare the same animal's direction post-treatment. Here, your 6 o'clock is magnetic South in the first place - non-conventional. At 12 o'clock, better use mN or gN. Avoid using non-conventional references such as magnetic south. Remind readers of seasonally appropriate headings and refer to the map.

      Thanks. We have responded to this point in the Public Reviews.

      LL232-234 - Emphasize that cue-magnetic congruency is key. Highlight the most important point that the congruency between the seasonal migratory direction and visual cues is key, not that in spring/fall, visual cues must be towards or opposite to the migratory goal. But the visual cue could be in the migratory direction or opposite, or at an angle - this is for future direction.

      Thanks. We have responded to this point in the Public Reviews.

      Figure 2 and associated main text - highlight that you only tested the designs when in all seasons the salient and single visual cue was in the migratory direction (in spring it coincided with mN but in fall it was towards the magnetic south). Other directions of visual cues have not been tested, but for simplicity and consistency, you chose to do these ones as the first step, perhaps.

      Thank you for this insightful comment. Yes, our experiments tested only the conditions in which the salient and single visual cue was aligned with the migratory direction. Other angular relationships between visual cues and the magnetic field were not examined in this study. For simplicity and consistency, we focused on this alignment as a first step toward understanding magnetic-visual cue integration in migratory orientation. We now highlight this in the Fig. 2 legend.

      Figures captures/legends - hard to tell from the main text now, better to italicize figure caption text and visually space them from the main text.

      Thanks for your suggestions.

      LL 250-251 - mention to people more familiar with r - lowercase - what is the expected range for R uppercase. It is not bound 0-1 as r. Could it be negative? How large can it be?

      Thanks. Thanks for the comment. After revisiting Moore (1980) we think that R* cannot take negative values. However, since R* = R*/N^ (3/2), it is not bounded between 0 and 1. We didn’t find any concept of an upper bound in the paper (https://doi.org/10.2307/2335330).

      Figure 3 - Consider adding a horizontal line indicating the 5% significance threshold.

      Thanks for your suggestions.

      L 261 - need to have some narrative after the subheading before you insert Figure 3.

      Thanks. Addreseed.

      LL274-275 - highlight that the timeline of this congruency between mN and a landmark and the effect of this on directedness is not explored here, but worth doing in future. How long does a new congruency or a relationship between mN and a visual cue need to be exposed to the animal to regain its directional response? Clearly, it is just a question of time of exposure so that a new association is established. Suggest future work on time-dependent adaptation to new cue-field relationships.

      Thanks for your suggestion. We have now included this point as a future direction in the revised Discussion.

      Figure 4 & S4 - Replace letters with asterisks/brackets for significance. The use of the letter is confusing and unconventional.

      Thanks for your suggestion.

      Figure 4 caption - Clarify the main takeaway.

      Thanks for your suggestion.

      Figure 4 - bare minimum is confusing. I understand that you wanted to avoid "no visual cues" because, as long as the animal sees things, there are things to be used as visual cues, even if this is not the intention of the experimenter. However, it needs clarification and rewording. Better to be more specific, like "no black triangle and horizon were used, just the uniformly white cylinder", or something like that.

      Thanks for your comments. In our setup it accurately describes the intentional removal of both the black triangle and the horizon, leaving only the uniformly white cylinder as the visual environment. This wording was chosen to reflect the practical limitations of producing a perfectly symmetrical flight simulator under laboratory conditions, and we therefore prefer to retain the original phrasing.

      L328 - Remove Xu et al. (2021) citation (not relevant). This is an in vitro study with a protein which may not work exactly as it is claimed in the paper in vivo.

      Thanks. Citation removed.

      L349-350 - Clarify what "no visual cue" means (e.g., uniformly white cylinder, no horizon line). Include a photo or a schematic of the inner surface of the cylinder for this condition in the Supplementary Materials.

      Thanks. We have responded to this point in the Public Reviews.

      L380 & throughout - Replace "barely minimum visual cues" (BMVC) with "no visual cues", clarifying limitations in Methods, meaning that you can explain that absolutely no visual cues is practically impossible because, as long as there is light, animals can use some asymmetries as cues even if this is not the intention of the experimenter.

      Thank you for this comment. We have decided to retain the term “barely minimum visual cues (BMVC)” because it accurately describes our experimental condition, which is distinct from a true “no visual cues” environment. In the revised Figure legend, we now clarify that BMVC refers to conditions in which obvious visual cues (i.e., features such as the black triangle in Fig. 1) were removed, while acknowledging that complete elimination of all visual information is not possible under illuminated conditions.

      L396 - Be cautious when generalizing from two species tested by a research group that is not absolutely independent (some authors in bogong and armyworm works overlap). We saw examples in diurnal migratory butterflies (Monarchs), a more studied species than the armyworm, that the findings do not entirely translate to Red Admirals (Pakhomov et al. 2025 preprint mentioned). Suggestion to tone down any claims of broad generalisation throughout the manuscript.

      Thank you for this comment. We have responded to this point in the Public Reviews.

      LL402-407 - Note that, unlike birds (e.g. European robins), moths appear to require both magnetic and visual cues for orientation, whereas birds, mole rats and some other animals can use magnetic cues alone.

      Thank you for this comment. We have responded to this point in the Public Reviews.

      L410 - Specify that this is correct only in the Northern Hemisphere.

      Thank you for this comment. Addressed.

      LL415-416 - Acknowledge artificiality of single-cue setup (see the major comments above); integrate earlier in the Discussion.

      Thank you for this comment. We have responded to this point in the Public Reviews.

      LL420-425 - Consolidate Future Directions into a single subsection; include more concrete experimental ideas, for example, using more naturalistic, numerous transient landmarks (could be done in a virtual maze with LEDs on the wall of the cylinder with cues moving with time). Multiple visual cues. Manipulating with salience of cues - less simplistic, less salient.

      Thank you for this comment. We have responded to this point in the Public Reviews.

      L431 - Does this paper support this statement? I think it just tested the use of stellar cues in a zero magnetic field. It also dealt with direction finding, not navigation, which is a position-finding ability - a much more complex feat and might not be the ability of moths (requires further studies like with geographic and magnetic displacements, etc). Reword and check this. Show the distinction between direction finding and navigation.

      Thank you for this comment. We have reworded the relevant sentence to use “orientation” instead of “navigation”.

      L436-437 - Specify "global visual cues" (stellar, lunar, etc.) and merge all future directions into one coherent section.

      Thank you for this comment. Addressed.

      LL443-446 - A bit early to plan such studies because migratory direction could well be a complex multigenetic trait, so that you cannot approach it simply with the knock out of a single gene. The genetic basis of magnetic direction needs to be first demonstrated, which leads you to the Future Directions section.

      Thank you for this helpful comment. We fully agree that migratory direction is likely a complex multigenic trait, and our intention was not to imply that knocking out a single gene would be sufficient to explain magnetic or migratory orientation. Our statement aimed only to highlight that identifying candidate genes is an important first step toward understanding the genetic basis of magnetic orientation.

      Line 496 - Clarify whether optic flow was used (unlike previous studies).

      Thank you for pointing this out. Clarified.

      LL499-511 - Clarify the improvements done in Chen's system and their relevance.

      Thank you for pointing this out. We reworded this sentence “The Flash flight simulator system was developed based on the early design of the Mouritsen-Frost flight simulator and adapted for our experiments in Yuanjiang”.

      Line 531 - Report and compare light intensities between indoor and outdoor experiments.

      Thanks for this comment. Unfortunately, due to the sensitivity limits of our current equipment, we were unable to reliably measure outdoor light intensities at night. However, we did not perform any open-top outdoor flight-simulator experiments; instead, we used field-captured moths but conducted all behavioral tests indoors.

      L549 - Add make/model of power supplies.

      Thanks. Addressed.

      LL582-585 - Specify whether R code will be shared; recommend open access (e.g., GitHub, other open repositories). Reiterate the importance of open science and sharing all scripts. Also here, add citations to some studies where MMRT has been used recently.

      Thank you for this comment. We have responded to this point in the Public Reviews.

      Line 592 - Explain how individual r-values were derived from optical encoder data.

      Thank you for this comment. Addressed.

      L842-843 - t-tests are inappropriate for angular data; use circular tests (Watson-Williams, Mardia-Watson-Wheeler, etc.).

      Thank you for this comment. Addressed.

      L865 - Reword to avoid repetition of "fall." Example: "In field captured armyworms during fall migration".

      Thank you for this comment. Addressed.

      LL882-885 - Improve phrasing and language here. Confirming that - no colon after. "Both the acrylic plate and diffusion paper." Confirm relevance of spectra to moth visual sensitivity - add relevant citation to original studies showing that.

      Thank you for this comment. Addressed.

      L886 - Reword "uniform" - does not look uniform to me.

      Thank you for this comment. Addressed.

      Reviewer #2 (Recommendations for the authors):

      The first two sentences of the abstract ("The navigational mechanisms employed by nocturnal insect migrants remain to be elucidated in most species. Nocturnal insect migrants are often considered to use the Earth's geomagnetic field for navigation, yet the underlying mechanisms of magnetoreception in insects remain elusive") are somewhat redundant. The authors may consider rewriting them.

      Thank you for pointing this out. We have rewritten this opening to provide a more concise and non-repetitive introduction.

    1. 59% of managers saying that they had to invest additional time to correct or redo work created by AI. Similarly, 53% said their direct reports had to take on extra work, while 45% said they had to bring in co-workers to help fix the mistake.

      Extra time and money spent to repair errors made by AI but not caught by the human in the middle. 59% is almost 2/3 (closer to 3/5) needed to correct or redo the work created by AI without a human auditing it. 53% claim extra work is needed to repair the AI mistakes, and 45% also needed to bring in a (perhaps more senior) co-worker to help fix the mistake. I can imagine workers needing to work on a mistake the hits production code, and all of the thousands (or more) mistakes that would need to be later repaired and rolled back. very expensive and costly.

    2. The term “workslop” was coined in a Harvard Business Review article last year

      'workslop' is creeping into businesses. This section mentions AI-generated content, slide decks, and even lengthy reports or 'random code' being passed off as polished work by employees!

    1. So there's a locked cupboard and you'll be informed of the code you need to open that.

      8 cupboard,答案在这句,但我没听出来具体的答案

    1. Briefing : L’Éducation à la Vie Affective, Relationnelle et Sexuelle (EVARS) en milieu scolaire

      Ce document propose une synthèse exhaustive de l'entretien avec Aurélie Gourmelon, conseillère pédagogique et autrice de l’ouvrage Aborder l'éducation sexuelle à l'école.

      Il analyse les fondements, les objectifs et la mise en œuvre pratique des programmes d’EVARS, tout en déconstruisant les idées reçues qui entourent ce sujet.

      --------------------------------------------------------------------------------

      Synthèse de la problématique

      L’Éducation à la Vie Affective, Relationnelle et Sexuelle (EVARS) est une obligation légale en France, souvent mal comprise par le public et redoutée par certains enseignants.

      Loin de se limiter à une approche technique de la sexualité, elle vise à outiller les élèves pour construire des relations saines, prévenir les violences et développer des compétences psychosociales.

      L'enjeu est de passer d'une éducation basée sur le « ressenti » ou le « feeling » à une approche institutionnelle ancrée dans des savoirs scientifiques et le respect des droits de l'enfant.

      --------------------------------------------------------------------------------

      I. Définition et cadre réglementaire

      Distinction sémantique et périmètre

      Il est crucial de distinguer l’« éducation sexuelle » (souvent perçue de manière réductrice ou technique) de l’Éducation à la Vie Affective, Relationnelle et Sexuelle (EVARS).

      Contenu : L'EVARS englobe les questions d'intimité, de pudeur, de consentement, d'égalité filles-garçons et de structure familiale.

      Approche : Elle ne relève pas de l'opinion personnelle de l'enseignant, mais de la transmission de paroles scientifiques et validées.

      Obligations légales

      Fréquence : La loi impose au moins trois séances par an, de la petite section de maternelle jusqu’à la terminale.

      Historique : Inscrite de manière obligatoire dans le code de l’éducation depuis 2001, elle existait déjà sous forme recommandée depuis les années 1960-1970, évoluant avec les faits de société (IVG, SIDA).

      Opposition : Les parents ne peuvent pas légalement s'opposer aux séances d’EVARS, car elles font partie intégrante du programme scolaire et des recommandations internationales pour la protection des droits de l'enfant.

      --------------------------------------------------------------------------------

      II. Déconstruction des mythes et réalités

      L'entretien identifie plusieurs idées reçues fréquentes qui alimentent les débats médiatiques et les craintes des parents :

      | Mythe | Réalité Scientifique et Pédagogique | | --- | --- | | Traumatisme | L'EVARS ne traumatise pas les enfants ; elle adapte les contenus aux questions posées par les élèves selon leur âge. | | Pornographie | L'EVARS n'expose pas les enfants à la pornographie. Elle constitue au contraire une réponse éducative face à l'exposition précoce et non désirée des enfants à ces contenus sur Internet. | | Incitation précoce | Parler de sexualité n'incite pas à une pratique précoce, tout comme parler du suicide ne pousse pas à l'acte. Cela permet de libérer la parole et d'aider ceux qui en ont besoin. | | Théorie du genre | Le genre est présenté comme un continuum scientifique (incluant les personnes intersexes, soit 17 pour 1000 naissances) plutôt que comme un moule cis-hétéronormé strict. | | Exclusivité familiale | Si la famille a sa place, l'école garantit un accès égal au savoir scientifique et protège les enfants issus de milieux où ces sujets sont tabous ou maltraitants. |

      --------------------------------------------------------------------------------

      III. Les enjeux majeurs de protection et de société

      Prévention des violences sexuelles

      L’un des arguments les plus incisifs en faveur de l’EVARS est la sécurité de l'enfant :

      Statistique critique : On estime que 2 à 3 enfants par classe sont victimes de violences sexuelles ou sexistes.

      Nature du risque : Dans 95 % des cas, l’agresseur est un membre de la famille ou un proche.

      Objectif : En apprenant aux enfants à identifier ce qui est « normal » de ce qui ne l’est pas, l’école ouvre un espace de parole permettant de révéler des situations de maltraitance (inceste, emprise) et de sauver des enfants.

      Développement des compétences psychosociales (CPS)

      L’EVARS s’appuie sur les CPS pour former des adultes libres et autonomes :

      Consentement : Appris dès la maternelle (ex: ne pas prendre le jouet d'un autre sans accord, respecter le « non »).

      Intimité et Pudeur : Nécessité pour l'école d'être cohérente (ex: respecter l'intimité dans les sanitaires scolaires).

      Égalité filles-garçons : Déconstruction des stéréotypes ancrés dès le plus jeune âge (jouets, publicités, attitudes en mathématiques ou en français).

      --------------------------------------------------------------------------------

      IV. Mise en œuvre pédagogique et posture de l'enseignant

      La posture du questionnement

      La clé d'une séance d'EVARS réussie réside dans la posture de l'enseignant. Au lieu d'apporter une réponse directe et potentiellement inadaptée, l'enseignant doit :

      1. Identifier la « question derrière la question ».

      2. Répondre par une autre question pour contextualiser le besoin de l'enfant.

      3. Apporter uniquement l'information nécessaire à ce stade du développement de l'élève.

      Intégration dans le quotidien

      L'éducation à la sexualité n'est pas uniquement déconnectée du reste des apprentissages ; elle est transversale :

      Littérature de jeunesse : Utilisation d'albums pour évoquer les émotions, les structures familiales (homoparentales, monoparentales, etc.) et les modes de vie.

      Vie de classe : Gestion des conflits, discussions philosophiques sur l'autorité et la domination.

      Mathématiques et Français : Analyse des biais de genre dans les énoncés ou les interactions en classe.

      La formation des enseignants

      L'urgence est de rassurer les professionnels :

      Auto-formation : Possible via la lecture et des ressources comme le podcast « C'est quoi l'amour maîtresse ? » (sessions enregistrées en CE1).

      Formation institutionnelle : Importance des parcours Magistère pour les connaissances et des formations en présentiel pour travailler la posture et l'échange entre pairs.

      --------------------------------------------------------------------------------

      Conclusion

      L'EVARS est un projet de société visant à construire le « vivre-ensemble ».

      En remplaçant les croyances par des savoirs scientifiques et en développant l'esprit critique, l'école remplit sa mission républicaine.

      L'accompagnement du développement psychosexuel de l'enfant est présenté comme une nécessité pour prévenir des dommages relationnels et traumatiques à l'adolescence et à l'âge adulte.

    1. Reviewer #2 (Public review):

      Summary

      This manuscript focuses on the role of social responsibility and guilt in social decision making by integrating neuroimaging and computational modeling methods. Across two studies, participants completed a lottery task in which they made decisions for themselves or for a social partner. By measuring momentary happiness throughout the task, the authors show that being responsible for a partner's bad lottery outcome leads to decreased happiness compared to trials in which the participant was not responsible for their partner's bad outcome. At the neural level, this guilt effect was reflected in increased neural activity in the anterior insula, and altered functional connectivity between the insula and the inferior frontal gyrus. Using computational modeling, the authors show that trial by trial fluctuations in happiness were successfully captured by a model including participant and partner rewards and prediction errors (a 'responsibility' model), and model-based neuroimaging analyses suggested that prediction errors for the partner were tracked by the superior temporal sulcus. Taken together, these findings suggest that responsibility and interpersonal guilt influence social decision making.

      Strengths

      This manuscript investigates the concept of guilt in social decision making through both statistical and computational modeling. It integrates behavioral and neural data, providing a more comprehensive understanding of the psychological mechanisms. For the behavioral results, data from two different studies is included, and although minor differences are found between the two studies, the main findings remain consistent. The authors share all their code and materials, leading to transparency and reproducibility of their methods.

      The manuscript is well-grounded in prior work. The task design is inspired by a large body of previous work on social decision making, and includes the necessary conditions to support their claims (i.e., Solo, Social, and Partner conditions). The computational models used in this study are inspired by previous work, and build on well-established economic theories of decision making. The research question and hypotheses clearly extend previous findings, and the more traditional univariate results align with prior work.

      The authors conducted extensive analyses, as supported by the inclusion of different linear models and computational models described in the supplemental materials. Psychological concepts like risk preferences are defined and tested in different ways, and different types of analyses (e.g., univariate and multivariate neuroimaging analyses) are used to try to answer the research questions. The inclusion and comparison of different computational models provides compelling support for the claim that partner prediction errors indeed influence task behavior, as illustrated by the multiple model comparison metrics and the good model recovery.

      The authors did a good job acknowledging other factors that could differ between the conditions, including the role of other emotions (like empathy) or agency in the decision making process. These additional analyses and nuances strengthen the manuscript and the interpretability of the findings.

      Weaknesses

      As the authors already note, they did not directly ask participants to report their feelings of guilt. The authors clearly describe this limitation, and also note that in addition to guilt, other emotions like empathy could also be at play in interpersonal decisions. Despite this limitation, this study provides insights into the neural and behavioral mechanisms of responsibility and guilt in social decision making, and how they influence behavior.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      The authors aimed to characterize neurocomputational signals underlying interpersonal guilt and responsibility. Across two studies, one behavioral and one fMRI, participants made risky economic decisions for themselves or for themselves and a partner; they also experienced a condition in which the partners made decisions for themselves and the participant. The authors also assessed momentary happiness intermittently between choices in the task. Briefly, results demonstrated that participants' self-reported happiness decreased after disadvantageous outcomes for themselves and when both they and their partner were affected; this effect was exacerbated when participants were responsible for their partner's low outcome, rather than the opposite, reflecting experienced guilt. Consistent with previous work, BOLD signals in the insula correlated with experienced guilt, and insula-right IFG connectivity was enhanced when participants made risky choices for themselves and safe choices for themselves and a partner.

      Strengths:

      This study implements an interesting approach to investigating guilt and responsibility; the paradigm in particular is well-suited to approach this question, offering participants the chance to make risky v. safe choices that affect both themselves and others. I appreciate the assessment of happiness as a metric for assessing guilt across the different task/outcome conditions, as well as the implementation of both computational models and fMRI.

      We thank Reviewer 1 for their positive assessment of our manuscript.

      Weaknesses:

      In spite of the overall strengths of the study, I think there are a few areas in which the paper fell a bit short and could be improved.

      We thank Reviewer 1 for their comments, which we have used to improve our manuscript. We hope that these changes address the issues raised by the Reviewer.

      (1) While the framing and goal of this study was to investigate guilt and felt responsibility, the task implemented - a risky choice task with social conditions - has been conducted in similar ways in past research that were not addressed here. The novelty of this study would appear to be the additional happiness assessments, but it would be helpful to consider the changes noted in risk-taking behavior in the context of additional studies that have investigated changes in risky economic choice in social contexts (e.g., Arioli et al., 2023 Cerebral Cortex; Fareri et al., 2022 Scientific Reports).

      We certainly agree that several previously published studies have relied on risky choice tasks with social conditions. In this revised version, we now mention these two studies in the substantially revised Introduction.

      (2) The authors note they assessed changes in risk preferences between social and solo conditions in two ways - by calculating a 'risk premium' and then by estimating rho from an expected utility model. I am curious why the authors took both approaches (this did not seem clearly justified, though I apologize if I missed it). Relatedly, in the expected utility approach, the authors report that since 'the number of these types of trials varied across participants', they 'only obtained reliable estimates for [gain and loss] trials in some participants' - in study 1, 22 participants had unreliable estimates and in study 2, 28 participants had unreliable estimates. Because of this, and because the task itself only had 20 gains, 20 losses, and 20 mixed gambles per condition, I wonder if the authors can comment on how interpretable these findings are in the Discussion. Other work investigating loss aversion has implemented larger numbers of trials to mitigate the potential for unreliable estimates (e.g., Sokol-Hessner et al., 2009).

      We agree that we have not clearly justified why we have taken two approaches to assess risk preferences. In short, while the expected utility approach is a more comprehensive method to model a participant’s choices, we had not sufficiently considered the need for the large number of trials required to fit such models when designing our experiment. Calculating the risk premium was the less comprehensive, simpler alternative that we could calculate for all participants. We have now mentioned this fact in the Results section. As the only difference in risk aversion across conditions was found in Study 1 using the expected utility method, which could only be successfully applied in a minority of participants, we believe that this difference should not be taken as a strong finding. We have now mentioned this fact in the revised Discussion.

      (3) One thing seemingly not addressed in the Discussion is the fact that the behavioral effect did not replicate significantly in study 2.

      We agree that we had not sufficiently discussed the fact that there were (slight but significant) differences in risk preferences between the Solo and Social conditions in Study 1 but not in Study 2. We now do so in the revised Discussion, and write the following:

      “Participants made slightly more risk-seeking choices when deciding for themselves than for both themselves and the partner in Study 1, but this difference disappeared in Study 2. The ρ parameter on which this finding in Study 1 is based could only be estimated in a minority of participants due to a relatively low number of trials, which suggests that this finding may not be very reliable. The simpler and more robust method (evaluation of a risk premium) showed no difference in risk aversion across conditions in either study. Overall, we believe that we do not have strong evidence of differences in risk preferences across conditions.”

      (4) Regarding the computational models, the authors suggest that the Reponsibility and Responsibility Redux models provided the best fit, but they are claiming this based on separate metrics (e.g., in study 1, the redux model had the lowest AIC, but the responsibility only model had the highest R^2; additionally, the basic model had the lowest BIC). I am wondering if the authors considered conducting a direct model comparison to statistically compare model fits.

      We agree that we should run formal, direct model comparison tests. We now ran likelihood-ratio tests which showed that the Responsibility model was the best. We now report this in the Results section, just below Table 1:

      “A likelihood ratio test (Equation 9) revealed that the Responsibility model fitted better than all the other models, including the Responsibility Redux model (Study 1: all LR ≥ 47.36, p < 0.0001; Study 2: all LR ≥ 77.83, p < 0.0001).”

      (5) In the reporting of imaging results, the authors report in a univariate analysis that a small cluster in the left anterior insula showed a stronger response to low outcomes for the partner as a result of participant choice rather than from partner choice. It then seems as though the authors performed small volume correction on this cluster to see whether it survived. If that is accurate, then I would suggest that this result be removed because it is not recommended to perform SVC where the volume is defined based on a result from the same whole-brain analysis (i.e., it should be done a priori).

      As indicated in the manuscript, the small insula cluster centered at [-28 24 -4] and shown in Figure 4F survived corrections for multiple tests within the anatomically-defined anterior insula (based on the anatomical maximum probability map described in Faillenot et al., 2017), which is independent of the result of our analysis. Functionally defining the small volume based on the same data would indeed be circular and misleading “double-dipping”. We have most certainly NOT done this. The reason why we selected the anterior insula is because it is one of the regions most frequently associated with guilt (see the explanations in our Introduction, which refers for example to Bastin et al., 2016; Lamm & Singer, 2010; Piretti et al., 2023). Thus we feel that performing small-volume correction within the anatomically-defined anterior insula is a valid analysis. We fully acknowledge that, independently of any correction, the effect and the cluster are small. We now write:

      “We found a weak response in a small cluster within the left anterior insula (peak T = 3.95, d = 0.59, 22 voxels, peak intensity at [-28 24 -4]; Figure 4F). Given the documented association between anterior insula and guilt (see Introduction), we proceeded to test whether this result survived correction for family-wise errors due to multiple comparisons restricted to the left anterior insula gray matter [defined anatomically and thus independently from our findings, as the anterior short gyrus, middle short gyrus, and anterior inferior cortex in an anatomical maximum probability map (Faillenot et al., 2017)]. This correction resulted in a p value of 0.024. This result, although it is only a small effect in a small cluster, is consistent with the mixed model analysis reported earlier.”

      Reviewer #2 (Public review):

      Summary

      This manuscript focuses on the role of social responsibility and guilt in social decision-making by integrating neuroimaging and computational modeling methods. Across two studies, participants completed a lottery task in which they made decisions for themselves or for a social partner. By measuring momentary happiness throughout the task, the authors show that being responsible for a partner's bad lottery outcome leads to decreased happiness compared to trials in which the participant was not responsible for their partner's bad outcome. At the neural level, this guilt effect was reflected in increased neural activity in the anterior insula, and altered functional connectivity between the insula and the inferior frontal gyrus. Using computational modeling, the authors show that trial-by-trial fluctuations in happiness were successfully captured by a model including participant and partner rewards and prediction errors (a 'responsibility' model), and model-based neuroimaging analyses suggested that prediction errors for the partner were tracked by the superior temporal sulcus. Taken together, these findings suggest that responsibility and interpersonal guilt influence social decision-making.

      Strengths

      This manuscript investigates the concept of guilt in social decision-making through both statistical and computational modeling. It integrates behavioral and neural data, providing a more comprehensive understanding of the psychological mechanisms. For the behavioral results, data from two different studies is included, and although minor differences are found between the two studies, the main findings remain consistent. The authors share all their code and materials, leading to transparency and reproducibility of their methods.

      The manuscript is well-grounded in prior work. The task design is inspired by a large body of previous work on social decision-making and includes the necessary conditions to support their claims (i.e., Solo, Social, and Partner conditions). The computational models used in this study are inspired by previous work and build on well-established economic theories of decision-making. The research question and hypotheses clearly extend previous findings, and the more traditional univariate results align with prior work.

      The authors conducted extensive analyses, as supported by the inclusion of different linear models and computational models described in the supplemental materials. Psychological concepts like risk preferences are defined and tested in different ways, and different types of analyses (e.g., univariate and multivariate neuroimaging analyses) are used to try to answer the research questions. The inclusion and comparison of different computational models provide compelling support for the claim that partner prediction errors indeed influence task behavior, as illustrated by the multiple model comparison metrics and the good model recovery.

      We thank the reviewer very much for their comprehensive description of our study and the positive assessment of our study and approach.

      Weaknesses

      As the authors already note, they did not directly ask participants to report their feelings of guilt. The decrease in happiness reported after a bad choice for a partner might thus be something else than guilt, for example, empathy or feelings of failure (not necessarily related to guilt towards the other person). Although the patterns of neural activity evoked during the task match with previously found patterns of guilt, there is no direct measure of guilt included in the task. This warrants caution in the interpretation of these findings as guilt per see.

      We fully agree that not directly asking participants about feelings of guilt is a clear limitation of our study. While we already mention this in our Discussion, we have expanded our discussion of the consequences on the interpretation of our results along the lines described by the reviewer in the revised manuscript. We would like to thank the reviewer for proposing these lines of thought, and have now made the following changes to the text:

      In the first paragraph of the discussion, we now write: “Being responsible for choosing a lottery that yielded a low outcome for a partner made our participants feel worse than witnessing the same outcome resulting from their partner’s choice, which we interpret as interpersonal guilt; although we note that we have not asked participants specifically about which emotion they felt in these situations.

      Later on, in the third paragraph focusing on the anterior insula, we now write: “This replicates a large body of evidence associating aIns with feelings of guilt evoked during social decisions (see Introduction). Because we have neither asked our participants specifically what they felt in these situations, nor specifically whether they experienced guilt, we cannot exclude the possibility that they have instead or in addition felt empathy for their partner, a feeling of failure or bad luck, or some other emotion.”

      As most comparisons contrast the social condition (making the decision for your partner) against either the partner condition (watching your partner make their decision) or the solo condition (making your own decision), an open question remains of how agency influences momentary happiness, independent of potential guilt. Other open questions relate to individual differences in interpersonal guilt, and how those might influence behavior.

      How agency influences momentary happiness or variations thereof during the course of an experiment such as ours is an interesting question in itself. We now ran linear mixed models assessing agency (i.e. we compared happiness in conditions Solo & Social conditions vs. Partner condition), which revealed lower happiness in Solo and Social conditions (i.e. when it was the participant’s turn to decide) in both studies. This is interesting in itself and may reflect the drive behind responsibility aversion reported by Edelson et al.’s 2018 study: being assigned the role of the decider in a social setting may make people slightly unhappy, perhaps due to “weight of the responsibility”. We now report these findings in the Results section, including this proposed explanation; because we were not specifically interested in responsibility aversion, we do not discuss this further in the Discussion. The edited text is under the new subsection entitled ‘Momentary happiness: effects of agency, responsibility and guilt’, on page 12:

      “Next, we assessed whether happiness varied depending on the participant’s agency (Social + Solo vs. Partner), and found happiness to be lower when the participant chose, independent of the outcome (Study 1: t(3600) = -3.92, p = 0.00009, β = -0.14, 95% CI = [-0.20 -0.07]; Study 2: t(2870) = -6.07, p = 0.000000001, β = -0.24, 95% CI = [-0.31 -0.16]). . This is interesting in itself and may reflect the drive behind responsibility aversion reported by Edelson et al.’s 2018 study: being assigned the role of the decider in a social setting may make people slightly unhappy, perhaps due to “weight of the responsibility”. To specifically search for a sign of interpersonal guilt, [...]”

      Regarding individual differences: this is a very interesting topic that we have not addressed here due to the (relatively) small number of participants in our studies, but we might consider this for future follow-up studies, which we mention in the Discussion paragraph regarding open questions.

      This manuscript is an impressive combination of multiple approaches, but how these different approaches relate to each other and how they can aid in answering slightly different questions is not very clearly described. The authors could improve this by more clearly describing the different methods and their added value in the introduction, and/or by including a paragraph on implications, open questions, and future work in the discussion.

      We thank the reviewer for their appreciation of our complementary approach, and agree that we had not sufficiently explained the reasons why we used several methods. We have now added a paragraph explaining this at the end of the Introduction (page 5):

      “We analysed our behavioural data using several complementary methods: choices were modelled with mixed-effects regressions serving as manipulation checks; risk preferences expressed in choices were assessed using a comprehensive expected utility model as well as with a simpler, more robust “risk premium” approach; and happiness data were fitted, in addition to the computational models, with several linear mixed models to assess the impact of both the participant’s and their partner’s rewards, the impact of agency and their interactions. Inspired by findings reported in previous neuroimaging of social emotions, we also used several methods to analyse our fMRI data, including conventional methods (both region-of-interest and mass univariate); mixed-effects regression models; computational model-based analyses (inspired by e.g. Konovalov et al., 2021; Rutledge et al., 2014); and functional connectivity (e.g. Edelson et al., 2018; Konovalov et al., 2021). The behavioural modelling is thus complemented by neuroimaging analyses that offer insight about both the activity in regions associated with guilt as well as their place in a wider network, providing an in-depth comprehensive analysis of the mechanisms behind guilt evoked by social responsibility.”

      In addition, as suggested we added the following paragraph on open questions and future work in the Discussion:

      “Several open questions remain at the end of this study. As discussed above, asking participants directly about which emotions they have felt during the different stages of this task would allow us to link subjective experience with our analytical measures. Testing more participants would allow us to assess the impact of inter-individual variations in personality traits on the experience as well as the behavioural and neural correlates of guilt and responsibility. Using more trials in the experiment would allow separate modelling of risk preferences in gain and loss trials in each experimental condition using expected utility models, and could allow testing whether changes in momentary happiness affect subsequent choices. Varying partner identities (friends, strangers, artificial agent) could reveal the impact of social discounting on guilt and responsibility. In sum, we believe that this experimental approach lends itself very well to the study of several aspects of social emotions.”

      However, taken together, this study provides useful insights into the neural and behavioral mechanisms of responsibility and guilt in social decision-making and how they influence behavior. 

      We thank the reviewer again for their appreciation of our work and hope that our revisions improved the manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The majority of my suggestions are in the public review, so I will not repeat them here. But in general, I like the paper, and in addition to my other comments, I think that there should be more discussion of the potential limitations of the study and conclusions that can be drawn. I also thought parts of the results were a little hard to follow, particularly in the 'momentary happiness' section. Perhaps an additional subsection here might help with flow.

      We agree that we could have discussed further the limitations of our study and the conclusions that can be drawn from it, which we have now done in the last paragraphs of the Discussion in this revised version.

      To improve the structure of the section on ‘momentary happiness’, we separated this section into two, entitled: ‘Momentary happiness: links to reward‘ and ‘Momentary happiness: effects of agency, responsibility and guilt’, which should facilitate the reading of this long section. We proceeded in a similar manner for the Choices section, which is now subdivided into ‘Choices: manipulation check’ and ‘Choices: risk preferences’. We believe that these changes have indeed improved the readability of our manuscript.

      Reviewer #2 (Recommendations for the authors):

      Overall, I believe this manuscript was well-designed, consists of extensive analyses, and provides interesting new insights into the mechanisms underlying social decision-making. I mostly have some clarifying questions and minor comments, which are described below. 

      (1) Integration of prior findings in the first paragraphs of the Introduction. Although all the previous work described in the 2nd-5th paragraph introduction is interesting, it felt a bit like an enumeration of findings rather than an integrated introduction leading to the current research question. At the end of paragraph 5, it becomes clear how these findings relate to the current research question, but I believe it will improve the flow and readability of the introduction if this becomes clear earlier on.

      We agree that we could have integrated the cited previous work into the Introduction so that the text builds up to the research question. We have now extensively reworked several paragraphs in the Introduction (pages 3-5) and hope that these changes have made it easier to follow.

      (2) For the risk attitudes (Choices), you describe pooling the gains and losses and then comparing the social and solo conditions. I was wondering whether you also looked at potential differences between gains and losses (delta measure) for social versus the solo condition (so a comparison of the delta). Based on prior work, I can imagine that the difference in risk attitudes for gains and losses might differ when making decisions for yourself versus when you're doing it for a partner. In general, I was wondering how you explain these findings, as there is also a lot of work showing differences in risk-taking patterns for gains and losses.

      We agree that we could have compared delta measures between solo and social conditions. However, as we describe in the Results section and comment on in the Discussion, the relatively low number of trials made separate fitting of gain and loss trials across conditions difficult. While this question could thus be addressed in subsequent versions of our experiment with more trials, such a fine-grained analysis of the decisions was not the focus of our current study.

      (3) On page 11, you state: "in particular the partner's reward prediction errors resulting from the participants' decisions, i.e. those pRPE for which participants were responsible." From the results described in the paragraph above, this doesn't become clear (e.g., there's no distinction made between social_pRPE and partner_pRPE in the text), as it only discusses differences in weights between pRPE and sRPE. I would recommend including some more information in the main text on these main modeling findings, so one doesn't have to go to the Supplemental Materials to understand them.

      We did indeed fail to report these findings in the text! We thank the reviewer for pointing this out. We have now edited this passage as follows:

      “Crucially, we find here that the partner’s reward prediction errors (social_pRPE and partner_pRPE) contributed to explaining changes in participants’ momentary happiness: the Responsibility and ResponsibilityRedux models explained the data better than the models without these parameters (see Table 1). In particular, the partner’s reward prediction errors resulting from the participants’ decisions (social_pRPE), i.e. those pRPE for which participants were responsible, contributed to explaining our data (weights for social_pRPE were greater than 0: Responsibility model: Study 1: Z = 2.85, p = 0.004, Study 2: Z = 3.26, p = 0.001; Responsibility Redux model: Study 1: Z = 2.93, p = 0.003, Study 2: Z = 3.30, p = 0.001; weights for social_pRPE tended to be higher than weights for partner_pRPE: Responsibility model: Study 1: Z = 2.14, p = 0.033; Study 2: Z = 1.41, p = 0.16).”

      (4) The functional connectivity findings seem to come out of nowhere and are not introduced or described anywhere prior in the manuscript. It is therefore not completely clear why you conducted these analyses, or what they add above and beyond previous analyses. Already introducing this method earlier on would fix that.

      We agree that we could have introduced functional connectivity analyses earlier in the text, particularly given the many previous studies in our field using this technique. We have now done this at the end of a new last paragraph of the Introduction:

      “Inspired by findings reported in previous neuroimaging of social emotions, we also used several methods to analyse our fMRI data, including conventional methods (both region-of-interest and mass univariate); mixed-effects regression models; computational model-based analyses (inspired by e.g. Konovalov et al., 2021; Rutledge et al., 2014); and functional connectivity (e.g. Edelson et al., 2018; Konovalov et al., 2021). The behavioural modelling is thus complemented by neuroimaging analyses that offer insight about both the activity in regions associated with guilt as well as their place in a wider network, providing an in-depth comprehensive analysis of the mechanisms behind guilt evoked by social responsibility.”

      (5) For the functional connectivity findings: I was wondering why you only looked at the choice phase, and not at the feedback phase. I understand that previous work focused on the choice phase, but for the purpose of this study (focus on guilt), I can imagine it is also interesting to see what happens with feedback. In the discussion, you also state "How we feel when we witness our decisions' consequences on others is an important signal to consider when attempting to make good social decisions." (p. 19), which is more focused on the feedback rather than choice, and also supports the idea that looking at the feedback moment might be relevant.

      We agree that we could also have looked at the functional connectivity during the feedback phase. The main reason why we had originally not done so was time constraints. At the current time we would in addition point out that the manuscript is already very long and contains many analyses of behavioural and fMRI data. Adding this analysis would cost additional time and would further delay the publication of our manuscript, which we would prefer to avoid. However, one could of course look at these effects in subsequent analyses of the same data or in subsequent versions of this experiment. We have now mentioned this in the Discussion, in the paragraphs on open questions.

      Minor comments:

      (1) For some of the Figures, it would be helpful if the subtitles were more informative. For Figure 2 and Figure 3 for example, it would be nice if Study 1 and Study 2 were not only mentioned in the figure description but also in the actual figure. For Figures 3 and 4, it would be helpful to have significance stars for the bar plots as well.

      We agree that these changes make the figures more easily understandable and have implemented them all, except for adding stars on Figure 4, because all bar plots in panels C and E would have been labeled with two or more stars, which would have made the figure difficult to read. We have now mentioned the fact that all these coefficients were significant in the figure legend.

      (2) For some of the Supplementary Results, it would be very helpful if there was a legend or description. This is already the case for most of the SR, but not for all.

      We have now added a legend to all elements of the Supplementary Results.

      Some questions that came to mind while going through them:

      - Supplementary Table 1: which p-values correspond to the significance stars? This information is included for Supplementary Table 2, but not for ST1. 

      We have now added the missing information in ST1.

      - Supplementary Figure 1: do the colors correspond to different participants? 

      We have now specified that the colors do indeed correspond to different participants.

      - Supplementary Table 5 (final table): what do the - represent? As in, why is there no value for "run" for the MPFC? At first, I thought you only included the significant values, but then I noticed a few non-significant values as well, so it wasn't completely clear to me why some of the values were missing. This also applies to Supplementary Table 6.

      We have indeed forgotten to explain this. The ‘-’ in Supplementary Tables 4 and 6 indicate that the linear mixed model without the factor ‘run’ was the better-fitting one. We have now added the following explanation in the text accompanying Supplementary Table 4:

      “We tested these models both with and without the factor Run and associated interaction, and we report the best-fitting model in the table below: a dash (‘-’) in the row displaying parameters for the run and socialVsSolo:run regressors indicates that the model without factor run was better-fitting for this ROI.”

      (3) I came across a few minor typos or sentences that were not completely clear to me.

      - On page 3: "Patients with damage to ventromedial prefrontal cortex (vmPFC) seem insensitive to guilt when playing social economic games (Krajbich et al., 2009)." This sentence felt a bit out of nowhere and doesn't logically follow from the previous sentences. 

      We have now revised the descriptions of this previous study as well as several others and how they fit into the research question.

      - On page 3: "In another study, participant errors in a difficult perception task lead to a partner feeling pain and evoked activations in left aIns and dlPFC (Koban et al., 2013)." This sentence doesn't really flow, and from the wording, it is not completely clear whether it's the errors or the partner pain that led to the aIns and dlPFC activation.

      We have now revised the description of this study as well, as follows:

      “In another study, partners received painful stimuli when participants made errors during a difficult perception task. These errors evoked activations in the left aIns and dlPFC in the participants (Koban et al., 2013).”

      - Supplementary Figure 1: there is a missing period after the sentence "We then compared these new estimated parameters to the actual parameters from which the synthetic data were generated"

      We have now added a missing comma after “generated”.

      - On page 5: "We ran two experiments, Study 1 outside fMRI and Study 2 during fMRI, with separate groups of participants." I would change "outside fMRI" to outside the MRI scanner or something like that, as it's not completely correct to say "outside fMRI".

      We have changed the sentence to “outside the MRI scanner”.

      - On page 6: for the first result, there are currently two p-values reported (p < 2.5e-20 and p < 2e-16). I believe this is an error?

      This was indeed an error! We have re-run this analysis, noticed that also the degrees of freedom were miscalculated, and have updated this result and the effect of condition (solo vs social). Results are almost identical as previously and all conclusions hold. We have also checked the other analyses reported in this paragraph – all results replicate exactly.

      - On page 6: "Supplemental Table 1" should be "Supplementary Table 1" (for consistency).

      Done.

      On page 8: "participants in both conditions of both studies", I would change "of both studies" to "for both studies".

      Done.

      On page 8: for the "Momentary Happiness" paragraph, it would be helpful if you could briefly describe the Rutledge method here, for people who are unfamiliar with the approach.

      We now write the following at the beginning of this paragraph:

      “Following Rutledge and colleagues’ methodology, which considers that changes in momentary happiness in response to outcomes of a probabilistic reward task are explained by the combined influence of recent reward expectations and prediction errors arising from those expectations, we fitted computational models to each participant’s happiness data.”

      On page 10: "Wilkoxon sign-rank tests", should be "Wilcoxon".

      Done.

      We thank the reviewer for their careful reading of our manuscript. We believe that these changes have indeed improved our manuscript.

    1. AbstractOver the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2:

      Reproducibility report for: Enhanced semantic classification of microbiome sample origins using Large Language Models Journal: Gigascience ID number/DOI: GIGA-D-25-00316 ​​​Reviewer(s): Laura Caquelin, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Wrote the report and reproduced the results]


      1. Summary of the Study

      This study evaluates whether Large Language Models (LLMs) can help re-annotate sequencing records. Using GPT models, the authors tested scalability, time, cost, and performance against a benchmark of 1,000 hand-curated examples. They then applied this approach to million environmental sequencing records, producing standardized annotations.


      1. Scope of reproducibility

      According to our assessment the primary objective is: to evaluate how closely GPT's annotation performance approached that a human expert when classifying environmental sequencing samples into biomes and sub-biomes.

      • Outcome: Accuracy of biome and sub-biome classification compared against a hand-curated benchmark dataset.

      • Analysis method outcome: As described to validate biome classifications: "For paired comparisons of repeated sample IDs, we use the McNemar test, which is appropriate for paired binary outcomes (True/False)", while "for comparisons across different sample sets, we employ the t-test for independent samples. In both scenarios, a Bonferroni correction is applied to adjust for multiple comparisons".

      For sub-biomes, comparisons across different sets were performed with independent t-tests, while "for runs involving the same sample IDs, comparisons are performed using the paired t-test". Section "Validation statistics" page 13-14.

      • Main result: "The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). No significantly different performances were detected for the sub-biome classification, neither between GPT and the human, nor between prompt versions (adjp-value=1)." Section "Human versus GPT classification accuracy" pages 18-19.

      1. Availability of Materials

      a. Data

      • Data availability: Open
      • Data completeness: Complete = all data necessary to reproduce main results are available
      • Access Method: Repository
      • Repository: https://zenodo.org/records/16100607
      • Data quality: Complete but no metadata associated with the file

      b. Code


      1. Computational environment of reproduction analysis

      2. Operating system for reproduction: MacOS 15.6.1

      3. Programming Language(s): Python
      4. Code implementation approach: Using shared code
      5. Version environment for reproduction: Python 3.13.7

      1. Results

      5.1 Original study results

      • Results:

      "The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). No significantly different performances were detected for the sub-biome classification, neither between GPT and the human, nor between prompt versions (adjp-value=1)."

      (The authors identified an error in the manuscript text during the review. Therefore, the following part of the manuscript needs to be updated (see email exchange with the authors below).

      5.2 Steps for reproduction

      -> Run the two first scripts of the container 4 in the Github: validate_biomes_subbiomes.py and overall_analysis.py

      • Issue 1: The README instructions for setting up the ~/MicrobeAtlasProject directory can lead to a nested folder structure (~/MicrobeAtlasProject/MicrobeAtlasProject) if followed literally. This causes the Docker container to fail when attempting to access required files like gpt_file_label_map.tsv, since they are not found at the expected path /MicrobeAtlasProject/.

      -- Resolved: The issue was resolved by manually renaming and flattening the directory structure after extraction, ensuring that the contents of MicrobeAtlasProject_Zenodo are directly placed inside ~/MicrobeAtlasProject/. However, the current instructions can mislead users, so a clarification in the README would be helpful.

      • Issue 2: During the execution of the overall_analysis.py script, multiple files with the same label were found, requiring manual selection of the file to use for the analysis.

      -- Resolved: The manuscript does not specify which file should be selected to reproduce the results, leading to potential ambiguity. By default, I chose the most recent file among the options, assuming it reflects the final data version used in the manuscript. It would be helpful if the documentation or manuscript explicitly stated this to ensure exact reproducibility.

      -> Compare the results reproduced to the results presented in the manuscript

      • Issue 3: The results obtained by running validate_biomes_subbiomes.py are two files: biome_subbiome_results.csv and biome_subbiome_stats.csv, which contain a large amount of output (1,284 and 48,197 rows respectively). The script overall_analysis.py provides overall performance metrics in the terminal output, but does not produce the adjusted p-values relevant to the scope of this review.

      -- Unresolved: It was difficult to identify where to find the results presented in the manuscript, so an email was sent to the authors.

      Message sent by the authors

      Dear reviewer, 

      By running validate_biomes_subbiomes.py as described (using gpt_file_label_map.tsv as --map_file), the output will be two .csv files named biome_subbiome_results.csv and biome_subbiome_stats.csv. 

      The latter file will contain the stats (hence the adjusted p-values). Were you able to reproduce such files? 

      We did notice there are a few mistakes. 

      Mistake 1. 

      In the manuscript it says: 

      " A trained molecular biologist, with no prior exposure to the project, was given the same prompt instructions as GPT and was asked to classify sample biomes and sub-biomes. While against the benchmark dataset, GPT achieved an accuracy of 79.76% (n=499; SD=40.0), the human annotator reached 78.0% (n=250; SD=33.0). "

      The second standard deviation should be replaced with SD=42.0. 

      Mistake 2. 

      In the manuscript it says: 

      " The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). "

      The first adjusted p-value should not be 0.031 but 0.134 hence not significant so this sentence should be adjusted to: 

      " There was an improvement in accuracy between the human's first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). "

      Mistake 3. 

      We built on biome_subbiome_results.csv and biome_subbiome_stats.csv further than necessary so the two files on Zenodo should be "cut" earlier to avoid confusion. This was a problem of the script validate_biomes_subbiomes.py which concatenates on existing files (e.g.: biome_subbiome_results.csv and biome_subbiome_stats.csv) instead of creating new ones. We should probably proceed by replacing these two files with the files without repetitions. 

      We thank you for your work and please do let us know if everything works out now. 

      Thank you and kind regards, ####

      The authors confirm that the results presented in the manuscript can be found in the biome_subbiome_stats.csv file. This file contains 48197 rows. According to the authors, the file includes data from both existing files and new data. As a result, it is difficult to determine which data have been reproduced. Even when using a Ctrl+F search for the reported p-value (e.g, pvalue = 0.134) in the Excel file, this value appears in several rows labeled under different configurations such as (label1/label2): --- chunk_size3000/sync_chunkN_presp0.0; --- chunk_size3000/sync_chunkN_temp1.5; --- chunk_size5000/gpt4-0613; --- machine/ sync_chunkY_topp0.0, etc…

      5.3 Statistical comparison Original vs Reproduced results

      • Results: The biome_subbiome_stats.csv file was reproduced, but it is difficult to distinguish between the newly reproduced data and the existing data already present in the file. Additionally, the data presented in the manuscript are also hard to identify due to the size of the file. No comparison was performed.
      • Comments: -
      • Errors detected: Authors identified an error in the manuscript text during the review with the first adjusted p-value that is not 0.031 but 0.134.
      • Statistical Consistency: No comparison was performed.

      1. ​​Conclusion

      2. Summary of the computational reproducibility review

      The main scripts to reproduced the results were successfully executed and the output files were generated. However, due to the size of the output files and the lack of precise references in the manuscript, it was difficult to identify which parts of the output correspond to the results presented in the paper. Moreover, authors mentionned that the script adds data to existing output files rather than generating new ones, making it hard to distinguish between old and new data. This led to confusion when trying to compare the reproduced results with those in the manuscript. Then a comparison of statistical values was not possible.

      • Recommendations for authors

      To improve the reproducibility of the manuscript, we recommend the authors to:

      -- Clarify instructions in the README about the MicrobeAtlasProject folder. -- Ensure scripts generate new outputs or clarify which data is new vs. existing in the files with for example a column indicating the origin (e.g., "new" or "existing"). -- Link results in the manuscript to specific rows/sections in the output files to easily locate the exact data used. Another solution could be to consider including a smaller, or a filtered version of the output files with only the rows used for key reults, figures or tables, to make checking results easier and avoid error. -- Metadata: For the data used or generated by the scripts, it would be helpful to include accompanying metadata files that explain: --- The definition of each variable name. --- The origin of each dataset (raw, processed, etc). --- Any preprocessing steps applied before analysis.

    2. AbstractOver the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1:

      The manuscript presents a carefully executed study using non-finetuned GPT models to classify microbiome sample metadata. It is very well written, and both the analyses and the interpretations are generally sound. I found the evaluation thorough and the presentation clear.

      1. The study provides a detailed evaluation of LLM-based metadata curation, clearly advancing over keyword-based approaches. However, it is surprising that recent related studies using LLMs for metadata curation are not cited. For completeness, I suggest including references such as:
      2. https://doi.org/10.1093/gigascience/giaf070 (disclaimer: I am an author of the paper. You might like the table 3),
      3. https://doi.org/10.1093/bib/bbad535,
      4. https://doi.org/10.3897/phytokeys.261.158396,
      5. https://pmc.ncbi.nlm.nih.gov/articles/PMC12099408/.

      6. The scale of the processed data is impressive. However, there appears to be a discrepancy: the Zenodo repository file metadata.out contains 2,254,619 accession IDs (presumably the input), while the GPT output files (gpt_clean_output) include only around 1,000 samples (presumably the benchmark dataset), whereas the manuscript states that 3.8 million samples were processed. It would be helpful to clarify these numbers and, if applicable, explain why fewer outputs are provided. I also recommend reorganizing the Zenodo repository so that readers can download individual files rather than the entire large archive.

      7. The data processing pipeline on GitHub is very useful. The repository currently indicates a CC0 (public domain) license. Since CC0 is typically intended for datasets rather than source code, please clarify whether this was intentional or if a software-specific license (e.g., MIT, Apache 2.0) would be more appropriate.

      8. A different typeface appears in some paragraphs (e.g., pp. 19 and 21). Please check whether this was intentional.

      9. The finding that grouping 5-17 samples per request does not substantially affect accuracy is interesting. Given that GPT models often fail with counting or item listing, the observed quality decline with larger chunk sizes seems reasonable and aligns with expectations.

      10. On p. 26, the observed variability in field usage may be linked to the BioSample package system used for submission (see: https://www.ncbi.nlm.nih.gov/biosample/docs/packages/). Some fields, such as env_biome and env_feature, were once mandatory for environmental samples but are currently optional, I suppose. Such historical changes may partly explain biases in field usage.

      11. The manuscript appropriately highlights the presence of ambiguous or unresolvable sample descriptions. We reached a similar conclusion in our own work with local LLMs: in many cases, even expert curators cannot determine a "correct" label, and the right answer may depend on context or application.

      12. The observation that JSON output significantly improves sub-biome classification accuracy is intriguing and consistent with our internal experience with local LLMs. Since output format may also affect processing speed, it would be useful to report whether response times differed between JSON and inline formats.

      13. One major limitation of the study is the dependence on proprietary GPT models accessible only via OpenAI's API. This constrains reproducibility and long-term availability. Indeed, the recent release of GPT-5 already renders some of the reported results outdated. While the present study remains highly valuable, it would be worthwhile to also evaluate local or open-source LLMs to ensure future reproducibility.

    1. AbstractBackground Recent advancements in single-cell omics technologies have enabled detailed characterization of cellular processes. However, coassay sequencing technologies remain limited, resulting in un-paired single-cell omics datasets with differing feature dimensions;Finding we present GROTIA (Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis), a computational method to align multi-omics datasets without requiring any prior correspondence information. GROTIA achieves global alignment through optimal transport while preserving local relationships via graph regularization. Additionally, our approach provides interpretability by deriving domain-specific feature importance from partial derivatives, highlighting key biological markers. Moreover, the transport plan between modalities can be leveraged for post-integration clustering, enabling a data-driven approach to discover novel cell subpopulations;Conclusions We demonstrate GROTIA’s superior performance on four simulated and four real-world datasets, surpassing state-of-the-art unsupervised alignment methods and confirming the biological significance of the top features identified in each domain. The software is available at https://github.com/PennShenLab/GROTIA.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag012), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1

      The manuscript presents a well-motivated and technically elegant approach to diagonal single-cell data integration, combining optimal transport with graph-based regularization to achieve a balance between global and local structure alignment. The method addresses an important challenge in single-cell data integration, where existing approaches still leave room for improvement. Its embedding design offers the potential for interpretable feature-level insights, a particularly desirable quality in single-cell multi-omics integration where biological interpretability is especially important.

      That said, the manuscript would be substantially strengthened by deeper validation and a clearer demonstration of reproducibility. Some claims would benefit from stronger empirical support in the presented results, and a more thorough evaluation of the method's added value relative to unimodal alternatives, particularly in the context of marker gene discovery and the identification of cell types or subpopulations, could further enhance the manuscript. Additionally, the impact of key parameter choices, such as kernel bandwidth selection, the number of nearest neighbors (k), and sensitivity to hyperparameters (λ, ρ), should be more fully explored, reported, or justified. Reproducibility could be improved by providing scripts and a computational environment or container to replicate all analyses and figures presented in the manuscript. Usability would also be improved by providing the method as an installable Python package, rather than limiting implementation to a Jupyter Notebook.

      Overall, the manuscript introduces a compelling methodological framework with meaningful potential for applications in single-cell integration. The suggestions that follow are intended to help the authors strengthen their contribution in alignment with GigaScience's emphasis on openness, reproducibility, and FAIR principles. I hope these suggestions will help strengthen the support for the authors' conclusions, clarify the reasoning behind key arguments, and improve the clarity and interpretability of the figures and descriptions.

      1. Reproducibility

      Reproducibility is impeded by the absence of clearly organized scripts or workflow files to regenerate the results, figures, and tables presented in the manuscript. While some outputs are shown or alluded to in the Jupyter Notebook found in the linked GitHub repository, they are not clearly cross-referenced with the paper's results, making it difficult to confirm how specific figures or tables were produced. Furthermore, no computational environment specification is provided, which makes replication with confidence impossible. Certain aspects of the manuscript fall short of best practices for transparent and reproducible research. Analysis scripts are incomplete or undocumented, and key portions of the software pipeline are either insufficiently described or lack proper attribution. These limitations hinder reproducibility and reduce reusability. Figures would also benefit from clearer annotation. Collectively, these shortcomings detract from alignment with the FAIR principles emphasized by GigaScience. Reproducibility would be significantly improved by packaging the software, versioning the code, defining and documenting the computational environment, and depositing all components of the analysis pipeline, including preprocessing scripts, evaluation code, and figure generation, in a publicly accessible repository.

      Additionally, while it is generally clear how the data were collected and curated, the rationale for using preprocessed datasets, particularly those sourced from external repositories, could be more clearly explained. The data are shared via a Google Drive link provided in the GitHub repository, which is convenient, though it may benefit from a more transparent and persistent form of distribution. The manuscript states that "All data used in this manuscript is publicly available and can be found at Liu et al. [11], Cheow et al. [16], Demetci et al. [12], Chen et al. [17], Cao et al. [14], and Samaran et al. [13].", but it appears that preprocessed versions of these datasets were used, rather than the original raw data. Clarifying this point would help improve transparency and reproducibility.

      The manuscript also describes custom preprocessing procedures for scRNA-seq and scATAC-seq data, including PCA, TF-IDF normalization, and gene filtering, that appear inconsistent with the properties of the datasets used. Without access to preprocessing scripts or further clarification, it is unclear whether these procedures were performed as described. Clarifying these discrepancies would strengthen transparency and ensure fair benchmarking comparisons. In addition, to improve transparency and reproducibility, it would be helpful to provide the scripts or commands used to run these baseline methods, along with the evaluation code for computing the reported metrics.

      Finally, several methodological details underlying downstream analyses are insufficiently described to allow confident reproduction or interpretation. For instance, it is unclear which dataset was used to obtain the results in "GROTIA Reveals Gene-Specific Contributions and Key Biological Processes in the RNA Embedding" and Figure 4 and 5. Additionally, the motif discovery step using GimmeMotifs should be expanded, since it is currently not entirely clear how motifs were matched to known transcription factors, and the process described in the text does not fully align with what is shown in Figure 5A. Clarifying these points would help improve the reproducibility and interpretability of the manuscript's key biological findings.

      1. Usability

      The code repository is easy to find on GitHub, available under the MIT license, following the link presented in the manuscript. However, the currently presented implementation is provided as a Jupyter Notebook that demonstrates the basic usage of the method, and technically allows users to replicate the process using their own data. Usability is currently limited by sparse documentation and could benefit from guidance on input requirements, parameter configuration, and expected output formats. To improve usability, the authors should supplement the notebook with detailed explanations, comments, and a README or user guide that explains how to prepare input data, adjust key parameters, interpret outputs, and run the method on other datasets. Wrapping core functionality into a small, importable Python module or script would further reduce friction for adoption and integration into pipelines.

      1. Attribution and Software Transparency

      The GitHub repository includes an evals.py script originally authored by the creators of SCOT (Pinar Demetci, Rebecca Santorella, and Ritambhara Singh), with attribution preserved within the file. However, the manuscript itself does not mention that components of the evaluation pipeline were adapted from this prior work. Given that this script supports benchmarking comparisons central to the paper's conclusions, explicit acknowledgment in the text would improve transparency and ensure appropriate credit is given.

      1. Support for Claims and Biological Interpretation

      Several key claims would benefit from additional evidence or clarification. I divide this into subsections "4a. Methodological Claims," "4b. Biological Interpretation," and "4c. Clustering Evaluation" for extra clarity and readability.

      4a. Methodological Claims - The claim "we selected the latent dimension to be either 5 or 8 and observed that GROTIA remained robust to this choice" is not substantiated by any reported results or sensitivity analysis. - The claim that GROTIA is computationally efficient would be more compelling if runtime comparisons included system specifications, analysis on larger (potentially synthetic) datasets, memory usage, and scalability assessments across CPU and GPU modes. Directly referencing Table A1 for the current runtime evaluation and adding the additional metrics mentioned above would provide a more comprehensive evaluation. - The manuscript asserts "Notably, unlike methods that require shared features across modalities, GROTIA only assumes that cells (rather than individual genes or peaks) follow a similar distribution if they belong to the same type or lineage—thus broadening its applicability to complex datasets." This claim would be more convincing if supported by analyses on more complex datasets, such as those with technical variability across origin sites, donors, or protocols; mosaic structures with missing observations; nested batch effects; or significant differences in data quality. Additionally, this statement may appear in tension with the claim that GROTIA depends on the presence of a shared underlying biology, which would not hold in many complex or heterogeneous settings. Clarifying how "complexity" is defined in the context of GROTIA's assumptions, and empirically substantiating the method's generalizability to such settings would improve both the precision and credibility of this claim. - While the manuscript assesses alignment quality using Fraction of Samples Closer Than the True Match (FOSCTTM) and Label Transfer Accuracy (LTA), capturing local alignment and biological label concordance, these metrics do not directly evaluate preservation of global structure. Since GROTIA is designed to balance both global and local alignment, it would be helpful to include an explicit global alignment metric to confirm that this objective is being met. Some of the provided figures (e.g., Fig. 2c, right panel, and Fig. 3b after alignment) suggest global structure is preserved, but incorporating a dedicated metric or discussion would strengthen the evidence and provide a more complete evaluation of alignment quality. - Likewise, the manuscript states that GROTIA employs orthogonality constraints within the Reproducing Kernel Hilbert Space (RKHS) to enhance interpretability and stability. The use of these constraints for interpretability is illustrated through feature importance analyses; however, there is no direct comparison showing that this approach yields improved interpretability relative to unimodal analyses. Additionally, the effect of orthogonality constraints on embedding stability is not clearly assessed. Providing empirical evidence that these constraints improve the consistency of the embeddings or the quality of feature discovery, particularly in relation to single-modality methods, would help confirm the added value of this design choice and support several of the broader claims made regarding marker gene discovery and cell population characterization. - The decision to exclude scConfluence from the scGEM and SNARE evaluations due to prior dimensionality reduction could be better substantiated. Since raw data for both datasets are publicly available (e.g., SNARE-seq on GEO, scGEM on SRA), it would be helpful to explain why reprocessing the data was not feasible or appropriate.

      4b. Biological Interpretation - The reasoning in the statement "Notably, GROTIA requires no a priori matching of features across modalities, so these dimension-specific drivers offer an unbiased method to uncover potential marker genes" is somewhat unclear. While the method's ability to operate without explicit feature matching is a strength, it would be helpful to clarify how this property directly leads to unbiased marker discovery. In particular, elaborating on how the dimension-specific drivers compare to features identified through unimodal or matched-feature approaches, would strengthen the interpretation. - Several statements related to cell-type-specific gene expression, such as "LYZ, ZEB2, PLXDC2 are highly expressed in monocytes…", would benefit from appropriate citations. This applies to other claims throughout the manuscript regarding gene specificity for particular lineages or subtypes.

      4c. Clustering Evaluation - The claim that GROTIA achieves "comparable or better performance" than Louvain clustering is not fully supported. While ARI/NMI scores of 0.75-0.8 indicate reasonable alignment with reference annotations, clarity on how ground truth (reference) labels were defined, whether Louvain resolution parameters were tuned, and which dataset(s) were used would strengthen this comparison. Additionally, specifying which co-clustering algorithm was used from the cited Python package, along with its parameter settings, would improve reproducibility and interpretability. - The claims that GROTIA can uncover finer structures and novel cellular states, as well as identify refined subpopulations aligned with major cell types, are intriguing but would benefit from additional support. As currently presented, the results do not highlight specific novel cell populations or provide examples of newly discovered subclusters.

      1. Writing, organization, tables, and figures, and minor notes

      2. There is a typo in the heading "GROTIA integrated simulated datasets in both semi and unsuperviseed setting" where unsuperviseed should be unsupervised.

      3. Under this heading, the section describing Figure 2a in paragraph two and paragraph three largely overlap.
      4. The results and interpretation of Figure 2 panel b and c are not described to the reader. The same is true for Figure 3 panels b and c.
      5. In Figure 3, the method is still labeled as GROT instead of GROTIA; this should be updated for consistency.
      6. In Figure 3, the abbreviations Semi Acc and Un Acc are not defined in the legend and should be clearly explained.
      7. In Figure 3, the visual layout in panel b differs between datasets and may be confusing for scGEM and SNARE-seq, the left and right columns represent cell types from each modality, whereas for PBMC, they reflect cell type and modality origin from a single, combined dataset. The PBMC-style presentation is more effective for visually assessing global alignment and should either be used consistently or more clearly explained.
      8. In Figure 3, legends are also missing descriptions of the color schemes used to denote modality.
      9. In panel c of both Figures 2 and 3, it should be specified whether the results correspond to semi-supervised or unsupervised alignment.
      10. In statements such as "Figure 4b presents UMAP visualizations of the top gene expression patterns for Dimensions 1 and 3", the wording could be clarified to avoid confusion. Specifically, it would help to state that gene expression patterns are overlaid on a UMAP projection of the scRNA-seq data, and that the genes visualized were selected based on their importance in Dimensions 1 and 3 of the RBF kernel embeddings (not UMAP axes).
      11. In Figure 4 panel a, it appears that several genes from D1-4 have higher importance in D5. Is this due to scaling, or does it have some biological interpretation?
      12. In Figure 4, panel c, the colorbar should be labeled.
      13. In Figure 5, panel a, only chromosome identifiers are shown, making the peak information incomplete and difficult to interpret. Including specific peak coordinates would improve clarity.
      14. In Figure 5d, it is not clear how accessibility is quantified for a specific gene, this should be described in the Methods section and reiterated in the results description.
      15. While the context makes it clear, explicitly noting that SPI1 is also known as PU.1 could improve clarity for readers less familiar with the nomenclature.
      16. The explanation of the proposed regulatory relationship between CEBPB and KLF4 could be strengthened. The manuscript notes that both factors cooperate with PU.1, but no direct link between CEBPB and KLF4 is established, aside from their shared involvement in monocyte development and differentiation.
      17. The statement that "co-expression networks further link CLEC7A with an IRF8-centered module" would be more convincing with a supporting citation or additional methodological detail on how this link was established.
      18. The description of Figure 5c could be expanded. The current phrasing, "validated through literature. For instance, FOS are implicated as potential regulators of KLF4 in Dimension 1 and CEBPB of FCR1G in Dimension 2" would be better placed in the main text, supported by citations, and more clearly connected to the results.
      19. It would strengthen the interpretation if claims about the cell-type specificity of TF-target pairs were explicitly linked to the expression patterns shown in Figure 5, panel d.
      20. Figure 5 panel d is missing a label on the color bar.
      21. "gene" in "as potential regulators of Gene KFL4" in the legend of Figure 5 should not be capitalized.
      22. The section identifier is missing from the statement "For further details, please refer to Section ."
      23. Figure panel 6b is missing a label for RNA on the y-axis.
      24. The referencing in a few instances could be strengthened for clarity and accuracy. For example, the statement "Lots of computational methods have recently been developed to integrate data across multiple modalities [4, 5]" cites only two methods, which may not sufficiently support the breadth implied. Either citing additional representative methods or rephrasing the sentence to more accurately reflect the scope would improve the credibility of the claim.
      25. To further support the claim that "GROTIA delivers comparable or superior performance," the authors might consider including comparisons to other recent diagonal integration methods such as Pamona and the updated version of SCOT: SCOTv2.
    1. SummaryThe chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multiorgan, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), the dynamics of T-cell developmental trajectories across organs, and the conservation and specificity of gene expression patterns across species. These findings provide a foundation for further investigations of the cell composition and gene regulatory networks throughout the rat body.HighlightsGeneration of a single-cell atlas of chromatin accessibility in nine organs of the ratCharacterization of cell type- and organ-specific transcription factors (TFs)Dynamics of chromatin accessibility in developing T cells revealed by cross-organ analysisConservation and specificity of gene expression patterns among humans, mice, and rats revealed by cross-species analysisCompeting Interest StatementThe authors have declared no competing interest.Footnotes↵10 Lead contact

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag013), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 3:

      In this study, Ronghai Li and colleagues constructed an extensive multi-organ single-nucleus chromatin accessibility atlas of the model organism Rattus norvegicus. The authors generated a comprehensive dataset encompassing 115,723 single-nucleus chromatin accessibility profiles across nine organs (thyroid, thymus, heart, lung, liver, spleen, kidney, pancreas, and ovary). Each organ was profiled in duplicate or triplicate, thereby ensuring reproducibility and robustness of the dataset. The authors also performed rigorous preprocessing and filtering steps, which provided a high-quality foundation for downstream analyses.

      The downstream analyses were multifaceted and thoughtfully executed in the following steps: 1) Low-dimensional visualization of all cells within the atlas, represented by organs, which led to the identification of six major cell types; 2) A census of the distribution of major cell types across the analyzed organs; 3) Integration with existing mouse single-cell RNA-seq atlases to refine cell type annotations (77 cell subtypes) and ensure cross-species comparability; 4) Inference of transcription factor activities from open chromatin profiles, providing important insights into gene regulatory mechanisms.

      Building upon these analyses, the authors focused on shared and organ-specific features of endothelial and stromal cells, thereby highlighting both conserved and divergent regulatory programs. Finally, through integration of human, rat, and mouse scRNA- and scATAC-seq atlases for the heart and kidney, they investigated cross-species similarities and differences in gene expression and regulatory patterns, further strengthening the relevance and translational potential of this resource.

      In my opinion, the manuscript by Ronghai Li et al. is well written, the data are of very high quality, and the study represents a significant data resource. The rat (Rattus norvegicus) is a widely used and indispensable model organism in biomedical research, particularly for studies related to disease onset, progression, and therapeutic development. The generation of this single-nucleus chromatin accessibility atlas, together with the comprehensive analyses provided, constitutes a valuable resource for dissecting organ- and tissue-specific regulatory landscapes. This work not only enhances our understanding of gene regulation across organs but also facilitates cross-species comparisons that will be of great importance for translational research. I therefore strongly support acceptance of the manuscript by Ronghai Li et al. for publication in GigaScience., contingent only upon minor revisions as outlined below:

      Minor revisions:

      1) While the data preprocessing and analysis steps are clearly described in the Methods section, the code used for data analysis is currently available only upon request. I believe that future readers and, in particular, potential users of this valuable resource would greatly benefit if the analysis code were made publicly accessible. Open availability of the code would not only enhance transparency and reproducibility but also facilitate broader adoption of the dataset. This is especially important as new single-cell ATAC-seq and RNA-seq datasets become available, since ready access to the analysis pipeline will accelerate and streamline future studies that build upon the provided atlas.

      2) Line 131: Stromal cells, immune cells, and endothelial cells from different organs tend to be clustered together (i.e., by cel type) rather than clustered according to the organ of origin or sample batch (Figure 1E). and Line 137: However, these cells tended to cluster by organ rather than by cell type (Figure 1E).

      This observation can be clearly seen in the UMAPs (Figure 1C-D), but not in the census plot (Figure 1E). I therefore recommend revising the figure reference to (Figure 1C-D) to improve accuracy and clarity for the reader.

      3) Authors often assess, across manuscript, whether cells cluster in a cell type-specific or organ-specific manner. For immune and epithelial cells, however, the conclusions can be confounded, as they vary depending on the analytical method used (e.g., scATAC-seq alone versus integration with scRNA-seq data). Could the authors elaborate on the robustness of these observations with respect to the choice of UMAP parameters, particularly the number of features included and the dimensionality of the LSI applied?

      4) The authors used the CIS-BP database of transcription factor motifs to assess cell type-specific transcription factor activities. However, JASPAR, which is a curated database, is more commonly used in scATAC-seq studies. Could the authors clarify the rationale for choosing CIS-BP over JASPAR?

    2. SummaryThe chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multiorgan, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), the dynamics of T-cell developmental trajectories across organs, and the conservation and specificity of gene expression patterns across species. These findings provide a foundation for further investigations of the cell composition and gene regulatory networks throughout the rat body.HighlightsGeneration of a single-cell atlas of chromatin accessibility in nine organs of the ratCharacterization of cell type- and organ-specific transcription factors (TFs)Dynamics of chromatin accessibility in developing T cells revealed by cross-organ analysisConservation and specificity of gene expression patterns among humans, mice, and rats revealed by cross-species analysisCompeting Interest StatementThe authors have declared no competing interest.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag013), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1:

      Li et al presents a manuscript where they generated a snATAC-seq atlas of 9 major organs in adult rat and integrated the atlas with mouse and human scRNA-seq and scATAC-seq data, revealing that chromatin accessbility is largely conserved between celltypes across species and that there also tissue-specific regulation in some celltypes even when they are common across several tissues. Overall, this looks like a great carefully analysed and annotated resource that would be useful for the community. I appreciate the amount of work that went into curating and analysing this dataset and i thought that the manuscript was very well written and clear.

      I think the most interesting finding is in figure 3 where the authors found unique TFs regulating the same cell-types but in different organs. However, the analysis ends abruptly other than listing these TFs. Can the authors comment on what are the functional consequences/associations of these tissue-specific TFs, perhaps in the discussion?

      The raw data is deposited into a database and can be openly downloaded but i find that the lack of processed data e.g. processed and labelled expression matrices or objects may prevent the adoption of this data by the community as it is a lengthy process to reach the author's conclusions. The authors might also want to consider incorporating an interactive platform for users to explore and navigate this dataset.

      While i appreciate that the authors have detailed in their manuscripts how they performed the data analysis, i would still encourage the authors to upload their scripts/notebooks to an open code repository otherwise again it would be prohibitive for adoption by the community as it is.

    1. As films like Bonnie and Clyde and Who’s Afraid of Virginia Woolf? (1966) tested the limits on violence and language, it became clear that the Production Code was in need of replacement. In 1968, the MPAA adopted a ratings sytem to identify films in terms of potentially objectionable content. By providing officially designated categories for films that would not have passed Production Code standards of the past, the MPAA opened a way for films to deal openly with mature content. The ratings system originally included four categories: G (suitable for general audiences), M (equivalent to the PG rating of today), R (restricted to adults over age 16), and X (equivalent to today’s NC-17). The MPAA rating systems, with some modifications, is still in place today. Before release in theaters, films are submitted to the MPAA board for a screening, during which advisers decide on the most apropriate rating based on the film’s content. However, studios are not required to have the MPAA screen releases ahead of time—some studios release films without the MPAA rating at all. Commercially, less restrictive ratings are generally more beneficial, particularly in the case of adult-themed films that have the potential to earn the most restrictive rating, the NC-17. Some movie theaters will not screen a movie that is rated NC-17. When filmmakers get a more restrictive rating than they were hoping for, they may resubmit the film for review after editing out objectionable scenes.Kirby Dick, interview by Terry Gross, Fresh Air, NPR, September 13, 2006, http://www.npr.org/templates/story/story.php?storyId=6068009.

      MPAA replaced production code. instead of sensoring the movies they started classifying them in different categories.

    1. OK: I went to the University of Washington and [then] I got hired by this company called Geoworks, doing assembly-language programming, and I did it for five years. To us, the Geoworkers, we wrote a whole operating system, the libraries, drivers, apps, you know: a desktop operating system in assembly. 8086 assembly! It wasn't even good assembly! We had four registers! [Plus the] si [register] if you counted, you know, if you counted 386, right? It was horrible.I mean, actually we kind of liked it. It was Object-Oriented Assembly. It's amazing what you can talk yourself into liking, which is the real irony of all this. And to us, C++ was the ultimate in Roman decadence. I mean, it was equivalent to going and vomiting so you could eat more. They had IF! We had jump CX zero! Right? They had "Objects". Well we did too, but I mean they had syntax for it, right? I mean it was all just such weeniness. And we knew that we could outperform any compiler out there because at the time, we could!So what happened? Well, they went bankrupt. Why? Now I'm probably disagreeing – I know for a fact that I'm disagreeing with every Geoworker out there. I'm the only one that holds this belief. But it's because we wrote fifteen million lines of 8086 assembly language. We had really good tools, world class tools: trust me, you need 'em. But at some point, man...The problem is, picture an ant walking across your garage floor, trying to make a straight line of it. It ain't gonna make a straight line. And you know this because you have perspective. You can see the ant walking around, going hee hee hee, look at him locally optimize for that rock, and now he's going off this way, right?This is what we were, when we were writing this giant assembly-language system. Because what happened was, Microsoft eventually released a platform for mobile devices that was much faster than ours. OK? And I started going in with my debugger, going, what? What is up with this? This rendering is just really slow, it's like sluggish, you know. And I went in and found out that some title bar was getting rendered 140 times every time you refreshed the screen. It wasn't just the title bar. Everything was getting called multiple times.Because we couldn't see how the system worked anymore!

      This is a well-known passage by Yegge explaining why GeoWorks didn't attain success with their object-oriented assembly, and why their object-oriented assembly is the reason for not doing so.

      A pithier and more succinct analogy of the bigger picture that Yegge is trying to communicate here—something like what Engelbart struggled to get people to understand resulted in his metaphor of a pencil taped to a brick—would be, "There's a reason why programmers aren't writing code on punch cards anymore (and it's not just because the market for the equipment and, accordingly, the equipment itself, all but vanished and is no longer available)."

    1. while this page is highly irritatingly designed wrt readability, it asks a good question wrt the basic layout of feedreaders. And the app looks very nice. Vgl [[Mijn ideale feedreader 20180703063626]] en Fraidycat w its sparklines. I'd like heatmaps across communities etc.

      Vgl [[Claude code workshop Frank]] last Friday where I started implementing some things

      n:: phantom obligation as a design choice that gives you chores (inbox zero etc) but really is not an obligation

    1. Briefing : Analyse d'un Féminicide et de ses Répercussions Familiales

      Ce document de synthèse analyse les faits, le déroulement du procès et les conséquences humaines d'un féminicide survenu au sein d'une cellule familiale, sur la base du témoignage des enfants et des éléments présentés lors de l'audience devant la cour d'assises.

      Résumé Exécutif

      L'affaire porte sur le meurtre d'une mère de famille par son époux, commis devant leurs enfants.

      Le procès, tenu trois ans après les faits, a mis en lumière un climat de violence conjugale de longue date, marqué par des signalements de police restés sans suite en 2012.

      Le verdict a condamné l'accusé à 20 ans de réclusion criminelle, assortis d'un retrait total de l'autorité parentale sur son fils mineur.

      Ce dossier souligne le traumatisme profond des orphelins de féminicides, leur précarité soudaine et les failles systémiques dans le traitement des violences domestiques.

      --------------------------------------------------------------------------------

      Chronologie et Faits Marquants

      Le tableau suivant récapitule les événements clés mentionnés dans le dossier :

      | Période | Événement | Détails | | --- | --- | --- | | 2012 | Violences initiales | Interventions policières pour disputes ; signalement de violences durant la grossesse (tentative d'avortement forcé). | | Jour du drame | Le meurtre | La victime est poignardée à mort dans la cuisine familiale devant son fils de 9 ans et sa fille de 19 ans. | | Post-drame | Prise en charge | L'appartement est placé sous scellés pendant 2,5 ans ; les enfants sont confiés à leur tante maternelle. | | 3 ans après | Le procès | Témoignages des enfants à la barre ; confrontation avec le père. | | Verdict | Condamnation | 20 ans de réclusion ; retrait de l'autorité parentale ; dommages et intérêts. |

      --------------------------------------------------------------------------------

      Analyse du Climat de Violence Conjugale

      Le témoignage des enfants et de l'entourage révèle que le meurtre n'était pas un acte isolé mais l'aboutissement d'un processus de contrôle et de violences chroniques.

      Les Précédents de 2012

      Des mains courantes déposées dix ans avant le drame indiquent que des alertes existaient :

      Violences physiques : La victime avait signalé avoir reçu des coups dans le ventre alors qu'elle était enceinte.

      Séquestration et menaces : Elle avait rapporté avoir été séquestrée par sa belle-famille pour l'obliger à avorter et avoir subi des menaces de mort contre elle et ses enfants.

      Réponse institutionnelle : À l'époque, les mains courantes n'entraînaient pas systématiquement d'enquêtes si la victime ne portait pas plainte, une pratique désormais théoriquement proscrite en matière de violences conjugales.

      Le Quotidien sous Emprise

      La fille aînée décrit un système de surveillance permanent :

      Contrôle obsessionnel : L'époux exigeait de voir les tickets de caisse et passait des appels vidéo pour vérifier la localisation de sa femme.

      Violence psychologique et sexuelle : Des faits de viol et de dénigrement constant ont été évoqués à l'audience.

      Menaces de mort explicites : L'accusé aurait déclaré : « Si tu divorces, je te tue. »

      --------------------------------------------------------------------------------

      Les Victimes Collatérales : Les Orphelins

      Le document souligne l'impact dévastateur sur les trois enfants du couple (deux fils et une fille).

      Traumatisme Psychologique

      Témoins directs : Le fils cadet a vu sa mère au sol dans une mare de sang, tandis que le père s'enfuyait avec ses clés.

      Responsabilité précoce : Lors de l'appel aux secours (d'une durée de 9 minutes), la fille aînée a dû tenter de prodiguer des soins d'urgence à sa mère mourante sous les yeux de son frère.

      Sentiment de perte : Les enfants décrivent un « vide immense » et une colère persistante. Le fils cadet exprime avoir perdu ses deux figures d'attachement simultanément.

      Précarité et Rupture de Vie

      Logement : L'appartement familial étant sous scellés pendant plus de deux ans, les enfants ont perdu l'accès à leurs effets personnels.

      Situation financière : La fille aînée, alors étudiante, a dû arrêter ses études pour travailler, frôlant l'expulsion de son logement.

      Éclatement de la fratrie : Si le cadet vit chez sa tante, l'un des frères s'est expatrié en Australie pour s'éloigner du drame, et l'aînée vit désormais seule.

      --------------------------------------------------------------------------------

      Le Déroulement de l'Audience

      La Ligne de Défense de l'Accusé

      L'accusé a maintenu une version des faits centrée sur une perte de contrôle spontanée :

      Le « trou noir » : Il prétend ne pas se souvenir du moment précis où il a porté le coup de couteau, parlant d'un « espace noir ».

      Inversion de culpabilité : Il a suggéré que la victime l'avait provoqué par des paroles insupportables, tentant de présenter l'acte comme le résultat d'un « court-circuit » émotionnel.

      Dénégation des violences : Il a nié les violences antérieures, malgré les témoignages de la famille et les mains courantes.

      Confrontation à la Barre

      Le fils cadet, âgé de 12 ans au moment du procès, a fait preuve d'une grande détermination :

      • Il a demandé à s'exprimer à plusieurs reprises pour démentir les propos de son père, l'accusant de mentir sur les circonstances du drame.

      • Il a exprimé son souhait d'effacer le nom de son père de son identité pour porter uniquement celui de sa mère.

      --------------------------------------------------------------------------------

      Verdict et Implications Juridiques

      La cour d'assises a rendu une décision qui, bien qu'en deçà des réquisitions, comporte des mesures civiles significatives.

      1. Condamnation Pénale : 20 ans de réclusion criminelle (l'avocat général en avait requis 30).

      2. Mesures Civiles :

      ◦ Retrait total de l'autorité parentale sur le fils mineur, en vertu de l'article 378 du code civil.  

      ◦ Provision de 20 000 euros de dommages et intérêts pour chaque enfant, en attente d'une expertise psychiatrique pour évaluer le préjudice définitif.

      3. Réaction des Parties Civiles : La famille a exprimé une certaine déception face à la durée de la peine, jugée insuffisante au regard de la préméditation suspectée (l'enfant ayant vu son père aiguiser des couteaux peu avant les faits).

      --------------------------------------------------------------------------------

      Perspective Nationale

      Le cas présenté s'inscrit dans une problématique systémique en France. Entre 2011 et 2024, environ 1 500 enfants sont devenus orphelins à la suite d'un meurtre au sein du couple.

      Le document met en exergue la nécessité du « protocole féminicide », visant à protéger et accompagner ces mineurs dont la vie bascule instantanément par l'acte d'un de leurs parents contre l'autre.

    1. Le Règlement Intérieur en Milieu Scolaire : Analyse du Cadre, des Enjeux et de l'Application

      Résumé Exécutif

      Le règlement intérieur constitue la pierre angulaire du fonctionnement d'un établissement scolaire, qu'il soit public ou privé, du premier degré à l'enseignement supérieur.

      Bien plus qu'un simple document administratif, il définit les droits et les devoirs de l'ensemble de la communauté éducative : élèves, parents, enseignants et personnels.

      Validé par le conseil d'administration et le recteur d'académie, il fait office de loi au sein de l'établissement.

      Toutefois, son efficacité repose sur une application rigoureuse et une coopération de tous les acteurs.

      Face à un public scolaire en constante mutation, le défi majeur réside dans la transition d'une acceptation passive vers une compréhension pédagogique de la règle, afin de garantir un environnement d'apprentissage sécurisé et neutre.

      --------------------------------------------------------------------------------

      1. Nature et Statut Juridique du Règlement Intérieur

      Le règlement intérieur n'est pas une spécificité scolaire ; il s'inscrit dans un cadre légal plus large, notamment renforcé par la loi PACTE de janvier 2020 qui l'impose à toute entreprise de plus de 50 salariés.

      Une "Loi" d'Établissement

      Bien que le règlement intérieur ne soit pas une prescription étendue au territoire national, il possède une force exécutoire interne :

      Définition juridique : Selon le dictionnaire Larousse, une loi est une "prescription établie par l'autorité souveraine de l'état applicable à tous et définissant les droits et les devoirs de chacun".

      Autorité : Une fois voté par le Conseil d'Administration (CA) et validé administrativement par le recteur d'académie, le document fait office de loi au sein de la structure.

      Adhésion : Contrairement à un contrat, il n'est pas négociable.

      L'inscription de l'élève dans l'établissement vaut acceptation automatique du règlement.

      La signature demandée aux familles est une formalité attestant de la prise de connaissance des règles et de l'engagement à les respecter.

      Champ d'Application

      Le document définit les droits et les devoirs de chaque membre :

      Élèves et familles : Sont les premiers concernés par les règles de vie et les sanctions.

      Personnels et enseignants : Sont tenus de représenter les valeurs de l'établissement et de faire appliquer le règlement, conformément à leur contrat de travail.

      --------------------------------------------------------------------------------

      2. Élaboration et Accessibilité du Cadre

      La création du règlement intérieur suit un processus démocratique et administratif rigoureux visant à refléter l'organisation entière de l'établissement.

      | Étape du Processus | Acteurs Impliqués | | --- | --- | | Concertation | Équipe de direction et communauté éducative. | | Examen et Vote | Conseil d'administration (incluant des représentants de chaque service et des élèves élus). | | Validation finale | Recteur d'académie. |

      Accessibilité de l'Information

      Pour être opposable, le règlement doit être facilement consultable. On le retrouve généralement :

      • Dans le carnet de correspondance (principalement au collège).

      • Affiché aux points de passage importants (salle des profs, points d'affluence).

      • Sur les logiciels de gestion des absences et des retards.

      Spécificité : Les Centres de Documentation et d'Information (CDI) et les internats possèdent souvent leurs propres règlements spécifiques, adaptés à leur contexte particulier.

      --------------------------------------------------------------------------------

      3. Contenu et Objectifs Pédagogiques

      Le règlement intérieur est exhaustif pour éviter tout "vide juridique" qui pourrait mener à des contestations ou des désaccords.

      Domaines Couverts

      Sécurité et Discipline : Sanctions disciplinaires, substances illicites, usage de la violence.

      Vie Scolaire : Tenue vestimentaire (adaptée au contexte), protection des biens personnels contre le vol.

      Outils Numériques : L'usage du téléphone portable est interdit en primaire et au collège.

      Au lycée, il est autorisé sauf mention contraire dans le règlement. Cette restriction vise à préserver les facultés de concentration.

      Sécurité Physique : Interdiction de certains bijoux ou piercings, notamment en Éducation Physique et Sportive (EPS), pour prévenir les blessures.

      Finalités de l'Institution

      Le cadre réglementaire vise à maintenir l'établissement comme un terrain neutre. Sa vocation est de :

      1. Former les jeunes et les responsabiliser.

      2. Préparer à la vie active par des apprentissages formels et informels.

      3. Définir des limites claires là où la vigilance parentale pourrait faire défaut.

      --------------------------------------------------------------------------------

      4. Les Défis de l'Application Pratique

      Il existe un écart significatif entre la théorie (le document écrit) et la pratique quotidienne dans les établissements.

      Les Résistances du Public

      Le public scolaire (élèves et parents) évolue, affichant de nouvelles revendications et parfois un manque de limites.

      Les règles concernant les tatouages, les piercings ou le code vestimentaire sont souvent perçues comme abusives ou sexistes, alors qu'elles répondent à des impératifs de sécurité ou d'adaptation à l'environnement.

      Le Risque du Laxisme

      L'efficacité du règlement dépend de la solidarité de la communauté éducative :

      Cohésion de l'équipe : Si la direction n'applique pas les sanctions prévues par le règlement qu'elle a fait voter, elle est perçue comme laxiste.

      Conséquences : Un manque de fermeté ou de soutien envers les personnels crée une souffrance pour les personnes non protégées au sein de l'institution.

      --------------------------------------------------------------------------------

      5. Perspectives : Vers une Responsabilisation Accrue

      Pour que le règlement ne soit pas perçu comme un simple "bout de papier" que l'on enfreint, des pistes d'évolution pédagogique sont envisagées.

      Le Dialogue Pédagogique : Ouvrir la discussion avec les élèves sur la nécessité des règles communes.

      L'objectif n'est pas de remanier le texte, mais de guider les élèves vers la compréhension de la règle par l'exercice de leur objectivité.

      L'Implication Collective : Faire participer les élèves à la réflexion sur l'application des règles dès l'école primaire pour favoriser leur engagement à long terme.

      Stabilité du Cadre : Malgré les pressions extérieures (manque de moyens, baisse du bon sens), le maintien d'un cadre défini reste indispensable pour travailler dans de bonnes conditions et garantir la neutralité et la tolérance.

      En conclusion, le règlement intérieur est un outil de protection et d'apprentissage qui ne fonctionne que par la coopération active de tous les acteurs.

      Sa légitimité repose sur sa capacité à évoluer avec son public tout en restant un socle ferme et protecteur.

    1. Failles et Défis du Système Périscolaire : Un État des Lieux

      Synthèse Opérationnelle

      Le secteur périscolaire en France traverse une crise profonde, mise en lumière par des enquêtes journalistiques récentes, notamment celle de Cash Investigation.

      Longtemps considéré comme l'« angle mort » de l'institution scolaire, ce secteur est marqué par des défaillances systémiques graves : violences sexuelles, méthodes d'encadrement brutales et recrutement précaire.

      Le constat est sans appel : une déconsidération sociale et politique du métier d'animateur conduit à une « profession poubelle » où la sécurité et l'épanouissement des enfants sont parfois compromis.

      La transition vers une professionnalisation réelle, passant par des diplômes qualifiants plutôt que par le seul BAFA, et une revalorisation des conditions de travail (salaires, temps de préparation, stabilité des équipes) apparaissent comme les leviers indispensables pour restaurer le « sanctuaire » de l'école.

      --------------------------------------------------------------------------------

      1. Un Secteur sous Haute Tension : Violences et Dysfonctionnements

      Les récentes révélations de presse ont brisé l'omertà sur des faits de violences sexuelles (attouchements, viols) commis au sein des écoles sur les temps périscolaires (midi, soir, mercredi).

      La Nature des Défaillances

      Gestion aléatoire du personnel : L'enquête souligne des cas où des individus dangereux ne sont pas écartés mais simplement « déplacés » d'une structure à une autre.

      Violences Éducatives Ordinaires (VEO) : Au-delà des crimes sexuels, l'immersion en caméra cachée révèle des brimades quotidiennes : cris, interdiction de parler, privation de repas, ou encore l'extinction des lumières pour obtenir le calme.

      Déni et failles de signalement : Il existe un blocage systémique dans la remontée des informations. La précarité et le turnover empêchent la cohésion des équipes, rendant les signalements plus difficiles par peur des conséquences ou par manque de légitimité perçue.

      L'Impact Médiatique et Politique

      L'enquête de Cash Investigation a provoqué des suites immédiates :

      • Saisine du procureur de la République par le ministère de l'Éducation nationale.

      • Suspension d'animateurs par la Ville de Paris.

      • Lancement d'enquêtes administratives.

      --------------------------------------------------------------------------------

      2. La Précarité Structurelle du Métier d'Animateur

      Le secteur souffre d'un manque de moyens financier et d'une dévalorisation sociale qui impactent directement la qualité de l'accueil.

      Des Conditions de Travail « Catastrophiques »

      | Facteur de Précarité | Description et Conséquences | | --- | --- | | Rémunération | Qualifiée de « salaire de misère », de nombreux animateurs vivent sous le seuil de pauvreté. | | Temps Partiel Subi | Travail en horaires fractionnés (matin, midi, soir). Une amplitude de 12h pour seulement 4 à 6h payées. | | Temps de Préparation | Souvent non rémunéré ou limité (parfois 15% du temps). Sans préparation, l'animation devient une simple « garderie ». | | Turnover Élevé | La difficulté du métier pousse les agents à quitter le secteur, empêchant la consolidation de projets pédagogiques. |

      Le Recrutement « à l'arrache »

      Le manque de personnel force les communes à recruter dans l'urgence. L'immersion d'une journaliste montre un accueil de seulement 6 minutes 30 avant d'être mise en responsabilité face aux enfants, sans présentation de la charte de déontologie ni formation préalable.

      --------------------------------------------------------------------------------

      3. Le Débat sur la Professionnalisation et la Formation

      Un point de tension majeur réside dans la distinction entre l'animation comme « petit boulot » étudiant et l'animation comme métier professionnel.

      BAFA vs Diplômes Professionnels

      Le BAFA (Brevet d'Aptitude aux Fonctions d'Animateur) : Conçu pour un engagement temporaire (lycéens de 16 ans, étudiants).

      Il coûte environ 1 000 € aux communes. Bien que formateur sur le plan humain, il est jugé insuffisant pour gérer l'accueil professionnel quotidien à l'année.

      Les Diplômes Professionnels : Allant du CAP au Master, ils coûtent entre 3 000 € et 9 000 €.

      Ils garantissent des compétences en psychologie de l'enfant, gestion de conflits et ingénierie pédagogique.

      La Tolérance Réglementaire : Le code de l'action sociale autorise 20 % de personnel non diplômé et 50 % de détenteurs du BAFA.

      Les experts plaident pour une loi imposant une majorité de professionnels qualifiés.

      --------------------------------------------------------------------------------

      4. Une Dualité Institutionnelle Conflictuelle

      Le périscolaire cohabite dans les mêmes murs que l'école, mais sans véritable dialogue.

      Le « dernier roue du carrosse » : Les animateurs se sentent souvent méprisés par l'équipe enseignante. Ils sont perçus comme moins formés et moins légitimes, malgré une mission éducative complémentaire.

      Absence de continuité éducative : Bien que des dispositifs comme les Projets Éducatifs de Territoire (PEDT) existent, leur mise en œuvre dépend de la volonté politique locale.

      Sans réunions communes, l'enfant subit une rupture entre le temps scolaire (assis, silencieux) et le temps périscolaire (souvent bruyant et désorganisé).

      --------------------------------------------------------------------------------

      5. Perspectives de Réforme : Le Modèle de la Volonté Politique

      L'enquête démontre qu'une amélioration est possible moyennant un investissement financier modéré.

      L'Exemple de la commune d'HMO : En augmentant de 8 % le budget alloué au périscolaire, cette commune a pu systématiser les temps pleins pour les animateurs (en complétant les heures par d'autres missions municipales) et rémunérer les temps de préparation.

      Vers une compétence obligatoire : Contrairement à l'école, le périscolaire n'est pas une compétence obligatoire pour les communes. Le rendre obligatoire permettrait d'imposer des standards de qualité et d'encadrement nationaux.

      Le rôle de « co-éducateur » : L'animateur doit être reconnu comme un référent pour l'enfant et les familles, capable de développer des compétences (autonomie, confiance en soi) que le cadre scolaire strict ne permet pas toujours d'explorer.

      Citations Clés :

      • « Le périscolaire est clairement l'angle mort de l'école. »

      • « C'est le sanctuaire qui se brise. » (À propos des violences à l'école)

      • « Si ça te fait de la peine [qu'un enfant pleure], c'est pas fait pour toi ce travail. » (Parole d'une employée captée en caméra cachée)

      • « Ça vaut combien l'avenir de nos enfants ? »

    1. Almost all of the sources of the gpcp tools are written in Component Pascal. It is not possible to use gpcp until you have executable code of the tools. And you cannot build the executables from this source repository unless you already have a Component Pascal compiler. Most users should start by downloading a recent release
    1. AI fatigue is real and nobody talks about it

      Summary of "AI Fatigue is Real"

      • The Productivity Paradox: AI significantly speeds up individual tasks (e.g., turning a 3-hour task into 45 minutes), but this doesn't lead to more free time. Instead, the baseline for "normal" output shifts, and the work expands to fill the new capacity, leading to a relentless pace.
      • From Creator to Reviewer: Engineering work is shifting from "generative" (energizing, flow-state tasks) to "evaluative" (draining, decision-fatigue tasks). Developers now spend their days as "quality inspectors" on an unending assembly line of AI-generated code.
      • The Cost of Nondeterminism: Engineers are trained for determinism (same input = same output). AI’s probabilistic nature creates a constant cognitive load because the output is always "suspect," requiring more rigorous review than code written by a trusted human colleague.
      • Context-Switching Exhaustion: Because tasks are "faster," engineers now touch 6–8 different problems a day instead of focusing on one. The mental cost of switching contexts so frequently is "brutally expensive" for the human brain.
      • Skill Atrophy: Much like GPS has weakened our innate sense of direction, over-reliance on AI coding tools can cause core technical reasoning and mental mapping of codebases to atrophy.
      • Strategies for Sustainability:
        • Time-boxing: Setting strict timers for AI sessions to avoid "prompt spirals."
        • Separating Phases: Dedicating mornings to deep thinking and afternoons to AI-assisted execution.
        • Accepting "Good Enough": Setting the bar at 70% usable output and fixing the rest manually to reduce frustration.
        • Strategic Hype Management: Ignoring every new tool launch and focusing on mastering one primary assistant.
    1. Briefing : Violences Sexuelles et Failles Institutionnelles au sein de l'Éducation Nationale

      Résumé Exécutif

      Ce document synthétise les révélations liées à l'affaire Pascal V. au lycée Bayen de Châlons-en-Champagne, ainsi que les défaillances systémiques de l'Éducation nationale face aux prédateurs sexuels en son sein.

      L'analyse met en lumière une culture de l'omerta et du "pas de vague" qui a permis à un enseignant d'abuser de dizaines d'élèves pendant plus de 25 ans.

      Malgré des alertes répétées dès 1997, la hiérarchie (proviseurs, inspecteurs et recteurs) a systématiquement échoué à protéger les mineurs, privilégiant la protection de l'institution et la présomption d'innocence au détriment de la sécurité des élèves.

      Le document souligne également le traitement punitif réservé aux lanceurs d'alerte, contrastant avec la promotion des cadres ayant failli à leurs obligations professionnelles et légales (Article 40 du code de procédure pénale).

      --------------------------------------------------------------------------------

      1. L'Affaire Pascal V. : Un Prédateur Protégé par l'Institution

      Profil et Modus Operandi

      Pascal V., professeur agrégé de lettres et de théorie du cirque au lycée Bayen, a exercé une emprise sur ses élèves de 1997 à 2023.

      Son profil intellectuel et charismatique lui servait de bouclier :

      Manipulation intellectuelle : Il utilisait la littérature (notamment Céline) pour introduire des thématiques sexuelles crues et valorisait les garçons sur leur physique tout en dénigrant la féminité des filles.

      Emprise et isolement : Il invitait les élèves chez lui sous prétexte de discuter de leurs copies, les forçait à l'accepter sur les réseaux sociaux et exigeait des photos suggestives.

      Violences et sévices : Les témoignages font état de viols, de fellations forcées, de séquestrations et de tortures physiques (trombones sous les ongles, peau découpée) sous couvert d'apprentissage de la douleur.

      Usage de drogues : Plusieurs victimes suspectent d'avoir été droguées lors de soirées chez l'enseignant avant de subir des abus dont elles n'ont que des souvenirs fragmentaires.

      La Longévité de l'Omerta

      Les premiers signalements remontent à la fin des années 90 :

      1997 : Un conseiller principal d'éducation (CPE) et une professeure de mathématiques sont informés d'une relation entre Pascal V. et un élève de 16 ans. L'enseignant qualifie l'acte de "consenti".

      2000 : Lors d'un voyage en Italie, un intervenant extérieur (Bernard Namura) surprend un élève en détresse et dénonce les faits aux collègues présents, qui décident de "surveiller" l'enseignant sans plus de suite.

      2016 : La proviseure Catherine Corvélec reçoit un courrier anonyme signalant les comportements suspects de Pascal V. Son supérieur hiérarchique lui conseille une simple "vigilance" faute d'éléments tangibles.

      --------------------------------------------------------------------------------

      2. Les Défaillances de la Chaîne Hiérarchique (2021-2023)

      Le combat de Marie Jacard, professeure de cirque et lanceuse d'alerte, révèle l'incapacité du rectorat de Reims à réagir, même face à une accumulation de preuves.

      Le Rôle des Cadres Locaux

      Le rapport de l'Inspection Générale (IGÉSR), bien que confidentiel, pointe des manquements graves :

      Sabine Bonet (Proviseure) : Elle est accusée de ne pas avoir transmis les informations à sa hiérarchie entre 2021 et 2023, refusant d'entendre les témoignages d'anciens élèves et minimisant l'affaire.

      Les Inspectrices Académiques : Elles ont maintenu l'alerte dans un "cercle à trois" avec la proviseure, omettant d'en informer le cabinet du recteur et décrédibilisant la lanceuse d'alerte.

      Non-application de l'Article 40 : Aucun des responsables informés n'a fait de signalement immédiat au procureur, une obligation légale pour tout fonctionnaire témoin d'un crime ou délit.

      L'Inaction du Rectorat

      Le recteur de l'époque, Olivier Brandouy, a qualifié la situation de "chamaillerie locale" et a invoqué la présomption d'innocence pour justifier l'absence d'enquête administrative interne, alors que celle-ci n'est pas conditionnée par un dépôt de plainte.

      --------------------------------------------------------------------------------

      3. Conséquences : Impunité des Cadres et Sanction de l'Alerte

      Une analyse des trajectoires professionnelles après l'éclatement du scandale montre une inversion des responsabilités :

      | Individu | Rôle dans l'affaire | Issue professionnelle | | --- | --- | --- | | Pascal V. | Prédateur présumé | Suicide en décembre 2023 avant son arrestation. | | Sabine Bonet | Proviseure (a couvert les faits) | Promue dans un lycée prestigieux à Évian. | | Olivier Brandouy | Recteur (inaction) | Promu directeur adjoint de cabinet du Ministre, puis conseiller au Premier Ministre. | | Inspectrices | Rétention d'information | Obtention des Palmes académiques et nominations honorifiques. | | Marie Jacard | Lanceuse d'alerte | Décrédibilisée, poussée à la mutation, perte de 50% de son salaire. |

      Marie Jacard a subi un harcèlement institutionnel : dénigrement de son travail pédagogique, accusation de rivalité personnelle avec Pascal V. et mutation forcée sans poste équivalent à ses compétences.

      --------------------------------------------------------------------------------

      4. Un Phénomène Systémique : De Villefontaine à Châlons

      L'affaire Bayen n'est pas isolée. L'Éducation nationale semble appliquer un schéma récurrent de protection de l'institution au détriment des enfants :

      L'Affaire Romain Farina (2015) : Un instituteur a violé plus de 40 élèves en 15 ans.

      Le rectorat de l'Isère était au courant d'une plainte dès 2001, mais l'enseignant a été muté six fois, accédant même au poste de directeur d'école. Farina s'est également suicidé en prison.

      Le Rapport Parlementaire (2024) : Les députés Violette Spielbout et Paul Vanier dénoncent une "défaillance majeure de l'État" dans le contrôle et la prévention.

      Leur rapport de 330 pages et 50 recommandations appelle à une révolution structurelle du ministère.

      --------------------------------------------------------------------------------

      5. Conclusions et Perspectives Judiciaires

      Le suicide de Pascal V. a privé les victimes d'un procès pénal et d'une reconnaissance judiciaire de leur statut.

      En réponse :

      1. Recours Administratif : Les victimes et leurs familles attaquent l'État devant le tribunal administratif pour faute lourde, visant le mutisme et l'inaction organisée de l'administration.

      2. Enquête pour Non-Dénonciation : L'ex-recteur Olivier Brandouy fait l'objet d'une enquête pour non-dénonciation de crime.

      3. Exigence de Sanctions : Les collectifs de victimes réclament que les cadres ayant failli à l'article 40 soient révoqués et non promus, afin de rompre le cycle de l'omerta institutionnelle.

      L'institution est comparée à l'Église dans sa volonté de placer l'autorité au-dessus de la vérité, transformant les lanceurs d'alerte en "moutons noirs" pour préserver une image de stabilité illusoire.

    1. L’Orientation et le Parcours de l’Élève : Analyse d’un Changement de Paradigme

      Ce document de synthèse analyse les interventions et les débats issus de l'émission thématique de l'IH2EF consacrée à l'orientation et au parcours de l'élève.

      Il explore les mutations profondes du système éducatif français, du collège à l'enseignement supérieur, en mettant l'accent sur le continuum « Bac -3 / Bac +3 ».

      --------------------------------------------------------------------------------

      Résumé Exécutif

      L'orientation scolaire ne doit plus être perçue comme une série de paliers administratifs urgents, mais comme un parcours fluide et continu visant à doter l'individu de compétences à s'orienter tout au long de sa vie.

      Le passage d'une logique de « régulation des flux » à une logique de « construction de l'individu » constitue le cœur des réformes actuelles (Baccalauréat, voie professionnelle, loi de 2018).

      Les points clés à retenir :

      Changement de paradigme : L'orientation s'efface au profit du « parcours », une notion transversale qui concerne tous les élèves et abolit les cloisons temporelles et spatiales.

      Accompagnement personnalisé : Le rôle des acteurs change ; l'enseignant devient un contributeur central à l'acquisition des compétences à s'orienter, malgré un besoin de formation exprimé.

      Le continuum Bac -3 / Bac +3 : La réussite repose sur une meilleure articulation entre le lycée (choix de spécialités) et l'université (système de majeure/mineure, entretiens individuels).

      Sécurité juridique : L'orientation est un processus administratif strict où le dialogue avec les familles et la motivation des décisions sont juridiquement contraignants.

      --------------------------------------------------------------------------------

      I. Évolution Conceptuelle : De l'Orientation au Parcours

      Le système éducatif français vit une transformation majeure de sa philosophie de l'orientation, passant d'un modèle statique à une approche dynamique.

      1. Le dépassement du modèle traditionnel

      Historiquement, l'orientation reposait sur un modèle « connexionniste » cherchant à faire correspondre mécaniquement les caractéristiques d'un individu à celles d'un emploi.

      Ce modèle a montré ses limites, laissant place à une approche probabiliste où la linéarité entre formation et métier devient plus rare.

      2. La notion de parcours

      Le parcours est défini comme un itinéraire organisé d'acquisition de connaissances et de compétences, incluant les dimensions éthiques et sociales.

      Universalité : Contrairement à l'ancienne orientation de "crise" pour certains élèves, le parcours concerne désormais l'intégralité des étudiants.

      Continuité : Il s'inscrit dans la durée, dès la classe de 4ème et se poursuit dans l'enseignement supérieur, sous-tendu par l'idée de formation tout au long de la vie.

      Réformes structurantes : Cette notion irrigue la réforme du baccalauréat, la transformation de la voie professionnelle et la loi de 2018 sur la liberté de choisir son avenir professionnel.

      --------------------------------------------------------------------------------

      II. Cadre Juridique et Procédural de l'Orientation

      L'orientation est régie par le Code de l'éducation (articles D331-23 et L331-7), définissant un cadre administratif rigoureux.

      | Étape du Processus | Responsabilités et Obligations | | --- | --- | | Dialogue | Résulte d'une concertation entre la famille et l'autorité administrative. | | Proposition | Le conseil de classe émet des propositions après étude des vœux des familles. | | Décision | Prise exclusivement par le chef d'établissement (ou son adjoint délégué). | | Motivation | Obligatoire si la décision diffère du vœu de la famille. Une mention vague (ex: "niveau insuffisant") est un motif d'annulation juridique. | | Recours | Délai de 3 jours ouvrables pour saisir la commission d'appel académique (recours administratif préalable obligatoire). |

      Implications juridiques : Le juge administratif exerce un contrôle strict sur la compétence de l'autorité, la composition de la commission d'appel et le respect du droit des parents à être entendus.

      Une orientation défaillante peut engager la responsabilité de l'État et mener à des indemnités.

      --------------------------------------------------------------------------------

      III. Le Continuum Bac -3 / Bac +3 : Enjeux et Dispositifs

      La transition entre le lycée et l'enseignement supérieur est identifiée comme un point de fragilité nécessitant une coordination accrue entre les acteurs.

      1. L'adaptation de l'Université (Exemple de La Rochelle)

      L'université doit répondre à un nouveau public issu d'un lycée sans filières, où les élèves choisissent des spécialités.

      Système Majeure/Mineure : Permet à l'étudiant d'élargir son champ de compétences (ex: mineure en entrepreneuriat) ou de sécuriser un parcours pluridisciplinaire.

      Accompagnement dès la L1 : Mise en place d'entretiens individuels en début d'année pour évaluer la projection de l'étudiant et proposer des renforcements disciplinaires ou méthodologiques.

      Droit à l'erreur : La réorientation est désormais perçue comme une trace de maturation du projet plutôt que comme un échec.

      2. La structuration au Lycée

      Au lycée (ex: Lycée Guy Chauvet de Loudun), l'enjeu est de structurer le parcours dès la classe de seconde autour de trois axes :

      Se connaître : Aider l'adolescent à identifier ses goûts et ses compétences dans une phase de mutation rapide.

      Connaître le monde professionnel : Rencontres avec des professionnels, anciens élèves et visites d'entreprises.

      Connaître les mécanismes : Compréhension de Parcoursup et du fonctionnement de la vie étudiante (gestion du budget, santé, bibliothèques).

      --------------------------------------------------------------------------------

      IV. Pilotage Académique et Leviers d'Action

      Le succès du parcours de l'élève repose sur une ingénierie de formation et une coordination des forces vives sur le territoire.

      1. Stratégies de terrain

      Réseaux et Mutualisation : L'établissement est jugé trop petit pour traiter seul l'orientation ; une approche réticulaire est nécessaire.

      Comités Locaux École-Entreprise (CLEE) : Instances essentielles pour la découverte des métiers dès le collège, visant à élargir l'horizon des élèves (passer de 15 à 60 métiers connus).

      Ambassadeurs et Tutorat : Utilisation de jeunes (étudiants en BTS ou CPGE) pour parler aux lycéens, favorisant une proximité pédagogique et émotionnelle.

      2. Dispositifs spécifiques mentionnés

      Ambition BTS : Accompagnement méthodologique des élèves issus de la voie professionnelle durant leurs trois premiers mois en BTS pour éviter le décrochage.

      Moi étudiant dans 3 ans : Immersion de collégiens/lycéens à l'université pour casser les représentations erronées.

      Campus Connectés : Lieux permettant de suivre des études supérieures à distance dans des zones rurales éloignées des centres universitaires (ex: Saintes).

      Procédure "Seconde Chance" : Facilite les passerelles pour les élèves souhaitant réintégrer la voie professionnelle en cours d'année.

      --------------------------------------------------------------------------------

      V. Freins et Défis Majeurs

      L'analyse identifie plusieurs obstacles à la mise en œuvre optimale de ces parcours :

      1. Conception historique rigide : La prédominance de la régulation des flux administratifs au détriment du développement des compétences à s'orienter.

      2. Mobilisation des enseignants : Beaucoup d'enseignants se sentent incompétents ou non formés pour accompagner l'orientation, craignant une perversion de leur rôle premier.

      3. Décalage entre discours et réalité : L'écart persistant entre l'idéal de personnalisation des parcours et les contraintes structurelles ou territoriales (déterminisme social, éloignement géographique).

      --------------------------------------------------------------------------------

      VI. Ressources et Outils de Référence

      Pour accompagner les équipes éducatives, plusieurs ressources sont recommandées par l'IH2EF :

      Vade-mecum de l'accompagnement à l'orientation : Trois guides spécifiques (Collège, Lycée GT, Voie Pro) disponibles sur Éduscol.

      ONISEP : Référentiel des compétences à s'orienter, catalogues de ressources et modules de formation pour les équipes.

      Rapport IGSr (2020) : "L'orientation de la quatrième au Master", piloté par Michel Lunier, offrant un cadre d'analyse national.

      Plateforme Capsule : Outil facilitant l'immersion des élèves dans le supérieur.

    1. Briefing : L'Accès au Logement des Familles Sans-Abri avec Enfants

      Résumé Exécutif

      La situation des enfants sans domicile en France atteint des seuils critiques, avec une projection de 80 000 enfants en habitat précaire pour 2025.

      Malgré l'obligation légale d'hébergement d'urgence inconditionnel (Article L345-2-2 du Code de l’Action sociale), la saturation des dispositifs entraîne un tri des publics de plus en plus sévère, excluant désormais des nourrissons et des femmes enceintes.

      Le recours massif à l'hébergement hôtelier, bien que palliatif, s'avère délétère pour le développement de l'enfant en raison de l'insalubrité, de l'instabilité résidentielle et d'un manque criant d'accompagnement social.

      Les représentants associatifs dénoncent une « insincérité budgétaire » et une dilution des responsabilités entre l'État et les départements.

      La solution identifiée réside dans un pilotage national coordonné — inspiré de la gestion de la crise ukrainienne — et une transition structurelle de l'hébergement d'urgence vers le logement social pérenne.

      --------------------------------------------------------------------------------

      I. État des Lieux : Une Crise Humanitaire Invisible

      1. Des indicateurs statistiques alarmants

      Le constat dressé par la Fédération des acteurs de la solidarité et l'UNICEF révèle une dégradation constante :

      Chiffres globaux : 80 000 enfants sans domicile ou en habitat précaire prévus en 2025 ; 70 000 grandissent à l'hôtel ou en structures d'urgence.

      Progression du sans-abrisme : Le nombre d'enfants à la rue a augmenté de 30 % en trois ans. En octobre 2024, environ 2 500 enfants ont été recensés à la rue.

      Saturation du 115 : Environ 69 % des demandes d'hébergement restent non pourvues chaque jour. 79 % des familles ayant sollicité le 115 déclarent avoir dormi à la rue la veille de leur appel.

      2. L'échec de l'inconditionnalité

      La loi prévoit que toute personne en détresse doit avoir accès à un hébergement à tout moment. Or, la pénurie de places force une « valse des précarités » :

      Durcissement des critères : La priorité, autrefois accordée aux enfants de moins de 3 ans et aux femmes enceintes, est désormais restreinte (femmes enceintes de plus de 6 mois, enfants de plus d'un an parfois exclus).

      Rupture de continuité : Des familles sont hébergées pour de très courtes durées (3 à 7 jours) avant d'être remises à la rue pour laisser la place à des profils jugés « plus vulnérables ».

      --------------------------------------------------------------------------------

      II. L'Hébergement Hôtelier : Un Environnement Inadapté

      Le parc hôtelier accueille près de 30 000 enfants, dont près de 10 000 ont moins de trois ans.

      Ce mode d'hébergement présente des défaillances majeures :

      1. Conditions matérielles et sanitaires dégradées

      Vétusté et insalubrité : Présence fréquente de nuisibles (punaises de lit, cafards).

      Absence de besoins fondamentaux : Manque d'espace, absence de cuisine (impossibilité de préparer des repas équilibrés), manque d'intimité pour les familles.

      Éloignement : Localisation souvent périphérique, rendant l'accès aux soins et aux écoles complexe (parfois des heures de transport pour maintenir la scolarité).

      2. Carences de l'accompagnement social

      Désert social : Contrairement aux centres d'hébergement d'urgence (CHU), les hôtels n'ont pas de travailleurs sociaux sur place.

      Inégalité territoriale : En Île-de-France, seules 45 % des familles à l'hôtel bénéficient d'un accompagnement via des plateformes mobiles (PASH).

      Perte de sens : Le manque de moyens réduit l'accompagnement à une variable d'ajustement budgétaire, sacrifiant l'humain au profit du paiement des nuitées.

      --------------------------------------------------------------------------------

      III. Impacts sur le Développement et la Santé des Enfants

      1. Santé mentale et physique

      Troubles psychiques : La prévalence des troubles de la santé mentale est de 19 % chez les enfants hébergés, contre 8 % dans la population générale.

      Insécurité permanente : Le changement constant d'environnement crée une « insécurité psychique » profonde.

      Risques sanitaires : Augmentation des naissances par césarienne (1/3 des accouchements chez les femmes sans domicile) et risques accrus de diabète gestationnel ou d'infections respiratoires.

      2. Scolarité et Socialisation

      L'instabilité résidentielle entrave l'inscription et l'assiduité scolaire.

      Les enfants témoignent de l'impossibilité de nouer des liens sociaux normaux (interdiction d'inviter des amis, stigmatisation liée au lieu de vie).

      3. Protection de l'enfance

      Peur du placement : Une crainte persistante des parents que la précarité du logement soit interprétée comme une défaillance éducative, entraînant un placement des enfants.

      Séparation des familles : Certains dispositifs n'acceptent que les mères et les enfants, excluant le père et brisant la cohésion familiale.

      --------------------------------------------------------------------------------

      IV. Obstacles Institutionnels et Budgétaires

      1. Le « jeu de ping-pong » administratif

      Une confusion persiste sur les compétences : l'État est responsable de l'hébergement d'urgence généraliste, tandis que les Départements sont responsables de l'hébergement des mères avec enfants de moins de 3 ans.

      Cette division crée des « zones grises » où des familles se retrouvent sans aucune prise en charge.

      2. Insincérité budgétaire

      Les acteurs soulignent que les crédits alloués (785 millions d'euros pour le programme 177 en 2024) sont insuffisants pour couvrir les 203 000 places théoriques.

      Cette sous-dotation structurelle empêche la création de places pérennes et de qualité.

      3. Délais administratifs

      La saturation est aggravée par la lenteur du traitement des titres de séjour (jusqu'à 18 mois d'attente).

      Des personnes ayant droit au travail ou au logement social restent bloquées en hébergement d'urgence faute de documents à jour, embolisant ainsi le système.

      --------------------------------------------------------------------------------

      V. Recommandations et Perspectives

      | Axe d'intervention | Actions préconisées | | --- | --- | | Pilotage | Création d'une coordination nationale interministérielle (Santé, Éducation, Logement) sur le modèle de la crise ukrainienne. | | Logement | Investissement massif dans le logement social (PLAI et PLUS) pour libérer le parc d'hébergement d'urgence. | | Qualité de l'accueil | Établissement d'un cahier des charges contraignant pour les structures accueillant des familles (accès cuisine, espaces de jeu, intimité). | | Accompagnement | Revalorisation des prix journaliers pour financer des équipes pluridisciplinaires (éducateurs de jeunes enfants, travailleurs sociaux). | | Petite Enfance | Garantie d'accès aux modes de garde (crèches) pour les familles précaires afin de favoriser la réinsertion. | | Dispositifs SAS | S'assurer que les structures d'accueil et de desserrement (SAS) garantissent une réelle évaluation et une continuité de parcours, sans retour à la rue. |

      Conclusion des intervenants : L'hébergement d'urgence ne doit plus être une solution de long terme.

      La priorité absolue doit être le passage direct au logement (« Logement d'abord »), seul garant de la stabilité nécessaire à la construction des enfants et à la dignité des familles.

    1. Reviewer #1 (Public review):

      In this study, the noncanonical amino acid acridon-2-ylalanine (Acd) was inserted at various positions within the human Hv1 protein using a genetic code expansion approach. The purified mutants with incorporated fluorophore were shown to be functional using a proton flux assay in proteoliposomes. FRET between native tryptophan and tyrosine residues and Acd were quantified using spectral FRET analysis. Predicted FRET efficiencies calculated from an AlphaFold model of the Hv1 dimer were compared to the corresponding experimental values. Spectral FRET analysis was also used to test whether structural rearrangements caused by Zn2+, a well-known Hv1 inhibitor, could be detected. The experimental data provide a good validation of the approach, but further expansion of the analysis will be necessary to differentiate between intra- and intersubunit structural features.

      Interestingly, the observed rearrangements induced by Zn2+ were not limited to the protein region proximal to the extracellular binding site but extended to the intracellular side of the channel. This finding agrees with previous studies showing that some extracellular Hv1 inhibitors, such as Zn2+ or AGAP/W38F, can cause long-range structural changes propagating to the intracellular vestibule of the channel (De La Rosa et al. J. Gen. Physiol. 2018, and Tang et al. Brit J. Pharm 2020). The authors should consider adding these references.

      Since one of the main goals of this work was to validate Acd incorporation and the spectral FRET analysis approach to detect conformational changes in hHv1 in preparation for future studies, the authors should consider removing one subunit from their dimer model, recalculating FRET efficiencies for the monomer, and comparing the predicted values to the experimental FRET data. This comparison could support the idea that the reported FRET measurements can inform not only on intrasubunit structural features but also on subunit organization.

    1. Reviewer #1 (Public review):

      Summary:

      This computational modelling study addresses the important question of how neurons can learn non-linear functions using biologically realistic plasticity mechanisms. The study extends the previous related work on metaplasticity by Khodadadi et al. (2025), using the same detailed biophysical model and basic study design, while significantly simplifying the synaptic plasticity rule by removing non-linearities, reducing the number of free parameters, and limiting plasticity to only excitatory synapses. The rule itself is supervised by the presence or absence of a binary dopamine reward signal, and gated by separate calcium-sensitive thresholds for potentiation and depression. The author shows that, when paired with a strong form of dendritic non-linearity called a "plateau potential" and appropriate pre-existing dendritic clustering of features, this simpler learning mechanism can solve a non-linear classification task similar to the classic XOR logic operator, with equal or better performance than the previous publication. The primary claims of this publication are that metaplasticity is required for learning non-linear feature classification, and that simultaneous dynamics in two separate thresholds (for potentiation and depression) are critical in this process. By systematically studying the properties of a biophysically plausible supervised learning rule, this paper adds interesting insights into the mechanics of learning complex computations in single neurons.

      Strengths:

      The simplified form of the learning rule makes it easier to understand and study than previous metaplasticity rules, and makes the conclusions more generalizable, while preserving biological realism. Since similar biophysical mechanisms and dynamics exist in many different cell types across the whole brain, the proposed rule could easily be integrated into a wide range of computational models specializing in brain regions beyond the striatum (which is the focus of this study), making it of broad interest to computational neuroscientists. The general approach of systematically fixing or modifying each variable while observing the effects and interactions with other variables is sound and brings great clarity to understanding the dynamic properties and mechanics of the proposed learning rule.

      Weaknesses:

      General notes

      (1) The credibility of the main claims is mainly limited by the very narrow range of model parameters that was explored, including several seemingly arbitrary choices that were not adequately justified or explored.

      (2) The choice to use a morphologically detailed biophysical model, rather than a simpler multi-compartment model, adds a great deal of complexity that further increases uncertainty as to whether the conclusions can generalize beyond the specific choices of model and morphology studied in this paper.

      (3) The requirement for pre-existing synaptic clustering, while not implausible, greatly limits the flexibility of this rule to solve non-linear problems more generally.

      (4) In order to claim that two thresholds are truly necessary, the author would have to show that other well-known rules with a single threshold (e.g., BCM) cannot solve this problem. No such direct head-to-head comparisons are made, raising the question of whether the same task could be achieved without having two separate plasticity thresholds.

      Specific notes

      (1) Regarding the limited hyperparameter search:

      (a) On page 5, the author introduces the upper LTP threshold Theta_LTP. It is not clear why this upper threshold is necessary when the weights are already bounded by w_max. Since w_max is just another hyperparameter, why not set it to a lower value if the goal is to avoid excessively strong synapses? The values of w_max and Theta_LTP appear to have been chosen arbitrarily, but this question could be resolved by doing a proper hyperparameter search over w_max in the absence of an upper Theta_LTP.

      (b) The author does not explore the effect of having separate learning rates for theta_LTP and theta_LTD, which could also improve learning performance in the NFBP. A more comprehensive exploration of these parameters would make the inclusion of theta_max (and the specific value chosen) a lot less arbitrary.

      (c) Figure 4 Supplements 3-4: The author shows results for a hyperparameter search of the learning rule parameters, which is important to see. However, the parameter search is very limited: only 3 parameter values were tried, and there is no explanation or rationale for choosing these specific parameters. In particular, the metaplasticity learning rates do not even span one order of magnitude. If the author wants to claim that the learning rule is insensitive to this parameter, it should be explored over a much broader range of values (e.g., something like the range [0.1-10]).

      (2) Regarding the similarity to BCM, the author would ideally directly implement the BCM learning rule in their model, but at the least the author could have shown whether a slight variant of their rule presented here can be effective: for example having a single (plastic, not fixed) Ca-dependent threshold that applies to both LTP and LTD, with a single learning rate parameter.

      (3) This paper is extremely similar (and essentially an extension) to the work of Khodadadi et al. (2025). Yet this paper is not mentioned at all in the introduction, and the relation between these papers is not made clear until the discussion, leaving me initially puzzled as to what problems this paper addresses that have not already been extensively solved. The introduction could be reworked to make this connection clearer while pointing out the main differences in approach (e.g., the important distinction between "boosting" nonlinearities and plateau potentials).

      (4) The introduction is missing some citations of other recent work that has addressed single-neuron non-linear computation and learning, such as Gidon et al (2020); Jones & Kording (2021).

      (5) Figure 1: The figure prominently features mGluR next to the CaV channel, but there is no mention of mGluR in the introduction. The introduction should be updated to include this.

      (6) Could the author explain why there is a non-monotonic increase/decrease in the [Ca]_L in Figure 2B_4? Perhaps my confusion comes from not understanding what a single line represents. Does each line represent the [Ca] in a single spine (and if so, which spine), or is each line an average of all the spines in a given stim condition?

      (7) Row 124 (page 4): L-type Ca microdomains (in which ions don't diffuse and therefore don't interact with Ca_NMDA) is a critical assumption of this model. The references for this appear only in the discussion, so when reading this paper, I found myself a bit confused about why the same ion is treated as two completely independent variables with separate dynamics. Highlighting the assumption (with citations) a bit more clearly in the results section when describing the rule would help with understanding.

      (8) Row 149 (page 5): The current formulation of the update rule is not actually multiplicative. The fact that the update is weight-dependent alone does not make it a multiplicative rule, and judging by equation (1) it appears to simply be an additive rule with a weight regularization term that guarantees weight bounds. For example, a similar weight-dependent update is also a core component of BTSP (Milstein et al. 2021; Galloni et al. 2025), which is another well-known *additive* rule. An actual multiplicative rule implies that the update itself is applied via a multiplication, i.e. w_new = w_old * delta_w

      For an example of a genuinely multiplicative rule, see: Cornford et al. 2024, "Brain-like learning with exponentiated gradients"). Multiplicative rules have very different properties to additive rules, since larger weights tend to grow quickly while small weights shrink towards 0.

      (9) Equation 1 (page 5): Shouldn't the depression term be written as: (w_min - w)? This term would be negative if w is larger than w_min, leading to LTD. As it is written now, a large w and small w_min would just cause further potentiation instead of depression.

      (10) In the introduction, the teaching signal is described in binary terms (DA peak, or DA pause), but in Equation 1, it actually appears to take on 3 different values. Could the author clarify what the difference is between a "DA pause" and the "no DA" condition? The way I read it, pause = absence of DA = no DA

      (11) Figure 3: In these experimental simulations, DA feedback comes in 400ms after the stimulus. The author could motivate this choice a bit better and explain the significance of this delay. Clearly, the equations have a delta_t term, but as far as the learning algorithm is concerned, it seems like learning would be more effective at delta_t=0. Is the choice of 400ms mainly motivated by experimental observations? On a related note, is it meaningful that the 200ms delta_t before the next stimulus is shorter than the 400ms pause from the first stimulus? Wouldn't the DA that arrives shortly before a stimulus also have an effect on the learning rule?

      (12) Figure 4C: How is it possible that the theta_LTP value goes higher than the upper threshold (dashed line)? Equation 3 implies that it should always be lower.

      (13) Row 429 (page 11): The statement that "without metaplasticity the NFBP cannot be solved" is overly general and not supported by the evidence presented. There exist many papers in which people solve similar non-linear feature learning problems with Hebbian or other bio-plausible rules that don't have metaplasticity. A more accurate statement that can be made here is that the specific rule presented in this paper requires metaplasticity.

      (14) The methods section does not make any mention of publicly available code or a GitHub repository. The author should add a link to the code and put some effort into improving the documentation so that others can more easily assess the code and reproduce the simulations.

    1. HTML, which is not a programming language

      Sure it is:

      It's not really a debate, HTML is a markup language [1], not a programming language

      This is a false dichotomy, and untrue on two points (a) that HTML is "not a programming language", and (b) that the stance you have taken is not up for debate.

      You're using the term "programming language" as shorthand to refer to the subset of languages more rigorously referred to as Turing-complete general-purpose programming languages.* There are programming languages outside this subset that no less have set membership with the larger "programming language" superset.

      A programming language is a programming language in the sense that they have come to exist with the advent of the stored program computer. Though HTML and other markup languages are not** Turing-complete general-purpose programming languages, they are used every bit as much to describe a stored program as a language you might encounter when opening a file containing G-code "instructions" for controlling a CNC machine.

      * We had a similar (but not the same) silly recurring argument in the 90s where people, at a loss for words to accurately capture their thoughts and grasping at whatever came to mind, argued that there were two distinct classes: "programming languages" and "scripting languages"—a category error arising from the same phenomenon of mistaking loose, off-the-cuff shorthand for The Real Idea Behind the Thing That's Involved Here

      ** No overlooking the fact that HTML is a proper superset of CSS, a provably Turing-complete (though not general-purpose) programming language—and JS, which is, of course, a true general-purpose programming language

      (Originally drafted in response to https://news.ycombinator.com/item?id=46747778)

    1. Second, we examined how the amino acid substitutions affect the tertiary structure of the proteins. The tertiary structure of the inactive CHK2 homodimer (PDB code: 3i6w) is shown in Figure 4B. p.R474 is located away fro

      [Paragraph-level] PMCID: PMC5111006 Section: RESULTS PassageIndex: 11

      Evidence Type(s): Functional, Oncogenic

      Justification: Functional: The passage discusses how the variant p.R474C alters the salt bridge formation and is likely to make the protein unstable, indicating an alteration in molecular function. Oncogenic: The variant p.R474C is described as "disease causing" and likely interferes with protein function, suggesting a role in tumor development or progression.

      Gene→Variant (gene-first): 11200:p.R474 11200:p.R474C

      Genes: 11200

      Variants: p.R474 p.R474C

    2. Second, we examined how the amino acid substitutions affect the tertiary structure of the proteins. The tertiary structure of the inactive CHK2 homodimer (PDB code: 3i6w) is shown in Figure 4B. p.R474 is located away fro

      [Paragraph-level] PMCID: PMC5111006 Section: RESULTS PassageIndex: 11

      Evidence Type(s): Functional, Oncogenic

      Justification: Functional: The passage discusses how the variant p.R474C alters the salt bridge formation and is likely to make the protein unstable, indicating an alteration in molecular function. Oncogenic: The variant p.R474C is described as "disease causing" and likely interferes with protein function, suggesting a role in tumor development or progression.

      Gene→Variant (gene-first): 11200:p.R474 11200:p.R474C

      Genes: 11200

      Variants: p.R474 p.R474C

    1. Si ! Cela sert à vous qui développez et aux personnes qui liront le code source de votre page

      ça ressemble un peu à cette extension "Hypothesis" au final, sans avoir besoin d'installer d'extension.

    2. des espaces

      Utiliser plutôt des _ (ou à la limite, des tirets -)

      Les points marchent, mais sont à éviter, car on les utilise beaucoup en CSS, et ça peut créer des confusions.

      #ancre <br>

      #Exemple_oiseau_migrateur

    1. Performance was compared against direct LLM implementations (GPT-5, Claude 4, DeepSeek V3.2, Gemini-2.5 Pro) and OpentronsAI, a commercial LLM–based solution (Table 1).

      I'd be interested to see this comparison done with other agentic approaches like Claude Code or Cursor if they were given the same kind of context upfront

    1. La Santé Mentale des Jeunes : Enjeux, État des Lieux et Pilotage en Milieu Scolaire

      Résumé Exécutif

      La santé mentale des jeunes est devenue une priorité gouvernementale et de santé publique majeure en France.

      Loin d'être une mission périphérique, elle est désormais reconnue comme une condition sine qua non de la réussite scolaire et du bien-être des élèves.

      Les données récentes révèlent une dégradation préoccupante de l'état psychique des jeunes, particulièrement chez les adolescentes, sans amélioration notable après la période COVID-19.

      La stratégie nationale repose sur un changement de paradigme : passer d'une gestion purement médicale des troubles à une approche globale d'« École promotrice de santé ».

      Cela implique la mobilisation de l'ensemble de la communauté éducative — et non seulement des professionnels de santé — pour créer des environnements favorables.

      Le pilotage repose sur des protocoles clairs (du repérage à la prise en charge), une exploitation rigoureuse des données statistiques et une formation accrue des personnels (secouristes en santé mentale).

      --------------------------------------------------------------------------------

      1. État des Lieux Statistique de la Santé Mentale des Jeunes

      Les données issues des enquêtes nationales (ENABY pour le primaire et « En Classe » pour le secondaire) dressent un constat de vulnérabilité croissante.

      Données par Cycle Scolaire

      | Niveau Scolaire | Prévalence des troubles probables | Observations Clés | | --- | --- | --- | | Maternelle (3-11 ans) | 8 % (soit 1 élève sur 12) | Les garçons sont deux fois plus concernés que les filles (troubles d'opposition, hyperactivité). | | Primaire (CP-CM2) | 13 % (soit + de 3 par classe) | Distinction selon le sexe : troubles émotionnels (anxiété, dépression) pour les filles ; troubles du comportement (TDAH) pour les garçons. | | Collège et Lycée | ~14 % de risque de dépression | Dégradation continue entre la 6ème et la terminale. Plus de 50 % des élèves présentent des symptômes physiques ou psychiques fréquents. |

      Focus sur les Risques Graves et Tendances

      Suicide au lycée : 13 % des lycéens déclarent avoir déjà fait une tentative de suicide ; 3 % ont fait une tentative ayant nécessité une hospitalisation (soit environ un élève par classe).

      Évolution temporelle : Tous les indicateurs se sont dégradés entre 2018 et 2022. La vulnérabilité des filles est le principal point d'alerte actuel.

      Contexte global : La santé mentale est impactée par un empilement de crises (économiques, sociales, géopolitiques et climatiques) et par l'influence des réseaux sociaux.

      --------------------------------------------------------------------------------

      2. Cadre Conceptuel et Institutionnel

      Une Définition Tripartite

      La santé mentale ne se résume pas à l'absence de pathologie. Elle comprend trois composantes essentielles :

      1. Le bien-être (santé mentale positive).

      2. Les troubles mentaux (souffrance psychique).

      3. Les maladies mentales (diagnostics cliniques).

      L'École Promotrice de Santé

      Ce dispositif, porté par le ministère depuis 2020, vise à fédérer la communauté éducative autour de la promotion de pratiques favorables au bien-être physique, mental et social.

      Objectif : Intégrer la santé mentale dans tous les actes quotidiens, pédagogiques et éducatifs.

      Priorité politique : Depuis 2022, les circulaires de rentrée placent le bien-être au même niveau que les apprentissages fondamentaux.

      --------------------------------------------------------------------------------

      3. Cadre Juridique : Secret Médical et Aménagements

      La prise en compte de la santé mentale doit s'équilibrer avec les droits fondamentaux des élèves.

      Le Secret Médical : Défini par l'article 226-13 du Code pénal, il est un droit fondamental du patient garantissant la confiance avec les personnels soignants.

      Sa violation est pénalement sanctionnée.

      Le Projet d'Accueil Individualisé (PAI) : Cet outil juridique permet d'organiser la scolarité des élèves ayant des problèmes de santé ou un handicap.

      Il permet d'aménager les régimes alimentaires, les horaires ou les activités de substitution sur prescription médicale, tout en respectant la confidentialité des diagnostics.

      --------------------------------------------------------------------------------

      4. Stratégies de Pilotage et Leviers Opérationnels

      Le pilotage de la santé mentale nécessite une approche à la fois verticale (institutionnelle) et horizontale (territoriale).

      Actions à l'Échelle de l'Établissement

      Le chef d'établissement doit agir comme un pilote en s'appuyant sur plusieurs leviers :

      Diagnostic local : Utiliser les indicateurs de climat scolaire (logiciels infirmiers, enquêtes sociales, évaluations d'établissement).

      Protocole Santé Mentale : Formaliser un document « du repérage à la prise en charge » qui précise le rôle de chaque acteur.

      Instances : Faire vivre le sujet au sein du CESCE, du conseil pédagogique et du conseil d'administration.

      Aménagements physiques : Intégrer le bien-être dans l'aménagement du bâti scolaire, des cours de récréation et de la restauration.

      Dispositifs et Outils Nationaux

      Secouristes en santé mentale : Formation de deux personnels par collège pour repérer les signes de crise (notamment suicidaire) et orienter les élèves.

      3114 : Le numéro national de prévention du suicide, désormais inscrit dans les carnets de correspondance.

      Infolettre EPSA : Publication sur Eduscol fournissant des données et des références pour le pilotage.

      Compétences Psychosociales (CPS) : Levier préventif majeur pour renforcer la résilience des élèves.

      --------------------------------------------------------------------------------

      5. Rôles et Responsabilités des Acteurs

      La santé mentale n'est pas uniquement l'affaire des spécialistes ; elle repose sur une chaîne de responsabilités partagées.

      Personnels de direction : Pilotes de la politique de santé et du climat scolaire.

      Personnels de santé et sociaux (Médecins, Infirmiers, Assistants Sociaux, Psychologues) : Experts-conseils et conseillers techniques. Ils assurent l'évaluation et l'orientation vers le soin extérieur.

      Personnels pédagogiques et éducatifs : Acteurs de première ligne pour le repérage et l'accueil de la parole.

      Partenaires territoriaux : Collectivités territoriales, Agences Régionales de Santé (ARS), et contrats locaux de santé pour assurer la continuité des soins hors de l'école.

      Familles : Reconnues comme les premières spécialistes de leurs enfants, elles sont des partenaires indispensables dans le suivi.

      Conclusion

      L'institution scolaire opère une mutation profonde en intégrant la santé mentale comme un axe de réussite scolaire au même titre que les savoirs académiques.

      Si les indicateurs statistiques restent préoccupants, la mobilisation collective — marquée par la déstigmatisation des troubles et la formation des personnels — constitue le levier principal pour stabiliser et améliorer le bien-être des jeunes générations.

      L'école ne soigne pas, mais elle repère, protège et oriente.

    1. Briefing Doc : "Le parcours de l'élève au périscope" Source : Excerpts from the radio show "Le parcours de l'élève au périscope"

      Date de diffusion : 11 mars 2025

      Participants :

      • Jean-Marc Moulet : Inspecteur général de l'éducation, du sport et de la recherche
      • Philippe Montoya : IEN en scolarisation des élèves en situation de handicap, conseiller technique école inclusive du recteur de l'académie de Toulouse
      • Patrick Avogadro : Personnel de direction, Lycée professionnel Les Grippeaux (académie de Poitiers)
      • Noémie Olympio : Enseignante-chercheuse sur les trajectoires des élèves (LEST CNRS, Université Aix-Marseille)
      • Raphaël Mata Duvignot : Présentateur de la minute juris

      • Thème principal : Le parcours de l'élève, ses enjeux, les dispositifs d'accompagnement, les inégalités et les perspectives.

      Structure de l'émission (d'après l'extrait) :

      • Enjeux autour du parcours de l'élève : Définition, étapes clés, accompagnement du système éducatif, objectifs au-delà de l'insertion professionnelle, influence du territoire et position de la recherche.

      • Minute Juris : Présentation des cadres légaux et administratifs du parcours de l'élève (socle commun, redoublement, orientation, classes et groupes spécifiques).

      • Témoignages de terrain (Table ronde) : Expériences dans le premier et second degré, forces du système, enjeux territoriaux, discriminations, découverte des métiers, inclusion des élèves en situation de handicap, formation des enseignants et partenaires.

      • Minute Bibli : Présentation de ressources bibliographiques.

      Principaux thèmes et idées clés :

      1. Définition et complexité du parcours de l'élève :

      • Le parcours de l'élève englobe "tout ce qu'un élève va vivre à l'intérieur de l'école à l'extérieur de l'école pour se construire réussir son orientation et arriver à une insertion professionnelle la meilleure possible." (Jean-Marc Moulet)

      • Il existe une distinction avec les "parcours éducatifs" (réforme de 2008) qui sont plus axés sur les éducations transversales, au sein desquels figure le "parcours avenir", central pour l'orientation au collège.

      • Le parcours est différencié selon les niveaux (primaire, collège, lycée), avec des dispositifs spécifiques pour accompagner les difficultés (plans personnalisés au primaire, SEGPA au collège, spécialisations au lycée).

      • Au lycée (surtout professionnel), l'éventail des parcours s'élargit avec des secondes thématiques et des possibilités d'approfondissement ou d'immersion professionnelle en terminale.

      Le lycée général et technologique offre une multiplication des choix de disciplines et de couplages.

      • Le système éducatif accompagne via des heures dédiées à l'orientation dès la 4ème, l'accompagnement personnalisé au lycée et le rôle des équipes éducatives et des psychologues de l'Éducation nationale.

      2. Objectifs multiples du parcours :

      • L'objectif n'est pas uniquement l'insertion professionnelle, mais aussi la "fabrication de citoyens qui soient heureux" et la "diversification des possibles". (Jean-Marc Moulet)

      • Il s'agit de lutter contre le déterminisme social et les pressions de genre en élargissant le "panel des possibles" pour que les élèves se révèlent dans ce qui est le meilleur pour eux.

      • Le socle commun assure l'acquisition des compétences nécessaires à l'orientation pour tous les citoyens à la fin de la scolarité obligatoire.

      3. Personnalisation et choix :

      • L'idée est d'avoir un parcours "le plus personnalisé proche des envies possibles des jeunes". (Jean-Marc Moulet)
      • L'offre de choix est aujourd'hui beaucoup plus large qu'auparavant, correspondant à une plus grande diversité de profils.
      • La valorisation du lycée professionnel est un enjeu éducatif et économique fort, en lien avec les besoins du marché du travail.

      4. Influence du territoire et mobilité :

      • La proximité du secteur économique influence l'orientation.

      L'information sur l'orientation est déléguée aux régions (loi de 2018) pour tenir compte des enjeux économiques locaux et favoriser la mobilité régionale.

      • La mobilité des élèves est centrale, et informer sur les opportunités régionales peut engager certains élèves à s'y orienter.

      • Les projets éducatifs de territoire (PEDT) sont des leviers importants pour lutter contre les inégalités culturelles et favoriser la mobilité dès le primaire.

      • Des initiatives comme les cordées de la réussite et les internats d'excellence visent à pallier les inégalités territoriales et à élever les ambitions des élèves.

      5. Le regard de la recherche : Inégalités et déterminismes :

      • La notion de parcours renvoie aux "périodes charnières" (aménagements précoces, premiers paliers d'orientation en 3ème et seconde). (Noémie Olympio)

      • Malgré la volonté d'uniformité (tronc commun), le système est marqué par des "éléments d'inégalité" et un fort "déterminisme scolaire et social des trajectoires".

      • La performance scolaire en fin de primaire est un bon prédicteur des possibilités futures. L'orientation est socialement marquée (à performance égale, un enfant de parents diplômés du supérieur a plus de chances de faire un bac général).

      • Les données de la DEP (panel d'élèves) montrent l'importance du "capital informationnel des familles", du "niveau d'aspiration des familles" et du "maintien des aspirations" (phénomène de "refroidissement des aspirations" parfois non lié à la performance scolaire).

      • La "représentation de l'utilité des diplômes" est également inégalement répartie et corrélée à la résilience scolaire.

      • Le système actuel, avec des aménagements précoces (comme la SEGPA), peut rendre les trajectoires "peu réversibles" et socialement marquées.

      • Le "capital informationnel" se constitue par la catégorie socio-professionnelle, la représentation du monde, le rapport à la mobilité et les "stratégies éducatives des parents" (plus ou moins "opérantes").

      6. Cadre légal et administratif (Minute Juris) :

      • L'article L 111-1 du code de l'éducation garantit l'organisation des parcours en fonction des élèves.

      • Le socle commun de connaissances, de compétences et de culture (défini par la loi de 2005 et refondé en 2013) est au cœur du système et évalué à la fin de chaque cycle.

      • Le redoublement est un dispositif rare et exceptionnel (codifié à l'article L 31-7), privilégiant des stratégies de prévention et d'accompagnement.

      Il est interdit en maternelle. La décision fait l'objet d'un dialogue avec les familles et peut être contestée.

      • L'orientation scolaire (articles L331-7 et D331-31) est encadrée par des voies définies par arrêté ministériel et implique un dialogue entre familles et équipes pédagogiques. En cas de désaccord, une procédure de recours existe.

      • Des classes et groupes spécifiques (article D 332-5), comme les SEGPA (article D 332-7) et les ULIS (article L12-1), permettent un parcours différencié pour répondre aux besoins des élèves, y compris en situation de handicap.

      La différence de traitement basée sur les besoins n'est pas considérée comme une rupture d'égalité.

      • Le principe de mutabilité du service public d'éducation implique une innovation et un ajustement continu des pratiques pédagogiques.

      7. Témoignages de terrain et solutions :

      • Les parcours diffèrent déjà au primaire en fonction du territoire et des projets menés (y compris le temps périscolaire). La distance au collège et au lycée impacte également les parcours.

      • Les projets éducatifs de territoire (PEDT) et le regroupement de collectivités sont essentiels pour offrir des opportunités culturelles et de mobilité.

      • Les cordées de la réussite lient les collèges à des grandes écoles pour susciter l'ambition. Les internats d'excellence lèvent l'obstacle de la distance.

      • Les campus des métiers d'excellence (CMQ) favorisent la mobilité et le lien avec l'économie des territoires, en produisant des ressources pour les collèges (jeux, plateformes numériques, accueil).

      • Il est crucial de travailler sur l'ouverture du champ des possibles et le capital informationnel sans paternalisme, en s'appuyant sur des données fiables (taux d'employabilité, mobilité professionnelle).

      • La découverte des métiers dès la 5ème (voire plus tôt, comme dans les pays anglo-saxons) est essentielle pour contrer les déterminismes.

      La rencontre avec des professionnels a un impact déterminant (l'exemple d'une heure d'intervention d'une scientifique sur l'orientation des filles).

      • Des actions locales (forums des métiers, visites d'entreprises, mini-stages, stages de seconde) permettent aux élèves de découvrir la diversité des professions.

      • Le soutien au parcours dans les lycées professionnels (ateliers CV, rencontres avec des professionnels et anciens élèves) vise à faciliter l'insertion et la poursuite d'études.

      Le parcours différencié en fin d'année permet des stages de professionnalisation ou des ateliers de préparation à la vie étudiante.

      • La formation d'initiatives locales (FIL) rapproche les enseignants des différents niveaux pour harmoniser les attentes.

      8. Inclusion des élèves en situation de handicap :

      • La mobilité est un enjeu crucial, nécessitant des outils spécifiques (applications d'aide au déplacement).

      • L'ambition pour ces élèves doit être élevée (faible taux en lycée général et technologique). Il existe un "plafond de verre" à faire sauter.

      • Les universités et grandes écoles développent une forte dynamique inclusive (référents handicap, aménagements). La convention "A tout pour tous" à Toulouse et les initiatives pour les étudiants avec TSA sont des exemples.

      • Des plateformes d'accompagnement à l'inclusion professionnelle sont mises en place.

      • La connaissance des dispositifs par les enseignants est fondamentale. L'action "Enseignement supérieur et handicap, c'est possible" vise à informer.

      9. Formation des enseignants et partenaires :

      • Les psychologues de l'Éducation nationale et les CIO ont un rôle central.

      • Il est important d'associer les médecins de l'Éducation nationale pour anticiper les contre-indications dans certaines filières professionnelles.

      • La région (information, orientation mobile), l'ONISEP (compétences à s'orienter, plateforme "Avenir"), les professeurs principaux et les DDFPT (en lycée professionnel) sont des partenaires clés.

      • Le travail en équipe et en réseau (campus des métiers et des qualifications) est essentiel.

      • Il faut renforcer les partenariats entre le collège et le lycée, ainsi qu'avec l'enseignement supérieur (continuum Bac-3 / Bac+3).

      • La gestion algorithmique de l'orientation peut alimenter l'autocensure, nécessitant une meilleure explicitation des stratégies et des accompagnements pour les élèves les moins favorisés.

      • Le bureau des entreprises dans les lycées professionnels renforce le lien avec le monde du travail. Le réseau associatif peut apporter une expertise complémentaire.

      • Il est important de lier le stage de seconde aux expériences vécues au collège.

      Conclusion et perspectives (Jean-Marc Moulet) :

      • Le système éducatif évolue pour faciliter et mieux accompagner les parcours, en aidant les familles les plus fragiles.

      • L'objectif ne doit pas être uniquement l'insertion professionnelle, mais aussi la formation à la mobilité professionnelle et à la plasticité face aux évolutions du marché du travail.

      • La question du décrochage scolaire, souvent lié à des difficultés d'orientation, pourrait faire l'objet d'une prochaine table ronde.

      Ressources bibliographiques (Minute Biblie) :

      Rapport de l'Inspection générale sur la découverte des métiers au collège (mai 2024).

      Articles de Noémie Olympio sur l'orientation en lycée professionnel, les aspirations et le capital social et culturel.

      Site de l'ONISEP et plateforme "Avenir".

    1. 13.5.4. Search through news submissions and only display good news# Now we will make a different version of the code that computes the sentiment of each submission title and only displays the ones with positive sentiment. # Look up the subreddit "news", then find the "hot" list, getting up to 10 submission submissions = reddit.subreddit("news").hot(limit=10) # Turn the submission results into a Python List submissions_list = list(submissions) # go through each reddit submission for submission in submissions_list: #calculate sentiment title_sentiment = sia.polarity_scores(submission.title)["compound"] if(title_sentiment > 0): print(submission.title) print() Copy to clipboard Fake praw is pretending to select the subreddit: newsBreaking news: A lovely cat took a nice long nap today! Breaking news: Some grandparents made some yummy cookies for all the kids to share! Copy to clipboard 13.5.5. Try it out on real Reddit# If you want, you can skip the fake_praw step and try it out on real Reddit, from whatever subreddit you want Did it work like you expected? You can also only show negative sentiment submissions (sentiment < 0) if you want to see only bad news.

      This demo was interesting because it shows how algorithms can intentionally shape what users see. Filtering for only positive news might seem helpful for improving mental health, but it also raises questions about whether hiding negative information creates a distorted view of reality. I also noticed that sentiment analysis is a simplified way to judge content, since tone and context can be more complex than just positive or negative. Overall, this example shows how small design decisions in algorithms can significantly influence users’ emotional experiences online.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Abdelmageed et al. investigate age-related changes in the subcellular localization of DNA polymerase kappa (POLK) in the brains of mice. POLK has been actively investigated for its role in translesion DNA synthesis and involvement in other DNA repair pathways in proliferating cells, very little is known about POLK in a tissue-specific context, let alone in post-mitotic cells. The authors investigated POLK subcellular distribution in the brains of young, middle-aged, and old mice via immunoblotting of fractioned tissue extracts and immunofluorescence (IF). Immunoblotting revealed a progressive decrease in the abundance of nuclear POLK, while cytoplasmic POLK levels concomitantly increased. Similar findings were present when IF was performed on brain sections. Further, IF studies of the cingulate cortex (Cg1), the motor cortex (M1, M2), and the somatosensory (S1) cortical regions all showed an age-related decline in nuclear POLK. Nuclear speckles of POLK decrease in each region, meanwhile, the number of cytoplasmic POLK granules decreases in all four regions, but granule size is increasing. The authors report similar findings for REV1, another Y-family DNA polymerase.

      The authors then investigate the colocalization of POLK with other DNA damage response (DDR) proteins in either pyramidal neurons or inhibitory interneurons. At 18 months of age, DNA damage marker gH2AX demonstrated colocalization with nuclear POLK, while strong colocalization of POLK and 8-oxo-dG was present in geriatric mice. The authors find that cytoplasmic POLK granules colocalize with stress granule marker G3BP1, suggesting that the accumulated POLK ends up in the lysosome.

      Brain regions were further stained to identify POLK patterns in NeuN+ neurons, GABAergic neurons, and other non-neuronal cell types present in the cortex. Microglia associated with pyramidal neurons or inhibitory interneurons were found to have a higher abundance of cytoplasmic POLK. The authors also report that POLK localization can be regulated by neuronal activity induced by Kainic acid treatment. Lastly, the authors suggest that POLK could serve as an aging clock for brain tissue, but POLK deserves further characterization and correlation to functional changes before being considered as a biomarker.

      Strengths:

      Investigation of TLS polymerases in specific tissues and in post-mitotic cells is largely understudied. The potential changes in sub-cellular localization of POLK and potentially other TLS polymerases open up many questions about DNA repair and damage tolerance in the brain and how it can change with age.

      Weaknesses:

      The work is quite novel and interesting, and the authors do suggest some potentially interesting roles for POLK in the brain, but these are in and of themselves a bit speculative. The majority of the findings of this paper draw upon findings from POLK antibody and its presumed specificity for POLK. However, this antibody has not been fully validated and needs further work. Further validation experiments using Polk-deficient or knocked-down cells to investigate antibody specificity for both immunoblotting and immunofluorescence should be performed. More mechanistic investigation is needed before POLK could be considered as a brain aging clock.

      We are thankful for the overall enthusiasm and positive comments.

      (a) Concern over POLK antibody characterization in mouse:

      We performed siRNA and shRNA knock downs in mouse primary cortical neurons as well as efficiently transfectable murine lines like 4T1 and Neuro-2A showing knock down of 99kDa and 120kDa bands recognized by sc-166667 anti-POLK antibody (exact figure number Figure 1 and S1). We show that in IF sc-166667 and A12052 (Figure S1G) shows similar immunostaining patterns and we used sc-166667 in all reported figures and western blots.

      (b) More mechanistic investigation is needed before POLK could be considered as a brain aging clock:

      We sincerely appreciate the valuable suggestion. We agree as a terminal assay POLK nucleo-cytoplasmic status is not practical for longitudinal studies. However, we believe it may serve an investigative/correlative endogenous signal for determining tissue age, that may be useful to "date" brain sections, since not many such cell biological markers exist. We have added clarification texts to address this.

      Reviewer #2 (Public review):

      Summary:

      Abdelmageed et al., demonstrate POLK expression in nervous tissue and focus mainly on neurons. Here they describe an exciting age-dependent change in POLK subcellular localization, from the nucleus in young tissue to the cytoplasm in old tissue. They argue that the cytosolic POLK is associated with stress granules. They also investigate the cell-type specific expression of POLK, and quantitate expression changes induced by cell-autonomous (activity) and cell nonautonomous (microglia) factors.

      I think it is an interesting report but requires a few more experiments to support their findings in the latter half of the paper. Additionally, a more mechanistic understanding of the pathways regulating POLK dynamics between the nucleus and cytosol, what is POLK doing in the cytosol, and what is it interacting with; would greatly increase the impact of this report. However, additional mechanistic experiments are mostly not needed to support much of the currently presented results, again, it would simply increase the impact.

      (a) Concern on more mechanistic understanding of the pathways regulating POLK dynamics between the nucleus and cytosol:

      We sincerely appreciate the reviewer’s enthusiasm and valuable guidance in helping us better understand the mechanism of nuclear-cytoplasmic POLK dynamics. Previously, we developed a modified aniPOND (accelerated native isolation of proteins on nascent DNA) protocol, which we termed iPoKD-MS (isolation of proteins on Pol kappa synthesized DNA followed by mass spectrometry), to capture proteins bound to nascent DNA synthesized by POLK in human cell lines (bioRxiv https://www.biorxiv.org/content/10.1101/2022.10.27.513845v3). In this dataset, we identified potential candidates that may regulate nuclear/cytoplasmic POLK dynamics. These candidates are currently undergoing validation in human cell lines, and we are preparing a manuscript on these findings. Among these, some candidates, including previously identified proteins such as exportin and importin (Temprine et al., 2020, PMID: 32345725), are being explored further as potential POLK nuclear/cytoplasmic shuttles. We are also conducting tests on these candidates in mouse cortical primary neurons to assess their role in POLK dynamics. In the revised version of the manuscript, we have included a discussion of our current understanding.

      (b) Question on “… what is POLK doing in the cytosol, and what is it interacting with …”: Our data so far indicate that POLK accumulates in stress granules and lysosomes. We are very grateful for the reviewer’s insightful suggestions and will make every effort to incorporate them in the revised manuscript. We characterized POLK accumulation in the cytoplasm using six additional endo-lysosomal markers, as recommended by the reviewer. This data is now part of entirely new Figure 3.

      Reviewer #3 (Public review):

      Summary:

      In this study, the authors show that DNA polymerase kappa POLK relocalizes in the cytoplasm as granules with age in mice. The reduction of nuclear POLK in old brains is congruent with an increase in DNA damage markers. The cytoplasmic granules colocalize with stress granules and endo-lysosome. The study proposes that protein localization of POLK could be used to determine the biological age of brain tissue sections.

      Strengths:

      Very few studies focus on the POLK protein in the peripheral nervous system (PNS). The microscopy approach used here is also very relevant: it allows the authors to highlight a radical change in POLK localization (nuclear versus cytoplasmic) depending on the age of the neurons. 

      The conclusions of the study are strong. Several types of neurons are compared, the colocalization with several proteins from the NHEJ and BER repair pathways is tested, and microscopy images are systematically quantified.

      Weaknesses:

      The authors do not discuss the physical nature of POLK granules. There is a large field of research dedicated to the nature and function of condensates: in particular numerous studies have shown that some condensates but not all exhibit liquid-like properties (https://www.nature.com/articles/nrm.2017.7, https://pubmed.ncbi.nlm.nih.gov/33510441/ https://www.mdpi.com/2073-4425/13/10/1846). The change of physical properties of condensates is particularly important in cells undergoing stress and during aging. The authors should discuss this literature.

      We highly appreciate the reviewer bringing up the context of biomolecular condensates. Our iPoKD-MS data referenced above suggests candidates from various biomolecular condensates that we are currently investigating. We appreciate the reviewer providing important literature cited these articles in text and potential biomolecular condensates are discussed in the revised version. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The work is quite novel and interesting, and the authors do suggest some potentially interesting roles for POLK in the brain, but these are in of themselves a bit speculative. The majority of the findings of this paper rely upon the POLK antibody and its specificity for POLK, which is not fully characterized and needs further work (validation of antibodies using immunoblots of Polk KO cells or siRNA KD of POLK in murine cells) to provide confidence in the authors' findings. 

      Points

      siRNA knockdown of Polk in primary neurons showed a dramatic reduction in signal by IF even though qPCR analysis showed a reduction of only ~35% at the transcript level. Typically many DNA repair genes need to be knocked down by 80% or more to see discernable differences at the protein level. siRNA knockdown in a murine cell line (MEFs, neurons, or some other easily transfectable cell type) needs to be performed with immunoblotting with whole cell and fractionated (nuclear/cytoplasmic) lysates in order to better validate the anti-POLK antibodies and which bands that are visualized during immunoblotting are specific to POLK.

      We performed siRNA and shRNA knock downs in mouse primary cortical neurons as well as efficiently transfectable murine lines like 4T1 and Neuro-2A showing knock down of 99kDa and 120kDa bands recognized by sc-166667 anti-POLK antibody (exact figure number Figure 1 and S1). We show that in IF sc-166667 and A12052 (Figure S1G) shows similar immunostaining patterns and we used sc-166667 in all reported figures and western blots.

      Figure 1B and C, it is not clear which antibody(ies) are used for the immunoblotting of nuclear and cytoplasmic fractions and for a blot with whole tissue lysates. Please place the antibody vendor or clone next to the corresponding blot or describe it in the figure legend. Bands of varying sizes are present in 1B (and Figure S1) but only a band at 99 kDa was shown in 1C. Because there are no bands of equivalent size present in the nuclear and cytoplasmic fractions in Figure 1B, please describe or denote which bands were used for quantification purposes for nuclear and cytoplasmic POLK.

      This has been clarified by using only one antibody throughout the manuscript sc-166667. We observed in whole cell lysate an intense ~99kDa and a faint ~120kDa band, which gets intense in nuclear fraction and is absent in cytoplasmic fraction. We have noted this in multiple human cell lines and hiPSC-derived neurons, which is our ongoing work. We do not know yet if the ~120kDa is a modification or isoform of POLK. We have hints from our proteomics data that it may be SUMOylated or ubiquitinylated or other post translational modifications. We added this in the discussion section.

      Figure 1I, is there a quantification beyond just the representative image? There is no green staining pattern outside the cytoplasm in the 1-month-old M1 images that is present in all the other images in the panel.

      Fig 1I is now Fig S1G in the revised manuscript. Since REV1 and POLH were not central to the study that focused on POLK, they were meant to be exploratory data panels and as such we did not quantify beyond the qualitative evaluation, which broadly resembled POLK’s disposition with age. We have noted there are some sample to sample variability in the background signal. In general, outside the cytoplasm as subcellularly segmented by fluorescent nissl expression, tends to be variable by brain areas but also higher in older brains

      "Association with PRKDC further suggests POLK's role in the "gap-filling" step in the NHEJ repair pathway in neurons." There is no strong evidence in the literature for mammalian POLK playing a role in NHEJ. Some description of a role in HR has been described, however. The reference regarding the iPoKD-MS data set that provides evidence of POLK associating with BER and NHEJ factors is listed as Paul, 2022 but is in the reference list as Shilpi Paul 2022.

      We removed this speculative statement and citation fixed.

      Figure 4A, what is the age of the mouse for the representative images?

      19 months and now mentioned in the figure legend

      Figure 4C, Could the data from the different ages be plotted side by side to better evaluate the differences for each cell type/region?

      Data is plotted side by side

      Why was the one-month time point chosen as this could still represent the developing and not mature murine brain? 

      Reviewer correctly noted that a 1 month brain is still developing, but mostly from the behavioral and circuit maturation standpoint. However, from cell division and neurogenesis perspective, that is considered to be complete by first postnatal month, with neuron production thereafter largely restricted to specialized adult niches in the dentate gyrus and subventricular zone–olfactory bulb pathway; these adult neurogenic stem cells are embryonically derived and are regulated in ways that are distinct from the early, expansionary developmental waves of neurogenesis. In our study we performed our measurements in the cortical areas only. (Caviness et al., 1995, PMID: 7482802; Ansorg et al., 2012, PMID: 22564330; Ming & Song, 2011, PMID: 21609825; Bond et al., 2015, PMID: 26431181; Bond et al., 2021, PMID: 33706926; Bartkowska et al., 2022, PMID: 36078144). Also, in Figure 6A it was incorrectly mentioned to be just 1month, we rechecked our metadata and noted that young brains were comprised of 1 and 2 month old brains and now it has been corrected.

      Furthermore, can the authors describe which sex of mice was used in these experiments and the justification if a single sex was used? If both sexes were used, were there any dimorphic differences in POLK localization patterns?

      This is an important aspect, but in the beginning to keep mice numbers within manageable limits, we were focusing more on the age component. While both males and female brains were assayed but due to uneven sample distribution between sexes, we could not estimate if there were any statistically significant sexual dimorphic differences in IN, PN and NNs. Future studies will investigate the sex component as a function of age.

      The suggestion of POLK as a brain aging clock may be a bit premature as the functional and behavioral consequences of cytoplasmic POLK sequestration are not fully known. Furthermore, investigation of POLK levels in other genetic models of neurodegeneration or with gerotherapeutics would be needed to establish if the POLK brain clock is responsive to changes that shift brain aging. Lastly, this clock may be impractical and not useful for longitudinal studies due to the terminal nature of assessing POLK levels.

      We agree as a terminal assay POLK nucleo-cytoplasmic status is not practical for longitudinal studies. However, we believe it may serve an investigative/correlative endogenous signal for determining tissue age, that may be useful to "date" brain sections, since not many such cell biological markers exist. We have added clarification text.

      Some discussion of the Polk-null mice is warranted, as they only have a slightly shortened lifespan, and any disease phenotypes were not reported. This stands in contrast to other DNA repair-deficient mice that mimic premature aging and show behavioral and motor deficits. This calls into question the role of POLK in brain aging.

      Discussion statements on Polk-null mice has been added.

      Please correct the catalog number for the SCBT anti-POLK antibody to sc-166667

      Typographical error has been corrected

      Reviewer #2 (Recommendations for the authors):

      Results:

      Figure by figure 

      (1) A progressive age-associated shift in subcellular localization of POLK The authors state that POLK has not been studied in nervous tissue before and they want to see if it is expressed, and if it changes subcellular location as a function of age. The authors argue age = stress like that seen in previous models using genotoxic agents and cancer cells. Indeed, POLK seems to convincingly change subcellular location from the nucleus to larger cytosolic puncta. 

      (2) Nuclear POLK co-localizes with DNA damage response and repair proteins This was a difficult dataset for me to decipher. To me, it appears as though POLK colocalizes with these examined proteins in the CYTOSOL, not the nucleus. Especially, in the oldest mice.

      We added in the discussion that DNA repair proteins were observed to be present in the cytoplasm and biomolecular condensates citing relevant reviews and primary references.

      (3) POLK in the cytoplasm is associated with stress granules and lysosomes in old brains LAMP1 has some issues as a lysosome marker. The authors even state it can be on endosomes. It would be nice to use a marker for mature lysosomes, some fluorescent reporter that is activated only by lysosomal proteases or pH. It is also of interest if POLK is localized to the membrane or the inside of these structures. The authors have access to an airyscan which is sufficient to examine luminal vs membrane localization on larger organelles like lysosomes.

      We thank the reviewer for pushing us to investigate the nature of cytoplasmic POLK in endo-lysosomal compartments. We have now added a full-page figure on the cell biological results from six different markers, subset (Cathepsin B and D) are known to present in the lumens of endo-lysosomes, in Figure 3. Further high-resolution membrane vs lumen was not pursued, which is perhaps better suited in cultured neurons rather than thick fixed tissues.

      (4) Differentially altered POLK subcellular expression amongst excitatory, inhibitory, and nonneuronal cells in the cortex.

      This seems fine. I don't see anything wrong with the author's statement that there is more POLK in neurons vs non-neuronal cells. 

      (5) Microglia associated with IN and PN have significantly higher levels of cytoplasmic POLK I don't see really any convincing evidence of the author's claim here. They find a difference at early-old age, but not at old-old, or other ages. This is explained by "However, this effect is lost in late-old age (Figure 5D), likely due to the MG-mediated removal of the INs.". But no trend being observed, no experiment to show sufficiency, and no experiment to uncover a directional relationship; this is a tough claim to stand by.

      Changes made in text to reflect speculative nature of this observation

      (6) Subcellular localization of POLK is regulated by neuronal activity

      Interesting and fairly difficult experiment. Can the authors talk more about what these values mean? I am confused as to why there is a decline in nuclear puncta at 80 min. Also, why are POLK counts in 6c similar at baseline between young and early-old? In Figures 5 and 6 I also worry about statistical analysis. Are all assumptions checked to use t-tests? Why not always use a test that has fewer assumptions?

      We have explained in the text the artificial nature of few hour long acute slice preparations is very different and inherently a stressful environment, especially for the old brains, compared to the vascular perfused PFA fixed brain tissues tested between young and old ages.

      We don’t have a proper explanation for the initial dip in nuclear puncta in both young and old brains at 80min of very similar magnitude. It could be a separate biological phenomenon that occurs at much shorter time scales that would not otherwise be captured in a fixed tissue assay and needs careful investigation using live tissue fluorescence imaging that is beyond the scope of this manuscript.

      We apologize for the typographical error in the figure legend. We rechecked our R code and the tests were all Wilcoxon rank-sum (Mann–Whitney U) two-sided nonparametric.

      Figure 6B & E had absurdly small p values due to large sample numbers. So, we implemented random sampling of 100 cells repeating for 200 times and presented the distribution of p values and Cohen’s d in the supplement and reported the median p value and Cohen’s in the main plot.

      (7) POLK as an endogenous "aging clock" for brain tissue

      Trainable model. What are the criteria for the model, and how does it work? The cutoffs it uses to classify each age group might be interesting in that the model may have identified a trait the researchers were unaware of. Otherwise, it is not especially useful. Maybe as an independent 'blind' analysis of the data?

      We have added a better description of the models, assumptions and how two different unsupervised approaches converge on the same set of features with high AUROCs.

      Minor questions:

      The cartoons (1a, 2a-b, 5a, 6a) help a lot. However, I still had to work a bit to understand some of the graphs (e.g., 5d, 6b-e, fig 7). Is there a simpler way to present them? Maybe simply additional labelling? I'm not sure.

      A more thorough discussion of statistical tests is warranted I think. I am not very clear why some were chosen (t-test vs nonparametric with fewer assumptions). Infinitesimally small p values also make me think maybe incorrect tests were done or no power analysis was performed beforehand. A fix for this is just discussing what went into the testing methods and why they were chosen.

      Statistical analysis for Fig2 (using Generalized Estimating Equations), and Fig6 (with random repeated subsampling; method explained in text, figure legend updated and supplementary data on the distribution of p values and cohen’s d are added) to address the very small p values. Descriptions rewritten in relevant text.

      In the absence of further mechanistic experiments, it would still be interesting to hear what the authors think is going on and what the significance of this altered subcellular location means. How do the authors think this is occurring? I think they are arguing that cytosolic localization of POLK is 100% detrimental to the neuron. ("The reduction of nuclear POLK in old brains is congruent with an increase in DNA damage markers") Do they have any idea what the 'bug' is in the POLK system then?

      Statements in the discussion has been added.

      Reviewer #3 (Recommendations for the authors):

      POLK is detected as small " as small "speckles" inside the nucleus at a young age (1-2 months) and larger "granules" can be seen in the cytoplasm at progressively older time points (>9 months). In the nucleus, is POLK bound to DNA? In the cytoplasm, how are the POLK molecules organized: are they bound to a substrate or are they just organized as a proteins condensate without DNA?

      In human U2OS cell line Dnase1 treatment leads to loss of POLK from the nucleus as well as its activity as reported in Fig5 of Paul, S. et. al. 2023 bioRxiv. While we haven’t reproduced these results in mouse primary neurons, we anticipate a similar situation which will be tested in the future. We have addressed limited aspects of the POLK in the cytoplasm in all new Fig3 with six endo-lysosomal markers, and added text.

      When POLK proteins accumulate in the cytoplasm in aging cells, do they also repair condensates in the cytoplasm? What is the function of cytoplasmic POLK granules? More generally, is it known if other granules or foci, such as repair foci are found in the cytoplasms in aging cells, or in cells under stress?

      Six markers for endo-lysosomes were tested to characterize the cytoplasmic granules now shown in Fig3.

      While the authors quantify the number and sizes of the POLK signal, they don't discuss their physical nature. Some membrane-less condensates exhibit liquid-like properties, such as stress granules, P-bodies, or in the nucleus some repair condensates. In some diseased tissues, some condensates lose their liquid properties and become solid-like. Is it known if POLK condensates behave like liquid condensates or they are simply formed by bound molecules on DNA? Since they are larger and fewer in the cytoplasm, is it because several small puncta fused together to form a larger one? It would be worthwhile to discuss these points.

      Discussion statements on the nature of condensates in context of the POLK cytoplasmic signal has been added.

    1. eLife Assessment

      This structural biology study provides insights into the assembly of the GID/CTLH E3 ligase complex. The multi-subunit complex forms unique, ring shaped assemblies and the findings presented here describe a "specificity code" regulates formation of subunit interfaces. The data supporting the conclusions are convincing, both in thoroughness and rigor. This study will be valuable to biochemists, structural biologists, and could lay foundation for novel designed protein assemblies.

    2. Reviewer #2 (Public review):

      Summary:

      This is a very interesting study focusing on a remarkable oligomerization domain, the LisH-CTLH-CRA module. The module is found in a diverse set of proteins across evolution. The present manuscript focuses on the extraordinary elaboration of this domain in GID/CTLH RING E3 ubiquitin ligases, which assemble into a gigantic, highly ordered, oval-shaped megadalton complex with strict subunit specificity. The arrangement of LisH-CTLH-CRA modules from several distinct subunits is required to form the oval on the outside of the assembly, allowing functional entities to recruit and modify substrates in the center. Although previous structures had shown that data revealed that CTLH-CRA dimerization interfaces share a conserved helical architecture, the molecular rules that govern subunit pairing have not been explored. This was a daunting task in protein biochemistry that was achieved in the present study, which defines this "assembly specificity code" at the structural and residue-specific level.

      The authors used X-ray crystallography to solve high-resolution structures of mammalian CTLH-CRA domains, including RANBP9, RANBP10, TWA1, MAEA, and the heterodimeric complex between RANBP9 and MKLN. They further examined and characterized assemblies by quantitative methods (ITC and SEC-MALS) and qualitatively using nondenaturing gels. Some of their ITC measurements were particularly clever and involved competitive titrations and titrations of varying partners depending on protein behavior. The experiments allowed the authors to discover that affinities for interactions between partners is exceptionally tight, in the pM-nM range, and to distill the basis for specificity while also inferring that additional interactions beyond the LisH-CTLH-CRA modules likely also contribute to stability. Beyond discovering how the native pairings are achieved, the authors were able to use this new structural knowledge to reengineer interfaces to achieve different preferred partnerings.

      Strengths:

      Nearly everything about this work is exceptionally strong.

      (1) The question is interesting for the native complexes, and even beyond that, has potential implications for the design of novel molecular machines.

      (2) The experimental data and analyses are quantitative, rigorous, and thorough.

      (3) The paper is a great read - scholarly and really interesting.

      (4) The figures are exceptional in every possible way. They present very complex and intricate interactions with exquisite clarity. The authors are to be commended for outstanding use of color and color-coding throughout the study, including in cartoons to help track what was studied in what experiments. And the figures are also outstanding aesthetically.

      Weaknesses:

      There are no major weaknesses of note, but I can make a few recommendations for editing the text.

    1. L’Évaluation dans le Système Éducatif : Enjeux, Mécanismes et Perspectives d'Évolution

      Synthèse de l'intervention

      Ce document de synthèse analyse les réflexions d'un enseignant-chercheur sur la nature et l'évolution de l'évaluation au sein du système éducatif français.

      L'analyse met en lumière le malaise persistant autour de la notation traditionnelle et propose une transition vers une « évaluation positive ».

      Le postulat central est que l'évaluation ne doit plus être un simple outil de certification appartenant au système, mais devenir un moteur d'apprentissage dont l'élève doit progressivement s'emparer.

      L'objectif ultime est de transformer l'acte d'évaluer en un levier de réussite et d'autonomie, en dépassant le simple « malentendu » de la note pour instaurer une véritable culture de la réflexion sur l'action.

      --------------------------------------------------------------------------------

      1. Perspective Historique et Paradoxes de la Notation

      L'évaluation chiffrée en France n'est pas une donnée naturelle mais une construction historique liée à des fonctions de sélection et de certification.

      Les racines de la note : La notation sur 10 a été instaurée sous Jules Ferry pour le certificat d'études primaires, dans une logique de rationalisation héritée de la Révolution française.

      La notation sur 20, quant à elle, apparaît avec la création du baccalauréat en 1808 par Napoléon, marquant une hiérarchie symbolique entre le secondaire et le primaire.

      L'évolution des enjeux sociaux : En 1900, seulement 1 % d'une classe d'âge obtenait le baccalauréat, contre plus de 60 % à la fin du XXe siècle.

      Ce changement d'échelle rend l'échec scolaire (les 7 % de sorties sans diplôme) socialement « mortel », alors qu'il était la norme autrefois.

      La « constante macabre » : Concept d'André Antibi cité pour illustrer la tendance des enseignants à reproduire une courbe de Gauss (distribution des notes entre bons et mauvais élèves) indépendamment de la réalité des acquis, par peur de manquer de crédibilité ou de sélectivité.

      --------------------------------------------------------------------------------

      2. Déconstruction du Processus d'Évaluation

      L'évaluation est définie comme un processus cognitif en trois étapes, souvent invisible, qui se distingue de la simple communication d'un résultat.

      Les piliers du processus

      Le Référent : Ce à quoi l'on se rapporte (le modèle, les critères, l'objectif idéal).

      L'auteur souligne l'importance de construire ce référent de manière concrète, voire de le co-construire avec les élèves.

      Le Référé : La performance réelle de l'élève, l'objet observé (travail écrit, prestation orale, geste technique).

      La Mesure de l'écart : L'estimation de la distance entre le référé et le référent. L'auteur précise que l'on ne « mesure » jamais vraiment en éducation (absence de mètre étalon) ; on « bricole » une estimation.

      La différence entre évaluer et communiquer

      Il existe une distinction majeure entre la fabrication de l'évaluation (l'analyse interne de l'enseignant) et sa communication (la note ou le commentaire).

      Le malaise actuel provient souvent d'un défaut de communication ou d'un codage inadéquat de cet écart.

      --------------------------------------------------------------------------------

      3. Typologie des Codes d'Évaluation

      Le système utilise divers codes pour traduire l'évaluation, chacun présentant des limites spécifiques :

      | Code d'évaluation | Caractéristiques et Limites | | --- | --- | | Notes (0-10 / 0-20) | Système dominant en France (système décimal). Perçu comme rationnel mais souvent utilisé pour classer plutôt que pour faire apprendre. | | Commentaires ouverts | Destinés à conseiller, ils sont souvent redondants (« Très bien » pour un 16) ou trop spécialisés pour être compris sans feedback. | | Lettres (A, B, C, D, E) | Souvent un échec en France car calquées sur la moyenne (A = au-dessus, E = en dessous), perdant leur intérêt de création de groupes homogènes. | | Smileys et Codes couleurs | Utiles pour une communication endogène à la classe ; moins stigmatisants et centrés sur la fonction psychologique. | | Grilles d'évaluation | Outil le plus complet et proche des compétences (type « checklist » de pilote), mais extrêmement lourd à gérer au quotidien. |

      --------------------------------------------------------------------------------

      4. L'Évaluation comme Moteur d'Apprentissage

      L'évolution vers une évaluation positive nécessite une rupture épistémologique.

      Évaluation Formative vs Sommative : L'auteur refuse de choisir entre les deux (« les deux mon colonel »).

      L'évaluation doit être formative (donner de l'information pour ajuster l'enseignement) pendant la formation, et sommative (certifier un niveau) au moment de l'examen.

      La boucle de l'action réfléchie : S'inspirant de Philippe Perrenoud et de Marguerite Altet, l'auteur propose un cycle : Action -> Réflexion -> Théorisation -> Entraînement -> Retour à l'action. L'évaluation est l'activité réflexive au cœur de ce cycle.

      La « Dépossession » : L'enjeu est que l'enseignant ne soit plus le seul détenteur de l'évaluation. L'élève doit apprendre à s'auto-évaluer pour devenir autonome. « Il n'y a pas d'autonomie des élèves tant qu'ils ne sont pas capables d'auto-évaluation. »

      --------------------------------------------------------------------------------

      5. Dimensions Institutionnelles et Professionnelles

      L'évaluation est présentée comme un « premier geste de métier » pour lequel les enseignants sont paradoxalement peu formés.

      Le manque de formation : La formation des enseignants est souvent fragmentée entre savoirs disciplinaires et didactique, négligeant les gestes professionnels transversaux comme l'évaluation et l'orientation.

      Le rôle de l'établissement : Une innovation isolée sur l'évaluation (comme une classe sans notes) est fragile.

      Pour faire bouger le système, l'action doit être portée par l'équipe de l'établissement, en lien avec la direction, pour créer un « effet de levier ».

      La posture réflexive : L'évaluation ne doit pas seulement porter sur les élèves, mais aussi sur les pratiques enseignantes elles-mêmes.

      Il est nécessaire d'évaluer les dispositifs d'évaluation (méta-évaluation) par le biais d'analyses de situations éducatives.

      --------------------------------------------------------------------------------

      Citations Clés

      « Le paradoxe du métier d'enseignant, c'est que l'on n'est pas toujours formé au premier geste de métier : évaluer et orienter. »

      « On ne peut pas ne pas évaluer. Nous sommes condamnés à évaluer. »

      « L'évaluation doit être formative pendant la formation et sommative pendant la certification. Je ne monterais pas à bord d'un Airbus où le pilote n'aurait fait que du simulateur de vol. »

      « Faire de l'évaluation le moteur des apprentissages est la meilleure voie vers les savoirs et le savoir-agir. »

      --------------------------------------------------------------------------------

      Conclusion

      L'évaluation dans le système éducatif français est à la croisée des chemins entre un héritage sélectif du XIXe siècle et les nécessités sociales du XXIe siècle.

      Passer d'une évaluation subie à une évaluation « moteur » exige de clarifier le contrat de communication avec l'élève, de co-construire les critères de réussite et de réintégrer l'évaluation au cœur de la pratique réflexive des enseignants et des chefs d'établissement.

      L'autonomie de l'apprenant, finalité de l'école, passe nécessairement par sa capacité à évaluer son propre cheminement vers le savoir.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study by Wu et al. uses endogenous bruchpilot expression in a cell-type-specific manner to assess synaptic heterogeneity in adult Drosophila melanogaster mushroom body output neurons. The authors performed genomic on locus tagging of the presynaptic scaffold protein bruchpilot (BRP) with one part of splitGFP (GFP11) using the CRISPR/Cas9 methodology and co-expressed the other part of splitGFP (GFP1-10) using the GAL4/UAS system. Upon expression of both parts of splitGFP, fluorescent GFP is assembled at the N-terminus of BRP, exactly where BRP is endogenously expressed in active zones. For manageable analysis, a high-throughput pipeline was developed. This analysis evaluated parameters like location of BRP clusters, volume of clusters, and cluster intensity as a direct measure of the relative amount of BRP expression levels on site, using publicly available 3D analysis tools that are integrated in Fiji. Analysis was conducted for different mushroom body cell types in different mushroom body lobes using various specific GAL4 drivers. To test this new method of synapse assessment, Wu et al. performed an associative learning experiment in which an odor was paired with an aversive stimulus and found that, in a specific time frame after conditioning, the new analysis solidly revealed changes in BRP levels at specific synapses that are associated with aversive learning.

      Strengths:

      Expression of splitGFP bound to BRP enables intensity analysis of BRP expression levels as exactly one GFP molecule is expressed per BRP. This is a great tool for synapse assessment. This tool can be widely used for any synapse as long as driver lines are available to co-express the other part of splitGFP in a cell-type-specific manner. As neuropils and thus the BRP label can be extremely dense, the analysis pipeline developed here is very useful and important. The authors have chosen an exceptionally dense neuropil - the mushroom bodies - for their analysis and convincingly show that BRP assessment can be achieved with such densely packed active zones. The result that BRP levels change upon associative learning in an experiment with odor presentation paired with punishment is likewise convincing, and strongly suggests that the tool and pipeline developed here can be used in an in vivo context.

      Weaknesses:

      Although BRP is an important scaffold protein and its expression levels were associated with function and plasticity, I am still somewhat reluctant to accept that synapse structure profiling can be inferred from only assessing BRP expression levels and BRP cluster volume. Also, is it guaranteed that synaptic plasticity is not impaired by the large GFP fluorophore? Could the GFP10 construct that is tagged to BRP in all BRP-expressing cells, independent of GAL4, possibly hamper neuronal function? Is it certain that only active zones are labeled? I do see that plastic changes are made visible in this study after an associative learning experiment with BRP intensity and cluster volume as read-out, but I would be reassured by direct measurement of synaptic plasticity with splitGFP directly connected to BRP, maybe at a different synapse that is more accessible.

      We appreciate the reviewer’s comments. In the revised manuscript, we have clarified that Brp is an important, but not the only player in the active zone. We have included new data to demonstrate that split-GFP tagging does not severely affect the localization and plasticity of Brp and the function of synapses by showing: (1) nanoscopic localization of Brp::rGFP using STED imaging; (2) colocalization between Brp::rGFP and anti-Brp signals/VGCCs; (3) activity-dependent Brp remodeling in R8 photoreceptors; (4) no defect in memory performance when labeling Brp::rGFP in KCs; These four lines of additional evidence further corroborate our approach to characterize endogenous Brp as a proxy of active zone structure.

      Reviewer #2 (Public review):

      Summary:

      The authors developed a cell-type specific fluorescence-tagging approach using a CRISPR/Cas9 induced spilt-GFP reconstitution system to visualize endogenous Bruchpilot (BRP) clusters as presynaptic active zones (AZ) in specific cell types of the mushroom body (MB) in the adult Drosophila brain. This AZ profiling approach was implemented in a high-throughput quantification process, allowing for the comparison of synapse profiles within single cells, cell types, MB compartments, and between different individuals. The aim is to analyse in more detail neuronal connectivity and circuits in this centre of associative learning. These are notoriously difficult to investigate due to the density of cells and structures within a cell. The authors detect and characterize cell-type-specific differences in BRP-dependent profiling of presynapses in different compartments of the MB, while intracellular AZ distribution was found to be stereotyped. Next to the descriptive part characterizing various AZ profiles in the MB, the authors apply an associative learning assay and detect consequent AZ re-organisation.

      Strengths:

      The strength of this study lies in the outstanding resolution of synapse profiling in the extremely dense compartments of the MB. This detailed analysis will be the entry point for many future analyses of synapse diversity in connection with functional specificity to uncover the molecular mechanisms underlying learning and memory formation and neuronal network logics. Therefore, this approach is of high importance for the scientific community and a valuable tool to investigate and correlate AZ architecture and synapse function in the CNS.

      Weaknesses:

      The results and conclusions presented in this study are, in many aspects, well-supported by the data presented. To further support the key findings of the manuscript, additional controls, comments, and possibly broader functional analysis would be helpful. In particular:

      (1) All experiments in the study are based on spilt-GFP lines (BRP:GFP11 and UAS-GFP1-10).The Materials and Methods section does not contain any cloning strategy (gRNA, primer, PCR/sequencing validation, exact position of tag insertion, etc.) and only refers to a bioRxiv publication. It might be helpful to add a Materials and Methods section (at least for the BRP:GFP11 line). Additionally, as this is an on locus insertion the in BRP-ORF, it needs a general validation of this line, including controls (Western Blot and correlative antibody staining against BRP) showing that overall BRP expression is not compromised due to the GFP insertion and localizes as BRP in wild type flies, that flies are viable, have no defects in locomotion and learning and memory formation and MB morphology is not affected compared to wild type animals.

      We thank the reviewer for suggesting these important validations. We included details of the design of the construct and insertion site to the Methods section, performed several new experiments to validate the split-GFP tagging of Brp, and present the data in the revision.

      First, to examine whether the transcription of the brp gene is unaffected by the insertion of GFP<sub>11</sub>, we conducted qRT-PCR to compare the brp mRNA levels between brp::GFP<sub>11</sub>, UAS-GFP1-10 and UAS-GFP1-10 and found no difference (Figure 1 - figure supplement 1A).

      To further verify the effect of GFP<sub>11</sub> tagging at the protein level, we performed anti-Brp (nc82) immunohistochemistry of brains where GFP is reconstituted pan-neuronally. We found unaltered neuropile localization of nc82 signals (Figure 1 - figure supplement 1C). In presynaptic terminals of the mushroom body calyx, we found integration of Brp::rGFP to nc82 accumulation (Figure 1D). We performed super-resolution microscopy to verify the configuration of Brp::rGFP and confirmed the donut-shape arrangement of Brp::rGFP in the terminals of motor neurons (see Wu, Eno et al., 2025 PLOS Biology), corroborating the nanoscopic assembly of Brp::rGFP at active zones (Kittel et al., 2006 Science).

      Furthermore, co-expression of RFP-tagged voltage-gated calcium channel alpha subunit Cacophony (Cac) and Brp::rGFP in PAM-γ5 dopaminergic neurons revealed strong presynaptic colocalization of their punctate clusters (Figure 1E), suggesting that rGFP tagging of Brp did not damage key protein assembly at active zones (Kawasaki et al., 2004 J Neuroscience; Kittel et al., Science).

      These lines of evidence suggest that the localization of endogenous Brp is barely affected by the C-terminal GFP<sub>11</sub> insertion or GFP reconstitution therewith. This is in line with a large body of studies confirming that the N-terminal region and coiled-coil domains, but not the C-terminal, region of Brp are necessary and sufficient for active zone localization (Fouquet et al., 2009 J Cell Biol; Oswald et al., 2010 J Cell Biol; Mosca and Luo, 2014 eLife; Kiragasi et al., 2017 Cell Rep; Akbergenova et al., 2018 eLife; Nieratschker et al., 2009 PLoS Genet; Johnson et al., 2009 PLoS Biol; Hallermann et al., 2010 J Neurosci). We nevertheless report homozygous lethality and found the decreased immunoreactive signals in flies carrying the GFP<sub>11</sub> insertion (Figure 1 - figure supplement 1B).

      For these reasons, we always use heterozygotes for all the experiments therefore there is no conspicuous defect in locomotion as reported in the original study (Wagh et al., 2005 Neuron). To functionally validate the heterozygotes, we measured the aversive olfactory memory performance of flies where GFP reconstitution was induced in Kenyon cells using R13F02-GAL4. We found that all these transgenes did not alter mushroom body morphology (Figure 7 - figure supplement 1) or memory performance as compared to wild-type flies (Figure 7 - figure supplement 2), suggesting the synapse function required for short-term memory formation is not affected by split-GFP tagging of Brp.

      (2) Several aspects of image acquisition and high-throughput quantification data analysis would benefit from a more detailed clarification.

      (a) For BRP cluster segmentation it is stated in the Materials and Methods state, that intensity threshold and noise tolerance were "set" - this setting has a large effect on the quantification, and it should be specified and setting criteria named and justified (if set manually (how and why) or automatically (to what)). Additionally, if Pyhton was used for "Nearest Neigbor" analysis, the code should be made available within this manuscript; otherwise, it is difficult to judge the quality of this quantification step.

      (b) To better evaluate the quality of both the imaging analysis and image presentation, it would be important to state, if presented and analysed images are deconvolved and if so, at least one proof of principle example of a comparison of original and deconvoluted file should be shown and quantified to show the impact of deconvolution on the output quality as this is central to this study.

      We thank the reviewer for suggesting these clarifications. We have included more description to the revised manuscript to clarify the setting of segmentation, which was manually adjusted to optimize the F-score (previous Figure 1D, now moved to Figure 1 -figure supplement 5). We have included the code used for analyzing nearest neighbor distance, AZ density and local Brp density in the revised manuscript (Supplementary file 1), together with a pre-processed sample data sheet (Supplementary file 2).

      Regarding image deconvolution, we have clarified the differential use of deconvolved and not-deconvolved images in the revised manuscript. We have also included a quantitative evaluation of Richardson-Lucy iterative deconvolution (Figure 1 - figure supplement 4). We used 20 iterations due to only marginal FWHM improvement beyond this point (Figure 1 - figure supplement 4).

      (3) The major part of this study focuses on the description and comparison of the divergent synapse parameters across cell-types in MB compartments, which is highly relevant and interesting. Yet it would be very interesting to connect this new method with functional aspects of the heterogeneous synapses. This is done in Figure 7 with an associative learning approach, which is, in part, not trivial to follow for the reader and would profit from a more comprehensive analysis.

      (a) It would be important for the understanding and validation of the learning induced changes, if not (only) a ratio (of AZ density/local intensity) would be presented, but both values on their own, especially to allow a comparison to the quoted, previous AZ remodelling analysis quantifying BRP intensities (ref. 17, 18). It should be elucidated in more detail why only the ratio was presented here.

      We thank the reviewer for the suggestion on the presentation of learning-induced Brp remodeling. The reported values in Figure 7C are the correlation coefficient of AZ density and local intensity in each compartment, but not the ratio. These results suggest that subcompartment-sized clusters of AZs with high Brp accumulation (Figure 6) undergo local structural remodeling upon associative learning (Figure 7). For clarity, we have included a schematic of this correlation and an example scatter plot to Figure 6. Unlike the previous studies (refs 17 and 18), we did not observe robust learning-dependent changes in the Brp intensity, possibly due to some confounding factors such as overall expression levels and conditioning protocols as described in the previous and following points, respectively.

      (b) The reason why a single instead of a dual odour conditioning was performed could be clarified and discussed (would that have the same effects?).

      (c) Additionally, "controls" for the unpaired values - that is, in flies receiving neither shock nor odour - it would help to evaluate the unpaired control values in the different MB compartments.

      We use single odor conditioning because it is the simplest way to examine the effect of odor-shock association by comparing the paired and unpaired group. Standard differential conditioning with two odors contains unpaired odor presentation (CS-) even in the ‘paired’ group. We now show that single-odor conditioning induces memory that lasts one day as in differential conditioning (Figure 7B; Tully and Quinn, J Comp Phys A 1985).

      (d) The temporal resolution of the effect is very interesting (Figure 7D), and at more time points, especially between 90 and 270 min, this might raise interesting results.

      The sampling time points after training was chosen based on approximately logarithmic intervals, as the memory decay is roughly exponential (Figure 7B). This transient remodeling is consistent with the previous studies reporting that the Brp plasticity was short-lived (Zhang et al., 2018 Neuron; Turrel et al., 2022 Current Biol).

      (e) Additionally, it would be very interesting and rewarding to have at least one additional assay, relating structure and function, e.g. on a molecular level by a correlative analysis of BRP and synaptic vesicles (by staining or co-expression of SV-protein markers) or calcium activity imaging or on a functional level by additional learning assays.

      We thank the reviewer for raising this important point. We have performed calcium imaging of KC presynaptic terminals to correlate the structure and function in another study (see Figure 2 in Wu, Eno et al., 2025 PLOS Biology for more detail). The basal presynaptic calcium pattern along the γ compartments is strikingly similar to the compartmental heterogeneity of Brp accumulation (see also Figure 2 in this study). Considering colocalization of other active-zone components, such as Cac (Figure 1E), we propose that the learning-induced remodeling of local Brp clusters should transiently modulate synaptic properties.

      As a response to other reviewers’ interest, we used Brp::rGFP to measure different forms of Brp-based structural plasticity upon constant light exposure in the photoreceptors and upon silencing rab3 in KCs. Since these experiments nicely reproduced the results of previous studies (Sugie et al., Neuron 2013; Graf et al., Neuron 2009), we believe the learning-induced plasticity of Brp clustering in KCs has a transient nature.

      Reviewer #3 (Public review):

      Summary:

      The authors develop a tool for marking presynaptic active zones in Drosophila brains, dependent on the GAL4 construct used to express a fragment of GFP, which will incorporate with a genome-engineered partial GFP attached to the active zone protein bruchpilot - signal will be specific to the GAL4-expressing neuronal compartment. They then use various GAL4s to examine innervation onto the mushroom bodies to dissect compartment-specific differences in the size and intensity of active zones. After a description of these differences, they induce learning in flies with classic odour/electric shock pairing and observe changes after conditioning that are specific to the paired conditioning/learning paradigm.

      Strengths:

      The imaging and analysis appear strong. The tool is novel and exciting.

      Weaknesses:

      I feel that the tool could do with a little more characterisation. It is assumed that the puncta observed are AZs with no further definition or characterisation.

      We performed additional validation on the tool, including (1) nanoscopic localization of Brp::rGFP using STED imaging; (2) colocalization between Brp::rGFP and anti-Brp signals/VGCCs (Figure 1D-E); 3) activity-dependent active zone remodeling in R8 photoreceptors (Figure 1F). These will be detailed in our point-by-point response below.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The authors keep stating, they profile or assess synaptic structure by analyzing BRP localization, cluster volume, and intensity. However, I do not think that BRP cluster volume and intensity warrant an educated statement about presynaptic structure as a whole. I do not challenge the usefulness of BRP cluster analysis for synapse evaluation, but as there are so many more players involved in synaptic function, BRP analysis certainly cannot explain it all. This should at least be discussed.

      It is correct that Brp is not the only player in the active zone. We have included more discussion on the specific role of Brp (line 84 to 89) and other synaptic markers (line 250) and edited potentially misunderstanding text.

      (2) I do see that changes in BRP expression were observed following associative learning, but is it certain, that synaptic plasticity is generally unaffected by the large GFP fluorophore? BRP is grabbing onto other proteins, both with its C- and N-termini. As the GFP is right before the stop codon, it should be at the N-terminus. How far could BRP function be hampered by this? Is there still enough space for other proteins to interact?

      We thank the reviewer for sharing the concerns. We here provided three lines of evidence to demonstrate that the Brp assembly at active zones required for synaptic plasticity is unaffected by split-GFP tagging.

      First, we assessed olfactory memory of flies that have Brp::rGFP labeled in Kenyon cells and found the performance comparable to wild-type (Figure 7 - figure supplement 2), suggesting the Brp function required for olfactory memory (Knapek et al., J Neurosci 2011) is unaffected by split-GFP tagging.

      Second, we measured Brp remodeling in photoreceptors induced by constant light exposure (LL; Sugie et al., 2015 Neuron). Consistent with the previous study, we found that LL decreased the numbers of Brp::rGFP clusters in R8 terminals in the medulla, as compared to constant dark condition (DD). This result validates the synaptic plasticity involving dynamic Brp rearrangement in the photoreceptors. We have included this result into the revised manuscript (Figure 1F).

      To further validate protein interaction of Brp::rGFP, we focused on Rab3, as it was previously shown to control Brp allocation at active zones (Graf et al., 2009 Neuron). To this end, we silenced rab3 expression in Kenyon cells using RNAi and measured the intensity of Brp::rGFP clusters in γ Kenyon cells. As previously reported in the neuromuscular junction, we found that rab3 knock-down increased Brp::rGFP accumulation to the active zones, suggesting that Brp::rGFP represents the interaction with Rab3. We have included all the new data to the revised manuscript (Figure 1 - figure supplement 3).

      (3) It may well be that not only active-zone-associated BRP is labeled but possibly also BRP molecules elsewhere in the neuron. I would like to see more validation, e.g., the percentage of tagged endogenous BRP associated with other presynaptic proteins.

      To answer to what extent Brp::rGFP clusters represent active zones, we double-labelled Brp::rGFP and Cac::tdTomato (Cacophony, the alpha subunit of the voltage-gated calcium channels). We found that 97% of Brp::rGFP clusters showed co-localization with Cac::tdTomato in PAM-γ5 dopamine neurons terminals (Figure 1E), suggesting most Brp::rGFP clusters represent functional AZs.

      (4) Z-size is ~200 nm, while x/y pixel size is ~75 nm during acquisition. How far down does the resolution go after deconvolution?

      The Z-step was 370 nm and XY pixel size was 79 nm for image acquisition. We performed 20 iterations of Richarson-Lucy deconvolution using an empirical point spread function (PSF). We found that the effect of deconvolution on the full-width at half maximum (FWHM) of Brp::rGFP clusters improves only marginally beyond 20 iterations, when the XY FWHM is around 200 nm and the XZ FWHM is around 450 nm (Figure 1 - figure supplement 4).

      (5) Figure Legend 7: What is a "cytoplasm membrane marker"? Does this mean membrane-bound tdTom is sticking into the cytoplasm?

      We apologize for the typo and have corrected it to “plasma membrane marker”.

      (6) At the end of the introduction: "characterizing multiple structural parameters..." - which were these parameters? I was under the assumption that BRP localization, cluster volume, and intensity were assessed. I do not see how these are structural parameters. Please define what exactly is meant by "structural parameters".

      We apologize for the confusion. By "structural parameters”, we indeed referred to the volume, intensity and molecular density of Brp::rGFP clusters. We have revised the sentence to “Characterizing the distinct parameters and localization of Brp::rGFP cluster.”

      (7) Next to last sentence of the introduction: "Characterizing multiple structural parameters revealed a significant synaptic heterogeneity within single neurons and AZ distribution stereotypy across individuals." What do the authors mean by "significant synaptic heterogeneity"?

      By “synaptic heterogeneity”, we refer to the intracellular variability of active zone cytomatrices reported by Brp clusters. For instance, the intensities of Brp::rGFP clusters within Kenyon cell subtypes were variable among compartments (Figure 2). Intracellular variability of the Brp concentration of individual active zones was higher in DPM and APL neurons than Kenyon cells (Figure 3). These variabilities demonstrate intracellular synaptic heterogeneity. We have revised the sentence to be more specific to the different characters of Brp clusters.

      (8) I do not understand the last sentence of the introduction. "These cell-type-specific synapse profiles suggest that AZs are organized at multiple scales, ranging from neighboring synapses to across individuals." What do the authors mean by "ranging from neighboring synapses to across individuals"? Does this mean that even neighboring synapses in the same cell can be different?

      We have revised the sentence to “These cell-type-specific synapse profiles suggest that AZs are spatially organized at multiple scales, ranging from interindividual stereotypy to neighboring synapses in the same cells.”

      By “neighboring synapses", we refer to the nearest neighbor similarity in Brp levels in some cell-types (Figure 6A-C), and also the sub-compartmental dense AZ clusters with high Brp level in Kenyon cells (Figure 6D-H). By “across individuals”, we refer to the individually conserved active zone distribution patterns in some neurons (Figure 5).

      (9) The title talks about cell-type-specific spatial configurations. I do not understand what is meant by "spatial configurations"? Do you mean BRP cluster volume? I think the title is a little misleading.

      By “spatial configuration”, we refer to the arrangement of Brp clusters within individual mushroom body neurons. This statement is based on our findings on the intracellular synaptic heterogeneity (see also response to comment #7). We have streamlined the text description in the revised manuscript for clarity.

      Reviewer #2 (Recommendations for the authors):

      (1) For Figure 3A: exemplary two AZs are compared here, a histogram comparing more AZs would aid in making the point that in general, AZ of similar size have different BRP level (intensities) and how much variation exists.

      We have included histograms for Brp::rGFP intensity and cluster volumes to Figure 3 in the revised manuscript.

      (2) Line 52: "endogenous synapses" is a confusing term; it's probably meant that the protein levels within the synapse are endogenous and not overexpressed. 

      We apologize for the confusion and have revised the term to “endogenous synaptic proteins.”

      (3) It is not clear from the Materials and Methods section, whether and where deconvolved or not-deconvolved images were used for the quantification pipeline. Please comment on this. 

      We have now revised the Method section to clarify how deconvolved or not-deconvolved images were differently used in the pipeline.

      (4) Line 664 (C) not bold.

      We have corrected the error.

      (5) 725 "Files" should be Flies.

      We have corrected the error.

      (6) 727 two times "first".

      We have corrected the error.

      (7) Figure 7. All (A) etc., not bold - there should be consistent annotation. 

      We want to thank the reviewer for the detailed proof and have corrected all the errors spotted.

      Reviewer #3 (Recommendations for the authors):

      (1) Has there been an expression of the construct in a non-neuronal cell? Astrocyte-like cell? Any glia? As some sort of control for background and activity?

      As the reviewer suggested, we verified the neuronal expression specificity of Brp::rGFP. Using R86E01-GAL4 and Amon-GAL4, we compared Brp::rGFP in astrocyte-like glia and neuropeptide-releasing neurons. We found no Brp::rGFP puncta in the neuropils in astrocyte-like glia compared to neurons, suggesting Brp::rGFP is specific to neurons. We have included this new dataset to the revised manuscript (Figure 1 - figure supplement 2).

      (2) Similarly, expression of the construct co-expressed with a channelrhodopsin, and induction of a 'learning'-like regime of activity, similarly in a control type of experiment, expression of an inwardly rectifying channel (e.g. Kir2.1) to show that increases in size of the BRP puncta are truly activity dependent? The NMJ may be an optimal neuron to use to see the 'donut' structures of the AZs and their increase with activity. Also, are these truly AZs we are seeing here? Perhaps try co-expressing cacophony-dsRed? If the GFP Puncta are active zones, then they should be surrounded by cacophony.

      We would like to clarify that we did not find Brp::rGFP size increase upon learning. Instead, we demonstrated that associative training transiently remodelled sub-compartment-sized AZ “hot spots” in Kenyon cells, indicated by the correlation of local intensity and AZ density (Figure 6-7).

      To demonstrate split-GFP tagging does not affect activity-dependent plasticity associated with Brp, we measured Brp remodeling in photoreceptors induced by constant light exposure (LL; Sugie et al., 2015 Neuron). Consistent with the previous study, we found that LL decreased the numbers of Brp::rGFP clusters in R8 terminals in the medulla, as compared to constant dark condition (DD). This result validates the synaptic plasticity involving dynamic Brp rearrangement in the photoreceptors (Figure 1F).

      As the reviewer suggested, we performed the STED microscopy for the larval motor neuron and confirmed the donut-shape arrangement of Brp::rGFP (Wu, Eno et al., PLOS Biol 2025).

      Also following the reviewer’s suggestion, we double-labelled Brp::rGFP and Cac::tdTomato (Cacophony, the alpha subunit of the voltage-gated calcium channels). We found that 97% Brp::rGFP clusters showed co-localization with Cac::tdTomato in PAM-γ5 dopamine neurons terminals (Figure 1E), suggesting most Brp::rGFP clusters represent functional AZs.

      (3) In the introduction: Intro, a sentence about BRP - central organiser of the active zone, so a key regulator of activity.

      We have included a few more sentences about the role Brp in the active zones to the revised manuscript.

      (4) Figure 1 E, line 650 'cite the resource here'. 

      We thank the reviewer for pointing out the error and we have corrected it.

      (5) Many readers may not be MB aficionados, and to make the data more accessible, perhaps use a cartoon of an MB with the cell bodies of the neurons around the MB expressing the constructs highlighted so that the reader can have a wider idea of the anatomy in relation to the MB.

      We appreciate these comments and have appended cartoons of the MB to figures to help readers understand the anatomy.

    1. qui se trouvent à l'intérieur d'autres balises

      <div style="color: blue;">

      Ce texte sera bleu.

      </div>

      C'est ça l'héritage de style, la couleur bleue est appliquée au div et à tout ce qu'il contient, en tant que balises imbriquées (Mais toutes les propriétés ne sont pas héritables)

    1. Reviewer #2 (Public review):

      Summary:

      This manuscript uses single-molecule run-off experiments and TASEP/HMM models to estimate biophysical parameters, i.e., ribosomal initiation and elongation rates. Combining inferred initiation and elongation rates, the authors quantify ribosomal density. TASEP modeling was used to simulate the mechanistic dynamics of ribosomal translation, and the HMM is used to link ribosomal dynamics to microscope intensity measurements. The authors' main conclusions and findings are:

      - Ribosomal elongation rates and initiation rates are strongly coordinated.

      - Elongation rates were estimated between 1 and 4.5 aa/sec. Initiation rates were estimated between 1 and 2 ribosomes/min. These values agree with previously reported ones.

      - Ribosomal density was determined to be below 12% for all constructs and conditions.

      - eIF5A-perturbations (GC7 inhibition) resulted in non-significant changes in translational bursting and ribosome density.

      - eIF5A perturbations affected both elongation and initiation rates.

      Strengths:

      This manuscript presents an interesting scientific hypothesis to study ribosome initiation and elongation concurrently. This topic is relevant for the field. The manuscript presents a novel quantitative methodology to estimate ribosomal initiation rates from Harringtonine run-off assays. This is relevant because run-off assays have been used to estimate, exclusively, elongation rates.

      Comments on revisions:

      The authors have addressed my concerns. Specifically, they have expanded the discussion on unexpected eIF5A perturbation results, calculated CAI values for all constructs, and made code and data publicly available via GitHub and Zenodo. The mathematical notation is now consistent, and all variables are properly defined.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary:

      In this study, Lamberti et al. investigate how translation initiation and elongation are coordinated at the single-mRNA level in mammalian cells. The authors aim to uncover whether and how cells dynamically adjust initiation rates in response to elongation dynamics, with the overarching goal of understanding how translational homeostasis is maintained. To this end, the study combines single-molecule live-cell imaging using the SunTag system with a kinetic modeling framework grounded in the Totally Asymmetric Simple Exclusion Process (TASEP). By applying this approach to custom reporter constructs with different coding sequences, and under perturbations of the initiation/elongation factor eIF5A, the authors infer initiation and elongation rates from individual mRNAs and examine how these rates covary.

      The central finding is that initiation and elongation rates are strongly correlated across a range of coding sequences, resulting in consistently low ribosome density ({less than or equal to}12% of the coding sequence occupied). This coupling is preserved under partial pharmacological inhibition of eIF5A, which slows elongation but is matched by a proportional decrease in initiation, thereby maintaining ribosome density. However, a complete genetic knockout of eIF5A disrupts this coordination, leading to reduced ribosome density, potentially due to changes in ribosome stalling resolution or degradation.

      Strengths:

      A key strength of this work is its methodological innovation. The authors develop and validate a TASEP-based Hidden Markov Model (HMM) to infer translation kinetics at single-mRNA resolution. This approach provides a substantial advance over previous population-level or averaged models and enables dynamic reconstruction of ribosome behavior from experimental traces. The model is carefully benchmarked against simulated data and appropriately applied. The experimental design is also strong. The authors construct matched SunTag reporters differing only in codon composition in a defined region of the coding sequence, allowing them to isolate the effects of elongation-related features while controlling for other regulatory elements. The use of both pharmacological and genetic perturbations of eIF5A adds robustness and depth to the biological conclusions. The results are compelling: across all constructs and conditions, ribosome density remains low, and initiation and elongation appear tightly coordinated, suggesting an intrinsic feedback mechanism in translational regulation. These findings challenge the classical view of translation initiation as the sole rate-limiting step and provide new insights into how cells may dynamically maintain translation efficiency and avoid ribosome collisions.

      We thank the reviewer for their constructive assessment of our work, and for recognizing the methodological innovation and experimental rigor of our study.

      Weaknesses:

      A limitation of the study is its reliance on exogenous reporter mRNAs in HeLa cells, which may not fully capture the complexity of endogenous translation regulation. While the authors acknowledge this, it remains unclear how generalizable the observed coupling is to native mRNAs or in different cellular contexts.

      We agree that the use of exogenous reporters is a limitation inherent to the SunTag system, for which there is currently no simple alternative for single-mRNA translation imaging. However, we believe our findings are likely generalizable for several reasons.

      As discussed in our introduction and discussion, there is growing mechanistic evidence in the literature for coupling between elongation (ribosome collisions) and initiation via pathways such as the GIGYF2-4EHP axis (Amaya et al. 2018, Hickey et al. 2020, Juszkiewicz et al. 2020), which might operate on both exogenous and endogenous mRNAs.

      As already acknowledged in our limitations section, our exogenous reporters may not fully recapitulate certain aspects of endogenous translation (e.g., ER-coupled collagen processing), yet the observed initiation-elongation coupling was robust across all tested constructs and conditions.

      We have now expanded the Discussion (L393-395) to cite complementary evidence from Dufourt et al. (2021), who used a CRISPR-based approach in Drosophila embryos to measure translation of endogenous genes. We also added a reference to Choi et al. 2025, who uses a ER-specific SunTag reporter to visualize translation at the ER (L395-397).

      Additionally, the model assumes homogeneous elongation rates and does not explicitly account for ribosome pausing or collisions, which could affect inference accuracy, particularly in constructs designed to induce stalling. While the model is validated under low-density assumptions, more work may be needed to understand how deviations from these assumptions affect parameter estimates in real data.

      We agree with the reviewer that the assumption of homogeneous elongation rates is a simplification, and that our work represents a first step towards rigorous single-trace analysis of translation dynamics. We have explicitly tested the robustness of our model to violations of the low-density assumption through simulations (Figure 2 - figure supplement 2). These show that while parameter inference remains accurate at low ribosome densities, accuracy slightly deteriorates at higher densities, as expected. In fact, our experimental data do provide evidence for heterogeneous elongation: the waiting times between termination events deviate significantly from an exponential distribution (Figure 3 - figure supplement 2C), indicating the presence of ribosome stalling and/or bursting, consistent with the reviewer's concern. We acknowledge in the Limitations section (L402-406) that extending the model to explicitly capture transcript-dependent elongation rates and ribosome interactions remains challenging. The TASEP is difficult to solve analytically under these conditions, but we note that simulation-based inference approaches, such as particle filters to replace HMMs, could provide a path forward for future work to capture this complexity at the single-trace level.

      Furthermore, although the study observes translation "bursting" behavior, this is not explicitly modeled. Given the growing recognition of translational bursting as a regulatory feature, incorporating or quantifying this behavior more rigorously could strengthen the work's impact.

      While we do not explicitly model the bursting dynamics in the HMM framework, we have quantified bursting behavior directly from the data. Specifically, we measure the duration of translated (ON) and untranslated (OFF) periods across all reporters and conditions (Figure 1G for control conditions and Figure 4G-H for perturbed conditions), finding that active translation typically lasts 10-15 minutes interspersed with shorter silent periods of 5-10 minutes. This empirical characterization demonstrates that bursting is a consistent feature of translation across our experimental conditions. The average duration of silent periods is similar to what was inferred by Livingston et al. 2023 for a similar SunTag reporter; while the average duration of active periods is substantially shorter (~15 min instead of ~40 min), which is consistent with the shorter trace duration in our system compared to theirs (~15 min compared to ~80 min, on average). Incorporating an explicit two-state or multi-state bursting model into the TASEP-HMM framework would indeed be computationally intensive and represents an important direction for future work, as it would enable inference of switching rates alongside initiation and elongation parameters. We have added this point to the Discussion (L415-417).

      Assessment of Goals and Conclusions:

      The authors successfully achieve their stated aims: they quantify translation initiation and elongation at the single-mRNA level and show that these processes are dynamically coupled to maintain low ribosome density. The modeling framework is well suited to this task, and the conclusions are supported by multiple lines of evidence, including inferred kinetic parameters, independent ribosome counts, and consistent behavior under perturbation.

      Impact and Utility:

      This work makes a significant conceptual and technical contribution to the field of translation biology. The modeling framework developed here opens the door to more detailed and quantitative studies of ribosome dynamics on single mRNAs and could be adapted to other imaging systems or perturbations. The discovery of initiation-elongation coupling as a general feature of translation in mammalian cells will likely influence how researchers think about translational regulation under homeostatic and stress conditions.

      The data, models, and tools developed in this study will be of broad utility to the community, particularly for researchers studying translation dynamics, ribosome behavior, or the effects of codon usage and mRNA structure on protein synthesis.

      Context and Interpretation:

      This study contributes to a growing body of evidence that translation is not merely controlled at initiation but involves feedback between elongation and initiation. It supports the emerging view that ribosome collisions, stalling, and quality control pathways play active roles in regulating initiation rates in cis. The findings are consistent with recent studies in yeast and metazoans showing translation initiation repression following stalling events. However, the mechanistic details of this feedback remain incompletely understood and merit further investigation, particularly in physiological or stress contexts. 

      In summary, this is a thoughtfully executed and timely study that provides valuable insights into the dynamic regulation of translation and introduces a modeling framework with broad applicability. It will be of interest to a wide audience in molecular biology, systems biology, and quantitative imaging.

      We appreciate the reviewer's thorough and positive assessment of our work, and that they recognize both the technical innovation of our modeling framework and its potential broad utility to the translation biology community. We agree that further mechanistic investigation of initiation-elongation feedback under various physiological contexts represents an important direction for future research.

      Reviewer #2 (Public review):

      Summary:

      This manuscript uses single-molecule run-off experiments and TASEP/HMM models to estimate biophysical parameters, i.e., ribosomal initiation and elongation rates. Combining inferred initiation and elongation rates, the authors quantify ribosomal density. TASEP modeling was used to simulate the mechanistic dynamics of ribosomal translation, and the HMM is used to link ribosomal dynamics to microscope intensity measurements. The authors' main conclusions and findings are:

      (1) Ribosomal elongation rates and initiation rates are strongly coordinated.

      (2) Elongation rates were estimated between 1-4.5 aa/sec. Initiation rates were estimated between 0.5-2.5 events/min. These values agree with previously reported values.

      (3) Ribosomal density was determined below 12% for all constructs and conditions.

      (4) eIF5A-perturbations (KO and GC7 inhibition) resulted in non-significant changes in translational bursting and ribosome density.

      (5) eIF5A perturbations resulted in increases in elongation and decreases in initiation rates.

      Strengths:

      This manuscript presents an interesting scientific hypothesis to study ribosome initiation and elongation concurrently. This topic is highly relevant for the field. The manuscript presents a novel quantitative methodology to estimate ribosomal initiation rates from Harringtonine run-off assays. This is relevant because run-off assays have been used to estimate, exclusively, elongation rates.

      We thank the reviewer for their careful evaluation of our work and for recognizing the novelty of our quantitative methodology to extract both initiation and elongation rates from harringtonine run-off assays, extending beyond the traditional use of these experiments.

      Weaknesses:

      The conclusion of the strong coordination between initiation and elongation rates is interesting, but some results are unexpected, and further experimental validation is needed to ensure this coordination is valid. 

      We agree that some of our findings need further experimental investigation in future studies. However, we believe that the coordination between initiation and elongation is supported by multiple results in our current work: (1) the strong correlation observed across all reporters and conditions (Figure 3E), and (2) the consistent maintenance of low ribosome density despite varying elongation rates. While additional experimental validation would be valuable, we note that directly manipulating initiation or elongation independently in mammalian cells remains technically challenging. Nevertheless, our findings are consistent with emerging mechanistic understanding of collision-sensing pathways (GIGYF2-4EHP) that could mediate such coupling, as discussed in our manuscript.

      (1) eIF5a perturbations resulted in a non-significant effect on the fraction of translating mRNA, translation duration, and bursting periods. Given the central role of eIF5a, I would have expected a different outcome. I would recommend that the authors expand the discussion and review more literature to justify these findings.

      We appreciate this comment. This finding is indeed discussed in detail in our manuscript (Discussion, paragraphs 6-7). As we note there, while eIF5A plays a critical role in elongation, the maintenance of bursting dynamics and ribosome density upon perturbation can be explained by compensatory feedback mechanisms. Specifically, the coordinated decrease in initiation rates that counterbalances slower elongation to maintain homeostatic ribosome density. We also discuss several factors that complicate interpretation: (1) potential RQC-mediated degradation masking stronger effects in proline-rich constructs, (2) differences between GC7 treatment and genetic knockout suggesting altered stalling resolution kinetics, and (3) the limitations of using exogenous reporters that lack ER-coupled processing, which may be critical for eIF5A function in endogenous collagen translation (as suggested by Rossi et al., 2014; Mandal et al., 2016; Barba-Aliaga et al., 2021). The mechanistic complexity and tissue-specific nature of eIF5A function in mammals, which differs substantially from the better-characterized yeast system, likely contributes to the nuanced phenotype we observe. We believe our Discussion adequately addresses these points.

      (2) The AAG construct leading to slow elongation is very surprising. It is the opposite of the field consensus, where codon-optimized gene sequences are expected to elongate faster. More information about each construct should be provided. I would recommend more bioinformatic analysis on this, for example, calculating CAI for all constructs, or predicting the structures of the proteins.

      We agree that the slow elongation of the AAG construct is counterintuitive and indeed surprising. Following the reviewer's suggestion, we have now calculated the Codon Adaptation Index (CAI) for all constructs (Renilla 0.89, Col1a1 0.78, Col1a1 mutated 0.74). It is therefore unlikely that codon bias explains the slow translation, particularly since we designed the mutated Col1a1 construct with alanine codons selected to respect human codon usage bias, thereby minimizing changes in codon optimality. As we discuss in the manuscript, we hypothesize that the proline-to-alanine substitutions disrupted co-translational folding of the collagen-derived sequence. Prolines are critical for collagen triple-helix formation (Shoulders and Raines, 2009), and their replacement with alanines likely generates misfolded intermediates that cause ribosome stalling (Barba-Aliaga et al., 2021; Komar et al., 2024). This interpretation is supported by the high frequency (>30%) of incomplete run-off traces for AAG, suggesting persistent stalling events. Our findings thus illustrate an important potential caveat: "optimizing" a sequence based solely on codon usage can be detrimental when it disrupts functionally important structural features or co-translational folding pathways.

      This highlights that elongation rates depend not only on codon optimality but also on the interplay between nascent chain properties and ribosome progression.

      (3) The authors should consider using their methodology to study the effects of modifying the 5'UTR, resulting in changes in initiation rate and bursting, such as previously shown in reference Livingston et al., 2023. This may be outside of the scope of this project, but the authors could add this as a future direction and discuss if this may corroborate their conclusions. 

      We thank the reviewer for this excellent suggestion. We agree that applying our methodology to 5'-UTR variants would provide a complementary test of initiation-elongation coupling, and we have now added this as a future direction in the Discussion (L417-420).

      (4) The mathematical model and parameter inference routines are central to the conclusions of this manuscript. In order to support reproducibility, the computational code should be made available and well-documented, with a requirements file indicating the dependencies and their versions. 

      We have added the Github link in the manuscript (https://github.com/naef-lab/suntag-analysis) and have also deposited the data (.ome.tif) on Zenodo (https://zenodo.org/records/17669332).

      Reviewer #3 (Public review):

      Disclaimer:

      My expertise is in live single-molecule imaging of RNA and transcription, as well as associated data analysis and modeling. While this aligns well with the technical aspects of the manuscript, my background in translation is more limited, and I am not best positioned to assess the novelty of the biological conclusions.

      Summary:

      This study combines live-cell imaging of nascent proteins on single mRNAs with time-series analysis to investigate the kinetics of mRNA translation.

      The authors (i) used a calibration method for estimating absolute ribosome counts, and (ii) developed a new Bayesian approach to infer ribosome counts over time from run-off experiments, enabling estimation of elongation rates and ribosome density across conditions.

      They report (i) translational bursting at the single-mRNA level, (ii) low ribosome density (~10% occupancy

      {plus minus} a few percents), (iii) that ribosome density is minimally affected by perturbations of elongation (using a drug and/or different coding sequences in the reporter), suggesting a homeostatic mechanism potentially involving a feedback of elongation onto initiation, although (iv) this coupling breaks down upon knockout of elongation factor eIF5A.

      Strengths:

      (1) The manuscript is well written, and the conclusions are, in general, appropriately cautious (besides the few improvements I suggest below).

      (2) The time-series inference method is interesting and promising for broader applications. 

      (3) Simulations provide convincing support for the modeling (though some improvements are possible). 

      (4) The reported homeostatic effect on ribosome density is surprising and carefully validated with multiple perturbations.

      (5) Imaging quality and corrections (e.g., flat-fielding, laser power measurements) are robust.

      (6) Mathematical modeling is clearly described and precise; a few clarifications could improve it further.

      We thank the reviewer for recognizing the novelty of the approach and its rigour, and for providing suggestions to improve it further.

      Weaknesses:

      (1) The absolute quantification of ribosome numbers (via the measurement of $i_{MP}$ ) should be improved.This only affects the finding that ribosome density is low, not that it appears to be under homeostatic control. However, if $i_{MP}$ turns out to be substantially overestimated (hence ribosome density underestimated), then "ribosomes queuing up to the initiation site and physically blocking initiation" could become a relevant hypothesis. In my detailed recommendations to the authors, I list points that need clarification in their quantifications and suggest an independent validation experiment (measuring the intensity of an object with a known number of GFP molecules, e.g., MS2-GFP MS2-GFP-labeled RNAs, or individual GEMs).

      We agree with the reviewer that the estimation of the number of ribosomes is central to our finding that translation happens at low density on our reporters. This result derives from our measurement of the intensity of one mature protein (i<sub>MP</sub>), that we have achieved by using a SunTag reporter with a RH1 domain in the C terminus of the mature protein, allowing us to stabilise mature proteins via actin-tethering. In addition, as suggested by the reviewer, we already validated this result with an independent estimate of the mature protein intensity (Figure 5 - figure supplement 2B), which was obtained by adding the mature protein intensity directly as a free parameter of the HMM. The inferred value of mature protein intensity for each construct (10-15 a.u) was remarkably close to the experimental calibration result (14 ± 2 a.u.). Therefore, we have confidence that our absolute quantification of ribosome numbers is accurate.

      (2) The proposed initiation-elongation coupling is plausible, but alternative explanations, such as changes in abortive elongation frequency, should be considered more carefully. The authors mention this possibility, but should test or rule it out quantitatively. 

      We thank the reviewer for the comment, but we consider that ruling out alternative explanations through new perturbation experiments is beyond the scope of the present work.

      (3) The observation of translational bursting is presented as novel, but similar findings were reported by Livingston et al. (2023) using a similar SunTag-MS2 system. This prior work should be acknowledged, and the added value of the current approach clarified.

      We did cite Livingston et al. (2023) in several places, but we recognized that we could add a few citations in key places, to make clear that the observation of bursting is not novel but is in agreement with previous results. We now did so in the Results and Discussion sections.

      (4) It is unclear what the single-mRNA nature of the inference method is bringing since it is only used here to report _average_ ribosome elongation rate and density (averaged across mRNAs and across time during the run-off experiments - although the method, in principle, has the power to resolve these two aspects).

      While decoding individual traces, our model infers shared (population-level) rates. Inferring transcript-specific parameters would be more informative, but it is highly challenging due to the uncertainty on the initial ribosome distribution on single transcripts. Pooling multiple transcripts together allows us to use some assumptions on the initial distribution and infer average elongation and initiation-rate parameters, while revealing substantial mRNA-to-mRNA variability in the posterior decoding (e.g. Figure 3 - figure Supplement 2C). Indeed, the inference still informs on the single-trace run-off time distribution (Figure 3 A) and the waiting time between termination events (Figure 3 - figure supplement 2C), suggesting the presence of stalling and bursting. In addition, the transcript-to-transcript heterogeneity is likely accounted for by our model better than previous methods (linear fit of the average run-off intensity), as suggested by their comparison (Figure 3 - figure supplement 2 A). In the future the model could be refined by introducing transcript-specific parameters, possibly in a hierarchical way, alongside shared parameters.

      (5) I did not find any statement about data availability. The data should be made available. Their absence limits the ability to fully assess and reproduce the findings.

      We have added the Github link in the manuscript (https://github.com/naef-lab/suntag-analysis) and have also deposited the data (.ome.tif) on Zenodo (https://zenodo.org/records/17669332).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      Major Comments:

      (1) Lack of Explicit Bursting Model

      Although translation "bursts" are observed, the current framework does not explicitly model initiation as a stochastic ON/OFF process. This limits insight into regulatory mechanisms controlling burst frequency or duration. The authors should either incorporate a two-state/more-state (bursting) model of initiation or perform statistical analysis (e.g., dwell-time distributions) to quantify bursting dynamics. They should clarify how bursting influences the interpretation of initiation rate estimates.

      We agree with the reviewer that an explicit bursting model (e.g., a two-state telegraph model) would be the ideal theoretical framework. However, integrating such a model into the TASEP-HMM inference framework is computationally intensive and complex. As a robust first step, we have opted to quantify bursting empirically based on the decoded single-mRNA traces. As shown in Figure 1G (control) and Figure 4G (perturbed conditions), we explicitly measured the duration of "ON" (translated) and "OFF" (untranslated) periods. This statistical analysis provides a quantitative description of the bursting dynamics without relying on the specific assumptions of a telegraph model. We have clarified this in the text (L123-125) and, as suggested, added a discussion (L415-417) on the potential extensions of the model to include explicit switching kinetics in the Outlook section.

      (2) Assumption of Uniform Elongation Rates

      The model assumes homogeneous elongation across coding sequences, which may not hold for stalling-prone inserts (e.g., PPG). This simplification could bias inference, particularly in cases of sequence-specific pausing. Adding simulations or sensitivity analysis to assess how non-uniform elongation affects the accuracy of inferred parameters. The authors should explicitly discuss how ribosome stalling, collisions, or heterogeneity might skew model outputs (see point 4).

      A strong stalling sequence that affects all ribosomes equally should not deteriorate the inference of the initiation rate, provided that the low-density assumption holds. The scenario where stalling events lead to higher density, and thus increased ribosome-ribosome interactions, is comparable to the conditions explored in Figure 2E. In those simulations, we tested the inference on data generated with varying initiation and elongation rates, resulting in ribosome densities ranging from low to high. We demonstrated that the inference remains robust at low ribosome densities (<10%). At higher densities, the accuracy of the initiation rate estimate decreases, whereas the elongation rate estimate remains comparatively robust. Additionally, the model tends to overestimate ribosome density under high-density conditions, likely because it neglects ribosome interference at the initiation site (Figure 2 figure supplement 2C). We agree that a deeper investigation into the consequences of stochastic stalling and bursting would be beneficial, and we have explicitly acknowledged this in the Limitations section.

      (3) Interpretation of eIF5A Knockout Phenotype

      The observation that eIF5A KO reduces initiation more than elongation, leading to decreased ribosome density, is biologically intriguing. However, the explanation invoking altered RQC kinetics is speculative and not directly tested. The authors should consider validating the RQC hypothesis by monitoring reporter mRNA stability, ribosome collision markers, or translation termination intermediates.

      We thank the reviewer for the comment, but we consider that ruling out alternative explanations through new experiments is beyond the scope of the present work.

      (4) To strengthen the manuscript, the authors should incorporate insights from three studies.

      - Livingston et al. (PMC10330622) found that translation occurs in bursts, influenced by mRNA features and initiation factors, supporting the coupling of initiation and elongation.

      - Madern et al. (PMID: 39892379) demonstrated that ribosome cooperativity enhances translational efficiency, highlighting coordinated ribosome behavior.

      - Dufourt et al. (PMID: 33927056) observed that high initiation rates correlate with high elongation rates, suggesting a conserved mechanism across cell cultures and organisms.

      Integrating these studies could enrich the manuscript's interpretation and stimulate new avenues of thought.

      We thank the reviewer for the valuable comment. We added citations of Livingston et al. in the context of translational bursting. We already cited Madern et al. in multiple places and, although its observations of ribosome cooperativity are very compelling, they cannot be linked with our observations of a feedback between initiation and elongation, and it would be very challenging to see a similar effect on our reporters. This is why we did not expressly discuss cooperativity. We also integrated Dufourt et al. in the Discussion about the possibility of designing genetically-encoded reporter. We also added a sentence about the possibility of using an ER-specific SunTag reporter, as done recently in Choi et al., Nature (2025) (https://doi.org/10.1038/s41586-025-09718-0).

      Minor Comments:

      (1) Use consistent naming for SunTag reporters (e.g., "PPG" vs "proline-rich") throughout.

      Thank you for the comment. However, the term proline-rich always appears together with PPG, so we believe that the naming is clear and consistent.

      (2) Consider a schematic overview of the experimental design and modeling pipeline for accessibility.

      Thank you for the suggestion. We consider that experimental design and modeling is now sufficiently clearly described and does not justify an additional scheme. 

      (3) Clarify how incomplete run-off traces are handled in the HMM inference.

      Incomplete run-off traces are treated identically to complete traces in our HMM inference. This is possible because our model relies on the probability of transitions occurring per time step to infer rates. It does not require observing the final "empty" state to estimate the kinetic parameters ɑ and λ. The loss of signal (e.g., mRNA moving out of the focal volume or photobleaching) does not invalidate the kinetic information contained in the portion of the trace that was observed. We have clarified this in the Methods section.

      Reviewer #2 (Recommendations for the authors):

      (1) Reproducibility:

      (1.1) The authors should use a GitHub repository with a timestamp for the release version.

      The code is available on GitHub (https://github.com/naef-lab/suntag-analysis).

      (1.2) Make raw images and data available in a figure repository like Figshare.

      The raw images (.ome.tif) are now available on Zenodo (https://zenodo.org/records/17669332).

      (2) Paper reorganization and expansion of the intensity and ribosome quantification:

      (2.1) Given the relevance of the initiation and elongation rates for the conclusions of this study, and the fact that the authors inferred these rates from the spot intensities. I recommend that the authors move Figure 1 Supplement 2 to the main text and expand the description of the process to relate spot intensity and number of ribosomes. Please also expand the figure caption for this image.

      We agree with the importance of this validation. We have expanded the description of the calibration experiment in the main text and in the figure caption.

      (2.2) I suggest the authors explicitly mention the use of HMM in the abstract.

      We have now explicitly mentioned the TASEP-based HMM in the abstract.

      (2.3) In line 492, please add the frame rate used to acquire the images for the run-off assays.

      We have added the specific frame rate (one frame every 20 seconds) to the relevant section.

      (3) Figures and captions:

      (3.1) Figure 1, Supplement 2. Please add a description of the colors used in plots B, C. 

      We have expanded the caption and added the color description.

      (3.2) In the Figure 2 caption. It is not clear what the authors mean by "traceseLife". Please ensure it is not a typo.

      Thank you for spotting this. We have corrected the typo.

      (3.3) Figure 1 A, in the cartoon N(alpha)->N-1, shouldn't the transition also depend on lambda?

      The transition probability was explicitly derived in the “Bayesian modeling of run-off traces” section (Eqs. 17-18), and does not depend on λ, but only on the initiation rate under the low-density assumption.

      (3.4) Figure 3, Supplement 2. "presence of bursting and stalling.." has a typo.

      Corrected.

      (3.5) Figure 5, panel C, the y-axis label should be "run-off time (min)."

      Corrected.

      (3.6) For most figures, add significance bars.

      (3.7) In the figure captions, please add the total number of cells used for each condition.

      We have systematically indicated the number of traces (n<sub>t</sub>) and the number of independent experiments (n<sub>e</sub>) in the captions in this format (n<sub>t</sub>, n<sub>e</sub>).

      (4) Mathematical Methods:

      We greatly thank the reviewer for their detailed attention to the mathematical notation. We have addressed all points below.

      (4.1) In lines 555, Materials and Methods, subsection, Quantification of Intensity Traces, multiple equations are not numbered. For example, after Equation (4), no numbers are provided for the rest of the equations. Please keep consistency throughout the whole document.

      We have ensured that all equations are now consistently numbered throughout the document.

      (4.2) In line 588, the authors mention "$X$ is a standard normal random variable with mean $\mu$ and standard deviation $s_0$". Please ensure this is correct. A standard normal random variable has a 0 mean and std 1. 

      Thank you for the suggestion, we have corrected the text (L678).

      (4.3) Line 546, Equation 2. The authors use mu(x,y) to describe a 2d Gaussian function. But later in line 587, the authors reuse the same variable name in equation 5 to redefine the intensity as mu = b_0 + I.

      We have renamed the 2D Gaussian function to \mu_{2D}(x,y) in the spot tracking section

      (4.4) For the complete document, it could be beneficial to the reader if the authors expand the definition of the relationship between the signal "y" and the spot intensity "I". Please note how the paragraph in lines 582-587 does not properly introduce "y".

      We have added an explicit definition of y and its relationship to the underlying spot intensity I in the text to improve readability and clarity.

      (4.5) Please ensure consistency in variable names. For example, "I" is used in line 587 for the experimental spot intensity, then line 763 redefines I(t) as the total intensity obtained from the TASEP model; please use "I_sim(t)" for simulated intensities. Please note that reusing the variable "I" for different contexts makes it hard for the reader to follow the text. 

      We agree that this was confusing. We have implemented the suggestion and now distinguish simulated intensities using the notation I<sub>S</sub> .

      (4.6) Line 555 "The prior on the total intensity I is an "uninformative" prior" I ~ half_normal(1000). Please ensure it is not "I_0 ~ half_normal(1000)."? 

      We confirm that “I” is the correct variable representing the total intensity in this context; we do not use an “I<sub>0</sub>” variable here.

      (4.7) In lines 595, equation 6. Ensure that the equation is correct. Shouldn't it be: s_0^2 = ln ( 1 + (sigma_meas^2 / ⟨y⟩^2) )? Please ensure that this is correct and it is not affecting the calculated values given in lines 598.

      Thank you for catching this typo. We have corrected the equation in the manuscript. We confirm that the calculations performed in the code used the correct formula, so the reported values remain unchanged.

      (4.8) In line 597, "the mean intensity square ^2". Please ensure it is not "the square of the temporal mean intensity."

      We have corrected the text to "the square of the temporal mean intensity."

      (4.9) In lines 602-619, Bayesian modeling of run-off traces, please ensure to introduce the constant "\ell". Used to define the ribosomal footprint?

      We have added the explicit definition of 𝓁 as the ribosome footprint size (length of transcript occupied by one ribosome) in the "Bayesian modeling of run-off traces" section.

      (4.10) Line 687 has a minor typo "[...] ribosome distribution.. Then, [...]"

      We have corrected the punctuation.

      (4.11) In line 678, Equation 19 introduces the constant "L_S", Please ensure that it is defined in the text.

      We have added the explicit definition of L<sub>S</sub> (the length of the SunTag) to the text surrounding Equation 19.

      (4.12) In line 695, Equation 22, please consider using a subscript to differentiate the variance due to ribosome configuration. For example, instead of "sigma (...)^2" use something like "sigma_c ^2 (...)". Ensure that this change is correctly applied to Equation 24 and all other affected equations.

      Thank you, we have implemented the suggestions.

      (4.13) In line 696, please double-check equations 26 and 27. Specifically, the denominator ^2. Given the previous text, it is hard to follow the meaning of this variable. 

      We have revised the notation in Equations 26 and 27 to ensure the denominator is consistent with the definitions provided in the text.

      (4.14) In lines 726, the authors mention "[...], but for the purposes of this dissertation [...]", it should be "[...], but for the purposes of this study [...]"

      Thank you for spotting this. We have replaced "dissertation" with "study."

      (4.15) Equations 5, 28, 37, and the unnumbered equation between Equations 16 and 17 are similar, but in some, "y" does not explicitly depend on time. Please ensure this is correct. 

      We have verified these equations and believe they are correct.

      (4.16) Please review the complete document and ensure that variables and constants used in the equations are defined in the text. Please ensure that the same variable names are not reused for different concepts. To improve readability and flow in the text, please review the complete Materials and Methods sections and evaluate if the modeling section can be written more clearly and concisely. For example, Equation 28 is repeated in the text.

      We have performed a comprehensive review of the Materials and Methods section. To improve conciseness and flow, we have merged the subsection “Observation model and estimation of observation parameters” with the “Bayesian modeling of run-off traces” section. This allowed us to remove redundant definitions and repeated equations (such as the previous Equation 28). We have also checked that all variables and constants are defined upon first use and that variable names remain consistent throughout the manuscript.

      Reviewer #3 (Recommendations for the authors):

      (1) Data Presentation

      (1.1) In main Figures 1D and 4E, the traces appear to show frequent on-off-on transitions ("bursting"), but in supplementary figures (1-S1A and 4-S1A), this behavior is seen in only ~8 of 54 traces. Are the main figure examples truly representative?

      We acknowledge the reviewer's point. In Figure 1D, we selected some of the longest and most illustrative traces to highlight the bursting dynamics. We agree that the term "representative" might be misleading if interpreted as "average." We have updated the text to state "we show bursting traces" to more accurately reflect the selection.

      (1.2) There are 8 videos, but I could not identify which is which.

      Thank you for pointing this out. We have renamed the video files to clearly correspond to the figures and conditions they represent.

      (2) Data Availability:

      As noted above, the data should be shared. This is in accordance with eLife's policy: "Authors must make all original data used to support the claims of the paper, or that are required to reproduce them, available in the manuscript text, tables, figures or supplementary materials, or at a trusted digital repository (the latter is recommended). [...] eLife considers works to be published when they are posted as preprints, and expects preprints we review to meet the standards outlined here." Access to the time traces would have been helpful for reviewers.

      We have now added the Github link for the code (https://github.com/naef-lab/suntag-analysis) and deposited the raw data (.ome.tif files) on Zenodo (10.5281/zenodo.17669332).

      (3) Model Assumptions:

      (3.1) The broad range of run-off times (Figure 3A) suggests stalling, which may be incompatible with the 'low-density' assumption used on the TASEP model, which essentially assumes that ribosomes do not bump into each other. This could impact the validity of the assumptions that ribosomes behave independently, elongate at constant speed (necessary for the continuum-limit approximation), and that the rate-limiting step is the initiation. How robust are the inferences to this assumption?

      We agree that the deviation of waiting times from an exponential distribution (Figure 3 - figure supplement 2C) suggests the presence of stalling, which challenges the strict low-density assumption and constant elongation speed. We explicitly explored the robustness of our model to higher ribosome densities in simulations. As shown in Figure 2 - figure supplement 2, while the model accuracy for single parameters deteriorates at very high densities (overestimating density due to neglected interference), it remains robust for estimating global rates in the regime relevant to our data. We have expanded the discussion on the limitations of the low density and homogeneous elongation rate assumptions in the text (L404-408).

      (3.2) Since all constructs share the same SunTag region, elongation rates should be identical there and diverge only in the variable region. This would affect $\gamma (t)$ and hence possibly affect the results. A brief discussion would be helpful.

      This is a valid point. Currently, our model infers a single average elongation rate that effectively averages the behavior over the SunTag and the variable CDS regions. Modeling distinct rates for these regions would be a valuable extension but adds significant complexity. While our current "effective rate" approach might underestimate the magnitude of differences between reporters, it captures the global kinetic trend. We have added a brief discussion acknowledging this simplification (L408-412).

      (3.3) A similar point applies to the Gillespie simulations: modeling the SunTag region with a shared elongation rate would be more accurate.

      We agree. Simulating distinct rates for the SunTag and CDS would increase realism, though our current homogeneous simulations serve primarily to benchmark the inference framework itself. We have noted this as a potential future improvement (L413-414).

      (3.4) Equation (13) assumes that switching between bursting and non-bursting states is much slower than the elongation time. First, this should be made explicit. Second, this is not quite true (~5 min elongation time on Figure 3-s2A vs ~5-15min switching times on Figure 1). It would be useful to show the intensity distribution at t=0 and compare it to the expected mixture distribution (i.e., a Poisson distribution + some extra 'N=0' cells). 

      We thank the reviewer for this insightful comment. We have added a sentence to the text explicitly stating the assumption that switching dynamics are slower than the translation time. While the timescales are indeed closer than ideal (5 min vs. 5-15 min), this assumption allows for a tractable approximation of the initial conditions for the run-off inference. Comparing the intensity distribution at t=0 to a zero-inflated Poisson distribution is an excellent suggestion for validation, which we will consider for future iterations of the model.

      (4) Microscopy Quantifications:

      (4.1) Figure 1-S2A shows variable scFv-GFP expression across cells. Were cells selected for uniform expression in the analysis? Or is the SunTag assumed saturated? which would then need to be demonstrated. 

      All cell lines used are monoclonal, and cells were selected via FACS for consistent average cytoplasmic GFP signal. We assume the SunTag is saturated based on the established characterization of the system by Tanenbaum et al. (2014), where the high affinity of the scFv-GFP ensures saturation at expression levels similar to ours.

      (4.2) As translation proceeds, free scFv-GFP may become limiting due to the accumulation of mature SunTag-containing proteins. This would be difficult to detect (since mature proteins stay in the cytoplasm) and could affect intensity measurements (newly synthesized SunTag proteins getting dimmer over time).

      This effect can occur with very long induction times. To mitigate this, we optimized the Doxycycline (Dox) incubation time for our harringtonine experiments to prevent excessive accumulation of mature protein. We also monitor the cytoplasmic background for granularity, which would indicate aggregation or accumulation.

      (4.3) The statements "for some traces, the mRNA signal was lost before the run-off completion" (line 195) and "we observed relatively consistent fractions of translated transcripts and trace duration distributions across reporters" (line 340) should be supported by a supplementary figure.

      The first statement is supported by Figure 2 - figure supplement 1, which shows representative run-off traces for all constructs, including incomplete ones.

      The second statement regarding consistency is supported by the quantitative data in Figure 1E and G, which summarize the fraction of translated transcripts and trace durations across conditions.

      (4.4) Measurements of single mature protein intensity $i_{MP}$:

      (4.4.1) Since puromycin is used to disassemble elongating ribosomes, calibration may be biased by incomplete translation products (likely a substantial fraction, since the Dox induction is only 20min and RNAs need several minutes to be transcribed, exported, and then fully translated).

      As mentioned in the “Live-cell imaging” paragraph, the imaging takes place 40 min after the end of Dox incubation. This provides ample time for mRNA export and full translation of the synthesized proteins. Consequently, the fraction of incomplete products generated by the final puromycin addition is negligible compared to the pool of fully synthesized mature proteins accumulated during the preceding hour.

      (4.4.2) Line 519: "The intensity of each spot is averaged over the 100 frames". Do I understand correctly that you are looking at immobile proteins? What immobilizes these proteins? Are these small aggregates? It would be surprising that these aggregates have really only 1, 2, or 3 proteins, as suggested by Figure 1-S2A.

      We are visualizing mature proteins that are specifically tethered to the actin cytoskeleton. This is achieved using a reporter where the RH1 domain is fused directly to the C-terminus of the Renilla protein (SunTag-Renilla-RH1). The RH1 domain recruits the endogenous Myosin Va motor, which anchors the protein to actin filaments, rendering it immobile. Since each Myosin Va motor interacts with one RH1 domain (and thus one mature protein), the resulting spots represent individual immobilized proteins rather than aggregates. We have now revised the text and Methods section to make this calibration strategy and the construct design clearer (L130-140).

      (4.4.3) Estimating the average intensity $i_{MP}$ of single proteins all resides in the seeing discrete modes in the histogram of Figure 1-S2B, which is not very convincing. A complementary experiment, measuring *on the same microscope* the intensity of an object with a known number of GFP molecules (e.g., MS2-GFP labeled RNAs, or individual GEMs https://doi.org/10.1016/j.cell.2018.05.042 (only requiring a single transfection)) would be reassuring to convince the reader that we are not off by an order of magnitude.

      While a complementary calibration experiment would be valuable, we believe our current estimate is robust because it is independently validated by our model. When we inferred i<sub>MP</sub> as a free parameter in the HMM (Figure 5 - figure supplement 2B), the resulting value (10-15 a.u.) was remarkably consistent with our experimental calibration (14 ± 2 a.u.). We have clarified this independent validation in the text to strengthen the confidence in our quantification (L264-272).

      (4.4.4) Further on the histogram in Figure 1-S2B:

      - The gap between the first two modes is unexpectedly sharp. Can you double-check? It means that we have a completely empty bin between two of the most populated bins.

      We have double-checked the data; the plot is correct, though the sharp gap is likely due to the small sample size (n=29).

      - I am surprised not to see 3 modes or more, given that panel A shows three levels of intensity (the three colors of the arrows).

      As noted below, brighter foci exist but fall outside the displayed range of the histogram.

      - It is unclear what the statistical test is and what it is supposed to demonstrate.

      The Student's t-test compares the means of the two identified populations to confirm they are statistically distinct intensity groups.

      - I count n = 29, not 31. (The sample is small enough that the bars of the histogram show clear discrete heights, proportional to 1, 2, 3, 4, and 5 --adding up all the counts, I get 29). Is there a mistake somewhere? Or are some points falling outside of the displayed x-range?

      You are correct. Two brighter data points fell outside the displayed range. The total number of foci in the histogram is 29. We have corrected the figure caption and the text accordingly.

      (5) Miscellaneous Points: 

      (5.1) Panel B in Figure 2-s1 appears to be missing.

      The figure contains only one panel.

      (5.2) In Equation (7), $l$ is not defined (presumably ribosome footprint length?). Instead, $J$ is defined right after eq (7), as if it were used in this equation.

      Thank you for pointing this out, we have corrected it.

      (5.3) Line 703, did you mean to write something else than "Equation 26" (since equation 26 is defined after)?

      Yes, this was a typo. We have corrected the cross-reference.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Here, the authors aim to investigate the potential improvements of ANNs when used to explain brain data using top-down feedback connections found in the neocortex. To do so, they use a retinotopic and tonotopic organization to model each subregion of the ventral visual (V1, V2, V4, and IT) and ventral auditory (A1, Belt, A4) regions using Convolutional Gated Recurrent Units. The top-down feedback connections are inspired by the apical tree of pyramidal neurons, modeled either with a multiplicative effect (change of gain of the activation function) or a composite effect (change of gain and threshold of the activation function).

      To assess the functional impact of the top-down connections, the authors compare three architectures: a brain-like architecture derived directly from brain data analysis, a reversed architecture where all feedforward connections become feedback connections and vice versa, and a random connectivity architecture. More specifically, in the brain-like model the visual regions provide feedforward input to all auditory areas, whereas auditory areas provide feedback to visual regions.

      First, the authors found that top-down feedback influences audiovisual processing and that the brain-like model exhibits a visual bias in multimodal visual and auditory tasks. Second, they discovered that in the brain-like model, the composite integration of top-down feedback, similar to that found in the neocortex, leads to an inductive bias toward visual stimuli, which is not observed in the feedforward-only model. Furthermore, the authors found that the brain-like model learns to utilize relevant stimuli more quickly while ignoring distractors. Finally, by analyzing the activations of all hidden layers (brain regions), they found that the feedforward and feedback connectivity of a region could determine its functional specializations during the given tasks.

      Strengths:

      The study introduces a novel methodology for designing connectivity between regions in deep learning models. The authors also employ several tasks based on audiovisual stimuli to support their conclusions. Additionally, the model utilizes backpropagation of error as a learning algorithm, making it applicable across a range of tasks, from various supervised learning scenarios to reinforcement learning agents. Conversely, the presented framework offers a valuable tool for studying top-down feedback connections in cortical models. Thus, it is a very nice study that also can give inspiration to other fields (machine learning) to start exploring new architectures.

      We thank the reviewer for their accurate summary of our work and their kind assessment of its strengths.

      Weaknesses:

      Although the study explores some novel ideas on how to study the feedback connections of the neocortex, the data presented here are not complete in order to propose a concrete theory of the role of top-down feedback inputs in such models of the brain.

      (1) The gap in the literature that the paper tries to fill in the ability of DL algorithms to predict behavior: "However, there are still significant gaps in most deep neural networks' ability to predict behavior, particularly when presented with ambiguous, challenging stimuli." and "[...] to accurately model the brain."

      It is unclear to me how the presented work addresses this gap, as the only facts provided are derived from a simple categorization task that could also be solved by the feedforward-only model (see Figures 4 and 5). In my opinion, this statement is somewhat far-fetched, and there is insufficient data throughout the manuscript to support this claim.

      We can see now that the way the introduction was initially written led to some confusion about our goal in this study. Our goal here was not to demonstrate that top-down feedback can enable superior matches to human behaviour. Rather, our goal was to determine if top-down feedback had any real implications for processing ambiguous stimuli. The sentence that the reviewer has highlighted was intended as an explanation for why top-down feedback, and its impact on ambiguous stimuli, might be something one would want to examine for deep neural networks. But, here, we simply wanted to (1) provide an overview of the code base we have created, (2) demonstrate that top-down feedback does impact the processing of ambiguous stimuli.

      We agree with the reviewer that if our goal was to improve our ability to predict behaviour, then there was a big gap in the evidence we provided here. But, this was not our goal, and we believe that the data we provide here does convincingly show that top-down feedback has an impact on processing of ambiguous stimuli. We have updated the text in the introduction to make our goals more clear for the reader and avoid this misunderstanding of what we were trying to accomplish here. Specifically, the end of the introduction is changed to:

      “To study the effect of top-down feedback on such tasks, we built a freely available code base for creating deep neural networks with an algorithmic approximation of top-down feedback. Specifically, top-down feedback was designed to modulate ongoing activity in recurrent, convolutional neural networks. We explored different architectural configurations of connectivity, including a configuration based on the human brain, where all visual areas send feedforward inputs to, and receive top-down feedback from, the auditory areas. The human brain-based model performed well on all audiovisual tasks, but displayed a unique and persistent visual bias compared to models with only driving connectivity and models with different hierarchies. This qualitatively matches the reported visual bias of humans engaged in audio-visual tasks. Our results confirm that distinct configurations of feedforward/feedback connectivity have an important functional impact on a model's behavior. Therefore, top-down feedback captures behaviors and perceptual preferences that do not manifest reliably in feedforward-only networks. Further experiments are needed to clarify whether top-down feedback helps an ANN fit better to neural data, but the results show that top-down feedback affects the processing of stimuli and is thus a relevant feature that should be considered for deep ANN models in computational neuroscience more broadly.”

      (2) It is not clear what the advantages are between the brain-like model and a feedforward-only model in terms of performance in solving the task. Given Figures 4 and 5, it is evident that the feedforward-only model reaches almost the same performance as the brain-like model (when the latter uses the modulatory feedback with the composite function) on almost all tasks tested. The speed of learning is nearly the same: for some tested tasks the brain-like model learns faster, while for others it learns slower. Thus, it is hard to attribute a functional implication to the feedback connections given the presented figures and therefore the strong claims in the Discussion should be rephrased or toned down.

      Again, we believe that there has been a misunderstanding regarding the goals of this study, as we are not trying to claim here that there are performance advantages conferred by top-down feedback in this case. Indeed, we share the reviewer’s assessment that the feedforward only model seems to be capable of solving this task well. To reiterate: our goal here was to demonstrate that top-down feedback alters the computations in the network and, thus, has distinct effects on behaviour that need to be considered by researchers who use deep networks to model the brain. But we make no claims of “superiority” of the brain-like model.

      In-line with this, we’re not completely sure which claims in the discussion the reviewer is referring to. We note that we were quite careful in our claims. For example, in the first section of the discussion we say:

      “Altogether, our results demonstrate that the distinction between feedforward and feedback inputs has clear computational implications, and that ANN models of the brain should therefore consider top-down feedback as an important biological feature.”

      And later on:

      “In summary, our study shows that modulatory top-down feedback and the architectural diversity enabled by it can have important functional implications for computational models of the brain. We believe that future work examining brain function with deep neural networks should therefore consider incorporating top-down modulatory feedback into model architectures when appropriate.”

      If we have missed a claim in the discussion that implies superiority of the brain-like model in terms of task performance we would be happy to change it.

      (3) The Methods section lacks sufficient detail. There is no explanation provided for the choice of hyperparameters nor for the structure of the networks (number of trainable parameters, number of nodes per layer, etc). Clarifying the rationale behind these decisions would enhance understanding. Moreover, since the authors draw conclusions based on the performance of the networks on specific tasks, it is unclear whether the comparisons are fair, particularly concerning the number of trainable parameters. Furthermore, it is not clear if the visual bias observed in the brain-like model is an emerging property of the network or has been created because of the asymmetries in the visual vs. auditory pathway (size of the layer, number of layers, etc).

      We thank the reviewer for raising this issue, and want to provide some clarifications: First, the number of trainable parameters are roughly equal, since we were only switching the direction of connectivity (top-down versus bottom-up), not the number of connections. We confirmed the biggest difference in size is between models with composite and multiplicative feedback; models with composite feedback have roughly ~1K more parameters, and all models are within the 280K parameter range. We now state this in the methods.

      Second, because superior performance was not the goal of this study, as stated above, we conducted limited hyperparameter tuning. Given the reviewer’s comment, we wondered whether this may have impacted our results. Therefore, we explored different hyperparameters for the model during the multimodal auditory tasks, which show the clearest example of the visual dominance in the brainlike model (Figure 3).

      We explored different hidden state sizes, learning rates and processing times, and examined whether the core results were different. We found that extremely high learning rates (0.1) destabilize all models and that some models perform poorly under different processing times. But overall, the core results are evident across all hyperparameters where the models learn i.e the different behaviors of models with different connectivities and the visual dominance observed in the brainlike model. We now provide these results in a supplementary figure (Fig. S2, showing larger models trained with different learning rates, and Fig S3, which shows the effect of processing time on AS task performance).

      Reviewer #2 (Public review):

      Summary:

      This work addresses the question of whether artificial deep neural network models of the brain could be improved by incorporating top-down feedback, inspired by the architecture of the neocortex.

      In line with known biological features of cortical top-down feedback, the authors model such feedback connections with both, a typical driving effect and a purely modulatory effect on the activation of units in the network.

      To assess the functional impact of these top-down connections, they compare different architectures of feedforward and feedback connections in a model that mimics the ventral visual and auditory pathways in the cortex on an audiovisual integration task.

      Notably, one architecture is inspired by human anatomical data, where higher visual and auditory layers possess modulatory top-down connections to all lower-level layers of the same modality, and visual areas provide feedforward input to auditory layers, whereas auditory areas provide modulatory feedback to visual areas.

      First, the authors find that this brain-like architecture imparts the models with a light visual bias similar to what is seen in human data, which is the opposite in a reversed architecture, where auditory areas provide a feedforward drive to the visual areas.

      Second, they find that, in their model, modulatory feedback should be complemented by a driving component to enable effective audiovisual integration, similar to what is observed in neural data.

      Last, they find that the brain-like architecture with modulatory feedback learns a bit faster in some audiovisual switching tasks compared to a feedforward-only model.

      Overall, the study shows some possible functional implications when adding feedback connections in a deep artificial neural network that mimics some functional aspects of visual perception in humans.

      Strengths:

      The study contains innovative ideas, such as incorporating an anatomically inspired architecture into a deep ANN, and comparing its impact on a relevant task to alternative architectures.

      Moreover, the simplicity of the model allows it to draw conclusions on how features of the architecture and functional aspects of the top-down feedback affect the performance of the network.

      This could be a helpful resource for future studies of the impact of top-down connections in deep artificial neural network models of the neocortex.

      We thank the reviewer for their summary and their recognition of the innovative components and helpful resources therein.

      Weaknesses:

      Overall, the study appears to be a bit premature, as several parts need to be worked out more to support the claims of the paper and to increase its impact.

      First, the functional implication of modulatory feedback is not really clear. The "only feedforward" model (is a drive-only model meant?) attains the same performance as the composite model (with modulatory feedback) on virtually all tasks tested, it just takes a bit longer to learn for some tasks, but then is also faster at others. It even reproduces the visual bias on the audiovisual switching task. Therefore, the claims "Altogether, our results demonstrate that the distinction between feedforward and feedback inputs has clear computational implications, and that ANN models of the brain should therefore consider top-down feedback as an important biological feature." and "More broadly, our work supports the conclusion that both the cellular neurophysiology and structure of feed-back inputs have critical functional implications that need to be considered by computational models of brain function" are not sufficiently supported by the results of the study. Moreover, the latter points would require showing that this model describes neural data better, e.g., by comparing representations in the model with and without top-down feedback to recorded neural activity.

      To emphasize again our specific claims, we believe that our data shows that top-down feedback has functional implications for deep neural network behaviour, not increased performance or neural alignment. Indeed, our results demonstrate that top-down feedback alters the behaviour of the networks, as shown by the differences in responses to various combinations of ambiguous stimuli. We agree with the reviewer that if our goal was to claim either superior performance on these tasks, or better fit to neural data, we would need to actually provide data supporting that claim.

      Given the comments from the reviewer, we have tried to provide more clarity in the introduction and discussion regarding our claims. In particular, we now highlight that we are not trying to demonstrate that the models with top-down feedback exhibit superior performance or better fit to neural data.

      As one final note, yes, the reviewer understood correctly that the “only feedforward” model is a model with only driving inputs. We have renamed the feedforward-only models to drive only models and added additional emphasis in the text to ensure that the distinction is clear for all readers.

      Second, the analyses are not supported by supplementary material, hence it is difficult to evaluate parts of the claims. For example, it would be helpful to investigate the impact of the process time after which the output is taken for evaluation of the model. This is especially important because in recurrent and feedback models the convergence should be checked, and if the network does not converge, then it should be discussed why at which point in time the network is evaluated.

      This is an excellent point, and we thank the reviewer for raising it. We allowed the network to process the stimuli for seven time-steps, which was enough for information from any one region to be transmitted to any other. We found in some initial investigations that if we shortened the processing time some seeds would fail to solve the task. But, based on the reviewer’s comment, we have now also run additional tests with longer processing times for the auditory tasks where we see the clearest visual bias (Figure 3). We find that different process times do not change the behavioral biases observed in our models, but may introduce difficulties ignoring visual stimuli for some models. Thus, while process time is an important hyperparameter for optimal performance of the model, the central claim of the paper remains. We include this new data in a supplementary figure S3.

      Third, the descriptions of the models in the methods are hard to understand, i.e., parameters are not described and equations are explained by referring to multiple other studies. Since the implications of the results heavily rely on the model, a more detailed description of the model seems necessary.

      We agree with the reviewer that the methods could have been more thorough. Therefore, we have greatly expanded the methods section. We hope the model details are now more clear.

      Lastly, the discussion and testable predictions are not very well worked out and need more details. For example, the point "This represents another testable prediction flowing from our study, which could be studied in humans by examining the optical flow (Pines et al., 2023) between auditory and visual regions during an audiovisual task" needs to be made more precise to be useful as a prediction. What did the model predict in terms of "optic flow", how can modulatory from simple driving effect be distinguished, etc.

      We see that the original wording of this prediction was ambiguous, thank you for pointing this out. In the study highlighted (Pines et al., 2023) the authors use an analysis technique for measuring information flow between brain regions, which is related to analysis of optical flow in images, but applied to fMRI scans. This is confusing given the current study, though. Therefore, we have changed this sentence to make clear that we are speaking of information flow here. 

      Reviewer #3 (Public review):

      Summary:

      This study investigates the computational role of top-down feedback in artificial neural networks (ANNs), a feature that is prevalent in the brain but largely absent in standard ANN architectures. The authors construct hierarchical recurrent ANN models that incorporate key properties of top-down feedback in the neocortex. Using these models in an audiovisual integration task, they find that hierarchical structures introduce a mild visual bias, akin to that observed in human perception, not always compromising task performance.

      Strengths:

      The study investigates a relevant and current topic of considering top-down feedback in deep neural networks. In designing their brain-like model, they use neurophysiological data, such as externopyramidisation and hierarchical connectivity. Their brain-like model exhibits a visual bias that qualitatively matches human perception.

      We thank the reviewer for their summary and evaluation of our paper’s strengths.

      Weaknesses:

      While the model is brain-inspired, it has limited bioplausibility. The model assumes a simplified and fixed hierarchy. In the brain with additional neuromodulation, the hierarchy could be more flexible and more task-dependent.

      We agree, there are still many facets of top-down feedback that we have not captured here, and the modulation of hierarchy is an interesting example. We have added some consideration of this point to the limitations section of the discussion.

      While the brain-like model showed an advantage in ignoring distracting auditory inputs, it struggled when visual information had to be ignored. This suggests that its rigid bias toward visual processing could make it less adaptive in tasks requiring flexible multimodal integration. It hence does not necessarily constitute an improvement over existing ANNs. It is unclear, whether this aspect of the model also matches human data. In general, there is no direct comparison to human data. The study does not evaluate whether the top-down feedback architecture scales well to more complex problems or larger datasets. The model is not well enough specified in the methods and some definitions are missing.

      We agree with the reviewer that we have not demonstrated anything like superior performance (since the brain-like network is quite rigid, as noted) nor have we shown better match to human data with the brain-like network. This was not our intended claim. Rather, we demonstrated here simply that top-down feedback impacts behavior of the networks in response to ambiguous stimuli. We have now added statements to the introduction and discussion to make our specific claims (which are supported by our data, we believe) clear.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I believe that the work is very nice but not so mature at this stage. Below, you can find some comments that eventually could improve your manuscript.

      (1) Intro, last sentence: "Therefore, top-down feedback is a relevant feature that should be considered for deep ANN models in computational neuroscience more broadly." I don't understand what the authors refer to with this sentence. There are numerous models (deep ANNs) that have been used to model the neural activity and are much simpler than the one proposed here which contains very complex models and connectivity. Although I do agree that the top-down connections are very important there is no data to support their importance for modeling the brain.

      Respectfully, we disagree with the reviewer that we don’t provide data to demonstrate the importance of top-down feedback for modelling. Indeed, we provided a great deal of data to show that top-down feedback in the networks has real functional implications for behaviour, e.g., it can induce a human-like visual bias. Thus, top-down feedback is a factor that one should care about when modelling the brain. But, we agree with the reviewer that more demonstration of the utility of using top-down feedback for achieving better fits to neural data would be an important next step. 

      (2) I suggest adding some extra supplementary simulations where, for example, the number of data for visual and auditory pathways is equal in size (i.e., the same number of examples), the number of layers is identical (3 per pathway), and also the number of parameters. Doing this would help strengthen the claims presented in the paper.

      In fact, all of the hyperparameters the reviewer mentions here were identical for the different networks, so the experiments the reviewer is requesting here were already part of the paper. We now clarify this in the text.

      (3) Results: I suggest adding Tables with quantifications of the presented results. For example, best performance, epochs to converge, etc. As it is now, it is very hard to follow the evidence shown in Figures.

      This is a good suggestion, we have now added this table to the start of the supplemental figures.

      (4) Figure 2e, 3e: Although VS3, and AS3 have been used only for testing, the plot shows alignments with respect to training epochs. The authors should clarify in the Methods if they tested the network with all intermediate weights during VS1/VS2 or AS1/AS2 training.

      Testing scenarios in this context meant that the model was never shown the scenario/task during training, but the models were indeed evaluated on the VS3 and AS3 after each training epoch. We have added clarifications to the figure legends.

      (5) Methods: It would be beneficial to discuss how specific hyperparameters were selected based on prior research, empirical testing, or theoretical considerations. Also, it is not clear how the alignment (visual or audio) is calculated. Do the authors use the examples that have been classified correctly for both stimuli or do they exclude those from the analysis (maybe I have missed it).

      As noted above, because superior performance was not the goal of this study, we conducted limited hyperparameter tuning. But we have extended the results with additional hyperparameter tuning in a supplementary figure, and describe the hyperparameter choices more thoroughly in the methods. As well, all data includes all model responses, regardless of whether they were correct or not. We now clarify this in the methods.

      (6) Code: The code repository lacks straightforward examples demonstrating how to utilize the modeling approach. Given that it is referred to as a "framework", one would expect it to facilitate easy integration into various models and tasks. Including detailed instructions or clear examples would significantly improve usability and help users effectively apply the proposed methodology.

      We agree with the reviewer, this would be beneficial. We have revised the README of the codebase to explain the model and its usage more clearly and included an interactive jupyter notebook with example training on MNIST.

      Some minor comments are given below. Generally speaking, the Figures need to be more carefully checked for consistent labels, colors, etc.

      (1) Page 4, 1st paragraph - grammar correction: "a larger infragranular layer" or "larger infragranular layers"

      Thank you for catching this, we have fixed the text.

      (2) Page 4, 2nd para - rephrase: "In three additional control ANNs" → "In the third additional control ANN"

      In fact, we did mean three additional control ANNs, each one representing a different randomized connectivity profile. We now clarify this in the text and provide the connectivity of the two other random graphs in the supplemental figures.

      (3) Page 4, VAE acronym needs to be defined before its first use

      The variational autoencoder is introduced by its full name in the text now.

      (4) Page 4: Fig. 2c reference should be Fig. 2b, Fig. 2d should be Fig. 2c, Fig. 2b should be Fig. 2d, VS4; Fig. 2b, bottom should be VS4; Fig. 2f, Fig. 2f to Fig. 2g. Double check the Figure references in the text. Here is very confusing for the reader.

      We have now fixed this, thank you for catching it.

      (5) Page 5, 1st para: "Altogether, our results demonstrated both" → "Altogether, our results demonstrated that both"

      This has been updated.

      (6) Figure 2: In the e and g panels the x label is missing.

      This was actually because the x-axis were the same across the panels, but we see how this was unclear, so we have updated the figure.

      (7) Figure 3: There is no panel g (the title is missing); In panels b, c, e, and g the y label is missing, and in panels e and g the x label is missing. Also, the Feedforward model is shown in panel g but it is introduced later in the text. Please remove it from Figure 3. Also in legend: "AV Reverse graph" → "Reverse graph". Also, "Accuracy" and "Alignment" should be presented as percentages (as in Figure 2).

      This has been corrected.

      (8) Figure 4; x labels are missing.

      As with point (6), this was actually because the x-axis were the same across the panels, but we see how this was unclear, so we have updated the figure.

      (9) Page 7; I can’t find the cited Figure S1.

      Apologies, we have added the supplemental figure (now as S4). It shows the results of models with multiplicative feedback on the task in Fig 5 (as opposed to models with composite feedback shown in the main figure).

      Reviewer #2 (Recommendations for the authors):

      (1) Discussion Section 3.1 is only a literature review, and does not really add any value.

      Respectfully, we think it is important to relate our work to other computational work on the role of top-down feedback, and to make clear what our specific contribution is. But, we have updated the text to try to place additional emphasis on our study’s contribution, so that this section is more than just a literature review.

      “Our study adds to this previous work by incorporating modulatory top-down feedback into deep, convolutional, recurrent networks that can be matched to real brain anatomy. Importantly, using this framework we could demonstrate that the specific architecture of top-down feedback in a neural network has important computational implications, endowing networks with different inductive biases.”

      (2) Including ipython notebooks and some examples would be great to make it easier to use the code.

      We now provide a demo of how to use the code base in a jupyter notebook.

      (3) The description of the model is hard to comprehend. Please name and describe all parameters. Also, a figure would be great to understand the different model equations.

      We have added definitions of all model terms and parameters.

      (4) The terminology is not really clear to me. For example "The results further suggest that different configurations of top-down feedback make otherwise identically connected models functionally distinct from each other and from traditional feedforward only recurrent models." The feedforward and only recurrent seem to contradict each other. Would maybe driving and modulatory be a better term here? I also saw in the code that you differentiate between three types of inputs, modulatory, threshold offset and basal (like feedforward). How about you only classify connections based on these three type? I was also confused about the feedforward only model, because I was unsure whether it is still feedback connections but with "basal" quality, or whether feedback connections between modalities and higher-to-lower level layers were omitted altogether.

      We take the reviewer’s point here. To clarify this, we have updated the text to refer to “driving only” rather than “feedforward only”, to make it obvious that what we change in these models is simply whether the connection has any modulatory impact on the activity. 

      (5) "incorporating it into ANNs can affect their behavior and help determine the solutions that the network can discover." -> Do you mean constrain? Overall, I did not really get this point.

      Yes, we mean that it constrains the solutions that the network is likely to discover.

      (6) "ignore the auditory inputs when they visual inputs were unambiguous" -> the not they

      This has been fixed. Thank you for catching it.

      (7) xlabel in Figure 4 is missing.

      This has been fixed, thank you for catching it.

      Reviewer #3 (Recommendations for the authors):

      Major:

      (1) How alignment is computed is not defined. In addition to a proper definition in the methods section, it would be nice to briefly define it when it first appears in the results section.

      We’ve added an explicit definition of how alignment is calculated in the methods and emphasized the calculation when its first explained in the results

      (2) A connectivity matrix for the feedforward-only model is missing and could be added.

      We have added this to Figure 1.

      (3) The connectivity matrix for each random model should also be shown.

      We’ve shown each of the random model configurations in the new supplemental figure S1.

      (4) Initial parameters are not defined, such as W, b etc. A table with all model parameters would be great.

      We have added a table to the methods listing all of the parameters.

      (5) Would be nice to show the t-sne plots (not just the NH score) for each model and each task in the appendix.

      We can provide these figures on request. They massively increase the file size of the paper pdf, as there’s 49 of them for each task and each model, 980 in total. An example t-SNE plot is provided in figure 6.

      Minor:

      (1) Page 4:

      "we refer to this as Visual-dominant Stimulus case 1, or VS1; Fig. 1a, top)." This should be Fig. 2a.

      (2) "In stimulus condition VS1, all of the models were able to learn to use the auditory clues to disambiguate the images (Fig. 2c)."

      This should be Fig. 2b.

      (3) "In comparison, in VS2, we found that the brainlike model learned to ignore distracting audio inputs quickly and consistently compared to the random models, and a bit more rapidly than the auditory information (Fig 2d)."

      This should be Fig. 2c.

      (4) "VS3; Fig. 2b, top"

      This should be Fig. 2d

      (5) "while all other models had to learn to do so further along in training (Fig. 2e)."

      It is not stated explicitly, but this suggests that the image-aligned target was considered correct, and that weight updates were happening.

      (6) "VS4; Fig. 2b, bottom"

      This should be Fig. 2f

      (7) "adept at learning (Fig. 2f)."

      This should be Fig. 2g

      (8) Figure 3:b,c,e y-labels are missing

      3f: both x and y labels are missing

      (9) Figure labeling in the text is not consistent (Fig. 1A versus Fig. 2a)

      (10) Doubled "the" in ""This shows that the inductive bias towards vision in the brainlike model depended on the presence of the multiplicative component of the the feedback"

      (11) Page 9 Figure 6: The caption says b shows the latent spaces for the VS2 task, whereas the main text refers to 6b as showing the latent space for the AS2 task. Please correct which task it is.

      (12) Methods 4.1 page 13

      "which is derived from the feedback input (h_{l−1})"

      This should be h_{l+1}

      (13) r_l, u_l, u and c are not defined to which aspects of the model they refer to

      Even though this is based on a previous model, the methods section should completely describe the model.

      Equations 1,2,3: the notation [x;y] is unclear and should be defined.

      Equation 5: u should probably be u_l.

      (14) Page 14 typo: externopyrmidisation.

      (15) It is confusing to use different names for the same thing: the all-feedforward model, the all feedforward network, the feedforward network, and the feedforward-only model are probably all the same? Consistent naming would help here.

      Thank you for the detailed comments! We’ve fixed the minor errors and renamed the feedforward models to drive-only models.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1 (Public review):

      Summary:

      Schafer et al. tested whether the hippocampus tracks social interactions as sequences of neural states within an abstract social space defined by dimensions of affiliation and power, using a task in which participants engaged in narrative-based social interactions. The findings of this study revealed that individual social relationships are represented by unique sequences of hippocampal activity patterns. These neural trajectories corresponded to the history of trial-to-trial affiliation and power dynamics between participants and each character, suggesting an extended role of the hippocampus in encoding sequences of events beyond spatial relationships.

      The current version has limited information on details in decoding and clustering analyses which can be improved in the future revision.

      Strengths:

      (1) Robust Analysis: The research combined representational similarity analysis with manifold analyses, enhancing the robustness of the findings and the interpretation of the hippocampus's role in social cognition.

      (2) Replicability: The study included two independent samples, which strengthens the generalizability and reliability of the results.

      Weaknesses:

      I appreciate the authors for utilizing contemporary machine-learning techniques to analyze neuroimaging data and examine the intricacies of human cognition. However, the manuscript would benefit from a more detailed explanation of the rationale behind the selection of each method and a thorough description of the validation procedures. Such clarifications are essential to understand the true impact of the research. Moreover, refining these areas will broaden the manuscript's accessibility to a diverse audience.

      We thank the reviewer for these comments and have addressed them in various ways.

      First, we removed the spline-based decoding and spectral clustering analyses. As we detail in our response to the recommendations, these approaches were complex and raised legitimate interpretational concerns, making it unclear how they supported our core claims. The revised manuscript now focuses on a set of representational similarity analyses to show representations consistent with social dimension similarity (affiliation vs. power decision trials) and social location similarity (trajectory/map-like coding based on participant choices).

      Second, we expanded the Methods and Results to more clearly explain the analyses, the questions they address, and associated controls and robustness tests. The dimension similarity analysis tests whether hippocampal patterns differentiate affiliation and power decisions in a way consistent with an abstract dimension representation. The location similarity RSAs test whether within-character neural pattern distances scale with Euclidean distance in social space (relationship-specific trajectories), and whether pattern distances across all characters scale with location distances when distances are globally standardized, consistent with a shared map-like coordinate system.

      Third, we emphasize new controls. For the dimension similarity RSA, we test for potential confounds such as word count, text sentiment, and reaction time differences between affiliation and power trials. For the location similarity RSA, we control for temporal distance between trials and show (in the Supplement) that the reported effects cannot be explained by temporal autocorrelation in the fMRI data or by the relationship between temporal distance and behavioral location distance.

      We believe that these changes address the reviewer’s request for clearer rationale and validation.

      Reviewer #2 (Public review):

      Summary:

      Using an innovative task design and analysis approach, the authors set out to show that the activity patterns in the hippocampus related to the development of social relationships with multiple partners in a virtual game. While I found the paper highly interesting (and would be thrilled if the claims made in the paper turned out to be true), I found many of the analyses presented either unconvincing or slightly unconnected to the claims that they were supposed to support. I very much hope the authors can alleviate these concerns in a revision of the paper.

      Strengths & Weaknesses:

      (1) The innovative task design and analyses, and the two independent samples of participants are clear strengths of the paper.

      We thank the reviewer for this comment.

      (2) The RSA analysis is not what I expected after I read the abstract and tile of the result section "The hippocampus represents abstract dimensions of affiliation and power". To me, the title suggests that the hippocampus has voxel patterns, which could be read out by a downstream area to infer the affiliation and power value, independent of the exact identity of the character in the current trial. The presented RSA analysis however presents something entirely different - namely that the affiliation trials and power trials elicit different activity patterns in the area indicated in Figure 3. What is the meaning of this analysis? It is not clear to me what is being "decoded" here and alternative explanations have not been considered. How do affiliation and power trials differ in terms of the length of sentences, complexity of the statements, and reaction time? Can the subsequent decision be decoded from these areas? I hope in the revision the authors can test these ideas - and also explain how the current RSA analysis relates to a representation of the "dimensions of affiliation and power".

      We agree that this analysis needed to be better justified and explained. We have revised the text to clarify that by “represents the interaction decision trials along abstract social dimensions” we mean that hippocampal multivoxel patterns differentiate affiliation and power decisions in a way consistent with the conceptual framework of underlying latent dimensions. The analysis tests one simple prediction of this view – that on average these trial types are separable in the neural patterns. We have added details to the Methods, showing how the affiliation and power trials do not differ in word count or in sentiment, but do differ in their semantics, as assessed by a Large Language Model, as we expect from our task assumptions. Thanks to the reviewer’s comment, we also tested for and found a reaction time difference between affiliation and power trials, that we now control for.

      (3) Overall, I found that the paper was missing some more fundamental and simpler RSA analyses that would provide a necessary backdrop for the more complicated analyses that followed. Can you decode character identity from the regions in question? If you trained a simple decoder for power and affiliation values (using the LLE, but without consideration of the sequential position as used in the spline analysis), could you predict left-out trials? Are affiliation and power represented in a way that is consistent across participants - i.e. could you train a model that predicts affiliation and power from N-1 subjects and then predict the Nth subject? Even if the answer to these questions is "no", I believe that they are important to report for the reader to get a full understanding of the nature of the neural representations in these areas. If the claim is that the hippocampus represents an "abstract" relationship space, then I think it is important to show that these representations hold across relationships. Otherwise, the claim needs to be adjusted to say that it is a representation of a relationship-specific trajectory, but not an abstract social space.

      We appreciate this comment and agree on the value of clear, conceptually simple analyses. To address this concern, we have simplified our main analysis significantly by removing the spline-based analysis and substituting it with a multiple regression representational similarity analysis approach. We test whether within-character neural pattern distances scale with distance in social space (relationship-specific trajectories), and whether pattern distances across all characters scale with location distances when distances are globally standardized. We find evidence for both, consistent with a shared map-like coordinate system.

      We agree that decoding character identity and an across-participant decoding approach could be informative. However, our current task is not well designed for such analyses and as such would complicate the paper. Although we agree that these questions are interesting, they would test questions that are outside the scope of this paper. 

      (4) To determine that the location of a specific character can be decoded from the hippocampal activity patterns, the authors use a sequential analysis in a lowdimensional space (using local linear embedding). In essence, each trial is decoded by finding the pair of two temporally sequential trials that is closest to this pattern, and then interpolating the power/affiliation values linearly between these two points. The obvious problem with this analysis is that fMRI pattern will have temporal autocorrelation and the power and affiliation values have temporal autocorrelation. Successful decoding could just reflect this smoothness in both time series. The authors present a series of control analyses, but I found most of them to not be incisive or convincing and I believe that they (and their explanation of their rationale) need to be improved. For example, the circular shifting of the patterns preserves some of the autocorrelation of the time series - but not entirely. In the shifted patterns, the first and last items are considered to be neighboring and used in the evaluation, which alone could explain the poor performance. The simplest way that I can see is to also connect the first and last item in a circular fashion, even when evaluating the veridical ordering. The only really convincing control condition I found was the generation of new sequences for every character by shuffling the sequence of choices and re-creating new artificial trajectories with the same start and endpoint. This analysis performs much better than chance (circular shuffling), suggesting to me that a lot of the observed decoding accuracy is indeed simply caused by the temporal smoothness of both time series.

      We thank the reviewer for emphasizing this important concern; we agree that we did not sufficiently address this in the initial submission. This concern is one main reason we removed the spline-based analysis and now use regression-based representational similarity analyses in its place. In the revision, we report autocorrelation-related analyses in the supplement, and via controls and additional analysis show that temporal distance (or its square) cannot explain the location-like effects. This substantially improves our ability to interpret the findings.

      (5) Overall, I found the analysis of the brain-behavior correlation presented in Figure 5 unconvincing. First, the correlation is mostly driven by one individual with a large network size and a 6.5 cluster. I suspect that the exclusion of this individual would lead to the correlation losing significance. Secondly, the neural measure used for this analysis (determining the number of optimal clusters that maximize the overlap between neural clustering and behavioral clustering) is new, non-validated, and disconnected from all the analyses that had been reported previously. The authors need to forgive me for saying so, but at this point of the paper, would it not be much more obvious to use the decoding accuracy for power and affiliation from the main model used in the paper thus far? Does this correlate? Another obvious candidate would be the decoding accuracy for character identity or the size of the region that encodes affiliation and power. Given the plethora of candidate neural measures, I would appreciate if the authors reported the other neural measures that were tried (and that did not correlate). One way to address this would have been to select the method on the initial sample and then test it on the validation sample - unfortunately, the measure was not pre-registered before the validation sample was collected. It seems that the correlation was only found and reported on the validation sample?

      We agree that this analysis was too complicated and under constrained, and thus not convincing. We think that removing this cluster-based analysis is the most conservative response to the reviewer’s concerns and have removed it from the revised paper.

      Recommendations to the authors:

      Reviewer #1 (Recommendations for the authors):

      The manuscript's description of the shuffling analysis performed during decoding is currently ambiguous, particularly concerning the control variables. This ambiguity is present only in the Figure 4 legends and requires a more detailed explanation within the methods section. It is essential to clarify whether the permutation process was conducted within each character's data set or across multiple characters' data sets. If permutations were confined to within-character data, the conclusion would be that the hippocampus encodes context-specific information rather than providing a twodimensional common space.

      We thank the reviewer for this comment. We have now removed the spline analysis due to these and other problems and have replaced it with representational similarity analyses that are both more rigorous and easier to interpret. We think these analyses allow us to make the claim that the characters are represented in a common space. 

      In the methods, we explain the analyses (page 23-24, lines 475-500):

      “We also expected the hippocampus to represent the different characters’ changing social locations, which are implicit in the participant’s choices. We used multiple regression searchlight RSA to test whether hippocampal pattern dissimilarity increases with social location distance, based on participant-specific trial-wise beta images where boxcar regressors spanned each trial’s reaction time.”

      “We ran two complementary regression analyses to address two related questions. First, we asked whether the hippocampus represents how a specific relationship changes over time. For this analysis, for each participant and each searchlight, we computed character-specific (i.e., only for same character trial pairs) correlation distances between trial-wise beta patterns and Euclidean distances between the social location behavioral coordinates. Distances were zscored within character trial pairs to isolate character-specific changes. The second analysis asked whether the there is a common map-like representation, where all trials, regardless of relationship, are represented in a shared coordinate system. Here, we included all trial pairs and z-scored the distances globally. For both regression analyses, we included control distances to control for possible confounds. To account for generic time-related changes, we controlled for absolute scan-time difference, as this correlated with location distance across participants (see Temporal autocorrelation of hippocampal beta patterns in the supplement). Although the square of this temporal distance did not explain any additional variance in behavioral distances, we ran a robustness analysis including both temporal distance and its square and saw qualitatively the same clusters with similar effect sizes. As such, we report the main analysis only. We included binary dimension difference (0 = trial pairs of different dimension, 1 = trials pairs of the same dimension), to ensure effects could not be explained by dimension-related effects. In the group-level model, we controlled for sample and the average reaction time between affiliation and power decisions.”

      In the results, we describe the results and our interpretation (pages 11-12, lines 185208):

      “We have shown that the left hippocampus represents the affiliation and power trials differently, consistent with an abstract dimensional representation. Does it also represent the changing social coordinates of each character? To test this, we multiple-regression RSA searchlight to test whether left hippocampus patterns represent the characters’ changing social locations across interactions (see Figure 3). We restricted the distances to those from trial pairs from the same character and standardized the distances within character (see Figure 3BD). We controlled for temporal distance to ensure the effect was not explainable by the time between trials, and for whether the trials shared the same underlying dimension (affiliation or power; see Location similarity searchlight analyses for more details). At the group level, we controlled for sample and the average reaction time difference between affiliation and power trials. Using the same testing logic as the dimensionality similarity analysis, we first tested our hypothesis in the bilateral hippocampus and found widespread effects in both the left (peak voxel MNI x/y/z = -35/-22/-15, cluster extent = 1470 voxels) and right (peak voxel MNI x/y/z = 37/-19/-14, cluster extent = 1953 voxels) hemispheres. The whole-brain searchlight analysis revealed additional clusters in the left putamen (-27/-3/14, cluster extent = 131 voxels) and left posterior cingulate cortex (-10/-28/41, cluster extent = 304 voxels).”

      “We then asked a second, complementary question: does the hippocampus represent all interactions, across characters, within a shared map? To test for this map-like structure, we repeated the analysis but now included all trial pairs, z-scoring distances globally rather than within character (Figure 3E-F). The remainder of the procedure followed the same logic as the preceding analysis. The hippocampus analysis revealed an extensive right hippocampal cluster (27/27/-14, cluster extent = 1667 voxels). The whole-brain analysis did not show any significant clusters.”

      We also describe the results in the discussion (page 12, lines 220-226): 

      “Then, we show that the hippocampus tracks the changing social locations (affiliation and power coordinates), above and beyond the effects of dimension or time; the hippocampus seemed to reflect both the changing within-character locations, tracking their locations over time, and locations across characters, as if in a shared map. Thus, these results suggest that the hippocampus does not just encode static character-related representations but rather tracks relationship changes in terms of underlying affiliation and power.”

      The manuscript's description of the decoding analysis is unclear regarding the variability of the decoded positions. The authors appear to decode the position of a character along a spline, which raises the question of whether this position correlates with time, since characters are more likely to be located further from the center in later trials. There is a concern that the decoded position may not solely reflect the hippocampal encoding of spatial location, but could also be influenced by an inherent temporal association. Given that a character's position at time t is likely to be similar to its positions at t−1 and t+1, it is crucial that the authors clearly articulate their approach to separating spatial representation from temporal autocorrelation. While this issue may have been addressed in the construction of the test set, the manuscript does not seem to adequately explain how such biases were mitigated in the training set.

      We agree that temporal confounding needs to be better accounted for, as our claims depend on space-like signals being separable from time-like ones. We address this in several ways in the revised manuscript.

      First, we emphasize that this is a narrative-based task, where temporal structure is relevant. As such, our analyses aim to demonstrate that effects go beyond simple temporal confounds, like trial order or time elapsed.

      Despite the temporal structure to the task, the decisions for the same character are spaced in time, and interleaved with other characters’ decisions, reducing the chance that a simple temporal confound could explain trajectory-related effects. We now describe the task better in the revised methods (page 16, lines 314-318):

      “All six characters’ decision trials are interleaved with one another and with narrative slides. On average, after a decision trial for a given character, participants view ~11 narrative slides and complete ~3 decisions for other characters before returning to that same character, such that each character’s choices are separated by an average of ~20 seconds (range 12 seconds to 10 min).”

      To address temporal autocorrelation in the fMRI time series, we used SPM’s FAST algorithm. Briefly, FAST models temporal autocorrelation as a weighted combination of candidate correlation functions, using the best estimate to remove autocorrelated signal.

      We also now report the temporal autocorrelation profile of the hippocampal beta series in the supplement, including (pages 29-31, lines 593-656):

      “The Social Navigation Task is a narrative-based task, where the relationships with characters evolve over time; trial pairs that are close in time may have more similar fMRI patterns for reasons unrelated to social mapping (e.g., slow drift). It is important to account for the role of time in our analyses, to ensure effects go beyond simple temporal confounds, like the time between decision trials. To aid in this, we quantified how fMRI signals change over time using a pattern autocorrelation function across decision trial lags. We defined the left and right hippocampus and the left and right intracalcarine cortex using the HarvardOxford atlas and thresholded them at 50% probability. We chose intracalcarine corex as an early visual control region that largely corresponds to primary visual cortex (V1), as it is likely to be driven by the visually presented narrative. We used the same trial-wise beta images as in the location similarity RSA (boxcar regressors spanning each decision trial’s reaction time). For each participant and region-of-interest (ROI), we extracted the decision trial-by-voxel beta matrix and quantified three kinds of temporal dependence: beta autocorrelation, multivoxel pattern correlation and multivoxel pattern correlation after regressing out temporal distance.”

      “To estimate the temporal autocorrelation of the trial-wise beta values, we treated each voxel’s beta values as a time series across trials and measured how much a voxel’s response on one trial correlated (Pearson) with its response on previous trials. We averaged these voxel wise autocorrelations within each ROI. At one trial apart (lag 1), both the hippocampus and V1 showed small positive autocorrelations, indicating modest trial-to-trial carryover in response amplitude (see Supplemental figure 1) that by three trials apart was approximately 0.”

      “Because our representational similarity analyses depend on trial-by-trial pattern similarity, we also estimated how multivoxel patterns were autocorrelated over time. For each lag, we computed the Pearson correlation between each trial’s voxelwise pattern and the pattern from the trial that many trials earlier, then averaged those correlations to obtain a single autocorrelation value for that lag. At one trial apart, both regions showed positive autocorrelation, with V1 having greater autocorrelation than the hippocampus; pattern correlations between trials 3 or 4 trials apart reduced across participants, settling into low but positive values. Then, for each participant and ROI, we regressed out the effect of absolute trial onset differences from all pairwise pattern correlations, to mirror the effects of controlling for these temporal distances in regressions. After removing this temporal distance component, the short lag pattern autocorrelation dropped substantially in both regions. The similarity in autocorrelation profiles between the two regions suggests that significant similarity effects in the hippocampus are unlikely to be driven by generic temporal autocorrelation.”

      “Relationship between behavioral location distance and temporal distance “

      “We also quantified how temporal distances between trials relates to their behavioral location distances, participant by participant. Our dimension similarity analysis controls for temporal distance between trials by design (see Social dimension similarity searchlight analysis), but our location similarity analysis does not. To decide on covariates to include in the analysis, we tested whether temporal distances can explain behavioral location distances. For each participant, we computed the correlations between trial pairs’ Euclidean distances in social locations and their linear temporal distances (“linear”) and the temporal distances squared (“quadratic”), to test for nonlinear effects. We then summarized the correlations using one-sample t-tests. The linear relationship was statistically significant (t<sub>49</sub> = 12.24, p < 0.001), whereas the quadratic relationship was not (t<sub>49</sub> = -0.55, p = 0.586). Similarly, in participant specific regressions with both linear and quadratic temporal distances, the linear effect was significant (t<sub>49</sub> = 5.69, p < 0.001) whereas the quadratic effect was not (t<sub>49</sub> = 0.20, p = 0.84). Based on this, we included linear temporal distances as a covariate in our location similarity analyses (see Location similarity searchlight analyses), and verified that adding a quadratic temporal distance covariate does not alter the results. Thus, the reported location-related pattern similarity effects go beyond what can be explained by temporal distance alone.”

      How the free parameter of spectral clustering was determined, if there is any?

      The interpretation of the number of hippocampal activity clusters is ambiguous. It is suggested that this number could fluctuate due to unique activity patterns or the fit to behaviorally defined trajectories. A lower number of clusters might indicate either a noisier or less distinct representation, raising the question of the necessity and interpretability of such a complex analysis. This concern is compounded by the potential sensitivity of the clustering to the variance in Euclidean distances of each trial's position relative to the center. If a character's position is consistently near the center, this could artificially reduce the perceived number of clusters. Furthermore, the manuscript should address whether there is any correlation between the number of clusters and behavioral performance. Specifically, what are the implications if participants are able to perform the task adequately with a smaller number of distinct hippocampal representation states?

      The rationale for conducting both cluster analysis and position decoding as separate analyses remains unclear. While cluster analysis can corroborate the findings of position decoding, it is not apparent why the authors chose to include trials across characters for cluster analysis but not for decoding analysis. An explanation of the reasoning behind this methodological divergence would help in understanding the distinct contributions of each analysis to the study's findings.

      The paper by Cohen et al. (1997), which provides the questionnaire for measuring the social network index, is not cited in the references. Upon reviewing the questionnaire that the author may have used, it appears that the term "social network size" does not refer to the actual size but to a score or index derived from the questionnaire responses. It may be more appropriate to replace the term "size" with a different term to more accurately reflect this distinction.

      Thank you for seeking these clarifications. Given the complexity of this analysis, we have decided to drop it to focus instead on our dimension and location representational similarity analysis results.

      Reviewer #2 (Recommendations for the authors):

      How did the participants' decisions on previous trials influence the future trials that the subjects saw? If the different participants were faced with different decision trials, then how did you compare their decision? If two participants made the same decisions, would they have seen exactly the same sequence of trials (see point X on how the trial sequence was randomized).

      All participants experience the same narrative, with the same decisions (i.e., the same available options); their choices (i.e., the options they select) are what implicitly shape each character’s affiliation and power locations, and thus each character’s trajectory. In other words, the narrative is fixed; what changes is the social coordinates assigned to each trial’s outcome depending on the participant’s choice of how to interact from the two narrative options. This means that we can meaningfully compare participants' neural patterns, given that every participant received the same text and images throughout.

      We have now added details on the narrative structure, replacing more ambiguous statements with a clearer description (page 16, lines 309-318):

      “The sequence of trials, including both narrative and decision trials, were fixed across participants; all that differs are the choices that the participants make. Narrative trials varied in duration, depending on the content (range 2-10 seconds), but were identical across participants. Decision trials always lasted 12 seconds, with two options presented until the participant made a choice, after which a blank screen was presented for the remainder of the duration. All six characters’ decision trials are interleaved with one another, and with the narrative slides. On average, after a decision trial for a given character, participants view ~11 narrative slides and complete ~3 decisions for other characters before returning to another decision with the same character, such that each character’s choices are separated by an average of ~20 seconds (ranging from 12 seconds to 10 min).”

      Figure 2B: I assume that "count" is "count of participants"? It would be good to indicate this on the axis/caption.

      Thank you for noting this. We have now removed this figure to improve the clarity of our figures. 

      We have shown that the hippocampus represents the interaction decision trials along abstract social dimensions, but does it track each relationship's unique sequence of abstract social coordinates?". Please clarify what you mean by "represents the interaction decision trials”.

      By “represents the interaction decision trials along abstract social dimensions”, we mean that when the participant makes a choice during the social interactions the hippocampal patterns represent the current social dimension of the choice (affiliation vs power). In other words, the hippocampal BOLD patterns differentiate affiliation and power decisions, consistent with our hypothesis of abstract social dimension representation in the hippocampus. We have clarified this (page 11, lines 185-187):

      “We have shown that the left hippocampus represents the affiliation and power trials differently, consistent with an abstract dimensional representation.”

      Page 8: "Hippocampal sequences are ordered like trajectories": It is not entirely clear to me what is meant by the split midpoint. Is this the midpoint of the piece-wise linear interpolation between two points, or simply the mean of all piecewise splines from one character? If the latter, is the null model the same as simply predicting the mean affiliation and power value for this character? If yes, please clarify and simplify this for the reader.

      Page 8: "Hippocampal sequences track relationship-specific paths". First, I was misled by the "relationship-specific". I first understood this to mean that you wanted to test whether two relationships (i.e. the identity of the partner) had different representations in Hippocampus, even if the power/affiliation trajectories are the same. I suggest changing the title of this section.

      The analysis in this section also breaks any temporal autocorrelation of measured patterns - so I am not sure if this is a strong analysis that should be interpreted at all. This analysis seems to not address the claim and conclusion that is drawn from it. I assume that the random trajectories have different choices and different affiliation/power values than the true trajectories. So the fact that the true trajectories can be better decoded simply shows that either choices or affiliation and power (or both) are represented in the neural code - but not necessarily anything beyond this.

      Page 9: "Neural trajectories reflect social locations, not just choices". The motivation of this analysis is not clear to me. As I understand this analysis, both social location and choices are changed from the real trajectories. How can it then show that it reflects social locations, not just the choices?

      Figure 4 caption: "on the -based approximation" Is there a missing "point"-[based] here?

      We agree with the reviewer that this analysis is hard to interpret and does not adequately address concerns regarding temporal autocorrelation, and as such we have removed it from the manuscript. We describe the new results that include controlling for temporal distance between trials (pages 11-12, lines 185-208):

      “We have shown that the left hippocampus represents the affiliation and power trials differently, consistent with an abstract dimensional representation. Does it also represent the changing social coordinates of each character? To test this, we multiple-regression RSA searchlight to test whether left hippocampus patterns represent the characters’ changing social locations across interactions (see Figure 3). We restricted the distances to those from trial pairs from the same character and standardized the distances within character (see Figure 3BD). We controlled for temporal distance to ensure the effect was not explainable by the time between trials, and for whether the trials shared the same underlying dimension (affiliation or power; see Location similarity searchlight analyses for more details). At the group level, we controlled for sample and the average reaction time difference between affiliation and power trials. Using the same testing logic as the dimensionality similarity analysis, we first tested our hypothesis in the bilateral hippocampus and found widespread effects in both the left (peak voxel MNI x/y/z = -35/-22/-15, cluster extent = 1470 voxels) and right (peak voxel MNI x/y/z = 37/-19/-14, cluster extent = 1953 voxels) hemispheres. The whole-brain searchlight analysis revealed additional clusters in the left putamen (-27/-3/14, cluster extent = 131 voxels) and left posterior cingulate cortex (-10/-28/41, cluster extent = 304 voxels).”

      “We then asked a second, complementary question: does the hippocampus represent all interactions, across characters, within a shared map? To test for this map-like structure, we repeated the analysis but now included all trial pairs, z-scoring distances globally rather than within character (Figure 3E-F). The remainder of the procedure followed the same logic as the preceding analysis. The hippocampus analysis revealed an extensive right hippocampal cluster (27/27/-14, cluster extent = 1667 voxels). The whole-brain analysis did not show any significant clusters.”

      We emphasize that the results are robust to the inclusion of temporal distance squared, in the methods (pages 23-24, lines 493-496):

      “Although the square of this temporal distance did not explain any additional variance in behavioral distances, we ran a robustness analysis including both temporal distance and its square and saw qualitatively the same clusters with similar effect sizes.”

      Page 8: last paragraph: The text sounds like you have already shown that you can decode character identity from the patterns - but I do not believe you have it this point. I would consider this would be an interesting addition to the paper, though.

      This section has been removed, and we have been careful to not imply this in the current version of the manuscript. While we agree a character identity decoding would enrich our argument, we do not believe our task is well-suited to capture a character identity effect. Each character only has 12 decision trials, and these trials are partially clustered in time - this is one problem of temporal autocorrelation that we thank the reviewers for pushing us to consider in more detail. Dimension and location patterns, on the other hand, are more natural to analyze in our task, especially in representational similarity analyses that test whether the relevant differences scale with neural distances.

      Page 14ff: Why is "Analysis section" not part of "Materials and Methods"? I believe adding the analysis after a careful description of the methods would improve the clarity of this section.

      We agree with the reviewer and have now consolidated these two sections.

      Two or three examples of Affiliation and Power decision trials should be provided, so the reader can form a more thorough understanding of how these dimensions were operationalized. For the RSA analysis, it is important to consider other differences between these two types of trials.

      We agree that adding examples will clarify the operationalization of these dimensions. We now include example affiliation and power trials in a table (page 17-18).

      We thank the reviewer for noting the need to rule out alternative hypotheses; we have added several such tests. Affiliation and power trials were not different in word count (page 17, lines 329-332):

      “To ensure that any observed neural or behavioral differences were not confounded by trivial features of the text, we tested for differences between the affiliation and power trials (where the two options are concatenated). There were no differences in word count (affiliation average = 26.6, power average = 25.6; t-test p = 0.56).”

      They were also not different in their sentiment, as assessed by a Large Language Model (LLM) analysis (page 17, lines 332-335): 

      “The text’s sentiment also did not differ between these trial types (t-test p = 0.72), as quantified by comparing sentiment compound scores (from most negative, −1, to most positive, +1), using a Large Language Model (LLM) specialized for sentiment analysis [26]. “

      The affiliation and power trials were different in terms of semantic content, consistent with our assumptions (page 17, lines 337-347):

      “Our framework assumes that affiliation and power trials differ in their semantic content–that is, in the conceptual meaning of the text, beyond word count or sentiment. To test this assumption, we used an LLM-based semantic embedding analysis. Each decision trial was embedded into a semantic vector. We then measured the cosine similarity between pairs of trials and calculated the difference between average within-dimension similarity (affiliation-affiliation and power-power comparisons) and average between-dimension similarity (affiliationpower comparisons) and assessed its statistical significance with permutation testing (1,000 shuffles of trial labels). As expected, decision trials of the same dimension were more similar to each other than trials of different dimension, across multiple LLMs (OpenAI’s text-embedding-3-small [27]: similarity difference = 0.041, p < 0.001; all-MiniLM-L12-v2 [28]: similarity difference = 0.032, p < 0.001).”

      The affiliation and power trials were different in average reaction time. To control for this difference in the dimension RSA analysis, we added each participant’s absolute value reaction time difference between the trial types as a covariate. The results were nearly identical to what they were before. We updated the text to reflect this new control (page 23, lines 471-474):

      “However, there was a significant difference in the average reaction time between affiliation and power decisions across participants (t<sub>49</sub> = 6.92, p < 0.001; affiliation mean = 4.92 seconds (s), power mean = 4.51 s), so we controlled for this in the group-level analysis.”

      The exact implementation and timing of the behavioral tasks should be described better. How many narrative trials were intermixed with the decision trials? Which characters were they assigned to? How was the sequence of trials determined? Was it fixed across participants, or randomized?

      We agree that additional details are helpful. In the Methods, we now describe this with more detail (page 16, lines 301-318):

      “There are two types of trials: “narrative” trials where background information is provided or characters talk or take actions (a total of 154 trials), and “decision” trials where the participant makes decisions in one-on-one interactions with a character that can change the relationship with that character (a total of 63 trials). On each decision, participants used a button response box to select between the two options. The options (1 or 2, assigned to the index and middle fingers) choice directions (+/-1 arbitrary unit on the current dimension) were counterbalanced.”

      “The sequence of trials, including both narrative and decision trials, were fixed across participants; all that differs are the choices that the participants make. Narrative trials varied in duration, depending on the content (range 2-10 seconds), but were identical across participants. Decision trials always lasted 12 seconds, with two options presented until the participant made a choice, after which a blank screen was presented for the remainder of the duration. All six characters’ decision trials are interleaved with one another, and with the narrative slides. On average, after a decision trial for a given character, participants view ~11 narrative slides and complete ~3 decisions for other characters before returning to another decision with the same character, such that each character’s choices are separated by an average of ~20 seconds (ranging from 12 seconds to 10 min).”

      What is the exact timing of trials during fMRI acquisition - i.e. how long were the trials, what was the ITI, were there long phases of rest to determine the resting baseline? These are all important factors that will determine the covariance between regressors and should be reported carefully. Ideally, I would like to see the trial-by-trial temporal auto-correlation structure across beta-weights to be reported.

      We thank the reviewer for asking for this clarification. We have added the following text to clarify the trial timing (page 16, lines 314-318):

      “All six characters’ decision trials are interleaved with one another and with narrative slides. On average, after a decision trial for a given character, participants view ~11 narrative slides and complete ~3 decisions for other characters before returning to that same character, such that each character’s choices are separated by an average of ~20 seconds (range 12 seconds to 10 min).”

      We now describe the temporal autocorrelation patterns in the supplement, including how we decided on how to control for temporal distance in representational similarity analyses (pages 29-31, lines 593-656):

      “The Social Navigation Task is a narrative-based task, where the relationships with characters evolve over time; trial pairs that are close in time may have more similar fMRI patterns for reasons unrelated to social mapping (e.g., slow drift). It is important to account for the role of time in our analyses, to ensure effects go beyond simple temporal confounds, like the time between decision trials. To aid in this, we quantified how fMRI signals change over time using a pattern autocorrelation function across decision trial lags. We defined the left and right hippocampus and the left and right intracalcarine cortex using the HarvardOxford atlas and thresholded them at 50% probability. We chose intracalcarine corex as an early visual control region that largely corresponds to primary visual cortex (V1), as it is likely to be driven by the visually presented narrative. We used the same trial-wise beta images as in the location similarity RSA (boxcar regressors spanning each decision trial’s reaction time). For each participant and region-of-interest (ROI), we extracted the decision trial-by-voxel beta matrix and quantified three kinds of temporal dependence: beta autocorrelation, multivoxel pattern correlation and multivoxel pattern correlation after regressing out temporal distance.”

      “To estimate the temporal autocorrelation of the trial-wise beta values, we treated each voxel’s beta values as a time series across trials and measured how much a voxel’s response on one trial correlated (Pearson) with its response on previous trials. We averaged these voxel wise autocorrelations within each ROI. At one trial apart (lag 1), both the hippocampus and V1 showed small positive autocorrelations, indicating modest trial-to-trial carryover in response amplitude (see Supplemental figure 1) that by three trials apart was approximately 0.”

      “Because our representational similarity analyses depend on trial-by-trial pattern similarity, we also estimated how multivoxel patterns were autocorrelated over time. For each lag, we computed the Pearson correlation between each trial’s voxelwise pattern and the pattern from the trial that many trials earlier, then averaged those correlations to obtain a single autocorrelation value for that lag. At one trial apart, both regions showed positive autocorrelation, with V1 having greater autocorrelation than the hippocampus; pattern correlations between trials 3 or 4 trials apart reduced across participants, settling into low but positive values. Then, for each participant and ROI, we regressed out the effect of absolute trial onset differences from all pairwise pattern correlations, to mirror the effects of controlling for these temporal distances in regressions. After removing this temporal distance component, the short lag pattern autocorrelation dropped substantially in both regions. The similarity in autocorrelation profiles between the two regions suggests that significant similarity effects in the hippocampus are unlikely to be driven by generic temporal autocorrelation.”

      “Relationship between behavioral location distance and temporal distance “

      “We also quantified how temporal distances between trials relates to their behavioral location distances, participant by participant. Our dimension similarity analysis controls for temporal distance between trials by design (see Social dimension similarity searchlight analysis), but our location similarity analysis does not. To decide on covariates to include in the analysis, we tested whether temporal distances can explain behavioral location distances. For each participant, we computed the correlations between trial pairs’ Euclidean distances in social locations and their linear temporal distances (“linear”) and the temporal distances squared (“quadratic”), to test for nonlinear effects. We then summarized the correlations using one-sample t-tests. The linear relationship was statistically significant (t<sub>49</sub> = 12.24, p < 0.001), whereas the quadratic relationship was not (t<sub>49</sub> = -0.55, p = 0.586). Similarly, in participant specific regressions with both linear and quadratic temporal distances, the linear effect was significant (t<sub>49</sub> = 5.69, p < 0.001) whereas the quadratic effect was not (t<sub>49</sub> = 0.20, p = 0.84). Based on this, we included linear temporal distances as a covariate in our location similarity analyses (see Location similarity searchlight analyses), and verified that adding a quadratic temporal distance covariate does not alter the results. Thus, the reported location-related pattern similarity effects go beyond what can be explained by temporal distance alone.”

    1. So, if passwords are impossible to protect on their own, what do we do? That's where two-factor authentication comes in. Two-Factor authentication or 2fa adds a second method of identity verification to secure your accounts. First, the thing you know your password, then something unique that you have, like your phone or fingerprint. By combining your password with one of these factors, attackers can't access your account, even if they have your password. The most common 2fa systems use a unique one-time code with every login attempt. This code is tied to your account and generated by a token smartphone or sent to you by text message. The more modern and most secure form of 2fa, uses a mobile app to send an approval notification to your smartphone or Smartwatch for the least hassle possible. With 95% of breaches involving account takeover, two- factor authentication is the most effective method of prevention. It's time for everybody businesses, governments, and you to take the easy and effective step of enabling two-factor authentication on all accounts. If it uses a password, it needs to use two-factor authentication

    1. here are a myriad of organisations that help support women to have careers in the music industry, from those who gave evidence to us, including the F-List, Black Lives in Music, Cactus City, and Women in CTRL, to more local schemes such as Girls Rock London, Yorkshire Sound Women Network and Manchester-based Brighter Sound to name just a few.71 Initiatives such as UK Music’s Five P’s action plan;72 the global Keychange pledge;73 a joint code of practice by the Musicians’ Union and the ISM74 and the best practice framework for the industry being developed by the BPI (British Phonographic Industry) and others are just some that seek to promote a change of culture in the industry. But as Lady of the House, a platform championing women in electronic music, told us: The fact we have so many initiatives, communities, programs etc is amazing but on the other hand shows the desperate need to straighten the music industry out so it can protect and give women an equal opportunity in regards to equity.75

      Support systems for women in the music industry.

    1. Example of person working with Obsidian and Claude Code. Note that it does not use the Obsidian CLI access, but its API.

      Same author some months back mentioned running Claude Code on a small Hetzner VPS (but just Claude Code, no models) so he could access it from anywhere (except offline obviously).

    1. Comparison video of Claude Code using Anthropics cloud models vs local models on a M4 128GB. Still a heavy lift, fans spinning, memory usage almost at full capacity. But it works. Means that for my M1 16GB a smaller model is all that works, and you need to leave room for context loading too. For one-offs like code generation and for interactive in moving contexts there's different needs.

    1. The Commission completed the review of the functioning of the EECC on 21 January 2026 with the adoption and publication of a Report to the European Parliament and the Council. After highlighting several challenges, the Digital Networks Act (DNA) proposal aims to replace the Code. In turn, the DNA will create a modern, simplified and more harmonised legal framework, that boosts innovation and investment in resilient and advanced digital infrastructure, that is critical for enabling the adoption of AI, cloud, space and other innovative technologies.

      EECC was reviewed, and is now to be replaced by [[The Digital Networks Act]] Another example of moving from a directive to a regulation. A stronger move to single market therefore. Vgl PSI Directive moving into DA.

    1. Cadre de référence sur les mesures de contrôle en milieu scolaire : Note de synthèse

      https://www.youtube.com/watch?v=D43t0L_G7-Y

      Résumé exécutif

      Ce document de référence, fruit d'une collaboration entre le ministère de l’Éducation (MEQ) et la Fédération des centres de services scolaires du Québec (FCSSQ), définit les orientations nationales concernant l’utilisation des mesures de contrôle — contention et isolement — dans les établissements d'enseignement.

      La prémisse fondamentale est que ces mesures ne doivent être envisagées qu'en dernier recours, exclusivement dans des situations d'urgence où la sécurité de l'élève ou d'autrui est menacée de façon imminente.

      Le cadre privilégie une approche préventive et éducative, structurée autour du Système de soutien à paliers multiples (SSPM), visant à réduire au minimum le recours à la force ou à la contrainte.

      Il clarifie les responsabilités légales et professionnelles, notamment depuis les modifications réglementaires d'octobre 2023 habilitant certains professionnels (psychologues et psychoéducateurs) à décider de l’utilisation de mesures de contention.

      La mise en œuvre repose sur une démarche rigoureuse en cinq étapes, incluant l'élaboration de protocoles spécifiques (école ou élève) et l'application de modalités postsituationnelles pour assurer le bien-être et la réévaluation constante des pratiques.

      1. Fondements et principes directeurs

      Le recours aux mesures de contrôle est strictement encadré par des références légales (Charte des droits et libertés, Code civil, Loi sur l'instruction publique) et doit respecter les principes de dignité, d'intégrité et de sécurité de l'élève.

      Principes fondamentaux de l'intervention :

      Dernier recours : Utilisé uniquement lorsque les interventions préventives et les mesures alternatives ont échoué.

      Danger imminent : La menace doit être caractérisée par sa prévisibilité, son immédiateté et la gravité de ses conséquences.

      Contrainte minimale : La mesure doit être la moins restrictive possible et durer le moins longtemps possible (cesser dès que le danger est écarté).

      Respect et dignité : L'intervention doit être empreinte de bienveillance et de chaleur humaine, sous une surveillance constante.

      Suivi obligatoire : Chaque application doit faire l'objet d'un suivi postsituationnel pour évaluer l'efficacité et réguler les futures interventions.

      2. Définitions des mesures de contrôle

      Le cadre distingue plusieurs types d'interventions pour assurer une compréhension commune au sein du réseau scolaire.

      | Type de mesure | Description | Exemples | | --- | --- | --- | | Contention physique | Utilisation de la force humaine pour immobiliser ou diriger un élève contre son gré. | Tenir le bras d'un élève qui résiste ou le maintenir s'il frappe. | | Contention mécanique | Emploi d'un équipement ou de matériel pour limiter le mouvement. | Mitaines de sécurité, vestes de retenue dans le transport scolaire. | | Retrait de matériel | Confiscation d'un appareil palliant normalement un handicap. | Retirer les freins d'un fauteuil roulant ou confisquer une marchette. | | Isolement | Confinement de l'élève dans un lieu d'où il ne peut sortir librement. | Tenir la poignée d'une porte fermée ou bloquer physiquement l'accès. |

      Note : L'administration de substances chimiques à des fins de contrôle nécessite une prescription médicale et n'est pas traitée dans ce document.

      3. Cadre opérationnel : Intervention planifiée vs non planifiée

      Le cadre distingue deux contextes d'application, impactant directement les responsabilités professionnelles.

      | Caractéristique | Intervention Non Planifiée | Intervention Planifiée | | --- | --- | --- | | Contexte | Comportement inhabituel et imprévisible. | Comportement connu et susceptible de se répéter. | | Outil de gestion | Protocole-école (universel). | Protocole-élève (personnalisé, lié au Plan d'intervention). | | Décision (Contention) | Activité non réservée (urgence). | Activité réservée aux professionnels habilités. | | Décision (Isolement) | Activité non réservée. | Activité non réservée (mais encadrée). | | Application | Activité non réservée. | Activité non réservée. |

      4. La démarche d'intervention en cinq étapes

      Pour assurer la sécurité et le respect des droits, une structure systématique est proposée :

      1. Élaboration du protocole : Mise en place préventive de balises (comité-école pour le protocole-école ; équipe-école et parents pour le protocole-élève).

      2. Application des interventions préventives et alternatives : Utilisation de stratégies éducatives pour éviter la crise (diversion, sécurisation de l'environnement).

      3. Évaluation du danger : Analyse rigoureuse de la situation selon les critères de prévisibilité, d'immédiateté et de gravité.

      4. Application de la mesure de contrôle : Mise en œuvre selon les balises du protocole et les recommandations professionnelles.

      5. Modalités postsituationnelles : Retour sur l'événement, établissement des faits, soutien aux témoins (élèves et adultes) et révision du protocole.

      5. Prévention et climat scolaire

      La prévention est la "première voie d'action". Le document souligne l'importance du Système de soutien à paliers multiples (SSPM) :

      Palier 1 (Universel) : Soutien proactif pour tous les élèves (climat sain, règles claires, relations positives).

      Palier 2 (Ciblé) : Soutien supplémentaire pour les élèves à risque (autorégulation, habiletés sociales).

      Palier 3 (Intensif) : Interventions individualisées pour les difficultés graves ou persistantes.

      Le modèle "3 x 3" du CSSMB est cité en exemple, croisant l'intensité de l'intervention avec les sphères individuelle, scolaire et familiale.

      6. Rôles et responsabilités clés

      Le succès de ce cadre repose sur une responsabilité partagée :

      Direction d'établissement : Coordonne l'élaboration des protocoles, assure la formation du personnel et veille au bien-être physique et psychologique de tous.

      Personnel professionnel habilité (Ergothérapeutes, infirmiers, médecins, physiothérapeutes, psychoéducateurs, psychologues) : Réalise l'évaluation clinique, décide de la mesure en contexte planifié et émet des recommandations.

      Intervenants scolaires : Collaborent à l'analyse des comportements, appliquent les mesures en suivant les protocoles et informent la direction.

      Parents et élèves : Doivent être impliqués activement dans l'élaboration du protocole-élève. Un consentement libre et éclairé est requis pour toute mesure planifiée.

      Citations et informations critiques

      « Une mesure de contrôle [...] est une intervention de dernier recours qui devrait être réalisée exclusivement en situation d’urgence, c’est-à-dire lorsque la sécurité du personnel ou des élèves est menacée. » — Bernard Drainville, Ministre de l'Éducation

      « L’utilisation d’une mesure de contrôle n’est pas préconisée en milieu scolaire. [...] Elle ne doit jamais être employée comme mesure éducative ou punitive ou encore pour faciliter la surveillance de l’élève. » — Source Contextuelle, Section 1.1

      « Le recours aux mesures de contrôle est susceptible d’entraîner des blessures physiques et psychologiques qui peuvent avoir des implications à long terme. » — Source Contextuelle, Section 1