the robustness of these reasoning behaviors remains underexplored
「推理行为的鲁棒性尚未被充分探索」——这句话是整个推理模型研究领域的集体盲点声明。过去两年,测试时计算(test-time compute)、长思维链(CoT)、o1/R1 类推理模型吸引了巨大关注,但几乎所有评测都在「孤立问题」环境下进行。在真实 Agent 部署场景中,「能否保持推理深度」这个最基本的可靠性问题,直到这篇论文才开始被系统研究。
the robustness of these reasoning behaviors remains underexplored
「推理行为的鲁棒性尚未被充分探索」——这句话是整个推理模型研究领域的集体盲点声明。过去两年,测试时计算(test-time compute)、长思维链(CoT)、o1/R1 类推理模型吸引了巨大关注,但几乎所有评测都在「孤立问题」环境下进行。在真实 Agent 部署场景中,「能否保持推理深度」这个最基本的可靠性问题,直到这篇论文才开始被系统研究。
we seek a posteriori error estimators whose constants do not blow up as 𝜀→0.
「ε→0 时常数不爆炸」这个需求揭示了传统方法的致命弱点:大多数能量估计方法在对流占主导(扩散系数 ε 趋于零)时,误差估计常数会以 ε⁻¹ 或更高阶发散,使估计器在实际问题中完全失效。本文的关键贡献正是构造了在整个对流-扩散谱(从抛物型到双曲型)上均匀有效的估计器——这在偏微分方程数值分析中是一个长期未解决的难题。
In computing, the robustness principle is a design guideline for software that states: "be conservative in what you do, be liberal in what you accept from others". It is often reworded as: "be conservative in what you send, be liberal in what you accept". The principle is also known as Postel's law, after Jon Postel, who used the wording in an early specification of TCP.
https://en.wikipedia.org/wiki/Robustness_principle
Robustness principle: be conservative in what you do, be liberal in what you accept from others.
This is insightful application of Postel's law en.wikipedia.org/wiki/Robustness_principle. It remains wrong to write software that assumes local parts of email addresses are case-insensitive, but yes, given that there is plenty of wrong software out there, it is also less than robust to require case sensitivity if you are the one accepting the mail.
Solution: Store emails with case sensitivity Send emails with case sensitivity Perform internal searches with case insensitivity
Robustness principle suggests that we accept case sensitive emails
A flaw can become entrenched as a de facto standard. Any implementation of the protocol is required to replicate the aberrant behavior, or it is not interoperable. This is both a consequence of applying the robustness principle, and a product of a natural reluctance to avoid fatal error conditions. Ensuring interoperability in this environment is often referred to as aiming to be "bug for bug compatible".
Deepti Gurdasani. (2022, January 29). Going to say this again because it’s important. Case-control studies to determine prevalence of long COVID are completely flawed science, but are often presented as being scientifically robust. This is not how we can define clinical syndromes or their prevalence! A thread. [Tweet]. @dgurdasani1. https://twitter.com/dgurdasani1/status/1487366920508694529
You should default to the most permissive option imo and there really is no reason to check anything until you really need to If it were left to me I'd just use optional chaining, as it also eliminates the need for no-ops
(lazy checking)
In other words, programs that send messages to other machines (or to other programs on the same machine) should conform completely to the specifications, but programs that receive messages should accept non-conformant input as long as the meaning is clear.
be conservative in what you do, be liberal in what you accept from others
Grimm, V., Johnston, A. S. A., Thulke, H.-H., Forbes, V. E., & Thorbek, P. (2020). Three questions to ask before using model outputs for decision support. Nature Communications, 11(1), 4959. https://doi.org/10.1038/s41467-020-17785-2
Kekecs, Z., Szaszi, B., & Aczel, B. (2020). ECO, an expert consensus procedure for developing robust scientific outputs [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/9gqru
Lab to field. (n.d.). Behaviourally Informed Organizations. Retrieved June 21, 2020, from https://www.biorgpartnership.com/lab-to-field
Ray, E. L., Wattanachit, N., Niemi, J., Kanji, A. H., House, K., Cramer, E. Y., Bracher, J., Zheng, A., Yamana, T. K., Xiong, X., Woody, S., Wang, Y., Wang, L., Walraven, R. L., Tomar, V., Sherratt, K., Sheldon, D., Reiner, R. C., Prakash, B. A., … Consortium, C.-19 F. H. (2020). Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S. MedRxiv, 2020.08.19.20177493. https://doi.org/10.1101/2020.08.19.20177493
Air Pollution Exposure and COVID-19. COVID-19 and the Labor Market. (n.d.). IZA – Institute of Labor Economics. Retrieved August 7, 2020, from https://covid-19.iza.org/publications/air-pollution-exposure-and-covid-19/
Partial Lockdown and the Spread of COVID-19: Lessons from the Italian Case. COVID-19 and the Labor Market. (n.d.). IZA – Institute of Labor Economics. Retrieved August 4, 2020, from https://covid-19.iza.org/publications/dp13375/
Mohseni-Kabir, A., Pant, M., Towsley, D., Guha, S., & Swami, A. (2020). Percolation Thresholds for Robust Network Connectivity. ArXiv:2006.14496 [Cond-Mat, Physics:Physics]. http://arxiv.org/abs/2006.14496
Are All Layers Created Equal?
Google的这文2个 idea 很简单:一个是在 trained 网络各层的参数分别换回训练前的初始参数而观察相应各层的鲁棒性;另一个是把上一个 idea 基础上把那套初始参数再从某分布中随机取一次瞅效果。此 paper 的严谨的验证试验过程是最值得学习的~[并不简单]
Using Pre-Training Can Improve Model Robustness and Uncertainty
此 paper 回应并补充了去年何神的一篇说 pre-training 对 performance 鸟用不大的文章 (Rethinking ImageNet Pre-training)。你问是怎么回应的?瞅一眼此 paper 的题目就晓得了。。。。
Is Robustness the Cost of Accuracy? -- A Comprehensive Study on the Robustness of 18 Deep Image Classification Models
这文帅了~ 信息丰富 超多的图~ 让人眼前一亮~
探讨了18个模型的鲁棒性和准确率。结论很多,如模型构架是影响鲁棒性和准确率的重要因素(似乎是废话);相似模型构架基础上增加“深度”对鲁棒性的提升很微弱;有些模型(Vgg类)的表现出很强的对抗样本迁移性。。。