the robustness of these reasoning behaviors remains underexplored
「推理行为的鲁棒性尚未被充分探索」——这句话是整个推理模型研究领域的集体盲点声明。过去两年,测试时计算(test-time compute)、长思维链(CoT)、o1/R1 类推理模型吸引了巨大关注,但几乎所有评测都在「孤立问题」环境下进行。在真实 Agent 部署场景中,「能否保持推理深度」这个最基本的可靠性问题,直到这篇论文才开始被系统研究。
the robustness of these reasoning behaviors remains underexplored
「推理行为的鲁棒性尚未被充分探索」——这句话是整个推理模型研究领域的集体盲点声明。过去两年,测试时计算(test-time compute)、长思维链(CoT)、o1/R1 类推理模型吸引了巨大关注,但几乎所有评测都在「孤立问题」环境下进行。在真实 Agent 部署场景中,「能否保持推理深度」这个最基本的可靠性问题,直到这篇论文才开始被系统研究。
why is there so little correlation between students’ performance in their physics courses and their ability to do physics research?
Evaluating Omicron and Other COVID Variants to Ensure Test Effectiveness. (n.d.). Abbott. Retrieved December 3, 2021, from https://www.abbott.com/corpnewsroom/diagnostics-testing/monitoring-covid-variants-to-ensure-test-effectiveness.html
Racine, N., Madigan, S., Cardinal, S., Hartwick, C., Leslie, M., Motz, M., & Pepler, D. (2021). Community-Based Research: Perspectives of Psychology Researchers and Community Partners. PsyArXiv. https://doi.org/10.31234/osf.io/cxrmt
Haber, N. A., Wieten, S. E., Rohrer, J. M., Arah, O. A., Tennant, P. W. G., Stuart, E. A., Murray, E. J., Pilleron, S., Lam, S. T., Riederer, E., Howcutt, S. J., Simmons, A. E., Leyrat, C., Schoenegger, P., Booman, A., Dufour, M.-S. K., O’Donoghue, A. L., Baglini, R., Do, S., … Fox, M. P. (2021). Causal and Associational Linking Language From Observational Research and Health Evaluation Literature in Practice: A systematic language evaluation [Preprint]. Epidemiology. https://doi.org/10.1101/2021.08.25.21262631
Sulik, J., & McKay, R. (2021). Studying science denial with a complex problem-solving task [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/huxm7
Policy Opportunities for the Remote Economy | Upwork. (n.d.). Retrieved July 2, 2021, from https://www.upwork.com/press/releases/policy-opportunities-for-the-remote-economy
Tiokhin, L. (2021, April 21). Why indirect contributions matter for science and scientists. Medium. https://leonidtiokhin.medium.com/why-indirect-contributions-matter-for-science-and-scientists-6c9bf827bc7d
Taquet, M. (2021, April 15). COVID-19 and cerebral venous thrombosis: a retrospective cohort study of 513,284 confirmed COVID-19 cases. https://doi.org/10.17605/OSF.IO/H2MT7
Susan Athey, July 22, 2020. (2020, August 2). https://www.youtube.com/watch?v=hqTOPrUxDzM
Barton, C. M., Alberti, M., Ames, D., Atkinson, J.-A., Bales, J., Burke, E., Chen, M., Diallo, S. Y., Earn, D. J. D., Fath, B., Feng, Z., Gibbons, C., Hammond, R., Heffernan, J., Houser, H., Hovmand, P. S., Kopainsky, B., Mabry, P. L., Mair, C., … Tucker, G. (2020). Call for transparency of COVID-19 models. Science, 368(6490), 482.2-483. https://doi.org/10.1126/science.abb8637
Eisen, M. B., & Tibshirani, R. (2020, July 20). Opinion | How to Identify Flawed Research Before It Becomes Dangerous. The New York Times. https://www.nytimes.com/2020/07/20/opinion/coronavirus-preprints.html
Mulligan, M. J., Lyke, K. E., Kitchin, N., Absalon, J., Gurtman, A., Lockhart, S. P., Neuzil, K., Raabe, V., Bailey, R., Swanson, K. A., Li, P., Koury, K., Kalina, W., Cooper, D., Fonter-Garfias, C., Shi, P.-Y., Tuereci, O., Tompkins, K. R., Walsh, E. E., … Jansen, K. U. (2020). Phase 1/2 Study to Describe the Safety and Immunogenicity of a COVID-19 RNA Vaccine Candidate (BNT162b1) in Adults 18 to 55 Years of Age: Interim Report. MedRxiv, 2020.06.30.20142570. https://doi.org/10.1101/2020.06.30.20142570
Project background – Critical Analysis Project. (n.d.). Retrieved July 10, 2020, from http://critical-analysis.org/project-background/
Infurna, F. J., & Luthar, S. S. (2018). Re-evaluating the notion that resilience is commonplace: A review and distillation of directions for future research, practice, and policy. Clinical Psychology Review, 65, 43–56. https://doi.org/10.1016/j.cpr.2018.07.003
Bell, K., & Green, J. (2020). Premature evaluation? Some cautionary thoughts on global pandemics and scholarly publishing. Critical Public Health, 0(0), 1–5. https://doi.org/10.1080/09581596.2020.1769406
Bell, Kirsten, and Judith Green. “Premature Evaluation? Some Cautionary Thoughts on Global Pandemics and Scholarly Publishing.” Critical Public Health 0, no. 0 (May 22, 2020): 1–5. https://doi.org/10.1080/09581596.2020.1769406.