10 Matching Annotations
  1. Last 7 days
    1. _Self-reported score with custom Anthropic scaffold._ SWEPro were evaluated with the mini-swe-agent scaffold. However, we use the scores reported by Anthropic for Opus with the max thinking efforts due to frequent timeouts during our evaluation trials.

      脚注2揭示了重要数据点:Opus 4.6的53.4分是Anthropic的自报分数,因为作者在评估过程中频繁遇到超时问题,无法自行验证。这表明性能比较中存在数据可靠性问题,特别是对于Opus的评估依赖于厂商自报数据,可能存在偏差。

  2. Jan 2022
  3. Sep 2021
  4. Dec 2020
    1. I haven't met anyone who makes this argument who then says that a one stop convenient, reliable, private and secure online learning environment can’t be achieved using common every day online systems

      Reliable: As a simple example, I'd trust Google to maintain data reliability over my institutional IT support.

      And you'd also need to make the argument for why learning needs to be "private", etc.

  5. Aug 2020
  6. Jun 2020
  7. May 2020
  8. Apr 2020