Hypothesis

2 Matching Annotations

Apr 2026
developers.openai.com developers.openai.com

https://developers.openai.com/blog/eval-skills

1
1. fxp007 24 Apr 2026
  
  in Public
  
  The most reliable way to improve a skill over time is to evaluate it the same way you would any other prompt for LLM applications.
  
  大多数人可能认为AI代理技能需要特殊的、独特的评估方法，但作者认为它们应该被视为普通LLM提示应用的一部分进行评估。这个观点挑战了AI代理评估需要特殊框架的共识，主张统一的方法论。
  
  non-consensus skill-evaluation llm-approach
Visit annotations in context

Tags

non-consensus

llm-approach

skill-evaluation

Annotators

fxp007

URL

developers.openai.com/blog/eval-skills
github.com github.com

https://github.com/saffron-health/libretto

1
1. fxp007 16 Apr 2026
  
  in Public
  
  feat(benchmarks): add screenshot-based evaluator, screenshot collector, and --parallelize flag - Add screenshot-based LLM judge evaluator (evaluator.ts) - Add ScreenshotCollector for capturing browser screenshots during runs
  
  令人惊讶的是：这个项目包含一个基于截图的评估系统，使用LLM作为评判员来评估自动化任务的结果。它能够捕获浏览器截图并在运行过程中收集这些视觉数据，这为网页自动化任务提供了一种全新的评估方式，超越了传统的文本比较方法。
  
  surprising llm-evaluation visual-testing
Visit annotations in context

Tags

visual-testing

llm-evaluation

surprising

Annotators

fxp007

URL

github.com/saffron-health/libretto

Tags

Annotators

URL

Tags

Annotators

URL