29 Matching Annotations
  1. Last 7 days
    1. These gains come on top of already-optimized baselines in kernels that were considered "done" by their authors. The improvements are the direct result of CompileIQ discovering compiler configurations that the default heuristics would never select.

      大多数人认为一旦开发者完成优化工作,就没有更多性能提升空间。但作者表明,即使是"完成"的优化代码仍可能通过编译器级别的调整获得显著提升(高达15%),这挑战了开发者对优化极限的认知。

    2. Most auto-tuning tools optimize for a single metric, typically runtime. CompileIQ goes further, supporting multi-objective optimization, simultaneously exploring trade-offs across competing objectives like runtime, compile time, and power consumption.

      大多数人认为性能优化应以运行时间为唯一目标,但作者提出,真正的优化需要考虑多个相互竞争的目标(运行时间、编译时间和功耗)。这与传统的单一目标优化理念相悖,暗示开发者需要更全面的优化策略。

    3. In attention inference kernels, GEMMs in the linear layers of FFN/MLP blocks plus the Q, K, V, and output projections account for approximately 70% of total FLOPs. Scaled dot-product attention, fused and flash attention variants account for another 25%. Together, these two kernel families represent more than 90% of end-to-end inference compute.

      大多数人认为优化整个应用程序或算法才能获得显著性能提升,但作者指出,仅仅优化占计算量90%的两个关键内核类型就能带来最大收益。这与广泛应用的"全面优化"策略相悖,暗示开发者应该将资源集中在最关键的代码路径上。

    4. NVIDIA GPU compilers apply the same default heuristics (register allocation strategies, instruction scheduling decisions, loop unrolling thresholds, etc.) to every kernel they compile. These heuristics are engineered to produce good results across a vast range of workloads. But "good across the board" and "optimal for your workload" are two very different things.

      大多数人认为编译器已经提供了足够的优化,开发者只需关注算法和代码实现即可。但作者认为,即使是最先进的GPU编译器也使用通用的启发式方法,这些方法无法针对特定工作负载进行优化,导致性能损失。这挑战了开发者社区对编译器优化能力的普遍认知。

  2. May 2026
    1. increase the ability of our trained Graph Neural Network (GNN) model to find feasible solutions for the problem from 14% to over 88%

      这是一个惊人的性能提升,从14%到88%的可行解发现能力增加了约6倍。这表明AlphaEvolve在电网优化问题上有突破性进展,显著减少了电网后处理步骤的需求,可能带来巨大的能源效率提升。

    1. We approached this through caching, eliminating unnecessary network hops, improving our safety stack to quickly flag issues, and—most importantly—building a way to create a persistent connection to the Responses API, instead of having to make a series of synchronous API calls.

      最佳实践建议:通过缓存、减少网络跳数、改进安全栈和建立持久连接来优化性能。

  3. Apr 2026
    1. The variance is also worth noting: baseline+FA TG has ±19 t/s of noise, while optimized+FA has ±0.59 t/s on x86. The fusions eliminate intermediate writes that pollute the cache, making the hot paths more predictable.

      这一数据揭示了优化的一个意外但重要的好处:不仅提高了性能,还显著降低了结果变异性。这表明通过减少缓存污染和内存访问模式的不确定性,优化可以使系统行为更加可预测。这一发现对构建可靠的高性能系统具有重要意义,强调了优化的一致性而不仅仅是峰值性能。

    2. Coding agents working from code alone generate shallow hypotheses. Adding a research phase — arxiv papers, competing forks, other backends — produced 5 kernel fusions that made llama.cpp CPU inference 15% faster.

      这一声明揭示了AI代理在代码优化中的关键局限:仅基于代码的优化会产生浅显的假设。通过引入研究阶段,包括阅读学术论文、研究竞争项目和后端实现,代理能够发现更深层次的优化机会,实现了显著的性能提升。这表明AI代理需要更广泛的上下文信息才能做出有意义的创新。

    3. The variance is also worth noting: baseline+FA TG has ±19 t/s of noise, while optimized+FA has ±0.59 t/s on x86.

      令人惊讶的是:优化后的代码不仅提高了性能,还显著减少了结果方差(从±19 t/s降至±0.59 t/s)。这表明AI代理的优化不仅关注速度,还考虑了内存访问模式的可预测性,这种全面性思维令人印象深刻。

    1. GLM-5.1 pushes this frontier further, delivering 3.6× speedup and continuing to make progress well into the run. While its rate of improvement also slows over time, it sustains useful optimization for substantially longer than GLM-5.

      令人惊讶的是:在机器学习工作负载优化任务中,GLM-5.1能够实现3.6倍的速度提升,并且在长时间运行中持续改进,而其他模型很快就会达到性能瓶颈。这种持续优化的能力对于实际应用中的复杂问题解决具有重要意义。

    1. Artificial Analysis has also positioned Gemini 3.1 Flash TTS within its 'most attractive quadrant' for its ideal blend of high-quality speech generation and low cost.

      令人惊讶的是:这个模型不仅质量高,而且成本效益也非常出色,在'最具吸引力象限'中占据一席之地。这表明Google在平衡AI性能和商业可行性方面取得了显著突破,这对大多数用户来说是意想不到的。

    1. Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications.

      大多数人认为AI性能主要由芯片规格决定,但作者强调硬件、软件和模型的协同设计才是关键,这挑战了以芯片为中心的行业认知,暗示了全栈优化比单纯追求芯片性能更重要。

  4. Apr 2023
  5. Jan 2022
  6. scattered-thoughts.net scattered-thoughts.net
    1. The latest SQLite 3.8.7 alpha version is 50% faster than the 3.7.17 release from 16 months ago. [...] This is 50% faster at the low-level grunt work of moving bits on and off disk and search b-trees. We have achieved this by incorporating hundreds of micro-optimizations. Each micro-optimization might improve the performance by as little as 0.05%. If we get one that improves performance by 0.25%, that is considered a huge win. Each of these optimizations is unmeasurable on a real-world system (we have to use cachegrind to get repeatable run-times) but if you do enough of them, they add up.
  7. Mar 2021
    1. If a UTF8-encoded Ruby string contains unicode characters, then indexing into that string becomes O(N). This can lead to very bad performance in string_end_with_semicolon?, as it would have to scan through the whole buffer for every single file. This commit fixes it to use UTF32 if there are any non-ascii characters in the files.
    1. What is the point of avoiding the semicolon in concat_javascript_sources

      For how detailed and insightful his analysis was -- which didn't elaborate or even touch on his not understanding the reason for adding the semicolon -- it sure appeared like he knew what it was for. Otherwise, the whole issue would/should have been about how he didn't understand that, not on how to keep adding the semicolon but do so in a faster way!

      Then again, this comment from 3 months afterwards, indicates he may not think they are even necessary: https://github.com/rails/sprockets/issues/388#issuecomment-252417741

      Anyway, just in case he really didn't know, the comment shortly below partly answers the question:

      Since the common problem with concatenating JavaScript files is the lack of semicolons, automatically adding one (that, like Sam said, will then be removed by the minifier if it's unnecessary) seems on the surface to be a perfectly fine speed optimization.

      This also alludes to the problem: https://github.com/rails/sprockets/issues/388#issuecomment-257312994

      But the explicit answer/explanation to this question still remains unspoken: because if you don't add them between concatenated files -- as I discovered just to day -- you will run into this error:

         (intermediate value)(...) is not a function
             at something.source.js:1
      

      , apparently because when it concatenated those 2 files together, it tried to evaluate it as:

         ({
           // other.js
         })()
         (function() {
           // something.js
         })();
      

      It makes sense that a ; is needed.

  8. Dec 2020
  9. Nov 2020
  10. Oct 2020
  11. Jul 2020