1 Matching Annotations
  1. Last 7 days
    1. In attention inference kernels, GEMMs in the linear layers of FFN/MLP blocks plus the Q, K, V, and output projections account for approximately 70% of total FLOPs. Scaled dot-product attention, fused and flash attention variants account for another 25%. Together, these two kernel families represent more than 90% of end-to-end inference compute.

      大多数人认为优化整个应用程序或算法才能获得显著性能提升,但作者指出,仅仅优化占计算量90%的两个关键内核类型就能带来最大收益。这与广泛应用的"全面优化"策略相悖,暗示开发者应该将资源集中在最关键的代码路径上。