Hypothesis

In attention inference kernels, GEMMs in the linear layers of FFN/MLP blocks plus the Q, K, V, and output projections account for approximately 70% of total FLOPs. Scaled dot-product attention, fused and flash attention variants account for another 25%. Together, these two kernel families represent more than 90% of end-to-end inference compute.

大多数人认为优化整个应用程序或算法才能获得显著性能提升，但作者指出，仅仅优化占计算量90%的两个关键内核类型就能带来最大收益。这与广泛应用的"全面优化"策略相悖，暗示开发者应该将资源集中在最关键的代码路径上。

non-consensus performance-optimization kernel-hotspots

Tags

Annotators

URL