NVIDIA GPU compilers apply the same default heuristics (register allocation strategies, instruction scheduling decisions, loop unrolling thresholds, etc.) to every kernel they compile. These heuristics are engineered to produce good results across a vast range of workloads. But "good across the board" and "optimal for your workload" are two very different things.
大多数人认为编译器已经提供了足够的优化,开发者只需关注算法和代码实现即可。但作者认为,即使是最先进的GPU编译器也使用通用的启发式方法,这些方法无法针对特定工作负载进行优化,导致性能损失。这挑战了开发者社区对编译器优化能力的普遍认知。