Hypothesis

Code is hit harder than unique prose (1.29–1.39x vs 1.20x). Code has more repeated high-frequency strings — keywords, imports, identifiers — exactly the patterns a Byte-Pair Encoding trained on code would collapse into long merges.

这一发现挑战了我们对代码token化的常识认知。通常我们认为代码有更多重复模式应该更高效token化，但事实相反。这表明代码的语义复杂性超越了简单的重复模式，需要更细粒度的处理。这一反直觉结论对代码生成和代码理解模型的优化方向提出了新思考。

code-tokenization semantic-complexity

Tags

Annotators

URL