7 Matching Annotations
  1. Apr 2026
    1. Chars-per-token on English dropped from 4.33 to 3.60. TypeScript dropped from 3.66 to 2.69. The vocabulary is representing the same text in smaller pieces.

      这一发现挑战了人们对tokenizer效率的直觉认知。通常我们假设更高效的tokenizer应该能用更少的token表示相同内容,但Claude 4.7的tokenizer实际上产生了更多token。这种反直觉的变化表明,Anthropic可能故意牺牲token效率换取更细粒度的语言处理能力,这违背了传统NLP中'更少token=更高效'的常识。

    1. Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.

      tokenizer的更新虽然增加了token使用量,但提高了文本处理效率,这反映了AI模型在基础架构上的持续优化,这种优化虽然带来短期成本增加,但长期将提升AI的处理能力和准确性。

  2. Dec 2023
    1. Python is both a compiled and interpreted language

      The CPython interpreter really is an interpreter. But it also is a compiler. Python must go through a few stages before ever running the first line of code:

      1. scanning
      2. parsing

      Older versions of Python added an additional stage:

      1. scanning
      2. parsing
      3. checking for valid assignment targets

      Let’s compare this to the stages of compiling a C program:

      1. ~~preprocessing~~
      2. lexical analysis (another term for “scanning”)
      3. syntactic analysis (another term for “parsing”)
      4. ~~semantic analysis~~
      5. ~~linking~~
    1. The tokenizer takes your source code and chunks it into “tokens”. Tokens are just small pieces of source code that you can identify in isolation. As examples, there will be tokens for numbers, mathematical operators, variable names, and keywords (like if or for). The parser will take that linear sequence of tokens and essentially reshape them into a tree structure (that's what the T in AST stands for: Tree). This tree is what gives meaning to your tokens, providing a nice structure that is easier to reason about and work on. As soon as we have that tree structure, our compiler can go over the tree and figure out what bytecode instructions represent the code in the tree. For example, if part of the tree represents a function, we may need a bytecode for the return statement of that function. Finally, the interpreter takes those bytecode instructions and executes them, producing the results of our original program.
    2. Recap

      In this article you started implementing your own version of Python. To do so, you needed to create four main components:

      A tokenizer: * accepts strings as input (supposedly, source code); * chunks the input into atomic pieces called tokens; * produces tokens regardless of their sequence making sense or not.

      A parser: * accepts tokens as input; * consumes the tokens one at a time, while making sense they come in an order that makes sense; * produces a tree that represents the syntax of the original code.

      A compiler: * accepts a tree as input; * traverses the tree to produce bytecode operations.

      An interpreter: * accepts bytecode as input; * traverses the bytecode and performs the operation that each one represents; * uses a stack to help with the computations.