8 Matching Annotations
  1. Last 7 days
    1. Through training, the models eventually learn that tokens 2339 and 588 can be conceptually identical, or they can have distinct meanings if “like” is a subwords (e.g., unlike vs. alike vs. businesslike).

      Thats cool how tokens can be different depending on the context the word is used in

    2. Here’s the thing: ChatGPT doesn’t know how many “r”s are in strawberry, because what we see as the word “ strawberry” GPT sees as token 41236. The letter “r” is token 81.

      thats interesting I thought Chat was actually reading and processing the words we type

    3. But on the other hand, English is by far the most well-represented language on the web, which is mostly what LLMs are trained on. There is simply more English in the training data than any other language. This means that tokenizers have a lot more data to find optimal chunking patterns in English compared to any other language.

      I feel like they should try to build LLMs in other languages so people around the world can use them

  2. Jan 2025
    1. f a bit of fun is had along the way, so much the better.Time is short; this is a genre in a hurry.

      Media is becoming more advance, its like everyday there is something new