5 Matching Annotations
  1. Last 7 days
    1. Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.

      Distillation

      Using the outputs of a "teacher model" to train a "student model".

    2. DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.

      Multi-head Latent Attention

      Compress the key-value store of tokens, which decreases memory usage during inferencing.

    3. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each. DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.

      Mixture-of-Experts

      Split LLM models into components with specialized knowledge, then activate only the modules that are required to address a prompt.

  2. May 2024
  3. Jul 2023
    1. AI-generated content may also feed future generative models, creating a self-referentialaesthetic flywheel that could perpetuate AI-driven cultural norms. This flywheel may in turnreinforce generative AI’s aesthetics, as well as the biases these models exhibit.

      AI bias becomes self-reinforcing

      Does this point to a need for more diversity in AI companies? Different aesthetic/training choices leads to opportunities for more diverse output. To say nothing of identifying and segregating AI-generated output from being used i the training data of subsequent models.