5 Matching Annotations
- Last 7 days
-
stratechery.com stratechery.com
-
Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.
Distillation
Using the outputs of a "teacher model" to train a "student model".
-
DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.
Multi-head Latent Attention
Compress the key-value store of tokens, which decreases memory usage during inferencing.
-
The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each. DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.
Mixture-of-Experts
Split LLM models into components with specialized knowledge, then activate only the modules that are required to address a prompt.
Tags
Annotators
URL
-
- May 2024
-
media.dltj.org media.dltj.org
-
why training artificial intelligence in research context is and should continue to be a fair use
Examination of AI training relative to the four factors of fair use
-
- Jul 2023
-
arxiv.org arxiv.org
-
AI-generated content may also feed future generative models, creating a self-referentialaesthetic flywheel that could perpetuate AI-driven cultural norms. This flywheel may in turnreinforce generative AI’s aesthetics, as well as the biases these models exhibit.
AI bias becomes self-reinforcing
Does this point to a need for more diversity in AI companies? Different aesthetic/training choices leads to opportunities for more diverse output. To say nothing of identifying and segregating AI-generated output from being used i the training data of subsequent models.
Tags
Annotators
URL
-