Hypothesis

5 Matching Annotations

Last 7 days
stratechery.com stratechery.com

DeepSeek FAQ

3
1. peter_murray 27 Jan 2025
  
  in Public
  
  Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.
  
  Distillation
  
  Using the outputs of a "teacher model" to train a "student model".
  
  building LLMs
2. peter_murray 27 Jan 2025
  
  in Public
  
  DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.
  
  Multi-head Latent Attention
  
  Compress the key-value store of tokens, which decreases memory usage during inferencing.
  
  building LLMs
3. peter_murray 27 Jan 2025
  
  in Public
  
  The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each. DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.
  
  Mixture-of-Experts
  
  Split LLM models into components with specialized knowledge, then activate only the modules that are required to address a prompt.
  
  building LLMs
Visit annotations in context

Tags

building LLMs

Annotators

peter_murray

URL

stratechery.com/2025/deepseek-faq/
May 2024
media.dltj.org media.dltj.org

Video: Handling Academic Copyright and Artificial Intelligence Research Questions as the Law Develops by CNI Spring Meeting 2024, annotated

1
1. peter_murray 28 May 2024
  
  in Public
  
  why training artificial intelligence in research context is and should continue to be a fair use
  
  Examination of AI training relative to the four factors of fair use
  
  LLM copyright building LLMs
Visit annotations in context

Tags

building LLMs

LLM copyright

Annotators

peter_murray

URL

media.dltj.org/annotated-video/20240527T173838-GMttBH1oAD4-handling-academic-copyright-artificial-intelligence-research-questions-law-develops/index.html
Jul 2023
arxiv.org arxiv.org

2306.04141.pdf

1
1. peter_murray 21 Jul 2023
  
  in Public
  
  AI-generated content may also feed future generative models, creating a self-referentialaesthetic flywheel that could perpetuate AI-driven cultural norms. This flywheel may in turnreinforce generative AI’s aesthetics, as well as the biases these models exhibit.
  
  AI bias becomes self-reinforcing
  
  Does this point to a need for more diversity in AI companies? Different aesthetic/training choices leads to opportunities for more diverse output. To say nothing of identifying and segregating AI-generated output from being used i the training data of subsequent models.
  
  building LLMs
Visit annotations in context

Tags

building LLMs

Annotators

peter_murray

URL

arxiv.org/pdf/2306.04141.pdf

Distillation

Multi-head Latent Attention

Mixture-of-Experts

Tags

Annotators

URL

Examination of AI training relative to the four factors of fair use

Tags

Annotators

URL

AI bias becomes self-reinforcing

Tags

Annotators

URL