Hypothesis

36 Matching Annotations

Apr 2025
netflixtechblog.com netflixtechblog.com

Foundation Model for Personalized Recommendation

1
1. rbali 10 Apr 2025
  
  in Public
  
  Overly lossy tokenization risks losing valuable signals, while too granular a sequence can exceed practical limits on processing time and memory.
  
  This is really important step
Visit annotations in context

Annotators

rbali

URL

netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39
Jan 2025
arxiv.org arxiv.org

DeepSeek-V3 Technical Report

10
1. rbali 23 Jan 2025
  
  in Public
  
  YaRN
  
  quick overview is available here: https://medium.com/@rcrajatchawla/understanding-yarn-extending-context-window-of-llms-3f21e3522465
2. rbali 23 Jan 2025
  
  in Public
  
  we first linearly increase it from 0 to 2.2 × 10−4 during the first2K steps. Then, we keep a constant learning rate of 2.2 × 10−4 until the model consumes 10Ttraining tokens. Subsequently, we gradually decay the learning rate to 2.2 × 10−5 in 4.3T tokens,following a cosine decay curve. During the training of the final 500B tokens, we keep a constantlearning rate of 2.2 × 10−5 in the first 333B tokens, and switch to another constant learning rateof 7.3 × 10−6 in the remaining 167B tokens.
  
  Was this setup achieved based on experiments & experience or some other insight is not so clear
3. rbali 23 Jan 2025
  
  in Public
  
  The KV compression dimension 𝑑𝑐 is set to 512,
  
  interestingly this brings it down to og levels again
4. rbali 23 Jan 2025
  
  in Public
  
  Transformer layers to 61 and the hiddendimension to 7168
  
  talk about scale!!!
5. rbali 23 Jan 2025
  
  in Public
  
  Fill-in-Middle (FIM) strategy does not compromise the next-token prediction capability whileenabling the model to accurately predict middle text based on contextual cues
  
  This is an interesting pattern, more details in this article: https://medium.com/@SymeCloud/what-is-fim-and-why-does-it-matter-in-llm-based-ai-53f33385585b
  
  The original paper by OpenAI is here: https://arxiv.org/pdf/2207.14255.pdf
6. rbali 22 Jan 2025
  
  in Public
  
  corresponding expert is overloaded,
  
  IMO this should also be a hyper-parameter of sorts?
7. rbali 22 Jan 2025
  
  in Public
  
  . The gating value, which will be multiplied withthe FFN output, is still derived from the original affinity score
  
  nice trick
8. rbali 22 Jan 2025
  
  in Public
  
  Next, we conduct a two-stagecontext length extension for DeepSeek-V3
  
  this is a nice reference overview of context extension techniques: https://medium.com/thedeephub/overview-e3dd94bc74c4
9. rbali 22 Jan 2025
  
  in Public
  
  In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize
  
  This whole paragraph clearly outlines how LLM landscape is not just about data science but requires hardcore engineering as well
10. rbali 22 Jan 2025
  
  in Public
  
  diverse andhigh-quality tokens,
  
  high-quality datasets seem to be a focus point more and more
Visit annotations in context

Annotators

rbali

URL

arxiv.org/pdf/2412.19437
arxiv.org arxiv.org

Monolith: Real Time Recommendation System With Collisionless Embedding TableMonolith: Real Time Recommendation System With Collisionless Embedding Table

11
1. rbali 08 Jan 2025
  
  in Public
  
  we update them at day-level.
  
  this is similar to our setup
2. rbali 08 Jan 2025
  
  in Public
  
  In addition, we also performed a live ex-periment with real serving traffic to further demonstrate theimportance of online training in real-world application.
  
  This really raises the bar. kudos
3. rbali 08 Jan 2025
  
  in Public
  
  Models with collisionless HashTable consistently outper-forms those with collision.
  
  cool
4. rbali 08 Jan 2025
  
  in Public
  
  The training PS periodically syn-chronizes its parameters to the serving PS, which will takeeffect on the user side immediately.
  
  This is interesting as this refers to a setup where pods technically do not need to be restarted once an updated model is available
5. rbali 08 Jan 2025
  
  in Public
  
  Slightly different from other common deeplearning tasks, we only train our dataset for one pass.
  
  why? is it because the dataset is pretty large and one pass is enough?
6. rbali 08 Jan 2025
  
  in Public
  
  chances of hash key collision increases
  
  I guess we are a bit away from this happening but is definitely a possibility
7. rbali 08 Jan 2025
  
  in Public
  
  probabilisticfilter which helps further reduce memory usage
  
  not sure how this would help?
8. rbali 08 Jan 2025
  
  in Public
  
  An important obser-vation is that IDs are long-tail distributed, where popularIDs may occur millions of times while the unpopular onesappear no more than ten times. Embeddings correspondingto these infrequent IDs are underfit due to lack of trainingdata and the model will not be able to make a good estima-tion based on them.
  
  this is a valid insight and could be a potential optimization strategy on its own
9. rbali 08 Jan 2025
  
  in Public
  
  Cuckoo Hashing achieves worst-case 𝑂 (1) time complexityfor lookups and deletions, and an expected amortized 𝑂 (1) time forinsertions.
  
  Maybe we should explore this separately as well for our online feature store and not just for embeddings
10. rbali 08 Jan 2025
  
  in Public
  
  Intuitively, information from a more recent history can moreeffectively contribute to predicting the change in a user’s behavior.To mitigate the effect of Concept Drift, serving models need to beupdated from new user feedback as close to real-time as possible toreflect the latest interest of a user.
  
  this is the case for us as well, although not at real-time basis
11. rbali 08 Jan 2025
  
  in Public
  
  The features are mostly sparse, categorical and dynamicallychanging;(2) The underlying distribution of training data is non-stationary,a.k.a. Concept Drif
  
  Point 1 is very relevant for our case of food delivery as well but concept drift may not be that apparent due to high dependency on past orders (is it?)
Visit annotations in context

Annotators

rbali

URL

arxiv.org/pdf/2209.07663
arxiv.org arxiv.org

2501.01028v2.pdf

14
1. rbali 06 Jan 2025
  
  in Public
  
  For example, the model trained on symmetric tasks underperformed onthe STS task compared to the model trained with a mixture of all data. Additionally, the performanceof the merged model deteriorated significantly, rendering it unusable. This may be attributed toa substantial gap or conflict between the two types of tasks, making direct parameter averaginginfeasible [Yadav et al., 2023].
  
  I am not sure I understood the utility of this. Need to read through the cited work
2. rbali 06 Jan 2025
  
  in Public
  
  hereas the representation of long texts is bettermanaged through the use of multiple vectors
  
  This is another important insight where use-cases require checking queries against large documents, hence chunking
3. rbali 06 Jan 2025
  
  in Public
  
  Rotary Position Embedding (RoPE)
  
  Rotary embeddings have a particular positive impact when it comes to longer contexts
4. rbali 06 Jan 2025
  
  in Public
  
  whereas it positively influences retrieval and reranking tasks
  
  this points to task specific insights. Recommendation and Search based tasks can benefit from this
5. rbali 06 Jan 2025
  
  in Public
  
  Ranking consistency filtering was particularly effectivefor our previous models, but its impact was relatively minor on this latest model, possibly due tothe more comprehensive data coverage in this version
  
  could this mean that when we have models fine-tuned with less diverse datasets, the impact would be more? Something to be aware of and potentially useful concept to exploit
6. rbali 06 Jan 2025
  
  in Public
  
  The top-kthreshold for ranking consistency filtering is set to 50.
  
  this is quite a high value for k but good to know
7. rbali 06 Jan 2025
  
  in Public
  
  The maximum input textlength is capped at 512 token
  
  this could be a limiting factor for a number of scenarios
8. rbali 06 Jan 2025
  
  in Public
  
  Consequently, when deploying the model in practical applications, it is advisable to tailortask instructions to specific scenarios and requirements.
  
  this is an important insight.
9. rbali 06 Jan 2025
  
  in Public
  
  we prepended instruction prefixes tothe queries of open-source data to differentiate between various tasks, employing a similar setupduring testing.
  
  this actually confirms the soft-prompting similarly
10. rbali 06 Jan 2025
  
  in Public
  
  Task Instruction
  
  IMHO this is very similar to soft-prompting and the advantages also seem aligned
11. rbali 06 Jan 2025
  
  in Public
  
  Ranking Consistency Filtering
  
  we can leverage this for other domains as well
12. rbali 06 Jan 2025
  
  in Public
  
  Despite our fine-tuning data being primarily in Chinese and English, with only a small portion ofmultilingual data, the performance of our model in other languages remains satisfactory, showingthat the multi-lingual advantage of pre-trained LLMs can carry over to embedding models
  
  A very powerful insight
13. rbali 06 Jan 2025
  
  in Public
  
  represents that our models achieve state-of-the-art multilingualperformance under 0.5B model size
  
  this an amazing feat and something that would be useful in a number of scenarios
14. rbali 06 Jan 2025
  
  in Public
  
  We collect over 20 categoriesof data for pre-training and 70 categories of data for fine-tuning.
  
  The number of categories for pretraining are smaller in number but probably broader in scope than fine-tuning categories
Visit annotations in context

Annotators

rbali

URL

arxiv.org/abs/2501.01028

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL