Overly lossy tokenization risks losing valuable signals, while too granular a sequence can exceed practical limits on processing time and memory.
This is really important step
Overly lossy tokenization risks losing valuable signals, while too granular a sequence can exceed practical limits on processing time and memory.
This is really important step
YaRN
quick overview is available here: https://medium.com/@rcrajatchawla/understanding-yarn-extending-context-window-of-llms-3f21e3522465
we first linearly increase it from 0 to 2.2 × 10−4 during the first2K steps. Then, we keep a constant learning rate of 2.2 × 10−4 until the model consumes 10Ttraining tokens. Subsequently, we gradually decay the learning rate to 2.2 × 10−5 in 4.3T tokens,following a cosine decay curve. During the training of the final 500B tokens, we keep a constantlearning rate of 2.2 × 10−5 in the first 333B tokens, and switch to another constant learning rateof 7.3 × 10−6 in the remaining 167B tokens.
Was this setup achieved based on experiments & experience or some other insight is not so clear
The KV compression dimension 𝑑𝑐 is set to 512,
interestingly this brings it down to og levels again
Transformer layers to 61 and the hiddendimension to 7168
talk about scale!!!
Fill-in-Middle (FIM) strategy does not compromise the next-token prediction capability whileenabling the model to accurately predict middle text based on contextual cues
This is an interesting pattern, more details in this article: https://medium.com/@SymeCloud/what-is-fim-and-why-does-it-matter-in-llm-based-ai-53f33385585b
The original paper by OpenAI is here: https://arxiv.org/pdf/2207.14255.pdf
corresponding expert is overloaded,
IMO this should also be a hyper-parameter of sorts?
. The gating value, which will be multiplied withthe FFN output, is still derived from the original affinity score
nice trick
Next, we conduct a two-stagecontext length extension for DeepSeek-V3
this is a nice reference overview of context extension techniques: https://medium.com/thedeephub/overview-e3dd94bc74c4
In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize
This whole paragraph clearly outlines how LLM landscape is not just about data science but requires hardcore engineering as well
diverse andhigh-quality tokens,
high-quality datasets seem to be a focus point more and more
we update them at day-level.
this is similar to our setup
In addition, we also performed a live ex-periment with real serving traffic to further demonstrate theimportance of online training in real-world application.
This really raises the bar. kudos
Models with collisionless HashTable consistently outper-forms those with collision.
cool
The training PS periodically syn-chronizes its parameters to the serving PS, which will takeeffect on the user side immediately.
This is interesting as this refers to a setup where pods technically do not need to be restarted once an updated model is available
Slightly different from other common deeplearning tasks, we only train our dataset for one pass.
why? is it because the dataset is pretty large and one pass is enough?
chances of hash key collision increases
I guess we are a bit away from this happening but is definitely a possibility
probabilisticfilter which helps further reduce memory usage
not sure how this would help?
An important obser-vation is that IDs are long-tail distributed, where popularIDs may occur millions of times while the unpopular onesappear no more than ten times. Embeddings correspondingto these infrequent IDs are underfit due to lack of trainingdata and the model will not be able to make a good estima-tion based on them.
this is a valid insight and could be a potential optimization strategy on its own
Cuckoo Hashing achieves worst-case 𝑂 (1) time complexityfor lookups and deletions, and an expected amortized 𝑂 (1) time forinsertions.
Maybe we should explore this separately as well for our online feature store and not just for embeddings
Intuitively, information from a more recent history can moreeffectively contribute to predicting the change in a user’s behavior.To mitigate the effect of Concept Drift, serving models need to beupdated from new user feedback as close to real-time as possible toreflect the latest interest of a user.
this is the case for us as well, although not at real-time basis
The features are mostly sparse, categorical and dynamicallychanging;(2) The underlying distribution of training data is non-stationary,a.k.a. Concept Drif
Point 1 is very relevant for our case of food delivery as well but concept drift may not be that apparent due to high dependency on past orders (is it?)
For example, the model trained on symmetric tasks underperformed onthe STS task compared to the model trained with a mixture of all data. Additionally, the performanceof the merged model deteriorated significantly, rendering it unusable. This may be attributed toa substantial gap or conflict between the two types of tasks, making direct parameter averaginginfeasible [Yadav et al., 2023].
I am not sure I understood the utility of this. Need to read through the cited work
hereas the representation of long texts is bettermanaged through the use of multiple vectors
This is another important insight where use-cases require checking queries against large documents, hence chunking
Rotary Position Embedding (RoPE)
Rotary embeddings have a particular positive impact when it comes to longer contexts
whereas it positively influences retrieval and reranking tasks
this points to task specific insights. Recommendation and Search based tasks can benefit from this
Ranking consistency filtering was particularly effectivefor our previous models, but its impact was relatively minor on this latest model, possibly due tothe more comprehensive data coverage in this version
could this mean that when we have models fine-tuned with less diverse datasets, the impact would be more? Something to be aware of and potentially useful concept to exploit
The top-kthreshold for ranking consistency filtering is set to 50.
this is quite a high value for k but good to know
The maximum input textlength is capped at 512 token
this could be a limiting factor for a number of scenarios
Consequently, when deploying the model in practical applications, it is advisable to tailortask instructions to specific scenarios and requirements.
this is an important insight.
we prepended instruction prefixes tothe queries of open-source data to differentiate between various tasks, employing a similar setupduring testing.
this actually confirms the soft-prompting similarly
Task Instruction
IMHO this is very similar to soft-prompting and the advantages also seem aligned
Ranking Consistency Filtering
we can leverage this for other domains as well
Despite our fine-tuning data being primarily in Chinese and English, with only a small portion ofmultilingual data, the performance of our model in other languages remains satisfactory, showingthat the multi-lingual advantage of pre-trained LLMs can carry over to embedding models
A very powerful insight
represents that our models achieve state-of-the-art multilingualperformance under 0.5B model size
this an amazing feat and something that would be useful in a number of scenarios
We collect over 20 categoriesof data for pre-training and 70 categories of data for fine-tuning.
The number of categories for pretraining are smaller in number but probably broader in scope than fine-tuning categories