36 Matching Annotations
  1. Apr 2025
    1. Overly lossy tokenization risks losing valuable signals, while too granular a sequence can exceed practical limits on processing time and memory.

      This is really important step

  2. Jan 2025
    1. we first linearly increase it from 0 to 2.2 × 10−4 during the first2K steps. Then, we keep a constant learning rate of 2.2 × 10−4 until the model consumes 10Ttraining tokens. Subsequently, we gradually decay the learning rate to 2.2 × 10−5 in 4.3T tokens,following a cosine decay curve. During the training of the final 500B tokens, we keep a constantlearning rate of 2.2 × 10−5 in the first 333B tokens, and switch to another constant learning rateof 7.3 × 10−6 in the remaining 167B tokens.

      Was this setup achieved based on experiments & experience or some other insight is not so clear

    2. In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize

      This whole paragraph clearly outlines how LLM landscape is not just about data science but requires hardcore engineering as well

    1. In addition, we also performed a live ex-periment with real serving traffic to further demonstrate theimportance of online training in real-world application.

      This really raises the bar. kudos

    2. The training PS periodically syn-chronizes its parameters to the serving PS, which will takeeffect on the user side immediately.

      This is interesting as this refers to a setup where pods technically do not need to be restarted once an updated model is available

    3. Slightly different from other common deeplearning tasks, we only train our dataset for one pass.

      why? is it because the dataset is pretty large and one pass is enough?

    4. An important obser-vation is that IDs are long-tail distributed, where popularIDs may occur millions of times while the unpopular onesappear no more than ten times. Embeddings correspondingto these infrequent IDs are underfit due to lack of trainingdata and the model will not be able to make a good estima-tion based on them.

      this is a valid insight and could be a potential optimization strategy on its own

    5. Cuckoo Hashing achieves worst-case 𝑂 (1) time complexityfor lookups and deletions, and an expected amortized 𝑂 (1) time forinsertions.

      Maybe we should explore this separately as well for our online feature store and not just for embeddings

    6. Intuitively, information from a more recent history can moreeffectively contribute to predicting the change in a user’s behavior.To mitigate the effect of Concept Drift, serving models need to beupdated from new user feedback as close to real-time as possible toreflect the latest interest of a user.

      this is the case for us as well, although not at real-time basis

    7. The features are mostly sparse, categorical and dynamicallychanging;(2) The underlying distribution of training data is non-stationary,a.k.a. Concept Drif

      Point 1 is very relevant for our case of food delivery as well but concept drift may not be that apparent due to high dependency on past orders (is it?)

    1. For example, the model trained on symmetric tasks underperformed onthe STS task compared to the model trained with a mixture of all data. Additionally, the performanceof the merged model deteriorated significantly, rendering it unusable. This may be attributed toa substantial gap or conflict between the two types of tasks, making direct parameter averaginginfeasible [Yadav et al., 2023].

      I am not sure I understood the utility of this. Need to read through the cited work

    2. hereas the representation of long texts is bettermanaged through the use of multiple vectors

      This is another important insight where use-cases require checking queries against large documents, hence chunking

    3. whereas it positively influences retrieval and reranking tasks

      this points to task specific insights. Recommendation and Search based tasks can benefit from this

    4. Ranking consistency filtering was particularly effectivefor our previous models, but its impact was relatively minor on this latest model, possibly due tothe more comprehensive data coverage in this version

      could this mean that when we have models fine-tuned with less diverse datasets, the impact would be more? Something to be aware of and potentially useful concept to exploit

    5. Consequently, when deploying the model in practical applications, it is advisable to tailortask instructions to specific scenarios and requirements.

      this is an important insight.

    6. we prepended instruction prefixes tothe queries of open-source data to differentiate between various tasks, employing a similar setupduring testing.

      this actually confirms the soft-prompting similarly

    7. Despite our fine-tuning data being primarily in Chinese and English, with only a small portion ofmultilingual data, the performance of our model in other languages remains satisfactory, showingthat the multi-lingual advantage of pre-trained LLMs can carry over to embedding models

      A very powerful insight

    8. represents that our models achieve state-of-the-art multilingualperformance under 0.5B model size

      this an amazing feat and something that would be useful in a number of scenarios

    9. We collect over 20 categoriesof data for pre-training and 70 categories of data for fine-tuning.

      The number of categories for pretraining are smaller in number but probably broader in scope than fine-tuning categories