6 Matching Annotations
  1. Oct 2025
    1. parameters and train with a wider range of multimodal data for more comprehensive learning

      3 phase training 1) First stage train only vision encoder and audio encoder 2) Second stage , unfreeze alll the parameters and train on multimodal dataset 3) Improve model understanding to understand complex long sequence data

    2. Second, it is essential to manage potential interference among outputsfrom different modalities, ensuring that the training processes for outputs such as text and voice tokensdo not disrupt each other.

      Ah so they both work in tandem not over each other

    3. First, it is crucial to implement a systematic method for the joint training of various modalities,including text, images, videos, and audio, to foster mutual enhancement among them.

      TMRoPE ( Time aligned Multimodal RoPE solves this)