844 Matching Annotations
  1. Aug 2023
  2. huggingface.co huggingface.co
    1. pretraining_tp (int, optional, defaults to 1) — Experimental feature. Tensor parallelism rank used during pretraining. Please refer to this document to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to this issue.

      [!NOTE] 模型的 pretraining_tp 是指什么?

      flashcard

      预训练时的张量并行度

    2. max_position_embeddings (int, optional, defaults to 2048) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

      [!NOTE] Transformer 模型的输入长度限制是由什么决定的?

      flashcard

      (绝对)位置嵌入的个数(例如 4096) 反之,相对位置编码就不会有硬性的输入长度限制?

    1. On the first GPU, the prompts will be ["a dog", "a cat"], and on the second GPU it will be ["a chicken", "a chicken"]. Make sure to drop the final sample, as it will be a duplicate of the previous one.

      [!NOTE] 🤗 Accelerate 的 split_between_processes(..., apply_padding=True) 会如何对齐样本数量?需要注意什么?

      flashcard

      复制最后一个进程的样本? 在返回值中去掉重复结果

    2. what if we then wanted to do something with the results of all the GPUs? (Say gather them all and perform some kind of post processing) You can pass in apply_padding=True to ensure that the lists of prompts are padded to the same length, with extra data being taken from the last sample. This way all GPUs will have the same number of prompts, and you can then gather the results.

      [!NOTE] 🤗 Accelerate 的 split_between_processes() 中,要在分布式进程间对齐样本数量,可以使用?

      flashcard

      apply_padding=True

    3. This is only needed when trying to perform an action such as gathering the results, where the data on each device needs to be the same length. Basic inference does not require this.

      [!NOTE] 从分布式进程/设备 gather 数据时,对数据有什么要求?

      flashcard

      形状相同(可能需要进程间 padding)

    4. With 🤗 Accelerate, we can simplify this process by using the Accelerator.split_between_processes() context manager (which also exists in PartialState and AcceleratorState). This function will automatically split whatever data you pass to it (be it a prompt, a set of tensors, a dictionary of the prior data, etc.) across all the processes (with a potential to be padded) for you to use right away.

      [!NOTE] 要向各个进程分发数据,在 🤗 Accelerate 中,可以使用?

      flashcard

      with accelerator.split_between_processes(): 可以使用的类包括 Accelerator, PartialState, AcceleratorState

    5. torch.distributed.get_rank()

      [!NOTE] PyTorch 分布式计算中,如何获取进程的 rank?

      flashcard

    1. Once you’ve finished training, make sure to run Accelerator.end_training() so that all the trackers can run their finish functionalities if they have any. Copied accelerator.end_training()

      [!NOTE] 🤗 Accelerate 中,如果使用了 tracker,训练结束后需要做什么?

      flashcard

      调用 accelerator.end_training()

    2. When you are ready to log any data, Accelerator.log() should be used. A step can also be passed in to correlate the data with a particular step in the training loop. Copied accelerator.log({"train_loss": 1.12, "valid_loss": 0.8}, step=1)

      [!NOTE] 🤗 Accelerate 中,如何使用 log()

      flashcard

      metric 词典 + step e.g. accelerator.log({"train_loss": 1.12, "valid_loss": 0.8}, step=1)

    3. At the start of your experiment Accelerator.init_trackers() should be used to setup your project

      [!NOTE] 🤗 Accelerate 中,如何初始化 tracker?

      flashcard

      1. accelerator = Accelerator(log_with="<tracker>")
      2. accelerator.init_trackers("my_project", config=hps)
    4. potentially add any experiment hyperparameters to be logged: Copied hps = {"num_iterations": 5, "learning_rate": 1e-2} accelerator.init_trackers("my_project", config=hps)

      [!NOTE] 🤗 Accelerate 中,如何添加要记录的超参数

      flashcard

      accelerator.init_trackers("my_project", config=hps)

    5. accelerator.init_trackers("my_project"

      [!NOTE] 🤗 Accelerate 中,如何设置 tracker 的项目名?

      flashcard

      accelerator.init_trackers("my_project")

    1. For printing statements you only want executed once per machine, you can just replace the print function by accelerator.print.

      [!NOTE] 🤗 Accelerate 中,要方便地在每台机器上仅打印一次,可以使用?

      flashcard

      accelerator.print()

    2. The local means per machine: if you are running your training on two servers with several GPUs, the instruction will be executed once on each of those servers. If you need to execute something only once for all processes (and not per machine) for instance, uploading the final model to the 🤗 model hub, wrap it in a test like this: Copied if accelerator.is_main_process:

      [!NOTE] 🤗 Accelerate 中,is_main_process 有无 local 的区别是什么?

      flashcard

      local 表示每台机器(上的主进程), 无 local 表示所有机器上所有进程中的主进程

    3. progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)

      [!NOTE] tqdm 如何仅在一个进程中显式?

      flashcard

      disable=<process_test> 参数

    4. Some of your instructions only need to run for one process on a given server: for instance a data download or a log statement.

      [!NOTE] 使用 🤗 Accelerate 等并行库时,对于文件操作,需要注意什么?

      flashcard

      需要仅在一个进程里执行

    5. You should only pass the learning rate scheduler to prepare() when the scheduler needs to be stepped at each optimizer step.

      为什么?如果不传呢?如果不是每步都要调整呢?

    6. use the option split_batches=True when creating and initializing your Accelerator, in which case the batch size will always stay the same, whether you run your script on 1, 2, 4, or 64 GPUs.

      如何保持一致?什么保持一致?

    1. The report includes the number of training steps, number of skipped optimizer updates (likely due to overflows in mixed-precision training), current learning rate, and current momentum.

      [!NOTE] DeepSpeed 的 report 包含什么内容?

      flashcard

      • 被跳过的优化器更新(可能由于损失溢出)
      • 当前的动量、学习率
      • ...
    1. While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from source to best match your hardware and also if you need to enable certain features, like 1-bit Adam, which aren’t available in the pypi distribution.

      [!NOTE] DeepSpeed 的安装源有什么讲究?

      flashcard

      • source 更可能最佳适配硬件
      • 特定 features 在 pypi distribution 里没有
    2. While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to the models hub or pass it to someone else you most likely will want to get the fp32 weights.

      [!NOTE] 公开上传的模型权重,最好使用什么数据类型?

      flashcard

      高精度,例如 fp32

    3. This ideally shouldn’t be done during training since this is a process that requires a lot of memory, and therefore best to be performed offline after the training is complete.

      [!NOTE] 🤗 Transformers&DeepSpeed 中如何将模型权重转化为 fp32?

      flashcard

      参考标注下方的文档

    4. DeepSpeed stores fp32 master weights in its custom checkpoint optimizer files, which are global_step*/*optim_states.pt (this is glob pattern), and are saved under the normal checkpoint.

      [!NOTE] DeepSpeed 如何保存模型权重?

      flashcard

      • 数据类型: fp32/fp16(见下)?
    5. If you use gradient accumulation with bf16-enabled, you need to be aware that it’ll accumulate gradients in bf16, which may not be what you want due to this format’s low precision, as it may lead to a lossy accumulation. A work is being done to fix that and provide an option to use a higher precision dtype (fp16 or fp32).

      [!NOTE] gradient accumulation数据类型有什么要求?

      flashcard

      精度要高,否则累加误差会很大 因此 fp16/fp32 优于 bf16

    6. bf16 has the same dynamic range as fp32 and thus doesn’t require loss scaling.

      [!NOTE] 使用 fp16 时通常需要如何处理 loss?

      flashcard

      使用 loss scaling 避免溢出

    7. With the 🤗 Trainer you can use --tf32 to enable it, or disable it with --tf32 0 or --no_tf32. By default the PyTorch default is used.

      [!NOTE] 🤗 Transformers 中,如何启用 TF32?默认设置为?

      flashcard

      --tf32 / --no_tf32 默认承袭 PyTorch 的默认值

    8. If you’re using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using the much more efficient tf32 format for some operations, but the results will still be in fp32.

      [!NOTE] 使用 tf32 的条件与方法是什么?

      flashcard

      以下配置默认会自动使用 tf32 - Ampere-architecture based GPU - pytorch version 1.7 and higher

    9. the only time you will want to not use it is when the model you’re using doesn’t behave well under this training mode. Typically this happens when the model wasn’t pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained models).

      [!NOTE] 使用低精度时,需要注意什么?

      flashcard

      • 模型训练时使用的精度,例如非 fp16 (fp32/bf16) 混合精度训练的模型,在使用 fp16 混合精度微调/推理时容易溢出
    10. You can also take the HF Transformers modeling code and replace torch.utils.checkpoint with the DeepSpeed’s API. The latter is more flexible since it allows you to offload the forward activations to the CPU memory instead of recalculating them.

      [!NOTE] 对于 Activation Checkpointing,DeepSpeed/PyTorch API 有什么差别?

      flashcard

      DeepSpeed API 额外允许将前向激活 offload 到 CPU 内存上(替代冲计算)

    11. HF Transformers models don’t know anything about DeepSpeed’s activation checkpointing, so if you try to enable that feature in the DeepSpeed config file, nothing will happen. Therefore you have two ways to take advantage of this very beneficial feature: If you want to use a HF Transformers models you can do model.gradient_checkpointing_enable() or use --gradient_checkpointing in the HF Trainer, which will automatically enable this for you. torch.utils.checkpoint is used there.

      [!NOTE] 如何在 🤗 Transformers 中使用 Activation Checkpointing?

      flashcard

      • model.gradient_checkpointing_enable()
      • use --gradient_checkpointing in the HF Trainer

      实现:torch.utils.checkpoint

    12. Activation checkpointing and gradient checkpointing are two distinct terms that refer to the same methodology. It’s very confusing but this is how it is.

      [!NOTE] Activation Checkpointing 与 Gradient Checkpointing 有什么异同?

      flashcard

      同一个东西...

    13. Before beginning to train BLOOM-176B I spent 2 days on this process and was able to increase throughput from 90 to 150 TFLOPs! This effort saved us more than one month of training time.

      [!QUESTION] 怎么查看算力指标(例如 TFLOPS)?

      flashcard

    14. Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical values look like, but we highly recommend using the one with multiple auto settings in it.

      [!NOTE] ZeRO-3 的配置字段及其常用值是怎样的?

      flashcard

      示例如下:

    15. Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical values look like, but we highly recommend using the one with multiple auto settings in it.

      [!NOTE] ZeRO-2 的配置字段及其常用值是怎样的?

      flashcard

      示例如下:

    16. It’s possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2: set stage3_param_persistence_threshold to a very large number - larger than the largest parameter, e.g., 6 * hidden_size * hidden_size. This will keep the parameters on the GPUs. turn off offload_params since ZeRO-2 doesn’t have that option.

      [!NOTE] 如何提高 ZeRO-3 性能(到接近 ZeRO-2 的水准)?

      flashcard

      • 关掉 offload_params
      • 增大 stage3_param_persistence_threshold
    17. modern NVMe transfer speeds in mind (as of this writing one can have ~3.5GB/s read, ~3GB/s write peak speeds)

      [!NOTE] NVMe 的读写速度分别约为多少?

      flashcard

      ~3.5GB/s read ~3GB/s write

    18. Make sure that your nvme_path is actually an NVMe, since it will work with the normal hard drive or SSD, but it’ll be much much slower.

      [!NOTE] NVMe 以什么形式提供接口?

      flashcard

      文件,例如 /local_nvme

    19. sub_group_size controls the granularity in which parameters are updated during optimizer steps. Parameters are grouped into buckets of sub_group_size and each buckets is updated one at a time. When used with NVMe offload in ZeRO-Infinity, sub_group_size therefore controls the granularity in which model states are moved in and out of CPU memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models.

      [!NOTE] ZeRO-3 config 中,stage3_gather_16bit_weights_on_model_save 有什么用?

      flashcard

      允许保存模型

    20. ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory.

      [!NOTE] ZeRO-Infinity 有什么用?

      flashcard

      用 NVMe 内存拓展 GPU&CPU 内存?

    21. When used with NVMe offload in ZeRO-Infinity, sub_group_size therefore controls the granularity in which model states are moved in and out of CPU memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models. You can leave sub_group_size to its default value of 1e9 when not using NVMe offload.

      [!QUESTION] ZeRO-3 config 中,sub_group_size 会如何影响 NVMe offload 的效果?

      flashcard

    22. stage3_gather_16bit_weights_on_model_save enables model fp16 weights consolidation when model gets saved. With large models and multiple GPUs this is an expensive operation both in terms of memory and speed. It’s currently required if you plan to resume the training.

      [!NOTE] ZeRO-3 config 中,sub_group_size 有什么用?

      flashcard

      设置模型参数被分成的组的大小,每组整体地被更新/offload 等操作

    23. The following configuration values depend on the model’s hidden size: reduce_bucket_size: hidden_size*hidden_size stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size stage3_param_persistence_threshold: 10 * hidden_size therefore set these values to auto and the Trainer will automatically assign the recommended values. But, of course, feel free to set these explicitly as well.

      [!NOTE] ZeRO-3 有哪些配置参数取决于模型的 hidden size?

      flashcard

      3 个

    24. “reuse distance” is a metric we are using to figure out when will a parameter be used again in the future, and we use the stage3_max_reuse_distance to decide whether to throw away the parameter or to keep it. If a parameter is going to be used again in near future (less than stage3_max_reuse_distance) then we keep it to reduce communication overhead.

      [!NOTE] ZeRO-3 中,reuse distance 是指什么?

      flashcard

      一个参数多久之后会被再次用到?决定了是否丢弃该参数

    25. stage3_max_live_parameters is the upper limit on how many full parameters you want to keep on the GPU at any given time.

      [!NOTE] ZeRO-3 中,live parameters 是指什么?

      flashcard

      始终保存在 GPU 上的参数

    26. 1e9 would consume ~2GB. The memory is shared by stage3_max_live_parameters and stage3_max_reuse_distance, so it’s not additive, it’s just 2GB total.

      [!NOTE] stage3_max_live_parametersstage3_max_reuse_distance 会占用多少内存?

      flashcard

      保存同样数量的数值

    27. stage3_max_live_parameters and stage3_max_reuse_distance. They should have minimal impact on performance unless you are doing activation checkpointing.

      [!QUESTION] 为什么 stage3_max_live_parametersstage3_max_reuse_distance 对性能影响很小,除非在使用激活检查点?

      flashcard

    28. This feature can improve the throughput at the cost of making less memory available to other processes. Pinned memory is set aside to the specific process that requested it and its typically accessed much faster than normal CPU memory.

      [!NOTE]- Pinned Memory 有什么利弊?

      flashcard

      • 利:增大吞吐量
      • 弊:增大内存占用
    29. The following is an example of configuration for ZeRO stage 3:

      [!NOTE]- ZeRO-3 的 config 是怎样的?

      flashcard

      示例如下

    30. gradient accumulation steps (more copying between optimizer steps)

      [!NOTE]- 梯度累加与 CPU offloading 有什么关系?

      flashcard

      梯度累加会在优化步骤之间将梯度 offload 到 CPU 上?

    31. enabling offload_optimizer should reduce GPU RAM usage (it requires "stage": 2)

      [!NOTE]- ZeRO 开启 offload 需要什么条件?

      flashcard

      至少要阶段 2

    32. an example of configuration for ZeRO stage 2:

      [!NOTE]- ZeRO-2 的配置文件是什么样的?

      flashcard

      以下为一个示例:

    33. currently DeepSpeed doesn’t validate parameter names, so if you misspell any, it’ll use the default setting for the parameter that got misspelled.

      [!NOTE]- 配置 DeepSpeed 并启动后,应该做什么?

      flashcard

      检查 "DeepSpeed engine start up log messages" 检查配置值是否符合自己的配置 - 截至 2023-8-13,DeepSpeed 还不会验证参数名称,无法自动纠正拼写错误

    34. This section has to be configured exclusively via DeepSpeed configuration - the Trainer provides no equivalent command line arguments.

      [!NOTE] DeepSpeed 配置中,zero_optimization 必须通过什么来配置?

      flashcard

      必须使用 DeepSpeed 的途径来配置,不存在对应的 Trainer 参数

    35. The zero_optimization section of the configuration file is the most important part (docs), since that is where you define which ZeRO stages you want to enable and how to configure them.

      DeepSpeed 配置中最重要的部分可能就是:zero_optimization

    36. The first one is not quite interesting for scalability purposes

      ZeRO-1 对于可扩展性不是很有帮助:::?

    37. In your own programs, you can also use the following approach if you’d like to modify the DeepSpeed config as a master and configure TrainingArguments based on that. The steps are: Create or load the DeepSpeed configuration to be used as a master configuration Create the TrainingArguments object based on these values Do note that some values, such as scheduler.params.total_num_steps are calculated by Trainer during train, but you can of course do the math yourself.

      自己设置 DeepSpeed&Trainer 配置的方法

    1. even though we use the mixed precision training, using full precision checkpoint is the best practice.

      [!QUESTION] 保存模型检查点时,应该使用什么精度?

      flashcard

      通常认为以全精度保存最好?

    1. 还有一个很重要的,混音。我听了很多AI翻唱的,目前出了我和Eternity丨L,好像没有其他UP做混音的,都是干音+伴奏。

      [!NOTE] 目前的 AI 翻唱有什么缺陷?

      flashcard

      1. 很少人做混音
    1. model_input_names (List[string], optional) — The list of inputs accepted by the forward pass of the model (like "token_type_ids" or "attention_mask"). Default value is picked from the class attribute of the same name.

      [!NOTE]- Hugging Face 中,tokenizer 在哪里储存模型有哪些种类的输入?

      flashcard

      model_input_names 属性

    1. Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs to the model. See glue and ner for example of how it’s useful.

      [!NOTE]- DefaultDataCollator 如何处理输入样本?

      flashcard

      根据输入样本的属性的名称传给模型的对应输入

    2. Data collators are objects that will form a batch by using a list of dataset elements as input.

      [!NOTE]- Data Collator 的输入和输出分别是什么?

      flashcard

      输入:数据集元素组成的列表 输出:数据样本组成的 batch

    1. ⚠️ Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels.

      [!NOTE]- 🤗 Transformers 中,移动 input_idslabels 发生在哪个阶段?

      flashcard

      模型内部? data collator 之后?

    2. DataCollatorForLanguageModeling supports both masked language modeling (MLM) and causal language modeling (CLM). By default it prepares data for MLM, but we can switch to CLM by setting the argument mlm=False:

      [!NOTE]- DataCollatorForLanguageModeling 默认适用于什么语言模型?

      flashcard

      mlm

    3. DataCollatorForLanguageModeling collator, which is designed specifically for language modeling (as the name subtly suggests). Besides stacking and padding batches, it also takes care of creating the language model labels

      [!NOTE]- DataCollatorForLanguageModeling 有哪些特殊功能?

      flashcard

      自动创建 labels

    1. Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

      [!NOTE]- 🤗 Transformers 中,如何仅使用 input_ids 同时构造 labels

      flashcard

      1. eox=>pad tokenizer.pad_token = tokenizer.eos_token
      2. mlm=False data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
    1. This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

      [!NOTE]- Hugging Face 在训练循环中使用 tokenizer 时,为了保证效率,有什么需要注意的?

      flashcard

      不要调用 num_special_tokens_to_add,因为其简单地重新编码并计算,效率很低

    2. Converts a string in a sequence of tokens, using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.

      [!NOTE]- Hugging Face tokenizer 要将字符串转化成 tokens 序列,可以使用?

      flashcard

      tokenizer.tokenize(text)

    3. Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing self.convert_tokens_to_ids(self.tokenize(text)).

      [!NOTE]- Hugging Face tokenizer 要将字符串转化为 token_ids,可以使用?

      flashcard

      tokenizer.encode() 等价于 self.convert_tokens_to_ids(self.tokenize(text))

    4. convert_ids_to_tokens

      [!NOTE]- Hugging Face 中,使用 tokenizer 时,如何将 id 转化为 token(或反之)?

      flashcard

      tokenizer.convert_ids_to_tokens

    1. 解码阶段给定当前生成词在第 ii i 个transformer层的向量表示为 ti∈Rb×1×ht^{i}\in R^{b\times 1\times h}t^{i}\in R^{b\times 1\times h} 。推断计算分两部分:更新KV cache和计算第 iii 个transformer层的输出。

      [!NOTE]- KV cache 是如何计算的?

      flashcard

      每当要新生成一个 token_id, 1. 先计算key和value,来更新KV cache, 2. 再计算query,结合 KV cache 计算 output

    1. embedding层不需要中间激活

      为什么 embedding 层不需要中间激活?

    2. 对于 softmax()softmax()softmax() 函数,需要保存函数的输入 QKTQK^TQK^T ,占用显存大小为 2bs2a2bs^2a2bs^2a ,这里的 aaa 表示注意力头数。

      transformer 的 soft-max 中输入 $QK^{T}$ 需要逐元素除以 $\sqrt{h}$ 为什么单独的 $Q,K$ 不需要 a 组参数,明明有 a 组不同的 $Q,K$?

    3. 词嵌入矩阵的参数量也较多,词向量维度通常等于隐藏层维度 hhh ,词嵌入矩阵的参数量为 VhVhVh

      词嵌入矩阵的形状为 $[V,e]$,应该是将长为 V 的 one-hot 词向量映射到对应的嵌入

    4. 最后的输出层的权重矩阵通常与词嵌入矩阵是参数共享的

      如何共享?

    5. MLP块

      FFN 块的参数构成

    6. self-attention块

      self-attention 块的参数构成

    1. synced_gpus (bool, optional) — Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

      什么意思?

    1. modifying your script to run on GPT-J with FP16 on an 3090, with input_ids.shape[1]=16 and max_new_tokens=256, we get: 14071MB of GPU usage with use_cache=False 13233MB of GPU usage with use_cache=True The difference becomes more visible with large models and large sequence lengths

      更大的 VRAM 上可以看出是否使用 KV Cache 的差别

  3. Jul 2023
    1. Our new technologies for optimizing inference cost and latency include: Inference-adapted parallelism allows users to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU inference, accounting for both inference latency and cost.Inference-optimized CUDA kernels boost per-GPU efficiency by fully utilizing the GPU resources through deep fusion and novel kernel scheduling.Effective quantize-aware training allows users to easily quantize models that can efficiently execute with low-precision, such as 8-bit integer (INT8) instead of 32-bit floating point (FP32), leading to both memory savings and latency reduction without hurting accuracy.

      DeepSpeed Inference 使用的主要技术

    1. I wonder if the current approach of statically-compiled CUDA kernels is sustainable. Perhaps there is value to considering JIT compilation, e.g. with Triton or NVRTC?

      对于大型项目,静态编译可能不再那么合适,而更应该采用 JIT 编译?

    1. In 🤗 Accelerate this conversion happens automatically when calling prepare() and passing in your model.

      These triggerpoints are added to the model

    1. For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the "special_tokens_mask" key, as returned by a PreTrainedTokenizer or a PreTrainedTokenizerFast with the argument return_special_tokens_mask=True.

      DataCollatorForLanguageModeling

    2. Data collator that will dynamically pad the inputs received, as well as the labels.

      DataCollatorForSeq2Seq

    1. batched (bool, defaults to False) — Provide batch of examples to function.

      要求 function 能处理 batch

    1. when tokenizing each element into chunks of the specified context size, we create many samples from each document. We just need to make sure to delete the existing columns, since they have a conflicting size. If we wanted to keep them, we could repeat them appropriately and return them within the Dataset.map() call:

      构建训练 samples 时,需要删除其他所有列,仅保留训练数据

    1. Answer grading is difficult in general. This grading logic is designed to be conservative and will sometimes reject correct answers, though it does so less frequently than the normalization logic from MATH. Our logic might sometimes admit incorrect answers, though we've put effort into minimizing this.

      答案检查逻辑还有待改善

    2. // Total time in milliseconds spent on labeling this solution. "total_time": 278270,

      一道题要标 5min

    1. Enlarge outputentropyAdd noises ontoparameter

      RL 的 exploration 中,一些增大 actor 随机性的方法: - 增大输出的 entropy - 给参数添加噪声 - ...

    2. In this way, we do not have to collection data after each upd

      RL 中,当负责交互的 actor 与要训练的 actor 不同,即后者向前者学习时,可以不必迭代式收集数据,而允许一次性收集大量数据

    3. Data collection is in the “forloop” of training iteratio

      理想的 RL 中,由于不同 policy 会得到不同的 observation 与 action,数据的收集应当是迭代式的,每更新一次 policy 就要重新收集一批数据

    4. Minus by a baseline 𝑏−𝑏−𝑏−𝑏−𝑏Make 𝐺𝑡′ have positive and negative value

      RL 的 reward 定义中,由于单纯的 reward 通常是正的,好坏只能相对体现,故使用一个 baseline $b$,将 reward 与 baseline 的差值作为最后的 reward,以体现 reward 绝对的好坏

    5. 𝐺1′ = 𝑟1 + 𝛾𝑟2 + 𝛾2𝑟3 + ......

      RL 的 reward 定义中,为了反映一个 action 对后续时段造成的影响随时间距离而衰减,可以使用一个 discount factor \gamma <1$,即每次都进行一次衰减,$G_{t}^{\prime}=\sum_{n=t}^{N} \gamma^{n-t} r_{n}$

    6. Make it take (or don’t take) a specific action ො𝑎 givenspecific observatio

      RL 控制 actor 采取/不采取某个行动的基本方法:用向量表示特定选择,要/不要采取该选择,就要求实际行动与该选择的交叉熵尽量小/大,实际操作中,可以实现为计算交叉熵并乘正/负的系数

  4. Jun 2023
    1. The pop-up menu at the lower left of the interface allows you to choose among the following models:

      左下角菜单选择具体的线性回归模型

    2. Predictor values are displayed in the text boxes on the horizontal axis and are marked by vertical dashed blue lines in the plots.

      预测值——数值显示+竖直虚线

    3. rstool plots a 95% simultaneous confidence band for the fitted response surface as two red curves.

      红线:95%置信区间边界

  5. May 2023
    1. If spec.loader.create_module does not return None, then any pre-existing attributes will not be reset. Also, no AttributeError will be raised if triggered while accessing spec or setting an attribute on the module.
  6. Apr 2023
    1. List of gradio.components to use as inputs.

      Gradio 事件的输入和输出都是 Component 或其列表

  7. Mar 2023
    1. In practice, very low pass rates are difficultor impossible to estimate, so we restrict to problems P and models M such that given some largesample budget, every problem is solved at least once by every model.

      实践中,为了避免性能指标过差,难以估计,可以尝试限制测试集和模型,要求在给定足够大的样本预算后,每个测试问题都能:::被每个模型解决一次?如果每个问题都能被每个模型解决一次,那如何比较模型的性能?

    2. the Inverse Scaling Prize [ 44 ] proposedseveral tasks for which model performance decreases as a function of scale. Similarly to a recentresult by Wei et al. [45], we find that GPT-4 reverses this trend, as shown on one of the tasks calledHindsight Neglect [46] in Figure 3.

      GPT-4 在 Hindsight Neglect 任务上的表现逆转了 Inverse Scaling Prize 提出的模型性能随规模下降的趋势

    3. Predictions on the other five buckets performed almost as well, the main exception beingGPT-4 underperforming our predictions on the easiest bucket.

      基于小模型性能对 GPT-4 的性能预测,在最简单的问题集上比实际性能偏高

    4. We chose to look at loss because it tends to be less noisy than other measures acrossdifferent amounts of training compute.

      GPT-4 使用 loss 来评判模型的性能,认为其比其他标度对于不同的训练计算量噪声更少

    5. we predicted GPT-4’s final loss on ourinternal codebase (not part of the training set) by fitting a scaling law with an irreducible loss term(as in Henighan et al. [15]): L(C) = aCb + c

      GPT-4 为了验证优化基础设施的可扩展性,通过拟合一个不可简化损失项的缩放定律来预测性能

    6. A power law fit to the smaller models (excluding GPT-4) is shown as the dottedline; this fit accurately predicts GPT-4’s performance.

      GPT-4 使用了小模型的 Loss-TrainingCompute 曲线来预测大模型的 Loss-TrainingCompute

    1. 快速傅立叶变换的核心思想也是将系数向量迅速变换为点值向量,再迅速的将点值向量还原成系数向量,其中还原的操作我们称之为IDFTIDFTIDFT。

      快速傅立叶变换的核心思想也是将系数向量:::迅速变换为点值向量,再迅速的将点值向量还原成系数向量?

    1. tol, rtolfloat, optionalIteration stops when error between last two iterates is less than tol OR the relative change is less than rtol.

      quadrature() 的绝对误差与相对误差指的是两次迭代之间的误差

    1. Why do we use the n + 1 order derivative of f to describe the error of a polynomial of degree atmost n? The intuition is that the n + 1 order derivative of a degree n polynomial is identically 0, sothat the difference between the actual function and the polynomial interpolant can be encapsulatedby this derivative.

      用 n+1 阶导数描述 n 阶插值多项式的误差,背后的直觉是 n 阶插值多项式的 n+1 阶及以上的导数为 0(,n 阶及以下的导数与原函数相同?),误差出现且仅出现在 “n+1 阶之后”

  8. Feb 2023
    1. 我便用一段话概括:学长在 CMU Joint program 跟着 Neubig 暑研,然后 18 年入学了 CMU MLT,20 年毕业后去了 Hudson River Trading 做 Algorithm Engineer。现在有了 H1B,人在加州,做的工作和 NLP 没有任何关系。至于 MLT,他入学的时候 MLT 只有 30 人,1 /3 的人能够拿到老板的赞助免除学费和生活费,虽然也是僧多粥少。没有和 CMU 的 joint program,他认为自己去不了 CMU MLT。此外,MLT 毕业时,转本校 PhD 的也不少。至于转本校,基本是 2 + (3+),很少需要 2 + 5 年…

      与人交流时,定期概括一番自己和对方的意思可能很不错

    1. 如果一个模块导入另一个模块,而后者又导入另一个模块,则第一个模块的 sys.path 是解释器搜索第二个导入语句的位置。

      sys.path 会在同一个解释器执行的不同脚本中继承(应该是储存在解释器中的?)

    2. sys.path 并不会依赖当前程序的工作路径 - os.getcwd(),仅仅依赖第一个脚本所在的路径:

      sys.path 与 shell 当前所在路径 os.getcwd() 无关,仅取决于执行的第一个脚本所在的路径

    3. 在解释器环境下,sys.path[0] 就是解释器启动时所在的路径

      解释器环境下,sys.path[0] 是解释器启动时解释器所在的路径

    1. (str, Parameter) – Tuple containing the name and parameter

      torch.nn.named_parameters() 返回字符串和 PyTorch 参数对象构成的元组 (str, Parameter)

    1. torch.nn.init.constant_(tensor, val)[source] Fills the input Tensor with the value val\text{val}val.

      torch.nn.init.constant_(tensor, val) 将输入张量填充上常量值

    2. torch.nn.init.normal_(tensor, mean=0.0, std=1.0)[source] Fills the input Tensor with values drawn from the normal distribution N(mean,std2)\mathcal{N}(\text{mean}, \text{std}^2)N(mean,std2).

      torch.nn.init.normal_(tensor, mean=0.0, std=1.0) 将输入张量填充上标准分布

    3. All the functions in this module are intended to be used to initialize neural network parameters, so they all run in torch.no_grad() mode and will not be taken into account by autograd.

      torch.nn.init 中所有函数都用于初始化神经网络参数,所以都运行在无梯度模式中,不会考虑自动梯度

    1. 为什么训练的时候warm up这么重要?这个问题目前还没有被充分证明,我们只能从直觉上和已有的一些论文[1,2,3]得到推测:有助于减缓模型在初始阶段对mini-batch的提前过拟合现象,保持分布的平稳有助于保持模型深层的稳定性

      训练预热的效果

    2. 在大型网络训练初期,我们需要用较小的学习率先学n个step

      训练预热的含义

    1. Open-Source Vizier implements a variety of sophisticated algorithms for tuning ML models, including Bayesian Optimization algorithms.

      模型调参算法工具:Open-Source Vizier

    2. Summary: Bayesian optimization tools are a compelling option once we’re done exploring for good search spaces and have decided what hyperparameters even should be tuned at all.

      贝叶斯优化工具是在探索好搜索空间、决定好要调整的超参数后非常有用的选择

    3. However, we should only adopt changes that produce improvements that outweigh any complexity they add.

      任何改变都应该满足其带来的改进大于其:::引入的复杂性

    4. Usually, we can get away with only recharacterizing the trial variance after major changes to the pipeline

      通常只在对 pipeline 做出重大改变后才需要重新刻画试验差异

    5. before adopting a candidate change, consider running the best trial N times to characterize the run-to-run trial variance.

      采用改变前,考虑再运行最佳 trial 若干次来了解 trial 运行之间的差异

    6. It is all well and good to make comparisons of validation error rates estimated on a finite validation set using fastidious statistical tests, but often the trial variance alone can produce statistically significant differences between two different trained models that use the same hyperparameter settings.

      随机性差异很显著?

    7. the most important sources of variation that might cause such an inconsistent result

      导致不稳定结果的主要原因可以分为以下几类: - 训练之间的差异 - 超参数搜索差异 - 数据收集和取样差异

    8. different random seeds. For example, different random initializations, training data shuffles, dropout masks, patterns of data augmentation operations, and orderings of parallel arithmetic operations, are all potential sources of trial variance.

      随机性因素可能导致模型性能不稳定

    9. Summary: Examining the training curves is an easy way to identify common failure modes and can help us prioritize what actions to take next.

      检查训练曲线可以: - 识别常见故障模式 - 帮助我们确定下一步要采取的行动的优先级

    10. In general, it can be very difficult to know if the search space has been sampled densely enough. 🤖

      一般很难确认搜索空间取样是否足够:::密集🤖

    11. basic hyperparameter axis plots where we plot the validation objective value versus one of the hyperparameters (e.g. learning rate). Each point on the plot corresponds to a single trial.

      basic hyperparameter aix plot 是指验证集误差-某个超参数的图表

    12. If all trials are infeasible for learning rates greater than some threshold value, and if the best performing trials have learning rates at the edge of that region, the model may suffer from stability issues preventing it from accessing higher learning rates.

      如果超参数在特定范围内的 trial 都不可行,可能存在一些:::稳定性问题使得模型无法使用这些范围内的超参数

    13. A search space is suspicious if the best point sampled from it is close to its boundary. We might find an even better point if we expanded the search range in that direction.

      如果当前搜索空间中超参数的最优点在一个或多个维度接近于搜索空间的边界,那么真正的最优点可能在搜索空间外,需要拓宽搜索空间

    14. For example, do the best trials have training curves consistent with problematic overfitting?

      对于最优 trial 的训练曲线,应当检查其是否与有问题的过拟合曲线:::一致?

    15. In some cases, a large number of infeasible points can indicate a bug in the training code.

      有时,大量的不可行点意味着训练代码中存在 bug

    16. reparameterizing the search space

      重参数化搜索空间是指:::?

    17. infeasible (i.e. trials that diverge, get really bad loss values, or fail to run at all because they violate some implicit constraint)

      trial 不可行包括以下情况: - 发散 - 极差的损失值 - 运行失败(因为触犯了某些隐藏的限制)

    18. Before analyzing a given set of experiments to make progress toward their original goal, we should ask ourselves the following additional questions

      分析实验数据前,应该检查以下事项: - 搜索空间是否够“宽” - 取样点是否够多 - 不合理 trial 的原因 - 模型是否 exhibit optimization issues? - 我们能从表现最好的 trial 的训练曲线学到什么?

    19. Since running experiments can be expensive, we also want to take the opportunity to extract other useful insights from each group of experiments, even if these insights are not immediately relevant to the current goal

      因为跑实验可能很贵,所以要充分提取每组实验的信息,即使这些信息可能与预设目标不太有关

    20. For example, if our goal is to select the best optimizer out of Nesterov momentum and Adam, we could create one study in which optimizer="Nesterov_momentum" and the nuisance hyperparameters are {learning_rate, momentum}, and another study in which optimizer="Adam" and the nuisance hyperparameters are {learning_rate, beta1, beta2, epsilon}.

      科学超参数与多余超参数的对应例子:::优化器选择与其参数

    21. it ensures that we obtain a relatively uniform sampling of values of the scientific hyperparameters

      搜索科学超参数时,一般会采用准随机搜索,因为其可以提供相对均匀的取值取样分布

    22. searches the scientific parameters uniformly

      搜索科学超参数时应当使其分布尽可能均匀

    23. conditional hyperparameters can cause problems since it is hard to specify a search space unless the set of nuisance hyperparameters is the same for all values of the scientific hyperparameters.

      同时搜索科学超参数与多余超参数时,两者之间的条件关系(不同的科学超参数对应不同的多余超参数)会使得设置搜索空间变得困难。 - 解决方法::::?

    24. include the scientific parameters in the same search space as the nuisance hyperparameters and use a search algorithm to sample values of both the scientific and nuisance hyperparameters in a single study.

      科学超参数过多时,可以将其加入搜索空间,和多余超参数一起搜索

    25. We can use any gradient-free optimization algorithm, including methods such as Bayesian optimization or evolutionary algorithms, to optimize over the nuisance hyperparameters

      超参数搜索算法:无梯度优化算法

    26. A study specifies a set of hyperparameter configurations to be run for subsequent analysis. Each configuration is called a "trial".

      trial 这一术语指模型运行依赖的一套超参数的配置

    27. the more nuisance hyperparameters we attempt to tune, the greater the risk we fail to tune them sufficiently well for each setting of the scientific hyperparameters and end up reaching the wrong conclusions from our experiments.

      过多的多余参数会让找到不同科学参数的最优配置变得困难,乃至无法实现最优配置,进而得出错误结论

    28. With limitless resources, we would leave all non-scientific hyperparameters as nuisance hyperparameters so that the conclusions we draw from our experiments are free from caveats about fixed hyperparameter values.

      理想情况下应当将科学参数外的所有其他参数都视为:::多余参数?

    29. When designing a new round of experiments, we first identify the scientific hyperparameters for our experimental goal. At this stage, we consider all other hyperparameters to be nuisance hyperparameters.

      确定科学超参数时,应当把其他所有超参数都当作:::多余超参数?

    30. The activation function could be a fixed hyperparameter if we have determined in prior experiments that the best choice of activation function is not sensitive to model depth, or if we are willing to limit our conclusions about the number of hidden layers to only cover this specific choice of activation function

      固定超参数满足以下条件之一: - 与科学超参数无关 - 预先决定了仅研究某个超参数下的模型表现

    31. The learning rate is a nuisance hyperparameter because we can only fairly compare models with different numbers of hidden layers if the learning rate is tuned separately for each number of layers (the optimal learning rate generally depends on the model architecture).

      多余超参数是为了公平比较超参数而需要分别调整的超参数

    32. When we are eventually ready to be greedy, we can focus purely on the validation error even if the experiments aren't maximally informative about the structure of the tuning problem.

      即使实验没有最大化地提供关于调参问题结构的信息,也可以:::专注于验证集误差?

    33. Identify which hyperparameters the validation error is most sensitive to, which hyperparameters interact the most and therefore need to be re-tuned together, and which hyperparameters are relatively insensitive to other changes and can therefore be fixed in future experiments.

      关于超参数的洞见: - 哪些超参数对验证集误差影响最大 - 哪些超参数耦合比较明显 - 那些超参数相对比较独立

    34. Although one might think we would spend most of our time trying to maximize performance on the validation set, in practice we spend the majority of our time trying to gain insight into the problem, and comparatively little time greedily focused on the validation error.

      获得关于问题的洞见>提升验证集表现

    35. We call it a launch when we update our best configuration (which may or may not correspond to an actual launch of a production model).

      launch 表示对最佳配置的更新(不一定在生产环境中)

    36. if an unnecessarily large step budget is chosen initially, it might be hard to change it down the road, e.g. once the learning rate schedule is tuned for that number of steps.

      如果一开始就选择了过大的步骤预算?,后续就:::很难调整了?

    37. training for fewer steps means that each training run is faster and uses fewer resources, boosting tuning efficiency by reducing the time between cycles and allowing more experiments to be run in parallel

      训练更少步数意味着 - 每次训练速度更快(耗时更少?),消耗资源更:::少?(从而允许更多实验并行运行)

    38. training for more steps can improve performance and makes hyperparameter tuning easier (see Shallue et al. 2018).

      训练更多步数可以改善模型表现,让超参数调整更:::简单?

    39. at minimum means that the trained model performs much better than random chance on the validation set

      模型表现的底线:比随机选择要好

    40. For example, start with a constant learning rate before adding fancy decay schedules.

      decay schedule 是:::?比学习率花哨?

    41. "Simple" means avoiding bells and whistles wherever possible; these can always be added later

      开始时尽可能避免花哨的东西;即使它们有用,后来也可以再加上

    42. Before beginning hyperparameter tuning we must determine the starting point. This includes specifying (1) the model configuration (e.g. number of layers), (2) the optimizer hyperparameters (e.g. learning rate), and (3) the number of training steps.

      起始配置的超参数包括: 1. 模型配置(例如层数) 2. 优化器超参数(例如学习率) 3. 训练步数

    43. Batch norm is complicated and, in general, should use a different batch size than the gradient computation to compute statistics.

      批归一化计算数据应当使用使用:::不同于梯度计算时的批大小?

    44. The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.

      受批大小影响最大的超参数是优化器超参数和正则化超参数

    45. The optimal values of most hyperparameters are sensitive to the batch size.

      大多数超参数的最佳值都对批大小敏感

    46. Choosing the batch size to minimize resource consumption

      选择批大小以最小化资源消耗:::TODO

    47. Choosing the batch size to minimize training time

      选择批大小以最小化训练时间:::TODO

    48. shouldn't be used to directly tune the validation set performance

      batch size 不应该被用于直接调优验证集上的表现?:::没看懂

    49. This is particularly relevant in the beginning stages of a project when we are trying to find the best values of various other hyperparameters (e.g. architecture hyperparameters) while treating optimizer hyperparameters as nuisance parameters.

      优化器参数都很重要,尤其是在项目的起始阶段,我们试图找到其他各种超参数的最佳设置,而将优化器参数视为无关参数时?:::没看懂

    50. If the training throughput increases only up to some maximum batch size, then we should only consider batch sizes up to that maximum batch size, even if a larger batch size is supported by the hardware.

      一般采用刚刚达到最大训练吞吐量的 batch size

    51. If this is not the case then the training pipeline has a bottleneck such as I/O or synchronization between compute nodes.

      加速器未饱和,但单步时间增长,说明可能存在其他瓶颈,例如IO、节点间同步等

    52. When the accelerators aren't yet saturated, if the batch size doubles, the training throughput should also double (or at least nearly double).

      加速器未饱和时,一轮消耗的时间不变,批量越大,样本处理速率越快

    53. larger batch sizes can be more prone to overfitting and may require stronger regularization and/or additional regularization techniques

      较大的批量可能更容易过拟合:::为什么? - 可能需要更强的正则化和/或额外的正则化技术

    54. the difference in validation set performance between two batch sizes typically goes away if the training pipeline is optimized independently for each batch size.

      如果训练 pipeline 对每个 batch size 独立优化,那么不同 batch size 下的验证集表现差异会:::消失?

    55. Increasing the batch size may either decrease, increase, or not change the resource consumption.

      batch size 对资源消耗的影响是不确定的?

    56. Summary: When starting a new project, try to reuse a model that already works.

      优先尝试使用已经被证明有效的模型

    57. When possible, try to find a paper that tackles something as close as possible to the problem at hand and reproduce that model as a starting point.

      一个比较好的起点是:寻找与手头问题近似的论文中的模型,并复现之

    58. There is already a pipeline set up that does training and evaluation, and it is easy to execute training and prediction jobs for various models of interest

      怎样才算一个完整的 pipeline?

    1. anguage modeling (LM) loss to generate captions given images

      LM 损失用于根据图像生成描述

    2. nless otherwise specified, all results re-ported in this paper as “BLIP” uses ViT-B

      BLIP 使用的 ViT 版本为 ViT-B

    3. shares the same cross-attention layers

      BLIP 共享参数的标准是:::什么?

    4. image-text matching (ITM) loss to distinguishbetween positive and negative image-text pairs

      ITM 损失可以区分图文对的匹配程度?

    5. an image-text contrastive (ITC) loss to align the vision and language representations.

      ITC 可以对齐视觉和语言表示

    6. layers except for SA leads to better performance comparedto not sharing

      BLIP 文本解码器和编码器共享除 SA 外的所有层(的参数)比不共享表现更好

    7. f the SA layers are shared,the model’s performance would degrade due to the conflictbetween the encoding task and the decoding task

      BLIP 如果在解码器和编码器共享 SA 层,会由于任务之间的冲突,使得模型表现下降 - SA 层在解码器和编码器中扮演着怎样的角色?

    8. During pre-training, the text encoder and decoder share allparameters except for the self-attention layers

      BLIP 预训练时,文本编码器和解码器共享除了自注意力层之外的所有参数

    9. e usethe same pre-training dataset as Li et al. (2021a) with 14Mimages in total, including two human-annotated datasets(COCO and Visual Genome (Krishna et al., 2017)), andthree web datasets (Conceptual Captions (Changpinyo et al.,2021), Conceptual 12M (Changpinyo et al., 2021), SBU cap-tions (Ordonez et al., 2011)). We also experimented with anadditional web dataset, LAION (Schuhmann et al., 2021),which contains 115M images with more noisy texts1

      BLIP 使用的数据集

    10. ncrease theimage resolution to 384 × 384 during finetuning

      BLIP 微调时将图片放大到 384^2,为了::: 什么目的?

    11. ake random image crops ofresolution 224 × 224 during pre-training

      BLIP 预训练时将图像随机裁剪为 224x224,为了:::什么目的?

    12. AdamW (Loshchilov & Hutter, 2017)optimizer with a weight decay of 0.05. The learning rateis warmed-up to 3e-4 (ViT-B) / 2e-4 (ViT-L) and decayedlinearly with a rate of 0.85.

      BLIP 使用 AdamW 优化器 - weight decay = 0.05 - lr - warm-up 3e-4 - decay rate = 0.85

    13. a [CLS] token is appended to the beginningof the text input to summarize the sentence

      [CLS] token 被添加到文本输入的开头,用于总结整个句子?

    14. as been adopted by themore recent methods (Li et al., 2021a; Kim et al., 2021).

      ViT 在 2021 年逐渐被用于视觉特征提取

    15. ViTis more computation-friendly

      ViT 比 pretrained objector detector 更加计算友好

    16. using pre-trained object detectorsfor visual feature extraction (Chen et al., 2020)

      使用预训练对象探测器来提取视觉特征(2020)

    17. divides an input image intopatches and encodes them as a sequence of embeddings,with an additional [CLS] token to represent the global im-age feature

      ViT 在 MED 中的用途: 1. 将输入图像划分为 patches? 2. 将 patches 编码为一个 embedding 序列 3. 用 [CLS] token 代表整体的图像特征?

    18. 3.1. Model Architecture

      MED 模型架构

    19. Captioning and Filtering (CapFilt): a new dataset boos-trapping method for learning from noisy image-text pairs.

      CapFilt 数据集自助方法用于从噪声(图文对)数据中学习

    20. An MED can operate either asa unimodal encoder, or an image-grounded text encoder,or an image-grounded text decode

      MED 的三大要素

    21. Multimodal mixture of Encoder-Decoder (MED): a newmodel architecture for effective multi-task pre-training andflexible transfer learning

      MED 利于有效的多任务预训练和灵活的迁移学习

    22. encoder-decoder models have not beensuccessfully adopted for image-text retrieval tasks.

      基于 encoder-decoder 的模型难以应用在图文对检索任务上?

    23. ncoder-based models are less straightfor-ward to directly transfer to text generation tasks (e.g. imagecaptioning),

      仅仅基于 encoder 的模型难以直接迁移到文本生成任务上

    1. Demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

      BLIP 的泛化能力能直接零样本迁移到视频-语言任务?:::如何做到?

    2. using a stochastic decoding method (nucleus sampling) is better than using beam search for caption generation, due to the higher level of diversity in the synthetic captions.

      使用随机的解码方法比使用 beam search 效果更好,因为人造的图片描述中有更高的多样性?