Hypothesis

844 Matching Annotations

Aug 2023
huggingface.co huggingface.co

Llama2

2
1. kidrain61 15 Aug 2023
  
  in Public
  
  pretraining_tp (int, optional, defaults to 1) — Experimental feature. Tensor parallelism rank used during pretraining. Please refer to this document to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to this issue.
  
  [!NOTE] 模型的 pretraining_tp 是指什么？
  
  flashcard
  
  预训练时的张量并行度
2. kidrain61 15 Aug 2023
  
  in Public
  
  max_position_embeddings (int, optional, defaults to 2048) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
  
  [!NOTE] Transformer 模型的输入长度限制是由什么决定的？
  
  flashcard
  
  （绝对）位置嵌入的个数（例如 4096）反之，相对位置编码就不会有硬性的输入长度限制？
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/transformers/model_doc/llama2
huggingface.co huggingface.co

Distributed Inference with 🤗 Accelerate

5
1. kidrain61 15 Aug 2023
  
  in Public
  
  On the first GPU, the prompts will be ["a dog", "a cat"], and on the second GPU it will be ["a chicken", "a chicken"]. Make sure to drop the final sample, as it will be a duplicate of the previous one.
  
  [!NOTE] 🤗 Accelerate 的 split_between_processes(..., apply_padding=True) 会如何对齐样本数量？需要注意什么？
  
  flashcard
  
  复制最后一个进程的样本？在返回值中去掉重复结果
2. kidrain61 15 Aug 2023
  
  in Public
  
  what if we then wanted to do something with the results of all the GPUs? (Say gather them all and perform some kind of post processing) You can pass in apply_padding=True to ensure that the lists of prompts are padded to the same length, with extra data being taken from the last sample. This way all GPUs will have the same number of prompts, and you can then gather the results.
  
  [!NOTE] 🤗 Accelerate 的 split_between_processes() 中，要在分布式进程间对齐样本数量，可以使用？
  
  flashcard
  
  apply_padding=True
3. kidrain61 15 Aug 2023
  
  in Public
  
  This is only needed when trying to perform an action such as gathering the results, where the data on each device needs to be the same length. Basic inference does not require this.
  
  [!NOTE] 从分布式进程/设备 gather 数据时，对数据有什么要求？
  
  flashcard
  
  形状相同（可能需要进程间 padding）
4. kidrain61 15 Aug 2023
  
  in Public
  
  With 🤗 Accelerate, we can simplify this process by using the Accelerator.split_between_processes() context manager (which also exists in PartialState and AcceleratorState). This function will automatically split whatever data you pass to it (be it a prompt, a set of tensors, a dictionary of the prior data, etc.) across all the processes (with a potential to be padded) for you to use right away.
  
  [!NOTE] 要向各个进程分发数据，在 🤗 Accelerate 中，可以使用？
  
  flashcard
  
  with accelerator.split_between_processes(): 可以使用的类包括 Accelerator, PartialState, AcceleratorState
5. kidrain61 15 Aug 2023
  
  in Public
  
  torch.distributed.get_rank()
  
  [!NOTE] PyTorch 分布式计算中，如何获取进程的 rank？
  
  flashcard
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/accelerate/usage_guides/distributed_inference
huggingface.co huggingface.co

Tracking

5
1. kidrain61 15 Aug 2023
  
  in Public
  
  Once you’ve finished training, make sure to run Accelerator.end_training() so that all the trackers can run their finish functionalities if they have any. Copied accelerator.end_training()
  
  [!NOTE] 🤗 Accelerate 中，如果使用了 tracker，训练结束后需要做什么？
  
  flashcard
  
  调用 accelerator.end_training()
2. kidrain61 15 Aug 2023
  
  in Public
  
  When you are ready to log any data, Accelerator.log() should be used. A step can also be passed in to correlate the data with a particular step in the training loop. Copied accelerator.log({"train_loss": 1.12, "valid_loss": 0.8}, step=1)
  
  [!NOTE] 🤗 Accelerate 中，如何使用 log()？
  
  flashcard
  
  metric 词典 + step e.g. accelerator.log({"train_loss": 1.12, "valid_loss": 0.8}, step=1)
3. kidrain61 15 Aug 2023
  
  in Public
  
  At the start of your experiment Accelerator.init_trackers() should be used to setup your project
  
  [!NOTE] 🤗 Accelerate 中，如何初始化 tracker？
  
  flashcard
  
  accelerator = Accelerator(log_with="<tracker>")
  
  accelerator.init_trackers("my_project", config=hps)
4. kidrain61 15 Aug 2023
  
  in Public
  
  potentially add any experiment hyperparameters to be logged: Copied hps = {"num_iterations": 5, "learning_rate": 1e-2} accelerator.init_trackers("my_project", config=hps)
  
  [!NOTE] 🤗 Accelerate 中，如何添加要记录的超参数？
  
  flashcard
  
  accelerator.init_trackers("my_project", config=hps)
5. kidrain61 15 Aug 2023
  
  in Public
  
  accelerator.init_trackers("my_project"
  
  [!NOTE] 🤗 Accelerate 中，如何设置 tracker 的项目名？
  
  flashcard
  
  accelerator.init_trackers("my_project")
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/accelerate/usage_guides/tracking
huggingface.co huggingface.co

Accelerate

6
1. kidrain61 15 Aug 2023
  
  in Public
  
  For printing statements you only want executed once per machine, you can just replace the print function by accelerator.print.
  
  [!NOTE] 🤗 Accelerate 中，要方便地在每台机器上仅打印一次，可以使用？
  
  flashcard
  
  accelerator.print()
2. kidrain61 15 Aug 2023
  
  in Public
  
  The local means per machine: if you are running your training on two servers with several GPUs, the instruction will be executed once on each of those servers. If you need to execute something only once for all processes (and not per machine) for instance, uploading the final model to the 🤗 model hub, wrap it in a test like this: Copied if accelerator.is_main_process:
  
  [!NOTE] 🤗 Accelerate 中，is_main_process 有无 local 的区别是什么？
  
  flashcard
  
  local 表示每台机器（上的主进程），无 local 表示所有机器上所有进程中的主进程
3. kidrain61 15 Aug 2023
  
  in Public
  
  progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
  
  [!NOTE] tqdm 如何仅在一个进程中显式？
  
  flashcard
  
  disable=<process_test> 参数
4. kidrain61 15 Aug 2023
  
  in Public
  
  Some of your instructions only need to run for one process on a given server: for instance a data download or a log statement.
  
  [!NOTE] 使用 🤗 Accelerate 等并行库时，对于文件操作，需要注意什么？
  
  flashcard
  
  需要仅在一个进程里执行
5. kidrain61 02 Aug 2023
  
  in Public
  
  You should only pass the learning rate scheduler to prepare() when the scheduler needs to be stepped at each optimizer step.
  
  为什么？如果不传呢？如果不是每步都要调整呢？
6. kidrain61 02 Aug 2023
  
  in Public
  
  use the option split_batches=True when creating and initializing your Accelerator, in which case the batch size will always stay the same, whether you run your script on 1, 2, 4, or 64 GPUs.
  
  如何保持一致？什么保持一致？
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/accelerate/quicktour
www.deepspeed.ai www.deepspeed.ai

DeepSpeed Configuration JSON

1
1. kidrain61 15 Aug 2023
  
  in Public
  
  The report includes the number of training steps, number of skipped optimizer updates (likely due to overflows in mixed-precision training), current learning rate, and current momentum.
  
  [!NOTE] DeepSpeed 的 report 包含什么内容？
  
  flashcard
  
  被跳过的优化器更新（可能由于损失溢出）
  
  当前的动量、学习率
  
  ...
Visit annotations in context

Annotators

kidrain61

URL

deepspeed.ai/docs/config-json/
huggingface.co huggingface.co

DeepSpeed Integration

37
1. kidrain61 15 Aug 2023
  
  in Public
  
  While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from source to best match your hardware and also if you need to enable certain features, like 1-bit Adam, which aren’t available in the pypi distribution.
  
  [!NOTE] DeepSpeed 的安装源有什么讲究？
  
  flashcard
  
  source 更可能最佳适配硬件
  
  特定 features 在 pypi distribution 里没有
2. kidrain61 15 Aug 2023
  
  in Public
  
  While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to the models hub or pass it to someone else you most likely will want to get the fp32 weights.
  
  [!NOTE] 公开上传的模型权重，最好使用什么数据类型？
  
  flashcard
  
  高精度，例如 fp32
3. kidrain61 15 Aug 2023
  
  in Public
  
  This ideally shouldn’t be done during training since this is a process that requires a lot of memory, and therefore best to be performed offline after the training is complete.
  
  [!NOTE] 🤗 Transformers&DeepSpeed 中如何将模型权重转化为 fp32？
  
  flashcard
  
  参考标注下方的文档
4. kidrain61 15 Aug 2023
  
  in Public
  
  DeepSpeed stores fp32 master weights in its custom checkpoint optimizer files, which are global_step*/*optim_states.pt (this is glob pattern), and are saved under the normal checkpoint.
  
  [!NOTE] DeepSpeed 如何保存模型权重？
  
  flashcard
  
  数据类型： fp32/fp16（见下）？
5. kidrain61 15 Aug 2023
  
  in Public
  
  If you use gradient accumulation with bf16-enabled, you need to be aware that it’ll accumulate gradients in bf16, which may not be what you want due to this format’s low precision, as it may lead to a lossy accumulation. A work is being done to fix that and provide an option to use a higher precision dtype (fp16 or fp32).
  
  [!NOTE] gradient accumulation 对数据类型有什么要求？
  
  flashcard
  
  精度要高，否则累加误差会很大因此 fp16/fp32 优于 bf16
6. kidrain61 15 Aug 2023
  
  in Public
  
  bf16 has the same dynamic range as fp32 and thus doesn’t require loss scaling.
  
  [!NOTE] 使用 fp16 时通常需要如何处理 loss？
  
  flashcard
  
  使用 loss scaling 避免溢出
7. kidrain61 15 Aug 2023
  
  in Public
  
  With the 🤗 Trainer you can use --tf32 to enable it, or disable it with --tf32 0 or --no_tf32. By default the PyTorch default is used.
  
  [!NOTE] 🤗 Transformers 中，如何启用 TF32？默认设置为？
  
  flashcard
  
  --tf32 / --no_tf32 默认承袭 PyTorch 的默认值
8. kidrain61 15 Aug 2023
  
  in Public
  
  If you’re using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using the much more efficient tf32 format for some operations, but the results will still be in fp32.
  
  [!NOTE] 使用 tf32 的条件与方法是什么？
  
  flashcard
  
  以下配置默认会自动使用 tf32 - Ampere-architecture based GPU - pytorch version 1.7 and higher
9. kidrain61 15 Aug 2023
  
  in Public
  
  the only time you will want to not use it is when the model you’re using doesn’t behave well under this training mode. Typically this happens when the model wasn’t pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained models).
  
  [!NOTE] 使用低精度时，需要注意什么？
  
  flashcard
  
  模型训练时使用的精度，例如非 fp16 (fp32/bf16) 混合精度训练的模型，在使用 fp16 混合精度微调/推理时容易溢出
10. kidrain61 15 Aug 2023
  
  in Public
  
  You can also take the HF Transformers modeling code and replace torch.utils.checkpoint with the DeepSpeed’s API. The latter is more flexible since it allows you to offload the forward activations to the CPU memory instead of recalculating them.
  
  [!NOTE] 对于 Activation Checkpointing，DeepSpeed/PyTorch API 有什么差别？
  
  flashcard
  
  DeepSpeed API 额外允许将前向激活 offload 到 CPU 内存上（替代冲计算）
11. kidrain61 15 Aug 2023
  
  in Public
  
  HF Transformers models don’t know anything about DeepSpeed’s activation checkpointing, so if you try to enable that feature in the DeepSpeed config file, nothing will happen. Therefore you have two ways to take advantage of this very beneficial feature: If you want to use a HF Transformers models you can do model.gradient_checkpointing_enable() or use --gradient_checkpointing in the HF Trainer, which will automatically enable this for you. torch.utils.checkpoint is used there.
  
  [!NOTE] 如何在 🤗 Transformers 中使用 Activation Checkpointing？
  
  flashcard
  
  model.gradient_checkpointing_enable()
  
  use --gradient_checkpointing in the HF Trainer
  
  实现：torch.utils.checkpoint
12. kidrain61 15 Aug 2023
  
  in Public
  
  Activation checkpointing and gradient checkpointing are two distinct terms that refer to the same methodology. It’s very confusing but this is how it is.
  
  [!NOTE] Activation Checkpointing 与 Gradient Checkpointing 有什么异同？
  
  flashcard
  
  同一个东西...
13. kidrain61 15 Aug 2023
  
  in Public
  
  Before beginning to train BLOOM-176B I spent 2 days on this process and was able to increase throughput from 90 to 150 TFLOPs! This effort saved us more than one month of training time.
  
  [!QUESTION] 怎么查看算力指标（例如 TFLOPS）？
  
  flashcard
  
  ？
14. kidrain61 15 Aug 2023
  
  in Public
  
  Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical values look like, but we highly recommend using the one with multiple auto settings in it.
  
  [!NOTE] ZeRO-3 的配置字段及其常用值是怎样的？
  
  flashcard
  
  示例如下：
15. kidrain61 15 Aug 2023
  
  in Public
  
  Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical values look like, but we highly recommend using the one with multiple auto settings in it.
  
  [!NOTE] ZeRO-2 的配置字段及其常用值是怎样的？
  
  flashcard
  
  示例如下：
16. kidrain61 15 Aug 2023
  
  in Public
  
  It’s possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2: set stage3_param_persistence_threshold to a very large number - larger than the largest parameter, e.g., 6 * hidden_size * hidden_size. This will keep the parameters on the GPUs. turn off offload_params since ZeRO-2 doesn’t have that option.
  
  [!NOTE] 如何提高 ZeRO-3 性能（到接近 ZeRO-2 的水准）？
  
  flashcard
  
  关掉 offload_params
  
  增大 stage3_param_persistence_threshold
17. kidrain61 15 Aug 2023
  
  in Public
  
  modern NVMe transfer speeds in mind (as of this writing one can have ~3.5GB/s read, ~3GB/s write peak speeds)
  
  [!NOTE] NVMe 的读写速度分别约为多少？
  
  flashcard
  
  ~3.5GB/s read ~3GB/s write
18. kidrain61 15 Aug 2023
  
  in Public
  
  Make sure that your nvme_path is actually an NVMe, since it will work with the normal hard drive or SSD, but it’ll be much much slower.
  
  [!NOTE] NVMe 以什么形式提供接口？
  
  flashcard
  
  文件，例如 /local_nvme
19. kidrain61 15 Aug 2023
  
  in Public
  
  sub_group_size controls the granularity in which parameters are updated during optimizer steps. Parameters are grouped into buckets of sub_group_size and each buckets is updated one at a time. When used with NVMe offload in ZeRO-Infinity, sub_group_size therefore controls the granularity in which model states are moved in and out of CPU memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models.
  
  [!NOTE] ZeRO-3 config 中，stage3_gather_16bit_weights_on_model_save 有什么用？
  
  flashcard
  
  允许保存模型
20. kidrain61 15 Aug 2023
  
  in Public
  
  ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory.
  
  [!NOTE] ZeRO-Infinity 有什么用？
  
  flashcard
  
  用 NVMe 内存拓展 GPU&CPU 内存？
21. kidrain61 15 Aug 2023
  
  in Public
  
  When used with NVMe offload in ZeRO-Infinity, sub_group_size therefore controls the granularity in which model states are moved in and out of CPU memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models. You can leave sub_group_size to its default value of 1e9 when not using NVMe offload.
  
  [!QUESTION] ZeRO-3 config 中，sub_group_size 会如何影响 NVMe offload 的效果？
  
  flashcard
  
  ？
22. kidrain61 15 Aug 2023
  
  in Public
  
  stage3_gather_16bit_weights_on_model_save enables model fp16 weights consolidation when model gets saved. With large models and multiple GPUs this is an expensive operation both in terms of memory and speed. It’s currently required if you plan to resume the training.
  
  [!NOTE] ZeRO-3 config 中，sub_group_size 有什么用？
  
  flashcard
  
  设置模型参数被分成的组的大小，每组整体地被更新/offload 等操作
23. kidrain61 15 Aug 2023
  
  in Public
  
  The following configuration values depend on the model’s hidden size: reduce_bucket_size: hidden_size*hidden_size stage3_prefetch_bucket_size: 0.9 * hidden_size * hidden_size stage3_param_persistence_threshold: 10 * hidden_size therefore set these values to auto and the Trainer will automatically assign the recommended values. But, of course, feel free to set these explicitly as well.
  
  [!NOTE] ZeRO-3 有哪些配置参数取决于模型的 hidden size？
  
  flashcard
  
  3 个
24. kidrain61 15 Aug 2023
  
  in Public
  
  “reuse distance” is a metric we are using to figure out when will a parameter be used again in the future, and we use the stage3_max_reuse_distance to decide whether to throw away the parameter or to keep it. If a parameter is going to be used again in near future (less than stage3_max_reuse_distance) then we keep it to reduce communication overhead.
  
  [!NOTE] ZeRO-3 中，reuse distance 是指什么？
  
  flashcard
  
  一个参数多久之后会被再次用到？决定了是否丢弃该参数
25. kidrain61 15 Aug 2023
  
  in Public
  
  stage3_max_live_parameters is the upper limit on how many full parameters you want to keep on the GPU at any given time.
  
  [!NOTE] ZeRO-3 中，live parameters 是指什么？
  
  flashcard
  
  始终保存在 GPU 上的参数
26. kidrain61 15 Aug 2023
  
  in Public
  
  1e9 would consume ~2GB. The memory is shared by stage3_max_live_parameters and stage3_max_reuse_distance, so it’s not additive, it’s just 2GB total.
  
  [!NOTE] stage3_max_live_parameters 和 stage3_max_reuse_distance 会占用多少内存？
  
  flashcard
  
  保存同样数量的数值
27. kidrain61 15 Aug 2023
  
  in Public
  
  stage3_max_live_parameters and stage3_max_reuse_distance. They should have minimal impact on performance unless you are doing activation checkpointing.
  
  [!QUESTION] 为什么 stage3_max_live_parameters 和 stage3_max_reuse_distance 对性能影响很小，除非在使用激活检查点？
  
  flashcard
28. kidrain61 14 Aug 2023
  
  in Public
  
  This feature can improve the throughput at the cost of making less memory available to other processes. Pinned memory is set aside to the specific process that requested it and its typically accessed much faster than normal CPU memory.
  
  [!NOTE]- Pinned Memory 有什么利弊？
  
  flashcard
  
  利：增大吞吐量
  
  弊：增大内存占用
29. kidrain61 14 Aug 2023
  
  in Public
  
  The following is an example of configuration for ZeRO stage 3:
  
  [!NOTE]- ZeRO-3 的 config 是怎样的？
  
  flashcard
  
  示例如下
30. kidrain61 13 Aug 2023
  
  in Public
  
  gradient accumulation steps (more copying between optimizer steps)
  
  [!NOTE]- 梯度累加与 CPU offloading 有什么关系？
  
  flashcard
  
  梯度累加会在优化步骤之间将梯度 offload 到 CPU 上？
31. kidrain61 13 Aug 2023
  
  in Public
  
  enabling offload_optimizer should reduce GPU RAM usage (it requires "stage": 2)
  
  [!NOTE]- ZeRO 开启 offload 需要什么条件？
  
  flashcard
  
  至少要阶段 2
32. kidrain61 13 Aug 2023
  
  in Public
  
  an example of configuration for ZeRO stage 2:
  
  [!NOTE]- ZeRO-2 的配置文件是什么样的？
  
  flashcard
  
  以下为一个示例：
33. kidrain61 13 Aug 2023
  
  in Public
  
  currently DeepSpeed doesn’t validate parameter names, so if you misspell any, it’ll use the default setting for the parameter that got misspelled.
  
  [!NOTE]- 配置 DeepSpeed 并启动后，应该做什么？
  
  flashcard
  
  检查 "DeepSpeed engine start up log messages" 检查配置值是否符合自己的配置 - 截至 2023-8-13，DeepSpeed 还不会验证参数名称，无法自动纠正拼写错误
34. kidrain61 13 Aug 2023
  
  in Public
  
  This section has to be configured exclusively via DeepSpeed configuration - the Trainer provides no equivalent command line arguments.
  
  [!NOTE] DeepSpeed 配置中，zero_optimization 必须通过什么来配置？
  
  flashcard
  
  必须使用 DeepSpeed 的途径来配置，不存在对应的 Trainer 参数
35. kidrain61 13 Aug 2023
  
  in Public
  
  The zero_optimization section of the configuration file is the most important part (docs), since that is where you define which ZeRO stages you want to enable and how to configure them.
  
  DeepSpeed 配置中最重要的部分可能就是：zero_optimization
36. kidrain61 13 Aug 2023
  
  in Public
  
  The first one is not quite interesting for scalability purposes
  
  ZeRO-1 对于可扩展性不是很有帮助:::？
37. kidrain61 13 Aug 2023
  
  in Public
  
  In your own programs, you can also use the following approach if you’d like to modify the DeepSpeed config as a master and configure TrainingArguments based on that. The steps are: Create or load the DeepSpeed configuration to be used as a master configuration Create the TrainingArguments object based on these values Do note that some values, such as scheduler.params.total_num_steps are calculated by Trainer during train, but you can of course do the math yourself.
  
  自己设置 DeepSpeed&Trainer 配置的方法
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/transformers/main_classes/deepspeed
github.com github.com

Question about the precision of checkpoint · Issue #216 · facebookresearch/llama

1
1. kidrain61 15 Aug 2023
  
  in Public
  
  even though we use the mixed precision training, using full precision checkpoint is the best practice.
  
  [!QUESTION] 保存模型检查点时，应该使用什么精度？
  
  flashcard
  
  通常认为以全精度保存最好？
Visit annotations in context

Annotators

kidrain61

URL

github.com/facebookresearch/llama/issues/216
www.bilibili.com www.bilibili.com

「 AI 孙燕姿」《发如雪》重制完整版这大概是最好听的发如雪AI翻唱_哔哩哔哩_bilibili

1
1. kidrain61 15 Aug 2023
  
  in Public
  
  还有一个很重要的，混音。我听了很多AI翻唱的，目前出了我和Eternity丨L，好像没有其他UP做混音的，都是干音+伴奏。
  
  [!NOTE] 目前的 AI 翻唱有什么缺陷？
  
  flashcard
  
  很少人做混音
Visit annotations in context

Annotators

kidrain61

URL

bilibili.com/video/BV1tP411q7p7/
huggingface.co huggingface.co

Utilities for Tokenizers

1
1. kidrain61 14 Aug 2023
  
  in Public
  
  model_input_names (List[string], optional) — The list of inputs accepted by the forward pass of the model (like "token_type_ids" or "attention_mask"). Default value is picked from the class attribute of the same name.
  
  [!NOTE]- Hugging Face 中，tokenizer 在哪里储存模型有哪些种类的输入？
  
  flashcard
  
  model_input_names 属性
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/transformers/internal/tokenization_utils
huggingface.co huggingface.co

Data Collator

2
1. kidrain61 14 Aug 2023
  
  in Public
  
  Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs to the model. See glue and ner for example of how it’s useful.
  
  [!NOTE]- DefaultDataCollator 如何处理输入样本？
  
  flashcard
  
  根据输入样本的属性的名称传给模型的对应输入
2. kidrain61 14 Aug 2023
  
  in Public
  
  Data collators are objects that will form a batch by using a list of dataset elements as input.
  
  [!NOTE]- Data Collator 的输入和输出分别是什么？
  
  flashcard
  
  输入：数据集元素组成的列表输出：数据样本组成的 batch
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/transformers/main_classes/data_collator
huggingface.co huggingface.co

Training a causal language model from scratch - Hugging Face NLP Course

3
1. kidrain61 14 Aug 2023
  
  in Public
  
  ⚠️ Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels.
  
  [!NOTE]- 🤗 Transformers 中，移动 input_ids 与 labels 发生在哪个阶段？
  
  flashcard
  
  模型内部？ data collator 之后？
2. kidrain61 14 Aug 2023
  
  in Public
  
  DataCollatorForLanguageModeling supports both masked language modeling (MLM) and causal language modeling (CLM). By default it prepares data for MLM, but we can switch to CLM by setting the argument mlm=False:
  
  [!NOTE]- DataCollatorForLanguageModeling 默认适用于什么语言模型？
  
  flashcard
  
  mlm
3. kidrain61 14 Aug 2023
  
  in Public
  
  DataCollatorForLanguageModeling collator, which is designed specifically for language modeling (as the name subtly suggests). Besides stacking and padding batches, it also takes care of creating the language model labels
  
  [!NOTE]- DataCollatorForLanguageModeling 有哪些特殊功能？
  
  flashcard
  
  自动创建 labels
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/learn/nlp-course/chapter7/6
huggingface.co huggingface.co

Causal language modeling

1
1. kidrain61 14 Aug 2023
  
  in Public
  
  Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:
  
  [!NOTE]- 🤗 Transformers 中，如何仅使用 input_ids 同时构造 labels？
  
  flashcard
  
  eox=>pad tokenizer.pad_token = tokenizer.eos_token
  
  mlm=False data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/transformers/tasks/language_modeling
huggingface.co huggingface.co

Tokenizer

4
1. kidrain61 14 Aug 2023
  
  in Public
  
  This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
  
  [!NOTE]- Hugging Face 在训练循环中使用 tokenizer 时，为了保证效率，有什么需要注意的？
  
  flashcard
  
  不要调用 num_special_tokens_to_add，因为其简单地重新编码并计算，效率很低
2. kidrain61 14 Aug 2023
  
  in Public
  
  Converts a string in a sequence of tokens, using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.
  
  [!NOTE]- Hugging Face tokenizer 要将字符串转化成 tokens 序列，可以使用？
  
  flashcard
  
  tokenizer.tokenize(text)
3. kidrain61 14 Aug 2023
  
  in Public
  
  Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing self.convert_tokens_to_ids(self.tokenize(text)).
  
  [!NOTE]- Hugging Face tokenizer 要将字符串转化为 token_ids，可以使用？
  
  flashcard
  
  tokenizer.encode() 等价于 self.convert_tokens_to_ids(self.tokenize(text))
4. kidrain61 13 Aug 2023
  
  in Public
  
  convert_ids_to_tokens
  
  [!NOTE]- Hugging Face 中，使用 tokenizer 时，如何将 id 转化为 token（或反之）？
  
  flashcard
  
  tokenizer.convert_ids_to_tokens
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/transformers/main_classes/tokenizer
zhuanlan.zhihu.com zhuanlan.zhihu.com

分析transformer模型的参数量、计算量、中间激活、KV cache

1
1. kidrain61 14 Aug 2023
  
  in Public
  
  解码阶段给定当前生成词在第 ii i 个transformer层的向量表示为 ti∈Rb×1×ht^{i}\in R^{b\times 1\times h}t^{i}\in R^{b\times 1\times h} 。推断计算分两部分：更新KV cache和计算第 iii 个transformer层的输出。
  
  [!NOTE]- KV cache 是如何计算的？
  
  flashcard
  
  每当要新生成一个 token_id， 1. 先计算key和value，来更新KV cache， 2. 再计算query，结合 KV cache 计算 output
Visit annotations in context

Annotators

kidrain61

URL

zhuanlan.zhihu.com/p/624740065
zhuanlan.zhihu.com zhuanlan.zhihu.com

分析transformer模型的参数量、计算量、中间激活、KV cache

6
1. kidrain61 12 Aug 2023
  
  in Public
  
  embedding层不需要中间激活
  
  为什么 embedding 层不需要中间激活？
2. kidrain61 12 Aug 2023
  
  in Public
  
  对于 softmax()softmax()softmax() 函数，需要保存函数的输入 QKTQK^TQK^T ，占用显存大小为 2bs2a2bs^2a2bs^2a ，这里的 aaa 表示注意力头数。
  
  transformer 的 soft-max 中输入 $QK^{T}$ 需要逐元素除以 $\sqrt{h}$ 为什么单独的 $Q,K$ 不需要 a 组参数，明明有 a 组不同的 $Q,K$？
3. kidrain61 12 Aug 2023
  
  in Public
  
  词嵌入矩阵的参数量也较多，词向量维度通常等于隐藏层维度 hhh ，词嵌入矩阵的参数量为 VhVhVh
  
  词嵌入矩阵的形状为 $[V,e]$，应该是将长为 V 的 one-hot 词向量映射到对应的嵌入
4. kidrain61 12 Aug 2023
  
  in Public
  
  最后的输出层的权重矩阵通常与词嵌入矩阵是参数共享的
  
  如何共享？
5. kidrain61 12 Aug 2023
  
  in Public
  
  MLP块
  
  FFN 块的参数构成
6. kidrain61 12 Aug 2023
  
  in Public
  
  self-attention块
  
  self-attention 块的参数构成
Visit annotations in context

Annotators

kidrain61

URL

zhuanlan.zhihu.com/p/624740065
huggingface.co huggingface.co

Generation

1
1. kidrain61 02 Aug 2023
  
  in Public
  
  synced_gpus (bool, optional) — Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.
  
  什么意思？
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/transformers/main_classes/text_generation
discuss.huggingface.co discuss.huggingface.co

Generate: using k-v cache is faster but no difference to memory usage - 🤗Transformers - Hugging Face Forums

1
1. kidrain61 01 Aug 2023
  
  in Public
  
  modifying your script to run on GPT-J with FP16 on an 3090, with input_ids.shape[1]=16 and max_new_tokens=256, we get: 14071MB of GPU usage with use_cache=False 13233MB of GPU usage with use_cache=True The difference becomes more visible with large models and large sequence lengths
  
  更大的 VRAM 上可以看出是否使用 KV Cache 的差别
Visit annotations in context

Annotators

kidrain61

URL

discuss.huggingface.co/t/generate-using-k-v-cache-is-faster-but-no-difference-to-memory-usage/31272
Jul 2023
www.microsoft.com www.microsoft.com

DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression - Microsoft Research

1
1. kidrain61 28 Jul 2023
  
  in Public
  
  Our new technologies for optimizing inference cost and latency include: Inference-adapted parallelism allows users to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU inference, accounting for both inference latency and cost.Inference-optimized CUDA kernels boost per-GPU efficiency by fully utilizing the GPU resources through deep fusion and novel kernel scheduling.Effective quantize-aware training allows users to easily quantize models that can efficiently execute with low-precision, such as 8-bit integer (INT8) instead of 32-bit floating point (FP32), leading to both memory savings and latency reduction without hurting accuracy.
  
  DeepSpeed Inference 使用的主要技术
Visit annotations in context

Annotators

kidrain61

URL

microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/
github.com github.com

Not able to install 2.0 · Issue #358 · Dao-AILab/flash-attention

1
1. kidrain61 28 Jul 2023
  
  in Public
  
  I wonder if the current approach of statically-compiled CUDA kernels is sustainable. Perhaps there is value to considering JIT compilation, e.g. with Triton or NVRTC?
  
  对于大型项目，静态编译可能不再那么合适，而更应该采用 JIT 编译？
Visit annotations in context

Annotators

kidrain61

URL

github.com/Dao-AILab/flash-attention/issues/358
huggingface.co huggingface.co

Gradient Synchronization

1
1. kidrain61 27 Jul 2023
  
  in Public
  
  In 🤗 Accelerate this conversion happens automatically when calling prepare() and passing in your model.
  
  These triggerpoints are added to the model
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/accelerate/concept_guides/gradient_synchronization
huggingface.co huggingface.co

Data Collator

2
1. kidrain61 27 Jul 2023
  
  in Public
  
  For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the "special_tokens_mask" key, as returned by a PreTrainedTokenizer or a PreTrainedTokenizerFast with the argument return_special_tokens_mask=True.
  
  DataCollatorForLanguageModeling
2. kidrain61 27 Jul 2023
  
  in Public
  
  Data collator that will dynamically pad the inputs received, as well as the labels.
  
  DataCollatorForSeq2Seq
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/transformers/v4.31.0/en/main_classes/data_collator
huggingface.co huggingface.co

Main classes

1
1. kidrain61 27 Jul 2023
  
  in Public
  
  batched (bool, defaults to False) — Provide batch of examples to function.
  
  要求 function 能处理 batch
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/docs/datasets/v2.14.0/en/package_reference/main_classes
huggingface.co huggingface.co

Training a causal language model from scratch - Hugging Face NLP Course

1
1. kidrain61 24 Jul 2023
  
  in Public
  
  when tokenizing each element into chunks of the specified context size, we create many samples from each document. We just need to make sure to delete the existing columns, since they have a conflicting size. If we wanted to keep them, we could repeat them appropriately and return them within the Dataset.map() call:
  
  构建训练 samples 时，需要删除其他所有列，仅保留训练数据
Visit annotations in context

Annotators

kidrain61

URL

huggingface.co/learn/nlp-course/chapter7/6
github.com github.com

openai/prm800k: 800,000 step-level correctness labels on LLM solutions to MATH problems

2
1. kidrain61 21 Jul 2023
  
  in Public
  
  Answer grading is difficult in general. This grading logic is designed to be conservative and will sometimes reject correct answers, though it does so less frequently than the normalization logic from MATH. Our logic might sometimes admit incorrect answers, though we've put effort into minimizing this.
  
  答案检查逻辑还有待改善
2. kidrain61 21 Jul 2023
  
  in Public
  
  // Total time in milliseconds spent on labeling this solution. "total_time": 278270,
  
  一道题要标 5min
Visit annotations in context

Annotators

kidrain61

URL

github.com/openai/prm800k
speech.ee.ntu.edu.tw speech.ee.ntu.edu.tw

An Overview of Deep Reinforcement Learning

6
1. kidrain61 14 Jul 2023
  
  in Public
  
  Enlarge outputentropyAdd noises ontoparameter
  
  RL 的 exploration 中，一些增大 actor 随机性的方法： - 增大输出的 entropy - 给参数添加噪声 - ...
2. kidrain61 14 Jul 2023
  
  in Public
  
  In this way, we do not have to collection data after each upd
  
  RL 中，当负责交互的 actor 与要训练的 actor 不同，即后者向前者学习时，可以不必迭代式收集数据，而允许一次性收集大量数据
3. kidrain61 14 Jul 2023
  
  in Public
  
  Data collection is in the “forloop” of training iteratio
  
  理想的 RL 中，由于不同 policy 会得到不同的 observation 与 action，数据的收集应当是迭代式的，每更新一次 policy 就要重新收集一批数据
4. kidrain61 14 Jul 2023
  
  in Public
  
  Minus by a baseline 𝑏−𝑏−𝑏−𝑏−𝑏Make 𝐺𝑡′ have positive and negative value
  
  RL 的 reward 定义中，由于单纯的 reward 通常是正的，好坏只能相对体现，故使用一个 baseline $b$，将 reward 与 baseline 的差值作为最后的 reward，以体现 reward 绝对的好坏
5. kidrain61 14 Jul 2023
  
  in Public
  
  𝐺1′ = 𝑟1 + 𝛾𝑟2 + 𝛾2𝑟3 + ......
  
  RL 的 reward 定义中，为了反映一个 action 对后续时段造成的影响随时间距离而衰减，可以使用一个 discount factor \gamma <1$，即每次都进行一次衰减，$G_{t}^{\prime}=\sum_{n=t}^{N} \gamma^{n-t} r_{n}$
6. kidrain61 14 Jul 2023
  
  in Public
  
  Make it take (or don’t take) a specific action ො𝑎 givenspecific observatio
  
  RL 控制 actor 采取/不采取某个行动的基本方法：用向量表示特定选择，要/不要采取该选择，就要求实际行动与该选择的交叉熵尽量小/大，实际操作中，可以实现为计算交叉熵并乘正/负的系数
Visit annotations in context

Annotators

kidrain61

URL

speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/drl_v5.pdf
Jun 2023
ww2.mathworks.cn ww2.mathworks.cn

Interactive response surface modeling - MATLAB rstool - MathWorks 中国

3
1. kidrain61 06 Jun 2023
  
  in Public
  
  The pop-up menu at the lower left of the interface allows you to choose among the following models:
  
  左下角菜单选择具体的线性回归模型
2. kidrain61 06 Jun 2023
  
  in Public
  
  Predictor values are displayed in the text boxes on the horizontal axis and are marked by vertical dashed blue lines in the plots.
  
  预测值——数值显示+竖直虚线
3. kidrain61 06 Jun 2023
  
  in Public
  
  rstool plots a 95% simultaneous confidence band for the fitted response surface as two red curves.
  
  红线：95%置信区间边界
Visit annotations in context

Annotators

kidrain61

URL

ww2.mathworks.cn/help/stats/rstool.html
May 2023
docs.python.org docs.python.org

importlib — The implementation of import — Python 3.8.0 documentation

1
1. kidrain61 06 May 2023
  
  in Public
  
  If spec.loader.create_module does not return None, then any pre-existing attributes will not be reset. Also, no AttributeError will be raised if triggered while accessing spec or setting an attribute on the module.
  
  #question
Visit annotations in context

Tags

#question

Annotators

kidrain61

URL

docs.python.org/3/library/importlib.html
Apr 2023
gradio.app gradio.app

Gradio Docs

1
1. kidrain61 19 Apr 2023
  
  in Public
  
  List of gradio.components to use as inputs.
  
  Gradio 事件的输入和输出都是 Component 或其列表
Visit annotations in context

Annotators

kidrain61

URL

gradio.app/docs/
Mar 2023
cdn.openai.com cdn.openai.com

gpt-4.pdf

6
1. kidrain61 27 Mar 2023
  
  in Public
  
  In practice, very low pass rates are difficultor impossible to estimate, so we restrict to problems P and models M such that given some largesample budget, every problem is solved at least once by every model.
  
  实践中，为了避免性能指标过差，难以估计，可以尝试限制测试集和模型，要求在给定足够大的样本预算后，每个测试问题都能:::被每个模型解决一次？如果每个问题都能被每个模型解决一次，那如何比较模型的性能？
2. kidrain61 27 Mar 2023
  
  in Public
  
  the Inverse Scaling Prize [ 44 ] proposedseveral tasks for which model performance decreases as a function of scale. Similarly to a recentresult by Wei et al. [45], we find that GPT-4 reverses this trend, as shown on one of the tasks calledHindsight Neglect [46] in Figure 3.
  
  GPT-4 在 Hindsight Neglect 任务上的表现逆转了 Inverse Scaling Prize 提出的模型性能随规模下降的趋势
3. kidrain61 27 Mar 2023
  
  in Public
  
  Predictions on the other five buckets performed almost as well, the main exception beingGPT-4 underperforming our predictions on the easiest bucket.
  
  基于小模型性能对 GPT-4 的性能预测，在最简单的问题集上比实际性能偏高
4. kidrain61 27 Mar 2023
  
  in Public
  
  We chose to look at loss because it tends to be less noisy than other measures acrossdifferent amounts of training compute.
  
  GPT-4 使用 loss 来评判模型的性能，认为其比其他标度对于不同的训练计算量噪声更少
5. kidrain61 27 Mar 2023
  
  in Public
  
  we predicted GPT-4’s final loss on ourinternal codebase (not part of the training set) by fitting a scaling law with an irreducible loss term(as in Henighan et al. [15]): L(C) = aCb + c
  
  GPT-4 为了验证优化基础设施的可扩展性，通过拟合一个不可简化损失项的缩放定律来预测性能
6. kidrain61 27 Mar 2023
  
  in Public
  
  A power law fit to the smaller models (excluding GPT-4) is shown as the dottedline; this fit accurately predicts GPT-4’s performance.
  
  GPT-4 使用了小模型的 Loss-TrainingCompute 曲线来预测大模型的 Loss-TrainingCompute
Visit annotations in context

Annotators

kidrain61

URL

cdn.openai.com/papers/gpt-4.pdf
www.cnblogs.com www.cnblogs.com

浅谈范德蒙德(Vandermonde)方阵的逆矩阵的求法以及快速傅里叶变换(FFT)中IDFT的原理 - Deadecho - 博客园

1
1. kidrain61 25 Mar 2023
  
  in Public
  
  快速傅立叶变换的核心思想也是将系数向量迅速变换为点值向量，再迅速的将点值向量还原成系数向量，其中还原的操作我们称之为IDFTIDFTIDFT。
  
  快速傅立叶变换的核心思想也是将系数向量:::迅速变换为点值向量，再迅速的将点值向量还原成系数向量？
Visit annotations in context

Annotators

kidrain61

URL

cnblogs.com/gzy-cjoier/p/9741950.html
docs.scipy.org docs.scipy.org

scipy.integrate.quadrature — SciPy v1.10.1 Manual

1
1. kidrain61 11 Mar 2023
  
  in Public
  
  tol, rtolfloat, optionalIteration stops when error between last two iterates is less than tol OR the relative change is less than rtol.
  
  quadrature() 的绝对误差与相对误差指的是两次迭代之间的误差
Visit annotations in context

Annotators

kidrain61

URL

docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.quadrature.html
pages.cs.wisc.edu pages.cs.wisc.edu

lecture09.pdf

1
1. kidrain61 11 Mar 2023
  
  in Public
  
  Why do we use the n + 1 order derivative of f to describe the error of a polynomial of degree atmost n? The intuition is that the n + 1 order derivative of a degree n polynomial is identically 0, sothat the difference between the actual function and the polynomial interpolant can be encapsulatedby this derivative.
  
  用 n+1 阶导数描述 n 阶插值多项式的误差，背后的直觉是 n 阶插值多项式的 n+1 阶及以上的导数为 0（，n 阶及以下的导数与原函数相同？），误差出现且仅出现在 “n+1 阶之后”
Visit annotations in context

Annotators

kidrain61

URL

pages.cs.wisc.edu/~amos/412/lecture-notes/lecture09.pdf
Feb 2023
www.notion.so www.notion.so

Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

1
1. kidrain61 22 Feb 2023
  
  in Public
  
  我便用一段话概括：学长在 CMU Joint program 跟着 Neubig 暑研，然后 18 年入学了 CMU MLT，20 年毕业后去了 Hudson River Trading 做 Algorithm Engineer。现在有了 H1B，人在加州，做的工作和 NLP 没有任何关系。至于 MLT，他入学的时候 MLT 只有 30 人，1 /3 的人能够拿到老板的赞助免除学费和生活费，虽然也是僧多粥少。没有和 CMU 的 joint program，他认为自己去不了 CMU MLT。此外，MLT 毕业时，转本校 PhD 的也不少。至于转本校，基本是 2 + (3+)，很少需要 2 + 5 年…
  
  与人交流时，定期概括一番自己和对方的意思可能很不错
Visit annotations in context

Annotators

kidrain61

URL

notion.so/zhaochen20/df2807daf0f5477489795480a5eb1c46
www.51cto.com www.51cto.com

Python 的 import 是怎么工作的？-51CTO.COM

3
1. kidrain61 18 Feb 2023
  
  in Public
  
  如果一个模块导入另一个模块，而后者又导入另一个模块，则第一个模块的 sys.path 是解释器搜索第二个导入语句的位置。
  
  sys.path 会在同一个解释器执行的不同脚本中继承（应该是储存在解释器中的？）
2. kidrain61 18 Feb 2023
  
  in Public
  
  sys.path 并不会依赖当前程序的工作路径 - os.getcwd()，仅仅依赖第一个脚本所在的路径：
  
  sys.path 与 shell 当前所在路径 os.getcwd() 无关，仅取决于执行的第一个脚本所在的路径
3. kidrain61 18 Feb 2023
  
  in Public
  
  在解释器环境下，sys.path[0] 就是解释器启动时所在的路径
  
  解释器环境下，sys.path[0] 是解释器启动时解释器所在的路径
Visit annotations in context

Annotators

kidrain61

URL

51cto.com/article/705454.html
pytorch.org pytorch.org

Module — PyTorch 1.6.0 documentation

1
1. kidrain61 17 Feb 2023
  
  in Public
  
  (str, Parameter) – Tuple containing the name and parameter
  
  torch.nn.named_parameters() 返回字符串和 PyTorch 参数对象构成的元组 (str, Parameter)
Visit annotations in context

Annotators

kidrain61

URL

pytorch.org/docs/stable/generated/torch.nn.Module.html
pytorch.org pytorch.org

torch.nn.init — PyTorch master documentation

3
1. kidrain61 17 Feb 2023
  
  in Public
  
  torch.nn.init.constant_(tensor, val)[source] Fills the input Tensor with the value val\text{val}val.
  
  torch.nn.init.constant_(tensor, val) 将输入张量填充上常量值
2. kidrain61 17 Feb 2023
  
  in Public
  
  torch.nn.init.normal_(tensor, mean=0.0, std=1.0)[source] Fills the input Tensor with values drawn from the normal distribution N(mean,std2)\mathcal{N}(\text{mean}, \text{std}^2)N(mean,std2).
  
  torch.nn.init.normal_(tensor, mean=0.0, std=1.0) 将输入张量填充上标准分布
3. kidrain61 17 Feb 2023
  
  in Public
  
  All the functions in this module are intended to be used to initialize neural network parameters, so they all run in torch.no_grad() mode and will not be taken into account by autograd.
  
  torch.nn.init 中所有函数都用于初始化神经网络参数，所以都运行在无梯度模式中，不会考虑自动梯度
Visit annotations in context

Annotators

kidrain61

URL

pytorch.org/docs/stable/nn.init.html
zhuanlan.zhihu.com zhuanlan.zhihu.com

为什么训练的时候warm up这么重要？一文理解warm up原理

2
1. kidrain61 17 Feb 2023
  
  in Public
  
  为什么训练的时候warm up这么重要？这个问题目前还没有被充分证明，我们只能从直觉上和已有的一些论文[1,2,3]得到推测：有助于减缓模型在初始阶段对mini-batch的提前过拟合现象，保持分布的平稳有助于保持模型深层的稳定性
  
  训练预热的效果
2. kidrain61 17 Feb 2023
  
  in Public
  
  在大型网络训练初期，我们需要用较小的学习率先学n个step
  
  训练预热的含义
Visit annotations in context

Annotators

kidrain61

URL

zhuanlan.zhihu.com/p/424373231
github.com github.com

google-research/tuning_playbook: A playbook for systematically maximizing the performance of deep learning models.

58
1. kidrain61 17 Feb 2023
  
  in Public
  
  Open-Source Vizier implements a variety of sophisticated algorithms for tuning ML models, including Bayesian Optimization algorithms.
  
  模型调参算法工具：Open-Source Vizier
2. kidrain61 17 Feb 2023
  
  in Public
  
  Summary: Bayesian optimization tools are a compelling option once we’re done exploring for good search spaces and have decided what hyperparameters even should be tuned at all.
  
  贝叶斯优化工具是在探索好搜索空间、决定好要调整的超参数后非常有用的选择
3. kidrain61 17 Feb 2023
  
  in Public
  
  However, we should only adopt changes that produce improvements that outweigh any complexity they add.
  
  任何改变都应该满足其带来的改进大于其:::引入的复杂性
4. kidrain61 17 Feb 2023
  
  in Public
  
  Usually, we can get away with only recharacterizing the trial variance after major changes to the pipeline
  
  通常只在对 pipeline 做出重大改变后才需要重新刻画试验差异
5. kidrain61 17 Feb 2023
  
  in Public
  
  before adopting a candidate change, consider running the best trial N times to characterize the run-to-run trial variance.
  
  采用改变前，考虑再运行最佳 trial 若干次来了解 trial 运行之间的差异
6. kidrain61 17 Feb 2023
  
  in Public
  
  It is all well and good to make comparisons of validation error rates estimated on a finite validation set using fastidious statistical tests, but often the trial variance alone can produce statistically significant differences between two different trained models that use the same hyperparameter settings.
  
  随机性差异很显著？
7. kidrain61 17 Feb 2023
  
  in Public
  
  the most important sources of variation that might cause such an inconsistent result
  
  导致不稳定结果的主要原因可以分为以下几类： - 训练之间的差异 - 超参数搜索差异 - 数据收集和取样差异
8. kidrain61 17 Feb 2023
  
  in Public
  
  different random seeds. For example, different random initializations, training data shuffles, dropout masks, patterns of data augmentation operations, and orderings of parallel arithmetic operations, are all potential sources of trial variance.
  
  随机性因素可能导致模型性能不稳定
9. kidrain61 16 Feb 2023
  
  in Public
  
  Summary: Examining the training curves is an easy way to identify common failure modes and can help us prioritize what actions to take next.
  
  检查训练曲线可以： - 识别常见故障模式 - 帮助我们确定下一步要采取的行动的优先级
10. kidrain61 16 Feb 2023
  
  in Public
  
  In general, it can be very difficult to know if the search space has been sampled densely enough. 🤖
  
  一般很难确认搜索空间取样是否足够:::密集🤖
11. kidrain61 16 Feb 2023
  
  in Public
  
  basic hyperparameter axis plots where we plot the validation objective value versus one of the hyperparameters (e.g. learning rate). Each point on the plot corresponds to a single trial.
  
  basic hyperparameter aix plot 是指验证集误差-某个超参数的图表
12. kidrain61 16 Feb 2023
  
  in Public
  
  If all trials are infeasible for learning rates greater than some threshold value, and if the best performing trials have learning rates at the edge of that region, the model may suffer from stability issues preventing it from accessing higher learning rates.
  
  如果超参数在特定范围内的 trial 都不可行，可能存在一些:::稳定性问题使得模型无法使用这些范围内的超参数
13. kidrain61 16 Feb 2023
  
  in Public
  
  A search space is suspicious if the best point sampled from it is close to its boundary. We might find an even better point if we expanded the search range in that direction.
  
  如果当前搜索空间中超参数的最优点在一个或多个维度接近于搜索空间的边界，那么真正的最优点可能在搜索空间外，需要拓宽搜索空间
14. kidrain61 16 Feb 2023
  
  in Public
  
  For example, do the best trials have training curves consistent with problematic overfitting?
  
  对于最优 trial 的训练曲线，应当检查其是否与有问题的过拟合曲线:::一致？
15. kidrain61 16 Feb 2023
  
  in Public
  
  In some cases, a large number of infeasible points can indicate a bug in the training code.
  
  有时，大量的不可行点意味着训练代码中存在 bug
16. kidrain61 16 Feb 2023
  
  in Public
  
  reparameterizing the search space
  
  重参数化搜索空间是指:::？
17. kidrain61 16 Feb 2023
  
  in Public
  
  infeasible (i.e. trials that diverge, get really bad loss values, or fail to run at all because they violate some implicit constraint)
  
  trial 不可行包括以下情况： - 发散 - 极差的损失值 - 运行失败（因为触犯了某些隐藏的限制）
18. kidrain61 16 Feb 2023
  
  in Public
  
  Before analyzing a given set of experiments to make progress toward their original goal, we should ask ourselves the following additional questions
  
  分析实验数据前，应该检查以下事项： - 搜索空间是否够“宽” - 取样点是否够多 - 不合理 trial 的原因 - 模型是否 exhibit optimization issues? - 我们能从表现最好的 trial 的训练曲线学到什么？
19. kidrain61 16 Feb 2023
  
  in Public
  
  Since running experiments can be expensive, we also want to take the opportunity to extract other useful insights from each group of experiments, even if these insights are not immediately relevant to the current goal
  
  因为跑实验可能很贵，所以要充分提取每组实验的信息，即使这些信息可能与预设目标不太有关
20. kidrain61 16 Feb 2023
  
  in Public
  
  For example, if our goal is to select the best optimizer out of Nesterov momentum and Adam, we could create one study in which optimizer="Nesterov_momentum" and the nuisance hyperparameters are {learning_rate, momentum}, and another study in which optimizer="Adam" and the nuisance hyperparameters are {learning_rate, beta1, beta2, epsilon}.
  
  科学超参数与多余超参数的对应例子:::优化器选择与其参数
21. kidrain61 16 Feb 2023
  
  in Public
  
  it ensures that we obtain a relatively uniform sampling of values of the scientific hyperparameters
  
  搜索科学超参数时，一般会采用准随机搜索，因为其可以提供相对均匀的取值取样分布
22. kidrain61 16 Feb 2023
  
  in Public
  
  searches the scientific parameters uniformly
  
  搜索科学超参数时应当使其分布尽可能均匀
23. kidrain61 16 Feb 2023
  
  in Public
  
  conditional hyperparameters can cause problems since it is hard to specify a search space unless the set of nuisance hyperparameters is the same for all values of the scientific hyperparameters.
  
  同时搜索科学超参数与多余超参数时，两者之间的条件关系（不同的科学超参数对应不同的多余超参数）会使得设置搜索空间变得困难。 - 解决方法：:::?
24. kidrain61 16 Feb 2023
  
  in Public
  
  include the scientific parameters in the same search space as the nuisance hyperparameters and use a search algorithm to sample values of both the scientific and nuisance hyperparameters in a single study.
  
  科学超参数过多时，可以将其加入搜索空间，和多余超参数一起搜索
25. kidrain61 16 Feb 2023
  
  in Public
  
  We can use any gradient-free optimization algorithm, including methods such as Bayesian optimization or evolutionary algorithms, to optimize over the nuisance hyperparameters
  
  超参数搜索算法：无梯度优化算法
26. kidrain61 16 Feb 2023
  
  in Public
  
  A study specifies a set of hyperparameter configurations to be run for subsequent analysis. Each configuration is called a "trial".
  
  trial 这一术语指模型运行依赖的一套超参数的配置
27. kidrain61 15 Feb 2023
  
  in Public
  
  the more nuisance hyperparameters we attempt to tune, the greater the risk we fail to tune them sufficiently well for each setting of the scientific hyperparameters and end up reaching the wrong conclusions from our experiments.
  
  过多的多余参数会让找到不同科学参数的最优配置变得困难，乃至无法实现最优配置，进而得出错误结论
28. kidrain61 15 Feb 2023
  
  in Public
  
  With limitless resources, we would leave all non-scientific hyperparameters as nuisance hyperparameters so that the conclusions we draw from our experiments are free from caveats about fixed hyperparameter values.
  
  理想情况下应当将科学参数外的所有其他参数都视为:::多余参数？
29. kidrain61 15 Feb 2023
  
  in Public
  
  When designing a new round of experiments, we first identify the scientific hyperparameters for our experimental goal. At this stage, we consider all other hyperparameters to be nuisance hyperparameters.
  
  确定科学超参数时，应当把其他所有超参数都当作:::多余超参数？
30. kidrain61 15 Feb 2023
  
  in Public
  
  The activation function could be a fixed hyperparameter if we have determined in prior experiments that the best choice of activation function is not sensitive to model depth, or if we are willing to limit our conclusions about the number of hidden layers to only cover this specific choice of activation function
  
  固定超参数满足以下条件之一： - 与科学超参数无关 - 预先决定了仅研究某个超参数下的模型表现
31. kidrain61 15 Feb 2023
  
  in Public
  
  The learning rate is a nuisance hyperparameter because we can only fairly compare models with different numbers of hidden layers if the learning rate is tuned separately for each number of layers (the optimal learning rate generally depends on the model architecture).
  
  多余超参数是为了公平比较超参数而需要分别调整的超参数
32. kidrain61 15 Feb 2023
  
  in Public
  
  When we are eventually ready to be greedy, we can focus purely on the validation error even if the experiments aren't maximally informative about the structure of the tuning problem.
  
  即使实验没有最大化地提供关于调参问题结构的信息，也可以:::专注于验证集误差？
33. kidrain61 15 Feb 2023
  
  in Public
  
  Identify which hyperparameters the validation error is most sensitive to, which hyperparameters interact the most and therefore need to be re-tuned together, and which hyperparameters are relatively insensitive to other changes and can therefore be fixed in future experiments.
  
  关于超参数的洞见： - 哪些超参数对验证集误差影响最大 - 哪些超参数耦合比较明显 - 那些超参数相对比较独立
34. kidrain61 15 Feb 2023
  
  in Public
  
  Although one might think we would spend most of our time trying to maximize performance on the validation set, in practice we spend the majority of our time trying to gain insight into the problem, and comparatively little time greedily focused on the validation error.
  
  获得关于问题的洞见＞提升验证集表现
35. kidrain61 15 Feb 2023
  
  in Public
  
  We call it a launch when we update our best configuration (which may or may not correspond to an actual launch of a production model).
  
  launch 表示对最佳配置的更新（不一定在生产环境中）
36. kidrain61 15 Feb 2023
  
  in Public
  
  if an unnecessarily large step budget is chosen initially, it might be hard to change it down the road, e.g. once the learning rate schedule is tuned for that number of steps.
  
  如果一开始就选择了过大的步骤预算？，后续就:::很难调整了？
37. kidrain61 15 Feb 2023
  
  in Public
  
  training for fewer steps means that each training run is faster and uses fewer resources, boosting tuning efficiency by reducing the time between cycles and allowing more experiments to be run in parallel
  
  训练更少步数意味着 - 每次训练速度更快（耗时更少？），消耗资源更:::少？（从而允许更多实验并行运行）
38. kidrain61 15 Feb 2023
  
  in Public
  
  training for more steps can improve performance and makes hyperparameter tuning easier (see Shallue et al. 2018).
  
  训练更多步数可以改善模型表现，让超参数调整更:::简单？
39. kidrain61 15 Feb 2023
  
  in Public
  
  at minimum means that the trained model performs much better than random chance on the validation set
  
  模型表现的底线：比随机选择要好
40. kidrain61 15 Feb 2023
  
  in Public
  
  For example, start with a constant learning rate before adding fancy decay schedules.
  
  decay schedule 是:::？比学习率花哨？
41. kidrain61 15 Feb 2023
  
  in Public
  
  "Simple" means avoiding bells and whistles wherever possible; these can always be added later
  
  开始时尽可能避免花哨的东西；即使它们有用，后来也可以再加上
42. kidrain61 15 Feb 2023
  
  in Public
  
  Before beginning hyperparameter tuning we must determine the starting point. This includes specifying (1) the model configuration (e.g. number of layers), (2) the optimizer hyperparameters (e.g. learning rate), and (3) the number of training steps.
  
  起始配置的超参数包括： 1. 模型配置（例如层数） 2. 优化器超参数（例如学习率） 3. 训练步数
43. kidrain61 15 Feb 2023
  
  in Public
  
  Batch norm is complicated and, in general, should use a different batch size than the gradient computation to compute statistics.
  
  批归一化计算数据应当使用使用:::不同于梯度计算时的批大小？
44. kidrain61 15 Feb 2023
  
  in Public
  
  The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.
  
  受批大小影响最大的超参数是优化器超参数和正则化超参数
45. kidrain61 15 Feb 2023
  
  in Public
  
  The optimal values of most hyperparameters are sensitive to the batch size.
  
  大多数超参数的最佳值都对批大小敏感
46. kidrain61 15 Feb 2023
  
  in Public
  
  Choosing the batch size to minimize resource consumption
  
  选择批大小以最小化资源消耗:::TODO
47. kidrain61 15 Feb 2023
  
  in Public
  
  Choosing the batch size to minimize training time
  
  选择批大小以最小化训练时间:::TODO
48. kidrain61 14 Feb 2023
  
  in Public
  
  shouldn't be used to directly tune the validation set performance
  
  batch size 不应该被用于直接调优验证集上的表现？:::没看懂
49. kidrain61 14 Feb 2023
  
  in Public
  
  This is particularly relevant in the beginning stages of a project when we are trying to find the best values of various other hyperparameters (e.g. architecture hyperparameters) while treating optimizer hyperparameters as nuisance parameters.
  
  优化器参数都很重要，尤其是在项目的起始阶段，我们试图找到其他各种超参数的最佳设置，而将优化器参数视为无关参数时？:::没看懂
50. kidrain61 14 Feb 2023
  
  in Public
  
  If the training throughput increases only up to some maximum batch size, then we should only consider batch sizes up to that maximum batch size, even if a larger batch size is supported by the hardware.
  
  一般采用刚刚达到最大训练吞吐量的 batch size
51. kidrain61 14 Feb 2023
  
  in Public
  
  If this is not the case then the training pipeline has a bottleneck such as I/O or synchronization between compute nodes.
  
  加速器未饱和，但单步时间增长，说明可能存在其他瓶颈，例如IO、节点间同步等
52. kidrain61 14 Feb 2023
  
  in Public
  
  When the accelerators aren't yet saturated, if the batch size doubles, the training throughput should also double (or at least nearly double).
  
  加速器未饱和时，一轮消耗的时间不变，批量越大，样本处理速率越快
53. kidrain61 14 Feb 2023
  
  in Public
  
  larger batch sizes can be more prone to overfitting and may require stronger regularization and/or additional regularization techniques
  
  较大的批量可能更容易过拟合:::为什么？ - 可能需要更强的正则化和/或额外的正则化技术
54. kidrain61 14 Feb 2023
  
  in Public
  
  the difference in validation set performance between two batch sizes typically goes away if the training pipeline is optimized independently for each batch size.
  
  如果训练 pipeline 对每个 batch size 独立优化，那么不同 batch size 下的验证集表现差异会:::消失？
55. kidrain61 14 Feb 2023
  
  in Public
  
  Increasing the batch size may either decrease, increase, or not change the resource consumption.
  
  batch size 对资源消耗的影响是不确定的？
56. kidrain61 13 Feb 2023
  
  in Public
  
  Summary: When starting a new project, try to reuse a model that already works.
  
  优先尝试使用已经被证明有效的模型
57. kidrain61 13 Feb 2023
  
  in Public
  
  When possible, try to find a paper that tackles something as close as possible to the problem at hand and reproduce that model as a starting point.
  
  一个比较好的起点是：寻找与手头问题近似的论文中的模型，并复现之
58. kidrain61 13 Feb 2023
  
  in Public
  
  There is already a pipeline set up that does training and evaluation, and it is easy to execute training and prediction jobs for various models of interest
  
  怎样才算一个完整的 pipeline?
Visit annotations in context

Annotators

kidrain61

URL

github.com/google-research/tuning_playbook
arxiv.org arxiv.org

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

23
1. kidrain61 10 Feb 2023
  
  in Public
  
  anguage modeling (LM) loss to generate captions given images
  
  LM 损失用于根据图像生成描述
2. kidrain61 10 Feb 2023
  
  in Public
  
  nless otherwise specified, all results re-ported in this paper as “BLIP” uses ViT-B
  
  BLIP 使用的 ViT 版本为 ViT-B
3. kidrain61 10 Feb 2023
  
  in Public
  
  shares the same cross-attention layers
  
  BLIP 共享参数的标准是:::什么？
4. kidrain61 10 Feb 2023
  
  in Public
  
  image-text matching (ITM) loss to distinguishbetween positive and negative image-text pairs
  
  ITM 损失可以区分图文对的匹配程度？
5. kidrain61 10 Feb 2023
  
  in Public
  
  an image-text contrastive (ITC) loss to align the vision and language representations.
  
  ITC 可以对齐视觉和语言表示
6. kidrain61 10 Feb 2023
  
  in Public
  
  layers except for SA leads to better performance comparedto not sharing
  
  BLIP 文本解码器和编码器共享除 SA 外的所有层（的参数）比不共享表现更好
7. kidrain61 10 Feb 2023
  
  in Public
  
  f the SA layers are shared,the model’s performance would degrade due to the conflictbetween the encoding task and the decoding task
  
  BLIP 如果在解码器和编码器共享 SA 层，会由于任务之间的冲突，使得模型表现下降 - SA 层在解码器和编码器中扮演着怎样的角色？
8. kidrain61 10 Feb 2023
  
  in Public
  
  During pre-training, the text encoder and decoder share allparameters except for the self-attention layers
  
  BLIP 预训练时，文本编码器和解码器共享除了自注意力层之外的所有参数
9. kidrain61 10 Feb 2023
  
  in Public
  
  e usethe same pre-training dataset as Li et al. (2021a) with 14Mimages in total, including two human-annotated datasets(COCO and Visual Genome (Krishna et al., 2017)), andthree web datasets (Conceptual Captions (Changpinyo et al.,2021), Conceptual 12M (Changpinyo et al., 2021), SBU cap-tions (Ordonez et al., 2011)). We also experimented with anadditional web dataset, LAION (Schuhmann et al., 2021),which contains 115M images with more noisy texts1
  
  BLIP 使用的数据集
10. kidrain61 10 Feb 2023
  
  in Public
  
  ncrease theimage resolution to 384 × 384 during finetuning
  
  BLIP 微调时将图片放大到 384^2，为了::: 什么目的？
11. kidrain61 10 Feb 2023
  
  in Public
  
  ake random image crops ofresolution 224 × 224 during pre-training
  
  BLIP 预训练时将图像随机裁剪为 224x224，为了:::什么目的？
12. kidrain61 10 Feb 2023
  
  in Public
  
  AdamW (Loshchilov & Hutter, 2017)optimizer with a weight decay of 0.05. The learning rateis warmed-up to 3e-4 (ViT-B) / 2e-4 (ViT-L) and decayedlinearly with a rate of 0.85.
  
  BLIP 使用 AdamW 优化器 - weight decay = 0.05 - lr - warm-up 3e-4 - decay rate = 0.85
13. kidrain61 10 Feb 2023
  
  in Public
  
  a [CLS] token is appended to the beginningof the text input to summarize the sentence
  
  [CLS] token 被添加到文本输入的开头，用于总结整个句子？
14. kidrain61 10 Feb 2023
  
  in Public
  
  as been adopted by themore recent methods (Li et al., 2021a; Kim et al., 2021).
  
  ViT 在 2021 年逐渐被用于视觉特征提取
15. kidrain61 10 Feb 2023
  
  in Public
  
  ViTis more computation-friendly
  
  ViT 比 pretrained objector detector 更加计算友好
16. kidrain61 10 Feb 2023
  
  in Public
  
  using pre-trained object detectorsfor visual feature extraction (Chen et al., 2020)
  
  使用预训练对象探测器来提取视觉特征（2020）
17. kidrain61 10 Feb 2023
  
  in Public
  
  divides an input image intopatches and encodes them as a sequence of embeddings,with an additional [CLS] token to represent the global im-age feature
  
  ViT 在 MED 中的用途： 1. 将输入图像划分为 patches? 2. 将 patches 编码为一个 embedding 序列 3. 用 [CLS] token 代表整体的图像特征？
18. kidrain61 10 Feb 2023
  
  in Public
  
  3.1. Model Architecture
  
  MED 模型架构
19. kidrain61 10 Feb 2023
  
  in Public
  
  Captioning and Filtering (CapFilt): a new dataset boos-trapping method for learning from noisy image-text pairs.
  
  CapFilt 数据集自助方法用于从噪声（图文对）数据中学习
20. kidrain61 10 Feb 2023
  
  in Public
  
  An MED can operate either asa unimodal encoder, or an image-grounded text encoder,or an image-grounded text decode
  
  MED 的三大要素
21. kidrain61 10 Feb 2023
  
  in Public
  
  Multimodal mixture of Encoder-Decoder (MED): a newmodel architecture for effective multi-task pre-training andflexible transfer learning
  
  MED 利于有效的多任务预训练和灵活的迁移学习
22. kidrain61 10 Feb 2023
  
  in Public
  
  encoder-decoder models have not beensuccessfully adopted for image-text retrieval tasks.
  
  基于 encoder-decoder 的模型难以应用在图文对检索任务上？
23. kidrain61 10 Feb 2023
  
  in Public
  
  ncoder-based models are less straightfor-ward to directly transfer to text generation tasks (e.g. imagecaptioning),
  
  仅仅基于 encoder 的模型难以直接迁移到文本生成任务上
Visit annotations in context

Annotators

kidrain61

URL

arxiv.org/pdf/2201.12086
blog.salesforceairesearch.com blog.salesforceairesearch.com

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

2
1. kidrain61 10 Feb 2023
  
  in Public
  
  Demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
  
  BLIP 的泛化能力能直接零样本迁移到视频-语言任务？:::如何做到？
2. kidrain61 10 Feb 2023
  
  in Public
  
  using a stochastic decoding method (nucleus sampling) is better than using beam search for caption generation, due to the higher level of diversity in the synthetic captions.
  
  使用随机的解码方法比使用 beam search 效果更好，因为人造的图片描述中有更高的多样性？
Visit annotations in context

Annotators

kidrain61

URL

blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/

flashcard

flashcard

Annotators

URL

flashcard

flashcard

flashcard

flashcard

flashcard

Annotators

URL

flashcard

flashcard

flashcard

flashcard

flashcard

Annotators

URL

flashcard

flashcard

flashcard

flashcard

Annotators

URL

flashcard

Annotators

URL

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

flashcard

Annotators

URL

flashcard

Annotators

URL

flashcard

Annotators

URL

flashcard

Annotators

URL

flashcard

flashcard

Annotators

URL

flashcard

flashcard

flashcard

Annotators