Hypothesis

27 Matching Annotations

Feb 2025
docs.nvidia.com docs.nvidia.com

Programming Guide :: CUDA Toolkit Documentation

27
1. MaverickCoder 26 Feb 2025
  
  in Public
  
  __shared__ cuda::pipeline_shared_state< cuda::thread_scope::thread_scope_block, stages_count > shared_state;
  
  封装了允许管道处理多达 count 个并发阶段的有限资源。
2. MaverickCoder 26 Feb 2025
  
  in Public
  
  producer_acquire
  
  获取流水线内部队列中一个可用阶段，使当前线程等待，知道流水线的生产者阶段可用。如果所有资源都在使用中，则 pipeline.producer_acquire() 会阻塞生产者线程。
3. MaverickCoder 26 Feb 2025
  
  in Public
  
  consumer_release
  
  将流水线中最旧的阶段释放回流水线对象来重复使用。释放的阶段随后可以被生产者重新获取。释放消费者阶段的资源，表示当前阶段的操作已经完成。循环处理下一个批次。
4. MaverickCoder 26 Feb 2025
  
  in Public
  
  consumer_wait
  
  等待流水线中最旧阶段的所有异步操作完成，使当前线程等待，知道流水线的消费者阶段完成前一个阶段的操作
5. MaverickCoder 26 Feb 2025
  
  in Public
  
  producer_commit
  
  提交在当前已获取的流水线阶段上，producer_acquire 调用之后发出的异步操作。提交生产者阶段的操作，表示数据已经准备好给消费者使用。
6. MaverickCoder 26 Feb 2025
  
  in Public
  
  7.27.6.6. Keep Commit and Arrive-On Operations Converged
  
  建议在进行 commit 和 arrive-on 操作时，所有线程应该是收敛的。这样可以避免不必要的等待，并且减少对屏障对象的更新操作。
7. MaverickCoder 26 Feb 2025
  
  in Public
  
  7.27.6.5. Warp Entanglement - Arrive-On
  
  Arrive-On 操作用于更新某个同步屏障（barrier）
8. MaverickCoder 26 Feb 2025
  
  in Public
  
  7.27.6.4. Warp Entanglement - Wait
  
  Wait 操作描述了线程如何等待批次的完成。pipeline_consumer_wait_prior<N>() 或 pipeline::consumer_wait() 用于等待某个批次完成，等待的批次序列号是根据 warp 的收敛程度而不同的。
9. MaverickCoder 26 Feb 2025
  
  in Public
  
  7.27.6.3. Warp Entanglement - Commit
  
  Warp 是一组并行执行的线程。这里描述了 warp-shared pipeline（warp共享流水线）中批次序列如何协同工作。
10. MaverickCoder 26 Feb 2025
  
  in Public
  
  Trivially copyable
  
  如果传递给 memcpy_async 的指针所指向的数据类型不是“Trivially Copyable（可以简单复制的类型）”，那么复制每个元素时需要调用其拷贝构造函数。这种情况下，无法利用 cp.async 系列指令来加速 memcpy_async 操作。
11. MaverickCoder 26 Feb 2025
  
  in Public
  
  On devices with compute capability 8.0, the cp.async family of instructions allows copying data from global to shared memory asynchronously. These instructions support copying 4, 8, and 16 bytes at a time. If the size provided to memcpy_async is a multiple of 4, 8, or 16, and both pointers passed to memcpy_async are aligned to a 4, 8, or 16 alignment boundary, then memcpy_async can be implemented using exclusively asynchronous memory operations.
  
  在设备拥有8.0计算能力时，cp.async系列指令允许从设备内部高效地复制数据。这些异步内存操作指令通过支持对 4、8、16 字节的数据块进行传输，可以显著提高数据复制的效率。在使用 memcpy_async() 时，如果提供的大小是 4、8 或 16 字节的倍数，且源指针和目标指针都按 4、8 或 16 字节对齐，那么 memcpy_async() 可以通过完全异步的方式来完成数据的复制。
12. MaverickCoder 26 Feb 2025
  
  in Public
  
  memcpy_async
  
  使用cudaMalloc 分配内存：使用cudaMalloc分配全局内存的时候，返回的指针通常已经满足128字节对齐要求。
13. MaverickCoder 26 Feb 2025
  
  in Public
  
  Additionally for achieving best performance when using memcpy_async API, an alignment of 128 Bytes for both shared memory and global memory is required.
  
  为了得到更好的新能，共享内存和全局内存最好是以 128 字节对齐使用 align 关键字，在定义共享内存和全局内存的时候，可以使用这个关键字来指定对齐方式： align(128) int sharedMemory[SIZE];
14. MaverickCoder 26 Feb 2025
  
  in Public
  
  the pipeline mechanism is shared among CUDA threads in the same CUDA warp. This sharing causes batches of memcpy_async to be entangled within a warp
  
  流水线机制被同一warp中的所有线程所共享。这种共享会导致 memcpy_async 被纠缠在warp中，这在某种情况下会影响性能。
15. MaverickCoder 26 Feb 2025
  
  in Public
  
  barrier.arrive_and_wait(); // Waits for all copies to complete
  
  等待操作变成 arrive and wait
16. MaverickCoder 26 Feb 2025
  
  in Public
  
  __shared__ cuda::barrier<cuda::thread_scope::thread_scope_block> barrier; if (block.thread_rank() == 0) { init(&barrier, block.size()); // Friend function initializes barrier } block.sync();
  
  声明一个块级别的屏障同步对象barrier，并在线程块的第一个线程中初始它。然后，调用 block.sync() 确保所有线程在继续执行之前都已经到达此点。
17. MaverickCoder 26 Feb 2025
  
  in Public
  
  incrementing the expected count of the current phase on creation, and decrementing it on completion of the copy operation, such that the phase of the barrier will only advance when all threads participating in the barrier have arrived, and all memcpy_async bound to the current phase of the barrier have completed.
  
  在创建时增加当前阶段的预期计数，并在复制操作完成时减少该计数，从而确保只有当所有参与屏障的线程都到达并且所有与当前屏障阶段绑定的 memcpy_async 都完成时，屏障的阶段才会前进。
18. MaverickCoder 26 Feb 2025
  
  in Public
  
  cooperative_groups::memcpy_async
  
  在执行 memcpy_async 异步拷贝操作时，拷贝过程跟前序的指令是异步执行的，不会阻塞当前线程的执行。为了确保数据一致性和避免数据竞争，需要在拷贝完成之后 cooperative_groups::wait 进行同步
19. MaverickCoder 26 Feb 2025
  
  in Public
  
  On devices with compute capability 8.0 or higher, memcpy_async transfers from global to shared memory can benefit from hardware acceleration, which avoids transfering the data through an intermediate register.
  
  硬件加速，避免了使用中间寄存器传输数据
20. MaverickCoder 26 Feb 2025
  
  in Public
  
  Until the copy operation completes, modifying the global data or reading or writing the shared data introduces a data race.
  
  在拷贝数据完成之气那，修改global mem 或者是写 shared mem 会造成数据竞争。
21. MaverickCoder 26 Feb 2025
  
  in Public
  
  block.sync(); // Wait for compute using shared memory to finish
  
  第二次同步
22. MaverickCoder 26 Feb 2025
  
  in Public
  
  block.sync(); // Wait for all copies to complete
  
  第一次同步
23. MaverickCoder 26 Feb 2025
  
  in Public
  
  each thread block needs to synchronize after the shared[local_idx] = global[global_idx] assignment, to ensure all writes to shared memory have completed before the compute phase can begin. The thread block also needs to synchronize again after the compute phase, to prevent overwriting shared memory before all threads have completed their computations. This pattern is illustrated in the following code snippet.
  
  为了确保在计算阶段开始之前完成对共享内存的写入，需要同步一次。然后，为了防止在所有的线程完成他们的计算之前覆盖共享内存，在计算阶段之后也需要同步一次。
24. MaverickCoder 26 Feb 2025
  
  in Public
  
  With memcpy_async improves the previous example by introducing the memcpy_async and the cuda::memcpy_async APIs to directly copy data from global to shared memory without using intermediate registers.
  
  不需要使用中间寄存器，从global 到 shared memory
25. MaverickCoder 26 Feb 2025
  
  in Public
  
  Without memcpy_async introduces an example that does not overlap computation with data movement and uses an intermediate register to copy data.
  
  不会重叠计算，并且需要使用中间寄存器来copy数据
26. MaverickCoder 26 Feb 2025
  
  in Public
  
  memcpy_async API
  
  主要用于在设备内核中执行异步的数据传输操作，特别是从全局内存到共享内存的传输。它允许在内核执行期间重叠计算和数据传输
27. MaverickCoder 26 Feb 2025
  
  in Public
  
  Asynchronous
  
  操作不需要等待前一个操作完成即可开始
Visit annotations in context

Annotators

MaverickCoder

URL

docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Annotators

URL