31 Matching Annotations
  1. Last 7 days
    1. The advantage of usinga jump table over a long sequence of if-else statements is that the time taken toperform the switch is independent of the number of switch cases.

      jump table 对比 if-else 的优势是什么?



  2. Jan 2022
    1. If one of those two expressions couldpossibly generate an error condition or a side effect, this could lead to invalidbehavior. Such is the case for our earlier example

      有什么情况下必须使用 branching 方式,而不能使用 conditional move?

    2. The testinstructions behave in the same manner as the and instructions, except that theyset the condition codes without altering their destinations.

      test 指令的作用是什么?

    3. The cmp instructions set the condition codes according to the differences of theirtwo operands. They behave in the same way as the sub instructions, except thatthey set the condition codes without updating their destinations.

      cmp 指令集的作用是什么?

    4. By using a PC-relativeencoding of the jump targets, the instructions can be compactly encoded (requiringjust 2 bytes), and the object code can be shifted to different positions in memorywithout alteration.

      pc-relative encoding 的计算方式是什么,有什么优势?

    5. It is important to recognize that the suffixes forthese instructions denote different conditions and not different operand sizes. Forexample, instructions setl and setb denote “set less” and “set below,” not “setlong word” or “set byte.”

      set 指令的后缀代表的含义是什么?

  3. Dec 2021
    1. one for unsigned (mulq) and one for two’s-complement (imulq) multiplication.For both of these instructions, one argument must be in register %rax, and theother is given as the instruction source operand.

      mulq 和 imulq 分别表示什么指令集,他们的操作数有什么要求?

    2. The different shift instructions can specify the shift amount either asan immediate value or with the single-byte register %cl.

      shift 指令可以接受哪些操作数?

  4. Nov 2021
    1. As with themov instructions, the two operands cannot both be memory locations.

      binary operation 的两个操作数可以是 memory location 吗?

    2. This operand can be either a register ora memory location.

      unary 的操作数可以是什么?

    3. The destination operand must be a register.

      load effective address 的 destination 需要是什么?

    4. The ability of the leaq instruction to perform addition and limited forms ofmultiplication proves useful when compiling simple arithmetic expressions suchas this example.

      leaq 在什么情况下有用?

    5. local variables such as x are often kept in registers rather thanstored in memory locations. Register access is much faster than memory access.

      local variables 通过会存在哪里,为什么?

    6. we see that whatwe call “pointers” in C are simply addresses. Dereferencing a pointer involvescopying that pointer into a register, and then using this register in a memoryreference.

      dereference pointer 在 assembly code 中如何实现?

    7. One important feature is that memoryreferences in x86-64 are always given with quad word registers, such as %rax, evenif the operand is a byte, single word, or double word.

      memory reference 属于那种 register 类型?

    8. logicallybe named movzlq, but this instruction does not exist. Instead, this type of datamovement can be implemented using a movl instruction having a register as thedestination. This technique takes advantage of the property that an instructiongenerating a 4-byte value with a register as the destination will fill the upper 4bytes with zeros.

      为什么在 movz 的指令中缺少 movzlq?

    9. in memory, to a register destination. Instructions in the movz class fill out theremaining bytes of the destination with zeros, while those in the movs class fillthem out by sign extension, replicating copies of the most significant bit of thesource operand.

      那两种 move 指令针对 copy smaller source 到 larger destination,他们的做法分别是什么?

    10. The source operand designates a value that is immediate, stored in a register,or stored in memory. The destination operand designates a location that is either aregister or a memory address. x86-64 imposes the restriction that a move instruc-tion cannot have both operands refer to memory locations. Copying a value fromone memory location to another requires two instructions—the first to load thesource value into a register, and the second to write this register value to the des-tination.

      move 的 source operand 和 destination operand 分别可以是哪些类型?

    11. The most general form is shown at the bottomof the table with syntax Imm(rb,ri,s). Such a reference has four components: animmediate offset Imm, a base register rb, an index register ri, and a scale factors, where s must be 1, 2, 4, or 8. Both the base and index must be 64-bit registers.The effective address is computed as Imm + R[rb] + R[ri] . s.

      访问 $$Imm(r_b, r_i, s)$$ 的内存应该如何计算,有哪些限制条件?

    12. C declaration Intel data type Assembly-code suffix Size (bytes)

      不同数据类型的 size 以及在 assembly 中的后缀?

    13. A final difference is that we see two additional lines of code (lines8–9). These instructions will have no effect on the program, since they occur afterthe return instruction (line 7). They have been inserted to grow the code for thefunction to 16 bytes, enabling a better placement of the next block of code in termsof memory system performance.

      为什么有时候通过 disassembly 生成的 assembly 代码会在 ret 之后通过 nop 增加一些空格?

    14. Its main feature isthat it is in a more readable textual format, as compared to the binary format ofmachine code.

      assembly code 和 machine code 相比最大的区别是什么?

    1. reinterpret_cast 运算符并不会改变括号中运算对象的值,而是对该对象从位模式上进行重新解释

      reinterpret_cast 在 c++ 中如何理解?

    1. A namespace is a scope.C++ provides namespaces to prevent name conflicts.

      namespace 有什么作用?

    1. But the other effect of unnamed namespaces is that all identifiers inside an unnamed namespace are treated as if they had internal linkage, which means that the content of an unnamed namespace can’t be seen outside of the file in which the unnamed namespace is defined.

      unnamed namespace 有什么作用?

    1. One of the best things about classes is that they contain destructors that automatically get executed when an object of the class goes out of scope. So if you allocate (or acquire) memory in your constructor, you can deallocate it in your destructor, and be guaranteed that the memory will be deallocated when the class object is destroyed (regardless of whether it goes out of scope, gets explicitly deleted, etc…).

      smart pointer 的原理是什么?

    1. Three techniques to avoid losing critical information at half-precision: Full-precision master copy of weights. Maintain a full precision (FP32) copy of model weights that accumulates gradients. The numbers are rounded up to half-precision for forward & backward passes. The motivation is that each gradient update (i.e. gradient times the learning rate) might be too small to be fully contained within the FP16 range (i.e. 2−242−242^{-24} becomes zero in FP16). Loss scaling. Scale up the loss to better handle gradients with small magnitudes (See Fig. 16). Scaling up the gradients helps shift them to occupy a larger section towards the right section (containing larger values) of the representable range, preserving values that are otherwise lost. Arithmetic precision. For common network arithmetic (e.g. vector dot-product, reduction by summing up vector elements), we can accumulate the partial results in FP32 and then save the final output as FP16 before saving into memory. Point-wise operations can be executed in either FP16 or FP32.


    2. two major memory consumption of large model training: The majority is occupied by model states, including optimizer states (e.g. Adam momentums and variances), gradients and parameters. Mixed-precision training demands a lot of memory since the optimizer needs to keep a copy of FP32 parameters and other optimizer states, besides the FP16 version. The remaining is consumed by activations, temporary buffers and unusable fragmented memory (named residual states in the paper).


    3. It partitions optimizer state, gradients and parameters across multiple data parallel processes via a dynamic communication schedule to minimize the communication volume.

      ZeRO-DP 的原理是什么?

    4. Asynchronous parallel (ASP): Every GPU worker processes the data asynchronously, no waiting or stalling. However, it can easily lead to stale weights being used and thus lower the statistical learning efficiency. Even though it increases the computation time, it may not speed up training time to convergence.

      ASP 是什么以及其优缺点?

    5. Bulk synchronous parallels (BSP): Workers sync data at the end of every minibatch. It prevents model weights staleness and good learning efficiency but each machine has to halt and wait for others to send gradients.

      BSP 是什么以及其优缺点?