8 Matching Annotations
  1. Apr 2024
    1. Reducing atomic reference counting The final optimization we will discuss in this article is how the new scheduler reduces the amount of atomic reference counts needed. There are many outstanding references to the task structure: the scheduler and each waker hold a handle. A common way to manage this memory is to use atomic reference counting. This strategy requires an atomic operation each time a reference is cloned and an atomic operation each time a reference is dropped. When the final reference goes out of scope, the memory is freed. In the old Tokio scheduler, each waker held a counted reference to the task handle, roughly: struct Waker { task: Arc<Task>, } impl Waker { fn wake(&self) { let task = self.task.clone(); task.scheduler.schedule(task); } } When the task is woken, the reference is cloned (atomic increment). The reference is then pushed into the run queue. When the processor receives the task and is done executing it, it drops the reference resulting in an atomic decrement. These atomic operations add up. This problem has previously been identified by the designers of the std::future task system. It was observed that when Waker::wake is called, often times the original waker reference is no longer needed. This allows for reusing the atomic reference count when pushing the task into the run queue. The std::future task system now includes two "wake" APIs: wake which takes self wake_by_ref which takes &self. This API design pushes the caller to use wake which avoids the atomic increment. The implementation now becomes: impl Waker { fn wake(self) { task.scheduler.schedule(self.task); } fn wake_by_ref(&self) { let task = self.task.clone(); task.scheduler.schedule(task); } } This avoids the overhead of additional reference counting only if it is possible to take ownership of the waker in order to wake. In my experience, it is almost always desirable to wake with &self instead. Waking with self prevents reusing the waker (useful in cases where the resource sends many values, i.e. channels, sockets, ...) it also is more difficult to implement thread-safe waking when self is required (the details of this will be left to another article). The new scheduler side steps the entire "wake by self" issue by avoiding the atomic increment in wake_by_ref, making it as efficient as wake(self). This is done by having the scheduler maintain a list of all tasks currently active (have not yet completed). This list represents the reference count needed to push a task into the run queue. The difficulty with this optimization is to ensure that the scheduler will not drop any tasks from its list until it can be guaranteed that the task will not be pushed into the run queue again. The specifics of how this is managed are beyond the scope of this article, but I urge you to further investigate this in the source.

      在新调度器中,为了减少原子引用计数的使用,进行了以下优化。通常,调度器和每个唤醒器都持有任务结构的一个句柄,常见的管理这些内存的方式是使用原子引用计数。这种策略需要在每次克隆引用时进行一次原子操作,并在每次丢弃引用时进行一次原子操作。当最后一个引用超出作用域时,内存被释放。

      在旧的Tokio调度器中,每个唤醒器持有一个对任务句柄的计数引用,大致如下:

      ```rust struct Waker { task: Arc<Task>, }

      impl Waker { fn wake(&self) { let task = self.task.clone(); task.scheduler.schedule(task); } } ```

      当任务被唤醒时,引用被克隆(原子增加)。然后将引用推入运行队列。当处理器接收到任务并执行完毕后,它会丢弃引用,导致原子减少。这些原子操作累计起来。

      std::future任务系统的设计者之前已经识别出这个问题。他们观察到,当调用Waker::wake时,往往不再需要原始的唤醒器引用。这允许在将任务推入运行队列时重用原子引用计数。std::future任务系统现在包括两个“唤醒”API:

      • wake,这个方法接收self
      • wake_by_ref,这个方法接收&self

      这种API设计促使调用者使用避免原子增量的wake方法。现在的实现变为:

      ```rust impl Waker { fn wake(self) { self.task.scheduler.schedule(self.task); }

      fn wake_by_ref(&self) {
          let task = self.task.clone();
          task.scheduler.schedule(task);
      }
      

      } ```

      这种方式只有在可以拿到唤醒器的所有权以唤醒时,才能避免额外引用计数的开销。根据我的经验,几乎总是更希望使用&self来唤醒。使用self唤醒会阻止重用唤醒器(在需要发送许多值的情况下有用,例如通道、套接字等),实现线程安全的唤醒也更加困难(当需要self时)。

      新调度器通过避免在wake_by_ref中进行原子增量,使其与wake(self)一样高效,从而绕过了整个“自我唤醒”的问题。这是通过让调度器维护一个所有当前活跃(尚未完成)任务的列表来实现的。这个列表代表了将任务推入运行队列所需的引用计数。

      这种优化的难点在于确保调度器不会从其列表中删除任何任务,直到可以保证任务不会再次被推入运行队列。如何管理这一点的具体细节超出了本文的范围,但我鼓励你在源代码中进一步研究这个问题。

    2. Reducing allocations The new Tokio scheduler requires only a single allocation per spawned task while the old one required two. Previously, the Task struct looked something like: struct Task { /// All state needed to manage the task state: TaskState, /// The logic to run is represented as a future trait object. future: Box<dyn Future<Output = ()>>, } The Task struct would then be allocated in a Box as well. This has always been a wart that I have wanted to fix for a long time (I first attempted to fix this in 2014). Since the old Tokio scheduler, two things have changed. First, std::alloc stabilized. Second, the Future task system switched to an explicit vtable strategy. These were the two missing pieces needed to finally get rid of the double allocation per task inefficiency. Now, the Task structure is represented as: struct Task<T> { header: Header, future: T, trailer: Trailer, } Both Header and Trailer are state needed to power the task, however they are split between "hot" data (header) and "cold" data (trailer), i.e. roughly data that is accessed often and data that is rarely used. The hot data is placed at the head of the struct and is kept as small as possible. When the CPU dereferences the task pointer, it will load a cache line sized amount of data at once (between 64 and 128 bytes). We want that data to be as relevant as possible.

      新的Tokio调度器只需要为每个生成的任务分配一次内存,而旧版本需要两次。以前,任务结构(Task struct)看起来像这样:

      ```rust struct Task { /// 管理任务所需的所有状态 state: TaskState,

      /// 以未来特性对象表示的运行逻辑
      future: Box<dyn Future<Output = ()>>,
      

      } `` 然后,Task 结构也会被分配在一个 Box 中。这一直是一个问题,我长期以来一直想解决(我最初在2014年尝试解决这个问题)。自旧的Tokio调度器以来,有两个变化。首先,std::alloc` 稳定了。其次,Future任务系统转向了显式虚拟表(vtable)策略。这是最终消除每个任务双重分配低效的两个缺失部分。

      现在,任务结构表示为:

      rust struct Task<T> { header: Header, future: T, trailer: Trailer, } Header 和 Trailer 都是推动任务所需的状态,但它们被分为“热”数据(header)和“冷”数据(trailer),即大致上是经常访问的数据和很少使用的数据。热数据被放在结构的头部,并尽可能保持小尺寸。当 CPU 解引用任务指针时,它将一次性加载一个缓存行大小的数据(在64到128字节之间)。我们希望这些数据尽可能相关。

      这样的设计减少了内存分配的次数,同时优化了数据的局部性,提高了缓存效率,从而提升了任务的处理性能。这种结构也更加灵活,允许在不牺牲性能的前提下,更自由地定义任务的存储和管理方式。

    3. Reducing cross thread synchronization The other critical piece of a work-stealing scheduler is sibling notification. This is where a processor notifies a sibling when it observes new tasks. If the sibling is sleeping, it wakes up and steals tasks. The notification action has another critical responsibility. Recall the queue algorithm used weak atomic orderings (Acquire / Release). Because of how atomic memory ordering work, without additional synchronization, there is no guarantee that a sibling processor will ever see tasks in the queue to steal. The notification action also is responsible for establishing the necessary synchronization for the sibling processor to see the tasks in order to steal them. These requirements make the notification action expensive. The goal is to perform the notification action as little as possible without resulting in CPU under utilization, i.e. a processor has tasks and a sibling is unable to steal them. Overeager notification results in a thundering herd problem. The original Tokio scheduler took a naive approach to notification. Whenever a new task was pushed on to the run queue, a processor was notified. Whenever a processor was notified and found a task upon waking, it would then notify yet another processor. This logic very quickly resulted in all processor getting woken up and searching for work (causing contention). Very often, most of these processors failed to find work and went back to sleep. The new scheduler significantly improves on this by borrowing the same technique used in the Go scheduler. Notification is attempted at the same points as the previous scheduler, however, notification only happens if there are no workers in the searching state (see previous section). When a worker is notified, it is immediately transitioned to the searching state. When a processor in the searching state finds new tasks, it will first transition out of the searching state, then notify another processor. This logic has the effect of throttling the rate at which processors wake up. If a batch of tasks is scheduled at once (for example, when epoll is polled for socket readiness), the first one will result in notifying a processor. That processor is now in the searching state. The rest of the scheduled tasks in the batch will not notify a processor as there is at least one in the searching state. That notified processor will steal half the tasks in the batch, and in turn notify another processor. This third processor will wake up, find tasks from one of the first two processors and steal half of those. This results in a smooth ramp up of processors as well as rapid load balancing of tasks.

      工作窃取调度器的另一个关键部分是同级通知。当处理器发现新任务时,它会通知一个同级。如果同级处于休眠状态,它将被唤醒并窃取任务。通知操作还承担着另一个关键责任。回想一下队列算法使用了弱原子排序(获取/释放)。由于原子内存排序的工作方式,如果没有额外的同步,就无法保证同级处理器能够看到队列中的任务以进行窃取。通知操作还负责建立必要的同步,以便同级处理器能够看到并窃取这些任务。这些要求使得通知操作变得代价高昂。目标是尽可能少地执行通知操作,而不会导致 CPU 利用率低下,即处理器有任务而同级无法窃取它们。过度急切的通知会导致群聚问题(thundering herd problem)。

      原始的 Tokio 调度器采用了一种简单的通知方法。每当新任务被推送到运行队列上,就通知一个处理器。每当一个处理器被通知并在唤醒后发现任务时,它就会通知另一个处理器。这种逻辑很快就导致所有处理器被唤醒并寻找工作(造成竞争)。很多时候,大多数处理器未能找到工作又回到了休眠状态。

      新的调度器通过借鉴 Go 调度器使用的同一技术显著改进了这一点。通知的时机与之前的调度器相同,然而,只有在没有工作器处于搜索状态时(参见前一节)才会进行通知。当工作器被通知时,它会立即过渡到搜索状态。当处于搜索状态的处理器发现新任务时,它将首先退出搜索状态,然后通知另一个处理器。

      这一逻辑具有限制处理器唤醒速率的效果。如果一批任务同时被调度(例如,当epoll轮询套接字就绪时),第一个结果会通知一个处理器。该处理器现在处于搜索状态。该批次中剩余的计划任务不会通知处理器,因为至少有一个处于搜索状态。被通知的处理器将窃取批次中一半的任务,然后转而通知另一个处理器。这第三个处理器将醒来,从前两个处理器中找到任务并窃取其中的一半。这导致处理器的平稳启动以及任务的快速负载平衡。

    4. To address this, the new Tokio scheduler implements an optimization (also found in Go's and Kotlin's schedulers). When a task transitions to the runnable state, instead of pushing it to the back of the run queue, it is stored in a special "next task" slot. The processor will always check this slot before checking the run queue. When inserting a task into this slot, if a task is already stored in it, the old task is removed from the slot and pushed to the back of the run queue. In the message passing case, this will result in the receiver of the message to be scheduled to run next.

      优先槽位,只有当前 thread 提交的 task 才会存储在这里,其他 thread 提交的 task 只会存储在他们的本地或者 inject 中

    5. The last missing piece is consuming the global queue. This queue is used to handle overflow from processor local queues as well as to submit tasks to the scheduler from non-processor threads. If a processor is under load, i.e. the local run queue has tasks. The processor will attempt to pop from the global after executing ~60 tasks from the local queue. It also checks the global queue when in the "searching" state, described below.

      ,即本地运行队列中有任务,处理器将在从本地队列执行大约60个任务后尝试从全局队列中弹出任务。它还会在下面描述的“搜索”状态时检查全局队列。

    6. The steal function is similar to the pop function but the load from self.tail must be atomic. Also, similar to push_overflow, the steal operation will attempt to claim half of the queue instead of a single task. This has some nice characteristics which we will cover later.

      既然 sync 是一个高成本的事情,那么我每次 sync 的时候就多干点事同时减少 sync 的触发次数

    7. The queue implementation is a circular buffer, using an array to store values. Atomic integers are used to track the head and tail positions.

      这里的 mask 实际在现在的代码里已经被移除了,因为 cap 是固定的常数,直接通过 cap 就能计算出 ringbuffer 溢出位的位移量。也就是 mask = cap - 1 = 255

      mask 在 ringbuffer 的实现中很常见,当我们即将写入数组的最后一个元素时,下一次写入应该在数组前开始。如果每次都通过条件语句来判断就会不够高效,引入 mask 后每次push 尾idx + 1,每次 pop 头 idx +1,实际访问数组时将当前 idx 和 mask 做 mod 即可得到合法的 idx 位置。 这里用与运算代替 mod 也是常见做法,因为 mod 值的二机制位全为 1(mask = 255)。

    8. If you are familiar with the details of atomic memory orderings, you might notice a potential "issue" with the push function as shown above. An atomic load with Acquire ordering is pretty weak. It may return stale values, i.e. a concurrent steal operation may have already incremented the value of self.head but the thread performing the push had an old value in the cache so it didn't notice the steal operation. This is not a problem for the correctness of the algorithm. In the fast-path of push, we only care if the local run queue is full or not. Given that the current thread is the only thread that can push into the run queue, a stale load will result in seeing the run queue as more full than it actually is. It may incorrectly determine that the queue is full and enter the push_overflow function, but this function includes a stronger atomic operation. If push_overflow determines the queue is not actually full, it returns w/ Err and the push operation tries again. This is another reason why push_overflow will move half of the run queue to the global queue. By moving half of the queue, the "run queue is empty" false positive is hit a lot less often.

      非常巧妙