4 Matching Annotations
  1. Dec 2024
    1. collect 算子有两处性能隐患,一个是拉取数据过程中引入的网络开销,另一个 Driver 的 OOM(内存溢出,Out of Memory)

      收集数据会导致Driver的内存占用

  2. Jun 2024
    1. suppose that GPT 4 training took 3 months in 2027 a leading AI lab will be able to train a GPT 4 00:18:19 level model in a minute

      for - stat - AI evolution - prediction 2027 - training time - 6 OOM decrease

      stat - AI evolution - prediction 2027 - training time - 6 OOM decrease - today it takes 3 months to train GPT 4 - in 2027, it will take 1 minute - That is, 131,400 minutes vs 1 minute, or - 6 OOM

  3. Jul 2023
  4. Oct 2020
    1. Linux Memory Management at Scale

      "we had to build a complete and compliant operating system in order to perform resource control reliably"

      epic real-talk. the only people on the planet who seemed to have tamed linux for workloads. controlling memory. taming io. being on the bleeding edge, it turns out, is almost entirely about forward-progress. what can we reclaim?

      • oomd for memory protection
      • fbtax2
      • psi monitoring for io regulation
      • cgroups v2

      https://facebookmicrosites.github.io/cgroup2/docs/fbtax-results.html