14 Matching Annotations
  1. Oct 2024
    1. matching the ownership tracking * granularities between memcg and writeback in either direction

      Algorithmic policy: the memcg and writeback ownerships have to match. This mechanism implemented via the mem_cgroup_track_foreign_dirty_slowpath and mem_cgroup_flush_foreign functions.

    2. #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 #define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2

      Maximum loops when reclaiming memory for a soft threshold from a cgroup. These value configurations are set once via #define.

    3. if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)) return ep; if (parent_effective > siblings_protected && parent_usage > siblings_protected && usage > protected) { unsigned long unclaimed; unclaimed = parent_effective - siblings_protected; unclaimed *= usage - protected; unclaimed /= parent_usage - siblings_protected; ep += unclaimed; } return ep;

      Heuristic algorithm for calculations of the effective protection of an individual cgroup. Used to check if the cgroup tree is using memory consumption in normal range by referencing the cgroup, its parent, its sibling, and rest of tree

    4. atomic_add(nr_bytes, &old->nr_charged_bytes);

      Configuration policy for the tradeoff for flushing per-memcg from old object stock. Currently choosing to fully limit enforcement accuracy while having no CPU contention by writing to a centralized value.

    5. #define THRESHOLDS_EVENTS_TARGET 128 #define SOFTLIMIT_EVENTS_TARGET 1024

      These #define statements are for the target number of events before the system triggers action for threshold events and soft limit events, respectively. These are memory pressure events. THRESHOLDS_EVENTS_TARGET is set to 128, meaning threshold events are processed more frequently (finer grain) than soft limit events, which are only triggered every 1024 events.

    6. if (total >= (excess >> 2) || (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) break;

      For the soft limit reclaim of cgroup memory, we continuously loop through the cgroup and find a victim process to shrink. If we didn't find a victim, we will try again up to a certain threshold (#define MEM_CGROUP_MAX_RECLAIM_LOOPS) or quit if we found that we've reclaimed more than the exponentially decreasing original excess: algo left shifts the excess every failed loop.

    7. #define FLUSH_TIME (2UL*HZ)

      This is a configuration policy that sets the interval for periodic flushing of memory statistics. The flushing is performed every 2 seconds (2UL * HZ), allowing the system to balance between the cost of frequent stat updates and keeping the statistics reasonably fresh.

    8. #define MEMCG_DELAY_PRECISION_SHIFT 20 #define MEMCG_DELAY_SCALING_SHIFT 14

      These two #define statements are part of the configuration policy for controlling the increasing penalty applied to memory overage. They ensure the system doesn’t penalize too harshly in minor cases but still exponentially increases delay for excessive usage.

    9. /* * Don't sleep if the amount of jiffies this memcg owes us is so low * that it's not even worth doing, in an attempt to be nice to those who * go only a small amount over their memory.high value and maybe haven't * been aggressively reclaimed enough yet. */ if (penalty_jiffies <= HZ / 100) goto out;

      Both the predicate and action of an algorithmic policy are defined here. If the calculated penality_jiffies (delay for over allocated cgrouup) is less than 1/100th of a second, don't throttle the cgroup.

    10. #define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ)

      This line defines the maximum sleep time (delay) for a memory cgroup that has breached its memory.high limit. This is a configuration policy because it sets a fixed upper limit (2 seconds).

    1. if (oc->chosen && oc->chosen != (void *)-1UL) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory");

      We found the process that has the highest "badness" score and will be terminated after we called select_bad_process. This is the action side of the previously mentioned predicate (maximal "badness").

    2. #define MAX_OOM_REAP_RETRIES 10 static void oom_reap_task(struct task_struct *tsk) { int attempts = 0; struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the mmap_read_trylock(mm) a few times */ while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); if (attempts <= MAX_OOM_REAP_RETRIES || test_bit(MMF_OOM_SKIP, &mm->flags)) goto done;

      The reaper will try 10 times (with a delay after each attempt) to reap the memory from the process. This value configuration is defined via a #define statement. This will fail if the process doesn't give up its mmap lock.

    3. #define OOM_REAPER_DELAY (2*HZ) static void queue_oom_reaper(struct task_struct *tsk) { /* mm is already queued? */ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) return; get_task_struct(tsk); timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; add_timer(&tsk->oom_reaper_timer); }

      The program here sends a signal to the selected process that will be killed to solve the out of memory issue. It will then wait for a configuration OOM_REAPER_DELAY before forcefully killing it by invoking the reaper thread on the process. It's currently set to 2*HZ, meaning that we wait for 2 seconds. This time period policy is defined as configuration via #define.

    4. /* * The baseline for the badness score is the proportion of RAM that each * task's rss, pagetable and swap space use. */ points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + mm_pgtables_bytes(p->mm) / PAGE_SIZE; task_unlock(p); /* Normalize to oom_score_adj units */ adj *= totalpages / 1000; points += adj;

      The kernel calls out of memory when the RAM passes full utilization limit. This code calculates a predicate for choosing the maximal "bad" process to reap to fix the out of memory issue. This code defines the predicate as a heuristic calculation; it calculates "badness" as the proportion of RAM that the sum of the task's RSS, pagetable, and swap space use.