126 Matching Annotations
  1. Oct 2024
    1. #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 #define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2

      Maximum loops when reclaiming memory for a soft threshold from a cgroup. These value configurations are set once via #define.

    2. atomic_add(nr_bytes, &old->nr_charged_bytes);

      Configuration policy for the tradeoff for flushing per-memcg from old object stock. Currently choosing to fully limit enforcement accuracy while having no CPU contention by writing to a centralized value.

    3. #define THRESHOLDS_EVENTS_TARGET 128 #define SOFTLIMIT_EVENTS_TARGET 1024

      These #define statements are for the target number of events before the system triggers action for threshold events and soft limit events, respectively. These are memory pressure events. THRESHOLDS_EVENTS_TARGET is set to 128, meaning threshold events are processed more frequently (finer grain) than soft limit events, which are only triggered every 1024 events.

    4. if (total >= (excess >> 2) || (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) break;

      For the soft limit reclaim of cgroup memory, we continuously loop through the cgroup and find a victim process to shrink. If we didn't find a victim, we will try again up to a certain threshold (#define MEM_CGROUP_MAX_RECLAIM_LOOPS) or quit if we found that we've reclaimed more than the exponentially decreasing original excess: algo left shifts the excess every failed loop.

    5. #define FLUSH_TIME (2UL*HZ)

      This is a configuration policy that sets the interval for periodic flushing of memory statistics. The flushing is performed every 2 seconds (2UL * HZ), allowing the system to balance between the cost of frequent stat updates and keeping the statistics reasonably fresh.

    6. #define MEMCG_DELAY_PRECISION_SHIFT 20 #define MEMCG_DELAY_SCALING_SHIFT 14

      These two #define statements are part of the configuration policy for controlling the increasing penalty applied to memory overage. They ensure the system doesn’t penalize too harshly in minor cases but still exponentially increases delay for excessive usage.

    7. #define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ)

      This line defines the maximum sleep time (delay) for a memory cgroup that has breached its memory.high limit. This is a configuration policy because it sets a fixed upper limit (2 seconds).

    1. static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) { /* * If reclaim is making progress greater than 12% efficiency then * wake all the NOPROGRESS throttled tasks. */ if (sc->nr_reclaimed > (sc->nr_scanned >> 3)) { wait_queue_head_t *wqh; wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS]; if (waitqueue_active(wqh)) wake_up(wqh); return; } /* * Do not throttle kswapd or cgroup reclaim on NOPROGRESS as it will * throttle on VMSCAN_THROTTLE_WRITEBACK if there are too many pages * under writeback and marked for immediate reclaim at the tail of the * LRU. */ if (current_is_kswapd() || cgroup_reclaim(sc)) return; /* Throttle if making no progress at high prioities. */ if (sc->priority == 1 && !sc->nr_reclaimed) reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS); }

      Managing throttling: wakes up throttled tasks when reclaim is making progress at some efficiency. Throttle reclaim if no progress (in reclaim_throttle(), throttling under no progress could also be skipped under some conditions).

    2. scan_balance = SCAN_FRACT; /* * Calculate the pressure balance between anon and file pages. * * The amount of pressure we put on each LRU is inversely * proportional to the cost of reclaiming each list, as * determined by the share of pages that are refaulting, times * the relative IO cost of bringing back a swapped out * anonymous page vs reloading a filesystem page (swappiness). * * Although we limit that influence to ensure no list gets * left behind completely: at least a third of the pressure is * applied, before swappiness. * * With swappiness at 100, anon and file have equal IO cost. */ total_cost = sc->anon_cost + sc->file_cost; anon_cost = total_cost + sc->anon_cost; file_cost = total_cost + sc->file_cost; total_cost = anon_cost + file_cost; ap = swappiness * (total_cost + 1); ap /= anon_cost + 1; fp = (200 - swappiness) * (total_cost + 1); fp /= file_cost + 1; fraction[0] = ap; fraction[1] = fp; denominator = ap + fp;

      Calculates the proportion of anon and file pages to be scanned based on swappiness. The whole funtion determines number of folios to be scanned for each type.

    3. if (young * MIN_NR_GENS > total) return true; if (old * (MIN_NR_GENS + 2) < total) return true;

      Decides whether aging is needed. Aging is needed if the lruvec is short on cold folios. If returns true, the scanning of this lruvec is skipped.

    4. if (!swappiness) type = LRU_GEN_FILE; else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE]) type = LRU_GEN_ANON; else if (swappiness == 1) type = LRU_GEN_FILE; else if (swappiness == 200) type = LRU_GEN_ANON; else type = get_type_to_scan(lruvec, swappiness, &tier);

      Makes a decision on which type to scan based on swappiness, for the generational lruvecs

    5. nr = xchg_nr_deferred(shrinker, shrinkctl); if (shrinker->seeks) { delta = freeable >> priority; delta *= 4; do_div(delta, shrinker->seeks); } else { /* * These objects don't require any IO to create. Trim * them aggressively under memory pressure to keep * them from causing refetches in the IO caches. */ delta = freeable / 2; } total_scan = nr >> priority; total_scan += delta; total_scan = min(total_scan, (2 * freeable)); trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, freeable, delta, total_scan, priority); /* * Normally, we should not scan less than batch_size objects in one * pass to avoid too frequent shrinker calls, but if the slab has less * than batch_size objects in total and we are really tight on memory, * we will try to reclaim all available objects, otherwise we can end * up failing allocations although there are plenty of reclaimable * objects spread over several slabs with usage less than the * batch_size. * * We detect the "tight on memory" situations by looking at the total * number of objects we want to scan (total_scan). If it is greater * than the total number of objects on slab (freeable), we must be * scanning at high prio and therefore should try to reclaim as much as * possible. */ while (total_scan >= batch_size || total_scan >= freeable) {

      Determine the number of slab objects to be scanned (total_scan) based on 'freeable' and 'priority'. Scans are done in batch sizes unless memory is tight. Batch size is set by shrinker, or defaulted to SHRINK_BATCH at line 832

    1. #define ZSWAP_NR_ZPOOLS 32

      More configuration policy, as per the comment, this is an empirical number.

    2. static unsigned int zswap_max_pool_percent = 20;

      maximum percentage of memory that the compresses pool can occupy

    3. static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON);

      Since zswap will be part of decisions on the swap flow the this config is policy.

    1. #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* * Watermark failed for this zone, but see if we can * grow this zone if it contains deferred pages. */ if (deferred_pages_enabled()) { if (_deferred_grow_zone(zone, order)) goto try_this_zone; } #endif

      This is a configuration policy. It will retry if zone has deferred pages.

    2. #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* Try again if zone has deferred pages */ if (deferred_pages_enabled()) { if (_deferred_grow_zone(zone, order)) goto try_this_zone; } #endif

      This is a configuration policy. It will retry if zone has deferred pages.

    3. if (!should_skip_kasan_unpoison(gfp_flags) && kasan_unpoison_pages(page, order, init)) { /* Take note that memory was initialized by KASAN. */ if (kasan_has_integrated_init()) init = false; } else { /* * If memory tags have not been set by KASAN, reset the page * tags to ensure page_address() dereferencing does not fault. */ for (i = 0; i != 1 << order; ++i) page_kasan_tag_reset(page + i); }

      This decision made by configuration (KASAN's mode)

    4. if (can_direct_reclaim && can_compact && (costly_order || (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask))

      const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;

      The value of PAGE_ALLOC_COSTLY_ORDER is defined as 3, heuristically determining whether try direct compaction or not. However, can_direct_reclaim is determined by caller, so it is also a configuaration policy.

    5. if (!can_direct_reclaim)

      bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;

      gfp_mask is parameter passed by caller and it specify whether the caller would like to reclaim pages or not if failing to allocate pages.

    6. if (gfp_mask & __GFP_NORETRY)

      gfp_mask is provided by caller. This configuration determine whether the function would retry to allocate pages or not.

    7. int ratio = sysctl_lowmem_reserve_ratio[i]; bool clear = !ratio || !zone_managed_pages(zone); unsigned long managed_pages = 0; for (j = i + 1; j < MAX_NR_ZONES; j++) { struct zone *upper_zone = &pgdat->node_zones[j]; managed_pages += zone_managed_pages(upper_zone); if (clear) zone->lowmem_reserve[j] = 0; else zone->lowmem_reserve[j] = managed_pages / ratio; }

      This is a configuration policy since it revolves around the "sysctl_lowmem_reserve_ratio" which is a configurable setting that can be changed dynamically at runtime. The for loop applies the logic of the configuration but the parameter controls how much memory to reserve. The calculation of memory reserves based on this ratio are configuration policies.

    1. #define MAX_OOM_REAP_RETRIES 10 static void oom_reap_task(struct task_struct *tsk) { int attempts = 0; struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the mmap_read_trylock(mm) a few times */ while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); if (attempts <= MAX_OOM_REAP_RETRIES || test_bit(MMF_OOM_SKIP, &mm->flags)) goto done;

      The reaper will try 10 times (with a delay after each attempt) to reap the memory from the process. This value configuration is defined via a #define statement. This will fail if the process doesn't give up its mmap lock.

    2. #define OOM_REAPER_DELAY (2*HZ) static void queue_oom_reaper(struct task_struct *tsk) { /* mm is already queued? */ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) return; get_task_struct(tsk); timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; add_timer(&tsk->oom_reaper_timer); }

      The program here sends a signal to the selected process that will be killed to solve the out of memory issue. It will then wait for a configuration OOM_REAPER_DELAY before forcefully killing it by invoking the reaper thread on the process. It's currently set to 2*HZ, meaning that we wait for 2 seconds. This time period policy is defined as configuration via #define.

    1. if (s->flags & __CMPXCHG_DOUBLE) { ret = __update_freelist_fast(slab, freelist_old, counters_old, freelist_new, counters_new); } else { ret = __update_freelist_slow(slab, freelist_old, counters_old, freelist_new, counters_new); }

      This policy is very similar to annotated code below. The description is reproduced here:

      This policy determines if the system has support for compare and exchange. If so, it will use the "__update_freelist_fast()" function, which uses a compare and exchange internally. Otherwise, it will use "__update_freelist_slow()", which uses a lock (specifically a bit-based spinlock) internally.

    2. #ifdef CONFIG_SLAB_FREELIST_HARDENED encoded = (unsigned long)ptr ^ s->random ^ swab(ptr_addr); #else encoded = (unsigned long)ptr; #endif

      This policy uses the configuration setting "CONFIG_SLAB_FREELIST_HARDENED" to determine whether to obfuscate a SLUB free list pointer or not for increased security at the cost of some performance.

    3. #ifdef CONFIG_SLAB_FREELIST_HARDENED decoded = (void *)(ptr.v ^ s->random ^ swab(ptr_addr)); #else decoded = (void *)ptr.v; #endif

      This policy is based on the configuration setting indicated by "CONFIG_SLAB_FREELIST_HARDENED", which hardens the slab free list. If the free list is hardened, then free list pointers will be obfuscated; this policy just undoes the obfuscation in that case. In the case where free list pointers are not obfuscated, this function just returns the unmodified pointer value.

    4. #ifdef CONFIG_SLAB_FREELIST_RANDOM /* Pre-initialize the random sequence cache */ static int init_cache_random_seq(struct kmem_cache *s) { unsigned int count = oo_objects(s->oo); int err; /* Bailout if already initialised */ if (s->random_seq) return 0; err = cache_random_seq_create(s, count, GFP_KERNEL); if (err) { pr_err("SLUB: Unable to initialize free list for %s\n", s->name); return err; } /* Transform to an offset on the set of pages */ if (s->random_seq) { unsigned int i; for (i = 0; i < count; i++) s->random_seq[i] *= s->size; } return 0; } /* Initialize each random sequence freelist per cache */ static void __init init_freelist_randomization(void) { struct kmem_cache *s; mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) init_cache_random_seq(s); mutex_unlock(&slab_mutex); } /* Get the next entry on the pre-computed freelist randomized */ static void *next_freelist_entry(struct kmem_cache *s, struct slab *slab, unsigned long *pos, void *start, unsigned long page_limit, unsigned long freelist_count) { unsigned int idx; /* * If the target page allocation failed, the number of objects on the * page might be smaller than the usual size defined by the cache. */ do { idx = s->random_seq[*pos]; *pos += 1; if (*pos >= freelist_count) *pos = 0; } while (unlikely(idx >= page_limit)); return (char *)start + idx; } /* Shuffle the single linked freelist based on a random pre-computed sequence */ static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab) { void *start; void *cur; void *next; unsigned long idx, pos, page_limit, freelist_count; if (slab->objects < 2 || !s->random_seq) return false; freelist_count = oo_objects(s->oo); pos = get_random_u32_below(freelist_count); page_limit = slab->objects * s->size; start = fixup_red_left(s, slab_address(slab)); /* First entry is used as the base of the freelist */ cur = next_freelist_entry(s, slab, &pos, start, page_limit, freelist_count); cur = setup_object(s, cur); slab->freelist = cur; for (idx = 1; idx < slab->objects; idx++) { next = next_freelist_entry(s, slab, &pos, start, page_limit, freelist_count); next = setup_object(s, next); set_freepointer(s, cur, next); cur = next; } set_freepointer(s, cur, NULL); return true; } #else static inline int init_cache_random_seq(struct kmem_cache *s) { return 0; } static inline void init_freelist_randomization(void) { } static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab) { return false; } #endif /* CONFIG_SLAB_FREELIST_RANDOM */

      This policy looks at the "CONFIG_SLAB_FREELIST_RANDOM" config setting. If it is set, the policy defines functions to randomize the free list order when creating new pages (for security purposes). If it is not set, the functions involved in randomizing the free list are empty, effectively turning off free list randomization.

    5. if (s->flags & __CMPXCHG_DOUBLE) { ret = __update_freelist_fast(slab, freelist_old, counters_old, freelist_new, counters_new); } else { unsigned long flags; local_irq_save(flags); ret = __update_freelist_slow(slab, freelist_old, counters_old, freelist_new, counters_new); local_irq_restore(flags); }

      This policy determines if the system has support for compare and exchange. If so, it will use the "__update_freelist_fast()" function, which uses a compare and exchange internally. Otherwise, it will use "__update_freelist_slow()", which uses a lock (specifically a bit-based spinlock) internally.

    1. #ifdef CONFIG_VM_EVENT_COUNTERS DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; EXPORT_PER_CPU_SYMBOL(vm_event_states); static void sum_vm_events(unsigned long *ret) { int cpu; int i; memset(ret, 0, NR_VM_EVENT_ITEMS * sizeof(unsigned long)); for_each_online_cpu(cpu) { struct vm_event_state *this = &per_cpu(vm_event_states, cpu); for (i = 0; i < NR_VM_EVENT_ITEMS; i++) ret[i] += this->event[i]; } } /* * Accumulate the vm event counters across all CPUs. * The result is unavoidably approximate - it can change * during and after execution of this function. */ void all_vm_events(unsigned long *ret) { cpus_read_lock(); sum_vm_events(ret); cpus_read_unlock(); } EXPORT_SYMBOL_GPL(all_vm_events); /* * Fold the foreign cpu events into our own. * * This is adding to the events on one processor * but keeps the global counts constant. */ void vm_events_fold_cpu(int cpu) { struct vm_event_state *fold_state = &per_cpu(vm_event_states, cpu); int i; for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { count_vm_events(i, fold_state->event[i]); fold_state->event[i] = 0; } } #endif /* CONFIG_VM_EVENT_COUNTERS */

      When CONFIG_VM_EVENT_COUNTERS is enabled, the kernel compiles the code that collects and manages virtual memory (vm) event counters. These counters track events like page faults, page allocations, and swap operations. The functions sum_vm_events, all_vm_events, and vm_events_fold_cpu are responsble for accumulating these statistics across all CPUs.

      If CONFIG_VM_EVENT_COUNTERS is not enabled, this code is excluded from the build. This means the kernel won't collect these detailed VM statistics, reducing memory usage and avoiding the overhead associated with tracking them.

    2. #ifdef CONFIG_NUMA int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; /* zero numa counters within a zone */ static void zero_zone_numa_counters(struct zone *zone) { int item, cpu; for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) { atomic_long_set(&zone->vm_numa_event[item], 0); for_each_online_cpu(cpu) { per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_event[item] = 0; } } } /* zero numa counters of all the populated zones */ static void zero_zones_numa_counters(void) { struct zone *zone; for_each_populated_zone(zone) zero_zone_numa_counters(zone); } /* zero global numa counters */ static void zero_global_numa_counters(void) { int item; for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) atomic_long_set(&vm_numa_event[item], 0); } static void invalid_numa_statistics(void) { zero_zones_numa_counters(); zero_global_numa_counters(); } static DEFINE_MUTEX(vm_numa_stat_lock); int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos) { int ret, oldval; mutex_lock(&vm_numa_stat_lock); if (write) oldval = sysctl_vm_numa_stat; ret = proc_dointvec_minmax(table, write, buffer, length, ppos); if (ret || !write) goto out; if (oldval == sysctl_vm_numa_stat) goto out; else if (sysctl_vm_numa_stat == ENABLE_NUMA_STAT) { static_branch_enable(&vm_numa_stat_key); pr_info("enable numa statistics\n"); } else { static_branch_disable(&vm_numa_stat_key); invalid_numa_statistics(); pr_info("disable numa statistics, and clear numa counters\n"); } out: mutex_unlock(&vm_numa_stat_lock); return ret; } #endif

      This conditionally includes the NUMA-specific code based on whether the CONFIG_NUMA option is enabled during kernel compilation. Within this conditional block, the sysctl_vm_numa_stat_handler function provides a runtime configuration mechanism through the sysctl_vm_numa_stat variable. This function allows system administrators to enable or disable the collection of NUMA statistics at runtime. When the value of sysctl_vm_numa_stat changes, the function changes the vm_numa_stat_key to start or stop the statistics collection and clears the NUMA counters if disabled.

    1. if (dtc->wb_thresh < 2 * wb_stat_error()) { wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); } else { wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE); dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); }

      This is a configuration policy that does a more accurate calculation on the number of reclaimable pages and dirty pages when the threshold for the dirty pages in the writeback context is lower than 2 times the maximal error of a stat counter.

    2. if (thresh > dirty) return 1UL << (ilog2(thresh - dirty) >> 1);

      This implements a configuration policy that determines the interval for the kernel to wake up and check for dirty pages that need to be written back to disk.

    3. limit -= (limit - thresh) >> 5;

      This is a configuration policy that determines how much should the limit be updated. The limit controls the amount of dirty memory allowed in the system.

    4. shift = dirty_ratelimit / (2 * step + 1); if (shift < BITS_PER_LONG) step = DIV_ROUND_UP(step >> shift, 8); else step = 0; if (dirty_ratelimit < balanced_dirty_ratelimit) dirty_ratelimit += step; else dirty_ratelimit -= step;

      This is a configuration policy that determines how much we should increase/decrease the dirty_ratelimit, which controls the rate that processors write dirty pages back to storage.

    5. ratelimit_pages = dirty_thresh / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16;

      This is a configuration policy that dynamically determines the rate that kernel can write dirty pages back to storage in a single writeback cycle.

    6. if (elapsed > WB_BANDWIDTH_IDLE_JIF && !atomic_read(&wb->writeback_inodes)) {

      Idle interval config

    7. if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {

      This is a configuration policy to force strictlimit.

    8. if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) {

      This is a configuration policy that controls whether to update the limit in the control group. The config enables support for controlling the writeback of dirty pages on a per-cgroup basis in the Linux kernel. This allows for better resource management and improved performance.

    1. static long madvise_cold(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start_addr, unsigned long end_addr) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather tlb; *prev = vma; if (!can_madv_lru_vma(vma)) return -EINVAL; lru_add_drain(); tlb_gather_mmu(&tlb, mm); madvise_cold_page_range(&tlb, vma, start_addr, end_addr); tlb_finish_mmu(&tlb); return 0; }

      This is a configuration policy. Cold page ranges can not be advised to the kernel if the VM_LOCKED or VM_PFNMAP or VM_HUGETLB flags are set.

    1. if (lru_gen_enabled() && pte_young(ptep_get(pvmw.pte))) { lru_gen_look_around(&pvmw); referenced++; }

      These lines show configuration policy. The lru_gen_enabled() function checks if the multi-gen LRU is enabled and this is set with the CONFIG_LRU_GEN_ENABLED kernel configuration. This is a determining factor into how the folio referenced bit is updated, so it is a configuration policy

    2. if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { if (pmdp_clear_flush_young_notify(vma, address, pvmw.pmd)) referenced++;

      This is configuration policy. The CONFIG_TRANSPARENT_HUGEPAGE kernel configuration determines how and if the page reference bit is incremented.

    1. #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) static ssize_t compact_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { int nid = dev->id; if (nid >= 0 && nid < nr_node_ids && node_online(nid)) { /* Flush pending updates to the LRU lists */ lru_add_drain_all(); compact_node(nid); } return count; } static DEVICE_ATTR_WO(compact); int compaction_register_node(struct node *node) { return device_create_file(&node->dev, &dev_attr_compact); } void compaction_unregister_node(struct node *node) { device_remove_file(&node->dev, &dev_attr_compact); } #endif /* CONFIG_SYSFS && CONFIG_NUMA */

      When these options are enabled, the functions compact_store, compaction_register_node, and compaction_unregister_node are included in the build. The compact_store function allows users to trigger memory compaction on specific NUMA nodes by writing to the sysfs interface. Inside this function, it checks if the node ID is valid and online, drains pending updates to the LRU lists with lru_add_drain_all(), and then calls compact_node(nid) to perform compaction on that node.

      Including these functions provides fine-grained control over memory management in NUMA system. If even one of the flag is not eneabled, these functions are excluded from the build, and the per-node compaction interface is unavailable.

    2. #ifdef CONFIG_CMA /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */ if (migratetype == MIGRATE_MOVABLE && !free_area_empty(area, MIGRATE_CMA)) return COMPACT_SUCCESS; #endif

      This block allows the kernel's memory compaction algorithm to consider MIGRATE_CMA pages as fallback options for MIGRATE_MOVABLE allocations when the Contiguous Memory Allocator (CMA) is enabled (CONFIG_CMA is defined).

      When CONFIG_CMA is enabled, the compaction process becomes more flexible by allowing the use of free CMA pages if standard movable pages are unavailable, potentially improving allocation success rates.

      If CONFIG_CMA is disabled, this fallback mechanism is omitted, ensuring that CMA regions remain dedicated to their primary purpose of providing contiguous memory blocks for devices that require them.

      https://elixir.bootlin.com/linux/v6.6.42/source/mm/Kconfig#L895

    3. #ifdef CONFIG_SPARSEMEM

      #ifdef CONFIG_SPARSEMEM conditionally includes or excludes the implementation of the skip_offline_sections and skip_offline_sections_reverse functions based on whether the CONFIG_SPARSEMEM option is enabled in the kernel configuration. CONFIG_SPARSEMEM enables the the kernel to support sparse memory systems where memory sections can be dynamically online or offline. They provide logic to skip over offline memory sections during operations like memory compaction, ensuring that the system does not attempt to access or manipulate memory that is not currently available.

      If CONFIG_SPARSEMEM is not defined, these functions return 0, effectively disabling the handling of offline memory sections.

      https://elixir.bootlin.com/linux/v6.6.42/source/mm/Kconfig#L461

    4. #ifdef CONFIG_COMPACTION

      This option acts as a policy control mechanism that determines whether the memory compaction feature is included in the kernel build. By including the compaction-related code within #ifdef CONFIG_COMPACTION and #endif [Line 522], the code conditionally compiles these sections based on the configuration setting. This affects the kernel's behavior regarding memory management and fragmentation handling. CONFIG_COMPACTION is defined in https://elixir.bootlin.com/linux/v6.6.42/source/mm/Kconfig#L637 and default is set to Yes or True

    5. static int compaction_proactiveness_sysctl_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos) { int rc, nid; rc = proc_dointvec_minmax(table, write, buffer, length, ppos); if (rc) return rc; if (write && sysctl_compaction_proactiveness) { for_each_online_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); if (pgdat->proactive_compact_trigger) continue; pgdat->proactive_compact_trigger = true; trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, -1, pgdat->nr_zones - 1); wake_up_interruptible(&pgdat->kcompactd_wait); } } return 0; }

      This function implements configuration and policy logic for proactive memory compaction. It supports the sysctl handler, and provides runtime configuration of compaction proactiveness. When the proactiveness setting is enabled, the function applies a policy to trigger proactive compaction across all online memory nodes. It does this by setting a trigger flag and waking up the kcompactd daemon for each node that hasn't already been triggered.

    6. #ifdef CONFIG_COMPACTION static bool suitable_migration_source(struct compact_control *cc, struct page *page) { int block_mt; if (pageblock_skip_persistent(page)) return false; if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction) return true; block_mt = get_pageblock_migratetype(page); if (cc->migratetype == MIGRATE_MOVABLE) return is_migrate_movable(block_mt); else return block_mt == cc->migratetype; }

      This code snippet from the Linux kernel's memory compaction system implements policy and configuration logic for determining suitable migration sources during compaction. It's conditionally compiled based on the CONFIG_COMPACTION option, demonstrating configuration-dependent behavior. The suitable_migration_source function encapsulates policy decisions by considering factors such as persistent skip flags, compaction mode, and migration types. It applies different criteria for async direct compaction versus other modes, and implements specific rules for matching migration types, with special handling for movable pages.

    7. if (!sysctl_compaction_proactiveness) timeout = MAX_SCHEDULE_TIMEOUT;

      If proactive compaction is disabled (sysctl_compaction_proactiveness is 0), the daemon sets its timeout to MAX_SCHEDULE_TIMEOUT, effectively sleeping until explicitly woken up. If proactive compaction is enabled, it uses the default timeout to periodically wake up and check for work.

    1. if (khugepaged_has_work()) { const unsigned long scan_sleep_jiffies = msecs_to_jiffies(khugepaged_scan_sleep_millisecs); if (!scan_sleep_jiffies) return; khugepaged_sleep_expire = jiffies + scan_sleep_jiffies; wait_event_freezable_timeout(khugepaged_wait, khugepaged_should_wakeup(), scan_sleep_jiffies); return; } if (hugepage_flags_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event());

      khugepaged thread sleeps when khugepaged does not have work (khugepaged_scan list is empty). Sleeps for khugepaged_scan_sleep_millisecs if has work

    2. if (hugepage_flags_enabled()) { if (!khugepaged_thread) khugepaged_thread = kthread_run(khugepaged, NULL,

      hugeoage_flags_enabled() flag will turn on hupe page for all mappings. transparent_hugepage_flags is disabled by default. If enabled, khugepaged thread will start running.

    1. scan_base = offset = si->lowest_bit; last_in_cluster = offset + SWAPFILE_CLUSTER - 1; /* Locate the first empty (unaligned) cluster */ for (; last_in_cluster <= si->highest_bit; offset++) { if (si->swap_map[offset]) last_in_cluster = offset + SWAPFILE_CLUSTER; else if (offset == last_in_cluster) { spin_lock(&si->lock); offset -= SWAPFILE_CLUSTER - 1; si->cluster_next = offset; si->cluster_nr = SWAPFILE_CLUSTER - 1; goto checks; } if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; } }

      Here, (when using HDDs), a policy is implemented that places the swapped page in the first available slot. This is supposed to reduce seek time in spinning drives, as it encourages having swap entries near each other.

    2. /* * Even if there's no free clusters available (fragmented), * try to scan a little more quickly with lock held unless we * have scanned too many slots already. */ if (!scanned_many) { unsigned long scan_limit; if (offset < scan_base) scan_limit = scan_base; else scan_limit = si->highest_bit; for (; offset <= scan_limit && --latency_ration > 0; offset++) { if (!si->swap_map[offset]) goto checks; } }

      Here we have a configuration policy where we do another smaller scan as long as we haven't exhausted our latency_ration. Another alternative could be yielding early in anticipation that we aren't going to find a free slot.

    3. cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);

      A policy decision is made to add a cluster to the discard list on a first-come, first-served basis. However, this approach could be enhanced by prioritizing certain clusters higher on the list based on their 'importance.' This 'importance' could be defined by how closely a cluster is related to other pages in the swap. By doing so, the system can reduce seek time as mentioned on line 817.

    4. if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; scanned_many = true; } if (swap_offset_available_and_locked(si, offset)) goto checks; } offset = si->lowest_bit; while (offset < scan_base) { if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; scanned_many = true; }

      Here, a policy decision is made to fully replenish the latency_ration with the LATENCY_LIMIT and then yield back to the scheduler if we've exhausted it. This makes it so that when scheduled again, we have the full LATENCY_LIMIT to do a scan. Alternative policies could grow/shrink this to find a better heuristic instead of fully replenishing each time.

      Marked config/value as awe're replacing latency_ration with a compiletime-defined limit.

    5. while (scan_swap_map_ssd_cluster_conflict(si, offset)) { /* take a break if we already got some slots */ if (n_ret) goto done; if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) goto scan;

      Here, a policy decision is made to stop scanning if some slots were already found. Other policy decisions could be made to keep scanning or take into account how long the scan took or how many pages were found.

    6. #ifdef CONFIG_THP_SWAP

      This is a build-time flag that configures how hugepages are handled when swapped. When defined, it swaps them in one piece, while without it splits them into smaller units and swaps those units.

    7. if (swap_flags & SWAP_FLAG_DISCARD_ONCE) p->flags &= ~SWP_PAGE_DISCARD; else if (swap_flags & SWAP_FLAG_DISCARD_PAGES) p->flags &= ~SWP_AREA_DISCARD

      This is a configuration policy decision where a sysadmin can pass flags to sys_swapon() to control the behavior discards are handled. If DISCARD_ONCE is set, a flag which "discard[s] swap area at swapon-time" is unset, and if DISCARD_PAGES is set, a flag which "discard[s] page-clusters after use" is unset.

    8. /* * Should not even be attempting cluster allocations when huge * page swap is disabled. Warn and fail the allocation. */ if (!IS_ENABLED(CONFIG_THP_SWAP)) { VM_WARN_ON_ONCE(1); return 0; }

      The policy dictates that if huge page swapping is disabled at compile time (via the CONFIG_THP_SWAP configuration option), the kernel should not attempt to allocate swap clusters for huge pages and should fail the operation with a warning.

    9. #ifdef CONFIG_HIBERNATION

      If CONFIG_HIBERNATION is defined, the kernel includes code to write the entire system memory state to the swapfile before powering down the system. This involves allocating swap slots for the entire memory state and ensuring that the data is properly stored.

    10. #define SWAPFILE_CLUSTER 256

      This configuration defines the size of the swap cluster, which is the number of swap pages that the system tries to allocate as a unit. Is set to 256.

    11. #define LATENCY_LIMIT 256

      LATENCY_LIMIT is an upper bound on how many slots can be checked by scan_swap_map_slots before it yields the cpu back. It's set to 64. This policy prevents the function from monopolizing the cpu.

    12. n_goal = min3((long)n_goal, (long)SWAP_BATCH, avail_pgs);

      SWAP_BATCH: This is a configuration policy (Macro) that caps the maximum number of entries that can be swapped in a single operation. It's set to 64. This is done to limit amount of swapping that happens.

    1. static const struct file_operations proc_page_owner_operations = { .read = read_page_owner, .llseek = lseek_page_owner, };

      Configuration policy that defines an interface for user space to access kernel page ownership data. Configuring how users can read and seek through the page information.

    1. if (vma->vm_flags & VM_PFNMAP) { int err = 1; if (ops->pte_hole) err = ops->pte_hole(start, end, -1, walk);
    1. unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, bool need_rmap_locks)

      Simple, but optimize by size of page table. Triggered by checking CONFIG_HAVE_MOVE_PUD and CONFIG_HAVE_MOVE_PMD

    1. enum get_ksm_page_flags { GET_KSM_PAGE_NOLOCK, GET_KSM_PAGE_LOCK, GET_KSM_PAGE_TRYLOCK };

      This defines the locking behavior when accessing a KSM page. The caller can choose between no locking, locking, or trying to lock without blocking, affecting how the page is accessed and modified.

    2. static int ksmd_should_run(void) { return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.slot.mm_node); }

      This function checks whether the ksmd should run based on the KSM_RUN_MERGE flag and whether there are any memory regions to process. The decision to enable or disable memory merging is controlled by the ksm_run configuration.

    3. static unsigned int ksm_thread_sleep_millisecs = 20;

      This variable sets the delay (in milliseconds) for the KSM daemon to sleep between scanning batches. It controls the scanning frequency, balancing KSM's memory merging process with system performance.

    4. static unsigned int ksm_thread_pages_to_scan = 100;

      This variable determines the number of pages the KSM daemon will scan in one batch. It controls the performance and efficiency of KSM's scanning process, balancing between thoroughness and system load.

    5. static int ksm_max_page_sharing = 256;

      This variable controls the maximum number of page slots that can share a stable node in the stable tree for KSM. It defines a limit for page sharing, affecting how KSM consolidates memory pages.

    6. static unsigned int ksm_stable_node_chains_prune_millisecs = 2000;

      This variable sets the delay (in milliseconds) for pruning stale stable_node_dups in the stable_node_chains. It controls how often KSM will perform cleanup operations on stale nodes.

    1. randomize_stack_top

      This function uses a configuration policy to enable Address Space Layout Randomization (ASLR) for a specific process if PF_RANDOMIZE flag is set. It randomly arranges the positions of stack of a process to help defend certain attacks by making memory addresses unpredictable.

    1. /* Soft offline could migrate non-LRU movable pages */ if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page)) return true;

      The code includes a policy for soft offline pages (MF_SOFT_OFFLINE). This is a feature where the kernel attempts to migrate pages to avoid using faulty memory areas. The policy allows non-LRU movable pages (pages that aren’t in the Least Recently Used list) to be migrated.

    1. if (movable_node_is_enabled()) {

      This policy, as the comment states, will ignore the kernelcore and movablecore options if movable nodes are enabled (skipping the logic below this if statement and jumping to "out2" instead). The logic for the "movable_node_is_enabled()" function is in "memory_hotplug.h".

    2. if (page_poisoning_enabled() || (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && debug_pagealloc_enabled())) {

      This check looks at flags and configuration settings to determine if page poisoning should be enabled.

    3. if (descending)

      This decides whether to iterate forward or backward through the zones passed into the function. The variable "descending" is determined by the "arch_has_descending_max_zone_pfns()" function on line 1811, which is determined from configuration options.

    4. if (mminit_loglevel < MMINIT_VERIFY)

      This policy controls whether the function will print information about the zonelist. This decision is determined by the value of the "mminit_level" enum in "mm/internal.h".

    5. if (overcommit_policy == OVERCOMMIT_NEVER)

      This policy controls the memory batch size based on the overcommit policy, choosing a smaller batch size when the policy is OVERCOMMIT_NEVER.

    1. 1

      This is a configuration policy that sets the timeout between retries if vmap_pages_range() fails. This could be tunable variable.

    2. 100U

      This is an configuration policy that determines 100 pages are the upper limit for the bulk-allocator. However, the implementation of alloc_pages_bulk_array_mempolicy does not explicitly limit in the implementation. So I believe it is an algorithmic policy related to some sort of optimization.

    3. VMAP_PURGE_THRESHOLD

      The threshold VMAP_PURGE_THRESHOLD is a configuration policy that could be tuned by machine learning. Setting this threshold lower reduces purging activities while setting it higher reduces framentation.

    4. resched_threshold = lazy_max_pages() << 1;

      The assignment of resched_threshold and lines 1776-1777 are configuration policies to determine fewer than which number of lazily-freed pages it should yield CPU temporarily to higher-priority tasks.

    5. log = fls(num_online_cpus());

      This heuristic scales lazy_max_pages logarithmically, which is a configuration policy. Alternatively, machine learning could determine the optimal scaling function—whether linear, logarithmic, square-root, or another approach.

    6. 32UL * 1024 * 1024

      This is a configuration policy that decides to always returns multiples of 32 MB worth of pages. This could be a configurable variable rather than a fixed magic number.

    1. /* * If the user wants hardware cache aligned objects then follow that * suggestion if the object is sufficiently large. * * The hardware cache alignment cannot override the specified * alignment though. If that is greater then use it. */ if (flags & SLAB_HWCACHE_ALIGN) { unsigned int ralign; ralign = cache_line_size(); while (size <= ralign / 2) ralign /= 2; align = max(align, ralign); }

      This SLAB_HWCACHE_ALIGN is defined by the user, making users decide whether hardware cache should align objects or not.

    2. if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL)) flags |= SLAB_NO_MERGE;

      This is a configuration policy. If CONFIG_MEMCG_KMEM is enabled, disable cache merging for KMALLOC_NORMAL caches.

    1. unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE

      This chunks the read-ahead into 2 megabyte units to avoid pinning too much memory at once. LDOS could replace this with a dynamically-sized chunk to better optimize for other use cases.

    1. static bool should_skip_region(struct memblock_type *type, struct memblock_region *m, int nid, int flags) { int m_nid = memblock_get_region_node(m); /* we never skip regions when iterating memblock.reserved or physmem */ if (type != memblock_memory) return false; /* only memory regions are associated with nodes, check it */ if (nid != NUMA_NO_NODE && nid != m_nid) return true; /* skip hotpluggable memory regions if needed */ if (movable_node_is_enabled() && memblock_is_hotpluggable(m) && !(flags & MEMBLOCK_HOTPLUG)) return true; /* if we want mirror memory skip non-mirror memory regions */ if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(m)) return true; /* skip nomap memory unless we were asked for it explicitly */ if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m)) return true; /* skip driver-managed memory unless we were asked for it explicitly */ if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m)) return true; return false; }

      This policy determines whether a memblock region should be skipped, based on several checks that incorporate various flags. You can see this policy being used in other functions on lines 1080 and 1184 in this file; these other functions appear to be sub-functions for iterators on the memblock regions.

    2. if (memblock_bottom_up())

      This policy controls whether the memblock allocator should allocate memory from the bottom up or from the top down.

  2. Aug 2024
    1. Flag indicating whether KSM should run.

    2. flag indicating if KSM should merge same-pages across NUMA nodes

    3. KSM_ATTR(advisor_target_scan_time);

      target scan time -- used by the EWA?

    4. KSM_ATTR(advisor_max_pages_to_scan);

      max number of pages to scan per iteration

    5. KSM_ATTR(advisor_min_pages_to_scan);

      min pages to scan per iteration

    6. KSM_ATTR(advisor_max_cpu);

      max amount of cpu per iteration of the ksmd?

    7. KSM_ATTR(advisor_mode);

      the mode that ksm runs in -- only two for now.

    8. KSM_ATTR(smart_scan);

      smart scan -- not sure what this does

    9. KSM_ATTR(stable_node_chains_prune_millisecs);

      millis before a page is removed from the stable tree??

    10. KSM_ATTR(max_page_sharing);

      what is the max number of page sharings -- I think this is how many times a single page can be shared (to limit the length of the reverse map?)

    11. KSM_ATTR(use_zero_pages);

      shouls KSM use special zero page handling

    12. KSM_ATTR(merge_across_nodes);

      should ksm merge across NUMA nodes

    13. KSM_ATTR(run);

      flag indicating if ksm should run (0 or 1)

    14. KSM_ATTR(pages_to_scan);

      number of pages to scan per loop iteration

    15. KSM_ATTR(sleep_millisecs);

      sleep time between ksmd loop iterations

    16. configuration for targeted scan time

    17. max pages to scan during a single pass of the KSM loop

    18. min number of pages to scan during a KSM loo[p

    19. config for how much cpu to consume.

    20. choose the mode to run -- either using EWA or just some fixed values.

    21. configuration for how long before a page is pruned from the stable tree

    22. configuration for max page sharing -- not quite sure what this is doing

    23. Configuration to decide if KSM should use special handilng for pages filled with zeros.

    24. Configuration for the number of pages to scan with each run

    25. Configuration for how often the thread should run

  3. Jun 2024
  4. May 2024
    1. if (IS_ENABLED(CONFIG_BALLOON_COMPACTION) && PageIsolated(page)) { /* raced with isolation */ unlock_page(page); continue; }

      Config enabled code

    1. bdi->dev = NULL; kref_init(&bdi->refcnt); bdi->min_ratio = 0; bdi->max_ratio = 100 * BDI_RATIO_SCALE; bdi->max_prop_frac = FPROP_FRAC_BASE; INIT_LIST_HEAD(&bdi->bdi_list); INIT_LIST_HEAD(&bdi->wb_list); init_waitqueue_head(&bdi->wb_waitq); bdi->last_bdp_sleep = jiffies;

      Initial config settings

    2. static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_min_ratio_fine.attr, &dev_attr_max_ratio.attr, &dev_attr_max_ratio_fine.attr, &dev_attr_min_bytes.attr, &dev_attr_max_bytes.attr, &dev_attr_stable_pages_required.attr, &dev_attr_strict_limit.attr, NULL, };

      Policy parameter data structure

    3. static ssize_t strict_limit_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int strict_limit; ssize_t ret; ret = kstrtouint(buf, 10, &strict_limit); if (ret < 0) return ret; ret = bdi_set_strict_limit(bdi, strict_limit); if (!ret) ret = count; return ret; }

      Policy function to set parameters

    4. static ssize_t max_bytes_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); u64 bytes; ssize_t ret; ret = kstrtoull(buf, 10, &bytes); if (ret < 0) return ret; ret = bdi_set_max_bytes(bdi, bytes); if (!ret) ret = count; return ret; }

      Policy function to set parameters

    5. static ssize_t min_bytes_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); u64 bytes; ssize_t ret; ret = kstrtoull(buf, 10, &bytes); if (ret < 0) return ret; ret = bdi_set_min_bytes(bdi, bytes); if (!ret) ret = count;

      Policy function to set parametesr

    6. static ssize_t max_ratio_fine_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int ratio; 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 ssize_t ret; ret = kstrtouint(buf, 10, &ratio); if (ret < 0) return ret; ret = bdi_set_max_ratio_no_scale(bdi, ratio); if (!ret) ret = count; return ret; }

      Policy function to set paraameters

    7. static ssize_t max_ratio_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int ratio; ssize_t ret; ret = kstrtouint(buf, 10, &ratio); if (ret < 0) return ret; ret = bdi_set_max_ratio(bdi, ratio); if (!ret) ret = count; return ret; }

      Policy function to set parameters

    8. static ssize_t min_ratio_fine_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { struct backing_dev_info *bdi = dev_get_drvdata(dev); unsigned int ratio; ssize_t ret; ret = kstrtouint(buf, 10, &ratio); if (ret < 0) return ret; ret = bdi_set_min_ratio_no_scale(bdi, ratio); if (!ret) ret = count; return ret; }

      Policy function to set parameters