277 Matching Annotations
  1. Oct 2024
    1. if ((gup_flags & (FOLL_PIN | FOLL_LONGTERM)) != (FOLL_PIN | FOLL_LONGTERM)) return true;

      most likely part of a policy code that includes flag based decision making

    2. /* user gate pages are read-only */ if (gup_flags & FOLL_WRITE) return -EFAULT; if (address > TASK_SIZE) pgd = pgd_offset_k(address); else pgd = pgd_offset_gate(mm, address); if (pgd_none(*pgd)) return -EFAULT; p4d = p4d_offset(pgd, address); if (p4d_none(*p4d)) return -EFAULT; pud = pud_offset(p4d, address); if (pud_none(*pud)) return -EFAULT; pmd = pmd_offset(pud, address); if (!pmd_present(*pmd)) return -EFAULT; pte = pte_offset_map(pmd, address); if (!pte) return -EFAULT; entry = ptep_get(pte); if (pte_none(entry)) goto unmap; *vma = get_gate_vma(mm); if (!page) goto out; *page = vm_normal_page(*vma, address, entry); if (!*page) { if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(entry))) goto unmap; *page = pte_page(entry); } ret = try_grab_page(*page, gup_flags); if (unlikely(ret)) goto unmap;

      Most of these seem like sanity checks right up until line 897 i.e, 'if(!page)'* after which we seem to unmap the page. flag +

    1. matching the ownership tracking * granularities between memcg and writeback in either direction

      Algorithmic policy: the memcg and writeback ownerships have to match. This mechanism implemented via the mem_cgroup_track_foreign_dirty_slowpath and mem_cgroup_flush_foreign functions.

    2. #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 #define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2

      Maximum loops when reclaiming memory for a soft threshold from a cgroup. These value configurations are set once via #define.

    3. if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)) return ep; if (parent_effective > siblings_protected && parent_usage > siblings_protected && usage > protected) { unsigned long unclaimed; unclaimed = parent_effective - siblings_protected; unclaimed *= usage - protected; unclaimed /= parent_usage - siblings_protected; ep += unclaimed; } return ep;

      Heuristic algorithm for calculations of the effective protection of an individual cgroup. Used to check if the cgroup tree is using memory consumption in normal range by referencing the cgroup, its parent, its sibling, and rest of tree

    4. atomic_add(nr_bytes, &old->nr_charged_bytes);

      Configuration policy for the tradeoff for flushing per-memcg from old object stock. Currently choosing to fully limit enforcement accuracy while having no CPU contention by writing to a centralized value.

    5. #define THRESHOLDS_EVENTS_TARGET 128 #define SOFTLIMIT_EVENTS_TARGET 1024

      These #define statements are for the target number of events before the system triggers action for threshold events and soft limit events, respectively. These are memory pressure events. THRESHOLDS_EVENTS_TARGET is set to 128, meaning threshold events are processed more frequently (finer grain) than soft limit events, which are only triggered every 1024 events.

    6. if (total >= (excess >> 2) || (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) break;

      For the soft limit reclaim of cgroup memory, we continuously loop through the cgroup and find a victim process to shrink. If we didn't find a victim, we will try again up to a certain threshold (#define MEM_CGROUP_MAX_RECLAIM_LOOPS) or quit if we found that we've reclaimed more than the exponentially decreasing original excess: algo left shifts the excess every failed loop.

    7. #define FLUSH_TIME (2UL*HZ)

      This is a configuration policy that sets the interval for periodic flushing of memory statistics. The flushing is performed every 2 seconds (2UL * HZ), allowing the system to balance between the cost of frequent stat updates and keeping the statistics reasonably fresh.

    8. #define MEMCG_DELAY_PRECISION_SHIFT 20 #define MEMCG_DELAY_SCALING_SHIFT 14

      These two #define statements are part of the configuration policy for controlling the increasing penalty applied to memory overage. They ensure the system doesn’t penalize too harshly in minor cases but still exponentially increases delay for excessive usage.

    9. /* * Don't sleep if the amount of jiffies this memcg owes us is so low * that it's not even worth doing, in an attempt to be nice to those who * go only a small amount over their memory.high value and maybe haven't * been aggressively reclaimed enough yet. */ if (penalty_jiffies <= HZ / 100) goto out;

      Both the predicate and action of an algorithmic policy are defined here. If the calculated penality_jiffies (delay for over allocated cgrouup) is less than 1/100th of a second, don't throttle the cgroup.

    10. #define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ)

      This line defines the maximum sleep time (delay) for a memory cgroup that has breached its memory.high limit. This is a configuration policy because it sets a fixed upper limit (2 seconds).

    1. static inline bool should_continue_reclaim(struct pglist_data *pgdat,

      This function decides whether there should be another round of reclamation. Stop reclaiming as long as space is sufficient for compaction or page allocation

    2. static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) { /* * If reclaim is making progress greater than 12% efficiency then * wake all the NOPROGRESS throttled tasks. */ if (sc->nr_reclaimed > (sc->nr_scanned >> 3)) { wait_queue_head_t *wqh; wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS]; if (waitqueue_active(wqh)) wake_up(wqh); return; } /* * Do not throttle kswapd or cgroup reclaim on NOPROGRESS as it will * throttle on VMSCAN_THROTTLE_WRITEBACK if there are too many pages * under writeback and marked for immediate reclaim at the tail of the * LRU. */ if (current_is_kswapd() || cgroup_reclaim(sc)) return; /* Throttle if making no progress at high prioities. */ if (sc->priority == 1 && !sc->nr_reclaimed) reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS); }

      Managing throttling: wakes up throttled tasks when reclaim is making progress at some efficiency. Throttle reclaim if no progress (in reclaim_throttle(), throttling under no progress could also be skipped under some conditions).

    3. /* * We make inactive:active ratio decisions based on the node's * composition of memory, but a restrictive reclaim_idx or a * memory.low cgroup setting can exempt large amounts of * memory from reclaim. Neither of which are very common, so * instead of doing costly eligibility calculations of the * entire cgroup subtree up front, we assume the estimates are * good, and retry with forcible deactivation if that fails. */ if (sc->skipped_deactivate) { sc->priority = initial_priority; sc->force_deactivate = 1; sc->skipped_deactivate = 0; goto retry; } /* Untapped cgroup reserves? Don't OOM, retry. */ if (sc->memcg_low_skipped) { sc->priority = initial_priority; sc->force_deactivate = 0; sc->memcg_low_reclaim = 1; sc->memcg_low_skipped = 0; goto retry; }

      Deactivation may be skipped if inactive ratio is not low or due to memory.Low limit. In this case, simply retry again with forcible deactivation/reclaim flags.

    4. scan_balance = SCAN_FRACT; /* * Calculate the pressure balance between anon and file pages. * * The amount of pressure we put on each LRU is inversely * proportional to the cost of reclaiming each list, as * determined by the share of pages that are refaulting, times * the relative IO cost of bringing back a swapped out * anonymous page vs reloading a filesystem page (swappiness). * * Although we limit that influence to ensure no list gets * left behind completely: at least a third of the pressure is * applied, before swappiness. * * With swappiness at 100, anon and file have equal IO cost. */ total_cost = sc->anon_cost + sc->file_cost; anon_cost = total_cost + sc->anon_cost; file_cost = total_cost + sc->file_cost; total_cost = anon_cost + file_cost; ap = swappiness * (total_cost + 1); ap /= anon_cost + 1; fp = (200 - swappiness) * (total_cost + 1); fp /= file_cost + 1; fraction[0] = ap; fraction[1] = fp; denominator = ap + fp;

      Calculates the proportion of anon and file pages to be scanned based on swappiness. The whole funtion determines number of folios to be scanned for each type.

    5. if (young * MIN_NR_GENS > total) return true; if (old * (MIN_NR_GENS + 2) < total) return true;

      Decides whether aging is needed. Aging is needed if the lruvec is short on cold folios. If returns true, the scanning of this lruvec is skipped.

    6. if (!swappiness) type = LRU_GEN_FILE; else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE]) type = LRU_GEN_ANON; else if (swappiness == 1) type = LRU_GEN_FILE; else if (swappiness == 200) type = LRU_GEN_ANON; else type = get_type_to_scan(lruvec, swappiness, &tier);

      Makes a decision on which type to scan based on swappiness, for the generational lruvecs

    7. */ if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) { if (cgroup_reclaim(sc) && writeback_throttling_sane(sc)) set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags); if (current_is_kswapd()) set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags); } /* * Stall direct reclaim for IO completions if the lruvec is * node is congested. Allow kswapd to continue until it * starts encountering unqueued dirty pages or cycling through * the LRU too quickly. */ if (!current_is_kswapd() && current_may_throttle() && !sc->hibernation_mode && (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) || test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags))) reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); if (should_continue_reclaim(pgdat, nr_node_reclaimed, sc)) goto again;

      stall reclaim if lruvec is congested, which is determined by whether all dirty pages in the cgroup or node were marked for writeback

    8. static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru) { enum lru_list active_lru = inactive_lru + LRU_ACTIVE; unsigned long inactive, active; unsigned long inactive_ratio; unsigned long gb; inactive = lruvec_page_state(lruvec, NR_LRU_BASE + inactive_lru); active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru); gb = (inactive + active) >> (30 - PAGE_SHIFT); if (gb) inactive_ratio = int_sqrt(10 * gb); else inactive_ratio = 1; return inactive * inactive_ratio < active; }

      Defines/calculates a target active-to-inactive ratio. There are various places where this function is called to decide whether the active list should be shrinked and whether inactive list could be expanded.

    9. nr = xchg_nr_deferred(shrinker, shrinkctl); if (shrinker->seeks) { delta = freeable >> priority; delta *= 4; do_div(delta, shrinker->seeks); } else { /* * These objects don't require any IO to create. Trim * them aggressively under memory pressure to keep * them from causing refetches in the IO caches. */ delta = freeable / 2; } total_scan = nr >> priority; total_scan += delta; total_scan = min(total_scan, (2 * freeable)); trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, freeable, delta, total_scan, priority); /* * Normally, we should not scan less than batch_size objects in one * pass to avoid too frequent shrinker calls, but if the slab has less * than batch_size objects in total and we are really tight on memory, * we will try to reclaim all available objects, otherwise we can end * up failing allocations although there are plenty of reclaimable * objects spread over several slabs with usage less than the * batch_size. * * We detect the "tight on memory" situations by looking at the total * number of objects we want to scan (total_scan). If it is greater * than the total number of objects on slab (freeable), we must be * scanning at high prio and therefore should try to reclaim as much as * possible. */ while (total_scan >= batch_size || total_scan >= freeable) {

      Determine the number of slab objects to be scanned (total_scan) based on 'freeable' and 'priority'. Scans are done in batch sizes unless memory is tight. Batch size is set by shrinker, or defaulted to SHRINK_BATCH at line 832

    10. if ((vm_flags & VM_EXEC) && folio_is_file_lru(folio)) { nr_rotated += folio_nr_pages(folio); list_add(&folio->lru, &l_active); continue; }

      The goal is to identify executable code regions and keep them in memory under moderate memory pressure. The heuristic is to check VM_EXEC flag and whether folio is file-backed, and add identified folio back to active list.

    1. #define ZSWAP_NR_ZPOOLS 32

      More configuration policy, as per the comment, this is an empirical number.

    2. static unsigned int zswap_max_pool_percent = 20;

      maximum percentage of memory that the compresses pool can occupy

    3. static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON);

      Since zswap will be part of decisions on the swap flow the this config is policy.

    1. if (!chunk->isolated && chunk->free_bytes == pcpu_unit_size) { struct pcpu_chunk *pos; list_for_each_entry(pos, &pcpu_chunk_lists[pcpu_free_slot], list) if (pos != chunk) { need_balance = true; break; } } else if (pcpu_should_reclaim_chunk(chunk)) { pcpu_isolate_chunk(chunk); need_balance = true; }

      When freeing memory determine whether to balance or not.

    2. static void pcpu_balance_workfn(struct work_struct *work)

      Manage free chunks and populated pages. It does so by first reclaiming all fully free chunks regardless of the number of populated pages. Then rebalancing and finally cleaning.

    3. if (freed_page_start < freed_page_end) {

      Batching TLB flushes to amortize cost. The threashold is also a heuristic.

    4. int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn)

      Static allocation policy is used here for the first chunk per cpu.

    5. if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH) { reintegrate = true; break; }

      Heuristic to stop reclaiming if empty fall below threshold that would cause atomic alloc failures.

    6. static void pcpu_balance_populated(void)

      Maintain a certain amount of populated pages to satisfy atomic allocations.

    7. if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)

      Heuristic to determine when to rebalance when allocating pages. If the number of pages are below the PCPU_EMPTY_POP_PAGES_LOW threashold then rebalancer is called.

    8. if (!pcpu_check_block_hint(chunk_md, alloc_bits, align))

      As the comment said this is a heuristic to stop finding chunks when true.

    9. static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off, int bits)

      Update metadate hints (for policy).

    10. if (contig == block->contig_hint) { if (block->contig_hint_start && (!start || __ffs(start) > __ffs(block->contig_hint_start))) { /* start has a better alignment so use it */ block->contig_hint_start = start; if (start < block->scan_hint_start && block->contig_hint > block->scan_hint) block->scan_hint = 0; } else if (start > block->scan_hint_start || block->contig_hint > block->scan_hint) { /* * Knowing contig == contig_hint, update the scan_hint * if it is farther than or larger than the current * scan_hint. */ block->scan_hint_start = start; block->scan_hint = contig; }

      Heuristic when contig hints are equal to choose the "best" starting offset.

    1. #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* * Watermark failed for this zone, but see if we can * grow this zone if it contains deferred pages. */ if (deferred_pages_enabled()) { if (_deferred_grow_zone(zone, order)) goto try_this_zone; } #endif

      This is a configuration policy. It will retry if zone has deferred pages.

    2. #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* Try again if zone has deferred pages */ if (deferred_pages_enabled()) { if (_deferred_grow_zone(zone, order)) goto try_this_zone; } #endif

      This is a configuration policy. It will retry if zone has deferred pages.

    3. batch = min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE); batch /= 4; /* We effectively *= 4 below */ if (batch < 1) batch = 1; /* * Clamp the batch to a 2^n - 1 value. Having a power * of 2 value was found to be more likely to have * suboptimal cache aliasing properties in some cases. * * For example if 2 tasks are alternately allocating * batches of pages, one task can end up with a lot * of pages of one half of the possible page colors * and the other with pages of the other colors. */ batch = rounddown_pow_of_two(batch + batch/2) - 1;

      Determine the number of pages for batch allocating based on a heuristic. Using a (2^n - 1) to minimize cache aliasing issues.

      but I think it may also be categorized as a configuration policy because the code execution depends on CONFIG_MMU.

    4. if (!should_skip_kasan_unpoison(gfp_flags) && kasan_unpoison_pages(page, order, init)) { /* Take note that memory was initialized by KASAN. */ if (kasan_has_integrated_init()) init = false; } else { /* * If memory tags have not been set by KASAN, reset the page * tags to ensure page_address() dereferencing does not fault. */ for (i = 0; i != 1 << order; ++i) page_kasan_tag_reset(page + i); }

      This decision made by configuration (KASAN's mode)

    5. if (can_direct_reclaim && can_compact && (costly_order || (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask))

      const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;

      The value of PAGE_ALLOC_COSTLY_ORDER is defined as 3, heuristically determining whether try direct compaction or not. However, can_direct_reclaim is determined by caller, so it is also a configuaration policy.

    6. if (!can_direct_reclaim)

      bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;

      gfp_mask is parameter passed by caller and it specify whether the caller would like to reclaim pages or not if failing to allocate pages.

    7. if (!order || order > PAGE_ALLOC_COSTLY_ORDER)

      PAGE_ALLOC_COSTLY_ORDER is a heuristic threshold determining whether should retry the compaction or not.

    8. if (!node_isset(node, *used_node_mask)) { node_set(node, *used_node_mask); return node; } for_each_node_state(n, N_MEMORY) { /* Don't want a node to appear more than once */ if (node_isset(n, *used_node_mask)) continue; /* Use the distance array to find the distance */ val = node_distance(node, n); /* Penalize nodes under us ("prefer the next node") */ val += (n < node); /* Give preference to headless and unused nodes */ if (!cpumask_empty(cpumask_of_node(n))) val += PENALTY_FOR_NODE_WITH_CPUS; /* Slight preference for less loaded node */ val *= MAX_NUMNODES; val += node_load[n]; if (val < min_val) { min_val = val; best_node = n; } } if (best_node >= 0) node_set(best_node, *used_node_mask);

      Selects the best node based on a heuristic that takes into account node distance, CPU availability, and load. The code prefers unused nodes, penalizes closer nodes, and gives preference to less-loaded nodes. It then updates the used node mask to prevent reuse of the same node.

    9. /* * !costly requests are much more important than * __GFP_RETRY_MAYFAIL costly ones because they are de * facto nofail and invoke OOM killer to move on while * costly can fail and users are ready to cope with * that. 1/4 retries is rather arbitrary but we would * need much more detailed feedback from compaction to * make a better decision. */ if (order > PAGE_ALLOC_COSTLY_ORDER) max_retries /= 4; if (++(*compaction_retries) <= max_retries) { ret = true; goto out; }

      Adjusts the maximum number of retries for compaction based on allocation order. When handling high-order allocations, the retries are reduced to 1/4 to avoid overcommitting resources. This is a heuristic decision.

    10. if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) *no_progress_loops = 0; else (*no_progress_loops)++;

      Decision on whether to reset or increment no_progress_loops based on memory reclaim progress and allocation order. The progress is determined at runtime, with PAGE_ALLOC_COSTLY_ORDER serving as a threshold, representing a heuristic decision.

    11. if (gfp_mask & __GFP_NORETRY)

      gfp_mask is provided by caller. This configuration determine whether the function would retry to allocate pages or not.

    12. /* * Do not steal pages from freelists belonging to other pageblocks * i.e. orders < pageblock_order. If there are no local zones free, * the zonelists will be reiterated without ALLOC_NOFRAGMENT. */ if (order < pageblock_order && alloc_flags & ALLOC_NOFRAGMENT) min_order = pageblock_order;

      This prevents fragmentation by avoiding small allocations to not cause excessive fragmentation in memory. With ALLOC_NOGRAGMENT flag is set then the kernel will only allocate from larger pageblocks.

    13. /* s390's use of memset() could override KASAN redzones. */ kasan_disable_current(); for (i = 0; i < numpages; i++) clear_highpage_kasan_tagged(page + i); kasan_enable_current();

      The entire functions clears memory pages but while avoiding the KASAN redzone. This redzones are used to detect memory corruption. After the clearing, the KASAN is re-enabled.

    14. /* * Check tail pages before head page information is cleared to * avoid checking PageCompound for order-0 pages. */ if (unlikely(order)) { bool compound = PageCompound(page); int i; VM_BUG_ON_PAGE(compound && compound_order(page) != order, page); if (compound) page[1].flags &= ~PAGE_FLAGS_SECOND;

      This policy handles compound/big pages. Like what the comment mentions, ensuring that the order matches and that each tail page is handled accordingly. This algorithmic policy prevents possible issues when trying to free large memory blocks.

    15. if (order > PAGE_ALLOC_COSTLY_ORDER) { VM_BUG_ON(order != pageblock_order); movable = migratetype == MIGRATE_MOVABLE; return NR_LOWORDER_PCP_LISTS + movable; }

      Manages memory pages efficiently in the buddy allocator by organizing them in page of their migratetype and order. It ensures that by organizing by types and sizes it can be easily found during allocation and deallocation.

    16. /* * Temporary debugging check for pages not lying within a given zone. */ static int __maybe_unused bad_range(struct zone *zone, struct page *page) { if (page_outside_zone_boundaries(zone, page)) return 1; if (zone != page_zone(page)) return 1; return 0; }

      Policy that ensures that pages are assigned on the correct zones. This should be within the kernel's memory management system. Checks for potential corruption of mismanagement of pages. Enforces constraints.

    17. int ratio = sysctl_lowmem_reserve_ratio[i]; bool clear = !ratio || !zone_managed_pages(zone); unsigned long managed_pages = 0; for (j = i + 1; j < MAX_NR_ZONES; j++) { struct zone *upper_zone = &pgdat->node_zones[j]; managed_pages += zone_managed_pages(upper_zone); if (clear) zone->lowmem_reserve[j] = 0; else zone->lowmem_reserve[j] = managed_pages / ratio; }

      This is a configuration policy since it revolves around the "sysctl_lowmem_reserve_ratio" which is a configurable setting that can be changed dynamically at runtime. The for loop applies the logic of the configuration but the parameter controls how much memory to reserve. The calculation of memory reserves based on this ratio are configuration policies.

    18. if (likely(!is_migrate_isolate(migratetype))) __mod_zone_freepage_state(zone, 1 << order, migratetype);

      The decision policy checks if the free page count is updated to reflect the allocation of deallocation of memory blocks of a migratype (if the page is movable, unmovable, or reclaimable).

    19. static inline bool free_page_is_bad(struct page *page) { if (likely(page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE))) return false; /* Something has gone sideways, find it */ free_page_is_bad_report(page); return true; }

      This policy checks the state of a page to verify if it is in the expected state. If in expected state returns false and ca be freed normally. The policy decides whether to report and handle a bad page based on its state.

    20. if (unlikely(!order && (alloc_flags & ALLOC_MIN_RESERVE) && z->watermark_boost && ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) { mark = z->_watermark[WMARK_MIN]; return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags, free_pages); }

      This watermark policy that determines when to ignore the watermark boost for certain types of memory allocations. Boosting is used to temporarily increase the free memory threshold, but this policy decides when the boost should be ignored.

    21. bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, int highest_zoneidx, unsigned int alloc_flags, long free_pages) { long min = mark; int o; /* free_pages may go negative - that's OK */ free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); if (unlikely(alloc_flags & ALLOC_RESERVES)) { /* * __GFP_HIGH allows access to 50% of the min reserve as well * as OOM. */ if (alloc_flags & ALLOC_MIN_RESERVE) { min -= min / 2; /* * Non-blocking allocations (e.g. GFP_ATOMIC) can * access more reserves than just __GFP_HIGH. Other * non-blocking allocations requests such as GFP_NOWAIT * or (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) do not get * access to the min reserve. */ if (alloc_flags & ALLOC_NON_BLOCK) min -= min / 4; } /* * OOM victims can try even harder than the normal reserve * users on the grounds that it's definitely going to be in * the exit path shortly and free memory. Any allocation it * makes during the free path will be small and short-lived. */ if (alloc_flags & ALLOC_OOM) min -= min / 2; } /* * Check watermarks for an order-0 allocation request. If these * are not met, then a high-order request also cannot go ahead * even if a suitable page happened to be free. */ if (free_pages <= min + z->lowmem_reserve[highest_zoneidx]) return false; /* If this is an order-0 request then the watermark is fine */ if (!order) return true; /* For a high-order request, check at least one suitable page is free */ for (o = order; o < NR_PAGE_ORDERS; o++) { struct free_area *area = &z->free_area[o]; int mt; if (!area->nr_free) continue; for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { if (!free_area_empty(area, mt)) return true; } #ifdef CONFIG_CMA if ((alloc_flags & ALLOC_CMA) && !free_area_empty(area, MIGRATE_CMA)) { return true; } #endif if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) && !free_area_empty(area, MIGRATE_HIGHATOMIC)) { return true; } } return false; }

      Algorithmic policy for memory allocation. Checks if there are enough free pages for an allocation request based on a watermark-based allocation policy criteria. Is is tone by checking the number of free pages in the zone to a threshold "mark".

    1. if (oc->chosen && oc->chosen != (void *)-1UL) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory");

      We found the process that has the highest "badness" score and will be terminated after we called select_bad_process. This is the action side of the previously mentioned predicate (maximal "badness").

    2. #define MAX_OOM_REAP_RETRIES 10 static void oom_reap_task(struct task_struct *tsk) { int attempts = 0; struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the mmap_read_trylock(mm) a few times */ while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); if (attempts <= MAX_OOM_REAP_RETRIES || test_bit(MMF_OOM_SKIP, &mm->flags)) goto done;

      The reaper will try 10 times (with a delay after each attempt) to reap the memory from the process. This value configuration is defined via a #define statement. This will fail if the process doesn't give up its mmap lock.

    3. #define OOM_REAPER_DELAY (2*HZ) static void queue_oom_reaper(struct task_struct *tsk) { /* mm is already queued? */ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) return; get_task_struct(tsk); timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; add_timer(&tsk->oom_reaper_timer); }

      The program here sends a signal to the selected process that will be killed to solve the out of memory issue. It will then wait for a configuration OOM_REAPER_DELAY before forcefully killing it by invoking the reaper thread on the process. It's currently set to 2*HZ, meaning that we wait for 2 seconds. This time period policy is defined as configuration via #define.

    4. /* * The baseline for the badness score is the proportion of RAM that each * task's rss, pagetable and swap space use. */ points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + mm_pgtables_bytes(p->mm) / PAGE_SIZE; task_unlock(p); /* Normalize to oom_score_adj units */ adj *= totalpages / 1000; points += adj;

      The kernel calls out of memory when the RAM passes full utilization limit. This code calculates a predicate for choosing the maximal "bad" process to reap to fix the out of memory issue. This code defines the predicate as a heuristic calculation; it calculates "badness" as the proportion of RAM that the sum of the task's RSS, pagetable, and swap space use.

    1. if (s->flags & __CMPXCHG_DOUBLE) { ret = __update_freelist_fast(slab, freelist_old, counters_old, freelist_new, counters_new); } else { ret = __update_freelist_slow(slab, freelist_old, counters_old, freelist_new, counters_new); }

      This policy is very similar to annotated code below. The description is reproduced here:

      This policy determines if the system has support for compare and exchange. If so, it will use the "__update_freelist_fast()" function, which uses a compare and exchange internally. Otherwise, it will use "__update_freelist_slow()", which uses a lock (specifically a bit-based spinlock) internally.

    2. #ifdef CONFIG_SLAB_FREELIST_HARDENED encoded = (unsigned long)ptr ^ s->random ^ swab(ptr_addr); #else encoded = (unsigned long)ptr; #endif

      This policy uses the configuration setting "CONFIG_SLAB_FREELIST_HARDENED" to determine whether to obfuscate a SLUB free list pointer or not for increased security at the cost of some performance.

    3. #ifdef CONFIG_SLAB_FREELIST_HARDENED decoded = (void *)(ptr.v ^ s->random ^ swab(ptr_addr)); #else decoded = (void *)ptr.v; #endif

      This policy is based on the configuration setting indicated by "CONFIG_SLAB_FREELIST_HARDENED", which hardens the slab free list. If the free list is hardened, then free list pointers will be obfuscated; this policy just undoes the obfuscation in that case. In the case where free list pointers are not obfuscated, this function just returns the unmodified pointer value.

    4. #ifdef CONFIG_SLAB_FREELIST_RANDOM /* Pre-initialize the random sequence cache */ static int init_cache_random_seq(struct kmem_cache *s) { unsigned int count = oo_objects(s->oo); int err; /* Bailout if already initialised */ if (s->random_seq) return 0; err = cache_random_seq_create(s, count, GFP_KERNEL); if (err) { pr_err("SLUB: Unable to initialize free list for %s\n", s->name); return err; } /* Transform to an offset on the set of pages */ if (s->random_seq) { unsigned int i; for (i = 0; i < count; i++) s->random_seq[i] *= s->size; } return 0; } /* Initialize each random sequence freelist per cache */ static void __init init_freelist_randomization(void) { struct kmem_cache *s; mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) init_cache_random_seq(s); mutex_unlock(&slab_mutex); } /* Get the next entry on the pre-computed freelist randomized */ static void *next_freelist_entry(struct kmem_cache *s, struct slab *slab, unsigned long *pos, void *start, unsigned long page_limit, unsigned long freelist_count) { unsigned int idx; /* * If the target page allocation failed, the number of objects on the * page might be smaller than the usual size defined by the cache. */ do { idx = s->random_seq[*pos]; *pos += 1; if (*pos >= freelist_count) *pos = 0; } while (unlikely(idx >= page_limit)); return (char *)start + idx; } /* Shuffle the single linked freelist based on a random pre-computed sequence */ static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab) { void *start; void *cur; void *next; unsigned long idx, pos, page_limit, freelist_count; if (slab->objects < 2 || !s->random_seq) return false; freelist_count = oo_objects(s->oo); pos = get_random_u32_below(freelist_count); page_limit = slab->objects * s->size; start = fixup_red_left(s, slab_address(slab)); /* First entry is used as the base of the freelist */ cur = next_freelist_entry(s, slab, &pos, start, page_limit, freelist_count); cur = setup_object(s, cur); slab->freelist = cur; for (idx = 1; idx < slab->objects; idx++) { next = next_freelist_entry(s, slab, &pos, start, page_limit, freelist_count); next = setup_object(s, next); set_freepointer(s, cur, next); cur = next; } set_freepointer(s, cur, NULL); return true; } #else static inline int init_cache_random_seq(struct kmem_cache *s) { return 0; } static inline void init_freelist_randomization(void) { } static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab) { return false; } #endif /* CONFIG_SLAB_FREELIST_RANDOM */

      This policy looks at the "CONFIG_SLAB_FREELIST_RANDOM" config setting. If it is set, the policy defines functions to randomize the free list order when creating new pages (for security purposes). If it is not set, the functions involved in randomizing the free list are empty, effectively turning off free list randomization.

    5. if (s->flags & __CMPXCHG_DOUBLE) { ret = __update_freelist_fast(slab, freelist_old, counters_old, freelist_new, counters_new); } else { unsigned long flags; local_irq_save(flags); ret = __update_freelist_slow(slab, freelist_old, counters_old, freelist_new, counters_new); local_irq_restore(flags); }

      This policy determines if the system has support for compare and exchange. If so, it will use the "__update_freelist_fast()" function, which uses a compare and exchange internally. Otherwise, it will use "__update_freelist_slow()", which uses a lock (specifically a bit-based spinlock) internally.

    1. #ifdef CONFIG_VM_EVENT_COUNTERS DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; EXPORT_PER_CPU_SYMBOL(vm_event_states); static void sum_vm_events(unsigned long *ret) { int cpu; int i; memset(ret, 0, NR_VM_EVENT_ITEMS * sizeof(unsigned long)); for_each_online_cpu(cpu) { struct vm_event_state *this = &per_cpu(vm_event_states, cpu); for (i = 0; i < NR_VM_EVENT_ITEMS; i++) ret[i] += this->event[i]; } } /* * Accumulate the vm event counters across all CPUs. * The result is unavoidably approximate - it can change * during and after execution of this function. */ void all_vm_events(unsigned long *ret) { cpus_read_lock(); sum_vm_events(ret); cpus_read_unlock(); } EXPORT_SYMBOL_GPL(all_vm_events); /* * Fold the foreign cpu events into our own. * * This is adding to the events on one processor * but keeps the global counts constant. */ void vm_events_fold_cpu(int cpu) { struct vm_event_state *fold_state = &per_cpu(vm_event_states, cpu); int i; for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { count_vm_events(i, fold_state->event[i]); fold_state->event[i] = 0; } } #endif /* CONFIG_VM_EVENT_COUNTERS */

      When CONFIG_VM_EVENT_COUNTERS is enabled, the kernel compiles the code that collects and manages virtual memory (vm) event counters. These counters track events like page faults, page allocations, and swap operations. The functions sum_vm_events, all_vm_events, and vm_events_fold_cpu are responsble for accumulating these statistics across all CPUs.

      If CONFIG_VM_EVENT_COUNTERS is not enabled, this code is excluded from the build. This means the kernel won't collect these detailed VM statistics, reducing memory usage and avoiding the overhead associated with tracking them.

    2. #ifdef CONFIG_NUMA int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; /* zero numa counters within a zone */ static void zero_zone_numa_counters(struct zone *zone) { int item, cpu; for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) { atomic_long_set(&zone->vm_numa_event[item], 0); for_each_online_cpu(cpu) { per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_event[item] = 0; } } } /* zero numa counters of all the populated zones */ static void zero_zones_numa_counters(void) { struct zone *zone; for_each_populated_zone(zone) zero_zone_numa_counters(zone); } /* zero global numa counters */ static void zero_global_numa_counters(void) { int item; for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) atomic_long_set(&vm_numa_event[item], 0); } static void invalid_numa_statistics(void) { zero_zones_numa_counters(); zero_global_numa_counters(); } static DEFINE_MUTEX(vm_numa_stat_lock); int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos) { int ret, oldval; mutex_lock(&vm_numa_stat_lock); if (write) oldval = sysctl_vm_numa_stat; ret = proc_dointvec_minmax(table, write, buffer, length, ppos); if (ret || !write) goto out; if (oldval == sysctl_vm_numa_stat) goto out; else if (sysctl_vm_numa_stat == ENABLE_NUMA_STAT) { static_branch_enable(&vm_numa_stat_key); pr_info("enable numa statistics\n"); } else { static_branch_disable(&vm_numa_stat_key); invalid_numa_statistics(); pr_info("disable numa statistics, and clear numa counters\n"); } out: mutex_unlock(&vm_numa_stat_lock); return ret; } #endif

      This conditionally includes the NUMA-specific code based on whether the CONFIG_NUMA option is enabled during kernel compilation. Within this conditional block, the sysctl_vm_numa_stat_handler function provides a runtime configuration mechanism through the sysctl_vm_numa_stat variable. This function allows system administrators to enable or disable the collection of NUMA statistics at runtime. When the value of sysctl_vm_numa_stat changes, the function changes the vm_numa_stat_key to start or stop the statistics collection and clears the NUMA counters if disabled.

    1. if (start != vma->vm_start) { error = split_vma(&vmi, vma, start, 1); if (error) return error; } if (end != vma->vm_end) { error = split_vma(&vmi, vma, end, 0); if (error) return error; }

      This is an algorithmic policy. The VMA is split at the start address if the specified start is different from the current beginning of the VMA, or if the specified end address is different from the current end of the VMA, the VMA is split.

    2. if (behavior == MADV_SOFT_OFFLINE) { pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", pfn, start); ret = soft_offline_page(pfn, MF_COUNT_INCREASED); }

      This is algorithmic policy. If this particular madvise behavior is MADV_SOFT_OFFLINE, the code decides to performs soft offlining of a page by marking it as unavailable for future use without reclaiming it.

    3. if (folio_test_large(folio)) { int err; if (folio_estimated_sharers(folio) != 1) break; if (!folio_trylock(folio)) break; folio_get(folio); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); start_pte = NULL; err = split_folio(folio); folio_unlock(folio); folio_put(folio); if (err) break; start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!start_pte) break; arch_enter_lazy_mmu_mode(); pte--; addr -= PAGE_SIZE; continue; }

      This is algorithmic policy. This code handles the splitting of large folio pages. It checks if the folio can be safely split, attempts to split it into smaller folios, and then appropriately updates the relevant PTEs and MMU state.

    4. if (!pte_present(ptent)) { swp_entry_t entry; entry = pte_to_swp_entry(ptent); if (!non_swap_entry(entry)) { nr_swap--; free_swap_and_cache(entry); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); } continue; }

      This is an algorithmic policy. It checks if the entry is a valid swap entry and manages its resources accordingly. If the entry is poisoned, it clears the PTE but does not free associated resources, and if it is not a swap entry then all resources are freed.

    5. static long madvise_cold(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start_addr, unsigned long end_addr) { struct mm_struct *mm = vma->vm_mm; struct mmu_gather tlb; *prev = vma; if (!can_madv_lru_vma(vma)) return -EINVAL; lru_add_drain(); tlb_gather_mmu(&tlb, mm); madvise_cold_page_range(&tlb, vma, start_addr, end_addr); tlb_finish_mmu(&tlb); return 0; }

      This is a configuration policy. Cold page ranges can not be advised to the kernel if the VM_LOCKED or VM_PFNMAP or VM_HUGETLB flags are set.

    1. if (dtc->wb_thresh < 2 * wb_stat_error()) { wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); } else { wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE); dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); }

      This is a configuration policy that does a more accurate calculation on the number of reclaimable pages and dirty pages when the threshold for the dirty pages in the writeback context is lower than 2 times the maximal error of a stat counter.

    2. static long wb_min_pause(struct bdi_writeback *wb, long max_pause, unsigned long task_ratelimit, unsigned long dirty_ratelimit, int *nr_dirtied_pause

      This function is an algorithmic policy that determines the minimum throttle time for a process between consecutive writeback operations for dirty pages based on heuristics. It is used for balancing the load of the I/O subsystems so that there will not be excessive I/O operations that impact the performance of the system.

    3. if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh && !writeback_in_progress(wb)) wb_start_background_writeback(wb);

      This is a configuration policy that determines whether to start background writeout. The code here indicates that if laptop_mode, which will reduce disk activity for power saving, is not set, then when the number of dirty pages reaches the bg_thresh threshold, the system starts writing back pages.

    4. if (thresh > dirty) return 1UL << (ilog2(thresh - dirty) >> 1);

      This implements a configuration policy that determines the interval for the kernel to wake up and check for dirty pages that need to be written back to disk.

    5. limit -= (limit - thresh) >> 5;

      This is a configuration policy that determines how much should the limit be updated. The limit controls the amount of dirty memory allowed in the system.

    6. if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && (!mdtc || m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { unsigned long intv; unsigned long m_intv; free_running: intv = dirty_poll_interval(dirty, thresh); m_intv = ULONG_MAX; current->dirty_paused_when = now; current->nr_dirtied = 0; if (mdtc) m_intv = dirty_poll_interval(m_dirty, m_thresh); current->nr_dirtied_pause = min(intv, m_intv); break; } /* Start writeback even when in laptop mode */ if (unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); mem_cgroup_flush_foreign(wb); /* * Calculate global domain's pos_ratio and select the * global dtc by default. */ if (!strictlimit) { wb_dirty_limits(gdtc); if ((current->flags & PF_LOCAL_THROTTLE) && gdtc->wb_dirty < dirty_freerun_ceiling(gdtc->wb_thresh, gdtc->wb_bg_thresh)) /* * LOCAL_THROTTLE tasks must not be throttled * when below the per-wb freerun ceiling. */ goto free_running; } dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && ((gdtc->dirty > gdtc->thresh) || strictlimit); wb_position_ratio(gdtc); sdtc = gdtc; if (mdtc) { /* * If memcg domain is in effect, calculate its * pos_ratio. @wb should satisfy constraints from * both global and memcg domains. Choose the one * w/ lower pos_ratio. */ if (!strictlimit) { wb_dirty_limits(mdtc); if ((current->flags & PF_LOCAL_THROTTLE) && mdtc->wb_dirty < dirty_freerun_ceiling(mdtc->wb_thresh, mdtc->wb_bg_thresh)) /* * LOCAL_THROTTLE tasks must not be * throttled when below the per-wb * freerun ceiling. */ goto free_running; } dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && ((mdtc->dirty > mdtc->thresh) || strictlimit); wb_position_ratio(mdtc); if (mdtc->pos_ratio < gdtc->pos_ratio) sdtc = mdtc; }

      This is an algorithmic policy that determines whether the process can run freely or a throttle is needed to control the rate of the writeback by checking if the number of dirty pages exceed the average of the global threshold and background threshold.

    7. shift = dirty_ratelimit / (2 * step + 1); if (shift < BITS_PER_LONG) step = DIV_ROUND_UP(step >> shift, 8); else step = 0; if (dirty_ratelimit < balanced_dirty_ratelimit) dirty_ratelimit += step; else dirty_ratelimit -= step;

      This is a configuration policy that determines how much we should increase/decrease the dirty_ratelimit, which controls the rate that processors write dirty pages back to storage.

    8. ratelimit_pages = dirty_thresh / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16;

      This is a configuration policy that dynamically determines the rate that kernel can write dirty pages back to storage in a single writeback cycle.

    9. t = wb_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8));

      This implements a configuration policy that determines the maximum time that the kernel should wait between writeback operations for dirty pages. This ensures that dirty pages are flushed to disk within a reasonable time frame and control the risk of data loss in case of a system crash.

    10. int balance_dirty_pages_ratelimited_flags(struct address_space *mapping, unsigned int flags)

      line 2064 calling balance_dirty_pages() depending on ratelimit.

    11. if (elapsed > WB_BANDWIDTH_IDLE_JIF && !atomic_read(&wb->writeback_inodes)) {

      Idle interval config

    12. static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, unsigned long dirtied, unsigned long elapsed)

      The dynamic calculation of dirty_ratelimit, the use of step size adjustments, filtering of singular points to avoid spikes, and special handling for strict limits.

    13. static void wb_update_write_bandwidth(struct bdi_writeback *wb, unsigned long elapsed, unsigned long written)

      Does bandwidth calculation (line 1221-1229) and smoothing (line 1234-1240)

    14. static void domain_dirty_limits(struct dirty_throttle_control *dtc)

      Heuristic based algorithms for calculating thresh and bg_thresh. Line 419-422 adds additional threshold if the task is determined to be real time task.

    15. void writeback_set_ratelimit(void)

      Dynamically calculates the ratelimit_pages based on heuristics. Thresholds are set according to configuration policies.

    16. bool wb_over_bg_thresh(struct bdi_writeback *wb)

      Although thresholds are set using configuation policies, this function uses series of comparison to determine if the wb should start.

    17. static void wb_position_ratio(struct dirty_throttle_control *dtc)

      dynamically calculates pos_ratio based on the state of global and per-writeback (wb) dirty limits, write bandwidth, and thresholds

    18. if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {

      This is a configuration policy to force strictlimit.

    19. /* throttle according to the chosen dtc */ dirty_ratelimit = READ_ONCE(wb->dirty_ratelimit); task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >> RATELIMIT_CALC_SHIFT; max_pause = wb_max_pause(wb, sdtc->wb_dirty); min_pause = wb_min_pause(wb, max_pause, task_ratelimit, dirty_ratelimit, &nr_dirtied_pause); if (unlikely(task_ratelimit == 0)) { period = max_pause; pause = max_pause; goto pause; } period = HZ * pages_dirtied / task_ratelimit; pause = period; if (current->dirty_paused_when) pause -= now - current->dirty_paused_when; /* * For less than 1s think time (ext3/4 may block the dirtier * for up to 800ms from time to time on 1-HDD; so does xfs, * however at much less frequency), try to compensate it in * future periods by updating the virtual time; otherwise just * do a reset, as it may be a light dirtier. */ if (pause < min_pause) { trace_balance_dirty_pages(wb, sdtc->thresh, sdtc->bg_thresh, sdtc->dirty, sdtc->wb_thresh, sdtc->wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, period, min(pause, 0L), start_time); if (pause < -HZ) { current->dirty_paused_when = now; current->nr_dirtied = 0; } else if (period) { current->dirty_paused_when += period; current->nr_dirtied = 0; } else if (current->nr_dirtied_pause <= pages_dirtied) current->nr_dirtied_pause += pages_dirtied; break; } if (unlikely(pause > max_pause)) { /* for occasional dropped task_ratelimit */ now += min(pause - max_pause, max_pause); pause = max_pause; } pause: trace_balance_dirty_pages(wb, sdtc->thresh, sdtc->bg_thresh, sdtc->dirty, sdtc->wb_thresh, sdtc->wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, period, pause, start_time); if (flags & BDP_ASYNC) { ret = -EAGAIN; break; } __set_current_state(TASK_KILLABLE); bdi->last_bdp_sleep = jiffies; io_schedule_timeout(pause); current->dirty_paused_when = now + pause; current->nr_dirtied = 0; current->nr_dirtied_pause = nr_dirtied_pause;

      This part of the code makes algorithmic decision on how long a task should throttle based on the rate at which it dirties memory pages.

    20. if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) {

      This is a configuration policy that controls whether to update the limit in the control group. The config enables support for controlling the writeback of dirty pages on a per-cgroup basis in the Linux kernel. This allows for better resource management and improved performance.

    1. if (lru_gen_enabled() && pte_young(ptep_get(pvmw.pte))) { lru_gen_look_around(&pvmw); referenced++; }

      These lines show configuration policy. The lru_gen_enabled() function checks if the multi-gen LRU is enabled and this is set with the CONFIG_LRU_GEN_ENABLED kernel configuration. This is a determining factor into how the folio referenced bit is updated, so it is a configuration policy

    2. pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);

      This is an algorithmic policy. Before it was called, a pte was cleared. A call to this function puts PTE_MARKER_UFFD_WP on a page, if needed. Here the predicate would be that the pte was cleared, and the action is the updating of the marker. (The function should not be called if the pte was not cleared).There are more policies within the function to determine if the marker is needed.

    3. if (pending != flushed) { arch_flush_tlb_batched_pending(mm);

      This is an algorithmic policy. This code decides whether to perform a TLB flush based on whether there are pending flushes that haven't been completed.

    4. if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { if (pmdp_clear_flush_young_notify(vma, address, pvmw.pmd)) referenced++;

      This is configuration policy. The CONFIG_TRANSPARENT_HUGEPAGE kernel configuration determines how and if the page reference bit is incremented.

    5. if (!dst->anon_vma && src->anon_vma && anon_vma->num_children < 2 && anon_vma->num_active_vmas == 0) dst->anon_vma = anon_vma;

      This is an algorithmic policy. This code checks if dst currently does not have an anonymous VMA and if src does. It checks that the anon VMA being considered has fewer than two children and no active VMAs. If all conditions are met, it reuses the existing anonymous VMA rather than allocating a new one.

    1. if (should_proactive_compact_node(pgdat)) { unsigned int prev_score, score; prev_score = fragmentation_score_node(pgdat); proactive_compact_node(pgdat); score = fragmentation_score_node(pgdat); /* * Defer proactive compaction if the fragmentation * score did not go down i.e. no progress made. */ if (unlikely(score >= prev_score)) timeout = default_timeout << COMPACT_MAX_DEFER_SHIFT; }

      This part of the code checks whether proactive compaction should be triggered using the should_proactive_compact_node(pgdat) predicate. If proactive compaction is necessary, it stores the node's fragmentation score (prev_score), performs proactive compaction (proactive_compact_node(pgdat)), and then checks the new fragmentation score (score). If the score does not decrease after compaction (indicating no progress), the system defers further proactive compaction by increasing the timeout (timeout = default_timeout << COMPACT_MAX_DEFER_SHIFT). This helps avoid repeated compaction attempts without progress.

    2. #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) static ssize_t compact_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { int nid = dev->id; if (nid >= 0 && nid < nr_node_ids && node_online(nid)) { /* Flush pending updates to the LRU lists */ lru_add_drain_all(); compact_node(nid); } return count; } static DEVICE_ATTR_WO(compact); int compaction_register_node(struct node *node) { return device_create_file(&node->dev, &dev_attr_compact); } void compaction_unregister_node(struct node *node) { device_remove_file(&node->dev, &dev_attr_compact); } #endif /* CONFIG_SYSFS && CONFIG_NUMA */

      When these options are enabled, the functions compact_store, compaction_register_node, and compaction_unregister_node are included in the build. The compact_store function allows users to trigger memory compaction on specific NUMA nodes by writing to the sysfs interface. Inside this function, it checks if the node ID is valid and online, drains pending updates to the LRU lists with lru_add_drain_all(), and then calls compact_node(nid) to perform compaction on that node.

      Including these functions provides fine-grained control over memory management in NUMA system. If even one of the flag is not eneabled, these functions are excluded from the build, and the per-node compaction interface is unavailable.

    3. if (prio > MIN_COMPACT_PRIORITY && compaction_deferred(zone, order)) { rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); continue; } status = compact_zone_order(zone, order, gfp_mask, prio, alloc_flags, ac->highest_zoneidx, capture); rc = max(status, rc); /* The allocation should succeed, stop compacting */ if (status == COMPACT_SUCCESS) { /* * We think the allocation will succeed in this zone, * but it is not certain, hence the false. The caller * will repeat this with true if allocation indeed * succeeds in this zone. */ compaction_defer_reset(zone, order, false); break; } if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE || status == COMPACT_PARTIAL_SKIPPED)) /* * We think that allocation won't succeed in this zone * so we defer compaction there. If it ends up * succeeding after all, it will be reset. */ defer_compaction(zone, order); /* * We might have stopped compacting due to need_resched() in * async compaction, or due to a fatal signal detected. In that * case do not try further zones */ if ((prio == COMPACT_PRIO_ASYNC && need_resched()) || fatal_signal_pending(current)) break;

      The kernel attempts to compact memory across different zones, guided by several key predicates. First, if compaction for a zone has been deferred and the priority is above MIN_COMPACT_PRIORITY, the function skips the zone, marking the result as COMPACT_DEFERRED. For each zone, it calls compact_zone_order and updates the overall status. If compaction succeeds (COMPACT_SUCCESS), the kernel resets the deferred state for the zone and stops further compaction attempts, expecting allocation to succeed. If compaction is either complete or partially skipped in non-async mode, the zone is marked for deferred compaction, preventing further unnecessary attempts. If compaction is interrupted due to rescheduling or a fatal signal, the function exits early to avoid system overload.

    4. if (cc->nr_freepages > 0) { unsigned long free_pfn = release_freepages(&cc->freepages); cc->nr_freepages = 0; VM_BUG_ON(free_pfn == 0); /* The cached pfn is always the first in a pageblock */ free_pfn = pageblock_start_pfn(free_pfn); /* * Only go back, not forward. The cached pfn might have been * already reset to zone end in compact_finished() */ if (free_pfn > cc->zone->compact_cached_free_pfn) cc->zone->compact_cached_free_pfn = free_pfn; }

      This part ensures efficient release of free pages back into the system. If there are free pages to be released (cc->nr_freepages > 0), the function release_freepages is called, and the first page frame number (PFN) in the pageblock is recorded. The cached PFN is then updated to point to the start of the newly freed pageblock. A critical check (VM_BUG_ON) ensures that the process does not operate on invalid or uninitialized page ranges.

      This ensures that the cached free PFN is always updated in a conservative manner i.e. only moving backward, not forward. This avoids skipping any potential free pages. This helps optimize subsequent compaction attempts by ensuring that future operations start from the correct position.

    5. if (err) { putback_movable_pages(&cc->migratepages); /* * migrate_pages() may return -ENOMEM when scanners meet * and we want compact_finished() to detect it */ if (err == -ENOMEM && !compact_scanners_met(cc)) { ret = COMPACT_CONTENDED; goto out; } /* * If an ASYNC or SYNC_LIGHT fails to migrate a page * within the pageblock_order-aligned block and * fast_find_migrateblock may be used then scan the * remainder of the pageblock. This will mark the * pageblock "skip" to avoid rescanning in the near * future. This will isolate more pages than necessary * for the request but avoid loops due to * fast_find_migrateblock revisiting blocks that were * recently partially scanned. */ if (!pageblock_aligned(cc->migrate_pfn) && !cc->ignore_skip_hint && !cc->finish_pageblock && (cc->mode < MIGRATE_SYNC)) { cc->finish_pageblock = true; /* * Draining pcplists does not help THP if * any page failed to migrate. Even after * drain, the pageblock will not be free. */ if (cc->order == COMPACTION_HPAGE_ORDER) last_migrated_pfn = 0; goto rescan; } }

      This block of code handles errors during page migration by first resetting the state to prevent inconsistencies. If a migration error occurs, the kernel puts back the movable pages. If the error is due to memory shortage (-ENOMEM), the process checks if the migration and free page scanners have met. If not, compaction is terminated early with COMPACT_CONTENDED. For ASYNC or SYNC_LIGHT modes, if the migration fails within an unaligned pageblock, the kernel marks the block for skipping to avoid unnecessary rescanning. Thus, improving efficiency by isolating more pages than required and preventing revisits to partially scanned blocks. For transparent huge pages (THP), it ensures the failed pageblocks are not reused.

    6. #ifdef CONFIG_CMA /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */ if (migratetype == MIGRATE_MOVABLE && !free_area_empty(area, MIGRATE_CMA)) return COMPACT_SUCCESS; #endif

      This block allows the kernel's memory compaction algorithm to consider MIGRATE_CMA pages as fallback options for MIGRATE_MOVABLE allocations when the Contiguous Memory Allocator (CMA) is enabled (CONFIG_CMA is defined).

      When CONFIG_CMA is enabled, the compaction process becomes more flexible by allowing the use of free CMA pages if standard movable pages are unavailable, potentially improving allocation success rates.

      If CONFIG_CMA is disabled, this fallback mechanism is omitted, ensuring that CMA regions remain dedicated to their primary purpose of providing contiguous memory blocks for devices that require them.

      https://elixir.bootlin.com/linux/v6.6.42/source/mm/Kconfig#L895

    7. /* * compaction_suitable: Is this suitable to run compaction on this zone now? */ bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx) { enum compact_result compact_result; bool suitable; suitable = __compaction_suitable(zone, order, highest_zoneidx, zone_page_state(zone, NR_FREE_PAGES)); /* * fragmentation index determines if allocation failures are due to * low memory or external fragmentation * * index of -1000 would imply allocations might succeed depending on * watermarks, but we already failed the high-order watermark check * index towards 0 implies failure is due to lack of memory * index towards 1000 implies failure is due to fragmentation * * Only compact if a failure would be due to fragmentation. Also * ignore fragindex for non-costly orders where the alternative to * a successful reclaim/compaction is OOM. Fragindex and the * vm.extfrag_threshold sysctl is meant as a heuristic to prevent * excessive compaction for costly orders, but it should not be at the * expense of system stability. */ if (suitable) { compact_result = COMPACT_CONTINUE; if (order > PAGE_ALLOC_COSTLY_ORDER) { int fragindex = fragmentation_index(zone, order); if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold) { suitable = false; compact_result = COMPACT_NOT_SUITABLE_ZONE; } } } else { compact_result = COMPACT_SKIPPED; } trace_mm_compaction_suitable(zone, order, compact_result); return suitable; }

      This function decides whether to initiate memory compaction on a particular zone based on system conditions like the fragmentation index and the external fragmentation threshold (sysctl_extfrag_threshold). It determines whether compaction would be effective or should be skipped. For high-order allocations, if the fragmentation index is below the threshold, the function opts to avoid unnecessary compaction, conserving system resources and preventing potential performance degradation.

      This enhances linux's memory management efficiency by preventing futile compaction attempts; such as in low-memory situations. Initiating compaction only when fragmentation is the primary issue, helps the system optimize CPU utilization. This should improve memory allocation success rates.

    8. static unsigned long fast_find_migrateblock(struct compact_control *cc) { unsigned int limit = freelist_scan_limit(cc); unsigned int nr_scanned = 0; unsigned long distance; unsigned long pfn = cc->migrate_pfn; unsigned long high_pfn; int order; bool found_block = false; /* Skip hints are relied on to avoid repeats on the fast search */ if (cc->ignore_skip_hint) return pfn; /* * If the pageblock should be finished then do not select a different * pageblock. */ if (cc->finish_pageblock) return pfn; /* * If the migrate_pfn is not at the start of a zone or the start * of a pageblock then assume this is a continuation of a previous * scan restarted due to COMPACT_CLUSTER_MAX. */ if (pfn != cc->zone->zone_start_pfn && pfn != pageblock_start_pfn(pfn)) return pfn; /* * For smaller orders, just linearly scan as the number of pages * to migrate should be relatively small and does not necessarily * justify freeing up a large block for a small allocation. */ if (cc->order <= PAGE_ALLOC_COSTLY_ORDER) return pfn; /* * Only allow kcompactd and direct requests for movable pages to * quickly clear out a MOVABLE pageblock for allocation. This * reduces the risk that a large movable pageblock is freed for * an unmovable/reclaimable small allocation. */ if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE) return pfn; /* * When starting the migration scanner, pick any pageblock within the * first half of the search space. Otherwise try and pick a pageblock * within the first eighth to reduce the chances that a migration * target later becomes a source. */ distance = (cc->free_pfn - cc->migrate_pfn) >> 1; if (cc->migrate_pfn != cc->zone->zone_start_pfn) distance >>= 2; high_pfn = pageblock_start_pfn(cc->migrate_pfn + distance); for (order = cc->order - 1; order >= PAGE_ALLOC_COSTLY_ORDER && !found_block && nr_scanned < limit; order--) { struct free_area *area = &cc->zone->free_area[order]; struct list_head *freelist; unsigned long flags; struct page *freepage; if (!area->nr_free) continue; spin_lock_irqsave(&cc->zone->lock, flags); freelist = &area->free_list[MIGRATE_MOVABLE]; list_for_each_entry(freepage, freelist, buddy_list) { unsigned long free_pfn; if (nr_scanned++ >= limit) { move_freelist_tail(freelist, freepage); break; } free_pfn = page_to_pfn(freepage); if (free_pfn < high_pfn) { /* * Avoid if skipped recently. Ideally it would * move to the tail but even safe iteration of * the list assumes an entry is deleted, not * reordered. */ if (get_pageblock_skip(freepage)) continue; /* Reorder to so a future search skips recent pages */ move_freelist_tail(freelist, freepage); update_fast_start_pfn(cc, free_pfn); pfn = pageblock_start_pfn(free_pfn); if (pfn < cc->zone->zone_start_pfn) pfn = cc->zone->zone_start_pfn; cc->fast_search_fail = 0; found_block = true; break; } } spin_unlock_irqrestore(&cc->zone->lock, flags); } cc->total_migrate_scanned += nr_scanned; /* * If fast scanning failed then use a cached entry for a page block * that had free pages as the basis for starting a linear scan. */ if (!found_block) { cc->fast_search_fail++; pfn = reinit_migrate_pfn(cc); } return pfn; }

      This function helps determine when and how to perform a fast search for a suitable migration block during memory compaction. It evaluates predicates such as skipping the fast search if skip hints are ignored (if (cc->ignore_skip_hint) return pfn;), if the current scan should be finished (if (cc->finish_pageblock) return pfn;), if the migration pointer isn't at the start of a pageblock (if (pfn != cc->zone->zone_start_pfn && pfn != pageblock_start_pfn(pfn)) return pfn;), or if the allocation order is smlal (if (cc->order <= PAGE_ALLOC_COSTLY_ORDER) return pfn;). It also restricts fast searches to movable pages during direct compaction for higher-order allocations (if (cc->direct_compaction && cc->migratetype != MIGRATE_MOVABLE) return pfn;). Based on these conditions, the function performs actions like initiating a fast search under specific circumstances to quickly locate suitable blocks, limiting the number of pages scanned to conserve CPU resources (if (nr_scanned++ >= limit) { move_freelist_tail(freelist, freepage); break; }), and adjusting scanning strategies based on previous successes or failures to enhance adaptability (if (!found_block) { cc->fast_search_fail++; pfn = reinit_migrate_pfn(cc); }).

    9. #ifdef CONFIG_SPARSEMEM

      #ifdef CONFIG_SPARSEMEM conditionally includes or excludes the implementation of the skip_offline_sections and skip_offline_sections_reverse functions based on whether the CONFIG_SPARSEMEM option is enabled in the kernel configuration. CONFIG_SPARSEMEM enables the the kernel to support sparse memory systems where memory sections can be dynamically online or offline. They provide logic to skip over offline memory sections during operations like memory compaction, ensuring that the system does not attempt to access or manipulate memory that is not currently available.

      If CONFIG_SPARSEMEM is not defined, these functions return 0, effectively disabling the handling of offline memory sections.

      https://elixir.bootlin.com/linux/v6.6.42/source/mm/Kconfig#L461

    10. #ifdef CONFIG_COMPACTION

      This option acts as a policy control mechanism that determines whether the memory compaction feature is included in the kernel build. By including the compaction-related code within #ifdef CONFIG_COMPACTION and #endif [Line 522], the code conditionally compiles these sections based on the configuration setting. This affects the kernel's behavior regarding memory management and fragmentation handling. CONFIG_COMPACTION is defined in https://elixir.bootlin.com/linux/v6.6.42/source/mm/Kconfig#L637 and default is set to Yes or True

    11. static bool kcompactd_node_suitable(pg_data_t *pgdat) { int zoneid; struct zone *zone; enum zone_type highest_zoneidx = pgdat->kcompactd_highest_zoneidx; for (zoneid = 0; zoneid <= highest_zoneidx; zoneid++) { zone = &pgdat->node_zones[zoneid]; if (!populated_zone(zone)) continue; /* Allocation can already succeed, check other zones */ if (zone_watermark_ok(zone, pgdat->kcompactd_max_order, min_wmark_pages(zone), highest_zoneidx, 0)) continue; if (compaction_suitable(zone, pgdat->kcompactd_max_order, highest_zoneidx)) return true; } return false; }

      This function determines whether a memory node is suitable for compaction by the kcompactd daemon. It iterates through memory zones up to a configured highest index, checking if compaction is necessary and beneficial. The policy skips unpopulated zones and those where allocation can already succeed without compaction.

    12. static int compaction_proactiveness_sysctl_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos) { int rc, nid; rc = proc_dointvec_minmax(table, write, buffer, length, ppos); if (rc) return rc; if (write && sysctl_compaction_proactiveness) { for_each_online_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); if (pgdat->proactive_compact_trigger) continue; pgdat->proactive_compact_trigger = true; trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, -1, pgdat->nr_zones - 1); wake_up_interruptible(&pgdat->kcompactd_wait); } } return 0; }

      This function implements configuration and policy logic for proactive memory compaction. It supports the sysctl handler, and provides runtime configuration of compaction proactiveness. When the proactiveness setting is enabled, the function applies a policy to trigger proactive compaction across all online memory nodes. It does this by setting a trigger flag and waking up the kcompactd daemon for each node that hasn't already been triggered.

    13. if (suitable) {

      This policy is based off of the allocation order, zone characteristics, and memory fragmentation level to decide whether compaction should happen. For high-order allocations, it uses a fragmentation index and a configurable threshold to decide if compaction would be beneficial. The policy aims to balance the need for contiguous memory with system performance, avoiding unnecessary compaction that could impact system stability.

    14. watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ? low_wmark_pages(zone) : min_wmark_pages(zone); watermark += compact_gap(order); return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx, ALLOC_CMA, wmark_target);

      The __compaction_suitable function determines if memory compaction is viable for a given zone based on its current watermark levels and the requested allocation order. It calculates a watermark threshold, which varies depending on the allocation size, and includes a compaction gap for efficiency. The function then checks if the zone meets this watermark using __zone_watermark_ok, considering factors like CMA pages and zone indices.

    15. static enum compact_result __compact_finished(struct compact_control *cc) { unsigned int order; const int migratetype = cc->migratetype; int ret; /* Compaction run completes if the migrate and free scanner meet */ if (compact_scanners_met(cc)) { /* Let the next compaction start anew. */ reset_cached_positions(cc->zone); /* * Mark that the PG_migrate_skip information should be cleared * by kswapd when it goes to sleep. kcompactd does not set the * flag itself as the decision to be clear should be directly * based on an allocation request. */ if (cc->direct_compaction) cc->zone->compact_blockskip_flush = true; if (cc->whole_zone) return COMPACT_COMPLETE; else return COMPACT_PARTIAL_SKIPPED; } if (cc->proactive_compaction) { int score, wmark_low; pg_data_t *pgdat; pgdat = cc->zone->zone_pgdat; if (kswapd_is_running(pgdat)) return COMPACT_PARTIAL_SKIPPED; score = fragmentation_score_zone(cc->zone); wmark_low = fragmentation_score_wmark(true); if (score > wmark_low) ret = COMPACT_CONTINUE; else ret = COMPACT_SUCCESS; goto out; } if (is_via_compact_memory(cc->order)) return COMPACT_CONTINUE; /* * Always finish scanning a pageblock to reduce the possibility of * fallbacks in the future. This is particularly important when * migration source is unmovable/reclaimable but it's not worth * special casing. */ if (!pageblock_aligned(cc->migrate_pfn)) return COMPACT_CONTINUE; /* Direct compactor: Is a suitable page free? */ ret = COMPACT_NO_SUITABLE_PAGE; for (order = cc->order; order < NR_PAGE_ORDERS; order++) { struct free_area *area = &cc->zone->free_area[order]; bool can_steal; /* Job done if page is free of the right migratetype */ if (!free_area_empty(area, migratetype)) return COMPACT_SUCCESS;

      This function has policy logic for when memory compaction should complete, considering whether migrate and free scanners have met, proactive compactions status, fragmentation scores, and availability of free pages. The function returns different COMPACT_* status codes based on these criteria, effectively deciding whether to continue compaction, consider it successful, or terminate the process. This policy balances thorough memory reorganization with efficient use of system resources during compaction.

    16. #ifdef CONFIG_COMPACTION static bool suitable_migration_source(struct compact_control *cc, struct page *page) { int block_mt; if (pageblock_skip_persistent(page)) return false; if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction) return true; block_mt = get_pageblock_migratetype(page); if (cc->migratetype == MIGRATE_MOVABLE) return is_migrate_movable(block_mt); else return block_mt == cc->migratetype; }

      This code snippet from the Linux kernel's memory compaction system implements policy and configuration logic for determining suitable migration sources during compaction. It's conditionally compiled based on the CONFIG_COMPACTION option, demonstrating configuration-dependent behavior. The suitable_migration_source function encapsulates policy decisions by considering factors such as persistent skip flags, compaction mode, and migration types. It applies different criteria for async direct compaction versus other modes, and implements specific rules for matching migration types, with special handling for movable pages.

    17. /* Similar to reclaim, but different enough that they don't share logic */ static bool too_many_isolated(struct compact_control *cc) { pg_data_t *pgdat = cc->zone->zone_pgdat; bool too_many; unsigned long active, inactive, isolated; inactive = node_page_state(pgdat, NR_INACTIVE_FILE) + node_page_state(pgdat, NR_INACTIVE_ANON); active = node_page_state(pgdat, NR_ACTIVE_FILE) + node_page_state(pgdat, NR_ACTIVE_ANON); isolated = node_page_state(pgdat, NR_ISOLATED_FILE) + node_page_state(pgdat, NR_ISOLATED_ANON); /* * Allow GFP_NOFS to isolate past the limit set for regular * compaction runs. This prevents an ABBA deadlock when other * compactors have already isolated to the limit, but are * blocked on filesystem locks held by the GFP_NOFS thread. */ if (cc->gfp_mask & __GFP_FS) { inactive >>= 3; active >>= 3; } too_many = isolated > (inactive + active) / 2; if (!too_many) wake_throttle_isolated(pgdat); return too_many; }

      This policy determines when too many pages have been isolated during kernel compaction and calculates a threshold based on the number of active and inactive pages relative to isolated pages. It handles GFP_NOFS allocations specifically differently so as to prevent deadlocks.

    18. /* * Compound pages of >= pageblock_order should consistently be skipped until * released. It is always pointless to compact pages of such order (if they are * migratable), and the pageblocks they occupy cannot contain any free pages. */ static bool pageblock_skip_persistent(struct page *page) { if (!PageCompound(page)) return false; page = compound_head(page); if (compound_order(page) >= pageblock_order) return true; return false; }

      This policy optimizes memory compaction by deferring compaction on large compound pages. It determines if they are of a size greater than or equal to pageblock_order and skips them if so. For non-compound pages, it returns false, meaning it shouldn't be deferred.

    19. /* Returns true if compaction should be skipped this time */ static bool compaction_deferred(struct zone *zone, int order) { unsigned long defer_limit = 1UL << zone->compact_defer_shift; if (order < zone->compact_order_failed) return false; /* Avoid possible overflow */ if (++zone->compact_considered >= defer_limit) { zone->compact_considered = defer_limit; return false; } trace_mm_compaction_deferred(zone, order); return true; }

      This policy is for deferring memory compaction in Linux. It defers compaction if the requested order is > than the previously failed order and there hasn't been more than some limit of comparisons recently, so that compaction attempts will stop if repeatedly unsuccessful.

    20. if (!sysctl_compaction_proactiveness) timeout = MAX_SCHEDULE_TIMEOUT;

      If proactive compaction is disabled (sysctl_compaction_proactiveness is 0), the daemon sets its timeout to MAX_SCHEDULE_TIMEOUT, effectively sleeping until explicitly woken up. If proactive compaction is enabled, it uses the default timeout to periodically wake up and check for work.

    1. if (khugepaged_has_work()) { const unsigned long scan_sleep_jiffies = msecs_to_jiffies(khugepaged_scan_sleep_millisecs); if (!scan_sleep_jiffies) return; khugepaged_sleep_expire = jiffies + scan_sleep_jiffies; wait_event_freezable_timeout(khugepaged_wait, khugepaged_should_wakeup(), scan_sleep_jiffies); return; } if (hugepage_flags_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event());

      khugepaged thread sleeps when khugepaged does not have work (khugepaged_scan list is empty). Sleeps for khugepaged_scan_sleep_millisecs if has work

    2. if (hugepage_flags_enabled()) { if (!khugepaged_thread) khugepaged_thread = kthread_run(khugepaged, NULL,

      hugeoage_flags_enabled() flag will turn on hupe page for all mappings. transparent_hugepage_flags is disabled by default. If enabled, khugepaged thread will start running.

    1. if (si->cluster_info) { if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) goto scan; } else if (unlikely(!si->cluster_nr--)) {

      Algorithmic policy decision: If SSD, use SSD wear-leveling friendly algorithm. Otherwise, use HDD algo which minimizes seek times. Potentially, certain access patterns may make one of these algorithms less effective (i.e. in the case of wear leveling, the swap is constantly full)

    2. scan_base = offset = si->lowest_bit; last_in_cluster = offset + SWAPFILE_CLUSTER - 1; /* Locate the first empty (unaligned) cluster */ for (; last_in_cluster <= si->highest_bit; offset++) { if (si->swap_map[offset]) last_in_cluster = offset + SWAPFILE_CLUSTER; else if (offset == last_in_cluster) { spin_lock(&si->lock); offset -= SWAPFILE_CLUSTER - 1; si->cluster_next = offset; si->cluster_nr = SWAPFILE_CLUSTER - 1; goto checks; } if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; } }

      Here, (when using HDDs), a policy is implemented that places the swapped page in the first available slot. This is supposed to reduce seek time in spinning drives, as it encourages having swap entries near each other.

    3. /* * Even if there's no free clusters available (fragmented), * try to scan a little more quickly with lock held unless we * have scanned too many slots already. */ if (!scanned_many) { unsigned long scan_limit; if (offset < scan_base) scan_limit = scan_base; else scan_limit = si->highest_bit; for (; offset <= scan_limit && --latency_ration > 0; offset++) { if (!si->swap_map[offset]) goto checks; } }

      Here we have a configuration policy where we do another smaller scan as long as we haven't exhausted our latency_ration. Another alternative could be yielding early in anticipation that we aren't going to find a free slot.

    4. /* * select a random position to start with to help wear leveling * SSD */ for_each_possible_cpu(cpu) { per_cpu(*p->cluster_next_cpu, cpu) = get_random_u32_inclusive(1, p->highest_bit); }

      An algorithmic (random) policy is used here to spread swap pages around an SSD to help with wear leveling instead of writing to the same area of an SSD often.

    5. cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);

      A policy decision is made to add a cluster to the discard list on a first-come, first-served basis. However, this approach could be enhanced by prioritizing certain clusters higher on the list based on their 'importance.' This 'importance' could be defined by how closely a cluster is related to other pages in the swap. By doing so, the system can reduce seek time as mentioned on line 817.

    6. if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; scanned_many = true; } if (swap_offset_available_and_locked(si, offset)) goto checks; } offset = si->lowest_bit; while (offset < scan_base) { if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; scanned_many = true; }

      Here, a policy decision is made to fully replenish the latency_ration with the LATENCY_LIMIT and then yield back to the scheduler if we've exhausted it. This makes it so that when scheduled again, we have the full LATENCY_LIMIT to do a scan. Alternative policies could grow/shrink this to find a better heuristic instead of fully replenishing each time.

      Marked config/value as awe're replacing latency_ration with a compiletime-defined limit.

    7. while (scan_swap_map_ssd_cluster_conflict(si, offset)) { /* take a break if we already got some slots */ if (n_ret) goto done; if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) goto scan;

      Here, a policy decision is made to stop scanning if some slots were already found. Other policy decisions could be made to keep scanning or take into account how long the scan took or how many pages were found.

    8. else if (!cluster_list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in * discarding, do discard now and reclaim them, then * reread cluster_next_cpu since we dropped si->lock */ swap_do_scheduled_discard(si); *scan_base = this_cpu_read(*si->cluster_next_cpu); *offset = *scan_base; goto new_cluster;

      This algorithmic policy discards + reclaims pages as-needed whenever there is no free cluster. Other policies could be explored that do this preemptively in order to avoid the cost of doing it here.

    9. #ifdef CONFIG_THP_SWAP

      This is a build-time flag that configures how hugepages are handled when swapped. When defined, it swaps them in one piece, while without it splits them into smaller units and swaps those units.

    10. if (swap_flags & SWAP_FLAG_DISCARD_ONCE) p->flags &= ~SWP_PAGE_DISCARD; else if (swap_flags & SWAP_FLAG_DISCARD_PAGES) p->flags &= ~SWP_AREA_DISCARD

      This is a configuration policy decision where a sysadmin can pass flags to sys_swapon() to control the behavior discards are handled. If DISCARD_ONCE is set, a flag which "discard[s] swap area at swapon-time" is unset, and if DISCARD_PAGES is set, a flag which "discard[s] page-clusters after use" is unset.

    11. /* * Should not even be attempting cluster allocations when huge * page swap is disabled. Warn and fail the allocation. */ if (!IS_ENABLED(CONFIG_THP_SWAP)) { VM_WARN_ON_ONCE(1); return 0; }

      The policy dictates that if huge page swapping is disabled at compile time (via the CONFIG_THP_SWAP configuration option), the kernel should not attempt to allocate swap clusters for huge pages and should fail the operation with a warning.

    12. #ifdef CONFIG_HIBERNATION

      If CONFIG_HIBERNATION is defined, the kernel includes code to write the entire system memory state to the swapfile before powering down the system. This involves allocating swap slots for the entire memory state and ensuring that the data is properly stored.

    13. if (count == SWAP_HAS_CACHE && !swap_page_trans_huge_swapped(p, entry)) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL);

      This policy optimizes memory usage by attempting to free both the swap entry and the page cache when they are no longer needed. The system avoids keeping cached pages in memory unnecessarily, thus freeing up resources for other processes.

    14. #define SWAPFILE_CLUSTER 256

      This configuration defines the size of the swap cluster, which is the number of swap pages that the system tries to allocate as a unit. Is set to 256.

    15. /* * Use percpu scan base for SSD to reduce lock contention on * cluster and swap cache. For HDD, sequential access is more * important. */ if (si->flags & SWP_SOLIDSTATE) scan_base = this_cpu_read(*si->cluster_next_cpu); else scan_base = si->cluster_next; offset = scan_base;

      Algorithmic Policy: Check if the device is HDD or SSD. If SSD, allow different cpus to scan different parts of the swap map. If HDD, check for slots sequentially.

    16. /* reuse swap entry of cache-only swap if not busy. */ if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { int swap_was_freed; unlock_cluster(ci); spin_unlock(&si->lock); swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); spin_lock(&si->lock); /* entry was freed successfully, try to use this again */ if (swap_was_freed) goto checks; goto scan; /* check next one */ }

      This is an algorithmic policy. If you're running out of swap spaces, try to reclaim space in the cache-only swap.

    17. #define LATENCY_LIMIT 256

      LATENCY_LIMIT is an upper bound on how many slots can be checked by scan_swap_map_slots before it yields the cpu back. It's set to 64. This policy prevents the function from monopolizing the cpu.

    18. node = numa_node_id();

      This algorithmic policy is choosing which node to look at for swap devices.

      In particular, it only tries swapping pages with devices that are at the same Numa Node as the core, and not any other.

    19. plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) { /* requeue si to after same-priority siblings */ plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);

      This is an algorithmic policy that decides which of the swap devices should be used to get swap pages from, in that numa node. What's interesting here is that once a device is seen, it is requeued, so that the burden is spread evenly over all the devices with same priority, on that that node.

    20. n_goal = min3((long)n_goal, (long)SWAP_BATCH, avail_pgs);

      SWAP_BATCH: This is a configuration policy (Macro) that caps the maximum number of entries that can be swapped in a single operation. It's set to 64. This is done to limit amount of swapping that happens.

    1. static const struct file_operations proc_page_owner_operations = { .read = read_page_owner, .llseek = lseek_page_owner, };

      Configuration policy that defines an interface for user space to access kernel page ownership data. Configuring how users can read and seek through the page information.

    2. /* * Some pages could be missed by concurrent allocation or free, * because we don't hold the zone lock. */ if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) goto ext_put_continue; /* * Although we do have the info about past allocation of free * pages, it's not relevant for current memory usage. */ if (!test_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags)) goto ext_put_continue; page_owner = get_page_owner(page_ext); /* * Don't print "tail" pages of high-order allocations as that * would inflate the stats. */ if (!IS_ALIGNED(pfn, 1 << page_owner->order)) goto ext_put_continue;

      Policy that ensures pages have ownership info, are valid and allocated.

      In the first if statement it skip pages without owners, the second one checks that the pages are not currently allocated. And finally it skip tail pages.

    3. /* Find a valid PFN or the start of a MAX_ORDER_NR_PAGES area */ while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) pfn++;

      This policy ensures efficient traversal of large memory areas speeding up memory management. It avoid redundant checks, It ensures only valid memory pages are processed.

    1. static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct shrink_control *sc)

      The policy in this function governs the management of shadow nodes in the Linux kernel's page cache. It dynamically calculates a maximum allowable number of shadow nodes based on the available memory, aiming to maintain an optimal balance between memory usage and the ability to detect page refaults. The policy sets this limit at approximately 1/8th of the total possible pages, adapting to different memory configurations and pressure scenarios. When the current number of shadow nodes exceeds this calculated maximum, the policy determines how many nodes should be reclaimed.

    2. f (lru_gen_enabled()) return lru_gen_eviction(folio);

      When enabled via lru_gen_enabled(), it alters how the kernel tracks page access patterns and manages evictions. This policy introduces the concept of generation-based tracking, where pages are grouped into generations based on their access time. It uses functions like lru_gen_eviction() and lru_gen_refault() to handle page evictions and refaults respectively. This approach allows for more nuanced decisions in page reclaim, potentially improving memory utilization and reducing I/O operations by better identifying truly cold pages and protecting hot pages from premature eviction.

    1. if (vma->vm_flags & VM_PFNMAP) { int err = 1; if (ops->pte_hole) err = ops->pte_hole(start, end, -1, walk);
    1. unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, bool need_rmap_locks)

      Simple, but optimize by size of page table. Triggered by checking CONFIG_HAVE_MOVE_PUD and CONFIG_HAVE_MOVE_PMD

    1. static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) { int ret; bool found_slot = false; struct memory_tier *memtier, *new_memtier; int adistance = memtype->adistance; unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; lockdep_assert_held_once(&memory_tier_lock); adistance = round_down(adistance, memtier_adistance_chunk_size); /* * If the memtype is already part of a memory tier, * just return that. */ if (!list_empty(&memtype->tier_sibiling)) { list_for_each_entry(memtier, &memory_tiers, list) { if (adistance == memtier->adistance_start) return memtier; } WARN_ON(1); return ERR_PTR(-EINVAL); } list_for_each_entry(memtier, &memory_tiers, list) { if (adistance == memtier->adistance_start) { goto link_memtype; } else if (adistance < memtier->adistance_start) { found_slot = true; break; } } new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); if (!new_memtier) return ERR_PTR(-ENOMEM); new_memtier->adistance_start = adistance; INIT_LIST_HEAD(&new_memtier->list); INIT_LIST_HEAD(&new_memtier->memory_types); if (found_slot) list_add_tail(&new_memtier->list, &memtier->list); else list_add_tail(&new_memtier->list, &memory_tiers); new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS; new_memtier->dev.bus = &memory_tier_subsys; new_memtier->dev.release = memory_tier_device_release; new_memtier->dev.groups = memtier_dev_groups; ret = device_register(&new_memtier->dev); if (ret) { list_del(&new_memtier->list); put_device(&new_memtier->dev); return ERR_PTR(ret); } memtier = new_memtier; link_memtype: list_add(&memtype->tier_sibiling, &memtier->memory_types); return memtier; }

      This function decides whether to assign a memory device to an existing memory tier or create a new one. The decision is based on the memory device's distance and the structure of existing memory tiers.

    2. static void establish_demotion_targets(void) { struct memory_tier *memtier; struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; nodemask_t tier_nodes, lower_tier; lockdep_assert_held_once(&memory_tier_lock); if (!node_demotion) return; disable_all_demotion_targets(); for_each_node_state(node, N_MEMORY) { best_distance = -1; nd = &node_demotion[node]; memtier = __node_get_memory_tier(node); if (!memtier || list_is_last(&memtier->list, &memory_tiers)) continue; /* * Get the lower memtier to find the demotion node list. */ memtier = list_next_entry(memtier, list); tier_nodes = get_memtier_nodemask(memtier); /* * find_next_best_node, use 'used' nodemask as a skip list. * Add all memory nodes except the selected memory tier * nodelist to skip list so that we find the best node from the * memtier nodelist. */ nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes); /* * Find all the nodes in the memory tier node list of same best distance. * add them to the preferred mask. We randomly select between nodes * in the preferred mask when allocating pages during demotion. */ do { target = find_next_best_node(node, &tier_nodes); if (target == NUMA_NO_NODE) break; distance = node_distance(node, target); if (distance == best_distance || best_distance == -1) { best_distance = distance; node_set(target, nd->preferred); } else { break; } } while (1); } /* * Promotion is allowed from a memory tier to higher * memory tier only if the memory tier doesn't include * compute. We want to skip promotion from a memory tier, * if any node that is part of the memory tier have CPUs. * Once we detect such a memory tier, we consider that tier * as top tiper from which promotion is not allowed. */ list_for_each_entry_reverse(memtier, &memory_tiers, list) { tier_nodes = get_memtier_nodemask(memtier); nodes_and(tier_nodes, node_states[N_CPU], tier_nodes); if (!nodes_empty(tier_nodes)) { /* * abstract distance below the max value of this memtier * is considered toptier. */ top_tier_adistance = memtier->adistance_start + MEMTIER_CHUNK_SIZE - 1; break; } } /* * Now build the lower_tier mask for each node collecting node mask from * all memory tier below it. This allows us to fallback demotion page * allocation to a set of nodes that is closer the above selected * perferred node. */ lower_tier = node_states[N_MEMORY]; list_for_each_entry(memtier, &memory_tiers, list) { /* * Keep removing current tier from lower_tier nodes, * This will remove all nodes in current and above * memory tier from the lower_tier mask. */ tier_nodes = get_memtier_nodemask(memtier); nodes_andnot(lower_tier, lower_tier, tier_nodes); memtier->lower_tier_mask = lower_tier; } }

      This function contains algorithmic policies that select demotion targets based on node distance, memory tier relationships, and CPU presence in memory nodes, and adjusts the demotion strategy based on runtime system conditions.

    1. enum get_ksm_page_flags { GET_KSM_PAGE_NOLOCK, GET_KSM_PAGE_LOCK, GET_KSM_PAGE_TRYLOCK };

      This defines the locking behavior when accessing a KSM page. The caller can choose between no locking, locking, or trying to lock without blocking, affecting how the page is accessed and modified.

    2. static struct page *get_ksm_page(struct ksm_stable_node *stable_node, enum get_ksm_page_flags flags) { struct page *page; void *expected_mapping; unsigned long kpfn; expected_mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); again: kpfn = READ_ONCE(stable_node->kpfn); /* Address dependency. */ page = pfn_to_page(kpfn); if (READ_ONCE(page->mapping) != expected_mapping) goto stale; /* * We cannot do anything with the page while its refcount is 0. * Usually 0 means free, or tail of a higher-order page: in which * case this node is no longer referenced, and should be freed; * however, it might mean that the page is under page_ref_freeze(). * The __remove_mapping() case is easy, again the node is now stale; * the same is in reuse_ksm_page() case; but if page is swapcache * in folio_migrate_mapping(), it might still be our page, * in which case it's essential to keep the node. */ while (!get_page_unless_zero(page)) { /* * Another check for page->mapping != expected_mapping would * work here too. We have chosen the !PageSwapCache test to * optimize the common case, when the page is or is about to * be freed: PageSwapCache is cleared (under spin_lock_irq) * in the ref_freeze section of __remove_mapping(); but Anon * page->mapping reset to NULL later, in free_pages_prepare(). */ if (!PageSwapCache(page)) goto stale; cpu_relax(); } if (READ_ONCE(page->mapping) != expected_mapping) { put_page(page); goto stale; } if (flags == GET_KSM_PAGE_TRYLOCK) { if (!trylock_page(page)) { put_page(page); return ERR_PTR(-EBUSY); } } else if (flags == GET_KSM_PAGE_LOCK) lock_page(page); if (flags != GET_KSM_PAGE_NOLOCK) { if (READ_ONCE(page->mapping) != expected_mapping) { unlock_page(page); put_page(page); goto stale; } } return page; stale: /* * We come here from above when page->mapping or !PageSwapCache * suggests that the node is stale; but it might be under migration. * We need smp_rmb(), matching the smp_wmb() in folio_migrate_ksm(), * before checking whether node->kpfn has been changed. */ smp_rmb(); if (READ_ONCE(stable_node->kpfn) != kpfn) goto again; remove_node_from_stable_tree(stable_node); return NULL; }

      This function checks whether a KSM page is still valid by verifying its mapping, reference count, and handling page migration. It dynamically decides whether to return the page, retry, or mark the stable node as stale.

    3. struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address) { struct folio *folio = page_folio(page); struct anon_vma *anon_vma = folio_anon_vma(folio); struct page *new_page; if (PageKsm(page)) { if (page_stable_node(page) && !(ksm_run & KSM_RUN_UNMERGE)) return page; /* no need to copy it */ } else if (!anon_vma) { return page; /* no need to copy it */ } else if (page->index == linear_page_index(vma, address) && anon_vma->root == vma->anon_vma->root) { return page; /* still no need to copy it */ } if (PageHWPoison(page)) return ERR_PTR(-EHWPOISON); if (!PageUptodate(page)) return page; /* let do_swap_page report the error */ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (new_page && mem_cgroup_charge(page_folio(new_page), vma->vm_mm, GFP_KERNEL)) { put_page(new_page); new_page = NULL; } if (new_page) { if (copy_mc_user_highpage(new_page, page, address, vma)) { put_page(new_page); memory_failure_queue(page_to_pfn(page), 0); return ERR_PTR(-EHWPOISON); } SetPageDirty(new_page); __SetPageUptodate(new_page); __SetPageLocked(new_page); #ifdef CONFIG_SWAP count_vm_event(KSM_SWPIN_COPY); #endif } return new_page; }

      This function decides whether a page needs to be copied based on its status as a KSM page, its association with an anonymous VMA, and its hardware-poisoned or up-to-date status.

    4. static struct page *try_to_merge_two_pages(struct ksm_rmap_item *rmap_item, struct page *page, struct ksm_rmap_item *tree_rmap_item, struct page *tree_page) { int err; err = try_to_merge_with_ksm_page(rmap_item, page, NULL); if (!err) { err = try_to_merge_with_ksm_page(tree_rmap_item, tree_page, page); /* * If that fails, we have a ksm page with only one pte * pointing to it: so break it. */ if (err) break_cow(rmap_item); } return err ? NULL : page; }

      This function dynamically decides when two pages should be merged into a KSM page. If the merge fails, it applies a fallback policy by breaking the COW relationship, ensuring that memory management remains consistent.

    5. static int try_to_merge_with_ksm_page(struct ksm_rmap_item *rmap_item, struct page *page, struct page *kpage) { struct mm_struct *mm = rmap_item->mm; struct vm_area_struct *vma; int err = -EFAULT; mmap_read_lock(mm); vma = find_mergeable_vma(mm, rmap_item->address); if (!vma) goto out; err = try_to_merge_one_page(vma, page, kpage); if (err) goto out; /* Unstable nid is in union with stable anon_vma: remove first */ remove_rmap_item_from_tree(rmap_item); /* Must get reference to anon_vma while still holding mmap_lock */ rmap_item->anon_vma = vma->anon_vma; get_anon_vma(vma->anon_vma); out: mmap_read_unlock(mm); trace_ksm_merge_with_ksm_page(kpage, page_to_pfn(kpage ? kpage : page), rmap_item, mm, err); return err; }

      This function checks whether the VMA is mergable and proceeds based on that condition, and it dynamically removes the reverse mapping item from the unstable tree after a successful merge.

    6. static int ksmd_should_run(void) { return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.slot.mm_node); }

      This function checks whether the ksmd should run based on the KSM_RUN_MERGE flag and whether there are any memory regions to process. The decision to enable or disable memory merging is controlled by the ksm_run configuration.

    7. static unsigned int ksm_thread_sleep_millisecs = 20;

      This variable sets the delay (in milliseconds) for the KSM daemon to sleep between scanning batches. It controls the scanning frequency, balancing KSM's memory merging process with system performance.

    8. static unsigned int ksm_thread_pages_to_scan = 100;

      This variable determines the number of pages the KSM daemon will scan in one batch. It controls the performance and efficiency of KSM's scanning process, balancing between thoroughness and system load.

    9. static int ksm_max_page_sharing = 256;

      This variable controls the maximum number of page slots that can share a stable node in the stable tree for KSM. It defines a limit for page sharing, affecting how KSM consolidates memory pages.

    10. static unsigned int ksm_stable_node_chains_prune_millisecs = 2000;

      This variable sets the delay (in milliseconds) for pruning stale stable_node_dups in the stable_node_chains. It controls how often KSM will perform cleanup operations on stale nodes.

    1. randomize_stack_top

      This function uses a configuration policy to enable Address Space Layout Randomization (ASLR) for a specific process if PF_RANDOMIZE flag is set. It randomly arranges the positions of stack of a process to help defend certain attacks by making memory addresses unpredictable.

    1. /* Soft offline could migrate non-LRU movable pages */ if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page)) return true;

      The code includes a policy for soft offline pages (MF_SOFT_OFFLINE). This is a feature where the kernel attempts to migrate pages to avoid using faulty memory areas. The policy allows non-LRU movable pages (pages that aren’t in the Least Recently Used list) to be migrated.

    2. if (TestSetPageHWPoison(p)) { pr_err("%#lx: already hardware poisoned\n", pfn); res = -EHWPOISON; if (flags & MF_ACTION_REQUIRED) res = kill_accessing_process(current, pfn, flags); if (flags & MF_COUNT_INCREASED) put_page(p); goto unlock_mutex; }

      The page isolation policy ensures that faulty pages are marked as hardware poisoned (using TestSetPageHWPoison). The kernel checks if a page has already been isolated (marked as poisoned) before taking any further action. This prevents faulty pages from being reused by the system and isolates them to avoid further corruption.

    1. if (movable_node_is_enabled()) {

      This policy, as the comment states, will ignore the kernelcore and movablecore options if movable nodes are enabled (skipping the logic below this if statement and jumping to "out2" instead). The logic for the "movable_node_is_enabled()" function is in "memory_hotplug.h".

    2. if (page_poisoning_enabled() || (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && debug_pagealloc_enabled())) {

      This check looks at flags and configuration settings to determine if page poisoning should be enabled.

    3. if (descending)

      This decides whether to iterate forward or backward through the zones passed into the function. The variable "descending" is determined by the "arch_has_descending_max_zone_pfns()" function on line 1811, which is determined from configuration options.

    4. static bool __meminit defer_init(int nid, unsigned long pfn, unsigned long end_pfn) { static unsigned long prev_end_pfn, nr_initialised; if (early_page_ext_enabled()) return false; /* * prev_end_pfn static that contains the end of previous zone * No need to protect because called very early in boot before smp_init. */ if (prev_end_pfn != end_pfn) { prev_end_pfn = end_pfn; nr_initialised = 0; } /* Always populate low zones for address-constrained allocations */ if (end_pfn < pgdat_end_pfn(NODE_DATA(nid))) return false; if (NODE_DATA(nid)->first_deferred_pfn != ULONG_MAX) return true; /* * We start only with one section of pages, more pages are added as * needed until the rest of deferred pages are initialized. */ nr_initialised++; if ((nr_initialised > PAGES_PER_SECTION) && (pfn & (PAGES_PER_SECTION - 1)) == 0) { NODE_DATA(nid)->first_deferred_pfn = pfn; return true; } return false; }

      This policy decides whether to defer the remaining initialization; it's used on line 885 of this page. Note that this function definition relies on CONFIG_DEFERRED_STRUCT_PAGE_INIT being defined.

    5. if (mminit_loglevel < MMINIT_VERIFY)

      This policy controls whether the function will print information about the zonelist. This decision is determined by the value of the "mminit_level" enum in "mm/internal.h".

    6. if (overcommit_policy == OVERCOMMIT_NEVER)

      This policy controls the memory batch size based on the overcommit policy, choosing a smaller batch size when the policy is OVERCOMMIT_NEVER.

    1. if (prev_class) { if (can_merge(prev_class, pages_per_zspage, objs_per_zspage)) { pool->size_class[i] = prev_class; continue; } }

      This is an algorithmic policy. A zs_pool maintains zs_pages of different size_class. However, some size_classes share exactly same characteristics, namely pages_per_zspage and objs_per_zspage. Recall the other annotation of mine, it searched free zspages by size_class: zspage = find_get_zspage(class);. Thus, grouping different classes improves memory utilization.

    2. zspage = find_get_zspage(class); if (likely(zspage)) { obj = obj_malloc(pool, zspage, handle); /* Now move the zspage to another fullness group, if required */ fix_fullness_group(class, zspage); record_obj(handle, obj); class_stat_inc(class, ZS_OBJS_INUSE, 1); goto out; }

      This is an algorithmic policy. Instead of immediately allocating new zspages for each memory request, the algorithm first attempts to find and reuse existing partially filled zspages from a given size class by invoking the find_get_zspage(class) function. It also updates the corresponding fullness groups.

    1. 1

      This is a configuration policy that sets the timeout between retries if vmap_pages_range() fails. This could be tunable variable.

    2. if (!(flags & VM_ALLOC)) area->addr = kasan_unpoison_vmalloc(area->addr, requested_size, KASAN_VMALLOC_PROT_NORMAL);

      This is an algorithmic policy. This is an optimization that prevents duplicate marks of accessibility. Only pages allocated without VM_ALLOC (e.g. ioremap) was not set accessible (unpoison), thus requiring explicit setting here.

    3. 100U

      This is an configuration policy that determines 100 pages are the upper limit for the bulk-allocator. However, the implementation of alloc_pages_bulk_array_mempolicy does not explicitly limit in the implementation. So I believe it is an algorithmic policy related to some sort of optimization.

    4. if (!order) {

      This is an algorithmic policy determines that only use the bulk allocator for order-0 pages (non-super pages). Maybe the bulk allocator could be applied to super pages to speed up allocation. Currently I haven't seen the reason why it cannot be applied.

    5. if (likely(count <= VMAP_MAX_ALLOC)) { mem = vb_alloc(size, GFP_KERNEL); if (IS_ERR(mem)) return NULL; addr = (unsigned long)mem; } else { struct vmap_area *va; va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START, VMALLOC_END, node, GFP_KERNEL, VMAP_RAM); if (IS_ERR(va)) return NULL; addr = va->va_start; mem = (void *)addr; }

      This is an algorithmic policy that determines whether to use a more efficient vl_alloc which does not involve complex virtual-to-physical mappings. Unlike the latter alloc_vmap_area, vb_alloc does not need to traverse the rb-tree of free vmap areas. It simply find a larger enough block from vmap_block_queue.

    6. VMAP_PURGE_THRESHOLD

      The threshold VMAP_PURGE_THRESHOLD is a configuration policy that could be tuned by machine learning. Setting this threshold lower reduces purging activities while setting it higher reduces framentation.

    7. if (!(force_purge

      This is an algorithmic policy that prevents purging blocks with considerable amount of "usable memory" unless requested with force_purge.

    8. resched_threshold = lazy_max_pages() << 1;

      The assignment of resched_threshold and lines 1776-1777 are configuration policies to determine fewer than which number of lazily-freed pages it should yield CPU temporarily to higher-priority tasks.

    9. log = fls(num_online_cpus());

      This heuristic scales lazy_max_pages logarithmically, which is a configuration policy. Alternatively, machine learning could determine the optimal scaling function—whether linear, logarithmic, square-root, or another approach.

    10. 32UL * 1024 * 1024

      This is a configuration policy that decides to always returns multiples of 32 MB worth of pages. This could be a configurable variable rather than a fixed magic number.

    1. /* * If the user wants hardware cache aligned objects then follow that * suggestion if the object is sufficiently large. * * The hardware cache alignment cannot override the specified * alignment though. If that is greater then use it. */ if (flags & SLAB_HWCACHE_ALIGN) { unsigned int ralign; ralign = cache_line_size(); while (size <= ralign / 2) ralign /= 2; align = max(align, ralign); }

      This SLAB_HWCACHE_ALIGN is defined by the user, making users decide whether hardware cache should align objects or not.

    2. if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL)) flags |= SLAB_NO_MERGE;

      This is a configuration policy. If CONFIG_MEMCG_KMEM is enabled, disable cache merging for KMALLOC_NORMAL caches.

    1. size = count_history_pages(mapping, index, max);

      This uses the size of contigiously cached pages in cache to get a "conservative" estimation of the "length of a sequential read sequence" or the "thrashing threshold in memory tight systems". LDOS could use different policies to predict the optimal read ahead size.

    2. unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE

      This chunks the read-ahead into 2 megabyte units to avoid pinning too much memory at once. LDOS could replace this with a dynamically-sized chunk to better optimize for other use cases.

    1. static bool should_skip_region(struct memblock_type *type, struct memblock_region *m, int nid, int flags) { int m_nid = memblock_get_region_node(m); /* we never skip regions when iterating memblock.reserved or physmem */ if (type != memblock_memory) return false; /* only memory regions are associated with nodes, check it */ if (nid != NUMA_NO_NODE && nid != m_nid) return true; /* skip hotpluggable memory regions if needed */ if (movable_node_is_enabled() && memblock_is_hotpluggable(m) && !(flags & MEMBLOCK_HOTPLUG)) return true; /* if we want mirror memory skip non-mirror memory regions */ if ((flags & MEMBLOCK_MIRROR) && !memblock_is_mirror(m)) return true; /* skip nomap memory unless we were asked for it explicitly */ if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m)) return true; /* skip driver-managed memory unless we were asked for it explicitly */ if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m)) return true; return false; }

      This policy determines whether a memblock region should be skipped, based on several checks that incorporate various flags. You can see this policy being used in other functions on lines 1080 and 1184 in this file; these other functions appear to be sub-functions for iterators on the memblock regions.

    2. if (memblock_bottom_up())

      This policy controls whether the memblock allocator should allocate memory from the bottom up or from the top down.

  2. Sep 2024
    1. if (lruvec->file_cost + lruvec->anon_cost > lrusize / 4) { lruvec->file_cost /= 2; lruvec->anon_cost /= 2; }

      It is a configuration policy. The policy here is to adjust the cost of current page. The cost means the overhead for kernel to operate the page if the page is swapped out. Kernel adopts a decay policy. Specifically, if current cost is greater than lrusize/4, its cost will be divided by 2. If kernel has no this policy, after a long-term running, the cost of each page is very high. Kernel might mistakenly reserve pages which are frequently visited long time ago but are inactive currently. It might cause performance degradation. The value (lrusize/4) here has a trade-off between performance and sensitivity. For example, if the value is too small, kernel will frequently adjust the priority of each page, resulting in process's performance degradation. If the value is too large, kernel might be misleaded by historical data, causing wrongly swapping currently popular pages and further performance degradation.

    2. if (megs < 16) page_cluster = 2; else page_cluster = 3;

      It is a configuration policy. The policy here is to determine the size of page cluster. "2" and "3" is determined externally to the kernel. Page cluster is the actual execution unit when swapping. If the machine is small-memory, kernel executes swap operation on 4 (2^2) pages for each time. Otherwise, kernel operats 8 (2^3) for each time. The rational here is to avoid much memory pressure for small-memory system and to improve performance for large-memory system.

    1. mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);

      It is a configuration policy. It is used to count the physical page for memory control group. The policy here is one physical page corresponds to one count to the memory control group. If using huge page, the value here might not be 1,

    2. schedule_timeout_uninterruptible(1);

      It is a configuration policy. If kernel cannot allocate an enough virtual space with alignment, while nofail is specified to disallow failure, the kernel will let the process to sleep for 1 time slot. In this period, the process cannot be interrupted and quit the schedule queue of CPU. The "1" here is a configuration parameter, made externally to the kernel. It is not large enough for latency-sensitive process and is not small enough to retry frequently.

    3. if (array_size > PAGE_SIZE) { area->pages = __vmalloc_node(array_size, 1, nested_gfp, node, area->caller); } else { area->pages = kmalloc_node(array_size, nested_gfp, node); }

      It is an algorithmic policy and also a configuration policy. The algorithmic policy here is to maxmize the data locality of virtual memory address, by letting the space be in continuous virtual pages. If the demanded size of virtual memory is larger than one page, the kernel will allocate multiple continuous pages for it by using __vmalloc_node(). Otherwise, the kernel will call kmalloc_node() to place it in a single page. On the other hand, to allocate continuous pages, we should invoke __vmalloc_node() by declaring align = 1, which is a configuration policy.

    4. if (!counters) return; if (v->flags & VM_UNINITIALIZED) return;

      It is an algorithmic policy. The show_numa_info() is used for printing how the virtual memory of v (vm_struct) distributes its pages across numa nodes. The two if conditions are used for filtering the invalid v. The first if means the virtual memory has not been associated with physical page. The second if means the virtual memory has not been initialized yet.

    5. if (page) copied = copy_page_to_iter_nofault(page, offset, length, iter); else copied = zero_iter(iter, length);

      It is an algorithmic policy. The policy here is to avoid lock/unlock overhead by using copy_page_to_iter_nofault() function plus if-else branch. Because copy_page_to_iter_nofault() is based on no page fault, we need to check the physical page is still alive. So the kernel uses an if condition to guarantee copy_page_to_iter_nofault() without page fault. On the other hand, if the page is not valid, kernel chooses to fill the iter using zero value instead of returning an error. I think it is because it can make invocation more convenient. The caller does not need to do a verification of the return value and does not need further actions to handle it.

    6. if (base + last_end < vmalloc_start + last_end) goto overflow; /* * Fitting base has not been found. */ if (va == NULL) goto overflow;

      It is an algorithmic policy. The two if conditions are used for checking the feasibility of the allocation. The first if means the adress of the allocation is beyond the allowable address, causing overflow. In this case, since the base address is decided from higher one to smaller one, reducing base address cannot work. Note that the base address is dependent on the va. The second if means we don't find a valid va which has appropriate base address. So the policy here is going to overflow branch to re-allocate va list.

    7. if (base + start < va->va_start) { va = node_to_va(rb_prev(&va->rb_node)); base = pvm_determine_end_from_reverse(&va, align) - end; term_area = area; continue; }

      It is an algorithmic policy. It is used for checking whether the start address of current va is valid. The policy here is to change a new va, different from the solution of "end address check" (explained in my other annotation). The reason is that we find the base address from high address to low address. If we continue to reduce the base address, the if condition will be always wrong. Changing a new va with a lower start address is the only solution.

    8. if (base + end > va->va_end) { base = pvm_determine_end_from_reverse(&va, align) - end; term_area = area; continue; }

      It is an algorithmic policy. The outer loop is an endless loop until finding a vaild virtual memory space satisfying the requirements. The if condition here is used to guarantee the space in va is large enough. If not, we continue to scan the memory space from high address to low address while satisfying the alignment requirement. Specifically, tuning the base address to a lower address. And then, it will retry for verification.

    9. va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); if (WARN_ON_ONCE(!va)) continue;

      It is an algorithmic policy. The for loop is used for transferring the virtual memory managment by using from vm_struct (link list) to vm_area (tree?). Inside the for loop, if we cannot allocate a new vmap_area from vmap_area_cachep (va = NULL), the program will simply print a warning and skip current tmp, instread of retry or anything to guarantee the integrity.

    10. if (!spin_trylock(&vmap_area_lock)) return false;

      It is an algorithmic policy. The vmalloc_dump_obj() function is used for printing the information of specified virtual memory. Due to concurrency, before manipulating the critical state, we must obtain the lock. The policy here is spin_trylock. It only tries once. If fails, the function will simply returns false. I think the policy here is to avoid long-term waiting to improve the performance.

    1. static inline unsigned int calc_slab_order(unsigned int size, unsigned int min_order, unsigned int max_order, unsigned int fract_leftover) { unsigned int order; for (order = min_order; order <= max_order; order++) { unsigned int slab_size = (unsigned int)PAGE_SIZE << order; unsigned int rem; rem = slab_size % size; if (rem <= slab_size / fract_leftover) break; } return order; }

      Code to choose how many pages to allocate for a new slab to minimize wasted space from the remainder

    1. if (folio) { pgoff_t start; rcu_read_lock(); start = page_cache_next_miss(ractl->mapping, index + 1, max_pages); rcu_read_unlock(); if (!start || start - index > max_pages) return; ra->start = start; ra->size = start - index; /* old async_size */ ra->size += req_size; ra->size = get_next_ra_size(ra, max_pages); ra->async_size = ra->size; goto readit;

      Read ahead policy for marked folios

  3. Aug 2024
    1. if (hugetlb_cgroup_disabled())

      This probably not an interesting policy decision for ldos. It is a feature flag for the running OS. But if cgroups were decided by policy then this flag would be controlled by the cgroup decision.