From a724fb1d1134a5b4f79558a6b8658a93ec80673d Mon Sep 17 00:00:00 2001 From: Lance Yang Date: Mon, 29 Jan 2024 13:45:51 +0800 Subject: [PATCH 01/28] mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check ANBZ: #30191 commit 879c6000e191b61b97e17bce44c4564ee42eb612 upstream. khugepaged scans the entire address space in the background for each given mm, looking for opportunities to merge sequences of basic pages into huge pages. However, when an mm is inserted to the mm_slots list, and the MMF_DISABLE_THP flag is set later, this scanning process becomes unnecessary for that mm and can be skipped to avoid redundant operations, especially in scenarios with a large address space. On an Intel Core i5 CPU, the time taken by khugepaged to scan the address space of the process, which has been set with the MMF_DISABLE_THP flag after being added to the mm_slots list, is as follows (shorter is better): VMA Count | Old | New | Change --------------------------------------- 50 | 23us | 9us | -60.9% 100 | 32us | 9us | -71.9% 200 | 44us | 9us | -79.5% 400 | 75us | 9us | -88.0% 800 | 98us | 9us | -90.8% Once the count of VMAs for the process exceeds page_to_scan, khugepaged needs to wait for scan_sleep_millisecs ms before scanning the next process. IMO, unnecessary scans could actually be skipped with a very inexpensive mm->flags check in this case. This commit introduces a check before each scanning process to test the MMF_DISABLE_THP flag for the given mm; if the flag is set, the scanning process is bypassed, thereby improving the efficiency of khugepaged. This optimization is not a correctness issue but rather an enhancement to save expensive checks on each VMA when userspace cannot prctl itself before spawning into the new process. On some servers within our company, we deploy a daemon responsible for monitoring and updating local applications. Some applications prefer not to use THP, so the daemon calls prctl to disable THP before fork/exec. Conversely, for other applications, the daemon calls prctl to enable THP before fork/exec. Ideally, the daemon should invoke prctl after the fork, but its current implementation follows the described approach. In the Go standard library, there is no direct encapsulation of the fork system call; instead, fork and execve are combined into one through syscall.ForkExec. Link: https://lkml.kernel.org/r/20240129054551.57728-1-ioworker0@gmail.com Signed-off-by: Lance Yang Acked-by: David Hildenbrand Cc: Michal Hocko Cc: Minchan Kim Cc: Muchun Song Cc: Peter Xu Cc: Zach O'Keefe Signed-off-by: Andrew Morton Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 9c7818b8118a..cc04a4025d8d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -434,6 +434,12 @@ static bool hugepage_pmd_enabled(void) return false; } +static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm) +{ + return hpage_collapse_test_exit(mm) || + test_bit(MMF_DISABLE_THP, &mm->flags); +} + void __khugepaged_enter(struct mm_struct *mm) { struct khugepaged_mm_slot *mm_slot; @@ -1440,7 +1446,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) lockdep_assert_held(&khugepaged_mm_lock); - if (hpage_collapse_test_exit(mm)) { + if (hpage_collapse_test_exit_or_disable(mm)) { /* free mm_slot */ hash_del(&slot->hash); list_del(&slot->mm_node); @@ -2401,7 +2407,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, goto breakouterloop_mmap_lock; progress++; - if (unlikely(hpage_collapse_test_exit(mm))) + if (unlikely(hpage_collapse_test_exit_or_disable(mm))) goto breakouterloop; vma_iter_init(&vmi, mm, khugepaged_scan.address); @@ -2409,7 +2415,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, unsigned long hstart, hend; cond_resched(); - if (unlikely(hpage_collapse_test_exit(mm))) { + if (unlikely(hpage_collapse_test_exit_or_disable(mm))) { progress++; break; } @@ -2431,7 +2437,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, bool mmap_locked = true; cond_resched(); - if (unlikely(hpage_collapse_test_exit(mm))) + if (unlikely(hpage_collapse_test_exit_or_disable(mm))) goto breakouterloop; VM_BUG_ON(khugepaged_scan.address < hstart || @@ -2449,7 +2455,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, fput(file); if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { mmap_read_lock(mm); - if (hpage_collapse_test_exit(mm)) + if (hpage_collapse_test_exit_or_disable(mm)) goto breakouterloop; *result = collapse_pte_mapped_thp(mm, khugepaged_scan.address, false); @@ -2491,7 +2497,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, * Release the current mm_slot if this mm is about to die, or * if we scanned all vmas of this mm. */ - if (hpage_collapse_test_exit(mm) || !vma) { + if (hpage_collapse_test_exit_or_disable(mm) || !vma) { /* * Make sure that if mm_users is reaching zero while * khugepaged runs here, khugepaged_exit will find -- Gitee From 5ba6b41139280749e0295e3af13acd347a4ac25f Mon Sep 17 00:00:00 2001 From: Lance Yang Date: Tue, 27 Feb 2024 11:51:35 +0800 Subject: [PATCH 02/28] mm/khugepaged: keep mm in mm_slot without MMF_DISABLE_THP check ANBZ: #30191 commit 5dad604809c5acc546ec74057498db1623f1c408 upstream. Previously, we removed the mm from mm_slot and dropped mm_count if the MMF_THP_DISABLE flag was set. However, we didn't re-add the mm back after clearing the MMF_THP_DISABLE flag. Additionally, We add a check for the MMF_THP_DISABLE flag in hugepage_vma_revalidate(). Link: https://lkml.kernel.org/r/20240227035135.54593-1-ioworker0@gmail.com Fixes: 879c6000e191 ("mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check") Signed-off-by: Lance Yang Suggested-by: Yang Shi Reviewed-by: Yang Shi Reviewed-by: David Hildenbrand Cc: Michal Hocko Cc: Minchan Kim Cc: Muchun Song Cc: Peter Xu Cc: Zach O'Keefe Signed-off-by: Andrew Morton Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index cc04a4025d8d..2870061f8df4 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -928,7 +928,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, { struct vm_area_struct *vma; - if (unlikely(hpage_collapse_test_exit(mm))) + if (unlikely(hpage_collapse_test_exit_or_disable(mm))) return SCAN_ANY_PROCESS; *vmap = vma = find_vma(mm, address); @@ -1446,7 +1446,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) lockdep_assert_held(&khugepaged_mm_lock); - if (hpage_collapse_test_exit_or_disable(mm)) { + if (hpage_collapse_test_exit(mm)) { /* free mm_slot */ hash_del(&slot->hash); list_del(&slot->mm_node); @@ -2497,7 +2497,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, * Release the current mm_slot if this mm is about to die, or * if we scanned all vmas of this mm. */ - if (hpage_collapse_test_exit_or_disable(mm) || !vma) { + if (hpage_collapse_test_exit(mm) || !vma) { /* * Make sure that if mm_users is reaching zero while * khugepaged runs here, khugepaged_exit will find -- Gitee From 28519f4d53b7a605cb4d81192c38c0ed6b2bd469 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Thu, 25 Apr 2024 05:00:55 +0100 Subject: [PATCH 03/28] mm: simplify thp_vma_allowable_order ANBZ: #30191 commit e0ffb29bc54d86b9ab10ebafc66eb1b7229e0cd7 upstream. Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) Cc: David Hildenbrand Cc: Kefeng Wang Cc: Ryan Roberts Signed-off-by: Andrew Morton Signed-off-by: Yuanhe Shu --- fs/proc/task_mmu.c | 4 ++-- include/linux/huge_mm.h | 29 +++++++++++++++-------------- mm/huge_memory.c | 7 +++++-- mm/khugepaged.c | 16 +++++++--------- mm/memory.c | 14 ++++++++------ 5 files changed, 37 insertions(+), 33 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index d6786bf83ed2..e75e2153357b 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -879,8 +879,8 @@ static int show_smap(struct seq_file *m, void *v) #endif seq_printf(m, "THPeligible: %8u\n", - !!thp_vma_allowable_orders(vma, vma->vm_flags, true, false, - true, THP_ORDERS_ALL)); + !!thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL)); if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 895aeeb2043c..20e91feef801 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -92,8 +92,12 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define THP_ORDERS_ALL \ (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT) -#define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \ - (!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order))) +#define TVA_SMAPS (1 << 0) /* Will be used for procfs */ +#define TVA_IN_PF (1 << 1) /* Page fault handler */ +#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */ + +#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ + (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) static inline int lowest_order(unsigned long orders) { @@ -275,17 +279,15 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) } unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs, + unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders); /** * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma * @vma: the vm area to check * @vm_flags: use these vm_flags instead of vma->vm_flags - * @smaps: whether answer will be used for smaps file - * @in_pf: whether answer will be used by page fault handler - * @enforce_sysfs: whether sysfs config should be taken into account + * @tva_flags: Which TVA flags to honour * @orders: bitfield of all orders to consider * * Calculates the intersection of the requested hugepage orders and the allowed @@ -298,12 +300,12 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, */ static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs, + unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders) { /* Optimization to check if required orders are enabled early. */ - if (enforce_sysfs && vma_is_anonymous(vma)) { + if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) { unsigned long mask = READ_ONCE(huge_anon_orders_always); if (vm_flags & VM_HUGEPAGE) @@ -317,8 +319,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, return 0; } - return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, - enforce_sysfs, orders); + return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); } struct thpsize { @@ -494,8 +495,8 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma, } static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs, + unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ac49e32e7238..7ba5cac3ea53 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -89,10 +89,13 @@ static bool anon_orders_configured __initdata; static bool file_orders_configured; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs, + unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders) { + bool smaps = tva_flags & TVA_SMAPS; + bool in_pf = tva_flags & TVA_IN_PF; + bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; unsigned long supported_orders; /* Check the intersection of requested and supported orders. */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2870061f8df4..2e0e26b5859e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -477,7 +477,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && hugepage_pmd_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, false, false, true, + if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS, PMD_ORDER)) __khugepaged_enter(vma->vm_mm); } @@ -927,6 +927,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, struct collapse_control *cc) { struct vm_area_struct *vma; + unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0; if (unlikely(hpage_collapse_test_exit_or_disable(mm))) return SCAN_ANY_PROCESS; @@ -937,8 +938,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, - cc->is_khugepaged, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER)) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then @@ -1529,8 +1529,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, * and map it by a PMD, regardless of sysfs THP settings. As such, let's * analogously elide sysfs THP settings here. */ - if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, false, - PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) return SCAN_VMA_CHECK; /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ @@ -2419,8 +2418,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, - true, PMD_ORDER)) { + if (!thp_vma_allowable_order(vma, vma->vm_flags, + TVA_ENFORCE_SYSFS, PMD_ORDER)) { skip: progress++; continue; @@ -2775,8 +2774,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, *prev = vma; - if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, false, - PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) return -EINVAL; cc = kmalloc(sizeof(*cc), GFP_KERNEL); diff --git a/mm/memory.c b/mm/memory.c index 2484b5143e3f..5c11b0423c8d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4466,8 +4466,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. */ - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true, - BIT(PMD_ORDER) - 1); + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); @@ -4996,8 +4996,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) * for this vma. Then filter out the orders that can't be allocated over * the faulting address and still be fully contained in the vma. */ - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true, - BIT(PMD_ORDER) - 1); + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); if (!orders) @@ -6303,7 +6303,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, return VM_FAULT_OOM; retry_pud: if (pud_none(*vmf.pud) && - thp_vma_allowable_order(vma, vm_flags, false, true, true, PUD_ORDER)) { + thp_vma_allowable_order(vma, vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -6337,7 +6338,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, goto retry_pud; if (pmd_none(*vmf.pmd) && - thp_vma_allowable_order(vma, vm_flags, false, true, true, PMD_ORDER)) { + thp_vma_allowable_order(vma, vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; -- Gitee From e55d7bce1241cb90562332932c74f6a3c5e99134 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Tue, 5 Aug 2025 11:54:47 +0800 Subject: [PATCH 04/28] mm: fix the race between collapse and PT_RECLAIM under per-vma lock ANBZ: #30191 commit 366a4532d96fc357998465133db34d34edb79e4c upstream. The check_pmd_still_valid() call during collapse is currently only protected by the mmap_lock in write mode, which was sufficient when pt_reclaim always ran under mmap_lock in read mode. However, since madvise_dontneed can now execute under a per-VMA lock, this assumption is no longer valid. As a result, a race condition can occur between collapse and PT_RECLAIM, potentially leading to a kernel panic. [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary) [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4 [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30 [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0 [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286 [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003 [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000 [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018 [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000 [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000 [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0 [ 38.168812] PKRU: 55555554 [ 38.169275] Call Trace: [ 38.169647] [ 38.169975] ? __kasan_check_byte+0x19/0x50 [ 38.170581] lock_acquire+0xea/0x310 [ 38.171083] ? rcu_is_watching+0x19/0xc0 [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 38.173130] _raw_spin_lock+0x38/0x50 [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0 [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0 [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10 [ 38.175724] ? __pfx_pud_val+0x10/0x10 [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30 [ 38.177183] unmap_page_range+0xb60/0x43e0 [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10 [ 38.178485] ? mas_next_slot+0x133a/0x1a50 [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250 [ 38.179830] unmap_vmas+0x1fa/0x460 [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10 [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 38.181877] exit_mmap+0x1a2/0xb40 [ 38.182396] ? lock_release+0x14f/0x2c0 [ 38.182929] ? __pfx_exit_mmap+0x10/0x10 [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10 [ 38.184188] ? mutex_unlock+0x16/0x20 [ 38.184704] mmput+0x132/0x370 [ 38.185208] do_exit+0x7e7/0x28c0 [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30 [ 38.186328] ? do_group_exit+0x1d8/0x2c0 [ 38.186873] ? __pfx_do_exit+0x10/0x10 [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30 [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60 [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110 [ 38.189313] do_group_exit+0xe4/0x2c0 [ 38.189831] __x64_sys_exit_group+0x4d/0x60 [ 38.190413] x64_sys_call+0x2174/0x2180 [ 38.190935] do_syscall_64+0x6d/0x2e0 [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e This patch moves the vma_start_write() call to precede check_pmd_still_valid(), ensuring that the check is also properly protected by the per-VMA lock. Link: https://lkml.kernel.org/r/20250805035447.7958-1-21cnbao@gmail.com Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED") Signed-off-by: Barry Song Tested-by: "Lai, Yi" Reported-by: "Lai, Yi" Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/ Reviewed-by: Lorenzo Stoakes Cc: David Hildenbrand Cc: Lorenzo Stoakes Cc: Qi Zheng Cc: Vlastimil Babka Cc: Jann Horn Cc: Suren Baghdasaryan Cc: Lokesh Gidra Cc: Tangquan Zheng Cc: Lance Yang Cc: Zi Yan Cc: Baolin Wang Cc: Liam R. Howlett Cc: Nico Pache Cc: Ryan Roberts Cc: Dev Jain Signed-off-by: Andrew Morton Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2e0e26b5859e..9f39376cd6d2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, if (result != SCAN_SUCCEED) goto out_up_write; /* check if the pmd is still valid */ + vma_start_write(vma); result = check_pmd_still_valid(mm, address, pmd); if (result != SCAN_SUCCEED || is_async_fork_mm(mm)) goto out_up_write; - vma_start_write(vma); anon_vma_lock_write(vma->anon_vma); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, -- Gitee From 1e85544bd7ff8d54ac2070f0d5b6be12c4d6fad2 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 15 Aug 2025 14:54:54 +0100 Subject: [PATCH 05/28] mm/huge_memory: convert "tva_flags" to "enum tva_type" ANBZ: #30191 commit 1f1c061089dcd274befa0c76fb9f6e253a8368c0 upstream. When determining which THP orders are eligible for a VMA mapping, we have previously specified tva_flags, however it turns out it is really not necessary to treat these as flags. Rather, we distinguish between distinct modes. The only case where we previously combined flags was with TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is the default, except for MADV_COLLAPSE or an edge cases in collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding a mode specifically for this case - TVA_FORCED_COLLAPSE. We have: * smaps handling for showing "THPeligible" * Pagefault handling * khugepaged handling * Forced collapse handling: primarily MADV_COLLAPSE, but also for an edge case in collapse_pte_mapped_thp() Disregarding the edge cases, we only want to ignore sysfs settings only when we are forcing a collapse through MADV_COLLAPSE, otherwise we want to enforce it, hence this patch does the following flag to enum conversions: * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT * TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED * 0 -> TVA_FORCED_COLLAPSE With this change, we immediately know if we are in the forced collapse case, which will be valuable next. Link: https://lkml.kernel.org/r/20250815135549.130506-3-usamaarif642@gmail.com Signed-off-by: David Hildenbrand Signed-off-by: Usama Arif Acked-by: Usama Arif Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Cc: Arnd Bergmann Cc: Barry Song Cc: Dev Jain Cc: Jann Horn Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Liam Howlett Cc: Mariano Pache Cc: Michal Hocko Cc: Mike Rapoport Cc: Rik van Riel Cc: Ryan Roberts Cc: SeongJae Park Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Yafang Signed-off-by: Andrew Morton Signed-off-by: Yuanhe Shu --- fs/proc/task_mmu.c | 4 ++-- include/linux/huge_mm.h | 30 ++++++++++++++++++------------ mm/huge_memory.c | 8 ++++---- mm/khugepaged.c | 15 +++++++-------- mm/memory.c | 14 ++++++-------- 5 files changed, 37 insertions(+), 34 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e75e2153357b..1232ab6e297e 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -879,8 +879,8 @@ static int show_smap(struct seq_file *m, void *v) #endif seq_printf(m, "THPeligible: %8u\n", - !!thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL)); + !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS, + THP_ORDERS_ALL)); if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 20e91feef801..b42d9ea84a98 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -92,12 +92,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define THP_ORDERS_ALL \ (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT) -#define TVA_SMAPS (1 << 0) /* Will be used for procfs */ -#define TVA_IN_PF (1 << 1) /* Page fault handler */ -#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */ +enum tva_type { + TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */ + TVA_PAGEFAULT, /* Serving a page fault. */ + TVA_KHUGEPAGED, /* Khugepaged collapse. */ + TVA_FORCED_COLLAPSE, /* Forced collapse (e.g. MADV_COLLAPSE). */ +}; -#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ - (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) +#define thp_vma_allowable_order(vma, vm_flags, type, order) \ + (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order))) static inline int lowest_order(unsigned long orders) { @@ -280,14 +283,14 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, - unsigned long tva_flags, + enum tva_type type, unsigned long orders); /** * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma * @vma: the vm area to check * @vm_flags: use these vm_flags instead of vma->vm_flags - * @tva_flags: Which TVA flags to honour + * @type: TVA type * @orders: bitfield of all orders to consider * * Calculates the intersection of the requested hugepage orders and the allowed @@ -301,11 +304,14 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, - unsigned long tva_flags, + enum tva_type type, unsigned long orders) { - /* Optimization to check if required orders are enabled early. */ - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) { + /* + * Optimization to check if required orders are enabled early. Only + * forced collapse ignores sysfs configs. + */ + if (type != TVA_FORCED_COLLAPSE && vma_is_anonymous(vma)) { unsigned long mask = READ_ONCE(huge_anon_orders_always); if (vm_flags & VM_HUGEPAGE) @@ -319,7 +325,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, return 0; } - return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); + return __thp_vma_allowable_orders(vma, vm_flags, type, orders); } struct thpsize { @@ -496,7 +502,7 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma, static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, - unsigned long tva_flags, + enum tva_type type, unsigned long orders) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7ba5cac3ea53..23097d7e5b70 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -90,12 +90,12 @@ static bool file_orders_configured; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, - unsigned long tva_flags, + enum tva_type type, unsigned long orders) { - bool smaps = tva_flags & TVA_SMAPS; - bool in_pf = tva_flags & TVA_IN_PF; - bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; + const bool smaps = type == TVA_SMAPS; + const bool in_pf = type == TVA_PAGEFAULT; + const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE; unsigned long supported_orders; /* Check the intersection of requested and supported orders. */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 9f39376cd6d2..242890836c58 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -477,8 +477,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && hugepage_pmd_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS, - PMD_ORDER)) + if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) __khugepaged_enter(vma->vm_mm); } } @@ -927,7 +926,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, struct collapse_control *cc) { struct vm_area_struct *vma; - unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0; + enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED : + TVA_FORCED_COLLAPSE; if (unlikely(hpage_collapse_test_exit_or_disable(mm))) return SCAN_ANY_PROCESS; @@ -938,7 +938,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER)) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then @@ -1527,9 +1527,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, * in the page cache with a single hugepage. If a mm were to fault-in * this memory (mapped by a suitably aligned VMA), we'd get the hugepage * and map it by a PMD, regardless of sysfs THP settings. As such, let's - * analogously elide sysfs THP settings here. + * analogously elide sysfs THP settings here and force collapse. */ - if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER)) return SCAN_VMA_CHECK; /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ @@ -2418,8 +2418,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, - TVA_ENFORCE_SYSFS, PMD_ORDER)) { + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) { skip: progress++; continue; diff --git a/mm/memory.c b/mm/memory.c index 5c11b0423c8d..7dd53805ac82 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4466,8 +4466,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. */ - orders = thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, + BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); @@ -4996,8 +4996,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) * for this vma. Then filter out the orders that can't be allocated over * the faulting address and still be fully contained in the vma. */ - orders = thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, + BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); if (!orders) @@ -6303,8 +6303,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, return VM_FAULT_OOM; retry_pud: if (pud_none(*vmf.pud) && - thp_vma_allowable_order(vma, vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { + thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -6338,8 +6337,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, goto retry_pud; if (pmd_none(*vmf.pmd) && - thp_vma_allowable_order(vma, vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) { + thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; -- Gitee From c6a0bc039d1fd1865e8dadfb3b3ceeadab309447 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:09:40 +0000 Subject: [PATCH 06/28] mm/khugepaged: map dirty/writeback pages failures to EAGAIN ANBZ: #30191 cherry-picked from https://lore.kernel.org/lkml/20260118190939.8986-4-shivankg@amd.com/ When collapse_file encounters dirty or writeback pages in file-backed mappings, it currently returns SCAN_FAIL which maps to -EINVAL. This is misleading as EINVAL suggests invalid arguments, whereas dirty/writeback pages represent transient conditions that may resolve on retry. Introduce SCAN_PAGE_DIRTY_OR_WRITEBACK to cover both dirty and writeback states, mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with a clear signal that retry may succeed after writeback completes. For khugepaged, this is harmless as it will naturally revisit the range during periodic scans after async writeback completes. Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Reported-by: Branden Moore Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com Reviewed-by: Dev Jain Reviewed-by: Lance Yang Reviewed-by: Baolin Wang Reviewed-by: wang lian Acked-by: David Hildenbrand (Red Hat) Signed-off-by: Shivank Garg Signed-off-by: Yuanhe Shu --- include/trace/events/huge_memory.h | 3 ++- mm/khugepaged.c | 8 +++++--- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 9277524e84eb..5dc636add309 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -39,7 +39,8 @@ EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ EM( SCAN_STORE_FAILED, "store_failed") \ EM( SCAN_COPY_MC, "copy_poisoned_page") \ - EMe(SCAN_PAGE_FILLED, "page_filled") + EM( SCAN_PAGE_FILLED, "page_filled") \ + EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") #undef EM #undef EMe diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 242890836c58..5e8fed461bd3 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -60,6 +60,7 @@ enum scan_result { SCAN_STORE_FAILED, SCAN_COPY_MC, SCAN_PAGE_FILLED, + SCAN_PAGE_DIRTY_OR_WRITEBACK, }; #define CREATE_TRACE_POINTS @@ -1934,11 +1935,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, */ xas_unlock_irq(&xas); filemap_flush(mapping); - result = SCAN_FAIL; + result = SCAN_PAGE_DIRTY_OR_WRITEBACK; goto xa_unlocked; } else if (folio_test_writeback(folio)) { xas_unlock_irq(&xas); - result = SCAN_FAIL; + result = SCAN_PAGE_DIRTY_OR_WRITEBACK; goto xa_unlocked; } else if (folio_trylock(folio)) { folio_get(folio); @@ -1985,7 +1986,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, * folio is dirty because it hasn't been flushed * since first write. */ - result = SCAN_FAIL; + result = SCAN_PAGE_DIRTY_OR_WRITEBACK; goto out_unlock; } @@ -2748,6 +2749,7 @@ static int madvise_collapse_errno(enum scan_result r) case SCAN_PAGE_LRU: case SCAN_DEL_PAGE_LRU: case SCAN_PAGE_FILLED: + case SCAN_PAGE_DIRTY_OR_WRITEBACK: return -EAGAIN; /* * Other: Trying again likely not to succeed / error intrinsic to -- Gitee From 9c73f8dc845ea2bf28f047d27344987c7c0bb040 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:09:43 +0000 Subject: [PATCH 07/28] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE ANBZ: #30191 cherry-picked from https://lore.kernel.org/lkml/20260118190939.8986-7-shivankg@amd.com/ [backport note] Drop '*lock_dropped = true;' as we don't have it. When MADV_COLLAPSE is called on file-backed mappings (e.g., executable text sections), the pages may still be dirty from recent writes. collapse_file() will trigger async writeback and fail with SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN). MADV_COLLAPSE is a synchronous operation where userspace expects immediate results. If the collapse fails due to dirty pages, perform synchronous writeback on the specific range and retry once. This avoids spurious failures for freshly written executables while avoiding unnecessary synchronous I/O for mappings that are already clean. Reported-by: Branden Moore Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Suggested-by: David Hildenbrand Tested-by: Lance Yang Signed-off-by: Shivank Garg Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5e8fed461bd3..0085307c4c78 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -2791,7 +2792,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) { int result = SCAN_FAIL; + bool triggered_wb = false; +retry: if (!mmap_locked) { cond_resched(); mmap_read_lock(mm); @@ -2816,6 +2819,17 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, mmap_locked = false; result = hpage_collapse_scan_file(mm, addr, file, pgoff, cc); + + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb && + mapping_can_writeback(file->f_mapping)) { + loff_t lstart = (loff_t)pgoff << PAGE_SHIFT; + loff_t lend = lstart + HPAGE_PMD_SIZE - 1; + + filemap_write_and_wait_range(file->f_mapping, lstart, lend); + triggered_wb = true; + fput(file); + goto retry; + } fput(file); } else { result = hpage_collapse_scan_pmd(mm, vma, addr, -- Gitee From 381abfe3549dff1898b76098b004b8c477ff59de Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:22:53 +0000 Subject: [PATCH 08/28] mm/khugepaged: remove unnecessary goto 'skip' label ANBZ: #30191 cherry-picked from https://lore.kernel.org/lkml/20260118192253.9263-6-shivankg@amd.com/ Replace goto skip with actual logic for better code readability. No functional change. Reviewed-by: Liam R. Howlett Reviewed-by: Zi Yan Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Lance Yang Signed-off-by: Shivank Garg Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0085307c4c78..bc612ce6c1f1 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2421,14 +2421,15 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, break; } if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) { -skip: progress++; continue; } hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE); hend = round_down(vma->vm_end, HPAGE_PMD_SIZE); - if (khugepaged_scan.address > hend) - goto skip; + if (khugepaged_scan.address > hend) { + progress++; + continue; + } if (khugepaged_scan.address < hstart) khugepaged_scan.address = hstart; VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK); -- Gitee From 028295f8bed97f887c1a32bacf702b6226287783 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:22:55 +0000 Subject: [PATCH 09/28] mm/khugepaged: count small VMAs towards scan limit ANBZ: #30191 cherry-picked from https://lore.kernel.org/lkml/20260118192253.9263-8-shivankg@amd.com/ The khugepaged_scan_mm_slot() uses a 'progress' counter to limit the amount of work performed and consists of three components: 1. Transitioning to a new mm (+1). 2. Skipping an unsuitable VMA (+1). 3. Scanning a PMD-sized range (+HPAGE_PMD_NR). Consider a 1MB VMA sitting between two 2MB alignment boundaries: vma1 vma2 vma3 +----------+------+----------+ |2M |1M |2M | +----------+------+----------+ ^ ^ start end ^ hstart,hend In this case, for vma2: hstart = round_up(start, HPAGE_PMD_SIZE) -> Next 2MB alignment hend = round_down(end, HPAGE_PMD_SIZE) -> Prev 2MB alignment Currently, since `hend <= hstart`, VMAs that are too small or unaligned to contain a hugepage are skipped without incrementing 'progress'. A process containing a large number of such small VMAs will unfairly consume more CPU cycles before yielding compared to a process with fewer, larger, or aligned VMAs. Fix this by incrementing progress when the `hend <= hstart` condition is met. Additionally, change 'progress' type to `unsigned int` to match both the 'pages' type and the function return value. Suggested-by: Wei Yang Reviewed-by: Wei Yang Reviewed-by: Lance Yang Signed-off-by: Shivank Garg Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index bc612ce6c1f1..3d98c14f35d6 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2373,7 +2373,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, struct mm_slot *slot; struct mm_struct *mm; struct vm_area_struct *vma; - int progress = 0; + unsigned int progress = 0; VM_BUG_ON(!pages); lockdep_assert_held(&khugepaged_mm_lock); @@ -2426,7 +2426,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, } hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE); hend = round_down(vma->vm_end, HPAGE_PMD_SIZE); - if (khugepaged_scan.address > hend) { + if (khugepaged_scan.address > hend || hend <= hstart) { + /* VMA already scanned or too small/unaligned for hugepage. */ progress++; continue; } -- Gitee From abf9f2f5af2ba56015907d76baa710aeb90342fe Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:22:57 +0000 Subject: [PATCH 10/28] mm/khugepaged: change collapse_pte_mapped_thp() to return void ANBZ: #30191 cherry-picked from https://lore.kernel.org/lkml/20260118192253.9263-10-shivankg@amd.com/ The only external caller of collapse_pte_mapped_thp() is uprobe, which ignores the return value. Change the external API to return void to simplify the interface. Introduce try_collapse_pte_mapped_thp() for internal use that preserves the return value. This prepares for future patch that will convert the return type to use enum scan_result. Suggested-by: David Hildenbrand (Red Hat) Acked-by: Lance Yang Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Zi Yan Signed-off-by: Shivank Garg Signed-off-by: Yuanhe Shu --- include/linux/khugepaged.h | 9 ++++----- mm/khugepaged.c | 40 ++++++++++++++++++++++---------------- 2 files changed, 27 insertions(+), 22 deletions(-) diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index 30baae91b225..9b2b8ef8f7cf 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -18,13 +18,12 @@ extern void khugepaged_enter_vma(struct vm_area_struct *vma, extern void khugepaged_min_free_kbytes_update(void); extern bool current_is_khugepaged(void); #ifdef CONFIG_SHMEM -extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, - bool install_pmd); +void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, + bool install_pmd); #else -static inline int collapse_pte_mapped_thp(struct mm_struct *mm, - unsigned long addr, bool install_pmd) +static inline void collapse_pte_mapped_thp(struct mm_struct *mm, + unsigned long addr, bool install_pmd) { - return 0; } #endif diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3d98c14f35d6..0de7525c4227 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1486,20 +1486,8 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, return SCAN_SUCCEED; } -/** - * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at - * address haddr. - * - * @mm: process address space where collapse happens - * @addr: THP collapse address - * @install_pmd: If a huge PMD should be installed - * - * This function checks whether all the PTEs in the PMD are pointing to the - * right THP. If so, retract the page table so the THP can refault in with - * as pmd-mapped. Possibly install a huge PMD mapping the THP. - */ -int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, - bool install_pmd) +static int try_collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, + bool install_pmd) { struct mmu_notifier_range range; bool notified = false; @@ -1703,6 +1691,24 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, return result; } +/** + * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at + * address haddr. + * + * @mm: process address space where collapse happens + * @addr: THP collapse address + * @install_pmd: If a huge PMD should be installed + * + * This function checks whether all the PTEs in the PMD are pointing to the + * right THP. If so, retract the page table so the THP can refault in with + * as pmd-mapped. Possibly install a huge PMD mapping the THP. + */ +void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, + bool install_pmd) +{ + try_collapse_pte_mapped_thp(mm, addr, install_pmd); +} + static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) { struct vm_area_struct *vma; @@ -2200,7 +2206,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, /* * Remove pte page tables, so we can re-fault the page as huge. - * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp(). + * If MADV_COLLAPSE, adjust result to call try_collapse_pte_mapped_thp(). */ retract_page_tables(mapping, start); if (cc && !cc->is_khugepaged) @@ -2459,7 +2465,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, mmap_read_lock(mm); if (hpage_collapse_test_exit_or_disable(mm)) goto breakouterloop; - *result = collapse_pte_mapped_thp(mm, + *result = try_collapse_pte_mapped_thp(mm, khugepaged_scan.address, false); if (*result == SCAN_PMD_MAPPED) *result = SCAN_SUCCEED; @@ -2850,7 +2856,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, BUG_ON(mmap_locked); BUG_ON(*prev); mmap_read_lock(mm); - result = collapse_pte_mapped_thp(mm, addr, true); + result = try_collapse_pte_mapped_thp(mm, addr, true); mmap_read_unlock(mm); goto handle_result; /* Whitelisted set of results where continuing OK */ -- Gitee From 286b897fb2513e3c860125bb3890fae59a6c2a89 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:22:59 +0000 Subject: [PATCH 11/28] mm/khugepaged: use enum scan_result for result variables and return types ANBZ: #30191 cherry-picked from https://lore.kernel.org/lkml/20260118192253.9263-12-shivankg@amd.com/ Convert result variables and return types from int to enum scan_result throughout khugepaged code. This improves type safety and code clarity by making the intent explicit. No functional change. Reviewed-by: Zi Yan Signed-off-by: Shivank Garg Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 100 +++++++++++++++++++++++------------------------- 1 file changed, 47 insertions(+), 53 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0de7525c4227..b60616f5f0cf 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -565,16 +565,15 @@ static bool is_refcount_suitable(struct folio *folio) return folio_ref_count(folio) == expected_refcount; } -static int __collapse_huge_page_isolate(struct vm_area_struct *vma, - unsigned long address, - pte_t *pte, - struct collapse_control *cc, - struct list_head *compound_pagelist) +static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, + unsigned long address, pte_t *pte, struct collapse_control *cc, + struct list_head *compound_pagelist) { struct page *page = NULL; struct folio *folio = NULL; pte_t *_pte; - int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0; + int none_or_zero = 0, shared = 0, referenced = 0; + enum scan_result result = SCAN_FAIL; bool writable = false; for (_pte = pte; _pte < pte + HPAGE_PMD_NR; @@ -804,13 +803,13 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, * @ptl: lock on raw pages' PTEs * @compound_pagelist: list that stores compound pages */ -static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, +static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio, pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl, struct list_head *compound_pagelist) { unsigned int i; - int result = SCAN_SUCCEED; + enum scan_result result = SCAN_SUCCEED; /* * Copying pages' contents is subject to memory poison at any iteration. @@ -922,10 +921,8 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc) * Returns enum scan_result value. */ -static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, - bool expect_anon, - struct vm_area_struct **vmap, - struct collapse_control *cc) +static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, + bool expect_anon, struct vm_area_struct **vmap, struct collapse_control *cc) { struct vm_area_struct *vma; enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED : @@ -954,7 +951,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, return SCAN_SUCCEED; } -static inline int check_pmd_state(pmd_t *pmd) +static inline enum scan_result check_pmd_state(pmd_t *pmd) { pmd_t pmde = pmdp_get_lockless(pmd); @@ -971,9 +968,8 @@ static inline int check_pmd_state(pmd_t *pmd) return SCAN_SUCCEED; } -static int find_pmd_or_thp_or_none(struct mm_struct *mm, - unsigned long address, - pmd_t **pmd) +static enum scan_result find_pmd_or_thp_or_none(struct mm_struct *mm, + unsigned long address, pmd_t **pmd) { *pmd = mm_find_pmd(mm, address); if (!*pmd) @@ -982,12 +978,11 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, return check_pmd_state(*pmd); } -static int check_pmd_still_valid(struct mm_struct *mm, - unsigned long address, - pmd_t *pmd) +static enum scan_result check_pmd_still_valid(struct mm_struct *mm, + unsigned long address, pmd_t *pmd) { pmd_t *new_pmd; - int result = find_pmd_or_thp_or_none(mm, address, &new_pmd); + enum scan_result result = find_pmd_or_thp_or_none(mm, address, &new_pmd); if (result != SCAN_SUCCEED) return result; @@ -1003,15 +998,14 @@ static int check_pmd_still_valid(struct mm_struct *mm, * Called and returns without pte mapped or spinlocks held. * Returns result: if not SCAN_SUCCEED, mmap_lock has been released. */ -static int __collapse_huge_page_swapin(struct mm_struct *mm, - struct vm_area_struct *vma, - unsigned long haddr, pmd_t *pmd, - int referenced) +static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd, + int referenced) { int swapped_in = 0; vm_fault_t ret = 0; unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE); - int result; + enum scan_result result; pte_t *pte = NULL; spinlock_t *ptl; @@ -1075,8 +1069,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, return result; } -static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, - struct collapse_control *cc) +static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, + struct collapse_control *cc) { gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : GFP_TRANSHUGE); @@ -1103,9 +1097,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, return SCAN_SUCCEED; } -static int collapse_huge_page(struct mm_struct *mm, unsigned long address, - int referenced, int unmapped, - struct collapse_control *cc) +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address, + int referenced, int unmapped, struct collapse_control *cc) { LIST_HEAD(compound_pagelist); pmd_t *pmd, _pmd; @@ -1113,7 +1106,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, pgtable_t pgtable; struct folio *folio; spinlock_t *pmd_ptl, *pte_ptl; - int result = SCAN_FAIL; + enum scan_result result = SCAN_FAIL; struct vm_area_struct *vma; struct mmu_notifier_range range; @@ -1270,15 +1263,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, return result; } -static int hpage_collapse_scan_pmd(struct mm_struct *mm, - struct vm_area_struct *vma, - unsigned long address, bool *mmap_locked, - struct collapse_control *cc) +static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm, + struct vm_area_struct *vma, unsigned long address, bool *mmap_locked, + struct collapse_control *cc) { pmd_t *pmd; pte_t *pte, *_pte; - int result = SCAN_FAIL, referenced = 0; - int none_or_zero = 0, shared = 0; + int none_or_zero = 0, shared = 0, referenced = 0; + enum scan_result result = SCAN_FAIL; struct page *page = NULL; struct folio *folio = NULL; unsigned long _address; @@ -1467,8 +1459,8 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) #ifdef CONFIG_SHMEM /* folio must be locked, and mmap_lock must be held */ -static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmdp, struct folio *folio, struct page *page) +static enum scan_result set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmdp, struct folio *folio, struct page *page) { struct vm_fault vmf = { .vma = vma, @@ -1486,7 +1478,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, return SCAN_SUCCEED; } -static int try_collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, +static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, bool install_pmd) { struct mmu_notifier_range range; @@ -1497,7 +1489,9 @@ static int try_collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, pte_t *start_pte, *pte; pmd_t *pmd, pgt_pmd; spinlock_t *pml = NULL, *ptl; - int nr_ptes = 0, result = SCAN_FAIL; + enum scan_result result = SCAN_FAIL; + int nr_ptes = 0; + int i; mmap_assert_locked(mm); @@ -1831,9 +1825,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) * + unlock old pages * + unlock and free huge page; */ -static int collapse_file(struct mm_struct *mm, unsigned long addr, - struct file *file, pgoff_t start, - struct collapse_control *cc) +static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, + struct file *file, pgoff_t start, struct collapse_control *cc) { struct address_space *mapping = file->f_mapping; struct page *dst; @@ -1841,7 +1834,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, pgoff_t index = 0, end = start + HPAGE_PMD_NR; LIST_HEAD(pagelist); XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER); - int nr_none = 0, result = SCAN_SUCCEED; + enum scan_result result = SCAN_SUCCEED; + int nr_none = 0; bool is_shmem = shmem_file(file); VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); @@ -2270,16 +2264,15 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, return result; } -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, - struct file *file, pgoff_t start, - struct collapse_control *cc) +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, + struct file *file, pgoff_t start, struct collapse_control *cc) { struct folio *folio = NULL; struct address_space *mapping = file->f_mapping; XA_STATE(xas, &mapping->i_pages, start); int present, swap; int node = NUMA_NO_NODE; - int result = SCAN_SUCCEED; + enum scan_result result = SCAN_SUCCEED; present = 0; swap = 0; @@ -2369,7 +2362,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, } #endif -static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, +static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result *result, struct collapse_control *cc) __releases(&khugepaged_mm_lock) __acquires(&khugepaged_mm_lock) @@ -2544,7 +2537,7 @@ static void khugepaged_do_scan(struct collapse_control *cc) unsigned int progress = 0, pass_through_head = 0; unsigned int pages = READ_ONCE(khugepaged_pages_to_scan); bool wait = true; - int result = SCAN_SUCCEED; + enum scan_result result = SCAN_SUCCEED; lru_add_drain_all(); @@ -2776,7 +2769,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, struct collapse_control *cc; struct mm_struct *mm = vma->vm_mm; unsigned long hstart, hend, addr; - int thps = 0, last_fail = SCAN_FAIL; + enum scan_result last_fail = SCAN_FAIL; + int thps = 0; bool mmap_locked = true; BUG_ON(vma->vm_start > start); @@ -2799,7 +2793,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, hend = end & HPAGE_PMD_MASK; for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) { - int result = SCAN_FAIL; + enum scan_result result = SCAN_FAIL; bool triggered_wb = false; retry: -- Gitee From bbd0a16deac0ee6c0518c3d4cda5d3d80bc27586 Mon Sep 17 00:00:00 2001 From: Shivank Garg Date: Sun, 18 Jan 2026 19:23:01 +0000 Subject: [PATCH 12/28] mm/khugepaged: make khugepaged_collapse_control static ANBZ: #30191 cherry-picked from https://lore.kernel.org/lkml/20260118192253.9263-14-shivankg@amd.com/ The global variable 'khugepaged_collapse_control' is not used outside of mm/khugepaged.c. Make it static to limit its scope. Reviewed-by: Wei Yang Reviewed-by: Zi Yan Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Anshuman Khandual Signed-off-by: Shivank Garg Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b60616f5f0cf..c36984c3384c 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -851,7 +851,7 @@ static void khugepaged_alloc_sleep(void) remove_wait_queue(&khugepaged_wait, &wait); } -struct collapse_control khugepaged_collapse_control = { +static struct collapse_control khugepaged_collapse_control = { .is_khugepaged = true, }; -- Gitee From 4244efc6cdf00a628dd6ba93475a43f30b28c9a7 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:26 -0700 Subject: [PATCH 13/28] mm: introduce is_pmd_order helper ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-2-npache@redhat.com/ In order to add mTHP support to khugepaged, we will often be checking if a given order is (or is not) a PMD order. Some places in the kernel already use this check, so lets create a simple helper function to keep the code clean and readable. Reviewed-by: Lorenzo Stoakes Suggested-by: Lorenzo Stoakes Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- include/linux/huge_mm.h | 5 +++++ mm/huge_memory.c | 2 +- mm/khugepaged.c | 4 ++-- mm/mempolicy.c | 2 +- mm/page_alloc.c | 2 +- 5 files changed, 10 insertions(+), 5 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b42d9ea84a98..95d2aa68125a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -627,6 +627,11 @@ static inline int file_exec_order(void) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +static inline bool is_pmd_order(unsigned int order) +{ + return order == HPAGE_PMD_ORDER; +} + static inline int split_folio_to_list(struct folio *folio, struct list_head *list) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 23097d7e5b70..6e3bf6e86da1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3462,7 +3462,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) i_mmap_unlock_read(mapping); out: xas_destroy(&xas); - if (order == HPAGE_PMD_ORDER) + if (is_pmd_order(order)) count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED); count_mthp_stat(order, !ret ? MTHP_STAT_SPLIT : MTHP_STAT_SPLIT_FAILED); return ret; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index c36984c3384c..48c4d551c5f6 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1968,7 +1968,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, * we locked the first folio, then a THP might be there already. * This will be discovered on the first iteration. */ - if (folio_order(folio) == HPAGE_PMD_ORDER && + if (is_pmd_order(folio_order(folio)) && folio->index == start) { /* Maybe PMD-mapped */ result = SCAN_PTE_MAPPED_HUGEPAGE; @@ -2294,7 +2294,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned continue; } - if (folio_order(folio) == HPAGE_PMD_ORDER && + if (is_pmd_order(folio_order(folio)) && folio->index == start) { /* Maybe PMD-mapped */ result = SCAN_PTE_MAPPED_HUGEPAGE; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f863c3690b09..910a96c61383 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2271,7 +2271,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && /* filter "hugepage" allocation, unless from alloc_pages() */ - order == HPAGE_PMD_ORDER && ilx != NO_INTERLEAVE_INDEX) { + is_pmd_order(order) && ilx != NO_INTERLEAVE_INDEX) { /* * For hugepage allocation and non-interleave policy which * allows the current node (or other explicitly preferred diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 41bf25fe0610..381630c95fd2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -549,7 +549,7 @@ static inline bool pcp_allowed_order(unsigned int order) if (order <= PAGE_ALLOC_COSTLY_ORDER) return true; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (order == HPAGE_PMD_ORDER) + if (is_pmd_order(order)) return true; #endif return false; -- Gitee From 08b541b4f351309da143403cda42bcd3c3765722 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:27 -0700 Subject: [PATCH 14/28] khugepaged: rename hpage_collapse_* to collapse_* ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-3-npache@redhat.com/ The hpage_collapse functions describe functions used by madvise_collapse and khugepaged. remove the unnecessary hpage prefix to shorten the function name. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Liam R. Howlett Reviewed-by: Zi Yan Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 70 ++++++++++++++++++++++++------------------------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 48c4d551c5f6..9dfe7d1e3b58 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -407,7 +407,7 @@ void __init khugepaged_destroy(void) kmem_cache_destroy(mm_slot_cache); } -static inline int hpage_collapse_test_exit(struct mm_struct *mm) +static inline int collapse_test_exit(struct mm_struct *mm) { return atomic_read(&mm->mm_users) == 0; } @@ -436,9 +436,9 @@ static bool hugepage_pmd_enabled(void) return false; } -static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm) +static inline int collapse_test_exit_or_disable(struct mm_struct *mm) { - return hpage_collapse_test_exit(mm) || + return collapse_test_exit(mm) || test_bit(MMF_DISABLE_THP, &mm->flags); } @@ -449,7 +449,7 @@ void __khugepaged_enter(struct mm_struct *mm) int wakeup; /* __khugepaged_exit() must not run from under us */ - VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm); + VM_BUG_ON_MM(collapse_test_exit(mm), mm); if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) return; @@ -507,7 +507,7 @@ void __khugepaged_exit(struct mm_struct *mm) } else if (mm_slot) { /* * This is required to serialize against - * hpage_collapse_test_exit() (which is guaranteed to run + * collapse_test_exit() (which is guaranteed to run * under mmap sem read mode). Stop here (after we return all * pagetables will be destroyed) until khugepaged has finished * working on the pagetables under the mmap_lock. @@ -855,7 +855,7 @@ static struct collapse_control khugepaged_collapse_control = { .is_khugepaged = true, }; -static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc) +static bool collapse_scan_abort(int nid, struct collapse_control *cc) { int i; @@ -890,7 +890,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void) } #ifdef CONFIG_NUMA -static int hpage_collapse_find_target_node(struct collapse_control *cc) +static int collapse_find_target_node(struct collapse_control *cc) { int nid, target_node = 0, max_value = 0; @@ -909,7 +909,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc) return target_node; } #else -static int hpage_collapse_find_target_node(struct collapse_control *cc) +static int collapse_find_target_node(struct collapse_control *cc) { return 0; } @@ -928,7 +928,7 @@ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned l enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE; - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) + if (unlikely(collapse_test_exit_or_disable(mm))) return SCAN_ANY_PROCESS; *vmap = vma = find_vma(mm, address); @@ -993,7 +993,7 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm, /* * Bring missing pages in from swap, to complete THP collapse. - * Only done if hpage_collapse_scan_pmd believes it is worthwhile. + * Only done if khugepaged_scan_pmd believes it is worthwhile. * * Called and returns without pte mapped or spinlocks held. * Returns result: if not SCAN_SUCCEED, mmap_lock has been released. @@ -1074,7 +1074,7 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru { gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : GFP_TRANSHUGE); - int node = hpage_collapse_find_target_node(cc); + int node = collapse_find_target_node(cc); struct folio *folio; folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask); @@ -1263,9 +1263,10 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a return result; } -static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long address, bool *mmap_locked, - struct collapse_control *cc) +static enum scan_result collapse_scan_pmd(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, bool *mmap_locked, + struct collapse_control *cc) { pmd_t *pmd; pte_t *pte, *_pte; @@ -1367,7 +1368,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm, * hit record. */ node = folio_nid(folio); - if (hpage_collapse_scan_abort(node, cc)) { + if (collapse_scan_abort(node, cc)) { result = SCAN_SCAN_ABORT; goto out_unmap; } @@ -1440,7 +1441,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) lockdep_assert_held(&khugepaged_mm_lock); - if (hpage_collapse_test_exit(mm)) { + if (collapse_test_exit(mm)) { /* free mm_slot */ hash_del(&slot->hash); list_del(&slot->mm_node); @@ -1734,7 +1735,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) continue; - if (hpage_collapse_test_exit(mm)) + if (collapse_test_exit(mm)) continue; /* * When a vma is registered with uffd-wp, we cannot recycle @@ -2264,8 +2265,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, return result; } -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, - struct file *file, pgoff_t start, struct collapse_control *cc) +static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long addr, + struct file *file, pgoff_t start, + struct collapse_control *cc) { struct folio *folio = NULL; struct address_space *mapping = file->f_mapping; @@ -2308,7 +2310,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned } node = folio_nid(folio); - if (hpage_collapse_scan_abort(node, cc)) { + if (collapse_scan_abort(node, cc)) { result = SCAN_SCAN_ABORT; break; } @@ -2354,7 +2356,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned return result; } #else -static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, +static int collapse_scan_file(struct mm_struct *mm, unsigned long addr, struct file *file, pgoff_t start, struct collapse_control *cc) { @@ -2362,7 +2364,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, } #endif -static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result *result, +static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result, struct collapse_control *cc) __releases(&khugepaged_mm_lock) __acquires(&khugepaged_mm_lock) @@ -2407,7 +2409,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result goto breakouterloop_mmap_lock; progress++; - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) + if (unlikely(collapse_test_exit_or_disable(mm))) goto breakouterloop; vma_iter_init(&vmi, mm, khugepaged_scan.address); @@ -2415,7 +2417,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result unsigned long hstart, hend; cond_resched(); - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) { + if (unlikely(collapse_test_exit_or_disable(mm))) { progress++; break; } @@ -2438,7 +2440,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result bool mmap_locked = true; cond_resched(); - if (unlikely(hpage_collapse_test_exit_or_disable(mm))) + if (unlikely(collapse_test_exit_or_disable(mm))) goto breakouterloop; VM_BUG_ON(khugepaged_scan.address < hstart || @@ -2451,12 +2453,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result mmap_read_unlock(mm); mmap_locked = false; - *result = hpage_collapse_scan_file(mm, + *result = collapse_scan_file(mm, khugepaged_scan.address, file, pgoff, cc); fput(file); if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { mmap_read_lock(mm); - if (hpage_collapse_test_exit_or_disable(mm)) + if (collapse_test_exit_or_disable(mm)) goto breakouterloop; *result = try_collapse_pte_mapped_thp(mm, khugepaged_scan.address, false); @@ -2465,7 +2467,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result mmap_read_unlock(mm); } } else { - *result = hpage_collapse_scan_pmd(mm, vma, + *result = collapse_scan_pmd(mm, vma, khugepaged_scan.address, &mmap_locked, cc); } @@ -2498,7 +2500,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result * Release the current mm_slot if this mm is about to die, or * if we scanned all vmas of this mm. */ - if (hpage_collapse_test_exit(mm) || !vma) { + if (collapse_test_exit(mm) || !vma) { /* * Make sure that if mm_users is reaching zero while * khugepaged runs here, khugepaged_exit will find @@ -2552,8 +2554,8 @@ static void khugepaged_do_scan(struct collapse_control *cc) pass_through_head++; if (khugepaged_has_work() && pass_through_head < 2) - progress += khugepaged_scan_mm_slot(pages - progress, - &result, cc); + progress += collapse_scan_mm_slot(pages - progress, + &result, cc); else progress = pages; spin_unlock(&khugepaged_mm_lock); @@ -2819,8 +2821,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, mmap_read_unlock(mm); mmap_locked = false; - result = hpage_collapse_scan_file(mm, addr, file, pgoff, - cc); + result = collapse_scan_file(mm, addr, file, pgoff, cc); if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb && mapping_can_writeback(file->f_mapping)) { @@ -2834,8 +2835,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, } fput(file); } else { - result = hpage_collapse_scan_pmd(mm, vma, addr, - &mmap_locked, cc); + result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc); } if (!mmap_locked) *prev = NULL; /* Tell caller we dropped mmap_lock */ -- Gitee From 45d2a5ed660459bd2196db0046c48168e7a29bf8 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Mon, 1 Dec 2025 10:46:13 -0700 Subject: [PATCH 15/28] introduce collapse_single_pmd to unify khugepaged and madvise_collapse ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-4-npache@redhat.com/ [Backport Note] Merge fix: https://lore.kernel.org/all/b824f131-3e51-422c-9e98-044b0a2928a6@redhat.com/ The khugepaged daemon and madvise_collapse have two different implementations that do almost the same thing. Create collapse_single_pmd to increase code reuse and create an entry point to these two users. Refactor madvise_collapse and collapse_scan_mm_slot to use the new collapse_single_pmd function. This introduces a minor behavioral change that is most likely an undiscovered bug. The current implementation of khugepaged tests collapse_test_exit_or_disable before calling collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse case. By unifying these two callers madvise_collapse now also performs this check. We also modify the return value to be SCAN_ANY_PROCESS which properly indicates that this process is no longer valid to operate on. We also guard the khugepaged_pages_collapsed variable to ensure its only incremented for khugepaged. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Lorenzo Stoakes Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 121 +++++++++++++++++++++++++----------------------- 1 file changed, 64 insertions(+), 57 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 9dfe7d1e3b58..4db573ea2310 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2364,6 +2364,62 @@ static int collapse_scan_file(struct mm_struct *mm, unsigned long addr, } #endif +/* + * Try to collapse a single PMD starting at a PMD aligned addr, and return + * the results. + */ +static enum scan_result collapse_single_pmd(unsigned long addr, + struct vm_area_struct *vma, bool *mmap_locked, + struct collapse_control *cc) +{ + struct mm_struct *mm = vma->vm_mm; + enum scan_result result; + struct file *file; + pgoff_t pgoff; + + if (vma_is_anonymous(vma)) { + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc); + goto end; + } + + file = get_file(vma->vm_file); + pgoff = linear_page_index(vma, addr); + + mmap_read_unlock(mm); + *mmap_locked = false; + result = collapse_scan_file(mm, addr, file, pgoff, cc); + + if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK && + mapping_can_writeback(file->f_mapping)) { + const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT; + const loff_t lend = lstart + HPAGE_PMD_SIZE - 1; + + filemap_write_and_wait_range(file->f_mapping, lstart, lend); + } + fput(file); + + if (result != SCAN_PTE_MAPPED_HUGEPAGE) + goto end; + + mmap_read_lock(mm); + *mmap_locked = true; + if (collapse_test_exit_or_disable(mm)) { + mmap_read_unlock(mm); + *mmap_locked = false; + return SCAN_ANY_PROCESS; + } + result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged); + if (result == SCAN_PMD_MAPPED) + result = SCAN_SUCCEED; + mmap_read_unlock(mm); + *mmap_locked = false; + +end: + if (cc->is_khugepaged && result == SCAN_SUCCEED) + ++khugepaged_pages_collapsed; + return result; +} + static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result, struct collapse_control *cc) __releases(&khugepaged_mm_lock) @@ -2446,33 +2502,8 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result * VM_BUG_ON(khugepaged_scan.address < hstart || khugepaged_scan.address + HPAGE_PMD_SIZE > hend); - if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) { - struct file *file = get_file(vma->vm_file); - pgoff_t pgoff = linear_page_index(vma, - khugepaged_scan.address); - - mmap_read_unlock(mm); - mmap_locked = false; - *result = collapse_scan_file(mm, - khugepaged_scan.address, file, pgoff, cc); - fput(file); - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { - mmap_read_lock(mm); - if (collapse_test_exit_or_disable(mm)) - goto breakouterloop; - *result = try_collapse_pte_mapped_thp(mm, - khugepaged_scan.address, false); - if (*result == SCAN_PMD_MAPPED) - *result = SCAN_SUCCEED; - mmap_read_unlock(mm); - } - } else { - *result = collapse_scan_pmd(mm, vma, - khugepaged_scan.address, &mmap_locked, cc); - } - - if (*result == SCAN_SUCCEED) - ++khugepaged_pages_collapsed; + *result = collapse_single_pmd(khugepaged_scan.address, + vma, &mmap_locked, cc); /* move to next address */ khugepaged_scan.address += HPAGE_PMD_SIZE; @@ -2815,46 +2846,22 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, mmap_assert_locked(mm); memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); - if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) { - struct file *file = get_file(vma->vm_file); - pgoff_t pgoff = linear_page_index(vma, addr); - mmap_read_unlock(mm); - mmap_locked = false; - result = collapse_scan_file(mm, addr, file, pgoff, cc); - - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb && - mapping_can_writeback(file->f_mapping)) { - loff_t lstart = (loff_t)pgoff << PAGE_SHIFT; - loff_t lend = lstart + HPAGE_PMD_SIZE - 1; - - filemap_write_and_wait_range(file->f_mapping, lstart, lend); - triggered_wb = true; - fput(file); - goto retry; - } - fput(file); - } else { - result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc); + result = collapse_single_pmd(addr, vma, &mmap_locked, cc); + + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) { + triggered_wb = true; + goto retry; } - if (!mmap_locked) - *prev = NULL; /* Tell caller we dropped mmap_lock */ -handle_result: switch (result) { case SCAN_SUCCEED: case SCAN_PMD_MAPPED: ++thps; break; - case SCAN_PTE_MAPPED_HUGEPAGE: - BUG_ON(mmap_locked); - BUG_ON(*prev); - mmap_read_lock(mm); - result = try_collapse_pte_mapped_thp(mm, addr, true); - mmap_read_unlock(mm); - goto handle_result; /* Whitelisted set of results where continuing OK */ case SCAN_PMD_NULL: + case SCAN_PTE_MAPPED_HUGEPAGE: case SCAN_PTE_NON_PRESENT: case SCAN_PTE_UFFD_WP: case SCAN_PAGE_RO: -- Gitee From 296b40fb55bdf723f895023279663b61b5cf92e7 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:29 -0700 Subject: [PATCH 16/28] khugepaged: generalize hugepage_vma_revalidate for mTHP support ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-5-npache@redhat.com/ For khugepaged to support different mTHP orders, we must generalize this to check if the PMD is not shared by another VMA and that the order is enabled. No functional change in this patch. Also correct a comment about the functionality of the revalidation. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Acked-by: David Hildenbrand Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4db573ea2310..d64d782a9762 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -917,12 +917,13 @@ static int collapse_find_target_node(struct collapse_control *cc) /* * If mmap_lock temporarily dropped, revalidate vma - * before taking mmap_lock. + * after taking the mmap_lock again. * Returns enum scan_result value. */ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, - bool expect_anon, struct vm_area_struct **vmap, struct collapse_control *cc) + bool expect_anon, struct vm_area_struct **vmap, + struct collapse_control *cc, unsigned int order) { struct vm_area_struct *vma; enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED : @@ -935,15 +936,16 @@ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned l if (!vma) return SCAN_VMA_NULL; + /* Always check the PMD order to ensure its not shared by another VMA */ if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER)) + if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, BIT(order))) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then * remapped to file after khugepaged reaquired the mmap_lock. * - * thp_vma_allowable_order may return true for qualified file + * thp_vma_allowable_orders may return true for qualified file * vmas. */ if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))) @@ -1132,7 +1134,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a * in case the special case has happened. */ mmap_read_lock(mm); - result = hugepage_vma_revalidate(mm, address, true, &vma, cc); + result = hugepage_vma_revalidate(mm, address, true, &vma, cc, + HPAGE_PMD_ORDER); if (result != SCAN_SUCCEED) { mmap_read_unlock(mm); goto out_nolock; @@ -1163,7 +1166,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a * handled by the anon_vma lock + PG_lock. */ mmap_write_lock(mm); - result = hugepage_vma_revalidate(mm, address, true, &vma, cc); + result = hugepage_vma_revalidate(mm, address, true, &vma, cc, + HPAGE_PMD_ORDER); if (result != SCAN_SUCCEED) goto out_up_write; /* check if the pmd is still valid */ @@ -2835,7 +2839,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, mmap_read_lock(mm); mmap_locked = true; result = hugepage_vma_revalidate(mm, addr, false, &vma, - cc); + cc, HPAGE_PMD_ORDER); if (result != SCAN_SUCCEED) { last_fail = result; goto out_nolock; -- Gitee From ecc6dc5899c6e55d8c95d3b9ea173b806aef3118 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Thu, 22 Jan 2026 12:28:30 -0700 Subject: [PATCH 17/28] khugepaged: generalize alloc_charge_folio() ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-6-npache@redhat.com/ Pass order to alloc_charge_folio() and update mTHP statistics. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Acked-by: David Hildenbrand Co-developed-by: Nico Pache Signed-off-by: Nico Pache Signed-off-by: Dev Jain Signed-off-by: Yuanhe Shu --- Documentation/admin-guide/mm/transhuge.rst | 8 ++++++++ include/linux/huge_mm.h | 2 ++ mm/huge_memory.c | 4 ++++ mm/khugepaged.c | 17 +++++++++++------ 4 files changed, 25 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 0f0b3ff0cb88..50e3b1fbca89 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -626,6 +626,14 @@ anon_fault_fallback_charge instead falls back to using huge pages with lower orders or small pages even though the allocation was successful. +collapse_alloc + is incremented every time a huge page is successfully allocated for a + khugepaged collapse. + +collapse_alloc_failed + is incremented every time a huge page allocation fails during a + khugepaged collapse. + swpout is incremented every time a huge page is swapped out in one piece without splitting. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 95d2aa68125a..b6898110b3d9 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -124,6 +124,8 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_ALLOC, MTHP_STAT_ANON_FAULT_FALLBACK, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, + MTHP_STAT_COLLAPSE_ALLOC, + MTHP_STAT_COLLAPSE_ALLOC_FAILED, MTHP_STAT_SWPOUT, MTHP_STAT_SWPOUT_FALLBACK, MTHP_STAT_SHMEM_ALLOC, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6e3bf6e86da1..5775d1b3a802 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -654,6 +654,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name) DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC); +DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED); DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT); DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK); #ifdef CONFIG_SHMEM @@ -711,6 +713,8 @@ static struct attribute *any_stats_attrs[] = { #endif &split_attr.attr, &split_failed_attr.attr, + &collapse_alloc_attr.attr, + &collapse_alloc_failed_attr.attr, NULL, }; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d64d782a9762..c8ed08e526ae 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1072,21 +1072,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm, } static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, - struct collapse_control *cc) + struct collapse_control *cc, unsigned int order) { gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : GFP_TRANSHUGE); int node = collapse_find_target_node(cc); struct folio *folio; - folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask); + folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask); if (!folio) { *foliop = NULL; - count_vm_event(THP_COLLAPSE_ALLOC_FAILED); + if (is_pmd_order(order)) + count_vm_event(THP_COLLAPSE_ALLOC_FAILED); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED); return SCAN_ALLOC_HUGE_PAGE_FAIL; } - count_vm_event(THP_COLLAPSE_ALLOC); + if (is_pmd_order(order)) + count_vm_event(THP_COLLAPSE_ALLOC); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC); + if (unlikely(mem_cgroup_charge(folio, mm, gfp))) { folio_put(folio); *foliop = NULL; @@ -1122,7 +1127,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a */ mmap_read_unlock(mm); - result = alloc_charge_folio(&folio, mm, cc); + result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); if (result != SCAN_SUCCEED) goto out_nolock; @@ -1846,7 +1851,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); - result = alloc_charge_folio(&new_folio, mm, cc); + result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER); if (result != SCAN_SUCCEED) goto out; -- Gitee From 174dc2ccfa32848817ebb1df278ff53f2234cef9 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:31 -0700 Subject: [PATCH 18/28] khugepaged: generalize __collapse_huge_page_* for mTHP support ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-7-npache@redhat.com/ generalize the order of the __collapse_huge_page_* functions to support future mTHP collapse. mTHP collapse will not honor the khugepaged_max_ptes_shared or khugepaged_max_ptes_swap parameters, and will fail if it encounters a shared or swapped entry. No functional changes in this patch. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Lorenzo Stoakes Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 71 ++++++++++++++++++++++++++++++++----------------- 1 file changed, 46 insertions(+), 25 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index c8ed08e526ae..3c599479626f 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -567,16 +567,18 @@ static bool is_refcount_suitable(struct folio *folio) static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, unsigned long address, pte_t *pte, struct collapse_control *cc, - struct list_head *compound_pagelist) + unsigned int order, struct list_head *compound_pagelist) { struct page *page = NULL; struct folio *folio = NULL; pte_t *_pte; int none_or_zero = 0, shared = 0, referenced = 0; enum scan_result result = SCAN_FAIL; + const unsigned long nr_pages = 1UL << order; + int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); bool writable = false; - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; + for (_pte = pte; _pte < pte + nr_pages; _pte++, address += PAGE_SIZE) { pte_t pteval = ptep_get(_pte); if (pte_none(pteval) || (pte_present(pteval) && @@ -584,7 +586,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, ++none_or_zero; if (!userfaultfd_armed(vma) && (!cc->is_khugepaged || - none_or_zero <= khugepaged_max_ptes_none)) { + none_or_zero <= max_ptes_none)) { continue; } else { result = SCAN_EXCEED_NONE_PTE; @@ -611,8 +613,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, if (page_mapcount(page) > 1) { ++shared; - if (cc->is_khugepaged && - shared > khugepaged_max_ptes_shared) { + /* + * TODO: Support shared pages without leading to further + * mTHP collapses. Currently bringing in new pages via + * shared may cause a future higher order collapse on a + * rescan of the same range. + */ + if (!is_pmd_order(order) || (cc->is_khugepaged && + shared > khugepaged_max_ptes_shared)) { result = SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); goto out; @@ -710,18 +718,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, } static void __collapse_huge_page_copy_succeeded(pte_t *pte, - struct vm_area_struct *vma, - unsigned long address, - spinlock_t *ptl, - struct list_head *compound_pagelist) + struct vm_area_struct *vma, unsigned long address, + spinlock_t *ptl, unsigned int order, + struct list_head *compound_pagelist) { struct folio *src_folio; struct page *src_page; struct page *tmp; pte_t *_pte; pte_t pteval; + const unsigned long nr_pages = 1UL << order; - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; + for (_pte = pte; _pte < pte + nr_pages; _pte++, address += PAGE_SIZE) { pteval = ptep_get(_pte); if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { @@ -765,13 +773,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, } static void __collapse_huge_page_copy_failed(pte_t *pte, - pmd_t *pmd, - pmd_t orig_pmd, - struct vm_area_struct *vma, - struct list_head *compound_pagelist) + pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, + unsigned int order, struct list_head *compound_pagelist) { spinlock_t *pmd_ptl; - + const unsigned long nr_pages = 1UL << order; /* * Re-establish the PMD to point to the original page table * entry. Restoring PMD needs to be done prior to releasing @@ -785,7 +791,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, * Release both raw and compound pages isolated * in __collapse_huge_page_isolate. */ - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); + release_pte_pages(pte, pte + nr_pages, compound_pagelist); } /* @@ -805,16 +811,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, */ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio, pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, - unsigned long address, spinlock_t *ptl, + unsigned long address, spinlock_t *ptl, unsigned int order, struct list_head *compound_pagelist) { unsigned int i; enum scan_result result = SCAN_SUCCEED; - + const unsigned long nr_pages = 1UL << order; /* * Copying pages' contents is subject to memory poison at any iteration. */ - for (i = 0; i < HPAGE_PMD_NR; i++) { + for (i = 0; i < nr_pages; i++) { pte_t pteval = ptep_get(pte + i); struct page *page = folio_page(folio, i); unsigned long src_addr = address + i * PAGE_SIZE; @@ -833,10 +839,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli if (likely(result == SCAN_SUCCEED)) __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, - compound_pagelist); + order, compound_pagelist); else __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, - compound_pagelist); + order, compound_pagelist); return result; } @@ -1001,12 +1007,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm, * Returns result: if not SCAN_SUCCEED, mmap_lock has been released. */ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd, - int referenced) + struct vm_area_struct *vma, unsigned long haddr, + pmd_t *pmd, int referenced, unsigned int order) { int swapped_in = 0; vm_fault_t ret = 0; - unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE); + unsigned long address, end = haddr + (PAGE_SIZE << order); enum scan_result result; pte_t *pte = NULL; spinlock_t *ptl; @@ -1033,6 +1039,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm, if (!is_swap_pte(vmf.orig_pte)) continue; + /* + * TODO: Support swapin without leading to further mTHP + * collapses. Currently bringing in new pages via swapin may + * cause a future higher order collapse on a rescan of the same + * range. + */ + if (!is_pmd_order(order)) { + pte_unmap(pte); + mmap_read_unlock(mm); + result = SCAN_EXCEED_SWAP_PTE; + goto out; + } + vmf.pte = pte; vmf.ptl = ptl; ret = do_swap_page(&vmf); @@ -1159,7 +1178,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a * that case. Continuing to collapse causes inconsistency. */ result = __collapse_huge_page_swapin(mm, vma, address, pmd, - referenced); + referenced, HPAGE_PMD_ORDER); if (result != SCAN_SUCCEED) goto out_nolock; } @@ -1204,6 +1223,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); if (pte) { result = __collapse_huge_page_isolate(vma, address, pte, cc, + HPAGE_PMD_ORDER, &compound_pagelist); spin_unlock(pte_ptl); } else { @@ -1234,6 +1254,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, vma, address, pte_ptl, + HPAGE_PMD_ORDER, &compound_pagelist); pte_unmap(pte); if (unlikely(result != SCAN_SUCCEED)) -- Gitee From ba05e641e9a169ac964876ef3ccdf67371f41111 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:32 -0700 Subject: [PATCH 19/28] khugepaged: introduce collapse_max_ptes_none helper function ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-8-npache@redhat.com/ The current mechanism for determining mTHP collapse scales the khugepaged_max_ptes_none value based on the target order. This introduces an undesirable feedback loop, or "creep", when max_ptes_none is set to a value greater than HPAGE_PMD_NR / 2. With this configuration, a successful collapse to order N will populate enough pages to satisfy the collapse condition on order N+1 on the next scan. This leads to unnecessary work and memory churn. To fix this issue introduce a helper function that will limit mTHP collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1. This effectively supports two modes: - max_ptes_none=0: never introduce new none-pages for mTHP collapse. - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest available mTHP order. This removes the possiblilty of "creep", while not modifying any uAPI expectations. A warning will be emitted if any non-supported max_ptes_none value is configured with mTHP enabled. The limits can be ignored by passing full_scan=true, this is useful for madvise_collapse (which ignores limits), or in the case of collapse_scan_pmd(), allows the full PMD to be scanned when mTHP collapse is available. Reviewed-by: Baolin Wang Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 43 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3c599479626f..cb0452813258 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -474,6 +474,44 @@ void __khugepaged_enter(struct mm_struct *mm) wake_up_interruptible(&khugepaged_wait); } +/** + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse + * @order: The folio order being collapsed to + * @full_scan: Whether this is a full scan (ignore limits) + * + * For madvise-triggered collapses (full_scan=true), all limits are bypassed + * and allow up to HPAGE_PMD_NR - 1 empty PTEs. + * + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured + * khugepaged_max_ptes_none value. + * + * For mTHP collapses, we currently only support khugepaged_max_pte_none values + * of 0 or (HPAGE_PMD_NR - 1). Any other value will emit a warning and no mTHP + * collapse will be attempted + * + * Return: Maximum number of empty PTEs allowed for the collapse operation + */ +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan) +{ + /* ignore max_ptes_none limits */ + if (full_scan) + return HPAGE_PMD_NR - 1; + + if (is_pmd_order(order)) + return khugepaged_max_ptes_none; + + /* Zero/non-present collapse disabled. */ + if (!khugepaged_max_ptes_none) + return 0; + + if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1) + return (1 << order) - 1; + + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %d\n", + HPAGE_PMD_NR - 1); + return -EINVAL; +} + void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags) { @@ -575,9 +613,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, int none_or_zero = 0, shared = 0, referenced = 0; enum scan_result result = SCAN_FAIL; const unsigned long nr_pages = 1UL << order; - int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); + int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged); bool writable = false; + if (max_ptes_none == -EINVAL) + return result; + for (_pte = pte; _pte < pte + nr_pages; _pte++, address += PAGE_SIZE) { pte_t pteval = ptep_get(_pte); -- Gitee From 3f03832361ea31cae328c14cab4c5ae1dd8ed552 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:33 -0700 Subject: [PATCH 20/28] khugepaged: generalize collapse_huge_page for mTHP collapse ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-9-npache@redhat.com/ [backport note] Adapted for older kernel lacking map_anon_folio_pmd_nopf() and newer THP infrastructure. The PMD collapse path uses original mk_huge_pmd(), set_pmd_at(), and update_mmu_cache_pmd() instead. The mTHP collapse logic is retained as set_ptes() and update_mmu_cache_range() are available. Address variables adjusted to match legacy naming (pmd_address, start_addr) and folio APIs aligned with backported context. Pass an order and offset to collapse_huge_page to support collapsing anon memory to arbitrary orders within a PMD. order indicates what mTHP size we are attempting to collapse to, and offset indicates were in the PMD to start the collapse attempt. For non-PMD collapse we must leave the anon VMA write locked until after we collapse the mTHP-- in the PMD case all the pages are isolated, but in the mTHP case this is not true, and we must keep the lock to prevent changes to the VMA from occurring. Reviewed-by: Baolin Wang Tested-by: Baolin Wang Signed-off-by: Nico Pache --- mm/khugepaged.c | 125 +++++++++++++++++++++++++++++------------------- 1 file changed, 77 insertions(+), 48 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index cb0452813258..28ab15a9ed70 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1164,30 +1164,38 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru return SCAN_SUCCEED; } -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address, - int referenced, int unmapped, struct collapse_control *cc) +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr, + int referenced, int unmapped, struct collapse_control *cc, + bool *mmap_locked, unsigned int order) { LIST_HEAD(compound_pagelist); pmd_t *pmd, _pmd; - pte_t *pte; + pte_t *pte = NULL; pgtable_t pgtable; struct folio *folio; spinlock_t *pmd_ptl, *pte_ptl; enum scan_result result = SCAN_FAIL; struct vm_area_struct *vma; struct mmu_notifier_range range; + bool anon_vma_locked = false; + const unsigned long nr_pages = 1UL << order; + const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK; - VM_BUG_ON(address & ~HPAGE_PMD_MASK); + VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK); /* * Before allocating the hugepage, release the mmap_lock read lock. * The allocation can take potentially a long time if it involves * sync compaction, and we do not need to hold the mmap_lock during * that. We will recheck the vma after taking it again in write mode. + * If collapsing mTHPs we may have already released the read_lock. */ - mmap_read_unlock(mm); + if (*mmap_locked) { + mmap_read_unlock(mm); + *mmap_locked = false; + } - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); + result = alloc_charge_folio(&folio, mm, cc, order); if (result != SCAN_SUCCEED) goto out_nolock; @@ -1199,14 +1207,15 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a * in case the special case has happened. */ mmap_read_lock(mm); - result = hugepage_vma_revalidate(mm, address, true, &vma, cc, - HPAGE_PMD_ORDER); + *mmap_locked = true; + result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order); if (result != SCAN_SUCCEED) { mmap_read_unlock(mm); + *mmap_locked = false; goto out_nolock; } - result = find_pmd_or_thp_or_none(mm, address, &pmd); + result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd); if (result != SCAN_SUCCEED || is_async_fork_mm(mm)) { mmap_read_unlock(mm); goto out_nolock; @@ -1218,33 +1227,36 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a * released when it fails. So we jump out_nolock directly in * that case. Continuing to collapse causes inconsistency. */ - result = __collapse_huge_page_swapin(mm, vma, address, pmd, - referenced, HPAGE_PMD_ORDER); - if (result != SCAN_SUCCEED) + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd, + referenced, order); + if (result != SCAN_SUCCEED) { + *mmap_locked = false; goto out_nolock; + } } mmap_read_unlock(mm); + *mmap_locked = false; /* * Prevent all access to pagetables with the exception of * gup_fast later handled by the ptep_clear_flush and the VM * handled by the anon_vma lock + PG_lock. */ mmap_write_lock(mm); - result = hugepage_vma_revalidate(mm, address, true, &vma, cc, - HPAGE_PMD_ORDER); + result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order); if (result != SCAN_SUCCEED) goto out_up_write; /* check if the pmd is still valid */ vma_start_write(vma); - result = check_pmd_still_valid(mm, address, pmd); + result = check_pmd_still_valid(mm, pmd_address, pmd); if (result != SCAN_SUCCEED || is_async_fork_mm(mm)) goto out_up_write; anon_vma_lock_write(vma->anon_vma); + anon_vma_locked = true; - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, - address + HPAGE_PMD_SIZE); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr, + start_addr + (PAGE_SIZE << order)); mmu_notifier_invalidate_range_start(&range); pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ @@ -1256,24 +1268,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a * Parallel fast GUP is fine since fast GUP will back off when * it detects PMD is changed. */ - _pmd = pmdp_collapse_flush(vma, address, pmd); + _pmd = pmdp_collapse_flush(vma, pmd_address, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(&range); tlb_remove_table_sync_one(); - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl); if (pte) { - result = __collapse_huge_page_isolate(vma, address, pte, cc, - HPAGE_PMD_ORDER, - &compound_pagelist); + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc, + order, &compound_pagelist); spin_unlock(pte_ptl); } else { result = SCAN_PMD_NULL; } if (unlikely(result != SCAN_SUCCEED)) { - if (pte) - pte_unmap(pte); spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); /* @@ -1283,21 +1292,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a */ pmd_populate(mm, pmd, pmd_pgtable(_pmd)); spin_unlock(pmd_ptl); - anon_vma_unlock_write(vma->anon_vma); goto out_up_write; } /* - * All pages are isolated and locked so anon_vma rmap - * can't run anymore. + * For PMD collapse all pages are isolated and locked so anon_vma + * rmap can't run anymore. For mTHP collapse we must hold the lock */ - anon_vma_unlock_write(vma->anon_vma); + if (is_pmd_order(order)) { + anon_vma_unlock_write(vma->anon_vma); + anon_vma_locked = false; + } result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, - vma, address, pte_ptl, - HPAGE_PMD_ORDER, - &compound_pagelist); - pte_unmap(pte); + vma, start_addr, pte_ptl, + order, &compound_pagelist); if (unlikely(result != SCAN_SUCCEED)) goto out_up_write; @@ -1307,27 +1316,48 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a * write. */ __folio_mark_uptodate(folio); - pgtable = pmd_pgtable(_pmd); - - _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot); - _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); - - spin_lock(pmd_ptl); - BUG_ON(!pmd_none(*pmd)); - folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); - folio_add_lru_vma(folio, vma); - pgtable_trans_huge_deposit(mm, pmd, pgtable); - set_pmd_at(mm, address, pmd, _pmd); - update_mmu_cache_pmd(vma, address, pmd); - deferred_split_folio(folio, false); + if (is_pmd_order(order)) { /* PMD collapse */ + pgtable = pmd_pgtable(_pmd); + _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot); + _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); + + spin_lock(pmd_ptl); + BUG_ON(!pmd_none(*pmd)); + folio_add_new_anon_rmap(folio, vma, pmd_address, RMAP_EXCLUSIVE); + folio_add_lru_vma(folio, vma); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + set_pmd_at(mm, pmd_address, pmd, _pmd); + update_mmu_cache_pmd(vma, pmd_address, pmd); + deferred_split_folio(folio, false); + } else { /* mTHP collapse */ + pte_t mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot); + + mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma); + spin_lock(pmd_ptl); + WARN_ON_ONCE(!pmd_none(*pmd)); + folio_ref_add(folio, nr_pages - 1); + folio_add_new_anon_rmap(folio, vma, start_addr, RMAP_EXCLUSIVE); + folio_add_lru_vma(folio, vma); + set_ptes(vma->vm_mm, start_addr, pte, mthp_pte, nr_pages); + update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages); + + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */ + pmd_populate(mm, pmd, pmd_pgtable(_pmd)); + } spin_unlock(pmd_ptl); folio = NULL; result = SCAN_SUCCEED; out_up_write: + if (anon_vma_locked) + anon_vma_unlock_write(vma->anon_vma); + if (pte) + pte_unmap(pte); mmap_write_unlock(mm); + *mmap_locked = false; out_nolock: + WARN_ON_ONCE(*mmap_locked); if (folio) folio_put(folio); trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); @@ -1495,9 +1525,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm, pte_unmap_unlock(pte, ptl); if (result == SCAN_SUCCEED) { result = collapse_huge_page(mm, address, referenced, - unmapped, cc); - /* collapse_huge_page will return with the mmap_lock released */ - *mmap_locked = false; + unmapped, cc, mmap_locked, + HPAGE_PMD_ORDER); } out: trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced, -- Gitee From 81172a273737c17d248c0305d3797e8c7cc63a01 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:34 -0700 Subject: [PATCH 21/28] khugepaged: skip collapsing mTHP to smaller orders ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-10-npache@redhat.com/ khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in some pages being unmapped. Skip these cases until we have a way to check if its ok to collapse to a smaller mTHP size (like in the case of a partially mapped folio). This patch is inspired by Dev Jain's work on khugepaged mTHP support [1]. [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/ Reviewed-by: Lorenzo Stoakes Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 28ab15a9ed70..be3255c6dfa5 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -667,6 +667,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } } + /* + * TODO: In some cases of partially-mapped folios, we'd actually + * want to collapse. + */ + if (!is_pmd_order(order) && folio_order(folio) >= order) { + result = SCAN_PTE_MAPPED_HUGEPAGE; + goto out; + } if (folio_test_large(folio)) { struct folio *f; -- Gitee From 98e918cf3c8b763a14bd4fb5ecc00f4dbe45b417 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:35 -0700 Subject: [PATCH 22/28] khugepaged: add per-order mTHP collapse failure statistics ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-11-npache@redhat.com/ Add three new mTHP statistics to track collapse failures for different orders when encountering swap PTEs, excessive none PTEs, and shared PTEs: - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap PTEs - collapse_exceed_none_pte: Counts when mTHP collapse fails due to exceeding the none PTE threshold for the given order - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared PTEs These statistics complement the existing THP_SCAN_EXCEED_* events by providing per-order granularity for mTHP collapse attempts. The stats are exposed via sysfs under `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each supported hugepage size. As we currently dont support collapsing mTHPs that contain a swap or shared entry, those statistics keep track of how often we are encountering failed mTHP collapses due to these restrictions. Reviewed-by: Baolin Wang Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++ include/linux/huge_mm.h | 3 +++ mm/huge_memory.c | 6 ++++++ mm/khugepaged.c | 16 ++++++++++++--- 4 files changed, 46 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 50e3b1fbca89..e07decb28304 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -683,6 +683,30 @@ nr_anon_partially_mapped an anonymous THP as "partially mapped" and count it here, even though it is not actually partially mapped anymore. +collapse_exceed_none_pte + The number of collapse attempts that failed due to exceeding the + max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none + values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will + emit a warning and no mTHP collapse will be attempted. khugepaged will + try to collapse to the largest enabled (m)THP size; if it fails, it will + try the next lower enabled mTHP size. This counter records the number of + times a collapse attempt was skipped for exceeding the max_ptes_none + threshold, and khugepaged will move on to the next available mTHP size. + +collapse_exceed_swap_pte + The number of anonymous mTHP PTE ranges which were unable to collapse due + to containing at least one swap PTE. Currently khugepaged does not + support collapsing mTHP regions that contain a swap PTE. This counter can + be used to monitor the number of khugepaged mTHP collapses that failed + due to the presence of a swap PTE. + +collapse_exceed_shared_pte + The number of anonymous mTHP PTE ranges which were unable to collapse due + to containing at least one shared PTE. Currently khugepaged does not + support collapsing mTHP PTE ranges that contain a shared PTE. This + counter can be used to monitor the number of khugepaged mTHP collapses + that failed due to the presence of a shared PTE. + file_alloc is incremented every time a file huge page is successfully allocated. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b6898110b3d9..4ea638655124 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -136,6 +136,9 @@ enum mthp_stat_item { MTHP_STAT_SPLIT_DEFERRED, MTHP_STAT_NR_ANON, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, + MTHP_STAT_COLLAPSE_EXCEED_SWAP, + MTHP_STAT_COLLAPSE_EXCEED_NONE, + MTHP_STAT_COLLAPSE_EXCEED_SHARED, MTHP_STAT_FILE_ALLOC, MTHP_STAT_FILE_FALLBACK, __MTHP_STAT_COUNT diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5775d1b3a802..f40d8d5da2aa 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -668,6 +668,9 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED); DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON); DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED); DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC); DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK); @@ -682,6 +685,9 @@ static struct attribute *anon_stats_attrs[] = { &split_deferred_attr.attr, &nr_anon_attr.attr, &nr_anon_partially_mapped_attr.attr, + &collapse_exceed_swap_pte_attr.attr, + &collapse_exceed_none_pte_attr.attr, + &collapse_exceed_shared_pte_attr.attr, NULL, }; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index be3255c6dfa5..3c053d378378 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -631,7 +631,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, continue; } else { result = SCAN_EXCEED_NONE_PTE; - count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + if (is_pmd_order(order)) + count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE); goto out; } } @@ -660,10 +662,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, * shared may cause a future higher order collapse on a * rescan of the same range. */ - if (!is_pmd_order(order) || (cc->is_khugepaged && - shared > khugepaged_max_ptes_shared)) { + if (!is_pmd_order(order)) { + result = SCAN_EXCEED_SHARED_PTE; + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED); + goto out; + } + + if (cc->is_khugepaged && + shared > khugepaged_max_ptes_shared) { result = SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED); goto out; } } @@ -1095,6 +1104,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm, * range. */ if (!is_pmd_order(order)) { + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP); pte_unmap(pte); mmap_read_unlock(mm); result = SCAN_EXCEED_SWAP_PTE; -- Gitee From 9c136f758c204696bf6be7de9d3efa22487329fd Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:36 -0700 Subject: [PATCH 23/28] khugepaged: improve tracepoints for mTHP orders ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-12-npache@redhat.com/ Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to give better insight into what order is being operated at for. Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Reviewed-by: Baolin Wang Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- include/trace/events/huge_memory.h | 34 +++++++++++++++++++----------- mm/khugepaged.c | 9 ++++---- 2 files changed, 27 insertions(+), 16 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 5dc636add309..db9179cbbd37 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -93,34 +93,37 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, TRACE_EVENT(mm_collapse_huge_page, - TP_PROTO(struct mm_struct *mm, int isolated, int status), + TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int order), - TP_ARGS(mm, isolated, status), + TP_ARGS(mm, isolated, status, order), TP_STRUCT__entry( __field(struct mm_struct *, mm) __field(int, isolated) __field(int, status) + __field(unsigned int, order) ), TP_fast_assign( __entry->mm = mm; __entry->isolated = isolated; __entry->status = status; + __entry->order = order; ), - TP_printk("mm=%p, isolated=%d, status=%s", + TP_printk("mm=%p, isolated=%d, status=%s order=%u", __entry->mm, __entry->isolated, - __print_symbolic(__entry->status, SCAN_STATUS)) + __print_symbolic(__entry->status, SCAN_STATUS), + __entry->order) ); TRACE_EVENT(mm_collapse_huge_page_isolate, TP_PROTO(struct page *page, int none_or_zero, - int referenced, bool writable, int status), + int referenced, bool writable, int status, unsigned int order), - TP_ARGS(page, none_or_zero, referenced, writable, status), + TP_ARGS(page, none_or_zero, referenced, writable, status, order), TP_STRUCT__entry( __field(unsigned long, pfn) @@ -128,6 +131,7 @@ TRACE_EVENT(mm_collapse_huge_page_isolate, __field(int, referenced) __field(bool, writable) __field(int, status) + __field(unsigned int, order) ), TP_fast_assign( @@ -136,27 +140,31 @@ TRACE_EVENT(mm_collapse_huge_page_isolate, __entry->referenced = referenced; __entry->writable = writable; __entry->status = status; + __entry->order = order; ), - TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s", + TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s, order=%u", __entry->pfn, __entry->none_or_zero, __entry->referenced, __entry->writable, - __print_symbolic(__entry->status, SCAN_STATUS)) + __print_symbolic(__entry->status, SCAN_STATUS), + __entry->order) ); TRACE_EVENT(mm_collapse_huge_page_swapin, - TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret), + TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret, + unsigned int order), - TP_ARGS(mm, swapped_in, referenced, ret), + TP_ARGS(mm, swapped_in, referenced, ret, order), TP_STRUCT__entry( __field(struct mm_struct *, mm) __field(int, swapped_in) __field(int, referenced) __field(int, ret) + __field(unsigned int, order) ), TP_fast_assign( @@ -164,13 +172,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin, __entry->swapped_in = swapped_in; __entry->referenced = referenced; __entry->ret = ret; + __entry->order = order; ), - TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d", + TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%u", __entry->mm, __entry->swapped_in, __entry->referenced, - __entry->ret) + __entry->ret, + __entry->order) ); TRACE_EVENT(mm_khugepaged_scan_file, diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3c053d378378..0f9f1daa887e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -765,13 +765,13 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma, } else { result = SCAN_SUCCEED; trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero, - referenced, writable, result); + referenced, writable, result, order); return result; } out: release_pte_pages(pte, _pte, compound_pagelist); trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero, - referenced, writable, result); + referenced, writable, result, order); return result; } @@ -1145,7 +1145,8 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm, result = SCAN_SUCCEED; out: - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result); + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result, + order); return result; } @@ -1378,7 +1379,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s WARN_ON_ONCE(*mmap_locked); if (folio) folio_put(folio); - trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); + trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order); return result; } -- Gitee From 60ac21fc2bc048ed5c23a3e591edf6e356d4dcbe Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:37 -0700 Subject: [PATCH 24/28] khugepaged: introduce collapse_allowable_orders helper function ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-13-npache@redhat.com/ Add collapse_allowable_orders() to generalize THP order eligibility. The function determines which THP orders are permitted based on collapse context (khugepaged vs madv_collapse). This consolidates collapse configuration logic and provides a clean interface for future mTHP collapse support where the orders may be different. Reviewed-by: Baolin Wang Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0f9f1daa887e..49999861dba3 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -512,12 +512,22 @@ static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan) return -EINVAL; } +/* Check what orders are allowed based on the vma and collapse type */ +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma, + vm_flags_t vm_flags, bool is_khugepaged) +{ + enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE; + unsigned long orders = BIT(HPAGE_PMD_ORDER); + + return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); +} + void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags) { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && hugepage_pmd_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) + if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true)) __khugepaged_enter(vma->vm_mm); } } @@ -2596,7 +2606,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result * progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) { + if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/true)) { progress++; continue; } @@ -2930,7 +2940,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, *prev = vma; - if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) + if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/false)) return -EINVAL; cc = kmalloc(sizeof(*cc), GFP_KERNEL); -- Gitee From 4936e0ff156ff959c209adeccdd0f8c8fae23ecf Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:38 -0700 Subject: [PATCH 25/28] khugepaged: Introduce mTHP collapse support ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-14-npache@redhat.com/ Enable khugepaged to collapse to mTHP orders. This patch implements the main scanning logic using a bitmap to track occupied pages and a stack structure that allows us to find optimal collapse sizes. Previous to this patch, PMD collapse had 3 main phases, a light weight scanning phase (mmap_read_lock) that determines a potential PMD collapse, a alloc phase (mmap unlocked), then finally heavier collapse phase (mmap_write_lock). To enabled mTHP collapse we make the following changes: During PMD scan phase, track occupied pages in a bitmap. When mTHP orders are enabled, we remove the restriction of max_ptes_none during the scan phase to avoid missing potential mTHP collapse candidates. Once we have scanned the full PMD range and updated the bitmap to track occupied pages, we use the bitmap to find the optimal mTHP size. Implement collapse_scan_bitmap() to perform binary recursion on the bitmap and determine the best eligible order for the collapse. A stack structure is used instead of traditional recursion to manage the search. The algorithm recursively splits the bitmap into smaller chunks to find the highest order mTHPs that satisfy the collapse criteria. We start by attempting the PMD order, then moved on the consecutively lower orders (mTHP collapse). The stack maintains a pair of variables (offset, order), indicating the number of PTEs from the start of the PMD, and the order of the potential collapse candidate. The algorithm for consuming the bitmap works as such: 1) push (0, HPAGE_PMD_ORDER) onto the stack 2) pop the stack 3) check if the number of set bits in that (offset,order) pair statisfy the max_ptes_none threshold for that order 4) if yes, attempt collapse 5) if no (or collapse fails), push two new stack items representing the left and right halves of the current bitmap range, at the next lower order 6) repeat at step (2) until stack is empty. Below is a diagram representing the algorithm and stack items: offset mid_offset | | | | v v ____________________________________ | PTE Page Table | -------------------------------------- <-------><-------> order-1 order-1 We currently only support mTHP collapse for max_ptes_none values of 0 and HPAGE_PMD_NR - 1. resulting in the following behavior: - max_ptes_none=0: Never introduce new empty pages during collapse - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest available mTHP order Any other max_ptes_none value will emit a warning and skip mTHP collapse attempts. There should be no behavior change for PMD collapse. Once we determine what mTHP sizes fits best in that PMD range a collapse is attempted. A minimum collapse order of 2 is used as this is the lowest order supported by anon memory as defined by THP_ORDERS_ALL_ANON. mTHP collapses reject regions containing swapped out or shared pages. This is because adding new entries can lead to new none pages, and these may lead to constant promotion into a higher order (m)THP. A similar issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse introducing at least 2x the number of pages, and on a future scan will satisfy the promotion condition once again. This issue is prevented via the collapse_max_ptes_none() function which imposes the max_ptes_none restrictions above. Currently madv_collapse is not supported and will only attempt PMD collapse. We can also remove the check for is_khugepaged inside the PMD scan as the collapse_max_ptes_none() function handles this logic now. Reviewed-by: Baolin Wang Tested-by: Baolin Wang Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 184 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 176 insertions(+), 8 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 49999861dba3..f45908c263c7 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -96,6 +96,32 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); static struct kmem_cache *mm_slot_cache __read_mostly; +#define KHUGEPAGED_MIN_MTHP_ORDER 2 +/* + * The maximum number of mTHP ranges that can be stored on the stack. + * This is calculated based on the number of PTE entries in a PTE page table + * and the minimum mTHP order. + * + * ilog2(MAX_PTRS_PER_PTE) is log2 of the maximum number of PTE entries. + * This gives you the PMD_ORDER, and is needed in place of HPAGE_PMD_ORDER due + * to restrictions of some architectures (ie ppc64le). + * + * At most there will be 1 << (PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER) mTHP ranges + */ +#define MTHP_STACK_SIZE (1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER)) + +/* + * Defines a range of PTE entries in a PTE page table which are being + * considered for (m)THP collapse. + * + * @offset: the offset of the first PTE entry in a PMD range. + * @order: the order of the PTE entries being considered for collapse. + */ +struct mthp_range { + u16 offset; + u8 order; +}; + struct collapse_control { bool is_khugepaged; @@ -104,6 +130,11 @@ struct collapse_control { /* nodemask for allocation fallback */ nodemask_t alloc_nmask; + + /* bitmap used for mTHP collapse */ + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE); + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE); + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE]; }; /** @@ -1393,6 +1424,121 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s return result; } +static void mthp_stack_push(struct collapse_control *cc, int *stack_size, + u16 offset, u8 order) +{ + const int size = *stack_size; + struct mthp_range *stack = &cc->mthp_bitmap_stack[size]; + + VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE); + stack->order = order; + stack->offset = offset; + (*stack_size)++; +} + +static struct mthp_range mthp_stack_pop(struct collapse_control *cc, int *stack_size) +{ + const int size = *stack_size; + + VM_WARN_ON_ONCE(size <= 0); + (*stack_size)--; + return cc->mthp_bitmap_stack[size - 1]; +} + +static unsigned int mthp_nr_occupied_pte_entries(struct collapse_control *cc, + u16 offset, unsigned long nr_pte_entries) +{ + bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR); + bitmap_set(cc->mthp_bitmap_mask, offset, nr_pte_entries); + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, HPAGE_PMD_NR); +} + +/* + * mthp_collapse() consumes the bitmap that is generated during + * collapse_scan_pmd() to determine what regions and mTHP orders fit best. + * + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page. + * A stack structure cc->mthp_bitmap_stack is used to check different regions + * of the bitmap for collapse eligibility. The stack maintains a pair of + * variables (offset, order), indicating the number of PTEs from the start of + * the PMD, and the order of the potential collapse candidate respectively. We + * start at the PMD order and check if it is eligible for collapse; if not, we + * add two entries to the stack at a lower order to represent the left and right + * halves of the PTE page table we are examining. + * + * offset mid_offset + * | | + * | | + * v v + * -------------------------------------- + * | cc->mthp_bitmap | + * -------------------------------------- + * <-------><-------> + * order-1 order-1 + * + * For each of these, we determine how many PTE entries are occupied in the + * range of PTE entries we propose to collapse, then we compare this to a + * threshold number of PTE entries which would need to be occupied for a + * collapse to be permitted at that order (accounting for max_ptes_none). + + * If a collapse is permitted, we attempt to collapse the PTE range into a + * mTHP. + */ +static int mthp_collapse(struct mm_struct *mm, unsigned long address, + int referenced, int unmapped, struct collapse_control *cc, + bool *mmap_locked, unsigned long enabled_orders) +{ + unsigned int max_ptes_none, nr_occupied_ptes; + struct mthp_range range; + unsigned long collapse_address; + int collapsed = 0, stack_size = 0; + unsigned long nr_pte_entries; + u16 offset; + u8 order; + + mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER); + + while (stack_size > 0) { + range = mthp_stack_pop(cc, &stack_size); + order = range.order; + offset = range.offset; + nr_pte_entries = 1UL << order; + + if (!test_bit(order, &enabled_orders)) + goto next_order; + + max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged); + + if (max_ptes_none == -EINVAL) + return collapsed; + + nr_occupied_ptes = mthp_nr_occupied_pte_entries(cc, offset, nr_pte_entries); + + if (nr_occupied_ptes >= nr_pte_entries - max_ptes_none) { + int ret; + + collapse_address = address + offset * PAGE_SIZE; + ret = collapse_huge_page(mm, collapse_address, referenced, + unmapped, cc, mmap_locked, + order); + if (ret == SCAN_SUCCEED) { + collapsed += nr_pte_entries; + continue; + } + } + +next_order: + if (order > KHUGEPAGED_MIN_MTHP_ORDER) { + const u8 next_order = order - 1; + const u16 mid_offset = offset + (nr_pte_entries / 2); + + mthp_stack_push(cc, &stack_size, mid_offset, next_order); + mthp_stack_push(cc, &stack_size, offset, next_order); + } + } + return collapsed; +} + static enum scan_result collapse_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, bool *mmap_locked, @@ -1400,11 +1546,15 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm, { pmd_t *pmd; pte_t *pte, *_pte; - int none_or_zero = 0, shared = 0, referenced = 0; + int i; + int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0; enum scan_result result = SCAN_FAIL; struct page *page = NULL; + unsigned int max_ptes_none; struct folio *folio = NULL; unsigned long _address; + unsigned long enabled_orders; + bool full_scan = true; spinlock_t *ptl; int node = NUMA_NO_NODE, unmapped = 0; bool writable = false; @@ -1415,16 +1565,29 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm, if (result != SCAN_SUCCEED) goto out; + bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR); memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, cc->is_khugepaged); + + /* + * If PMD is the only enabled order, enforce max_ptes_none, otherwise + * scan all pages to populate the bitmap for mTHP collapse. + */ + if (cc->is_khugepaged && enabled_orders == BIT(HPAGE_PMD_ORDER)) + full_scan = false; + max_ptes_none = collapse_max_ptes_none(HPAGE_PMD_ORDER, full_scan); + pte = pte_offset_map_lock(mm, pmd, address, &ptl); if (!pte) { result = SCAN_PMD_NULL; goto out; } - for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; - _pte++, _address += PAGE_SIZE) { + for (i = 0; i < HPAGE_PMD_NR; i++) { + _pte = pte + i; + _address = address + i * PAGE_SIZE; pte_t pteval = ptep_get(_pte); if (is_swap_pte(pteval)) { ++unmapped; @@ -1449,8 +1612,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm, if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { ++none_or_zero; if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <= khugepaged_max_ptes_none)) { + none_or_zero <= max_ptes_none) { continue; } else { result = SCAN_EXCEED_NONE_PTE; @@ -1491,6 +1653,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm, } folio = page_folio(page); + + /* Set bit for occupied pages */ + bitmap_set(cc->mthp_bitmap, i, 1); /* * Record which node the original page is from and save this * information to cc->node_load[]. @@ -1553,9 +1718,12 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm, out_unmap: pte_unmap_unlock(pte, ptl); if (result == SCAN_SUCCEED) { - result = collapse_huge_page(mm, address, referenced, - unmapped, cc, mmap_locked, - HPAGE_PMD_ORDER); + nr_collapsed = mthp_collapse(mm, address, referenced, unmapped, + cc, mmap_locked, enabled_orders); + if (nr_collapsed > 0) + result = SCAN_SUCCEED; + else + result = SCAN_FAIL; } out: trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced, -- Gitee From 64643e3caaa8e88e707c5ff9664b9102758fcfd7 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:39 -0700 Subject: [PATCH 26/28] khugepaged: avoid unnecessary mTHP collapse attempts ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-15-npache@redhat.com/ [backport note] Drop 'case SCAN_NO_PTE_TABLE' as we don't have. There are cases where, if an attempted collapse fails, all subsequent orders are guaranteed to also fail. Avoid these collapse attempts by bailing out early. Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 34 +++++++++++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index f45908c263c7..3b5426f62fc8 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1521,9 +1521,41 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address, ret = collapse_huge_page(mm, collapse_address, referenced, unmapped, cc, mmap_locked, order); - if (ret == SCAN_SUCCEED) { + + switch (ret) { + /* Cases were we continue to next collapse candidate */ + case SCAN_SUCCEED: collapsed += nr_pte_entries; + fallthrough; + case SCAN_PTE_MAPPED_HUGEPAGE: continue; + /* Cases were lower orders might still succeed */ + case SCAN_LACK_REFERENCED_PAGE: + case SCAN_EXCEED_NONE_PTE: + case SCAN_EXCEED_SWAP_PTE: + case SCAN_EXCEED_SHARED_PTE: + case SCAN_PAGE_LOCK: + case SCAN_PAGE_COUNT: + case SCAN_PAGE_LRU: + case SCAN_PAGE_NULL: + case SCAN_DEL_PAGE_LRU: + case SCAN_PTE_NON_PRESENT: + case SCAN_PTE_UFFD_WP: + case SCAN_ALLOC_HUGE_PAGE_FAIL: + goto next_order; + /* Cases were no further collapse is possible */ + case SCAN_CGROUP_CHARGE_FAIL: + case SCAN_COPY_MC: + case SCAN_ADDRESS_RANGE: + case SCAN_ANY_PROCESS: + case SCAN_VMA_NULL: + case SCAN_VMA_CHECK: + case SCAN_SCAN_ABORT: + case SCAN_PAGE_ANON: + case SCAN_PMD_MAPPED: + case SCAN_FAIL: + default: + return collapsed; } } -- Gitee From f28599978e34b1c5e6c0e4620c4dd07feb6066a4 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Thu, 22 Jan 2026 12:28:40 -0700 Subject: [PATCH 27/28] khugepaged: run khugepaged for all orders ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-16-npache@redhat.com/ If any order (m)THP is enabled we should allow running khugepaged to attempt scanning and collapsing mTHPs. In order for khugepaged to operate when only mTHP sizes are specified in sysfs, we must modify the predicate function that determines whether it ought to run to do so. This function is currently called hugepage_pmd_enabled(), this patch renames it to hugepage_enabled() and updates the logic to check to determine whether any valid orders may exist which would justify khugepaged running. We must also update collapse_allowable_orders() to check all orders if the vma is anonymous and the collapse is khugepaged. After this patch khugepaged mTHP collapse is fully enabled. Signed-off-by: Baolin Wang Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- mm/khugepaged.c | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3b5426f62fc8..c2811f2fe84d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -443,23 +443,23 @@ static inline int collapse_test_exit(struct mm_struct *mm) return atomic_read(&mm->mm_users) == 0; } -static bool hugepage_pmd_enabled(void) +static bool hugepage_enabled(void) { /* * We cover the anon, shmem and the file-backed case here; file-backed * hugepages, when configured in, are determined by the global control. - * Anon pmd-sized hugepages are determined by the pmd-size control. + * Anon hugepages are determined by its per-size mTHP control. * Shmem pmd-sized hugepages are also determined by its pmd-size control, * except when the global shmem_huge is set to SHMEM_HUGE_DENY. */ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && hugepage_global_enabled()) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_always)) + if (READ_ONCE(huge_anon_orders_always)) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_madvise)) + if (READ_ONCE(huge_anon_orders_madvise)) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) && + if (READ_ONCE(huge_anon_orders_inherit) && hugepage_global_enabled()) return true; if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled()) @@ -547,8 +547,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan) static unsigned long collapse_allowable_orders(struct vm_area_struct *vma, vm_flags_t vm_flags, bool is_khugepaged) { + unsigned long orders; enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE; - unsigned long orders = BIT(HPAGE_PMD_ORDER); + + /* If khugepaged is scanning an anonymous vma, allow mTHP collapse */ + if (is_khugepaged && vma_is_anonymous(vma)) + orders = THP_ORDERS_ALL_ANON; + else + orders = BIT(HPAGE_PMD_ORDER); return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); } @@ -557,7 +563,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags) { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && - hugepage_pmd_enabled()) { + hugepage_enabled()) { if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true)) __khugepaged_enter(vma->vm_mm); } @@ -2885,7 +2891,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result * static int khugepaged_has_work(void) { - return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled(); + return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled(); } static int khugepaged_wait_event(void) @@ -2958,7 +2964,7 @@ static void khugepaged_wait_work(void) return; } - if (hugepage_pmd_enabled()) + if (hugepage_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); } @@ -3008,7 +3014,7 @@ static void set_recommended_min_free_kbytes(void) int nr_zones = 0; unsigned long recommended_min; - if (!hugepage_pmd_enabled()) { + if (!hugepage_enabled()) { calculate_min_free_kbytes(); goto update_wmarks; } @@ -3058,7 +3064,7 @@ int start_stop_khugepaged(void) int err = 0; mutex_lock(&khugepaged_mutex); - if (hugepage_pmd_enabled()) { + if (hugepage_enabled()) { if (!khugepaged_thread) khugepaged_thread = kthread_run(khugepaged, NULL, "khugepaged"); @@ -3084,7 +3090,7 @@ int start_stop_khugepaged(void) void khugepaged_min_free_kbytes_update(void) { mutex_lock(&khugepaged_mutex); - if (hugepage_pmd_enabled() && khugepaged_thread) + if (hugepage_enabled() && khugepaged_thread) set_recommended_min_free_kbytes(); mutex_unlock(&khugepaged_mutex); } -- Gitee From c76b471c9a9eb7370fe416c05129db03abeb1e18 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Thu, 22 Jan 2026 12:28:41 -0700 Subject: [PATCH 28/28] Documentation: mm: update the admin guide for mTHP collapse ANBZ: #30191 cherry-picked from https://lore.kernel.org/all/20260122192841.128719-17-npache@redhat.com/ Now that we can collapse to mTHPs lets update the admin guide to reflect these changes and provide proper guidance on how to utilize it. Reviewed-by: Bagas Sanjaya Signed-off-by: Nico Pache Signed-off-by: Yuanhe Shu --- Documentation/admin-guide/mm/transhuge.rst | 48 +++++++++++++--------- 1 file changed, 28 insertions(+), 20 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index e07decb28304..993d4b1e5aa6 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -63,7 +63,8 @@ often. THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into PMD-sized huge pages. +collapses sequences of basic pages into huge pages of either PMD size +or mTHP sizes, if the system is configured to do so. The THP behaviour is controlled via :ref:`sysfs ` interface and using madvise(2) and prctl(2) system calls. @@ -212,20 +213,15 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused -khugepaged will be automatically started when PMD-sized THP is enabled +khugepaged will be automatically started when any THP size is enabled (either of the per-size anon control or the top-level control are set to "always" or "madvise"), and it'll be automatically shutdown when -PMD-sized THP is disabled (when both the per-size anon control and the +all THP sizes are disabled (when both the per-size anon control and the top-level control are "never") Khugepaged controls ------------------- -.. note:: - khugepaged currently only searches for opportunities to collapse to - PMD-sized THP and no attempt is made to collapse to other THP - sizes. - khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's @@ -253,11 +249,11 @@ allocation failure to throttle the next allocation attempt:: The khugepaged progress can be seen in the number of pages collapsed (note that this counter may not be an exact count of the number of pages collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping -being replaced by a PMD mapping, or (2) All 4K physical pages replaced by -one 2M hugepage. Each may happen independently, or together, depending on -the type of memory and the failures that occur. As such, this value should -be interpreted roughly as a sign of progress, and counters in /proc/vmstat -consulted for more accurate accounting):: +being replaced by a PMD mapping, or (2) physical pages replaced by one +hugepage of various sizes (PMD-sized or mTHP). Each may happen independently, +or together, depending on the type of memory and the failures that occur. +As such, this value should be interpreted roughly as a sign of progress, +and counters in /proc/vmstat consulted for more accurate accounting):: /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed @@ -265,16 +261,19 @@ for each pass:: /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans -``max_ptes_none`` specifies how many extra small pages (that are -not already mapped) can be allocated when collapsing a group -of small pages into one large page:: +``max_ptes_none`` specifies how many empty (none/zero) pages are allowed +when collapsing a group of small pages into one large page:: /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none -A higher value leads to use additional memory for programs. -A lower value leads to gain less thp performance. Value of -max_ptes_none can waste cpu time very little, you can -ignore it. +For PMD-sized THP collapse, this directly limits the number of empty pages +allowed in the 2MB region. For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) +are supported. Any other value will emit a warning and no mTHP collapse +will be attempted. + +A higher value allows more empty pages, potentially leading to more memory +usage but better THP performance. A lower value is more conservative and +may result in fewer THP collapses. ``max_ptes_swap`` specifies how many pages can be brought in from swap when collapsing a group of pages into a transparent huge page:: @@ -293,6 +292,15 @@ processes. Exceeding the number would block the collapse:: A higher value may increase memory footprint for some workloads. +.. note:: + For mTHP collapse, khugepaged does not support collapsing regions that + contain shared or swapped out pages, as this could lead to continuous + promotion to higher orders. The collapse will fail if any shared or + swapped PTEs are encountered during the scan. + + Currently, madvise_collapse only supports collapsing to PMD-sized THPs + and does not attempt mTHP collapses. + File-Backed Hugepages --------------------- -- Gitee