All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Zach O'Keefe" <zokeefe@google.com>
To: Alex Shi <alex.shi@linux.alibaba.com>,
	David Hildenbrand <david@redhat.com>,
	 David Rientjes <rientjes@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Michal Hocko <mhocko@suse.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	 SeongJae Park <sj@kernel.org>, Song Liu <songliubraving@fb.com>,
	Vlastimil Babka <vbabka@suse.cz>,  Yang Shi <shy828301@gmail.com>,
	Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Arnd Bergmann <arnd@arndb.de>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Chris Kennelly <ckennelly@google.com>,
	Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
	 Hugh Dickins <hughd@google.com>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	 "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>,
	Jens Axboe <axboe@kernel.dk>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matt Turner <mattst88@gmail.com>,
	 Max Filippov <jcmvbkbc@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	 Minchan Kim <minchan@kernel.org>,
	Patrick Xia <patrickx@google.com>,
	 Pavel Begunkov <asml.silence@gmail.com>,
	Peter Xu <peterx@redhat.com>,
	 Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	"Zach O'Keefe" <zokeefe@google.com>
Subject: [PATCH 05/12] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
Date: Sun, 10 Apr 2022 06:54:38 -0700	[thread overview]
Message-ID: <20220410135445.3897054-6-zokeefe@google.com> (raw)
In-Reply-To: <20220410135445.3897054-1-zokeefe@google.com>

This idea was introduced by David Rientjes[1], and the semantics and
implementation were introduced and discussed in a previous PATCH RFC[2].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* avoid unpredictable timing of khugepaged collapse

Immediate users of this new functionality include:

* immediately back executable text by hugepages.  Current support
  provided by CONFIG_READ_ONLY_THP_FOR_FS may take too long on a large
  system.
* malloc implementations that manage memory in hugepage-sized chunks,
  but sometimes subrelease memory back to the system in native-sized
  chunks via MADV_DONTNEED; zapping the pmd.  Later, when the memory
  is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the
  memory by THP to regain TLB performance.

Allocation semantics are the same as khugepaged, and depend on (1) the
active sysfs settings /sys/kernel/mm/transparent_hugepage/enabled and
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2) the VMA
flags of the memory range being collapsed.

Only privately-mapped anon memory is supported for now.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://lore.kernel.org/linux-mm/20220308213417.1407042-1-zokeefe@google.com/

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 include/linux/huge_mm.h                |  12 ++
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/khugepaged.c                        | 151 ++++++++++++++++++++++---
 mm/madvise.c                           |   5 +
 4 files changed, 157 insertions(+), 13 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 816a9937f30e..ddad7c7af44e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -236,6 +236,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -392,6 +395,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
 	BUG();
 	return 0;
 }
+
+static inline int madvise_collapse(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev,
+				   unsigned long start, unsigned long end)
+{
+	BUG();
+	return 0;
+}
+
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ed025dbbd7e6..c5c484b7e394 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -846,7 +846,6 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 	return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
 }
 
-#ifdef CONFIG_NUMA
 static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
@@ -872,6 +871,24 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 
+static struct page *alloc_hpage(struct collapse_control *cc, gfp_t gfp,
+				int node)
+{
+	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
+
+	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+	if (unlikely(!cc->hpage)) {
+		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		cc->hpage = ERR_PTR(-ENOMEM);
+		return NULL;
+	}
+
+	prep_transhuge_page(cc->hpage);
+	count_vm_event(THP_COLLAPSE_ALLOC);
+	return cc->hpage;
+}
+
+#ifdef CONFIG_NUMA
 static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 {
 	if (IS_ERR(*hpage)) {
@@ -892,18 +909,7 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 static struct page *khugepaged_alloc_page(struct collapse_control *cc,
 					  gfp_t gfp, int node)
 {
-	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
-
-	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
-	if (unlikely(!cc->hpage)) {
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-		cc->hpage = ERR_PTR(-ENOMEM);
-		return NULL;
-	}
-
-	prep_transhuge_page(cc->hpage);
-	count_vm_event(THP_COLLAPSE_ALLOC);
-	return cc->hpage;
+	return alloc_hpage(cc, gfp, node);
 }
 #else
 static int khugepaged_find_target_node(struct collapse_control *cc)
@@ -2456,3 +2462,122 @@ void khugepaged_min_free_kbytes_update(void)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
+
+static void madvise_collapse_cleanup_page(struct page **hpage)
+{
+	if (!IS_ERR(*hpage) && *hpage)
+		put_page(*hpage);
+	*hpage = NULL;
+}
+
+int madvise_collapse_errno(enum scan_result r)
+{
+	switch (r) {
+	case SCAN_PMD_NULL:
+	case SCAN_ADDRESS_RANGE:
+	case SCAN_VMA_NULL:
+	case SCAN_PTE_NON_PRESENT:
+	case SCAN_PAGE_NULL:
+		/*
+		 * Addresses in the specified range are not currently mapped,
+		 * or are outside the AS of the process.
+		 */
+		return -ENOMEM;
+	case SCAN_ALLOC_HUGE_PAGE_FAIL:
+	case SCAN_CGROUP_CHARGE_FAIL:
+		/* A kernel resource was temporarily unavailable. */
+		return -EAGAIN;
+	default:
+		return -EINVAL;
+	}
+}
+
+int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end)
+{
+	struct collapse_control cc = {
+		.last_target_node = NUMA_NO_NODE,
+		.hpage = NULL,
+		.alloc_hpage = &alloc_hpage,
+	};
+	struct mm_struct *mm = vma->vm_mm;
+	struct collapse_result cr;
+	unsigned long hstart, hend, addr;
+	int thps = 0, nr_hpages = 0;
+
+	BUG_ON(vma->vm_start > start);
+	BUG_ON(vma->vm_end < end);
+
+	*prev = vma;
+
+	if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file)
+		return -EINVAL;
+
+	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = end & HPAGE_PMD_MASK;
+	nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
+
+	if (hstart >= hend || !transparent_hugepage_active(vma))
+		return -EINVAL;
+
+	mmgrab(mm);
+	lru_add_drain();
+
+	for (addr = hstart; ; ) {
+		mmap_assert_locked(mm);
+		cond_resched();
+		memset(&cr, 0, sizeof(cr));
+
+		if (unlikely(khugepaged_test_exit(mm)))
+			break;
+
+		memset(cc.node_load, 0, sizeof(cc.node_load));
+		khugepaged_scan_pmd(mm, vma, addr, &cc, &cr);
+		if (cr.dropped_mmap_lock)
+			*prev = NULL;  /* tell madvise we dropped mmap_lock */
+
+		switch (cr.result) {
+		/* Whitelisted set of results where continuing OK */
+		case SCAN_SUCCEED:
+		case SCAN_PMD_MAPPED:
+			++thps;
+		case SCAN_PMD_NULL:
+		case SCAN_PTE_NON_PRESENT:
+		case SCAN_PTE_UFFD_WP:
+		case SCAN_PAGE_RO:
+		case SCAN_LACK_REFERENCED_PAGE:
+		case SCAN_PAGE_NULL:
+		case SCAN_PAGE_COUNT:
+		case SCAN_PAGE_LOCK:
+		case SCAN_PAGE_COMPOUND:
+			break;
+		case SCAN_PAGE_LRU:
+			lru_add_drain_all();
+			goto retry;
+		default:
+			/* Other error, exit */
+			goto break_loop;
+		}
+		addr += HPAGE_PMD_SIZE;
+		if (addr >= hend)
+			break;
+retry:
+		if (cr.dropped_mmap_lock) {
+			mmap_read_lock(mm);
+			if (hugepage_vma_revalidate(mm, addr, &vma))
+				goto out;
+		}
+		madvise_collapse_cleanup_page(&cc.hpage);
+	}
+
+break_loop:
+	/* madvise_walk_vmas() expects us to hold mmap_lock on return */
+	if (cr.dropped_mmap_lock)
+		mmap_read_lock(mm);
+out:
+	mmap_assert_locked(mm);
+	madvise_collapse_cleanup_page(&cc.hpage);
+	mmdrop(mm);
+
+	return thps == nr_hpages ? 0 : madvise_collapse_errno(cr.result);
+}
diff --git a/mm/madvise.c b/mm/madvise.c
index ec03a76244b7..7ad53e5311cf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1051,6 +1052,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1144,6 +1147,7 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1333,6 +1337,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
-- 
2.35.1.1178.g4f1659d476-goog



  parent reply	other threads:[~2022-04-10 13:55 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-10 13:54 [PATCH 00/12] mm: userspace hugepage collapse Zach O'Keefe
2022-04-10 13:54 ` [PATCH 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
2022-04-10 13:54 ` [PATCH 02/12] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-04-10 13:54 ` [PATCH 03/12] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
2022-04-10 17:47   ` kernel test robot
2022-04-10 17:47   ` kernel test robot
2022-04-11 17:28     ` Zach O'Keefe
2022-04-11 17:28       ` Zach O'Keefe
2022-04-10 13:54 ` [PATCH 04/12] mm/khugepaged: add struct collapse_result Zach O'Keefe
2022-04-10 13:54 ` Zach O'Keefe [this message]
2022-04-10 16:04   ` [PATCH 05/12] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse kernel test robot
2022-04-10 16:14   ` kernel test robot
2022-04-11 17:18     ` Zach O'Keefe
2022-04-11 17:18       ` Zach O'Keefe
2022-04-10 13:54 ` [PATCH 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions Zach O'Keefe
2022-04-10 17:06   ` kernel test robot
2022-04-11 17:42     ` Zach O'Keefe
2022-04-11 17:42       ` Zach O'Keefe
2022-04-10 13:54 ` [PATCH 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
2022-04-10 13:54 ` [PATCH 08/12] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
2022-04-10 13:54 ` [PATCH 09/12] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-04-10 13:54 ` [PATCH 10/12] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-04-10 13:54 ` [PATCH 11/12] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-04-10 13:54 ` [PATCH 12/12] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220410135445.3897054-6-zokeefe@google.com \
    --to=zokeefe@google.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linux.alibaba.com \
    --cc=arnd@arndb.de \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=chris@zankel.net \
    --cc=ckennelly@google.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jcmvbkbc@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=patrickx@google.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.