linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault
@ 2021-04-13 21:24 Yang Shi
  2021-04-13 21:24 ` [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault Yang Shi
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Yang Shi @ 2021-04-13 21:24 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel


Changelog:
v1 --> v2:
    * Adopted the suggestion from Gerald Schaefer to skip huge PMD for S390
      for now.
    * Used PageTransHuge to distinguish base page or THP instead of a new
      parameter for migrate_misplaced_page() per Huang Ying.
    * Restored PMD lazily to avoid unnecessary TLB shootdown per Huang Ying.
    * Skipped shared THP.
    * Updated counters correctly.
    * Rebased to linux-next (next-20210412).

When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.  It is definitely a maintenance burden to keep two THP migration
implementation for different code paths and it is more error prone.  Using the
generic THP migration implementation allows us remove the duplicate code and
some hacks needed by the old ad hoc implementation.

A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
and NUMA balancing.  The most of them support THP migration except for S390.
Zi Yan tried to add THP migration support for S390 before but it was not
accepted due to the design of S390 PMD.  For the discussion, please see:
https://lkml.org/lkml/2018/4/27/953.

Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge
PMD for S390 for now.

I saw there were some hacks about gup from git history, but I didn't figure out
if they have been removed or not since I just found FOLL_NUMA code in the current
gup implementation and they seems useful.

I'm trying to keep the behavior as consistent as possible between before and after.
But there is still some minor disparity.  For example, file THP won't
get migrated at all in old implementation due to the anon_vma check, but
the new implementation doesn't need acquire anon_vma lock anymore, so
file THP might get migrated.  Not sure if this behavior needs to be
kept.

Patch #1 ~ #2 are preparation patches.
Patch #3 is the real meat.
Patch #4 ~ #6 keep consistent counters and behaviors with before.
Patch #7 skips change huge PMD to prot_none if thp migration is not supported.

Yang Shi (7):
      mm: memory: add orig_pmd to struct vm_fault
      mm: memory: make numa_migrate_prep() non-static
      mm: thp: refactor NUMA fault handling
      mm: migrate: account THP NUMA migration counters correctly
      mm: migrate: don't split THP for misplaced NUMA page
      mm: migrate: check mapcount for THP instead of ref count
      mm: thp: skip make PMD PROT_NONE if THP migration is not supported

 include/linux/huge_mm.h |   9 ++---
 include/linux/migrate.h |  23 -----------
 include/linux/mm.h      |   3 ++
 mm/huge_memory.c        | 156 +++++++++++++++++++++++++-----------------------------------------------
 mm/internal.h           |  21 ++--------
 mm/memory.c             |  31 +++++++--------
 mm/migrate.c            | 204 +++++++++++++++++++++--------------------------------------------------------------------------
 7 files changed, 123 insertions(+), 324 deletions(-)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault
  2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
@ 2021-04-13 21:24 ` Yang Shi
  2021-05-17 15:09   ` Mel Gorman
  2021-04-13 21:24 ` [v2 PATCH 2/7] mm: memory: make numa_migrate_prep() non-static Yang Shi
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Yang Shi @ 2021-04-13 21:24 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page
fault could be removed, just like its PTE counterpart does.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/huge_mm.h |  9 ++++-----
 include/linux/mm.h      |  3 +++
 mm/huge_memory.c        |  9 ++++++---
 mm/memory.c             | 26 +++++++++++++-------------
 4 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9626fda5efce..3b38070f4337 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,7 +11,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma);
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
+void huge_pmd_set_accessed(struct vm_fault *vmf);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma);
@@ -24,7 +24,7 @@ static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 }
 #endif
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmd,
 				   unsigned int flags);
@@ -283,7 +283,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
 extern struct page *huge_zero_page;
 
@@ -429,8 +429,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 	return NULL;
 }
 
-static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf,
-		pmd_t orig_pmd)
+static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	return 0;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25b9041f9925..9c5856f8cc81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -547,6 +547,9 @@ struct vm_fault {
 					 * the 'address'
 					 */
 	pte_t orig_pte;			/* Value of PTE at the time of fault */
+	pmd_t orig_pmd;			/* Value of PMD at the time of fault,
+					 * used by PMD fault only.
+					 */
 
 	struct page *cow_page;		/* Page handler may use for COW fault */
 	struct page *page;		/* ->fault handlers should return a
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 63ed6b25deaa..35cac4aeaf68 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1251,11 +1251,12 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
+void huge_pmd_set_accessed(struct vm_fault *vmf)
 {
 	pmd_t entry;
 	unsigned long haddr;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
+	pmd_t orig_pmd = vmf->orig_pmd;
 
 	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
@@ -1272,11 +1273,12 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
 	spin_unlock(vmf->ptl);
 }
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	pmd_t orig_pmd = vmf->orig_pmd;
 
 	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1412,9 +1414,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	pmd_t pmd = vmf->orig_pmd;
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
diff --git a/mm/memory.c b/mm/memory.c
index 4e358601c5d6..e0cbffeceb0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4232,12 +4232,12 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 }
 
 /* `inline' is required to avoid gcc 4.1.2 build error */
-static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
+static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
+		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
-		return do_huge_pmd_wp_page(vmf, orig_pmd);
+		return do_huge_pmd_wp_page(vmf);
 	}
 	if (vmf->vma->vm_ops->huge_fault) {
 		vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
@@ -4464,26 +4464,26 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	} else {
-		pmd_t orig_pmd = *vmf.pmd;
+		vmf.orig_pmd = *vmf.pmd;
 
 		barrier();
-		if (unlikely(is_swap_pmd(orig_pmd))) {
+		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
 			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(orig_pmd));
-			if (is_pmd_migration_entry(orig_pmd))
+					  !is_pmd_migration_entry(vmf.orig_pmd));
+			if (is_pmd_migration_entry(vmf.orig_pmd))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
-		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
-			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
-				return do_huge_pmd_numa_page(&vmf, orig_pmd);
+		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+				return do_huge_pmd_numa_page(&vmf);
 
-			if (dirty && !pmd_write(orig_pmd)) {
-				ret = wp_huge_pmd(&vmf, orig_pmd);
+			if (dirty && !pmd_write(vmf.orig_pmd)) {
+				ret = wp_huge_pmd(&vmf);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
 			} else {
-				huge_pmd_set_accessed(&vmf, orig_pmd);
+				huge_pmd_set_accessed(&vmf);
 				return 0;
 			}
 		}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [v2 PATCH 2/7] mm: memory: make numa_migrate_prep() non-static
  2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
  2021-04-13 21:24 ` [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault Yang Shi
@ 2021-04-13 21:24 ` Yang Shi
  2021-05-17 15:11   ` Mel Gorman
  2021-04-13 21:24 ` [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling Yang Shi
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Yang Shi @ 2021-04-13 21:24 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

The numa_migrate_prep() will be used by huge NUMA fault as well in the following
patch, make it non-static.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/internal.h | 3 +++
 mm/memory.c   | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f469f69309de..b6f889d9c22d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -659,4 +659,7 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 
 void vunmap_range_noflush(unsigned long start, unsigned long end);
 
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+		      unsigned long addr, int page_nid, int *flags);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index e0cbffeceb0b..bbb38066021a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4119,9 +4119,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int page_nid,
-				int *flags)
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+		      unsigned long addr, int page_nid, int *flags)
 {
 	get_page(page);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling
  2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
  2021-04-13 21:24 ` [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault Yang Shi
  2021-04-13 21:24 ` [v2 PATCH 2/7] mm: memory: make numa_migrate_prep() non-static Yang Shi
@ 2021-04-13 21:24 ` Yang Shi
  2021-04-14  2:43   ` Huang, Ying
  2021-05-17 15:27   ` Mel Gorman
  2021-04-13 21:24 ` [v2 PATCH 4/7] mm: migrate: account THP NUMA migration counters correctly Yang Shi
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Yang Shi @ 2021-04-13 21:24 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.

This patch reworked the NUMA fault handling to use generic migration implementation
to migrate misplaced page.  There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
  Acquire ptl
  Prepare for migration (elevate page refcount)
  Release ptl
  Isolate page from lru and elevate page refcount
  Migrate the misplaced THP

If migration is failed just restore the old normal PMD.

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot,
it seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Use migrate_misplaced_page() for both base page and THP NUMA hinting
fault and remove all the dead and duplicate code.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/migrate.h |  23 ------
 mm/huge_memory.c        | 143 ++++++++++----------------------
 mm/internal.h           |  18 ----
 mm/migrate.c            | 177 ++++++++--------------------------------
 4 files changed, 77 insertions(+), 284 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4bb4e519e3f5..163d6f2b03d1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -95,14 +95,9 @@ static inline void __ClearPageMovable(struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
 				  struct vm_area_struct *vma, int node);
 #else
-static inline bool pmd_trans_migrating(pmd_t pmd)
-{
-	return false;
-}
 static inline int migrate_misplaced_page(struct page *page,
 					 struct vm_area_struct *vma, int node)
 {
@@ -110,24 +105,6 @@ static inline int migrate_misplaced_page(struct page *page,
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-			struct vm_area_struct *vma,
-			pmd_t *pmd, pmd_t entry,
-			unsigned long address,
-			struct page *page, int node);
-#else
-static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-			struct vm_area_struct *vma,
-			pmd_t *pmd, pmd_t entry,
-			unsigned long address,
-			struct page *page, int node)
-{
-	return -EAGAIN;
-}
-#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
-
-
 #ifdef CONFIG_MIGRATION
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 35cac4aeaf68..94981907fd4c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	pmd_t pmd = vmf->orig_pmd;
-	struct anon_vma *anon_vma = NULL;
+	pmd_t oldpmd;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
-	int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
+	int page_nid = NUMA_NO_NODE;
 	int target_nid, last_cpupid = -1;
-	bool page_locked;
 	bool migrated = false;
-	bool was_writable;
+	bool was_writable = pmd_savedwrite(pmd);
 	int flags = 0;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
-		goto out_unlock;
-
-	/*
-	 * If there are potential migrations, wait for completion and retry
-	 * without disrupting NUMA hinting information. Do not relock and
-	 * check_same as the page may no longer be mapped.
-	 */
-	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
-		page = pmd_page(*vmf->pmd);
-		if (!get_page_unless_zero(page))
-			goto out_unlock;
+	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
 		spin_unlock(vmf->ptl);
-		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
 		goto out;
 	}
 
-	page = pmd_page(pmd);
-	BUG_ON(is_huge_zero_page(page));
-	page_nid = page_to_nid(page);
-	last_cpupid = page_cpupid_last(page);
-	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == this_nid) {
-		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-		flags |= TNF_FAULT_LOCAL;
-	}
-
-	/* See similar comment in do_numa_page for explanation */
-	if (!pmd_savedwrite(pmd))
-		flags |= TNF_NO_GROUP;
-
-	/*
-	 * Acquire the page lock to serialise THP migrations but avoid dropping
-	 * page_table_lock if at all possible
-	 */
-	page_locked = trylock_page(page);
-	target_nid = mpol_misplaced(page, vma, haddr);
-	/* Migration could have started since the pmd_trans_migrating check */
-	if (!page_locked) {
-		page_nid = NUMA_NO_NODE;
-		if (!get_page_unless_zero(page))
-			goto out_unlock;
-		spin_unlock(vmf->ptl);
-		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
-		goto out;
-	} else if (target_nid == NUMA_NO_NODE) {
-		/* There are no parallel migrations and page is in the right
-		 * node. Clear the numa hinting info in this pmd.
-		 */
-		goto clear_pmdnuma;
-	}
-
-	/*
-	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
-	 * to serialises splits
-	 */
-	get_page(page);
-	spin_unlock(vmf->ptl);
-	anon_vma = page_lock_anon_vma_read(page);
-
-	/* Confirm the PMD did not change while page_table_lock was released */
-	spin_lock(vmf->ptl);
-	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
-		unlock_page(page);
-		put_page(page);
-		page_nid = NUMA_NO_NODE;
-		goto out_unlock;
-	}
-
-	/* Bail if we fail to protect against THP splits for any reason */
-	if (unlikely(!anon_vma)) {
-		put_page(page);
-		page_nid = NUMA_NO_NODE;
-		goto clear_pmdnuma;
-	}
-
 	/*
 	 * Since we took the NUMA fault, we must have observed the !accessible
 	 * bit. Make sure all other CPUs agree with that, to avoid them
@@ -1531,43 +1459,60 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 					      haddr + HPAGE_PMD_SIZE);
 	}
 
-	/*
-	 * Migrate the THP to the requested node, returns with page unlocked
-	 * and access rights restored.
-	 */
+	oldpmd = pmd_modify(pmd, vma->vm_page_prot);
+	page = vm_normal_page_pmd(vma, haddr, oldpmd);
+	if (!page) {
+		spin_unlock(vmf->ptl);
+		goto out_map;
+	}
+
+	/* See similar comment in do_numa_page for explanation */
+	if (!was_writable)
+		flags |= TNF_NO_GROUP;
+
+	page_nid = page_to_nid(page);
+	last_cpupid = page_cpupid_last(page);
+	target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
+				       &flags);
+
+	if (target_nid == NUMA_NO_NODE) {
+		put_page(page);
+		goto out_map;
+	}
+
 	spin_unlock(vmf->ptl);
 
-	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
-				vmf->pmd, pmd, vmf->address, page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated) {
 		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
-	} else
+	} else {
 		flags |= TNF_MIGRATE_FAIL;
+		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+		if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
+			spin_unlock(vmf->ptl);
+			goto out;
+		}
+		goto out_map;
+	}
 
-	goto out;
-clear_pmdnuma:
-	BUG_ON(!PageLocked(page));
-	was_writable = pmd_savedwrite(pmd);
+out:
+	if (page_nid != NUMA_NO_NODE)
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
+				flags);
+
+	return 0;
+
+out_map:
+	/* Restore the PMD */
 	pmd = pmd_modify(pmd, vma->vm_page_prot);
 	pmd = pmd_mkyoung(pmd);
 	if (was_writable)
 		pmd = pmd_mkwrite(pmd);
 	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-	unlock_page(page);
-out_unlock:
 	spin_unlock(vmf->ptl);
-
-out:
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
-
-	if (page_nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
-				flags);
-
-	return 0;
+	goto out;
 }
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index b6f889d9c22d..c060551d6ebf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -381,23 +381,6 @@ extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
  */
 extern void clear_page_mlock(struct page *page);
 
-/*
- * mlock_migrate_page - called only from migrate_misplaced_transhuge_page()
- * (because that does not go through the full procedure of migration ptes):
- * to migrate the Mlocked page flag; update statistics.
- */
-static inline void mlock_migrate_page(struct page *newpage, struct page *page)
-{
-	if (TestClearPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		/* Holding pmd lock, no change in irq context: __mod is safe */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		SetPageMlocked(newpage);
-		__mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
-	}
-}
-
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
@@ -448,7 +431,6 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 #else /* !CONFIG_MMU */
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_migrate_page(struct page *new, struct page *old) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index b234c3f3acb7..333448aa53f1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2042,6 +2042,23 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 	return newpage;
 }
 
+static struct page *alloc_misplaced_dst_page_thp(struct page *page,
+						 unsigned long data)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_node(nid, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
+				   HPAGE_PMD_ORDER);
+	if (!newpage)
+		goto out;
+
+	prep_transhuge_page(newpage);
+
+out:
+	return newpage;
+}
+
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
@@ -2080,12 +2097,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	return 1;
 }
 
-bool pmd_trans_migrating(pmd_t pmd)
-{
-	struct page *page = pmd_page(pmd);
-	return PageLocked(page);
-}
-
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
@@ -2098,6 +2109,20 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	int isolated;
 	int nr_remaining;
 	LIST_HEAD(migratepages);
+	new_page_t *new;
+	bool compound;
+
+	/*
+	 * PTE mapped THP or HugeTLB page can't reach here so the page could
+	 * be either base page or THP.  And it must be head page if it is
+	 * THP.
+	 */
+	compound = PageTransHuge(page);
+
+	if (compound)
+		new = alloc_misplaced_dst_page_thp;
+	else
+		new = alloc_misplaced_dst_page;
 
 	/*
 	 * Don't migrate file pages that are mapped in multiple processes
@@ -2119,9 +2144,8 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 		goto out;
 
 	list_add(&page->lru, &migratepages);
-	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
-				     NULL, node, MIGRATE_ASYNC,
-				     MR_NUMA_MISPLACED);
+	nr_remaining = migrate_pages(&migratepages, *new, NULL, node,
+				     MIGRATE_ASYNC, MR_NUMA_MISPLACED);
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
@@ -2140,141 +2164,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	return 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
-
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * Migrates a THP to a given target node. page must be locked and is unlocked
- * before returning.
- */
-int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-				struct vm_area_struct *vma,
-				pmd_t *pmd, pmd_t entry,
-				unsigned long address,
-				struct page *page, int node)
-{
-	spinlock_t *ptl;
-	pg_data_t *pgdat = NODE_DATA(node);
-	int isolated = 0;
-	struct page *new_page = NULL;
-	int page_lru = page_is_file_lru(page);
-	unsigned long start = address & HPAGE_PMD_MASK;
-
-	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
-		HPAGE_PMD_ORDER);
-	if (!new_page)
-		goto out_fail;
-	prep_transhuge_page(new_page);
-
-	isolated = numamigrate_isolate_page(pgdat, page);
-	if (!isolated) {
-		put_page(new_page);
-		goto out_fail;
-	}
-
-	/* Prepare a page as a migration target */
-	__SetPageLocked(new_page);
-	if (PageSwapBacked(page))
-		__SetPageSwapBacked(new_page);
-
-	/* anon mapping, we can simply copy page->mapping to the new page: */
-	new_page->mapping = page->mapping;
-	new_page->index = page->index;
-	/* flush the cache before copying using the kernel virtual address */
-	flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
-	migrate_page_copy(new_page, page);
-	WARN_ON(PageLRU(new_page));
-
-	/* Recheck the target PMD */
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
-		spin_unlock(ptl);
-
-		/* Reverse changes made by migrate_page_copy() */
-		if (TestClearPageActive(new_page))
-			SetPageActive(page);
-		if (TestClearPageUnevictable(new_page))
-			SetPageUnevictable(page);
-
-		unlock_page(new_page);
-		put_page(new_page);		/* Free it */
-
-		/* Retake the callers reference and putback on LRU */
-		get_page(page);
-		putback_lru_page(page);
-		mod_node_page_state(page_pgdat(page),
-			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
-
-		goto out_unlock;
-	}
-
-	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-
-	/*
-	 * Overwrite the old entry under pagetable lock and establish
-	 * the new PTE. Any parallel GUP will either observe the old
-	 * page blocking on the page lock, block on the page table
-	 * lock or observe the new page. The SetPageUptodate on the
-	 * new page and page_add_new_anon_rmap guarantee the copy is
-	 * visible before the pagetable update.
-	 */
-	page_add_anon_rmap(new_page, vma, start, true);
-	/*
-	 * At this point the pmd is numa/protnone (i.e. non present) and the TLB
-	 * has already been flushed globally.  So no TLB can be currently
-	 * caching this non present pmd mapping.  There's no need to clear the
-	 * pmd before doing set_pmd_at(), nor to flush the TLB after
-	 * set_pmd_at().  Clearing the pmd here would introduce a race
-	 * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the
-	 * mmap_lock for reading.  If the pmd is set to NULL at any given time,
-	 * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this
-	 * pmd.
-	 */
-	set_pmd_at(mm, start, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
-
-	page_ref_unfreeze(page, 2);
-	mlock_migrate_page(new_page, page);
-	page_remove_rmap(page, true);
-	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
-
-	spin_unlock(ptl);
-
-	/* Take an "isolate" reference and put new page on the LRU. */
-	get_page(new_page);
-	putback_lru_page(new_page);
-
-	unlock_page(new_page);
-	unlock_page(page);
-	put_page(page);			/* Drop the rmap reference */
-	put_page(page);			/* Drop the LRU isolation reference */
-
-	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
-	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
-
-	mod_node_page_state(page_pgdat(page),
-			NR_ISOLATED_ANON + page_lru,
-			-HPAGE_PMD_NR);
-	return isolated;
-
-out_fail:
-	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-	ptl = pmd_lock(mm, pmd);
-	if (pmd_same(*pmd, entry)) {
-		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, start, pmd, entry);
-		update_mmu_cache_pmd(vma, address, &entry);
-	}
-	spin_unlock(ptl);
-
-out_unlock:
-	unlock_page(page);
-	put_page(page);
-	return 0;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_DEVICE_PRIVATE
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [v2 PATCH 4/7] mm: migrate: account THP NUMA migration counters correctly
  2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (2 preceding siblings ...)
  2021-04-13 21:24 ` [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling Yang Shi
@ 2021-04-13 21:24 ` Yang Shi
  2021-05-17 15:28   ` Mel Gorman
  2021-04-13 21:24 ` [v2 PATCH 5/7] mm: migrate: don't split THP for misplaced NUMA page Yang Shi
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Yang Shi @ 2021-04-13 21:24 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

Now both base page and THP NUMA migration is done via migrate_misplaced_page(),
keep the counters correctly for THP.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/migrate.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 333448aa53f1..a473f25fbd01 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2111,6 +2111,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	LIST_HEAD(migratepages);
 	new_page_t *new;
 	bool compound;
+	unsigned int nr_pages = thp_nr_pages(page);
 
 	/*
 	 * PTE mapped THP or HugeTLB page can't reach here so the page could
@@ -2149,13 +2150,13 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
-			dec_node_page_state(page, NR_ISOLATED_ANON +
-					page_is_file_lru(page));
+			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+					page_is_file_lru(page), -nr_pages);
 			putback_lru_page(page);
 		}
 		isolated = 0;
 	} else
-		count_vm_numa_event(NUMA_PAGE_MIGRATE);
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_pages);
 	BUG_ON(!list_empty(&migratepages));
 	return isolated;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [v2 PATCH 5/7] mm: migrate: don't split THP for misplaced NUMA page
  2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (3 preceding siblings ...)
  2021-04-13 21:24 ` [v2 PATCH 4/7] mm: migrate: account THP NUMA migration counters correctly Yang Shi
@ 2021-04-13 21:24 ` Yang Shi
  2021-05-17 15:29   ` Mel Gorman
  2021-04-13 21:24 ` [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count Yang Shi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Yang Shi @ 2021-04-13 21:24 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

The old behavior didn't split THP if migration is failed due to lack of
memory on the target node.  But the THP migration does split THP, so keep
the old behavior for misplaced NUMA page migration.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/migrate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a473f25fbd01..a72994c68ec6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1417,6 +1417,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	int swapwrite = current->flags & PF_SWAPWRITE;
 	int rc, nr_subpages;
 	LIST_HEAD(ret_pages);
+	bool nosplit = (reason == MR_NUMA_MISPLACED);
 
 	trace_mm_migrate_pages_start(mode, reason);
 
@@ -1488,8 +1489,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 				/*
 				 * When memory is low, don't bother to try to migrate
 				 * other pages, just exit.
+				 * THP NUMA faulting doesn't split THP to retry.
 				 */
-				if (is_thp) {
+				if (is_thp && !nosplit) {
 					if (!try_split_thp(page, &page2, from)) {
 						nr_thp_split++;
 						goto retry;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count
  2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (4 preceding siblings ...)
  2021-04-13 21:24 ` [v2 PATCH 5/7] mm: migrate: don't split THP for misplaced NUMA page Yang Shi
@ 2021-04-13 21:24 ` Yang Shi
  2021-04-14  3:00   ` Huang, Ying
  2021-04-13 21:24 ` [v2 PATCH 7/7] mm: thp: skip make PMD PROT_NONE if THP migration is not supported Yang Shi
  2021-05-03 21:58 ` [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
  7 siblings, 1 reply; 26+ messages in thread
From: Yang Shi @ 2021-04-13 21:24 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

The generic migration path will check refcount, so no need check refcount here.
But the old code actually prevents from migrating shared THP (mapped by multiple
processes), so bail out early if mapcount is > 1 to keep the behavior.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/migrate.c | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index a72994c68ec6..dc7cc7f3a124 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2067,6 +2067,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 
 	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
 
+	/* Do not migrate THP mapped by multiple processes */
+	if (PageTransHuge(page) && page_mapcount(page) > 1)
+		return 0;
+
 	/* Avoid migrating to a node that is nearly full */
 	if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
 		return 0;
@@ -2074,18 +2078,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	if (isolate_lru_page(page))
 		return 0;
 
-	/*
-	 * migrate_misplaced_transhuge_page() skips page migration's usual
-	 * check on page_count(), so we must do it here, now that the page
-	 * has been isolated: a GUP pin, or any other pin, prevents migration.
-	 * The expected page count is 3: 1 for page's mapcount and 1 for the
-	 * caller's pin and 1 for the reference taken by isolate_lru_page().
-	 */
-	if (PageTransHuge(page) && page_count(page) != 3) {
-		putback_lru_page(page);
-		return 0;
-	}
-
 	page_lru = page_is_file_lru(page);
 	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
 				thp_nr_pages(page));
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [v2 PATCH 7/7] mm: thp: skip make PMD PROT_NONE if THP migration is not supported
  2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (5 preceding siblings ...)
  2021-04-13 21:24 ` [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count Yang Shi
@ 2021-04-13 21:24 ` Yang Shi
  2021-05-17 15:30   ` Mel Gorman
  2021-05-03 21:58 ` [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
  7 siblings, 1 reply; 26+ messages in thread
From: Yang Shi @ 2021-04-13 21:24 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
NUMA balancing and THP.  But S390 doesn't support THP migration so NUMA
balancing actually can't migrate any misplaced pages.

Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
by pointless NUMA hinting faults on S390.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/huge_memory.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94981907fd4c..f63445f3a17d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1741,6 +1741,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  * Returns
  *  - 0 if PMD could not be locked
  *  - 1 if PMD was locked but protections unchanged and TLB flush unnecessary
+ *      or if prot_numa but THP migration is not supported
  *  - HPAGE_PMD_NR if protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -1755,6 +1756,9 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
+	if (prot_numa && !thp_migration_supported())
+		return 1;
+
 	ptl = __pmd_trans_huge_lock(pmd, vma);
 	if (!ptl)
 		return 0;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling
  2021-04-13 21:24 ` [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling Yang Shi
@ 2021-04-14  2:43   ` Huang, Ying
  2021-04-14 17:15     ` Yang Shi
  2021-05-17 15:27   ` Mel Gorman
  1 sibling, 1 reply; 26+ messages in thread
From: Huang, Ying @ 2021-04-14  2:43 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, kirill.shutemov, ziy, mhocko, hughd, gerald.schaefer,
	hca, gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel

Yang Shi <shy828301@gmail.com> writes:

> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> THP migration has been supported so it doesn't make too much sense to still keep
> another THP migration implementation rather than using the generic migration
> code.
>
> This patch reworked the NUMA fault handling to use generic migration implementation
> to migrate misplaced page.  There is no functional change.
>
> After the refactor the flow of NUMA fault handling looks just like its
> PTE counterpart:
>   Acquire ptl
>   Prepare for migration (elevate page refcount)
>   Release ptl
>   Isolate page from lru and elevate page refcount
>   Migrate the misplaced THP
>
> If migration is failed just restore the old normal PMD.
>
> In the old code anon_vma lock was needed to serialize THP migration
> against THP split, but since then the THP code has been reworked a lot,
> it seems anon_vma lock is not required anymore to avoid the race.
>
> The page refcount elevation when holding ptl should prevent from THP
> split.
>
> Use migrate_misplaced_page() for both base page and THP NUMA hinting
> fault and remove all the dead and duplicate code.
>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/migrate.h |  23 ------
>  mm/huge_memory.c        | 143 ++++++++++----------------------
>  mm/internal.h           |  18 ----
>  mm/migrate.c            | 177 ++++++++--------------------------------
>  4 files changed, 77 insertions(+), 284 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 4bb4e519e3f5..163d6f2b03d1 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -95,14 +95,9 @@ static inline void __ClearPageMovable(struct page *page)
>  #endif
>  
>  #ifdef CONFIG_NUMA_BALANCING
> -extern bool pmd_trans_migrating(pmd_t pmd);
>  extern int migrate_misplaced_page(struct page *page,
>  				  struct vm_area_struct *vma, int node);
>  #else
> -static inline bool pmd_trans_migrating(pmd_t pmd)
> -{
> -	return false;
> -}
>  static inline int migrate_misplaced_page(struct page *page,
>  					 struct vm_area_struct *vma, int node)
>  {
> @@ -110,24 +105,6 @@ static inline int migrate_misplaced_page(struct page *page,
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> -			struct vm_area_struct *vma,
> -			pmd_t *pmd, pmd_t entry,
> -			unsigned long address,
> -			struct page *page, int node);
> -#else
> -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> -			struct vm_area_struct *vma,
> -			pmd_t *pmd, pmd_t entry,
> -			unsigned long address,
> -			struct page *page, int node)
> -{
> -	return -EAGAIN;
> -}
> -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> -
> -
>  #ifdef CONFIG_MIGRATION
>  
>  /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 35cac4aeaf68..94981907fd4c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	pmd_t pmd = vmf->orig_pmd;
> -	struct anon_vma *anon_vma = NULL;
> +	pmd_t oldpmd;

nit: the usage of oldpmd and pmd in the function appears not very
consistent.  How about make oldpmd == vmf->orig_pmd always.  While make
pmd the changed one?

Best Regards,
Huang, Ying

>  	struct page *page;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> -	int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
> +	int page_nid = NUMA_NO_NODE;
>  	int target_nid, last_cpupid = -1;
> -	bool page_locked;
>  	bool migrated = false;
> -	bool was_writable;
> +	bool was_writable = pmd_savedwrite(pmd);
>  	int flags = 0;
>  
>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> -	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
> -		goto out_unlock;
> -
> -	/*
> -	 * If there are potential migrations, wait for completion and retry
> -	 * without disrupting NUMA hinting information. Do not relock and
> -	 * check_same as the page may no longer be mapped.
> -	 */
> -	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> -		page = pmd_page(*vmf->pmd);
> -		if (!get_page_unless_zero(page))
> -			goto out_unlock;
> +	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
>  		spin_unlock(vmf->ptl);
> -		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
>  		goto out;
>  	}
>  
> -	page = pmd_page(pmd);
> -	BUG_ON(is_huge_zero_page(page));
> -	page_nid = page_to_nid(page);
> -	last_cpupid = page_cpupid_last(page);
> -	count_vm_numa_event(NUMA_HINT_FAULTS);
> -	if (page_nid == this_nid) {
> -		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> -		flags |= TNF_FAULT_LOCAL;
> -	}
> -
> -	/* See similar comment in do_numa_page for explanation */
> -	if (!pmd_savedwrite(pmd))
> -		flags |= TNF_NO_GROUP;
> -
> -	/*
> -	 * Acquire the page lock to serialise THP migrations but avoid dropping
> -	 * page_table_lock if at all possible
> -	 */
> -	page_locked = trylock_page(page);
> -	target_nid = mpol_misplaced(page, vma, haddr);
> -	/* Migration could have started since the pmd_trans_migrating check */
> -	if (!page_locked) {
> -		page_nid = NUMA_NO_NODE;
> -		if (!get_page_unless_zero(page))
> -			goto out_unlock;
> -		spin_unlock(vmf->ptl);
> -		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> -		goto out;
> -	} else if (target_nid == NUMA_NO_NODE) {
> -		/* There are no parallel migrations and page is in the right
> -		 * node. Clear the numa hinting info in this pmd.
> -		 */
> -		goto clear_pmdnuma;
> -	}
> -
> -	/*
> -	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
> -	 * to serialises splits
> -	 */
> -	get_page(page);
> -	spin_unlock(vmf->ptl);
> -	anon_vma = page_lock_anon_vma_read(page);
> -
> -	/* Confirm the PMD did not change while page_table_lock was released */
> -	spin_lock(vmf->ptl);
> -	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> -		unlock_page(page);
> -		put_page(page);
> -		page_nid = NUMA_NO_NODE;
> -		goto out_unlock;
> -	}
> -
> -	/* Bail if we fail to protect against THP splits for any reason */
> -	if (unlikely(!anon_vma)) {
> -		put_page(page);
> -		page_nid = NUMA_NO_NODE;
> -		goto clear_pmdnuma;
> -	}
> -
>  	/*
>  	 * Since we took the NUMA fault, we must have observed the !accessible
>  	 * bit. Make sure all other CPUs agree with that, to avoid them
> @@ -1531,43 +1459,60 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  					      haddr + HPAGE_PMD_SIZE);
>  	}
>  
> -	/*
> -	 * Migrate the THP to the requested node, returns with page unlocked
> -	 * and access rights restored.
> -	 */
> +	oldpmd = pmd_modify(pmd, vma->vm_page_prot);
> +	page = vm_normal_page_pmd(vma, haddr, oldpmd);
> +	if (!page) {
> +		spin_unlock(vmf->ptl);
> +		goto out_map;
> +	}
> +
> +	/* See similar comment in do_numa_page for explanation */
> +	if (!was_writable)
> +		flags |= TNF_NO_GROUP;
> +
> +	page_nid = page_to_nid(page);
> +	last_cpupid = page_cpupid_last(page);
> +	target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
> +				       &flags);
> +
> +	if (target_nid == NUMA_NO_NODE) {
> +		put_page(page);
> +		goto out_map;
> +	}
> +
>  	spin_unlock(vmf->ptl);
>  
> -	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
> -				vmf->pmd, pmd, vmf->address, page, target_nid);
> +	migrated = migrate_misplaced_page(page, vma, target_nid);
>  	if (migrated) {
>  		flags |= TNF_MIGRATED;
>  		page_nid = target_nid;
> -	} else
> +	} else {
>  		flags |= TNF_MIGRATE_FAIL;
> +		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> +		if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> +			spin_unlock(vmf->ptl);
> +			goto out;
> +		}
> +		goto out_map;
> +	}
>  
> -	goto out;
> -clear_pmdnuma:
> -	BUG_ON(!PageLocked(page));
> -	was_writable = pmd_savedwrite(pmd);
> +out:
> +	if (page_nid != NUMA_NO_NODE)
> +		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> +				flags);
> +
> +	return 0;
> +
> +out_map:
> +	/* Restore the PMD */
>  	pmd = pmd_modify(pmd, vma->vm_page_prot);
>  	pmd = pmd_mkyoung(pmd);
>  	if (was_writable)
>  		pmd = pmd_mkwrite(pmd);
>  	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
>  	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> -	unlock_page(page);
> -out_unlock:
>  	spin_unlock(vmf->ptl);
> -
> -out:
> -	if (anon_vma)
> -		page_unlock_anon_vma_read(anon_vma);
> -
> -	if (page_nid != NUMA_NO_NODE)
> -		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> -				flags);
> -
> -	return 0;
> +	goto out;
>  }
>  

[snip]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count
  2021-04-13 21:24 ` [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count Yang Shi
@ 2021-04-14  3:00   ` Huang, Ying
  2021-04-14 15:02     ` Zi Yan
  2021-04-14 17:23     ` Yang Shi
  0 siblings, 2 replies; 26+ messages in thread
From: Huang, Ying @ 2021-04-14  3:00 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, kirill.shutemov, ziy, mhocko, hughd, gerald.schaefer,
	hca, gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel

Yang Shi <shy828301@gmail.com> writes:

> The generic migration path will check refcount, so no need check refcount here.
> But the old code actually prevents from migrating shared THP (mapped by multiple
> processes), so bail out early if mapcount is > 1 to keep the behavior.

What prevents us from migrating shared THP?  If no, why not just remove
the old refcount checking?

Best Regards,
Huang, Ying

> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/migrate.c | 16 ++++------------
>  1 file changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a72994c68ec6..dc7cc7f3a124 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2067,6 +2067,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>  
>  	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
>  
> +	/* Do not migrate THP mapped by multiple processes */
> +	if (PageTransHuge(page) && page_mapcount(page) > 1)
> +		return 0;
> +
>  	/* Avoid migrating to a node that is nearly full */
>  	if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
>  		return 0;
> @@ -2074,18 +2078,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>  	if (isolate_lru_page(page))
>  		return 0;
>  
> -	/*
> -	 * migrate_misplaced_transhuge_page() skips page migration's usual
> -	 * check on page_count(), so we must do it here, now that the page
> -	 * has been isolated: a GUP pin, or any other pin, prevents migration.
> -	 * The expected page count is 3: 1 for page's mapcount and 1 for the
> -	 * caller's pin and 1 for the reference taken by isolate_lru_page().
> -	 */
> -	if (PageTransHuge(page) && page_count(page) != 3) {
> -		putback_lru_page(page);
> -		return 0;
> -	}
> -
>  	page_lru = page_is_file_lru(page);
>  	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>  				thp_nr_pages(page));

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count
  2021-04-14  3:00   ` Huang, Ying
@ 2021-04-14 15:02     ` Zi Yan
  2021-04-15  6:45       ` Huang, Ying
  2021-04-14 17:23     ` Yang Shi
  1 sibling, 1 reply; 26+ messages in thread
From: Zi Yan @ 2021-04-14 15:02 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yang Shi, mgorman, kirill.shutemov, mhocko, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm, linux-mm,
	linux-s390, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2387 bytes --]

On 13 Apr 2021, at 23:00, Huang, Ying wrote:

> Yang Shi <shy828301@gmail.com> writes:
>
>> The generic migration path will check refcount, so no need check refcount here.
>> But the old code actually prevents from migrating shared THP (mapped by multiple
>> processes), so bail out early if mapcount is > 1 to keep the behavior.
>
> What prevents us from migrating shared THP?  If no, why not just remove
> the old refcount checking?

If two or more processes are in different NUMA nodes, a THP shared by them can be
migrated back and forth between NUMA nodes, which is quite costly. Unless we have
a better way of figuring out a good location for such pages to reduce the number
of migration, it might be better not to move them, right?

>
> Best Regards,
> Huang, Ying
>
>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>> ---
>>  mm/migrate.c | 16 ++++------------
>>  1 file changed, 4 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index a72994c68ec6..dc7cc7f3a124 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2067,6 +2067,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>>
>>  	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
>>
>> +	/* Do not migrate THP mapped by multiple processes */
>> +	if (PageTransHuge(page) && page_mapcount(page) > 1)
>> +		return 0;
>> +
>>  	/* Avoid migrating to a node that is nearly full */
>>  	if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
>>  		return 0;
>> @@ -2074,18 +2078,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>>  	if (isolate_lru_page(page))
>>  		return 0;
>>
>> -	/*
>> -	 * migrate_misplaced_transhuge_page() skips page migration's usual
>> -	 * check on page_count(), so we must do it here, now that the page
>> -	 * has been isolated: a GUP pin, or any other pin, prevents migration.
>> -	 * The expected page count is 3: 1 for page's mapcount and 1 for the
>> -	 * caller's pin and 1 for the reference taken by isolate_lru_page().
>> -	 */
>> -	if (PageTransHuge(page) && page_count(page) != 3) {
>> -		putback_lru_page(page);
>> -		return 0;
>> -	}
>> -
>>  	page_lru = page_is_file_lru(page);
>>  	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>>  				thp_nr_pages(page));


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling
  2021-04-14  2:43   ` Huang, Ying
@ 2021-04-14 17:15     ` Yang Shi
  0 siblings, 0 replies; 26+ messages in thread
From: Yang Shi @ 2021-04-14 17:15 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko,
	Hugh Dickins, Gerald Schaefer, hca, gor, borntraeger,
	Andrew Morton, Linux MM, linux-s390, Linux Kernel Mailing List

On Tue, Apr 13, 2021 at 7:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > When the THP NUMA fault support was added THP migration was not supported yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> > THP migration has been supported so it doesn't make too much sense to still keep
> > another THP migration implementation rather than using the generic migration
> > code.
> >
> > This patch reworked the NUMA fault handling to use generic migration implementation
> > to migrate misplaced page.  There is no functional change.
> >
> > After the refactor the flow of NUMA fault handling looks just like its
> > PTE counterpart:
> >   Acquire ptl
> >   Prepare for migration (elevate page refcount)
> >   Release ptl
> >   Isolate page from lru and elevate page refcount
> >   Migrate the misplaced THP
> >
> > If migration is failed just restore the old normal PMD.
> >
> > In the old code anon_vma lock was needed to serialize THP migration
> > against THP split, but since then the THP code has been reworked a lot,
> > it seems anon_vma lock is not required anymore to avoid the race.
> >
> > The page refcount elevation when holding ptl should prevent from THP
> > split.
> >
> > Use migrate_misplaced_page() for both base page and THP NUMA hinting
> > fault and remove all the dead and duplicate code.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/migrate.h |  23 ------
> >  mm/huge_memory.c        | 143 ++++++++++----------------------
> >  mm/internal.h           |  18 ----
> >  mm/migrate.c            | 177 ++++++++--------------------------------
> >  4 files changed, 77 insertions(+), 284 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 4bb4e519e3f5..163d6f2b03d1 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -95,14 +95,9 @@ static inline void __ClearPageMovable(struct page *page)
> >  #endif
> >
> >  #ifdef CONFIG_NUMA_BALANCING
> > -extern bool pmd_trans_migrating(pmd_t pmd);
> >  extern int migrate_misplaced_page(struct page *page,
> >                                 struct vm_area_struct *vma, int node);
> >  #else
> > -static inline bool pmd_trans_migrating(pmd_t pmd)
> > -{
> > -     return false;
> > -}
> >  static inline int migrate_misplaced_page(struct page *page,
> >                                        struct vm_area_struct *vma, int node)
> >  {
> > @@ -110,24 +105,6 @@ static inline int migrate_misplaced_page(struct page *page,
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> > -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > -                     struct vm_area_struct *vma,
> > -                     pmd_t *pmd, pmd_t entry,
> > -                     unsigned long address,
> > -                     struct page *page, int node);
> > -#else
> > -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > -                     struct vm_area_struct *vma,
> > -                     pmd_t *pmd, pmd_t entry,
> > -                     unsigned long address,
> > -                     struct page *page, int node)
> > -{
> > -     return -EAGAIN;
> > -}
> > -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> > -
> > -
> >  #ifdef CONFIG_MIGRATION
> >
> >  /*
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 35cac4aeaf68..94981907fd4c 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> >       pmd_t pmd = vmf->orig_pmd;
> > -     struct anon_vma *anon_vma = NULL;
> > +     pmd_t oldpmd;
>
> nit: the usage of oldpmd and pmd in the function appears not very
> consistent.  How about make oldpmd == vmf->orig_pmd always.  While make
> pmd the changed one?

Thanks for the suggestion. Yes, it seemed neater. Will fix it in the
next version.

>
> Best Regards,
> Huang, Ying
>
> >       struct page *page;
> >       unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > -     int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
> > +     int page_nid = NUMA_NO_NODE;
> >       int target_nid, last_cpupid = -1;
> > -     bool page_locked;
> >       bool migrated = false;
> > -     bool was_writable;
> > +     bool was_writable = pmd_savedwrite(pmd);
> >       int flags = 0;
> >
> >       vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > -     if (unlikely(!pmd_same(pmd, *vmf->pmd)))
> > -             goto out_unlock;
> > -
> > -     /*
> > -      * If there are potential migrations, wait for completion and retry
> > -      * without disrupting NUMA hinting information. Do not relock and
> > -      * check_same as the page may no longer be mapped.
> > -      */
> > -     if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> > -             page = pmd_page(*vmf->pmd);
> > -             if (!get_page_unless_zero(page))
> > -                     goto out_unlock;
> > +     if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> >               spin_unlock(vmf->ptl);
> > -             put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> >               goto out;
> >       }
> >
> > -     page = pmd_page(pmd);
> > -     BUG_ON(is_huge_zero_page(page));
> > -     page_nid = page_to_nid(page);
> > -     last_cpupid = page_cpupid_last(page);
> > -     count_vm_numa_event(NUMA_HINT_FAULTS);
> > -     if (page_nid == this_nid) {
> > -             count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> > -             flags |= TNF_FAULT_LOCAL;
> > -     }
> > -
> > -     /* See similar comment in do_numa_page for explanation */
> > -     if (!pmd_savedwrite(pmd))
> > -             flags |= TNF_NO_GROUP;
> > -
> > -     /*
> > -      * Acquire the page lock to serialise THP migrations but avoid dropping
> > -      * page_table_lock if at all possible
> > -      */
> > -     page_locked = trylock_page(page);
> > -     target_nid = mpol_misplaced(page, vma, haddr);
> > -     /* Migration could have started since the pmd_trans_migrating check */
> > -     if (!page_locked) {
> > -             page_nid = NUMA_NO_NODE;
> > -             if (!get_page_unless_zero(page))
> > -                     goto out_unlock;
> > -             spin_unlock(vmf->ptl);
> > -             put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> > -             goto out;
> > -     } else if (target_nid == NUMA_NO_NODE) {
> > -             /* There are no parallel migrations and page is in the right
> > -              * node. Clear the numa hinting info in this pmd.
> > -              */
> > -             goto clear_pmdnuma;
> > -     }
> > -
> > -     /*
> > -      * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
> > -      * to serialises splits
> > -      */
> > -     get_page(page);
> > -     spin_unlock(vmf->ptl);
> > -     anon_vma = page_lock_anon_vma_read(page);
> > -
> > -     /* Confirm the PMD did not change while page_table_lock was released */
> > -     spin_lock(vmf->ptl);
> > -     if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> > -             unlock_page(page);
> > -             put_page(page);
> > -             page_nid = NUMA_NO_NODE;
> > -             goto out_unlock;
> > -     }
> > -
> > -     /* Bail if we fail to protect against THP splits for any reason */
> > -     if (unlikely(!anon_vma)) {
> > -             put_page(page);
> > -             page_nid = NUMA_NO_NODE;
> > -             goto clear_pmdnuma;
> > -     }
> > -
> >       /*
> >        * Since we took the NUMA fault, we must have observed the !accessible
> >        * bit. Make sure all other CPUs agree with that, to avoid them
> > @@ -1531,43 +1459,60 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >                                             haddr + HPAGE_PMD_SIZE);
> >       }
> >
> > -     /*
> > -      * Migrate the THP to the requested node, returns with page unlocked
> > -      * and access rights restored.
> > -      */
> > +     oldpmd = pmd_modify(pmd, vma->vm_page_prot);
> > +     page = vm_normal_page_pmd(vma, haddr, oldpmd);
> > +     if (!page) {
> > +             spin_unlock(vmf->ptl);
> > +             goto out_map;
> > +     }
> > +
> > +     /* See similar comment in do_numa_page for explanation */
> > +     if (!was_writable)
> > +             flags |= TNF_NO_GROUP;
> > +
> > +     page_nid = page_to_nid(page);
> > +     last_cpupid = page_cpupid_last(page);
> > +     target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
> > +                                    &flags);
> > +
> > +     if (target_nid == NUMA_NO_NODE) {
> > +             put_page(page);
> > +             goto out_map;
> > +     }
> > +
> >       spin_unlock(vmf->ptl);
> >
> > -     migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
> > -                             vmf->pmd, pmd, vmf->address, page, target_nid);
> > +     migrated = migrate_misplaced_page(page, vma, target_nid);
> >       if (migrated) {
> >               flags |= TNF_MIGRATED;
> >               page_nid = target_nid;
> > -     } else
> > +     } else {
> >               flags |= TNF_MIGRATE_FAIL;
> > +             vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > +             if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> > +                     spin_unlock(vmf->ptl);
> > +                     goto out;
> > +             }
> > +             goto out_map;
> > +     }
> >
> > -     goto out;
> > -clear_pmdnuma:
> > -     BUG_ON(!PageLocked(page));
> > -     was_writable = pmd_savedwrite(pmd);
> > +out:
> > +     if (page_nid != NUMA_NO_NODE)
> > +             task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> > +                             flags);
> > +
> > +     return 0;
> > +
> > +out_map:
> > +     /* Restore the PMD */
> >       pmd = pmd_modify(pmd, vma->vm_page_prot);
> >       pmd = pmd_mkyoung(pmd);
> >       if (was_writable)
> >               pmd = pmd_mkwrite(pmd);
> >       set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
> >       update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> > -     unlock_page(page);
> > -out_unlock:
> >       spin_unlock(vmf->ptl);
> > -
> > -out:
> > -     if (anon_vma)
> > -             page_unlock_anon_vma_read(anon_vma);
> > -
> > -     if (page_nid != NUMA_NO_NODE)
> > -             task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> > -                             flags);
> > -
> > -     return 0;
> > +     goto out;
> >  }
> >
>
> [snip]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count
  2021-04-14  3:00   ` Huang, Ying
  2021-04-14 15:02     ` Zi Yan
@ 2021-04-14 17:23     ` Yang Shi
  1 sibling, 0 replies; 26+ messages in thread
From: Yang Shi @ 2021-04-14 17:23 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko,
	Hugh Dickins, Gerald Schaefer, hca, gor, borntraeger,
	Andrew Morton, Linux MM, linux-s390, Linux Kernel Mailing List

On Tue, Apr 13, 2021 at 8:00 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > The generic migration path will check refcount, so no need check refcount here.
> > But the old code actually prevents from migrating shared THP (mapped by multiple
> > processes), so bail out early if mapcount is > 1 to keep the behavior.
>
> What prevents us from migrating shared THP?  If no, why not just remove
> the old refcount checking?

We could migrate shared THP if we don't care about the bounce back and
forth between nodes as Zi Yan described. The other reason is, as I
mentioned in the cover letter,  I'd like to keep the behavior as
consistent as possible between before and after for now. The old
behavior does prevent migrating shared THP, so I did so in this
series. We definitely could optimize the behavior later on.

>
> Best Regards,
> Huang, Ying
>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/migrate.c | 16 ++++------------
> >  1 file changed, 4 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index a72994c68ec6..dc7cc7f3a124 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2067,6 +2067,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
> >
> >       VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
> >
> > +     /* Do not migrate THP mapped by multiple processes */
> > +     if (PageTransHuge(page) && page_mapcount(page) > 1)
> > +             return 0;
> > +
> >       /* Avoid migrating to a node that is nearly full */
> >       if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
> >               return 0;
> > @@ -2074,18 +2078,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
> >       if (isolate_lru_page(page))
> >               return 0;
> >
> > -     /*
> > -      * migrate_misplaced_transhuge_page() skips page migration's usual
> > -      * check on page_count(), so we must do it here, now that the page
> > -      * has been isolated: a GUP pin, or any other pin, prevents migration.
> > -      * The expected page count is 3: 1 for page's mapcount and 1 for the
> > -      * caller's pin and 1 for the reference taken by isolate_lru_page().
> > -      */
> > -     if (PageTransHuge(page) && page_count(page) != 3) {
> > -             putback_lru_page(page);
> > -             return 0;
> > -     }
> > -
> >       page_lru = page_is_file_lru(page);
> >       mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
> >                               thp_nr_pages(page));

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count
  2021-04-14 15:02     ` Zi Yan
@ 2021-04-15  6:45       ` Huang, Ying
  2021-04-15 18:57         ` Zi Yan
  0 siblings, 1 reply; 26+ messages in thread
From: Huang, Ying @ 2021-04-15  6:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Yang Shi, mgorman, kirill.shutemov, mhocko, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm, linux-mm,
	linux-s390, linux-kernel

"Zi Yan" <ziy@nvidia.com> writes:

> On 13 Apr 2021, at 23:00, Huang, Ying wrote:
>
>> Yang Shi <shy828301@gmail.com> writes:
>>
>>> The generic migration path will check refcount, so no need check refcount here.
>>> But the old code actually prevents from migrating shared THP (mapped by multiple
>>> processes), so bail out early if mapcount is > 1 to keep the behavior.
>>
>> What prevents us from migrating shared THP?  If no, why not just remove
>> the old refcount checking?
>
> If two or more processes are in different NUMA nodes, a THP shared by them can be
> migrated back and forth between NUMA nodes, which is quite costly. Unless we have
> a better way of figuring out a good location for such pages to reduce the number
> of migration, it might be better not to move them, right?
>

Some mechanism has been provided in should_numa_migrate_memory() to
identify the shared pages from the private pages.  Do you find it
doesn't work well in some situations?

The multiple threads in one process which run on different NUMA nodes
may share pages too.  So it isn't a good solution to exclude pages
shared by multiple processes.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count
  2021-04-15  6:45       ` Huang, Ying
@ 2021-04-15 18:57         ` Zi Yan
  0 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2021-04-15 18:57 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yang Shi, mgorman, kirill.shutemov, mhocko, hughd,
	gerald.schaefer, hca, gor, borntraeger, akpm, linux-mm,
	linux-s390, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1617 bytes --]

On 15 Apr 2021, at 2:45, Huang, Ying wrote:

> "Zi Yan" <ziy@nvidia.com> writes:
>
>> On 13 Apr 2021, at 23:00, Huang, Ying wrote:
>>
>>> Yang Shi <shy828301@gmail.com> writes:
>>>
>>>> The generic migration path will check refcount, so no need check refcount here.
>>>> But the old code actually prevents from migrating shared THP (mapped by multiple
>>>> processes), so bail out early if mapcount is > 1 to keep the behavior.
>>>
>>> What prevents us from migrating shared THP?  If no, why not just remove
>>> the old refcount checking?
>>
>> If two or more processes are in different NUMA nodes, a THP shared by them can be
>> migrated back and forth between NUMA nodes, which is quite costly. Unless we have
>> a better way of figuring out a good location for such pages to reduce the number
>> of migration, it might be better not to move them, right?
>>
>
> Some mechanism has been provided in should_numa_migrate_memory() to
> identify the shared pages from the private pages.  Do you find it
> doesn't work well in some situations?
>
> The multiple threads in one process which run on different NUMA nodes
> may share pages too.  So it isn't a good solution to exclude pages
> shared by multiple processes.

After recheck the patch, it seems that no shared THP migration here is a side effect
of the original page_count check, which might not be intended and be worth fixing.
But Yang just want to solve one problem, simplifying THP NUMA migration,
at a time. Maybe a separate patch would be better for both discussing and fixing this problem.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault
  2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (6 preceding siblings ...)
  2021-04-13 21:24 ` [v2 PATCH 7/7] mm: thp: skip make PMD PROT_NONE if THP migration is not supported Yang Shi
@ 2021-05-03 21:58 ` Yang Shi
  7 siblings, 0 replies; 26+ messages in thread
From: Yang Shi @ 2021-05-03 21:58 UTC (permalink / raw)
  To: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, Gerald Schaefer, hca, gor, borntraeger,
	Andrew Morton
  Cc: Linux MM, linux-s390, Linux Kernel Mailing List

Gently ping.

I also did some tests to measure the latency of do_huge_pmd_numa_page.
The test VM has 80 vcpus and 64G memory. The test would create 2
processes to consume 128G memory together which would incur memory
pressure to cause THP splits. And it also creates 80 processes to hog
cpu, and the memory consumer processes are bound to different nodes
periodically in order to increase NUMA faults.

The below test script is used:

echo 3 > /proc/sys/vm/drop_caches

# Run stress-ng for 24 hours
./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
PID=$!

./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &

# Wait for vm stressors forked
sleep 5

PID_1=`pgrep -P $PID | awk 'NR == 1'`
PID_2=`pgrep -P $PID | awk 'NR == 2'`

JOB1=`pgrep -P $PID_1`
JOB2=`pgrep -P $PID_2`

# Bind load jobs to different nodes periodically to force generate
# cross node memory access
while [ -d "/proc/$PID" ]
do
        taskset -apc 8 $JOB1
        taskset -apc 8 $JOB2
        sleep 300
        taskset -apc 58 $JOB1
        taskset -apc 58 $JOB2
        sleep 300
done

With the above test the histogram of latency of do_huge_pmd_numa_page
is as shown below. Since the number of do_huge_pmd_numa_page varies
drastically for each run (should be due to scheduler), so I converted
the raw number to percentage.

                                     patched                        base
@us[stress-ng]:
[0]                                  3.57%                         0.16%
[1]                                  55.68%                       18.36%
[2, 4)                             10.46%                        40.44%
[4, 8)                              7.26%                         17.82%
[8, 16)                            21.12%                       13.41%
[16, 32)                         1.06%                           4.27%
[32, 64)                         0.56%                           4.07%
[64, 128)                       0.16%                            0.35%
[128, 256)                     < 0.1%                          < 0.1%
[256, 512)                     < 0.1%                          < 0.1%
[512, 1K)                       < 0.1%                          < 0.1%
[1K, 2K)                         < 0.1%                          < 0.1%
[2K, 4K)                         < 0.1%                          < 0.1%
[4K, 8K)                         < 0.1%                          < 0.1%
[8K, 16K)                       < 0.1%                          < 0.1%
[16K, 32K)                     < 0.1%                          < 0.1%
[32K, 64K)                     < 0.1%                          < 0.1%

Per the result, patched kernel is even slightly better than the base
kernel. I think this is because the lock contention against THP split
is less than base kernel due to the refactor.


To exclude the affect from THP split, I also did test w/o memory
pressure. No obvious regression is spotted. The below is the test
result *w/o* memory pressure.
                                       patched
          base
@us[stress-ng]:
[0]                                      7.97%
        18.4%
[1]                                      69.63%
        58.24%
[2, 4)                                  4.18%
        2.63%
[4, 8)                                  0.22%
        0.17%
[8, 16)                                1.03%
       0.92%
[16, 32)                              0.14%
      < 0.1%
[32, 64)                              < 0.1%
      < 0.1%
[64, 128)                            < 0.1%
     < 0.1%
[128, 256)                          < 0.1%
     < 0.1%
[256, 512)                           0.45%
     1.19%
[512, 1K)                            15.45%
     17.27%
[1K, 2K)                               < 0.1%
       < 0.1%
[2K, 4K)                              < 0.1%
       < 0.1%
[4K, 8K)                              < 0.1%
       < 0.1%
[8K, 16K)                             0.86%
      0.88%
[16K, 32K)                           < 0.1%
     0.15%
[32K, 64K)                           < 0.1%
     < 0.1%
[64K, 128K)                         < 0.1%
    < 0.1%
[128K, 256K)                       < 0.1%                                 < 0.1%


On Tue, Apr 13, 2021 at 2:24 PM Yang Shi <shy828301@gmail.com> wrote:
>
>
> Changelog:
> v1 --> v2:
>     * Adopted the suggestion from Gerald Schaefer to skip huge PMD for S390
>       for now.
>     * Used PageTransHuge to distinguish base page or THP instead of a new
>       parameter for migrate_misplaced_page() per Huang Ying.
>     * Restored PMD lazily to avoid unnecessary TLB shootdown per Huang Ying.
>     * Skipped shared THP.
>     * Updated counters correctly.
>     * Rebased to linux-next (next-20210412).
>
> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> THP migration has been supported so it doesn't make too much sense to still keep
> another THP migration implementation rather than using the generic migration
> code.  It is definitely a maintenance burden to keep two THP migration
> implementation for different code paths and it is more error prone.  Using the
> generic THP migration implementation allows us remove the duplicate code and
> some hacks needed by the old ad hoc implementation.
>
> A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
> and NUMA balancing.  The most of them support THP migration except for S390.
> Zi Yan tried to add THP migration support for S390 before but it was not
> accepted due to the design of S390 PMD.  For the discussion, please see:
> https://lkml.org/lkml/2018/4/27/953.
>
> Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge
> PMD for S390 for now.
>
> I saw there were some hacks about gup from git history, but I didn't figure out
> if they have been removed or not since I just found FOLL_NUMA code in the current
> gup implementation and they seems useful.
>
> I'm trying to keep the behavior as consistent as possible between before and after.
> But there is still some minor disparity.  For example, file THP won't
> get migrated at all in old implementation due to the anon_vma check, but
> the new implementation doesn't need acquire anon_vma lock anymore, so
> file THP might get migrated.  Not sure if this behavior needs to be
> kept.
>
> Patch #1 ~ #2 are preparation patches.
> Patch #3 is the real meat.
> Patch #4 ~ #6 keep consistent counters and behaviors with before.
> Patch #7 skips change huge PMD to prot_none if thp migration is not supported.
>
> Yang Shi (7):
>       mm: memory: add orig_pmd to struct vm_fault
>       mm: memory: make numa_migrate_prep() non-static
>       mm: thp: refactor NUMA fault handling
>       mm: migrate: account THP NUMA migration counters correctly
>       mm: migrate: don't split THP for misplaced NUMA page
>       mm: migrate: check mapcount for THP instead of ref count
>       mm: thp: skip make PMD PROT_NONE if THP migration is not supported
>
>  include/linux/huge_mm.h |   9 ++---
>  include/linux/migrate.h |  23 -----------
>  include/linux/mm.h      |   3 ++
>  mm/huge_memory.c        | 156 +++++++++++++++++++++++++-----------------------------------------------
>  mm/internal.h           |  21 ++--------
>  mm/memory.c             |  31 +++++++--------
>  mm/migrate.c            | 204 +++++++++++++++++++++--------------------------------------------------------------------------
>  7 files changed, 123 insertions(+), 324 deletions(-)
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault
  2021-04-13 21:24 ` [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault Yang Shi
@ 2021-05-17 15:09   ` Mel Gorman
  2021-05-17 19:39     ` Yang Shi
  0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-05-17 15:09 UTC (permalink / raw)
  To: Yang Shi
  Cc: kirill.shutemov, ziy, mhocko, ying.huang, hughd, gerald.schaefer,
	hca, gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel

On Tue, Apr 13, 2021 at 02:24:10PM -0700, Yang Shi wrote:
> Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page
> fault could be removed, just like its PTE counterpart does.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> <SNIP>
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25b9041f9925..9c5856f8cc81 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -547,6 +547,9 @@ struct vm_fault {
>  					 * the 'address'
>  					 */
>  	pte_t orig_pte;			/* Value of PTE at the time of fault */
> +	pmd_t orig_pmd;			/* Value of PMD at the time of fault,
> +					 * used by PMD fault only.
> +					 */
>  
>  	struct page *cow_page;		/* Page handler may use for COW fault */
>  	struct page *page;		/* ->fault handlers should return a

Could this be a union?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 2/7] mm: memory: make numa_migrate_prep() non-static
  2021-04-13 21:24 ` [v2 PATCH 2/7] mm: memory: make numa_migrate_prep() non-static Yang Shi
@ 2021-05-17 15:11   ` Mel Gorman
  0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-05-17 15:11 UTC (permalink / raw)
  To: Yang Shi
  Cc: kirill.shutemov, ziy, mhocko, ying.huang, hughd, gerald.schaefer,
	hca, gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel

On Tue, Apr 13, 2021 at 02:24:11PM -0700, Yang Shi wrote:
> The numa_migrate_prep() will be used by huge NUMA fault as well in the following
> patch, make it non-static.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling
  2021-04-13 21:24 ` [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling Yang Shi
  2021-04-14  2:43   ` Huang, Ying
@ 2021-05-17 15:27   ` Mel Gorman
  2021-05-17 19:41     ` Yang Shi
  1 sibling, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-05-17 15:27 UTC (permalink / raw)
  To: Yang Shi
  Cc: kirill.shutemov, ziy, mhocko, ying.huang, hughd, gerald.schaefer,
	hca, gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel

On Tue, Apr 13, 2021 at 02:24:12PM -0700, Yang Shi wrote:
> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> THP migration has been supported so it doesn't make too much sense to still keep
> another THP migration implementation rather than using the generic migration
> code.
> 
> This patch reworked the NUMA fault handling to use generic migration implementation
> to migrate misplaced page.  There is no functional change.
> 
> After the refactor the flow of NUMA fault handling looks just like its
> PTE counterpart:
>   Acquire ptl
>   Prepare for migration (elevate page refcount)
>   Release ptl
>   Isolate page from lru and elevate page refcount
>   Migrate the misplaced THP
> 
> If migration is failed just restore the old normal PMD.
> 
> In the old code anon_vma lock was needed to serialize THP migration
> against THP split, but since then the THP code has been reworked a lot,
> it seems anon_vma lock is not required anymore to avoid the race.
> 
> The page refcount elevation when holding ptl should prevent from THP
> split.
> 
> Use migrate_misplaced_page() for both base page and THP NUMA hinting
> fault and remove all the dead and duplicate code.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

I did not spot any big problems and FWIW, the series overall passed a
series of tests that exercise NUMA balancing migrations so...

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 4/7] mm: migrate: account THP NUMA migration counters correctly
  2021-04-13 21:24 ` [v2 PATCH 4/7] mm: migrate: account THP NUMA migration counters correctly Yang Shi
@ 2021-05-17 15:28   ` Mel Gorman
  0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-05-17 15:28 UTC (permalink / raw)
  To: Yang Shi
  Cc: kirill.shutemov, ziy, mhocko, ying.huang, hughd, gerald.schaefer,
	hca, gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel

On Tue, Apr 13, 2021 at 02:24:13PM -0700, Yang Shi wrote:
> Now both base page and THP NUMA migration is done via migrate_misplaced_page(),
> keep the counters correctly for THP.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 5/7] mm: migrate: don't split THP for misplaced NUMA page
  2021-04-13 21:24 ` [v2 PATCH 5/7] mm: migrate: don't split THP for misplaced NUMA page Yang Shi
@ 2021-05-17 15:29   ` Mel Gorman
  0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-05-17 15:29 UTC (permalink / raw)
  To: Yang Shi
  Cc: kirill.shutemov, ziy, mhocko, ying.huang, hughd, gerald.schaefer,
	hca, gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel

On Tue, Apr 13, 2021 at 02:24:14PM -0700, Yang Shi wrote:
> The old behavior didn't split THP if migration is failed due to lack of
> memory on the target node.  But the THP migration does split THP, so keep
> the old behavior for misplaced NUMA page migration.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 7/7] mm: thp: skip make PMD PROT_NONE if THP migration is not supported
  2021-04-13 21:24 ` [v2 PATCH 7/7] mm: thp: skip make PMD PROT_NONE if THP migration is not supported Yang Shi
@ 2021-05-17 15:30   ` Mel Gorman
  0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2021-05-17 15:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: kirill.shutemov, ziy, mhocko, ying.huang, hughd, gerald.schaefer,
	hca, gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel

On Tue, Apr 13, 2021 at 02:24:16PM -0700, Yang Shi wrote:
> A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
> NUMA balancing and THP.  But S390 doesn't support THP migration so NUMA
> balancing actually can't migrate any misplaced pages.
> 
> Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
> by pointless NUMA hinting faults on S390.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault
  2021-05-17 15:09   ` Mel Gorman
@ 2021-05-17 19:39     ` Yang Shi
  2021-05-18  7:36       ` Mel Gorman
  0 siblings, 1 reply; 26+ messages in thread
From: Yang Shi @ 2021-05-17 19:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, Gerald Schaefer, hca, gor, borntraeger,
	Andrew Morton, Linux MM, linux-s390, Linux Kernel Mailing List

On Mon, May 17, 2021 at 8:09 AM Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Apr 13, 2021 at 02:24:10PM -0700, Yang Shi wrote:
> > Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page
> > fault could be removed, just like its PTE counterpart does.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> >
> > <SNIP>
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 25b9041f9925..9c5856f8cc81 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -547,6 +547,9 @@ struct vm_fault {
> >                                        * the 'address'
> >                                        */
> >       pte_t orig_pte;                 /* Value of PTE at the time of fault */
> > +     pmd_t orig_pmd;                 /* Value of PMD at the time of fault,
> > +                                      * used by PMD fault only.
> > +                                      */
> >
> >       struct page *cow_page;          /* Page handler may use for COW fault */
> >       struct page *page;              /* ->fault handlers should return a
>
> Could this be a union?

Do you mean orig_pte and orig_pmd, or cow_page and page? I think
orig_pte and orig_pmd could be union, I don't see that they are used
at the same time from vm_fault.

But cow_page and page can't since do_cow_fault() needs copy data from
page to cow_page.

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling
  2021-05-17 15:27   ` Mel Gorman
@ 2021-05-17 19:41     ` Yang Shi
  0 siblings, 0 replies; 26+ messages in thread
From: Yang Shi @ 2021-05-17 19:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, Gerald Schaefer, hca, gor, borntraeger,
	Andrew Morton, Linux MM, linux-s390, Linux Kernel Mailing List

On Mon, May 17, 2021 at 8:27 AM Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Apr 13, 2021 at 02:24:12PM -0700, Yang Shi wrote:
> > When the THP NUMA fault support was added THP migration was not supported yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> > THP migration has been supported so it doesn't make too much sense to still keep
> > another THP migration implementation rather than using the generic migration
> > code.
> >
> > This patch reworked the NUMA fault handling to use generic migration implementation
> > to migrate misplaced page.  There is no functional change.
> >
> > After the refactor the flow of NUMA fault handling looks just like its
> > PTE counterpart:
> >   Acquire ptl
> >   Prepare for migration (elevate page refcount)
> >   Release ptl
> >   Isolate page from lru and elevate page refcount
> >   Migrate the misplaced THP
> >
> > If migration is failed just restore the old normal PMD.
> >
> > In the old code anon_vma lock was needed to serialize THP migration
> > against THP split, but since then the THP code has been reworked a lot,
> > it seems anon_vma lock is not required anymore to avoid the race.
> >
> > The page refcount elevation when holding ptl should prevent from THP
> > split.
> >
> > Use migrate_misplaced_page() for both base page and THP NUMA hinting
> > fault and remove all the dead and duplicate code.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> I did not spot any big problems and FWIW, the series overall passed a
> series of tests that exercise NUMA balancing migrations so...
>
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks a lot for testing and reviewing the series.

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault
  2021-05-17 19:39     ` Yang Shi
@ 2021-05-18  7:36       ` Mel Gorman
  2021-05-18 17:03         ` Yang Shi
  0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2021-05-18  7:36 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, Gerald Schaefer, hca, gor, borntraeger,
	Andrew Morton, Linux MM, linux-s390, Linux Kernel Mailing List

On Mon, May 17, 2021 at 12:39:49PM -0700, Yang Shi wrote:
> On Mon, May 17, 2021 at 8:09 AM Mel Gorman <mgorman@suse.de> wrote:
> >
> > On Tue, Apr 13, 2021 at 02:24:10PM -0700, Yang Shi wrote:
> > > Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page
> > > fault could be removed, just like its PTE counterpart does.
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > >
> > > <SNIP>
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 25b9041f9925..9c5856f8cc81 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -547,6 +547,9 @@ struct vm_fault {
> > >                                        * the 'address'
> > >                                        */
> > >       pte_t orig_pte;                 /* Value of PTE at the time of fault */
> > > +     pmd_t orig_pmd;                 /* Value of PMD at the time of fault,
> > > +                                      * used by PMD fault only.
> > > +                                      */
> > >
> > >       struct page *cow_page;          /* Page handler may use for COW fault */
> > >       struct page *page;              /* ->fault handlers should return a
> >
> > Could this be a union?
> 
> Do you mean orig_pte and orig_pmd, or cow_page and page?

orig_pte and orig_pmd given that one for PTE faults and one is for PMD
faults and it's very unlikely they would both need to be considered during
a single fault.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault
  2021-05-18  7:36       ` Mel Gorman
@ 2021-05-18 17:03         ` Yang Shi
  0 siblings, 0 replies; 26+ messages in thread
From: Yang Shi @ 2021-05-18 17:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, Gerald Schaefer, hca, gor, borntraeger,
	Andrew Morton, Linux MM, linux-s390, Linux Kernel Mailing List

On Tue, May 18, 2021 at 12:36 AM Mel Gorman <mgorman@suse.de> wrote:
>
> On Mon, May 17, 2021 at 12:39:49PM -0700, Yang Shi wrote:
> > On Mon, May 17, 2021 at 8:09 AM Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > On Tue, Apr 13, 2021 at 02:24:10PM -0700, Yang Shi wrote:
> > > > Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page
> > > > fault could be removed, just like its PTE counterpart does.
> > > >
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > >
> > > > <SNIP>
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 25b9041f9925..9c5856f8cc81 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -547,6 +547,9 @@ struct vm_fault {
> > > >                                        * the 'address'
> > > >                                        */
> > > >       pte_t orig_pte;                 /* Value of PTE at the time of fault */
> > > > +     pmd_t orig_pmd;                 /* Value of PMD at the time of fault,
> > > > +                                      * used by PMD fault only.
> > > > +                                      */
> > > >
> > > >       struct page *cow_page;          /* Page handler may use for COW fault */
> > > >       struct page *page;              /* ->fault handlers should return a
> > >
> > > Could this be a union?
> >
> > Do you mean orig_pte and orig_pmd, or cow_page and page?
>
> orig_pte and orig_pmd given that one for PTE faults and one is for PMD
> faults and it's very unlikely they would both need to be considered during
> a single fault.

Yes, agreed.

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-05-18 17:03 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-13 21:24 [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
2021-04-13 21:24 ` [v2 PATCH 1/7] mm: memory: add orig_pmd to struct vm_fault Yang Shi
2021-05-17 15:09   ` Mel Gorman
2021-05-17 19:39     ` Yang Shi
2021-05-18  7:36       ` Mel Gorman
2021-05-18 17:03         ` Yang Shi
2021-04-13 21:24 ` [v2 PATCH 2/7] mm: memory: make numa_migrate_prep() non-static Yang Shi
2021-05-17 15:11   ` Mel Gorman
2021-04-13 21:24 ` [v2 PATCH 3/7] mm: thp: refactor NUMA fault handling Yang Shi
2021-04-14  2:43   ` Huang, Ying
2021-04-14 17:15     ` Yang Shi
2021-05-17 15:27   ` Mel Gorman
2021-05-17 19:41     ` Yang Shi
2021-04-13 21:24 ` [v2 PATCH 4/7] mm: migrate: account THP NUMA migration counters correctly Yang Shi
2021-05-17 15:28   ` Mel Gorman
2021-04-13 21:24 ` [v2 PATCH 5/7] mm: migrate: don't split THP for misplaced NUMA page Yang Shi
2021-05-17 15:29   ` Mel Gorman
2021-04-13 21:24 ` [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count Yang Shi
2021-04-14  3:00   ` Huang, Ying
2021-04-14 15:02     ` Zi Yan
2021-04-15  6:45       ` Huang, Ying
2021-04-15 18:57         ` Zi Yan
2021-04-14 17:23     ` Yang Shi
2021-04-13 21:24 ` [v2 PATCH 7/7] mm: thp: skip make PMD PROT_NONE if THP migration is not supported Yang Shi
2021-05-17 15:30   ` Mel Gorman
2021-05-03 21:58 ` [v2 RFC PATCH 0/7] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).