linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
@ 2021-03-29 18:33 Yang Shi
  2021-03-29 18:33 ` [PATCH 1/6] mm: memory: add orig_pmd to struct vm_fault Yang Shi
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Yang Shi @ 2021-03-29 18:33 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel


When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.  It is definitely a maintenance burden to keep two THP migration
implementation for different code paths and it is more error prone.  Using the
generic THP migration implementation allows us remove the duplicate code and
some hacks needed by the old ad hoc implementation.

A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
and NUMA balancing.  The most of them support THP migration except for S390.
Zi Yan tried to add THP migration support for S390 before but it was not
accepted due to the design of S390 PMD.  For the discussion, please see:
https://lkml.org/lkml/2018/4/27/953.

I'm not expert on S390 so not sure if it is feasible to support THP migration
for S390 or not.  If it is not feasible then the patchset may make THP NUMA
balancing not be functional on S390.  Not sure if this is a show stopper although
the patchset does simplify the code a lot.  Anyway it seems worth posting the
series to the mailing list to get some feedback.

Patch #1 ~ #3 are preparation and clean up patches.
Patch #4 is the real meat.
Patch #5 keep THP not split if migration is failed for NUMA hinting.
Patch #6 removes a hack about page refcount.

I saw there were some hacks about gup from git history, but I didn't figure out
if they have been removed or not since I just found FOLL_NUMA code in the current
gup implementation and they seems useful.


Yang Shi (6):
      mm: memory: add orig_pmd to struct vm_fault
      mm: memory: make numa_migrate_prep() non-static
      mm: migrate: teach migrate_misplaced_page() about THP
      mm: thp: refactor NUMA fault handling
      mm: migrate: don't split THP for misplaced NUMA page
      mm: migrate: remove redundant page count check for THP

 include/linux/huge_mm.h |   9 ++---
 include/linux/migrate.h |  29 ++-------------
 include/linux/mm.h      |   1 +
 mm/huge_memory.c        | 141 +++++++++++++++++++---------------------------------------------------
 mm/internal.h           |   3 ++
 mm/memory.c             |  33 ++++++++---------
 mm/migrate.c            | 190 ++++++++++++++---------------------------------------------------------------------------------
 7 files changed, 94 insertions(+), 312 deletions(-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6] mm: memory: add orig_pmd to struct vm_fault
  2021-03-29 18:33 [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
@ 2021-03-29 18:33 ` Yang Shi
  2021-03-29 18:33 ` [PATCH 2/6] mm: memory: make numa_migrate_prep() non-static Yang Shi
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2021-03-29 18:33 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page
fault could be removed, just like its PTE counterpart does.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/huge_mm.h |  9 ++++-----
 include/linux/mm.h      |  1 +
 mm/huge_memory.c        |  9 ++++++---
 mm/memory.c             | 26 +++++++++++++-------------
 4 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ba973efcd369..5650db25a49d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,7 +11,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma);
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
+void huge_pmd_set_accessed(struct vm_fault *vmf);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma);
@@ -24,7 +24,7 @@ static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 }
 #endif
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmd,
 				   unsigned int flags);
@@ -286,7 +286,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
 extern struct page *huge_zero_page;
 
@@ -432,8 +432,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 	return NULL;
 }
 
-static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf,
-		pmd_t orig_pmd)
+static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	return 0;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..899f55d46fba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -528,6 +528,7 @@ struct vm_fault {
 					 * the 'address'
 					 */
 	pte_t orig_pte;			/* Value of PTE at the time of fault */
+	pmd_t orig_pmd;			/* Value of PMD at the time of fault */
 
 	struct page *cow_page;		/* Page handler may use for COW fault */
 	struct page *page;		/* ->fault handlers should return a
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae907a9c2050..53f3843ce72a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1252,11 +1252,12 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
+void huge_pmd_set_accessed(struct vm_fault *vmf)
 {
 	pmd_t entry;
 	unsigned long haddr;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
+	pmd_t orig_pmd = vmf->orig_pmd;
 
 	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
@@ -1273,11 +1274,12 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
 	spin_unlock(vmf->ptl);
 }
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	pmd_t orig_pmd = vmf->orig_pmd;
 
 	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1413,9 +1415,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	pmd_t pmd = vmf->orig_pmd;
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
diff --git a/mm/memory.c b/mm/memory.c
index 5efa07fb6cdc..33be5811ac65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4193,12 +4193,12 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 }
 
 /* `inline' is required to avoid gcc 4.1.2 build error */
-static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
+static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
+		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
-		return do_huge_pmd_wp_page(vmf, orig_pmd);
+		return do_huge_pmd_wp_page(vmf);
 	}
 	if (vmf->vma->vm_ops->huge_fault) {
 		vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
@@ -4425,26 +4425,26 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	} else {
-		pmd_t orig_pmd = *vmf.pmd;
+		vmf.orig_pmd = *vmf.pmd;
 
 		barrier();
-		if (unlikely(is_swap_pmd(orig_pmd))) {
+		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
 			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(orig_pmd));
-			if (is_pmd_migration_entry(orig_pmd))
+					  !is_pmd_migration_entry(vmf.orig_pmd));
+			if (is_pmd_migration_entry(vmf.orig_pmd))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
-		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
-			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
-				return do_huge_pmd_numa_page(&vmf, orig_pmd);
+		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+				return do_huge_pmd_numa_page(&vmf);
 
-			if (dirty && !pmd_write(orig_pmd)) {
-				ret = wp_huge_pmd(&vmf, orig_pmd);
+			if (dirty && !pmd_write(vmf.orig_pmd)) {
+				ret = wp_huge_pmd(&vmf);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
 			} else {
-				huge_pmd_set_accessed(&vmf, orig_pmd);
+				huge_pmd_set_accessed(&vmf);
 				return 0;
 			}
 		}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/6] mm: memory: make numa_migrate_prep() non-static
  2021-03-29 18:33 [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
  2021-03-29 18:33 ` [PATCH 1/6] mm: memory: add orig_pmd to struct vm_fault Yang Shi
@ 2021-03-29 18:33 ` Yang Shi
  2021-03-29 18:33 ` [PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP Yang Shi
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2021-03-29 18:33 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

The numa_migrate_prep() will be used by huge NUMA fault as well in the following
patch, make it non-static.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/internal.h | 3 +++
 mm/memory.c   | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1432feec62df..5ac525f364e6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -618,4 +618,7 @@ struct migration_target_control {
 	gfp_t gfp_mask;
 };
 
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+		      unsigned long addr, int page_nid, int *flags);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 33be5811ac65..003bbf3187d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4078,9 +4078,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int page_nid,
-				int *flags)
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+		      unsigned long addr, int page_nid, int *flags)
 {
 	get_page(page);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP
  2021-03-29 18:33 [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
  2021-03-29 18:33 ` [PATCH 1/6] mm: memory: add orig_pmd to struct vm_fault Yang Shi
  2021-03-29 18:33 ` [PATCH 2/6] mm: memory: make numa_migrate_prep() non-static Yang Shi
@ 2021-03-29 18:33 ` Yang Shi
  2021-03-30  0:21   ` Huang, Ying
  2021-03-29 18:33 ` [PATCH 4/6] mm: thp: refactor NUMA fault handling Yang Shi
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2021-03-29 18:33 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

In the following patch the migrate_misplaced_page() will be used to migrate THP
for NUMA faul too.  Prepare to deal with THP.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/migrate.h | 6 ++++--
 mm/memory.c             | 2 +-
 mm/migrate.c            | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..6abd34986cc5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -102,14 +102,16 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
-				  struct vm_area_struct *vma, int node);
+				  struct vm_area_struct *vma, int node,
+				  bool compound);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
 	return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-					 struct vm_area_struct *vma, int node)
+					 struct vm_area_struct *vma, int node,
+					 bool compound)
 {
 	return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 003bbf3187d4..7fed578bdc31 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4169,7 +4169,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, vma, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid, false);
 	if (migrated) {
 		page_nid = target_nid;
 		flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index 62b81d5257aa..9c4ae5132919 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2127,7 +2127,7 @@ static inline bool is_shared_exec_page(struct vm_area_struct *vma,
  * the page that will be dropped by this function before returning.
  */
 int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
-			   int node)
+			   int node, bool compound)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/6] mm: thp: refactor NUMA fault handling
  2021-03-29 18:33 [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (2 preceding siblings ...)
  2021-03-29 18:33 ` [PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP Yang Shi
@ 2021-03-29 18:33 ` Yang Shi
  2021-03-30  0:41   ` Huang, Ying
  2021-03-29 18:33 ` [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page Yang Shi
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2021-03-29 18:33 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

When the THP NUMA fault support was added THP migration was not supported yet.
So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
THP migration has been supported so it doesn't make too much sense to still keep
another THP migration implementation rather than using the generic migration
code.

This patch reworked the NUMA fault handling to use generic migration implementation
to migrate misplaced page.  There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
  Acquire ptl
  Restore PMD
  Prepare for migration (elevate page refcount)
  Release ptl
  Isolate page from lru and elevate page refcount
  Migrate the misplaced THP

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot,
it seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/migrate.h |  23 ------
 mm/huge_memory.c        | 132 ++++++++----------------------
 mm/migrate.c            | 173 ++++++----------------------------------
 3 files changed, 57 insertions(+), 271 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 6abd34986cc5..6c8640e9af4f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -100,15 +100,10 @@ static inline void __ClearPageMovable(struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
 				  struct vm_area_struct *vma, int node,
 				  bool compound);
 #else
-static inline bool pmd_trans_migrating(pmd_t pmd)
-{
-	return false;
-}
 static inline int migrate_misplaced_page(struct page *page,
 					 struct vm_area_struct *vma, int node,
 					 bool compound)
@@ -117,24 +112,6 @@ static inline int migrate_misplaced_page(struct page *page,
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-			struct vm_area_struct *vma,
-			pmd_t *pmd, pmd_t entry,
-			unsigned long address,
-			struct page *page, int node);
-#else
-static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-			struct vm_area_struct *vma,
-			pmd_t *pmd, pmd_t entry,
-			unsigned long address,
-			struct page *page, int node)
-{
-	return -EAGAIN;
-}
-#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
-
-
 #ifdef CONFIG_MIGRATION
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 53f3843ce72a..157c63b0fd95 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1419,94 +1419,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	pmd_t pmd = vmf->orig_pmd;
-	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
-	int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
+	int page_nid = NUMA_NO_NODE;
 	int target_nid, last_cpupid = -1;
-	bool page_locked;
 	bool migrated = false;
-	bool was_writable;
+	bool was_writable = pmd_savedwrite(pmd);
 	int flags = 0;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
-		goto out_unlock;
-
-	/*
-	 * If there are potential migrations, wait for completion and retry
-	 * without disrupting NUMA hinting information. Do not relock and
-	 * check_same as the page may no longer be mapped.
-	 */
-	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
-		page = pmd_page(*vmf->pmd);
-		if (!get_page_unless_zero(page))
-			goto out_unlock;
-		spin_unlock(vmf->ptl);
-		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
-		goto out;
-	}
-
-	page = pmd_page(pmd);
-	BUG_ON(is_huge_zero_page(page));
-	page_nid = page_to_nid(page);
-	last_cpupid = page_cpupid_last(page);
-	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == this_nid) {
-		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-		flags |= TNF_FAULT_LOCAL;
-	}
-
-	/* See similar comment in do_numa_page for explanation */
-	if (!pmd_savedwrite(pmd))
-		flags |= TNF_NO_GROUP;
-
-	/*
-	 * Acquire the page lock to serialise THP migrations but avoid dropping
-	 * page_table_lock if at all possible
-	 */
-	page_locked = trylock_page(page);
-	target_nid = mpol_misplaced(page, vma, haddr);
-	if (target_nid == NUMA_NO_NODE) {
-		/* If the page was locked, there are no parallel migrations */
-		if (page_locked)
-			goto clear_pmdnuma;
-	}
-
-	/* Migration could have started since the pmd_trans_migrating check */
-	if (!page_locked) {
-		page_nid = NUMA_NO_NODE;
-		if (!get_page_unless_zero(page))
-			goto out_unlock;
+	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
 		spin_unlock(vmf->ptl);
-		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
 		goto out;
 	}
 
-	/*
-	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
-	 * to serialises splits
-	 */
-	get_page(page);
-	spin_unlock(vmf->ptl);
-	anon_vma = page_lock_anon_vma_read(page);
-
-	/* Confirm the PMD did not change while page_table_lock was released */
-	spin_lock(vmf->ptl);
-	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
-		unlock_page(page);
-		put_page(page);
-		page_nid = NUMA_NO_NODE;
-		goto out_unlock;
-	}
-
-	/* Bail if we fail to protect against THP splits for any reason */
-	if (unlikely(!anon_vma)) {
-		put_page(page);
-		page_nid = NUMA_NO_NODE;
-		goto clear_pmdnuma;
-	}
-
 	/*
 	 * Since we took the NUMA fault, we must have observed the !accessible
 	 * bit. Make sure all other CPUs agree with that, to avoid them
@@ -1533,38 +1459,44 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 					      haddr + HPAGE_PMD_SIZE);
 	}
 
-	/*
-	 * Migrate the THP to the requested node, returns with page unlocked
-	 * and access rights restored.
-	 */
-	spin_unlock(vmf->ptl);
-
-	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
-				vmf->pmd, pmd, vmf->address, page, target_nid);
-	if (migrated) {
-		flags |= TNF_MIGRATED;
-		page_nid = target_nid;
-	} else
-		flags |= TNF_MIGRATE_FAIL;
-
-	goto out;
-clear_pmdnuma:
-	BUG_ON(!PageLocked(page));
-	was_writable = pmd_savedwrite(pmd);
+	/* Restore the PMD */
 	pmd = pmd_modify(pmd, vma->vm_page_prot);
 	pmd = pmd_mkyoung(pmd);
 	if (was_writable)
 		pmd = pmd_mkwrite(pmd);
 	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-	unlock_page(page);
-out_unlock:
+
+	page = vm_normal_page_pmd(vma, haddr, pmd);
+	if (!page) {
+		spin_unlock(vmf->ptl);
+		goto out;
+	}
+
+	/* See similar comment in do_numa_page for explanation */
+	if (!was_writable)
+		flags |= TNF_NO_GROUP;
+
+	page_nid = page_to_nid(page);
+	last_cpupid = page_cpupid_last(page);
+	target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
+				       &flags);
+
 	spin_unlock(vmf->ptl);
 
-out:
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
+	if (target_nid == NUMA_NO_NODE) {
+		put_page(page);
+		goto out;
+	}
+
+	migrated = migrate_misplaced_page(page, vma, target_nid, true);
+	if (migrated) {
+		flags |= TNF_MIGRATED;
+		page_nid = target_nid;
+	} else
+		flags |= TNF_MIGRATE_FAIL;
 
+out:
 	if (page_nid != NUMA_NO_NODE)
 		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
 				flags);
diff --git a/mm/migrate.c b/mm/migrate.c
index 9c4ae5132919..86325c750c14 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2066,6 +2066,23 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 	return newpage;
 }
 
+static struct page *alloc_misplaced_dst_page_thp(struct page *page,
+						 unsigned long data)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_node(nid, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
+				   HPAGE_PMD_ORDER);
+	if (!newpage)
+		goto out;
+
+	prep_transhuge_page(newpage);
+
+out:
+	return newpage;
+}
+
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
@@ -2104,12 +2121,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	return 1;
 }
 
-bool pmd_trans_migrating(pmd_t pmd)
-{
-	struct page *page = pmd_page(pmd);
-	return PageLocked(page);
-}
-
 static inline bool is_shared_exec_page(struct vm_area_struct *vma,
 				       struct page *page)
 {
@@ -2133,6 +2144,12 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	int isolated;
 	int nr_remaining;
 	LIST_HEAD(migratepages);
+	new_page_t *new;
+
+	if (compound)
+		new = alloc_misplaced_dst_page_thp;
+	else
+		new = alloc_misplaced_dst_page;
 
 	/*
 	 * Don't migrate file pages that are mapped in multiple processes
@@ -2153,9 +2170,8 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 		goto out;
 
 	list_add(&page->lru, &migratepages);
-	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
-				     NULL, node, MIGRATE_ASYNC,
-				     MR_NUMA_MISPLACED);
+	nr_remaining = migrate_pages(&migratepages, *new, NULL, node,
+				     MIGRATE_ASYNC, MR_NUMA_MISPLACED);
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
@@ -2174,145 +2190,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	return 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
-
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * Migrates a THP to a given target node. page must be locked and is unlocked
- * before returning.
- */
-int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-				struct vm_area_struct *vma,
-				pmd_t *pmd, pmd_t entry,
-				unsigned long address,
-				struct page *page, int node)
-{
-	spinlock_t *ptl;
-	pg_data_t *pgdat = NODE_DATA(node);
-	int isolated = 0;
-	struct page *new_page = NULL;
-	int page_lru = page_is_file_lru(page);
-	unsigned long start = address & HPAGE_PMD_MASK;
-
-	if (is_shared_exec_page(vma, page))
-		goto out;
-
-	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
-		HPAGE_PMD_ORDER);
-	if (!new_page)
-		goto out_fail;
-	prep_transhuge_page(new_page);
-
-	isolated = numamigrate_isolate_page(pgdat, page);
-	if (!isolated) {
-		put_page(new_page);
-		goto out_fail;
-	}
-
-	/* Prepare a page as a migration target */
-	__SetPageLocked(new_page);
-	if (PageSwapBacked(page))
-		__SetPageSwapBacked(new_page);
-
-	/* anon mapping, we can simply copy page->mapping to the new page: */
-	new_page->mapping = page->mapping;
-	new_page->index = page->index;
-	/* flush the cache before copying using the kernel virtual address */
-	flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
-	migrate_page_copy(new_page, page);
-	WARN_ON(PageLRU(new_page));
-
-	/* Recheck the target PMD */
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
-		spin_unlock(ptl);
-
-		/* Reverse changes made by migrate_page_copy() */
-		if (TestClearPageActive(new_page))
-			SetPageActive(page);
-		if (TestClearPageUnevictable(new_page))
-			SetPageUnevictable(page);
-
-		unlock_page(new_page);
-		put_page(new_page);		/* Free it */
-
-		/* Retake the callers reference and putback on LRU */
-		get_page(page);
-		putback_lru_page(page);
-		mod_node_page_state(page_pgdat(page),
-			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
-
-		goto out_unlock;
-	}
-
-	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-
-	/*
-	 * Overwrite the old entry under pagetable lock and establish
-	 * the new PTE. Any parallel GUP will either observe the old
-	 * page blocking on the page lock, block on the page table
-	 * lock or observe the new page. The SetPageUptodate on the
-	 * new page and page_add_new_anon_rmap guarantee the copy is
-	 * visible before the pagetable update.
-	 */
-	page_add_anon_rmap(new_page, vma, start, true);
-	/*
-	 * At this point the pmd is numa/protnone (i.e. non present) and the TLB
-	 * has already been flushed globally.  So no TLB can be currently
-	 * caching this non present pmd mapping.  There's no need to clear the
-	 * pmd before doing set_pmd_at(), nor to flush the TLB after
-	 * set_pmd_at().  Clearing the pmd here would introduce a race
-	 * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the
-	 * mmap_lock for reading.  If the pmd is set to NULL at any given time,
-	 * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this
-	 * pmd.
-	 */
-	set_pmd_at(mm, start, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
-
-	page_ref_unfreeze(page, 2);
-	mlock_migrate_page(new_page, page);
-	page_remove_rmap(page, true);
-	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
-
-	spin_unlock(ptl);
-
-	/* Take an "isolate" reference and put new page on the LRU. */
-	get_page(new_page);
-	putback_lru_page(new_page);
-
-	unlock_page(new_page);
-	unlock_page(page);
-	put_page(page);			/* Drop the rmap reference */
-	put_page(page);			/* Drop the LRU isolation reference */
-
-	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
-	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
-
-	mod_node_page_state(page_pgdat(page),
-			NR_ISOLATED_ANON + page_lru,
-			-HPAGE_PMD_NR);
-	return isolated;
-
-out_fail:
-	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-	ptl = pmd_lock(mm, pmd);
-	if (pmd_same(*pmd, entry)) {
-		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, start, pmd, entry);
-		update_mmu_cache_pmd(vma, address, &entry);
-	}
-	spin_unlock(ptl);
-
-out_unlock:
-	unlock_page(page);
-out:
-	put_page(page);
-	return 0;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_DEVICE_PRIVATE
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page
  2021-03-29 18:33 [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (3 preceding siblings ...)
  2021-03-29 18:33 ` [PATCH 4/6] mm: thp: refactor NUMA fault handling Yang Shi
@ 2021-03-29 18:33 ` Yang Shi
  2021-03-30 14:42   ` Gerald Schaefer
  2021-03-29 18:33 ` [PATCH 6/6] mm: migrate: remove redundant page count check for THP Yang Shi
  2021-03-30 14:42 ` [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Gerald Schaefer
  6 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2021-03-29 18:33 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

The old behavior didn't split THP if migration is failed due to lack of
memory on the target node.  But the THP migration does split THP, so keep
the old behavior for misplaced NUMA page migration.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/migrate.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 86325c750c14..1c0c873375ab 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1444,6 +1444,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	int swapwrite = current->flags & PF_SWAPWRITE;
 	int rc, nr_subpages;
 	LIST_HEAD(ret_pages);
+	bool nosplit = (reason == MR_NUMA_MISPLACED);
 
 	if (!swapwrite)
 		current->flags |= PF_SWAPWRITE;
@@ -1495,7 +1496,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 			 */
 			case -ENOSYS:
 				/* THP migration is unsupported */
-				if (is_thp) {
+				if (is_thp && !nosplit) {
 					if (!try_split_thp(page, &page2, from)) {
 						nr_thp_split++;
 						goto retry;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/6] mm: migrate: remove redundant page count check for THP
  2021-03-29 18:33 [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (4 preceding siblings ...)
  2021-03-29 18:33 ` [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page Yang Shi
@ 2021-03-29 18:33 ` Yang Shi
  2021-03-30 14:42 ` [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Gerald Schaefer
  6 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2021-03-29 18:33 UTC (permalink / raw)
  To: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm
  Cc: shy828301, linux-mm, linux-s390, linux-kernel

Don't have to keep the redundant page count check for THP anymore after
switching to use generic migration code.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/migrate.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 1c0c873375ab..328f76848d6c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2097,18 +2097,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	if (isolate_lru_page(page))
 		return 0;
 
-	/*
-	 * migrate_misplaced_transhuge_page() skips page migration's usual
-	 * check on page_count(), so we must do it here, now that the page
-	 * has been isolated: a GUP pin, or any other pin, prevents migration.
-	 * The expected page count is 3: 1 for page's mapcount and 1 for the
-	 * caller's pin and 1 for the reference taken by isolate_lru_page().
-	 */
-	if (PageTransHuge(page) && page_count(page) != 3) {
-		putback_lru_page(page);
-		return 0;
-	}
-
 	page_lru = page_is_file_lru(page);
 	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
 				thp_nr_pages(page));
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP
  2021-03-29 18:33 ` [PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP Yang Shi
@ 2021-03-30  0:21   ` Huang, Ying
  2021-03-30 16:57     ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2021-03-30  0:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, kirill.shutemov, ziy, mhocko, hughd, hca, gor,
	borntraeger, akpm, linux-mm, linux-s390, linux-kernel

Yang Shi <shy828301@gmail.com> writes:

> In the following patch the migrate_misplaced_page() will be used to migrate THP
> for NUMA faul too.  Prepare to deal with THP.
>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/migrate.h | 6 ++++--
>  mm/memory.c             | 2 +-
>  mm/migrate.c            | 2 +-
>  3 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 3a389633b68f..6abd34986cc5 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -102,14 +102,16 @@ static inline void __ClearPageMovable(struct page *page)
>  #ifdef CONFIG_NUMA_BALANCING
>  extern bool pmd_trans_migrating(pmd_t pmd);
>  extern int migrate_misplaced_page(struct page *page,
> -				  struct vm_area_struct *vma, int node);
> +				  struct vm_area_struct *vma, int node,
> +				  bool compound);
>  #else
>  static inline bool pmd_trans_migrating(pmd_t pmd)
>  {
>  	return false;
>  }
>  static inline int migrate_misplaced_page(struct page *page,
> -					 struct vm_area_struct *vma, int node)
> +					 struct vm_area_struct *vma, int node,
> +					 bool compound)
>  {
>  	return -EAGAIN; /* can't migrate now */
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 003bbf3187d4..7fed578bdc31 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4169,7 +4169,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	}
>  
>  	/* Migrate to the requested node */
> -	migrated = migrate_misplaced_page(page, vma, target_nid);
> +	migrated = migrate_misplaced_page(page, vma, target_nid, false);
>  	if (migrated) {
>  		page_nid = target_nid;
>  		flags |= TNF_MIGRATED;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 62b81d5257aa..9c4ae5132919 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2127,7 +2127,7 @@ static inline bool is_shared_exec_page(struct vm_area_struct *vma,
>   * the page that will be dropped by this function before returning.
>   */
>  int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> -			   int node)
> +			   int node, bool compound)

Can we just use PageCompound(page) instead?

Best Regards,
Huang, Ying

>  {
>  	pg_data_t *pgdat = NODE_DATA(node);
>  	int isolated;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/6] mm: thp: refactor NUMA fault handling
  2021-03-29 18:33 ` [PATCH 4/6] mm: thp: refactor NUMA fault handling Yang Shi
@ 2021-03-30  0:41   ` Huang, Ying
  2021-03-30 17:02     ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2021-03-30  0:41 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, kirill.shutemov, ziy, mhocko, hughd, hca, gor,
	borntraeger, akpm, linux-mm, linux-s390, linux-kernel

Yang Shi <shy828301@gmail.com> writes:

> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> THP migration has been supported so it doesn't make too much sense to still keep
> another THP migration implementation rather than using the generic migration
> code.
>
> This patch reworked the NUMA fault handling to use generic migration implementation
> to migrate misplaced page.  There is no functional change.
>
> After the refactor the flow of NUMA fault handling looks just like its
> PTE counterpart:
>   Acquire ptl
>   Restore PMD
>   Prepare for migration (elevate page refcount)
>   Release ptl
>   Isolate page from lru and elevate page refcount
>   Migrate the misplaced THP

There's some behavior change between the original implementation and
your new implementation.  Originally, PMD is restored after trying to
migrate the misplaced THP.  I think this can reduce the TLB
shooting-down IPI.

Best Regards,
Huang, Ying

> In the old code anon_vma lock was needed to serialize THP migration
> against THP split, but since then the THP code has been reworked a lot,
> it seems anon_vma lock is not required anymore to avoid the race.
>
> The page refcount elevation when holding ptl should prevent from THP
> split.
>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/migrate.h |  23 ------
>  mm/huge_memory.c        | 132 ++++++++----------------------
>  mm/migrate.c            | 173 ++++++----------------------------------
>  3 files changed, 57 insertions(+), 271 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 6abd34986cc5..6c8640e9af4f 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -100,15 +100,10 @@ static inline void __ClearPageMovable(struct page *page)
>  #endif
>  
>  #ifdef CONFIG_NUMA_BALANCING
> -extern bool pmd_trans_migrating(pmd_t pmd);
>  extern int migrate_misplaced_page(struct page *page,
>  				  struct vm_area_struct *vma, int node,
>  				  bool compound);
>  #else
> -static inline bool pmd_trans_migrating(pmd_t pmd)
> -{
> -	return false;
> -}
>  static inline int migrate_misplaced_page(struct page *page,
>  					 struct vm_area_struct *vma, int node,
>  					 bool compound)
> @@ -117,24 +112,6 @@ static inline int migrate_misplaced_page(struct page *page,
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> -			struct vm_area_struct *vma,
> -			pmd_t *pmd, pmd_t entry,
> -			unsigned long address,
> -			struct page *page, int node);
> -#else
> -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> -			struct vm_area_struct *vma,
> -			pmd_t *pmd, pmd_t entry,
> -			unsigned long address,
> -			struct page *page, int node)
> -{
> -	return -EAGAIN;
> -}
> -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> -
> -
>  #ifdef CONFIG_MIGRATION
>  
>  /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 53f3843ce72a..157c63b0fd95 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1419,94 +1419,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	pmd_t pmd = vmf->orig_pmd;
> -	struct anon_vma *anon_vma = NULL;
>  	struct page *page;
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> -	int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
> +	int page_nid = NUMA_NO_NODE;
>  	int target_nid, last_cpupid = -1;
> -	bool page_locked;
>  	bool migrated = false;
> -	bool was_writable;
> +	bool was_writable = pmd_savedwrite(pmd);
>  	int flags = 0;
>  
>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> -	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
> -		goto out_unlock;
> -
> -	/*
> -	 * If there are potential migrations, wait for completion and retry
> -	 * without disrupting NUMA hinting information. Do not relock and
> -	 * check_same as the page may no longer be mapped.
> -	 */
> -	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> -		page = pmd_page(*vmf->pmd);
> -		if (!get_page_unless_zero(page))
> -			goto out_unlock;
> -		spin_unlock(vmf->ptl);
> -		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> -		goto out;
> -	}
> -
> -	page = pmd_page(pmd);
> -	BUG_ON(is_huge_zero_page(page));
> -	page_nid = page_to_nid(page);
> -	last_cpupid = page_cpupid_last(page);
> -	count_vm_numa_event(NUMA_HINT_FAULTS);
> -	if (page_nid == this_nid) {
> -		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> -		flags |= TNF_FAULT_LOCAL;
> -	}
> -
> -	/* See similar comment in do_numa_page for explanation */
> -	if (!pmd_savedwrite(pmd))
> -		flags |= TNF_NO_GROUP;
> -
> -	/*
> -	 * Acquire the page lock to serialise THP migrations but avoid dropping
> -	 * page_table_lock if at all possible
> -	 */
> -	page_locked = trylock_page(page);
> -	target_nid = mpol_misplaced(page, vma, haddr);
> -	if (target_nid == NUMA_NO_NODE) {
> -		/* If the page was locked, there are no parallel migrations */
> -		if (page_locked)
> -			goto clear_pmdnuma;
> -	}
> -
> -	/* Migration could have started since the pmd_trans_migrating check */
> -	if (!page_locked) {
> -		page_nid = NUMA_NO_NODE;
> -		if (!get_page_unless_zero(page))
> -			goto out_unlock;
> +	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
>  		spin_unlock(vmf->ptl);
> -		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
>  		goto out;
>  	}
>  
> -	/*
> -	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
> -	 * to serialises splits
> -	 */
> -	get_page(page);
> -	spin_unlock(vmf->ptl);
> -	anon_vma = page_lock_anon_vma_read(page);
> -
> -	/* Confirm the PMD did not change while page_table_lock was released */
> -	spin_lock(vmf->ptl);
> -	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> -		unlock_page(page);
> -		put_page(page);
> -		page_nid = NUMA_NO_NODE;
> -		goto out_unlock;
> -	}
> -
> -	/* Bail if we fail to protect against THP splits for any reason */
> -	if (unlikely(!anon_vma)) {
> -		put_page(page);
> -		page_nid = NUMA_NO_NODE;
> -		goto clear_pmdnuma;
> -	}
> -
>  	/*
>  	 * Since we took the NUMA fault, we must have observed the !accessible
>  	 * bit. Make sure all other CPUs agree with that, to avoid them
> @@ -1533,38 +1459,44 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  					      haddr + HPAGE_PMD_SIZE);
>  	}
>  
> -	/*
> -	 * Migrate the THP to the requested node, returns with page unlocked
> -	 * and access rights restored.
> -	 */
> -	spin_unlock(vmf->ptl);
> -
> -	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
> -				vmf->pmd, pmd, vmf->address, page, target_nid);
> -	if (migrated) {
> -		flags |= TNF_MIGRATED;
> -		page_nid = target_nid;
> -	} else
> -		flags |= TNF_MIGRATE_FAIL;
> -
> -	goto out;
> -clear_pmdnuma:
> -	BUG_ON(!PageLocked(page));
> -	was_writable = pmd_savedwrite(pmd);
> +	/* Restore the PMD */
>  	pmd = pmd_modify(pmd, vma->vm_page_prot);
>  	pmd = pmd_mkyoung(pmd);
>  	if (was_writable)
>  		pmd = pmd_mkwrite(pmd);
>  	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
>  	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> -	unlock_page(page);
> -out_unlock:
> +
> +	page = vm_normal_page_pmd(vma, haddr, pmd);
> +	if (!page) {
> +		spin_unlock(vmf->ptl);
> +		goto out;
> +	}
> +
> +	/* See similar comment in do_numa_page for explanation */
> +	if (!was_writable)
> +		flags |= TNF_NO_GROUP;
> +
> +	page_nid = page_to_nid(page);
> +	last_cpupid = page_cpupid_last(page);
> +	target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
> +				       &flags);
> +
>  	spin_unlock(vmf->ptl);
>  
> -out:
> -	if (anon_vma)
> -		page_unlock_anon_vma_read(anon_vma);
> +	if (target_nid == NUMA_NO_NODE) {
> +		put_page(page);
> +		goto out;
> +	}
> +
> +	migrated = migrate_misplaced_page(page, vma, target_nid, true);
> +	if (migrated) {
> +		flags |= TNF_MIGRATED;
> +		page_nid = target_nid;
> +	} else
> +		flags |= TNF_MIGRATE_FAIL;
>  
> +out:
>  	if (page_nid != NUMA_NO_NODE)
>  		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
>  				flags);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9c4ae5132919..86325c750c14 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2066,6 +2066,23 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
>  	return newpage;
>  }
>  
> +static struct page *alloc_misplaced_dst_page_thp(struct page *page,
> +						 unsigned long data)
> +{
> +	int nid = (int) data;
> +	struct page *newpage;
> +
> +	newpage = alloc_pages_node(nid, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> +				   HPAGE_PMD_ORDER);
> +	if (!newpage)
> +		goto out;
> +
> +	prep_transhuge_page(newpage);
> +
> +out:
> +	return newpage;
> +}
> +
>  static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>  {
>  	int page_lru;
> @@ -2104,12 +2121,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>  	return 1;
>  }
>  
> -bool pmd_trans_migrating(pmd_t pmd)
> -{
> -	struct page *page = pmd_page(pmd);
> -	return PageLocked(page);
> -}
> -
>  static inline bool is_shared_exec_page(struct vm_area_struct *vma,
>  				       struct page *page)
>  {
> @@ -2133,6 +2144,12 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  	int isolated;
>  	int nr_remaining;
>  	LIST_HEAD(migratepages);
> +	new_page_t *new;
> +
> +	if (compound)
> +		new = alloc_misplaced_dst_page_thp;
> +	else
> +		new = alloc_misplaced_dst_page;
>  
>  	/*
>  	 * Don't migrate file pages that are mapped in multiple processes
> @@ -2153,9 +2170,8 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  		goto out;
>  
>  	list_add(&page->lru, &migratepages);
> -	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
> -				     NULL, node, MIGRATE_ASYNC,
> -				     MR_NUMA_MISPLACED);
> +	nr_remaining = migrate_pages(&migratepages, *new, NULL, node,
> +				     MIGRATE_ASYNC, MR_NUMA_MISPLACED);
>  	if (nr_remaining) {
>  		if (!list_empty(&migratepages)) {
>  			list_del(&page->lru);
> @@ -2174,145 +2190,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  	return 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
> -
> -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -/*
> - * Migrates a THP to a given target node. page must be locked and is unlocked
> - * before returning.
> - */
> -int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> -				struct vm_area_struct *vma,
> -				pmd_t *pmd, pmd_t entry,
> -				unsigned long address,
> -				struct page *page, int node)
> -{
> -	spinlock_t *ptl;
> -	pg_data_t *pgdat = NODE_DATA(node);
> -	int isolated = 0;
> -	struct page *new_page = NULL;
> -	int page_lru = page_is_file_lru(page);
> -	unsigned long start = address & HPAGE_PMD_MASK;
> -
> -	if (is_shared_exec_page(vma, page))
> -		goto out;
> -
> -	new_page = alloc_pages_node(node,
> -		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> -		HPAGE_PMD_ORDER);
> -	if (!new_page)
> -		goto out_fail;
> -	prep_transhuge_page(new_page);
> -
> -	isolated = numamigrate_isolate_page(pgdat, page);
> -	if (!isolated) {
> -		put_page(new_page);
> -		goto out_fail;
> -	}
> -
> -	/* Prepare a page as a migration target */
> -	__SetPageLocked(new_page);
> -	if (PageSwapBacked(page))
> -		__SetPageSwapBacked(new_page);
> -
> -	/* anon mapping, we can simply copy page->mapping to the new page: */
> -	new_page->mapping = page->mapping;
> -	new_page->index = page->index;
> -	/* flush the cache before copying using the kernel virtual address */
> -	flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
> -	migrate_page_copy(new_page, page);
> -	WARN_ON(PageLRU(new_page));
> -
> -	/* Recheck the target PMD */
> -	ptl = pmd_lock(mm, pmd);
> -	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
> -		spin_unlock(ptl);
> -
> -		/* Reverse changes made by migrate_page_copy() */
> -		if (TestClearPageActive(new_page))
> -			SetPageActive(page);
> -		if (TestClearPageUnevictable(new_page))
> -			SetPageUnevictable(page);
> -
> -		unlock_page(new_page);
> -		put_page(new_page);		/* Free it */
> -
> -		/* Retake the callers reference and putback on LRU */
> -		get_page(page);
> -		putback_lru_page(page);
> -		mod_node_page_state(page_pgdat(page),
> -			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
> -
> -		goto out_unlock;
> -	}
> -
> -	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> -	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> -
> -	/*
> -	 * Overwrite the old entry under pagetable lock and establish
> -	 * the new PTE. Any parallel GUP will either observe the old
> -	 * page blocking on the page lock, block on the page table
> -	 * lock or observe the new page. The SetPageUptodate on the
> -	 * new page and page_add_new_anon_rmap guarantee the copy is
> -	 * visible before the pagetable update.
> -	 */
> -	page_add_anon_rmap(new_page, vma, start, true);
> -	/*
> -	 * At this point the pmd is numa/protnone (i.e. non present) and the TLB
> -	 * has already been flushed globally.  So no TLB can be currently
> -	 * caching this non present pmd mapping.  There's no need to clear the
> -	 * pmd before doing set_pmd_at(), nor to flush the TLB after
> -	 * set_pmd_at().  Clearing the pmd here would introduce a race
> -	 * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the
> -	 * mmap_lock for reading.  If the pmd is set to NULL at any given time,
> -	 * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this
> -	 * pmd.
> -	 */
> -	set_pmd_at(mm, start, pmd, entry);
> -	update_mmu_cache_pmd(vma, address, &entry);
> -
> -	page_ref_unfreeze(page, 2);
> -	mlock_migrate_page(new_page, page);
> -	page_remove_rmap(page, true);
> -	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
> -
> -	spin_unlock(ptl);
> -
> -	/* Take an "isolate" reference and put new page on the LRU. */
> -	get_page(new_page);
> -	putback_lru_page(new_page);
> -
> -	unlock_page(new_page);
> -	unlock_page(page);
> -	put_page(page);			/* Drop the rmap reference */
> -	put_page(page);			/* Drop the LRU isolation reference */
> -
> -	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
> -	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
> -
> -	mod_node_page_state(page_pgdat(page),
> -			NR_ISOLATED_ANON + page_lru,
> -			-HPAGE_PMD_NR);
> -	return isolated;
> -
> -out_fail:
> -	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
> -	ptl = pmd_lock(mm, pmd);
> -	if (pmd_same(*pmd, entry)) {
> -		entry = pmd_modify(entry, vma->vm_page_prot);
> -		set_pmd_at(mm, start, pmd, entry);
> -		update_mmu_cache_pmd(vma, address, &entry);
> -	}
> -	spin_unlock(ptl);
> -
> -out_unlock:
> -	unlock_page(page);
> -out:
> -	put_page(page);
> -	return 0;
> -}
> -#endif /* CONFIG_NUMA_BALANCING */
> -
>  #endif /* CONFIG_NUMA */
>  
>  #ifdef CONFIG_DEVICE_PRIVATE

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-03-29 18:33 [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
                   ` (5 preceding siblings ...)
  2021-03-29 18:33 ` [PATCH 6/6] mm: migrate: remove redundant page count check for THP Yang Shi
@ 2021-03-30 14:42 ` Gerald Schaefer
  2021-03-30 16:51   ` Yang Shi
  2021-03-31 13:20   ` Mel Gorman
  6 siblings, 2 replies; 23+ messages in thread
From: Gerald Schaefer @ 2021-03-30 14:42 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel,
	Alexander Gordeev

On Mon, 29 Mar 2021 11:33:06 -0700
Yang Shi <shy828301@gmail.com> wrote:

> 
> When the THP NUMA fault support was added THP migration was not supported yet.
> So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> THP migration has been supported so it doesn't make too much sense to still keep
> another THP migration implementation rather than using the generic migration
> code.  It is definitely a maintenance burden to keep two THP migration
> implementation for different code paths and it is more error prone.  Using the
> generic THP migration implementation allows us remove the duplicate code and
> some hacks needed by the old ad hoc implementation.
> 
> A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
> and NUMA balancing.  The most of them support THP migration except for S390.
> Zi Yan tried to add THP migration support for S390 before but it was not
> accepted due to the design of S390 PMD.  For the discussion, please see:
> https://lkml.org/lkml/2018/4/27/953.
> 
> I'm not expert on S390 so not sure if it is feasible to support THP migration
> for S390 or not.  If it is not feasible then the patchset may make THP NUMA
> balancing not be functional on S390.  Not sure if this is a show stopper although
> the patchset does simplify the code a lot.  Anyway it seems worth posting the
> series to the mailing list to get some feedback.

The reason why THP migration cannot work on s390 is because the migration
code will establish swap ptes in a pmd. The pmd layout is very different from
the pte layout on s390, so you cannot simply write a swap pte into a pmd.
There are no separate swp primitives for swap/migration pmds, IIRC. And even
if there were, we'd still need to find some space for a present bit in the
s390 pmd, and/or possibly move around some other bits.

A lot of things can go wrong here, even if it could be possible in theory,
by introducing separate swp primitives in common code for pmd entries, along
with separate offset, type, shift, etc. I don't see that happening in the
near future.

Not sure if this is a show stopper, but I am not familiar enough with
NUMA and migration code to judge. E.g., I do not see any swp entry action
in your patches, but I assume this is implicitly triggered by the switch
to generic THP migration code.

Could there be a work-around by splitting THP pages instead of marking them
as migrate pmds (via pte swap entries), at least when THP migration is not
supported? I guess it could also be acceptable if THP pages were simply not
migrated for NUMA balancing on s390, but then we might need some extra config
option to make that behavior explicit.

See also my comment on patch #5 of this series.

Regards,
Gerald

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page
  2021-03-29 18:33 ` [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page Yang Shi
@ 2021-03-30 14:42   ` Gerald Schaefer
  2021-03-30 16:53     ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Gerald Schaefer @ 2021-03-30 14:42 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel,
	Alexander Gordeev

On Mon, 29 Mar 2021 11:33:11 -0700
Yang Shi <shy828301@gmail.com> wrote:

> The old behavior didn't split THP if migration is failed due to lack of
> memory on the target node.  But the THP migration does split THP, so keep
> the old behavior for misplaced NUMA page migration.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/migrate.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 86325c750c14..1c0c873375ab 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1444,6 +1444,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	int swapwrite = current->flags & PF_SWAPWRITE;
>  	int rc, nr_subpages;
>  	LIST_HEAD(ret_pages);
> +	bool nosplit = (reason == MR_NUMA_MISPLACED);
>  
>  	if (!swapwrite)
>  		current->flags |= PF_SWAPWRITE;
> @@ -1495,7 +1496,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  			 */
>  			case -ENOSYS:
>  				/* THP migration is unsupported */
> -				if (is_thp) {
> +				if (is_thp && !nosplit) {

This is the "THP migration is unsupported" case, but according to your
description you rather want to change the -ENOMEM case?

Could this be the correct place to trigger THP split for NUMA balancing,
for architectures not supporting THP migration, like s390?

Do I understand it correctly that this change (for -ENOSYS) would
result in always failed THP migrations during NUMA balancing, if THP
migration was not supported?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-03-30 14:42 ` [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Gerald Schaefer
@ 2021-03-30 16:51   ` Yang Shi
  2021-03-31 11:47     ` Gerald Schaefer
  2021-03-31 13:20   ` Mel Gorman
  1 sibling, 1 reply; 23+ messages in thread
From: Yang Shi @ 2021-03-30 16:51 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, hca, gor, borntraeger, Andrew Morton, Linux MM,
	linux-s390, Linux Kernel Mailing List, Alexander Gordeev

On Tue, Mar 30, 2021 at 7:42 AM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> On Mon, 29 Mar 2021 11:33:06 -0700
> Yang Shi <shy828301@gmail.com> wrote:
>
> >
> > When the THP NUMA fault support was added THP migration was not supported yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> > THP migration has been supported so it doesn't make too much sense to still keep
> > another THP migration implementation rather than using the generic migration
> > code.  It is definitely a maintenance burden to keep two THP migration
> > implementation for different code paths and it is more error prone.  Using the
> > generic THP migration implementation allows us remove the duplicate code and
> > some hacks needed by the old ad hoc implementation.
> >
> > A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
> > and NUMA balancing.  The most of them support THP migration except for S390.
> > Zi Yan tried to add THP migration support for S390 before but it was not
> > accepted due to the design of S390 PMD.  For the discussion, please see:
> > https://lkml.org/lkml/2018/4/27/953.
> >
> > I'm not expert on S390 so not sure if it is feasible to support THP migration
> > for S390 or not.  If it is not feasible then the patchset may make THP NUMA
> > balancing not be functional on S390.  Not sure if this is a show stopper although
> > the patchset does simplify the code a lot.  Anyway it seems worth posting the
> > series to the mailing list to get some feedback.
>
> The reason why THP migration cannot work on s390 is because the migration
> code will establish swap ptes in a pmd. The pmd layout is very different from
> the pte layout on s390, so you cannot simply write a swap pte into a pmd.
> There are no separate swp primitives for swap/migration pmds, IIRC. And even
> if there were, we'd still need to find some space for a present bit in the
> s390 pmd, and/or possibly move around some other bits.
>
> A lot of things can go wrong here, even if it could be possible in theory,
> by introducing separate swp primitives in common code for pmd entries, along
> with separate offset, type, shift, etc. I don't see that happening in the
> near future.

Thanks a lot for elaboration. IIUC, implementing migration PMD entry
is *not* prevented from by hardware, it may be very tricky to
implement it, right?

>
> Not sure if this is a show stopper, but I am not familiar enough with
> NUMA and migration code to judge. E.g., I do not see any swp entry action
> in your patches, but I assume this is implicitly triggered by the switch
> to generic THP migration code.

Yes, exactly. The migrate_pages() called by migrate_misplaced_page()
takes care of everything.

>
> Could there be a work-around by splitting THP pages instead of marking them
> as migrate pmds (via pte swap entries), at least when THP migration is not
> supported? I guess it could also be acceptable if THP pages were simply not
> migrated for NUMA balancing on s390, but then we might need some extra config
> option to make that behavior explicit.

Yes, it could be. The old behavior of migration was to return -ENOMEM
if THP migration is not supported then split THP. That behavior was
not very friendly to some usecases, for example, memory policy and
migration lieu of reclaim (the upcoming). But I don't mean we restore
the old behavior. We could split THP if it returns -ENOSYS and the
page is THP.

>
> See also my comment on patch #5 of this series.
>
> Regards,
> Gerald

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page
  2021-03-30 14:42   ` Gerald Schaefer
@ 2021-03-30 16:53     ` Yang Shi
  0 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2021-03-30 16:53 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, hca, gor, borntraeger, Andrew Morton, Linux MM,
	linux-s390, Linux Kernel Mailing List, Alexander Gordeev

On Tue, Mar 30, 2021 at 7:42 AM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> On Mon, 29 Mar 2021 11:33:11 -0700
> Yang Shi <shy828301@gmail.com> wrote:
>
> > The old behavior didn't split THP if migration is failed due to lack of
> > memory on the target node.  But the THP migration does split THP, so keep
> > the old behavior for misplaced NUMA page migration.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/migrate.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 86325c750c14..1c0c873375ab 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1444,6 +1444,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >       int swapwrite = current->flags & PF_SWAPWRITE;
> >       int rc, nr_subpages;
> >       LIST_HEAD(ret_pages);
> > +     bool nosplit = (reason == MR_NUMA_MISPLACED);
> >
> >       if (!swapwrite)
> >               current->flags |= PF_SWAPWRITE;
> > @@ -1495,7 +1496,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >                        */
> >                       case -ENOSYS:
> >                               /* THP migration is unsupported */
> > -                             if (is_thp) {
> > +                             if (is_thp && !nosplit) {
>
> This is the "THP migration is unsupported" case, but according to your
> description you rather want to change the -ENOMEM case?
>
> Could this be the correct place to trigger THP split for NUMA balancing,
> for architectures not supporting THP migration, like s390?

Yes, I think it could be as I mentioned in the previous email.

>
> Do I understand it correctly that this change (for -ENOSYS) would
> result in always failed THP migrations during NUMA balancing, if THP
> migration was not supported?

Yes.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP
  2021-03-30  0:21   ` Huang, Ying
@ 2021-03-30 16:57     ` Yang Shi
  0 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2021-03-30 16:57 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko,
	Hugh Dickins, hca, gor, borntraeger, Andrew Morton, Linux MM,
	linux-s390, Linux Kernel Mailing List

On Mon, Mar 29, 2021 at 5:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > In the following patch the migrate_misplaced_page() will be used to migrate THP
> > for NUMA faul too.  Prepare to deal with THP.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/migrate.h | 6 ++++--
> >  mm/memory.c             | 2 +-
> >  mm/migrate.c            | 2 +-
> >  3 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 3a389633b68f..6abd34986cc5 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -102,14 +102,16 @@ static inline void __ClearPageMovable(struct page *page)
> >  #ifdef CONFIG_NUMA_BALANCING
> >  extern bool pmd_trans_migrating(pmd_t pmd);
> >  extern int migrate_misplaced_page(struct page *page,
> > -                               struct vm_area_struct *vma, int node);
> > +                               struct vm_area_struct *vma, int node,
> > +                               bool compound);
> >  #else
> >  static inline bool pmd_trans_migrating(pmd_t pmd)
> >  {
> >       return false;
> >  }
> >  static inline int migrate_misplaced_page(struct page *page,
> > -                                      struct vm_area_struct *vma, int node)
> > +                                      struct vm_area_struct *vma, int node,
> > +                                      bool compound)
> >  {
> >       return -EAGAIN; /* can't migrate now */
> >  }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 003bbf3187d4..7fed578bdc31 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4169,7 +4169,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> >       }
> >
> >       /* Migrate to the requested node */
> > -     migrated = migrate_misplaced_page(page, vma, target_nid);
> > +     migrated = migrate_misplaced_page(page, vma, target_nid, false);
> >       if (migrated) {
> >               page_nid = target_nid;
> >               flags |= TNF_MIGRATED;
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 62b81d5257aa..9c4ae5132919 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2127,7 +2127,7 @@ static inline bool is_shared_exec_page(struct vm_area_struct *vma,
> >   * the page that will be dropped by this function before returning.
> >   */
> >  int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> > -                        int node)
> > +                        int node, bool compound)
>
> Can we just use PageCompound(page) instead?

Yes, I think so. The current base page NUMA hinting does bail out
early if PTE mapped THP is met. So the THP just could be PMD mapped as
long as it reaches here.

>
> Best Regards,
> Huang, Ying
>
> >  {
> >       pg_data_t *pgdat = NODE_DATA(node);
> >       int isolated;
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/6] mm: thp: refactor NUMA fault handling
  2021-03-30  0:41   ` Huang, Ying
@ 2021-03-30 17:02     ` Yang Shi
  0 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2021-03-30 17:02 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko,
	Hugh Dickins, hca, gor, borntraeger, Andrew Morton, Linux MM,
	linux-s390, Linux Kernel Mailing List

On Mon, Mar 29, 2021 at 5:41 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > When the THP NUMA fault support was added THP migration was not supported yet.
> > So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> > THP migration has been supported so it doesn't make too much sense to still keep
> > another THP migration implementation rather than using the generic migration
> > code.
> >
> > This patch reworked the NUMA fault handling to use generic migration implementation
> > to migrate misplaced page.  There is no functional change.
> >
> > After the refactor the flow of NUMA fault handling looks just like its
> > PTE counterpart:
> >   Acquire ptl
> >   Restore PMD
> >   Prepare for migration (elevate page refcount)
> >   Release ptl
> >   Isolate page from lru and elevate page refcount
> >   Migrate the misplaced THP
>
> There's some behavior change between the original implementation and
> your new implementation.  Originally, PMD is restored after trying to
> migrate the misplaced THP.  I think this can reduce the TLB
> shooting-down IPI.

In theory, yes. However I'm not sure if it is really measurable to
real life workload. I would like to make the huge NUMA hinting behave
like its PTE counterpart for now. If your PTE NUMA hinting patch
proves the TLB shooting down optimization is really worth it we could
easily do it for the PMD side too.

>
> Best Regards,
> Huang, Ying
>
> > In the old code anon_vma lock was needed to serialize THP migration
> > against THP split, but since then the THP code has been reworked a lot,
> > it seems anon_vma lock is not required anymore to avoid the race.
> >
> > The page refcount elevation when holding ptl should prevent from THP
> > split.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/migrate.h |  23 ------
> >  mm/huge_memory.c        | 132 ++++++++----------------------
> >  mm/migrate.c            | 173 ++++++----------------------------------
> >  3 files changed, 57 insertions(+), 271 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 6abd34986cc5..6c8640e9af4f 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -100,15 +100,10 @@ static inline void __ClearPageMovable(struct page *page)
> >  #endif
> >
> >  #ifdef CONFIG_NUMA_BALANCING
> > -extern bool pmd_trans_migrating(pmd_t pmd);
> >  extern int migrate_misplaced_page(struct page *page,
> >                                 struct vm_area_struct *vma, int node,
> >                                 bool compound);
> >  #else
> > -static inline bool pmd_trans_migrating(pmd_t pmd)
> > -{
> > -     return false;
> > -}
> >  static inline int migrate_misplaced_page(struct page *page,
> >                                        struct vm_area_struct *vma, int node,
> >                                        bool compound)
> > @@ -117,24 +112,6 @@ static inline int migrate_misplaced_page(struct page *page,
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> > -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > -                     struct vm_area_struct *vma,
> > -                     pmd_t *pmd, pmd_t entry,
> > -                     unsigned long address,
> > -                     struct page *page, int node);
> > -#else
> > -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > -                     struct vm_area_struct *vma,
> > -                     pmd_t *pmd, pmd_t entry,
> > -                     unsigned long address,
> > -                     struct page *page, int node)
> > -{
> > -     return -EAGAIN;
> > -}
> > -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
> > -
> > -
> >  #ifdef CONFIG_MIGRATION
> >
> >  /*
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 53f3843ce72a..157c63b0fd95 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1419,94 +1419,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> >       pmd_t pmd = vmf->orig_pmd;
> > -     struct anon_vma *anon_vma = NULL;
> >       struct page *page;
> >       unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > -     int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
> > +     int page_nid = NUMA_NO_NODE;
> >       int target_nid, last_cpupid = -1;
> > -     bool page_locked;
> >       bool migrated = false;
> > -     bool was_writable;
> > +     bool was_writable = pmd_savedwrite(pmd);
> >       int flags = 0;
> >
> >       vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > -     if (unlikely(!pmd_same(pmd, *vmf->pmd)))
> > -             goto out_unlock;
> > -
> > -     /*
> > -      * If there are potential migrations, wait for completion and retry
> > -      * without disrupting NUMA hinting information. Do not relock and
> > -      * check_same as the page may no longer be mapped.
> > -      */
> > -     if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> > -             page = pmd_page(*vmf->pmd);
> > -             if (!get_page_unless_zero(page))
> > -                     goto out_unlock;
> > -             spin_unlock(vmf->ptl);
> > -             put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> > -             goto out;
> > -     }
> > -
> > -     page = pmd_page(pmd);
> > -     BUG_ON(is_huge_zero_page(page));
> > -     page_nid = page_to_nid(page);
> > -     last_cpupid = page_cpupid_last(page);
> > -     count_vm_numa_event(NUMA_HINT_FAULTS);
> > -     if (page_nid == this_nid) {
> > -             count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> > -             flags |= TNF_FAULT_LOCAL;
> > -     }
> > -
> > -     /* See similar comment in do_numa_page for explanation */
> > -     if (!pmd_savedwrite(pmd))
> > -             flags |= TNF_NO_GROUP;
> > -
> > -     /*
> > -      * Acquire the page lock to serialise THP migrations but avoid dropping
> > -      * page_table_lock if at all possible
> > -      */
> > -     page_locked = trylock_page(page);
> > -     target_nid = mpol_misplaced(page, vma, haddr);
> > -     if (target_nid == NUMA_NO_NODE) {
> > -             /* If the page was locked, there are no parallel migrations */
> > -             if (page_locked)
> > -                     goto clear_pmdnuma;
> > -     }
> > -
> > -     /* Migration could have started since the pmd_trans_migrating check */
> > -     if (!page_locked) {
> > -             page_nid = NUMA_NO_NODE;
> > -             if (!get_page_unless_zero(page))
> > -                     goto out_unlock;
> > +     if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> >               spin_unlock(vmf->ptl);
> > -             put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
> >               goto out;
> >       }
> >
> > -     /*
> > -      * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
> > -      * to serialises splits
> > -      */
> > -     get_page(page);
> > -     spin_unlock(vmf->ptl);
> > -     anon_vma = page_lock_anon_vma_read(page);
> > -
> > -     /* Confirm the PMD did not change while page_table_lock was released */
> > -     spin_lock(vmf->ptl);
> > -     if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
> > -             unlock_page(page);
> > -             put_page(page);
> > -             page_nid = NUMA_NO_NODE;
> > -             goto out_unlock;
> > -     }
> > -
> > -     /* Bail if we fail to protect against THP splits for any reason */
> > -     if (unlikely(!anon_vma)) {
> > -             put_page(page);
> > -             page_nid = NUMA_NO_NODE;
> > -             goto clear_pmdnuma;
> > -     }
> > -
> >       /*
> >        * Since we took the NUMA fault, we must have observed the !accessible
> >        * bit. Make sure all other CPUs agree with that, to avoid them
> > @@ -1533,38 +1459,44 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >                                             haddr + HPAGE_PMD_SIZE);
> >       }
> >
> > -     /*
> > -      * Migrate the THP to the requested node, returns with page unlocked
> > -      * and access rights restored.
> > -      */
> > -     spin_unlock(vmf->ptl);
> > -
> > -     migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
> > -                             vmf->pmd, pmd, vmf->address, page, target_nid);
> > -     if (migrated) {
> > -             flags |= TNF_MIGRATED;
> > -             page_nid = target_nid;
> > -     } else
> > -             flags |= TNF_MIGRATE_FAIL;
> > -
> > -     goto out;
> > -clear_pmdnuma:
> > -     BUG_ON(!PageLocked(page));
> > -     was_writable = pmd_savedwrite(pmd);
> > +     /* Restore the PMD */
> >       pmd = pmd_modify(pmd, vma->vm_page_prot);
> >       pmd = pmd_mkyoung(pmd);
> >       if (was_writable)
> >               pmd = pmd_mkwrite(pmd);
> >       set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
> >       update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
> > -     unlock_page(page);
> > -out_unlock:
> > +
> > +     page = vm_normal_page_pmd(vma, haddr, pmd);
> > +     if (!page) {
> > +             spin_unlock(vmf->ptl);
> > +             goto out;
> > +     }
> > +
> > +     /* See similar comment in do_numa_page for explanation */
> > +     if (!was_writable)
> > +             flags |= TNF_NO_GROUP;
> > +
> > +     page_nid = page_to_nid(page);
> > +     last_cpupid = page_cpupid_last(page);
> > +     target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
> > +                                    &flags);
> > +
> >       spin_unlock(vmf->ptl);
> >
> > -out:
> > -     if (anon_vma)
> > -             page_unlock_anon_vma_read(anon_vma);
> > +     if (target_nid == NUMA_NO_NODE) {
> > +             put_page(page);
> > +             goto out;
> > +     }
> > +
> > +     migrated = migrate_misplaced_page(page, vma, target_nid, true);
> > +     if (migrated) {
> > +             flags |= TNF_MIGRATED;
> > +             page_nid = target_nid;
> > +     } else
> > +             flags |= TNF_MIGRATE_FAIL;
> >
> > +out:
> >       if (page_nid != NUMA_NO_NODE)
> >               task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> >                               flags);
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 9c4ae5132919..86325c750c14 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2066,6 +2066,23 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
> >       return newpage;
> >  }
> >
> > +static struct page *alloc_misplaced_dst_page_thp(struct page *page,
> > +                                              unsigned long data)
> > +{
> > +     int nid = (int) data;
> > +     struct page *newpage;
> > +
> > +     newpage = alloc_pages_node(nid, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> > +                                HPAGE_PMD_ORDER);
> > +     if (!newpage)
> > +             goto out;
> > +
> > +     prep_transhuge_page(newpage);
> > +
> > +out:
> > +     return newpage;
> > +}
> > +
> >  static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
> >  {
> >       int page_lru;
> > @@ -2104,12 +2121,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
> >       return 1;
> >  }
> >
> > -bool pmd_trans_migrating(pmd_t pmd)
> > -{
> > -     struct page *page = pmd_page(pmd);
> > -     return PageLocked(page);
> > -}
> > -
> >  static inline bool is_shared_exec_page(struct vm_area_struct *vma,
> >                                      struct page *page)
> >  {
> > @@ -2133,6 +2144,12 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> >       int isolated;
> >       int nr_remaining;
> >       LIST_HEAD(migratepages);
> > +     new_page_t *new;
> > +
> > +     if (compound)
> > +             new = alloc_misplaced_dst_page_thp;
> > +     else
> > +             new = alloc_misplaced_dst_page;
> >
> >       /*
> >        * Don't migrate file pages that are mapped in multiple processes
> > @@ -2153,9 +2170,8 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> >               goto out;
> >
> >       list_add(&page->lru, &migratepages);
> > -     nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
> > -                                  NULL, node, MIGRATE_ASYNC,
> > -                                  MR_NUMA_MISPLACED);
> > +     nr_remaining = migrate_pages(&migratepages, *new, NULL, node,
> > +                                  MIGRATE_ASYNC, MR_NUMA_MISPLACED);
> >       if (nr_remaining) {
> >               if (!list_empty(&migratepages)) {
> >                       list_del(&page->lru);
> > @@ -2174,145 +2190,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> >       return 0;
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> > -
> > -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > -/*
> > - * Migrates a THP to a given target node. page must be locked and is unlocked
> > - * before returning.
> > - */
> > -int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > -                             struct vm_area_struct *vma,
> > -                             pmd_t *pmd, pmd_t entry,
> > -                             unsigned long address,
> > -                             struct page *page, int node)
> > -{
> > -     spinlock_t *ptl;
> > -     pg_data_t *pgdat = NODE_DATA(node);
> > -     int isolated = 0;
> > -     struct page *new_page = NULL;
> > -     int page_lru = page_is_file_lru(page);
> > -     unsigned long start = address & HPAGE_PMD_MASK;
> > -
> > -     if (is_shared_exec_page(vma, page))
> > -             goto out;
> > -
> > -     new_page = alloc_pages_node(node,
> > -             (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> > -             HPAGE_PMD_ORDER);
> > -     if (!new_page)
> > -             goto out_fail;
> > -     prep_transhuge_page(new_page);
> > -
> > -     isolated = numamigrate_isolate_page(pgdat, page);
> > -     if (!isolated) {
> > -             put_page(new_page);
> > -             goto out_fail;
> > -     }
> > -
> > -     /* Prepare a page as a migration target */
> > -     __SetPageLocked(new_page);
> > -     if (PageSwapBacked(page))
> > -             __SetPageSwapBacked(new_page);
> > -
> > -     /* anon mapping, we can simply copy page->mapping to the new page: */
> > -     new_page->mapping = page->mapping;
> > -     new_page->index = page->index;
> > -     /* flush the cache before copying using the kernel virtual address */
> > -     flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
> > -     migrate_page_copy(new_page, page);
> > -     WARN_ON(PageLRU(new_page));
> > -
> > -     /* Recheck the target PMD */
> > -     ptl = pmd_lock(mm, pmd);
> > -     if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
> > -             spin_unlock(ptl);
> > -
> > -             /* Reverse changes made by migrate_page_copy() */
> > -             if (TestClearPageActive(new_page))
> > -                     SetPageActive(page);
> > -             if (TestClearPageUnevictable(new_page))
> > -                     SetPageUnevictable(page);
> > -
> > -             unlock_page(new_page);
> > -             put_page(new_page);             /* Free it */
> > -
> > -             /* Retake the callers reference and putback on LRU */
> > -             get_page(page);
> > -             putback_lru_page(page);
> > -             mod_node_page_state(page_pgdat(page),
> > -                      NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
> > -
> > -             goto out_unlock;
> > -     }
> > -
> > -     entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> > -     entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> > -
> > -     /*
> > -      * Overwrite the old entry under pagetable lock and establish
> > -      * the new PTE. Any parallel GUP will either observe the old
> > -      * page blocking on the page lock, block on the page table
> > -      * lock or observe the new page. The SetPageUptodate on the
> > -      * new page and page_add_new_anon_rmap guarantee the copy is
> > -      * visible before the pagetable update.
> > -      */
> > -     page_add_anon_rmap(new_page, vma, start, true);
> > -     /*
> > -      * At this point the pmd is numa/protnone (i.e. non present) and the TLB
> > -      * has already been flushed globally.  So no TLB can be currently
> > -      * caching this non present pmd mapping.  There's no need to clear the
> > -      * pmd before doing set_pmd_at(), nor to flush the TLB after
> > -      * set_pmd_at().  Clearing the pmd here would introduce a race
> > -      * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the
> > -      * mmap_lock for reading.  If the pmd is set to NULL at any given time,
> > -      * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this
> > -      * pmd.
> > -      */
> > -     set_pmd_at(mm, start, pmd, entry);
> > -     update_mmu_cache_pmd(vma, address, &entry);
> > -
> > -     page_ref_unfreeze(page, 2);
> > -     mlock_migrate_page(new_page, page);
> > -     page_remove_rmap(page, true);
> > -     set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
> > -
> > -     spin_unlock(ptl);
> > -
> > -     /* Take an "isolate" reference and put new page on the LRU. */
> > -     get_page(new_page);
> > -     putback_lru_page(new_page);
> > -
> > -     unlock_page(new_page);
> > -     unlock_page(page);
> > -     put_page(page);                 /* Drop the rmap reference */
> > -     put_page(page);                 /* Drop the LRU isolation reference */
> > -
> > -     count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
> > -     count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
> > -
> > -     mod_node_page_state(page_pgdat(page),
> > -                     NR_ISOLATED_ANON + page_lru,
> > -                     -HPAGE_PMD_NR);
> > -     return isolated;
> > -
> > -out_fail:
> > -     count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
> > -     ptl = pmd_lock(mm, pmd);
> > -     if (pmd_same(*pmd, entry)) {
> > -             entry = pmd_modify(entry, vma->vm_page_prot);
> > -             set_pmd_at(mm, start, pmd, entry);
> > -             update_mmu_cache_pmd(vma, address, &entry);
> > -     }
> > -     spin_unlock(ptl);
> > -
> > -out_unlock:
> > -     unlock_page(page);
> > -out:
> > -     put_page(page);
> > -     return 0;
> > -}
> > -#endif /* CONFIG_NUMA_BALANCING */
> > -
> >  #endif /* CONFIG_NUMA */
> >
> >  #ifdef CONFIG_DEVICE_PRIVATE

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-03-30 16:51   ` Yang Shi
@ 2021-03-31 11:47     ` Gerald Schaefer
  2021-04-01 20:10       ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Gerald Schaefer @ 2021-03-31 11:47 UTC (permalink / raw)
  To: Yang Shi
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, hca, gor, borntraeger, Andrew Morton, Linux MM,
	linux-s390, Linux Kernel Mailing List, Alexander Gordeev

On Tue, 30 Mar 2021 09:51:46 -0700
Yang Shi <shy828301@gmail.com> wrote:

> On Tue, Mar 30, 2021 at 7:42 AM Gerald Schaefer
> <gerald.schaefer@linux.ibm.com> wrote:
> >
> > On Mon, 29 Mar 2021 11:33:06 -0700
> > Yang Shi <shy828301@gmail.com> wrote:
> >  
> > >
> > > When the THP NUMA fault support was added THP migration was not supported yet.
> > > So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> > > THP migration has been supported so it doesn't make too much sense to still keep
> > > another THP migration implementation rather than using the generic migration
> > > code.  It is definitely a maintenance burden to keep two THP migration
> > > implementation for different code paths and it is more error prone.  Using the
> > > generic THP migration implementation allows us remove the duplicate code and
> > > some hacks needed by the old ad hoc implementation.
> > >
> > > A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
> > > and NUMA balancing.  The most of them support THP migration except for S390.
> > > Zi Yan tried to add THP migration support for S390 before but it was not
> > > accepted due to the design of S390 PMD.  For the discussion, please see:
> > > https://lkml.org/lkml/2018/4/27/953.
> > >
> > > I'm not expert on S390 so not sure if it is feasible to support THP migration
> > > for S390 or not.  If it is not feasible then the patchset may make THP NUMA
> > > balancing not be functional on S390.  Not sure if this is a show stopper although
> > > the patchset does simplify the code a lot.  Anyway it seems worth posting the
> > > series to the mailing list to get some feedback.  
> >
> > The reason why THP migration cannot work on s390 is because the migration
> > code will establish swap ptes in a pmd. The pmd layout is very different from
> > the pte layout on s390, so you cannot simply write a swap pte into a pmd.
> > There are no separate swp primitives for swap/migration pmds, IIRC. And even
> > if there were, we'd still need to find some space for a present bit in the
> > s390 pmd, and/or possibly move around some other bits.
> >
> > A lot of things can go wrong here, even if it could be possible in theory,
> > by introducing separate swp primitives in common code for pmd entries, along
> > with separate offset, type, shift, etc. I don't see that happening in the
> > near future.  
> 
> Thanks a lot for elaboration. IIUC, implementing migration PMD entry
> is *not* prevented from by hardware, it may be very tricky to
> implement it, right?

Well, it depends. The HW is preventing proper full-blown swap + migration
support for PMD, similar to what we have for PTE, because we simply don't
have enough OS-defined bits in the PMD. A 5-bit swap type for example,
similar to a PTE, plus the PFN would not be possible.

The HW would not prevent a similar mechanism in principle, i.e. we could
mark it as invalid to trigger a fault, and have some magic bits that tell
the fault handler or migration code what it is about.

For handling migration aspects only, w/o any swap device or other support, a
single type bit could already be enough, to indicate read/write migration,
plus a "present" bit similar to PTE. But even those 2 bits would be hard to
find, though I would not entirely rule that out. That would be the tricky
part.

Then of course, common code would need some changes, to reflect the
different swap/migration (type) capabilities of PTE and PMD entries.
Not sure if such an approach would be acceptable for common code.

But this is just some very abstract and optimistic view, I have not
really properly looked into the details. So it might be even more
tricky, or not possible at all.

> 
> >
> > Not sure if this is a show stopper, but I am not familiar enough with
> > NUMA and migration code to judge. E.g., I do not see any swp entry action
> > in your patches, but I assume this is implicitly triggered by the switch
> > to generic THP migration code.  
> 
> Yes, exactly. The migrate_pages() called by migrate_misplaced_page()
> takes care of everything.
> 
> >
> > Could there be a work-around by splitting THP pages instead of marking them
> > as migrate pmds (via pte swap entries), at least when THP migration is not
> > supported? I guess it could also be acceptable if THP pages were simply not
> > migrated for NUMA balancing on s390, but then we might need some extra config
> > option to make that behavior explicit.  
> 
> Yes, it could be. The old behavior of migration was to return -ENOMEM
> if THP migration is not supported then split THP. That behavior was
> not very friendly to some usecases, for example, memory policy and
> migration lieu of reclaim (the upcoming). But I don't mean we restore
> the old behavior. We could split THP if it returns -ENOSYS and the
> page is THP.

OK, as long as we don't get any broken PMD migration entries established
for s390, some extra THP splitting would be acceptable I guess.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-03-30 14:42 ` [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Gerald Schaefer
  2021-03-30 16:51   ` Yang Shi
@ 2021-03-31 13:20   ` Mel Gorman
  2021-04-01 20:12     ` Yang Shi
  1 sibling, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2021-03-31 13:20 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Yang Shi, kirill.shutemov, ziy, mhocko, ying.huang, hughd, hca,
	gor, borntraeger, akpm, linux-mm, linux-s390, linux-kernel,
	Alexander Gordeev

On Tue, Mar 30, 2021 at 04:42:00PM +0200, Gerald Schaefer wrote:
> Could there be a work-around by splitting THP pages instead of marking them
> as migrate pmds (via pte swap entries), at least when THP migration is not
> supported? I guess it could also be acceptable if THP pages were simply not
> migrated for NUMA balancing on s390, but then we might need some extra config
> option to make that behavior explicit.
> 

The split is not done on other architectures simply because the loss
from splitting exceeded the gain of improved locality in too many cases.
However, it might be ok as an s390-specific workaround.

(Note, I haven't read the rest of the series due to lack of time but this
query caught my eye).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-03-31 11:47     ` Gerald Schaefer
@ 2021-04-01 20:10       ` Yang Shi
  2021-04-06 12:02         ` Gerald Schaefer
  0 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2021-04-01 20:10 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, hca, gor, borntraeger, Andrew Morton, Linux MM,
	linux-s390, Linux Kernel Mailing List, Alexander Gordeev

On Wed, Mar 31, 2021 at 4:47 AM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> On Tue, 30 Mar 2021 09:51:46 -0700
> Yang Shi <shy828301@gmail.com> wrote:
>
> > On Tue, Mar 30, 2021 at 7:42 AM Gerald Schaefer
> > <gerald.schaefer@linux.ibm.com> wrote:
> > >
> > > On Mon, 29 Mar 2021 11:33:06 -0700
> > > Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > >
> > > > When the THP NUMA fault support was added THP migration was not supported yet.
> > > > So the ad hoc THP migration was implemented in NUMA fault handling.  Since v4.14
> > > > THP migration has been supported so it doesn't make too much sense to still keep
> > > > another THP migration implementation rather than using the generic migration
> > > > code.  It is definitely a maintenance burden to keep two THP migration
> > > > implementation for different code paths and it is more error prone.  Using the
> > > > generic THP migration implementation allows us remove the duplicate code and
> > > > some hacks needed by the old ad hoc implementation.
> > > >
> > > > A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP
> > > > and NUMA balancing.  The most of them support THP migration except for S390.
> > > > Zi Yan tried to add THP migration support for S390 before but it was not
> > > > accepted due to the design of S390 PMD.  For the discussion, please see:
> > > > https://lkml.org/lkml/2018/4/27/953.
> > > >
> > > > I'm not expert on S390 so not sure if it is feasible to support THP migration
> > > > for S390 or not.  If it is not feasible then the patchset may make THP NUMA
> > > > balancing not be functional on S390.  Not sure if this is a show stopper although
> > > > the patchset does simplify the code a lot.  Anyway it seems worth posting the
> > > > series to the mailing list to get some feedback.
> > >
> > > The reason why THP migration cannot work on s390 is because the migration
> > > code will establish swap ptes in a pmd. The pmd layout is very different from
> > > the pte layout on s390, so you cannot simply write a swap pte into a pmd.
> > > There are no separate swp primitives for swap/migration pmds, IIRC. And even
> > > if there were, we'd still need to find some space for a present bit in the
> > > s390 pmd, and/or possibly move around some other bits.
> > >
> > > A lot of things can go wrong here, even if it could be possible in theory,
> > > by introducing separate swp primitives in common code for pmd entries, along
> > > with separate offset, type, shift, etc. I don't see that happening in the
> > > near future.
> >
> > Thanks a lot for elaboration. IIUC, implementing migration PMD entry
> > is *not* prevented from by hardware, it may be very tricky to
> > implement it, right?
>
> Well, it depends. The HW is preventing proper full-blown swap + migration
> support for PMD, similar to what we have for PTE, because we simply don't
> have enough OS-defined bits in the PMD. A 5-bit swap type for example,
> similar to a PTE, plus the PFN would not be possible.
>
> The HW would not prevent a similar mechanism in principle, i.e. we could
> mark it as invalid to trigger a fault, and have some magic bits that tell
> the fault handler or migration code what it is about.
>
> For handling migration aspects only, w/o any swap device or other support, a
> single type bit could already be enough, to indicate read/write migration,
> plus a "present" bit similar to PTE. But even those 2 bits would be hard to
> find, though I would not entirely rule that out. That would be the tricky
> part.
>
> Then of course, common code would need some changes, to reflect the
> different swap/migration (type) capabilities of PTE and PMD entries.
> Not sure if such an approach would be acceptable for common code.
>
> But this is just some very abstract and optimistic view, I have not
> really properly looked into the details. So it might be even more
> tricky, or not possible at all.

Thanks a lot for the elaboration.

>
> >
> > >
> > > Not sure if this is a show stopper, but I am not familiar enough with
> > > NUMA and migration code to judge. E.g., I do not see any swp entry action
> > > in your patches, but I assume this is implicitly triggered by the switch
> > > to generic THP migration code.
> >
> > Yes, exactly. The migrate_pages() called by migrate_misplaced_page()
> > takes care of everything.
> >
> > >
> > > Could there be a work-around by splitting THP pages instead of marking them
> > > as migrate pmds (via pte swap entries), at least when THP migration is not
> > > supported? I guess it could also be acceptable if THP pages were simply not
> > > migrated for NUMA balancing on s390, but then we might need some extra config
> > > option to make that behavior explicit.
> >
> > Yes, it could be. The old behavior of migration was to return -ENOMEM
> > if THP migration is not supported then split THP. That behavior was
> > not very friendly to some usecases, for example, memory policy and
> > migration lieu of reclaim (the upcoming). But I don't mean we restore
> > the old behavior. We could split THP if it returns -ENOSYS and the
> > page is THP.
>
> OK, as long as we don't get any broken PMD migration entries established
> for s390, some extra THP splitting would be acceptable I guess.

There will be no migration PMD installed. The current behavior is a
no-op if THP migration is not supported.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-03-31 13:20   ` Mel Gorman
@ 2021-04-01 20:12     ` Yang Shi
  0 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2021-04-01 20:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Gerald Schaefer, Kirill A. Shutemov, Zi Yan, Michal Hocko,
	Huang Ying, Hugh Dickins, hca, gor, borntraeger, Andrew Morton,
	Linux MM, linux-s390, Linux Kernel Mailing List,
	Alexander Gordeev

On Wed, Mar 31, 2021 at 6:20 AM Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Mar 30, 2021 at 04:42:00PM +0200, Gerald Schaefer wrote:
> > Could there be a work-around by splitting THP pages instead of marking them
> > as migrate pmds (via pte swap entries), at least when THP migration is not
> > supported? I guess it could also be acceptable if THP pages were simply not
> > migrated for NUMA balancing on s390, but then we might need some extra config
> > option to make that behavior explicit.
> >
>
> The split is not done on other architectures simply because the loss
> from splitting exceeded the gain of improved locality in too many cases.
> However, it might be ok as an s390-specific workaround.
>
> (Note, I haven't read the rest of the series due to lack of time but this
> query caught my eye).

Will wait for your comments before I post v2. Thanks.

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-04-01 20:10       ` Yang Shi
@ 2021-04-06 12:02         ` Gerald Schaefer
  2021-04-06 16:42           ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Gerald Schaefer @ 2021-04-06 12:02 UTC (permalink / raw)
  To: Yang Shi
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, hca, gor, borntraeger, Andrew Morton, Linux MM,
	linux-s390, Linux Kernel Mailing List, Alexander Gordeev

On Thu, 1 Apr 2021 13:10:49 -0700
Yang Shi <shy828301@gmail.com> wrote:

[...]
> > >
> > > Yes, it could be. The old behavior of migration was to return -ENOMEM
> > > if THP migration is not supported then split THP. That behavior was
> > > not very friendly to some usecases, for example, memory policy and
> > > migration lieu of reclaim (the upcoming). But I don't mean we restore
> > > the old behavior. We could split THP if it returns -ENOSYS and the
> > > page is THP.
> >
> > OK, as long as we don't get any broken PMD migration entries established
> > for s390, some extra THP splitting would be acceptable I guess.
> 
> There will be no migration PMD installed. The current behavior is a
> no-op if THP migration is not supported.

Ok, just for completeness, since Mel also replied that the split
was not done on other architectures "because the loss from splitting
exceeded the gain of improved locality":

I did not mean to request extra splitting functionality for s390,
simply skipping / ignoring large PMDs would also be fine for s390,
no need to add extra complexity.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-04-06 12:02         ` Gerald Schaefer
@ 2021-04-06 16:42           ` Yang Shi
  2021-04-07  8:32             ` Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2021-04-06 16:42 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Mel Gorman, Kirill A. Shutemov, Zi Yan, Michal Hocko, Huang Ying,
	Hugh Dickins, hca, gor, borntraeger, Andrew Morton, Linux MM,
	linux-s390, Linux Kernel Mailing List, Alexander Gordeev

On Tue, Apr 6, 2021 at 5:03 AM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> On Thu, 1 Apr 2021 13:10:49 -0700
> Yang Shi <shy828301@gmail.com> wrote:
>
> [...]
> > > >
> > > > Yes, it could be. The old behavior of migration was to return -ENOMEM
> > > > if THP migration is not supported then split THP. That behavior was
> > > > not very friendly to some usecases, for example, memory policy and
> > > > migration lieu of reclaim (the upcoming). But I don't mean we restore
> > > > the old behavior. We could split THP if it returns -ENOSYS and the
> > > > page is THP.
> > >
> > > OK, as long as we don't get any broken PMD migration entries established
> > > for s390, some extra THP splitting would be acceptable I guess.
> >
> > There will be no migration PMD installed. The current behavior is a
> > no-op if THP migration is not supported.
>
> Ok, just for completeness, since Mel also replied that the split
> was not done on other architectures "because the loss from splitting
> exceeded the gain of improved locality":
>
> I did not mean to request extra splitting functionality for s390,
> simply skipping / ignoring large PMDs would also be fine for s390,
> no need to add extra complexity.

Thank you. It could make life easier. The current code still converts
huge PMD to RPOTNONE even though THP migration is not supported. It is
easy to skip such PMDs hence cycles are saved for pointless NUMA
hinting page faults.

Will do so in v2 if no objection from Mel as well.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-04-06 16:42           ` Yang Shi
@ 2021-04-07  8:32             ` Mel Gorman
  2021-04-07 16:04               ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2021-04-07  8:32 UTC (permalink / raw)
  To: Yang Shi
  Cc: Gerald Schaefer, Kirill A. Shutemov, Zi Yan, Michal Hocko,
	Huang Ying, Hugh Dickins, hca, gor, borntraeger, Andrew Morton,
	Linux MM, linux-s390, Linux Kernel Mailing List,
	Alexander Gordeev

On Tue, Apr 06, 2021 at 09:42:07AM -0700, Yang Shi wrote:
> On Tue, Apr 6, 2021 at 5:03 AM Gerald Schaefer
> <gerald.schaefer@linux.ibm.com> wrote:
> >
> > On Thu, 1 Apr 2021 13:10:49 -0700
> > Yang Shi <shy828301@gmail.com> wrote:
> >
> > [...]
> > > > >
> > > > > Yes, it could be. The old behavior of migration was to return -ENOMEM
> > > > > if THP migration is not supported then split THP. That behavior was
> > > > > not very friendly to some usecases, for example, memory policy and
> > > > > migration lieu of reclaim (the upcoming). But I don't mean we restore
> > > > > the old behavior. We could split THP if it returns -ENOSYS and the
> > > > > page is THP.
> > > >
> > > > OK, as long as we don't get any broken PMD migration entries established
> > > > for s390, some extra THP splitting would be acceptable I guess.
> > >
> > > There will be no migration PMD installed. The current behavior is a
> > > no-op if THP migration is not supported.
> >
> > Ok, just for completeness, since Mel also replied that the split
> > was not done on other architectures "because the loss from splitting
> > exceeded the gain of improved locality":
> >
> > I did not mean to request extra splitting functionality for s390,
> > simply skipping / ignoring large PMDs would also be fine for s390,
> > no need to add extra complexity.
> 
> Thank you. It could make life easier. The current code still converts
> huge PMD to RPOTNONE even though THP migration is not supported. It is
> easy to skip such PMDs hence cycles are saved for pointless NUMA
> hinting page faults.
> 
> Will do so in v2 if no objection from Mel as well.

I did not get a chance to review this in time but if a v2 shows up,
I'll at least run it through a battery of tests to measure the impact
and hopefully find the time to do a proper review. Superficially I'm not
opposed to using generic code for migration because even if it shows up a
problem, it would be better to optimise the generic implementation than
carry two similar implementations. I'm undecided on whether s390 should
split+migrate rather than skip because I do not have a good overview of
"typical workloads on s390 that benefit from NUMA balancing".

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault
  2021-04-07  8:32             ` Mel Gorman
@ 2021-04-07 16:04               ` Yang Shi
  0 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2021-04-07 16:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Gerald Schaefer, Kirill A. Shutemov, Zi Yan, Michal Hocko,
	Huang Ying, Hugh Dickins, hca, gor, borntraeger, Andrew Morton,
	Linux MM, linux-s390, Linux Kernel Mailing List,
	Alexander Gordeev

On Wed, Apr 7, 2021 at 1:32 AM Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Apr 06, 2021 at 09:42:07AM -0700, Yang Shi wrote:
> > On Tue, Apr 6, 2021 at 5:03 AM Gerald Schaefer
> > <gerald.schaefer@linux.ibm.com> wrote:
> > >
> > > On Thu, 1 Apr 2021 13:10:49 -0700
> > > Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > [...]
> > > > > >
> > > > > > Yes, it could be. The old behavior of migration was to return -ENOMEM
> > > > > > if THP migration is not supported then split THP. That behavior was
> > > > > > not very friendly to some usecases, for example, memory policy and
> > > > > > migration lieu of reclaim (the upcoming). But I don't mean we restore
> > > > > > the old behavior. We could split THP if it returns -ENOSYS and the
> > > > > > page is THP.
> > > > >
> > > > > OK, as long as we don't get any broken PMD migration entries established
> > > > > for s390, some extra THP splitting would be acceptable I guess.
> > > >
> > > > There will be no migration PMD installed. The current behavior is a
> > > > no-op if THP migration is not supported.
> > >
> > > Ok, just for completeness, since Mel also replied that the split
> > > was not done on other architectures "because the loss from splitting
> > > exceeded the gain of improved locality":
> > >
> > > I did not mean to request extra splitting functionality for s390,
> > > simply skipping / ignoring large PMDs would also be fine for s390,
> > > no need to add extra complexity.
> >
> > Thank you. It could make life easier. The current code still converts
> > huge PMD to RPOTNONE even though THP migration is not supported. It is
> > easy to skip such PMDs hence cycles are saved for pointless NUMA
> > hinting page faults.
> >
> > Will do so in v2 if no objection from Mel as well.
>
> I did not get a chance to review this in time but if a v2 shows up,
> I'll at least run it through a battery of tests to measure the impact
> and hopefully find the time to do a proper review. Superficially I'm not
> opposed to using generic code for migration because even if it shows up a
> problem, it would be better to optimise the generic implementation than
> carry two similar implementations. I'm undecided on whether s390 should
> split+migrate rather than skip because I do not have a good overview of
> "typical workloads on s390 that benefit from NUMA balancing".

Thanks, Mel. I don't have an idea about S390 either. I will just skip
huge PMDs for S390 for now as Gerald suggested.

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-04-07 16:04 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-29 18:33 [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Yang Shi
2021-03-29 18:33 ` [PATCH 1/6] mm: memory: add orig_pmd to struct vm_fault Yang Shi
2021-03-29 18:33 ` [PATCH 2/6] mm: memory: make numa_migrate_prep() non-static Yang Shi
2021-03-29 18:33 ` [PATCH 3/6] mm: migrate: teach migrate_misplaced_page() about THP Yang Shi
2021-03-30  0:21   ` Huang, Ying
2021-03-30 16:57     ` Yang Shi
2021-03-29 18:33 ` [PATCH 4/6] mm: thp: refactor NUMA fault handling Yang Shi
2021-03-30  0:41   ` Huang, Ying
2021-03-30 17:02     ` Yang Shi
2021-03-29 18:33 ` [PATCH 5/6] mm: migrate: don't split THP for misplaced NUMA page Yang Shi
2021-03-30 14:42   ` Gerald Schaefer
2021-03-30 16:53     ` Yang Shi
2021-03-29 18:33 ` [PATCH 6/6] mm: migrate: remove redundant page count check for THP Yang Shi
2021-03-30 14:42 ` [RFC PATCH 0/6] mm: thp: use generic THP migration for NUMA hinting fault Gerald Schaefer
2021-03-30 16:51   ` Yang Shi
2021-03-31 11:47     ` Gerald Schaefer
2021-04-01 20:10       ` Yang Shi
2021-04-06 12:02         ` Gerald Schaefer
2021-04-06 16:42           ` Yang Shi
2021-04-07  8:32             ` Mel Gorman
2021-04-07 16:04               ` Yang Shi
2021-03-31 13:20   ` Mel Gorman
2021-04-01 20:12     ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).