All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/14] mm: page migration enhancement for thp
@ 2017-02-05 16:12 ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

The patches are rebased on mmotm-2017-02-01-15-35 with feedbacks from 
Naoya Horiguchi's v2 patches.

I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
which leads to PTE page table not freed.

In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
can be set by some CPUs by mistake and the new swap entry format does not use
bit 1-4.

I also adjust two core migration functions, set_pmd_migration_entry() and
remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
function. Patch 8 needs Kirill's comments, since I also add changes
to his page_vma_mapped_walk() function with pmd_migration_entry handling.

In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
in set_pmd_migration_entry() to avoid data corruption after page migration.

In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
leads to deposited PTE page table not freed.

I personally use this patchset with my customized kernel to test frequent
page migrations by replacing page reclaim with page migration.
The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
I did a 16-hour stress test that has ~7 billion total page migrations.
No error or data corruption was found. 


General description 
===========================================

This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
 - mbind(2)
 - move_pages(2)
 - migrate_pages(2)
 - cgroup/cpuset migration
 - memory hotremove
 - soft offline

The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.

The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.


Any comments or advices are welcomed.

Best Regards,
Yan Zi

Naoya Horiguchi (11):
  mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  mm: mempolicy: add queue_pages_node_check()
  mm: thp: introduce separate TTU flag for thp freezing
  mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  mm: thp: enable thp migration in generic path
  mm: thp: check pmd migration entry in common path
  mm: soft-dirty: keep soft-dirty bits over thp migration
  mm: hwpoison: soft offline supports thp migration
  mm: mempolicy: mbind and migrate_pages support thp migration
  mm: migrate: move_pages() supports thp migration
  mm: memory_hotplug: memory hotremove supports thp migration

Zi Yan (3):
  mm: thp: make __split_huge_pmd_locked visible.
  mm: thp: create new __zap_huge_pmd_locked function.
  mm: use pmd lock instead of racy checks in zap_pmd_range()

 arch/x86/Kconfig                     |   4 +
 arch/x86/include/asm/pgtable.h       |  17 ++
 arch/x86/include/asm/pgtable_64.h    |   2 +
 arch/x86/include/asm/pgtable_types.h |  10 +-
 arch/x86/mm/gup.c                    |   4 +-
 fs/proc/task_mmu.c                   |  37 +++--
 include/asm-generic/pgtable.h        | 105 ++++--------
 include/linux/huge_mm.h              |  36 ++++-
 include/linux/rmap.h                 |   1 +
 include/linux/swapops.h              | 146 ++++++++++++++++-
 mm/Kconfig                           |   3 +
 mm/gup.c                             |  20 ++-
 mm/huge_memory.c                     | 302 +++++++++++++++++++++++++++++------
 mm/madvise.c                         |   2 +
 mm/memcontrol.c                      |   2 +
 mm/memory-failure.c                  |  31 ++--
 mm/memory.c                          |  33 ++--
 mm/memory_hotplug.c                  |  17 +-
 mm/mempolicy.c                       | 124 ++++++++++----
 mm/migrate.c                         |  66 ++++++--
 mm/mprotect.c                        |   6 +-
 mm/mremap.c                          |   2 +-
 mm/page_vma_mapped.c                 |  13 +-
 mm/pagewalk.c                        |   2 +
 mm/pgtable-generic.c                 |   3 +-
 mm/rmap.c                            |  21 ++-
 26 files changed, 770 insertions(+), 239 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 00/14] mm: page migration enhancement for thp
@ 2017-02-05 16:12 ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

The patches are rebased on mmotm-2017-02-01-15-35 with feedbacks from 
Naoya Horiguchi's v2 patches.

I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
which leads to PTE page table not freed.

In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
can be set by some CPUs by mistake and the new swap entry format does not use
bit 1-4.

I also adjust two core migration functions, set_pmd_migration_entry() and
remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
function. Patch 8 needs Kirill's comments, since I also add changes
to his page_vma_mapped_walk() function with pmd_migration_entry handling.

In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
in set_pmd_migration_entry() to avoid data corruption after page migration.

In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
leads to deposited PTE page table not freed.

I personally use this patchset with my customized kernel to test frequent
page migrations by replacing page reclaim with page migration.
The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
I did a 16-hour stress test that has ~7 billion total page migrations.
No error or data corruption was found. 


General description 
===========================================

This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
 - mbind(2)
 - move_pages(2)
 - migrate_pages(2)
 - cgroup/cpuset migration
 - memory hotremove
 - soft offline

The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.

The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.


Any comments or advices are welcomed.

Best Regards,
Yan Zi

Naoya Horiguchi (11):
  mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  mm: mempolicy: add queue_pages_node_check()
  mm: thp: introduce separate TTU flag for thp freezing
  mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  mm: thp: enable thp migration in generic path
  mm: thp: check pmd migration entry in common path
  mm: soft-dirty: keep soft-dirty bits over thp migration
  mm: hwpoison: soft offline supports thp migration
  mm: mempolicy: mbind and migrate_pages support thp migration
  mm: migrate: move_pages() supports thp migration
  mm: memory_hotplug: memory hotremove supports thp migration

Zi Yan (3):
  mm: thp: make __split_huge_pmd_locked visible.
  mm: thp: create new __zap_huge_pmd_locked function.
  mm: use pmd lock instead of racy checks in zap_pmd_range()

 arch/x86/Kconfig                     |   4 +
 arch/x86/include/asm/pgtable.h       |  17 ++
 arch/x86/include/asm/pgtable_64.h    |   2 +
 arch/x86/include/asm/pgtable_types.h |  10 +-
 arch/x86/mm/gup.c                    |   4 +-
 fs/proc/task_mmu.c                   |  37 +++--
 include/asm-generic/pgtable.h        | 105 ++++--------
 include/linux/huge_mm.h              |  36 ++++-
 include/linux/rmap.h                 |   1 +
 include/linux/swapops.h              | 146 ++++++++++++++++-
 mm/Kconfig                           |   3 +
 mm/gup.c                             |  20 ++-
 mm/huge_memory.c                     | 302 +++++++++++++++++++++++++++++------
 mm/madvise.c                         |   2 +
 mm/memcontrol.c                      |   2 +
 mm/memory-failure.c                  |  31 ++--
 mm/memory.c                          |  33 ++--
 mm/memory_hotplug.c                  |  17 +-
 mm/mempolicy.c                       | 124 ++++++++++----
 mm/migrate.c                         |  66 ++++++--
 mm/mprotect.c                        |   6 +-
 mm/mremap.c                          |   2 +-
 mm/page_vma_mapped.c                 |  13 +-
 mm/pagewalk.c                        |   2 +
 mm/pgtable-generic.c                 |   3 +-
 mm/rmap.c                            |  21 ++-
 26 files changed, 770 insertions(+), 239 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible.
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It allows splitting huge pmd while you are holding the pmd lock.
It is prepared for future zap_pmd_range() use.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 include/linux/huge_mm.h |  2 ++
 mm/huge_memory.c        | 22 ++++++++++++----------
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d49ba39..2036f69c8284 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -120,6 +120,8 @@ static inline int split_huge_page(struct page *page)
 }
 void deferred_split_huge_page(struct page *page);
 
+void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long haddr, bool freeze);
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct page *page);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03e4566fc226..cd66532ef667 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1877,8 +1877,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
-static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
+void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
@@ -1887,6 +1887,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	bool young, write, dirty, soft_dirty;
 	unsigned long addr;
 	int i;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
@@ -1895,6 +1896,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	count_vm_event(THP_SPLIT_PMD);
 
+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+
 	if (!vma_is_anonymous(vma)) {
 		_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
 		/*
@@ -1904,16 +1907,17 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
 		if (vma_is_dax(vma))
-			return;
+			goto out;
 		page = pmd_page(_pmd);
 		if (!PageReferenced(page) && pmd_young(_pmd))
 			SetPageReferenced(page);
 		page_remove_rmap(page, true);
 		put_page(page);
 		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
-		return;
+		goto out;
 	} else if (is_huge_zero_pmd(*pmd)) {
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		__split_huge_zero_page_pmd(vma, haddr, pmd);
+		goto out;
 	}
 
 	page = pmd_page(*pmd);
@@ -2010,6 +2014,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			put_page(page + i);
 		}
 	}
+out:
+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -2017,11 +2023,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	spinlock_t *ptl;
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 
-	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
 	ptl = pmd_lock(mm, pmd);
-
 	/*
 	 * If caller asks to setup a migration entries, we need a page to check
 	 * pmd against. Otherwise we can end up replacing wrong page.
@@ -2036,10 +2039,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			clear_page_mlock(page);
 	} else if (!pmd_devmap(*pmd))
 		goto out;
-	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
+	__split_huge_pmd_locked(vma, pmd, address, freeze);
 out:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 }
 
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible.
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It allows splitting huge pmd while you are holding the pmd lock.
It is prepared for future zap_pmd_range() use.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 include/linux/huge_mm.h |  2 ++
 mm/huge_memory.c        | 22 ++++++++++++----------
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d49ba39..2036f69c8284 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -120,6 +120,8 @@ static inline int split_huge_page(struct page *page)
 }
 void deferred_split_huge_page(struct page *page);
 
+void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long haddr, bool freeze);
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct page *page);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03e4566fc226..cd66532ef667 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1877,8 +1877,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
-static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
+void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
@@ -1887,6 +1887,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	bool young, write, dirty, soft_dirty;
 	unsigned long addr;
 	int i;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
@@ -1895,6 +1896,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	count_vm_event(THP_SPLIT_PMD);
 
+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+
 	if (!vma_is_anonymous(vma)) {
 		_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
 		/*
@@ -1904,16 +1907,17 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
 		if (vma_is_dax(vma))
-			return;
+			goto out;
 		page = pmd_page(_pmd);
 		if (!PageReferenced(page) && pmd_young(_pmd))
 			SetPageReferenced(page);
 		page_remove_rmap(page, true);
 		put_page(page);
 		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
-		return;
+		goto out;
 	} else if (is_huge_zero_pmd(*pmd)) {
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		__split_huge_zero_page_pmd(vma, haddr, pmd);
+		goto out;
 	}
 
 	page = pmd_page(*pmd);
@@ -2010,6 +2014,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			put_page(page + i);
 		}
 	}
+out:
+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -2017,11 +2023,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	spinlock_t *ptl;
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 
-	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
 	ptl = pmd_lock(mm, pmd);
-
 	/*
 	 * If caller asks to setup a migration entries, we need a page to check
 	 * pmd against. Otherwise we can end up replacing wrong page.
@@ -2036,10 +2039,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			clear_page_mlock(page);
 	} else if (!pmd_devmap(*pmd))
 		goto out;
-	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
+	__split_huge_pmd_locked(vma, pmd, address, freeze);
 out:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 }
 
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 02/14] mm: thp: create new __zap_huge_pmd_locked function.
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It allows removing huge pmd while holding the pmd lock.
It is prepared for future zap_pmd_range() use.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 include/linux/huge_mm.h |  3 +++
 mm/huge_memory.c        | 27 ++++++++++++++++++---------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2036f69c8284..44ee130c7207 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -26,6 +26,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 extern bool madvise_free_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr, unsigned long next);
+extern int __zap_huge_pmd_locked(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr);
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cd66532ef667..d8e15fd817b0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1590,17 +1590,12 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 	atomic_long_dec(&mm->nr_ptes);
 }
 
-int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
 	pmd_t orig_pmd;
-	spinlock_t *ptl;
 
 	tlb_remove_check_page_size_change(tlb, HPAGE_PMD_SIZE);
-
-	ptl = __pmd_trans_huge_lock(pmd, vma);
-	if (!ptl)
-		return 0;
 	/*
 	 * For architectures like ppc64 we look at deposited pgtable
 	 * when calling pmdp_huge_get_and_clear. So do the
@@ -1611,13 +1606,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			tlb->fullmm);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	if (vma_is_dax(vma)) {
-		spin_unlock(ptl);
 		if (is_huge_zero_pmd(orig_pmd))
 			tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
 		pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
 		atomic_long_dec(&tlb->mm->nr_ptes);
-		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else {
 		struct page *page = pmd_page(orig_pmd);
@@ -1635,9 +1628,25 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
 		}
-		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
+
+	return 1;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd, unsigned long addr)
+{
+	spinlock_t *ptl;
+
+
+	ptl = __pmd_trans_huge_lock(pmd, vma);
+	if (!ptl)
+		return 0;
+
+	__zap_huge_pmd_locked(tlb, vma, pmd, addr);
+
+	spin_unlock(ptl);
 	return 1;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 02/14] mm: thp: create new __zap_huge_pmd_locked function.
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It allows removing huge pmd while holding the pmd lock.
It is prepared for future zap_pmd_range() use.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 include/linux/huge_mm.h |  3 +++
 mm/huge_memory.c        | 27 ++++++++++++++++++---------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2036f69c8284..44ee130c7207 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -26,6 +26,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 extern bool madvise_free_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr, unsigned long next);
+extern int __zap_huge_pmd_locked(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr);
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cd66532ef667..d8e15fd817b0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1590,17 +1590,12 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 	atomic_long_dec(&mm->nr_ptes);
 }
 
-int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
 	pmd_t orig_pmd;
-	spinlock_t *ptl;
 
 	tlb_remove_check_page_size_change(tlb, HPAGE_PMD_SIZE);
-
-	ptl = __pmd_trans_huge_lock(pmd, vma);
-	if (!ptl)
-		return 0;
 	/*
 	 * For architectures like ppc64 we look at deposited pgtable
 	 * when calling pmdp_huge_get_and_clear. So do the
@@ -1611,13 +1606,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			tlb->fullmm);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	if (vma_is_dax(vma)) {
-		spin_unlock(ptl);
 		if (is_huge_zero_pmd(orig_pmd))
 			tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
 		pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
 		atomic_long_dec(&tlb->mm->nr_ptes);
-		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else {
 		struct page *page = pmd_page(orig_pmd);
@@ -1635,9 +1628,25 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
 		}
-		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
+
+	return 1;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd, unsigned long addr)
+{
+	spinlock_t *ptl;
+
+
+	ptl = __pmd_trans_huge_lock(pmd, vma);
+	if (!ptl)
+		return 0;
+
+	__zap_huge_pmd_locked(tlb, vma, pmd, addr);
+
+	spin_unlock(ptl);
 	return 1;
 }
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Originally, zap_pmd_range() checks pmd value without taking pmd lock.
This can cause pmd_protnone entry not being freed.

Because there are two steps in changing a pmd entry to a pmd_protnone
entry. First, the pmd entry is cleared to a pmd_none entry, then,
the pmd_none entry is changed into a pmd_protnone entry.
The racy check, even with barrier, might only see the pmd_none entry
in zap_pmd_range(), thus, the mapping is neither split nor zapped.

Later, in free_pmd_range(), pmd_none_or_clear() will see the
pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
since the pmd_protnone entry is not properly freed, the corresponding
deposited pte page table is not freed either.

This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.

This patch relies on __split_huge_pmd_locked() and
__zap_huge_pmd_locked().

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 mm/memory.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3929b015faf7..7cfdd5208ef5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 				struct zap_details *details)
 {
 	pmd_t *pmd;
+	spinlock_t *ptl;
 	unsigned long next;
 
 	pmd = pmd_offset(pud, addr);
+	ptl = pmd_lock(vma->vm_mm, pmd);
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
 				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
-				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
-				goto next;
+				__split_huge_pmd_locked(vma, pmd, addr, false);
+			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
+				continue;
 			/* fall through */
 		}
-		/*
-		 * Here there can be other concurrent MADV_DONTNEED or
-		 * trans huge page faults running, and if the pmd is
-		 * none or trans huge it can change under us. This is
-		 * because MADV_DONTNEED holds the mmap_sem in read
-		 * mode.
-		 */
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			goto next;
+
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		spin_unlock(ptl);
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
-next:
 		cond_resched();
+		spin_lock(ptl);
 	} while (pmd++, addr = next, addr != end);
+	spin_unlock(ptl);
 
 	return addr;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Originally, zap_pmd_range() checks pmd value without taking pmd lock.
This can cause pmd_protnone entry not being freed.

Because there are two steps in changing a pmd entry to a pmd_protnone
entry. First, the pmd entry is cleared to a pmd_none entry, then,
the pmd_none entry is changed into a pmd_protnone entry.
The racy check, even with barrier, might only see the pmd_none entry
in zap_pmd_range(), thus, the mapping is neither split nor zapped.

Later, in free_pmd_range(), pmd_none_or_clear() will see the
pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
since the pmd_protnone entry is not properly freed, the corresponding
deposited pte page table is not freed either.

This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.

This patch relies on __split_huge_pmd_locked() and
__zap_huge_pmd_locked().

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 mm/memory.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3929b015faf7..7cfdd5208ef5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 				struct zap_details *details)
 {
 	pmd_t *pmd;
+	spinlock_t *ptl;
 	unsigned long next;
 
 	pmd = pmd_offset(pud, addr);
+	ptl = pmd_lock(vma->vm_mm, pmd);
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
 				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
-				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
-				goto next;
+				__split_huge_pmd_locked(vma, pmd, addr, false);
+			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
+				continue;
 			/* fall through */
 		}
-		/*
-		 * Here there can be other concurrent MADV_DONTNEED or
-		 * trans huge page faults running, and if the pmd is
-		 * none or trans huge it can change under us. This is
-		 * because MADV_DONTNEED holds the mmap_sem in read
-		 * mode.
-		 */
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			goto next;
+
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		spin_unlock(ptl);
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
-next:
 		cond_resched();
+		spin_lock(ptl);
 	} while (pmd++, addr = next, addr != end);
+	spin_unlock(ptl);
 
 	return addr;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
false negative return when it races with thp spilt
(during which _PAGE_PRESENT is temporary cleared.) I don't think that
dropping _PAGE_PSE check in pmd_present() works well because it can
hurt optimization of tlb handling in thp split.
In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v3:
- Move _PAGE_SWP_SOFT_DIRTY to bit 1, it was placed at bit 6. Because
some CPUs might accidentally set bit 5 or 6.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_types.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8b4de22d6429..3695abd58ef6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
  *
  * Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_RW
 #else
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
false negative return when it races with thp spilt
(during which _PAGE_PRESENT is temporary cleared.) I don't think that
dropping _PAGE_PSE check in pmd_present() works well because it can
hurt optimization of tlb handling in thp split.
In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v3:
- Move _PAGE_SWP_SOFT_DIRTY to bit 1, it was placed at bit 6. Because
some CPUs might accidentally set bit 5 or 6.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_types.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8b4de22d6429..3695abd58ef6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
  *
  * Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_RW
 #else
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 05/14] mm: mempolicy: add queue_pages_node_check()
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Introduce a separate check routine related to MPOL_MF_INVERT flag.
This patch just does cleanup, no behavioral change.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index caa752516b67..5cc6a99918ab 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -477,6 +477,15 @@ struct queue_pages {
 	struct vm_area_struct *prev;
 };
 
+static inline bool queue_pages_node_check(struct page *page,
+					struct queue_pages *qp)
+{
+	int nid = page_to_nid(page);
+	unsigned long flags = qp->flags;
+
+	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -530,8 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (PageReserved(page))
 			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		if (queue_pages_node_check(page, qp))
 			continue;
 		if (PageTransCompound(page)) {
 			get_page(page);
@@ -563,7 +571,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 #ifdef CONFIG_HUGETLB_PAGE
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -573,8 +580,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	if (!pte_present(entry))
 		goto unlock;
 	page = pte_page(entry);
-	nid = page_to_nid(page);
-	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+	if (queue_pages_node_check(page, qp))
 		goto unlock;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 05/14] mm: mempolicy: add queue_pages_node_check()
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Introduce a separate check routine related to MPOL_MF_INVERT flag.
This patch just does cleanup, no behavioral change.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index caa752516b67..5cc6a99918ab 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -477,6 +477,15 @@ struct queue_pages {
 	struct vm_area_struct *prev;
 };
 
+static inline bool queue_pages_node_check(struct page *page,
+					struct queue_pages *qp)
+{
+	int nid = page_to_nid(page);
+	unsigned long flags = qp->flags;
+
+	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -530,8 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (PageReserved(page))
 			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		if (queue_pages_node_check(page, qp))
 			continue;
 		if (PageTransCompound(page)) {
 			get_page(page);
@@ -563,7 +571,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 #ifdef CONFIG_HUGETLB_PAGE
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -573,8 +580,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	if (!pte_present(entry))
 		goto unlock;
 	page = pte_page(entry);
-	nid = page_to_nid(page);
-	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+	if (queue_pages_node_check(page, qp))
 		goto unlock;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 06/14] mm: thp: introduce separate TTU flag for thp freezing
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.

try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

and ttu_flag given for tail pages is:

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION)

__unmap_and_move() calls try_to_unmap() with ttu_flag:

    (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)

Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below

static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                       unsigned long address, void *arg)
  {
          ...
          if (flags & TTU_MIGRATION) {
                  if (!PageHuge(page) && PageTransCompound(page)) {
                          set_pmd_migration_entry(page, vma, address);
                          goto out;
                  }
          }

, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)

I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/rmap.h | 1 +
 mm/huge_memory.c     | 2 +-
 mm/rmap.c            | 7 ++++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8c89e902df3e..97d8b7127bd2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -88,6 +88,7 @@ enum ttu_flags {
 	TTU_MUNLOCK = 4,		/* munlock mode */
 	TTU_LZFREE = 8,			/* lazy free mode */
 	TTU_SPLIT_HUGE_PMD = 16,	/* split huge PMD if any */
+	TTU_SPLIT_FREEZE = 32,		/* freeze pte under splitting thp */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8e15fd817b0..6893c47428b6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2123,7 +2123,7 @@ static void freeze_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
-		ttu_flags |= TTU_MIGRATION;
+		ttu_flags |= TTU_SPLIT_FREEZE;
 
 	ret = try_to_unmap(page, ttu_flags);
 	VM_BUG_ON_PAGE(ret, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 8774791e2809..16789b936e3a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1310,7 +1310,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
-				flags & TTU_MIGRATION, page);
+				flags & TTU_SPLIT_FREEZE, page);
 	}
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1395,7 +1395,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 */
 			dec_mm_counter(mm, mm_counter(page));
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
-				(flags & TTU_MIGRATION)) {
+				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
 			pte_t swp_pte;
 			/*
@@ -1514,7 +1514,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	 * locking requirements of exec(), migration skips
 	 * temporary VMAs until after exec() completes.
 	 */
-	if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+	if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+	    && !PageKsm(page) && PageAnon(page))
 		rwc.invalid_vma = invalid_migration_vma;
 
 	if (flags & TTU_RMAP_LOCKED)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 06/14] mm: thp: introduce separate TTU flag for thp freezing
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.

try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

and ttu_flag given for tail pages is:

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION)

__unmap_and_move() calls try_to_unmap() with ttu_flag:

    (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)

Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below

static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                       unsigned long address, void *arg)
  {
          ...
          if (flags & TTU_MIGRATION) {
                  if (!PageHuge(page) && PageTransCompound(page)) {
                          set_pmd_migration_entry(page, vma, address);
                          goto out;
                  }
          }

, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)

I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/rmap.h | 1 +
 mm/huge_memory.c     | 2 +-
 mm/rmap.c            | 7 ++++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8c89e902df3e..97d8b7127bd2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -88,6 +88,7 @@ enum ttu_flags {
 	TTU_MUNLOCK = 4,		/* munlock mode */
 	TTU_LZFREE = 8,			/* lazy free mode */
 	TTU_SPLIT_HUGE_PMD = 16,	/* split huge PMD if any */
+	TTU_SPLIT_FREEZE = 32,		/* freeze pte under splitting thp */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8e15fd817b0..6893c47428b6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2123,7 +2123,7 @@ static void freeze_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
-		ttu_flags |= TTU_MIGRATION;
+		ttu_flags |= TTU_SPLIT_FREEZE;
 
 	ret = try_to_unmap(page, ttu_flags);
 	VM_BUG_ON_PAGE(ret, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 8774791e2809..16789b936e3a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1310,7 +1310,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
-				flags & TTU_MIGRATION, page);
+				flags & TTU_SPLIT_FREEZE, page);
 	}
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1395,7 +1395,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 */
 			dec_mm_counter(mm, mm_counter(page));
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
-				(flags & TTU_MIGRATION)) {
+				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
 			pte_t swp_pte;
 			/*
@@ -1514,7 +1514,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	 * locking requirements of exec(), migration skips
 	 * temporary VMAs until after exec() completes.
 	 */
-	if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+	if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+	    && !PageKsm(page) && PageAnon(page))
 		rwc.invalid_vma = invalid_migration_vma;
 
 	if (flags & TTU_RMAP_LOCKED)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 07/14] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v1 -> v2:
- fixed config name in subject and patch description
---
 arch/x86/Kconfig        |  4 ++++
 include/linux/huge_mm.h | 10 ++++++++++
 mm/Kconfig              |  3 +++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3d7cd097e827..d9683ca904e0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2247,6 +2247,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	def_bool y
 	depends on X86_64 && HUGETLB_PAGE && MIGRATION
 
+config ARCH_ENABLE_THP_MIGRATION
+	def_bool y
+	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
+
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44ee130c7207..83a8d42f9d55 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -217,6 +217,11 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
 
 #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
 
+static inline bool thp_migration_supported(void)
+{
+	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -311,6 +316,11 @@ static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static inline bool thp_migration_supported(void)
+{
+	return false;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 85c74e0efc52..165cd7f2e9a4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,9 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config ARCH_ENABLE_THP_MIGRATION
+	bool
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 07/14] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v1 -> v2:
- fixed config name in subject and patch description
---
 arch/x86/Kconfig        |  4 ++++
 include/linux/huge_mm.h | 10 ++++++++++
 mm/Kconfig              |  3 +++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3d7cd097e827..d9683ca904e0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2247,6 +2247,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	def_bool y
 	depends on X86_64 && HUGETLB_PAGE && MIGRATION
 
+config ARCH_ENABLE_THP_MIGRATION
+	def_bool y
+	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
+
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44ee130c7207..83a8d42f9d55 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -217,6 +217,11 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
 
 #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
 
+static inline bool thp_migration_supported(void)
+{
+	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -311,6 +316,11 @@ static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static inline bool thp_migration_supported(void)
+{
+	return false;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 85c74e0efc52..165cd7f2e9a4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,9 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config ARCH_ENABLE_THP_MIGRATION
+	bool
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 08/14] mm: thp: enable thp migration in generic path
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch adds thp migration's core code, including conversions
between a PMD entry and a swap entry, setting PMD migration entry,
removing PMD migration entry, and waiting on PMD migration entries.

This patch makes it possible to support thp migration.
If you fail to allocate a destination page as a thp, you just split
the source thp as we do now, and then enter the normal page migration.
If you succeed to allocate destination thp, you enter thp migration.
Subsequent patches actually enable thp migration for each caller of
page migration by allowing its get_new_page() callback to
allocate thps.

ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v2 -> v3:
- use page_vma_mapped_walk()

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_64.h |   2 +
 include/linux/swapops.h           |  70 +++++++++++++++++-
 mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
 mm/migrate.c                      |  29 +++++++-
 mm/page_vma_mapped.c              |  13 +++-
 mm/pgtable-generic.c              |   3 +-
 mm/rmap.c                         |  14 +++-
 7 files changed, 259 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 768eccc85553..0277f7755f3a 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -182,7 +182,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 					 ((type) << (SWP_TYPE_FIRST_BIT)) \
 					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
 extern void cleanup_highmap(void);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3e7eec..6625bea13869 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
-	BUG_ON(!PageLocked(page));
+	BUG_ON(!PageLocked(compound_head(page)));
+
 	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
 			page_to_pfn(page));
 }
@@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
 	 * Any use of migration entries may only occur while the
 	 * corresponding page is locked
 	 */
-	BUG_ON(!PageLocked(p));
+	BUG_ON(!PageLocked(compound_head(p)));
 	return p;
 }
 
@@ -163,6 +164,71 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __pmd_to_swp_entry(pmd);
+	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+	return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	BUILD_BUG();
+}
+
+static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new)
+{
+	BUILD_BUG();
+	return 0;
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	BUILD_BUG();
+	return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	BUILD_BUG();
+	return (pmd_t){ 0 };
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6893c47428b6..fd54bbdc16cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		atomic_long_dec(&tlb->mm->nr_ptes);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else {
-		struct page *page = pmd_page(orig_pmd);
-		page_remove_rmap(page, true);
-		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-		VM_BUG_ON_PAGE(!PageHead(page), page);
-		if (PageAnon(page)) {
-			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
-			pte_free(tlb->mm, pgtable);
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		struct page *page;
+		int migration = 0;
+
+		if (!is_pmd_migration_entry(orig_pmd)) {
+			page = pmd_page(orig_pmd);
+			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+			page_remove_rmap(page, true);
+			if (PageAnon(page)) {
+				pgtable_t pgtable;
+
+				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
+								      pmd);
+				pte_free(tlb->mm, pgtable);
+				atomic_long_dec(&tlb->mm->nr_ptes);
+				add_mm_counter(tlb->mm, MM_ANONPAGES,
+					       -HPAGE_PMD_NR);
+			} else {
+				if (arch_needs_pgtable_deposit())
+					zap_deposited_table(tlb->mm, pmd);
+				add_mm_counter(tlb->mm, MM_FILEPAGES,
+					       -HPAGE_PMD_NR);
+			}
 		} else {
-			if (arch_needs_pgtable_deposit())
-				zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(orig_pmd);
+			page = pfn_to_page(swp_offset(entry));
+			if (PageAnon(page)) {
+				pgtable_t pgtable;
+
+				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
+								      pmd);
+				pte_free(tlb->mm, pgtable);
+				atomic_long_dec(&tlb->mm->nr_ptes);
+				add_mm_counter(tlb->mm, MM_ANONPAGES,
+					       -HPAGE_PMD_NR);
+			} else {
+				if (arch_needs_pgtable_deposit())
+					zap_deposited_table(tlb->mm, pmd);
+				add_mm_counter(tlb->mm, MM_FILEPAGES,
+					       -HPAGE_PMD_NR);
+			}
+			free_swap_and_cache(entry); /* waring in failure? */
+			migration = 1;
 		}
 		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
@@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
 }
 late_initcall(split_huge_pages_debugfs);
 #endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	pmd_t pmdval;
+	swp_entry_t entry;
+
+	if (pvmw->pmd && !pvmw->pte) {
+		pmd_t pmdswp;
+
+		mmu_notifier_invalidate_range_start(mm, address,
+				address + HPAGE_PMD_SIZE);
+
+		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+		if (pmd_dirty(pmdval))
+			set_page_dirty(page);
+		entry = make_migration_entry(page, pmd_write(pmdval));
+		pmdswp = swp_entry_to_pmd(entry);
+		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
+		page_remove_rmap(page, true);
+		put_page(page);
+
+		mmu_notifier_invalidate_range_end(mm, address,
+				address + HPAGE_PMD_SIZE);
+	} else { /* pte-mapped thp */
+		pte_t pteval;
+		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
+		pte_t swp_pte;
+
+		pteval = ptep_clear_flush(vma, address, pvmw->pte);
+		if (pte_dirty(pteval))
+			set_page_dirty(subpage);
+		entry = make_migration_entry(subpage, pte_write(pteval));
+		swp_pte = swp_entry_to_pte(entry);
+		set_pte_at(mm, address, pvmw->pte, swp_pte);
+		page_remove_rmap(subpage, false);
+		put_page(subpage);
+	}
+}
+
+void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	swp_entry_t entry;
+
+	/* PMD-mapped THP  */
+	if (pvmw->pmd && !pvmw->pte) {
+		unsigned long mmun_start = address & HPAGE_PMD_MASK;
+		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+		pmd_t pmde;
+
+		entry = pmd_to_swp_entry(*pvmw->pmd);
+		get_page(new);
+		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+		if (is_write_migration_entry(entry))
+			pmde = maybe_pmd_mkwrite(pmde, vma);
+
+		flush_cache_range(vma, mmun_start, mmun_end);
+		page_add_anon_rmap(new, vma, mmun_start, true);
+		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
+		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
+		flush_tlb_range(vma, mmun_start, mmun_end);
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_vma_page(new);
+		update_mmu_cache_pmd(vma, address, pvmw->pmd);
+
+	} else { /* pte-mapped thp */
+		pte_t pte;
+		pte_t *ptep = pvmw->pte;
+
+		entry = pte_to_swp_entry(*pvmw->pte);
+		get_page(new);
+		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		if (pte_swp_soft_dirty(*pvmw->pte))
+			pte = pte_mksoft_dirty(pte);
+		if (is_write_migration_entry(entry))
+			pte = maybe_mkwrite(pte, vma);
+		flush_dcache_page(new);
+		set_pte_at(mm, address, ptep, pte);
+		if (PageAnon(new))
+			page_add_anon_rmap(new, vma, address, false);
+		else
+			page_add_file_rmap(new, false);
+		update_mmu_cache(vma, address, ptep);
+	}
+}
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 95e8580dc902..84181a3668c6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -214,6 +214,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		new = page - pvmw.page->index +
 			linear_page_index(vma, pvmw.address);
 
+		/* PMD-mapped THP migration entry */
+		if (!PageHuge(page) && PageTransCompound(page)) {
+			remove_migration_pmd(&pvmw, new);
+			continue;
+		}
+
 		get_page(new);
 		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
 		if (pte_swp_soft_dirty(*pvmw.pte))
@@ -327,6 +333,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
 	__migration_entry_wait(mm, pte, ptl);
 }
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	struct page *page;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!is_pmd_migration_entry(*pmd))
+		goto unlock;
+	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+	if (!get_page_unless_zero(page))
+		goto unlock;
+	spin_unlock(ptl);
+	wait_on_page_locked(page);
+	put_page(page);
+	return;
+unlock:
+	spin_unlock(ptl);
+}
+#endif
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
@@ -1085,7 +1112,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page))) {
+	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
 		lock_page(page);
 		rc = split_huge_page(page);
 		unlock_page(page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a23001a22c15..0ed3aee62d50 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!pud_present(*pud))
 		return false;
 	pvmw->pmd = pmd_offset(pud, pvmw->address);
-	if (pmd_trans_huge(*pvmw->pmd)) {
+	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (!pmd_present(*pvmw->pmd))
-			return not_found(pvmw);
 		if (likely(pmd_trans_huge(*pvmw->pmd))) {
 			if (pvmw->flags & PVMW_MIGRATION)
 				return not_found(pvmw);
 			if (pmd_page(*pvmw->pmd) != page)
 				return not_found(pvmw);
 			return true;
+		} else if (!pmd_present(*pvmw->pmd)) {
+			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+				if (migration_entry_to_page(entry) != page)
+					return not_found(pvmw);
+				return true;
+			}
+			return not_found(pvmw);
 		} else {
 			/* THP pmd was split under us: handle on pte level */
 			spin_unlock(pvmw->ptl);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4ed5908c65b0..9d550a8a0c71 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+		  !pmd_devmap(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
diff --git a/mm/rmap.c b/mm/rmap.c
index 16789b936e3a..b33216668fa4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1304,6 +1304,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	struct rmap_private *rp = arg;
 	enum ttu_flags flags = rp->flags;
 
+
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return SWAP_AGAIN;
@@ -1314,12 +1315,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* THP migration */
+		if (flags & TTU_MIGRATION) {
+			if (!PageHuge(page) && PageTransCompound(page)) {
+				set_pmd_migration_entry(&pvmw, page);
+				continue;
+			}
+		}
+		/* Unexpected PMD-mapped THP */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
-
 		/*
 		 * If the page is mlock()d, we cannot swap it out.
 		 * If it's recently referenced (perhaps page_referenced
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 08/14] mm: thp: enable thp migration in generic path
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch adds thp migration's core code, including conversions
between a PMD entry and a swap entry, setting PMD migration entry,
removing PMD migration entry, and waiting on PMD migration entries.

This patch makes it possible to support thp migration.
If you fail to allocate a destination page as a thp, you just split
the source thp as we do now, and then enter the normal page migration.
If you succeed to allocate destination thp, you enter thp migration.
Subsequent patches actually enable thp migration for each caller of
page migration by allowing its get_new_page() callback to
allocate thps.

ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v2 -> v3:
- use page_vma_mapped_walk()

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_64.h |   2 +
 include/linux/swapops.h           |  70 +++++++++++++++++-
 mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
 mm/migrate.c                      |  29 +++++++-
 mm/page_vma_mapped.c              |  13 +++-
 mm/pgtable-generic.c              |   3 +-
 mm/rmap.c                         |  14 +++-
 7 files changed, 259 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 768eccc85553..0277f7755f3a 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -182,7 +182,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 					 ((type) << (SWP_TYPE_FIRST_BIT)) \
 					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
 extern void cleanup_highmap(void);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3e7eec..6625bea13869 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
-	BUG_ON(!PageLocked(page));
+	BUG_ON(!PageLocked(compound_head(page)));
+
 	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
 			page_to_pfn(page));
 }
@@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
 	 * Any use of migration entries may only occur while the
 	 * corresponding page is locked
 	 */
-	BUG_ON(!PageLocked(p));
+	BUG_ON(!PageLocked(compound_head(p)));
 	return p;
 }
 
@@ -163,6 +164,71 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __pmd_to_swp_entry(pmd);
+	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+	return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	BUILD_BUG();
+}
+
+static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new)
+{
+	BUILD_BUG();
+	return 0;
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	BUILD_BUG();
+	return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	BUILD_BUG();
+	return (pmd_t){ 0 };
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6893c47428b6..fd54bbdc16cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		atomic_long_dec(&tlb->mm->nr_ptes);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else {
-		struct page *page = pmd_page(orig_pmd);
-		page_remove_rmap(page, true);
-		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-		VM_BUG_ON_PAGE(!PageHead(page), page);
-		if (PageAnon(page)) {
-			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
-			pte_free(tlb->mm, pgtable);
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		struct page *page;
+		int migration = 0;
+
+		if (!is_pmd_migration_entry(orig_pmd)) {
+			page = pmd_page(orig_pmd);
+			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+			page_remove_rmap(page, true);
+			if (PageAnon(page)) {
+				pgtable_t pgtable;
+
+				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
+								      pmd);
+				pte_free(tlb->mm, pgtable);
+				atomic_long_dec(&tlb->mm->nr_ptes);
+				add_mm_counter(tlb->mm, MM_ANONPAGES,
+					       -HPAGE_PMD_NR);
+			} else {
+				if (arch_needs_pgtable_deposit())
+					zap_deposited_table(tlb->mm, pmd);
+				add_mm_counter(tlb->mm, MM_FILEPAGES,
+					       -HPAGE_PMD_NR);
+			}
 		} else {
-			if (arch_needs_pgtable_deposit())
-				zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(orig_pmd);
+			page = pfn_to_page(swp_offset(entry));
+			if (PageAnon(page)) {
+				pgtable_t pgtable;
+
+				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
+								      pmd);
+				pte_free(tlb->mm, pgtable);
+				atomic_long_dec(&tlb->mm->nr_ptes);
+				add_mm_counter(tlb->mm, MM_ANONPAGES,
+					       -HPAGE_PMD_NR);
+			} else {
+				if (arch_needs_pgtable_deposit())
+					zap_deposited_table(tlb->mm, pmd);
+				add_mm_counter(tlb->mm, MM_FILEPAGES,
+					       -HPAGE_PMD_NR);
+			}
+			free_swap_and_cache(entry); /* waring in failure? */
+			migration = 1;
 		}
 		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
@@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
 }
 late_initcall(split_huge_pages_debugfs);
 #endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	pmd_t pmdval;
+	swp_entry_t entry;
+
+	if (pvmw->pmd && !pvmw->pte) {
+		pmd_t pmdswp;
+
+		mmu_notifier_invalidate_range_start(mm, address,
+				address + HPAGE_PMD_SIZE);
+
+		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+		if (pmd_dirty(pmdval))
+			set_page_dirty(page);
+		entry = make_migration_entry(page, pmd_write(pmdval));
+		pmdswp = swp_entry_to_pmd(entry);
+		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
+		page_remove_rmap(page, true);
+		put_page(page);
+
+		mmu_notifier_invalidate_range_end(mm, address,
+				address + HPAGE_PMD_SIZE);
+	} else { /* pte-mapped thp */
+		pte_t pteval;
+		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
+		pte_t swp_pte;
+
+		pteval = ptep_clear_flush(vma, address, pvmw->pte);
+		if (pte_dirty(pteval))
+			set_page_dirty(subpage);
+		entry = make_migration_entry(subpage, pte_write(pteval));
+		swp_pte = swp_entry_to_pte(entry);
+		set_pte_at(mm, address, pvmw->pte, swp_pte);
+		page_remove_rmap(subpage, false);
+		put_page(subpage);
+	}
+}
+
+void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	swp_entry_t entry;
+
+	/* PMD-mapped THP  */
+	if (pvmw->pmd && !pvmw->pte) {
+		unsigned long mmun_start = address & HPAGE_PMD_MASK;
+		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+		pmd_t pmde;
+
+		entry = pmd_to_swp_entry(*pvmw->pmd);
+		get_page(new);
+		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+		if (is_write_migration_entry(entry))
+			pmde = maybe_pmd_mkwrite(pmde, vma);
+
+		flush_cache_range(vma, mmun_start, mmun_end);
+		page_add_anon_rmap(new, vma, mmun_start, true);
+		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
+		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
+		flush_tlb_range(vma, mmun_start, mmun_end);
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_vma_page(new);
+		update_mmu_cache_pmd(vma, address, pvmw->pmd);
+
+	} else { /* pte-mapped thp */
+		pte_t pte;
+		pte_t *ptep = pvmw->pte;
+
+		entry = pte_to_swp_entry(*pvmw->pte);
+		get_page(new);
+		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		if (pte_swp_soft_dirty(*pvmw->pte))
+			pte = pte_mksoft_dirty(pte);
+		if (is_write_migration_entry(entry))
+			pte = maybe_mkwrite(pte, vma);
+		flush_dcache_page(new);
+		set_pte_at(mm, address, ptep, pte);
+		if (PageAnon(new))
+			page_add_anon_rmap(new, vma, address, false);
+		else
+			page_add_file_rmap(new, false);
+		update_mmu_cache(vma, address, ptep);
+	}
+}
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 95e8580dc902..84181a3668c6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -214,6 +214,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		new = page - pvmw.page->index +
 			linear_page_index(vma, pvmw.address);
 
+		/* PMD-mapped THP migration entry */
+		if (!PageHuge(page) && PageTransCompound(page)) {
+			remove_migration_pmd(&pvmw, new);
+			continue;
+		}
+
 		get_page(new);
 		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
 		if (pte_swp_soft_dirty(*pvmw.pte))
@@ -327,6 +333,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
 	__migration_entry_wait(mm, pte, ptl);
 }
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	struct page *page;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!is_pmd_migration_entry(*pmd))
+		goto unlock;
+	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+	if (!get_page_unless_zero(page))
+		goto unlock;
+	spin_unlock(ptl);
+	wait_on_page_locked(page);
+	put_page(page);
+	return;
+unlock:
+	spin_unlock(ptl);
+}
+#endif
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
@@ -1085,7 +1112,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page))) {
+	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
 		lock_page(page);
 		rc = split_huge_page(page);
 		unlock_page(page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a23001a22c15..0ed3aee62d50 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!pud_present(*pud))
 		return false;
 	pvmw->pmd = pmd_offset(pud, pvmw->address);
-	if (pmd_trans_huge(*pvmw->pmd)) {
+	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (!pmd_present(*pvmw->pmd))
-			return not_found(pvmw);
 		if (likely(pmd_trans_huge(*pvmw->pmd))) {
 			if (pvmw->flags & PVMW_MIGRATION)
 				return not_found(pvmw);
 			if (pmd_page(*pvmw->pmd) != page)
 				return not_found(pvmw);
 			return true;
+		} else if (!pmd_present(*pvmw->pmd)) {
+			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+				if (migration_entry_to_page(entry) != page)
+					return not_found(pvmw);
+				return true;
+			}
+			return not_found(pvmw);
 		} else {
 			/* THP pmd was split under us: handle on pte level */
 			spin_unlock(pvmw->ptl);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4ed5908c65b0..9d550a8a0c71 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+		  !pmd_devmap(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
diff --git a/mm/rmap.c b/mm/rmap.c
index 16789b936e3a..b33216668fa4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1304,6 +1304,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	struct rmap_private *rp = arg;
 	enum ttu_flags flags = rp->flags;
 
+
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return SWAP_AGAIN;
@@ -1314,12 +1315,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* THP migration */
+		if (flags & TTU_MIGRATION) {
+			if (!PageHuge(page) && PageTransCompound(page)) {
+				set_pmd_migration_entry(&pvmw, page);
+				continue;
+			}
+		}
+		/* Unexpected PMD-mapped THP */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
-
 		/*
 		 * If the page is mlock()d, we cannot swap it out.
 		 * If it's recently referenced (perhaps page_referenced
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 09/14] mm: thp: check pmd migration entry in common path
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

If one of callers of page migration starts to handle thp,
memory management code start to see pmd migration entry, so we need
to prepare for it before enabling. This patch changes various code
point which checks the status of given pmds in order to prevent race
between thp migration and the pmd-related works.

ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't
  think up no better name. Any suggesntion is welcomed.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v2 -> v3:
- add is_swap_pmd()
- a pmd entry should be is_swap_pmd(), pmd_trans_huge(), pmd_devmap(),
  or pmd_none()
- use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
- flush_cache_range() while set_pmd_migration_entry()
- pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
  true on pmd_migration_entry, so that migration entries are not
  treated as pmd page table entries.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/mm/gup.c             |  4 +--
 fs/proc/task_mmu.c            | 22 ++++++++-----
 include/asm-generic/pgtable.h | 71 ----------------------------------------
 include/linux/huge_mm.h       | 21 ++++++++++--
 include/linux/swapops.h       | 74 +++++++++++++++++++++++++++++++++++++++++
 mm/gup.c                      | 20 ++++++++++--
 mm/huge_memory.c              | 76 ++++++++++++++++++++++++++++++++++++-------
 mm/madvise.c                  |  2 ++
 mm/memcontrol.c               |  2 ++
 mm/memory.c                   |  9 +++--
 mm/memory_hotplug.c           | 13 +++++++-
 mm/mempolicy.c                |  1 +
 mm/mprotect.c                 |  6 ++--
 mm/mremap.c                   |  2 +-
 mm/pagewalk.c                 |  2 ++
 15 files changed, 221 insertions(+), 104 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 0d4fb3ebbbac..78a153d90064 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
-		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+		if (unlikely(pmd_large(pmd))) {
 			/*
 			 * NUMA hinting faults need to be handled in the GUP
 			 * slowpath for accounting purposes and so that they
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6c07c7813b26..1e64d6898c68 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		smaps_pmd_entry(pmd, addr, walk);
+		if (pmd_present(*pmd))
+			smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
 	}
@@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 			goto out;
 		}
 
+		if (!pmd_present(*pmd))
+			goto out;
+
 		page = pmd_page(*pmd);
 
 		/* Clear accessed and referenced bits. */
@@ -1208,19 +1212,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	if (ptl) {
 		u64 flags = 0, frame = 0;
 		pmd_t pmd = *pmdp;
+		struct page *page;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
 			flags |= PM_SOFT_DIRTY;
 
-		/*
-		 * Currently pmd for thp is always present because thp
-		 * can not be swapped-out, migrated, or HWPOISONed
-		 * (split in such cases instead.)
-		 * This if-check is just to prepare for future implementation.
-		 */
-		if (pmd_present(pmd)) {
-			struct page *page = pmd_page(pmd);
+		if (is_pmd_migration_entry(pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(pmd);
 
+			frame = swp_type(entry) |
+				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+			page = migration_entry_to_page(entry);
+		} else if (pmd_present(pmd)) {
+			page = pmd_page(pmd);
 			if (page_mapcount(page) == 1)
 				flags |= PM_MMAP_EXCLUSIVE;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b71a431ed649..6cf9e9b5a7be 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -726,77 +726,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)
 #endif
-/*
- * This function is meant to be used by sites walking pagetables with
- * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
- * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
- * into a null pmd and the transhuge page fault can convert a null pmd
- * into an hugepmd or into a regular pmd (if the hugepage allocation
- * fails). While holding the mmap_sem in read mode the pmd becomes
- * stable and stops changing under us only if it's not null and not a
- * transhuge pmd. When those races occurs and this function makes a
- * difference vs the standard pmd_none_or_clear_bad, the result is
- * undefined so behaving like if the pmd was none is safe (because it
- * can return none anyway). The compiler level barrier() is critically
- * important to compute the two checks atomically on the same pmdval.
- *
- * For 32bit kernels with a 64bit large pmd_t this automatically takes
- * care of reading the pmd atomically to avoid SMP race conditions
- * against pmd_populate() when the mmap_sem is hold for reading by the
- * caller (a special atomic read not done by "gcc" as in the generic
- * version above, is also needed when THP is disabled because the page
- * fault can populate the pmd from under us).
- */
-static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
-{
-	pmd_t pmdval = pmd_read_atomic(pmd);
-	/*
-	 * The barrier will stabilize the pmdval in a register or on
-	 * the stack so that it will stop changing under the code.
-	 *
-	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
-	 * pmd_read_atomic is allowed to return a not atomic pmdval
-	 * (for example pointing to an hugepage that has never been
-	 * mapped in the pmd). The below checks will only care about
-	 * the low part of the pmd with 32bit PAE x86 anyway, with the
-	 * exception of pmd_none(). So the important thing is that if
-	 * the low part of the pmd is found null, the high part will
-	 * be also null or the pmd_none() check below would be
-	 * confused.
-	 */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	barrier();
-#endif
-	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
-		return 1;
-	if (unlikely(pmd_bad(pmdval))) {
-		pmd_clear_bad(pmd);
-		return 1;
-	}
-	return 0;
-}
-
-/*
- * This is a noop if Transparent Hugepage Support is not built into
- * the kernel. Otherwise it is equivalent to
- * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
- * places that already verified the pmd is not none and they want to
- * walk ptes while holding the mmap sem in read mode (write mode don't
- * need this). If THP is not enabled, the pmd can't go away under the
- * code even if MADV_DONTNEED runs, but if THP is enabled we need to
- * run a pmd_trans_unstable before walking the ptes after
- * split_huge_page_pmd returns (because it may have run when the pmd
- * become null, but then a page fault can map in a THP and not a
- * regular page).
- */
-static inline int pmd_trans_unstable(pmd_t *pmd)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	return pmd_none_or_trans_huge_or_clear_bad(pmd);
-#else
-	return 0;
-#endif
-}
 
 #ifndef CONFIG_NUMA_BALANCING
 /*
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 83a8d42f9d55..c2e5a4eab84a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -131,7 +131,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		if (pmd_trans_huge(*____pmd)				\
+		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
 					|| pmd_devmap(*____pmd))	\
 			__split_huge_pmd(__vma, __pmd, __address,	\
 						false, NULL);		\
@@ -162,12 +162,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma);
 extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma);
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return !pmd_none(pmd) && !pmd_present(pmd);
+}
+
 /* mmap_sem must be held on entry */
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
 	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
@@ -192,6 +198,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags);
+static inline int hpage_order(struct page *page)
+{
+	if (unlikely(PageTransHuge(page)))
+		return HPAGE_PMD_ORDER;
+	return 0;
+}
 
 extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
@@ -232,6 +244,7 @@ static inline bool thp_migration_supported(void)
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
+#define hpage_order(x) 0
 
 #define transparent_hugepage_enabled(__vma) 0
 
@@ -274,6 +287,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return 0;
+}
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6625bea13869..50e4aa7e7ff9 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -229,6 +229,80 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 }
 #endif
 
+/*
+ * This function is meant to be used by sites walking pagetables with
+ * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
+ * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
+ * into a null pmd and the transhuge page fault can convert a null pmd
+ * into an hugepmd or into a regular pmd (if the hugepage allocation
+ * fails). While holding the mmap_sem in read mode the pmd becomes
+ * stable and stops changing under us only if it's not null and not a
+ * transhuge pmd. When those races occurs and this function makes a
+ * difference vs the standard pmd_none_or_clear_bad, the result is
+ * undefined so behaving like if the pmd was none is safe (because it
+ * can return none anyway). The compiler level barrier() is critically
+ * important to compute the two checks atomically on the same pmdval.
+ *
+ * For 32bit kernels with a 64bit large pmd_t this automatically takes
+ * care of reading the pmd atomically to avoid SMP race conditions
+ * against pmd_populate() when the mmap_sem is hold for reading by the
+ * caller (a special atomic read not done by "gcc" as in the generic
+ * version above, is also needed when THP is disabled because the page
+ * fault can populate the pmd from under us).
+ */
+static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
+{
+	pmd_t pmdval = pmd_read_atomic(pmd);
+	/*
+	 * The barrier will stabilize the pmdval in a register or on
+	 * the stack so that it will stop changing under the code.
+	 *
+	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
+	 * pmd_read_atomic is allowed to return a not atomic pmdval
+	 * (for example pointing to an hugepage that has never been
+	 * mapped in the pmd). The below checks will only care about
+	 * the low part of the pmd with 32bit PAE x86 anyway, with the
+	 * exception of pmd_none(). So the important thing is that if
+	 * the low part of the pmd is found null, the high part will
+	 * be also null or the pmd_none() check below would be
+	 * confused.
+	 */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	barrier();
+#endif
+	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
+			|| is_pmd_migration_entry(pmdval))
+		return 1;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * This is a noop if Transparent Hugepage Support is not built into
+ * the kernel. Otherwise it is equivalent to
+ * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
+ * places that already verified the pmd is not none and they want to
+ * walk ptes while holding the mmap sem in read mode (write mode don't
+ * need this). If THP is not enabled, the pmd can't go away under the
+ * code even if MADV_DONTNEED runs, but if THP is enabled we need to
+ * run a pmd_trans_unstable before walking the ptes after
+ * split_huge_page_pmd returns (because it may have run when the pmd
+ * become null, but then a page fault can map in a THP and not a
+ * regular page).
+ */
+static inline int pmd_trans_unstable(pmd_t *pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return pmd_none_or_trans_huge_or_clear_bad(pmd);
+#else
+	return 0;
+#endif
+}
+
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/gup.c b/mm/gup.c
index 1e67461b2733..82e0304e5d29 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -274,6 +274,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		return no_page_table(vma, flags);
+	if (!pmd_present(*pmd)) {
+retry:
+		if (likely(!(flags & FOLL_MIGRATION)))
+			return no_page_table(vma, flags);
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry;
+	}
 	if (pmd_devmap(*pmd)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -285,6 +292,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		return follow_page_pte(vma, address, pmd, flags);
 
 	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_present(*pmd))) {
+retry_locked:
+		if (likely(!(flags & FOLL_MIGRATION))) {
+			spin_unlock(ptl);
+			return no_page_table(vma, flags);
+		}
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry_locked;
+	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
@@ -340,7 +356,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	pud = pud_offset(pgd, address);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		return -EFAULT;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map(pmd, address);
@@ -1368,7 +1384,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = READ_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd54bbdc16cf..4ac923539372 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -897,6 +897,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pmd = *src_pmd;
+
+	if (unlikely(is_pmd_migration_entry(pmd))) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_write_migration_entry(entry)) {
+			make_migration_entry_read(&entry);
+			pmd = swp_entry_to_pmd(entry);
+			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		}
+		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+		ret = 0;
+		goto out_unlock;
+	}
+	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");
+
 	if (unlikely(!pmd_trans_huge(pmd))) {
 		pte_free(dst_mm, pgtable);
 		goto out_unlock;
@@ -1203,6 +1218,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
 		goto out_unlock;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out_unlock;
+
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
@@ -1337,7 +1355,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		goto out;
 
-	page = pmd_page(*pmd);
+	if (is_pmd_migration_entry(*pmd)) {
+		swp_entry_t entry;
+
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+		if (!is_migration_entry(entry))
+			goto out;
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd);
@@ -1533,6 +1559,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (is_huge_zero_pmd(orig_pmd))
 		goto out;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out;
+
 	page = pmd_page(orig_pmd);
 	/*
 	 * If other processes are mapping this page, we couldn't discard
@@ -1659,7 +1688,8 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			free_swap_and_cache(entry); /* waring in failure? */
 			migration = 1;
 		}
-		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+		if (!migration)
+			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
 
 	return 1;
@@ -1775,10 +1805,22 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		 * data is likely to be read-cached on the local CPU and
 		 * local/remote hits to the zero page are not interesting.
 		 */
-		if (prot_numa && is_huge_zero_pmd(*pmd)) {
-			spin_unlock(ptl);
-			return ret;
-		}
+		if (prot_numa && is_huge_zero_pmd(*pmd))
+			goto unlock;
+
+		if (is_pmd_migration_entry(*pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+			if (is_write_migration_entry(entry)) {
+				pmd_t newpmd;
+
+				make_migration_entry_read(&entry);
+				newpmd = swp_entry_to_pmd(entry);
+				set_pmd_at(mm, addr, pmd, newpmd);
+			}
+			goto unlock;
+		} else if (!pmd_present(*pmd))
+			WARN_ONCE(1, "Uknown non-present format on pmd.\n");
 
 		if (!prot_numa || !pmd_protnone(*pmd)) {
 			entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
@@ -1790,6 +1832,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
 					pmd_write(entry));
 		}
+unlock:
 		spin_unlock(ptl);
 	}
 
@@ -1806,7 +1849,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+			pmd_devmap(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -1924,7 +1968,7 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t _pmd;
-	bool young, write, dirty, soft_dirty;
+	bool young, write, dirty, soft_dirty, pmd_migration;
 	unsigned long addr;
 	int i;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
@@ -1932,7 +1976,8 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+				&& !pmd_devmap(*pmd));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -1960,7 +2005,14 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		goto out;
 	}
 
-	page = pmd_page(*pmd);
+	pmd_migration = is_pmd_migration_entry(*pmd);
+	if (pmd_migration) {
+		swp_entry_t entry;
+
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
 	write = pmd_write(*pmd);
@@ -1979,7 +2031,7 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		if (freeze) {
+		if (freeze || pmd_migration) {
 			swp_entry_t swp_entry;
 			swp_entry = make_migration_entry(page + i, write);
 			entry = swp_entry_to_pte(swp_entry);
@@ -2077,7 +2129,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		page = pmd_page(*pmd);
 		if (PageMlocked(page))
 			clear_page_mlock(page);
-	} else if (!pmd_devmap(*pmd))
+	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, address, freeze);
 out:
diff --git a/mm/madvise.c b/mm/madvise.c
index e424a06e9f2b..0497a502351f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -310,6 +310,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long next;
 
 	next = pmd_addr_end(addr, end);
+	if (!pmd_present(*pmd))
+		return 0;
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 44fb1e80701a..09bce3f0d622 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4633,6 +4633,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	enum mc_target_type ret = MC_TARGET_NONE;
 
+	if (unlikely(!pmd_present(pmd)))
+		return ret;
 	page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
 	if (!(mc.flags & MOVE_ANON))
diff --git a/mm/memory.c b/mm/memory.c
index 7cfdd5208ef5..bf10b19e02d3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -999,7 +999,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
+			|| pmd_devmap(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
 			err = copy_huge_pmd(dst_mm, src_mm,
@@ -1240,7 +1241,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	ptl = pmd_lock(vma->vm_mm, pmd);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
 				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
@@ -3697,6 +3698,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		pmd_t orig_pmd = *vmf.pmd;
 
 		barrier();
+		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
+			pmd_migration_entry_wait(mm, vmf.pmd);
+			return 0;
+		}
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			vmf.flags |= FAULT_FLAG_SIZE_PMD;
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 19b460acb5e1..9cb4c83151a8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1570,6 +1570,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
 	struct page *new_page = NULL;
+	unsigned int order = 0;
 
 	/*
 	 * TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1580,6 +1581,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					next_node_in(nid, nmask));
 
+	if (thp_migration_supported() && PageTransHuge(page)) {
+		order = hpage_order(page);
+		gfp_mask |= GFP_TRANSHUGE;
+	}
+
 	node_clear(nid, nmask);
 
 	if (PageHighMem(page)
@@ -1593,6 +1599,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		new_page = __alloc_pages(gfp_mask, 0,
 					node_zonelist(nid, gfp_mask));
 
+	if (new_page && order == hpage_order(page))
+		prep_transhuge_page(new_page);
+
 	return new_page;
 }
 
@@ -1622,7 +1631,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			if (isolate_huge_page(page, &source))
 				move_pages -= 1 << compound_order(head);
 			continue;
-		}
+		} else if (thp_migration_supported() && PageTransHuge(page))
+			pfn = page_to_pfn(compound_head(page))
+				+ hpage_nr_pages(page) - 1;
 
 		if (!get_page_unless_zero(page))
 			continue;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5cc6a99918ab..021ff13b9a7a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -94,6 +94,7 @@
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
+#include <linux/swapops.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 98acf7d5cef2..bfbe66798a7a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -150,7 +150,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
+		if (!pmd_present(*pmd))
+			continue;
+		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
 				&& pmd_none_or_clear_bad(pmd))
 			continue;
 
@@ -160,7 +162,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(mm, mni_start, end);
 		}
 
-		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 				if (pmd_trans_unstable(pmd))
diff --git a/mm/mremap.c b/mm/mremap.c
index 8233b0105c82..5d537ce12adc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -213,7 +213,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-		if (pmd_trans_huge(*old_pmd)) {
+		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE) {
 				bool moved;
 				/* See comment in move_ptes() */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 03761577ae86..114fc2b5a370 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -2,6 +2,8 @@
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 09/14] mm: thp: check pmd migration entry in common path
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

If one of callers of page migration starts to handle thp,
memory management code start to see pmd migration entry, so we need
to prepare for it before enabling. This patch changes various code
point which checks the status of given pmds in order to prevent race
between thp migration and the pmd-related works.

ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't
  think up no better name. Any suggesntion is welcomed.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v2 -> v3:
- add is_swap_pmd()
- a pmd entry should be is_swap_pmd(), pmd_trans_huge(), pmd_devmap(),
  or pmd_none()
- use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
- flush_cache_range() while set_pmd_migration_entry()
- pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
  true on pmd_migration_entry, so that migration entries are not
  treated as pmd page table entries.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/mm/gup.c             |  4 +--
 fs/proc/task_mmu.c            | 22 ++++++++-----
 include/asm-generic/pgtable.h | 71 ----------------------------------------
 include/linux/huge_mm.h       | 21 ++++++++++--
 include/linux/swapops.h       | 74 +++++++++++++++++++++++++++++++++++++++++
 mm/gup.c                      | 20 ++++++++++--
 mm/huge_memory.c              | 76 ++++++++++++++++++++++++++++++++++++-------
 mm/madvise.c                  |  2 ++
 mm/memcontrol.c               |  2 ++
 mm/memory.c                   |  9 +++--
 mm/memory_hotplug.c           | 13 +++++++-
 mm/mempolicy.c                |  1 +
 mm/mprotect.c                 |  6 ++--
 mm/mremap.c                   |  2 +-
 mm/pagewalk.c                 |  2 ++
 15 files changed, 221 insertions(+), 104 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 0d4fb3ebbbac..78a153d90064 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
-		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+		if (unlikely(pmd_large(pmd))) {
 			/*
 			 * NUMA hinting faults need to be handled in the GUP
 			 * slowpath for accounting purposes and so that they
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6c07c7813b26..1e64d6898c68 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		smaps_pmd_entry(pmd, addr, walk);
+		if (pmd_present(*pmd))
+			smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
 	}
@@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 			goto out;
 		}
 
+		if (!pmd_present(*pmd))
+			goto out;
+
 		page = pmd_page(*pmd);
 
 		/* Clear accessed and referenced bits. */
@@ -1208,19 +1212,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	if (ptl) {
 		u64 flags = 0, frame = 0;
 		pmd_t pmd = *pmdp;
+		struct page *page;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
 			flags |= PM_SOFT_DIRTY;
 
-		/*
-		 * Currently pmd for thp is always present because thp
-		 * can not be swapped-out, migrated, or HWPOISONed
-		 * (split in such cases instead.)
-		 * This if-check is just to prepare for future implementation.
-		 */
-		if (pmd_present(pmd)) {
-			struct page *page = pmd_page(pmd);
+		if (is_pmd_migration_entry(pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(pmd);
 
+			frame = swp_type(entry) |
+				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+			page = migration_entry_to_page(entry);
+		} else if (pmd_present(pmd)) {
+			page = pmd_page(pmd);
 			if (page_mapcount(page) == 1)
 				flags |= PM_MMAP_EXCLUSIVE;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b71a431ed649..6cf9e9b5a7be 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -726,77 +726,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)
 #endif
-/*
- * This function is meant to be used by sites walking pagetables with
- * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
- * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
- * into a null pmd and the transhuge page fault can convert a null pmd
- * into an hugepmd or into a regular pmd (if the hugepage allocation
- * fails). While holding the mmap_sem in read mode the pmd becomes
- * stable and stops changing under us only if it's not null and not a
- * transhuge pmd. When those races occurs and this function makes a
- * difference vs the standard pmd_none_or_clear_bad, the result is
- * undefined so behaving like if the pmd was none is safe (because it
- * can return none anyway). The compiler level barrier() is critically
- * important to compute the two checks atomically on the same pmdval.
- *
- * For 32bit kernels with a 64bit large pmd_t this automatically takes
- * care of reading the pmd atomically to avoid SMP race conditions
- * against pmd_populate() when the mmap_sem is hold for reading by the
- * caller (a special atomic read not done by "gcc" as in the generic
- * version above, is also needed when THP is disabled because the page
- * fault can populate the pmd from under us).
- */
-static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
-{
-	pmd_t pmdval = pmd_read_atomic(pmd);
-	/*
-	 * The barrier will stabilize the pmdval in a register or on
-	 * the stack so that it will stop changing under the code.
-	 *
-	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
-	 * pmd_read_atomic is allowed to return a not atomic pmdval
-	 * (for example pointing to an hugepage that has never been
-	 * mapped in the pmd). The below checks will only care about
-	 * the low part of the pmd with 32bit PAE x86 anyway, with the
-	 * exception of pmd_none(). So the important thing is that if
-	 * the low part of the pmd is found null, the high part will
-	 * be also null or the pmd_none() check below would be
-	 * confused.
-	 */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	barrier();
-#endif
-	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
-		return 1;
-	if (unlikely(pmd_bad(pmdval))) {
-		pmd_clear_bad(pmd);
-		return 1;
-	}
-	return 0;
-}
-
-/*
- * This is a noop if Transparent Hugepage Support is not built into
- * the kernel. Otherwise it is equivalent to
- * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
- * places that already verified the pmd is not none and they want to
- * walk ptes while holding the mmap sem in read mode (write mode don't
- * need this). If THP is not enabled, the pmd can't go away under the
- * code even if MADV_DONTNEED runs, but if THP is enabled we need to
- * run a pmd_trans_unstable before walking the ptes after
- * split_huge_page_pmd returns (because it may have run when the pmd
- * become null, but then a page fault can map in a THP and not a
- * regular page).
- */
-static inline int pmd_trans_unstable(pmd_t *pmd)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	return pmd_none_or_trans_huge_or_clear_bad(pmd);
-#else
-	return 0;
-#endif
-}
 
 #ifndef CONFIG_NUMA_BALANCING
 /*
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 83a8d42f9d55..c2e5a4eab84a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -131,7 +131,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		if (pmd_trans_huge(*____pmd)				\
+		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
 					|| pmd_devmap(*____pmd))	\
 			__split_huge_pmd(__vma, __pmd, __address,	\
 						false, NULL);		\
@@ -162,12 +162,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma);
 extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma);
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return !pmd_none(pmd) && !pmd_present(pmd);
+}
+
 /* mmap_sem must be held on entry */
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
 	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
@@ -192,6 +198,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags);
+static inline int hpage_order(struct page *page)
+{
+	if (unlikely(PageTransHuge(page)))
+		return HPAGE_PMD_ORDER;
+	return 0;
+}
 
 extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
@@ -232,6 +244,7 @@ static inline bool thp_migration_supported(void)
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
+#define hpage_order(x) 0
 
 #define transparent_hugepage_enabled(__vma) 0
 
@@ -274,6 +287,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return 0;
+}
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6625bea13869..50e4aa7e7ff9 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -229,6 +229,80 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 }
 #endif
 
+/*
+ * This function is meant to be used by sites walking pagetables with
+ * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
+ * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
+ * into a null pmd and the transhuge page fault can convert a null pmd
+ * into an hugepmd or into a regular pmd (if the hugepage allocation
+ * fails). While holding the mmap_sem in read mode the pmd becomes
+ * stable and stops changing under us only if it's not null and not a
+ * transhuge pmd. When those races occurs and this function makes a
+ * difference vs the standard pmd_none_or_clear_bad, the result is
+ * undefined so behaving like if the pmd was none is safe (because it
+ * can return none anyway). The compiler level barrier() is critically
+ * important to compute the two checks atomically on the same pmdval.
+ *
+ * For 32bit kernels with a 64bit large pmd_t this automatically takes
+ * care of reading the pmd atomically to avoid SMP race conditions
+ * against pmd_populate() when the mmap_sem is hold for reading by the
+ * caller (a special atomic read not done by "gcc" as in the generic
+ * version above, is also needed when THP is disabled because the page
+ * fault can populate the pmd from under us).
+ */
+static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
+{
+	pmd_t pmdval = pmd_read_atomic(pmd);
+	/*
+	 * The barrier will stabilize the pmdval in a register or on
+	 * the stack so that it will stop changing under the code.
+	 *
+	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
+	 * pmd_read_atomic is allowed to return a not atomic pmdval
+	 * (for example pointing to an hugepage that has never been
+	 * mapped in the pmd). The below checks will only care about
+	 * the low part of the pmd with 32bit PAE x86 anyway, with the
+	 * exception of pmd_none(). So the important thing is that if
+	 * the low part of the pmd is found null, the high part will
+	 * be also null or the pmd_none() check below would be
+	 * confused.
+	 */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	barrier();
+#endif
+	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
+			|| is_pmd_migration_entry(pmdval))
+		return 1;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * This is a noop if Transparent Hugepage Support is not built into
+ * the kernel. Otherwise it is equivalent to
+ * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
+ * places that already verified the pmd is not none and they want to
+ * walk ptes while holding the mmap sem in read mode (write mode don't
+ * need this). If THP is not enabled, the pmd can't go away under the
+ * code even if MADV_DONTNEED runs, but if THP is enabled we need to
+ * run a pmd_trans_unstable before walking the ptes after
+ * split_huge_page_pmd returns (because it may have run when the pmd
+ * become null, but then a page fault can map in a THP and not a
+ * regular page).
+ */
+static inline int pmd_trans_unstable(pmd_t *pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return pmd_none_or_trans_huge_or_clear_bad(pmd);
+#else
+	return 0;
+#endif
+}
+
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/gup.c b/mm/gup.c
index 1e67461b2733..82e0304e5d29 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -274,6 +274,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		return no_page_table(vma, flags);
+	if (!pmd_present(*pmd)) {
+retry:
+		if (likely(!(flags & FOLL_MIGRATION)))
+			return no_page_table(vma, flags);
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry;
+	}
 	if (pmd_devmap(*pmd)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -285,6 +292,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		return follow_page_pte(vma, address, pmd, flags);
 
 	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_present(*pmd))) {
+retry_locked:
+		if (likely(!(flags & FOLL_MIGRATION))) {
+			spin_unlock(ptl);
+			return no_page_table(vma, flags);
+		}
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry_locked;
+	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
@@ -340,7 +356,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	pud = pud_offset(pgd, address);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		return -EFAULT;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map(pmd, address);
@@ -1368,7 +1384,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = READ_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd54bbdc16cf..4ac923539372 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -897,6 +897,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pmd = *src_pmd;
+
+	if (unlikely(is_pmd_migration_entry(pmd))) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_write_migration_entry(entry)) {
+			make_migration_entry_read(&entry);
+			pmd = swp_entry_to_pmd(entry);
+			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		}
+		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+		ret = 0;
+		goto out_unlock;
+	}
+	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");
+
 	if (unlikely(!pmd_trans_huge(pmd))) {
 		pte_free(dst_mm, pgtable);
 		goto out_unlock;
@@ -1203,6 +1218,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
 		goto out_unlock;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out_unlock;
+
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
@@ -1337,7 +1355,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		goto out;
 
-	page = pmd_page(*pmd);
+	if (is_pmd_migration_entry(*pmd)) {
+		swp_entry_t entry;
+
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+		if (!is_migration_entry(entry))
+			goto out;
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd);
@@ -1533,6 +1559,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (is_huge_zero_pmd(orig_pmd))
 		goto out;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out;
+
 	page = pmd_page(orig_pmd);
 	/*
 	 * If other processes are mapping this page, we couldn't discard
@@ -1659,7 +1688,8 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			free_swap_and_cache(entry); /* waring in failure? */
 			migration = 1;
 		}
-		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+		if (!migration)
+			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
 
 	return 1;
@@ -1775,10 +1805,22 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		 * data is likely to be read-cached on the local CPU and
 		 * local/remote hits to the zero page are not interesting.
 		 */
-		if (prot_numa && is_huge_zero_pmd(*pmd)) {
-			spin_unlock(ptl);
-			return ret;
-		}
+		if (prot_numa && is_huge_zero_pmd(*pmd))
+			goto unlock;
+
+		if (is_pmd_migration_entry(*pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+			if (is_write_migration_entry(entry)) {
+				pmd_t newpmd;
+
+				make_migration_entry_read(&entry);
+				newpmd = swp_entry_to_pmd(entry);
+				set_pmd_at(mm, addr, pmd, newpmd);
+			}
+			goto unlock;
+		} else if (!pmd_present(*pmd))
+			WARN_ONCE(1, "Uknown non-present format on pmd.\n");
 
 		if (!prot_numa || !pmd_protnone(*pmd)) {
 			entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
@@ -1790,6 +1832,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
 					pmd_write(entry));
 		}
+unlock:
 		spin_unlock(ptl);
 	}
 
@@ -1806,7 +1849,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+			pmd_devmap(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -1924,7 +1968,7 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t _pmd;
-	bool young, write, dirty, soft_dirty;
+	bool young, write, dirty, soft_dirty, pmd_migration;
 	unsigned long addr;
 	int i;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
@@ -1932,7 +1976,8 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+				&& !pmd_devmap(*pmd));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -1960,7 +2005,14 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		goto out;
 	}
 
-	page = pmd_page(*pmd);
+	pmd_migration = is_pmd_migration_entry(*pmd);
+	if (pmd_migration) {
+		swp_entry_t entry;
+
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
 	write = pmd_write(*pmd);
@@ -1979,7 +2031,7 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		if (freeze) {
+		if (freeze || pmd_migration) {
 			swp_entry_t swp_entry;
 			swp_entry = make_migration_entry(page + i, write);
 			entry = swp_entry_to_pte(swp_entry);
@@ -2077,7 +2129,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		page = pmd_page(*pmd);
 		if (PageMlocked(page))
 			clear_page_mlock(page);
-	} else if (!pmd_devmap(*pmd))
+	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, address, freeze);
 out:
diff --git a/mm/madvise.c b/mm/madvise.c
index e424a06e9f2b..0497a502351f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -310,6 +310,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long next;
 
 	next = pmd_addr_end(addr, end);
+	if (!pmd_present(*pmd))
+		return 0;
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 44fb1e80701a..09bce3f0d622 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4633,6 +4633,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	enum mc_target_type ret = MC_TARGET_NONE;
 
+	if (unlikely(!pmd_present(pmd)))
+		return ret;
 	page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
 	if (!(mc.flags & MOVE_ANON))
diff --git a/mm/memory.c b/mm/memory.c
index 7cfdd5208ef5..bf10b19e02d3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -999,7 +999,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
+			|| pmd_devmap(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
 			err = copy_huge_pmd(dst_mm, src_mm,
@@ -1240,7 +1241,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	ptl = pmd_lock(vma->vm_mm, pmd);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
 				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
@@ -3697,6 +3698,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		pmd_t orig_pmd = *vmf.pmd;
 
 		barrier();
+		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
+			pmd_migration_entry_wait(mm, vmf.pmd);
+			return 0;
+		}
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			vmf.flags |= FAULT_FLAG_SIZE_PMD;
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 19b460acb5e1..9cb4c83151a8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1570,6 +1570,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
 	struct page *new_page = NULL;
+	unsigned int order = 0;
 
 	/*
 	 * TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1580,6 +1581,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					next_node_in(nid, nmask));
 
+	if (thp_migration_supported() && PageTransHuge(page)) {
+		order = hpage_order(page);
+		gfp_mask |= GFP_TRANSHUGE;
+	}
+
 	node_clear(nid, nmask);
 
 	if (PageHighMem(page)
@@ -1593,6 +1599,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		new_page = __alloc_pages(gfp_mask, 0,
 					node_zonelist(nid, gfp_mask));
 
+	if (new_page && order == hpage_order(page))
+		prep_transhuge_page(new_page);
+
 	return new_page;
 }
 
@@ -1622,7 +1631,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			if (isolate_huge_page(page, &source))
 				move_pages -= 1 << compound_order(head);
 			continue;
-		}
+		} else if (thp_migration_supported() && PageTransHuge(page))
+			pfn = page_to_pfn(compound_head(page))
+				+ hpage_nr_pages(page) - 1;
 
 		if (!get_page_unless_zero(page))
 			continue;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5cc6a99918ab..021ff13b9a7a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -94,6 +94,7 @@
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
+#include <linux/swapops.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 98acf7d5cef2..bfbe66798a7a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -150,7 +150,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
+		if (!pmd_present(*pmd))
+			continue;
+		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
 				&& pmd_none_or_clear_bad(pmd))
 			continue;
 
@@ -160,7 +162,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(mm, mni_start, end);
 		}
 
-		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 				if (pmd_trans_unstable(pmd))
diff --git a/mm/mremap.c b/mm/mremap.c
index 8233b0105c82..5d537ce12adc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -213,7 +213,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-		if (pmd_trans_huge(*old_pmd)) {
+		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE) {
 				bool moved;
 				/* See comment in move_ptes() */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 03761577ae86..114fc2b5a370 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -2,6 +2,8 @@
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 10/14] mm: soft-dirty: keep soft-dirty bits over thp migration
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
- clear_soft_dirty_pmd can handle migration entry
---
 arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
 fs/proc/task_mmu.c             | 17 +++++++++++------
 include/asm-generic/pgtable.h  | 34 +++++++++++++++++++++++++++++++++-
 include/linux/swapops.h        |  2 ++
 mm/huge_memory.c               | 24 +++++++++++++++++++++++-
 5 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1cfb36b8c024..e57abf8e926c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1088,6 +1088,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
 #endif
 
 #define PKRU_AD_BIT 0x1
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1e64d6898c68..e367dc3afea3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -900,12 +900,17 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
 {
-	pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
-
-	pmd = pmd_wrprotect(pmd);
-	pmd = pmd_clear_soft_dirty(pmd);
-
-	set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	pmd_t pmd = *pmdp;
+
+	if (pmd_present(pmd)) {
+		pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
+		pmd = pmd_wrprotect(pmd);
+		pmd = pmd_clear_soft_dirty(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+		pmd = pmd_swp_clear_soft_dirty(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	}
 }
 #else
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 6cf9e9b5a7be..f4c4ee5bce2b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -550,7 +550,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return 0;
@@ -595,6 +612,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte;
 }
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
 #endif
 
 #ifndef __HAVE_PFNMAP_TRACKING
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 50e4aa7e7ff9..c22f30a88959 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -179,6 +179,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 {
 	swp_entry_t arch_entry;
 
+	if (pmd_swp_soft_dirty(pmd))
+		pmd = pmd_swp_clear_soft_dirty(pmd);
 	arch_entry = __pmd_to_swp_entry(pmd);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ac923539372..283c27dd3f36 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -904,6 +904,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (is_write_migration_entry(entry)) {
 			make_migration_entry_read(&entry);
 			pmd = swp_entry_to_pmd(entry);
+			if (pmd_swp_soft_dirty(pmd))
+				pmd = pmd_swp_mksoft_dirty(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
 		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1726,6 +1728,17 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
 }
 #endif
 
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	if (unlikely(is_pmd_migration_entry(pmd)))
+		pmd = pmd_swp_mksoft_dirty(pmd);
+	else if (pmd_present(pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+#endif
+	return pmd;
+}
+
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		  unsigned long new_addr, unsigned long old_end,
 		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
@@ -1768,7 +1781,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
 		}
-		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+		pmd = move_soft_dirty_pmd(pmd);
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		if (force_flush)
@@ -1816,6 +1830,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 				make_migration_entry_read(&entry);
 				newpmd = swp_entry_to_pmd(entry);
+				if (pmd_swp_soft_dirty(newpmd))
+					newpmd = pmd_swp_mksoft_dirty(newpmd);
 				set_pmd_at(mm, addr, pmd, newpmd);
 			}
 			goto unlock;
@@ -2740,6 +2756,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 			set_page_dirty(page);
 		entry = make_migration_entry(page, pmd_write(pmdval));
 		pmdswp = swp_entry_to_pmd(entry);
+		if (pmd_soft_dirty(pmdval))
+			pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
 		page_remove_rmap(page, true);
 		put_page(page);
@@ -2756,6 +2774,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 			set_page_dirty(subpage);
 		entry = make_migration_entry(subpage, pte_write(pteval));
 		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
 		set_pte_at(mm, address, pvmw->pte, swp_pte);
 		page_remove_rmap(subpage, false);
 		put_page(subpage);
@@ -2778,6 +2798,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		entry = pmd_to_swp_entry(*pvmw->pmd);
 		get_page(new);
 		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+		if (pmd_swp_soft_dirty(*pvmw->pmd))
+			pmde = pmd_mksoft_dirty(pmde);
 		if (is_write_migration_entry(entry))
 			pmde = maybe_pmd_mkwrite(pmde, vma);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 10/14] mm: soft-dirty: keep soft-dirty bits over thp migration
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
- clear_soft_dirty_pmd can handle migration entry
---
 arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
 fs/proc/task_mmu.c             | 17 +++++++++++------
 include/asm-generic/pgtable.h  | 34 +++++++++++++++++++++++++++++++++-
 include/linux/swapops.h        |  2 ++
 mm/huge_memory.c               | 24 +++++++++++++++++++++++-
 5 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1cfb36b8c024..e57abf8e926c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1088,6 +1088,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
 #endif
 
 #define PKRU_AD_BIT 0x1
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1e64d6898c68..e367dc3afea3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -900,12 +900,17 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
 {
-	pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
-
-	pmd = pmd_wrprotect(pmd);
-	pmd = pmd_clear_soft_dirty(pmd);
-
-	set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	pmd_t pmd = *pmdp;
+
+	if (pmd_present(pmd)) {
+		pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
+		pmd = pmd_wrprotect(pmd);
+		pmd = pmd_clear_soft_dirty(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+		pmd = pmd_swp_clear_soft_dirty(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	}
 }
 #else
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 6cf9e9b5a7be..f4c4ee5bce2b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -550,7 +550,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return 0;
@@ -595,6 +612,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte;
 }
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
 #endif
 
 #ifndef __HAVE_PFNMAP_TRACKING
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 50e4aa7e7ff9..c22f30a88959 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -179,6 +179,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 {
 	swp_entry_t arch_entry;
 
+	if (pmd_swp_soft_dirty(pmd))
+		pmd = pmd_swp_clear_soft_dirty(pmd);
 	arch_entry = __pmd_to_swp_entry(pmd);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ac923539372..283c27dd3f36 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -904,6 +904,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (is_write_migration_entry(entry)) {
 			make_migration_entry_read(&entry);
 			pmd = swp_entry_to_pmd(entry);
+			if (pmd_swp_soft_dirty(pmd))
+				pmd = pmd_swp_mksoft_dirty(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
 		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1726,6 +1728,17 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
 }
 #endif
 
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	if (unlikely(is_pmd_migration_entry(pmd)))
+		pmd = pmd_swp_mksoft_dirty(pmd);
+	else if (pmd_present(pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+#endif
+	return pmd;
+}
+
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		  unsigned long new_addr, unsigned long old_end,
 		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
@@ -1768,7 +1781,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
 		}
-		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+		pmd = move_soft_dirty_pmd(pmd);
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		if (force_flush)
@@ -1816,6 +1830,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 				make_migration_entry_read(&entry);
 				newpmd = swp_entry_to_pmd(entry);
+				if (pmd_swp_soft_dirty(newpmd))
+					newpmd = pmd_swp_mksoft_dirty(newpmd);
 				set_pmd_at(mm, addr, pmd, newpmd);
 			}
 			goto unlock;
@@ -2740,6 +2756,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 			set_page_dirty(page);
 		entry = make_migration_entry(page, pmd_write(pmdval));
 		pmdswp = swp_entry_to_pmd(entry);
+		if (pmd_soft_dirty(pmdval))
+			pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
 		page_remove_rmap(page, true);
 		put_page(page);
@@ -2756,6 +2774,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 			set_page_dirty(subpage);
 		entry = make_migration_entry(subpage, pte_write(pteval));
 		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
 		set_pte_at(mm, address, pvmw->pte, swp_pte);
 		page_remove_rmap(subpage, false);
 		put_page(subpage);
@@ -2778,6 +2798,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		entry = pmd_to_swp_entry(*pvmw->pmd);
 		get_page(new);
 		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+		if (pmd_swp_soft_dirty(*pvmw->pmd))
+			pmde = pmd_mksoft_dirty(pmde);
 		if (is_write_migration_entry(entry))
 			pmde = maybe_pmd_mkwrite(pmde, vma);
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 11/14] mm: hwpoison: soft offline supports thp migration
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for soft offline.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3f3cfd4e1901..95db94207d01 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1483,7 +1483,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 						   nid);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(nid,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
@@ -1691,28 +1701,11 @@ static int __soft_offline_page(struct page *page, int flags)
 static int soft_offline_in_use_page(struct page *page, int flags)
 {
 	int ret;
-	struct page *hpage = compound_head(page);
-
-	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		lock_page(hpage);
-		if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
-			unlock_page(hpage);
-			if (!PageAnon(hpage))
-				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
-			else
-				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_hwpoison_page(hpage);
-			return -EBUSY;
-		}
-		unlock_page(hpage);
-		get_hwpoison_page(page);
-		put_hwpoison_page(hpage);
-	}
 
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page, flags);
 	else
-		ret = __soft_offline_page(page, flags);
+		ret = __soft_offline_page(compound_head(page), flags);
 
 	return ret;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 11/14] mm: hwpoison: soft offline supports thp migration
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for soft offline.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3f3cfd4e1901..95db94207d01 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1483,7 +1483,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 						   nid);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(nid,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
@@ -1691,28 +1701,11 @@ static int __soft_offline_page(struct page *page, int flags)
 static int soft_offline_in_use_page(struct page *page, int flags)
 {
 	int ret;
-	struct page *hpage = compound_head(page);
-
-	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		lock_page(hpage);
-		if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
-			unlock_page(hpage);
-			if (!PageAnon(hpage))
-				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
-			else
-				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_hwpoison_page(hpage);
-			return -EBUSY;
-		}
-		unlock_page(hpage);
-		get_hwpoison_page(page);
-		put_hwpoison_page(hpage);
-	}
 
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page, flags);
 	else
-		ret = __soft_offline_page(page, flags);
+		ret = __soft_offline_page(compound_head(page), flags);
 
 	return ret;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 12/14] mm: mempolicy: mbind and migrate_pages support thp migration
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for mbind(2) and migrate_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp
---
 mm/mempolicy.c | 107 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 78 insertions(+), 29 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 021ff13b9a7a..435bb7bec0a5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -487,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
 	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
 }
 
+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	int ret = 0;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags;
+
+	if (unlikely(is_pmd_migration_entry(*pmd))) {
+		ret = 1;
+		goto unlock;
+	}
+	page = pmd_page(*pmd);
+	if (is_huge_zero_page(page)) {
+		spin_unlock(ptl);
+		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+		goto out;
+	}
+	if (!thp_migration_supported()) {
+		get_page(page);
+		spin_unlock(ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		goto out;
+	}
+	if (queue_pages_node_check(page, qp)) {
+		ret = 1;
+		goto unlock;
+	}
+
+	ret = 1;
+	flags = qp->flags;
+	/* go to thp migration */
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+unlock:
+	spin_unlock(ptl);
+out:
+	return ret;
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -498,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid, ret;
+	int ret;
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(walk->mm, pmd);
-		if (pmd_trans_huge(*pmd)) {
-			page = pmd_page(*pmd);
-			if (is_huge_zero_page(page)) {
-				spin_unlock(ptl);
-				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			} else {
-				get_page(page);
-				spin_unlock(ptl);
-				lock_page(page);
-				ret = split_huge_page(page);
-				unlock_page(page);
-				put_page(page);
-				if (ret)
-					return 0;
-			}
-		} else {
-			spin_unlock(ptl);
-		}
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+		if (ret)
+			return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
@@ -542,7 +570,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		if (queue_pages_node_check(page, qp))
 			continue;
-		if (PageTransCompound(page)) {
+		if (PageTransCompound(page) && !thp_migration_supported()) {
 			get_page(page);
 			pte_unmap_unlock(pte, ptl);
 			lock_page(page);
@@ -960,19 +988,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 #ifdef CONFIG_MIGRATION
 /*
- * page migration
+ * page migration, thp tail pages can be passed.
  */
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags)
 {
+	struct page *head = compound_head(page);
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
-		if (!isolate_lru_page(page)) {
-			list_add_tail(&page->lru, pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+		if (!isolate_lru_page(head)) {
+			list_add_tail(&head->lru, pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 	}
 }
@@ -982,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
 	if (PageHuge(page))
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
 						    __GFP_THISNODE, 0);
 }
@@ -1148,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
 	if (PageHuge(page)) {
 		BUG_ON(!vma);
 		return alloc_huge_page_noerr(vma, address, 1);
+	} else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+					 HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
 	}
 	/*
 	 * if !vma, alloc_page_vma() will use task or system default policy
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 12/14] mm: mempolicy: mbind and migrate_pages support thp migration
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for mbind(2) and migrate_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp
---
 mm/mempolicy.c | 107 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 78 insertions(+), 29 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 021ff13b9a7a..435bb7bec0a5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -487,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
 	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
 }
 
+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	int ret = 0;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags;
+
+	if (unlikely(is_pmd_migration_entry(*pmd))) {
+		ret = 1;
+		goto unlock;
+	}
+	page = pmd_page(*pmd);
+	if (is_huge_zero_page(page)) {
+		spin_unlock(ptl);
+		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+		goto out;
+	}
+	if (!thp_migration_supported()) {
+		get_page(page);
+		spin_unlock(ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		goto out;
+	}
+	if (queue_pages_node_check(page, qp)) {
+		ret = 1;
+		goto unlock;
+	}
+
+	ret = 1;
+	flags = qp->flags;
+	/* go to thp migration */
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+unlock:
+	spin_unlock(ptl);
+out:
+	return ret;
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -498,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid, ret;
+	int ret;
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(walk->mm, pmd);
-		if (pmd_trans_huge(*pmd)) {
-			page = pmd_page(*pmd);
-			if (is_huge_zero_page(page)) {
-				spin_unlock(ptl);
-				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			} else {
-				get_page(page);
-				spin_unlock(ptl);
-				lock_page(page);
-				ret = split_huge_page(page);
-				unlock_page(page);
-				put_page(page);
-				if (ret)
-					return 0;
-			}
-		} else {
-			spin_unlock(ptl);
-		}
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+		if (ret)
+			return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
@@ -542,7 +570,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		if (queue_pages_node_check(page, qp))
 			continue;
-		if (PageTransCompound(page)) {
+		if (PageTransCompound(page) && !thp_migration_supported()) {
 			get_page(page);
 			pte_unmap_unlock(pte, ptl);
 			lock_page(page);
@@ -960,19 +988,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 #ifdef CONFIG_MIGRATION
 /*
- * page migration
+ * page migration, thp tail pages can be passed.
  */
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags)
 {
+	struct page *head = compound_head(page);
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
-		if (!isolate_lru_page(page)) {
-			list_add_tail(&page->lru, pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+		if (!isolate_lru_page(head)) {
+			list_add_tail(&head->lru, pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 	}
 }
@@ -982,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
 	if (PageHuge(page))
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
 						    __GFP_THISNODE, 0);
 }
@@ -1148,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
 	if (PageHuge(page)) {
 		BUG_ON(!vma);
 		return alloc_huge_page_noerr(vma, address, 1);
+	} else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+					 HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
 	}
 	/*
 	 * if !vma, alloc_page_vma() will use task or system default policy
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 13/14] mm: migrate: move_pages() supports thp migration
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for move_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 84181a3668c6..9bcaccb481ac 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1413,7 +1413,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 					pm->node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(pm->node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
 }
@@ -1440,6 +1450,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
 		struct vm_area_struct *vma;
 		struct page *page;
+		struct page *head;
+		unsigned int follflags;
 
 		err = -EFAULT;
 		vma = find_vma(mm, pp->addr);
@@ -1447,8 +1459,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto set_status;
 
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, pp->addr,
-				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+		follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;
+		if (!thp_migration_supported())
+			follflags |= FOLL_SPLIT;
+		page = follow_page(vma, pp->addr, follflags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1458,7 +1472,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!page)
 			goto set_status;
 
-		pp->page = page;
 		err = page_to_nid(page);
 
 		if (err == pp->node)
@@ -1473,16 +1486,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto put_and_set;
 
 		if (PageHuge(page)) {
-			if (PageHead(page))
+			if (PageHead(page)) {
 				isolate_huge_page(page, &pagelist);
+				err = 0;
+				pp->page = page;
+			}
 			goto put_and_set;
 		}
 
-		err = isolate_lru_page(page);
+		pp->page = compound_head(page);
+		head = compound_head(page);
+		err = isolate_lru_page(head);
 		if (!err) {
-			list_add_tail(&page->lru, &pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+			list_add_tail(&head->lru, &pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 put_and_set:
 		/*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 13/14] mm: migrate: move_pages() supports thp migration
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for move_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 84181a3668c6..9bcaccb481ac 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1413,7 +1413,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 					pm->node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(pm->node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
 }
@@ -1440,6 +1450,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
 		struct vm_area_struct *vma;
 		struct page *page;
+		struct page *head;
+		unsigned int follflags;
 
 		err = -EFAULT;
 		vma = find_vma(mm, pp->addr);
@@ -1447,8 +1459,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto set_status;
 
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, pp->addr,
-				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+		follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;
+		if (!thp_migration_supported())
+			follflags |= FOLL_SPLIT;
+		page = follow_page(vma, pp->addr, follflags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1458,7 +1472,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!page)
 			goto set_status;
 
-		pp->page = page;
 		err = page_to_nid(page);
 
 		if (err == pp->node)
@@ -1473,16 +1486,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto put_and_set;
 
 		if (PageHuge(page)) {
-			if (PageHead(page))
+			if (PageHead(page)) {
 				isolate_huge_page(page, &pagelist);
+				err = 0;
+				pp->page = page;
+			}
 			goto put_and_set;
 		}
 
-		err = isolate_lru_page(page);
+		pp->page = compound_head(page);
+		head = compound_head(page);
+		err = isolate_lru_page(head);
 		if (!err) {
-			list_add_tail(&page->lru, &pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+			list_add_tail(&head->lru, &pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 put_and_set:
 		/*
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 14/14] mm: memory_hotplug: memory hotremove supports thp migration
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-05 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for memory hotremove.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9cb4c83151a8..e988ea15cc99 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1593,10 +1593,10 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		gfp_mask |= __GFP_HIGHMEM;
 
 	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0,
+		new_page = __alloc_pages_nodemask(gfp_mask, order,
 					node_zonelist(nid, gfp_mask), &nmask);
 	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0,
+		new_page = __alloc_pages(gfp_mask, order,
 					node_zonelist(nid, gfp_mask));
 
 	if (new_page && order == hpage_order(page))
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 14/14] mm: memory_hotplug: memory hotremove supports thp migration
@ 2017-02-05 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-05 16:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for memory hotremove.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9cb4c83151a8..e988ea15cc99 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1593,10 +1593,10 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		gfp_mask |= __GFP_HIGHMEM;
 
 	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0,
+		new_page = __alloc_pages_nodemask(gfp_mask, order,
 					node_zonelist(nid, gfp_mask), &nmask);
 	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0,
+		new_page = __alloc_pages(gfp_mask, order,
 					node_zonelist(nid, gfp_mask));
 
 	if (new_page && order == hpage_order(page))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-06  4:02     ` Hillf Danton
  -1 siblings, 0 replies; 87+ messages in thread
From: Hillf Danton @ 2017-02-06  4:02 UTC (permalink / raw)
  To: 'Zi Yan', linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan,
	'Zi Yan'


On February 06, 2017 12:13 AM Zi Yan wrote: 
> 
> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  				struct zap_details *details)
>  {
>  	pmd_t *pmd;
> +	spinlock_t *ptl;
>  	unsigned long next;
> 
>  	pmd = pmd_offset(pud, addr);
> +	ptl = pmd_lock(vma->vm_mm, pmd);
>  	do {
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
> -				goto next;
> +				__split_huge_pmd_locked(vma, pmd, addr, false);
> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> +				continue;
>  			/* fall through */
>  		}
> -		/*
> -		 * Here there can be other concurrent MADV_DONTNEED or
> -		 * trans huge page faults running, and if the pmd is
> -		 * none or trans huge it can change under us. This is
> -		 * because MADV_DONTNEED holds the mmap_sem in read
> -		 * mode.
> -		 */
> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> -			goto next;
> +
> +		if (pmd_none_or_clear_bad(pmd))
> +			continue;
> +		spin_unlock(ptl);
>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> -next:
>  		cond_resched();
> +		spin_lock(ptl);
>  	} while (pmd++, addr = next, addr != end);

spin_lock() is appointed to the bench of pmd_lock().

> +	spin_unlock(ptl);
> 
>  	return addr;
>  }

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-06  4:02     ` Hillf Danton
  0 siblings, 0 replies; 87+ messages in thread
From: Hillf Danton @ 2017-02-06  4:02 UTC (permalink / raw)
  To: 'Zi Yan', linux-kernel, linux-mm, kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan,
	'Zi Yan'


On February 06, 2017 12:13 AM Zi Yan wrote: 
> 
> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  				struct zap_details *details)
>  {
>  	pmd_t *pmd;
> +	spinlock_t *ptl;
>  	unsigned long next;
> 
>  	pmd = pmd_offset(pud, addr);
> +	ptl = pmd_lock(vma->vm_mm, pmd);
>  	do {
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
> -				goto next;
> +				__split_huge_pmd_locked(vma, pmd, addr, false);
> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> +				continue;
>  			/* fall through */
>  		}
> -		/*
> -		 * Here there can be other concurrent MADV_DONTNEED or
> -		 * trans huge page faults running, and if the pmd is
> -		 * none or trans huge it can change under us. This is
> -		 * because MADV_DONTNEED holds the mmap_sem in read
> -		 * mode.
> -		 */
> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> -			goto next;
> +
> +		if (pmd_none_or_clear_bad(pmd))
> +			continue;
> +		spin_unlock(ptl);
>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> -next:
>  		cond_resched();
> +		spin_lock(ptl);
>  	} while (pmd++, addr = next, addr != end);

spin_lock() is appointed to the bench of pmd_lock().

> +	spin_unlock(ptl);
> 
>  	return addr;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-06  4:02     ` Hillf Danton
@ 2017-02-06  4:14       ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-06  4:14 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, n-horiguchi, khandual, Zi Yan

[-- Attachment #1: Type: text/plain, Size: 1986 bytes --]

On 5 Feb 2017, at 22:02, Hillf Danton wrote:

> On February 06, 2017 12:13 AM Zi Yan wrote:
>>
>> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>>  				struct zap_details *details)
>>  {
>>  	pmd_t *pmd;
>> +	spinlock_t *ptl;
>>  	unsigned long next;
>>
>>  	pmd = pmd_offset(pud, addr);
>> +	ptl = pmd_lock(vma->vm_mm, pmd);
>>  	do {
>>  		next = pmd_addr_end(addr, end);
>>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>>  			if (next - addr != HPAGE_PMD_SIZE) {
>>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
>> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
>> -				goto next;
>> +				__split_huge_pmd_locked(vma, pmd, addr, false);
>> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
>> +				continue;
>>  			/* fall through */
>>  		}
>> -		/*
>> -		 * Here there can be other concurrent MADV_DONTNEED or
>> -		 * trans huge page faults running, and if the pmd is
>> -		 * none or trans huge it can change under us. This is
>> -		 * because MADV_DONTNEED holds the mmap_sem in read
>> -		 * mode.
>> -		 */
>> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>> -			goto next;
>> +
>> +		if (pmd_none_or_clear_bad(pmd))
>> +			continue;
>> +		spin_unlock(ptl);
>>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
>> -next:
>>  		cond_resched();
>> +		spin_lock(ptl);
>>  	} while (pmd++, addr = next, addr != end);
>
> spin_lock() is appointed to the bench of pmd_lock().

Any problem with this?

The code is trying to lock this PMD page to avoid other changes
and only unlock it when we want to go deeper to PTE range.

Locking the PMD page for at most 512-entry handling should be
acceptable, since zap_pte_range() does similar work for 512 PTEs.

>
>> +	spin_unlock(ptl);
>>
>>  	return addr;
>>  }


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-06  4:14       ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-06  4:14 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, n-horiguchi, khandual, Zi Yan

[-- Attachment #1: Type: text/plain, Size: 1986 bytes --]

On 5 Feb 2017, at 22:02, Hillf Danton wrote:

> On February 06, 2017 12:13 AM Zi Yan wrote:
>>
>> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>>  				struct zap_details *details)
>>  {
>>  	pmd_t *pmd;
>> +	spinlock_t *ptl;
>>  	unsigned long next;
>>
>>  	pmd = pmd_offset(pud, addr);
>> +	ptl = pmd_lock(vma->vm_mm, pmd);
>>  	do {
>>  		next = pmd_addr_end(addr, end);
>>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>>  			if (next - addr != HPAGE_PMD_SIZE) {
>>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
>> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
>> -				goto next;
>> +				__split_huge_pmd_locked(vma, pmd, addr, false);
>> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
>> +				continue;
>>  			/* fall through */
>>  		}
>> -		/*
>> -		 * Here there can be other concurrent MADV_DONTNEED or
>> -		 * trans huge page faults running, and if the pmd is
>> -		 * none or trans huge it can change under us. This is
>> -		 * because MADV_DONTNEED holds the mmap_sem in read
>> -		 * mode.
>> -		 */
>> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>> -			goto next;
>> +
>> +		if (pmd_none_or_clear_bad(pmd))
>> +			continue;
>> +		spin_unlock(ptl);
>>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
>> -next:
>>  		cond_resched();
>> +		spin_lock(ptl);
>>  	} while (pmd++, addr = next, addr != end);
>
> spin_lock() is appointed to the bench of pmd_lock().

Any problem with this?

The code is trying to lock this PMD page to avoid other changes
and only unlock it when we want to go deeper to PTE range.

Locking the PMD page for at most 512-entry handling should be
acceptable, since zap_pte_range() does similar work for 512 PTEs.

>
>> +	spin_unlock(ptl);
>>
>>  	return addr;
>>  }


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible.
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-06  6:12     ` Naoya Horiguchi
  -1 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-06  6:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:39AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It allows splitting huge pmd while you are holding the pmd lock.
> It is prepared for future zap_pmd_range() use.
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  include/linux/huge_mm.h |  2 ++
>  mm/huge_memory.c        | 22 ++++++++++++----------
>  2 files changed, 14 insertions(+), 10 deletions(-)
> 
...
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 03e4566fc226..cd66532ef667 100644
...
> @@ -2036,10 +2039,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			clear_page_mlock(page);
>  	} else if (!pmd_devmap(*pmd))
>  		goto out;
> -	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
> +	__split_huge_pmd_locked(vma, pmd, address, freeze);

Could you explain what is intended on this change?
If some caller (f.e. wp_huge_pmd?) could call __split_huge_pmd() with
address not aligned with pmd border, __split_huge_pmd_locked() results in
triggering VM_BUG_ON(haddr & ~HPAGE_PMD_MASK).

Thanks,
Naoya Horiguchi

>  out:
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
>  }
>  
>  void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible.
@ 2017-02-06  6:12     ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-06  6:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:39AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It allows splitting huge pmd while you are holding the pmd lock.
> It is prepared for future zap_pmd_range() use.
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  include/linux/huge_mm.h |  2 ++
>  mm/huge_memory.c        | 22 ++++++++++++----------
>  2 files changed, 14 insertions(+), 10 deletions(-)
> 
...
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 03e4566fc226..cd66532ef667 100644
...
> @@ -2036,10 +2039,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			clear_page_mlock(page);
>  	} else if (!pmd_devmap(*pmd))
>  		goto out;
> -	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
> +	__split_huge_pmd_locked(vma, pmd, address, freeze);

Could you explain what is intended on this change?
If some caller (f.e. wp_huge_pmd?) could call __split_huge_pmd() with
address not aligned with pmd border, __split_huge_pmd_locked() results in
triggering VM_BUG_ON(haddr & ~HPAGE_PMD_MASK).

Thanks,
Naoya Horiguchi

>  out:
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
>  }
>  
>  void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
> -- 
> 2.11.0
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-06  7:43     ` Naoya Horiguchi
  -1 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-06  7:43 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> This can cause pmd_protnone entry not being freed.
> 
> Because there are two steps in changing a pmd entry to a pmd_protnone
> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> the pmd_none entry is changed into a pmd_protnone entry.
> The racy check, even with barrier, might only see the pmd_none entry
> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
> 
> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> since the pmd_protnone entry is not properly freed, the corresponding
> deposited pte page table is not freed either.
> 
> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> 
> This patch relies on __split_huge_pmd_locked() and
> __zap_huge_pmd_locked().
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  mm/memory.c | 24 +++++++++++-------------
>  1 file changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 3929b015faf7..7cfdd5208ef5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  				struct zap_details *details)
>  {
>  	pmd_t *pmd;
> +	spinlock_t *ptl;
>  	unsigned long next;
>  
>  	pmd = pmd_offset(pud, addr);
> +	ptl = pmd_lock(vma->vm_mm, pmd);

If USE_SPLIT_PMD_PTLOCKS is true, pmd_lock() returns different ptl for
each pmd. The following code runs over pmds within [addr, end) with
a single ptl (of the first pmd,) so I suspect this locking really works.
Maybe pmd_lock() should be called inside while loop?

Thanks,
Naoya Horiguchi

>  	do {
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
> -				goto next;
> +				__split_huge_pmd_locked(vma, pmd, addr, false);
> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> +				continue;
>  			/* fall through */
>  		}
> -		/*
> -		 * Here there can be other concurrent MADV_DONTNEED or
> -		 * trans huge page faults running, and if the pmd is
> -		 * none or trans huge it can change under us. This is
> -		 * because MADV_DONTNEED holds the mmap_sem in read
> -		 * mode.
> -		 */
> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> -			goto next;
> +
> +		if (pmd_none_or_clear_bad(pmd))
> +			continue;
> +		spin_unlock(ptl);
>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> -next:
>  		cond_resched();
> +		spin_lock(ptl);
>  	} while (pmd++, addr = next, addr != end);
> +	spin_unlock(ptl);
>  
>  	return addr;
>  }
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-06  7:43     ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-06  7:43 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> This can cause pmd_protnone entry not being freed.
> 
> Because there are two steps in changing a pmd entry to a pmd_protnone
> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> the pmd_none entry is changed into a pmd_protnone entry.
> The racy check, even with barrier, might only see the pmd_none entry
> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
> 
> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> since the pmd_protnone entry is not properly freed, the corresponding
> deposited pte page table is not freed either.
> 
> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> 
> This patch relies on __split_huge_pmd_locked() and
> __zap_huge_pmd_locked().
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  mm/memory.c | 24 +++++++++++-------------
>  1 file changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 3929b015faf7..7cfdd5208ef5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  				struct zap_details *details)
>  {
>  	pmd_t *pmd;
> +	spinlock_t *ptl;
>  	unsigned long next;
>  
>  	pmd = pmd_offset(pud, addr);
> +	ptl = pmd_lock(vma->vm_mm, pmd);

If USE_SPLIT_PMD_PTLOCKS is true, pmd_lock() returns different ptl for
each pmd. The following code runs over pmds within [addr, end) with
a single ptl (of the first pmd,) so I suspect this locking really works.
Maybe pmd_lock() should be called inside while loop?

Thanks,
Naoya Horiguchi

>  	do {
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
> -				goto next;
> +				__split_huge_pmd_locked(vma, pmd, addr, false);
> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> +				continue;
>  			/* fall through */
>  		}
> -		/*
> -		 * Here there can be other concurrent MADV_DONTNEED or
> -		 * trans huge page faults running, and if the pmd is
> -		 * none or trans huge it can change under us. This is
> -		 * because MADV_DONTNEED holds the mmap_sem in read
> -		 * mode.
> -		 */
> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> -			goto next;
> +
> +		if (pmd_none_or_clear_bad(pmd))
> +			continue;
> +		spin_unlock(ptl);
>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> -next:
>  		cond_resched();
> +		spin_lock(ptl);
>  	} while (pmd++, addr = next, addr != end);
> +	spin_unlock(ptl);
>  
>  	return addr;
>  }
> -- 
> 2.11.0
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible.
  2017-02-06  6:12     ` Naoya Horiguchi
  (?)
@ 2017-02-06 12:10     ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-06 12:10 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, Zi Yan

[-- Attachment #1: Type: text/plain, Size: 1728 bytes --]

On 6 Feb 2017, at 0:12, Naoya Horiguchi wrote:

> On Sun, Feb 05, 2017 at 11:12:39AM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> It allows splitting huge pmd while you are holding the pmd lock.
>> It is prepared for future zap_pmd_range() use.
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  include/linux/huge_mm.h |  2 ++
>>  mm/huge_memory.c        | 22 ++++++++++++----------
>>  2 files changed, 14 insertions(+), 10 deletions(-)
>>
> ...
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 03e4566fc226..cd66532ef667 100644
> ...
>> @@ -2036,10 +2039,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>  			clear_page_mlock(page);
>>  	} else if (!pmd_devmap(*pmd))
>>  		goto out;
>> -	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
>> +	__split_huge_pmd_locked(vma, pmd, address, freeze);
>
> Could you explain what is intended on this change?
> If some caller (f.e. wp_huge_pmd?) could call __split_huge_pmd() with
> address not aligned with pmd border, __split_huge_pmd_locked() results in
> triggering VM_BUG_ON(haddr & ~HPAGE_PMD_MASK).

This change is intended for any caller already hold pmd lock. Now it is for this
call site only.

In Patch 2, I moved unsigned long haddr = address & HPAGE_PMD_MASK;
from __split_huge_pmd() to __split_huge_pmd_locked(), so VM_BUG_ON(haddr & ~HPAGE_PMD_MASK)
will not be triggered.



>
> Thanks,
> Naoya Horiguchi
>
>>  out:
>>  	spin_unlock(ptl);
>> -	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
>>  }
>>
>>  void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
>> -- 
>> 2.11.0
>>


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-06  7:43     ` Naoya Horiguchi
  (?)
@ 2017-02-06 13:02     ` Zi Yan
  2017-02-06 23:22         ` Naoya Horiguchi
  -1 siblings, 1 reply; 87+ messages in thread
From: Zi Yan @ 2017-02-06 13:02 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, Zi Yan

[-- Attachment #1: Type: text/plain, Size: 5666 bytes --]

On 6 Feb 2017, at 1:43, Naoya Horiguchi wrote:

> On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
>> This can cause pmd_protnone entry not being freed.
>>
>> Because there are two steps in changing a pmd entry to a pmd_protnone
>> entry. First, the pmd entry is cleared to a pmd_none entry, then,
>> the pmd_none entry is changed into a pmd_protnone entry.
>> The racy check, even with barrier, might only see the pmd_none entry
>> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
>>
>> Later, in free_pmd_range(), pmd_none_or_clear() will see the
>> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
>> since the pmd_protnone entry is not properly freed, the corresponding
>> deposited pte page table is not freed either.
>>
>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
>>
>> This patch relies on __split_huge_pmd_locked() and
>> __zap_huge_pmd_locked().
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  mm/memory.c | 24 +++++++++++-------------
>>  1 file changed, 11 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 3929b015faf7..7cfdd5208ef5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>>  				struct zap_details *details)
>>  {
>>  	pmd_t *pmd;
>> +	spinlock_t *ptl;
>>  	unsigned long next;
>>
>>  	pmd = pmd_offset(pud, addr);
>> +	ptl = pmd_lock(vma->vm_mm, pmd);
>
> If USE_SPLIT_PMD_PTLOCKS is true, pmd_lock() returns different ptl for
> each pmd. The following code runs over pmds within [addr, end) with
> a single ptl (of the first pmd,) so I suspect this locking really works.
> Maybe pmd_lock() should be called inside while loop?

According to include/linux/mm.h, pmd_lockptr() first gets the page the pmd is in,
using mask = ~(PTRS_PER_PMD * sizeof(pmd_t) -1) = 0xfffffffffffff000 and virt_to_page().
Then, ptlock_ptr() gets spinlock_t either from page->ptl (split case) or
mm->page_table_lock (not split case).

It seems to me that all PMDs in one page table page share a single spinlock. Let me know
if I misunderstand any code.

But your suggestion can avoid holding the pmd lock for long without cond_sched(),
I can move the spinlock inside the loop.

Thanks.

diff --git a/mm/memory.c b/mm/memory.c
index 5299b261c4b4..ff61d45eaea7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1260,31 +1260,34 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
                                struct zap_details *details)
 {
        pmd_t *pmd;
-       spinlock_t *ptl;
+       spinlock_t *ptl = NULL;
        unsigned long next;

        pmd = pmd_offset(pud, addr);
-       ptl = pmd_lock(vma->vm_mm, pmd);
        do {
+               ptl = pmd_lock(vma->vm_mm, pmd);
                next = pmd_addr_end(addr, end);
                if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
                        if (next - addr != HPAGE_PMD_SIZE) {
                                VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
                                    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
                                __split_huge_pmd_locked(vma, pmd, addr, false);
-                       } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
-                               continue;
+                       } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr)) {
+                               spin_unlock(ptl);
+                               goto next;
+                       }
                        /* fall through */
                }

-               if (pmd_none_or_clear_bad(pmd))
-                       continue;
+               if (pmd_none_or_clear_bad(pmd)) {
+                       spin_unlock(ptl);
+                       goto next;
+               }
                spin_unlock(ptl);
                next = zap_pte_range(tlb, vma, pmd, addr, next, details);
+next:
                cond_resched();
-               spin_lock(ptl);
        } while (pmd++, addr = next, addr != end);
-       spin_unlock(ptl);

        return addr;
 }


>
> Thanks,
> Naoya Horiguchi
>
>>  	do {
>>  		next = pmd_addr_end(addr, end);
>>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>>  			if (next - addr != HPAGE_PMD_SIZE) {
>>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
>> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
>> -				goto next;
>> +				__split_huge_pmd_locked(vma, pmd, addr, false);
>> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
>> +				continue;
>>  			/* fall through */
>>  		}
>> -		/*
>> -		 * Here there can be other concurrent MADV_DONTNEED or
>> -		 * trans huge page faults running, and if the pmd is
>> -		 * none or trans huge it can change under us. This is
>> -		 * because MADV_DONTNEED holds the mmap_sem in read
>> -		 * mode.
>> -		 */
>> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>> -			goto next;
>> +
>> +		if (pmd_none_or_clear_bad(pmd))
>> +			continue;
>> +		spin_unlock(ptl);
>>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
>> -next:
>>  		cond_resched();
>> +		spin_lock(ptl);
>>  	} while (pmd++, addr = next, addr != end);
>> +	spin_unlock(ptl);
>>
>>  	return addr;
>>  }
>> -- 
>> 2.11.0
>>


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible.
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-06 15:02     ` Matthew Wilcox
  -1 siblings, 0 replies; 87+ messages in thread
From: Matthew Wilcox @ 2017-02-06 15:02 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:39AM -0500, Zi Yan wrote:
> +++ b/include/linux/huge_mm.h
> @@ -120,6 +120,8 @@ static inline int split_huge_page(struct page *page)
>  }
>  void deferred_split_huge_page(struct page *page);
>  
> +void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long haddr, bool freeze);

Could you change that from 'haddr' to 'address' so callers who only
read the header instead of the implementation aren't expecting to align
it themselves?

> +void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long address, bool freeze)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page;

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible.
@ 2017-02-06 15:02     ` Matthew Wilcox
  0 siblings, 0 replies; 87+ messages in thread
From: Matthew Wilcox @ 2017-02-06 15:02 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:39AM -0500, Zi Yan wrote:
> +++ b/include/linux/huge_mm.h
> @@ -120,6 +120,8 @@ static inline int split_huge_page(struct page *page)
>  }
>  void deferred_split_huge_page(struct page *page);
>  
> +void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long haddr, bool freeze);

Could you change that from 'haddr' to 'address' so callers who only
read the header instead of the implementation aren't expecting to align
it themselves?

> +void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long address, bool freeze)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible.
  2017-02-06 15:02     ` Matthew Wilcox
  (?)
@ 2017-02-06 15:03     ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-06 15:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, n-horiguchi, khandual, Zi Yan

[-- Attachment #1: Type: text/plain, Size: 839 bytes --]

On 6 Feb 2017, at 9:02, Matthew Wilcox wrote:

> On Sun, Feb 05, 2017 at 11:12:39AM -0500, Zi Yan wrote:
>> +++ b/include/linux/huge_mm.h
>> @@ -120,6 +120,8 @@ static inline int split_huge_page(struct page *page)
>>  }
>>  void deferred_split_huge_page(struct page *page);
>>
>> +void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> +		unsigned long haddr, bool freeze);
>
> Could you change that from 'haddr' to 'address' so callers who only
> read the header instead of the implementation aren't expecting to align
> it themselves?

Sure. I will do that to avoid confusion.

Thanks for pointing it out.


>
>> +void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> +		unsigned long address, bool freeze)
>>  {
>>  	struct mm_struct *mm = vma->vm_mm;
>>  	struct page *page;


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-06 16:07     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-06 16:07 UTC (permalink / raw)
  To: Zi Yan, mgorman, riel
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	n-horiguchi, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> This can cause pmd_protnone entry not being freed.
> 
> Because there are two steps in changing a pmd entry to a pmd_protnone
> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> the pmd_none entry is changed into a pmd_protnone entry.
> The racy check, even with barrier, might only see the pmd_none entry
> in zap_pmd_range(), thus, the mapping is neither split nor zapped.

That's definately a good catch.

But I don't agree with the solution. Taking pmd lock on each
zap_pmd_range() is a significant hit by scalability of the code path.
Yes, split ptl lock helps, but it would be nice to avoid the lock in first
place.

Can we fix change_huge_pmd() instead? Is there a reason why we cannot
setup the pmd_protnone() atomically?

Mel? Rik?

> 
> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> since the pmd_protnone entry is not properly freed, the corresponding
> deposited pte page table is not freed either.
> 
> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> 
> This patch relies on __split_huge_pmd_locked() and
> __zap_huge_pmd_locked().
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  mm/memory.c | 24 +++++++++++-------------
>  1 file changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 3929b015faf7..7cfdd5208ef5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  				struct zap_details *details)
>  {
>  	pmd_t *pmd;
> +	spinlock_t *ptl;
>  	unsigned long next;
>  
>  	pmd = pmd_offset(pud, addr);
> +	ptl = pmd_lock(vma->vm_mm, pmd);
>  	do {
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
> -				goto next;
> +				__split_huge_pmd_locked(vma, pmd, addr, false);
> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> +				continue;
>  			/* fall through */
>  		}
> -		/*
> -		 * Here there can be other concurrent MADV_DONTNEED or
> -		 * trans huge page faults running, and if the pmd is
> -		 * none or trans huge it can change under us. This is
> -		 * because MADV_DONTNEED holds the mmap_sem in read
> -		 * mode.
> -		 */
> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> -			goto next;
> +
> +		if (pmd_none_or_clear_bad(pmd))
> +			continue;
> +		spin_unlock(ptl);
>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> -next:
>  		cond_resched();
> +		spin_lock(ptl);
>  	} while (pmd++, addr = next, addr != end);
> +	spin_unlock(ptl);
>  
>  	return addr;
>  }
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-06 16:07     ` Kirill A. Shutemov
  0 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-06 16:07 UTC (permalink / raw)
  To: Zi Yan, mgorman, riel
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	n-horiguchi, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> This can cause pmd_protnone entry not being freed.
> 
> Because there are two steps in changing a pmd entry to a pmd_protnone
> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> the pmd_none entry is changed into a pmd_protnone entry.
> The racy check, even with barrier, might only see the pmd_none entry
> in zap_pmd_range(), thus, the mapping is neither split nor zapped.

That's definately a good catch.

But I don't agree with the solution. Taking pmd lock on each
zap_pmd_range() is a significant hit by scalability of the code path.
Yes, split ptl lock helps, but it would be nice to avoid the lock in first
place.

Can we fix change_huge_pmd() instead? Is there a reason why we cannot
setup the pmd_protnone() atomically?

Mel? Rik?

> 
> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> since the pmd_protnone entry is not properly freed, the corresponding
> deposited pte page table is not freed either.
> 
> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> 
> This patch relies on __split_huge_pmd_locked() and
> __zap_huge_pmd_locked().
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  mm/memory.c | 24 +++++++++++-------------
>  1 file changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 3929b015faf7..7cfdd5208ef5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  				struct zap_details *details)
>  {
>  	pmd_t *pmd;
> +	spinlock_t *ptl;
>  	unsigned long next;
>  
>  	pmd = pmd_offset(pud, addr);
> +	ptl = pmd_lock(vma->vm_mm, pmd);
>  	do {
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
> -				goto next;
> +				__split_huge_pmd_locked(vma, pmd, addr, false);
> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> +				continue;
>  			/* fall through */
>  		}
> -		/*
> -		 * Here there can be other concurrent MADV_DONTNEED or
> -		 * trans huge page faults running, and if the pmd is
> -		 * none or trans huge it can change under us. This is
> -		 * because MADV_DONTNEED holds the mmap_sem in read
> -		 * mode.
> -		 */
> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> -			goto next;
> +
> +		if (pmd_none_or_clear_bad(pmd))
> +			continue;
> +		spin_unlock(ptl);
>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> -next:
>  		cond_resched();
> +		spin_lock(ptl);
>  	} while (pmd++, addr = next, addr != end);
> +	spin_unlock(ptl);
>  
>  	return addr;
>  }
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-06 16:07     ` Kirill A. Shutemov
  (?)
@ 2017-02-06 16:32     ` Zi Yan
  2017-02-06 17:35         ` Kirill A. Shutemov
  -1 siblings, 1 reply; 87+ messages in thread
From: Zi Yan @ 2017-02-06 16:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: mgorman, riel, linux-kernel, linux-mm, kirill.shutemov, akpm,
	minchan, vbabka, n-horiguchi, khandual, Zi Yan

[-- Attachment #1: Type: text/plain, Size: 4660 bytes --]

On 6 Feb 2017, at 10:07, Kirill A. Shutemov wrote:

> On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
>> This can cause pmd_protnone entry not being freed.
>>
>> Because there are two steps in changing a pmd entry to a pmd_protnone
>> entry. First, the pmd entry is cleared to a pmd_none entry, then,
>> the pmd_none entry is changed into a pmd_protnone entry.
>> The racy check, even with barrier, might only see the pmd_none entry
>> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
>
> That's definately a good catch.
>
> But I don't agree with the solution. Taking pmd lock on each
> zap_pmd_range() is a significant hit by scalability of the code path.
> Yes, split ptl lock helps, but it would be nice to avoid the lock in first
> place.
>
> Can we fix change_huge_pmd() instead? Is there a reason why we cannot
> setup the pmd_protnone() atomically?

If you want to setup the pmd_protnone() atomically, we need a new way of
changing pmds, like pmdp_huge_cmp_exchange_and_clear(). Otherwise, due to
the nature of racy check of pmd in zap_pmd_range(), it is impossible to
eliminate the chance of catching this bug if pmd_protnone() is setup
in two steps: first, clear it, second, set it.

However, if we use pmdp_huge_cmp_exchange_and_clear() to change pmds from now on,
instead of current two-step approach, it will eliminate the possibility of
using batched TLB shootdown optimization (introduced by Mel Gorman for base page swapping)
when THP is swappable in the future. Maybe other optimizations?

Why do you think holding pmd lock is bad? In zap_pte_range(), pte lock
is also held when each PTE is zapped.

BTW, I am following Naoya's suggestion and going to take pmd lock inside
the loop. So pmd lock is held when each pmd is being checked and it will be released
when the pmd entry is zapped, split, or pointed to a page table.
Does it still hurt much on performance?

Thanks.



>
> Mel? Rik?
>
>>
>> Later, in free_pmd_range(), pmd_none_or_clear() will see the
>> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
>> since the pmd_protnone entry is not properly freed, the corresponding
>> deposited pte page table is not freed either.
>>
>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
>>
>> This patch relies on __split_huge_pmd_locked() and
>> __zap_huge_pmd_locked().
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  mm/memory.c | 24 +++++++++++-------------
>>  1 file changed, 11 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 3929b015faf7..7cfdd5208ef5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>>  				struct zap_details *details)
>>  {
>>  	pmd_t *pmd;
>> +	spinlock_t *ptl;
>>  	unsigned long next;
>>
>>  	pmd = pmd_offset(pud, addr);
>> +	ptl = pmd_lock(vma->vm_mm, pmd);
>>  	do {
>>  		next = pmd_addr_end(addr, end);
>>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>>  			if (next - addr != HPAGE_PMD_SIZE) {
>>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
>> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
>> -				goto next;
>> +				__split_huge_pmd_locked(vma, pmd, addr, false);
>> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
>> +				continue;
>>  			/* fall through */
>>  		}
>> -		/*
>> -		 * Here there can be other concurrent MADV_DONTNEED or
>> -		 * trans huge page faults running, and if the pmd is
>> -		 * none or trans huge it can change under us. This is
>> -		 * because MADV_DONTNEED holds the mmap_sem in read
>> -		 * mode.
>> -		 */
>> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>> -			goto next;
>> +
>> +		if (pmd_none_or_clear_bad(pmd))
>> +			continue;
>> +		spin_unlock(ptl);
>>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
>> -next:
>>  		cond_resched();
>> +		spin_lock(ptl);
>>  	} while (pmd++, addr = next, addr != end);
>> +	spin_unlock(ptl);
>>
>>  	return addr;
>>  }
>> -- 
>> 2.11.0
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> -- 
>  Kirill A. Shutemov


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-06 16:32     ` Zi Yan
@ 2017-02-06 17:35         ` Kirill A. Shutemov
  0 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-06 17:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: mgorman, riel, linux-kernel, linux-mm, kirill.shutemov, akpm,
	minchan, vbabka, n-horiguchi, khandual, Zi Yan

On Mon, Feb 06, 2017 at 10:32:10AM -0600, Zi Yan wrote:
> On 6 Feb 2017, at 10:07, Kirill A. Shutemov wrote:
> 
> > On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> >> This can cause pmd_protnone entry not being freed.
> >>
> >> Because there are two steps in changing a pmd entry to a pmd_protnone
> >> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> >> the pmd_none entry is changed into a pmd_protnone entry.
> >> The racy check, even with barrier, might only see the pmd_none entry
> >> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
> >
> > That's definately a good catch.
> >
> > But I don't agree with the solution. Taking pmd lock on each
> > zap_pmd_range() is a significant hit by scalability of the code path.
> > Yes, split ptl lock helps, but it would be nice to avoid the lock in first
> > place.
> >
> > Can we fix change_huge_pmd() instead? Is there a reason why we cannot
> > setup the pmd_protnone() atomically?
> 
> If you want to setup the pmd_protnone() atomically, we need a new way of
> changing pmds, like pmdp_huge_cmp_exchange_and_clear(). Otherwise, due to
> the nature of racy check of pmd in zap_pmd_range(), it is impossible to
> eliminate the chance of catching this bug if pmd_protnone() is setup
> in two steps: first, clear it, second, set it.
> 
> However, if we use pmdp_huge_cmp_exchange_and_clear() to change pmds from now on,
> instead of current two-step approach, it will eliminate the possibility of
> using batched TLB shootdown optimization (introduced by Mel Gorman for base page swapping)
> when THP is swappable in the future. Maybe other optimizations?

I'll think about this more.

> Why do you think holding pmd lock is bad?

Each additional atomic operation in fast-path hurts scalability.
Cost of atomic operations rises fast as machine gets bigger.

> In zap_pte_range(), pte lock is also held when each PTE is zapped.

It's necessary evil for pte. Not so much for pmd so far.

> BTW, I am following Naoya's suggestion and going to take pmd lock inside
> the loop. So pmd lock is held when each pmd is being checked and it will be released
> when the pmd entry is zapped, split, or pointed to a page table.
> Does it still hurt much on performance?

Naoya's suggestion is not correct: pmd_lock() can be different not for
each pmd entry, but for each pmd table. So taking it outside of the loop
is correct.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-06 17:35         ` Kirill A. Shutemov
  0 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-06 17:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: mgorman, riel, linux-kernel, linux-mm, kirill.shutemov, akpm,
	minchan, vbabka, n-horiguchi, khandual, Zi Yan

On Mon, Feb 06, 2017 at 10:32:10AM -0600, Zi Yan wrote:
> On 6 Feb 2017, at 10:07, Kirill A. Shutemov wrote:
> 
> > On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> >> This can cause pmd_protnone entry not being freed.
> >>
> >> Because there are two steps in changing a pmd entry to a pmd_protnone
> >> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> >> the pmd_none entry is changed into a pmd_protnone entry.
> >> The racy check, even with barrier, might only see the pmd_none entry
> >> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
> >
> > That's definately a good catch.
> >
> > But I don't agree with the solution. Taking pmd lock on each
> > zap_pmd_range() is a significant hit by scalability of the code path.
> > Yes, split ptl lock helps, but it would be nice to avoid the lock in first
> > place.
> >
> > Can we fix change_huge_pmd() instead? Is there a reason why we cannot
> > setup the pmd_protnone() atomically?
> 
> If you want to setup the pmd_protnone() atomically, we need a new way of
> changing pmds, like pmdp_huge_cmp_exchange_and_clear(). Otherwise, due to
> the nature of racy check of pmd in zap_pmd_range(), it is impossible to
> eliminate the chance of catching this bug if pmd_protnone() is setup
> in two steps: first, clear it, second, set it.
> 
> However, if we use pmdp_huge_cmp_exchange_and_clear() to change pmds from now on,
> instead of current two-step approach, it will eliminate the possibility of
> using batched TLB shootdown optimization (introduced by Mel Gorman for base page swapping)
> when THP is swappable in the future. Maybe other optimizations?

I'll think about this more.

> Why do you think holding pmd lock is bad?

Each additional atomic operation in fast-path hurts scalability.
Cost of atomic operations rises fast as machine gets bigger.

> In zap_pte_range(), pte lock is also held when each PTE is zapped.

It's necessary evil for pte. Not so much for pmd so far.

> BTW, I am following Naoya's suggestion and going to take pmd lock inside
> the loop. So pmd lock is held when each pmd is being checked and it will be released
> when the pmd entry is zapped, split, or pointed to a page table.
> Does it still hurt much on performance?

Naoya's suggestion is not correct: pmd_lock() can be different not for
each pmd entry, but for each pmd table. So taking it outside of the loop
is correct.


-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-06 13:02     ` Zi Yan
@ 2017-02-06 23:22         ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-06 23:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, Zi Yan

On Mon, Feb 06, 2017 at 07:02:41AM -0600, Zi Yan wrote:
> On 6 Feb 2017, at 1:43, Naoya Horiguchi wrote:
> 
> > On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> >> This can cause pmd_protnone entry not being freed.
> >>
> >> Because there are two steps in changing a pmd entry to a pmd_protnone
> >> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> >> the pmd_none entry is changed into a pmd_protnone entry.
> >> The racy check, even with barrier, might only see the pmd_none entry
> >> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
> >>
> >> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> >> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> >> since the pmd_protnone entry is not properly freed, the corresponding
> >> deposited pte page table is not freed either.
> >>
> >> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> >>
> >> This patch relies on __split_huge_pmd_locked() and
> >> __zap_huge_pmd_locked().
> >>
> >> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> >> ---
> >>  mm/memory.c | 24 +++++++++++-------------
> >>  1 file changed, 11 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 3929b015faf7..7cfdd5208ef5 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> >>  				struct zap_details *details)
> >>  {
> >>  	pmd_t *pmd;
> >> +	spinlock_t *ptl;
> >>  	unsigned long next;
> >>
> >>  	pmd = pmd_offset(pud, addr);
> >> +	ptl = pmd_lock(vma->vm_mm, pmd);
> >
> > If USE_SPLIT_PMD_PTLOCKS is true, pmd_lock() returns different ptl for
> > each pmd. The following code runs over pmds within [addr, end) with
> > a single ptl (of the first pmd,) so I suspect this locking really works.
> > Maybe pmd_lock() should be called inside while loop?
> 
> According to include/linux/mm.h, pmd_lockptr() first gets the page the pmd is in,
> using mask = ~(PTRS_PER_PMD * sizeof(pmd_t) -1) = 0xfffffffffffff000 and virt_to_page().
> Then, ptlock_ptr() gets spinlock_t either from page->ptl (split case) or
> mm->page_table_lock (not split case).
> 
> It seems to me that all PMDs in one page table page share a single spinlock. Let me know
> if I misunderstand any code.

Thanks for clarification, it was my misunderstanding.

Naoya

> 
> But your suggestion can avoid holding the pmd lock for long without cond_sched(),
> I can move the spinlock inside the loop.
> 
> Thanks.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5299b261c4b4..ff61d45eaea7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1260,31 +1260,34 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                                 struct zap_details *details)
>  {
>         pmd_t *pmd;
> -       spinlock_t *ptl;
> +       spinlock_t *ptl = NULL;
>         unsigned long next;
> 
>         pmd = pmd_offset(pud, addr);
> -       ptl = pmd_lock(vma->vm_mm, pmd);
>         do {
> +               ptl = pmd_lock(vma->vm_mm, pmd);
>                 next = pmd_addr_end(addr, end);
>                 if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>                         if (next - addr != HPAGE_PMD_SIZE) {
>                                 VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>                                     !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>                                 __split_huge_pmd_locked(vma, pmd, addr, false);
> -                       } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> -                               continue;
> +                       } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr)) {
> +                               spin_unlock(ptl);
> +                               goto next;
> +                       }
>                         /* fall through */
>                 }
> 
> -               if (pmd_none_or_clear_bad(pmd))
> -                       continue;
> +               if (pmd_none_or_clear_bad(pmd)) {
> +                       spin_unlock(ptl);
> +                       goto next;
> +               }
>                 spin_unlock(ptl);
>                 next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> +next:
>                 cond_resched();
> -               spin_lock(ptl);
>         } while (pmd++, addr = next, addr != end);
> -       spin_unlock(ptl);
> 
>         return addr;
>  }
> 
> 
> >
> > Thanks,
> > Naoya Horiguchi
> >
> >>  	do {
> >>  		next = pmd_addr_end(addr, end);
> >>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> >>  			if (next - addr != HPAGE_PMD_SIZE) {
> >>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
> >>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> >> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> >> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
> >> -				goto next;
> >> +				__split_huge_pmd_locked(vma, pmd, addr, false);
> >> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> >> +				continue;
> >>  			/* fall through */
> >>  		}
> >> -		/*
> >> -		 * Here there can be other concurrent MADV_DONTNEED or
> >> -		 * trans huge page faults running, and if the pmd is
> >> -		 * none or trans huge it can change under us. This is
> >> -		 * because MADV_DONTNEED holds the mmap_sem in read
> >> -		 * mode.
> >> -		 */
> >> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> >> -			goto next;
> >> +
> >> +		if (pmd_none_or_clear_bad(pmd))
> >> +			continue;
> >> +		spin_unlock(ptl);
> >>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> >> -next:
> >>  		cond_resched();
> >> +		spin_lock(ptl);
> >>  	} while (pmd++, addr = next, addr != end);
> >> +	spin_unlock(ptl);
> >>
> >>  	return addr;
> >>  }
> >> -- 
> >> 2.11.0
> >>
> 
> 
> --
> Best Regards
> Yan Zi

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-06 23:22         ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-06 23:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, Zi Yan

On Mon, Feb 06, 2017 at 07:02:41AM -0600, Zi Yan wrote:
> On 6 Feb 2017, at 1:43, Naoya Horiguchi wrote:
> 
> > On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> >> This can cause pmd_protnone entry not being freed.
> >>
> >> Because there are two steps in changing a pmd entry to a pmd_protnone
> >> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> >> the pmd_none entry is changed into a pmd_protnone entry.
> >> The racy check, even with barrier, might only see the pmd_none entry
> >> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
> >>
> >> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> >> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> >> since the pmd_protnone entry is not properly freed, the corresponding
> >> deposited pte page table is not freed either.
> >>
> >> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> >>
> >> This patch relies on __split_huge_pmd_locked() and
> >> __zap_huge_pmd_locked().
> >>
> >> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> >> ---
> >>  mm/memory.c | 24 +++++++++++-------------
> >>  1 file changed, 11 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 3929b015faf7..7cfdd5208ef5 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> >>  				struct zap_details *details)
> >>  {
> >>  	pmd_t *pmd;
> >> +	spinlock_t *ptl;
> >>  	unsigned long next;
> >>
> >>  	pmd = pmd_offset(pud, addr);
> >> +	ptl = pmd_lock(vma->vm_mm, pmd);
> >
> > If USE_SPLIT_PMD_PTLOCKS is true, pmd_lock() returns different ptl for
> > each pmd. The following code runs over pmds within [addr, end) with
> > a single ptl (of the first pmd,) so I suspect this locking really works.
> > Maybe pmd_lock() should be called inside while loop?
> 
> According to include/linux/mm.h, pmd_lockptr() first gets the page the pmd is in,
> using mask = ~(PTRS_PER_PMD * sizeof(pmd_t) -1) = 0xfffffffffffff000 and virt_to_page().
> Then, ptlock_ptr() gets spinlock_t either from page->ptl (split case) or
> mm->page_table_lock (not split case).
> 
> It seems to me that all PMDs in one page table page share a single spinlock. Let me know
> if I misunderstand any code.

Thanks for clarification, it was my misunderstanding.

Naoya

> 
> But your suggestion can avoid holding the pmd lock for long without cond_sched(),
> I can move the spinlock inside the loop.
> 
> Thanks.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5299b261c4b4..ff61d45eaea7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1260,31 +1260,34 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                                 struct zap_details *details)
>  {
>         pmd_t *pmd;
> -       spinlock_t *ptl;
> +       spinlock_t *ptl = NULL;
>         unsigned long next;
> 
>         pmd = pmd_offset(pud, addr);
> -       ptl = pmd_lock(vma->vm_mm, pmd);
>         do {
> +               ptl = pmd_lock(vma->vm_mm, pmd);
>                 next = pmd_addr_end(addr, end);
>                 if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>                         if (next - addr != HPAGE_PMD_SIZE) {
>                                 VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>                                     !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
>                                 __split_huge_pmd_locked(vma, pmd, addr, false);
> -                       } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> -                               continue;
> +                       } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr)) {
> +                               spin_unlock(ptl);
> +                               goto next;
> +                       }
>                         /* fall through */
>                 }
> 
> -               if (pmd_none_or_clear_bad(pmd))
> -                       continue;
> +               if (pmd_none_or_clear_bad(pmd)) {
> +                       spin_unlock(ptl);
> +                       goto next;
> +               }
>                 spin_unlock(ptl);
>                 next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> +next:
>                 cond_resched();
> -               spin_lock(ptl);
>         } while (pmd++, addr = next, addr != end);
> -       spin_unlock(ptl);
> 
>         return addr;
>  }
> 
> 
> >
> > Thanks,
> > Naoya Horiguchi
> >
> >>  	do {
> >>  		next = pmd_addr_end(addr, end);
> >>  		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> >>  			if (next - addr != HPAGE_PMD_SIZE) {
> >>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
> >>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> >> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> >> -			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
> >> -				goto next;
> >> +				__split_huge_pmd_locked(vma, pmd, addr, false);
> >> +			} else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr))
> >> +				continue;
> >>  			/* fall through */
> >>  		}
> >> -		/*
> >> -		 * Here there can be other concurrent MADV_DONTNEED or
> >> -		 * trans huge page faults running, and if the pmd is
> >> -		 * none or trans huge it can change under us. This is
> >> -		 * because MADV_DONTNEED holds the mmap_sem in read
> >> -		 * mode.
> >> -		 */
> >> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> >> -			goto next;
> >> +
> >> +		if (pmd_none_or_clear_bad(pmd))
> >> +			continue;
> >> +		spin_unlock(ptl);
> >>  		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> >> -next:
> >>  		cond_resched();
> >> +		spin_lock(ptl);
> >>  	} while (pmd++, addr = next, addr != end);
> >> +	spin_unlock(ptl);
> >>
> >>  	return addr;
> >>  }
> >> -- 
> >> 2.11.0
> >>
> 
> 
> --
> Best Regards
> Yan Zi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-06 16:07     ` Kirill A. Shutemov
@ 2017-02-07 13:55       ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 87+ messages in thread
From: Aneesh Kumar K.V @ 2017-02-07 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Zi Yan, mgorman, riel
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	n-horiguchi, khandual, zi.yan, Zi Yan

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>> 
>> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
>> This can cause pmd_protnone entry not being freed.
>> 
>> Because there are two steps in changing a pmd entry to a pmd_protnone
>> entry. First, the pmd entry is cleared to a pmd_none entry, then,
>> the pmd_none entry is changed into a pmd_protnone entry.
>> The racy check, even with barrier, might only see the pmd_none entry
>> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
>
> That's definately a good catch.
>
> But I don't agree with the solution. Taking pmd lock on each
> zap_pmd_range() is a significant hit by scalability of the code path.
> Yes, split ptl lock helps, but it would be nice to avoid the lock in first
> place.
>
> Can we fix change_huge_pmd() instead? Is there a reason why we cannot
> setup the pmd_protnone() atomically?
>
> Mel? Rik?
>

I am also trying to fixup the usage of set_pte_at on ptes that are
valid/present (that this autonuma ptes). I guess what we are missing is a
variant of pte update routines that can atomically update a pte without
clearing it and that also doesn't do a tlb flush ?

-aneesh

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-07 13:55       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 87+ messages in thread
From: Aneesh Kumar K.V @ 2017-02-07 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Zi Yan, mgorman, riel
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	n-horiguchi, khandual, zi.yan, Zi Yan

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>> 
>> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
>> This can cause pmd_protnone entry not being freed.
>> 
>> Because there are two steps in changing a pmd entry to a pmd_protnone
>> entry. First, the pmd entry is cleared to a pmd_none entry, then,
>> the pmd_none entry is changed into a pmd_protnone entry.
>> The racy check, even with barrier, might only see the pmd_none entry
>> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
>
> That's definately a good catch.
>
> But I don't agree with the solution. Taking pmd lock on each
> zap_pmd_range() is a significant hit by scalability of the code path.
> Yes, split ptl lock helps, but it would be nice to avoid the lock in first
> place.
>
> Can we fix change_huge_pmd() instead? Is there a reason why we cannot
> setup the pmd_protnone() atomically?
>
> Mel? Rik?
>

I am also trying to fixup the usage of set_pte_at on ptes that are
valid/present (that this autonuma ptes). I guess what we are missing is a
variant of pte update routines that can atomically update a pte without
clearing it and that also doesn't do a tlb flush ?

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-07 13:55       ` Aneesh Kumar K.V
  (?)
@ 2017-02-07 14:12       ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-07 14:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, mgorman, riel, linux-kernel, linux-mm,
	kirill.shutemov, akpm, minchan, vbabka, n-horiguchi, khandual,
	Zi Yan

[-- Attachment #1: Type: text/plain, Size: 2197 bytes --]

On 7 Feb 2017, at 7:55, Aneesh Kumar K.V wrote:

> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>
>> On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
>>> This can cause pmd_protnone entry not being freed.
>>>
>>> Because there are two steps in changing a pmd entry to a pmd_protnone
>>> entry. First, the pmd entry is cleared to a pmd_none entry, then,
>>> the pmd_none entry is changed into a pmd_protnone entry.
>>> The racy check, even with barrier, might only see the pmd_none entry
>>> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
>>
>> That's definately a good catch.
>>
>> But I don't agree with the solution. Taking pmd lock on each
>> zap_pmd_range() is a significant hit by scalability of the code path.
>> Yes, split ptl lock helps, but it would be nice to avoid the lock in first
>> place.
>>
>> Can we fix change_huge_pmd() instead? Is there a reason why we cannot
>> setup the pmd_protnone() atomically?
>>
>> Mel? Rik?
>>
>
> I am also trying to fixup the usage of set_pte_at on ptes that are
> valid/present (that this autonuma ptes). I guess what we are missing is a
> variant of pte update routines that can atomically update a pte without
> clearing it and that also doesn't do a tlb flush ?

I think so. The key point is to have a atomic PTE update function instead
of current two-step pte/pmd_get_clear() then set_pte/pmd_at(). We can always
add a wrapper to include TLB flush, once we have this atomic update function.

I used xchg() to replace xxx_get_clear() & set_xxx_at() in pmd_protnone(),
set_pmd_migration_entry(), and remove_pmd_migration(),
then ran my test overnight. I did not see kernel crashing nor data corruption.
So I think the atomic PTE/PMD update function works without taking locks
in zap_pmd_range().

Aneesh, in your patch of fixing PowerPC's autonuma pte problem, why didn't you
use atomic operations? Is there any limitation on PowerPC?

My question is why current kernel uses xxx_get_clear() and set_xxx_at()
in the first place? Is there any limitation I do not know?


Thanks.

--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-07 14:19     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-07 14:19 UTC (permalink / raw)
  To: Zi Yan, Andrea Arcangeli, Minchan Kim
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, n-horiguchi, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> This can cause pmd_protnone entry not being freed.
> 
> Because there are two steps in changing a pmd entry to a pmd_protnone
> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> the pmd_none entry is changed into a pmd_protnone entry.
> The racy check, even with barrier, might only see the pmd_none entry
> in zap_pmd_range(), thus, the mapping is neither split nor zapped.

Okay, this only can happen to MADV_DONTNEED as we hold
down_write(mmap_sem) for the rest of zap_pmd_range() and whoever modifies
page tables has to hold at least down_read(mmap_sem) or exclude parallel
modification in other ways.

See 1a5a9906d4e8 ("mm: thp: fix pmd_bad() triggering in code paths holding
mmap_sem read mode") for more details.

+Andrea.

> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> since the pmd_protnone entry is not properly freed, the corresponding
> deposited pte page table is not freed either.

free_pmd_range() should be fine: we only free page tables after vmas gone
(under down_write(mmap_sem() in exit_mmap() and unmap_region()) or after
pagetables moved (under down_write(mmap_sem) in shift_arg_pages()).

> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.

The problem is that numabalancing calls change_huge_pmd() under
down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
It makes numabalancing the only code path beyond page fault that can turn
pmd_none() into pmd_trans_huge() under down_read(mmap_sem).

This can lead to race when MADV_DONTNEED miss THP. That's not critical for
pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
case. Not so much for change_huge_pmd().

Looks like we need pmdp_modify() or something to modify protection bits
inplace, without clearing pmd.

Not sure how to get crash scenario.

BTW, Zi, have you observed the crash? Or is it based on code inspection?
Any backtraces?

Ouch! madvise_free_huge_pmd() is broken too. We shouldn't clear pmd in the
middle of it as we only hold down_read(mmap_sem). I guess we need a helper
to clear both access and dirty bits.
Minchan, could you look into it?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-07 14:19     ` Kirill A. Shutemov
  0 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-07 14:19 UTC (permalink / raw)
  To: Zi Yan, Andrea Arcangeli, Minchan Kim
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, vbabka, mgorman,
	n-horiguchi, khandual, zi.yan, Zi Yan

On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
> This can cause pmd_protnone entry not being freed.
> 
> Because there are two steps in changing a pmd entry to a pmd_protnone
> entry. First, the pmd entry is cleared to a pmd_none entry, then,
> the pmd_none entry is changed into a pmd_protnone entry.
> The racy check, even with barrier, might only see the pmd_none entry
> in zap_pmd_range(), thus, the mapping is neither split nor zapped.

Okay, this only can happen to MADV_DONTNEED as we hold
down_write(mmap_sem) for the rest of zap_pmd_range() and whoever modifies
page tables has to hold at least down_read(mmap_sem) or exclude parallel
modification in other ways.

See 1a5a9906d4e8 ("mm: thp: fix pmd_bad() triggering in code paths holding
mmap_sem read mode") for more details.

+Andrea.

> Later, in free_pmd_range(), pmd_none_or_clear() will see the
> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
> since the pmd_protnone entry is not properly freed, the corresponding
> deposited pte page table is not freed either.

free_pmd_range() should be fine: we only free page tables after vmas gone
(under down_write(mmap_sem() in exit_mmap() and unmap_region()) or after
pagetables moved (under down_write(mmap_sem) in shift_arg_pages()).

> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.

The problem is that numabalancing calls change_huge_pmd() under
down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
It makes numabalancing the only code path beyond page fault that can turn
pmd_none() into pmd_trans_huge() under down_read(mmap_sem).

This can lead to race when MADV_DONTNEED miss THP. That's not critical for
pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
case. Not so much for change_huge_pmd().

Looks like we need pmdp_modify() or something to modify protection bits
inplace, without clearing pmd.

Not sure how to get crash scenario.

BTW, Zi, have you observed the crash? Or is it based on code inspection?
Any backtraces?

Ouch! madvise_free_huge_pmd() is broken too. We shouldn't clear pmd in the
middle of it as we only hold down_read(mmap_sem). I guess we need a helper
to clear both access and dirty bits.
Minchan, could you look into it?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-07 14:19     ` Kirill A. Shutemov
@ 2017-02-07 15:11       ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-07 15:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

[-- Attachment #1: Type: text/plain, Size: 7023 bytes --]

Hi Kirill,

Kirill A. Shutemov wrote:
> On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
>> This can cause pmd_protnone entry not being freed.
>>
>> Because there are two steps in changing a pmd entry to a pmd_protnone
>> entry. First, the pmd entry is cleared to a pmd_none entry, then,
>> the pmd_none entry is changed into a pmd_protnone entry.
>> The racy check, even with barrier, might only see the pmd_none entry
>> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
> 
> Okay, this only can happen to MADV_DONTNEED as we hold
> down_write(mmap_sem) for the rest of zap_pmd_range() and whoever modifies
> page tables has to hold at least down_read(mmap_sem) or exclude parallel
> modification in other ways.
> 
> See 1a5a9906d4e8 ("mm: thp: fix pmd_bad() triggering in code paths holding
> mmap_sem read mode") for more details.
> 
> +Andrea.
> 
>> Later, in free_pmd_range(), pmd_none_or_clear() will see the
>> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
>> since the pmd_protnone entry is not properly freed, the corresponding
>> deposited pte page table is not freed either.
> 
> free_pmd_range() should be fine: we only free page tables after vmas gone
> (under down_write(mmap_sem() in exit_mmap() and unmap_region()) or after
> pagetables moved (under down_write(mmap_sem) in shift_arg_pages()).

The leaked page is not pmd page tables, but a PTE page table that is in
pmd_page_table_page->pmd_huge_pte. If a pmd_protnone does not go into
__split_huge_pmd() nor zap_huge_pmd(), the corresponding deposited PTE
page table, put into the list via pgtable_trans_huge_deposit(), will not
be taken out via pgtable_trans_huge_withdraw().

Then, when the kernel at the end of free_pmd_range() is trying to call
pgtable_pmd_page_dtor()  in pmd_free_tlb(), it will encounter
VM_BUG_ON_PAGE(page->pmd_huge_pte, page). Either the kernel crashes or
it leaks a page.


> 
>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> 
> The problem is that numabalancing calls change_huge_pmd() under
> down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> It makes numabalancing the only code path beyond page fault that can turn
> pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> 
> This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> case. Not so much for change_huge_pmd().
> 
> Looks like we need pmdp_modify() or something to modify protection bits
> inplace, without clearing pmd.
> 
> Not sure how to get crash scenario.
> 
> BTW, Zi, have you observed the crash? Or is it based on code inspection?
> Any backtraces?

The problem should be very rare in the upstream kernel. I discover the
problem in my customized kernel which does very frequent page migration
and uses numa_protnone.

The crash scenario I guess is like:
1. A huge page pmd entry is in the middle of being changed into either a
pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.

2. At the same time, the application frees the vma this page belongs to.

3. zap_pmd_range() only see pmd_none in
"if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))",
it might catch pmd_protnone in
"if (pmd_none_or_trans_huge_or_clear_bad(pmd))". But nothing is done for
it. So the deposited PTE page table page associated with the huge pmd
entry is not withdrawn.

4. free_pmd_range() calls pmd_free_tlb() and in pgtable_pmd_page_dtor(),
VM_BUG_ON_PAGE(page->pmd_huge_pte, page) is triggered.

The crash log (you will see a pmd_migration_entry is regarded as bad
pmd, which should not be. I also saw pmd_protnone before.):

[ 1945.978677] mm/pgtable-generic.c:33: bad pmd
ffff8f07b13c1b90(0000004fed803c00)
                 ^^^^^^^^^^^^^^^^ a pmd migration entry

[ 1946.964974] page:fffffd1dd0c4f040 count:1 mapcount:-511 mapping:
     (null) index:0x0
[ 1946.974265] flags: 0x6ffff0000000000()
[ 1946.978486] raw: 06ffff0000000000 0000000000000000 0000000000000000
00000001fffffe00
[ 1946.987202] raw: dead000000000100 fffffd1dd0c45c80 ffff8f07aa38e340
ffff8efdca466678
[ 1946.995927] page dumped because: VM_BUG_ON_PAGE(page->pmd_huge_pte)
[ 1947.002984] page->mem_cgroup:ffff8efdca466678
[ 1947.007927] ------------[ cut here ]------------
[ 1947.013123] kernel BUG at ./include/linux/mm.h:1733!
[ 1947.018706] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 1947.024774] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack intel_rapl sb_edac edac_corei
[ 1947.077814] CPU: 19 PID: 3303 Comm: python Not tainted
4.10.0-rc5-page-migration+ #283
[ 1947.086721] Hardware name: Dell Inc. PowerEdge R530/0HFG24, BIOS
1.5.4 10/05/2015
[ 1947.095140] task: ffff8f07a5870040 task.stack: ffffc37d64adc000
[ 1947.101796] RIP: 0010:___pmd_free_tlb+0x83/0x90
[ 1947.106890] RSP: 0018:ffffc37d64adfce8 EFLAGS: 00010282
[ 1947.112762] RAX: 0000000000000021 RBX: ffffc37d64adfe10 RCX:
0000000000000000
[ 1947.120770] RDX: 0000000000000000 RSI: ffff8f07c224dea8 RDI:
ffff8f07c224dea8
[ 1947.128809] RBP: ffffc37d64adfcf8 R08: 0000000000000001 R09:
0000000000000000
[ 1947.136818] R10: 000000000000000f R11: 0000000000000001 R12:
fffffd1dd0c4f040
[ 1947.144825] R13: 00007fae2d7fd000 R14: ffff8f07b13c1b60 R15:
ffffc37d64adfe10
[ 1947.152832] FS:  00007fafbcce6700(0000) GS:ffff8f07c2240000(0000)
knlGS:0000000000000000
[ 1947.161934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1947.168380] CR2: 00007fafeb851188 CR3: 0000001432184000 CR4:
00000000001406e0
[ 1947.176393] Call Trace:
[ 1947.179160]  free_pgd_range+0x487/0x5d0
[ 1947.183476]  free_pgtables+0xc4/0x120
[ 1947.187593]  unmap_region+0xe1/0x130
[ 1947.191620]  do_munmap+0x273/0x400
[ 1947.195452]  SyS_munmap+0x53/0x70
[ 1947.199190]  entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1947.204382] RIP: 0033:0x7fafeb59d387
[ 1947.208406] RSP: 002b:00007fafbcce5358 EFLAGS: 00000207 ORIG_RAX:
000000000000000b
[ 1947.216924] RAX: ffffffffffffffda RBX: 00007faf2c0b6780 RCX:
00007fafeb59d387
[ 1947.224933] RDX: 00007fae247fc030 RSI: 0000000009001000 RDI:
00007fae247fc000
[ 1947.232940] RBP: 00007fafbcce5390 R08: 00007faf48f9ed00 R09:
0000000000000100
[ 1947.240947] R10: 0000000000000020 R11: 0000000000000207 R12:
0000000002cf1fc0
[ 1947.248955] R13: 0000000002cf1fc0 R14: 000000000343d530 R15:
00007fafbcce5810
[ 1947.256965] Code: 4c 89 e6 48 89 df e8 0d b5 1a 00 84 c0 74 08 48 89
df e8 91 b4 1a 00 5b 41 5c 5d c3 48 c7 c6 d8 73 c7 b8 4c 89 e7 e8 dd 7b
1a 00 <0f> 0b 48 8b 3d 34 80 d9 00 eb 99 66 90
[ 1947.278200] RIP: ___pmd_free_tlb+0x83/0x90 RSP: ffffc37d64adfce8
[ 1947.285688] ---[ end trace 7864a23976d71e0a ]---

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-07 15:11       ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-07 15:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

[-- Attachment #1: Type: text/plain, Size: 7023 bytes --]

Hi Kirill,

Kirill A. Shutemov wrote:
> On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Originally, zap_pmd_range() checks pmd value without taking pmd lock.
>> This can cause pmd_protnone entry not being freed.
>>
>> Because there are two steps in changing a pmd entry to a pmd_protnone
>> entry. First, the pmd entry is cleared to a pmd_none entry, then,
>> the pmd_none entry is changed into a pmd_protnone entry.
>> The racy check, even with barrier, might only see the pmd_none entry
>> in zap_pmd_range(), thus, the mapping is neither split nor zapped.
> 
> Okay, this only can happen to MADV_DONTNEED as we hold
> down_write(mmap_sem) for the rest of zap_pmd_range() and whoever modifies
> page tables has to hold at least down_read(mmap_sem) or exclude parallel
> modification in other ways.
> 
> See 1a5a9906d4e8 ("mm: thp: fix pmd_bad() triggering in code paths holding
> mmap_sem read mode") for more details.
> 
> +Andrea.
> 
>> Later, in free_pmd_range(), pmd_none_or_clear() will see the
>> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore,
>> since the pmd_protnone entry is not properly freed, the corresponding
>> deposited pte page table is not freed either.
> 
> free_pmd_range() should be fine: we only free page tables after vmas gone
> (under down_write(mmap_sem() in exit_mmap() and unmap_region()) or after
> pagetables moved (under down_write(mmap_sem) in shift_arg_pages()).

The leaked page is not pmd page tables, but a PTE page table that is in
pmd_page_table_page->pmd_huge_pte. If a pmd_protnone does not go into
__split_huge_pmd() nor zap_huge_pmd(), the corresponding deposited PTE
page table, put into the list via pgtable_trans_huge_deposit(), will not
be taken out via pgtable_trans_huge_withdraw().

Then, when the kernel at the end of free_pmd_range() is trying to call
pgtable_pmd_page_dtor()  in pmd_free_tlb(), it will encounter
VM_BUG_ON_PAGE(page->pmd_huge_pte, page). Either the kernel crashes or
it leaks a page.


> 
>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> 
> The problem is that numabalancing calls change_huge_pmd() under
> down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> It makes numabalancing the only code path beyond page fault that can turn
> pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> 
> This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> case. Not so much for change_huge_pmd().
> 
> Looks like we need pmdp_modify() or something to modify protection bits
> inplace, without clearing pmd.
> 
> Not sure how to get crash scenario.
> 
> BTW, Zi, have you observed the crash? Or is it based on code inspection?
> Any backtraces?

The problem should be very rare in the upstream kernel. I discover the
problem in my customized kernel which does very frequent page migration
and uses numa_protnone.

The crash scenario I guess is like:
1. A huge page pmd entry is in the middle of being changed into either a
pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.

2. At the same time, the application frees the vma this page belongs to.

3. zap_pmd_range() only see pmd_none in
"if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))",
it might catch pmd_protnone in
"if (pmd_none_or_trans_huge_or_clear_bad(pmd))". But nothing is done for
it. So the deposited PTE page table page associated with the huge pmd
entry is not withdrawn.

4. free_pmd_range() calls pmd_free_tlb() and in pgtable_pmd_page_dtor(),
VM_BUG_ON_PAGE(page->pmd_huge_pte, page) is triggered.

The crash log (you will see a pmd_migration_entry is regarded as bad
pmd, which should not be. I also saw pmd_protnone before.):

[ 1945.978677] mm/pgtable-generic.c:33: bad pmd
ffff8f07b13c1b90(0000004fed803c00)
                 ^^^^^^^^^^^^^^^^ a pmd migration entry

[ 1946.964974] page:fffffd1dd0c4f040 count:1 mapcount:-511 mapping:
     (null) index:0x0
[ 1946.974265] flags: 0x6ffff0000000000()
[ 1946.978486] raw: 06ffff0000000000 0000000000000000 0000000000000000
00000001fffffe00
[ 1946.987202] raw: dead000000000100 fffffd1dd0c45c80 ffff8f07aa38e340
ffff8efdca466678
[ 1946.995927] page dumped because: VM_BUG_ON_PAGE(page->pmd_huge_pte)
[ 1947.002984] page->mem_cgroup:ffff8efdca466678
[ 1947.007927] ------------[ cut here ]------------
[ 1947.013123] kernel BUG at ./include/linux/mm.h:1733!
[ 1947.018706] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 1947.024774] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack intel_rapl sb_edac edac_corei
[ 1947.077814] CPU: 19 PID: 3303 Comm: python Not tainted
4.10.0-rc5-page-migration+ #283
[ 1947.086721] Hardware name: Dell Inc. PowerEdge R530/0HFG24, BIOS
1.5.4 10/05/2015
[ 1947.095140] task: ffff8f07a5870040 task.stack: ffffc37d64adc000
[ 1947.101796] RIP: 0010:___pmd_free_tlb+0x83/0x90
[ 1947.106890] RSP: 0018:ffffc37d64adfce8 EFLAGS: 00010282
[ 1947.112762] RAX: 0000000000000021 RBX: ffffc37d64adfe10 RCX:
0000000000000000
[ 1947.120770] RDX: 0000000000000000 RSI: ffff8f07c224dea8 RDI:
ffff8f07c224dea8
[ 1947.128809] RBP: ffffc37d64adfcf8 R08: 0000000000000001 R09:
0000000000000000
[ 1947.136818] R10: 000000000000000f R11: 0000000000000001 R12:
fffffd1dd0c4f040
[ 1947.144825] R13: 00007fae2d7fd000 R14: ffff8f07b13c1b60 R15:
ffffc37d64adfe10
[ 1947.152832] FS:  00007fafbcce6700(0000) GS:ffff8f07c2240000(0000)
knlGS:0000000000000000
[ 1947.161934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1947.168380] CR2: 00007fafeb851188 CR3: 0000001432184000 CR4:
00000000001406e0
[ 1947.176393] Call Trace:
[ 1947.179160]  free_pgd_range+0x487/0x5d0
[ 1947.183476]  free_pgtables+0xc4/0x120
[ 1947.187593]  unmap_region+0xe1/0x130
[ 1947.191620]  do_munmap+0x273/0x400
[ 1947.195452]  SyS_munmap+0x53/0x70
[ 1947.199190]  entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1947.204382] RIP: 0033:0x7fafeb59d387
[ 1947.208406] RSP: 002b:00007fafbcce5358 EFLAGS: 00000207 ORIG_RAX:
000000000000000b
[ 1947.216924] RAX: ffffffffffffffda RBX: 00007faf2c0b6780 RCX:
00007fafeb59d387
[ 1947.224933] RDX: 00007fae247fc030 RSI: 0000000009001000 RDI:
00007fae247fc000
[ 1947.232940] RBP: 00007fafbcce5390 R08: 00007faf48f9ed00 R09:
0000000000000100
[ 1947.240947] R10: 0000000000000020 R11: 0000000000000207 R12:
0000000002cf1fc0
[ 1947.248955] R13: 0000000002cf1fc0 R14: 000000000343d530 R15:
00007fafbcce5810
[ 1947.256965] Code: 4c 89 e6 48 89 df e8 0d b5 1a 00 84 c0 74 08 48 89
df e8 91 b4 1a 00 5b 41 5c 5d c3 48 c7 c6 d8 73 c7 b8 4c 89 e7 e8 dd 7b
1a 00 <0f> 0b 48 8b 3d 34 80 d9 00 eb 99 66 90
[ 1947.278200] RIP: ___pmd_free_tlb+0x83/0x90 RSP: ffffc37d64adfce8
[ 1947.285688] ---[ end trace 7864a23976d71e0a ]---

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-07 15:11       ` Zi Yan
@ 2017-02-07 16:37         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-07 16:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: Zi Yan, Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
> >> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> > 
> > The problem is that numabalancing calls change_huge_pmd() under
> > down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> > It makes numabalancing the only code path beyond page fault that can turn
> > pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> > 
> > This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> > pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> > case. Not so much for change_huge_pmd().
> > 
> > Looks like we need pmdp_modify() or something to modify protection bits
> > inplace, without clearing pmd.
> > 
> > Not sure how to get crash scenario.
> > 
> > BTW, Zi, have you observed the crash? Or is it based on code inspection?
> > Any backtraces?
> 
> The problem should be very rare in the upstream kernel. I discover the
> problem in my customized kernel which does very frequent page migration
> and uses numa_protnone.
> 
> The crash scenario I guess is like:
> 1. A huge page pmd entry is in the middle of being changed into either a
> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> 
> 2. At the same time, the application frees the vma this page belongs to.

Em... no.

This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
And we only be able to remove vma under down_write(mmap_sem), so the
scenario should be excluded.

What do I miss?

> 3. zap_pmd_range() only see pmd_none in
> "if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))",
> it might catch pmd_protnone in
> "if (pmd_none_or_trans_huge_or_clear_bad(pmd))". But nothing is done for
> it. So the deposited PTE page table page associated with the huge pmd
> entry is not withdrawn.
> 
> 4. free_pmd_range() calls pmd_free_tlb() and in pgtable_pmd_page_dtor(),
> VM_BUG_ON_PAGE(page->pmd_huge_pte, page) is triggered.
> 
> The crash log (you will see a pmd_migration_entry is regarded as bad
> pmd, which should not be. I also saw pmd_protnone before.):
> 
> [ 1945.978677] mm/pgtable-generic.c:33: bad pmd
> ffff8f07b13c1b90(0000004fed803c00)
>                  ^^^^^^^^^^^^^^^^ a pmd migration entry
> 
> [ 1946.964974] page:fffffd1dd0c4f040 count:1 mapcount:-511 mapping:
>      (null) index:0x0
> [ 1946.974265] flags: 0x6ffff0000000000()
> [ 1946.978486] raw: 06ffff0000000000 0000000000000000 0000000000000000
> 00000001fffffe00
> [ 1946.987202] raw: dead000000000100 fffffd1dd0c45c80 ffff8f07aa38e340
> ffff8efdca466678
> [ 1946.995927] page dumped because: VM_BUG_ON_PAGE(page->pmd_huge_pte)
> [ 1947.002984] page->mem_cgroup:ffff8efdca466678
> [ 1947.007927] ------------[ cut here ]------------
> [ 1947.013123] kernel BUG at ./include/linux/mm.h:1733!
> [ 1947.018706] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 1947.024774] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_conntrack nf_conntrack intel_rapl sb_edac edac_corei
> [ 1947.077814] CPU: 19 PID: 3303 Comm: python Not tainted
> 4.10.0-rc5-page-migration+ #283
> [ 1947.086721] Hardware name: Dell Inc. PowerEdge R530/0HFG24, BIOS
> 1.5.4 10/05/2015
> [ 1947.095140] task: ffff8f07a5870040 task.stack: ffffc37d64adc000
> [ 1947.101796] RIP: 0010:___pmd_free_tlb+0x83/0x90
> [ 1947.106890] RSP: 0018:ffffc37d64adfce8 EFLAGS: 00010282
> [ 1947.112762] RAX: 0000000000000021 RBX: ffffc37d64adfe10 RCX:
> 0000000000000000
> [ 1947.120770] RDX: 0000000000000000 RSI: ffff8f07c224dea8 RDI:
> ffff8f07c224dea8
> [ 1947.128809] RBP: ffffc37d64adfcf8 R08: 0000000000000001 R09:
> 0000000000000000
> [ 1947.136818] R10: 000000000000000f R11: 0000000000000001 R12:
> fffffd1dd0c4f040
> [ 1947.144825] R13: 00007fae2d7fd000 R14: ffff8f07b13c1b60 R15:
> ffffc37d64adfe10
> [ 1947.152832] FS:  00007fafbcce6700(0000) GS:ffff8f07c2240000(0000)
> knlGS:0000000000000000
> [ 1947.161934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1947.168380] CR2: 00007fafeb851188 CR3: 0000001432184000 CR4:
> 00000000001406e0
> [ 1947.176393] Call Trace:
> [ 1947.179160]  free_pgd_range+0x487/0x5d0
> [ 1947.183476]  free_pgtables+0xc4/0x120
> [ 1947.187593]  unmap_region+0xe1/0x130
> [ 1947.191620]  do_munmap+0x273/0x400
> [ 1947.195452]  SyS_munmap+0x53/0x70
> [ 1947.199190]  entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1947.204382] RIP: 0033:0x7fafeb59d387
> [ 1947.208406] RSP: 002b:00007fafbcce5358 EFLAGS: 00000207 ORIG_RAX:
> 000000000000000b
> [ 1947.216924] RAX: ffffffffffffffda RBX: 00007faf2c0b6780 RCX:
> 00007fafeb59d387
> [ 1947.224933] RDX: 00007fae247fc030 RSI: 0000000009001000 RDI:
> 00007fae247fc000
> [ 1947.232940] RBP: 00007fafbcce5390 R08: 00007faf48f9ed00 R09:
> 0000000000000100
> [ 1947.240947] R10: 0000000000000020 R11: 0000000000000207 R12:
> 0000000002cf1fc0
> [ 1947.248955] R13: 0000000002cf1fc0 R14: 000000000343d530 R15:
> 00007fafbcce5810
> [ 1947.256965] Code: 4c 89 e6 48 89 df e8 0d b5 1a 00 84 c0 74 08 48 89
> df e8 91 b4 1a 00 5b 41 5c 5d c3 48 c7 c6 d8 73 c7 b8 4c 89 e7 e8 dd 7b
> 1a 00 <0f> 0b 48 8b 3d 34 80 d9 00 eb 99 66 90
> [ 1947.278200] RIP: ___pmd_free_tlb+0x83/0x90 RSP: ffffc37d64adfce8
> [ 1947.285688] ---[ end trace 7864a23976d71e0a ]---
> 
> -- 
> Best Regards,
> Yan Zi
> 



-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-07 16:37         ` Kirill A. Shutemov
  0 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-07 16:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: Zi Yan, Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
> >> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> > 
> > The problem is that numabalancing calls change_huge_pmd() under
> > down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> > It makes numabalancing the only code path beyond page fault that can turn
> > pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> > 
> > This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> > pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> > case. Not so much for change_huge_pmd().
> > 
> > Looks like we need pmdp_modify() or something to modify protection bits
> > inplace, without clearing pmd.
> > 
> > Not sure how to get crash scenario.
> > 
> > BTW, Zi, have you observed the crash? Or is it based on code inspection?
> > Any backtraces?
> 
> The problem should be very rare in the upstream kernel. I discover the
> problem in my customized kernel which does very frequent page migration
> and uses numa_protnone.
> 
> The crash scenario I guess is like:
> 1. A huge page pmd entry is in the middle of being changed into either a
> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> 
> 2. At the same time, the application frees the vma this page belongs to.

Em... no.

This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
And we only be able to remove vma under down_write(mmap_sem), so the
scenario should be excluded.

What do I miss?

> 3. zap_pmd_range() only see pmd_none in
> "if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))",
> it might catch pmd_protnone in
> "if (pmd_none_or_trans_huge_or_clear_bad(pmd))". But nothing is done for
> it. So the deposited PTE page table page associated with the huge pmd
> entry is not withdrawn.
> 
> 4. free_pmd_range() calls pmd_free_tlb() and in pgtable_pmd_page_dtor(),
> VM_BUG_ON_PAGE(page->pmd_huge_pte, page) is triggered.
> 
> The crash log (you will see a pmd_migration_entry is regarded as bad
> pmd, which should not be. I also saw pmd_protnone before.):
> 
> [ 1945.978677] mm/pgtable-generic.c:33: bad pmd
> ffff8f07b13c1b90(0000004fed803c00)
>                  ^^^^^^^^^^^^^^^^ a pmd migration entry
> 
> [ 1946.964974] page:fffffd1dd0c4f040 count:1 mapcount:-511 mapping:
>      (null) index:0x0
> [ 1946.974265] flags: 0x6ffff0000000000()
> [ 1946.978486] raw: 06ffff0000000000 0000000000000000 0000000000000000
> 00000001fffffe00
> [ 1946.987202] raw: dead000000000100 fffffd1dd0c45c80 ffff8f07aa38e340
> ffff8efdca466678
> [ 1946.995927] page dumped because: VM_BUG_ON_PAGE(page->pmd_huge_pte)
> [ 1947.002984] page->mem_cgroup:ffff8efdca466678
> [ 1947.007927] ------------[ cut here ]------------
> [ 1947.013123] kernel BUG at ./include/linux/mm.h:1733!
> [ 1947.018706] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 1947.024774] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_conntrack nf_conntrack intel_rapl sb_edac edac_corei
> [ 1947.077814] CPU: 19 PID: 3303 Comm: python Not tainted
> 4.10.0-rc5-page-migration+ #283
> [ 1947.086721] Hardware name: Dell Inc. PowerEdge R530/0HFG24, BIOS
> 1.5.4 10/05/2015
> [ 1947.095140] task: ffff8f07a5870040 task.stack: ffffc37d64adc000
> [ 1947.101796] RIP: 0010:___pmd_free_tlb+0x83/0x90
> [ 1947.106890] RSP: 0018:ffffc37d64adfce8 EFLAGS: 00010282
> [ 1947.112762] RAX: 0000000000000021 RBX: ffffc37d64adfe10 RCX:
> 0000000000000000
> [ 1947.120770] RDX: 0000000000000000 RSI: ffff8f07c224dea8 RDI:
> ffff8f07c224dea8
> [ 1947.128809] RBP: ffffc37d64adfcf8 R08: 0000000000000001 R09:
> 0000000000000000
> [ 1947.136818] R10: 000000000000000f R11: 0000000000000001 R12:
> fffffd1dd0c4f040
> [ 1947.144825] R13: 00007fae2d7fd000 R14: ffff8f07b13c1b60 R15:
> ffffc37d64adfe10
> [ 1947.152832] FS:  00007fafbcce6700(0000) GS:ffff8f07c2240000(0000)
> knlGS:0000000000000000
> [ 1947.161934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1947.168380] CR2: 00007fafeb851188 CR3: 0000001432184000 CR4:
> 00000000001406e0
> [ 1947.176393] Call Trace:
> [ 1947.179160]  free_pgd_range+0x487/0x5d0
> [ 1947.183476]  free_pgtables+0xc4/0x120
> [ 1947.187593]  unmap_region+0xe1/0x130
> [ 1947.191620]  do_munmap+0x273/0x400
> [ 1947.195452]  SyS_munmap+0x53/0x70
> [ 1947.199190]  entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1947.204382] RIP: 0033:0x7fafeb59d387
> [ 1947.208406] RSP: 002b:00007fafbcce5358 EFLAGS: 00000207 ORIG_RAX:
> 000000000000000b
> [ 1947.216924] RAX: ffffffffffffffda RBX: 00007faf2c0b6780 RCX:
> 00007fafeb59d387
> [ 1947.224933] RDX: 00007fae247fc030 RSI: 0000000009001000 RDI:
> 00007fae247fc000
> [ 1947.232940] RBP: 00007fafbcce5390 R08: 00007faf48f9ed00 R09:
> 0000000000000100
> [ 1947.240947] R10: 0000000000000020 R11: 0000000000000207 R12:
> 0000000002cf1fc0
> [ 1947.248955] R13: 0000000002cf1fc0 R14: 000000000343d530 R15:
> 00007fafbcce5810
> [ 1947.256965] Code: 4c 89 e6 48 89 df e8 0d b5 1a 00 84 c0 74 08 48 89
> df e8 91 b4 1a 00 5b 41 5c 5d c3 48 c7 c6 d8 73 c7 b8 4c 89 e7 e8 dd 7b
> 1a 00 <0f> 0b 48 8b 3d 34 80 d9 00 eb 99 66 90
> [ 1947.278200] RIP: ___pmd_free_tlb+0x83/0x90 RSP: ffffc37d64adfce8
> [ 1947.285688] ---[ end trace 7864a23976d71e0a ]---
> 
> -- 
> Best Regards,
> Yan Zi
> 



-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-07 16:37         ` Kirill A. Shutemov
@ 2017-02-07 17:14           ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-07 17:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

[-- Attachment #1: Type: text/plain, Size: 2001 bytes --]



Kirill A. Shutemov wrote:
> On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
>>>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
>>> The problem is that numabalancing calls change_huge_pmd() under
>>> down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
>>> It makes numabalancing the only code path beyond page fault that can turn
>>> pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
>>>
>>> This can lead to race when MADV_DONTNEED miss THP. That's not critical for
>>> pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
>>> case. Not so much for change_huge_pmd().
>>>
>>> Looks like we need pmdp_modify() or something to modify protection bits
>>> inplace, without clearing pmd.
>>>
>>> Not sure how to get crash scenario.
>>>
>>> BTW, Zi, have you observed the crash? Or is it based on code inspection?
>>> Any backtraces?
>> The problem should be very rare in the upstream kernel. I discover the
>> problem in my customized kernel which does very frequent page migration
>> and uses numa_protnone.
>>
>> The crash scenario I guess is like:
>> 1. A huge page pmd entry is in the middle of being changed into either a
>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
>>
>> 2. At the same time, the application frees the vma this page belongs to.
> 
> Em... no.
> 
> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> And we only be able to remove vma under down_write(mmap_sem), so the
> scenario should be excluded.
> 
> What do I miss?

You are right. This problem will not happen in the upstream kernel.

The problem comes from my customized kernel, where I migrate pages away
instead of reclaiming them when memory is under pressure. I did not take
any mmap_sem when I migrate pages. So I got this error.

It is a false alarm. Sorry about that. Thanks for clarifying the problem.


-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-07 17:14           ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-07 17:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

[-- Attachment #1: Type: text/plain, Size: 2001 bytes --]



Kirill A. Shutemov wrote:
> On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
>>>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
>>> The problem is that numabalancing calls change_huge_pmd() under
>>> down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
>>> It makes numabalancing the only code path beyond page fault that can turn
>>> pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
>>>
>>> This can lead to race when MADV_DONTNEED miss THP. That's not critical for
>>> pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
>>> case. Not so much for change_huge_pmd().
>>>
>>> Looks like we need pmdp_modify() or something to modify protection bits
>>> inplace, without clearing pmd.
>>>
>>> Not sure how to get crash scenario.
>>>
>>> BTW, Zi, have you observed the crash? Or is it based on code inspection?
>>> Any backtraces?
>> The problem should be very rare in the upstream kernel. I discover the
>> problem in my customized kernel which does very frequent page migration
>> and uses numa_protnone.
>>
>> The crash scenario I guess is like:
>> 1. A huge page pmd entry is in the middle of being changed into either a
>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
>>
>> 2. At the same time, the application frees the vma this page belongs to.
> 
> Em... no.
> 
> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> And we only be able to remove vma under down_write(mmap_sem), so the
> scenario should be excluded.
> 
> What do I miss?

You are right. This problem will not happen in the upstream kernel.

The problem comes from my customized kernel, where I migrate pages away
instead of reclaiming them when memory is under pressure. I did not take
any mmap_sem when I migrate pages. So I got this error.

It is a false alarm. Sorry about that. Thanks for clarifying the problem.


-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-07 17:14           ` Zi Yan
@ 2017-02-07 17:45             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-07 17:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Zi Yan, Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

On Tue, Feb 07, 2017 at 11:14:56AM -0600, Zi Yan wrote:
> 
> 
> Kirill A. Shutemov wrote:
> > On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
> >>>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> >>> The problem is that numabalancing calls change_huge_pmd() under
> >>> down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> >>> It makes numabalancing the only code path beyond page fault that can turn
> >>> pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> >>>
> >>> This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> >>> pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> >>> case. Not so much for change_huge_pmd().
> >>>
> >>> Looks like we need pmdp_modify() or something to modify protection bits
> >>> inplace, without clearing pmd.
> >>>
> >>> Not sure how to get crash scenario.
> >>>
> >>> BTW, Zi, have you observed the crash? Or is it based on code inspection?
> >>> Any backtraces?
> >> The problem should be very rare in the upstream kernel. I discover the
> >> problem in my customized kernel which does very frequent page migration
> >> and uses numa_protnone.
> >>
> >> The crash scenario I guess is like:
> >> 1. A huge page pmd entry is in the middle of being changed into either a
> >> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> >>
> >> 2. At the same time, the application frees the vma this page belongs to.
> > 
> > Em... no.
> > 
> > This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> > And we only be able to remove vma under down_write(mmap_sem), so the
> > scenario should be excluded.
> > 
> > What do I miss?
> 
> You are right. This problem will not happen in the upstream kernel.
> 
> The problem comes from my customized kernel, where I migrate pages away
> instead of reclaiming them when memory is under pressure. I did not take
> any mmap_sem when I migrate pages. So I got this error.
> 
> It is a false alarm. Sorry about that. Thanks for clarifying the problem.

I think there's still a race between MADV_DONTNEED and
change_huge_pmd(.prot_numa=1) resulting in skipping THP by
zap_pmd_range(). It need to be addressed.

And MADV_FREE requires a fix.

So, minus one non-bug, plus two bugs. 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-07 17:45             ` Kirill A. Shutemov
  0 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-07 17:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: Zi Yan, Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

On Tue, Feb 07, 2017 at 11:14:56AM -0600, Zi Yan wrote:
> 
> 
> Kirill A. Shutemov wrote:
> > On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
> >>>> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> >>> The problem is that numabalancing calls change_huge_pmd() under
> >>> down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> >>> It makes numabalancing the only code path beyond page fault that can turn
> >>> pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> >>>
> >>> This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> >>> pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> >>> case. Not so much for change_huge_pmd().
> >>>
> >>> Looks like we need pmdp_modify() or something to modify protection bits
> >>> inplace, without clearing pmd.
> >>>
> >>> Not sure how to get crash scenario.
> >>>
> >>> BTW, Zi, have you observed the crash? Or is it based on code inspection?
> >>> Any backtraces?
> >> The problem should be very rare in the upstream kernel. I discover the
> >> problem in my customized kernel which does very frequent page migration
> >> and uses numa_protnone.
> >>
> >> The crash scenario I guess is like:
> >> 1. A huge page pmd entry is in the middle of being changed into either a
> >> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> >>
> >> 2. At the same time, the application frees the vma this page belongs to.
> > 
> > Em... no.
> > 
> > This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> > And we only be able to remove vma under down_write(mmap_sem), so the
> > scenario should be excluded.
> > 
> > What do I miss?
> 
> You are right. This problem will not happen in the upstream kernel.
> 
> The problem comes from my customized kernel, where I migrate pages away
> instead of reclaiming them when memory is under pressure. I did not take
> any mmap_sem when I migrate pages. So I got this error.
> 
> It is a false alarm. Sorry about that. Thanks for clarifying the problem.

I think there's still a race between MADV_DONTNEED and
change_huge_pmd(.prot_numa=1) resulting in skipping THP by
zap_pmd_range(). It need to be addressed.

And MADV_FREE requires a fix.

So, minus one non-bug, plus two bugs. 

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-09  9:14     ` Naoya Horiguchi
  -1 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09  9:14 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan

On Sun, Feb 05, 2017 at 11:12:42AM -0500, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
> false negative return when it races with thp spilt
> (during which _PAGE_PRESENT is temporary cleared.) I don't think that
> dropping _PAGE_PSE check in pmd_present() works well because it can
> hurt optimization of tlb handling in thp split.
> In the current kernel, bits 1-4 are not used in non-present format
> since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
> work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
> Bit 7 is used as reserved (always clear), so please don't use it for
> other purpose.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v3:
> - Move _PAGE_SWP_SOFT_DIRTY to bit 1, it was placed at bit 6. Because
> some CPUs might accidentally set bit 5 or 6.
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---

More documenting will be helpful, could you do like follows?

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Sun, 5 Feb 2017 11:12:42 -0500
Subject: [PATCH] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1

pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
false negative return when it races with thp spilt
(during which _PAGE_PRESENT is temporary cleared.) I don't think that
dropping _PAGE_PSE check in pmd_present() works well because it can
hurt optimization of tlb handling in thp split.
In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_64.h    | 12 +++++++++---
 arch/x86/include/asm/pgtable_types.h | 10 +++++-----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 73c7ccc38912..07c98c85cc96 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 /*
  * Encode and de-code a swap entry
  *
- * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
- * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|X|X|X| X| X|X|X|0| <- swp entry
+ * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
+ * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
  * there.  We also need to avoid using A and D because of an
  * erratum where they can be incorrectly set by hardware on
  * non-present PTEs.
+ *
+ * SD (1) in swp entry is used to store soft dirty bit, which helps us
+ * remember soft dirty over page migration.
+ *
+ * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
+ * but G.
  */
 #define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
 #define SWP_TYPE_BITS 5
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8b4de22d6429..3695abd58ef6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
  *
  * Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_RW
 #else
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
@ 2017-02-09  9:14     ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09  9:14 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan

On Sun, Feb 05, 2017 at 11:12:42AM -0500, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
> false negative return when it races with thp spilt
> (during which _PAGE_PRESENT is temporary cleared.) I don't think that
> dropping _PAGE_PSE check in pmd_present() works well because it can
> hurt optimization of tlb handling in thp split.
> In the current kernel, bits 1-4 are not used in non-present format
> since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
> work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
> Bit 7 is used as reserved (always clear), so please don't use it for
> other purpose.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v3:
> - Move _PAGE_SWP_SOFT_DIRTY to bit 1, it was placed at bit 6. Because
> some CPUs might accidentally set bit 5 or 6.
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---

More documenting will be helpful, could you do like follows?

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Sun, 5 Feb 2017 11:12:42 -0500
Subject: [PATCH] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1

pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
false negative return when it races with thp spilt
(during which _PAGE_PRESENT is temporary cleared.) I don't think that
dropping _PAGE_PSE check in pmd_present() works well because it can
hurt optimization of tlb handling in thp split.
In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_64.h    | 12 +++++++++---
 arch/x86/include/asm/pgtable_types.h | 10 +++++-----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 73c7ccc38912..07c98c85cc96 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 /*
  * Encode and de-code a swap entry
  *
- * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
- * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|X|X|X| X| X|X|X|0| <- swp entry
+ * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
+ * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
  * there.  We also need to avoid using A and D because of an
  * erratum where they can be incorrectly set by hardware on
  * non-present PTEs.
+ *
+ * SD (1) in swp entry is used to store soft dirty bit, which helps us
+ * remember soft dirty over page migration.
+ *
+ * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
+ * but G.
  */
 #define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
 #define SWP_TYPE_BITS 5
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8b4de22d6429..3695abd58ef6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
  *
  * Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_RW
 #else
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 08/14] mm: thp: enable thp migration in generic path
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-09  9:15     ` Naoya Horiguchi
  -1 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09  9:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan

On Sun, Feb 05, 2017 at 11:12:46AM -0500, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> This patch adds thp migration's core code, including conversions
> between a PMD entry and a swap entry, setting PMD migration entry,
> removing PMD migration entry, and waiting on PMD migration entries.
> 
> This patch makes it possible to support thp migration.
> If you fail to allocate a destination page as a thp, you just split
> the source thp as we do now, and then enter the normal page migration.
> If you succeed to allocate destination thp, you enter thp migration.
> Subsequent patches actually enable thp migration for each caller of
> page migration by allowing its get_new_page() callback to
> allocate thps.
> 
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v2 -> v3:
> - use page_vma_mapped_walk()
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  70 +++++++++++++++++-
>  mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
>  mm/migrate.c                      |  29 +++++++-
>  mm/page_vma_mapped.c              |  13 +++-
>  mm/pgtable-generic.c              |   3 +-
>  mm/rmap.c                         |  14 +++-
>  7 files changed, 259 insertions(+), 23 deletions(-)
> 
...
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6893c47428b6..fd54bbdc16cf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		atomic_long_dec(&tlb->mm->nr_ptes);
>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>  	} else {
> -		struct page *page = pmd_page(orig_pmd);
> -		page_remove_rmap(page, true);
> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> -		if (PageAnon(page)) {
> -			pgtable_t pgtable;
> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> -			pte_free(tlb->mm, pgtable);
> -			atomic_long_dec(&tlb->mm->nr_ptes);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		struct page *page;
> +		int migration = 0;
> +
> +		if (!is_pmd_migration_entry(orig_pmd)) {
> +			page = pmd_page(orig_pmd);
> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> +			page_remove_rmap(page, true);

> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				if (arch_needs_pgtable_deposit())
> +					zap_deposited_table(tlb->mm, pmd);
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}

This block is exactly equal to the one in else block below,
So you can factor out into some function.

>  		} else {
> -			if (arch_needs_pgtable_deposit())
> -				zap_deposited_table(tlb->mm, pmd);
> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(orig_pmd);
> +			page = pfn_to_page(swp_offset(entry));

> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				if (arch_needs_pgtable_deposit())
> +					zap_deposited_table(tlb->mm, pmd);
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}

> +			free_swap_and_cache(entry); /* waring in failure? */
> +			migration = 1;
>  		}
>  		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
> @@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
>  }
>  late_initcall(split_huge_pages_debugfs);
>  #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	pmd_t pmdval;
> +	swp_entry_t entry;
> +
> +	if (pvmw->pmd && !pvmw->pte) {
> +		pmd_t pmdswp;
> +
> +		mmu_notifier_invalidate_range_start(mm, address,
> +				address + HPAGE_PMD_SIZE);

Don't you have to put mmu_notifier_invalidate_range_* outside this if block?

Thanks,
Naoya Horiguchi

> +
> +		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> +		if (pmd_dirty(pmdval))
> +			set_page_dirty(page);
> +		entry = make_migration_entry(page, pmd_write(pmdval));
> +		pmdswp = swp_entry_to_pmd(entry);
> +		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
> +		page_remove_rmap(page, true);
> +		put_page(page);
> +
> +		mmu_notifier_invalidate_range_end(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +	} else { /* pte-mapped thp */
> +		pte_t pteval;
> +		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
> +		pte_t swp_pte;
> +
> +		pteval = ptep_clear_flush(vma, address, pvmw->pte);
> +		if (pte_dirty(pteval))
> +			set_page_dirty(subpage);
> +		entry = make_migration_entry(subpage, pte_write(pteval));
> +		swp_pte = swp_entry_to_pte(entry);
> +		set_pte_at(mm, address, pvmw->pte, swp_pte);
> +		page_remove_rmap(subpage, false);
> +		put_page(subpage);
> +	}
> +}
> +
> +void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	swp_entry_t entry;
> +
> +	/* PMD-mapped THP  */
> +	if (pvmw->pmd && !pvmw->pte) {
> +		unsigned long mmun_start = address & HPAGE_PMD_MASK;
> +		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +		pmd_t pmde;
> +
> +		entry = pmd_to_swp_entry(*pvmw->pmd);
> +		get_page(new);
> +		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> +		if (is_write_migration_entry(entry))
> +			pmde = maybe_pmd_mkwrite(pmde, vma);
> +
> +		flush_cache_range(vma, mmun_start, mmun_end);
> +		page_add_anon_rmap(new, vma, mmun_start, true);
> +		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
> +		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> +		flush_tlb_range(vma, mmun_start, mmun_end);
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_vma_page(new);
> +		update_mmu_cache_pmd(vma, address, pvmw->pmd);
> +
> +	} else { /* pte-mapped thp */
> +		pte_t pte;
> +		pte_t *ptep = pvmw->pte;
> +
> +		entry = pte_to_swp_entry(*pvmw->pte);
> +		get_page(new);
> +		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
> +		if (pte_swp_soft_dirty(*pvmw->pte))
> +			pte = pte_mksoft_dirty(pte);
> +		if (is_write_migration_entry(entry))
> +			pte = maybe_mkwrite(pte, vma);
> +		flush_dcache_page(new);
> +		set_pte_at(mm, address, ptep, pte);
> +		if (PageAnon(new))
> +			page_add_anon_rmap(new, vma, address, false);
> +		else
> +			page_add_file_rmap(new, false);
> +		update_mmu_cache(vma, address, ptep);
> +	}
> +}
> +#endif
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 95e8580dc902..84181a3668c6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -214,6 +214,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		new = page - pvmw.page->index +
>  			linear_page_index(vma, pvmw.address);
>  
> +		/* PMD-mapped THP migration entry */
> +		if (!PageHuge(page) && PageTransCompound(page)) {
> +			remove_migration_pmd(&pvmw, new);
> +			continue;
> +		}
> +
>  		get_page(new);
>  		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
>  		if (pte_swp_soft_dirty(*pvmw.pte))
> @@ -327,6 +333,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
>  	__migration_entry_wait(mm, pte, ptl);
>  }
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (!is_pmd_migration_entry(*pmd))
> +		goto unlock;
> +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> +	if (!get_page_unless_zero(page))
> +		goto unlock;
> +	spin_unlock(ptl);
> +	wait_on_page_locked(page);
> +	put_page(page);
> +	return;
> +unlock:
> +	spin_unlock(ptl);
> +}
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> @@ -1085,7 +1112,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>  		goto out;
>  	}
>  
> -	if (unlikely(PageTransHuge(page))) {
> +	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
>  		lock_page(page);
>  		rc = split_huge_page(page);
>  		unlock_page(page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index a23001a22c15..0ed3aee62d50 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	if (!pud_present(*pud))
>  		return false;
>  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> -	if (pmd_trans_huge(*pvmw->pmd)) {
> +	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
>  		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> -		if (!pmd_present(*pvmw->pmd))
> -			return not_found(pvmw);
>  		if (likely(pmd_trans_huge(*pvmw->pmd))) {
>  			if (pvmw->flags & PVMW_MIGRATION)
>  				return not_found(pvmw);
>  			if (pmd_page(*pvmw->pmd) != page)
>  				return not_found(pvmw);
>  			return true;
> +		} else if (!pmd_present(*pvmw->pmd)) {
> +			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
> +
> +				if (migration_entry_to_page(entry) != page)
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			return not_found(pvmw);
>  		} else {
>  			/* THP pmd was split under us: handle on pte level */
>  			spin_unlock(pvmw->ptl);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4ed5908c65b0..9d550a8a0c71 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp));
>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>  	return pmd;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 16789b936e3a..b33216668fa4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1304,6 +1304,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	struct rmap_private *rp = arg;
>  	enum ttu_flags flags = rp->flags;
>  
> +
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		return SWAP_AGAIN;
> @@ -1314,12 +1315,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* THP migration */
> +		if (flags & TTU_MIGRATION) {
> +			if (!PageHuge(page) && PageTransCompound(page)) {
> +				set_pmd_migration_entry(&pvmw, page);
> +				continue;
> +			}
> +		}
> +		/* Unexpected PMD-mapped THP */
> +		VM_BUG_ON_PAGE(!pvmw.pte, page);
> +
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> -		/* Unexpected PMD-mapped THP? */
> -		VM_BUG_ON_PAGE(!pvmw.pte, page);
> -
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
>  		 * If it's recently referenced (perhaps page_referenced
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 08/14] mm: thp: enable thp migration in generic path
@ 2017-02-09  9:15     ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09  9:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan

On Sun, Feb 05, 2017 at 11:12:46AM -0500, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> This patch adds thp migration's core code, including conversions
> between a PMD entry and a swap entry, setting PMD migration entry,
> removing PMD migration entry, and waiting on PMD migration entries.
> 
> This patch makes it possible to support thp migration.
> If you fail to allocate a destination page as a thp, you just split
> the source thp as we do now, and then enter the normal page migration.
> If you succeed to allocate destination thp, you enter thp migration.
> Subsequent patches actually enable thp migration for each caller of
> page migration by allowing its get_new_page() callback to
> allocate thps.
> 
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v2 -> v3:
> - use page_vma_mapped_walk()
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  70 +++++++++++++++++-
>  mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
>  mm/migrate.c                      |  29 +++++++-
>  mm/page_vma_mapped.c              |  13 +++-
>  mm/pgtable-generic.c              |   3 +-
>  mm/rmap.c                         |  14 +++-
>  7 files changed, 259 insertions(+), 23 deletions(-)
> 
...
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6893c47428b6..fd54bbdc16cf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		atomic_long_dec(&tlb->mm->nr_ptes);
>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>  	} else {
> -		struct page *page = pmd_page(orig_pmd);
> -		page_remove_rmap(page, true);
> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> -		if (PageAnon(page)) {
> -			pgtable_t pgtable;
> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> -			pte_free(tlb->mm, pgtable);
> -			atomic_long_dec(&tlb->mm->nr_ptes);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		struct page *page;
> +		int migration = 0;
> +
> +		if (!is_pmd_migration_entry(orig_pmd)) {
> +			page = pmd_page(orig_pmd);
> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> +			page_remove_rmap(page, true);

> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				if (arch_needs_pgtable_deposit())
> +					zap_deposited_table(tlb->mm, pmd);
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}

This block is exactly equal to the one in else block below,
So you can factor out into some function.

>  		} else {
> -			if (arch_needs_pgtable_deposit())
> -				zap_deposited_table(tlb->mm, pmd);
> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(orig_pmd);
> +			page = pfn_to_page(swp_offset(entry));

> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				if (arch_needs_pgtable_deposit())
> +					zap_deposited_table(tlb->mm, pmd);
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}

> +			free_swap_and_cache(entry); /* waring in failure? */
> +			migration = 1;
>  		}
>  		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
> @@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
>  }
>  late_initcall(split_huge_pages_debugfs);
>  #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	pmd_t pmdval;
> +	swp_entry_t entry;
> +
> +	if (pvmw->pmd && !pvmw->pte) {
> +		pmd_t pmdswp;
> +
> +		mmu_notifier_invalidate_range_start(mm, address,
> +				address + HPAGE_PMD_SIZE);

Don't you have to put mmu_notifier_invalidate_range_* outside this if block?

Thanks,
Naoya Horiguchi

> +
> +		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> +		if (pmd_dirty(pmdval))
> +			set_page_dirty(page);
> +		entry = make_migration_entry(page, pmd_write(pmdval));
> +		pmdswp = swp_entry_to_pmd(entry);
> +		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
> +		page_remove_rmap(page, true);
> +		put_page(page);
> +
> +		mmu_notifier_invalidate_range_end(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +	} else { /* pte-mapped thp */
> +		pte_t pteval;
> +		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
> +		pte_t swp_pte;
> +
> +		pteval = ptep_clear_flush(vma, address, pvmw->pte);
> +		if (pte_dirty(pteval))
> +			set_page_dirty(subpage);
> +		entry = make_migration_entry(subpage, pte_write(pteval));
> +		swp_pte = swp_entry_to_pte(entry);
> +		set_pte_at(mm, address, pvmw->pte, swp_pte);
> +		page_remove_rmap(subpage, false);
> +		put_page(subpage);
> +	}
> +}
> +
> +void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	swp_entry_t entry;
> +
> +	/* PMD-mapped THP  */
> +	if (pvmw->pmd && !pvmw->pte) {
> +		unsigned long mmun_start = address & HPAGE_PMD_MASK;
> +		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +		pmd_t pmde;
> +
> +		entry = pmd_to_swp_entry(*pvmw->pmd);
> +		get_page(new);
> +		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> +		if (is_write_migration_entry(entry))
> +			pmde = maybe_pmd_mkwrite(pmde, vma);
> +
> +		flush_cache_range(vma, mmun_start, mmun_end);
> +		page_add_anon_rmap(new, vma, mmun_start, true);
> +		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
> +		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> +		flush_tlb_range(vma, mmun_start, mmun_end);
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_vma_page(new);
> +		update_mmu_cache_pmd(vma, address, pvmw->pmd);
> +
> +	} else { /* pte-mapped thp */
> +		pte_t pte;
> +		pte_t *ptep = pvmw->pte;
> +
> +		entry = pte_to_swp_entry(*pvmw->pte);
> +		get_page(new);
> +		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
> +		if (pte_swp_soft_dirty(*pvmw->pte))
> +			pte = pte_mksoft_dirty(pte);
> +		if (is_write_migration_entry(entry))
> +			pte = maybe_mkwrite(pte, vma);
> +		flush_dcache_page(new);
> +		set_pte_at(mm, address, ptep, pte);
> +		if (PageAnon(new))
> +			page_add_anon_rmap(new, vma, address, false);
> +		else
> +			page_add_file_rmap(new, false);
> +		update_mmu_cache(vma, address, ptep);
> +	}
> +}
> +#endif
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 95e8580dc902..84181a3668c6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -214,6 +214,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		new = page - pvmw.page->index +
>  			linear_page_index(vma, pvmw.address);
>  
> +		/* PMD-mapped THP migration entry */
> +		if (!PageHuge(page) && PageTransCompound(page)) {
> +			remove_migration_pmd(&pvmw, new);
> +			continue;
> +		}
> +
>  		get_page(new);
>  		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
>  		if (pte_swp_soft_dirty(*pvmw.pte))
> @@ -327,6 +333,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
>  	__migration_entry_wait(mm, pte, ptl);
>  }
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (!is_pmd_migration_entry(*pmd))
> +		goto unlock;
> +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> +	if (!get_page_unless_zero(page))
> +		goto unlock;
> +	spin_unlock(ptl);
> +	wait_on_page_locked(page);
> +	put_page(page);
> +	return;
> +unlock:
> +	spin_unlock(ptl);
> +}
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> @@ -1085,7 +1112,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>  		goto out;
>  	}
>  
> -	if (unlikely(PageTransHuge(page))) {
> +	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
>  		lock_page(page);
>  		rc = split_huge_page(page);
>  		unlock_page(page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index a23001a22c15..0ed3aee62d50 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	if (!pud_present(*pud))
>  		return false;
>  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> -	if (pmd_trans_huge(*pvmw->pmd)) {
> +	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
>  		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> -		if (!pmd_present(*pvmw->pmd))
> -			return not_found(pvmw);
>  		if (likely(pmd_trans_huge(*pvmw->pmd))) {
>  			if (pvmw->flags & PVMW_MIGRATION)
>  				return not_found(pvmw);
>  			if (pmd_page(*pvmw->pmd) != page)
>  				return not_found(pvmw);
>  			return true;
> +		} else if (!pmd_present(*pvmw->pmd)) {
> +			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
> +
> +				if (migration_entry_to_page(entry) != page)
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			return not_found(pvmw);
>  		} else {
>  			/* THP pmd was split under us: handle on pte level */
>  			spin_unlock(pvmw->ptl);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4ed5908c65b0..9d550a8a0c71 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp));
>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>  	return pmd;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 16789b936e3a..b33216668fa4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1304,6 +1304,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	struct rmap_private *rp = arg;
>  	enum ttu_flags flags = rp->flags;
>  
> +
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		return SWAP_AGAIN;
> @@ -1314,12 +1315,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* THP migration */
> +		if (flags & TTU_MIGRATION) {
> +			if (!PageHuge(page) && PageTransCompound(page)) {
> +				set_pmd_migration_entry(&pvmw, page);
> +				continue;
> +			}
> +		}
> +		/* Unexpected PMD-mapped THP */
> +		VM_BUG_ON_PAGE(!pvmw.pte, page);
> +
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> -		/* Unexpected PMD-mapped THP? */
> -		VM_BUG_ON_PAGE(!pvmw.pte, page);
> -
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
>  		 * If it's recently referenced (perhaps page_referenced
> -- 
> 2.11.0
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 09/14] mm: thp: check pmd migration entry in common path
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-09  9:16     ` Naoya Horiguchi
  -1 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09  9:16 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan

On Sun, Feb 05, 2017 at 11:12:47AM -0500, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> If one of callers of page migration starts to handle thp,
> memory management code start to see pmd migration entry, so we need
> to prepare for it before enabling. This patch changes various code
> point which checks the status of given pmds in order to prevent race
> between thp migration and the pmd-related works.
> 
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't
>   think up no better name. Any suggesntion is welcomed.)
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v2 -> v3:
> - add is_swap_pmd()
> - a pmd entry should be is_swap_pmd(), pmd_trans_huge(), pmd_devmap(),
>   or pmd_none()

(nitpick) ... or normal pmd pointing to pte pages?

> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
> - flush_cache_range() while set_pmd_migration_entry()
> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>   true on pmd_migration_entry, so that migration entries are not
>   treated as pmd page table entries.
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/mm/gup.c             |  4 +--
>  fs/proc/task_mmu.c            | 22 ++++++++-----
>  include/asm-generic/pgtable.h | 71 ----------------------------------------
>  include/linux/huge_mm.h       | 21 ++++++++++--
>  include/linux/swapops.h       | 74 +++++++++++++++++++++++++++++++++++++++++
>  mm/gup.c                      | 20 ++++++++++--
>  mm/huge_memory.c              | 76 ++++++++++++++++++++++++++++++++++++-------
>  mm/madvise.c                  |  2 ++
>  mm/memcontrol.c               |  2 ++
>  mm/memory.c                   |  9 +++--
>  mm/memory_hotplug.c           | 13 +++++++-
>  mm/mempolicy.c                |  1 +
>  mm/mprotect.c                 |  6 ++--
>  mm/mremap.c                   |  2 +-
>  mm/pagewalk.c                 |  2 ++
>  15 files changed, 221 insertions(+), 104 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 0d4fb3ebbbac..78a153d90064 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = *pmdp;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
> -		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> +		if (unlikely(pmd_large(pmd))) {
>  			/*
>  			 * NUMA hinting faults need to be handled in the GUP
>  			 * slowpath for accounting purposes and so that they
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6c07c7813b26..1e64d6898c68 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> -		smaps_pmd_entry(pmd, addr, walk);
> +		if (pmd_present(*pmd))
> +			smaps_pmd_entry(pmd, addr, walk);
>  		spin_unlock(ptl);
>  		return 0;
>  	}
> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  			goto out;
>  		}
>  
> +		if (!pmd_present(*pmd))
> +			goto out;
> +
>  		page = pmd_page(*pmd);
>  
>  		/* Clear accessed and referenced bits. */
> @@ -1208,19 +1212,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  	if (ptl) {
>  		u64 flags = 0, frame = 0;
>  		pmd_t pmd = *pmdp;
> +		struct page *page;
>  
>  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
>  			flags |= PM_SOFT_DIRTY;
>  
> -		/*
> -		 * Currently pmd for thp is always present because thp
> -		 * can not be swapped-out, migrated, or HWPOISONed
> -		 * (split in such cases instead.)
> -		 * This if-check is just to prepare for future implementation.
> -		 */
> -		if (pmd_present(pmd)) {
> -			struct page *page = pmd_page(pmd);
> +		if (is_pmd_migration_entry(pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> +			frame = swp_type(entry) |
> +				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> +			page = migration_entry_to_page(entry);
> +		} else if (pmd_present(pmd)) {
> +			page = pmd_page(pmd);
>  			if (page_mapcount(page) == 1)
>  				flags |= PM_MMAP_EXCLUSIVE;
>  
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index b71a431ed649..6cf9e9b5a7be 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -726,77 +726,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  #ifndef arch_needs_pgtable_deposit
>  #define arch_needs_pgtable_deposit() (false)
>  #endif
> -/*
> - * This function is meant to be used by sites walking pagetables with
> - * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> - * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
> - * into a null pmd and the transhuge page fault can convert a null pmd
> - * into an hugepmd or into a regular pmd (if the hugepage allocation
> - * fails). While holding the mmap_sem in read mode the pmd becomes
> - * stable and stops changing under us only if it's not null and not a
> - * transhuge pmd. When those races occurs and this function makes a
> - * difference vs the standard pmd_none_or_clear_bad, the result is
> - * undefined so behaving like if the pmd was none is safe (because it
> - * can return none anyway). The compiler level barrier() is critically
> - * important to compute the two checks atomically on the same pmdval.
> - *
> - * For 32bit kernels with a 64bit large pmd_t this automatically takes
> - * care of reading the pmd atomically to avoid SMP race conditions
> - * against pmd_populate() when the mmap_sem is hold for reading by the
> - * caller (a special atomic read not done by "gcc" as in the generic
> - * version above, is also needed when THP is disabled because the page
> - * fault can populate the pmd from under us).
> - */
> -static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> -{
> -	pmd_t pmdval = pmd_read_atomic(pmd);
> -	/*
> -	 * The barrier will stabilize the pmdval in a register or on
> -	 * the stack so that it will stop changing under the code.
> -	 *
> -	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
> -	 * pmd_read_atomic is allowed to return a not atomic pmdval
> -	 * (for example pointing to an hugepage that has never been
> -	 * mapped in the pmd). The below checks will only care about
> -	 * the low part of the pmd with 32bit PAE x86 anyway, with the
> -	 * exception of pmd_none(). So the important thing is that if
> -	 * the low part of the pmd is found null, the high part will
> -	 * be also null or the pmd_none() check below would be
> -	 * confused.
> -	 */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	barrier();
> -#endif
> -	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
> -		return 1;
> -	if (unlikely(pmd_bad(pmdval))) {
> -		pmd_clear_bad(pmd);
> -		return 1;
> -	}
> -	return 0;
> -}
> -
> -/*
> - * This is a noop if Transparent Hugepage Support is not built into
> - * the kernel. Otherwise it is equivalent to
> - * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> - * places that already verified the pmd is not none and they want to
> - * walk ptes while holding the mmap sem in read mode (write mode don't
> - * need this). If THP is not enabled, the pmd can't go away under the
> - * code even if MADV_DONTNEED runs, but if THP is enabled we need to
> - * run a pmd_trans_unstable before walking the ptes after
> - * split_huge_page_pmd returns (because it may have run when the pmd
> - * become null, but then a page fault can map in a THP and not a
> - * regular page).
> - */
> -static inline int pmd_trans_unstable(pmd_t *pmd)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	return pmd_none_or_trans_huge_or_clear_bad(pmd);
> -#else
> -	return 0;
> -#endif
> -}
>  
>  #ifndef CONFIG_NUMA_BALANCING
>  /*
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 83a8d42f9d55..c2e5a4eab84a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -131,7 +131,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  #define split_huge_pmd(__vma, __pmd, __address)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		if (pmd_trans_huge(*____pmd)				\
> +		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
>  					|| pmd_devmap(*____pmd))	\
>  			__split_huge_pmd(__vma, __pmd, __address,	\
>  						false, NULL);		\
> @@ -162,12 +162,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma);
>  extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
>  		struct vm_area_struct *vma);
> +
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> +	return !pmd_none(pmd) && !pmd_present(pmd);
> +}
> +
>  /* mmap_sem must be held on entry */
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> -	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> +	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>  		return __pmd_trans_huge_lock(pmd, vma);
>  	else
>  		return NULL;
> @@ -192,6 +198,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		pmd_t *pmd, int flags);
>  struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>  		pud_t *pud, int flags);
> +static inline int hpage_order(struct page *page)
> +{
> +	if (unlikely(PageTransHuge(page)))
> +		return HPAGE_PMD_ORDER;
> +	return 0;
> +}
>  
>  extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
>  
> @@ -232,6 +244,7 @@ static inline bool thp_migration_supported(void)
>  #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
>  
>  #define hpage_nr_pages(x) 1
> +#define hpage_order(x) 0
>  
>  #define transparent_hugepage_enabled(__vma) 0
>  
> @@ -274,6 +287,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>  					 long adjust_next)
>  {
>  }
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> +	return 0;
> +}
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 6625bea13869..50e4aa7e7ff9 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -229,6 +229,80 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  }
>  #endif
>  
> +/*
> + * This function is meant to be used by sites walking pagetables with
> + * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> + * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
> + * into a null pmd and the transhuge page fault can convert a null pmd
> + * into an hugepmd or into a regular pmd (if the hugepage allocation
> + * fails). While holding the mmap_sem in read mode the pmd becomes
> + * stable and stops changing under us only if it's not null and not a
> + * transhuge pmd. When those races occurs and this function makes a
> + * difference vs the standard pmd_none_or_clear_bad, the result is
> + * undefined so behaving like if the pmd was none is safe (because it
> + * can return none anyway). The compiler level barrier() is critically
> + * important to compute the two checks atomically on the same pmdval.
> + *
> + * For 32bit kernels with a 64bit large pmd_t this automatically takes
> + * care of reading the pmd atomically to avoid SMP race conditions
> + * against pmd_populate() when the mmap_sem is hold for reading by the
> + * caller (a special atomic read not done by "gcc" as in the generic
> + * version above, is also needed when THP is disabled because the page
> + * fault can populate the pmd from under us).
> + */
> +static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> +{
> +	pmd_t pmdval = pmd_read_atomic(pmd);
> +	/*
> +	 * The barrier will stabilize the pmdval in a register or on
> +	 * the stack so that it will stop changing under the code.
> +	 *
> +	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
> +	 * pmd_read_atomic is allowed to return a not atomic pmdval
> +	 * (for example pointing to an hugepage that has never been
> +	 * mapped in the pmd). The below checks will only care about
> +	 * the low part of the pmd with 32bit PAE x86 anyway, with the
> +	 * exception of pmd_none(). So the important thing is that if
> +	 * the low part of the pmd is found null, the high part will
> +	 * be also null or the pmd_none() check below would be
> +	 * confused.
> +	 */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	barrier();
> +#endif
> +	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
> +			|| is_pmd_migration_entry(pmdval))
> +		return 1;
> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * This is a noop if Transparent Hugepage Support is not built into
> + * the kernel. Otherwise it is equivalent to
> + * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> + * places that already verified the pmd is not none and they want to
> + * walk ptes while holding the mmap sem in read mode (write mode don't
> + * need this). If THP is not enabled, the pmd can't go away under the
> + * code even if MADV_DONTNEED runs, but if THP is enabled we need to
> + * run a pmd_trans_unstable before walking the ptes after
> + * split_huge_page_pmd returns (because it may have run when the pmd
> + * become null, but then a page fault can map in a THP and not a
> + * regular page).
> + */
> +static inline int pmd_trans_unstable(pmd_t *pmd)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	return pmd_none_or_trans_huge_or_clear_bad(pmd);
> +#else
> +	return 0;
> +#endif
> +}
> +
> +

These functions are page table or thp matter, so putting them in swapops.h
looks weird to me. Maybe you can avoid this code transfer by using !pmd_present
instead of is_pmd_migration_entry?
And we have to consider renaming pmd_none_or_trans_huge_or_clear_bad(),
I like a simple name like __pmd_trans_unstable(), but if you have an idea,
that's great.

>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git a/mm/gup.c b/mm/gup.c
> index 1e67461b2733..82e0304e5d29 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -274,6 +274,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	}
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		return no_page_table(vma, flags);
> +	if (!pmd_present(*pmd)) {
> +retry:
> +		if (likely(!(flags & FOLL_MIGRATION)))
> +			return no_page_table(vma, flags);
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry;
> +	}
>  	if (pmd_devmap(*pmd)) {
>  		ptl = pmd_lock(mm, pmd);
>  		page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -285,6 +292,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  		return follow_page_pte(vma, address, pmd, flags);
>  
>  	ptl = pmd_lock(mm, pmd);
> +	if (unlikely(!pmd_present(*pmd))) {
> +retry_locked:
> +		if (likely(!(flags & FOLL_MIGRATION))) {
> +			spin_unlock(ptl);
> +			return no_page_table(vma, flags);
> +		}
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry_locked;
> +	}
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags);
> @@ -340,7 +356,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>  	pud = pud_offset(pgd, address);
>  	BUG_ON(pud_none(*pud));
>  	pmd = pmd_offset(pud, address);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return -EFAULT;
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map(pmd, address);
> @@ -1368,7 +1384,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
>  
>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd54bbdc16cf..4ac923539372 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -897,6 +897,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;
> +
> +	if (unlikely(is_pmd_migration_entry(pmd))) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		if (is_write_migration_entry(entry)) {
> +			make_migration_entry_read(&entry);
> +			pmd = swp_entry_to_pmd(entry);
> +			set_pmd_at(src_mm, addr, src_pmd, pmd);
> +		}
> +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");
> +
>  	if (unlikely(!pmd_trans_huge(pmd))) {
>  		pte_free(dst_mm, pgtable);
>  		goto out_unlock;
> @@ -1203,6 +1218,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
>  		goto out_unlock;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out_unlock;
> +
>  	page = pmd_page(orig_pmd);
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
> @@ -1337,7 +1355,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		goto out;
>  
> -	page = pmd_page(*pmd);
> +	if (is_pmd_migration_entry(*pmd)) {
> +		swp_entry_t entry;
> +
> +		entry = pmd_to_swp_entry(*pmd);
> +		page = pfn_to_page(swp_offset(entry));
> +		if (!is_migration_entry(entry))
> +			goto out;
> +	} else
> +		page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd);
> @@ -1533,6 +1559,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	if (is_huge_zero_pmd(orig_pmd))
>  		goto out;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out;
> +
>  	page = pmd_page(orig_pmd);
>  	/*
>  	 * If other processes are mapping this page, we couldn't discard
> @@ -1659,7 +1688,8 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			free_swap_and_cache(entry); /* waring in failure? */
>  			migration = 1;
>  		}
> -		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> +		if (!migration)
> +			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
>  
>  	return 1;
> @@ -1775,10 +1805,22 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		 * data is likely to be read-cached on the local CPU and
>  		 * local/remote hits to the zero page are not interesting.
>  		 */
> -		if (prot_numa && is_huge_zero_pmd(*pmd)) {
> -			spin_unlock(ptl);
> -			return ret;
> -		}
> +		if (prot_numa && is_huge_zero_pmd(*pmd))
> +			goto unlock;
> +
> +		if (is_pmd_migration_entry(*pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> +			if (is_write_migration_entry(entry)) {
> +				pmd_t newpmd;
> +
> +				make_migration_entry_read(&entry);
> +				newpmd = swp_entry_to_pmd(entry);
> +				set_pmd_at(mm, addr, pmd, newpmd);
> +			}
> +			goto unlock;
> +		} else if (!pmd_present(*pmd))
> +			WARN_ONCE(1, "Uknown non-present format on pmd.\n");
>  
>  		if (!prot_numa || !pmd_protnone(*pmd)) {
>  			entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> @@ -1790,6 +1832,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
>  					pmd_write(entry));
>  		}
> +unlock:
>  		spin_unlock(ptl);
>  	}
>  
> @@ -1806,7 +1849,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
>  {
>  	spinlock_t *ptl;
>  	ptl = pmd_lock(vma->vm_mm, pmd);
> -	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
> +			pmd_devmap(*pmd)))
>  		return ptl;
>  	spin_unlock(ptl);
>  	return NULL;
> @@ -1924,7 +1968,7 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t _pmd;
> -	bool young, write, dirty, soft_dirty;
> +	bool young, write, dirty, soft_dirty, pmd_migration;
>  	unsigned long addr;
>  	int i;
>  	unsigned long haddr = address & HPAGE_PMD_MASK;
> @@ -1932,7 +1976,8 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +				&& !pmd_devmap(*pmd));
>  
>  	count_vm_event(THP_SPLIT_PMD);
>  
> @@ -1960,7 +2005,14 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		goto out;
>  	}
>  
> -	page = pmd_page(*pmd);
> +	pmd_migration = is_pmd_migration_entry(*pmd);
> +	if (pmd_migration) {
> +		swp_entry_t entry;
> +
> +		entry = pmd_to_swp_entry(*pmd);
> +		page = pfn_to_page(swp_offset(entry));
> +	} else
> +		page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	page_ref_add(page, HPAGE_PMD_NR - 1);
>  	write = pmd_write(*pmd);
> @@ -1979,7 +2031,7 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		 * transferred to avoid any possibility of altering
>  		 * permissions across VMAs.
>  		 */
> -		if (freeze) {
> +		if (freeze || pmd_migration) {
>  			swp_entry_t swp_entry;
>  			swp_entry = make_migration_entry(page + i, write);
>  			entry = swp_entry_to_pte(swp_entry);
> @@ -2077,7 +2129,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		page = pmd_page(*pmd);
>  		if (PageMlocked(page))
>  			clear_page_mlock(page);
> -	} else if (!pmd_devmap(*pmd))
> +	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
>  		goto out;
>  	__split_huge_pmd_locked(vma, pmd, address, freeze);
>  out:
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e424a06e9f2b..0497a502351f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -310,6 +310,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	unsigned long next;
>  
>  	next = pmd_addr_end(addr, end);
> +	if (!pmd_present(*pmd))
> +		return 0;
>  	if (pmd_trans_huge(*pmd))
>  		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
>  			goto next;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 44fb1e80701a..09bce3f0d622 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4633,6 +4633,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
>  	struct page *page = NULL;
>  	enum mc_target_type ret = MC_TARGET_NONE;
>  
> +	if (unlikely(!pmd_present(pmd)))
> +		return ret;
>  	page = pmd_page(pmd);
>  	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
>  	if (!(mc.flags & MOVE_ANON))
> diff --git a/mm/memory.c b/mm/memory.c
> index 7cfdd5208ef5..bf10b19e02d3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -999,7 +999,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> +		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
> +			|| pmd_devmap(*src_pmd)) {
>  			int err;
>  			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
>  			err = copy_huge_pmd(dst_mm, src_mm,
> @@ -1240,7 +1241,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  	ptl = pmd_lock(vma->vm_mm, pmd);
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> +		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> @@ -3697,6 +3698,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  		pmd_t orig_pmd = *vmf.pmd;
>  
>  		barrier();
> +		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> +			pmd_migration_entry_wait(mm, vmf.pmd);
> +			return 0;
> +		}
>  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
>  			vmf.flags |= FAULT_FLAG_SIZE_PMD;
>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 19b460acb5e1..9cb4c83151a8 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c

Changes on mm/memory_hotplug.c should be with patch 14/14?
# If that's right, definition of hpage_order() also should go to 14/14.

Thanks,
Naoya Horiguchi

> @@ -1570,6 +1570,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
>  	int nid = page_to_nid(page);
>  	nodemask_t nmask = node_states[N_MEMORY];
>  	struct page *new_page = NULL;
> +	unsigned int order = 0;
>  
>  	/*
>  	 * TODO: allocate a destination hugepage from a nearest neighbor node,
> @@ -1580,6 +1581,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
>  		return alloc_huge_page_node(page_hstate(compound_head(page)),
>  					next_node_in(nid, nmask));
>  
> +	if (thp_migration_supported() && PageTransHuge(page)) {
> +		order = hpage_order(page);
> +		gfp_mask |= GFP_TRANSHUGE;
> +	}
> +
>  	node_clear(nid, nmask);
>  
>  	if (PageHighMem(page)
> @@ -1593,6 +1599,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
>  		new_page = __alloc_pages(gfp_mask, 0,
>  					node_zonelist(nid, gfp_mask));
>  
> +	if (new_page && order == hpage_order(page))
> +		prep_transhuge_page(new_page);
> +
>  	return new_page;
>  }
>  
> @@ -1622,7 +1631,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  			if (isolate_huge_page(page, &source))
>  				move_pages -= 1 << compound_order(head);
>  			continue;
> -		}
> +		} else if (thp_migration_supported() && PageTransHuge(page))
> +			pfn = page_to_pfn(compound_head(page))
> +				+ hpage_nr_pages(page) - 1;
>  
>  		if (!get_page_unless_zero(page))
>  			continue;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5cc6a99918ab..021ff13b9a7a 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -94,6 +94,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/printk.h>
> +#include <linux/swapops.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/uaccess.h>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 98acf7d5cef2..bfbe66798a7a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -150,7 +150,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		unsigned long this_pages;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> +		if (!pmd_present(*pmd))
> +			continue;
> +		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
>  				&& pmd_none_or_clear_bad(pmd))
>  			continue;
>  
> @@ -160,7 +162,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  			mmu_notifier_invalidate_range_start(mm, mni_start, end);
>  		}
>  
> -		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> +		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				__split_huge_pmd(vma, pmd, addr, false, NULL);
>  				if (pmd_trans_unstable(pmd))
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 8233b0105c82..5d537ce12adc 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -213,7 +213,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
>  		if (!new_pmd)
>  			break;
> -		if (pmd_trans_huge(*old_pmd)) {
> +		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
>  			if (extent == HPAGE_PMD_SIZE) {
>  				bool moved;
>  				/* See comment in move_ptes() */
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 03761577ae86..114fc2b5a370 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -2,6 +2,8 @@
>  #include <linux/highmem.h>
>  #include <linux/sched.h>
>  #include <linux/hugetlb.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  
>  static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			  struct mm_walk *walk)
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 09/14] mm: thp: check pmd migration entry in common path
@ 2017-02-09  9:16     ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09  9:16 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan

On Sun, Feb 05, 2017 at 11:12:47AM -0500, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> If one of callers of page migration starts to handle thp,
> memory management code start to see pmd migration entry, so we need
> to prepare for it before enabling. This patch changes various code
> point which checks the status of given pmds in order to prevent race
> between thp migration and the pmd-related works.
> 
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't
>   think up no better name. Any suggesntion is welcomed.)
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v2 -> v3:
> - add is_swap_pmd()
> - a pmd entry should be is_swap_pmd(), pmd_trans_huge(), pmd_devmap(),
>   or pmd_none()

(nitpick) ... or normal pmd pointing to pte pages?

> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
> - flush_cache_range() while set_pmd_migration_entry()
> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>   true on pmd_migration_entry, so that migration entries are not
>   treated as pmd page table entries.
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/mm/gup.c             |  4 +--
>  fs/proc/task_mmu.c            | 22 ++++++++-----
>  include/asm-generic/pgtable.h | 71 ----------------------------------------
>  include/linux/huge_mm.h       | 21 ++++++++++--
>  include/linux/swapops.h       | 74 +++++++++++++++++++++++++++++++++++++++++
>  mm/gup.c                      | 20 ++++++++++--
>  mm/huge_memory.c              | 76 ++++++++++++++++++++++++++++++++++++-------
>  mm/madvise.c                  |  2 ++
>  mm/memcontrol.c               |  2 ++
>  mm/memory.c                   |  9 +++--
>  mm/memory_hotplug.c           | 13 +++++++-
>  mm/mempolicy.c                |  1 +
>  mm/mprotect.c                 |  6 ++--
>  mm/mremap.c                   |  2 +-
>  mm/pagewalk.c                 |  2 ++
>  15 files changed, 221 insertions(+), 104 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 0d4fb3ebbbac..78a153d90064 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = *pmdp;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
> -		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> +		if (unlikely(pmd_large(pmd))) {
>  			/*
>  			 * NUMA hinting faults need to be handled in the GUP
>  			 * slowpath for accounting purposes and so that they
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6c07c7813b26..1e64d6898c68 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> -		smaps_pmd_entry(pmd, addr, walk);
> +		if (pmd_present(*pmd))
> +			smaps_pmd_entry(pmd, addr, walk);
>  		spin_unlock(ptl);
>  		return 0;
>  	}
> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  			goto out;
>  		}
>  
> +		if (!pmd_present(*pmd))
> +			goto out;
> +
>  		page = pmd_page(*pmd);
>  
>  		/* Clear accessed and referenced bits. */
> @@ -1208,19 +1212,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  	if (ptl) {
>  		u64 flags = 0, frame = 0;
>  		pmd_t pmd = *pmdp;
> +		struct page *page;
>  
>  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
>  			flags |= PM_SOFT_DIRTY;
>  
> -		/*
> -		 * Currently pmd for thp is always present because thp
> -		 * can not be swapped-out, migrated, or HWPOISONed
> -		 * (split in such cases instead.)
> -		 * This if-check is just to prepare for future implementation.
> -		 */
> -		if (pmd_present(pmd)) {
> -			struct page *page = pmd_page(pmd);
> +		if (is_pmd_migration_entry(pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> +			frame = swp_type(entry) |
> +				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> +			page = migration_entry_to_page(entry);
> +		} else if (pmd_present(pmd)) {
> +			page = pmd_page(pmd);
>  			if (page_mapcount(page) == 1)
>  				flags |= PM_MMAP_EXCLUSIVE;
>  
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index b71a431ed649..6cf9e9b5a7be 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -726,77 +726,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>  #ifndef arch_needs_pgtable_deposit
>  #define arch_needs_pgtable_deposit() (false)
>  #endif
> -/*
> - * This function is meant to be used by sites walking pagetables with
> - * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> - * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
> - * into a null pmd and the transhuge page fault can convert a null pmd
> - * into an hugepmd or into a regular pmd (if the hugepage allocation
> - * fails). While holding the mmap_sem in read mode the pmd becomes
> - * stable and stops changing under us only if it's not null and not a
> - * transhuge pmd. When those races occurs and this function makes a
> - * difference vs the standard pmd_none_or_clear_bad, the result is
> - * undefined so behaving like if the pmd was none is safe (because it
> - * can return none anyway). The compiler level barrier() is critically
> - * important to compute the two checks atomically on the same pmdval.
> - *
> - * For 32bit kernels with a 64bit large pmd_t this automatically takes
> - * care of reading the pmd atomically to avoid SMP race conditions
> - * against pmd_populate() when the mmap_sem is hold for reading by the
> - * caller (a special atomic read not done by "gcc" as in the generic
> - * version above, is also needed when THP is disabled because the page
> - * fault can populate the pmd from under us).
> - */
> -static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> -{
> -	pmd_t pmdval = pmd_read_atomic(pmd);
> -	/*
> -	 * The barrier will stabilize the pmdval in a register or on
> -	 * the stack so that it will stop changing under the code.
> -	 *
> -	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
> -	 * pmd_read_atomic is allowed to return a not atomic pmdval
> -	 * (for example pointing to an hugepage that has never been
> -	 * mapped in the pmd). The below checks will only care about
> -	 * the low part of the pmd with 32bit PAE x86 anyway, with the
> -	 * exception of pmd_none(). So the important thing is that if
> -	 * the low part of the pmd is found null, the high part will
> -	 * be also null or the pmd_none() check below would be
> -	 * confused.
> -	 */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	barrier();
> -#endif
> -	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
> -		return 1;
> -	if (unlikely(pmd_bad(pmdval))) {
> -		pmd_clear_bad(pmd);
> -		return 1;
> -	}
> -	return 0;
> -}
> -
> -/*
> - * This is a noop if Transparent Hugepage Support is not built into
> - * the kernel. Otherwise it is equivalent to
> - * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> - * places that already verified the pmd is not none and they want to
> - * walk ptes while holding the mmap sem in read mode (write mode don't
> - * need this). If THP is not enabled, the pmd can't go away under the
> - * code even if MADV_DONTNEED runs, but if THP is enabled we need to
> - * run a pmd_trans_unstable before walking the ptes after
> - * split_huge_page_pmd returns (because it may have run when the pmd
> - * become null, but then a page fault can map in a THP and not a
> - * regular page).
> - */
> -static inline int pmd_trans_unstable(pmd_t *pmd)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	return pmd_none_or_trans_huge_or_clear_bad(pmd);
> -#else
> -	return 0;
> -#endif
> -}
>  
>  #ifndef CONFIG_NUMA_BALANCING
>  /*
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 83a8d42f9d55..c2e5a4eab84a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -131,7 +131,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  #define split_huge_pmd(__vma, __pmd, __address)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		if (pmd_trans_huge(*____pmd)				\
> +		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
>  					|| pmd_devmap(*____pmd))	\
>  			__split_huge_pmd(__vma, __pmd, __address,	\
>  						false, NULL);		\
> @@ -162,12 +162,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma);
>  extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
>  		struct vm_area_struct *vma);
> +
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> +	return !pmd_none(pmd) && !pmd_present(pmd);
> +}
> +
>  /* mmap_sem must be held on entry */
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> -	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> +	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>  		return __pmd_trans_huge_lock(pmd, vma);
>  	else
>  		return NULL;
> @@ -192,6 +198,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		pmd_t *pmd, int flags);
>  struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>  		pud_t *pud, int flags);
> +static inline int hpage_order(struct page *page)
> +{
> +	if (unlikely(PageTransHuge(page)))
> +		return HPAGE_PMD_ORDER;
> +	return 0;
> +}
>  
>  extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
>  
> @@ -232,6 +244,7 @@ static inline bool thp_migration_supported(void)
>  #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
>  
>  #define hpage_nr_pages(x) 1
> +#define hpage_order(x) 0
>  
>  #define transparent_hugepage_enabled(__vma) 0
>  
> @@ -274,6 +287,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>  					 long adjust_next)
>  {
>  }
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> +	return 0;
> +}
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 6625bea13869..50e4aa7e7ff9 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -229,6 +229,80 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  }
>  #endif
>  
> +/*
> + * This function is meant to be used by sites walking pagetables with
> + * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> + * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
> + * into a null pmd and the transhuge page fault can convert a null pmd
> + * into an hugepmd or into a regular pmd (if the hugepage allocation
> + * fails). While holding the mmap_sem in read mode the pmd becomes
> + * stable and stops changing under us only if it's not null and not a
> + * transhuge pmd. When those races occurs and this function makes a
> + * difference vs the standard pmd_none_or_clear_bad, the result is
> + * undefined so behaving like if the pmd was none is safe (because it
> + * can return none anyway). The compiler level barrier() is critically
> + * important to compute the two checks atomically on the same pmdval.
> + *
> + * For 32bit kernels with a 64bit large pmd_t this automatically takes
> + * care of reading the pmd atomically to avoid SMP race conditions
> + * against pmd_populate() when the mmap_sem is hold for reading by the
> + * caller (a special atomic read not done by "gcc" as in the generic
> + * version above, is also needed when THP is disabled because the page
> + * fault can populate the pmd from under us).
> + */
> +static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> +{
> +	pmd_t pmdval = pmd_read_atomic(pmd);
> +	/*
> +	 * The barrier will stabilize the pmdval in a register or on
> +	 * the stack so that it will stop changing under the code.
> +	 *
> +	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
> +	 * pmd_read_atomic is allowed to return a not atomic pmdval
> +	 * (for example pointing to an hugepage that has never been
> +	 * mapped in the pmd). The below checks will only care about
> +	 * the low part of the pmd with 32bit PAE x86 anyway, with the
> +	 * exception of pmd_none(). So the important thing is that if
> +	 * the low part of the pmd is found null, the high part will
> +	 * be also null or the pmd_none() check below would be
> +	 * confused.
> +	 */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	barrier();
> +#endif
> +	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
> +			|| is_pmd_migration_entry(pmdval))
> +		return 1;
> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * This is a noop if Transparent Hugepage Support is not built into
> + * the kernel. Otherwise it is equivalent to
> + * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> + * places that already verified the pmd is not none and they want to
> + * walk ptes while holding the mmap sem in read mode (write mode don't
> + * need this). If THP is not enabled, the pmd can't go away under the
> + * code even if MADV_DONTNEED runs, but if THP is enabled we need to
> + * run a pmd_trans_unstable before walking the ptes after
> + * split_huge_page_pmd returns (because it may have run when the pmd
> + * become null, but then a page fault can map in a THP and not a
> + * regular page).
> + */
> +static inline int pmd_trans_unstable(pmd_t *pmd)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	return pmd_none_or_trans_huge_or_clear_bad(pmd);
> +#else
> +	return 0;
> +#endif
> +}
> +
> +

These functions are page table or thp matter, so putting them in swapops.h
looks weird to me. Maybe you can avoid this code transfer by using !pmd_present
instead of is_pmd_migration_entry?
And we have to consider renaming pmd_none_or_trans_huge_or_clear_bad(),
I like a simple name like __pmd_trans_unstable(), but if you have an idea,
that's great.

>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git a/mm/gup.c b/mm/gup.c
> index 1e67461b2733..82e0304e5d29 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -274,6 +274,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	}
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		return no_page_table(vma, flags);
> +	if (!pmd_present(*pmd)) {
> +retry:
> +		if (likely(!(flags & FOLL_MIGRATION)))
> +			return no_page_table(vma, flags);
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry;
> +	}
>  	if (pmd_devmap(*pmd)) {
>  		ptl = pmd_lock(mm, pmd);
>  		page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -285,6 +292,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  		return follow_page_pte(vma, address, pmd, flags);
>  
>  	ptl = pmd_lock(mm, pmd);
> +	if (unlikely(!pmd_present(*pmd))) {
> +retry_locked:
> +		if (likely(!(flags & FOLL_MIGRATION))) {
> +			spin_unlock(ptl);
> +			return no_page_table(vma, flags);
> +		}
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry_locked;
> +	}
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags);
> @@ -340,7 +356,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>  	pud = pud_offset(pgd, address);
>  	BUG_ON(pud_none(*pud));
>  	pmd = pmd_offset(pud, address);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return -EFAULT;
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map(pmd, address);
> @@ -1368,7 +1384,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
>  
>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd54bbdc16cf..4ac923539372 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -897,6 +897,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;
> +
> +	if (unlikely(is_pmd_migration_entry(pmd))) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		if (is_write_migration_entry(entry)) {
> +			make_migration_entry_read(&entry);
> +			pmd = swp_entry_to_pmd(entry);
> +			set_pmd_at(src_mm, addr, src_pmd, pmd);
> +		}
> +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");
> +
>  	if (unlikely(!pmd_trans_huge(pmd))) {
>  		pte_free(dst_mm, pgtable);
>  		goto out_unlock;
> @@ -1203,6 +1218,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
>  		goto out_unlock;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out_unlock;
> +
>  	page = pmd_page(orig_pmd);
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
> @@ -1337,7 +1355,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		goto out;
>  
> -	page = pmd_page(*pmd);
> +	if (is_pmd_migration_entry(*pmd)) {
> +		swp_entry_t entry;
> +
> +		entry = pmd_to_swp_entry(*pmd);
> +		page = pfn_to_page(swp_offset(entry));
> +		if (!is_migration_entry(entry))
> +			goto out;
> +	} else
> +		page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd);
> @@ -1533,6 +1559,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	if (is_huge_zero_pmd(orig_pmd))
>  		goto out;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out;
> +
>  	page = pmd_page(orig_pmd);
>  	/*
>  	 * If other processes are mapping this page, we couldn't discard
> @@ -1659,7 +1688,8 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			free_swap_and_cache(entry); /* waring in failure? */
>  			migration = 1;
>  		}
> -		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> +		if (!migration)
> +			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
>  
>  	return 1;
> @@ -1775,10 +1805,22 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		 * data is likely to be read-cached on the local CPU and
>  		 * local/remote hits to the zero page are not interesting.
>  		 */
> -		if (prot_numa && is_huge_zero_pmd(*pmd)) {
> -			spin_unlock(ptl);
> -			return ret;
> -		}
> +		if (prot_numa && is_huge_zero_pmd(*pmd))
> +			goto unlock;
> +
> +		if (is_pmd_migration_entry(*pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> +			if (is_write_migration_entry(entry)) {
> +				pmd_t newpmd;
> +
> +				make_migration_entry_read(&entry);
> +				newpmd = swp_entry_to_pmd(entry);
> +				set_pmd_at(mm, addr, pmd, newpmd);
> +			}
> +			goto unlock;
> +		} else if (!pmd_present(*pmd))
> +			WARN_ONCE(1, "Uknown non-present format on pmd.\n");
>  
>  		if (!prot_numa || !pmd_protnone(*pmd)) {
>  			entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> @@ -1790,6 +1832,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
>  					pmd_write(entry));
>  		}
> +unlock:
>  		spin_unlock(ptl);
>  	}
>  
> @@ -1806,7 +1849,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
>  {
>  	spinlock_t *ptl;
>  	ptl = pmd_lock(vma->vm_mm, pmd);
> -	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
> +			pmd_devmap(*pmd)))
>  		return ptl;
>  	spin_unlock(ptl);
>  	return NULL;
> @@ -1924,7 +1968,7 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t _pmd;
> -	bool young, write, dirty, soft_dirty;
> +	bool young, write, dirty, soft_dirty, pmd_migration;
>  	unsigned long addr;
>  	int i;
>  	unsigned long haddr = address & HPAGE_PMD_MASK;
> @@ -1932,7 +1976,8 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> -	VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
> +	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> +				&& !pmd_devmap(*pmd));
>  
>  	count_vm_event(THP_SPLIT_PMD);
>  
> @@ -1960,7 +2005,14 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		goto out;
>  	}
>  
> -	page = pmd_page(*pmd);
> +	pmd_migration = is_pmd_migration_entry(*pmd);
> +	if (pmd_migration) {
> +		swp_entry_t entry;
> +
> +		entry = pmd_to_swp_entry(*pmd);
> +		page = pfn_to_page(swp_offset(entry));
> +	} else
> +		page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	page_ref_add(page, HPAGE_PMD_NR - 1);
>  	write = pmd_write(*pmd);
> @@ -1979,7 +2031,7 @@ void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		 * transferred to avoid any possibility of altering
>  		 * permissions across VMAs.
>  		 */
> -		if (freeze) {
> +		if (freeze || pmd_migration) {
>  			swp_entry_t swp_entry;
>  			swp_entry = make_migration_entry(page + i, write);
>  			entry = swp_entry_to_pte(swp_entry);
> @@ -2077,7 +2129,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		page = pmd_page(*pmd);
>  		if (PageMlocked(page))
>  			clear_page_mlock(page);
> -	} else if (!pmd_devmap(*pmd))
> +	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
>  		goto out;
>  	__split_huge_pmd_locked(vma, pmd, address, freeze);
>  out:
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e424a06e9f2b..0497a502351f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -310,6 +310,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	unsigned long next;
>  
>  	next = pmd_addr_end(addr, end);
> +	if (!pmd_present(*pmd))
> +		return 0;
>  	if (pmd_trans_huge(*pmd))
>  		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
>  			goto next;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 44fb1e80701a..09bce3f0d622 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4633,6 +4633,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
>  	struct page *page = NULL;
>  	enum mc_target_type ret = MC_TARGET_NONE;
>  
> +	if (unlikely(!pmd_present(pmd)))
> +		return ret;
>  	page = pmd_page(pmd);
>  	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
>  	if (!(mc.flags & MOVE_ANON))
> diff --git a/mm/memory.c b/mm/memory.c
> index 7cfdd5208ef5..bf10b19e02d3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -999,7 +999,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> +		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
> +			|| pmd_devmap(*src_pmd)) {
>  			int err;
>  			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
>  			err = copy_huge_pmd(dst_mm, src_mm,
> @@ -1240,7 +1241,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  	ptl = pmd_lock(vma->vm_mm, pmd);
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> +		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
>  				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> @@ -3697,6 +3698,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  		pmd_t orig_pmd = *vmf.pmd;
>  
>  		barrier();
> +		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> +			pmd_migration_entry_wait(mm, vmf.pmd);
> +			return 0;
> +		}
>  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
>  			vmf.flags |= FAULT_FLAG_SIZE_PMD;
>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 19b460acb5e1..9cb4c83151a8 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c

Changes on mm/memory_hotplug.c should be with patch 14/14?
# If that's right, definition of hpage_order() also should go to 14/14.

Thanks,
Naoya Horiguchi

> @@ -1570,6 +1570,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
>  	int nid = page_to_nid(page);
>  	nodemask_t nmask = node_states[N_MEMORY];
>  	struct page *new_page = NULL;
> +	unsigned int order = 0;
>  
>  	/*
>  	 * TODO: allocate a destination hugepage from a nearest neighbor node,
> @@ -1580,6 +1581,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
>  		return alloc_huge_page_node(page_hstate(compound_head(page)),
>  					next_node_in(nid, nmask));
>  
> +	if (thp_migration_supported() && PageTransHuge(page)) {
> +		order = hpage_order(page);
> +		gfp_mask |= GFP_TRANSHUGE;
> +	}
> +
>  	node_clear(nid, nmask);
>  
>  	if (PageHighMem(page)
> @@ -1593,6 +1599,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
>  		new_page = __alloc_pages(gfp_mask, 0,
>  					node_zonelist(nid, gfp_mask));
>  
> +	if (new_page && order == hpage_order(page))
> +		prep_transhuge_page(new_page);
> +
>  	return new_page;
>  }
>  
> @@ -1622,7 +1631,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  			if (isolate_huge_page(page, &source))
>  				move_pages -= 1 << compound_order(head);
>  			continue;
> -		}
> +		} else if (thp_migration_supported() && PageTransHuge(page))
> +			pfn = page_to_pfn(compound_head(page))
> +				+ hpage_nr_pages(page) - 1;
>  
>  		if (!get_page_unless_zero(page))
>  			continue;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5cc6a99918ab..021ff13b9a7a 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -94,6 +94,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/printk.h>
> +#include <linux/swapops.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/uaccess.h>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 98acf7d5cef2..bfbe66798a7a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -150,7 +150,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		unsigned long this_pages;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> +		if (!pmd_present(*pmd))
> +			continue;
> +		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
>  				&& pmd_none_or_clear_bad(pmd))
>  			continue;
>  
> @@ -160,7 +162,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  			mmu_notifier_invalidate_range_start(mm, mni_start, end);
>  		}
>  
> -		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> +		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE) {
>  				__split_huge_pmd(vma, pmd, addr, false, NULL);
>  				if (pmd_trans_unstable(pmd))
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 8233b0105c82..5d537ce12adc 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -213,7 +213,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
>  		if (!new_pmd)
>  			break;
> -		if (pmd_trans_huge(*old_pmd)) {
> +		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
>  			if (extent == HPAGE_PMD_SIZE) {
>  				bool moved;
>  				/* See comment in move_ptes() */
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 03761577ae86..114fc2b5a370 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -2,6 +2,8 @@
>  #include <linux/highmem.h>
>  #include <linux/sched.h>
>  #include <linux/hugetlb.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  
>  static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			  struct mm_walk *walk)
> -- 
> 2.11.0
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 13/14] mm: migrate: move_pages() supports thp migration
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-09  9:16     ` Naoya Horiguchi
  -1 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09  9:16 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan

On Sun, Feb 05, 2017 at 11:12:51AM -0500, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> This patch enables thp migration for move_pages(2).
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
>  1 file changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 84181a3668c6..9bcaccb481ac 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1413,7 +1413,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
>  	if (PageHuge(p))
>  		return alloc_huge_page_node(page_hstate(compound_head(p)),
>  					pm->node);
> -	else
> +	else if (thp_migration_supported() && PageTransHuge(p)) {
> +		struct page *thp;
> +
> +		thp = alloc_pages_node(pm->node,
> +			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> +			HPAGE_PMD_ORDER);
> +		if (!thp)
> +			return NULL;
> +		prep_transhuge_page(thp);
> +		return thp;
> +	} else
>  		return __alloc_pages_node(pm->node,
>  				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
>  }
> @@ -1440,6 +1450,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
>  		struct vm_area_struct *vma;
>  		struct page *page;
> +		struct page *head;
> +		unsigned int follflags;
>  
>  		err = -EFAULT;
>  		vma = find_vma(mm, pp->addr);
> @@ -1447,8 +1459,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  			goto set_status;
>  
>  		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, pp->addr,
> -				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
> +		follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;

FOLL_SPLIT should be added depending on thp_migration_supported().

Thanks,
Naoya Horiguchi

> +		if (!thp_migration_supported())
> +			follflags |= FOLL_SPLIT;
> +		page = follow_page(vma, pp->addr, follflags);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> @@ -1458,7 +1472,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  		if (!page)
>  			goto set_status;
>  
> -		pp->page = page;
>  		err = page_to_nid(page);
>  
>  		if (err == pp->node)
> @@ -1473,16 +1486,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  			goto put_and_set;
>  
>  		if (PageHuge(page)) {
> -			if (PageHead(page))
> +			if (PageHead(page)) {
>  				isolate_huge_page(page, &pagelist);
> +				err = 0;
> +				pp->page = page;
> +			}
>  			goto put_and_set;
>  		}
>  
> -		err = isolate_lru_page(page);
> +		pp->page = compound_head(page);
> +		head = compound_head(page);
> +		err = isolate_lru_page(head);
>  		if (!err) {
> -			list_add_tail(&page->lru, &pagelist);
> -			inc_node_page_state(page, NR_ISOLATED_ANON +
> -					    page_is_file_cache(page));
> +			list_add_tail(&head->lru, &pagelist);
> +			mod_node_page_state(page_pgdat(head),
> +				NR_ISOLATED_ANON + page_is_file_cache(head),
> +				hpage_nr_pages(head));
>  		}
>  put_and_set:
>  		/*
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 13/14] mm: migrate: move_pages() supports thp migration
@ 2017-02-09  9:16     ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09  9:16 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual, zi.yan

On Sun, Feb 05, 2017 at 11:12:51AM -0500, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> This patch enables thp migration for move_pages(2).
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
>  1 file changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 84181a3668c6..9bcaccb481ac 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1413,7 +1413,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
>  	if (PageHuge(p))
>  		return alloc_huge_page_node(page_hstate(compound_head(p)),
>  					pm->node);
> -	else
> +	else if (thp_migration_supported() && PageTransHuge(p)) {
> +		struct page *thp;
> +
> +		thp = alloc_pages_node(pm->node,
> +			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> +			HPAGE_PMD_ORDER);
> +		if (!thp)
> +			return NULL;
> +		prep_transhuge_page(thp);
> +		return thp;
> +	} else
>  		return __alloc_pages_node(pm->node,
>  				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
>  }
> @@ -1440,6 +1450,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
>  		struct vm_area_struct *vma;
>  		struct page *page;
> +		struct page *head;
> +		unsigned int follflags;
>  
>  		err = -EFAULT;
>  		vma = find_vma(mm, pp->addr);
> @@ -1447,8 +1459,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  			goto set_status;
>  
>  		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, pp->addr,
> -				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
> +		follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;

FOLL_SPLIT should be added depending on thp_migration_supported().

Thanks,
Naoya Horiguchi

> +		if (!thp_migration_supported())
> +			follflags |= FOLL_SPLIT;
> +		page = follow_page(vma, pp->addr, follflags);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> @@ -1458,7 +1472,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  		if (!page)
>  			goto set_status;
>  
> -		pp->page = page;
>  		err = page_to_nid(page);
>  
>  		if (err == pp->node)
> @@ -1473,16 +1486,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  			goto put_and_set;
>  
>  		if (PageHuge(page)) {
> -			if (PageHead(page))
> +			if (PageHead(page)) {
>  				isolate_huge_page(page, &pagelist);
> +				err = 0;
> +				pp->page = page;
> +			}
>  			goto put_and_set;
>  		}
>  
> -		err = isolate_lru_page(page);
> +		pp->page = compound_head(page);
> +		head = compound_head(page);
> +		err = isolate_lru_page(head);
>  		if (!err) {
> -			list_add_tail(&page->lru, &pagelist);
> -			inc_node_page_state(page, NR_ISOLATED_ANON +
> -					    page_is_file_cache(page));
> +			list_add_tail(&head->lru, &pagelist);
> +			mod_node_page_state(page_pgdat(head),
> +				NR_ISOLATED_ANON + page_is_file_cache(head),
> +				hpage_nr_pages(head));
>  		}
>  put_and_set:
>  		/*
> -- 
> 2.11.0
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  2017-02-09  9:14     ` Naoya Horiguchi
  (?)
@ 2017-02-09 15:07     ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-09 15:07 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual

[-- Attachment #1: Type: text/plain, Size: 4935 bytes --]

On 9 Feb 2017, at 3:14, Naoya Horiguchi wrote:

> On Sun, Feb 05, 2017 at 11:12:42AM -0500, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
>> false negative return when it races with thp spilt
>> (during which _PAGE_PRESENT is temporary cleared.) I don't think that
>> dropping _PAGE_PSE check in pmd_present() works well because it can
>> hurt optimization of tlb handling in thp split.
>> In the current kernel, bits 1-4 are not used in non-present format
>> since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
>> work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
>> Bit 7 is used as reserved (always clear), so please don't use it for
>> other purpose.
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> ChangeLog v3:
>> - Move _PAGE_SWP_SOFT_DIRTY to bit 1, it was placed at bit 6. Because
>> some CPUs might accidentally set bit 5 or 6.
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>
> More documenting will be helpful, could you do like follows?

Sure. Thanks for helping.

>
> Thanks,
> Naoya Horiguchi
> ---
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Sun, 5 Feb 2017 11:12:42 -0500
> Subject: [PATCH] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
>
> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
> false negative return when it races with thp spilt
> (during which _PAGE_PRESENT is temporary cleared.) I don't think that
> dropping _PAGE_PSE check in pmd_present() works well because it can
> hurt optimization of tlb handling in thp split.
> In the current kernel, bits 1-4 are not used in non-present format
> since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
> work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
> Bit 7 is used as reserved (always clear), so please don't use it for
> other purpose.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/include/asm/pgtable_64.h    | 12 +++++++++---
>  arch/x86/include/asm/pgtable_types.h | 10 +++++-----
>  2 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 73c7ccc38912..07c98c85cc96 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  /*
>   * Encode and de-code a swap entry
>   *
> - * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
> - * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
> - * | OFFSET (14->63) | TYPE (9-13)  |0|X|X|X| X| X|X|X|0| <- swp entry
> + * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
> + * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
> + * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
>   *
>   * G (8) is aliased and used as a PROT_NONE indicator for
>   * !present ptes.  We need to start storing swap entries above
>   * there.  We also need to avoid using A and D because of an
>   * erratum where they can be incorrectly set by hardware on
>   * non-present PTEs.
> + *
> + * SD (1) in swp entry is used to store soft dirty bit, which helps us
> + * remember soft dirty over page migration.
> + *
> + * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
> + * but G.
>   */
>  #define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
>  #define SWP_TYPE_BITS 5
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 8b4de22d6429..3695abd58ef6 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -97,15 +97,15 @@
>  /*
>   * Tracking soft dirty bit when a page goes to a swap is tricky.
>   * We need a bit which can be stored in pte _and_ not conflict
> - * with swap entry format. On x86 bits 6 and 7 are *not* involved
> - * into swap entry computation, but bit 6 is used for nonlinear
> - * file mapping, so we borrow bit 7 for soft dirty tracking.
> + * with swap entry format. On x86 bits 1-4 are *not* involved
> + * into swap entry computation, but bit 7 is used for thp migration,
> + * so we borrow bit 1 for soft dirty tracking.
>   *
>   * Please note that this bit must be treated as swap dirty page
> - * mark if and only if the PTE has present bit clear!
> + * mark if and only if the PTE/PMD has present bit clear!
>   */
>  #ifdef CONFIG_MEM_SOFT_DIRTY
> -#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
> +#define _PAGE_SWP_SOFT_DIRTY	_PAGE_RW
>  #else
>  #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
>  #endif
> -- 
> 2.7.4


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 08/14] mm: thp: enable thp migration in generic path
  2017-02-09  9:15     ` Naoya Horiguchi
  (?)
@ 2017-02-09 15:17     ` Zi Yan
  2017-02-09 23:04         ` Naoya Horiguchi
  -1 siblings, 1 reply; 87+ messages in thread
From: Zi Yan @ 2017-02-09 15:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual

[-- Attachment #1: Type: text/plain, Size: 5540 bytes --]

On 9 Feb 2017, at 3:15, Naoya Horiguchi wrote:

> On Sun, Feb 05, 2017 at 11:12:46AM -0500, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> This patch adds thp migration's core code, including conversions
>> between a PMD entry and a swap entry, setting PMD migration entry,
>> removing PMD migration entry, and waiting on PMD migration entries.
>>
>> This patch makes it possible to support thp migration.
>> If you fail to allocate a destination page as a thp, you just split
>> the source thp as we do now, and then enter the normal page migration.
>> If you succeed to allocate destination thp, you enter thp migration.
>> Subsequent patches actually enable thp migration for each caller of
>> page migration by allowing its get_new_page() callback to
>> allocate thps.
>>
>> ChangeLog v1 -> v2:
>> - support pte-mapped thp, doubly-mapped thp
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> ChangeLog v2 -> v3:
>> - use page_vma_mapped_walk()
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  arch/x86/include/asm/pgtable_64.h |   2 +
>>  include/linux/swapops.h           |  70 +++++++++++++++++-
>>  mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
>>  mm/migrate.c                      |  29 +++++++-
>>  mm/page_vma_mapped.c              |  13 +++-
>>  mm/pgtable-generic.c              |   3 +-
>>  mm/rmap.c                         |  14 +++-
>>  7 files changed, 259 insertions(+), 23 deletions(-)
>>
> ...
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 6893c47428b6..fd54bbdc16cf 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		atomic_long_dec(&tlb->mm->nr_ptes);
>>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>>  	} else {
>> -		struct page *page = pmd_page(orig_pmd);
>> -		page_remove_rmap(page, true);
>> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>> -		VM_BUG_ON_PAGE(!PageHead(page), page);
>> -		if (PageAnon(page)) {
>> -			pgtable_t pgtable;
>> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
>> -			pte_free(tlb->mm, pgtable);
>> -			atomic_long_dec(&tlb->mm->nr_ptes);
>> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>> +		struct page *page;
>> +		int migration = 0;
>> +
>> +		if (!is_pmd_migration_entry(orig_pmd)) {
>> +			page = pmd_page(orig_pmd);
>> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>> +			VM_BUG_ON_PAGE(!PageHead(page), page);
>> +			page_remove_rmap(page, true);
>
>> +			if (PageAnon(page)) {
>> +				pgtable_t pgtable;
>> +
>> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
>> +								      pmd);
>> +				pte_free(tlb->mm, pgtable);
>> +				atomic_long_dec(&tlb->mm->nr_ptes);
>> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
>> +					       -HPAGE_PMD_NR);
>> +			} else {
>> +				if (arch_needs_pgtable_deposit())
>> +					zap_deposited_table(tlb->mm, pmd);
>> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
>> +					       -HPAGE_PMD_NR);
>> +			}
>
> This block is exactly equal to the one in else block below,
> So you can factor out into some function.

Of course, I will do that.

>
>>  		} else {
>> -			if (arch_needs_pgtable_deposit())
>> -				zap_deposited_table(tlb->mm, pmd);
>> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
>> +			swp_entry_t entry;
>> +
>> +			entry = pmd_to_swp_entry(orig_pmd);
>> +			page = pfn_to_page(swp_offset(entry));
>
>> +			if (PageAnon(page)) {
>> +				pgtable_t pgtable;
>> +
>> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
>> +								      pmd);
>> +				pte_free(tlb->mm, pgtable);
>> +				atomic_long_dec(&tlb->mm->nr_ptes);
>> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
>> +					       -HPAGE_PMD_NR);
>> +			} else {
>> +				if (arch_needs_pgtable_deposit())
>> +					zap_deposited_table(tlb->mm, pmd);
>> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
>> +					       -HPAGE_PMD_NR);
>> +			}
>
>> +			free_swap_and_cache(entry); /* waring in failure? */
>> +			migration = 1;
>>  		}
>>  		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>>  	}
>> @@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
>>  }
>>  late_initcall(split_huge_pages_debugfs);
>>  #endif
>> +
>> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>> +		struct page *page)
>> +{
>> +	struct vm_area_struct *vma = pvmw->vma;
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	unsigned long address = pvmw->address;
>> +	pmd_t pmdval;
>> +	swp_entry_t entry;
>> +
>> +	if (pvmw->pmd && !pvmw->pte) {
>> +		pmd_t pmdswp;
>> +
>> +		mmu_notifier_invalidate_range_start(mm, address,
>> +				address + HPAGE_PMD_SIZE);
>
> Don't you have to put mmu_notifier_invalidate_range_* outside this if block?

I think I need to add mmu_notifier_invalidate_page() in else block.

Because Kirill's page_vma_mapped_walk() iterates each PMD or PTE.
In set_pmd_migration_etnry(), if the page is PMD-mapped, it will be called once
with PMD, then mmu_notifier_invalidate_range_* can be used. On the other hand,
if the page is PTE-mapped, the function will be called 1~512 times depending
on how many PTEs are present. mmu_notifier_invalidate_range_* is not suitable.
mmu_notifier_invalidate_page() on the corresponding subpage should work.



--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 09/14] mm: thp: check pmd migration entry in common path
  2017-02-09  9:16     ` Naoya Horiguchi
  (?)
@ 2017-02-09 17:36     ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-09 17:36 UTC (permalink / raw)
  To: Naoya Horiguchi, kirill.shutemov, Andrea Arcangeli
  Cc: linux-kernel, linux-mm, akpm, minchan, vbabka, mgorman, khandual

[-- Attachment #1: Type: text/plain, Size: 15417 bytes --]

On 9 Feb 2017, at 3:16, Naoya Horiguchi wrote:

> On Sun, Feb 05, 2017 at 11:12:47AM -0500, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> If one of callers of page migration starts to handle thp,
>> memory management code start to see pmd migration entry, so we need
>> to prepare for it before enabling. This patch changes various code
>> point which checks the status of given pmds in order to prevent race
>> between thp migration and the pmd-related works.
>>
>> ChangeLog v1 -> v2:
>> - introduce pmd_related() (I know the naming is not good, but can't
>>   think up no better name. Any suggesntion is welcomed.)
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> ChangeLog v2 -> v3:
>> - add is_swap_pmd()
>> - a pmd entry should be is_swap_pmd(), pmd_trans_huge(), pmd_devmap(),
>>   or pmd_none()
>
> (nitpick) ... or normal pmd pointing to pte pages?

Sure, I will add it.

>
>> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
>> - flush_cache_range() while set_pmd_migration_entry()
>> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>>   true on pmd_migration_entry, so that migration entries are not
>>   treated as pmd page table entries.
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  arch/x86/mm/gup.c             |  4 +--
>>  fs/proc/task_mmu.c            | 22 ++++++++-----
>>  include/asm-generic/pgtable.h | 71 ----------------------------------------
>>  include/linux/huge_mm.h       | 21 ++++++++++--
>>  include/linux/swapops.h       | 74 +++++++++++++++++++++++++++++++++++++++++
>>  mm/gup.c                      | 20 ++++++++++--
>>  mm/huge_memory.c              | 76 ++++++++++++++++++++++++++++++++++++-------
>>  mm/madvise.c                  |  2 ++
>>  mm/memcontrol.c               |  2 ++
>>  mm/memory.c                   |  9 +++--
>>  mm/memory_hotplug.c           | 13 +++++++-
>>  mm/mempolicy.c                |  1 +
>>  mm/mprotect.c                 |  6 ++--
>>  mm/mremap.c                   |  2 +-
>>  mm/pagewalk.c                 |  2 ++
>>  15 files changed, 221 insertions(+), 104 deletions(-)
>>
>> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
>> index 0d4fb3ebbbac..78a153d90064 100644
>> --- a/arch/x86/mm/gup.c
>> +++ b/arch/x86/mm/gup.c
>> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>>  		pmd_t pmd = *pmdp;
>>
>>  		next = pmd_addr_end(addr, end);
>> -		if (pmd_none(pmd))
>> +		if (!pmd_present(pmd))
>>  			return 0;
>> -		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
>> +		if (unlikely(pmd_large(pmd))) {
>>  			/*
>>  			 * NUMA hinting faults need to be handled in the GUP
>>  			 * slowpath for accounting purposes and so that they
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 6c07c7813b26..1e64d6898c68 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>>
>>  	ptl = pmd_trans_huge_lock(pmd, vma);
>>  	if (ptl) {
>> -		smaps_pmd_entry(pmd, addr, walk);
>> +		if (pmd_present(*pmd))
>> +			smaps_pmd_entry(pmd, addr, walk);
>>  		spin_unlock(ptl);
>>  		return 0;
>>  	}
>> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>>  			goto out;
>>  		}
>>
>> +		if (!pmd_present(*pmd))
>> +			goto out;
>> +
>>  		page = pmd_page(*pmd);
>>
>>  		/* Clear accessed and referenced bits. */
>> @@ -1208,19 +1212,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>>  	if (ptl) {
>>  		u64 flags = 0, frame = 0;
>>  		pmd_t pmd = *pmdp;
>> +		struct page *page;
>>
>>  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
>>  			flags |= PM_SOFT_DIRTY;
>>
>> -		/*
>> -		 * Currently pmd for thp is always present because thp
>> -		 * can not be swapped-out, migrated, or HWPOISONed
>> -		 * (split in such cases instead.)
>> -		 * This if-check is just to prepare for future implementation.
>> -		 */
>> -		if (pmd_present(pmd)) {
>> -			struct page *page = pmd_page(pmd);
>> +		if (is_pmd_migration_entry(pmd)) {
>> +			swp_entry_t entry = pmd_to_swp_entry(pmd);
>>
>> +			frame = swp_type(entry) |
>> +				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
>> +			page = migration_entry_to_page(entry);
>> +		} else if (pmd_present(pmd)) {
>> +			page = pmd_page(pmd);
>>  			if (page_mapcount(page) == 1)
>>  				flags |= PM_MMAP_EXCLUSIVE;
>>
>> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
>> index b71a431ed649..6cf9e9b5a7be 100644
>> --- a/include/asm-generic/pgtable.h
>> +++ b/include/asm-generic/pgtable.h
>> @@ -726,77 +726,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>>  #ifndef arch_needs_pgtable_deposit
>>  #define arch_needs_pgtable_deposit() (false)
>>  #endif
>> -/*
>> - * This function is meant to be used by sites walking pagetables with
>> - * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
>> - * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
>> - * into a null pmd and the transhuge page fault can convert a null pmd
>> - * into an hugepmd or into a regular pmd (if the hugepage allocation
>> - * fails). While holding the mmap_sem in read mode the pmd becomes
>> - * stable and stops changing under us only if it's not null and not a
>> - * transhuge pmd. When those races occurs and this function makes a
>> - * difference vs the standard pmd_none_or_clear_bad, the result is
>> - * undefined so behaving like if the pmd was none is safe (because it
>> - * can return none anyway). The compiler level barrier() is critically
>> - * important to compute the two checks atomically on the same pmdval.
>> - *
>> - * For 32bit kernels with a 64bit large pmd_t this automatically takes
>> - * care of reading the pmd atomically to avoid SMP race conditions
>> - * against pmd_populate() when the mmap_sem is hold for reading by the
>> - * caller (a special atomic read not done by "gcc" as in the generic
>> - * version above, is also needed when THP is disabled because the page
>> - * fault can populate the pmd from under us).
>> - */
>> -static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>> -{
>> -	pmd_t pmdval = pmd_read_atomic(pmd);
>> -	/*
>> -	 * The barrier will stabilize the pmdval in a register or on
>> -	 * the stack so that it will stop changing under the code.
>> -	 *
>> -	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
>> -	 * pmd_read_atomic is allowed to return a not atomic pmdval
>> -	 * (for example pointing to an hugepage that has never been
>> -	 * mapped in the pmd). The below checks will only care about
>> -	 * the low part of the pmd with 32bit PAE x86 anyway, with the
>> -	 * exception of pmd_none(). So the important thing is that if
>> -	 * the low part of the pmd is found null, the high part will
>> -	 * be also null or the pmd_none() check below would be
>> -	 * confused.
>> -	 */
>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -	barrier();
>> -#endif
>> -	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
>> -		return 1;
>> -	if (unlikely(pmd_bad(pmdval))) {
>> -		pmd_clear_bad(pmd);
>> -		return 1;
>> -	}
>> -	return 0;
>> -}
>> -
>> -/*
>> - * This is a noop if Transparent Hugepage Support is not built into
>> - * the kernel. Otherwise it is equivalent to
>> - * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
>> - * places that already verified the pmd is not none and they want to
>> - * walk ptes while holding the mmap sem in read mode (write mode don't
>> - * need this). If THP is not enabled, the pmd can't go away under the
>> - * code even if MADV_DONTNEED runs, but if THP is enabled we need to
>> - * run a pmd_trans_unstable before walking the ptes after
>> - * split_huge_page_pmd returns (because it may have run when the pmd
>> - * become null, but then a page fault can map in a THP and not a
>> - * regular page).
>> - */
>> -static inline int pmd_trans_unstable(pmd_t *pmd)
>> -{
>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -	return pmd_none_or_trans_huge_or_clear_bad(pmd);
>> -#else
>> -	return 0;
>> -#endif
>> -}
>>
>>  #ifndef CONFIG_NUMA_BALANCING
>>  /*
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 83a8d42f9d55..c2e5a4eab84a 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -131,7 +131,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>  #define split_huge_pmd(__vma, __pmd, __address)				\
>>  	do {								\
>>  		pmd_t *____pmd = (__pmd);				\
>> -		if (pmd_trans_huge(*____pmd)				\
>> +		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
>>  					|| pmd_devmap(*____pmd))	\
>>  			__split_huge_pmd(__vma, __pmd, __address,	\
>>  						false, NULL);		\
>> @@ -162,12 +162,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
>>  		struct vm_area_struct *vma);
>>  extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
>>  		struct vm_area_struct *vma);
>> +
>> +static inline int is_swap_pmd(pmd_t pmd)
>> +{
>> +	return !pmd_none(pmd) && !pmd_present(pmd);
>> +}
>> +
>>  /* mmap_sem must be held on entry */
>>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>>  		struct vm_area_struct *vma)
>>  {
>>  	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
>> -	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>> +	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>>  		return __pmd_trans_huge_lock(pmd, vma);
>>  	else
>>  		return NULL;
>> @@ -192,6 +198,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  		pmd_t *pmd, int flags);
>>  struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>>  		pud_t *pud, int flags);
>> +static inline int hpage_order(struct page *page)
>> +{
>> +	if (unlikely(PageTransHuge(page)))
>> +		return HPAGE_PMD_ORDER;
>> +	return 0;
>> +}
>>
>>  extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
>>
>> @@ -232,6 +244,7 @@ static inline bool thp_migration_supported(void)
>>  #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
>>
>>  #define hpage_nr_pages(x) 1
>> +#define hpage_order(x) 0
>>
>>  #define transparent_hugepage_enabled(__vma) 0
>>
>> @@ -274,6 +287,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>>  					 long adjust_next)
>>  {
>>  }
>> +static inline int is_swap_pmd(pmd_t pmd)
>> +{
>> +	return 0;
>> +}
>>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>>  		struct vm_area_struct *vma)
>>  {
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 6625bea13869..50e4aa7e7ff9 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -229,6 +229,80 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>>  }
>>  #endif
>>
>> +/*
>> + * This function is meant to be used by sites walking pagetables with
>> + * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
>> + * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
>> + * into a null pmd and the transhuge page fault can convert a null pmd
>> + * into an hugepmd or into a regular pmd (if the hugepage allocation
>> + * fails). While holding the mmap_sem in read mode the pmd becomes
>> + * stable and stops changing under us only if it's not null and not a
>> + * transhuge pmd. When those races occurs and this function makes a
>> + * difference vs the standard pmd_none_or_clear_bad, the result is
>> + * undefined so behaving like if the pmd was none is safe (because it
>> + * can return none anyway). The compiler level barrier() is critically
>> + * important to compute the two checks atomically on the same pmdval.
>> + *
>> + * For 32bit kernels with a 64bit large pmd_t this automatically takes
>> + * care of reading the pmd atomically to avoid SMP race conditions
>> + * against pmd_populate() when the mmap_sem is hold for reading by the
>> + * caller (a special atomic read not done by "gcc" as in the generic
>> + * version above, is also needed when THP is disabled because the page
>> + * fault can populate the pmd from under us).
>> + */
>> +static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>> +{
>> +	pmd_t pmdval = pmd_read_atomic(pmd);
>> +	/*
>> +	 * The barrier will stabilize the pmdval in a register or on
>> +	 * the stack so that it will stop changing under the code.
>> +	 *
>> +	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
>> +	 * pmd_read_atomic is allowed to return a not atomic pmdval
>> +	 * (for example pointing to an hugepage that has never been
>> +	 * mapped in the pmd). The below checks will only care about
>> +	 * the low part of the pmd with 32bit PAE x86 anyway, with the
>> +	 * exception of pmd_none(). So the important thing is that if
>> +	 * the low part of the pmd is found null, the high part will
>> +	 * be also null or the pmd_none() check below would be
>> +	 * confused.
>> +	 */
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	barrier();
>> +#endif
>> +	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
>> +			|| is_pmd_migration_entry(pmdval))
>> +		return 1;
>> +	if (unlikely(pmd_bad(pmdval))) {
>> +		pmd_clear_bad(pmd);
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
>> +
>> +/*
>> + * This is a noop if Transparent Hugepage Support is not built into
>> + * the kernel. Otherwise it is equivalent to
>> + * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
>> + * places that already verified the pmd is not none and they want to
>> + * walk ptes while holding the mmap sem in read mode (write mode don't
>> + * need this). If THP is not enabled, the pmd can't go away under the
>> + * code even if MADV_DONTNEED runs, but if THP is enabled we need to
>> + * run a pmd_trans_unstable before walking the ptes after
>> + * split_huge_page_pmd returns (because it may have run when the pmd
>> + * become null, but then a page fault can map in a THP and not a
>> + * regular page).
>> + */
>> +static inline int pmd_trans_unstable(pmd_t *pmd)
>> +{
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	return pmd_none_or_trans_huge_or_clear_bad(pmd);
>> +#else
>> +	return 0;
>> +#endif
>> +}
>> +
>> +
>
> These functions are page table or thp matter, so putting them in swapops.h
> looks weird to me. Maybe you can avoid this code transfer by using !pmd_present
> instead of is_pmd_migration_entry?
> And we have to consider renaming pmd_none_or_trans_huge_or_clear_bad(),
> I like a simple name like __pmd_trans_unstable(), but if you have an idea,
> that's great.

Yes. I will move it back.

I am not sure if it is OK to only use !pmd_present. We may miss some pmd_bad.

Kirill and Andrea, can you give some insight on this?


>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 19b460acb5e1..9cb4c83151a8 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>
> Changes on mm/memory_hotplug.c should be with patch 14/14?
> # If that's right, definition of hpage_order() also should go to 14/14.

Got it. I will move it.


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 13/14] mm: migrate: move_pages() supports thp migration
  2017-02-09  9:16     ` Naoya Horiguchi
  (?)
@ 2017-02-09 17:37     ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-09 17:37 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual

[-- Attachment #1: Type: text/plain, Size: 1961 bytes --]

On 9 Feb 2017, at 3:16, Naoya Horiguchi wrote:

> On Sun, Feb 05, 2017 at 11:12:51AM -0500, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> This patch enables thp migration for move_pages(2).
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> ---
>>  mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
>>  1 file changed, 28 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 84181a3668c6..9bcaccb481ac 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1413,7 +1413,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
>>  	if (PageHuge(p))
>>  		return alloc_huge_page_node(page_hstate(compound_head(p)),
>>  					pm->node);
>> -	else
>> +	else if (thp_migration_supported() && PageTransHuge(p)) {
>> +		struct page *thp;
>> +
>> +		thp = alloc_pages_node(pm->node,
>> +			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
>> +			HPAGE_PMD_ORDER);
>> +		if (!thp)
>> +			return NULL;
>> +		prep_transhuge_page(thp);
>> +		return thp;
>> +	} else
>>  		return __alloc_pages_node(pm->node,
>>  				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
>>  }
>> @@ -1440,6 +1450,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>>  	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
>>  		struct vm_area_struct *vma;
>>  		struct page *page;
>> +		struct page *head;
>> +		unsigned int follflags;
>>
>>  		err = -EFAULT;
>>  		vma = find_vma(mm, pp->addr);
>> @@ -1447,8 +1459,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>>  			goto set_status;
>>
>>  		/* FOLL_DUMP to ignore special (like zero) pages */
>> -		page = follow_page(vma, pp->addr,
>> -				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
>> +		follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;
>
> FOLL_SPLIT should be added depending on thp_migration_supported().

Sure. I will fix it.


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 08/14] mm: thp: enable thp migration in generic path
  2017-02-09 15:17     ` Zi Yan
@ 2017-02-09 23:04         ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09 23:04 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual

On Thu, Feb 09, 2017 at 09:17:01AM -0600, Zi Yan wrote:
> On 9 Feb 2017, at 3:15, Naoya Horiguchi wrote:
> 
> > On Sun, Feb 05, 2017 at 11:12:46AM -0500, Zi Yan wrote:
> >> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >>
> >> This patch adds thp migration's core code, including conversions
> >> between a PMD entry and a swap entry, setting PMD migration entry,
> >> removing PMD migration entry, and waiting on PMD migration entries.
> >>
> >> This patch makes it possible to support thp migration.
> >> If you fail to allocate a destination page as a thp, you just split
> >> the source thp as we do now, and then enter the normal page migration.
> >> If you succeed to allocate destination thp, you enter thp migration.
> >> Subsequent patches actually enable thp migration for each caller of
> >> page migration by allowing its get_new_page() callback to
> >> allocate thps.
> >>
> >> ChangeLog v1 -> v2:
> >> - support pte-mapped thp, doubly-mapped thp
> >>
> >> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >>
> >> ChangeLog v2 -> v3:
> >> - use page_vma_mapped_walk()
> >>
> >> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> >> ---
> >>  arch/x86/include/asm/pgtable_64.h |   2 +
> >>  include/linux/swapops.h           |  70 +++++++++++++++++-
> >>  mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
> >>  mm/migrate.c                      |  29 +++++++-
> >>  mm/page_vma_mapped.c              |  13 +++-
> >>  mm/pgtable-generic.c              |   3 +-
> >>  mm/rmap.c                         |  14 +++-
> >>  7 files changed, 259 insertions(+), 23 deletions(-)
> >>
> > ...
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 6893c47428b6..fd54bbdc16cf 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		atomic_long_dec(&tlb->mm->nr_ptes);
> >>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> >>  	} else {
> >> -		struct page *page = pmd_page(orig_pmd);
> >> -		page_remove_rmap(page, true);
> >> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> >> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> >> -		if (PageAnon(page)) {
> >> -			pgtable_t pgtable;
> >> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> >> -			pte_free(tlb->mm, pgtable);
> >> -			atomic_long_dec(&tlb->mm->nr_ptes);
> >> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> >> +		struct page *page;
> >> +		int migration = 0;
> >> +
> >> +		if (!is_pmd_migration_entry(orig_pmd)) {
> >> +			page = pmd_page(orig_pmd);
> >> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> >> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> >> +			page_remove_rmap(page, true);
> >
> >> +			if (PageAnon(page)) {
> >> +				pgtable_t pgtable;
> >> +
> >> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> >> +								      pmd);
> >> +				pte_free(tlb->mm, pgtable);
> >> +				atomic_long_dec(&tlb->mm->nr_ptes);
> >> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> >> +					       -HPAGE_PMD_NR);
> >> +			} else {
> >> +				if (arch_needs_pgtable_deposit())
> >> +					zap_deposited_table(tlb->mm, pmd);
> >> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> >> +					       -HPAGE_PMD_NR);
> >> +			}
> >
> > This block is exactly equal to the one in else block below,
> > So you can factor out into some function.
> 
> Of course, I will do that.
> 
> >
> >>  		} else {
> >> -			if (arch_needs_pgtable_deposit())
> >> -				zap_deposited_table(tlb->mm, pmd);
> >> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> >> +			swp_entry_t entry;
> >> +
> >> +			entry = pmd_to_swp_entry(orig_pmd);
> >> +			page = pfn_to_page(swp_offset(entry));
> >
> >> +			if (PageAnon(page)) {
> >> +				pgtable_t pgtable;
> >> +
> >> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> >> +								      pmd);
> >> +				pte_free(tlb->mm, pgtable);
> >> +				atomic_long_dec(&tlb->mm->nr_ptes);
> >> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> >> +					       -HPAGE_PMD_NR);
> >> +			} else {
> >> +				if (arch_needs_pgtable_deposit())
> >> +					zap_deposited_table(tlb->mm, pmd);
> >> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> >> +					       -HPAGE_PMD_NR);
> >> +			}
> >
> >> +			free_swap_and_cache(entry); /* waring in failure? */
> >> +			migration = 1;
> >>  		}
> >>  		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> >>  	}
> >> @@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
> >>  }
> >>  late_initcall(split_huge_pages_debugfs);
> >>  #endif
> >> +
> >> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> >> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> >> +		struct page *page)
> >> +{
> >> +	struct vm_area_struct *vma = pvmw->vma;
> >> +	struct mm_struct *mm = vma->vm_mm;
> >> +	unsigned long address = pvmw->address;
> >> +	pmd_t pmdval;
> >> +	swp_entry_t entry;
> >> +
> >> +	if (pvmw->pmd && !pvmw->pte) {
> >> +		pmd_t pmdswp;
> >> +
> >> +		mmu_notifier_invalidate_range_start(mm, address,
> >> +				address + HPAGE_PMD_SIZE);
> >
> > Don't you have to put mmu_notifier_invalidate_range_* outside this if block?
> 
> I think I need to add mmu_notifier_invalidate_page() in else block.
> 
> Because Kirill's page_vma_mapped_walk() iterates each PMD or PTE.
> In set_pmd_migration_etnry(), if the page is PMD-mapped, it will be called once
> with PMD, then mmu_notifier_invalidate_range_* can be used. On the other hand,
> if the page is PTE-mapped, the function will be called 1~512 times depending
> on how many PTEs are present. mmu_notifier_invalidate_range_* is not suitable.
> mmu_notifier_invalidate_page() on the corresponding subpage should work.
> 

Ah right, mmu_notifier_invalidate_page() is better for PTE-mapped thp.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 08/14] mm: thp: enable thp migration in generic path
@ 2017-02-09 23:04         ` Naoya Horiguchi
  0 siblings, 0 replies; 87+ messages in thread
From: Naoya Horiguchi @ 2017-02-09 23:04 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, khandual

On Thu, Feb 09, 2017 at 09:17:01AM -0600, Zi Yan wrote:
> On 9 Feb 2017, at 3:15, Naoya Horiguchi wrote:
> 
> > On Sun, Feb 05, 2017 at 11:12:46AM -0500, Zi Yan wrote:
> >> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >>
> >> This patch adds thp migration's core code, including conversions
> >> between a PMD entry and a swap entry, setting PMD migration entry,
> >> removing PMD migration entry, and waiting on PMD migration entries.
> >>
> >> This patch makes it possible to support thp migration.
> >> If you fail to allocate a destination page as a thp, you just split
> >> the source thp as we do now, and then enter the normal page migration.
> >> If you succeed to allocate destination thp, you enter thp migration.
> >> Subsequent patches actually enable thp migration for each caller of
> >> page migration by allowing its get_new_page() callback to
> >> allocate thps.
> >>
> >> ChangeLog v1 -> v2:
> >> - support pte-mapped thp, doubly-mapped thp
> >>
> >> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >>
> >> ChangeLog v2 -> v3:
> >> - use page_vma_mapped_walk()
> >>
> >> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> >> ---
> >>  arch/x86/include/asm/pgtable_64.h |   2 +
> >>  include/linux/swapops.h           |  70 +++++++++++++++++-
> >>  mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
> >>  mm/migrate.c                      |  29 +++++++-
> >>  mm/page_vma_mapped.c              |  13 +++-
> >>  mm/pgtable-generic.c              |   3 +-
> >>  mm/rmap.c                         |  14 +++-
> >>  7 files changed, 259 insertions(+), 23 deletions(-)
> >>
> > ...
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 6893c47428b6..fd54bbdc16cf 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		atomic_long_dec(&tlb->mm->nr_ptes);
> >>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> >>  	} else {
> >> -		struct page *page = pmd_page(orig_pmd);
> >> -		page_remove_rmap(page, true);
> >> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> >> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> >> -		if (PageAnon(page)) {
> >> -			pgtable_t pgtable;
> >> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> >> -			pte_free(tlb->mm, pgtable);
> >> -			atomic_long_dec(&tlb->mm->nr_ptes);
> >> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> >> +		struct page *page;
> >> +		int migration = 0;
> >> +
> >> +		if (!is_pmd_migration_entry(orig_pmd)) {
> >> +			page = pmd_page(orig_pmd);
> >> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> >> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> >> +			page_remove_rmap(page, true);
> >
> >> +			if (PageAnon(page)) {
> >> +				pgtable_t pgtable;
> >> +
> >> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> >> +								      pmd);
> >> +				pte_free(tlb->mm, pgtable);
> >> +				atomic_long_dec(&tlb->mm->nr_ptes);
> >> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> >> +					       -HPAGE_PMD_NR);
> >> +			} else {
> >> +				if (arch_needs_pgtable_deposit())
> >> +					zap_deposited_table(tlb->mm, pmd);
> >> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> >> +					       -HPAGE_PMD_NR);
> >> +			}
> >
> > This block is exactly equal to the one in else block below,
> > So you can factor out into some function.
> 
> Of course, I will do that.
> 
> >
> >>  		} else {
> >> -			if (arch_needs_pgtable_deposit())
> >> -				zap_deposited_table(tlb->mm, pmd);
> >> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> >> +			swp_entry_t entry;
> >> +
> >> +			entry = pmd_to_swp_entry(orig_pmd);
> >> +			page = pfn_to_page(swp_offset(entry));
> >
> >> +			if (PageAnon(page)) {
> >> +				pgtable_t pgtable;
> >> +
> >> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> >> +								      pmd);
> >> +				pte_free(tlb->mm, pgtable);
> >> +				atomic_long_dec(&tlb->mm->nr_ptes);
> >> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> >> +					       -HPAGE_PMD_NR);
> >> +			} else {
> >> +				if (arch_needs_pgtable_deposit())
> >> +					zap_deposited_table(tlb->mm, pmd);
> >> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> >> +					       -HPAGE_PMD_NR);
> >> +			}
> >
> >> +			free_swap_and_cache(entry); /* waring in failure? */
> >> +			migration = 1;
> >>  		}
> >>  		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> >>  	}
> >> @@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
> >>  }
> >>  late_initcall(split_huge_pages_debugfs);
> >>  #endif
> >> +
> >> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> >> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> >> +		struct page *page)
> >> +{
> >> +	struct vm_area_struct *vma = pvmw->vma;
> >> +	struct mm_struct *mm = vma->vm_mm;
> >> +	unsigned long address = pvmw->address;
> >> +	pmd_t pmdval;
> >> +	swp_entry_t entry;
> >> +
> >> +	if (pvmw->pmd && !pvmw->pte) {
> >> +		pmd_t pmdswp;
> >> +
> >> +		mmu_notifier_invalidate_range_start(mm, address,
> >> +				address + HPAGE_PMD_SIZE);
> >
> > Don't you have to put mmu_notifier_invalidate_range_* outside this if block?
> 
> I think I need to add mmu_notifier_invalidate_page() in else block.
> 
> Because Kirill's page_vma_mapped_walk() iterates each PMD or PTE.
> In set_pmd_migration_etnry(), if the page is PMD-mapped, it will be called once
> with PMD, then mmu_notifier_invalidate_range_* can be used. On the other hand,
> if the page is PTE-mapped, the function will be called 1~512 times depending
> on how many PTEs are present. mmu_notifier_invalidate_range_* is not suitable.
> mmu_notifier_invalidate_page() on the corresponding subpage should work.
> 

Ah right, mmu_notifier_invalidate_page() is better for PTE-mapped thp.

Thanks,
Naoya
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-07 17:45             ` Kirill A. Shutemov
@ 2017-02-13  0:25               ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-13  0:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

Hi Kirill,

>>>> The crash scenario I guess is like:
>>>> 1. A huge page pmd entry is in the middle of being changed into either a
>>>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
>>>>
>>>> 2. At the same time, the application frees the vma this page belongs to.
>>>
>>> Em... no.
>>>
>>> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
>>> And we only be able to remove vma under down_write(mmap_sem), so the
>>> scenario should be excluded.
>>>
>>> What do I miss?
>>
>> You are right. This problem will not happen in the upstream kernel.
>>
>> The problem comes from my customized kernel, where I migrate pages away
>> instead of reclaiming them when memory is under pressure. I did not take
>> any mmap_sem when I migrate pages. So I got this error.
>>
>> It is a false alarm. Sorry about that. Thanks for clarifying the problem.
>
> I think there's still a race between MADV_DONTNEED and
> change_huge_pmd(.prot_numa=1) resulting in skipping THP by
> zap_pmd_range(). It need to be addressed.
>
> And MADV_FREE requires a fix.
>
> So, minus one non-bug, plus two bugs.
>

You said a huge page pmd entry needs to be changed under down_read(mmap_sem).
It is only true for huge pages, right?

Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
compaction. Namely, base page migrations do not hold down_read(mmap_sem),
so in zap_pte_range(), the kernel needs to hold PTE page table locks.
Am I right about this?

If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
possible.

Do you agree?


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-13  0:25               ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-13  0:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Minchan Kim, linux-kernel, linux-mm,
	kirill.shutemov, akpm, vbabka, mgorman, n-horiguchi, khandual,
	Zi Yan

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

Hi Kirill,

>>>> The crash scenario I guess is like:
>>>> 1. A huge page pmd entry is in the middle of being changed into either a
>>>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
>>>>
>>>> 2. At the same time, the application frees the vma this page belongs to.
>>>
>>> Em... no.
>>>
>>> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
>>> And we only be able to remove vma under down_write(mmap_sem), so the
>>> scenario should be excluded.
>>>
>>> What do I miss?
>>
>> You are right. This problem will not happen in the upstream kernel.
>>
>> The problem comes from my customized kernel, where I migrate pages away
>> instead of reclaiming them when memory is under pressure. I did not take
>> any mmap_sem when I migrate pages. So I got this error.
>>
>> It is a false alarm. Sorry about that. Thanks for clarifying the problem.
>
> I think there's still a race between MADV_DONTNEED and
> change_huge_pmd(.prot_numa=1) resulting in skipping THP by
> zap_pmd_range(). It need to be addressed.
>
> And MADV_FREE requires a fix.
>
> So, minus one non-bug, plus two bugs.
>

You said a huge page pmd entry needs to be changed under down_read(mmap_sem).
It is only true for huge pages, right?

Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
compaction. Namely, base page migrations do not hold down_read(mmap_sem),
so in zap_pte_range(), the kernel needs to hold PTE page table locks.
Am I right about this?

If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
possible.

Do you agree?


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-13  0:25               ` Zi Yan
@ 2017-02-13 10:59                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-13 10:59 UTC (permalink / raw)
  To: Zi Yan, Andrea Arcangeli
  Cc: Minchan Kim, linux-kernel, linux-mm, kirill.shutemov, akpm,
	vbabka, mgorman, n-horiguchi, khandual, Zi Yan

On Sun, Feb 12, 2017 at 06:25:09PM -0600, Zi Yan wrote:
> Hi Kirill,
> 
> >>>> The crash scenario I guess is like:
> >>>> 1. A huge page pmd entry is in the middle of being changed into either a
> >>>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> >>>>
> >>>> 2. At the same time, the application frees the vma this page belongs to.
> >>>
> >>> Em... no.
> >>>
> >>> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> >>> And we only be able to remove vma under down_write(mmap_sem), so the
> >>> scenario should be excluded.
> >>>
> >>> What do I miss?
> >>
> >> You are right. This problem will not happen in the upstream kernel.
> >>
> >> The problem comes from my customized kernel, where I migrate pages away
> >> instead of reclaiming them when memory is under pressure. I did not take
> >> any mmap_sem when I migrate pages. So I got this error.
> >>
> >> It is a false alarm. Sorry about that. Thanks for clarifying the problem.
> >
> > I think there's still a race between MADV_DONTNEED and
> > change_huge_pmd(.prot_numa=1) resulting in skipping THP by
> > zap_pmd_range(). It need to be addressed.
> >
> > And MADV_FREE requires a fix.
> >
> > So, minus one non-bug, plus two bugs.
> >
> 
> You said a huge page pmd entry needs to be changed under down_read(mmap_sem).
> It is only true for huge pages, right?

mmap_sem is a way to make sure that the VMA will not go away under you.
Besides mmap_sem, anon_vma_lock/i_mmap_lock can be used for this.

> Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
> compaction. Namely, base page migrations do not hold down_read(mmap_sem),
> so in zap_pte_range(), the kernel needs to hold PTE page table locks.
> Am I right about this?
> 
> If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
> in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
> possible.
> 
> Do you agree?

I *think* we can get away with speculative (without ptl) check in
zap_pmd_range() if we make page fault the only place that can turn
pmd_none() into something else.

It means all other sides that change pmd must not clear it intermittently
during pmd change, unless run under down_write(mmap_sem).

I found two such problematic places in kernel:

 - change_huge_pmd(.prot_numa=1);

 - madvise_free_huge_pmd();

Both clear pmd before setting up a modified version. Both under
down_read(mmap_sem).

The migration path also would need to establish migration pmd atomically
to make this work.

Once all these cases will be fixed, zap_pmd_range() would only be able to
race with page fault if it called from MADV_DONTNEED.
This case is not a problem.

Andrea, does it sound reasonable to you?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-13 10:59                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 87+ messages in thread
From: Kirill A. Shutemov @ 2017-02-13 10:59 UTC (permalink / raw)
  To: Zi Yan, Andrea Arcangeli
  Cc: Minchan Kim, linux-kernel, linux-mm, kirill.shutemov, akpm,
	vbabka, mgorman, n-horiguchi, khandual, Zi Yan

On Sun, Feb 12, 2017 at 06:25:09PM -0600, Zi Yan wrote:
> Hi Kirill,
> 
> >>>> The crash scenario I guess is like:
> >>>> 1. A huge page pmd entry is in the middle of being changed into either a
> >>>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> >>>>
> >>>> 2. At the same time, the application frees the vma this page belongs to.
> >>>
> >>> Em... no.
> >>>
> >>> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> >>> And we only be able to remove vma under down_write(mmap_sem), so the
> >>> scenario should be excluded.
> >>>
> >>> What do I miss?
> >>
> >> You are right. This problem will not happen in the upstream kernel.
> >>
> >> The problem comes from my customized kernel, where I migrate pages away
> >> instead of reclaiming them when memory is under pressure. I did not take
> >> any mmap_sem when I migrate pages. So I got this error.
> >>
> >> It is a false alarm. Sorry about that. Thanks for clarifying the problem.
> >
> > I think there's still a race between MADV_DONTNEED and
> > change_huge_pmd(.prot_numa=1) resulting in skipping THP by
> > zap_pmd_range(). It need to be addressed.
> >
> > And MADV_FREE requires a fix.
> >
> > So, minus one non-bug, plus two bugs.
> >
> 
> You said a huge page pmd entry needs to be changed under down_read(mmap_sem).
> It is only true for huge pages, right?

mmap_sem is a way to make sure that the VMA will not go away under you.
Besides mmap_sem, anon_vma_lock/i_mmap_lock can be used for this.

> Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
> compaction. Namely, base page migrations do not hold down_read(mmap_sem),
> so in zap_pte_range(), the kernel needs to hold PTE page table locks.
> Am I right about this?
> 
> If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
> in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
> possible.
> 
> Do you agree?

I *think* we can get away with speculative (without ptl) check in
zap_pmd_range() if we make page fault the only place that can turn
pmd_none() into something else.

It means all other sides that change pmd must not clear it intermittently
during pmd change, unless run under down_write(mmap_sem).

I found two such problematic places in kernel:

 - change_huge_pmd(.prot_numa=1);

 - madvise_free_huge_pmd();

Both clear pmd before setting up a modified version. Both under
down_read(mmap_sem).

The migration path also would need to establish migration pmd atomically
to make this work.

Once all these cases will be fixed, zap_pmd_range() would only be able to
race with page fault if it called from MADV_DONTNEED.
This case is not a problem.

Andrea, does it sound reasonable to you?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
  2017-02-13 10:59                 ` Kirill A. Shutemov
@ 2017-02-13 14:40                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2017-02-13 14:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, Minchan Kim, linux-kernel, linux-mm, kirill.shutemov,
	akpm, vbabka, mgorman, n-horiguchi, khandual, Zi Yan

Hello!

On Mon, Feb 13, 2017 at 01:59:06PM +0300, Kirill A. Shutemov wrote:
> On Sun, Feb 12, 2017 at 06:25:09PM -0600, Zi Yan wrote:
> > Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
> > compaction. Namely, base page migrations do not hold down_read(mmap_sem),
> > so in zap_pte_range(), the kernel needs to hold PTE page table locks.
> > Am I right about this?
> > 
> > If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
> > in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
> > possible.
> > 
> > Do you agree?

compaction skips compound pages in the LRU entirely because they're
guaranteed to be HPAGE_PMD_ORDER in size by design, so yes compaction
is not effective in helping compound page allocations of order >
HPAGE_PMD_ORDER, you need to use CMA allocation APIs for that instead
of plain alloc_pages for orders higher than HPAGE_PMD_ORDER.

That only leaves order 10 not covered, which happens to match
HPAGE_PMD_ORDER on x86 32bit non-PAE. In fact MAX_ORDER should be
trimmed to 10 for x86-64 and x86 32bit PAE mode, we're probably
wasting a bit of CPU to maintain order 10 for no good on x86-64 but
that's another issue not related to THP pmd updates.

There's no issue with x86 pmd updates in compaction because we don't
compact those in the first place and 1GB pages in THP are not feasible
regardless of MAX_ORDER being 10 or 11.

> I *think* we can get away with speculative (without ptl) check in
> zap_pmd_range() if we make page fault the only place that can turn
> pmd_none() into something else.
> 
> It means all other sides that change pmd must not clear it intermittently
> during pmd change, unless run under down_write(mmap_sem).
> 
> I found two such problematic places in kernel:
> 
>  - change_huge_pmd(.prot_numa=1);
> 
>  - madvise_free_huge_pmd();
> 
> Both clear pmd before setting up a modified version. Both under
> down_read(mmap_sem).
> 
> The migration path also would need to establish migration pmd atomically
> to make this work.

Pagetables updates are always atomic, the issue here is not the
atomicity nor the lock prefix, but just "not temporarily showing zero
pmds" if only holding the pmd_lock and mmap_sem for reading (i.e. not
hodling it for writing)

> 
> Once all these cases will be fixed, zap_pmd_range() would only be able to
> race with page fault if it called from MADV_DONTNEED.
> This case is not a problem.

Yes this case is handled fine by pmd_trans_unstable and
pmd_none_or_trans_huge_or_clear_bad, it's a controlled race with
userland undefined result when it triggers. We've just to be
consistent with the invariants and not let the kernel get confused
about it, so no problem there.

> Andrea, does it sound reasonable to you?

Yes, pmdp_invalidate does exactly that, it won't show a zero pmd,
instead it does:

	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));

the change_huge_pmd under mmap_sem for reading is a relatively newer
introduction, in the older THP code that couldn't happen ever, it was
always running under the mmap_sem for writing and it wasn't updated to
cover this race condition when it started to be taken for reading to
arm NUMA hinting faults in task work.

madvise_free_huge_pmd is also a newer addition not present in the
older THP code introduced with MADV_FREE.

Whenever the mmap_sem is taken for reading only, the pmd shouldn't be
zeroed out at any given time, instead it should do like
split_huge_page->pmdp_invalidate. It's not hard to atomically update
the pmd with a new value:

#ifndef __HAVE_ARCH_PMDP_INVALIDATE
void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
		     pmd_t *pmdp)
{
	pmd_t entry = *pmdp;
	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
}
#endif

set_pmd_at will do (it's atomic as far as C is concerned, not
guaranteed by the C standard but we always depend on gcc to be smart
and make an atomic update in all pgtable updates, nothing special
here).

Simply removing pmdp_huge_get_and_clear_full and calling set_pmd_at
directly should do the trick. The pmd can't change with the pmd_lock
hold so we've just to read it, update it, and overwrite it atomically
with set_pmd_at. It actually will speed up the code removing an
unnecessary clear.

The only reason for doing pmdp_huge_get_and_clear_full in the pte
cases is to avoid losing the dirty bit updated in hardware. That is
non issue for anon memory where all THP anon memory is always dirty as
it can't be swapped natively and it can never be freed and dropped
unless it's splitted first (in turn not being a hugepmd anymore).

The problem with the dirty bit happens in this sequence:

    pmd_t pmd = *pmdp; // dirty bit is not set in pmd
    // dirty bit is set in hardware in *pmdp
    set_pte_at(..., pmd); // dirty bit lost

pmdp_huge_get_and_clear_full prevents the above so it's preferable but
it's unusable outside of mmap_sem for writing.

tmpfs in THP should do the same as the swap API isn't capable of
natively creating large chunks natively.

If we were to track the dirty bit what we could do is to introduce a
xchg based ptep_invalidate that instead of setting the pmd to zero it
will set it to non present or to a migration entry.

In short either we'd drop pmdp_huge_get_and_clear_full entirely and we
only use set_pmd_at, because if we don't use it always like the old
THP code did, or there's no point in wasting CPU for those xchg as
there would be code paths that would eventually lose dirty bits
anyway, or we should replace it with a variant that can be called with
the mmap_sem for reading and the pmd_lock only that doesn't clear the
pmd but that still prevents the hardware to set the dirty bit while we
update it.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
@ 2017-02-13 14:40                   ` Andrea Arcangeli
  0 siblings, 0 replies; 87+ messages in thread
From: Andrea Arcangeli @ 2017-02-13 14:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, Minchan Kim, linux-kernel, linux-mm, kirill.shutemov,
	akpm, vbabka, mgorman, n-horiguchi, khandual, Zi Yan

Hello!

On Mon, Feb 13, 2017 at 01:59:06PM +0300, Kirill A. Shutemov wrote:
> On Sun, Feb 12, 2017 at 06:25:09PM -0600, Zi Yan wrote:
> > Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
> > compaction. Namely, base page migrations do not hold down_read(mmap_sem),
> > so in zap_pte_range(), the kernel needs to hold PTE page table locks.
> > Am I right about this?
> > 
> > If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
> > in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
> > possible.
> > 
> > Do you agree?

compaction skips compound pages in the LRU entirely because they're
guaranteed to be HPAGE_PMD_ORDER in size by design, so yes compaction
is not effective in helping compound page allocations of order >
HPAGE_PMD_ORDER, you need to use CMA allocation APIs for that instead
of plain alloc_pages for orders higher than HPAGE_PMD_ORDER.

That only leaves order 10 not covered, which happens to match
HPAGE_PMD_ORDER on x86 32bit non-PAE. In fact MAX_ORDER should be
trimmed to 10 for x86-64 and x86 32bit PAE mode, we're probably
wasting a bit of CPU to maintain order 10 for no good on x86-64 but
that's another issue not related to THP pmd updates.

There's no issue with x86 pmd updates in compaction because we don't
compact those in the first place and 1GB pages in THP are not feasible
regardless of MAX_ORDER being 10 or 11.

> I *think* we can get away with speculative (without ptl) check in
> zap_pmd_range() if we make page fault the only place that can turn
> pmd_none() into something else.
> 
> It means all other sides that change pmd must not clear it intermittently
> during pmd change, unless run under down_write(mmap_sem).
> 
> I found two such problematic places in kernel:
> 
>  - change_huge_pmd(.prot_numa=1);
> 
>  - madvise_free_huge_pmd();
> 
> Both clear pmd before setting up a modified version. Both under
> down_read(mmap_sem).
> 
> The migration path also would need to establish migration pmd atomically
> to make this work.

Pagetables updates are always atomic, the issue here is not the
atomicity nor the lock prefix, but just "not temporarily showing zero
pmds" if only holding the pmd_lock and mmap_sem for reading (i.e. not
hodling it for writing)

> 
> Once all these cases will be fixed, zap_pmd_range() would only be able to
> race with page fault if it called from MADV_DONTNEED.
> This case is not a problem.

Yes this case is handled fine by pmd_trans_unstable and
pmd_none_or_trans_huge_or_clear_bad, it's a controlled race with
userland undefined result when it triggers. We've just to be
consistent with the invariants and not let the kernel get confused
about it, so no problem there.

> Andrea, does it sound reasonable to you?

Yes, pmdp_invalidate does exactly that, it won't show a zero pmd,
instead it does:

	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));

the change_huge_pmd under mmap_sem for reading is a relatively newer
introduction, in the older THP code that couldn't happen ever, it was
always running under the mmap_sem for writing and it wasn't updated to
cover this race condition when it started to be taken for reading to
arm NUMA hinting faults in task work.

madvise_free_huge_pmd is also a newer addition not present in the
older THP code introduced with MADV_FREE.

Whenever the mmap_sem is taken for reading only, the pmd shouldn't be
zeroed out at any given time, instead it should do like
split_huge_page->pmdp_invalidate. It's not hard to atomically update
the pmd with a new value:

#ifndef __HAVE_ARCH_PMDP_INVALIDATE
void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
		     pmd_t *pmdp)
{
	pmd_t entry = *pmdp;
	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
}
#endif

set_pmd_at will do (it's atomic as far as C is concerned, not
guaranteed by the C standard but we always depend on gcc to be smart
and make an atomic update in all pgtable updates, nothing special
here).

Simply removing pmdp_huge_get_and_clear_full and calling set_pmd_at
directly should do the trick. The pmd can't change with the pmd_lock
hold so we've just to read it, update it, and overwrite it atomically
with set_pmd_at. It actually will speed up the code removing an
unnecessary clear.

The only reason for doing pmdp_huge_get_and_clear_full in the pte
cases is to avoid losing the dirty bit updated in hardware. That is
non issue for anon memory where all THP anon memory is always dirty as
it can't be swapped natively and it can never be freed and dropped
unless it's splitted first (in turn not being a hugepmd anymore).

The problem with the dirty bit happens in this sequence:

    pmd_t pmd = *pmdp; // dirty bit is not set in pmd
    // dirty bit is set in hardware in *pmdp
    set_pte_at(..., pmd); // dirty bit lost

pmdp_huge_get_and_clear_full prevents the above so it's preferable but
it's unusable outside of mmap_sem for writing.

tmpfs in THP should do the same as the swap API isn't capable of
natively creating large chunks natively.

If we were to track the dirty bit what we could do is to introduce a
xchg based ptep_invalidate that instead of setting the pmd to zero it
will set it to non present or to a migration entry.

In short either we'd drop pmdp_huge_get_and_clear_full entirely and we
only use set_pmd_at, because if we don't use it always like the old
THP code did, or there's no point in wasting CPU for those xchg as
there would be code paths that would eventually lose dirty bits
anyway, or we should replace it with a variant that can be called with
the mmap_sem for reading and the pmd_lock only that doesn't clear the
pmd but that still prevents the hardware to set the dirty bit while we
update it.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 08/14] mm: thp: enable thp migration in generic path
  2017-02-05 16:12   ` Zi Yan
@ 2017-02-14 20:13     ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-14 20:13 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 15843 bytes --]

Hi Kirill,

I just wonder if you have time to take a look at this
patch, since it is based on your page_vma_mapped_walk()
function and I also changed your page_vma_mapped_walk()
code to beware of pmd_migration_entry.

Thanks.


On 5 Feb 2017, at 10:12, Zi Yan wrote:

> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>
> This patch adds thp migration's core code, including conversions
> between a PMD entry and a swap entry, setting PMD migration entry,
> removing PMD migration entry, and waiting on PMD migration entries.
>
> This patch makes it possible to support thp migration.
> If you fail to allocate a destination page as a thp, you just split
> the source thp as we do now, and then enter the normal page migration.
> If you succeed to allocate destination thp, you enter thp migration.
> Subsequent patches actually enable thp migration for each caller of
> page migration by allowing its get_new_page() callback to
> allocate thps.
>
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>
> ChangeLog v2 -> v3:
> - use page_vma_mapped_walk()
>
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  70 +++++++++++++++++-
>  mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
>  mm/migrate.c                      |  29 +++++++-
>  mm/page_vma_mapped.c              |  13 +++-
>  mm/pgtable-generic.c              |   3 +-
>  mm/rmap.c                         |  14 +++-
>  7 files changed, 259 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 768eccc85553..0277f7755f3a 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -182,7 +182,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
>  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
> +#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
>
>  extern int kern_addr_valid(unsigned long addr);
>  extern void cleanup_highmap(void);
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3e7eec..6625bea13869 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>  #ifdef CONFIG_MIGRATION
>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>  {
> -	BUG_ON(!PageLocked(page));
> +	BUG_ON(!PageLocked(compound_head(page)));
> +
>  	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
>  			page_to_pfn(page));
>  }
> @@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
>  	 * Any use of migration entries may only occur while the
>  	 * corresponding page is locked
>  	 */
> -	BUG_ON(!PageLocked(p));
> +	BUG_ON(!PageLocked(compound_head(p)));
>  	return p;
>  }
>
> @@ -163,6 +164,71 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>
>  #endif
>
> +struct page_vma_mapped_walk;
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page);
> +
> +extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
> +		struct page *new);
> +
> +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __pmd_to_swp_entry(pmd);
> +	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> +	return __swp_entry_to_pmd(arch_entry);
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> +}
> +#else
> +static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	BUILD_BUG();
> +}
> +
> +static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
> +		struct page *new)
> +{
> +	BUILD_BUG();
> +	return 0;
> +}
> +
> +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	BUILD_BUG();
> +	return swp_entry(0, 0);
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	BUILD_BUG();
> +	return (pmd_t){ 0 };
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6893c47428b6..fd54bbdc16cf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		atomic_long_dec(&tlb->mm->nr_ptes);
>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>  	} else {
> -		struct page *page = pmd_page(orig_pmd);
> -		page_remove_rmap(page, true);
> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> -		if (PageAnon(page)) {
> -			pgtable_t pgtable;
> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> -			pte_free(tlb->mm, pgtable);
> -			atomic_long_dec(&tlb->mm->nr_ptes);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		struct page *page;
> +		int migration = 0;
> +
> +		if (!is_pmd_migration_entry(orig_pmd)) {
> +			page = pmd_page(orig_pmd);
> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> +			page_remove_rmap(page, true);
> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				if (arch_needs_pgtable_deposit())
> +					zap_deposited_table(tlb->mm, pmd);
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}
>  		} else {
> -			if (arch_needs_pgtable_deposit())
> -				zap_deposited_table(tlb->mm, pmd);
> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(orig_pmd);
> +			page = pfn_to_page(swp_offset(entry));
> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				if (arch_needs_pgtable_deposit())
> +					zap_deposited_table(tlb->mm, pmd);
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}
> +			free_swap_and_cache(entry); /* waring in failure? */
> +			migration = 1;
>  		}
>  		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
> @@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
>  }
>  late_initcall(split_huge_pages_debugfs);
>  #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	pmd_t pmdval;
> +	swp_entry_t entry;
> +
> +	if (pvmw->pmd && !pvmw->pte) {
> +		pmd_t pmdswp;
> +
> +		mmu_notifier_invalidate_range_start(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +
> +		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> +		if (pmd_dirty(pmdval))
> +			set_page_dirty(page);
> +		entry = make_migration_entry(page, pmd_write(pmdval));
> +		pmdswp = swp_entry_to_pmd(entry);
> +		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
> +		page_remove_rmap(page, true);
> +		put_page(page);
> +
> +		mmu_notifier_invalidate_range_end(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +	} else { /* pte-mapped thp */
> +		pte_t pteval;
> +		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
> +		pte_t swp_pte;
> +
> +		pteval = ptep_clear_flush(vma, address, pvmw->pte);
> +		if (pte_dirty(pteval))
> +			set_page_dirty(subpage);
> +		entry = make_migration_entry(subpage, pte_write(pteval));
> +		swp_pte = swp_entry_to_pte(entry);
> +		set_pte_at(mm, address, pvmw->pte, swp_pte);
> +		page_remove_rmap(subpage, false);
> +		put_page(subpage);
> +	}
> +}
> +
> +void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	swp_entry_t entry;
> +
> +	/* PMD-mapped THP  */
> +	if (pvmw->pmd && !pvmw->pte) {
> +		unsigned long mmun_start = address & HPAGE_PMD_MASK;
> +		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +		pmd_t pmde;
> +
> +		entry = pmd_to_swp_entry(*pvmw->pmd);
> +		get_page(new);
> +		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> +		if (is_write_migration_entry(entry))
> +			pmde = maybe_pmd_mkwrite(pmde, vma);
> +
> +		flush_cache_range(vma, mmun_start, mmun_end);
> +		page_add_anon_rmap(new, vma, mmun_start, true);
> +		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
> +		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> +		flush_tlb_range(vma, mmun_start, mmun_end);
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_vma_page(new);
> +		update_mmu_cache_pmd(vma, address, pvmw->pmd);
> +
> +	} else { /* pte-mapped thp */
> +		pte_t pte;
> +		pte_t *ptep = pvmw->pte;
> +
> +		entry = pte_to_swp_entry(*pvmw->pte);
> +		get_page(new);
> +		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
> +		if (pte_swp_soft_dirty(*pvmw->pte))
> +			pte = pte_mksoft_dirty(pte);
> +		if (is_write_migration_entry(entry))
> +			pte = maybe_mkwrite(pte, vma);
> +		flush_dcache_page(new);
> +		set_pte_at(mm, address, ptep, pte);
> +		if (PageAnon(new))
> +			page_add_anon_rmap(new, vma, address, false);
> +		else
> +			page_add_file_rmap(new, false);
> +		update_mmu_cache(vma, address, ptep);
> +	}
> +}
> +#endif
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 95e8580dc902..84181a3668c6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -214,6 +214,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		new = page - pvmw.page->index +
>  			linear_page_index(vma, pvmw.address);
>
> +		/* PMD-mapped THP migration entry */
> +		if (!PageHuge(page) && PageTransCompound(page)) {
> +			remove_migration_pmd(&pvmw, new);
> +			continue;
> +		}
> +
>  		get_page(new);
>  		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
>  		if (pte_swp_soft_dirty(*pvmw.pte))
> @@ -327,6 +333,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
>  	__migration_entry_wait(mm, pte, ptl);
>  }
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (!is_pmd_migration_entry(*pmd))
> +		goto unlock;
> +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> +	if (!get_page_unless_zero(page))
> +		goto unlock;
> +	spin_unlock(ptl);
> +	wait_on_page_locked(page);
> +	put_page(page);
> +	return;
> +unlock:
> +	spin_unlock(ptl);
> +}
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> @@ -1085,7 +1112,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>  		goto out;
>  	}
>
> -	if (unlikely(PageTransHuge(page))) {
> +	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
>  		lock_page(page);
>  		rc = split_huge_page(page);
>  		unlock_page(page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index a23001a22c15..0ed3aee62d50 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	if (!pud_present(*pud))
>  		return false;
>  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> -	if (pmd_trans_huge(*pvmw->pmd)) {
> +	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
>  		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> -		if (!pmd_present(*pvmw->pmd))
> -			return not_found(pvmw);
>  		if (likely(pmd_trans_huge(*pvmw->pmd))) {
>  			if (pvmw->flags & PVMW_MIGRATION)
>  				return not_found(pvmw);
>  			if (pmd_page(*pvmw->pmd) != page)
>  				return not_found(pvmw);
>  			return true;
> +		} else if (!pmd_present(*pvmw->pmd)) {
> +			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
> +
> +				if (migration_entry_to_page(entry) != page)
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			return not_found(pvmw);
>  		} else {
>  			/* THP pmd was split under us: handle on pte level */
>  			spin_unlock(pvmw->ptl);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4ed5908c65b0..9d550a8a0c71 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp));
>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>  	return pmd;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 16789b936e3a..b33216668fa4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1304,6 +1304,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	struct rmap_private *rp = arg;
>  	enum ttu_flags flags = rp->flags;
>
> +
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		return SWAP_AGAIN;
> @@ -1314,12 +1315,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* THP migration */
> +		if (flags & TTU_MIGRATION) {
> +			if (!PageHuge(page) && PageTransCompound(page)) {
> +				set_pmd_migration_entry(&pvmw, page);
> +				continue;
> +			}
> +		}
> +		/* Unexpected PMD-mapped THP */
> +		VM_BUG_ON_PAGE(!pvmw.pte, page);
> +
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>
> -		/* Unexpected PMD-mapped THP? */
> -		VM_BUG_ON_PAGE(!pvmw.pte, page);
> -
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
>  		 * If it's recently referenced (perhaps page_referenced
> -- 
> 2.11.0


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 08/14] mm: thp: enable thp migration in generic path
@ 2017-02-14 20:13     ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-14 20:13 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: akpm, minchan, vbabka, mgorman, n-horiguchi, khandual, zi.yan,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 15843 bytes --]

Hi Kirill,

I just wonder if you have time to take a look at this
patch, since it is based on your page_vma_mapped_walk()
function and I also changed your page_vma_mapped_walk()
code to beware of pmd_migration_entry.

Thanks.


On 5 Feb 2017, at 10:12, Zi Yan wrote:

> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>
> This patch adds thp migration's core code, including conversions
> between a PMD entry and a swap entry, setting PMD migration entry,
> removing PMD migration entry, and waiting on PMD migration entries.
>
> This patch makes it possible to support thp migration.
> If you fail to allocate a destination page as a thp, you just split
> the source thp as we do now, and then enter the normal page migration.
> If you succeed to allocate destination thp, you enter thp migration.
> Subsequent patches actually enable thp migration for each caller of
> page migration by allowing its get_new_page() callback to
> allocate thps.
>
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>
> ChangeLog v2 -> v3:
> - use page_vma_mapped_walk()
>
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  70 +++++++++++++++++-
>  mm/huge_memory.c                  | 151 ++++++++++++++++++++++++++++++++++----
>  mm/migrate.c                      |  29 +++++++-
>  mm/page_vma_mapped.c              |  13 +++-
>  mm/pgtable-generic.c              |   3 +-
>  mm/rmap.c                         |  14 +++-
>  7 files changed, 259 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 768eccc85553..0277f7755f3a 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -182,7 +182,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
>  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
> +#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
>
>  extern int kern_addr_valid(unsigned long addr);
>  extern void cleanup_highmap(void);
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3e7eec..6625bea13869 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>  #ifdef CONFIG_MIGRATION
>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>  {
> -	BUG_ON(!PageLocked(page));
> +	BUG_ON(!PageLocked(compound_head(page)));
> +
>  	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
>  			page_to_pfn(page));
>  }
> @@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
>  	 * Any use of migration entries may only occur while the
>  	 * corresponding page is locked
>  	 */
> -	BUG_ON(!PageLocked(p));
> +	BUG_ON(!PageLocked(compound_head(p)));
>  	return p;
>  }
>
> @@ -163,6 +164,71 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>
>  #endif
>
> +struct page_vma_mapped_walk;
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page);
> +
> +extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
> +		struct page *new);
> +
> +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __pmd_to_swp_entry(pmd);
> +	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> +	return __swp_entry_to_pmd(arch_entry);
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> +}
> +#else
> +static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	BUILD_BUG();
> +}
> +
> +static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
> +		struct page *new)
> +{
> +	BUILD_BUG();
> +	return 0;
> +}
> +
> +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	BUILD_BUG();
> +	return swp_entry(0, 0);
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	BUILD_BUG();
> +	return (pmd_t){ 0 };
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6893c47428b6..fd54bbdc16cf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1613,20 +1613,51 @@ int __zap_huge_pmd_locked(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		atomic_long_dec(&tlb->mm->nr_ptes);
>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>  	} else {
> -		struct page *page = pmd_page(orig_pmd);
> -		page_remove_rmap(page, true);
> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> -		if (PageAnon(page)) {
> -			pgtable_t pgtable;
> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> -			pte_free(tlb->mm, pgtable);
> -			atomic_long_dec(&tlb->mm->nr_ptes);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		struct page *page;
> +		int migration = 0;
> +
> +		if (!is_pmd_migration_entry(orig_pmd)) {
> +			page = pmd_page(orig_pmd);
> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> +			page_remove_rmap(page, true);
> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				if (arch_needs_pgtable_deposit())
> +					zap_deposited_table(tlb->mm, pmd);
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}
>  		} else {
> -			if (arch_needs_pgtable_deposit())
> -				zap_deposited_table(tlb->mm, pmd);
> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(orig_pmd);
> +			page = pfn_to_page(swp_offset(entry));
> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				if (arch_needs_pgtable_deposit())
> +					zap_deposited_table(tlb->mm, pmd);
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}
> +			free_swap_and_cache(entry); /* waring in failure? */
> +			migration = 1;
>  		}
>  		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
> @@ -2634,3 +2665,97 @@ static int __init split_huge_pages_debugfs(void)
>  }
>  late_initcall(split_huge_pages_debugfs);
>  #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	pmd_t pmdval;
> +	swp_entry_t entry;
> +
> +	if (pvmw->pmd && !pvmw->pte) {
> +		pmd_t pmdswp;
> +
> +		mmu_notifier_invalidate_range_start(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +
> +		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> +		if (pmd_dirty(pmdval))
> +			set_page_dirty(page);
> +		entry = make_migration_entry(page, pmd_write(pmdval));
> +		pmdswp = swp_entry_to_pmd(entry);
> +		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
> +		page_remove_rmap(page, true);
> +		put_page(page);
> +
> +		mmu_notifier_invalidate_range_end(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +	} else { /* pte-mapped thp */
> +		pte_t pteval;
> +		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
> +		pte_t swp_pte;
> +
> +		pteval = ptep_clear_flush(vma, address, pvmw->pte);
> +		if (pte_dirty(pteval))
> +			set_page_dirty(subpage);
> +		entry = make_migration_entry(subpage, pte_write(pteval));
> +		swp_pte = swp_entry_to_pte(entry);
> +		set_pte_at(mm, address, pvmw->pte, swp_pte);
> +		page_remove_rmap(subpage, false);
> +		put_page(subpage);
> +	}
> +}
> +
> +void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	swp_entry_t entry;
> +
> +	/* PMD-mapped THP  */
> +	if (pvmw->pmd && !pvmw->pte) {
> +		unsigned long mmun_start = address & HPAGE_PMD_MASK;
> +		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +		pmd_t pmde;
> +
> +		entry = pmd_to_swp_entry(*pvmw->pmd);
> +		get_page(new);
> +		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> +		if (is_write_migration_entry(entry))
> +			pmde = maybe_pmd_mkwrite(pmde, vma);
> +
> +		flush_cache_range(vma, mmun_start, mmun_end);
> +		page_add_anon_rmap(new, vma, mmun_start, true);
> +		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
> +		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> +		flush_tlb_range(vma, mmun_start, mmun_end);
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_vma_page(new);
> +		update_mmu_cache_pmd(vma, address, pvmw->pmd);
> +
> +	} else { /* pte-mapped thp */
> +		pte_t pte;
> +		pte_t *ptep = pvmw->pte;
> +
> +		entry = pte_to_swp_entry(*pvmw->pte);
> +		get_page(new);
> +		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
> +		if (pte_swp_soft_dirty(*pvmw->pte))
> +			pte = pte_mksoft_dirty(pte);
> +		if (is_write_migration_entry(entry))
> +			pte = maybe_mkwrite(pte, vma);
> +		flush_dcache_page(new);
> +		set_pte_at(mm, address, ptep, pte);
> +		if (PageAnon(new))
> +			page_add_anon_rmap(new, vma, address, false);
> +		else
> +			page_add_file_rmap(new, false);
> +		update_mmu_cache(vma, address, ptep);
> +	}
> +}
> +#endif
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 95e8580dc902..84181a3668c6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -214,6 +214,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		new = page - pvmw.page->index +
>  			linear_page_index(vma, pvmw.address);
>
> +		/* PMD-mapped THP migration entry */
> +		if (!PageHuge(page) && PageTransCompound(page)) {
> +			remove_migration_pmd(&pvmw, new);
> +			continue;
> +		}
> +
>  		get_page(new);
>  		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
>  		if (pte_swp_soft_dirty(*pvmw.pte))
> @@ -327,6 +333,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
>  	__migration_entry_wait(mm, pte, ptl);
>  }
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (!is_pmd_migration_entry(*pmd))
> +		goto unlock;
> +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> +	if (!get_page_unless_zero(page))
> +		goto unlock;
> +	spin_unlock(ptl);
> +	wait_on_page_locked(page);
> +	put_page(page);
> +	return;
> +unlock:
> +	spin_unlock(ptl);
> +}
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> @@ -1085,7 +1112,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>  		goto out;
>  	}
>
> -	if (unlikely(PageTransHuge(page))) {
> +	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
>  		lock_page(page);
>  		rc = split_huge_page(page);
>  		unlock_page(page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index a23001a22c15..0ed3aee62d50 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	if (!pud_present(*pud))
>  		return false;
>  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> -	if (pmd_trans_huge(*pvmw->pmd)) {
> +	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
>  		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> -		if (!pmd_present(*pvmw->pmd))
> -			return not_found(pvmw);
>  		if (likely(pmd_trans_huge(*pvmw->pmd))) {
>  			if (pvmw->flags & PVMW_MIGRATION)
>  				return not_found(pvmw);
>  			if (pmd_page(*pvmw->pmd) != page)
>  				return not_found(pvmw);
>  			return true;
> +		} else if (!pmd_present(*pvmw->pmd)) {
> +			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
> +
> +				if (migration_entry_to_page(entry) != page)
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			return not_found(pvmw);
>  		} else {
>  			/* THP pmd was split under us: handle on pte level */
>  			spin_unlock(pvmw->ptl);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4ed5908c65b0..9d550a8a0c71 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp));
>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>  	return pmd;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 16789b936e3a..b33216668fa4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1304,6 +1304,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	struct rmap_private *rp = arg;
>  	enum ttu_flags flags = rp->flags;
>
> +
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		return SWAP_AGAIN;
> @@ -1314,12 +1315,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* THP migration */
> +		if (flags & TTU_MIGRATION) {
> +			if (!PageHuge(page) && PageTransCompound(page)) {
> +				set_pmd_migration_entry(&pvmw, page);
> +				continue;
> +			}
> +		}
> +		/* Unexpected PMD-mapped THP */
> +		VM_BUG_ON_PAGE(!pvmw.pte, page);
> +
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>
> -		/* Unexpected PMD-mapped THP? */
> -		VM_BUG_ON_PAGE(!pvmw.pte, page);
> -
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
>  		 * If it's recently referenced (perhaps page_referenced
> -- 
> 2.11.0


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 00/14] mm: page migration enhancement for thp
  2017-02-05 16:12 ` Zi Yan
@ 2017-02-23 16:12   ` Zi Yan
  -1 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-23 16:12 UTC (permalink / raw)
  To: kirill.shutemov, Andrea Arcangeli, Michal Hocko
  Cc: linux-kernel, linux-mm, minchan, vbabka, mgorman, n-horiguchi,
	khandual, akpm

[-- Attachment #1: Type: text/plain, Size: 4876 bytes --]

Ping.

Just want to get comments on THP migration part (Patch 4-14). If they
look OK, I can rebase THP migration part on mmotm-2017-02-22-16-28 and
send them out for merging.

Thanks.

Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> The patches are rebased on mmotm-2017-02-01-15-35 with feedbacks from 
> Naoya Horiguchi's v2 patches.
> 
> I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
> The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
> which leads to PTE page table not freed.
> 
> In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
> can be set by some CPUs by mistake and the new swap entry format does not use
> bit 1-4.
> 
> I also adjust two core migration functions, set_pmd_migration_entry() and
> remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
> function. Patch 8 needs Kirill's comments, since I also add changes
> to his page_vma_mapped_walk() function with pmd_migration_entry handling.
> 
> In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
> in set_pmd_migration_entry() to avoid data corruption after page migration.
> 
> In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
> Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
> leads to deposited PTE page table not freed.
> 
> I personally use this patchset with my customized kernel to test frequent
> page migrations by replacing page reclaim with page migration.
> The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
> I did a 16-hour stress test that has ~7 billion total page migrations.
> No error or data corruption was found. 
> 
> 
> General description 
> ===========================================
> 
> This patchset enhances page migration functionality to handle thp migration
> for various page migration's callers:
>  - mbind(2)
>  - move_pages(2)
>  - migrate_pages(2)
>  - cgroup/cpuset migration
>  - memory hotremove
>  - soft offline
> 
> The main benefit is that we can avoid unnecessary thp splits, which helps us
> avoid performance decrease when your applications handles NUMA optimization on
> their own.
> 
> The implementation is similar to that of normal page migration, the key point
> is that we modify a pmd to a pmd migration entry in swap-entry like format.
> 
> 
> Any comments or advices are welcomed.
> 
> Best Regards,
> Yan Zi
> 
> Naoya Horiguchi (11):
>   mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
>   mm: mempolicy: add queue_pages_node_check()
>   mm: thp: introduce separate TTU flag for thp freezing
>   mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
>   mm: thp: enable thp migration in generic path
>   mm: thp: check pmd migration entry in common path
>   mm: soft-dirty: keep soft-dirty bits over thp migration
>   mm: hwpoison: soft offline supports thp migration
>   mm: mempolicy: mbind and migrate_pages support thp migration
>   mm: migrate: move_pages() supports thp migration
>   mm: memory_hotplug: memory hotremove supports thp migration
> 
> Zi Yan (3):
>   mm: thp: make __split_huge_pmd_locked visible.
>   mm: thp: create new __zap_huge_pmd_locked function.
>   mm: use pmd lock instead of racy checks in zap_pmd_range()
> 
>  arch/x86/Kconfig                     |   4 +
>  arch/x86/include/asm/pgtable.h       |  17 ++
>  arch/x86/include/asm/pgtable_64.h    |   2 +
>  arch/x86/include/asm/pgtable_types.h |  10 +-
>  arch/x86/mm/gup.c                    |   4 +-
>  fs/proc/task_mmu.c                   |  37 +++--
>  include/asm-generic/pgtable.h        | 105 ++++--------
>  include/linux/huge_mm.h              |  36 ++++-
>  include/linux/rmap.h                 |   1 +
>  include/linux/swapops.h              | 146 ++++++++++++++++-
>  mm/Kconfig                           |   3 +
>  mm/gup.c                             |  20 ++-
>  mm/huge_memory.c                     | 302 +++++++++++++++++++++++++++++------
>  mm/madvise.c                         |   2 +
>  mm/memcontrol.c                      |   2 +
>  mm/memory-failure.c                  |  31 ++--
>  mm/memory.c                          |  33 ++--
>  mm/memory_hotplug.c                  |  17 +-
>  mm/mempolicy.c                       | 124 ++++++++++----
>  mm/migrate.c                         |  66 ++++++--
>  mm/mprotect.c                        |   6 +-
>  mm/mremap.c                          |   2 +-
>  mm/page_vma_mapped.c                 |  13 +-
>  mm/pagewalk.c                        |   2 +
>  mm/pgtable-generic.c                 |   3 +-
>  mm/rmap.c                            |  21 ++-
>  26 files changed, 770 insertions(+), 239 deletions(-)
> 

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 00/14] mm: page migration enhancement for thp
@ 2017-02-23 16:12   ` Zi Yan
  0 siblings, 0 replies; 87+ messages in thread
From: Zi Yan @ 2017-02-23 16:12 UTC (permalink / raw)
  To: kirill.shutemov, Andrea Arcangeli, Michal Hocko
  Cc: linux-kernel, linux-mm, minchan, vbabka, mgorman, n-horiguchi,
	khandual, akpm

[-- Attachment #1: Type: text/plain, Size: 4876 bytes --]

Ping.

Just want to get comments on THP migration part (Patch 4-14). If they
look OK, I can rebase THP migration part on mmotm-2017-02-22-16-28 and
send them out for merging.

Thanks.

Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> The patches are rebased on mmotm-2017-02-01-15-35 with feedbacks from 
> Naoya Horiguchi's v2 patches.
> 
> I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
> The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
> which leads to PTE page table not freed.
> 
> In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
> can be set by some CPUs by mistake and the new swap entry format does not use
> bit 1-4.
> 
> I also adjust two core migration functions, set_pmd_migration_entry() and
> remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
> function. Patch 8 needs Kirill's comments, since I also add changes
> to his page_vma_mapped_walk() function with pmd_migration_entry handling.
> 
> In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
> in set_pmd_migration_entry() to avoid data corruption after page migration.
> 
> In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
> Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
> leads to deposited PTE page table not freed.
> 
> I personally use this patchset with my customized kernel to test frequent
> page migrations by replacing page reclaim with page migration.
> The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
> I did a 16-hour stress test that has ~7 billion total page migrations.
> No error or data corruption was found. 
> 
> 
> General description 
> ===========================================
> 
> This patchset enhances page migration functionality to handle thp migration
> for various page migration's callers:
>  - mbind(2)
>  - move_pages(2)
>  - migrate_pages(2)
>  - cgroup/cpuset migration
>  - memory hotremove
>  - soft offline
> 
> The main benefit is that we can avoid unnecessary thp splits, which helps us
> avoid performance decrease when your applications handles NUMA optimization on
> their own.
> 
> The implementation is similar to that of normal page migration, the key point
> is that we modify a pmd to a pmd migration entry in swap-entry like format.
> 
> 
> Any comments or advices are welcomed.
> 
> Best Regards,
> Yan Zi
> 
> Naoya Horiguchi (11):
>   mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
>   mm: mempolicy: add queue_pages_node_check()
>   mm: thp: introduce separate TTU flag for thp freezing
>   mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
>   mm: thp: enable thp migration in generic path
>   mm: thp: check pmd migration entry in common path
>   mm: soft-dirty: keep soft-dirty bits over thp migration
>   mm: hwpoison: soft offline supports thp migration
>   mm: mempolicy: mbind and migrate_pages support thp migration
>   mm: migrate: move_pages() supports thp migration
>   mm: memory_hotplug: memory hotremove supports thp migration
> 
> Zi Yan (3):
>   mm: thp: make __split_huge_pmd_locked visible.
>   mm: thp: create new __zap_huge_pmd_locked function.
>   mm: use pmd lock instead of racy checks in zap_pmd_range()
> 
>  arch/x86/Kconfig                     |   4 +
>  arch/x86/include/asm/pgtable.h       |  17 ++
>  arch/x86/include/asm/pgtable_64.h    |   2 +
>  arch/x86/include/asm/pgtable_types.h |  10 +-
>  arch/x86/mm/gup.c                    |   4 +-
>  fs/proc/task_mmu.c                   |  37 +++--
>  include/asm-generic/pgtable.h        | 105 ++++--------
>  include/linux/huge_mm.h              |  36 ++++-
>  include/linux/rmap.h                 |   1 +
>  include/linux/swapops.h              | 146 ++++++++++++++++-
>  mm/Kconfig                           |   3 +
>  mm/gup.c                             |  20 ++-
>  mm/huge_memory.c                     | 302 +++++++++++++++++++++++++++++------
>  mm/madvise.c                         |   2 +
>  mm/memcontrol.c                      |   2 +
>  mm/memory-failure.c                  |  31 ++--
>  mm/memory.c                          |  33 ++--
>  mm/memory_hotplug.c                  |  17 +-
>  mm/mempolicy.c                       | 124 ++++++++++----
>  mm/migrate.c                         |  66 ++++++--
>  mm/mprotect.c                        |   6 +-
>  mm/mremap.c                          |   2 +-
>  mm/page_vma_mapped.c                 |  13 +-
>  mm/pagewalk.c                        |   2 +
>  mm/pgtable-generic.c                 |   3 +-
>  mm/rmap.c                            |  21 ++-
>  26 files changed, 770 insertions(+), 239 deletions(-)
> 

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2017-02-23 16:12 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-05 16:12 [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-06  6:12   ` Naoya Horiguchi
2017-02-06  6:12     ` Naoya Horiguchi
2017-02-06 12:10     ` Zi Yan
2017-02-06 15:02   ` Matthew Wilcox
2017-02-06 15:02     ` Matthew Wilcox
2017-02-06 15:03     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 02/14] mm: thp: create new __zap_huge_pmd_locked function Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range() Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-06  4:02   ` Hillf Danton
2017-02-06  4:02     ` Hillf Danton
2017-02-06  4:14     ` Zi Yan
2017-02-06  4:14       ` Zi Yan
2017-02-06  7:43   ` Naoya Horiguchi
2017-02-06  7:43     ` Naoya Horiguchi
2017-02-06 13:02     ` Zi Yan
2017-02-06 23:22       ` Naoya Horiguchi
2017-02-06 23:22         ` Naoya Horiguchi
2017-02-06 16:07   ` Kirill A. Shutemov
2017-02-06 16:07     ` Kirill A. Shutemov
2017-02-06 16:32     ` Zi Yan
2017-02-06 17:35       ` Kirill A. Shutemov
2017-02-06 17:35         ` Kirill A. Shutemov
2017-02-07 13:55     ` Aneesh Kumar K.V
2017-02-07 13:55       ` Aneesh Kumar K.V
2017-02-07 14:12       ` Zi Yan
2017-02-07 14:19   ` Kirill A. Shutemov
2017-02-07 14:19     ` Kirill A. Shutemov
2017-02-07 15:11     ` Zi Yan
2017-02-07 15:11       ` Zi Yan
2017-02-07 16:37       ` Kirill A. Shutemov
2017-02-07 16:37         ` Kirill A. Shutemov
2017-02-07 17:14         ` Zi Yan
2017-02-07 17:14           ` Zi Yan
2017-02-07 17:45           ` Kirill A. Shutemov
2017-02-07 17:45             ` Kirill A. Shutemov
2017-02-13  0:25             ` Zi Yan
2017-02-13  0:25               ` Zi Yan
2017-02-13 10:59               ` Kirill A. Shutemov
2017-02-13 10:59                 ` Kirill A. Shutemov
2017-02-13 14:40                 ` Andrea Arcangeli
2017-02-13 14:40                   ` Andrea Arcangeli
2017-02-05 16:12 ` [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-09  9:14   ` Naoya Horiguchi
2017-02-09  9:14     ` Naoya Horiguchi
2017-02-09 15:07     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 05/14] mm: mempolicy: add queue_pages_node_check() Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 06/14] mm: thp: introduce separate TTU flag for thp freezing Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 07/14] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 08/14] mm: thp: enable thp migration in generic path Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-09  9:15   ` Naoya Horiguchi
2017-02-09  9:15     ` Naoya Horiguchi
2017-02-09 15:17     ` Zi Yan
2017-02-09 23:04       ` Naoya Horiguchi
2017-02-09 23:04         ` Naoya Horiguchi
2017-02-14 20:13   ` Zi Yan
2017-02-14 20:13     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 09/14] mm: thp: check pmd migration entry in common path Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-09  9:16   ` Naoya Horiguchi
2017-02-09  9:16     ` Naoya Horiguchi
2017-02-09 17:36     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 10/14] mm: soft-dirty: keep soft-dirty bits over thp migration Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 11/14] mm: hwpoison: soft offline supports " Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 12/14] mm: mempolicy: mbind and migrate_pages support " Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 13/14] mm: migrate: move_pages() supports " Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-09  9:16   ` Naoya Horiguchi
2017-02-09  9:16     ` Naoya Horiguchi
2017-02-09 17:37     ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 14/14] mm: memory_hotplug: memory hotremove " Zi Yan
2017-02-05 16:12   ` Zi Yan
2017-02-23 16:12 ` [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
2017-02-23 16:12   ` Zi Yan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.