linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/12] mm: page migration enhancement for thp
@ 2016-11-07 23:31 Naoya Horiguchi
  2016-11-07 23:31 ` [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6 Naoya Horiguchi
                   ` (14 more replies)
  0 siblings, 15 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

Hi everyone,

I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
with feedbacks for ver.1.

General description (no change since ver.1)
===========================================

This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
 - mbind(2)
 - move_pages(2)
 - migrate_pages(2)
 - cgroup/cpuset migration
 - memory hotremove
 - soft offline

The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.

The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.

Changes / Notes
===============

- pmd_present() in x86 checks _PAGE_PRESENT, _PAGE_PROTNONE and _PAGE_PSE
  bits together, which makes implementing thp migration a bit hard because
  _PAGE_PSE bit is currently used by soft-dirty in swap-entry format.
  I was advised to dropping _PAGE_PSE in pmd_present(), but I don't think
  of the justification, so I keep it in this version. Instead, my approach
  is to move _PAGE_SWP_SOFT_DIRTY to bit 6 (unused) and reserve bit 7 for
  pmd non-present cases.

- this patchset still covers only x86_64. Zi Yan posted a patch for ppc64
  and I think it's favorably received so that's fine. But there's unsolved
  minor suggestion by Aneesh, so I don't include it in this set, expecting
  that it will be updated/reposted.

- pte-mapped thp and doubly-mapped thp were not supported in ver.1, but
  this version should work for such kinds of thp.

- thp page cache is not tested yet, and it's at the head of my todo list
  for future version.

Any comments or advices are welcomed.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-10 23:29   ` Dave Hansen
  2016-11-07 23:31 ` [PATCH v2 02/12] mm: mempolicy: add queue_pages_node_check() Naoya Horiguchi
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid false negative
return when it races with thp spilt (during which _PAGE_PRESENT is temporary
cleared.) I don't think that dropping _PAGE_PSE check in pmd_present() works
well because it can hurt optimization of tlb handling in thp split.
In the current kernel, bit 6 is not used in non-present format because nonlinear
file mapping is obsolete, so let's move _PAGE_SWP_SOFT_DIRTY to that bit.
Bit 7 is used as reserved (always clear), so please don't use it for other
purpose.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/x86/include/asm/pgtable_types.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_types.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_types.h
index 8b4de22..89e88f1 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_types.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_types.h
@@ -98,14 +98,14 @@
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
  * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 6 for soft dirty tracking.
  *
  * Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_DIRTY
 #else
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 02/12] mm: mempolicy: add queue_pages_node_check()
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
  2016-11-07 23:31 ` [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6 Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-07 23:31 ` [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing Naoya Horiguchi
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

Introduce a separate check routine related to MPOL_MF_INVERT flag. This patch
just does cleanup, no behavioral change.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
index 6d3639e..77d0668 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
@@ -477,6 +477,15 @@ struct queue_pages {
 	struct vm_area_struct *prev;
 };
 
+static inline bool queue_pages_node_check(struct page *page,
+					struct queue_pages *qp)
+{
+	int nid = page_to_nid(page);
+	unsigned long flags = qp->flags;
+
+	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -530,8 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (PageReserved(page))
 			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		if (queue_pages_node_check(page, qp))
 			continue;
 		if (PageTransCompound(page)) {
 			get_page(page);
@@ -563,7 +571,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 #ifdef CONFIG_HUGETLB_PAGE
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -573,8 +580,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	if (!pte_present(entry))
 		goto unlock;
 	page = pte_page(entry);
-	nid = page_to_nid(page);
-	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+	if (queue_pages_node_check(page, qp))
 		goto unlock;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
  2016-11-07 23:31 ` [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6 Naoya Horiguchi
  2016-11-07 23:31 ` [PATCH v2 02/12] mm: mempolicy: add queue_pages_node_check() Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-10  8:31   ` Anshuman Khandual
  2016-11-11 11:16   ` Kirill A. Shutemov
  2016-11-07 23:31 ` [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Naoya Horiguchi
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/rmap.h | 1 +
 mm/huge_memory.c     | 2 +-
 mm/rmap.c            | 8 +++++---
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/rmap.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/rmap.h
index b46bb56..a2fa425 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/rmap.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/rmap.h
@@ -87,6 +87,7 @@ enum ttu_flags {
 	TTU_MUNLOCK = 4,		/* munlock mode */
 	TTU_LZFREE = 8,			/* lazy free mode */
 	TTU_SPLIT_HUGE_PMD = 16,	/* split huge PMD if any */
+	TTU_SPLIT_FREEZE = 32,		/* freeze pte under splitting thp */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
index 2d1d6bb..0509d17 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
@@ -1794,7 +1794,7 @@ static void freeze_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
-		ttu_flags |= TTU_MIGRATION;
+		ttu_flags |= TTU_SPLIT_FREEZE;
 
 	/* We only need TTU_SPLIT_HUGE_PMD once */
 	ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/rmap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/rmap.c
index 1ef3640..a4be307 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/rmap.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/rmap.c
@@ -1449,7 +1449,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
-				flags & TTU_MIGRATION, page);
+				flags & TTU_SPLIT_FREEZE, page);
 		/* check if we have anything to do after split */
 		if (page_mapcount(page) == 0)
 			goto out;
@@ -1527,7 +1527,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * will take care of the rest.
 		 */
 		dec_mm_counter(mm, mm_counter(page));
-	} else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) {
+	} else if (IS_ENABLED(CONFIG_MIGRATION) &&
+		   (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 		swp_entry_t entry;
 		pte_t swp_pte;
 		/*
@@ -1649,7 +1650,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	 * locking requirements of exec(), migration skips
 	 * temporary VMAs until after exec() completes.
 	 */
-	if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+	if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+	    && !PageKsm(page) && PageAnon(page))
 		rwc.invalid_vma = invalid_migration_vma;
 
 	if (flags & TTU_RMAP_LOCKED)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-11 11:18   ` Kirill A. Shutemov
  2016-11-28 14:21   ` Michal Hocko
  2016-11-07 23:31 ` [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration Naoya Horiguchi
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v1 -> v2:
- fixed config name in subject and patch description
---
 arch/x86/Kconfig        |  4 ++++
 include/linux/huge_mm.h | 14 ++++++++++++++
 mm/Kconfig              |  3 +++
 3 files changed, 21 insertions(+)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/Kconfig v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/Kconfig
index 19d237b..63b2f24 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/Kconfig
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/Kconfig
@@ -2246,6 +2246,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	def_bool y
 	depends on X86_64 && HUGETLB_PAGE && MIGRATION
 
+config ARCH_ENABLE_THP_MIGRATION
+	def_bool y
+	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
+
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
index 31f2c32..fcbca51 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
@@ -161,6 +161,15 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
 
 #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
 
+static inline bool thp_migration_supported(void)
+{
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+	return true;
+#else
+	return false;
+#endif
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -232,6 +241,11 @@ static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static inline bool thp_migration_supported(void)
+{
+	return false;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/Kconfig v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/Kconfig
index be0ee11..1965310 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/Kconfig
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/Kconfig
@@ -289,6 +289,9 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config ARCH_ENABLE_THP_MIGRATION
+	bool
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-08  3:05   ` kbuild test robot
                     ` (4 more replies)
  2016-11-07 23:31 ` [PATCH v2 06/12] mm: thp: enable thp migration in generic path Naoya Horiguchi
                   ` (9 subsequent siblings)
  14 siblings, 5 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

This patch prepares thp migration's core code. These code will be open when
unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
to allocate destination thps.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp
---
 arch/x86/include/asm/pgtable_64.h |   2 +
 include/linux/swapops.h           |  61 +++++++++++++++
 mm/huge_memory.c                  | 154 ++++++++++++++++++++++++++++++++++++++
 mm/migrate.c                      |  44 ++++++++++-
 mm/pgtable-generic.c              |   3 +-
 5 files changed, 262 insertions(+), 2 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
index 1cc82ec..3a1b48e 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
@@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 					 ((type) << (SWP_TYPE_FIRST_BIT)) \
 					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pte)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
 extern void cleanup_highmap(void);
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
index 5c3a5f3..b6b22a2 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
@@ -163,6 +163,67 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page *page,
+		struct vm_area_struct *vma, unsigned long address);
+
+extern int remove_migration_pmd(struct page *new, pmd_t *pmd,
+		struct vm_area_struct *vma, unsigned long addr, void *old);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __pmd_to_swp_entry(pmd);
+	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+	return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page *page,
+			struct vm_area_struct *vma, unsigned long address)
+{
+}
+
+static inline int remove_migration_pmd(struct page *new, pmd_t *pmd,
+		struct vm_area_struct *vma, unsigned long addr, void *old)
+{
+	return 0;
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	pmd_t pmd = {};
+
+	return pmd;
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
index 0509d17..b3022b3 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
@@ -2310,3 +2310,157 @@ static int __init split_huge_pages_debugfs(void)
 }
 late_initcall(split_huge_pages_debugfs);
 #endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
+				unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pmd_t pmdval;
+	swp_entry_t entry;
+	spinlock_t *ptl;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		return;
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		return;
+	pmd = pmd_offset(pud, addr);
+	pmdval = *pmd;
+	barrier();
+	if (!pmd_present(pmdval))
+		return;
+
+	mmu_notifier_invalidate_range_start(mm, addr, addr + HPAGE_PMD_SIZE);
+	if (pmd_trans_huge(pmdval)) {
+		pmd_t pmdswp;
+
+		ptl = pmd_lock(mm, pmd);
+		if (!pmd_present(*pmd))
+			goto unlock_pmd;
+		if (unlikely(!pmd_trans_huge(*pmd)))
+			goto unlock_pmd;
+		if (pmd_page(*pmd) != page)
+			goto unlock_pmd;
+
+		pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
+		if (pmd_dirty(pmdval))
+			set_page_dirty(page);
+		entry = make_migration_entry(page, pmd_write(pmdval));
+		pmdswp = swp_entry_to_pmd(entry);
+		pmdswp = pmd_mkhuge(pmdswp);
+		set_pmd_at(mm, addr, pmd, pmdswp);
+		page_remove_rmap(page, true);
+		put_page(page);
+unlock_pmd:
+		spin_unlock(ptl);
+	} else { /* pte-mapped thp */
+		pte_t *pte;
+		pte_t pteval;
+		struct page *tmp = compound_head(page);
+		unsigned long address = addr & HPAGE_PMD_MASK;
+		pte_t swp_pte;
+		int i;
+
+		pte = pte_offset_map(pmd, address);
+		ptl = pte_lockptr(mm, pmd);
+		spin_lock(ptl);
+		for (i = 0; i < HPAGE_PMD_NR; i++, pte++, tmp++) {
+			if (!(pte_present(*pte) &&
+			      page_to_pfn(tmp) == pte_pfn(*pte)))
+				continue;
+			pteval = ptep_clear_flush(vma, address, pte);
+			if (pte_dirty(pteval))
+				set_page_dirty(tmp);
+			entry = make_migration_entry(tmp, pte_write(pteval));
+			swp_pte = swp_entry_to_pte(entry);
+			set_pte_at(mm, address, pte, swp_pte);
+			page_remove_rmap(tmp, false);
+			put_page(tmp);
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+	mmu_notifier_invalidate_range_end(mm, addr, addr + HPAGE_PMD_SIZE);
+	return;
+}
+
+int remove_migration_pmd(struct page *new, pmd_t *pmd,
+		struct vm_area_struct *vma, unsigned long addr, void *old)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	pmd_t pmde;
+	swp_entry_t entry;
+
+	pmde = *pmd;
+	barrier();
+
+	if (!pmd_present(pmde)) {
+		if (is_migration_entry(pmd_to_swp_entry(pmde))) {
+			unsigned long mmun_start = addr & HPAGE_PMD_MASK;
+			unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+
+			ptl = pmd_lock(mm, pmd);
+			entry = pmd_to_swp_entry(*pmd);
+			if (migration_entry_to_page(entry) != old)
+				goto unlock_ptl;
+			get_page(new);
+			pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+			if (is_write_migration_entry(entry))
+				pmde = maybe_pmd_mkwrite(pmde, vma);
+			flush_cache_range(vma, mmun_start, mmun_end);
+			page_add_anon_rmap(new, vma, mmun_start, true);
+			pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
+			set_pmd_at(mm, mmun_start, pmd, pmde);
+			flush_tlb_range(vma, mmun_start, mmun_end);
+			if (vma->vm_flags & VM_LOCKED)
+				mlock_vma_page(new);
+			update_mmu_cache_pmd(vma, addr, pmd);
+unlock_ptl:
+			spin_unlock(ptl);
+		}
+	} else { /* pte-mapped thp */
+		pte_t *ptep;
+		pte_t pte;
+		int i;
+		struct page *tmpnew = compound_head(new);
+		struct page *tmpold = compound_head((struct page *)old);
+		unsigned long address = addr & HPAGE_PMD_MASK;
+
+		ptep = pte_offset_map(pmd, addr);
+		ptl = pte_lockptr(mm, pmd);
+		spin_lock(ptl);
+
+		for (i = 0; i < HPAGE_PMD_NR;
+		     i++, ptep++, tmpnew++, tmpold++, address += PAGE_SIZE) {
+			pte = *ptep;
+			if (!is_swap_pte(pte))
+				continue;
+			entry = pte_to_swp_entry(pte);
+			if (!is_migration_entry(entry) ||
+			    migration_entry_to_page(entry) != tmpold)
+				continue;
+			get_page(tmpnew);
+			pte = pte_mkold(mk_pte(tmpnew,
+					       READ_ONCE(vma->vm_page_prot)));
+			if (pte_swp_soft_dirty(*ptep))
+				pte = pte_mksoft_dirty(pte);
+			if (is_write_migration_entry(entry))
+				pte = maybe_mkwrite(pte, vma);
+			flush_dcache_page(tmpnew);
+			set_pte_at(mm, address, ptep, pte);
+			if (PageAnon(new))
+				page_add_anon_rmap(tmpnew, vma, address, false);
+			else
+				page_add_file_rmap(tmpnew, false);
+			update_mmu_cache(vma, address, ptep);
+		}
+		pte_unmap_unlock(ptep, ptl);
+	}
+	return SWAP_AGAIN;
+}
+#endif
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
index 66ce6b4..54f2eb6 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
@@ -198,6 +198,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	swp_entry_t entry;
+	pgd_t *pgd;
+	pud_t *pud;
  	pmd_t *pmd;
 	pte_t *ptep, pte;
  	spinlock_t *ptl;
@@ -208,10 +210,29 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 			goto out;
 		ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
 	} else {
-		pmd = mm_find_pmd(mm, addr);
+		pmd_t pmde;
+
+		pgd = pgd_offset(mm, addr);
+		if (!pgd_present(*pgd))
+			goto out;
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud))
+			goto out;
+		pmd = pmd_offset(pud, addr);
 		if (!pmd)
 			goto out;
 
+		if (PageTransCompound(new)) {
+			remove_migration_pmd(new, pmd, vma, addr, old);
+			goto out;
+		}
+
+		pmde = *pmd;
+		barrier();
+
+		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+			goto out;
+
 		ptep = pte_offset_map(pmd, addr);
 
 		/*
@@ -344,6 +365,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
 	__migration_entry_wait(mm, pte, ptl);
 }
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	struct page *page;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!is_pmd_migration_entry(*pmd))
+		goto unlock;
+	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+	if (!get_page_unless_zero(page))
+		goto unlock;
+	spin_unlock(ptl);
+	wait_on_page_locked(page);
+	put_page(page);
+	return;
+unlock:
+	spin_unlock(ptl);
+}
+#endif
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
index 71c5f91..6012343 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
@@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+		  !pmd_devmap(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 06/12] mm: thp: enable thp migration in generic path
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-28 14:33   ` Michal Hocko
  2016-11-07 23:31 ` [PATCH v2 07/12] mm: thp: check pmd migration entry in common path Naoya Horiguchi
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

This patch makes it possible to support thp migration gradually. If you fail
to allocate a destination page as a thp, you just split the source thp as we
do now, and then enter the normal page migration. If you succeed to allocate
destination thp, you enter thp migration. Subsequent patches actually enable
thp migration for each caller of page migration by allowing its get_new_page()
callback to allocate thps.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/migrate.c | 2 +-
 mm/rmap.c    | 7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
index 54f2eb6..97ab8d9 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
@@ -1142,7 +1142,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page))) {
+	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
 		lock_page(page);
 		rc = split_huge_page(page);
 		unlock_page(page);
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/rmap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/rmap.c
index a4be307..a0b665c 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/rmap.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/rmap.c
@@ -1443,6 +1443,13 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	struct rmap_private *rp = arg;
 	enum ttu_flags flags = rp->flags;
 
+	if (flags & TTU_MIGRATION) {
+		if (!PageHuge(page) && PageTransCompound(page)) {
+			set_pmd_migration_entry(page, vma, address);
+			goto out;
+		}
+	}
+
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		goto out;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (5 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 06/12] mm: thp: enable thp migration in generic path Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-08  0:23   ` kbuild test robot
                     ` (4 more replies)
  2016-11-07 23:31 ` [PATCH v2 08/12] mm: soft-dirty: keep soft-dirty bits over thp migration Naoya Horiguchi
                   ` (7 subsequent siblings)
  14 siblings, 5 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

If one of callers of page migration starts to handle thp, memory management code
start to see pmd migration entry, so we need to prepare for it before enabling.
This patch changes various code point which checks the status of given pmds in
order to prevent race between thp migration and the pmd-related works.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't think up
  no better name. Any suggesntion is welcomed.)
---
 arch/x86/mm/gup.c       |  4 +--
 fs/proc/task_mmu.c      | 23 +++++++------
 include/linux/huge_mm.h |  9 ++++-
 mm/gup.c                | 10 ++++--
 mm/huge_memory.c        | 88 ++++++++++++++++++++++++++++++++++++++++---------
 mm/madvise.c            |  2 +-
 mm/memcontrol.c         |  2 ++
 mm/memory.c             |  6 +++-
 mm/mprotect.c           |  2 ++
 mm/mremap.c             |  2 +-
 10 files changed, 114 insertions(+), 34 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
index 0d4fb3e..78a153d 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
@@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
-		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+		if (unlikely(pmd_large(pmd))) {
 			/*
 			 * NUMA hinting faults need to be handled in the GUP
 			 * slowpath for accounting purposes and so that they
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
index 35b92d8..c1f9cf4 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
@@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		smaps_pmd_entry(pmd, addr, walk);
+		if (pmd_present(*pmd))
+			smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
 	}
@@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 			goto out;
 		}
 
+		if (!pmd_present(*pmd))
+			goto out;
+
 		page = pmd_page(*pmd);
 
 		/* Clear accessed and referenced bits. */
@@ -1208,19 +1212,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	if (ptl) {
 		u64 flags = 0, frame = 0;
 		pmd_t pmd = *pmdp;
+		struct page *page;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
 			flags |= PM_SOFT_DIRTY;
 
-		/*
-		 * Currently pmd for thp is always present because thp
-		 * can not be swapped-out, migrated, or HWPOISONed
-		 * (split in such cases instead.)
-		 * This if-check is just to prepare for future implementation.
-		 */
-		if (pmd_present(pmd)) {
-			struct page *page = pmd_page(pmd);
-
+		if (is_pmd_migration_entry(pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(pmd);
+			frame = swp_type(entry) |
+				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+			page = migration_entry_to_page(entry);
+		} else if (pmd_present(pmd)) {
+			page = pmd_page(pmd);
 			if (page_mapcount(page) == 1)
 				flags |= PM_MMAP_EXCLUSIVE;
 
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
index fcbca51..3c252cd 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
@@ -125,12 +125,19 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    long adjust_next);
 extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma);
+
+static inline int pmd_related(pmd_t pmd)
+{
+	return !pmd_none(pmd) &&
+		(!pmd_present(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd));
+}
+
 /* mmap_sem must be held on entry */
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
 	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (pmd_related(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
index e50178c..2dc4978 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
@@ -267,6 +267,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		return no_page_table(vma, flags);
+	if (!pmd_present(*pmd))
+		return no_page_table(vma, flags);
 	if (pmd_devmap(*pmd)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -278,6 +280,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		return follow_page_pte(vma, address, pmd, flags);
 
 	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_present(*pmd))) {
+		spin_unlock(ptl);
+		return no_page_table(vma, flags);
+	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
@@ -333,7 +339,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	pud = pud_offset(pgd, address);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		return -EFAULT;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map(pmd, address);
@@ -1357,7 +1363,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = READ_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
index b3022b3..4e9090c 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
@@ -825,6 +825,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pmd = *src_pmd;
+
+	if (unlikely(is_pmd_migration_entry(pmd))) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_write_migration_entry(entry)) {
+			make_migration_entry_read(&entry);
+			pmd = swp_entry_to_pmd(entry);
+			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		}
+		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+		ret = 0;
+		goto out_unlock;
+	}
+
 	if (unlikely(!pmd_trans_huge(pmd))) {
 		pte_free(dst_mm, pgtable);
 		goto out_unlock;
@@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
 	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
 		goto out_unlock;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out_unlock;
+
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
@@ -1137,7 +1154,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		goto out;
 
-	page = pmd_page(*pmd);
+	if (is_pmd_migration_entry(*pmd)) {
+		swp_entry_t entry;
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+		if (!is_migration_entry(entry))
+			goto out;
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd);
@@ -1332,6 +1356,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (is_huge_zero_pmd(orig_pmd))
 		goto out;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out;
+
 	page = pmd_page(orig_pmd);
 	/*
 	 * If other processes are mapping this page, we couldn't discard
@@ -1410,20 +1437,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else {
 		struct page *page = pmd_page(orig_pmd);
-		page_remove_rmap(page, true);
-		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-		VM_BUG_ON_PAGE(!PageHead(page), page);
-		if (PageAnon(page)) {
-			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
-			pte_free(tlb->mm, pgtable);
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		int migration = 0;
+
+		if (!is_pmd_migration_entry(orig_pmd)) {
+			page_remove_rmap(page, true);
+			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+			if (PageAnon(page)) {
+				pgtable_t pgtable;
+				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
+								      pmd);
+				pte_free(tlb->mm, pgtable);
+				atomic_long_dec(&tlb->mm->nr_ptes);
+				add_mm_counter(tlb->mm, MM_ANONPAGES,
+					       -HPAGE_PMD_NR);
+			} else {
+				add_mm_counter(tlb->mm, MM_FILEPAGES,
+					       -HPAGE_PMD_NR);
+			}
 		} else {
-			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(orig_pmd);
+			free_swap_and_cache(entry); /* waring in failure? */
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			migration = 1;
 		}
 		spin_unlock(ptl);
-		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+		if (!migration)
+			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
 	return 1;
 }
@@ -1496,14 +1538,27 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		bool preserve_write = prot_numa && pmd_write(*pmd);
 		ret = 1;
 
+		if (!pmd_present(*pmd))
+			goto unlock;
 		/*
 		 * Avoid trapping faults against the zero page. The read-only
 		 * data is likely to be read-cached on the local CPU and
 		 * local/remote hits to the zero page are not interesting.
 		 */
-		if (prot_numa && is_huge_zero_pmd(*pmd)) {
-			spin_unlock(ptl);
-			return ret;
+		if (prot_numa && is_huge_zero_pmd(*pmd))
+			goto unlock;
+
+		if (is_pmd_migration_entry(*pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+			if (is_write_migration_entry(entry)) {
+				pmd_t newpmd;
+
+				make_migration_entry_read(&entry);
+				newpmd = swp_entry_to_pmd(entry);
+				set_pmd_at(mm, addr, pmd, newpmd);
+			}
+			goto unlock;
 		}
 
 		if (!prot_numa || !pmd_protnone(*pmd)) {
@@ -1516,6 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
 					pmd_write(entry));
 		}
+unlock:
 		spin_unlock(ptl);
 	}
 
@@ -1532,7 +1588,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+	if (likely(pmd_related(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
index 0e3828e..eaa2b02 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
@@ -274,7 +274,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long next;
 
 	next = pmd_addr_end(addr, end);
-	if (pmd_trans_huge(*pmd))
+	if (pmd_related(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
 
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
index 91dfc7c..ebc2c42 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
@@ -4635,6 +4635,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	enum mc_target_type ret = MC_TARGET_NONE;
 
+	if (unlikely(!pmd_present(pmd)))
+		return ret;
 	page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
 	if (!(mc.flags & MOVE_ANON))
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
index 94b5e2c..33fa439 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
@@ -999,7 +999,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+		if (pmd_related(*src_pmd)) {
 			int err;
 			VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
 			err = copy_huge_pmd(dst_mm, src_mm,
@@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		int ret;
 
 		barrier();
+		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
+			pmd_migration_entry_wait(mm, fe.pmd);
+			return 0;
+		}
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&fe, orig_pmd);
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
index c5ba2aa..81186e3 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
@@ -164,6 +164,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
+		if (!pmd_present(*pmd))
+			continue;
 		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
 				&& pmd_none_or_clear_bad(pmd))
 			continue;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
index da22ad2..a94a698 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
@@ -194,7 +194,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-		if (pmd_trans_huge(*old_pmd)) {
+		if (pmd_related(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE) {
 				bool moved;
 				/* See comment in move_ptes() */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 08/12] mm: soft-dirty: keep soft-dirty bits over thp migration
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (6 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 07/12] mm: thp: check pmd migration entry in common path Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-10  8:38   ` Anshuman Khandual
  2016-11-07 23:31 ` [PATCH v2 09/12] mm: hwpoison: soft offline supports " Naoya Horiguchi
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6
- clear_soft_dirty_pmd can handle migration entry
---
 arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
 fs/proc/task_mmu.c             | 17 +++++++++++------
 include/asm-generic/pgtable.h  | 34 +++++++++++++++++++++++++++++++++-
 include/linux/swapops.h        |  2 ++
 mm/huge_memory.c               | 25 +++++++++++++++++++++++--
 5 files changed, 86 insertions(+), 9 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable.h
index 437feb4..ceec210 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable.h
@@ -948,6 +948,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
 #endif
 
 #define PKRU_AD_BIT 0x1
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
index c1f9cf4..85745b9 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
@@ -900,12 +900,17 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
 {
-	pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
-
-	pmd = pmd_wrprotect(pmd);
-	pmd = pmd_clear_soft_dirty(pmd);
-
-	set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	pmd_t pmd = *pmdp;
+
+	if (pmd_present(pmd)) {
+		pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
+		pmd = pmd_wrprotect(pmd);
+		pmd = pmd_clear_soft_dirty(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+		pmd = pmd_swp_clear_soft_dirty(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	}
 }
 #else
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/asm-generic/pgtable.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/asm-generic/pgtable.h
index c4f8fd2..f50e8b4 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/asm-generic/pgtable.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/asm-generic/pgtable.h
@@ -489,7 +489,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return 0;
@@ -534,6 +551,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte;
 }
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
 #endif
 
 #ifndef __HAVE_PFNMAP_TRACKING
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
index b6b22a2..db8a858 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
@@ -176,6 +176,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 {
 	swp_entry_t arch_entry;
 
+	if (pmd_swp_soft_dirty(pmd))
+		pmd = pmd_swp_clear_soft_dirty(pmd);
 	arch_entry = __pmd_to_swp_entry(pmd);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
index 4e9090c..bd9c23e 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
@@ -832,6 +832,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (is_write_migration_entry(entry)) {
 			make_migration_entry_read(&entry);
 			pmd = swp_entry_to_pmd(entry);
+			if (pmd_swp_soft_dirty(pmd))
+				pmd = pmd_swp_mksoft_dirty(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
 		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1470,6 +1472,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return 1;
 }
 
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	if (unlikely(is_pmd_migration_entry(pmd)))
+		pmd = pmd_mksoft_dirty(pmd);
+	else if (pmd_present(pmd))
+		pmd = pmd_swp_mksoft_dirty(pmd);
+#endif
+	return pmd;
+}
+
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		  unsigned long new_addr, unsigned long old_end,
 		  pmd_t *old_pmd, pmd_t *new_pmd)
@@ -1510,7 +1523,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
 		}
-		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+		pmd = move_soft_dirty_pmd(pmd);
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		spin_unlock(old_ptl);
@@ -1556,6 +1570,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 				make_migration_entry_read(&entry);
 				newpmd = swp_entry_to_pmd(entry);
+				if (pmd_swp_soft_dirty(newpmd))
+					newpmd = pmd_swp_mksoft_dirty(newpmd);
 				set_pmd_at(mm, addr, pmd, newpmd);
 			}
 			goto unlock;
@@ -2408,7 +2424,8 @@ void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
 			set_page_dirty(page);
 		entry = make_migration_entry(page, pmd_write(pmdval));
 		pmdswp = swp_entry_to_pmd(entry);
-		pmdswp = pmd_mkhuge(pmdswp);
+		if (pmd_soft_dirty(pmdval))
+			pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 		set_pmd_at(mm, addr, pmd, pmdswp);
 		page_remove_rmap(page, true);
 		put_page(page);
@@ -2434,6 +2451,8 @@ void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
 				set_page_dirty(tmp);
 			entry = make_migration_entry(tmp, pte_write(pteval));
 			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pte, swp_pte);
 			page_remove_rmap(tmp, false);
 			put_page(tmp);
@@ -2466,6 +2485,8 @@ int remove_migration_pmd(struct page *new, pmd_t *pmd,
 				goto unlock_ptl;
 			get_page(new);
 			pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+			if (pmd_swp_soft_dirty(*pmd))
+				pmde = pmd_mksoft_dirty(pmde);
 			if (is_write_migration_entry(entry))
 				pmde = maybe_pmd_mkwrite(pmde, vma);
 			flush_cache_range(vma, mmun_start, mmun_end);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 09/12] mm: hwpoison: soft offline supports thp migration
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (7 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 08/12] mm: soft-dirty: keep soft-dirty bits over thp migration Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-10 10:31   ` Balbir Singh
  2016-11-07 23:31 ` [PATCH v2 10/12] mm: mempolicy: mbind and migrate_pages support " Naoya Horiguchi
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

This patch enables thp migration for soft offline.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
index 19e796d..6cc8157 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
@@ -1485,7 +1485,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 						   nid);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(nid,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
@@ -1687,28 +1697,11 @@ static int __soft_offline_page(struct page *page, int flags)
 static int soft_offline_in_use_page(struct page *page, int flags)
 {
 	int ret;
-	struct page *hpage = compound_head(page);
-
-	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		lock_page(hpage);
-		if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
-			unlock_page(hpage);
-			if (!PageAnon(hpage))
-				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
-			else
-				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_hwpoison_page(hpage);
-			return -EBUSY;
-		}
-		unlock_page(hpage);
-		get_hwpoison_page(page);
-		put_hwpoison_page(hpage);
-	}
 
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page, flags);
 	else
-		ret = __soft_offline_page(page, flags);
+		ret = __soft_offline_page(compound_head(page), flags);
 
 	return ret;
 }
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 10/12] mm: mempolicy: mbind and migrate_pages support thp migration
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (8 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 09/12] mm: hwpoison: soft offline supports " Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-25 12:27   ` Anshuman Khandual
  2016-11-07 23:31 ` [PATCH v2 11/12] mm: migrate: move_pages() supports " Naoya Horiguchi
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

This patch enables thp migration for mbind(2) and migrate_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp
---
 mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
index 77d0668..96507ee 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
@@ -94,6 +94,7 @@
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
+#include <linux/swapops.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -486,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
 	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
 }
 
+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	int ret = 0;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags;
+
+	if (unlikely(is_pmd_migration_entry(*pmd))) {
+		ret = 1;
+		goto unlock;
+	}
+	page = pmd_page(*pmd);
+	if (is_huge_zero_page(page)) {
+		spin_unlock(ptl);
+		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+		goto out;
+	}
+	if (!thp_migration_supported()) {
+		get_page(page);
+		spin_unlock(ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		goto out;
+	}
+	if (queue_pages_node_check(page, qp)) {
+		ret = 1;
+		goto unlock;
+	}
+
+	ret = 1;
+	flags = qp->flags;
+	/* go to thp migration */
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+unlock:
+	spin_unlock(ptl);
+out:
+	return ret;
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid, ret;
+	int ret;
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(walk->mm, pmd);
-		if (pmd_trans_huge(*pmd)) {
-			page = pmd_page(*pmd);
-			if (is_huge_zero_page(page)) {
-				spin_unlock(ptl);
-				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			} else {
-				get_page(page);
-				spin_unlock(ptl);
-				lock_page(page);
-				ret = split_huge_page(page);
-				unlock_page(page);
-				put_page(page);
-				if (ret)
-					return 0;
-			}
-		} else {
-			spin_unlock(ptl);
-		}
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+		if (ret)
+			return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
@@ -541,7 +570,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		if (queue_pages_node_check(page, qp))
 			continue;
-		if (PageTransCompound(page)) {
+		if (PageTransCompound(page) && !thp_migration_supported()) {
 			get_page(page);
 			pte_unmap_unlock(pte, ptl);
 			lock_page(page);
@@ -959,19 +988,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 #ifdef CONFIG_MIGRATION
 /*
- * page migration
+ * page migration, thp tail pages can be passed.
  */
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags)
 {
+	struct page *head = compound_head(page);
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
-		if (!isolate_lru_page(page)) {
-			list_add_tail(&page->lru, pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+		if (!isolate_lru_page(head)) {
+			list_add_tail(&head->lru, pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 	}
 }
@@ -981,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
 	if (PageHuge(page))
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
 						    __GFP_THISNODE, 0);
 }
@@ -1147,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
 	if (PageHuge(page)) {
 		BUG_ON(!vma);
 		return alloc_huge_page_noerr(vma, address, 1);
+	} else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+					 HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
 	}
 	/*
 	 * if !vma, alloc_page_vma() will use task or system default policy
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 11/12] mm: migrate: move_pages() supports thp migration
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (9 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 10/12] mm: mempolicy: mbind and migrate_pages support " Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-18  0:01   ` Kirill A. Shutemov
  2016-11-07 23:31 ` [PATCH v2 12/12] mm: memory_hotplug: memory hotremove " Naoya Horiguchi
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

This patch enables thp migration for move_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
index 97ab8d9..6a589b9 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
@@ -1443,7 +1443,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 					pm->node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(pm->node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
 }
@@ -1470,6 +1480,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
 		struct vm_area_struct *vma;
 		struct page *page;
+		struct page *head;
+		unsigned int follflags;
 
 		err = -EFAULT;
 		vma = find_vma(mm, pp->addr);
@@ -1477,8 +1489,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto set_status;
 
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, pp->addr,
-				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+		follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;
+		if (thp_migration_supported())
+			follflags &= ~FOLL_SPLIT;
+		page = follow_page(vma, pp->addr, follflags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1488,7 +1502,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!page)
 			goto set_status;
 
-		pp->page = page;
 		err = page_to_nid(page);
 
 		if (err == pp->node)
@@ -1503,16 +1516,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto put_and_set;
 
 		if (PageHuge(page)) {
-			if (PageHead(page))
+			if (PageHead(page)) {
 				isolate_huge_page(page, &pagelist);
+				err = 0;
+				pp->page = page;
+			}
 			goto put_and_set;
 		}
 
-		err = isolate_lru_page(page);
+		pp->page = compound_head(page);
+		head = compound_head(page);
+		err = isolate_lru_page(head);
 		if (!err) {
-			list_add_tail(&page->lru, &pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+			list_add_tail(&head->lru, &pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 put_and_set:
 		/*
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 12/12] mm: memory_hotplug: memory hotremove supports thp migration
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (10 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 11/12] mm: migrate: move_pages() supports " Naoya Horiguchi
@ 2016-11-07 23:31 ` Naoya Horiguchi
  2016-11-08  0:29   ` kbuild test robot
  2016-11-08  0:30   ` kbuild test robot
  2016-11-09  2:32 ` [PATCH v2 00/12] mm: page migration enhancement for thp Balbir Singh
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-07 23:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Naoya Horiguchi

This patch enables thp migration for memory hotremove.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()
---
 mm/memory_hotplug.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory_hotplug.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory_hotplug.c
index b18dab40..a9c3fe1 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory_hotplug.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory_hotplug.c
@@ -1543,6 +1543,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
 	struct page *new_page = NULL;
+	unsigned int order = 0;
 
 	/*
 	 * TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1553,6 +1554,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					next_node_in(nid, nmask));
 
+	if (thp_migration_supported() && PageTransHuge(page)) {
+		order = HPAGE_PMD_ORDER;
+		gfp_mask |= GFP_TRANSHUGE;
+	}
+
 	node_clear(nid, nmask);
 
 	if (PageHighMem(page)
@@ -1560,12 +1566,15 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		gfp_mask |= __GFP_HIGHMEM;
 
 	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0,
+		new_page = __alloc_pages_nodemask(gfp_mask, order,
 					node_zonelist(nid, gfp_mask), &nmask);
 	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0,
+		new_page = __alloc_pages(gfp_mask, order,
 					node_zonelist(nid, gfp_mask));
 
+	if (new_page && order == HPAGE_PMD_ORDER)
+		prep_transhuge_page(new_page);
+
 	return new_page;
 }
 
@@ -1595,7 +1604,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			if (isolate_huge_page(page, &source))
 				move_pages -= 1 << compound_order(head);
 			continue;
-		}
+		} else if (thp_migration_supported() && PageTransHuge(page))
+			pfn = page_to_pfn(compound_head(page))
+				+ HPAGE_PMD_NR - 1;
 
 		if (!get_page_unless_zero(page))
 			continue;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-07 23:31 ` [PATCH v2 07/12] mm: thp: check pmd migration entry in common path Naoya Horiguchi
@ 2016-11-08  0:23   ` kbuild test robot
  2016-11-08  1:22     ` Naoya Horiguchi
  2016-11-10  8:36   ` Anshuman Khandual
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 71+ messages in thread
From: kbuild test robot @ 2016-11-08  0:23 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: kbuild-all, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Attachment #1: Type: text/plain, Size: 3549 bytes --]

Hi Naoya,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9-rc4 next-20161028]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x007-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   mm/memory.c: In function 'copy_pmd_range':
>> mm/memory.c:1002:7: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
      if (pmd_related(*src_pmd)) {
          ^~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   mm/mremap.c: In function 'move_page_tables':
>> mm/mremap.c:197:7: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
      if (pmd_related(*old_pmd)) {
          ^~~~~~~~~~~
   In file included from include/asm-generic/bug.h:4:0,
                    from arch/x86/include/asm/bug.h:35,
                    from include/linux/bug.h:4,
                    from include/linux/mmdebug.h:4,
                    from include/linux/mm.h:8,
                    from mm/mremap.c:10:
>> include/linux/compiler.h:518:38: error: call to '__compiletime_assert_198' declared with attribute error: BUILD_BUG failed
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                         ^
   include/linux/compiler.h:501:4: note: in definition of macro '__compiletime_assert'
       prefix ## suffix();    \
       ^~~~~~
   include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
     ^~~~~~~~~~~~~~~~~~~
   include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                        ^~~~~~~~~~~~~~~~~~
   include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                        ^~~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:183:27: note: in expansion of macro 'BUILD_BUG'
    #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
                              ^~~~~~~~~
>> mm/mremap.c:198:18: note: in expansion of macro 'HPAGE_PMD_SIZE'
       if (extent == HPAGE_PMD_SIZE) {
                     ^~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   mm/madvise.c: In function 'madvise_free_pte_range':
>> mm/madvise.c:277:6: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
     if (pmd_related(*pmd))
         ^~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/pmd_related +1002 mm/memory.c

   996		dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
   997		if (!dst_pmd)
   998			return -ENOMEM;
   999		src_pmd = pmd_offset(src_pud, addr);
  1000		do {
  1001			next = pmd_addr_end(addr, end);
> 1002			if (pmd_related(*src_pmd)) {
  1003				int err;
  1004				VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
  1005				err = copy_huge_pmd(dst_mm, src_mm,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25069 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 12/12] mm: memory_hotplug: memory hotremove supports thp migration
  2016-11-07 23:31 ` [PATCH v2 12/12] mm: memory_hotplug: memory hotremove " Naoya Horiguchi
@ 2016-11-08  0:29   ` kbuild test robot
  2016-11-08  0:30   ` kbuild test robot
  1 sibling, 0 replies; 71+ messages in thread
From: kbuild test robot @ 2016-11-08  0:29 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: kbuild-all, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Attachment #1: Type: text/plain, Size: 8527 bytes --]

Hi Naoya,

[auto build test WARNING on mmotm/master]
[also build test WARNING on next-20161028]
[cannot apply to v4.9-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x000-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   Cyclomatic Complexity 3 mm/memory_hotplug.c:setup_memhp_default_state
   Cyclomatic Complexity 6 mm/memory_hotplug.c:find_smallest_section_pfn
   Cyclomatic Complexity 6 mm/memory_hotplug.c:find_biggest_section_pfn
   Cyclomatic Complexity 9 mm/memory_hotplug.c:shrink_pgdat_span
   Cyclomatic Complexity 1 include/linux/err.h:IS_ERR
   Cyclomatic Complexity 2 include/linux/page-flags.h:compound_head
   Cyclomatic Complexity 1 include/linux/page_ref.h:page_count
   Cyclomatic Complexity 2 include/linux/thread_info.h:test_ti_thread_flag
   Cyclomatic Complexity 1 include/linux/sched.h:test_tsk_thread_flag
   Cyclomatic Complexity 1 include/linux/sched.h:signal_pending
   Cyclomatic Complexity 0 include/linux/memory_hotplug.h:generic_alloc_nodedata
   Cyclomatic Complexity 3 mm/memory_hotplug.c:next_active_pageblock
   Cyclomatic Complexity 1 mm/memory_hotplug.c:__online_page_increment_counters
   Cyclomatic Complexity 1 include/linux/mm.h:__free_reserved_page
   Cyclomatic Complexity 1 include/linux/mm.h:free_reserved_page
   Cyclomatic Complexity 1 mm/memory_hotplug.c:__online_page_free
   Cyclomatic Complexity 1 mm/memory_hotplug.c:generic_online_page
   Cyclomatic Complexity 1 include/linux/mm.h:put_page_testzero
   Cyclomatic Complexity 4 mm/memory_hotplug.c:check_hotplug_memory_range
   Cyclomatic Complexity 1 mm/memory_hotplug.c:cmdline_parse_movable_node
   Cyclomatic Complexity 2 mm/memory_hotplug.c:ensure_zone_is_initialized
   Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_begin_nested
   Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_begin
   Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqlock
   Cyclomatic Complexity 1 include/linux/memory_hotplug.h:zone_span_writelock
   Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_end
   Cyclomatic Complexity 1 include/linux/seqlock.h:write_sequnlock
   Cyclomatic Complexity 1 include/linux/memory_hotplug.h:zone_span_writeunlock
   Cyclomatic Complexity 3 mm/memory_hotplug.c:grow_zone_span
   Cyclomatic Complexity 4 mm/memory_hotplug.c:__add_zone
   Cyclomatic Complexity 4 mm/memory_hotplug.c:__add_section
   Cyclomatic Complexity 7 mm/memory_hotplug.c:__add_pages
   Cyclomatic Complexity 9 mm/memory_hotplug.c:shrink_zone_span
   Cyclomatic Complexity 1 mm/memory_hotplug.c:__remove_zone
   Cyclomatic Complexity 2 mm/memory_hotplug.c:resize_zone
   Cyclomatic Complexity 6 mm/memory_hotplug.c:move_pfn_range_left
   Cyclomatic Complexity 6 mm/memory_hotplug.c:move_pfn_range_right
   Cyclomatic Complexity 4 mm/memory_hotplug.c:move_pfn_range
   Cyclomatic Complexity 3 mm/memory_hotplug.c:__remove_section
   Cyclomatic Complexity 6 mm/memory_hotplug.c:__remove_pages
   Cyclomatic Complexity 2 mm/memory_hotplug.c:check_pages_isolated
   Cyclomatic Complexity 1 mm/memory_hotplug.c:offline_isolated_pages
   Cyclomatic Complexity 3 mm/memory_hotplug.c:hotadd_new_pgdat
   Cyclomatic Complexity 1 mm/memory_hotplug.c:online_memory_block
   Cyclomatic Complexity 1 mm/memory_hotplug.c:rollback_node_hotadd
   Cyclomatic Complexity 4 mm/memory_hotplug.c:register_memory_resource
   Cyclomatic Complexity 2 mm/memory_hotplug.c:release_memory_resource
   Cyclomatic Complexity 6 mm/memory_hotplug.c:scan_movable_pages
   Cyclomatic Complexity 1 include/linux/gfp.h:__alloc_pages
   Cyclomatic Complexity 1 include/linux/hugetlb.h:page_hstate
   Cyclomatic Complexity 3 include/linux/bitmap.h:bitmap_empty
   Cyclomatic Complexity 1 include/linux/nodemask.h:__nodes_empty
   Cyclomatic Complexity 12 mm/memory_hotplug.c:new_node_page
   Cyclomatic Complexity 3 include/linux/mm.h:put_page
   Cyclomatic Complexity 15 mm/memory_hotplug.c:do_migrate_range
   Cyclomatic Complexity 2 mm/memory_hotplug.c:check_pages_isolated_cb
   Cyclomatic Complexity 1 mm/memory_hotplug.c:offline_isolated_pages_cb
   Cyclomatic Complexity 2 mm/memory_hotplug.c:check_memblock_offlined_cb
   Cyclomatic Complexity 2 mm/memory_hotplug.c:get_online_mems
   Cyclomatic Complexity 6 mm/memory_hotplug.c:put_online_mems
   Cyclomatic Complexity 2 mm/memory_hotplug.c:set_online_page_callback
   Cyclomatic Complexity 2 mm/memory_hotplug.c:restore_online_page_callback
   Cyclomatic Complexity 2 mm/memory_hotplug.c:mem_hotplug_begin
   Cyclomatic Complexity 1 mm/memory_hotplug.c:mem_hotplug_done
   Cyclomatic Complexity 1 mm/memory_hotplug.c:get_page_bootmem
   Cyclomatic Complexity 2 mm/memory_hotplug.c:put_page_bootmem
   Cyclomatic Complexity 9 mm/memory_hotplug.c:zone_can_shift
   Cyclomatic Complexity 16 mm/memory_hotplug.c:online_pages
   Cyclomatic Complexity 4 mm/memory_hotplug.c:try_online_node
   Cyclomatic Complexity 2 mm/memory_hotplug.c:zone_for_memory
   Cyclomatic Complexity 3 mm/memory_hotplug.c:is_mem_section_removable
   Cyclomatic Complexity 7 mm/memory_hotplug.c:test_pages_in_a_zone
   Cyclomatic Complexity 22 mm/memory_hotplug.c:__offline_pages
   Cyclomatic Complexity 1 mm/memory_hotplug.c:offline_pages
   Cyclomatic Complexity 9 mm/memory_hotplug.c:walk_memory_range
   Cyclomatic Complexity 8 mm/memory_hotplug.c:add_memory_resource
   Cyclomatic Complexity 3 mm/memory_hotplug.c:add_memory
   Cyclomatic Complexity 1 mm/memory_hotplug.c:remove_memory
   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from mm/memory_hotplug.c:7:
   mm/memory_hotplug.c: In function 'new_node_page':
   include/linux/compiler.h:518:38: error: call to '__compiletime_assert_1575' declared with attribute error: BUILD_BUG failed
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                         ^
   include/linux/compiler.h:501:4: note: in definition of macro '__compiletime_assert'
       prefix ## suffix();    \
       ^~~~~~
   include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
     ^~~~~~~~~~~~~~~~~~~
   include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                        ^~~~~~~~~~~~~~~~~~
   include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                        ^~~~~~~~~~~~~~~~
   include/linux/huge_mm.h:181:28: note: in expansion of macro 'BUILD_BUG'
    #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
                               ^~~~~~~~~
>> include/linux/huge_mm.h:56:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
>> mm/memory_hotplug.c:1575:27: note: in expansion of macro 'HPAGE_PMD_ORDER'
     if (new_page && order == HPAGE_PMD_ORDER)
                              ^~~~~~~~~~~~~~~

vim +/HPAGE_PMD_ORDER +1575 mm/memory_hotplug.c

  1559			gfp_mask |= GFP_TRANSHUGE;
  1560		}
  1561	
  1562		node_clear(nid, nmask);
  1563	
  1564		if (PageHighMem(page)
  1565		    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
  1566			gfp_mask |= __GFP_HIGHMEM;
  1567	
  1568		if (!nodes_empty(nmask))
  1569			new_page = __alloc_pages_nodemask(gfp_mask, order,
  1570						node_zonelist(nid, gfp_mask), &nmask);
  1571		if (!new_page)
  1572			new_page = __alloc_pages(gfp_mask, order,
  1573						node_zonelist(nid, gfp_mask));
  1574	
> 1575		if (new_page && order == HPAGE_PMD_ORDER)
  1576			prep_transhuge_page(new_page);
  1577	
  1578		return new_page;
  1579	}
  1580	
  1581	#define NR_OFFLINE_AT_ONCE_PAGES	(256)
  1582	static int
  1583	do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21730 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 12/12] mm: memory_hotplug: memory hotremove supports thp migration
  2016-11-07 23:31 ` [PATCH v2 12/12] mm: memory_hotplug: memory hotremove " Naoya Horiguchi
  2016-11-08  0:29   ` kbuild test robot
@ 2016-11-08  0:30   ` kbuild test robot
  2016-11-08  1:36     ` Naoya Horiguchi
  1 sibling, 1 reply; 71+ messages in thread
From: kbuild test robot @ 2016-11-08  0:30 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: kbuild-all, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Attachment #1: Type: text/plain, Size: 3756 bytes --]

Hi Naoya,

[auto build test WARNING on mmotm/master]
[also build test WARNING on next-20161028]
[cannot apply to v4.9-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x003-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   mm/memory_hotplug.c: In function 'try_offline_node':
   mm/memory_hotplug.c:2131:6: warning: unused variable 'i' [-Wunused-variable]
     int i;
         ^
   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from mm/memory_hotplug.c:7:
   mm/memory_hotplug.c: In function 'new_node_page':
   include/linux/compiler.h:518:38: error: call to '__compiletime_assert_1575' declared with attribute error: BUILD_BUG failed
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                         ^
   include/linux/compiler.h:160:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
>> mm/memory_hotplug.c:1575:2: note: in expansion of macro 'if'
     if (new_page && order == HPAGE_PMD_ORDER)
     ^~
   include/linux/compiler.h:506:2: note: in expansion of macro '__compiletime_assert'
     __compiletime_assert(condition, msg, prefix, suffix)
     ^~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
     ^~~~~~~~~~~~~~~~~~~
   include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                        ^~~~~~~~~~~~~~~~~~
   include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                        ^~~~~~~~~~~~~~~~
   include/linux/huge_mm.h:181:28: note: in expansion of macro 'BUILD_BUG'
    #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
                               ^~~~~~~~~
   include/linux/huge_mm.h:56:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
                             ^~~~~~~~~~~~~~~
   mm/memory_hotplug.c:1575:27: note: in expansion of macro 'HPAGE_PMD_ORDER'
     if (new_page && order == HPAGE_PMD_ORDER)
                              ^~~~~~~~~~~~~~~

vim +/if +1575 mm/memory_hotplug.c

  1559			gfp_mask |= GFP_TRANSHUGE;
  1560		}
  1561	
  1562		node_clear(nid, nmask);
  1563	
  1564		if (PageHighMem(page)
  1565		    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
  1566			gfp_mask |= __GFP_HIGHMEM;
  1567	
  1568		if (!nodes_empty(nmask))
  1569			new_page = __alloc_pages_nodemask(gfp_mask, order,
  1570						node_zonelist(nid, gfp_mask), &nmask);
  1571		if (!new_page)
  1572			new_page = __alloc_pages(gfp_mask, order,
  1573						node_zonelist(nid, gfp_mask));
  1574	
> 1575		if (new_page && order == HPAGE_PMD_ORDER)
  1576			prep_transhuge_page(new_page);
  1577	
  1578		return new_page;
  1579	}
  1580	
  1581	#define NR_OFFLINE_AT_ONCE_PAGES	(256)
  1582	static int
  1583	do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28889 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-08  0:23   ` kbuild test robot
@ 2016-11-08  1:22     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-08  1:22 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel, Naoya Horiguchi

On Tue, Nov 08, 2016 at 08:23:50AM +0800, kbuild test robot wrote:
> Hi Naoya,
> 
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.9-rc4 next-20161028]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-randconfig-x007-201645 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All error/warnings (new ones prefixed by >>):
> 
>    mm/memory.c: In function 'copy_pmd_range':
> >> mm/memory.c:1002:7: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
>       if (pmd_related(*src_pmd)) {
>           ^~~~~~~~~~~
>    cc1: some warnings being treated as errors
> --

I forgot to declare a noop routine for CONFIG_TRANSPARENT_HUGEPAGE=n.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -222,6 +229,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int pmd_related(pmd_t pmd)
+{
+	return 0;
+}
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {


>    mm/mremap.c: In function 'move_page_tables':
> >> mm/mremap.c:197:7: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
>       if (pmd_related(*old_pmd)) {
>           ^~~~~~~~~~~
>    In file included from include/asm-generic/bug.h:4:0,
>                     from arch/x86/include/asm/bug.h:35,
>                     from include/linux/bug.h:4,
>                     from include/linux/mmdebug.h:4,
>                     from include/linux/mm.h:8,
>                     from mm/mremap.c:10:
> >> include/linux/compiler.h:518:38: error: call to '__compiletime_assert_198' declared with attribute error: BUILD_BUG failed
>      _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>                                          ^
>    include/linux/compiler.h:501:4: note: in definition of macro '__compiletime_assert'
>        prefix ## suffix();    \
>        ^~~~~~
>    include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
>      _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>      ^~~~~~~~~~~~~~~~~~~
>    include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
>     #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>                                         ^~~~~~~~~~~~~~~~~~
>    include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>     #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>                         ^~~~~~~~~~~~~~~~
> >> include/linux/huge_mm.h:183:27: note: in expansion of macro 'BUILD_BUG'
>     #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
>                               ^~~~~~~~~
> >> mm/mremap.c:198:18: note: in expansion of macro 'HPAGE_PMD_SIZE'
>        if (extent == HPAGE_PMD_SIZE) {
>                      ^~~~~~~~~~~~~~
>    cc1: some warnings being treated as errors

HPAGE_PMD_SIZE is available only in CONFIG_TRANSPARENT_HUGEPAGE=y, and
this code looks to violate the rule, but it is in if (pmd_related()) block
which is compiled out in CONFIG_TRANSPARENT_HUGEPAGE=n, so this is OK
only with the above change.

Thanks,
Naoya Horiguchi

> --
>    mm/madvise.c: In function 'madvise_free_pte_range':
> >> mm/madvise.c:277:6: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
>      if (pmd_related(*pmd))
>          ^~~~~~~~~~~
>    cc1: some warnings being treated as errors
> 
> vim +/pmd_related +1002 mm/memory.c
> 
>    996		dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
>    997		if (!dst_pmd)
>    998			return -ENOMEM;
>    999		src_pmd = pmd_offset(src_pud, addr);
>   1000		do {
>   1001			next = pmd_addr_end(addr, end);
> > 1002			if (pmd_related(*src_pmd)) {
>   1003				int err;
>   1004				VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
>   1005				err = copy_huge_pmd(dst_mm, src_mm,
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 12/12] mm: memory_hotplug: memory hotremove supports thp migration
  2016-11-08  0:30   ` kbuild test robot
@ 2016-11-08  1:36     ` Naoya Horiguchi
  2016-11-08  1:41       ` [PATCH] mm: fix unused variable warning Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-08  1:36 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel, Naoya Horiguchi

On Tue, Nov 08, 2016 at 08:30:10AM +0800, kbuild test robot wrote:
> Hi Naoya,
> 
> [auto build test WARNING on mmotm/master]
> [also build test WARNING on next-20161028]
> [cannot apply to v4.9-rc4]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: x86_64-randconfig-x003-201645 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All warnings (new ones prefixed by >>):
> 
>    mm/memory_hotplug.c: In function 'try_offline_node':
>    mm/memory_hotplug.c:2131:6: warning: unused variable 'i' [-Wunused-variable]
>      int i;
>          ^
>    In file included from include/uapi/linux/stddef.h:1:0,
>                     from include/linux/stddef.h:4,
>                     from mm/memory_hotplug.c:7:

This seems unrelated to my patchset, but the fix is easy.
I'll post a separate patch later.

>    mm/memory_hotplug.c: In function 'new_node_page':
>    include/linux/compiler.h:518:38: error: call to '__compiletime_assert_1575' declared with attribute error: BUILD_BUG failed
>      _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>                                          ^
>    include/linux/compiler.h:160:16: note: in definition of macro '__trace_if'
>       ______r = !!(cond);     \
>                    ^~~~
> >> mm/memory_hotplug.c:1575:2: note: in expansion of macro 'if'
>      if (new_page && order == HPAGE_PMD_ORDER)
>      ^~
>    include/linux/compiler.h:506:2: note: in expansion of macro '__compiletime_assert'
>      __compiletime_assert(condition, msg, prefix, suffix)
>      ^~~~~~~~~~~~~~~~~~~~
>    include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
>      _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>      ^~~~~~~~~~~~~~~~~~~
>    include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
>     #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>                                         ^~~~~~~~~~~~~~~~~~
>    include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>     #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>                         ^~~~~~~~~~~~~~~~
>    include/linux/huge_mm.h:181:28: note: in expansion of macro 'BUILD_BUG'
>     #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>                                ^~~~~~~~~
>    include/linux/huge_mm.h:56:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
>     #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>                              ^~~~~~~~~~~~~~~
>    mm/memory_hotplug.c:1575:27: note: in expansion of macro 'HPAGE_PMD_ORDER'
>      if (new_page && order == HPAGE_PMD_ORDER)
>                               ^~~~~~~~~~~~~~~

HPAGE_PMD_ORDER is not available in non-thp code now, so let's add
a simple wrapper to access it in generic code.


diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3c252cdef587..b75a9a1bbf3e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -148,6 +148,12 @@ static inline int hpage_nr_pages(struct page *page)
 		return HPAGE_PMD_NR;
 	return 1;
 }
+static inline int hpage_order(struct page *page)
+{
+	if (unlikely(PageTransHuge(page)))
+		return HPAGE_PMD_ORDER;
+	return 0;
+}
 
 extern int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd);
 
@@ -183,6 +189,7 @@ static inline bool thp_migration_supported(void)
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
+#define hpage_order(x) 0
 
 #define transparent_hugepage_enabled(__vma) 0
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a9c3fe1b55ea..d612a75ceec4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1555,7 +1555,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 					next_node_in(nid, nmask));
 
 	if (thp_migration_supported() && PageTransHuge(page)) {
-		order = HPAGE_PMD_ORDER;
+		order = hpage_order(page);
 		gfp_mask |= GFP_TRANSHUGE;
 	}
 
@@ -1572,7 +1572,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		new_page = __alloc_pages(gfp_mask, order,
 					node_zonelist(nid, gfp_mask));
 
-	if (new_page && order == HPAGE_PMD_ORDER)
+	if (new_page && order == hpage_order(page))
 		prep_transhuge_page(new_page);
 
 	return new_page;
@@ -1606,7 +1606,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			continue;
 		} else if (thp_migration_supported() && PageTransHuge(page))
 			pfn = page_to_pfn(compound_head(page))
-				+ HPAGE_PMD_NR - 1;
+				+ hpage_nr_pages(page) - 1;
 
 		if (!get_page_unless_zero(page))
 			continue;

These changes are applied in the next version.

Thanks,
Naoya Horiguchi

> 
> vim +/if +1575 mm/memory_hotplug.c
> 
>   1559			gfp_mask |= GFP_TRANSHUGE;
>   1560		}
>   1561	
>   1562		node_clear(nid, nmask);
>   1563	
>   1564		if (PageHighMem(page)
>   1565		    || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
>   1566			gfp_mask |= __GFP_HIGHMEM;
>   1567	
>   1568		if (!nodes_empty(nmask))
>   1569			new_page = __alloc_pages_nodemask(gfp_mask, order,
>   1570						node_zonelist(nid, gfp_mask), &nmask);
>   1571		if (!new_page)
>   1572			new_page = __alloc_pages(gfp_mask, order,
>   1573						node_zonelist(nid, gfp_mask));
>   1574	
> > 1575		if (new_page && order == HPAGE_PMD_ORDER)
>   1576			prep_transhuge_page(new_page);
>   1577	
>   1578		return new_page;
>   1579	}
>   1580	
>   1581	#define NR_OFFLINE_AT_ONCE_PAGES	(256)
>   1582	static int
>   1583	do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH] mm: fix unused variable warning
  2016-11-08  1:36     ` Naoya Horiguchi
@ 2016-11-08  1:41       ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-08  1:41 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel, Naoya Horiguchi

Fix the following warning:

  mm/memory_hotplug.c: In function 'try_offline_node':
  mm/memory_hotplug.c:2131:6: warning: unused variable 'i' [-Wunused-variable]
    int i;
        ^

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory_hotplug.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 534348ddd285..d612a75ceec4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2128,7 +2128,6 @@ void try_offline_node(int nid)
 	unsigned long start_pfn = pgdat->node_start_pfn;
 	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
 	unsigned long pfn;
-	int i;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		unsigned long section_nr = pfn_to_section_nr(pfn);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-07 23:31 ` [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration Naoya Horiguchi
@ 2016-11-08  3:05   ` kbuild test robot
  2016-11-08  6:46     ` Naoya Horiguchi
  2016-11-08  8:13   ` Anshuman Khandual
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 71+ messages in thread
From: kbuild test robot @ 2016-11-08  3:05 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: kbuild-all, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel, Naoya Horiguchi, Naoya Horiguchi

[-- Attachment #1: Type: text/plain, Size: 1469 bytes --]

Hi Naoya,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9-rc4 next-20161028]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: arm-at91_dt_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   In file included from fs/proc/task_mmu.c:14:0:
   include/linux/swapops.h: In function 'swp_entry_to_pmd':
>> include/linux/swapops.h:216:14: error: empty scalar initializer
     pmd_t pmd = {};
                 ^
   include/linux/swapops.h:216:14: note: (near initialization for 'pmd')

vim +216 include/linux/swapops.h

   210	{
   211		return swp_entry(0, 0);
   212	}
   213	
   214	static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
   215	{
 > 216		pmd_t pmd = {};
   217	
   218		return pmd;
   219	}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21955 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-08  3:05   ` kbuild test robot
@ 2016-11-08  6:46     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-08  6:46 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel, Naoya Horiguchi

On Tue, Nov 08, 2016 at 11:05:52AM +0800, kbuild test robot wrote:
> Hi Naoya,
> 
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.9-rc4 next-20161028]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: arm-at91_dt_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm 
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from fs/proc/task_mmu.c:14:0:
>    include/linux/swapops.h: In function 'swp_entry_to_pmd':
> >> include/linux/swapops.h:216:14: error: empty scalar initializer
>      pmd_t pmd = {};
>                  ^
>    include/linux/swapops.h:216:14: note: (near initialization for 'pmd')
> 
> vim +216 include/linux/swapops.h
> 
>    210	{
>    211		return swp_entry(0, 0);
>    212	}
>    213	
>    214	static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
>    215	{
>  > 216		pmd_t pmd = {};
>    217	
>    218		return pmd;
>    219	}

Here is an alternative:

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index db8a858cc6ff..748c9233b3a5 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -215,9 +215,7 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 
 static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
 {
-	pmd_t pmd = {};
-
-	return pmd;
+	return (pmd_t) { 0 };
 }
 
 static inline int is_pmd_migration_entry(pmd_t pmd)

Thanks,
Naoya Horiguchi

> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-07 23:31 ` [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration Naoya Horiguchi
  2016-11-08  3:05   ` kbuild test robot
@ 2016-11-08  8:13   ` Anshuman Khandual
  2016-11-08  8:16     ` Naoya Horiguchi
  2016-11-10  8:29   ` Anshuman Khandual
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-08  8:13 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> This patch prepares thp migration's core code. These code will be open when
> unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> to allocate destination thps.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  61 +++++++++++++++
>  mm/huge_memory.c                  | 154 ++++++++++++++++++++++++++++++++++++++
>  mm/migrate.c                      |  44 ++++++++++-
>  mm/pgtable-generic.c              |   3 +-
>  5 files changed, 262 insertions(+), 2 deletions(-)
> 
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> index 1cc82ec..3a1b48e 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
>  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pte)		((swp_entry_t) { pmd_val((pmd)) })

The above macro takes *pte* but evaluates on *pmd*, guess its should
be fixed either way.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-08  8:13   ` Anshuman Khandual
@ 2016-11-08  8:16     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-08  8:16 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Naoya Horiguchi, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Zi Yan,
	Balbir Singh, linux-kernel

On Tue, Nov 08, 2016 at 01:43:54PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > This patch prepares thp migration's core code. These code will be open when
> > unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> > to allocate destination thps.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > ChangeLog v1 -> v2:
> > - support pte-mapped thp, doubly-mapped thp
> > ---
> >  arch/x86/include/asm/pgtable_64.h |   2 +
> >  include/linux/swapops.h           |  61 +++++++++++++++
> >  mm/huge_memory.c                  | 154 ++++++++++++++++++++++++++++++++++++++
> >  mm/migrate.c                      |  44 ++++++++++-
> >  mm/pgtable-generic.c              |   3 +-
> >  5 files changed, 262 insertions(+), 2 deletions(-)
> > 
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> > index 1cc82ec..3a1b48e 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> > @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> >  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
> >  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
> >  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> > +#define __pmd_to_swp_entry(pte)		((swp_entry_t) { pmd_val((pmd)) })
> 
> The above macro takes *pte* but evaluates on *pmd*, guess its should
> be fixed either way.

Right, I fix it.
Thank you very much.

- Naoya

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (11 preceding siblings ...)
  2016-11-07 23:31 ` [PATCH v2 12/12] mm: memory_hotplug: memory hotremove " Naoya Horiguchi
@ 2016-11-09  2:32 ` Balbir Singh
  2016-11-09  4:59   ` Naoya Horiguchi
  2016-11-09 10:33 ` Anshuman Khandual
  2017-01-30 10:47 ` Anshuman Khandual
  14 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2016-11-09  2:32 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, linux-kernel, Naoya Horiguchi

On 08/11/16 10:31, Naoya Horiguchi wrote:
> Hi everyone,
> 
> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> with feedbacks for ver.1.
> 
> General description (no change since ver.1)
> ===========================================
> 
> This patchset enhances page migration functionality to handle thp migration
> for various page migration's callers:
>  - mbind(2)
>  - move_pages(2)
>  - migrate_pages(2)
>  - cgroup/cpuset migration
>  - memory hotremove
>  - soft offline
> 
> The main benefit is that we can avoid unnecessary thp splits, which helps us
> avoid performance decrease when your applications handles NUMA optimization on
> their own.
> 
> The implementation is similar to that of normal page migration, the key point
> is that we modify a pmd to a pmd migration entry in swap-entry like format.
> 
> Changes / Notes
> ===============
> 
> - pmd_present() in x86 checks _PAGE_PRESENT, _PAGE_PROTNONE and _PAGE_PSE
>   bits together, which makes implementing thp migration a bit hard because
>   _PAGE_PSE bit is currently used by soft-dirty in swap-entry format.
>   I was advised to dropping _PAGE_PSE in pmd_present(), but I don't think
>   of the justification, so I keep it in this version. Instead, my approach
>   is to move _PAGE_SWP_SOFT_DIRTY to bit 6 (unused) and reserve bit 7 for
>   pmd non-present cases.

Thanks, IIRC

pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE

AutoNUMA balancing would change it to

pmd_present = _PAGE_PROTNONE | _PAGE_PSE

and PMD_SWP_SOFT_DIRTY would make it

pmd_present = _PAGE_PSE

What you seem to be suggesting in your comment is that

pmd_present should be _PAGE_PRESENT | _PAGE_PROTNONE

Isn't that good enough?

For THP migration I guess we use

_PAGE_PRESENT | _PAGE_PROTNONE | is_migration_entry(pmd)


> 
> - this patchset still covers only x86_64. Zi Yan posted a patch for ppc64
>   and I think it's favorably received so that's fine. But there's unsolved
>   minor suggestion by Aneesh, so I don't include it in this set, expecting
>   that it will be updated/reposted.
> 
> - pte-mapped thp and doubly-mapped thp were not supported in ver.1, but
>   this version should work for such kinds of thp.
> 
> - thp page cache is not tested yet, and it's at the head of my todo list
>   for future version.
> 
> Any comments or advices are welcomed.

Balbir Singh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2016-11-09  2:32 ` [PATCH v2 00/12] mm: page migration enhancement for thp Balbir Singh
@ 2016-11-09  4:59   ` Naoya Horiguchi
  2016-11-09 21:28     ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-09  4:59 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, linux-kernel,
	Naoya Horiguchi

On Wed, Nov 09, 2016 at 01:32:04PM +1100, Balbir Singh wrote:
> On 08/11/16 10:31, Naoya Horiguchi wrote:
> > Hi everyone,
> > 
> > I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> > with feedbacks for ver.1.
> > 
> > General description (no change since ver.1)
> > ===========================================
> > 
> > This patchset enhances page migration functionality to handle thp migration
> > for various page migration's callers:
> >  - mbind(2)
> >  - move_pages(2)
> >  - migrate_pages(2)
> >  - cgroup/cpuset migration
> >  - memory hotremove
> >  - soft offline
> > 
> > The main benefit is that we can avoid unnecessary thp splits, which helps us
> > avoid performance decrease when your applications handles NUMA optimization on
> > their own.
> > 
> > The implementation is similar to that of normal page migration, the key point
> > is that we modify a pmd to a pmd migration entry in swap-entry like format.
> > 
> > Changes / Notes
> > ===============
> > 
> > - pmd_present() in x86 checks _PAGE_PRESENT, _PAGE_PROTNONE and _PAGE_PSE
> >   bits together, which makes implementing thp migration a bit hard because
> >   _PAGE_PSE bit is currently used by soft-dirty in swap-entry format.
> >   I was advised to dropping _PAGE_PSE in pmd_present(), but I don't think
> >   of the justification, so I keep it in this version. Instead, my approach
> >   is to move _PAGE_SWP_SOFT_DIRTY to bit 6 (unused) and reserve bit 7 for
> >   pmd non-present cases.
> 
> Thanks, IIRC
> 
> pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
> 
> AutoNUMA balancing would change it to
> 
> pmd_present = _PAGE_PROTNONE | _PAGE_PSE
> 
> and PMD_SWP_SOFT_DIRTY would make it
> 
> pmd_present = _PAGE_PSE
> 
> What you seem to be suggesting in your comment is that
> 
> pmd_present should be _PAGE_PRESENT | _PAGE_PROTNONE

This (no _PAGE_PSE) was a possibile solution, and as I described I gave up
this solution, because I noticed that what I actually wanted was that
pmd_present() certainly returns false during thp migration and that's done
by moving _PAGE_SWP_SOFT_DIRTY. So

  pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE

is still correct in this patchset.

> 
> Isn't that good enough?
> 
> For THP migration I guess we use
> 
> _PAGE_PRESENT | _PAGE_PROTNONE | is_migration_entry(pmd)

Though I might misread your notations, I hope that the following code
seems describe itself well.

  static inline int is_pmd_migration_entry(pmd_t pmd)                            
  {                                                                              
          return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd)); 
  }                                                                              

Thanks,
Naoya Horiguchi

> 
> 
> > 
> > - this patchset still covers only x86_64. Zi Yan posted a patch for ppc64
> >   and I think it's favorably received so that's fine. But there's unsolved
> >   minor suggestion by Aneesh, so I don't include it in this set, expecting
> >   that it will be updated/reposted.
> > 
> > - pte-mapped thp and doubly-mapped thp were not supported in ver.1, but
> >   this version should work for such kinds of thp.
> > 
> > - thp page cache is not tested yet, and it's at the head of my todo list
> >   for future version.
> > 
> > Any comments or advices are welcomed.
> 
> Balbir Singh
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (12 preceding siblings ...)
  2016-11-09  2:32 ` [PATCH v2 00/12] mm: page migration enhancement for thp Balbir Singh
@ 2016-11-09 10:33 ` Anshuman Khandual
  2016-11-09 23:52   ` Naoya Horiguchi
  2017-01-30 10:47 ` Anshuman Khandual
  14 siblings, 1 reply; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-09 10:33 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> Hi everyone,
> 
> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> with feedbacks for ver.1.
> 
> General description (no change since ver.1)
> ===========================================
> 
> This patchset enhances page migration functionality to handle thp migration
> for various page migration's callers:
>  - mbind(2)
>  - move_pages(2)
>  - migrate_pages(2)
>  - cgroup/cpuset migration
>  - memory hotremove
>  - soft offline
> 
> The main benefit is that we can avoid unnecessary thp splits, which helps us
> avoid performance decrease when your applications handles NUMA optimization on
> their own.
> 
> The implementation is similar to that of normal page migration, the key point
> is that we modify a pmd to a pmd migration entry in swap-entry like format.

Will it be better to have new THP_MIGRATE_SUCCESS and THP_MIGRATE_FAIL
VM events to capture how many times the migration worked without first
splitting the huge page and how many time it did not work ? Also do you
have a test case which demonstrates this THP migration and kind of shows
its better than the present split and move method ?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2016-11-09  4:59   ` Naoya Horiguchi
@ 2016-11-09 21:28     ` Balbir Singh
  0 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2016-11-09 21:28 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, linux-kernel,
	Naoya Horiguchi



On 09/11/16 15:59, Naoya Horiguchi wrote:
> On Wed, Nov 09, 2016 at 01:32:04PM +1100, Balbir Singh wrote:
>> On 08/11/16 10:31, Naoya Horiguchi wrote:
>>> Hi everyone,
>>>
>>> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
>>> with feedbacks for ver.1.
>>>
>>> General description (no change since ver.1)
>>> ===========================================
>>>
>>> This patchset enhances page migration functionality to handle thp migration
>>> for various page migration's callers:
>>>  - mbind(2)
>>>  - move_pages(2)
>>>  - migrate_pages(2)
>>>  - cgroup/cpuset migration
>>>  - memory hotremove
>>>  - soft offline
>>>
>>> The main benefit is that we can avoid unnecessary thp splits, which helps us
>>> avoid performance decrease when your applications handles NUMA optimization on
>>> their own.
>>>
>>> The implementation is similar to that of normal page migration, the key point
>>> is that we modify a pmd to a pmd migration entry in swap-entry like format.
>>>
>>> Changes / Notes
>>> ===============
>>>
>>> - pmd_present() in x86 checks _PAGE_PRESENT, _PAGE_PROTNONE and _PAGE_PSE
>>>   bits together, which makes implementing thp migration a bit hard because
>>>   _PAGE_PSE bit is currently used by soft-dirty in swap-entry format.
>>>   I was advised to dropping _PAGE_PSE in pmd_present(), but I don't think
>>>   of the justification, so I keep it in this version. Instead, my approach
>>>   is to move _PAGE_SWP_SOFT_DIRTY to bit 6 (unused) and reserve bit 7 for
>>>   pmd non-present cases.
>>
>> Thanks, IIRC
>>
>> pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
>>
>> AutoNUMA balancing would change it to
>>
>> pmd_present = _PAGE_PROTNONE | _PAGE_PSE
>>
>> and PMD_SWP_SOFT_DIRTY would make it
>>
>> pmd_present = _PAGE_PSE
>>
>> What you seem to be suggesting in your comment is that
>>
>> pmd_present should be _PAGE_PRESENT | _PAGE_PROTNONE
> 
> This (no _PAGE_PSE) was a possibile solution, and as I described I gave up
> this solution, because I noticed that what I actually wanted was that
> pmd_present() certainly returns false during thp migration and that's done
> by moving _PAGE_SWP_SOFT_DIRTY. So
> 
>   pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
> 
> is still correct in this patchset.
> 

Thanks, I was wondering if there is any advantage or you felt its
safer not to change pmd_present().

>>
>> Isn't that good enough?
>>
>> For THP migration I guess we use
>>
>> _PAGE_PRESENT | _PAGE_PROTNONE | is_migration_entry(pmd)
> 
> Though I might misread your notations, I hope that the following code
> seems describe itself well.
> 
>   static inline int is_pmd_migration_entry(pmd_t pmd)                            
>   {                                                                              
>           return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd)); 
>   }                                                                              
> 

Thanks, yes my notation is not the best.



Balbir Singh.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2016-11-09 10:33 ` Anshuman Khandual
@ 2016-11-09 23:52   ` Naoya Horiguchi
  2016-11-10 14:01     ` Zi Yan
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-09 23:52 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

Hi Anshuman,

On Wed, Nov 09, 2016 at 04:03:04PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > Hi everyone,
> > 
> > I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> > with feedbacks for ver.1.
> > 
> > General description (no change since ver.1)
> > ===========================================
> > 
> > This patchset enhances page migration functionality to handle thp migration
> > for various page migration's callers:
> >  - mbind(2)
> >  - move_pages(2)
> >  - migrate_pages(2)
> >  - cgroup/cpuset migration
> >  - memory hotremove
> >  - soft offline
> > 
> > The main benefit is that we can avoid unnecessary thp splits, which helps us
> > avoid performance decrease when your applications handles NUMA optimization on
> > their own.
> > 
> > The implementation is similar to that of normal page migration, the key point
> > is that we modify a pmd to a pmd migration entry in swap-entry like format.
> 
> Will it be better to have new THP_MIGRATE_SUCCESS and THP_MIGRATE_FAIL
> VM events to capture how many times the migration worked without first
> splitting the huge page and how many time it did not work ?

Thank you for the suggestion.
I think that's helpful, so will try it in next version.

> Also do you
> have a test case which demonstrates this THP migration and kind of shows
> its better than the present split and move method ?

I don't have test cases which compare thp migration and split-then-migration
with some numbers. Maybe measuring/comparing the overhead of migration is
a good start point, although I think the real benefit of thp migration comes
from workload "after migration" by avoiding thp split.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-07 23:31 ` [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration Naoya Horiguchi
  2016-11-08  3:05   ` kbuild test robot
  2016-11-08  8:13   ` Anshuman Khandual
@ 2016-11-10  8:29   ` Anshuman Khandual
  2016-11-10  9:43     ` Naoya Horiguchi
  2016-11-14 11:45   ` Kirill A. Shutemov
  2016-11-28 14:31   ` Michal Hocko
  4 siblings, 1 reply; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-10  8:29 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> This patch prepares thp migration's core code. These code will be open when
> unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> to allocate destination thps.
> 

Snip

> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  61 +++++++++++++++
>  mm/huge_memory.c                  | 154 ++++++++++++++++++++++++++++++++++++++
>  mm/migrate.c                      |  44 ++++++++++-
>  mm/pgtable-generic.c              |   3 +-
>  5 files changed, 262 insertions(+), 2 deletions(-)


> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> index 71c5f91..6012343 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp))

Its a valid VM_BUG_ON check but is it related to THP migration or
just a regular fix up ?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing
  2016-11-07 23:31 ` [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing Naoya Horiguchi
@ 2016-11-10  8:31   ` Anshuman Khandual
  2016-11-10  9:09     ` Naoya Horiguchi
  2016-11-11 11:16   ` Kirill A. Shutemov
  1 sibling, 1 reply; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-10  8:31 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> TTU_MIGRATION is used to convert pte into migration entry until thp split
> completes. This behavior conflicts with thp migration added later patches,

Hmm, could you please explain why it conflicts with the PMD based
migration without split ? Why TTU_MIGRATION cannot be used to
freeze/hold on the PMD while it's being migrated ?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-07 23:31 ` [PATCH v2 07/12] mm: thp: check pmd migration entry in common path Naoya Horiguchi
  2016-11-08  0:23   ` kbuild test robot
@ 2016-11-10  8:36   ` Anshuman Khandual
  2016-11-10  9:12     ` Naoya Horiguchi
  2016-11-10  9:08   ` Hillf Danton
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-10  8:36 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> If one of callers of page migration starts to handle thp, memory management code
> start to see pmd migration entry, so we need to prepare for it before enabling.
> This patch changes various code point which checks the status of given pmds in
> order to prevent race between thp migration and the pmd-related works.

There are lot of changes in this one patch. Should not we split
this up into multiple patches and explain them in a bit detail
through their commit messages ?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 08/12] mm: soft-dirty: keep soft-dirty bits over thp migration
  2016-11-07 23:31 ` [PATCH v2 08/12] mm: soft-dirty: keep soft-dirty bits over thp migration Naoya Horiguchi
@ 2016-11-10  8:38   ` Anshuman Khandual
  2016-11-10  9:13     ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-10  8:38 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> Soft dirty bit is designed to keep tracked over page migration. This patch

Small nit here. s/tracked/track/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-07 23:31 ` [PATCH v2 07/12] mm: thp: check pmd migration entry in common path Naoya Horiguchi
  2016-11-08  0:23   ` kbuild test robot
  2016-11-10  8:36   ` Anshuman Khandual
@ 2016-11-10  9:08   ` Hillf Danton
  2016-11-10  9:21     ` Naoya Horiguchi
  2016-11-17 23:56   ` Kirill A. Shutemov
  2016-11-28 14:35   ` Michal Hocko
  4 siblings, 1 reply; 71+ messages in thread
From: Hillf Danton @ 2016-11-10  9:08 UTC (permalink / raw)
  To: 'Naoya Horiguchi', linux-mm
  Cc: 'Kirill A. Shutemov', 'Hugh Dickins',
	'Andrew Morton', 'Dave Hansen',
	'Andrea Arcangeli', 'Mel Gorman',
	'Michal Hocko', 'Vlastimil Babka',
	'Pavel Emelyanov', 'Zi Yan',
	'Balbir Singh', linux-kernel, 'Naoya Horiguchi',
	'Anshuman Khandual'

On Tuesday, November 08, 2016 7:32 AM Naoya Horiguchi wrote:
> 
> @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
>  	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
>  		goto out_unlock;
> 
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out_unlock;
> +

Can we encounter a migration entry after acquiring ptl ?

>  	page = pmd_page(orig_pmd);
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
[...]
> @@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  		int ret;
> 
>  		barrier();
> +		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> +			pmd_migration_entry_wait(mm, fe.pmd);
> +			return 0;
> +		}
>  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>  				return do_huge_pmd_numa_page(&fe, orig_pmd);

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing
  2016-11-10  8:31   ` Anshuman Khandual
@ 2016-11-10  9:09     ` Naoya Horiguchi
  2016-11-11  3:18       ` Anshuman Khandual
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-10  9:09 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Thu, Nov 10, 2016 at 02:01:56PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > TTU_MIGRATION is used to convert pte into migration entry until thp split
> > completes. This behavior conflicts with thp migration added later patches,
>
> Hmm, could you please explain why it conflicts with the PMD based
> migration without split ? Why TTU_MIGRATION cannot be used to
> freeze/hold on the PMD while it's being migrated ?

try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

and ttu_flag given for tail pages is:

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION)

__unmap_and_move() calls try_to_unmap() with ttu_flag:

    (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)

Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below


  static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                       unsigned long address, void *arg)
  {
          ...
          if (flags & TTU_MIGRATION) {
                  if (!PageHuge(page) && PageTransCompound(page)) {
                          set_pmd_migration_entry(page, vma, address);
                          goto out;
                  }
          }

, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)

I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().

Anyway, I'll add this explanation into the patch description in the next post.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-10  8:36   ` Anshuman Khandual
@ 2016-11-10  9:12     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-10  9:12 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Thu, Nov 10, 2016 at 02:06:14PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > If one of callers of page migration starts to handle thp, memory management code
> > start to see pmd migration entry, so we need to prepare for it before enabling.
> > This patch changes various code point which checks the status of given pmds in
> > order to prevent race between thp migration and the pmd-related works.
> 
> There are lot of changes in this one patch. Should not we split
> this up into multiple patches and explain them in a bit detail
> through their commit messages ?

Yes, and I admit that I might change more than necessary, if the context
never encounters migration entry for any reason, no change is needed.
I'll dig more detail.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 08/12] mm: soft-dirty: keep soft-dirty bits over thp migration
  2016-11-10  8:38   ` Anshuman Khandual
@ 2016-11-10  9:13     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-10  9:13 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Thu, Nov 10, 2016 at 02:08:38PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > Soft dirty bit is designed to keep tracked over page migration. This patch
> 
> Small nit here. s/tracked/track/
> 

Fixed. thank you :)

- Naoya

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-10  9:08   ` Hillf Danton
@ 2016-11-10  9:21     ` Naoya Horiguchi
  2016-11-10  9:28       ` Hillf Danton
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-10  9:21 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, 'Kirill A. Shutemov', 'Hugh Dickins',
	'Andrew Morton', 'Dave Hansen',
	'Andrea Arcangeli', 'Mel Gorman',
	'Michal Hocko', 'Vlastimil Babka',
	'Pavel Emelyanov', 'Zi Yan',
	'Balbir Singh', linux-kernel, 'Naoya Horiguchi',
	'Anshuman Khandual'

Hi Hillf,

On Thu, Nov 10, 2016 at 05:08:07PM +0800, Hillf Danton wrote:
> On Tuesday, November 08, 2016 7:32 AM Naoya Horiguchi wrote:
> > 
> > @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> >  	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> >  		goto out_unlock;
> > 
> > +	if (unlikely(!pmd_present(orig_pmd)))
> > +		goto out_unlock;
> > +
> 
> Can we encounter a migration entry after acquiring ptl ?

I think we can. thp migration code releases ptl after converting pmd into
migration entry, so other code can see it even within ptl.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-10  9:21     ` Naoya Horiguchi
@ 2016-11-10  9:28       ` Hillf Danton
  2016-11-10  9:34         ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Hillf Danton @ 2016-11-10  9:28 UTC (permalink / raw)
  To: 'Naoya Horiguchi'
  Cc: linux-mm, 'Kirill A. Shutemov', 'Hugh Dickins',
	'Andrew Morton', 'Dave Hansen',
	'Andrea Arcangeli', 'Mel Gorman',
	'Michal Hocko', 'Vlastimil Babka',
	'Pavel Emelyanov', 'Zi Yan',
	'Balbir Singh', linux-kernel, 'Naoya Horiguchi',
	'Anshuman Khandual'

On Thursday, November 10, 2016 5:22 PM Naoya Horiguchi wrote:
> On Thu, Nov 10, 2016 at 05:08:07PM +0800, Hillf Danton wrote:
> > On Tuesday, November 08, 2016 7:32 AM Naoya Horiguchi wrote:
> > >
> > > @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> > >  	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> > >  		goto out_unlock;
> > >
> > > +	if (unlikely(!pmd_present(orig_pmd)))
> > > +		goto out_unlock;
> > > +
> >
> > Can we encounter a migration entry after acquiring ptl ?
> 
> I think we can. thp migration code releases ptl after converting pmd into
> migration entry, so other code can see it even within ptl.
> 
But we have a pmd_same check there, you see. 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-10  9:28       ` Hillf Danton
@ 2016-11-10  9:34         ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-10  9:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, 'Kirill A. Shutemov', 'Hugh Dickins',
	'Andrew Morton', 'Dave Hansen',
	'Andrea Arcangeli', 'Mel Gorman',
	'Michal Hocko', 'Vlastimil Babka',
	'Pavel Emelyanov', 'Zi Yan',
	'Balbir Singh', linux-kernel, 'Naoya Horiguchi',
	'Anshuman Khandual'

On Thu, Nov 10, 2016 at 05:28:20PM +0800, Hillf Danton wrote:
> On Thursday, November 10, 2016 5:22 PM Naoya Horiguchi wrote:
> > On Thu, Nov 10, 2016 at 05:08:07PM +0800, Hillf Danton wrote:
> > > On Tuesday, November 08, 2016 7:32 AM Naoya Horiguchi wrote:
> > > >
> > > > @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> > > >  	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> > > >  		goto out_unlock;
> > > >
> > > > +	if (unlikely(!pmd_present(orig_pmd)))
> > > > +		goto out_unlock;
> > > > +
> > >
> > > Can we encounter a migration entry after acquiring ptl ?
> > 
> > I think we can. thp migration code releases ptl after converting pmd into
> > migration entry, so other code can see it even within ptl.
> > 
> But we have a pmd_same check there, you see. 

You're right. So we can omit this pmd_present check.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-10  8:29   ` Anshuman Khandual
@ 2016-11-10  9:43     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-10  9:43 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Thu, Nov 10, 2016 at 01:59:03PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > This patch prepares thp migration's core code. These code will be open when
> > unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> > to allocate destination thps.
> > 
> 
> Snip
> 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > ChangeLog v1 -> v2:
> > - support pte-mapped thp, doubly-mapped thp
> > ---
> >  arch/x86/include/asm/pgtable_64.h |   2 +
> >  include/linux/swapops.h           |  61 +++++++++++++++
> >  mm/huge_memory.c                  | 154 ++++++++++++++++++++++++++++++++++++++
> >  mm/migrate.c                      |  44 ++++++++++-
> >  mm/pgtable-generic.c              |   3 +-
> >  5 files changed, 262 insertions(+), 2 deletions(-)
> 
> 
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> > index 71c5f91..6012343 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> > @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> >  {
> >  	pmd_t pmd;
> >  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> > +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> > +		  !pmd_devmap(*pmdp))
> 
> Its a valid VM_BUG_ON check but is it related to THP migration or
> just a regular fix up ?

Without this change, this VM_BUG_ON always triggers when migration happens
on normal thp and it succeeds, so I included it here.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/12] mm: hwpoison: soft offline supports thp migration
  2016-11-07 23:31 ` [PATCH v2 09/12] mm: hwpoison: soft offline supports " Naoya Horiguchi
@ 2016-11-10 10:31   ` Balbir Singh
  2016-11-10 23:58     ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2016-11-10 10:31 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, linux-kernel, Naoya Horiguchi



On 08/11/16 10:31, Naoya Horiguchi wrote:
> This patch enables thp migration for soft offline.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  mm/memory-failure.c | 31 ++++++++++++-------------------
>  1 file changed, 12 insertions(+), 19 deletions(-)
> 
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
> index 19e796d..6cc8157 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
> @@ -1485,7 +1485,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
>  	if (PageHuge(p))
>  		return alloc_huge_page_node(page_hstate(compound_head(p)),
>  						   nid);
> -	else
> +	else if (thp_migration_supported() && PageTransHuge(p)) {
> +		struct page *thp;
> +
> +		thp = alloc_pages_node(nid,
> +			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> +			HPAGE_PMD_ORDER);
> +		if (!thp)
> +			return NULL;

Just wondering if new_page() fails, migration of that entry fails. Do we then
split and migrate? I guess this applies to THP migration in general.

Balbir Singh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2016-11-09 23:52   ` Naoya Horiguchi
@ 2016-11-10 14:01     ` Zi Yan
  2016-11-11  3:48       ` Anshuman Khandual
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2016-11-10 14:01 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Anshuman Khandual, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Dave Hansen, Andrea Arcangeli, Mel Gorman,
	Michal Hocko, Vlastimil Babka, Pavel Emelyanov, Balbir Singh,
	linux-kernel, Naoya Horiguchi

[-- Attachment #1: Type: text/plain, Size: 2421 bytes --]

On 9 Nov 2016, at 18:52, Naoya Horiguchi wrote:

> Hi Anshuman,
>
> On Wed, Nov 09, 2016 at 04:03:04PM +0530, Anshuman Khandual wrote:
>> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
>>> Hi everyone,
>>>
>>> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
>>> with feedbacks for ver.1.
>>>
>>> General description (no change since ver.1)
>>> ===========================================
>>>
>>> This patchset enhances page migration functionality to handle thp migration
>>> for various page migration's callers:
>>>  - mbind(2)
>>>  - move_pages(2)
>>>  - migrate_pages(2)
>>>  - cgroup/cpuset migration
>>>  - memory hotremove
>>>  - soft offline
>>>
>>> The main benefit is that we can avoid unnecessary thp splits, which helps us
>>> avoid performance decrease when your applications handles NUMA optimization on
>>> their own.
>>>
>>> The implementation is similar to that of normal page migration, the key point
>>> is that we modify a pmd to a pmd migration entry in swap-entry like format.
>>
>> Will it be better to have new THP_MIGRATE_SUCCESS and THP_MIGRATE_FAIL
>> VM events to capture how many times the migration worked without first
>> splitting the huge page and how many time it did not work ?
>
> Thank you for the suggestion.
> I think that's helpful, so will try it in next version.
>
>> Also do you
>> have a test case which demonstrates this THP migration and kind of shows
>> its better than the present split and move method ?
>
> I don't have test cases which compare thp migration and split-then-migration
> with some numbers. Maybe measuring/comparing the overhead of migration is
> a good start point, although I think the real benefit of thp migration comes
> from workload "after migration" by avoiding thp split.

Migrating 4KB pages has much lower (~1/3) throughput than 2MB pages.

What I get is that on average it takes 1987.38 us to migrate 512 4KB pages and
                                       658.54  us to migrate 1   2MB page.

I did the test in a two-socket Intel Xeon E5-2640v4 box. I used migrate_pages()
system call to migrate pages. MADV_NOHUGEPAGE and MADV_HUGEPAGE are used to
make 4KB and 2MB pages and each page’s flags are checked to make sure the page
size is 4KB or 2MB THP.

There is no split page. But the page migration time already tells the story.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6
  2016-11-07 23:31 ` [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6 Naoya Horiguchi
@ 2016-11-10 23:29   ` Dave Hansen
  2016-11-11  1:08     ` Naoya Horiguchi
  2016-11-11 11:12     ` Kirill A. Shutemov
  0 siblings, 2 replies; 71+ messages in thread
From: Dave Hansen @ 2016-11-10 23:29 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/07/2016 03:31 PM, Naoya Horiguchi wrote:
> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid false negative
> return when it races with thp spilt (during which _PAGE_PRESENT is temporary
> cleared.) I don't think that dropping _PAGE_PSE check in pmd_present() works
> well because it can hurt optimization of tlb handling in thp split.
> In the current kernel, bit 6 is not used in non-present format because nonlinear
> file mapping is obsolete, so let's move _PAGE_SWP_SOFT_DIRTY to that bit.
> Bit 7 is used as reserved (always clear), so please don't use it for other
> purpose.
...
>  #ifdef CONFIG_MEM_SOFT_DIRTY
> -#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
> +#define _PAGE_SWP_SOFT_DIRTY	_PAGE_DIRTY
>  #else
>  #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
>  #endif

I'm not sure this works.  Take a look at commit 00839ee3b29 and the
erratum it works around.  I _think_ this means that a system affected by
the erratum might see an erroneous _PAGE_SWP_SOFT_DIRTY/_PAGE_DIRTY get
set in swap ptes.

There are much worse things that can happen, but I don't think bits 5
(Accessed) and 6 (Dirty) are good choices since they're affected by the
erratum.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/12] mm: hwpoison: soft offline supports thp migration
  2016-11-10 10:31   ` Balbir Singh
@ 2016-11-10 23:58     ` Naoya Horiguchi
  2016-11-14 23:22       ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-10 23:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, linux-kernel,
	Naoya Horiguchi

On Thu, Nov 10, 2016 at 09:31:10PM +1100, Balbir Singh wrote:
> 
> 
> On 08/11/16 10:31, Naoya Horiguchi wrote:
> > This patch enables thp migration for soft offline.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> >  mm/memory-failure.c | 31 ++++++++++++-------------------
> >  1 file changed, 12 insertions(+), 19 deletions(-)
> > 
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
> > index 19e796d..6cc8157 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
> > @@ -1485,7 +1485,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
> >  	if (PageHuge(p))
> >  		return alloc_huge_page_node(page_hstate(compound_head(p)),
> >  						   nid);
> > -	else
> > +	else if (thp_migration_supported() && PageTransHuge(p)) {
> > +		struct page *thp;
> > +
> > +		thp = alloc_pages_node(nid,
> > +			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> > +			HPAGE_PMD_ORDER);
> > +		if (!thp)
> > +			return NULL;
> 
> Just wondering if new_page() fails, migration of that entry fails. Do we then
> split and migrate? I guess this applies to THP migration in general.

Yes, that's not implemented yet, but can be helpful.

I think that there are 2 types of callers of page migration,
one is a caller that specifies the target pages individually (like move_pages
and soft offline), and another is a caller that specifies the target pages
by (physical/virtual) address range basis.
Maybe the former ones want to fall back immediately to split and retry if
thp migration fails, and the latter ones want to retry thp migration more.
If this makes sense, we can make some more changes on retry logic to fit
the situation.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6
  2016-11-10 23:29   ` Dave Hansen
@ 2016-11-11  1:08     ` Naoya Horiguchi
  2016-11-11 11:12     ` Kirill A. Shutemov
  1 sibling, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-11  1:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Thu, Nov 10, 2016 at 03:29:51PM -0800, Dave Hansen wrote:
> On 11/07/2016 03:31 PM, Naoya Horiguchi wrote:
> > pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid false negative
> > return when it races with thp spilt (during which _PAGE_PRESENT is temporary
> > cleared.) I don't think that dropping _PAGE_PSE check in pmd_present() works
> > well because it can hurt optimization of tlb handling in thp split.
> > In the current kernel, bit 6 is not used in non-present format because nonlinear
> > file mapping is obsolete, so let's move _PAGE_SWP_SOFT_DIRTY to that bit.
> > Bit 7 is used as reserved (always clear), so please don't use it for other
> > purpose.
> ...
> >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > -#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
> > +#define _PAGE_SWP_SOFT_DIRTY	_PAGE_DIRTY
> >  #else
> >  #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
> >  #endif
> 
> I'm not sure this works.  Take a look at commit 00839ee3b29 and the
> erratum it works around.  I _think_ this means that a system affected by
> the erratum might see an erroneous _PAGE_SWP_SOFT_DIRTY/_PAGE_DIRTY get
> set in swap ptes.
> 
> There are much worse things that can happen, but I don't think bits 5
> (Accessed) and 6 (Dirty) are good choices since they're affected by the
> erratum.

Thank you for the information. According to 00839ee3b29, some bits which
are safe from the errata are reclaimed, so assigning one of such bits for
_PAGE_SWP_SOFT_DIRTY seems to work. And I'll update the description.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing
  2016-11-10  9:09     ` Naoya Horiguchi
@ 2016-11-11  3:18       ` Anshuman Khandual
  0 siblings, 0 replies; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-11  3:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On 11/10/2016 02:39 PM, Naoya Horiguchi wrote:
> On Thu, Nov 10, 2016 at 02:01:56PM +0530, Anshuman Khandual wrote:
>> > On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
>>> > > TTU_MIGRATION is used to convert pte into migration entry until thp split
>>> > > completes. This behavior conflicts with thp migration added later patches,
>> >
>> > Hmm, could you please explain why it conflicts with the PMD based
>> > migration without split ? Why TTU_MIGRATION cannot be used to
>> > freeze/hold on the PMD while it's being migrated ?
> try_to_unmap() is used both for thp split (via freeze_page()) and page
> migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
> head page is like below (assuming anonymous thp):
> 
>     (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
>      TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

Right.

> 
> and ttu_flag given for tail pages is:
> 
>     (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
>      TTU_MIGRATION)

Right.

> 
> __unmap_and_move() calls try_to_unmap() with ttu_flag:
> 
>     (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
> 
> Now I'm trying to insert a branch for thp migration at the top of
> try_to_unmap_one() like below
> 
> 
>   static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>                        unsigned long address, void *arg)
>   {
>           ...
>           if (flags & TTU_MIGRATION) {
>                   if (!PageHuge(page) && PageTransCompound(page)) {
>                           set_pmd_migration_entry(page, vma, address);
>                           goto out;
>                   }
>           }
> 
> , so try_to_unmap() for tail pages called by thp split can go into thp
> migration code path (which converts *pmd* into migration entry), while
> the expectation is to freeze thp (which converts *pte* into migration entry.)

Got it.

> 
> I detected this failure as a "bad page state" error in a testcase where
> split_huge_page() is called from queue_pages_pte_range().
> 
> Anyway, I'll add this explanation into the patch description in the next post.

Sure, thanks for the explanation.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2016-11-10 14:01     ` Zi Yan
@ 2016-11-11  3:48       ` Anshuman Khandual
  0 siblings, 0 replies; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-11  3:48 UTC (permalink / raw)
  To: Zi Yan, Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/10/2016 07:31 PM, Zi Yan wrote:
> On 9 Nov 2016, at 18:52, Naoya Horiguchi wrote:
> 
>> Hi Anshuman,
>>
>> On Wed, Nov 09, 2016 at 04:03:04PM +0530, Anshuman Khandual wrote:
>>> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
>>>> Hi everyone,
>>>>
>>>> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
>>>> with feedbacks for ver.1.
>>>>
>>>> General description (no change since ver.1)
>>>> ===========================================
>>>>
>>>> This patchset enhances page migration functionality to handle thp migration
>>>> for various page migration's callers:
>>>>  - mbind(2)
>>>>  - move_pages(2)
>>>>  - migrate_pages(2)
>>>>  - cgroup/cpuset migration
>>>>  - memory hotremove
>>>>  - soft offline
>>>>
>>>> The main benefit is that we can avoid unnecessary thp splits, which helps us
>>>> avoid performance decrease when your applications handles NUMA optimization on
>>>> their own.
>>>>
>>>> The implementation is similar to that of normal page migration, the key point
>>>> is that we modify a pmd to a pmd migration entry in swap-entry like format.
>>>
>>> Will it be better to have new THP_MIGRATE_SUCCESS and THP_MIGRATE_FAIL
>>> VM events to capture how many times the migration worked without first
>>> splitting the huge page and how many time it did not work ?
>>
>> Thank you for the suggestion.
>> I think that's helpful, so will try it in next version.
>>
>>> Also do you
>>> have a test case which demonstrates this THP migration and kind of shows
>>> its better than the present split and move method ?
>>
>> I don't have test cases which compare thp migration and split-then-migration
>> with some numbers. Maybe measuring/comparing the overhead of migration is
>> a good start point, although I think the real benefit of thp migration comes
>> from workload "after migration" by avoiding thp split.
> 
> Migrating 4KB pages has much lower (~1/3) throughput than 2MB pages.

I assume the 2MB throughput you mentioned is with this THP migration
feature enabled.

> 
> What I get is that on average it takes 1987.38 us to migrate 512 4KB pages and
>                                        658.54  us to migrate 1   2MB page.
> 
> I did the test in a two-socket Intel Xeon E5-2640v4 box. I used migrate_pages()
> system call to migrate pages. MADV_NOHUGEPAGE and MADV_HUGEPAGE are used to
> make 4KB and 2MB pages and each page’s flags are checked to make sure the page
> size is 4KB or 2MB THP.
> 
> There is no split page. But the page migration time already tells the story.

Right. Just wondering if we can add a test case which measures just
this migration time improvement by avoiding the split not the TLB
based improvement which the workload will receive as an addition.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6
  2016-11-10 23:29   ` Dave Hansen
  2016-11-11  1:08     ` Naoya Horiguchi
@ 2016-11-11 11:12     ` Kirill A. Shutemov
  1 sibling, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2016-11-11 11:12 UTC (permalink / raw)
  To: Dave Hansen, Pavel Emelyanov
  Cc: Naoya Horiguchi, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Andrew Morton, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Thu, Nov 10, 2016 at 03:29:51PM -0800, Dave Hansen wrote:
> On 11/07/2016 03:31 PM, Naoya Horiguchi wrote:
> > pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid false negative
> > return when it races with thp spilt (during which _PAGE_PRESENT is temporary
> > cleared.) I don't think that dropping _PAGE_PSE check in pmd_present() works
> > well because it can hurt optimization of tlb handling in thp split.
> > In the current kernel, bit 6 is not used in non-present format because nonlinear
> > file mapping is obsolete, so let's move _PAGE_SWP_SOFT_DIRTY to that bit.
> > Bit 7 is used as reserved (always clear), so please don't use it for other
> > purpose.
> ...
> >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > -#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
> > +#define _PAGE_SWP_SOFT_DIRTY	_PAGE_DIRTY
> >  #else
> >  #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
> >  #endif
> 
> I'm not sure this works.  Take a look at commit 00839ee3b29 and the
> erratum it works around.  I _think_ this means that a system affected by
> the erratum might see an erroneous _PAGE_SWP_SOFT_DIRTY/_PAGE_DIRTY get
> set in swap ptes.

But, is it destructive in any way? What is the harm if we mark swap entry
dirty by mistake?

Pavel?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing
  2016-11-07 23:31 ` [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing Naoya Horiguchi
  2016-11-10  8:31   ` Anshuman Khandual
@ 2016-11-11 11:16   ` Kirill A. Shutemov
  1 sibling, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2016-11-11 11:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Tue, Nov 08, 2016 at 08:31:48AM +0900, Naoya Horiguchi wrote:
> TTU_MIGRATION is used to convert pte into migration entry until thp split
> completes. This behavior conflicts with thp migration added later patches,
> so let's introduce a new TTU flag specifically for freezing.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2016-11-07 23:31 ` [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Naoya Horiguchi
@ 2016-11-11 11:18   ` Kirill A. Shutemov
  2016-11-15  2:05     ` Naoya Horiguchi
  2016-11-28 14:21   ` Michal Hocko
  1 sibling, 1 reply; 71+ messages in thread
From: Kirill A. Shutemov @ 2016-11-11 11:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Tue, Nov 08, 2016 at 08:31:49AM +0900, Naoya Horiguchi wrote:
> +static inline bool thp_migration_supported(void)
> +{
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +	return true;
> +#else
> +	return false;
> +#endif

Em..

	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);

?

Otherwise looks good to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-07 23:31 ` [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration Naoya Horiguchi
                     ` (2 preceding siblings ...)
  2016-11-10  8:29   ` Anshuman Khandual
@ 2016-11-14 11:45   ` Kirill A. Shutemov
  2016-11-15  4:57     ` Naoya Horiguchi
  2016-11-28 14:31   ` Michal Hocko
  4 siblings, 1 reply; 71+ messages in thread
From: Kirill A. Shutemov @ 2016-11-14 11:45 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Tue, Nov 08, 2016 at 08:31:50AM +0900, Naoya Horiguchi wrote:
> This patch prepares thp migration's core code. These code will be open when
> unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> to allocate destination thps.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  61 +++++++++++++++
>  mm/huge_memory.c                  | 154 ++++++++++++++++++++++++++++++++++++++
>  mm/migrate.c                      |  44 ++++++++++-
>  mm/pgtable-generic.c              |   3 +-
>  5 files changed, 262 insertions(+), 2 deletions(-)
> 
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> index 1cc82ec..3a1b48e 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
>  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pte)		((swp_entry_t) { pmd_val((pmd)) })
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
> +#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
>  
>  extern int kern_addr_valid(unsigned long addr);
>  extern void cleanup_highmap(void);
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> index 5c3a5f3..b6b22a2 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> @@ -163,6 +163,67 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern void set_pmd_migration_entry(struct page *page,
> +		struct vm_area_struct *vma, unsigned long address);
> +
> +extern int remove_migration_pmd(struct page *new, pmd_t *pmd,
> +		struct vm_area_struct *vma, unsigned long addr, void *old);
> +
> +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __pmd_to_swp_entry(pmd);
> +	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> +	return __swp_entry_to_pmd(arch_entry);
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> +}
> +#else
> +static inline void set_pmd_migration_entry(struct page *page,
> +			struct vm_area_struct *vma, unsigned long address)
> +{

VM_BUG()? Or BUILD_BUG()?

> +}
> +
> +static inline int remove_migration_pmd(struct page *new, pmd_t *pmd,
> +		struct vm_area_struct *vma, unsigned long addr, void *old)
> +{
> +	return 0;

Ditto.

> +}
> +
> +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	return swp_entry(0, 0);

Ditto.

> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	pmd_t pmd = {};

Ditto.

> +	return pmd;
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> index 0509d17..b3022b3 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> @@ -2310,3 +2310,157 @@ static int __init split_huge_pages_debugfs(void)
>  }
>  late_initcall(split_huge_pages_debugfs);
>  #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
> +				unsigned long addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pmd_t pmdval;
> +	swp_entry_t entry;
> +	spinlock_t *ptl;
> +
> +	pgd = pgd_offset(mm, addr);
> +	if (!pgd_present(*pgd))
> +		return;
> +	pud = pud_offset(pgd, addr);
> +	if (!pud_present(*pud))
> +		return;
> +	pmd = pmd_offset(pud, addr);
> +	pmdval = *pmd;
> +	barrier();
> +	if (!pmd_present(pmdval))
> +		return;
> +
> +	mmu_notifier_invalidate_range_start(mm, addr, addr + HPAGE_PMD_SIZE);
> +	if (pmd_trans_huge(pmdval)) {
> +		pmd_t pmdswp;
> +
> +		ptl = pmd_lock(mm, pmd);
> +		if (!pmd_present(*pmd))
> +			goto unlock_pmd;
> +		if (unlikely(!pmd_trans_huge(*pmd)))
> +			goto unlock_pmd;

Just check *pmd == pmdval?

> +		if (pmd_page(*pmd) != page)
> +			goto unlock_pmd;
> +
> +		pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
> +		if (pmd_dirty(pmdval))
> +			set_page_dirty(page);
> +		entry = make_migration_entry(page, pmd_write(pmdval));
> +		pmdswp = swp_entry_to_pmd(entry);
> +		pmdswp = pmd_mkhuge(pmdswp);
> +		set_pmd_at(mm, addr, pmd, pmdswp);
> +		page_remove_rmap(page, true);
> +		put_page(page);
> +unlock_pmd:
> +		spin_unlock(ptl);
> +	} else { /* pte-mapped thp */
> +		pte_t *pte;
> +		pte_t pteval;
> +		struct page *tmp = compound_head(page);
> +		unsigned long address = addr & HPAGE_PMD_MASK;
> +		pte_t swp_pte;
> +		int i;
> +
> +		pte = pte_offset_map(pmd, address);
> +		ptl = pte_lockptr(mm, pmd);
> +		spin_lock(ptl);

pte_offset_map_lock() ?

> +		for (i = 0; i < HPAGE_PMD_NR; i++, pte++, tmp++) {
> +			if (!(pte_present(*pte) &&
> +			      page_to_pfn(tmp) == pte_pfn(*pte)))

			if (!pte_present(*pte) || pte_page(*pte) != tmp) ?

> +				continue;
> +			pteval = ptep_clear_flush(vma, address, pte);
> +			if (pte_dirty(pteval))
> +				set_page_dirty(tmp);
> +			entry = make_migration_entry(tmp, pte_write(pteval));
> +			swp_pte = swp_entry_to_pte(entry);
> +			set_pte_at(mm, address, pte, swp_pte);
> +			page_remove_rmap(tmp, false);
> +			put_page(tmp);
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +	mmu_notifier_invalidate_range_end(mm, addr, addr + HPAGE_PMD_SIZE);
> +	return;
> +}
> +
> +int remove_migration_pmd(struct page *new, pmd_t *pmd,
> +		struct vm_area_struct *vma, unsigned long addr, void *old)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *ptl;
> +	pmd_t pmde;
> +	swp_entry_t entry;
> +
> +	pmde = *pmd;
> +	barrier();
> +
> +	if (!pmd_present(pmde)) {
> +		if (is_migration_entry(pmd_to_swp_entry(pmde))) {

		if (!is_migration_entry(pmd_to_swp_entry(pmde)))
			return SWAP_AGAIN;

And one level less indentation below.

> +			unsigned long mmun_start = addr & HPAGE_PMD_MASK;
> +			unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +
> +			ptl = pmd_lock(mm, pmd);
> +			entry = pmd_to_swp_entry(*pmd);
> +			if (migration_entry_to_page(entry) != old)
> +				goto unlock_ptl;
> +			get_page(new);
> +			pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> +			if (is_write_migration_entry(entry))
> +				pmde = maybe_pmd_mkwrite(pmde, vma);
> +			flush_cache_range(vma, mmun_start, mmun_end);
> +			page_add_anon_rmap(new, vma, mmun_start, true);
> +			pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
> +			set_pmd_at(mm, mmun_start, pmd, pmde);
> +			flush_tlb_range(vma, mmun_start, mmun_end);
> +			if (vma->vm_flags & VM_LOCKED)
> +				mlock_vma_page(new);
> +			update_mmu_cache_pmd(vma, addr, pmd);
> +unlock_ptl:
> +			spin_unlock(ptl);

			return SWAP_AGAIN;

And one level less indentation below.

> +		}
> +	} else { /* pte-mapped thp */
> +		pte_t *ptep;
> +		pte_t pte;
> +		int i;
> +		struct page *tmpnew = compound_head(new);
> +		struct page *tmpold = compound_head((struct page *)old);
> +		unsigned long address = addr & HPAGE_PMD_MASK;
> +
> +		ptep = pte_offset_map(pmd, addr);
> +		ptl = pte_lockptr(mm, pmd);
> +		spin_lock(ptl);

pte_offset_map_lock() ?

> +
> +		for (i = 0; i < HPAGE_PMD_NR;
> +		     i++, ptep++, tmpnew++, tmpold++, address += PAGE_SIZE) {
> +			pte = *ptep;
> +			if (!is_swap_pte(pte))
> +				continue;
> +			entry = pte_to_swp_entry(pte);
> +			if (!is_migration_entry(entry) ||
> +			    migration_entry_to_page(entry) != tmpold)
> +				continue;
> +			get_page(tmpnew);
> +			pte = pte_mkold(mk_pte(tmpnew,
> +					       READ_ONCE(vma->vm_page_prot)));

READ_ONCE()? Do we get here under mmap_sem, right?

> +			if (pte_swp_soft_dirty(*ptep))
> +				pte = pte_mksoft_dirty(pte);
> +			if (is_write_migration_entry(entry))
> +				pte = maybe_mkwrite(pte, vma);
> +			flush_dcache_page(tmpnew);
> +			set_pte_at(mm, address, ptep, pte);
> +			if (PageAnon(new))
> +				page_add_anon_rmap(tmpnew, vma, address, false);
> +			else
> +				page_add_file_rmap(tmpnew, false);
> +			update_mmu_cache(vma, address, ptep);
> +		}
> +		pte_unmap_unlock(ptep, ptl);
> +	}
> +	return SWAP_AGAIN;
> +}
> +#endif
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> index 66ce6b4..54f2eb6 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> @@ -198,6 +198,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	swp_entry_t entry;
> +	pgd_t *pgd;
> +	pud_t *pud;
>   	pmd_t *pmd;
>  	pte_t *ptep, pte;
>   	spinlock_t *ptl;
> @@ -208,10 +210,29 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  			goto out;
>  		ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
>  	} else {
> -		pmd = mm_find_pmd(mm, addr);
> +		pmd_t pmde;
> +
> +		pgd = pgd_offset(mm, addr);
> +		if (!pgd_present(*pgd))
> +			goto out;
> +		pud = pud_offset(pgd, addr);
> +		if (!pud_present(*pud))
> +			goto out;
> +		pmd = pmd_offset(pud, addr);
>  		if (!pmd)
>  			goto out;
>  
> +		if (PageTransCompound(new)) {
> +			remove_migration_pmd(new, pmd, vma, addr, old);
> +			goto out;
> +		}
> +
> +		pmde = *pmd;
> +		barrier();
> +
> +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> +			goto out;
> +
>  		ptep = pte_offset_map(pmd, addr);
>  
>  		/*
> @@ -344,6 +365,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
>  	__migration_entry_wait(mm, pte, ptl);
>  }
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (!is_pmd_migration_entry(*pmd))
> +		goto unlock;
> +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> +	if (!get_page_unless_zero(page))
> +		goto unlock;
> +	spin_unlock(ptl);
> +	wait_on_page_locked(page);
> +	put_page(page);
> +	return;
> +unlock:
> +	spin_unlock(ptl);
> +}
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> index 71c5f91..6012343 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp));
>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>  	return pmd;
> -- 
> 2.7.0
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/12] mm: hwpoison: soft offline supports thp migration
  2016-11-10 23:58     ` Naoya Horiguchi
@ 2016-11-14 23:22       ` Balbir Singh
  0 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2016-11-14 23:22 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, linux-kernel,
	Naoya Horiguchi



On 11/11/16 10:58, Naoya Horiguchi wrote:
> On Thu, Nov 10, 2016 at 09:31:10PM +1100, Balbir Singh wrote:
>>
>>
>> On 08/11/16 10:31, Naoya Horiguchi wrote:
>>> This patch enables thp migration for soft offline.
>>>
>>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>> ---
>>>  mm/memory-failure.c | 31 ++++++++++++-------------------
>>>  1 file changed, 12 insertions(+), 19 deletions(-)
>>>
>>> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
>>> index 19e796d..6cc8157 100644
>>> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c
>>> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
>>> @@ -1485,7 +1485,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
>>>  	if (PageHuge(p))
>>>  		return alloc_huge_page_node(page_hstate(compound_head(p)),
>>>  						   nid);
>>> -	else
>>> +	else if (thp_migration_supported() && PageTransHuge(p)) {
>>> +		struct page *thp;
>>> +
>>> +		thp = alloc_pages_node(nid,
>>> +			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
>>> +			HPAGE_PMD_ORDER);
>>> +		if (!thp)
>>> +			return NULL;
>>
>> Just wondering if new_page() fails, migration of that entry fails. Do we then
>> split and migrate? I guess this applies to THP migration in general.
> 
> Yes, that's not implemented yet, but can be helpful.
> 
> I think that there are 2 types of callers of page migration,
> one is a caller that specifies the target pages individually (like move_pages
> and soft offline), and another is a caller that specifies the target pages
> by (physical/virtual) address range basis.
> Maybe the former ones want to fall back immediately to split and retry if
> thp migration fails, and the latter ones want to retry thp migration more.
> If this makes sense, we can make some more changes on retry logic to fit
> the situation.
> 

I think we definitely need the retry with split option, but may be we can
build it on top of this series.

Balbir Singh.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2016-11-11 11:18   ` Kirill A. Shutemov
@ 2016-11-15  2:05     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-15  2:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Fri, Nov 11, 2016 at 02:18:40PM +0300, Kirill A. Shutemov wrote:
> On Tue, Nov 08, 2016 at 08:31:49AM +0900, Naoya Horiguchi wrote:
> > +static inline bool thp_migration_supported(void)
> > +{
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +	return true;
> > +#else
> > +	return false;
> > +#endif
> 
> Em..
> 
> 	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);

Looks better, thank you.

- Naoya

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-14 11:45   ` Kirill A. Shutemov
@ 2016-11-15  4:57     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-15  4:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Mon, Nov 14, 2016 at 02:45:03PM +0300, Kirill A. Shutemov wrote:
> On Tue, Nov 08, 2016 at 08:31:50AM +0900, Naoya Horiguchi wrote:
> > This patch prepares thp migration's core code. These code will be open when
> > unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> > to allocate destination thps.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > ChangeLog v1 -> v2:
> > - support pte-mapped thp, doubly-mapped thp
> > ---
> >  arch/x86/include/asm/pgtable_64.h |   2 +
> >  include/linux/swapops.h           |  61 +++++++++++++++
> >  mm/huge_memory.c                  | 154 ++++++++++++++++++++++++++++++++++++++
> >  mm/migrate.c                      |  44 ++++++++++-
> >  mm/pgtable-generic.c              |   3 +-
> >  5 files changed, 262 insertions(+), 2 deletions(-)
> > 
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> > index 1cc82ec..3a1b48e 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> > @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> >  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
> >  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
> >  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> > +#define __pmd_to_swp_entry(pte)		((swp_entry_t) { pmd_val((pmd)) })
> >  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
> > +#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
> >  
> >  extern int kern_addr_valid(unsigned long addr);
> >  extern void cleanup_highmap(void);
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> > index 5c3a5f3..b6b22a2 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> > @@ -163,6 +163,67 @@ static inline int is_write_migration_entry(swp_entry_t entry)
> >  
> >  #endif
> >  
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +extern void set_pmd_migration_entry(struct page *page,
> > +		struct vm_area_struct *vma, unsigned long address);
> > +
> > +extern int remove_migration_pmd(struct page *new, pmd_t *pmd,
> > +		struct vm_area_struct *vma, unsigned long addr, void *old);
> > +
> > +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> > +
> > +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> > +{
> > +	swp_entry_t arch_entry;
> > +
> > +	arch_entry = __pmd_to_swp_entry(pmd);
> > +	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> > +}
> > +
> > +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> > +{
> > +	swp_entry_t arch_entry;
> > +
> > +	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> > +	return __swp_entry_to_pmd(arch_entry);
> > +}
> > +
> > +static inline int is_pmd_migration_entry(pmd_t pmd)
> > +{
> > +	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> > +}
> > +#else
> > +static inline void set_pmd_migration_entry(struct page *page,
> > +			struct vm_area_struct *vma, unsigned long address)
> > +{
> 
> VM_BUG()? Or BUILD_BUG()?

These should be compiled out, so BUILD_BUG() seems better to me.
3 routines below will be done in the same manner.

> > +}
> > +
> > +static inline int remove_migration_pmd(struct page *new, pmd_t *pmd,
> > +		struct vm_area_struct *vma, unsigned long addr, void *old)
> > +{
> > +	return 0;
> 
> Ditto.
> 
> > +}
> > +
> > +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> > +
> > +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> > +{
> > +	return swp_entry(0, 0);
> 
> Ditto.
> 
> > +}
> > +
> > +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> > +{
> > +	pmd_t pmd = {};
> 
> Ditto.
> 
> > +	return pmd;
> > +}
> > +
> > +static inline int is_pmd_migration_entry(pmd_t pmd)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  
> >  extern atomic_long_t num_poisoned_pages __read_mostly;
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> > index 0509d17..b3022b3 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> > @@ -2310,3 +2310,157 @@ static int __init split_huge_pages_debugfs(void)
> >  }
> >  late_initcall(split_huge_pages_debugfs);
> >  #endif
> > +
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
> > +				unsigned long addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> > +	pmd_t *pmd;
> > +	pmd_t pmdval;
> > +	swp_entry_t entry;
> > +	spinlock_t *ptl;
> > +
> > +	pgd = pgd_offset(mm, addr);
> > +	if (!pgd_present(*pgd))
> > +		return;
> > +	pud = pud_offset(pgd, addr);
> > +	if (!pud_present(*pud))
> > +		return;
> > +	pmd = pmd_offset(pud, addr);
> > +	pmdval = *pmd;
> > +	barrier();
> > +	if (!pmd_present(pmdval))
> > +		return;
> > +
> > +	mmu_notifier_invalidate_range_start(mm, addr, addr + HPAGE_PMD_SIZE);
> > +	if (pmd_trans_huge(pmdval)) {
> > +		pmd_t pmdswp;
> > +
> > +		ptl = pmd_lock(mm, pmd);
> > +		if (!pmd_present(*pmd))
> > +			goto unlock_pmd;
> > +		if (unlikely(!pmd_trans_huge(*pmd)))
> > +			goto unlock_pmd;
> 
> Just check *pmd == pmdval?

OK.

> 
> > +		if (pmd_page(*pmd) != page)
> > +			goto unlock_pmd;
> > +
> > +		pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
> > +		if (pmd_dirty(pmdval))
> > +			set_page_dirty(page);
> > +		entry = make_migration_entry(page, pmd_write(pmdval));
> > +		pmdswp = swp_entry_to_pmd(entry);
> > +		pmdswp = pmd_mkhuge(pmdswp);
> > +		set_pmd_at(mm, addr, pmd, pmdswp);
> > +		page_remove_rmap(page, true);
> > +		put_page(page);
> > +unlock_pmd:
> > +		spin_unlock(ptl);
> > +	} else { /* pte-mapped thp */
> > +		pte_t *pte;
> > +		pte_t pteval;
> > +		struct page *tmp = compound_head(page);
> > +		unsigned long address = addr & HPAGE_PMD_MASK;
> > +		pte_t swp_pte;
> > +		int i;
> > +
> > +		pte = pte_offset_map(pmd, address);
> > +		ptl = pte_lockptr(mm, pmd);
> > +		spin_lock(ptl);
> 
> pte_offset_map_lock() ?

Right.

> > +		for (i = 0; i < HPAGE_PMD_NR; i++, pte++, tmp++) {
> > +			if (!(pte_present(*pte) &&
> > +			      page_to_pfn(tmp) == pte_pfn(*pte)))
> 
> 			if (!pte_present(*pte) || pte_page(*pte) != tmp) ?

Yes, this is shorter/simpler.

> 
> > +				continue;
> > +			pteval = ptep_clear_flush(vma, address, pte);
> > +			if (pte_dirty(pteval))
> > +				set_page_dirty(tmp);
> > +			entry = make_migration_entry(tmp, pte_write(pteval));
> > +			swp_pte = swp_entry_to_pte(entry);
> > +			set_pte_at(mm, address, pte, swp_pte);
> > +			page_remove_rmap(tmp, false);
> > +			put_page(tmp);
> > +		}
> > +		pte_unmap_unlock(pte, ptl);
> > +	}
> > +	mmu_notifier_invalidate_range_end(mm, addr, addr + HPAGE_PMD_SIZE);
> > +	return;
> > +}
> > +
> > +int remove_migration_pmd(struct page *new, pmd_t *pmd,
> > +		struct vm_area_struct *vma, unsigned long addr, void *old)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	spinlock_t *ptl;
> > +	pmd_t pmde;
> > +	swp_entry_t entry;
> > +
> > +	pmde = *pmd;
> > +	barrier();
> > +
> > +	if (!pmd_present(pmde)) {
> > +		if (is_migration_entry(pmd_to_swp_entry(pmde))) {
> 
> 		if (!is_migration_entry(pmd_to_swp_entry(pmde)))
> 			return SWAP_AGAIN;
> 
> And one level less indentation below.

OK.

> > +			unsigned long mmun_start = addr & HPAGE_PMD_MASK;
> > +			unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> > +
> > +			ptl = pmd_lock(mm, pmd);
> > +			entry = pmd_to_swp_entry(*pmd);
> > +			if (migration_entry_to_page(entry) != old)
> > +				goto unlock_ptl;
> > +			get_page(new);
> > +			pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> > +			if (is_write_migration_entry(entry))
> > +				pmde = maybe_pmd_mkwrite(pmde, vma);
> > +			flush_cache_range(vma, mmun_start, mmun_end);
> > +			page_add_anon_rmap(new, vma, mmun_start, true);
> > +			pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
> > +			set_pmd_at(mm, mmun_start, pmd, pmde);
> > +			flush_tlb_range(vma, mmun_start, mmun_end);
> > +			if (vma->vm_flags & VM_LOCKED)
> > +				mlock_vma_page(new);
> > +			update_mmu_cache_pmd(vma, addr, pmd);
> > +unlock_ptl:
> > +			spin_unlock(ptl);
> 
> 			return SWAP_AGAIN;
> 
> And one level less indentation below.
> 
> > +		}
> > +	} else { /* pte-mapped thp */
> > +		pte_t *ptep;
> > +		pte_t pte;
> > +		int i;
> > +		struct page *tmpnew = compound_head(new);
> > +		struct page *tmpold = compound_head((struct page *)old);
> > +		unsigned long address = addr & HPAGE_PMD_MASK;
> > +
> > +		ptep = pte_offset_map(pmd, addr);
> > +		ptl = pte_lockptr(mm, pmd);
> > +		spin_lock(ptl);
> 
> pte_offset_map_lock() ?
> 
> > +
> > +		for (i = 0; i < HPAGE_PMD_NR;
> > +		     i++, ptep++, tmpnew++, tmpold++, address += PAGE_SIZE) {
> > +			pte = *ptep;
> > +			if (!is_swap_pte(pte))
> > +				continue;
> > +			entry = pte_to_swp_entry(pte);
> > +			if (!is_migration_entry(entry) ||
> > +			    migration_entry_to_page(entry) != tmpold)
> > +				continue;
> > +			get_page(tmpnew);
> > +			pte = pte_mkold(mk_pte(tmpnew,
> > +					       READ_ONCE(vma->vm_page_prot)));
> 
> READ_ONCE()? Do we get here under mmap_sem, right?

Some callers of page migration (mbind, move_pages, migrate_pages, cpuset)
do get mmap_sem, but others (memory hotremove, soft offline) don't.
For this part, I borrowed some code from remove_migration_pte() which was
updated at the following commit:

  commit 6d2329f8872f23e46a19d240930571510ce525eb
  Author: Andrea Arcangeli <aarcange@redhat.com>
  Date:   Fri Oct 7 17:01:22 2016 -0700
  
      mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE


Thank you for reviewing in detail!

Naoya Horiguchi

> > +			if (pte_swp_soft_dirty(*ptep))
> > +				pte = pte_mksoft_dirty(pte);
> > +			if (is_write_migration_entry(entry))
> > +				pte = maybe_mkwrite(pte, vma);
> > +			flush_dcache_page(tmpnew);
> > +			set_pte_at(mm, address, ptep, pte);
> > +			if (PageAnon(new))
> > +				page_add_anon_rmap(tmpnew, vma, address, false);
> > +			else
> > +				page_add_file_rmap(tmpnew, false);
> > +			update_mmu_cache(vma, address, ptep);
> > +		}
> > +		pte_unmap_unlock(ptep, ptl);
> > +	}
> > +	return SWAP_AGAIN;
> > +}
> > +#endif
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> > index 66ce6b4..54f2eb6 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> > @@ -198,6 +198,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	swp_entry_t entry;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> >   	pmd_t *pmd;
> >  	pte_t *ptep, pte;
> >   	spinlock_t *ptl;
> > @@ -208,10 +210,29 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> >  			goto out;
> >  		ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
> >  	} else {
> > -		pmd = mm_find_pmd(mm, addr);
> > +		pmd_t pmde;
> > +
> > +		pgd = pgd_offset(mm, addr);
> > +		if (!pgd_present(*pgd))
> > +			goto out;
> > +		pud = pud_offset(pgd, addr);
> > +		if (!pud_present(*pud))
> > +			goto out;
> > +		pmd = pmd_offset(pud, addr);
> >  		if (!pmd)
> >  			goto out;
> >  
> > +		if (PageTransCompound(new)) {
> > +			remove_migration_pmd(new, pmd, vma, addr, old);
> > +			goto out;
> > +		}
> > +
> > +		pmde = *pmd;
> > +		barrier();
> > +
> > +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> > +			goto out;
> > +
> >  		ptep = pte_offset_map(pmd, addr);
> >  
> >  		/*
> > @@ -344,6 +365,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
> >  	__migration_entry_wait(mm, pte, ptl);
> >  }
> >  
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> > +{
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +
> > +	ptl = pmd_lock(mm, pmd);
> > +	if (!is_pmd_migration_entry(*pmd))
> > +		goto unlock;
> > +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> > +	if (!get_page_unless_zero(page))
> > +		goto unlock;
> > +	spin_unlock(ptl);
> > +	wait_on_page_locked(page);
> > +	put_page(page);
> > +	return;
> > +unlock:
> > +	spin_unlock(ptl);
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_BLOCK
> >  /* Returns true if all buffers are successfully locked */
> >  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> > index 71c5f91..6012343 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> > @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> >  {
> >  	pmd_t pmd;
> >  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> > +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> > +		  !pmd_devmap(*pmdp));
> >  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
> >  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> >  	return pmd;
> > -- 
> > 2.7.0
> > 
> 
> -- 
>  Kirill A. Shutemov
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-07 23:31 ` [PATCH v2 07/12] mm: thp: check pmd migration entry in common path Naoya Horiguchi
                     ` (2 preceding siblings ...)
  2016-11-10  9:08   ` Hillf Danton
@ 2016-11-17 23:56   ` Kirill A. Shutemov
  2016-11-29  6:46     ` Naoya Horiguchi
  2016-11-28 14:35   ` Michal Hocko
  4 siblings, 1 reply; 71+ messages in thread
From: Kirill A. Shutemov @ 2016-11-17 23:56 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Tue, Nov 08, 2016 at 08:31:52AM +0900, Naoya Horiguchi wrote:
> If one of callers of page migration starts to handle thp, memory management code
> start to see pmd migration entry, so we need to prepare for it before enabling.
> This patch changes various code point which checks the status of given pmds in
> order to prevent race between thp migration and the pmd-related works.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't think up
>   no better name. Any suggesntion is welcomed.)
> ---
>  arch/x86/mm/gup.c       |  4 +--
>  fs/proc/task_mmu.c      | 23 +++++++------
>  include/linux/huge_mm.h |  9 ++++-
>  mm/gup.c                | 10 ++++--
>  mm/huge_memory.c        | 88 ++++++++++++++++++++++++++++++++++++++++---------
>  mm/madvise.c            |  2 +-
>  mm/memcontrol.c         |  2 ++
>  mm/memory.c             |  6 +++-
>  mm/mprotect.c           |  2 ++
>  mm/mremap.c             |  2 +-
>  10 files changed, 114 insertions(+), 34 deletions(-)
> 
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> index 0d4fb3e..78a153d 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = *pmdp;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
> -		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> +		if (unlikely(pmd_large(pmd))) {
>  			/*
>  			 * NUMA hinting faults need to be handled in the GUP
>  			 * slowpath for accounting purposes and so that they
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> index 35b92d8..c1f9cf4 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> -		smaps_pmd_entry(pmd, addr, walk);
> +		if (pmd_present(*pmd))
> +			smaps_pmd_entry(pmd, addr, walk);
>  		spin_unlock(ptl);
>  		return 0;
>  	}
> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  			goto out;
>  		}
>  
> +		if (!pmd_present(*pmd))
> +			goto out;
> +

Hm. Looks like clear_soft_dirty_pmd() should handle !present. It doesn't.

Ah.. Found it in 08/12.

>  		page = pmd_page(*pmd);
>  
>  		/* Clear accessed and referenced bits. */
> @@ -1208,19 +1212,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  	if (ptl) {
>  		u64 flags = 0, frame = 0;
>  		pmd_t pmd = *pmdp;
> +		struct page *page;
>  
>  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
>  			flags |= PM_SOFT_DIRTY;
>  
> -		/*
> -		 * Currently pmd for thp is always present because thp
> -		 * can not be swapped-out, migrated, or HWPOISONed
> -		 * (split in such cases instead.)
> -		 * This if-check is just to prepare for future implementation.
> -		 */
> -		if (pmd_present(pmd)) {
> -			struct page *page = pmd_page(pmd);
> -
> +		if (is_pmd_migration_entry(pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(pmd);
> +			frame = swp_type(entry) |
> +				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> +			page = migration_entry_to_page(entry);
> +		} else if (pmd_present(pmd)) {
> +			page = pmd_page(pmd);
>  			if (page_mapcount(page) == 1)
>  				flags |= PM_MMAP_EXCLUSIVE;
>  
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> index fcbca51..3c252cd 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> @@ -125,12 +125,19 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
>  				    long adjust_next);
>  extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma);
> +
> +static inline int pmd_related(pmd_t pmd)
> +{
> +	return !pmd_none(pmd) &&
> +		(!pmd_present(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd));
> +}
> +

I would rather create is_swap_pmd() -- (!none && !present) and leave the
reset open-codded.


>  /* mmap_sem must be held on entry */
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> -	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> +	if (pmd_related(*pmd))
>  		return __pmd_trans_huge_lock(pmd, vma);
>  	else
>  		return NULL;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> index e50178c..2dc4978 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> @@ -267,6 +267,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	}
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		return no_page_table(vma, flags);
> +	if (!pmd_present(*pmd))
> +		return no_page_table(vma, flags);

Don't we need FOLL_MIGRATION, like on pte side?

>  	if (pmd_devmap(*pmd)) {
>  		ptl = pmd_lock(mm, pmd);
>  		page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -278,6 +280,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  		return follow_page_pte(vma, address, pmd, flags);
>  
>  	ptl = pmd_lock(mm, pmd);
> +	if (unlikely(!pmd_present(*pmd))) {
> +		spin_unlock(ptl);
> +		return no_page_table(vma, flags);
> +	}

Ditto.

>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags);
> @@ -333,7 +339,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>  	pud = pud_offset(pgd, address);
>  	BUG_ON(pud_none(*pud));
>  	pmd = pmd_offset(pud, address);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return -EFAULT;
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map(pmd, address);
> @@ -1357,7 +1363,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
>  
>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> index b3022b3..4e9090c 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> @@ -825,6 +825,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;
> +
> +	if (unlikely(is_pmd_migration_entry(pmd))) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		if (is_write_migration_entry(entry)) {
> +			make_migration_entry_read(&entry);
> +			pmd = swp_entry_to_pmd(entry);
> +			set_pmd_at(src_mm, addr, src_pmd, pmd);
> +		}

I think we should put at least WARN_ONCE() in 'else' here. We don't want
to miss such places when swap will be supported (or other swap entry type).

> +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +
>  	if (unlikely(!pmd_trans_huge(pmd))) {
>  		pte_free(dst_mm, pgtable);
>  		goto out_unlock;
> @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
>  	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
>  		goto out_unlock;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out_unlock;
> +
>  	page = pmd_page(orig_pmd);
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
> @@ -1137,7 +1154,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		goto out;
>  
> -	page = pmd_page(*pmd);
> +	if (is_pmd_migration_entry(*pmd)) {
> +		swp_entry_t entry;
> +		entry = pmd_to_swp_entry(*pmd);
> +		page = pfn_to_page(swp_offset(entry));
> +		if (!is_migration_entry(entry))
> +			goto out;

follow_page_pte() does different thing: wait for page to be migrated and
retry. Any reason you don't do the same?

> +	} else
> +		page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd);
> @@ -1332,6 +1356,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	if (is_huge_zero_pmd(orig_pmd))
>  		goto out;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out;
> +
>  	page = pmd_page(orig_pmd);
>  	/*
>  	 * If other processes are mapping this page, we couldn't discard
> @@ -1410,20 +1437,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>  	} else {
>  		struct page *page = pmd_page(orig_pmd);
> -		page_remove_rmap(page, true);
> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> -		if (PageAnon(page)) {
> -			pgtable_t pgtable;
> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> -			pte_free(tlb->mm, pgtable);
> -			atomic_long_dec(&tlb->mm->nr_ptes);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		int migration = 0;
> +
> +		if (!is_pmd_migration_entry(orig_pmd)) {
> +			page_remove_rmap(page, true);
> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}
>  		} else {
> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(orig_pmd);
> +			free_swap_and_cache(entry); /* waring in failure? */
> +			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +			migration = 1;
>  		}
>  		spin_unlock(ptl);
> -		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> +		if (!migration)
> +			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
>  	return 1;
>  }
> @@ -1496,14 +1538,27 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		bool preserve_write = prot_numa && pmd_write(*pmd);
>  		ret = 1;
>  
> +		if (!pmd_present(*pmd))
> +			goto unlock;
>  		/*
>  		 * Avoid trapping faults against the zero page. The read-only
>  		 * data is likely to be read-cached on the local CPU and
>  		 * local/remote hits to the zero page are not interesting.
>  		 */
> -		if (prot_numa && is_huge_zero_pmd(*pmd)) {
> -			spin_unlock(ptl);
> -			return ret;
> +		if (prot_numa && is_huge_zero_pmd(*pmd))
> +			goto unlock;
> +
> +		if (is_pmd_migration_entry(*pmd)) {

Hm? But we filtered out !present above?

> +			swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> +			if (is_write_migration_entry(entry)) {
> +				pmd_t newpmd;
> +
> +				make_migration_entry_read(&entry);
> +				newpmd = swp_entry_to_pmd(entry);
> +				set_pmd_at(mm, addr, pmd, newpmd);
> +			}
> +			goto unlock;
>  		}
>  
>  		if (!prot_numa || !pmd_protnone(*pmd)) {
> @@ -1516,6 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
>  					pmd_write(entry));
>  		}
> +unlock:
>  		spin_unlock(ptl);
>  	}
>  
> @@ -1532,7 +1588,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
>  {
>  	spinlock_t *ptl;
>  	ptl = pmd_lock(vma->vm_mm, pmd);
> -	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +	if (likely(pmd_related(*pmd)))
>  		return ptl;
>  	spin_unlock(ptl);
>  	return NULL;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> index 0e3828e..eaa2b02 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> @@ -274,7 +274,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	unsigned long next;
>  
>  	next = pmd_addr_end(addr, end);
> -	if (pmd_trans_huge(*pmd))
> +	if (pmd_related(*pmd))

I don't see a point going for madvise_free_huge_pmd(), just to fall off on
!present inside.

And is it safe for devmap?

>  		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
>  			goto next;
>  
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> index 91dfc7c..ebc2c42 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> @@ -4635,6 +4635,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
>  	struct page *page = NULL;
>  	enum mc_target_type ret = MC_TARGET_NONE;
>  
> +	if (unlikely(!pmd_present(pmd)))
> +		return ret;
>  	page = pmd_page(pmd);
>  	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
>  	if (!(mc.flags & MOVE_ANON))
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> index 94b5e2c..33fa439 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> @@ -999,7 +999,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> +		if (pmd_related(*src_pmd)) {
>  			int err;
>  			VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
>  			err = copy_huge_pmd(dst_mm, src_mm,
> @@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  		int ret;
>  
>  		barrier();
> +		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> +			pmd_migration_entry_wait(mm, fe.pmd);
> +			return 0;
> +		}
>  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>  				return do_huge_pmd_numa_page(&fe, orig_pmd);
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> index c5ba2aa..81186e3 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> @@ -164,6 +164,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		unsigned long this_pages;
>  
>  		next = pmd_addr_end(addr, end);
> +		if (!pmd_present(*pmd))
> +			continue;
>  		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
>  				&& pmd_none_or_clear_bad(pmd))
>  			continue;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> index da22ad2..a94a698 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> @@ -194,7 +194,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
>  		if (!new_pmd)
>  			break;
> -		if (pmd_trans_huge(*old_pmd)) {
> +		if (pmd_related(*old_pmd)) {
>  			if (extent == HPAGE_PMD_SIZE) {
>  				bool moved;
>  				/* See comment in move_ptes() */
> -- 
> 2.7.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 11/12] mm: migrate: move_pages() supports thp migration
  2016-11-07 23:31 ` [PATCH v2 11/12] mm: migrate: move_pages() supports " Naoya Horiguchi
@ 2016-11-18  0:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2016-11-18  0:01 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Tue, Nov 08, 2016 at 08:31:56AM +0900, Naoya Horiguchi wrote:
> This patch enables thp migration for move_pages(2).
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
>  1 file changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> index 97ab8d9..6a589b9 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> @@ -1443,7 +1443,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
>  	if (PageHuge(p))
>  		return alloc_huge_page_node(page_hstate(compound_head(p)),
>  					pm->node);
> -	else
> +	else if (thp_migration_supported() && PageTransHuge(p)) {
> +		struct page *thp;
> +
> +		thp = alloc_pages_node(pm->node,
> +			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> +			HPAGE_PMD_ORDER);
> +		if (!thp)
> +			return NULL;
> +		prep_transhuge_page(thp);
> +		return thp;
> +	} else
>  		return __alloc_pages_node(pm->node,
>  				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
>  }
> @@ -1470,6 +1480,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
>  		struct vm_area_struct *vma;
>  		struct page *page;
> +		struct page *head;
> +		unsigned int follflags;
>  
>  		err = -EFAULT;
>  		vma = find_vma(mm, pp->addr);
> @@ -1477,8 +1489,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  			goto set_status;
>  
>  		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, pp->addr,
> -				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
> +		follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;
> +		if (thp_migration_supported())
> +			follflags &= ~FOLL_SPLIT;

Nit: I would rather filp the condition -- adding flag is easier to read
than clearing.

> +		page = follow_page(vma, pp->addr, follflags);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> @@ -1488,7 +1502,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  		if (!page)
>  			goto set_status;
>  
> -		pp->page = page;
>  		err = page_to_nid(page);
>  
>  		if (err == pp->node)
> @@ -1503,16 +1516,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  			goto put_and_set;
>  
>  		if (PageHuge(page)) {
> -			if (PageHead(page))
> +			if (PageHead(page)) {
>  				isolate_huge_page(page, &pagelist);
> +				err = 0;
> +				pp->page = page;
> +			}
>  			goto put_and_set;
>  		}
>  
> -		err = isolate_lru_page(page);
> +		pp->page = compound_head(page);
> +		head = compound_head(page);
> +		err = isolate_lru_page(head);
>  		if (!err) {
> -			list_add_tail(&page->lru, &pagelist);
> -			inc_node_page_state(page, NR_ISOLATED_ANON +
> -					    page_is_file_cache(page));
> +			list_add_tail(&head->lru, &pagelist);
> +			mod_node_page_state(page_pgdat(head),
> +				NR_ISOLATED_ANON + page_is_file_cache(head),
> +				hpage_nr_pages(head));
>  		}
>  put_and_set:
>  		/*
> -- 
> 2.7.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/12] mm: mempolicy: mbind and migrate_pages support thp migration
  2016-11-07 23:31 ` [PATCH v2 10/12] mm: mempolicy: mbind and migrate_pages support " Naoya Horiguchi
@ 2016-11-25 12:27   ` Anshuman Khandual
  2016-11-29  7:07     ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Anshuman Khandual @ 2016-11-25 12:27 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> This patch enables thp migration for mbind(2) and migrate_pages(2).
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped and doubly-mapped thp
> ---
>  mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 79 insertions(+), 29 deletions(-)
> 
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
> index 77d0668..96507ee 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
> @@ -94,6 +94,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/printk.h>
> +#include <linux/swapops.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/uaccess.h>
> @@ -486,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
>  	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
>  }
>  
> +static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	int ret = 0;
> +	struct page *page;
> +	struct queue_pages *qp = walk->private;
> +	unsigned long flags;
> +
> +	if (unlikely(is_pmd_migration_entry(*pmd))) {
> +		ret = 1;
> +		goto unlock;
> +	}
> +	page = pmd_page(*pmd);
> +	if (is_huge_zero_page(page)) {
> +		spin_unlock(ptl);
> +		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
> +		goto out;
> +	}
> +	if (!thp_migration_supported()) {
> +		get_page(page);
> +		spin_unlock(ptl);
> +		lock_page(page);
> +		ret = split_huge_page(page);
> +		unlock_page(page);
> +		put_page(page);
> +		goto out;
> +	}
> +	if (queue_pages_node_check(page, qp)) {
> +		ret = 1;
> +		goto unlock;
> +	}
> +
> +	ret = 1;
> +	flags = qp->flags;
> +	/* go to thp migration */
> +	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> +		migrate_page_add(page, qp->pagelist, flags);
> +unlock:
> +	spin_unlock(ptl);
> +out:
> +	return ret;
> +}
> +
>  /*
>   * Scan through pages checking if pages follow certain conditions,
>   * and move them to the pagelist if they do.
> @@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	struct page *page;
>  	struct queue_pages *qp = walk->private;
>  	unsigned long flags = qp->flags;
> -	int nid, ret;
> +	int ret;
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	if (pmd_trans_huge(*pmd)) {
> -		ptl = pmd_lock(walk->mm, pmd);
> -		if (pmd_trans_huge(*pmd)) {
> -			page = pmd_page(*pmd);
> -			if (is_huge_zero_page(page)) {
> -				spin_unlock(ptl);
> -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> -			} else {
> -				get_page(page);
> -				spin_unlock(ptl);
> -				lock_page(page);
> -				ret = split_huge_page(page);
> -				unlock_page(page);
> -				put_page(page);
> -				if (ret)
> -					return 0;
> -			}
> -		} else {
> -			spin_unlock(ptl);
> -		}
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
> +		if (ret)
> +			return 0;
>  	}

I wonder if we should introduce pte_entry function along with pmd_entry
function as we are first looking for trans huge PMDs either for direct
addition into the migration list or splitting it before looking for PTEs.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2016-11-07 23:31 ` [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Naoya Horiguchi
  2016-11-11 11:18   ` Kirill A. Shutemov
@ 2016-11-28 14:21   ` Michal Hocko
  2016-11-29  7:50     ` Naoya Horiguchi
  1 sibling, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2016-11-28 14:21 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Tue 08-11-16 08:31:49, Naoya Horiguchi wrote:
> Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> functionality to x86_64, which should be safer at the first step.

Please make sure to describe why this has to be arch specific and what
are arches supposed to provide in order to enable this option.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-07 23:31 ` [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration Naoya Horiguchi
                     ` (3 preceding siblings ...)
  2016-11-14 11:45   ` Kirill A. Shutemov
@ 2016-11-28 14:31   ` Michal Hocko
  2016-11-29  7:57     ` Naoya Horiguchi
  4 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2016-11-28 14:31 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Tue 08-11-16 08:31:50, Naoya Horiguchi wrote:
> This patch prepares thp migration's core code. These code will be open when
> unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> to allocate destination thps.

this description is underdocumented to say the least. Could you
provide a high level documentation here please?

> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  61 +++++++++++++++
>  mm/huge_memory.c                  | 154 ++++++++++++++++++++++++++++++++++++++
>  mm/migrate.c                      |  44 ++++++++++-
>  mm/pgtable-generic.c              |   3 +-
>  5 files changed, 262 insertions(+), 2 deletions(-)
> 
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> index 1cc82ec..3a1b48e 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
>  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pte)		((swp_entry_t) { pmd_val((pmd)) })
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
> +#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
>  
>  extern int kern_addr_valid(unsigned long addr);
>  extern void cleanup_highmap(void);
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> index 5c3a5f3..b6b22a2 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> @@ -163,6 +163,67 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern void set_pmd_migration_entry(struct page *page,
> +		struct vm_area_struct *vma, unsigned long address);
> +
> +extern int remove_migration_pmd(struct page *new, pmd_t *pmd,
> +		struct vm_area_struct *vma, unsigned long addr, void *old);
> +
> +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __pmd_to_swp_entry(pmd);
> +	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> +	return __swp_entry_to_pmd(arch_entry);
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> +}
> +#else
> +static inline void set_pmd_migration_entry(struct page *page,
> +			struct vm_area_struct *vma, unsigned long address)
> +{
> +}
> +
> +static inline int remove_migration_pmd(struct page *new, pmd_t *pmd,
> +		struct vm_area_struct *vma, unsigned long addr, void *old)
> +{
> +	return 0;
> +}
> +
> +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	pmd_t pmd = {};
> +
> +	return pmd;
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> index 0509d17..b3022b3 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> @@ -2310,3 +2310,157 @@ static int __init split_huge_pages_debugfs(void)
>  }
>  late_initcall(split_huge_pages_debugfs);
>  #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
> +				unsigned long addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pmd_t pmdval;
> +	swp_entry_t entry;
> +	spinlock_t *ptl;
> +
> +	pgd = pgd_offset(mm, addr);
> +	if (!pgd_present(*pgd))
> +		return;
> +	pud = pud_offset(pgd, addr);
> +	if (!pud_present(*pud))
> +		return;
> +	pmd = pmd_offset(pud, addr);
> +	pmdval = *pmd;
> +	barrier();
> +	if (!pmd_present(pmdval))
> +		return;
> +
> +	mmu_notifier_invalidate_range_start(mm, addr, addr + HPAGE_PMD_SIZE);
> +	if (pmd_trans_huge(pmdval)) {
> +		pmd_t pmdswp;
> +
> +		ptl = pmd_lock(mm, pmd);
> +		if (!pmd_present(*pmd))
> +			goto unlock_pmd;
> +		if (unlikely(!pmd_trans_huge(*pmd)))
> +			goto unlock_pmd;
> +		if (pmd_page(*pmd) != page)
> +			goto unlock_pmd;
> +
> +		pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
> +		if (pmd_dirty(pmdval))
> +			set_page_dirty(page);
> +		entry = make_migration_entry(page, pmd_write(pmdval));
> +		pmdswp = swp_entry_to_pmd(entry);
> +		pmdswp = pmd_mkhuge(pmdswp);
> +		set_pmd_at(mm, addr, pmd, pmdswp);
> +		page_remove_rmap(page, true);
> +		put_page(page);
> +unlock_pmd:
> +		spin_unlock(ptl);
> +	} else { /* pte-mapped thp */
> +		pte_t *pte;
> +		pte_t pteval;
> +		struct page *tmp = compound_head(page);
> +		unsigned long address = addr & HPAGE_PMD_MASK;
> +		pte_t swp_pte;
> +		int i;
> +
> +		pte = pte_offset_map(pmd, address);
> +		ptl = pte_lockptr(mm, pmd);
> +		spin_lock(ptl);
> +		for (i = 0; i < HPAGE_PMD_NR; i++, pte++, tmp++) {
> +			if (!(pte_present(*pte) &&
> +			      page_to_pfn(tmp) == pte_pfn(*pte)))
> +				continue;
> +			pteval = ptep_clear_flush(vma, address, pte);
> +			if (pte_dirty(pteval))
> +				set_page_dirty(tmp);
> +			entry = make_migration_entry(tmp, pte_write(pteval));
> +			swp_pte = swp_entry_to_pte(entry);
> +			set_pte_at(mm, address, pte, swp_pte);
> +			page_remove_rmap(tmp, false);
> +			put_page(tmp);
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +	mmu_notifier_invalidate_range_end(mm, addr, addr + HPAGE_PMD_SIZE);
> +	return;
> +}
> +
> +int remove_migration_pmd(struct page *new, pmd_t *pmd,
> +		struct vm_area_struct *vma, unsigned long addr, void *old)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *ptl;
> +	pmd_t pmde;
> +	swp_entry_t entry;
> +
> +	pmde = *pmd;
> +	barrier();
> +
> +	if (!pmd_present(pmde)) {
> +		if (is_migration_entry(pmd_to_swp_entry(pmde))) {
> +			unsigned long mmun_start = addr & HPAGE_PMD_MASK;
> +			unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +
> +			ptl = pmd_lock(mm, pmd);
> +			entry = pmd_to_swp_entry(*pmd);
> +			if (migration_entry_to_page(entry) != old)
> +				goto unlock_ptl;
> +			get_page(new);
> +			pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> +			if (is_write_migration_entry(entry))
> +				pmde = maybe_pmd_mkwrite(pmde, vma);
> +			flush_cache_range(vma, mmun_start, mmun_end);
> +			page_add_anon_rmap(new, vma, mmun_start, true);
> +			pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
> +			set_pmd_at(mm, mmun_start, pmd, pmde);
> +			flush_tlb_range(vma, mmun_start, mmun_end);
> +			if (vma->vm_flags & VM_LOCKED)
> +				mlock_vma_page(new);
> +			update_mmu_cache_pmd(vma, addr, pmd);
> +unlock_ptl:
> +			spin_unlock(ptl);
> +		}
> +	} else { /* pte-mapped thp */
> +		pte_t *ptep;
> +		pte_t pte;
> +		int i;
> +		struct page *tmpnew = compound_head(new);
> +		struct page *tmpold = compound_head((struct page *)old);
> +		unsigned long address = addr & HPAGE_PMD_MASK;
> +
> +		ptep = pte_offset_map(pmd, addr);
> +		ptl = pte_lockptr(mm, pmd);
> +		spin_lock(ptl);
> +
> +		for (i = 0; i < HPAGE_PMD_NR;
> +		     i++, ptep++, tmpnew++, tmpold++, address += PAGE_SIZE) {
> +			pte = *ptep;
> +			if (!is_swap_pte(pte))
> +				continue;
> +			entry = pte_to_swp_entry(pte);
> +			if (!is_migration_entry(entry) ||
> +			    migration_entry_to_page(entry) != tmpold)
> +				continue;
> +			get_page(tmpnew);
> +			pte = pte_mkold(mk_pte(tmpnew,
> +					       READ_ONCE(vma->vm_page_prot)));
> +			if (pte_swp_soft_dirty(*ptep))
> +				pte = pte_mksoft_dirty(pte);
> +			if (is_write_migration_entry(entry))
> +				pte = maybe_mkwrite(pte, vma);
> +			flush_dcache_page(tmpnew);
> +			set_pte_at(mm, address, ptep, pte);
> +			if (PageAnon(new))
> +				page_add_anon_rmap(tmpnew, vma, address, false);
> +			else
> +				page_add_file_rmap(tmpnew, false);
> +			update_mmu_cache(vma, address, ptep);
> +		}
> +		pte_unmap_unlock(ptep, ptl);
> +	}
> +	return SWAP_AGAIN;
> +}
> +#endif
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> index 66ce6b4..54f2eb6 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> @@ -198,6 +198,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	swp_entry_t entry;
> +	pgd_t *pgd;
> +	pud_t *pud;
>   	pmd_t *pmd;
>  	pte_t *ptep, pte;
>   	spinlock_t *ptl;
> @@ -208,10 +210,29 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  			goto out;
>  		ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
>  	} else {
> -		pmd = mm_find_pmd(mm, addr);
> +		pmd_t pmde;
> +
> +		pgd = pgd_offset(mm, addr);
> +		if (!pgd_present(*pgd))
> +			goto out;
> +		pud = pud_offset(pgd, addr);
> +		if (!pud_present(*pud))
> +			goto out;
> +		pmd = pmd_offset(pud, addr);
>  		if (!pmd)
>  			goto out;
>  
> +		if (PageTransCompound(new)) {
> +			remove_migration_pmd(new, pmd, vma, addr, old);
> +			goto out;
> +		}
> +
> +		pmde = *pmd;
> +		barrier();
> +
> +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> +			goto out;
> +
>  		ptep = pte_offset_map(pmd, addr);
>  
>  		/*
> @@ -344,6 +365,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
>  	__migration_entry_wait(mm, pte, ptl);
>  }
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (!is_pmd_migration_entry(*pmd))
> +		goto unlock;
> +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> +	if (!get_page_unless_zero(page))
> +		goto unlock;
> +	spin_unlock(ptl);
> +	wait_on_page_locked(page);
> +	put_page(page);
> +	return;
> +unlock:
> +	spin_unlock(ptl);
> +}
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> index 71c5f91..6012343 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp));
>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>  	return pmd;
> -- 
> 2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/12] mm: thp: enable thp migration in generic path
  2016-11-07 23:31 ` [PATCH v2 06/12] mm: thp: enable thp migration in generic path Naoya Horiguchi
@ 2016-11-28 14:33   ` Michal Hocko
  2016-11-29  8:16     ` Naoya Horiguchi
  0 siblings, 1 reply; 71+ messages in thread
From: Michal Hocko @ 2016-11-28 14:33 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Tue 08-11-16 08:31:51, Naoya Horiguchi wrote:
> This patch makes it possible to support thp migration gradually. If you fail
> to allocate a destination page as a thp, you just split the source thp as we
> do now, and then enter the normal page migration. If you succeed to allocate
> destination thp, you enter thp migration. Subsequent patches actually enable
> thp migration for each caller of page migration by allowing its get_new_page()
> callback to allocate thps.

Does this need to be in a separate patch? Wouldn't it make more sense to
have the full THP migration code in a single one?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-07 23:31 ` [PATCH v2 07/12] mm: thp: check pmd migration entry in common path Naoya Horiguchi
                     ` (3 preceding siblings ...)
  2016-11-17 23:56   ` Kirill A. Shutemov
@ 2016-11-28 14:35   ` Michal Hocko
  4 siblings, 0 replies; 71+ messages in thread
From: Michal Hocko @ 2016-11-28 14:35 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Tue 08-11-16 08:31:52, Naoya Horiguchi wrote:
> If one of callers of page migration starts to handle thp, memory management code
> start to see pmd migration entry, so we need to prepare for it before enabling.
> This patch changes various code point which checks the status of given pmds in
> order to prevent race between thp migration and the pmd-related works.

Please be much more verbose about which code paths those are and why do
we care. The above wording didn't really help me to understand both the
problem and the solution until I started to stare into the code. And
even then I am not sure how to know that all places have been handled
properly.

> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't think up
>   no better name. Any suggesntion is welcomed.)
> ---
>  arch/x86/mm/gup.c       |  4 +--
>  fs/proc/task_mmu.c      | 23 +++++++------
>  include/linux/huge_mm.h |  9 ++++-
>  mm/gup.c                | 10 ++++--
>  mm/huge_memory.c        | 88 ++++++++++++++++++++++++++++++++++++++++---------
>  mm/madvise.c            |  2 +-
>  mm/memcontrol.c         |  2 ++
>  mm/memory.c             |  6 +++-
>  mm/mprotect.c           |  2 ++
>  mm/mremap.c             |  2 +-
>  10 files changed, 114 insertions(+), 34 deletions(-)
> 
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> index 0d4fb3e..78a153d 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = *pmdp;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
> -		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> +		if (unlikely(pmd_large(pmd))) {
>  			/*
>  			 * NUMA hinting faults need to be handled in the GUP
>  			 * slowpath for accounting purposes and so that they
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> index 35b92d8..c1f9cf4 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> -		smaps_pmd_entry(pmd, addr, walk);
> +		if (pmd_present(*pmd))
> +			smaps_pmd_entry(pmd, addr, walk);
>  		spin_unlock(ptl);
>  		return 0;
>  	}
> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  			goto out;
>  		}
>  
> +		if (!pmd_present(*pmd))
> +			goto out;
> +
>  		page = pmd_page(*pmd);
>  
>  		/* Clear accessed and referenced bits. */
> @@ -1208,19 +1212,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  	if (ptl) {
>  		u64 flags = 0, frame = 0;
>  		pmd_t pmd = *pmdp;
> +		struct page *page;
>  
>  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
>  			flags |= PM_SOFT_DIRTY;
>  
> -		/*
> -		 * Currently pmd for thp is always present because thp
> -		 * can not be swapped-out, migrated, or HWPOISONed
> -		 * (split in such cases instead.)
> -		 * This if-check is just to prepare for future implementation.
> -		 */
> -		if (pmd_present(pmd)) {
> -			struct page *page = pmd_page(pmd);
> -
> +		if (is_pmd_migration_entry(pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(pmd);
> +			frame = swp_type(entry) |
> +				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> +			page = migration_entry_to_page(entry);
> +		} else if (pmd_present(pmd)) {
> +			page = pmd_page(pmd);
>  			if (page_mapcount(page) == 1)
>  				flags |= PM_MMAP_EXCLUSIVE;
>  
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> index fcbca51..3c252cd 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> @@ -125,12 +125,19 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
>  				    long adjust_next);
>  extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma);
> +
> +static inline int pmd_related(pmd_t pmd)
> +{
> +	return !pmd_none(pmd) &&
> +		(!pmd_present(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd));
> +}
> +
>  /* mmap_sem must be held on entry */
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> -	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> +	if (pmd_related(*pmd))
>  		return __pmd_trans_huge_lock(pmd, vma);
>  	else
>  		return NULL;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> index e50178c..2dc4978 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> @@ -267,6 +267,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	}
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		return no_page_table(vma, flags);
> +	if (!pmd_present(*pmd))
> +		return no_page_table(vma, flags);
>  	if (pmd_devmap(*pmd)) {
>  		ptl = pmd_lock(mm, pmd);
>  		page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -278,6 +280,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  		return follow_page_pte(vma, address, pmd, flags);
>  
>  	ptl = pmd_lock(mm, pmd);
> +	if (unlikely(!pmd_present(*pmd))) {
> +		spin_unlock(ptl);
> +		return no_page_table(vma, flags);
> +	}
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags);
> @@ -333,7 +339,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>  	pud = pud_offset(pgd, address);
>  	BUG_ON(pud_none(*pud));
>  	pmd = pmd_offset(pud, address);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return -EFAULT;
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map(pmd, address);
> @@ -1357,7 +1363,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
>  
>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> index b3022b3..4e9090c 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> @@ -825,6 +825,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;
> +
> +	if (unlikely(is_pmd_migration_entry(pmd))) {
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		if (is_write_migration_entry(entry)) {
> +			make_migration_entry_read(&entry);
> +			pmd = swp_entry_to_pmd(entry);
> +			set_pmd_at(src_mm, addr, src_pmd, pmd);
> +		}
> +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +
>  	if (unlikely(!pmd_trans_huge(pmd))) {
>  		pte_free(dst_mm, pgtable);
>  		goto out_unlock;
> @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
>  	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
>  		goto out_unlock;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out_unlock;
> +
>  	page = pmd_page(orig_pmd);
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
> @@ -1137,7 +1154,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		goto out;
>  
> -	page = pmd_page(*pmd);
> +	if (is_pmd_migration_entry(*pmd)) {
> +		swp_entry_t entry;
> +		entry = pmd_to_swp_entry(*pmd);
> +		page = pfn_to_page(swp_offset(entry));
> +		if (!is_migration_entry(entry))
> +			goto out;
> +	} else
> +		page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd);
> @@ -1332,6 +1356,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	if (is_huge_zero_pmd(orig_pmd))
>  		goto out;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out;
> +
>  	page = pmd_page(orig_pmd);
>  	/*
>  	 * If other processes are mapping this page, we couldn't discard
> @@ -1410,20 +1437,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>  	} else {
>  		struct page *page = pmd_page(orig_pmd);
> -		page_remove_rmap(page, true);
> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> -		if (PageAnon(page)) {
> -			pgtable_t pgtable;
> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> -			pte_free(tlb->mm, pgtable);
> -			atomic_long_dec(&tlb->mm->nr_ptes);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		int migration = 0;
> +
> +		if (!is_pmd_migration_entry(orig_pmd)) {
> +			page_remove_rmap(page, true);
> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> +			if (PageAnon(page)) {
> +				pgtable_t pgtable;
> +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +								      pmd);
> +				pte_free(tlb->mm, pgtable);
> +				atomic_long_dec(&tlb->mm->nr_ptes);
> +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> +					       -HPAGE_PMD_NR);
> +			} else {
> +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> +					       -HPAGE_PMD_NR);
> +			}
>  		} else {
> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(orig_pmd);
> +			free_swap_and_cache(entry); /* waring in failure? */
> +			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +			migration = 1;
>  		}
>  		spin_unlock(ptl);
> -		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> +		if (!migration)
> +			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
>  	return 1;
>  }
> @@ -1496,14 +1538,27 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		bool preserve_write = prot_numa && pmd_write(*pmd);
>  		ret = 1;
>  
> +		if (!pmd_present(*pmd))
> +			goto unlock;
>  		/*
>  		 * Avoid trapping faults against the zero page. The read-only
>  		 * data is likely to be read-cached on the local CPU and
>  		 * local/remote hits to the zero page are not interesting.
>  		 */
> -		if (prot_numa && is_huge_zero_pmd(*pmd)) {
> -			spin_unlock(ptl);
> -			return ret;
> +		if (prot_numa && is_huge_zero_pmd(*pmd))
> +			goto unlock;
> +
> +		if (is_pmd_migration_entry(*pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> +			if (is_write_migration_entry(entry)) {
> +				pmd_t newpmd;
> +
> +				make_migration_entry_read(&entry);
> +				newpmd = swp_entry_to_pmd(entry);
> +				set_pmd_at(mm, addr, pmd, newpmd);
> +			}
> +			goto unlock;
>  		}
>  
>  		if (!prot_numa || !pmd_protnone(*pmd)) {
> @@ -1516,6 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
>  					pmd_write(entry));
>  		}
> +unlock:
>  		spin_unlock(ptl);
>  	}
>  
> @@ -1532,7 +1588,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
>  {
>  	spinlock_t *ptl;
>  	ptl = pmd_lock(vma->vm_mm, pmd);
> -	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +	if (likely(pmd_related(*pmd)))
>  		return ptl;
>  	spin_unlock(ptl);
>  	return NULL;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> index 0e3828e..eaa2b02 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> @@ -274,7 +274,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	unsigned long next;
>  
>  	next = pmd_addr_end(addr, end);
> -	if (pmd_trans_huge(*pmd))
> +	if (pmd_related(*pmd))
>  		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
>  			goto next;
>  
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> index 91dfc7c..ebc2c42 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> @@ -4635,6 +4635,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
>  	struct page *page = NULL;
>  	enum mc_target_type ret = MC_TARGET_NONE;
>  
> +	if (unlikely(!pmd_present(pmd)))
> +		return ret;
>  	page = pmd_page(pmd);
>  	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
>  	if (!(mc.flags & MOVE_ANON))
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> index 94b5e2c..33fa439 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> @@ -999,7 +999,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> +		if (pmd_related(*src_pmd)) {
>  			int err;
>  			VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
>  			err = copy_huge_pmd(dst_mm, src_mm,
> @@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  		int ret;
>  
>  		barrier();
> +		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> +			pmd_migration_entry_wait(mm, fe.pmd);
> +			return 0;
> +		}
>  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>  				return do_huge_pmd_numa_page(&fe, orig_pmd);
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> index c5ba2aa..81186e3 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> @@ -164,6 +164,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		unsigned long this_pages;
>  
>  		next = pmd_addr_end(addr, end);
> +		if (!pmd_present(*pmd))
> +			continue;
>  		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
>  				&& pmd_none_or_clear_bad(pmd))
>  			continue;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> index da22ad2..a94a698 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> @@ -194,7 +194,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
>  		if (!new_pmd)
>  			break;
> -		if (pmd_trans_huge(*old_pmd)) {
> +		if (pmd_related(*old_pmd)) {
>  			if (extent == HPAGE_PMD_SIZE) {
>  				bool moved;
>  				/* See comment in move_ptes() */
> -- 
> 2.7.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/12] mm: thp: check pmd migration entry in common path
  2016-11-17 23:56   ` Kirill A. Shutemov
@ 2016-11-29  6:46     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-29  6:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

# sorry for late reply ...

On Fri, Nov 18, 2016 at 02:56:24AM +0300, Kirill A. Shutemov wrote:
> On Tue, Nov 08, 2016 at 08:31:52AM +0900, Naoya Horiguchi wrote:
> > If one of callers of page migration starts to handle thp, memory management code
> > start to see pmd migration entry, so we need to prepare for it before enabling.
> > This patch changes various code point which checks the status of given pmds in
> > order to prevent race between thp migration and the pmd-related works.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > ChangeLog v1 -> v2:
> > - introduce pmd_related() (I know the naming is not good, but can't think up
> >   no better name. Any suggesntion is welcomed.)
> > ---
> >  arch/x86/mm/gup.c       |  4 +--
> >  fs/proc/task_mmu.c      | 23 +++++++------
> >  include/linux/huge_mm.h |  9 ++++-
> >  mm/gup.c                | 10 ++++--
> >  mm/huge_memory.c        | 88 ++++++++++++++++++++++++++++++++++++++++---------
> >  mm/madvise.c            |  2 +-
> >  mm/memcontrol.c         |  2 ++
> >  mm/memory.c             |  6 +++-
> >  mm/mprotect.c           |  2 ++
> >  mm/mremap.c             |  2 +-
> >  10 files changed, 114 insertions(+), 34 deletions(-)
> > 
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> > index 0d4fb3e..78a153d 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> > @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  		pmd_t pmd = *pmdp;
> >  
> >  		next = pmd_addr_end(addr, end);
> > -		if (pmd_none(pmd))
> > +		if (!pmd_present(pmd))
> >  			return 0;
> > -		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> > +		if (unlikely(pmd_large(pmd))) {
> >  			/*
> >  			 * NUMA hinting faults need to be handled in the GUP
> >  			 * slowpath for accounting purposes and so that they
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> > index 35b92d8..c1f9cf4 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> > @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >  
> >  	ptl = pmd_trans_huge_lock(pmd, vma);
> >  	if (ptl) {
> > -		smaps_pmd_entry(pmd, addr, walk);
> > +		if (pmd_present(*pmd))
> > +			smaps_pmd_entry(pmd, addr, walk);
> >  		spin_unlock(ptl);
> >  		return 0;
> >  	}
> > @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> >  			goto out;
> >  		}
> >  
> > +		if (!pmd_present(*pmd))
> > +			goto out;
> > +
> 
> Hm. Looks like clear_soft_dirty_pmd() should handle !present. It doesn't.
> 
> Ah.. Found it in 08/12.
> 
> >  		page = pmd_page(*pmd);
> >  
> >  		/* Clear accessed and referenced bits. */
> > @@ -1208,19 +1212,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> >  	if (ptl) {
> >  		u64 flags = 0, frame = 0;
> >  		pmd_t pmd = *pmdp;
> > +		struct page *page;
> >  
> >  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
> >  			flags |= PM_SOFT_DIRTY;
> >  
> > -		/*
> > -		 * Currently pmd for thp is always present because thp
> > -		 * can not be swapped-out, migrated, or HWPOISONed
> > -		 * (split in such cases instead.)
> > -		 * This if-check is just to prepare for future implementation.
> > -		 */
> > -		if (pmd_present(pmd)) {
> > -			struct page *page = pmd_page(pmd);
> > -
> > +		if (is_pmd_migration_entry(pmd)) {
> > +			swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +			frame = swp_type(entry) |
> > +				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> > +			page = migration_entry_to_page(entry);
> > +		} else if (pmd_present(pmd)) {
> > +			page = pmd_page(pmd);
> >  			if (page_mapcount(page) == 1)
> >  				flags |= PM_MMAP_EXCLUSIVE;
> >  
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> > index fcbca51..3c252cd 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> > @@ -125,12 +125,19 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
> >  				    long adjust_next);
> >  extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
> >  		struct vm_area_struct *vma);
> > +
> > +static inline int pmd_related(pmd_t pmd)
> > +{
> > +	return !pmd_none(pmd) &&
> > +		(!pmd_present(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd));
> > +}
> > +
> 
> I would rather create is_swap_pmd() -- (!none && !present) and leave the
> reset open-codded.

OK, I do this.

> 
> >  /* mmap_sem must be held on entry */
> >  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
> >  		struct vm_area_struct *vma)
> >  {
> >  	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> > -	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> > +	if (pmd_related(*pmd))
> >  		return __pmd_trans_huge_lock(pmd, vma);
> >  	else
> >  		return NULL;
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> > index e50178c..2dc4978 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> > @@ -267,6 +267,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  	}
> >  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> >  		return no_page_table(vma, flags);
> > +	if (!pmd_present(*pmd))
> > +		return no_page_table(vma, flags);
> 
> Don't we need FOLL_MIGRATION, like on pte side?

That's better, applied.

> >  	if (pmd_devmap(*pmd)) {
> >  		ptl = pmd_lock(mm, pmd);
> >  		page = follow_devmap_pmd(vma, address, pmd, flags);
> > @@ -278,6 +280,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  		return follow_page_pte(vma, address, pmd, flags);
> >  
> >  	ptl = pmd_lock(mm, pmd);
> > +	if (unlikely(!pmd_present(*pmd))) {
> > +		spin_unlock(ptl);
> > +		return no_page_table(vma, flags);
> > +	}
> 
> Ditto.
> 
> >  	if (unlikely(!pmd_trans_huge(*pmd))) {
> >  		spin_unlock(ptl);
> >  		return follow_page_pte(vma, address, pmd, flags);
> > @@ -333,7 +339,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
> >  	pud = pud_offset(pgd, address);
> >  	BUG_ON(pud_none(*pud));
> >  	pmd = pmd_offset(pud, address);
> > -	if (pmd_none(*pmd))
> > +	if (!pmd_present(*pmd))
> >  		return -EFAULT;
> >  	VM_BUG_ON(pmd_trans_huge(*pmd));
> >  	pte = pte_offset_map(pmd, address);
> > @@ -1357,7 +1363,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  		pmd_t pmd = READ_ONCE(*pmdp);
> >  
> >  		next = pmd_addr_end(addr, end);
> > -		if (pmd_none(pmd))
> > +		if (!pmd_present(pmd))
> >  			return 0;
> >  
> >  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> > index b3022b3..4e9090c 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> > @@ -825,6 +825,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  
> >  	ret = -EAGAIN;
> >  	pmd = *src_pmd;
> > +
> > +	if (unlikely(is_pmd_migration_entry(pmd))) {
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		if (is_write_migration_entry(entry)) {
> > +			make_migration_entry_read(&entry);
> > +			pmd = swp_entry_to_pmd(entry);
> > +			set_pmd_at(src_mm, addr, src_pmd, pmd);
> > +		}
> 
> I think we should put at least WARN_ONCE() in 'else' here. We don't want
> to miss such places when swap will be supported (or other swap entry type).

I guess that you mean 'else' branch of outer if-block, right?

> > +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> > +		ret = 0;
> > +		goto out_unlock;
> > +	}

Maybe inserting the below here seems OK.

        WARN_ONCE(!pmd_present(pmd), "Unknown non-present format on pmd.\n");

> > +
> >  	if (unlikely(!pmd_trans_huge(pmd))) {
> >  		pte_free(dst_mm, pgtable);
> >  		goto out_unlock;
> > @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> >  	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> >  		goto out_unlock;
> >  
> > +	if (unlikely(!pmd_present(orig_pmd)))
> > +		goto out_unlock;
> > +
> >  	page = pmd_page(orig_pmd);
> >  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> >  	/*
> > @@ -1137,7 +1154,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> >  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> >  		goto out;
> >  
> > -	page = pmd_page(*pmd);
> > +	if (is_pmd_migration_entry(*pmd)) {
> > +		swp_entry_t entry;
> > +		entry = pmd_to_swp_entry(*pmd);
> > +		page = pfn_to_page(swp_offset(entry));
> > +		if (!is_migration_entry(entry))
> > +			goto out;
> 
> follow_page_pte() does different thing: wait for page to be migrated and
> retry. Any reason you don't do the same?

No, just my self-check wasn't enough.

> > +	} else
> > +		page = pmd_page(*pmd);
> >  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >  	if (flags & FOLL_TOUCH)
> >  		touch_pmd(vma, addr, pmd);
> > @@ -1332,6 +1356,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  	if (is_huge_zero_pmd(orig_pmd))
> >  		goto out;
> >  
> > +	if (unlikely(!pmd_present(orig_pmd)))
> > +		goto out;
> > +
> >  	page = pmd_page(orig_pmd);
> >  	/*
> >  	 * If other processes are mapping this page, we couldn't discard
> > @@ -1410,20 +1437,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> >  	} else {
> >  		struct page *page = pmd_page(orig_pmd);
> > -		page_remove_rmap(page, true);
> > -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> > -		VM_BUG_ON_PAGE(!PageHead(page), page);
> > -		if (PageAnon(page)) {
> > -			pgtable_t pgtable;
> > -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> > -			pte_free(tlb->mm, pgtable);
> > -			atomic_long_dec(&tlb->mm->nr_ptes);
> > -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > +		int migration = 0;
> > +
> > +		if (!is_pmd_migration_entry(orig_pmd)) {
> > +			page_remove_rmap(page, true);
> > +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> > +			VM_BUG_ON_PAGE(!PageHead(page), page);
> > +			if (PageAnon(page)) {
> > +				pgtable_t pgtable;
> > +				pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> > +								      pmd);
> > +				pte_free(tlb->mm, pgtable);
> > +				atomic_long_dec(&tlb->mm->nr_ptes);
> > +				add_mm_counter(tlb->mm, MM_ANONPAGES,
> > +					       -HPAGE_PMD_NR);
> > +			} else {
> > +				add_mm_counter(tlb->mm, MM_FILEPAGES,
> > +					       -HPAGE_PMD_NR);
> > +			}
> >  		} else {
> > -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> > +			swp_entry_t entry;
> > +
> > +			entry = pmd_to_swp_entry(orig_pmd);
> > +			free_swap_and_cache(entry); /* waring in failure? */
> > +			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > +			migration = 1;
> >  		}
> >  		spin_unlock(ptl);
> > -		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> > +		if (!migration)
> > +			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> >  	}
> >  	return 1;
> >  }
> > @@ -1496,14 +1538,27 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >  		bool preserve_write = prot_numa && pmd_write(*pmd);
> >  		ret = 1;
> >  
> > +		if (!pmd_present(*pmd))
> > +			goto unlock;
> >  		/*
> >  		 * Avoid trapping faults against the zero page. The read-only
> >  		 * data is likely to be read-cached on the local CPU and
> >  		 * local/remote hits to the zero page are not interesting.
> >  		 */
> > -		if (prot_numa && is_huge_zero_pmd(*pmd)) {
> > -			spin_unlock(ptl);
> > -			return ret;
> > +		if (prot_numa && is_huge_zero_pmd(*pmd))
> > +			goto unlock;
> > +
> > +		if (is_pmd_migration_entry(*pmd)) {
> 
> Hm? But we filtered out !present above?

the !present check must come after this if-block with WARN_ONCE().

> > +			swp_entry_t entry = pmd_to_swp_entry(*pmd);
> > +
> > +			if (is_write_migration_entry(entry)) {
> > +				pmd_t newpmd;
> > +
> > +				make_migration_entry_read(&entry);
> > +				newpmd = swp_entry_to_pmd(entry);
> > +				set_pmd_at(mm, addr, pmd, newpmd);
> > +			}
> > +			goto unlock;
> >  		}
> >  
> >  		if (!prot_numa || !pmd_protnone(*pmd)) {
> > @@ -1516,6 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >  			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> >  					pmd_write(entry));
> >  		}
> > +unlock:
> >  		spin_unlock(ptl);
> >  	}
> >  
> > @@ -1532,7 +1588,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> >  {
> >  	spinlock_t *ptl;
> >  	ptl = pmd_lock(vma->vm_mm, pmd);
> > -	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> > +	if (likely(pmd_related(*pmd)))
> >  		return ptl;
> >  	spin_unlock(ptl);
> >  	return NULL;
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> > index 0e3828e..eaa2b02 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> > @@ -274,7 +274,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  	unsigned long next;
> >  
> >  	next = pmd_addr_end(addr, end);
> > -	if (pmd_trans_huge(*pmd))
> > +	if (pmd_related(*pmd))
> 
> I don't see a point going for madvise_free_huge_pmd(), just to fall off on
> !present inside.

Sorry, this code was wrong. I should've done like below simply:

+	if (!pmd_present(*pmd))
+		return 0;
 	if (pmd_trans_huge(*pmd))
 		...

> And is it safe for devmap?

I'm not sure, so let's keep it as-is.

Thanks,
Naoya Horiguchi

> 
> >  		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
> >  			goto next;
> >  
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> > index 91dfc7c..ebc2c42 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> > @@ -4635,6 +4635,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> >  	struct page *page = NULL;
> >  	enum mc_target_type ret = MC_TARGET_NONE;
> >  
> > +	if (unlikely(!pmd_present(pmd)))
> > +		return ret;
> >  	page = pmd_page(pmd);
> >  	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
> >  	if (!(mc.flags & MOVE_ANON))
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> > index 94b5e2c..33fa439 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> > @@ -999,7 +999,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
> >  	src_pmd = pmd_offset(src_pud, addr);
> >  	do {
> >  		next = pmd_addr_end(addr, end);
> > -		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> > +		if (pmd_related(*src_pmd)) {
> >  			int err;
> >  			VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
> >  			err = copy_huge_pmd(dst_mm, src_mm,
> > @@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >  		int ret;
> >  
> >  		barrier();
> > +		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> > +			pmd_migration_entry_wait(mm, fe.pmd);
> > +			return 0;
> > +		}
> >  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
> >  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> >  				return do_huge_pmd_numa_page(&fe, orig_pmd);
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> > index c5ba2aa..81186e3 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> > @@ -164,6 +164,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  		unsigned long this_pages;
> >  
> >  		next = pmd_addr_end(addr, end);
> > +		if (!pmd_present(*pmd))
> > +			continue;
> >  		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> >  				&& pmd_none_or_clear_bad(pmd))
> >  			continue;
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> > index da22ad2..a94a698 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> > @@ -194,7 +194,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
> >  		if (!new_pmd)
> >  			break;
> > -		if (pmd_trans_huge(*old_pmd)) {
> > +		if (pmd_related(*old_pmd)) {
> >  			if (extent == HPAGE_PMD_SIZE) {
> >  				bool moved;
> >  				/* See comment in move_ptes() */
> > -- 
> > 2.7.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
>  Kirill A. Shutemov
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 10/12] mm: mempolicy: mbind and migrate_pages support thp migration
  2016-11-25 12:27   ` Anshuman Khandual
@ 2016-11-29  7:07     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-29  7:07 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Michal Hocko,
	Vlastimil Babka, Pavel Emelyanov, Zi Yan, Balbir Singh,
	linux-kernel, Naoya Horiguchi

On Fri, Nov 25, 2016 at 05:57:20PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
...
> > @@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> >  	struct page *page;
> >  	struct queue_pages *qp = walk->private;
> >  	unsigned long flags = qp->flags;
> > -	int nid, ret;
> > +	int ret;
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> >  
> > -	if (pmd_trans_huge(*pmd)) {
> > -		ptl = pmd_lock(walk->mm, pmd);
> > -		if (pmd_trans_huge(*pmd)) {
> > -			page = pmd_page(*pmd);
> > -			if (is_huge_zero_page(page)) {
> > -				spin_unlock(ptl);
> > -				__split_huge_pmd(vma, pmd, addr, false, NULL);
> > -			} else {
> > -				get_page(page);
> > -				spin_unlock(ptl);
> > -				lock_page(page);
> > -				ret = split_huge_page(page);
> > -				unlock_page(page);
> > -				put_page(page);
> > -				if (ret)
> > -					return 0;
> > -			}
> > -		} else {
> > -			spin_unlock(ptl);
> > -		}
> > +	ptl = pmd_trans_huge_lock(pmd, vma);
> > +	if (ptl) {
> > +		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
> > +		if (ret)
> > +			return 0;
> >  	}
> 
> I wonder if we should introduce pte_entry function along with pmd_entry
> function as we are first looking for trans huge PMDs either for direct
> addition into the migration list or splitting it before looking for PTEs.

Most of pagewalk users don't define pte_entry because of performance reason
(to avoid the overhead of PTRS_PER_PMD function calls).
But that could be a nice cleanup if we have a workaround.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2016-11-28 14:21   ` Michal Hocko
@ 2016-11-29  7:50     ` Naoya Horiguchi
  2016-11-29  8:45       ` Michal Hocko
  0 siblings, 1 reply; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-29  7:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Mon, Nov 28, 2016 at 03:21:54PM +0100, Michal Hocko wrote:
> On Tue 08-11-16 08:31:49, Naoya Horiguchi wrote:
> > Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> > functionality to x86_64, which should be safer at the first step.
> 
> Please make sure to describe why this has to be arch specific and what
> are arches supposed to provide in order to enable this option.

OK, the below will be added in the future version:

  Thp migration is an arch-specific feature because it depends on the
  arch-dependent behavior of non-present format of page table entry.
  What you need to enable this option in other archs are:
  - to define arch-specific transformation functions like __pmd_to_swp_entry()
    and __swp_entry_to_pmd(),
  - to make sure that arch-specific page table walking code can properly handle
    !pmd_present case (gup_pmd_range() is a good example),
  - (if your archs enables CONFIG_HAVE_ARCH_SOFT_DIRTY,) to define soft dirty
    routines like pmd_swp_mksoft_dirty.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration
  2016-11-28 14:31   ` Michal Hocko
@ 2016-11-29  7:57     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-29  7:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Mon, Nov 28, 2016 at 03:31:32PM +0100, Michal Hocko wrote:
> On Tue 08-11-16 08:31:50, Naoya Horiguchi wrote:
> > This patch prepares thp migration's core code. These code will be open when
> > unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> > to allocate destination thps.
> 
> this description is underdocumented to say the least. Could you
> provide a high level documentation here please?

Yes, I'll do.

And maybe some update on Documentation/vm/page_migration will be wanted,
so will do it too.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/12] mm: thp: enable thp migration in generic path
  2016-11-28 14:33   ` Michal Hocko
@ 2016-11-29  8:16     ` Naoya Horiguchi
  0 siblings, 0 replies; 71+ messages in thread
From: Naoya Horiguchi @ 2016-11-29  8:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Mon, Nov 28, 2016 at 03:33:32PM +0100, Michal Hocko wrote:
> On Tue 08-11-16 08:31:51, Naoya Horiguchi wrote:
> > This patch makes it possible to support thp migration gradually. If you fail
> > to allocate a destination page as a thp, you just split the source thp as we
> > do now, and then enter the normal page migration. If you succeed to allocate
> > destination thp, you enter thp migration. Subsequent patches actually enable
> > thp migration for each caller of page migration by allowing its get_new_page()
> > callback to allocate thps.
> 
> Does this need to be in a separate patch? Wouldn't it make more sense to
> have the full THP migration code in a single one?

Actually, no big reason for separate patch. So I'm OK to merge this into
patch 5/12.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2016-11-29  7:50     ` Naoya Horiguchi
@ 2016-11-29  8:45       ` Michal Hocko
  0 siblings, 0 replies; 71+ messages in thread
From: Michal Hocko @ 2016-11-29  8:45 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Kirill A. Shutemov, Hugh Dickins, Andrew Morton,
	Dave Hansen, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi

On Tue 29-11-16 07:50:06, Naoya Horiguchi wrote:
> On Mon, Nov 28, 2016 at 03:21:54PM +0100, Michal Hocko wrote:
> > On Tue 08-11-16 08:31:49, Naoya Horiguchi wrote:
> > > Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> > > functionality to x86_64, which should be safer at the first step.
> > 
> > Please make sure to describe why this has to be arch specific and what
> > are arches supposed to provide in order to enable this option.
> 
> OK, the below will be added in the future version:
> 
>   Thp migration is an arch-specific feature because it depends on the
>   arch-dependent behavior of non-present format of page table entry.
>   What you need to enable this option in other archs are:
>   - to define arch-specific transformation functions like __pmd_to_swp_entry()
>     and __swp_entry_to_pmd(),
>   - to make sure that arch-specific page table walking code can properly handle
>     !pmd_present case (gup_pmd_range() is a good example),
>   - (if your archs enables CONFIG_HAVE_ARCH_SOFT_DIRTY,) to define soft dirty
>     routines like pmd_swp_mksoft_dirty.

Thanks this is _much_ better!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
                   ` (13 preceding siblings ...)
  2016-11-09 10:33 ` Anshuman Khandual
@ 2017-01-30 10:47 ` Anshuman Khandual
  2017-01-31 19:54   ` Zi Yan
  14 siblings, 1 reply; 71+ messages in thread
From: Anshuman Khandual @ 2017-01-30 10:47 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrew Morton, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Zi Yan, Balbir Singh, linux-kernel,
	Naoya Horiguchi, Zi Yan

On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> Hi everyone,
> 
> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> with feedbacks for ver.1.

Hello Noaya,

I have been working with Zi Yan on the parallel huge page migration series
(https://lkml.org/lkml/2016/11/22/457) and planning to post them on top of
this THP migration enhancement series. Hence we were wondering if you have
plans to post a new version of this series in near future ?

Regards
Anshuman

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2017-01-30 10:47 ` Anshuman Khandual
@ 2017-01-31 19:54   ` Zi Yan
  2017-02-03  3:12     ` Hillf Danton
  0 siblings, 1 reply; 71+ messages in thread
From: Zi Yan @ 2017-01-31 19:54 UTC (permalink / raw)
  To: Anshuman Khandual, Andrew Morton, Naoya Horiguchi, Kirill A. Shutemov
  Cc: linux-mm, Hugh Dickins, Dave Hansen, Andrea Arcangeli,
	Mel Gorman, Michal Hocko, Vlastimil Babka, Pavel Emelyanov,
	Balbir Singh, linux-kernel, Naoya Horiguchi

[-- Attachment #1: Type: text/plain, Size: 1420 bytes --]

I am also doing some tests on THP migration and discover that there are
some corner cases not handled in this patchset.

For example, in handle_mm_fault, without taking pmd_lock, the kernel may
see pmd_none(*pmd) during THP migrations, which leads to
handle_pte_fault or even deeper in the code path. At that moment,
pmd_trans_unstable() will treat a pmd_migration_entry as pmd_bad and
clear it. This leads to application crashing and page table leaks, since
a deposited PTE page is not released when the application crashes.

Even after I add is_pmd_migration_entry() into pmd_trans_unstable(), I
still see application data corruptions.

I hope someone can shed some light on how to debug this. Should I also
look into pmd_trans_huge() call sites where pmd_migration_entry should
be handled differently?

Thanks.

Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
>> Hi everyone,
>>
>> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
>> with feedbacks for ver.1.
> 
> Hello Noaya,
> 
> I have been working with Zi Yan on the parallel huge page migration series
> (https://lkml.org/lkml/2016/11/22/457) and planning to post them on top of
> this THP migration enhancement series. Hence we were wondering if you have
> plans to post a new version of this series in near future ?
> 
> Regards
> Anshuman
> 

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2017-01-31 19:54   ` Zi Yan
@ 2017-02-03  3:12     ` Hillf Danton
  2017-02-03  3:24       ` Zi Yan
  0 siblings, 1 reply; 71+ messages in thread
From: Hillf Danton @ 2017-02-03  3:12 UTC (permalink / raw)
  To: 'Zi Yan', 'Anshuman Khandual',
	'Andrew Morton', 'Naoya Horiguchi',
	'Kirill A. Shutemov'
  Cc: linux-mm, 'Hugh Dickins', 'Dave Hansen',
	'Andrea Arcangeli', 'Mel Gorman',
	'Michal Hocko', 'Vlastimil Babka',
	'Pavel Emelyanov', 'Balbir Singh',
	linux-kernel, 'Naoya Horiguchi'


On February 01, 2017 3:54 AM Zi Yan wrote: 
> 
> I am also doing some tests on THP migration and discover that there are
> some corner cases not handled in this patchset.
> 
> For example, in handle_mm_fault, without taking pmd_lock, the kernel may
> see pmd_none(*pmd) during THP migrations, which leads to
> handle_pte_fault or even deeper in the code path. At that moment,
> pmd_trans_unstable() will treat a pmd_migration_entry as pmd_bad and
> clear it. This leads to application crashing and page table leaks, since
> a deposited PTE page is not released when the application crashes.
> 
> Even after I add is_pmd_migration_entry() into pmd_trans_unstable(), I
> still see application data corruptions.
> 
> I hope someone can shed some light on how to debug this. Should I also
> look into pmd_trans_huge() call sites where pmd_migration_entry should
> be handled differently?
> 
Hm ... seems it helps more if you post your current works as RFC on
top of the mm tree, and the relevant tests as well.

Hillf
> 
> Anshuman Khandual wrote:
> > On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> >> Hi everyone,
> >>
> >> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> >> with feedbacks for ver.1.
> >
> > Hello Noaya,
> >
> > I have been working with Zi Yan on the parallel huge page migration series
> > (https://lkml.org/lkml/2016/11/22/457) and planning to post them on top of
> > this THP migration enhancement series. Hence we were wondering if you have
> > plans to post a new version of this series in near future ?
> >
> > Regards
> > Anshuman
> >
> 
> --
> Best Regards,
> Yan Zi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/12] mm: page migration enhancement for thp
  2017-02-03  3:12     ` Hillf Danton
@ 2017-02-03  3:24       ` Zi Yan
  0 siblings, 0 replies; 71+ messages in thread
From: Zi Yan @ 2017-02-03  3:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Anshuman Khandual, Andrew Morton, Naoya Horiguchi,
	Kirill A. Shutemov, linux-mm, Hugh Dickins, Dave Hansen,
	Andrea Arcangeli, Mel Gorman, Michal Hocko, Vlastimil Babka,
	Pavel Emelyanov, Balbir Singh, linux-kernel, Naoya Horiguchi

[-- Attachment #1: Type: text/plain, Size: 2075 bytes --]

On 2 Feb 2017, at 21:12, Hillf Danton wrote:

> On February 01, 2017 3:54 AM Zi Yan wrote:
>>
>> I am also doing some tests on THP migration and discover that there are
>> some corner cases not handled in this patchset.
>>
>> For example, in handle_mm_fault, without taking pmd_lock, the kernel may
>> see pmd_none(*pmd) during THP migrations, which leads to
>> handle_pte_fault or even deeper in the code path. At that moment,
>> pmd_trans_unstable() will treat a pmd_migration_entry as pmd_bad and
>> clear it. This leads to application crashing and page table leaks, since
>> a deposited PTE page is not released when the application crashes.
>>
>> Even after I add is_pmd_migration_entry() into pmd_trans_unstable(), I
>> still see application data corruptions.
>>
>> I hope someone can shed some light on how to debug this. Should I also
>> look into pmd_trans_huge() call sites where pmd_migration_entry should
>> be handled differently?
>>
> Hm ... seems it helps more if you post your current works as RFC on
> top of the mm tree, and the relevant tests as well.
>

Thanks for replying. I find that data corruption is caused by that
set_pmd_migration_entry() did not flush TLB while changing pmd entries.
I fix it by using pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear().

The racy pmd check in zap_pmd_range() also causes pmd_bad() problem. I am going
to rebase Naoya's patches and send it again for comments.



> Hillf
>>
>> Anshuman Khandual wrote:
>>> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
>>>> Hi everyone,
>>>>
>>>> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
>>>> with feedbacks for ver.1.
>>>
>>> Hello Noaya,
>>>
>>> I have been working with Zi Yan on the parallel huge page migration series
>>> (https://lkml.org/lkml/2016/11/22/457) and planning to post them on top of
>>> this THP migration enhancement series. Hence we were wondering if you have
>>> plans to post a new version of this series in near future ?
>>>
>>> Regards
>>> Anshuman
>>>
>>
>> --
>> Best Regards,
>> Yan Zi


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2017-02-03  3:24 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-07 23:31 [PATCH v2 00/12] mm: page migration enhancement for thp Naoya Horiguchi
2016-11-07 23:31 ` [PATCH v2 01/12] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6 Naoya Horiguchi
2016-11-10 23:29   ` Dave Hansen
2016-11-11  1:08     ` Naoya Horiguchi
2016-11-11 11:12     ` Kirill A. Shutemov
2016-11-07 23:31 ` [PATCH v2 02/12] mm: mempolicy: add queue_pages_node_check() Naoya Horiguchi
2016-11-07 23:31 ` [PATCH v2 03/12] mm: thp: introduce separate TTU flag for thp freezing Naoya Horiguchi
2016-11-10  8:31   ` Anshuman Khandual
2016-11-10  9:09     ` Naoya Horiguchi
2016-11-11  3:18       ` Anshuman Khandual
2016-11-11 11:16   ` Kirill A. Shutemov
2016-11-07 23:31 ` [PATCH v2 04/12] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Naoya Horiguchi
2016-11-11 11:18   ` Kirill A. Shutemov
2016-11-15  2:05     ` Naoya Horiguchi
2016-11-28 14:21   ` Michal Hocko
2016-11-29  7:50     ` Naoya Horiguchi
2016-11-29  8:45       ` Michal Hocko
2016-11-07 23:31 ` [PATCH v2 05/12] mm: thp: add core routines for thp/pmd migration Naoya Horiguchi
2016-11-08  3:05   ` kbuild test robot
2016-11-08  6:46     ` Naoya Horiguchi
2016-11-08  8:13   ` Anshuman Khandual
2016-11-08  8:16     ` Naoya Horiguchi
2016-11-10  8:29   ` Anshuman Khandual
2016-11-10  9:43     ` Naoya Horiguchi
2016-11-14 11:45   ` Kirill A. Shutemov
2016-11-15  4:57     ` Naoya Horiguchi
2016-11-28 14:31   ` Michal Hocko
2016-11-29  7:57     ` Naoya Horiguchi
2016-11-07 23:31 ` [PATCH v2 06/12] mm: thp: enable thp migration in generic path Naoya Horiguchi
2016-11-28 14:33   ` Michal Hocko
2016-11-29  8:16     ` Naoya Horiguchi
2016-11-07 23:31 ` [PATCH v2 07/12] mm: thp: check pmd migration entry in common path Naoya Horiguchi
2016-11-08  0:23   ` kbuild test robot
2016-11-08  1:22     ` Naoya Horiguchi
2016-11-10  8:36   ` Anshuman Khandual
2016-11-10  9:12     ` Naoya Horiguchi
2016-11-10  9:08   ` Hillf Danton
2016-11-10  9:21     ` Naoya Horiguchi
2016-11-10  9:28       ` Hillf Danton
2016-11-10  9:34         ` Naoya Horiguchi
2016-11-17 23:56   ` Kirill A. Shutemov
2016-11-29  6:46     ` Naoya Horiguchi
2016-11-28 14:35   ` Michal Hocko
2016-11-07 23:31 ` [PATCH v2 08/12] mm: soft-dirty: keep soft-dirty bits over thp migration Naoya Horiguchi
2016-11-10  8:38   ` Anshuman Khandual
2016-11-10  9:13     ` Naoya Horiguchi
2016-11-07 23:31 ` [PATCH v2 09/12] mm: hwpoison: soft offline supports " Naoya Horiguchi
2016-11-10 10:31   ` Balbir Singh
2016-11-10 23:58     ` Naoya Horiguchi
2016-11-14 23:22       ` Balbir Singh
2016-11-07 23:31 ` [PATCH v2 10/12] mm: mempolicy: mbind and migrate_pages support " Naoya Horiguchi
2016-11-25 12:27   ` Anshuman Khandual
2016-11-29  7:07     ` Naoya Horiguchi
2016-11-07 23:31 ` [PATCH v2 11/12] mm: migrate: move_pages() supports " Naoya Horiguchi
2016-11-18  0:01   ` Kirill A. Shutemov
2016-11-07 23:31 ` [PATCH v2 12/12] mm: memory_hotplug: memory hotremove " Naoya Horiguchi
2016-11-08  0:29   ` kbuild test robot
2016-11-08  0:30   ` kbuild test robot
2016-11-08  1:36     ` Naoya Horiguchi
2016-11-08  1:41       ` [PATCH] mm: fix unused variable warning Naoya Horiguchi
2016-11-09  2:32 ` [PATCH v2 00/12] mm: page migration enhancement for thp Balbir Singh
2016-11-09  4:59   ` Naoya Horiguchi
2016-11-09 21:28     ` Balbir Singh
2016-11-09 10:33 ` Anshuman Khandual
2016-11-09 23:52   ` Naoya Horiguchi
2016-11-10 14:01     ` Zi Yan
2016-11-11  3:48       ` Anshuman Khandual
2017-01-30 10:47 ` Anshuman Khandual
2017-01-31 19:54   ` Zi Yan
2017-02-03  3:12     ` Hillf Danton
2017-02-03  3:24       ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).