All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/11] mm: page migration enhancement for thp
@ 2017-03-13 15:44 ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Zi Yan <zi.yan@cs.rutgers.edu>

Hi all,

The patches are rebased on mmotm-2017-03-09-16-19 with the feedbacks from
v3 patches. Please give comments and consider merging it.

Hi Kirill, could you take a look at [05/11], which uses and modifies your
page_vma_mapped_walk()?


Motivations
===========================================
1. THP migration becomes important in the upcoming heterogeneous memory systems.

As David Nellans from NVIDIA pointed out from other threads
(http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349227.html),
future GPUs or other accelerators will have their memory managed by operating
systems. Moving data into and out of these memory nodes efficiently is critical
to applications that use GPUs or other accelerators. Existing page migration
only supports base pages, which has a very low memory bandwidth utilization.
My experiments (see below) show THP migration can migrate pages more efficiently.

2. Base page migration vs THP migration throughput.

Here are cross-socket page migration results from calling
move_pages() syscall:

In x86_64, a Intel two-socket E5-2640v3 box,
single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.

In ppc64, a two-socket Power8 box,
single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.

THP migration can give us 3x and 1.15x throughput over base page migration
in x86_64 and ppc64 respectivley.

You can test it out by using the code here:
https://github.com/x-y-z/thp-migration-bench

3. Existing page migration splits THP before migration and cannot guarantee
the migrated pages are still contiguous. Contiguity is always what GPUs and
accelerators look for. Without THP migration, khugepaged needs to do extra work
to reassemble the migrated pages back to THPs.

ChangeLog
===========================================

Changes since v3:

  * I dropped my fix on zap_pmd_range() since THP migration will not trigger
    it and Kirill has posted patches to fix the bug triggered by MADV_DONTNEED.

  * In Patch 9, I used !pmd_present() instead of is_pmd_migration_entry()
    in pmd_none_or_trans_huge_or_clear_bad() to avoid moving the function to
    linux/swapops.h. Currently, !pmd_present() is equivalent to 
    is_pmd_migration_entry(). Any suggestion is welcome to this change.

Changes since v2:

  * I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
    The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
    which leads to PTE page table not freed.

  * In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
    can be set by some CPUs by mistake and the new swap entry format does not use
    bit 1-4.

  * I also adjust two core migration functions, set_pmd_migration_entry() and
    remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
    function. Patch 8 needs Kirill's comments, since I also add changes
    to his page_vma_mapped_walk() function with pmd_migration_entry handling.

  * In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
    in set_pmd_migration_entry() to avoid data corruption after page migration.

  * In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
    Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
    leads to deposited PTE page table not freed.

  * I personally use this patchset with my customized kernel to test frequent
    page migrations by replacing page reclaim with page migration.
    The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
    I did a 16-hour stress test that has ~7 billion total page migrations.
    No error or data corruption was found. 

General description
===========================================

This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
 - mbind(2)
 - move_pages(2)
 - migrate_pages(2)
 - cgroup/cpuset migration
 - memory hotremove
 - soft offline

The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.

The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.

Any comments or advices are welcomed.

Best Regards,
Yan Zi

Naoya Horiguchi (11):
  mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  mm: mempolicy: add queue_pages_node_check()
  mm: thp: introduce separate TTU flag for thp freezing
  mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  mm: thp: enable thp migration in generic path
  mm: thp: check pmd migration entry in common path
  mm: soft-dirty: keep soft-dirty bits over thp migration
  mm: hwpoison: soft offline supports thp migration
  mm: mempolicy: mbind and migrate_pages support thp migration
  mm: migrate: move_pages() supports thp migration
  mm: memory_hotplug: memory hotremove supports thp migration

 arch/x86/Kconfig                     |   4 +
 arch/x86/include/asm/pgtable.h       |  17 +++
 arch/x86/include/asm/pgtable_64.h    |  14 ++-
 arch/x86/include/asm/pgtable_types.h |  10 +-
 arch/x86/mm/gup.c                    |   4 +-
 fs/proc/task_mmu.c                   |  49 +++++---
 include/asm-generic/pgtable.h        |  37 +++++-
 include/linux/huge_mm.h              |  32 ++++-
 include/linux/rmap.h                 |   3 +-
 include/linux/swapops.h              |  72 ++++++++++-
 mm/Kconfig                           |   3 +
 mm/gup.c                             |  22 +++-
 mm/huge_memory.c                     | 237 +++++++++++++++++++++++++++++++----
 mm/madvise.c                         |   2 +
 mm/memcontrol.c                      |   2 +
 mm/memory-failure.c                  |  31 ++---
 mm/memory.c                          |   9 +-
 mm/memory_hotplug.c                  |  17 ++-
 mm/mempolicy.c                       | 124 +++++++++++++-----
 mm/migrate.c                         |  66 ++++++++--
 mm/mprotect.c                        |   6 +-
 mm/mremap.c                          |   2 +-
 mm/page_vma_mapped.c                 |  13 +-
 mm/pgtable-generic.c                 |   3 +-
 mm/rmap.c                            |  16 ++-
 25 files changed, 655 insertions(+), 140 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 00/11] mm: page migration enhancement for thp
@ 2017-03-13 15:44 ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Zi Yan <zi.yan@cs.rutgers.edu>

Hi all,

The patches are rebased on mmotm-2017-03-09-16-19 with the feedbacks from
v3 patches. Please give comments and consider merging it.

Hi Kirill, could you take a look at [05/11], which uses and modifies your
page_vma_mapped_walk()?


Motivations
===========================================
1. THP migration becomes important in the upcoming heterogeneous memory systems.

As David Nellans from NVIDIA pointed out from other threads
(http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349227.html),
future GPUs or other accelerators will have their memory managed by operating
systems. Moving data into and out of these memory nodes efficiently is critical
to applications that use GPUs or other accelerators. Existing page migration
only supports base pages, which has a very low memory bandwidth utilization.
My experiments (see below) show THP migration can migrate pages more efficiently.

2. Base page migration vs THP migration throughput.

Here are cross-socket page migration results from calling
move_pages() syscall:

In x86_64, a Intel two-socket E5-2640v3 box,
single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.

In ppc64, a two-socket Power8 box,
single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.

THP migration can give us 3x and 1.15x throughput over base page migration
in x86_64 and ppc64 respectivley.

You can test it out by using the code here:
https://github.com/x-y-z/thp-migration-bench

3. Existing page migration splits THP before migration and cannot guarantee
the migrated pages are still contiguous. Contiguity is always what GPUs and
accelerators look for. Without THP migration, khugepaged needs to do extra work
to reassemble the migrated pages back to THPs.

ChangeLog
===========================================

Changes since v3:

  * I dropped my fix on zap_pmd_range() since THP migration will not trigger
    it and Kirill has posted patches to fix the bug triggered by MADV_DONTNEED.

  * In Patch 9, I used !pmd_present() instead of is_pmd_migration_entry()
    in pmd_none_or_trans_huge_or_clear_bad() to avoid moving the function to
    linux/swapops.h. Currently, !pmd_present() is equivalent to 
    is_pmd_migration_entry(). Any suggestion is welcome to this change.

Changes since v2:

  * I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
    The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
    which leads to PTE page table not freed.

  * In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
    can be set by some CPUs by mistake and the new swap entry format does not use
    bit 1-4.

  * I also adjust two core migration functions, set_pmd_migration_entry() and
    remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
    function. Patch 8 needs Kirill's comments, since I also add changes
    to his page_vma_mapped_walk() function with pmd_migration_entry handling.

  * In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
    in set_pmd_migration_entry() to avoid data corruption after page migration.

  * In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
    Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
    leads to deposited PTE page table not freed.

  * I personally use this patchset with my customized kernel to test frequent
    page migrations by replacing page reclaim with page migration.
    The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
    I did a 16-hour stress test that has ~7 billion total page migrations.
    No error or data corruption was found. 

General description
===========================================

This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
 - mbind(2)
 - move_pages(2)
 - migrate_pages(2)
 - cgroup/cpuset migration
 - memory hotremove
 - soft offline

The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.

The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.

Any comments or advices are welcomed.

Best Regards,
Yan Zi

Naoya Horiguchi (11):
  mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  mm: mempolicy: add queue_pages_node_check()
  mm: thp: introduce separate TTU flag for thp freezing
  mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  mm: thp: enable thp migration in generic path
  mm: thp: check pmd migration entry in common path
  mm: soft-dirty: keep soft-dirty bits over thp migration
  mm: hwpoison: soft offline supports thp migration
  mm: mempolicy: mbind and migrate_pages support thp migration
  mm: migrate: move_pages() supports thp migration
  mm: memory_hotplug: memory hotremove supports thp migration

 arch/x86/Kconfig                     |   4 +
 arch/x86/include/asm/pgtable.h       |  17 +++
 arch/x86/include/asm/pgtable_64.h    |  14 ++-
 arch/x86/include/asm/pgtable_types.h |  10 +-
 arch/x86/mm/gup.c                    |   4 +-
 fs/proc/task_mmu.c                   |  49 +++++---
 include/asm-generic/pgtable.h        |  37 +++++-
 include/linux/huge_mm.h              |  32 ++++-
 include/linux/rmap.h                 |   3 +-
 include/linux/swapops.h              |  72 ++++++++++-
 mm/Kconfig                           |   3 +
 mm/gup.c                             |  22 +++-
 mm/huge_memory.c                     | 237 +++++++++++++++++++++++++++++++----
 mm/madvise.c                         |   2 +
 mm/memcontrol.c                      |   2 +
 mm/memory-failure.c                  |  31 ++---
 mm/memory.c                          |   9 +-
 mm/memory_hotplug.c                  |  17 ++-
 mm/mempolicy.c                       | 124 +++++++++++++-----
 mm/migrate.c                         |  66 ++++++++--
 mm/mprotect.c                        |   6 +-
 mm/mremap.c                          |   2 +-
 mm/page_vma_mapped.c                 |  13 +-
 mm/pgtable-generic.c                 |   3 +-
 mm/rmap.c                            |  16 ++-
 25 files changed, 655 insertions(+), 140 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 01/11] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:44   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
false negative return when it races with thp spilt
(during which _PAGE_PRESENT is temporary cleared.) I don't think that
dropping _PAGE_PSE check in pmd_present() works well because it can
hurt optimization of tlb handling in thp split.
In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_64.h    | 12 +++++++++---
 arch/x86/include/asm/pgtable_types.h | 10 +++++-----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 73c7ccc38912..a5c4fc62e078 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 /*
  * Encode and de-code a swap entry
  *
- * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
- * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|X|X|X| X| X|X|X|0| <- swp entry
+ * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
+ * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
  * there.  We also need to avoid using A and D because of an
  * erratum where they can be incorrectly set by hardware on
  * non-present PTEs.
+ *
+ * SD (1) in swp entry is used to store soft dirty bit, which helps us
+ * remember soft dirty over page migration
+ *
+ * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
+ * but also G.
  */
 #define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
 #define SWP_TYPE_BITS 5
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8b4de22d6429..3695abd58ef6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
  *
  * Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_RW
 #else
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 01/11] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
@ 2017-03-13 15:44   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
false negative return when it races with thp spilt
(during which _PAGE_PRESENT is temporary cleared.) I don't think that
dropping _PAGE_PSE check in pmd_present() works well because it can
hurt optimization of tlb handling in thp split.
In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_64.h    | 12 +++++++++---
 arch/x86/include/asm/pgtable_types.h | 10 +++++-----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 73c7ccc38912..a5c4fc62e078 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 /*
  * Encode and de-code a swap entry
  *
- * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
- * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|X|X|X| X| X|X|X|0| <- swp entry
+ * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
+ * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
  * there.  We also need to avoid using A and D because of an
  * erratum where they can be incorrectly set by hardware on
  * non-present PTEs.
+ *
+ * SD (1) in swp entry is used to store soft dirty bit, which helps us
+ * remember soft dirty over page migration
+ *
+ * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
+ * but also G.
  */
 #define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
 #define SWP_TYPE_BITS 5
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8b4de22d6429..3695abd58ef6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
  *
  * Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
  */
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_RW
 #else
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 02/11] mm: mempolicy: add queue_pages_node_check()
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:44   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Introduce a separate check routine related to MPOL_MF_INVERT flag.
This patch just does cleanup, no behavioral change.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1e7873e40c9a..aa242da77fda 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -477,6 +477,15 @@ struct queue_pages {
 	struct vm_area_struct *prev;
 };
 
+static inline bool queue_pages_node_check(struct page *page,
+					struct queue_pages *qp)
+{
+	int nid = page_to_nid(page);
+	unsigned long flags = qp->flags;
+
+	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -530,8 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (PageReserved(page))
 			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		if (queue_pages_node_check(page, qp))
 			continue;
 		if (PageTransCompound(page)) {
 			get_page(page);
@@ -563,7 +571,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 #ifdef CONFIG_HUGETLB_PAGE
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -573,8 +580,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	if (!pte_present(entry))
 		goto unlock;
 	page = pte_page(entry);
-	nid = page_to_nid(page);
-	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+	if (queue_pages_node_check(page, qp))
 		goto unlock;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 02/11] mm: mempolicy: add queue_pages_node_check()
@ 2017-03-13 15:44   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Introduce a separate check routine related to MPOL_MF_INVERT flag.
This patch just does cleanup, no behavioral change.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1e7873e40c9a..aa242da77fda 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -477,6 +477,15 @@ struct queue_pages {
 	struct vm_area_struct *prev;
 };
 
+static inline bool queue_pages_node_check(struct page *page,
+					struct queue_pages *qp)
+{
+	int nid = page_to_nid(page);
+	unsigned long flags = qp->flags;
+
+	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -530,8 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (PageReserved(page))
 			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		if (queue_pages_node_check(page, qp))
 			continue;
 		if (PageTransCompound(page)) {
 			get_page(page);
@@ -563,7 +571,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 #ifdef CONFIG_HUGETLB_PAGE
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -573,8 +580,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	if (!pte_present(entry))
 		goto unlock;
 	page = pte_page(entry);
-	nid = page_to_nid(page);
-	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+	if (queue_pages_node_check(page, qp))
 		goto unlock;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 03/11] mm: thp: introduce separate TTU flag for thp freezing
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:44   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.

try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

and ttu_flag given for tail pages is:

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION)

__unmap_and_move() calls try_to_unmap() with ttu_flag:

    (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)

Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below

static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                       unsigned long address, void *arg)
  {
          ...
          if (flags & TTU_MIGRATION) {
                  if (!PageHuge(page) && PageTransCompound(page)) {
                          set_pmd_migration_entry(page, vma, address);
                          goto out;
                  }
          }

, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)

I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/rmap.h | 3 ++-
 mm/huge_memory.c     | 2 +-
 mm/rmap.c            | 7 ++++---
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index fee10d744ebd..58803b6e7f82 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -93,8 +93,9 @@ enum ttu_flags {
 	TTU_BATCH_FLUSH		= 0x40,	/* Batch TLB flushes where possible
 					 * and caller guarantees they will
 					 * do a final flush if necessary */
-	TTU_RMAP_LOCKED		= 0x80	/* do not grab rmap lock:
+	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
+	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f42d4d0a3019..e32ccbd8ee3a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2142,7 +2142,7 @@ static void freeze_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
-		ttu_flags |= TTU_MIGRATION;
+		ttu_flags |= TTU_SPLIT_FREEZE;
 
 	ret = try_to_unmap(page, ttu_flags);
 	VM_BUG_ON_PAGE(ret, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index e4391100af51..555cc7ebacf6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1304,7 +1304,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
-				flags & TTU_MIGRATION, page);
+				flags & TTU_SPLIT_FREEZE, page);
 	}
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1390,7 +1390,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 */
 			dec_mm_counter(mm, mm_counter(page));
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
-				(flags & TTU_MIGRATION)) {
+				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
 			pte_t swp_pte;
 			/*
@@ -1521,7 +1521,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	 * locking requirements of exec(), migration skips
 	 * temporary VMAs until after exec() completes.
 	 */
-	if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+	if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+	    && !PageKsm(page) && PageAnon(page))
 		rwc.invalid_vma = invalid_migration_vma;
 
 	if (flags & TTU_RMAP_LOCKED)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 03/11] mm: thp: introduce separate TTU flag for thp freezing
@ 2017-03-13 15:44   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.

try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

and ttu_flag given for tail pages is:

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
     TTU_MIGRATION)

__unmap_and_move() calls try_to_unmap() with ttu_flag:

    (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)

Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below

static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
                       unsigned long address, void *arg)
  {
          ...
          if (flags & TTU_MIGRATION) {
                  if (!PageHuge(page) && PageTransCompound(page)) {
                          set_pmd_migration_entry(page, vma, address);
                          goto out;
                  }
          }

, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)

I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/rmap.h | 3 ++-
 mm/huge_memory.c     | 2 +-
 mm/rmap.c            | 7 ++++---
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index fee10d744ebd..58803b6e7f82 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -93,8 +93,9 @@ enum ttu_flags {
 	TTU_BATCH_FLUSH		= 0x40,	/* Batch TLB flushes where possible
 					 * and caller guarantees they will
 					 * do a final flush if necessary */
-	TTU_RMAP_LOCKED		= 0x80	/* do not grab rmap lock:
+	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
+	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f42d4d0a3019..e32ccbd8ee3a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2142,7 +2142,7 @@ static void freeze_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
-		ttu_flags |= TTU_MIGRATION;
+		ttu_flags |= TTU_SPLIT_FREEZE;
 
 	ret = try_to_unmap(page, ttu_flags);
 	VM_BUG_ON_PAGE(ret, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index e4391100af51..555cc7ebacf6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1304,7 +1304,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
-				flags & TTU_MIGRATION, page);
+				flags & TTU_SPLIT_FREEZE, page);
 	}
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1390,7 +1390,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 */
 			dec_mm_counter(mm, mm_counter(page));
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
-				(flags & TTU_MIGRATION)) {
+				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
 			pte_t swp_pte;
 			/*
@@ -1521,7 +1521,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	 * locking requirements of exec(), migration skips
 	 * temporary VMAs until after exec() completes.
 	 */
-	if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+	if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+	    && !PageKsm(page) && PageAnon(page))
 		rwc.invalid_vma = invalid_migration_vma;
 
 	if (flags & TTU_RMAP_LOCKED)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 04/11] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:45   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v1 -> v2:
- fixed config name in subject and patch description
---
 arch/x86/Kconfig        |  4 ++++
 include/linux/huge_mm.h | 10 ++++++++++
 mm/Kconfig              |  3 +++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 69188841717a..a24bc11c7aed 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2276,6 +2276,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	def_bool y
 	depends on X86_64 && HUGETLB_PAGE && MIGRATION
 
+config ARCH_ENABLE_THP_MIGRATION
+	def_bool y
+	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
+
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d49ba39..1b81cb57ff0f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -212,6 +212,11 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
 
 #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
 
+static inline bool thp_migration_supported(void)
+{
+	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -306,6 +311,11 @@ static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static inline bool thp_migration_supported(void)
+{
+	return false;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb969dc..317a2f973720 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,9 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config ARCH_ENABLE_THP_MIGRATION
+	bool
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 04/11] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
@ 2017-03-13 15:45   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
v1 -> v2:
- fixed config name in subject and patch description
---
 arch/x86/Kconfig        |  4 ++++
 include/linux/huge_mm.h | 10 ++++++++++
 mm/Kconfig              |  3 +++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 69188841717a..a24bc11c7aed 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2276,6 +2276,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	def_bool y
 	depends on X86_64 && HUGETLB_PAGE && MIGRATION
 
+config ARCH_ENABLE_THP_MIGRATION
+	def_bool y
+	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
+
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d49ba39..1b81cb57ff0f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -212,6 +212,11 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
 
 #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
 
+static inline bool thp_migration_supported(void)
+{
+	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -306,6 +311,11 @@ static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static inline bool thp_migration_supported(void)
+{
+	return false;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb969dc..317a2f973720 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,9 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config ARCH_ENABLE_THP_MIGRATION
+	bool
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 05/11] mm: thp: enable thp migration in generic path
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:45   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch adds thp migration's core code, including conversions
between a PMD entry and a swap entry, setting PMD migration entry,
removing PMD migration entry, and waiting on PMD migration entries.

This patch makes it possible to support thp migration.
If you fail to allocate a destination page as a thp, you just split
the source thp as we do now, and then enter the normal page migration.
If you succeed to allocate destination thp, you enter thp migration.
Subsequent patches actually enable thp migration for each caller of
page migration by allowing its get_new_page() callback to
allocate thps.

ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v2 -> v3:
- use page_vma_mapped_walk()

ChangeLog v3 -> v4:
- factor out the code of removing pte pgtable page in zap_huge_pmd()

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_64.h |   2 +
 include/linux/swapops.h           |  70 +++++++++++++++++-
 mm/huge_memory.c                  | 147 ++++++++++++++++++++++++++++++++++----
 mm/migrate.c                      |  29 +++++++-
 mm/page_vma_mapped.c              |  13 +++-
 mm/pgtable-generic.c              |   3 +-
 mm/rmap.c                         |   9 +++
 7 files changed, 252 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index a5c4fc62e078..350397fd2129 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -187,7 +187,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 					 ((type) << (SWP_TYPE_FIRST_BIT)) \
 					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
 extern void cleanup_highmap(void);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3e7eec..6625bea13869 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
-	BUG_ON(!PageLocked(page));
+	BUG_ON(!PageLocked(compound_head(page)));
+
 	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
 			page_to_pfn(page));
 }
@@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
 	 * Any use of migration entries may only occur while the
 	 * corresponding page is locked
 	 */
-	BUG_ON(!PageLocked(p));
+	BUG_ON(!PageLocked(compound_head(p)));
 	return p;
 }
 
@@ -163,6 +164,71 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __pmd_to_swp_entry(pmd);
+	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+	return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	BUILD_BUG();
+}
+
+static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new)
+{
+	BUILD_BUG();
+	return 0;
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	BUILD_BUG();
+	return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	BUILD_BUG();
+	return (pmd_t){ 0 };
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e32ccbd8ee3a..a9c2a0ef5b9b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1588,6 +1588,26 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 	atomic_long_dec(&mm->nr_ptes);
 }
 
+static inline void remove_trans_huge_pgtable(struct page *page,
+		struct mmu_gather *tlb, pmd_t *pmd)
+{
+	if (PageAnon(page)) {
+		pgtable_t pgtable;
+
+		pgtable = pgtable_trans_huge_withdraw(tlb->mm,
+							  pmd);
+		pte_free(tlb->mm, pgtable);
+		atomic_long_dec(&tlb->mm->nr_ptes);
+		add_mm_counter(tlb->mm, MM_ANONPAGES,
+				   -HPAGE_PMD_NR);
+	} else {
+		if (arch_needs_pgtable_deposit())
+			zap_deposited_table(tlb->mm, pmd);
+		add_mm_counter(tlb->mm, MM_FILEPAGES,
+				   -HPAGE_PMD_NR);
+	}
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
@@ -1618,23 +1638,27 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else {
-		struct page *page = pmd_page(orig_pmd);
-		page_remove_rmap(page, true);
-		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-		VM_BUG_ON_PAGE(!PageHead(page), page);
-		if (PageAnon(page)) {
-			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
-			pte_free(tlb->mm, pgtable);
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		struct page *page;
+		int migration = 0;
+
+		if (!is_pmd_migration_entry(orig_pmd)) {
+			page = pmd_page(orig_pmd);
+			page_remove_rmap(page, true);
+			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+			remove_trans_huge_pgtable(page, tlb, pmd);
 		} else {
-			if (arch_needs_pgtable_deposit())
-				zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(orig_pmd);
+			page = pfn_to_page(swp_offset(entry));
+			remove_trans_huge_pgtable(page, tlb, pmd);
+			free_swap_and_cache(entry); /* waring in failure? */
+			migration = 1;
 		}
 		spin_unlock(ptl);
-		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+		if (!migration)
+			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
 	return 1;
 }
@@ -2652,3 +2676,98 @@ static int __init split_huge_pages_debugfs(void)
 }
 late_initcall(split_huge_pages_debugfs);
 #endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	pmd_t pmdval;
+	swp_entry_t entry;
+
+	if (pvmw->pmd && !pvmw->pte) {
+		pmd_t pmdswp;
+
+		mmu_notifier_invalidate_range_start(mm, address,
+				address + HPAGE_PMD_SIZE);
+
+		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+		if (pmd_dirty(pmdval))
+			set_page_dirty(page);
+		entry = make_migration_entry(page, pmd_write(pmdval));
+		pmdswp = swp_entry_to_pmd(entry);
+		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
+		page_remove_rmap(page, true);
+		put_page(page);
+
+		mmu_notifier_invalidate_range_end(mm, address,
+				address + HPAGE_PMD_SIZE);
+	} else { /* pte-mapped thp */
+		pte_t pteval;
+		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
+		pte_t swp_pte;
+
+		pteval = ptep_clear_flush(vma, address, pvmw->pte);
+		if (pte_dirty(pteval))
+			set_page_dirty(subpage);
+		entry = make_migration_entry(subpage, pte_write(pteval));
+		swp_pte = swp_entry_to_pte(entry);
+		set_pte_at(mm, address, pvmw->pte, swp_pte);
+		page_remove_rmap(subpage, false);
+		put_page(subpage);
+		mmu_notifier_invalidate_page(mm, address);
+	}
+}
+
+void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	swp_entry_t entry;
+
+	/* PMD-mapped THP  */
+	if (pvmw->pmd && !pvmw->pte) {
+		unsigned long mmun_start = address & HPAGE_PMD_MASK;
+		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+		pmd_t pmde;
+
+		entry = pmd_to_swp_entry(*pvmw->pmd);
+		get_page(new);
+		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+		if (is_write_migration_entry(entry))
+			pmde = maybe_pmd_mkwrite(pmde, vma);
+
+		flush_cache_range(vma, mmun_start, mmun_end);
+		page_add_anon_rmap(new, vma, mmun_start, true);
+		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
+		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
+		flush_tlb_range(vma, mmun_start, mmun_end);
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_vma_page(new);
+		update_mmu_cache_pmd(vma, address, pvmw->pmd);
+
+	} else { /* pte-mapped thp */
+		pte_t pte;
+		pte_t *ptep = pvmw->pte;
+
+		entry = pte_to_swp_entry(*pvmw->pte);
+		get_page(new);
+		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		if (pte_swp_soft_dirty(*pvmw->pte))
+			pte = pte_mksoft_dirty(pte);
+		if (is_write_migration_entry(entry))
+			pte = maybe_mkwrite(pte, vma);
+		flush_dcache_page(new);
+		set_pte_at(mm, address, ptep, pte);
+		if (PageAnon(new))
+			page_add_anon_rmap(new, vma, address, false);
+		else
+			page_add_file_rmap(new, false);
+		update_mmu_cache(vma, address, ptep);
+	}
+}
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index cda4c2778d04..0bbad6dcf95a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -211,6 +211,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		new = page - pvmw.page->index +
 			linear_page_index(vma, pvmw.address);
 
+		/* PMD-mapped THP migration entry */
+		if (!PageHuge(page) && PageTransCompound(page)) {
+			remove_migration_pmd(&pvmw, new);
+			continue;
+		}
+
 		get_page(new);
 		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
 		if (pte_swp_soft_dirty(*pvmw.pte))
@@ -324,6 +330,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
 	__migration_entry_wait(mm, pte, ptl);
 }
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	struct page *page;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!is_pmd_migration_entry(*pmd))
+		goto unlock;
+	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+	if (!get_page_unless_zero(page))
+		goto unlock;
+	spin_unlock(ptl);
+	wait_on_page_locked(page);
+	put_page(page);
+	return;
+unlock:
+	spin_unlock(ptl);
+}
+#endif
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
@@ -1082,7 +1109,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page))) {
+	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
 		lock_page(page);
 		rc = split_huge_page(page);
 		unlock_page(page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a23001a22c15..0ed3aee62d50 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!pud_present(*pud))
 		return false;
 	pvmw->pmd = pmd_offset(pud, pvmw->address);
-	if (pmd_trans_huge(*pvmw->pmd)) {
+	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (!pmd_present(*pvmw->pmd))
-			return not_found(pvmw);
 		if (likely(pmd_trans_huge(*pvmw->pmd))) {
 			if (pvmw->flags & PVMW_MIGRATION)
 				return not_found(pvmw);
 			if (pmd_page(*pvmw->pmd) != page)
 				return not_found(pvmw);
 			return true;
+		} else if (!pmd_present(*pvmw->pmd)) {
+			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+				if (migration_entry_to_page(entry) != page)
+					return not_found(pvmw);
+				return true;
+			}
+			return not_found(pvmw);
 		} else {
 			/* THP pmd was split under us: handle on pte level */
 			spin_unlock(pvmw->ptl);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4ed5908c65b0..9d550a8a0c71 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+		  !pmd_devmap(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
diff --git a/mm/rmap.c b/mm/rmap.c
index 555cc7ebacf6..2c65abbd7a0e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1298,6 +1298,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
+
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return SWAP_AGAIN;
@@ -1308,6 +1309,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* THP migration */
+		if (flags & TTU_MIGRATION) {
+			if (!PageHuge(page) && PageTransCompound(page)) {
+				set_pmd_migration_entry(&pvmw, page);
+				continue;
+			}
+		}
+
 		/*
 		 * If the page is mlock()d, we cannot swap it out.
 		 * If it's recently referenced (perhaps page_referenced
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 05/11] mm: thp: enable thp migration in generic path
@ 2017-03-13 15:45   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch adds thp migration's core code, including conversions
between a PMD entry and a swap entry, setting PMD migration entry,
removing PMD migration entry, and waiting on PMD migration entries.

This patch makes it possible to support thp migration.
If you fail to allocate a destination page as a thp, you just split
the source thp as we do now, and then enter the normal page migration.
If you succeed to allocate destination thp, you enter thp migration.
Subsequent patches actually enable thp migration for each caller of
page migration by allowing its get_new_page() callback to
allocate thps.

ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v2 -> v3:
- use page_vma_mapped_walk()

ChangeLog v3 -> v4:
- factor out the code of removing pte pgtable page in zap_huge_pmd()

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/include/asm/pgtable_64.h |   2 +
 include/linux/swapops.h           |  70 +++++++++++++++++-
 mm/huge_memory.c                  | 147 ++++++++++++++++++++++++++++++++++----
 mm/migrate.c                      |  29 +++++++-
 mm/page_vma_mapped.c              |  13 +++-
 mm/pgtable-generic.c              |   3 +-
 mm/rmap.c                         |   9 +++
 7 files changed, 252 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index a5c4fc62e078..350397fd2129 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -187,7 +187,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 					 ((type) << (SWP_TYPE_FIRST_BIT)) \
 					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
 extern void cleanup_highmap(void);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3e7eec..6625bea13869 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
-	BUG_ON(!PageLocked(page));
+	BUG_ON(!PageLocked(compound_head(page)));
+
 	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
 			page_to_pfn(page));
 }
@@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
 	 * Any use of migration entries may only occur while the
 	 * corresponding page is locked
 	 */
-	BUG_ON(!PageLocked(p));
+	BUG_ON(!PageLocked(compound_head(p)));
 	return p;
 }
 
@@ -163,6 +164,71 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __pmd_to_swp_entry(pmd);
+	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+	return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	BUILD_BUG();
+}
+
+static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new)
+{
+	BUILD_BUG();
+	return 0;
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	BUILD_BUG();
+	return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	BUILD_BUG();
+	return (pmd_t){ 0 };
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e32ccbd8ee3a..a9c2a0ef5b9b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1588,6 +1588,26 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 	atomic_long_dec(&mm->nr_ptes);
 }
 
+static inline void remove_trans_huge_pgtable(struct page *page,
+		struct mmu_gather *tlb, pmd_t *pmd)
+{
+	if (PageAnon(page)) {
+		pgtable_t pgtable;
+
+		pgtable = pgtable_trans_huge_withdraw(tlb->mm,
+							  pmd);
+		pte_free(tlb->mm, pgtable);
+		atomic_long_dec(&tlb->mm->nr_ptes);
+		add_mm_counter(tlb->mm, MM_ANONPAGES,
+				   -HPAGE_PMD_NR);
+	} else {
+		if (arch_needs_pgtable_deposit())
+			zap_deposited_table(tlb->mm, pmd);
+		add_mm_counter(tlb->mm, MM_FILEPAGES,
+				   -HPAGE_PMD_NR);
+	}
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
@@ -1618,23 +1638,27 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else {
-		struct page *page = pmd_page(orig_pmd);
-		page_remove_rmap(page, true);
-		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-		VM_BUG_ON_PAGE(!PageHead(page), page);
-		if (PageAnon(page)) {
-			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
-			pte_free(tlb->mm, pgtable);
-			atomic_long_dec(&tlb->mm->nr_ptes);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		struct page *page;
+		int migration = 0;
+
+		if (!is_pmd_migration_entry(orig_pmd)) {
+			page = pmd_page(orig_pmd);
+			page_remove_rmap(page, true);
+			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+			remove_trans_huge_pgtable(page, tlb, pmd);
 		} else {
-			if (arch_needs_pgtable_deposit())
-				zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(orig_pmd);
+			page = pfn_to_page(swp_offset(entry));
+			remove_trans_huge_pgtable(page, tlb, pmd);
+			free_swap_and_cache(entry); /* waring in failure? */
+			migration = 1;
 		}
 		spin_unlock(ptl);
-		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+		if (!migration)
+			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
 	}
 	return 1;
 }
@@ -2652,3 +2676,98 @@ static int __init split_huge_pages_debugfs(void)
 }
 late_initcall(split_huge_pages_debugfs);
 #endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	pmd_t pmdval;
+	swp_entry_t entry;
+
+	if (pvmw->pmd && !pvmw->pte) {
+		pmd_t pmdswp;
+
+		mmu_notifier_invalidate_range_start(mm, address,
+				address + HPAGE_PMD_SIZE);
+
+		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+		if (pmd_dirty(pmdval))
+			set_page_dirty(page);
+		entry = make_migration_entry(page, pmd_write(pmdval));
+		pmdswp = swp_entry_to_pmd(entry);
+		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
+		page_remove_rmap(page, true);
+		put_page(page);
+
+		mmu_notifier_invalidate_range_end(mm, address,
+				address + HPAGE_PMD_SIZE);
+	} else { /* pte-mapped thp */
+		pte_t pteval;
+		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
+		pte_t swp_pte;
+
+		pteval = ptep_clear_flush(vma, address, pvmw->pte);
+		if (pte_dirty(pteval))
+			set_page_dirty(subpage);
+		entry = make_migration_entry(subpage, pte_write(pteval));
+		swp_pte = swp_entry_to_pte(entry);
+		set_pte_at(mm, address, pvmw->pte, swp_pte);
+		page_remove_rmap(subpage, false);
+		put_page(subpage);
+		mmu_notifier_invalidate_page(mm, address);
+	}
+}
+
+void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	swp_entry_t entry;
+
+	/* PMD-mapped THP  */
+	if (pvmw->pmd && !pvmw->pte) {
+		unsigned long mmun_start = address & HPAGE_PMD_MASK;
+		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+		pmd_t pmde;
+
+		entry = pmd_to_swp_entry(*pvmw->pmd);
+		get_page(new);
+		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+		if (is_write_migration_entry(entry))
+			pmde = maybe_pmd_mkwrite(pmde, vma);
+
+		flush_cache_range(vma, mmun_start, mmun_end);
+		page_add_anon_rmap(new, vma, mmun_start, true);
+		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
+		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
+		flush_tlb_range(vma, mmun_start, mmun_end);
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_vma_page(new);
+		update_mmu_cache_pmd(vma, address, pvmw->pmd);
+
+	} else { /* pte-mapped thp */
+		pte_t pte;
+		pte_t *ptep = pvmw->pte;
+
+		entry = pte_to_swp_entry(*pvmw->pte);
+		get_page(new);
+		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		if (pte_swp_soft_dirty(*pvmw->pte))
+			pte = pte_mksoft_dirty(pte);
+		if (is_write_migration_entry(entry))
+			pte = maybe_mkwrite(pte, vma);
+		flush_dcache_page(new);
+		set_pte_at(mm, address, ptep, pte);
+		if (PageAnon(new))
+			page_add_anon_rmap(new, vma, address, false);
+		else
+			page_add_file_rmap(new, false);
+		update_mmu_cache(vma, address, ptep);
+	}
+}
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index cda4c2778d04..0bbad6dcf95a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -211,6 +211,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		new = page - pvmw.page->index +
 			linear_page_index(vma, pvmw.address);
 
+		/* PMD-mapped THP migration entry */
+		if (!PageHuge(page) && PageTransCompound(page)) {
+			remove_migration_pmd(&pvmw, new);
+			continue;
+		}
+
 		get_page(new);
 		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
 		if (pte_swp_soft_dirty(*pvmw.pte))
@@ -324,6 +330,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
 	__migration_entry_wait(mm, pte, ptl);
 }
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	struct page *page;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!is_pmd_migration_entry(*pmd))
+		goto unlock;
+	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+	if (!get_page_unless_zero(page))
+		goto unlock;
+	spin_unlock(ptl);
+	wait_on_page_locked(page);
+	put_page(page);
+	return;
+unlock:
+	spin_unlock(ptl);
+}
+#endif
+
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
@@ -1082,7 +1109,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page))) {
+	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
 		lock_page(page);
 		rc = split_huge_page(page);
 		unlock_page(page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a23001a22c15..0ed3aee62d50 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!pud_present(*pud))
 		return false;
 	pvmw->pmd = pmd_offset(pud, pvmw->address);
-	if (pmd_trans_huge(*pvmw->pmd)) {
+	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (!pmd_present(*pvmw->pmd))
-			return not_found(pvmw);
 		if (likely(pmd_trans_huge(*pvmw->pmd))) {
 			if (pvmw->flags & PVMW_MIGRATION)
 				return not_found(pvmw);
 			if (pmd_page(*pvmw->pmd) != page)
 				return not_found(pvmw);
 			return true;
+		} else if (!pmd_present(*pvmw->pmd)) {
+			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+				if (migration_entry_to_page(entry) != page)
+					return not_found(pvmw);
+				return true;
+			}
+			return not_found(pvmw);
 		} else {
 			/* THP pmd was split under us: handle on pte level */
 			spin_unlock(pvmw->ptl);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4ed5908c65b0..9d550a8a0c71 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+		  !pmd_devmap(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
diff --git a/mm/rmap.c b/mm/rmap.c
index 555cc7ebacf6..2c65abbd7a0e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1298,6 +1298,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
+
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return SWAP_AGAIN;
@@ -1308,6 +1309,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* THP migration */
+		if (flags & TTU_MIGRATION) {
+			if (!PageHuge(page) && PageTransCompound(page)) {
+				set_pmd_migration_entry(&pvmw, page);
+				continue;
+			}
+		}
+
 		/*
 		 * If the page is mlock()d, we cannot swap it out.
 		 * If it's recently referenced (perhaps page_referenced
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:45   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

If one of callers of page migration starts to handle thp,
memory management code start to see pmd migration entry, so we need
to prepare for it before enabling. This patch changes various code
point which checks the status of given pmds in order to prevent race
between thp migration and the pmd-related works.

ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't
  think up no better name. Any suggesntion is welcomed.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v2 -> v3:
- add is_swap_pmd()
- a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
  pmd_trans_huge(), pmd_devmap(), or pmd_none()
- use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
- flush_cache_range() while set_pmd_migration_entry()
- pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
  true on pmd_migration_entry, so that migration entries are not
  treated as pmd page table entries.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/mm/gup.c             |  4 +--
 fs/proc/task_mmu.c            | 22 +++++++++------
 include/asm-generic/pgtable.h |  3 +-
 include/linux/huge_mm.h       | 14 +++++++--
 mm/gup.c                      | 22 +++++++++++++--
 mm/huge_memory.c              | 66 ++++++++++++++++++++++++++++++++++++++-----
 mm/madvise.c                  |  2 ++
 mm/memcontrol.c               |  2 ++
 mm/memory.c                   |  9 ++++--
 mm/mprotect.c                 |  6 ++--
 mm/mremap.c                   |  2 +-
 11 files changed, 124 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 1f3b6ef105cd..23bb071f286d 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -243,9 +243,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
-		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+		if (unlikely(pmd_large(pmd))) {
 			/*
 			 * NUMA hinting faults need to be handled in the GUP
 			 * slowpath for accounting purposes and so that they
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5c8359704601..f2b0f3ba25ac 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -600,7 +600,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		smaps_pmd_entry(pmd, addr, walk);
+		if (pmd_present(*pmd))
+			smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
 	}
@@ -942,6 +943,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 			goto out;
 		}
 
+		if (!pmd_present(*pmd))
+			goto out;
+
 		page = pmd_page(*pmd);
 
 		/* Clear accessed and referenced bits. */
@@ -1221,19 +1225,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	if (ptl) {
 		u64 flags = 0, frame = 0;
 		pmd_t pmd = *pmdp;
+		struct page *page;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
 			flags |= PM_SOFT_DIRTY;
 
-		/*
-		 * Currently pmd for thp is always present because thp
-		 * can not be swapped-out, migrated, or HWPOISONed
-		 * (split in such cases instead.)
-		 * This if-check is just to prepare for future implementation.
-		 */
-		if (pmd_present(pmd)) {
-			struct page *page = pmd_page(pmd);
+		if (is_pmd_migration_entry(pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(pmd);
 
+			frame = swp_type(entry) |
+				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+			page = migration_entry_to_page(entry);
+		} else if (pmd_present(pmd)) {
+			page = pmd_page(pmd);
 			if (page_mapcount(page) == 1)
 				flags |= PM_MMAP_EXCLUSIVE;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f4ca23b158b3..f98a028100b6 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -790,7 +790,8 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	barrier();
 #endif
-	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
+	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
+			|| !pmd_present(pmdval))
 		return 1;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1b81cb57ff0f..6f44a2352597 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -126,7 +126,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		if (pmd_trans_huge(*____pmd)				\
+		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
 					|| pmd_devmap(*____pmd))	\
 			__split_huge_pmd(__vma, __pmd, __address,	\
 						false, NULL);		\
@@ -157,12 +157,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma);
 extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma);
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return !pmd_none(pmd) && !pmd_present(pmd);
+}
+
 /* mmap_sem must be held on entry */
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
 	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
@@ -269,6 +275,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return 0;
+}
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
diff --git a/mm/gup.c b/mm/gup.c
index 94fab8fa432b..2b1effb16242 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -272,6 +272,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags);
 	}
+	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
+		return no_page_table(vma, flags);
+	if (!pmd_present(*pmd)) {
+retry:
+		if (likely(!(flags & FOLL_MIGRATION)))
+			return no_page_table(vma, flags);
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry;
+	}
 	if (pmd_devmap(*pmd)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -286,6 +295,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		return no_page_table(vma, flags);
 
 	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_present(*pmd))) {
+retry_locked:
+		if (likely(!(flags & FOLL_MIGRATION))) {
+			spin_unlock(ptl);
+			return no_page_table(vma, flags);
+		}
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry_locked;
+	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
@@ -341,7 +359,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	pud = pud_offset(pgd, address);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		return -EFAULT;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map(pmd, address);
@@ -1369,7 +1387,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = READ_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a9c2a0ef5b9b..3f18452f3eb1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -898,6 +898,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pmd = *src_pmd;
+
+	if (unlikely(is_pmd_migration_entry(pmd))) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_write_migration_entry(entry)) {
+			make_migration_entry_read(&entry);
+			pmd = swp_entry_to_pmd(entry);
+			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		}
+		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+		ret = 0;
+		goto out_unlock;
+	}
+	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");
+
 	if (unlikely(!pmd_trans_huge(pmd))) {
 		pte_free(dst_mm, pgtable);
 		goto out_unlock;
@@ -1204,6 +1219,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
 		goto out_unlock;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out_unlock;
+
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
@@ -1338,7 +1356,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		goto out;
 
-	page = pmd_page(*pmd);
+	if (is_pmd_migration_entry(*pmd)) {
+		swp_entry_t entry;
+
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+		if (!is_migration_entry(entry))
+			goto out;
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd);
@@ -1534,6 +1560,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (is_huge_zero_pmd(orig_pmd))
 		goto out;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out;
+
 	page = pmd_page(orig_pmd);
 	/*
 	 * If other processes are mapping this page, we couldn't discard
@@ -1766,6 +1795,20 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (prot_numa && pmd_protnone(*pmd))
 		goto unlock;
 
+	if (is_pmd_migration_entry(*pmd)) {
+		swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+		if (is_write_migration_entry(entry)) {
+			pmd_t newpmd;
+
+			make_migration_entry_read(&entry);
+			newpmd = swp_entry_to_pmd(entry);
+			set_pmd_at(mm, addr, pmd, newpmd);
+		}
+		goto unlock;
+	} else if (!pmd_present(*pmd))
+		WARN_ONCE(1, "Uknown non-present format on pmd.\n");
+
 	/*
 	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
 	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
@@ -1820,7 +1863,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+			pmd_devmap(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -1938,14 +1982,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t _pmd;
-	bool young, write, dirty, soft_dirty;
+	bool young, write, dirty, soft_dirty, pmd_migration;
 	unsigned long addr;
 	int i;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+				&& !pmd_devmap(*pmd));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -1970,7 +2015,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	page = pmd_page(*pmd);
+	pmd_migration = is_pmd_migration_entry(*pmd);
+	if (pmd_migration) {
+		swp_entry_t entry;
+
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
 	write = pmd_write(*pmd);
@@ -1989,7 +2041,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		if (freeze) {
+		if (freeze || pmd_migration) {
 			swp_entry_t swp_entry;
 			swp_entry = make_migration_entry(page + i, write);
 			entry = swp_entry_to_pte(swp_entry);
@@ -2088,7 +2140,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		page = pmd_page(*pmd);
 		if (PageMlocked(page))
 			clear_page_mlock(page);
-	} else if (!pmd_devmap(*pmd))
+	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
 out:
diff --git a/mm/madvise.c b/mm/madvise.c
index a09d2d3dfae9..f410fc500486 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -311,6 +311,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long next;
 
 	next = pmd_addr_end(addr, end);
+	if (!pmd_present(*pmd))
+		return 0;
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 712a687cda01..94eb47ca49e3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4639,6 +4639,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	enum mc_target_type ret = MC_TARGET_NONE;
 
+	if (unlikely(!pmd_present(pmd)))
+		return ret;
 	page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
 	if (!(mc.flags & MOVE_ANON))
diff --git a/mm/memory.c b/mm/memory.c
index 14fc0b40f0bb..a4b247f63eb7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -998,7 +998,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
+			|| pmd_devmap(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
 			err = copy_huge_pmd(dst_mm, src_mm,
@@ -1236,7 +1237,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
 				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
@@ -3691,6 +3692,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		pmd_t orig_pmd = *vmf.pmd;
 
 		barrier();
+		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
+			pmd_migration_entry_wait(mm, vmf.pmd);
+			return 0;
+		}
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 118b1cd5ff1a..4a025c78fce0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -150,7 +150,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
+		if (!pmd_present(*pmd))
+			continue;
+		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
 				&& pmd_none_or_clear_bad(pmd))
 			continue;
 
@@ -160,7 +162,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(mm, mni_start, end);
 		}
 
-		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index 8233b0105c82..5d537ce12adc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -213,7 +213,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-		if (pmd_trans_huge(*old_pmd)) {
+		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE) {
 				bool moved;
 				/* See comment in move_ptes() */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
@ 2017-03-13 15:45   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

If one of callers of page migration starts to handle thp,
memory management code start to see pmd migration entry, so we need
to prepare for it before enabling. This patch changes various code
point which checks the status of given pmds in order to prevent race
between thp migration and the pmd-related works.

ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't
  think up no better name. Any suggesntion is welcomed.)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

ChangeLog v2 -> v3:
- add is_swap_pmd()
- a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
  pmd_trans_huge(), pmd_devmap(), or pmd_none()
- use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
- flush_cache_range() while set_pmd_migration_entry()
- pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
  true on pmd_migration_entry, so that migration entries are not
  treated as pmd page table entries.

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
---
 arch/x86/mm/gup.c             |  4 +--
 fs/proc/task_mmu.c            | 22 +++++++++------
 include/asm-generic/pgtable.h |  3 +-
 include/linux/huge_mm.h       | 14 +++++++--
 mm/gup.c                      | 22 +++++++++++++--
 mm/huge_memory.c              | 66 ++++++++++++++++++++++++++++++++++++++-----
 mm/madvise.c                  |  2 ++
 mm/memcontrol.c               |  2 ++
 mm/memory.c                   |  9 ++++--
 mm/mprotect.c                 |  6 ++--
 mm/mremap.c                   |  2 +-
 11 files changed, 124 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 1f3b6ef105cd..23bb071f286d 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -243,9 +243,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
-		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+		if (unlikely(pmd_large(pmd))) {
 			/*
 			 * NUMA hinting faults need to be handled in the GUP
 			 * slowpath for accounting purposes and so that they
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5c8359704601..f2b0f3ba25ac 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -600,7 +600,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		smaps_pmd_entry(pmd, addr, walk);
+		if (pmd_present(*pmd))
+			smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
 	}
@@ -942,6 +943,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 			goto out;
 		}
 
+		if (!pmd_present(*pmd))
+			goto out;
+
 		page = pmd_page(*pmd);
 
 		/* Clear accessed and referenced bits. */
@@ -1221,19 +1225,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	if (ptl) {
 		u64 flags = 0, frame = 0;
 		pmd_t pmd = *pmdp;
+		struct page *page;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
 			flags |= PM_SOFT_DIRTY;
 
-		/*
-		 * Currently pmd for thp is always present because thp
-		 * can not be swapped-out, migrated, or HWPOISONed
-		 * (split in such cases instead.)
-		 * This if-check is just to prepare for future implementation.
-		 */
-		if (pmd_present(pmd)) {
-			struct page *page = pmd_page(pmd);
+		if (is_pmd_migration_entry(pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(pmd);
 
+			frame = swp_type(entry) |
+				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+			page = migration_entry_to_page(entry);
+		} else if (pmd_present(pmd)) {
+			page = pmd_page(pmd);
 			if (page_mapcount(page) == 1)
 				flags |= PM_MMAP_EXCLUSIVE;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f4ca23b158b3..f98a028100b6 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -790,7 +790,8 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	barrier();
 #endif
-	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
+	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
+			|| !pmd_present(pmdval))
 		return 1;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1b81cb57ff0f..6f44a2352597 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -126,7 +126,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		if (pmd_trans_huge(*____pmd)				\
+		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
 					|| pmd_devmap(*____pmd))	\
 			__split_huge_pmd(__vma, __pmd, __address,	\
 						false, NULL);		\
@@ -157,12 +157,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma);
 extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma);
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return !pmd_none(pmd) && !pmd_present(pmd);
+}
+
 /* mmap_sem must be held on entry */
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
 	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
@@ -269,6 +275,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return 0;
+}
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
diff --git a/mm/gup.c b/mm/gup.c
index 94fab8fa432b..2b1effb16242 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -272,6 +272,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags);
 	}
+	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
+		return no_page_table(vma, flags);
+	if (!pmd_present(*pmd)) {
+retry:
+		if (likely(!(flags & FOLL_MIGRATION)))
+			return no_page_table(vma, flags);
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry;
+	}
 	if (pmd_devmap(*pmd)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -286,6 +295,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		return no_page_table(vma, flags);
 
 	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_present(*pmd))) {
+retry_locked:
+		if (likely(!(flags & FOLL_MIGRATION))) {
+			spin_unlock(ptl);
+			return no_page_table(vma, flags);
+		}
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry_locked;
+	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
@@ -341,7 +359,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	pud = pud_offset(pgd, address);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		return -EFAULT;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map(pmd, address);
@@ -1369,7 +1387,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = READ_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		if (!pmd_present(pmd))
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a9c2a0ef5b9b..3f18452f3eb1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -898,6 +898,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pmd = *src_pmd;
+
+	if (unlikely(is_pmd_migration_entry(pmd))) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_write_migration_entry(entry)) {
+			make_migration_entry_read(&entry);
+			pmd = swp_entry_to_pmd(entry);
+			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		}
+		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+		ret = 0;
+		goto out_unlock;
+	}
+	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");
+
 	if (unlikely(!pmd_trans_huge(pmd))) {
 		pte_free(dst_mm, pgtable);
 		goto out_unlock;
@@ -1204,6 +1219,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
 		goto out_unlock;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out_unlock;
+
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
@@ -1338,7 +1356,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		goto out;
 
-	page = pmd_page(*pmd);
+	if (is_pmd_migration_entry(*pmd)) {
+		swp_entry_t entry;
+
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+		if (!is_migration_entry(entry))
+			goto out;
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd);
@@ -1534,6 +1560,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (is_huge_zero_pmd(orig_pmd))
 		goto out;
 
+	if (unlikely(!pmd_present(orig_pmd)))
+		goto out;
+
 	page = pmd_page(orig_pmd);
 	/*
 	 * If other processes are mapping this page, we couldn't discard
@@ -1766,6 +1795,20 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (prot_numa && pmd_protnone(*pmd))
 		goto unlock;
 
+	if (is_pmd_migration_entry(*pmd)) {
+		swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+		if (is_write_migration_entry(entry)) {
+			pmd_t newpmd;
+
+			make_migration_entry_read(&entry);
+			newpmd = swp_entry_to_pmd(entry);
+			set_pmd_at(mm, addr, pmd, newpmd);
+		}
+		goto unlock;
+	} else if (!pmd_present(*pmd))
+		WARN_ONCE(1, "Uknown non-present format on pmd.\n");
+
 	/*
 	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
 	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
@@ -1820,7 +1863,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+			pmd_devmap(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -1938,14 +1982,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t _pmd;
-	bool young, write, dirty, soft_dirty;
+	bool young, write, dirty, soft_dirty, pmd_migration;
 	unsigned long addr;
 	int i;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+				&& !pmd_devmap(*pmd));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -1970,7 +2015,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	page = pmd_page(*pmd);
+	pmd_migration = is_pmd_migration_entry(*pmd);
+	if (pmd_migration) {
+		swp_entry_t entry;
+
+		entry = pmd_to_swp_entry(*pmd);
+		page = pfn_to_page(swp_offset(entry));
+	} else
+		page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
 	write = pmd_write(*pmd);
@@ -1989,7 +2041,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		if (freeze) {
+		if (freeze || pmd_migration) {
 			swp_entry_t swp_entry;
 			swp_entry = make_migration_entry(page + i, write);
 			entry = swp_entry_to_pte(swp_entry);
@@ -2088,7 +2140,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		page = pmd_page(*pmd);
 		if (PageMlocked(page))
 			clear_page_mlock(page);
-	} else if (!pmd_devmap(*pmd))
+	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
 out:
diff --git a/mm/madvise.c b/mm/madvise.c
index a09d2d3dfae9..f410fc500486 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -311,6 +311,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long next;
 
 	next = pmd_addr_end(addr, end);
+	if (!pmd_present(*pmd))
+		return 0;
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 712a687cda01..94eb47ca49e3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4639,6 +4639,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	enum mc_target_type ret = MC_TARGET_NONE;
 
+	if (unlikely(!pmd_present(pmd)))
+		return ret;
 	page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
 	if (!(mc.flags & MOVE_ANON))
diff --git a/mm/memory.c b/mm/memory.c
index 14fc0b40f0bb..a4b247f63eb7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -998,7 +998,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
+			|| pmd_devmap(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
 			err = copy_huge_pmd(dst_mm, src_mm,
@@ -1236,7 +1237,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
 				    !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
@@ -3691,6 +3692,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		pmd_t orig_pmd = *vmf.pmd;
 
 		barrier();
+		if (unlikely(is_pmd_migration_entry(orig_pmd))) {
+			pmd_migration_entry_wait(mm, vmf.pmd);
+			return 0;
+		}
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 118b1cd5ff1a..4a025c78fce0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -150,7 +150,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
+		if (!pmd_present(*pmd))
+			continue;
+		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
 				&& pmd_none_or_clear_bad(pmd))
 			continue;
 
@@ -160,7 +162,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(mm, mni_start, end);
 		}
 
-		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index 8233b0105c82..5d537ce12adc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -213,7 +213,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-		if (pmd_trans_huge(*old_pmd)) {
+		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE) {
 				bool moved;
 				/* See comment in move_ptes() */
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 07/11] mm: soft-dirty: keep soft-dirty bits over thp migration
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:45   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
- clear_soft_dirty_pmd can handle migration entry
---
 arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
 fs/proc/task_mmu.c             | 27 ++++++++++++++++-----------
 include/asm-generic/pgtable.h  | 34 +++++++++++++++++++++++++++++++++-
 include/linux/swapops.h        |  2 ++
 mm/huge_memory.c               | 22 +++++++++++++++++++++-
 5 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1cfb36b8c024..e57abf8e926c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1088,6 +1088,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
 #endif
 
 #define PKRU_AD_BIT 0x1
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f2b0f3ba25ac..6ea9546ea8c1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -908,17 +908,22 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 {
 	pmd_t pmd = *pmdp;
 
-	/* See comment in change_huge_pmd() */
-	pmdp_invalidate(vma, addr, pmdp);
-	if (pmd_dirty(*pmdp))
-		pmd = pmd_mkdirty(pmd);
-	if (pmd_young(*pmdp))
-		pmd = pmd_mkyoung(pmd);
-
-	pmd = pmd_wrprotect(pmd);
-	pmd = pmd_clear_soft_dirty(pmd);
-
-	set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	if (pmd_present(pmd)) {
+		/* See comment in change_huge_pmd() */
+		pmdp_invalidate(vma, addr, pmdp);
+		if (pmd_dirty(*pmdp))
+			pmd = pmd_mkdirty(pmd);
+		if (pmd_young(*pmdp))
+			pmd = pmd_mkyoung(pmd);
+
+		pmd = pmd_wrprotect(pmd);
+		pmd = pmd_clear_soft_dirty(pmd);
+
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+		pmd = pmd_swp_clear_soft_dirty(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	}
 }
 #else
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f98a028100b6..7c781aef0911 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -574,7 +574,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return 0;
@@ -619,6 +636,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte;
 }
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
 #endif
 
 #ifndef __HAVE_PFNMAP_TRACKING
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6625bea13869..b52674bd4173 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -179,6 +179,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 {
 	swp_entry_t arch_entry;
 
+	if (pmd_swp_soft_dirty(pmd))
+		pmd = pmd_swp_clear_soft_dirty(pmd);
 	arch_entry = __pmd_to_swp_entry(pmd);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3f18452f3eb1..f8a11d634a7e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -905,6 +905,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (is_write_migration_entry(entry)) {
 			make_migration_entry_read(&entry);
 			pmd = swp_entry_to_pmd(entry);
+			if (pmd_swp_soft_dirty(pmd))
+				pmd = pmd_swp_mksoft_dirty(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
 		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1707,6 +1709,17 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
 }
 #endif
 
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	if (unlikely(is_pmd_migration_entry(pmd)))
+		pmd = pmd_swp_mksoft_dirty(pmd);
+	else if (pmd_present(pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+#endif
+	return pmd;
+}
+
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		  unsigned long new_addr, unsigned long old_end,
 		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
@@ -1749,7 +1762,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
 		}
-		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+		pmd = move_soft_dirty_pmd(pmd);
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		if (force_flush)
@@ -2751,6 +2765,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 			set_page_dirty(page);
 		entry = make_migration_entry(page, pmd_write(pmdval));
 		pmdswp = swp_entry_to_pmd(entry);
+		if (pmd_soft_dirty(pmdval))
+			pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
 		page_remove_rmap(page, true);
 		put_page(page);
@@ -2767,6 +2783,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 			set_page_dirty(subpage);
 		entry = make_migration_entry(subpage, pte_write(pteval));
 		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
 		set_pte_at(mm, address, pvmw->pte, swp_pte);
 		page_remove_rmap(subpage, false);
 		put_page(subpage);
@@ -2790,6 +2808,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		entry = pmd_to_swp_entry(*pvmw->pmd);
 		get_page(new);
 		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+		if (pmd_swp_soft_dirty(*pvmw->pmd))
+			pmde = pmd_mksoft_dirty(pmde);
 		if (is_write_migration_entry(entry))
 			pmde = maybe_pmd_mkwrite(pmde, vma);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 07/11] mm: soft-dirty: keep soft-dirty bits over thp migration
@ 2017-03-13 15:45   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
- clear_soft_dirty_pmd can handle migration entry
---
 arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
 fs/proc/task_mmu.c             | 27 ++++++++++++++++-----------
 include/asm-generic/pgtable.h  | 34 +++++++++++++++++++++++++++++++++-
 include/linux/swapops.h        |  2 ++
 mm/huge_memory.c               | 22 +++++++++++++++++++++-
 5 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1cfb36b8c024..e57abf8e926c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1088,6 +1088,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
 #endif
 
 #define PKRU_AD_BIT 0x1
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f2b0f3ba25ac..6ea9546ea8c1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -908,17 +908,22 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 {
 	pmd_t pmd = *pmdp;
 
-	/* See comment in change_huge_pmd() */
-	pmdp_invalidate(vma, addr, pmdp);
-	if (pmd_dirty(*pmdp))
-		pmd = pmd_mkdirty(pmd);
-	if (pmd_young(*pmdp))
-		pmd = pmd_mkyoung(pmd);
-
-	pmd = pmd_wrprotect(pmd);
-	pmd = pmd_clear_soft_dirty(pmd);
-
-	set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	if (pmd_present(pmd)) {
+		/* See comment in change_huge_pmd() */
+		pmdp_invalidate(vma, addr, pmdp);
+		if (pmd_dirty(*pmdp))
+			pmd = pmd_mkdirty(pmd);
+		if (pmd_young(*pmdp))
+			pmd = pmd_mkyoung(pmd);
+
+		pmd = pmd_wrprotect(pmd);
+		pmd = pmd_clear_soft_dirty(pmd);
+
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+		pmd = pmd_swp_clear_soft_dirty(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	}
 }
 #else
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f98a028100b6..7c781aef0911 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -574,7 +574,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return 0;
@@ -619,6 +636,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 {
 	return pte;
 }
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+	return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+	return pmd;
+}
 #endif
 
 #ifndef __HAVE_PFNMAP_TRACKING
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6625bea13869..b52674bd4173 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -179,6 +179,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 {
 	swp_entry_t arch_entry;
 
+	if (pmd_swp_soft_dirty(pmd))
+		pmd = pmd_swp_clear_soft_dirty(pmd);
 	arch_entry = __pmd_to_swp_entry(pmd);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3f18452f3eb1..f8a11d634a7e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -905,6 +905,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (is_write_migration_entry(entry)) {
 			make_migration_entry_read(&entry);
 			pmd = swp_entry_to_pmd(entry);
+			if (pmd_swp_soft_dirty(pmd))
+				pmd = pmd_swp_mksoft_dirty(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
 		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1707,6 +1709,17 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
 }
 #endif
 
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	if (unlikely(is_pmd_migration_entry(pmd)))
+		pmd = pmd_swp_mksoft_dirty(pmd);
+	else if (pmd_present(pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+#endif
+	return pmd;
+}
+
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		  unsigned long new_addr, unsigned long old_end,
 		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
@@ -1749,7 +1762,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
 		}
-		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+		pmd = move_soft_dirty_pmd(pmd);
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		if (force_flush)
@@ -2751,6 +2765,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 			set_page_dirty(page);
 		entry = make_migration_entry(page, pmd_write(pmdval));
 		pmdswp = swp_entry_to_pmd(entry);
+		if (pmd_soft_dirty(pmdval))
+			pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
 		page_remove_rmap(page, true);
 		put_page(page);
@@ -2767,6 +2783,8 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 			set_page_dirty(subpage);
 		entry = make_migration_entry(subpage, pte_write(pteval));
 		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
 		set_pte_at(mm, address, pvmw->pte, swp_pte);
 		page_remove_rmap(subpage, false);
 		put_page(subpage);
@@ -2790,6 +2808,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		entry = pmd_to_swp_entry(*pvmw->pmd);
 		get_page(new);
 		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+		if (pmd_swp_soft_dirty(*pvmw->pmd))
+			pmde = pmd_mksoft_dirty(pmde);
 		if (is_write_migration_entry(entry))
 			pmde = maybe_pmd_mkwrite(pmde, vma);
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 08/11] mm: hwpoison: soft offline supports thp migration
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:45   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for soft offline.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b78d08016254..4c9f124c95c8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1483,7 +1483,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 						   nid);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(nid,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
@@ -1691,28 +1701,11 @@ static int __soft_offline_page(struct page *page, int flags)
 static int soft_offline_in_use_page(struct page *page, int flags)
 {
 	int ret;
-	struct page *hpage = compound_head(page);
-
-	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		lock_page(hpage);
-		if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
-			unlock_page(hpage);
-			if (!PageAnon(hpage))
-				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
-			else
-				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_hwpoison_page(hpage);
-			return -EBUSY;
-		}
-		unlock_page(hpage);
-		get_hwpoison_page(page);
-		put_hwpoison_page(hpage);
-	}
 
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page, flags);
 	else
-		ret = __soft_offline_page(page, flags);
+		ret = __soft_offline_page(compound_head(page), flags);
 
 	return ret;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 08/11] mm: hwpoison: soft offline supports thp migration
@ 2017-03-13 15:45   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for soft offline.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b78d08016254..4c9f124c95c8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1483,7 +1483,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 						   nid);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(nid,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
@@ -1691,28 +1701,11 @@ static int __soft_offline_page(struct page *page, int flags)
 static int soft_offline_in_use_page(struct page *page, int flags)
 {
 	int ret;
-	struct page *hpage = compound_head(page);
-
-	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		lock_page(hpage);
-		if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
-			unlock_page(hpage);
-			if (!PageAnon(hpage))
-				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
-			else
-				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_hwpoison_page(hpage);
-			return -EBUSY;
-		}
-		unlock_page(hpage);
-		get_hwpoison_page(page);
-		put_hwpoison_page(hpage);
-	}
 
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page, flags);
 	else
-		ret = __soft_offline_page(page, flags);
+		ret = __soft_offline_page(compound_head(page), flags);
 
 	return ret;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 09/11] mm: mempolicy: mbind and migrate_pages support thp migration
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:45   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for mbind(2) and migrate_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp
---
 mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aa242da77fda..d880dc6e9c6b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -94,6 +94,7 @@
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
+#include <linux/swapops.h>
 
 #include <asm/tlbflush.h>
 #include <linux/uaccess.h>
@@ -486,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
 	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
 }
 
+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	int ret = 0;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags;
+
+	if (unlikely(is_pmd_migration_entry(*pmd))) {
+		ret = 1;
+		goto unlock;
+	}
+	page = pmd_page(*pmd);
+	if (is_huge_zero_page(page)) {
+		spin_unlock(ptl);
+		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+		goto out;
+	}
+	if (!thp_migration_supported()) {
+		get_page(page);
+		spin_unlock(ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		goto out;
+	}
+	if (queue_pages_node_check(page, qp)) {
+		ret = 1;
+		goto unlock;
+	}
+
+	ret = 1;
+	flags = qp->flags;
+	/* go to thp migration */
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+unlock:
+	spin_unlock(ptl);
+out:
+	return ret;
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid, ret;
+	int ret;
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(walk->mm, pmd);
-		if (pmd_trans_huge(*pmd)) {
-			page = pmd_page(*pmd);
-			if (is_huge_zero_page(page)) {
-				spin_unlock(ptl);
-				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			} else {
-				get_page(page);
-				spin_unlock(ptl);
-				lock_page(page);
-				ret = split_huge_page(page);
-				unlock_page(page);
-				put_page(page);
-				if (ret)
-					return 0;
-			}
-		} else {
-			spin_unlock(ptl);
-		}
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+		if (ret)
+			return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
@@ -541,7 +570,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		if (queue_pages_node_check(page, qp))
 			continue;
-		if (PageTransCompound(page)) {
+		if (PageTransCompound(page) && !thp_migration_supported()) {
 			get_page(page);
 			pte_unmap_unlock(pte, ptl);
 			lock_page(page);
@@ -959,19 +988,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 #ifdef CONFIG_MIGRATION
 /*
- * page migration
+ * page migration, thp tail pages can be passed.
  */
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags)
 {
+	struct page *head = compound_head(page);
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
-		if (!isolate_lru_page(page)) {
-			list_add_tail(&page->lru, pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+		if (!isolate_lru_page(head)) {
+			list_add_tail(&head->lru, pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 	}
 }
@@ -981,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
 	if (PageHuge(page))
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
 						    __GFP_THISNODE, 0);
 }
@@ -1147,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
 	if (PageHuge(page)) {
 		BUG_ON(!vma);
 		return alloc_huge_page_noerr(vma, address, 1);
+	} else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+					 HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
 	}
 	/*
 	 * if !vma, alloc_page_vma() will use task or system default policy
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 09/11] mm: mempolicy: mbind and migrate_pages support thp migration
@ 2017-03-13 15:45   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for mbind(2) and migrate_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp
---
 mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aa242da77fda..d880dc6e9c6b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -94,6 +94,7 @@
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
+#include <linux/swapops.h>
 
 #include <asm/tlbflush.h>
 #include <linux/uaccess.h>
@@ -486,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
 	return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
 }
 
+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	int ret = 0;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags;
+
+	if (unlikely(is_pmd_migration_entry(*pmd))) {
+		ret = 1;
+		goto unlock;
+	}
+	page = pmd_page(*pmd);
+	if (is_huge_zero_page(page)) {
+		spin_unlock(ptl);
+		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+		goto out;
+	}
+	if (!thp_migration_supported()) {
+		get_page(page);
+		spin_unlock(ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		goto out;
+	}
+	if (queue_pages_node_check(page, qp)) {
+		ret = 1;
+		goto unlock;
+	}
+
+	ret = 1;
+	flags = qp->flags;
+	/* go to thp migration */
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+unlock:
+	spin_unlock(ptl);
+out:
+	return ret;
+}
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
@@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid, ret;
+	int ret;
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(walk->mm, pmd);
-		if (pmd_trans_huge(*pmd)) {
-			page = pmd_page(*pmd);
-			if (is_huge_zero_page(page)) {
-				spin_unlock(ptl);
-				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			} else {
-				get_page(page);
-				spin_unlock(ptl);
-				lock_page(page);
-				ret = split_huge_page(page);
-				unlock_page(page);
-				put_page(page);
-				if (ret)
-					return 0;
-			}
-		} else {
-			spin_unlock(ptl);
-		}
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+		if (ret)
+			return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
@@ -541,7 +570,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		if (queue_pages_node_check(page, qp))
 			continue;
-		if (PageTransCompound(page)) {
+		if (PageTransCompound(page) && !thp_migration_supported()) {
 			get_page(page);
 			pte_unmap_unlock(pte, ptl);
 			lock_page(page);
@@ -959,19 +988,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 #ifdef CONFIG_MIGRATION
 /*
- * page migration
+ * page migration, thp tail pages can be passed.
  */
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags)
 {
+	struct page *head = compound_head(page);
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
-		if (!isolate_lru_page(page)) {
-			list_add_tail(&page->lru, pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+		if (!isolate_lru_page(head)) {
+			list_add_tail(&head->lru, pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 	}
 }
@@ -981,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
 	if (PageHuge(page))
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
 						    __GFP_THISNODE, 0);
 }
@@ -1147,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
 	if (PageHuge(page)) {
 		BUG_ON(!vma);
 		return alloc_huge_page_noerr(vma, address, 1);
+	} else if (thp_migration_supported() && PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+					 HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
 	}
 	/*
 	 * if !vma, alloc_page_vma() will use task or system default policy
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 10/11] mm: migrate: move_pages() supports thp migration
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:45   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for move_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0bbad6dcf95a..0f9a97c76298 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1410,7 +1410,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 					pm->node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(pm->node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
 }
@@ -1437,6 +1447,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
 		struct vm_area_struct *vma;
 		struct page *page;
+		struct page *head;
+		unsigned int follflags;
 
 		err = -EFAULT;
 		vma = find_vma(mm, pp->addr);
@@ -1444,8 +1456,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto set_status;
 
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, pp->addr,
-				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+		follflags = FOLL_GET | FOLL_DUMP;
+		if (!thp_migration_supported())
+			follflags |= FOLL_SPLIT;
+		page = follow_page(vma, pp->addr, follflags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1455,7 +1469,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!page)
 			goto set_status;
 
-		pp->page = page;
 		err = page_to_nid(page);
 
 		if (err == pp->node)
@@ -1470,16 +1483,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto put_and_set;
 
 		if (PageHuge(page)) {
-			if (PageHead(page))
+			if (PageHead(page)) {
 				isolate_huge_page(page, &pagelist);
+				err = 0;
+				pp->page = page;
+			}
 			goto put_and_set;
 		}
 
-		err = isolate_lru_page(page);
+		pp->page = compound_head(page);
+		head = compound_head(page);
+		err = isolate_lru_page(head);
 		if (!err) {
-			list_add_tail(&page->lru, &pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+			list_add_tail(&head->lru, &pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 put_and_set:
 		/*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 10/11] mm: migrate: move_pages() supports thp migration
@ 2017-03-13 15:45   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for move_pages(2).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0bbad6dcf95a..0f9a97c76298 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1410,7 +1410,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 	if (PageHuge(p))
 		return alloc_huge_page_node(page_hstate(compound_head(p)),
 					pm->node);
-	else
+	else if (thp_migration_supported() && PageTransHuge(p)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(pm->node,
+			(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+			HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
 		return __alloc_pages_node(pm->node,
 				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
 }
@@ -1437,6 +1447,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
 		struct vm_area_struct *vma;
 		struct page *page;
+		struct page *head;
+		unsigned int follflags;
 
 		err = -EFAULT;
 		vma = find_vma(mm, pp->addr);
@@ -1444,8 +1456,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto set_status;
 
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, pp->addr,
-				FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+		follflags = FOLL_GET | FOLL_DUMP;
+		if (!thp_migration_supported())
+			follflags |= FOLL_SPLIT;
+		page = follow_page(vma, pp->addr, follflags);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1455,7 +1469,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!page)
 			goto set_status;
 
-		pp->page = page;
 		err = page_to_nid(page);
 
 		if (err == pp->node)
@@ -1470,16 +1483,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto put_and_set;
 
 		if (PageHuge(page)) {
-			if (PageHead(page))
+			if (PageHead(page)) {
 				isolate_huge_page(page, &pagelist);
+				err = 0;
+				pp->page = page;
+			}
 			goto put_and_set;
 		}
 
-		err = isolate_lru_page(page);
+		pp->page = compound_head(page);
+		head = compound_head(page);
+		err = isolate_lru_page(head);
 		if (!err) {
-			list_add_tail(&page->lru, &pagelist);
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-					    page_is_file_cache(page));
+			list_add_tail(&head->lru, &pagelist);
+			mod_node_page_state(page_pgdat(head),
+				NR_ISOLATED_ANON + page_is_file_cache(head),
+				hpage_nr_pages(head));
 		}
 put_and_set:
 		/*
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 11/11] mm: memory_hotplug: memory hotremove supports thp migration
  2017-03-13 15:44 ` Zi Yan
@ 2017-03-13 15:45   ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for memory hotremove.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()
---
 include/linux/huge_mm.h |  8 ++++++++
 mm/memory_hotplug.c     | 17 ++++++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6f44a2352597..92c2161704c3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -189,6 +189,13 @@ static inline int hpage_nr_pages(struct page *page)
 	return 1;
 }
 
+static inline int hpage_order(struct page *page)
+{
+	if (unlikely(PageTransHuge(page)))
+		return HPAGE_PMD_ORDER;
+	return 0;
+}
+
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
@@ -233,6 +240,7 @@ static inline bool thp_migration_supported(void)
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
+#define hpage_order(x) 0
 
 #define transparent_hugepage_enabled(__vma) 0
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6fb6bd2df787..2b014017a217 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1566,6 +1566,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
 	struct page *new_page = NULL;
+	unsigned int order = 0;
 
 	/*
 	 * TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1576,6 +1577,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					next_node_in(nid, nmask));
 
+	if (thp_migration_supported() && PageTransHuge(page)) {
+		order = hpage_order(page);
+		gfp_mask |= GFP_TRANSHUGE;
+	}
+
 	node_clear(nid, nmask);
 
 	if (PageHighMem(page)
@@ -1583,12 +1589,15 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		gfp_mask |= __GFP_HIGHMEM;
 
 	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0,
+		new_page = __alloc_pages_nodemask(gfp_mask, order,
 					node_zonelist(nid, gfp_mask), &nmask);
 	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0,
+		new_page = __alloc_pages(gfp_mask, order,
 					node_zonelist(nid, gfp_mask));
 
+	if (new_page && order == hpage_order(page))
+		prep_transhuge_page(new_page);
+
 	return new_page;
 }
 
@@ -1618,7 +1627,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			if (isolate_huge_page(page, &source))
 				move_pages -= 1 << compound_order(head);
 			continue;
-		}
+		} else if (thp_migration_supported() && PageTransHuge(page))
+			pfn = page_to_pfn(compound_head(page))
+				+ hpage_nr_pages(page) - 1;
 
 		if (!get_page_unless_zero(page))
 			continue;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 11/11] mm: memory_hotplug: memory hotremove supports thp migration
@ 2017-03-13 15:45   ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-13 15:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

This patch enables thp migration for memory hotremove.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()
---
 include/linux/huge_mm.h |  8 ++++++++
 mm/memory_hotplug.c     | 17 ++++++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6f44a2352597..92c2161704c3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -189,6 +189,13 @@ static inline int hpage_nr_pages(struct page *page)
 	return 1;
 }
 
+static inline int hpage_order(struct page *page)
+{
+	if (unlikely(PageTransHuge(page)))
+		return HPAGE_PMD_ORDER;
+	return 0;
+}
+
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
@@ -233,6 +240,7 @@ static inline bool thp_migration_supported(void)
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
+#define hpage_order(x) 0
 
 #define transparent_hugepage_enabled(__vma) 0
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6fb6bd2df787..2b014017a217 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1566,6 +1566,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 	int nid = page_to_nid(page);
 	nodemask_t nmask = node_states[N_MEMORY];
 	struct page *new_page = NULL;
+	unsigned int order = 0;
 
 	/*
 	 * TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1576,6 +1577,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		return alloc_huge_page_node(page_hstate(compound_head(page)),
 					next_node_in(nid, nmask));
 
+	if (thp_migration_supported() && PageTransHuge(page)) {
+		order = hpage_order(page);
+		gfp_mask |= GFP_TRANSHUGE;
+	}
+
 	node_clear(nid, nmask);
 
 	if (PageHighMem(page)
@@ -1583,12 +1589,15 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		gfp_mask |= __GFP_HIGHMEM;
 
 	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0,
+		new_page = __alloc_pages_nodemask(gfp_mask, order,
 					node_zonelist(nid, gfp_mask), &nmask);
 	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0,
+		new_page = __alloc_pages(gfp_mask, order,
 					node_zonelist(nid, gfp_mask));
 
+	if (new_page && order == hpage_order(page))
+		prep_transhuge_page(new_page);
+
 	return new_page;
 }
 
@@ -1618,7 +1627,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			if (isolate_huge_page(page, &source))
 				move_pages -= 1 << compound_order(head);
 			continue;
-		}
+		} else if (thp_migration_supported() && PageTransHuge(page))
+			pfn = page_to_pfn(compound_head(page))
+				+ hpage_nr_pages(page) - 1;
 
 		if (!get_page_unless_zero(page))
 			continue;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
  2017-03-13 15:45   ` Zi Yan
  (?)
@ 2017-03-14 21:19   ` kbuild test robot
  2017-03-14 21:55       ` Zi Yan
  -1 siblings, 1 reply; 52+ messages in thread
From: kbuild test robot @ 2017-03-14 21:19 UTC (permalink / raw)
  To: Zi Yan
  Cc: kbuild-all, linux-kernel, linux-mm, kirill.shutemov, akpm,
	minchan, vbabka, mgorman, mhocko, n-horiguchi, khandual, zi.yan,
	dnellans

[-- Attachment #1: Type: text/plain, Size: 2096 bytes --]

Hi Naoya,

[auto build test WARNING on mmotm/master]
[also build test WARNING on next-20170310]
[cannot apply to v4.11-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170315-042736
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All warnings (new ones prefixed by >>):

   In file included from fs/proc/task_mmu.c:15:0:
   include/linux/swapops.h: In function 'remove_migration_pmd':
   include/linux/swapops.h:209:2: warning: 'return' with a value, in function returning void
     return 0;
     ^
   include/linux/swapops.h: In function 'swp_entry_to_pmd':
>> include/linux/swapops.h:223:2: warning: missing braces around initializer [-Wmissing-braces]
     return (pmd_t){ 0 };
     ^
   include/linux/swapops.h:223:2: warning: (near initialization for '(anonymous).pmd') [-Wmissing-braces]

vim +223 include/linux/swapops.h

   203	}
   204	
   205	static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
   206			struct page *new)
   207	{
   208		BUILD_BUG();
 > 209		return 0;
   210	}
   211	
   212	static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
   213	
   214	static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
   215	{
   216		BUILD_BUG();
   217		return swp_entry(0, 0);
   218	}
   219	
   220	static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
   221	{
   222		BUILD_BUG();
 > 223		return (pmd_t){ 0 };
   224	}
   225	
   226	static inline int is_pmd_migration_entry(pmd_t pmd)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11920 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
  2017-03-13 15:45   ` Zi Yan
  (?)
  (?)
@ 2017-03-14 21:26   ` kbuild test robot
  -1 siblings, 0 replies; 52+ messages in thread
From: kbuild test robot @ 2017-03-14 21:26 UTC (permalink / raw)
  To: Zi Yan
  Cc: kbuild-all, linux-kernel, linux-mm, kirill.shutemov, akpm,
	minchan, vbabka, mgorman, mhocko, n-horiguchi, khandual, zi.yan,
	dnellans

[-- Attachment #1: Type: text/plain, Size: 8071 bytes --]

Hi Naoya,

[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20170310]
[cannot apply to v4.11-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170315-042736
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-s0-201711 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from fs/proc/task_mmu.c:15:0:
   include/linux/swapops.h: In function 'remove_migration_pmd':
>> include/linux/swapops.h:209:9: warning: 'return' with a value, in function returning void
     return 0;
            ^
   include/linux/swapops.h:205:20: note: declared here
    static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
                       ^~~~~~~~~~~~~~~~~~~~
--
   In file included from mm/page_vma_mapped.c:5:0:
   include/linux/swapops.h: In function 'remove_migration_pmd':
>> include/linux/swapops.h:209:9: warning: 'return' with a value, in function returning void
     return 0;
            ^
   include/linux/swapops.h:205:20: note: declared here
    static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
                       ^~~~~~~~~~~~~~~~~~~~
   In file included from include/asm-generic/bug.h:4:0,
                    from arch/x86/include/asm/bug.h:35,
                    from include/linux/bug.h:4,
                    from include/linux/mmdebug.h:4,
                    from include/linux/mm.h:8,
                    from mm/page_vma_mapped.c:1:
   In function 'pmd_to_swp_entry.isra.14',
       inlined from 'page_vma_mapped_walk' at mm/page_vma_mapped.c:149:8:
>> include/linux/compiler.h:537:38: error: call to '__compiletime_assert_216' declared with attribute error: BUILD_BUG failed
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                         ^
   include/linux/compiler.h:520:4: note: in definition of macro '__compiletime_assert'
       prefix ## suffix();    \
       ^~~~~~
   include/linux/compiler.h:537:2: note: in expansion of macro '_compiletime_assert'
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
     ^~~~~~~~~~~~~~~~~~~
   include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                        ^~~~~~~~~~~~~~~~~~
   include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                        ^~~~~~~~~~~~~~~~
>> include/linux/swapops.h:216:2: note: in expansion of macro 'BUILD_BUG'
     BUILD_BUG();
     ^~~~~~~~~
--
   In file included from mm/rmap.c:53:0:
   include/linux/swapops.h: In function 'remove_migration_pmd':
>> include/linux/swapops.h:209:9: warning: 'return' with a value, in function returning void
     return 0;
            ^
   include/linux/swapops.h:205:20: note: declared here
    static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
                       ^~~~~~~~~~~~~~~~~~~~
   In file included from include/asm-generic/bug.h:4:0,
                    from arch/x86/include/asm/bug.h:35,
                    from include/linux/bug.h:4,
                    from include/linux/mmdebug.h:4,
                    from include/linux/mm.h:8,
                    from mm/rmap.c:48:
   In function 'set_pmd_migration_entry.isra.28',
       inlined from 'try_to_unmap_one' at mm/rmap.c:1317:5:
   include/linux/compiler.h:537:38: error: call to '__compiletime_assert_202' declared with attribute error: BUILD_BUG failed
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                         ^
   include/linux/compiler.h:520:4: note: in definition of macro '__compiletime_assert'
       prefix ## suffix();    \
       ^~~~~~
   include/linux/compiler.h:537:2: note: in expansion of macro '_compiletime_assert'
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
     ^~~~~~~~~~~~~~~~~~~
   include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                        ^~~~~~~~~~~~~~~~~~
   include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                        ^~~~~~~~~~~~~~~~
   include/linux/swapops.h:202:2: note: in expansion of macro 'BUILD_BUG'
     BUILD_BUG();
     ^~~~~~~~~
--
   In file included from mm/migrate.c:18:0:
   include/linux/swapops.h: In function 'remove_migration_pmd':
>> include/linux/swapops.h:209:9: warning: 'return' with a value, in function returning void
     return 0;
            ^
   include/linux/swapops.h:205:20: note: declared here
    static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
                       ^~~~~~~~~~~~~~~~~~~~
   In file included from include/asm-generic/bug.h:4:0,
                    from arch/x86/include/asm/bug.h:35,
                    from include/linux/bug.h:4,
                    from include/linux/mmdebug.h:4,
                    from include/linux/mm.h:8,
                    from include/linux/migrate.h:4,
                    from mm/migrate.c:15:
   In function 'remove_migration_pmd.isra.32',
       inlined from 'remove_migration_pte' at mm/migrate.c:217:4:
   include/linux/compiler.h:537:38: error: call to '__compiletime_assert_208' declared with attribute error: BUILD_BUG failed
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                         ^
   include/linux/compiler.h:520:4: note: in definition of macro '__compiletime_assert'
       prefix ## suffix();    \
       ^~~~~~
   include/linux/compiler.h:537:2: note: in expansion of macro '_compiletime_assert'
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
     ^~~~~~~~~~~~~~~~~~~
   include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                        ^~~~~~~~~~~~~~~~~~
   include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                        ^~~~~~~~~~~~~~~~
   include/linux/swapops.h:208:2: note: in expansion of macro 'BUILD_BUG'
     BUILD_BUG();
     ^~~~~~~~~

vim +/__compiletime_assert_216 +537 include/linux/compiler.h

9a8ab1c3 Daniel Santos  2013-02-21  531   *
9a8ab1c3 Daniel Santos  2013-02-21  532   * In tradition of POSIX assert, this macro will break the build if the
9a8ab1c3 Daniel Santos  2013-02-21  533   * supplied condition is *false*, emitting the supplied error message if the
9a8ab1c3 Daniel Santos  2013-02-21  534   * compiler has support to do so.
9a8ab1c3 Daniel Santos  2013-02-21  535   */
9a8ab1c3 Daniel Santos  2013-02-21  536  #define compiletime_assert(condition, msg) \
9a8ab1c3 Daniel Santos  2013-02-21 @537  	_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
9a8ab1c3 Daniel Santos  2013-02-21  538  
47933ad4 Peter Zijlstra 2013-11-06  539  #define compiletime_assert_atomic_type(t)				\
47933ad4 Peter Zijlstra 2013-11-06  540  	compiletime_assert(__native_word(t),				\

:::::: The code at line 537 was first introduced by commit
:::::: 9a8ab1c39970a4938a72d94e6fd13be88a797590 bug.h, compiler.h: introduce compiletime_assert & BUILD_BUG_ON_MSG

:::::: TO: Daniel Santos <daniel.santos@pobox.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27412 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
  2017-03-14 21:19   ` kbuild test robot
@ 2017-03-14 21:55       ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-14 21:55 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-kernel, linux-mm, kirill.shutemov, akpm,
	minchan, vbabka, mgorman, mhocko, n-horiguchi, khandual,
	dnellans

[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]



On 03/14/2017 04:19 PM, kbuild test robot wrote:
> Hi Naoya,
>
> [auto build test WARNING on mmotm/master]
> [also build test WARNING on next-20170310]
> [cannot apply to v4.11-rc2]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170315-042736
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: m68k-sun3_defconfig (attached as .config)
> compiler: m68k-linux-gcc (GCC) 4.9.0
> reproduce:
>         wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=m68k 
>
> All warnings (new ones prefixed by >>):
>
>    In file included from fs/proc/task_mmu.c:15:0:
>    include/linux/swapops.h: In function 'remove_migration_pmd':
>    include/linux/swapops.h:209:2: warning: 'return' with a value, in function returning void
>      return 0;
>      ^
>    include/linux/swapops.h: In function 'swp_entry_to_pmd':

I will remove "return 0;" in next version.

--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -208,7 +208,6 @@ static inline void remove_migration_pmd(struct
page_vma_mapped_walk *pvmw,
                struct page *new)
 {
        BUILD_BUG();
-       return 0;
 }

 static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t
*p) { }


>>> include/linux/swapops.h:223:2: warning: missing braces around initializer [-Wmissing-braces]
>      return (pmd_t){ 0 };
>      ^
>    include/linux/swapops.h:223:2: warning: (near initialization for '(anonymous).pmd') [-Wmissing-braces]

I do not have any warning with gcc 6.3.0. This seems to be a GCC bug
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119).


-- 
Best Regards,
Yan Zi



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
@ 2017-03-14 21:55       ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-14 21:55 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-kernel, linux-mm, kirill.shutemov, akpm,
	minchan, vbabka, mgorman, mhocko, n-horiguchi, khandual,
	dnellans

[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]



On 03/14/2017 04:19 PM, kbuild test robot wrote:
> Hi Naoya,
>
> [auto build test WARNING on mmotm/master]
> [also build test WARNING on next-20170310]
> [cannot apply to v4.11-rc2]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170315-042736
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: m68k-sun3_defconfig (attached as .config)
> compiler: m68k-linux-gcc (GCC) 4.9.0
> reproduce:
>         wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=m68k 
>
> All warnings (new ones prefixed by >>):
>
>    In file included from fs/proc/task_mmu.c:15:0:
>    include/linux/swapops.h: In function 'remove_migration_pmd':
>    include/linux/swapops.h:209:2: warning: 'return' with a value, in function returning void
>      return 0;
>      ^
>    include/linux/swapops.h: In function 'swp_entry_to_pmd':

I will remove "return 0;" in next version.

--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -208,7 +208,6 @@ static inline void remove_migration_pmd(struct
page_vma_mapped_walk *pvmw,
                struct page *new)
 {
        BUILD_BUG();
-       return 0;
 }

 static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t
*p) { }


>>> include/linux/swapops.h:223:2: warning: missing braces around initializer [-Wmissing-braces]
>      return (pmd_t){ 0 };
>      ^
>    include/linux/swapops.h:223:2: warning: (near initialization for '(anonymous).pmd') [-Wmissing-braces]

I do not have any warning with gcc 6.3.0. This seems to be a GCC bug
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119).


-- 
Best Regards,
Yan Zi



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
  2017-03-14 21:55       ` Zi Yan
@ 2017-03-15  9:01         ` Geert Uytterhoeven
  -1 siblings, 0 replies; 52+ messages in thread
From: Geert Uytterhoeven @ 2017-03-15  9:01 UTC (permalink / raw)
  To: Zi Yan
  Cc: kbuild test robot, kbuild-all, linux-kernel, Linux MM,
	Kirill A. Shutemov, Andrew Morton, Minchan Kim, Vlastimil Babka,
	Mel Gorman, Michal Hocko, Naoya Horiguchi, Anshuman Khandual,
	dnellans

On Tue, Mar 14, 2017 at 10:55 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
>>>> include/linux/swapops.h:223:2: warning: missing braces around initializer [-Wmissing-braces]
>>      return (pmd_t){ 0 };
>>      ^
>>    include/linux/swapops.h:223:2: warning: (near initialization for '(anonymous).pmd') [-Wmissing-braces]
>
> I do not have any warning with gcc 6.3.0. This seems to be a GCC bug
> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119).

I guess you need

    return (pmd_t) { { 0, }};

to kill the warning.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
@ 2017-03-15  9:01         ` Geert Uytterhoeven
  0 siblings, 0 replies; 52+ messages in thread
From: Geert Uytterhoeven @ 2017-03-15  9:01 UTC (permalink / raw)
  To: Zi Yan
  Cc: kbuild test robot, kbuild-all, linux-kernel, Linux MM,
	Kirill A. Shutemov, Andrew Morton, Minchan Kim, Vlastimil Babka,
	Mel Gorman, Michal Hocko, Naoya Horiguchi, Anshuman Khandual,
	dnellans

On Tue, Mar 14, 2017 at 10:55 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
>>>> include/linux/swapops.h:223:2: warning: missing braces around initializer [-Wmissing-braces]
>>      return (pmd_t){ 0 };
>>      ^
>>    include/linux/swapops.h:223:2: warning: (near initialization for '(anonymous).pmd') [-Wmissing-braces]
>
> I do not have any warning with gcc 6.3.0. This seems to be a GCC bug
> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119).

I guess you need

    return (pmd_t) { { 0, }};

to kill the warning.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
  2017-03-15  9:01         ` Geert Uytterhoeven
@ 2017-03-15 16:00           ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-15 16:00 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: kbuild test robot, kbuild-all, linux-kernel, Linux MM,
	Kirill A. Shutemov, Andrew Morton, Minchan Kim, Vlastimil Babka,
	Mel Gorman, Michal Hocko, Naoya Horiguchi, Anshuman Khandual,
	dnellans

[-- Attachment #1: Type: text/plain, Size: 715 bytes --]

On 15 Mar 2017, at 4:01, Geert Uytterhoeven wrote:

> On Tue, Mar 14, 2017 at 10:55 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
>>>>> include/linux/swapops.h:223:2: warning: missing braces around initializer [-Wmissing-braces]
>>>      return (pmd_t){ 0 };
>>>      ^
>>>    include/linux/swapops.h:223:2: warning: (near initialization for '(anonymous).pmd') [-Wmissing-braces]
>>
>> I do not have any warning with gcc 6.3.0. This seems to be a GCC bug
>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119).
>
> I guess you need
>
>     return (pmd_t) { { 0, }};
>
> to kill the warning.

Yeah, that should work. I find the same solution from StackOverflow.

Thanks.

--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
@ 2017-03-15 16:00           ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-15 16:00 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: kbuild test robot, kbuild-all, linux-kernel, Linux MM,
	Kirill A. Shutemov, Andrew Morton, Minchan Kim, Vlastimil Babka,
	Mel Gorman, Michal Hocko, Naoya Horiguchi, Anshuman Khandual,
	dnellans

[-- Attachment #1: Type: text/plain, Size: 715 bytes --]

On 15 Mar 2017, at 4:01, Geert Uytterhoeven wrote:

> On Tue, Mar 14, 2017 at 10:55 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
>>>>> include/linux/swapops.h:223:2: warning: missing braces around initializer [-Wmissing-braces]
>>>      return (pmd_t){ 0 };
>>>      ^
>>>    include/linux/swapops.h:223:2: warning: (near initialization for '(anonymous).pmd') [-Wmissing-braces]
>>
>> I do not have any warning with gcc 6.3.0. This seems to be a GCC bug
>> (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119).
>
> I guess you need
>
>     return (pmd_t) { { 0, }};
>
> to kill the warning.

Yeah, that should work. I find the same solution from StackOverflow.

Thanks.

--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 04/11] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2017-03-13 15:45   ` Zi Yan
@ 2017-03-24 14:10     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2017-03-24 14:10 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, mhocko, n-horiguchi, khandual, zi.yan, dnellans

On Mon, Mar 13, 2017 at 11:45:00AM -0400, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> functionality to x86_64, which should be safer at the first step.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> v1 -> v2:
> - fixed config name in subject and patch description
> ---
>  arch/x86/Kconfig        |  4 ++++
>  include/linux/huge_mm.h | 10 ++++++++++
>  mm/Kconfig              |  3 +++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 69188841717a..a24bc11c7aed 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2276,6 +2276,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  	def_bool y
>  	depends on X86_64 && HUGETLB_PAGE && MIGRATION
>  
> +config ARCH_ENABLE_THP_MIGRATION
> +	def_bool y
> +	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
> +

TRANSPARENT_HUGEPAGE implies MIGRATION due to COMPACTION.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 04/11] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
@ 2017-03-24 14:10     ` Kirill A. Shutemov
  0 siblings, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2017-03-24 14:10 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, mhocko, n-horiguchi, khandual, zi.yan, dnellans

On Mon, Mar 13, 2017 at 11:45:00AM -0400, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> functionality to x86_64, which should be safer at the first step.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> v1 -> v2:
> - fixed config name in subject and patch description
> ---
>  arch/x86/Kconfig        |  4 ++++
>  include/linux/huge_mm.h | 10 ++++++++++
>  mm/Kconfig              |  3 +++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 69188841717a..a24bc11c7aed 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2276,6 +2276,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  	def_bool y
>  	depends on X86_64 && HUGETLB_PAGE && MIGRATION
>  
> +config ARCH_ENABLE_THP_MIGRATION
> +	def_bool y
> +	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
> +

TRANSPARENT_HUGEPAGE implies MIGRATION due to COMPACTION.


-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 04/11] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
  2017-03-24 14:10     ` Kirill A. Shutemov
@ 2017-03-24 14:21       ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 14:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]



Kirill A. Shutemov wrote:
> On Mon, Mar 13, 2017 at 11:45:00AM -0400, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
>> functionality to x86_64, which should be safer at the first step.
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> ---
>> v1 -> v2:
>> - fixed config name in subject and patch description
>> ---
>>  arch/x86/Kconfig        |  4 ++++
>>  include/linux/huge_mm.h | 10 ++++++++++
>>  mm/Kconfig              |  3 +++
>>  3 files changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 69188841717a..a24bc11c7aed 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -2276,6 +2276,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>>  	def_bool y
>>  	depends on X86_64 && HUGETLB_PAGE && MIGRATION
>>  
>> +config ARCH_ENABLE_THP_MIGRATION
>> +	def_bool y
>> +	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
>> +
> 
> TRANSPARENT_HUGEPAGE implies MIGRATION due to COMPACTION.
> 

Sure. I will change it to:

+config ARCH_ENABLE_THP_MIGRATION
+	def_bool y
+	depends on X86_64 && TRANSPARENT_HUGEPAGE
+


Thanks.

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 04/11] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
@ 2017-03-24 14:21       ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 14:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]



Kirill A. Shutemov wrote:
> On Mon, Mar 13, 2017 at 11:45:00AM -0400, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
>> functionality to x86_64, which should be safer at the first step.
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> ---
>> v1 -> v2:
>> - fixed config name in subject and patch description
>> ---
>>  arch/x86/Kconfig        |  4 ++++
>>  include/linux/huge_mm.h | 10 ++++++++++
>>  mm/Kconfig              |  3 +++
>>  3 files changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 69188841717a..a24bc11c7aed 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -2276,6 +2276,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>>  	def_bool y
>>  	depends on X86_64 && HUGETLB_PAGE && MIGRATION
>>  
>> +config ARCH_ENABLE_THP_MIGRATION
>> +	def_bool y
>> +	depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
>> +
> 
> TRANSPARENT_HUGEPAGE implies MIGRATION due to COMPACTION.
> 

Sure. I will change it to:

+config ARCH_ENABLE_THP_MIGRATION
+	def_bool y
+	depends on X86_64 && TRANSPARENT_HUGEPAGE
+


Thanks.

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
  2017-03-13 15:45   ` Zi Yan
@ 2017-03-24 14:28     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2017-03-24 14:28 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, mhocko, n-horiguchi, khandual, zi.yan, dnellans

On Mon, Mar 13, 2017 at 11:45:01AM -0400, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> This patch adds thp migration's core code, including conversions
> between a PMD entry and a swap entry, setting PMD migration entry,
> removing PMD migration entry, and waiting on PMD migration entries.
> 
> This patch makes it possible to support thp migration.
> If you fail to allocate a destination page as a thp, you just split
> the source thp as we do now, and then enter the normal page migration.
> If you succeed to allocate destination thp, you enter thp migration.
> Subsequent patches actually enable thp migration for each caller of
> page migration by allowing its get_new_page() callback to
> allocate thps.
> 
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v2 -> v3:
> - use page_vma_mapped_walk()
> 
> ChangeLog v3 -> v4:
> - factor out the code of removing pte pgtable page in zap_huge_pmd()
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>

See few questions below.

It would be nice to split it into few patches. Probably three or four.

> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  70 +++++++++++++++++-
>  mm/huge_memory.c                  | 147 ++++++++++++++++++++++++++++++++++----
>  mm/migrate.c                      |  29 +++++++-
>  mm/page_vma_mapped.c              |  13 +++-
>  mm/pgtable-generic.c              |   3 +-
>  mm/rmap.c                         |   9 +++
>  7 files changed, 252 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index a5c4fc62e078..350397fd2129 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -187,7 +187,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
>  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
> +#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
>  
>  extern int kern_addr_valid(unsigned long addr);
>  extern void cleanup_highmap(void);
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3e7eec..6625bea13869 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>  #ifdef CONFIG_MIGRATION
>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>  {
> -	BUG_ON(!PageLocked(page));
> +	BUG_ON(!PageLocked(compound_head(page)));
> +
>  	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
>  			page_to_pfn(page));
>  }
> @@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
>  	 * Any use of migration entries may only occur while the
>  	 * corresponding page is locked
>  	 */
> -	BUG_ON(!PageLocked(p));
> +	BUG_ON(!PageLocked(compound_head(p)));
>  	return p;
>  }
>  
> @@ -163,6 +164,71 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +struct page_vma_mapped_walk;
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page);
> +
> +extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
> +		struct page *new);
> +
> +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __pmd_to_swp_entry(pmd);
> +	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> +	return __swp_entry_to_pmd(arch_entry);
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> +}
> +#else
> +static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	BUILD_BUG();
> +}
> +
> +static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
> +		struct page *new)
> +{
> +	BUILD_BUG();
> +	return 0;
> +}
> +
> +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	BUILD_BUG();
> +	return swp_entry(0, 0);
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	BUILD_BUG();
> +	return (pmd_t){ 0 };
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e32ccbd8ee3a..a9c2a0ef5b9b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1588,6 +1588,26 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
>  	atomic_long_dec(&mm->nr_ptes);
>  }
>  
> +static inline void remove_trans_huge_pgtable(struct page *page,
> +		struct mmu_gather *tlb, pmd_t *pmd)
> +{
> +	if (PageAnon(page)) {
> +		pgtable_t pgtable;
> +
> +		pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +							  pmd);
> +		pte_free(tlb->mm, pgtable);
> +		atomic_long_dec(&tlb->mm->nr_ptes);
> +		add_mm_counter(tlb->mm, MM_ANONPAGES,
> +				   -HPAGE_PMD_NR);
> +	} else {
> +		if (arch_needs_pgtable_deposit())
> +			zap_deposited_table(tlb->mm, pmd);
> +		add_mm_counter(tlb->mm, MM_FILEPAGES,
> +				   -HPAGE_PMD_NR);
> +	}
> +}
> +
>  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		 pmd_t *pmd, unsigned long addr)
>  {
> @@ -1618,23 +1638,27 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>  	} else {
> -		struct page *page = pmd_page(orig_pmd);
> -		page_remove_rmap(page, true);
> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> -		if (PageAnon(page)) {
> -			pgtable_t pgtable;
> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> -			pte_free(tlb->mm, pgtable);
> -			atomic_long_dec(&tlb->mm->nr_ptes);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		struct page *page;
> +		int migration = 0;
> +
> +		if (!is_pmd_migration_entry(orig_pmd)) {
> +			page = pmd_page(orig_pmd);
> +			page_remove_rmap(page, true);
> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> +			remove_trans_huge_pgtable(page, tlb, pmd);
>  		} else {
> -			if (arch_needs_pgtable_deposit())
> -				zap_deposited_table(tlb->mm, pmd);
> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(orig_pmd);
> +			page = pfn_to_page(swp_offset(entry));
> +			remove_trans_huge_pgtable(page, tlb, pmd);
> +			free_swap_and_cache(entry); /* waring in failure? */
> +			migration = 1;
>  		}
>  		spin_unlock(ptl);
> -		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> +		if (!migration)
> +			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
>  	return 1;
>  }
> @@ -2652,3 +2676,98 @@ static int __init split_huge_pages_debugfs(void)
>  }
>  late_initcall(split_huge_pages_debugfs);
>  #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	pmd_t pmdval;
> +	swp_entry_t entry;
> +
> +	if (pvmw->pmd && !pvmw->pte) {
> +		pmd_t pmdswp;
> +
> +		mmu_notifier_invalidate_range_start(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +
> +		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> +		if (pmd_dirty(pmdval))
> +			set_page_dirty(page);
> +		entry = make_migration_entry(page, pmd_write(pmdval));
> +		pmdswp = swp_entry_to_pmd(entry);
> +		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
> +		page_remove_rmap(page, true);
> +		put_page(page);
> +
> +		mmu_notifier_invalidate_range_end(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +	} else { /* pte-mapped thp */
> +		pte_t pteval;
> +		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
> +		pte_t swp_pte;
> +
> +		pteval = ptep_clear_flush(vma, address, pvmw->pte);
> +		if (pte_dirty(pteval))
> +			set_page_dirty(subpage);
> +		entry = make_migration_entry(subpage, pte_write(pteval));
> +		swp_pte = swp_entry_to_pte(entry);
> +		set_pte_at(mm, address, pvmw->pte, swp_pte);
> +		page_remove_rmap(subpage, false);
> +		put_page(subpage);
> +		mmu_notifier_invalidate_page(mm, address);
> +	}
> +}
> +
> +void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	swp_entry_t entry;
> +
> +	/* PMD-mapped THP  */
> +	if (pvmw->pmd && !pvmw->pte) {
> +		unsigned long mmun_start = address & HPAGE_PMD_MASK;
> +		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +		pmd_t pmde;
> +
> +		entry = pmd_to_swp_entry(*pvmw->pmd);
> +		get_page(new);
> +		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> +		if (is_write_migration_entry(entry))
> +			pmde = maybe_pmd_mkwrite(pmde, vma);
> +
> +		flush_cache_range(vma, mmun_start, mmun_end);
> +		page_add_anon_rmap(new, vma, mmun_start, true);
> +		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
> +		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> +		flush_tlb_range(vma, mmun_start, mmun_end);
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_vma_page(new);
> +		update_mmu_cache_pmd(vma, address, pvmw->pmd);
> +
> +	} else { /* pte-mapped thp */
> +		pte_t pte;
> +		pte_t *ptep = pvmw->pte;
> +
> +		entry = pte_to_swp_entry(*pvmw->pte);
> +		get_page(new);
> +		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
> +		if (pte_swp_soft_dirty(*pvmw->pte))
> +			pte = pte_mksoft_dirty(pte);
> +		if (is_write_migration_entry(entry))
> +			pte = maybe_mkwrite(pte, vma);
> +		flush_dcache_page(new);
> +		set_pte_at(mm, address, ptep, pte);
> +		if (PageAnon(new))
> +			page_add_anon_rmap(new, vma, address, false);
> +		else
> +			page_add_file_rmap(new, false);
> +		update_mmu_cache(vma, address, ptep);
> +	}
> +}
> +#endif
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cda4c2778d04..0bbad6dcf95a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -211,6 +211,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		new = page - pvmw.page->index +
>  			linear_page_index(vma, pvmw.address);
>  
> +		/* PMD-mapped THP migration entry */
> +		if (!PageHuge(page) && PageTransCompound(page)) {
> +			remove_migration_pmd(&pvmw, new);
> +			continue;
> +		}
> +

Any reason not to share PTE handling of non-THP with THP?

>  		get_page(new);
>  		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
>  		if (pte_swp_soft_dirty(*pvmw.pte))
> @@ -324,6 +330,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
>  	__migration_entry_wait(mm, pte, ptl);
>  }
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (!is_pmd_migration_entry(*pmd))
> +		goto unlock;
> +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> +	if (!get_page_unless_zero(page))
> +		goto unlock;
> +	spin_unlock(ptl);
> +	wait_on_page_locked(page);
> +	put_page(page);
> +	return;
> +unlock:
> +	spin_unlock(ptl);
> +}
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> @@ -1082,7 +1109,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>  		goto out;
>  	}
>  
> -	if (unlikely(PageTransHuge(page))) {
> +	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
>  		lock_page(page);
>  		rc = split_huge_page(page);
>  		unlock_page(page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index a23001a22c15..0ed3aee62d50 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	if (!pud_present(*pud))
>  		return false;
>  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> -	if (pmd_trans_huge(*pvmw->pmd)) {
> +	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
>  		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> -		if (!pmd_present(*pvmw->pmd))
> -			return not_found(pvmw);
>  		if (likely(pmd_trans_huge(*pvmw->pmd))) {
>  			if (pvmw->flags & PVMW_MIGRATION)
>  				return not_found(pvmw);
>  			if (pmd_page(*pvmw->pmd) != page)
>  				return not_found(pvmw);
>  			return true;
> +		} else if (!pmd_present(*pvmw->pmd)) {
> +			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
> +
> +				if (migration_entry_to_page(entry) != page)
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			return not_found(pvmw);
>  		} else {
>  			/* THP pmd was split under us: handle on pte level */
>  			spin_unlock(pvmw->ptl);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4ed5908c65b0..9d550a8a0c71 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp));

How does this? _flush doesn't make sense for !present.

>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>  	return pmd;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 555cc7ebacf6..2c65abbd7a0e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1298,6 +1298,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	int ret = SWAP_AGAIN;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
>  
> +
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		return SWAP_AGAIN;
> @@ -1308,6 +1309,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* THP migration */
> +		if (flags & TTU_MIGRATION) {
> +			if (!PageHuge(page) && PageTransCompound(page)) {
> +				set_pmd_migration_entry(&pvmw, page);

Again, it would be nice share PTE handling. It should be rather similar,
no?

> +				continue;
> +			}
> +		}
> +
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
>  		 * If it's recently referenced (perhaps page_referenced
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
@ 2017-03-24 14:28     ` Kirill A. Shutemov
  0 siblings, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2017-03-24 14:28 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, mhocko, n-horiguchi, khandual, zi.yan, dnellans

On Mon, Mar 13, 2017 at 11:45:01AM -0400, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> This patch adds thp migration's core code, including conversions
> between a PMD entry and a swap entry, setting PMD migration entry,
> removing PMD migration entry, and waiting on PMD migration entries.
> 
> This patch makes it possible to support thp migration.
> If you fail to allocate a destination page as a thp, you just split
> the source thp as we do now, and then enter the normal page migration.
> If you succeed to allocate destination thp, you enter thp migration.
> Subsequent patches actually enable thp migration for each caller of
> page migration by allowing its get_new_page() callback to
> allocate thps.
> 
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v2 -> v3:
> - use page_vma_mapped_walk()
> 
> ChangeLog v3 -> v4:
> - factor out the code of removing pte pgtable page in zap_huge_pmd()
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>

See few questions below.

It would be nice to split it into few patches. Probably three or four.

> ---
>  arch/x86/include/asm/pgtable_64.h |   2 +
>  include/linux/swapops.h           |  70 +++++++++++++++++-
>  mm/huge_memory.c                  | 147 ++++++++++++++++++++++++++++++++++----
>  mm/migrate.c                      |  29 +++++++-
>  mm/page_vma_mapped.c              |  13 +++-
>  mm/pgtable-generic.c              |   3 +-
>  mm/rmap.c                         |   9 +++
>  7 files changed, 252 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index a5c4fc62e078..350397fd2129 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -187,7 +187,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  					 ((type) << (SWP_TYPE_FIRST_BIT)) \
>  					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
> +#define __swp_entry_to_pmd(x)		((pmd_t) { .pmd = (x).val })
>  
>  extern int kern_addr_valid(unsigned long addr);
>  extern void cleanup_highmap(void);
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3e7eec..6625bea13869 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>  #ifdef CONFIG_MIGRATION
>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>  {
> -	BUG_ON(!PageLocked(page));
> +	BUG_ON(!PageLocked(compound_head(page)));
> +
>  	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
>  			page_to_pfn(page));
>  }
> @@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
>  	 * Any use of migration entries may only occur while the
>  	 * corresponding page is locked
>  	 */
> -	BUG_ON(!PageLocked(p));
> +	BUG_ON(!PageLocked(compound_head(p)));
>  	return p;
>  }
>  
> @@ -163,6 +164,71 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +struct page_vma_mapped_walk;
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page);
> +
> +extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
> +		struct page *new);
> +
> +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __pmd_to_swp_entry(pmd);
> +	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	swp_entry_t arch_entry;
> +
> +	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> +	return __swp_entry_to_pmd(arch_entry);
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> +}
> +#else
> +static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	BUILD_BUG();
> +}
> +
> +static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
> +		struct page *new)
> +{
> +	BUILD_BUG();
> +	return 0;
> +}
> +
> +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> +	BUILD_BUG();
> +	return swp_entry(0, 0);
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> +	BUILD_BUG();
> +	return (pmd_t){ 0 };
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e32ccbd8ee3a..a9c2a0ef5b9b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1588,6 +1588,26 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
>  	atomic_long_dec(&mm->nr_ptes);
>  }
>  
> +static inline void remove_trans_huge_pgtable(struct page *page,
> +		struct mmu_gather *tlb, pmd_t *pmd)
> +{
> +	if (PageAnon(page)) {
> +		pgtable_t pgtable;
> +
> +		pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> +							  pmd);
> +		pte_free(tlb->mm, pgtable);
> +		atomic_long_dec(&tlb->mm->nr_ptes);
> +		add_mm_counter(tlb->mm, MM_ANONPAGES,
> +				   -HPAGE_PMD_NR);
> +	} else {
> +		if (arch_needs_pgtable_deposit())
> +			zap_deposited_table(tlb->mm, pmd);
> +		add_mm_counter(tlb->mm, MM_FILEPAGES,
> +				   -HPAGE_PMD_NR);
> +	}
> +}
> +
>  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		 pmd_t *pmd, unsigned long addr)
>  {
> @@ -1618,23 +1638,27 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>  	} else {
> -		struct page *page = pmd_page(orig_pmd);
> -		page_remove_rmap(page, true);
> -		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -		VM_BUG_ON_PAGE(!PageHead(page), page);
> -		if (PageAnon(page)) {
> -			pgtable_t pgtable;
> -			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> -			pte_free(tlb->mm, pgtable);
> -			atomic_long_dec(&tlb->mm->nr_ptes);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		struct page *page;
> +		int migration = 0;
> +
> +		if (!is_pmd_migration_entry(orig_pmd)) {
> +			page = pmd_page(orig_pmd);
> +			page_remove_rmap(page, true);
> +			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +			VM_BUG_ON_PAGE(!PageHead(page), page);
> +			remove_trans_huge_pgtable(page, tlb, pmd);
>  		} else {
> -			if (arch_needs_pgtable_deposit())
> -				zap_deposited_table(tlb->mm, pmd);
> -			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +			swp_entry_t entry;
> +
> +			entry = pmd_to_swp_entry(orig_pmd);
> +			page = pfn_to_page(swp_offset(entry));
> +			remove_trans_huge_pgtable(page, tlb, pmd);
> +			free_swap_and_cache(entry); /* waring in failure? */
> +			migration = 1;
>  		}
>  		spin_unlock(ptl);
> -		tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> +		if (!migration)
> +			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
>  	}
>  	return 1;
>  }
> @@ -2652,3 +2676,98 @@ static int __init split_huge_pages_debugfs(void)
>  }
>  late_initcall(split_huge_pages_debugfs);
>  #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> +		struct page *page)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	pmd_t pmdval;
> +	swp_entry_t entry;
> +
> +	if (pvmw->pmd && !pvmw->pte) {
> +		pmd_t pmdswp;
> +
> +		mmu_notifier_invalidate_range_start(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +
> +		flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
> +		if (pmd_dirty(pmdval))
> +			set_page_dirty(page);
> +		entry = make_migration_entry(page, pmd_write(pmdval));
> +		pmdswp = swp_entry_to_pmd(entry);
> +		set_pmd_at(mm, address, pvmw->pmd, pmdswp);
> +		page_remove_rmap(page, true);
> +		put_page(page);
> +
> +		mmu_notifier_invalidate_range_end(mm, address,
> +				address + HPAGE_PMD_SIZE);
> +	} else { /* pte-mapped thp */
> +		pte_t pteval;
> +		struct page *subpage = page - page_to_pfn(page) + pte_pfn(*pvmw->pte);
> +		pte_t swp_pte;
> +
> +		pteval = ptep_clear_flush(vma, address, pvmw->pte);
> +		if (pte_dirty(pteval))
> +			set_page_dirty(subpage);
> +		entry = make_migration_entry(subpage, pte_write(pteval));
> +		swp_pte = swp_entry_to_pte(entry);
> +		set_pte_at(mm, address, pvmw->pte, swp_pte);
> +		page_remove_rmap(subpage, false);
> +		put_page(subpage);
> +		mmu_notifier_invalidate_page(mm, address);
> +	}
> +}
> +
> +void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
> +{
> +	struct vm_area_struct *vma = pvmw->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long address = pvmw->address;
> +	swp_entry_t entry;
> +
> +	/* PMD-mapped THP  */
> +	if (pvmw->pmd && !pvmw->pte) {
> +		unsigned long mmun_start = address & HPAGE_PMD_MASK;
> +		unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +		pmd_t pmde;
> +
> +		entry = pmd_to_swp_entry(*pvmw->pmd);
> +		get_page(new);
> +		pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> +		if (is_write_migration_entry(entry))
> +			pmde = maybe_pmd_mkwrite(pmde, vma);
> +
> +		flush_cache_range(vma, mmun_start, mmun_end);
> +		page_add_anon_rmap(new, vma, mmun_start, true);
> +		pmdp_huge_clear_flush_notify(vma, mmun_start, pvmw->pmd);
> +		set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
> +		flush_tlb_range(vma, mmun_start, mmun_end);
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_vma_page(new);
> +		update_mmu_cache_pmd(vma, address, pvmw->pmd);
> +
> +	} else { /* pte-mapped thp */
> +		pte_t pte;
> +		pte_t *ptep = pvmw->pte;
> +
> +		entry = pte_to_swp_entry(*pvmw->pte);
> +		get_page(new);
> +		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
> +		if (pte_swp_soft_dirty(*pvmw->pte))
> +			pte = pte_mksoft_dirty(pte);
> +		if (is_write_migration_entry(entry))
> +			pte = maybe_mkwrite(pte, vma);
> +		flush_dcache_page(new);
> +		set_pte_at(mm, address, ptep, pte);
> +		if (PageAnon(new))
> +			page_add_anon_rmap(new, vma, address, false);
> +		else
> +			page_add_file_rmap(new, false);
> +		update_mmu_cache(vma, address, ptep);
> +	}
> +}
> +#endif
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cda4c2778d04..0bbad6dcf95a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -211,6 +211,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  		new = page - pvmw.page->index +
>  			linear_page_index(vma, pvmw.address);
>  
> +		/* PMD-mapped THP migration entry */
> +		if (!PageHuge(page) && PageTransCompound(page)) {
> +			remove_migration_pmd(&pvmw, new);
> +			continue;
> +		}
> +

Any reason not to share PTE handling of non-THP with THP?

>  		get_page(new);
>  		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
>  		if (pte_swp_soft_dirty(*pvmw.pte))
> @@ -324,6 +330,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
>  	__migration_entry_wait(mm, pte, ptl);
>  }
>  
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	spinlock_t *ptl;
> +	struct page *page;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (!is_pmd_migration_entry(*pmd))
> +		goto unlock;
> +	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> +	if (!get_page_unless_zero(page))
> +		goto unlock;
> +	spin_unlock(ptl);
> +	wait_on_page_locked(page);
> +	put_page(page);
> +	return;
> +unlock:
> +	spin_unlock(ptl);
> +}
> +#endif
> +
>  #ifdef CONFIG_BLOCK
>  /* Returns true if all buffers are successfully locked */
>  static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> @@ -1082,7 +1109,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>  		goto out;
>  	}
>  
> -	if (unlikely(PageTransHuge(page))) {
> +	if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
>  		lock_page(page);
>  		rc = split_huge_page(page);
>  		unlock_page(page);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index a23001a22c15..0ed3aee62d50 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	if (!pud_present(*pud))
>  		return false;
>  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> -	if (pmd_trans_huge(*pvmw->pmd)) {
> +	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
>  		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> -		if (!pmd_present(*pvmw->pmd))
> -			return not_found(pvmw);
>  		if (likely(pmd_trans_huge(*pvmw->pmd))) {
>  			if (pvmw->flags & PVMW_MIGRATION)
>  				return not_found(pvmw);
>  			if (pmd_page(*pvmw->pmd) != page)
>  				return not_found(pvmw);
>  			return true;
> +		} else if (!pmd_present(*pvmw->pmd)) {
> +			if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
> +				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
> +
> +				if (migration_entry_to_page(entry) != page)
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			return not_found(pvmw);
>  		} else {
>  			/* THP pmd was split under us: handle on pte level */
>  			spin_unlock(pvmw->ptl);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4ed5908c65b0..9d550a8a0c71 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>  	pmd_t pmd;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +		  !pmd_devmap(*pmdp));

How does this? _flush doesn't make sense for !present.

>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>  	return pmd;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 555cc7ebacf6..2c65abbd7a0e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1298,6 +1298,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	int ret = SWAP_AGAIN;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
>  
> +
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		return SWAP_AGAIN;
> @@ -1308,6 +1309,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* THP migration */
> +		if (flags & TTU_MIGRATION) {
> +			if (!PageHuge(page) && PageTransCompound(page)) {
> +				set_pmd_migration_entry(&pvmw, page);

Again, it would be nice share PTE handling. It should be rather similar,
no?

> +				continue;
> +			}
> +		}
> +
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
>  		 * If it's recently referenced (perhaps page_referenced
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
  2017-03-13 15:45   ` Zi Yan
@ 2017-03-24 14:50     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2017-03-24 14:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, mhocko, n-horiguchi, khandual, zi.yan, dnellans

On Mon, Mar 13, 2017 at 11:45:02AM -0400, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> If one of callers of page migration starts to handle thp,
> memory management code start to see pmd migration entry, so we need
> to prepare for it before enabling. This patch changes various code
> point which checks the status of given pmds in order to prevent race
> between thp migration and the pmd-related works.
> 
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't
>   think up no better name. Any suggesntion is welcomed.)
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v2 -> v3:
> - add is_swap_pmd()
> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
>   pmd_trans_huge(), pmd_devmap(), or pmd_none()
> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
> - flush_cache_range() while set_pmd_migration_entry()
> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>   true on pmd_migration_entry, so that migration entries are not
>   treated as pmd page table entries.
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/mm/gup.c             |  4 +--
>  fs/proc/task_mmu.c            | 22 +++++++++------
>  include/asm-generic/pgtable.h |  3 +-
>  include/linux/huge_mm.h       | 14 +++++++--
>  mm/gup.c                      | 22 +++++++++++++--
>  mm/huge_memory.c              | 66 ++++++++++++++++++++++++++++++++++++++-----
>  mm/madvise.c                  |  2 ++
>  mm/memcontrol.c               |  2 ++
>  mm/memory.c                   |  9 ++++--
>  mm/mprotect.c                 |  6 ++--
>  mm/mremap.c                   |  2 +-
>  11 files changed, 124 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 1f3b6ef105cd..23bb071f286d 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -243,9 +243,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = *pmdp;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
> -		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> +		if (unlikely(pmd_large(pmd))) {
>  			/*
>  			 * NUMA hinting faults need to be handled in the GUP
>  			 * slowpath for accounting purposes and so that they
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 5c8359704601..f2b0f3ba25ac 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -600,7 +600,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> -		smaps_pmd_entry(pmd, addr, walk);
> +		if (pmd_present(*pmd))
> +			smaps_pmd_entry(pmd, addr, walk);
>  		spin_unlock(ptl);
>  		return 0;
>  	}
> @@ -942,6 +943,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  			goto out;
>  		}
>  
> +		if (!pmd_present(*pmd))
> +			goto out;
> +
>  		page = pmd_page(*pmd);
>  
>  		/* Clear accessed and referenced bits. */
> @@ -1221,19 +1225,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  	if (ptl) {
>  		u64 flags = 0, frame = 0;
>  		pmd_t pmd = *pmdp;
> +		struct page *page;
>  
>  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
>  			flags |= PM_SOFT_DIRTY;
>  
> -		/*
> -		 * Currently pmd for thp is always present because thp
> -		 * can not be swapped-out, migrated, or HWPOISONed
> -		 * (split in such cases instead.)
> -		 * This if-check is just to prepare for future implementation.
> -		 */
> -		if (pmd_present(pmd)) {
> -			struct page *page = pmd_page(pmd);
> +		if (is_pmd_migration_entry(pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> +			frame = swp_type(entry) |
> +				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> +			page = migration_entry_to_page(entry);
> +		} else if (pmd_present(pmd)) {
> +			page = pmd_page(pmd);
>  			if (page_mapcount(page) == 1)
>  				flags |= PM_MMAP_EXCLUSIVE;
>  
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index f4ca23b158b3..f98a028100b6 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -790,7 +790,8 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	barrier();
>  #endif
> -	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
> +	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
> +			|| !pmd_present(pmdval))
>  		return 1;

pmd_none() check is redundant now.

>  	if (unlikely(pmd_bad(pmdval))) {
>  		pmd_clear_bad(pmd);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1b81cb57ff0f..6f44a2352597 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -126,7 +126,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  #define split_huge_pmd(__vma, __pmd, __address)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		if (pmd_trans_huge(*____pmd)				\
> +		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
>  					|| pmd_devmap(*____pmd))	\
>  			__split_huge_pmd(__vma, __pmd, __address,	\
>  						false, NULL);		\
> @@ -157,12 +157,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma);
>  extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
>  		struct vm_area_struct *vma);
> +
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> +	return !pmd_none(pmd) && !pmd_present(pmd);
> +}
> +
>  /* mmap_sem must be held on entry */
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> -	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> +	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>  		return __pmd_trans_huge_lock(pmd, vma);
>  	else
>  		return NULL;
> @@ -269,6 +275,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>  					 long adjust_next)
>  {
>  }
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> +	return 0;
> +}
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
> diff --git a/mm/gup.c b/mm/gup.c
> index 94fab8fa432b..2b1effb16242 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -272,6 +272,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  			return page;
>  		return no_page_table(vma, flags);
>  	}
> +	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> +		return no_page_table(vma, flags);
> +	if (!pmd_present(*pmd)) {
> +retry:
> +		if (likely(!(flags & FOLL_MIGRATION)))
> +			return no_page_table(vma, flags);
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry;

This looks a lot like endless loop if flags contain FOLL_MIGRATION. Hm?

I guess retry label should be on previous line.

> +	}
>  	if (pmd_devmap(*pmd)) {
>  		ptl = pmd_lock(mm, pmd);
>  		page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -286,6 +295,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  		return no_page_table(vma, flags);
>  
>  	ptl = pmd_lock(mm, pmd);
> +	if (unlikely(!pmd_present(*pmd))) {
> +retry_locked:
> +		if (likely(!(flags & FOLL_MIGRATION))) {
> +			spin_unlock(ptl);
> +			return no_page_table(vma, flags);
> +		}
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry_locked;

Again. That's doesn't look right..

> +	}
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags);
> @@ -341,7 +359,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>  	pud = pud_offset(pgd, address);
>  	BUG_ON(pud_none(*pud));
>  	pmd = pmd_offset(pud, address);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return -EFAULT;
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map(pmd, address);
> @@ -1369,7 +1387,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
>  
>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a9c2a0ef5b9b..3f18452f3eb1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -898,6 +898,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;
> +
> +	if (unlikely(is_pmd_migration_entry(pmd))) {

Shouldn't you first check that the pmd is not present?

> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		if (is_write_migration_entry(entry)) {
> +			make_migration_entry_read(&entry);
> +			pmd = swp_entry_to_pmd(entry);
> +			set_pmd_at(src_mm, addr, src_pmd, pmd);
> +		}
> +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");

Typo.

> +
>  	if (unlikely(!pmd_trans_huge(pmd))) {
>  		pte_free(dst_mm, pgtable);
>  		goto out_unlock;
> @@ -1204,6 +1219,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
>  		goto out_unlock;j
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out_unlock;
> +
>  	page = pmd_page(orig_pmd);
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
> @@ -1338,7 +1356,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		goto out;
>  
> -	page = pmd_page(*pmd);
> +	if (is_pmd_migration_entry(*pmd)) {

Again, I don't think it's it's safe to check if pmd is migration entry
before checking if it's present.

> +		swp_entry_t entry;
> +
> +		entry = pmd_to_swp_entry(*pmd);
> +		page = pfn_to_page(swp_offset(entry));
> +		if (!is_migration_entry(entry))
> +			goto out;

I don't understand how it suppose to work.
You take swp_offset() of entry before checking if it's migration entry.
What's going on?

> +	} else
> +		page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd);
> @@ -1534,6 +1560,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	if (is_huge_zero_pmd(orig_pmd))
>  		goto out;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out;
> +
>  	page = pmd_page(orig_pmd);
>  	/*
>  	 * If other processes are mapping this page, we couldn't discard
> @@ -1766,6 +1795,20 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	if (prot_numa && pmd_protnone(*pmd))
>  		goto unlock;
>  
> +	if (is_pmd_migration_entry(*pmd)) {
> +		swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> +		if (is_write_migration_entry(entry)) {
> +			pmd_t newpmd;
> +
> +			make_migration_entry_read(&entry);
> +			newpmd = swp_entry_to_pmd(entry);
> +			set_pmd_at(mm, addr, pmd, newpmd);
> +		}
> +		goto unlock;
> +	} else if (!pmd_present(*pmd))
> +		WARN_ONCE(1, "Uknown non-present format on pmd.\n");

Another typo.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
@ 2017-03-24 14:50     ` Kirill A. Shutemov
  0 siblings, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2017-03-24 14:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kirill.shutemov, akpm, minchan, vbabka,
	mgorman, mhocko, n-horiguchi, khandual, zi.yan, dnellans

On Mon, Mar 13, 2017 at 11:45:02AM -0400, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> If one of callers of page migration starts to handle thp,
> memory management code start to see pmd migration entry, so we need
> to prepare for it before enabling. This patch changes various code
> point which checks the status of given pmds in order to prevent race
> between thp migration and the pmd-related works.
> 
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't
>   think up no better name. Any suggesntion is welcomed.)
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> ChangeLog v2 -> v3:
> - add is_swap_pmd()
> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
>   pmd_trans_huge(), pmd_devmap(), or pmd_none()
> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
> - flush_cache_range() while set_pmd_migration_entry()
> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>   true on pmd_migration_entry, so that migration entries are not
>   treated as pmd page table entries.
> 
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/mm/gup.c             |  4 +--
>  fs/proc/task_mmu.c            | 22 +++++++++------
>  include/asm-generic/pgtable.h |  3 +-
>  include/linux/huge_mm.h       | 14 +++++++--
>  mm/gup.c                      | 22 +++++++++++++--
>  mm/huge_memory.c              | 66 ++++++++++++++++++++++++++++++++++++++-----
>  mm/madvise.c                  |  2 ++
>  mm/memcontrol.c               |  2 ++
>  mm/memory.c                   |  9 ++++--
>  mm/mprotect.c                 |  6 ++--
>  mm/mremap.c                   |  2 +-
>  11 files changed, 124 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 1f3b6ef105cd..23bb071f286d 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -243,9 +243,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = *pmdp;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
> -		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> +		if (unlikely(pmd_large(pmd))) {
>  			/*
>  			 * NUMA hinting faults need to be handled in the GUP
>  			 * slowpath for accounting purposes and so that they
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 5c8359704601..f2b0f3ba25ac 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -600,7 +600,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> -		smaps_pmd_entry(pmd, addr, walk);
> +		if (pmd_present(*pmd))
> +			smaps_pmd_entry(pmd, addr, walk);
>  		spin_unlock(ptl);
>  		return 0;
>  	}
> @@ -942,6 +943,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  			goto out;
>  		}
>  
> +		if (!pmd_present(*pmd))
> +			goto out;
> +
>  		page = pmd_page(*pmd);
>  
>  		/* Clear accessed and referenced bits. */
> @@ -1221,19 +1225,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  	if (ptl) {
>  		u64 flags = 0, frame = 0;
>  		pmd_t pmd = *pmdp;
> +		struct page *page;
>  
>  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
>  			flags |= PM_SOFT_DIRTY;
>  
> -		/*
> -		 * Currently pmd for thp is always present because thp
> -		 * can not be swapped-out, migrated, or HWPOISONed
> -		 * (split in such cases instead.)
> -		 * This if-check is just to prepare for future implementation.
> -		 */
> -		if (pmd_present(pmd)) {
> -			struct page *page = pmd_page(pmd);
> +		if (is_pmd_migration_entry(pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> +			frame = swp_type(entry) |
> +				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> +			page = migration_entry_to_page(entry);
> +		} else if (pmd_present(pmd)) {
> +			page = pmd_page(pmd);
>  			if (page_mapcount(page) == 1)
>  				flags |= PM_MMAP_EXCLUSIVE;
>  
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index f4ca23b158b3..f98a028100b6 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -790,7 +790,8 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	barrier();
>  #endif
> -	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
> +	if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
> +			|| !pmd_present(pmdval))
>  		return 1;

pmd_none() check is redundant now.

>  	if (unlikely(pmd_bad(pmdval))) {
>  		pmd_clear_bad(pmd);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1b81cb57ff0f..6f44a2352597 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -126,7 +126,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  #define split_huge_pmd(__vma, __pmd, __address)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		if (pmd_trans_huge(*____pmd)				\
> +		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
>  					|| pmd_devmap(*____pmd))	\
>  			__split_huge_pmd(__vma, __pmd, __address,	\
>  						false, NULL);		\
> @@ -157,12 +157,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma);
>  extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
>  		struct vm_area_struct *vma);
> +
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> +	return !pmd_none(pmd) && !pmd_present(pmd);
> +}
> +
>  /* mmap_sem must be held on entry */
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
>  	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> -	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> +	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>  		return __pmd_trans_huge_lock(pmd, vma);
>  	else
>  		return NULL;
> @@ -269,6 +275,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>  					 long adjust_next)
>  {
>  }
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> +	return 0;
> +}
>  static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>  		struct vm_area_struct *vma)
>  {
> diff --git a/mm/gup.c b/mm/gup.c
> index 94fab8fa432b..2b1effb16242 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -272,6 +272,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  			return page;
>  		return no_page_table(vma, flags);
>  	}
> +	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> +		return no_page_table(vma, flags);
> +	if (!pmd_present(*pmd)) {
> +retry:
> +		if (likely(!(flags & FOLL_MIGRATION)))
> +			return no_page_table(vma, flags);
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry;

This looks a lot like endless loop if flags contain FOLL_MIGRATION. Hm?

I guess retry label should be on previous line.

> +	}
>  	if (pmd_devmap(*pmd)) {
>  		ptl = pmd_lock(mm, pmd);
>  		page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -286,6 +295,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  		return no_page_table(vma, flags);
>  
>  	ptl = pmd_lock(mm, pmd);
> +	if (unlikely(!pmd_present(*pmd))) {
> +retry_locked:
> +		if (likely(!(flags & FOLL_MIGRATION))) {
> +			spin_unlock(ptl);
> +			return no_page_table(vma, flags);
> +		}
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry_locked;

Again. That's doesn't look right..

> +	}
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags);
> @@ -341,7 +359,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>  	pud = pud_offset(pgd, address);
>  	BUG_ON(pud_none(*pud));
>  	pmd = pmd_offset(pud, address);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		return -EFAULT;
>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map(pmd, address);
> @@ -1369,7 +1387,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
> -		if (pmd_none(pmd))
> +		if (!pmd_present(pmd))
>  			return 0;
>  
>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a9c2a0ef5b9b..3f18452f3eb1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -898,6 +898,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  
>  	ret = -EAGAIN;
>  	pmd = *src_pmd;
> +
> +	if (unlikely(is_pmd_migration_entry(pmd))) {

Shouldn't you first check that the pmd is not present?

> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		if (is_write_migration_entry(entry)) {
> +			make_migration_entry_read(&entry);
> +			pmd = swp_entry_to_pmd(entry);
> +			set_pmd_at(src_mm, addr, src_pmd, pmd);
> +		}
> +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +		ret = 0;
> +		goto out_unlock;
> +	}
> +	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");

Typo.

> +
>  	if (unlikely(!pmd_trans_huge(pmd))) {
>  		pte_free(dst_mm, pgtable);
>  		goto out_unlock;
> @@ -1204,6 +1219,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
>  		goto out_unlock;j
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out_unlock;
> +
>  	page = pmd_page(orig_pmd);
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
> @@ -1338,7 +1356,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>  		goto out;
>  
> -	page = pmd_page(*pmd);
> +	if (is_pmd_migration_entry(*pmd)) {

Again, I don't think it's it's safe to check if pmd is migration entry
before checking if it's present.

> +		swp_entry_t entry;
> +
> +		entry = pmd_to_swp_entry(*pmd);
> +		page = pfn_to_page(swp_offset(entry));
> +		if (!is_migration_entry(entry))
> +			goto out;

I don't understand how it suppose to work.
You take swp_offset() of entry before checking if it's migration entry.
What's going on?

> +	} else
> +		page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd);
> @@ -1534,6 +1560,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	if (is_huge_zero_pmd(orig_pmd))
>  		goto out;
>  
> +	if (unlikely(!pmd_present(orig_pmd)))
> +		goto out;
> +
>  	page = pmd_page(orig_pmd);
>  	/*
>  	 * If other processes are mapping this page, we couldn't discard
> @@ -1766,6 +1795,20 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	if (prot_numa && pmd_protnone(*pmd))
>  		goto unlock;
>  
> +	if (is_pmd_migration_entry(*pmd)) {
> +		swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> +		if (is_write_migration_entry(entry)) {
> +			pmd_t newpmd;
> +
> +			make_migration_entry_read(&entry);
> +			newpmd = swp_entry_to_pmd(entry);
> +			set_pmd_at(mm, addr, pmd, newpmd);
> +		}
> +		goto unlock;
> +	} else if (!pmd_present(*pmd))
> +		WARN_ONCE(1, "Uknown non-present format on pmd.\n");

Another typo.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
  2017-03-24 14:28     ` Kirill A. Shutemov
@ 2017-03-24 15:30       ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 15:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

[-- Attachment #1: Type: text/plain, Size: 4853 bytes --]

Hi Kirill,

Kirill A. Shutemov wrote:
> On Mon, Mar 13, 2017 at 11:45:01AM -0400, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> This patch adds thp migration's core code, including conversions
>> between a PMD entry and a swap entry, setting PMD migration entry,
>> removing PMD migration entry, and waiting on PMD migration entries.
>>
>> This patch makes it possible to support thp migration.
>> If you fail to allocate a destination page as a thp, you just split
>> the source thp as we do now, and then enter the normal page migration.
>> If you succeed to allocate destination thp, you enter thp migration.
>> Subsequent patches actually enable thp migration for each caller of
>> page migration by allowing its get_new_page() callback to
>> allocate thps.
>>
>> ChangeLog v1 -> v2:
>> - support pte-mapped thp, doubly-mapped thp
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> ChangeLog v2 -> v3:
>> - use page_vma_mapped_walk()
>>
>> ChangeLog v3 -> v4:
>> - factor out the code of removing pte pgtable page in zap_huge_pmd()
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> 
> See few questions below.
> 
> It would be nice to split it into few patches. Probably three or four.

This patch was two separate ones in v2:
1. introduce remove_pmd_migration_entry(), set_migration_pmd() and other
auxiliary functions,
2. enable THP migration in the migration path.

But the first one of these two patches would be dead code, since no one
else uses it. Michal also suggested merging two patches into one when he
reviewed v2.

If you have any suggestion, I am OK to split this patch and make it
smaller.

<snip>

>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index cda4c2778d04..0bbad6dcf95a 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -211,6 +211,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>>  		new = page - pvmw.page->index +
>>  			linear_page_index(vma, pvmw.address);
>>  
>> +		/* PMD-mapped THP migration entry */
>> +		if (!PageHuge(page) && PageTransCompound(page)) {
>> +			remove_migration_pmd(&pvmw, new);
>> +			continue;
>> +		}
>> +
> 
> Any reason not to share PTE handling of non-THP with THP?

You mean PTE-mapped THPs? I was mostly reuse Naoya's patches. But at
first look, it seems PTE-mapped THP handling code is the same as
existing PTE handling code.

This part of code can be changed to:

+		/* PMD-mapped THP migration entry */
+		if (!pvmw.pte && pvmw.page) {
+                       VM_BUG_ON_PAGE(!PageTransCompound(page), page);
+			remove_migration_pmd(&pvmw, new);
+			continue;
+		}
+

> 
>>  		get_page(new);
>>  		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
>>  		if (pte_swp_soft_dirty(*pvmw.pte))

<snip>

>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 4ed5908c65b0..9d550a8a0c71 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>>  {
>>  	pmd_t pmd;
>>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
>> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
>> +		  !pmd_devmap(*pmdp));
> 
> How does this? _flush doesn't make sense for !present.

Right. It should be:

-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+		  !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));


> 
>>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>  	return pmd;
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 555cc7ebacf6..2c65abbd7a0e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1298,6 +1298,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  	int ret = SWAP_AGAIN;
>>  	enum ttu_flags flags = (enum ttu_flags)arg;
>>  
>> +
>>  	/* munlock has nothing to gain from examining un-locked vmas */
>>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>>  		return SWAP_AGAIN;
>> @@ -1308,6 +1309,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  	}
>>  
>>  	while (page_vma_mapped_walk(&pvmw)) {
>> +		/* THP migration */
>> +		if (flags & TTU_MIGRATION) {
>> +			if (!PageHuge(page) && PageTransCompound(page)) {
>> +				set_pmd_migration_entry(&pvmw, page);
> 
> Again, it would be nice share PTE handling. It should be rather similar,
> no?

At first look, it should work. I will change it. If it works, it will be
included in the next version.

This can also shrink the patch size.

Thanks.


-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/11] mm: thp: enable thp migration in generic path
@ 2017-03-24 15:30       ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 15:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

[-- Attachment #1: Type: text/plain, Size: 4853 bytes --]

Hi Kirill,

Kirill A. Shutemov wrote:
> On Mon, Mar 13, 2017 at 11:45:01AM -0400, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> This patch adds thp migration's core code, including conversions
>> between a PMD entry and a swap entry, setting PMD migration entry,
>> removing PMD migration entry, and waiting on PMD migration entries.
>>
>> This patch makes it possible to support thp migration.
>> If you fail to allocate a destination page as a thp, you just split
>> the source thp as we do now, and then enter the normal page migration.
>> If you succeed to allocate destination thp, you enter thp migration.
>> Subsequent patches actually enable thp migration for each caller of
>> page migration by allowing its get_new_page() callback to
>> allocate thps.
>>
>> ChangeLog v1 -> v2:
>> - support pte-mapped thp, doubly-mapped thp
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> ChangeLog v2 -> v3:
>> - use page_vma_mapped_walk()
>>
>> ChangeLog v3 -> v4:
>> - factor out the code of removing pte pgtable page in zap_huge_pmd()
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> 
> See few questions below.
> 
> It would be nice to split it into few patches. Probably three or four.

This patch was two separate ones in v2:
1. introduce remove_pmd_migration_entry(), set_migration_pmd() and other
auxiliary functions,
2. enable THP migration in the migration path.

But the first one of these two patches would be dead code, since no one
else uses it. Michal also suggested merging two patches into one when he
reviewed v2.

If you have any suggestion, I am OK to split this patch and make it
smaller.

<snip>

>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index cda4c2778d04..0bbad6dcf95a 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -211,6 +211,12 @@ static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>>  		new = page - pvmw.page->index +
>>  			linear_page_index(vma, pvmw.address);
>>  
>> +		/* PMD-mapped THP migration entry */
>> +		if (!PageHuge(page) && PageTransCompound(page)) {
>> +			remove_migration_pmd(&pvmw, new);
>> +			continue;
>> +		}
>> +
> 
> Any reason not to share PTE handling of non-THP with THP?

You mean PTE-mapped THPs? I was mostly reuse Naoya's patches. But at
first look, it seems PTE-mapped THP handling code is the same as
existing PTE handling code.

This part of code can be changed to:

+		/* PMD-mapped THP migration entry */
+		if (!pvmw.pte && pvmw.page) {
+                       VM_BUG_ON_PAGE(!PageTransCompound(page), page);
+			remove_migration_pmd(&pvmw, new);
+			continue;
+		}
+

> 
>>  		get_page(new);
>>  		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
>>  		if (pte_swp_soft_dirty(*pvmw.pte))

<snip>

>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 4ed5908c65b0..9d550a8a0c71 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>>  {
>>  	pmd_t pmd;
>>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> -	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
>> +	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
>> +		  !pmd_devmap(*pmdp));
> 
> How does this? _flush doesn't make sense for !present.

Right. It should be:

-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+		  !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));


> 
>>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>>  	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>  	return pmd;
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 555cc7ebacf6..2c65abbd7a0e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1298,6 +1298,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  	int ret = SWAP_AGAIN;
>>  	enum ttu_flags flags = (enum ttu_flags)arg;
>>  
>> +
>>  	/* munlock has nothing to gain from examining un-locked vmas */
>>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>>  		return SWAP_AGAIN;
>> @@ -1308,6 +1309,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  	}
>>  
>>  	while (page_vma_mapped_walk(&pvmw)) {
>> +		/* THP migration */
>> +		if (flags & TTU_MIGRATION) {
>> +			if (!PageHuge(page) && PageTransCompound(page)) {
>> +				set_pmd_migration_entry(&pvmw, page);
> 
> Again, it would be nice share PTE handling. It should be rather similar,
> no?

At first look, it should work. I will change it. If it works, it will be
included in the next version.

This can also shrink the patch size.

Thanks.


-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
  2017-03-24 14:50     ` Kirill A. Shutemov
@ 2017-03-24 16:09       ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 16:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

[-- Attachment #1: Type: text/plain, Size: 8267 bytes --]



Kirill A. Shutemov wrote:
> On Mon, Mar 13, 2017 at 11:45:02AM -0400, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> If one of callers of page migration starts to handle thp,
>> memory management code start to see pmd migration entry, so we need
>> to prepare for it before enabling. This patch changes various code
>> point which checks the status of given pmds in order to prevent race
>> between thp migration and the pmd-related works.
>>
>> ChangeLog v1 -> v2:
>> - introduce pmd_related() (I know the naming is not good, but can't
>>   think up no better name. Any suggesntion is welcomed.)
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> ChangeLog v2 -> v3:
>> - add is_swap_pmd()
>> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
>>   pmd_trans_huge(), pmd_devmap(), or pmd_none()
>> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
>> - flush_cache_range() while set_pmd_migration_entry()
>> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>>   true on pmd_migration_entry, so that migration entries are not
>>   treated as pmd page table entries.
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  arch/x86/mm/gup.c             |  4 +--
>>  fs/proc/task_mmu.c            | 22 +++++++++------
>>  include/asm-generic/pgtable.h |  3 +-
>>  include/linux/huge_mm.h       | 14 +++++++--
>>  mm/gup.c                      | 22 +++++++++++++--
>>  mm/huge_memory.c              | 66 ++++++++++++++++++++++++++++++++++++++-----
>>  mm/madvise.c                  |  2 ++
>>  mm/memcontrol.c               |  2 ++
>>  mm/memory.c                   |  9 ++++--
>>  mm/mprotect.c                 |  6 ++--
>>  mm/mremap.c                   |  2 +-
>>  11 files changed, 124 insertions(+), 28 deletions(-)
>>
<snip>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 94fab8fa432b..2b1effb16242 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -272,6 +272,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>>  			return page;
>>  		return no_page_table(vma, flags);
>>  	}
>> +	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>> +		return no_page_table(vma, flags);
>> +	if (!pmd_present(*pmd)) {
>> +retry:
>> +		if (likely(!(flags & FOLL_MIGRATION)))
>> +			return no_page_table(vma, flags);
>> +		pmd_migration_entry_wait(mm, pmd);
>> +		goto retry;
> 
> This looks a lot like endless loop if flags contain FOLL_MIGRATION. Hm?
> 
> I guess retry label should be on previous line.

You are right. It should be:

+	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
+		return no_page_table(vma, flags);
+retry:
+	if (!pmd_present(*pmd)) {
+		if (likely(!(flags & FOLL_MIGRATION)))
+			return no_page_table(vma, flags);
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry;

> 
>> +	}
>>  	if (pmd_devmap(*pmd)) {
>>  		ptl = pmd_lock(mm, pmd);
>>  		page = follow_devmap_pmd(vma, address, pmd, flags);
>> @@ -286,6 +295,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>>  		return no_page_table(vma, flags);
>>  
>>  	ptl = pmd_lock(mm, pmd);
>> +	if (unlikely(!pmd_present(*pmd))) {
>> +retry_locked:
>> +		if (likely(!(flags & FOLL_MIGRATION))) {
>> +			spin_unlock(ptl);
>> +			return no_page_table(vma, flags);
>> +		}
>> +		pmd_migration_entry_wait(mm, pmd);
>> +		goto retry_locked;
> 
> Again. That's doesn't look right..

It will be changed:

 	ptl = pmd_lock(mm, pmd);
+retry_locked:
+	if (unlikely(!pmd_present(*pmd))) {
+		if (likely(!(flags & FOLL_MIGRATION))) {
+			spin_unlock(ptl);
+			return no_page_table(vma, flags);
+		}
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry_locked;

> 
>> +	}
>>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>>  		spin_unlock(ptl);
>>  		return follow_page_pte(vma, address, pmd, flags);
>> @@ -341,7 +359,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>>  	pud = pud_offset(pgd, address);
>>  	BUG_ON(pud_none(*pud));
>>  	pmd = pmd_offset(pud, address);
>> -	if (pmd_none(*pmd))
>> +	if (!pmd_present(*pmd))
>>  		return -EFAULT;
>>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>>  	pte = pte_offset_map(pmd, address);
>> @@ -1369,7 +1387,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>>  		pmd_t pmd = READ_ONCE(*pmdp);
>>  
>>  		next = pmd_addr_end(addr, end);
>> -		if (pmd_none(pmd))
>> +		if (!pmd_present(pmd))
>>  			return 0;
>>  
>>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index a9c2a0ef5b9b..3f18452f3eb1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -898,6 +898,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>  
>>  	ret = -EAGAIN;
>>  	pmd = *src_pmd;
>> +
>> +	if (unlikely(is_pmd_migration_entry(pmd))) {
> 
> Shouldn't you first check that the pmd is not present?

is_pmd_migration_entry() checks !pmd_present().

in linux/swapops.h, is_pmd_migration_entry is defined as:

static inline int is_pmd_migration_entry(pmd_t pmd)
{
    return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
}


> 
>> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
>> +
>> +		if (is_write_migration_entry(entry)) {
>> +			make_migration_entry_read(&entry);
>> +			pmd = swp_entry_to_pmd(entry);
>> +			set_pmd_at(src_mm, addr, src_pmd, pmd);
>> +		}
>> +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>> +		ret = 0;
>> +		goto out_unlock;
>> +	}
>> +	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");
> 
> Typo.

Got it.

> 
>> +
>>  	if (unlikely(!pmd_trans_huge(pmd))) {
>>  		pte_free(dst_mm, pgtable);
>>  		goto out_unlock;
>> @@ -1204,6 +1219,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>>  	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
>>  		goto out_unlock;j
>>  
>> +	if (unlikely(!pmd_present(orig_pmd)))
>> +		goto out_unlock;
>> +
>>  	page = pmd_page(orig_pmd);
>>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>>  	/*
>> @@ -1338,7 +1356,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>>  		goto out;
>>  
>> -	page = pmd_page(*pmd);
>> +	if (is_pmd_migration_entry(*pmd)) {
> 
> Again, I don't think it's it's safe to check if pmd is migration entry
> before checking if it's present.
> 
>> +		swp_entry_t entry;
>> +
>> +		entry = pmd_to_swp_entry(*pmd);
>> +		page = pfn_to_page(swp_offset(entry));
>> +		if (!is_migration_entry(entry))
>> +			goto out;
> 
> I don't understand how it suppose to work.
> You take swp_offset() of entry before checking if it's migration entry.
> What's going on?

This chunk of change inside follow_trans_huge_pmd() is not needed.
Because two callers, smaps_pmd_entry() and follow_page_mask(), guarantee
that the pmd points to a present entry.

I will drop this chunk in the next version.

> 
>> +	} else
>> +		page = pmd_page(*pmd);
>>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>  	if (flags & FOLL_TOUCH)
>>  		touch_pmd(vma, addr, pmd);
>> @@ -1534,6 +1560,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  	if (is_huge_zero_pmd(orig_pmd))
>>  		goto out;
>>  
>> +	if (unlikely(!pmd_present(orig_pmd)))
>> +		goto out;
>> +
>>  	page = pmd_page(orig_pmd);
>>  	/*
>>  	 * If other processes are mapping this page, we couldn't discard
>> @@ -1766,6 +1795,20 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>  	if (prot_numa && pmd_protnone(*pmd))
>>  		goto unlock;
>>  
>> +	if (is_pmd_migration_entry(*pmd)) {
>> +		swp_entry_t entry = pmd_to_swp_entry(*pmd);
>> +
>> +		if (is_write_migration_entry(entry)) {
>> +			pmd_t newpmd;
>> +
>> +			make_migration_entry_read(&entry);
>> +			newpmd = swp_entry_to_pmd(entry);
>> +			set_pmd_at(mm, addr, pmd, newpmd);
>> +		}
>> +		goto unlock;
>> +	} else if (!pmd_present(*pmd))
>> +		WARN_ONCE(1, "Uknown non-present format on pmd.\n");
> 
> Another typo.

Got it.

Thanks for all your comments.

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
@ 2017-03-24 16:09       ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 16:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

[-- Attachment #1: Type: text/plain, Size: 8267 bytes --]



Kirill A. Shutemov wrote:
> On Mon, Mar 13, 2017 at 11:45:02AM -0400, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> If one of callers of page migration starts to handle thp,
>> memory management code start to see pmd migration entry, so we need
>> to prepare for it before enabling. This patch changes various code
>> point which checks the status of given pmds in order to prevent race
>> between thp migration and the pmd-related works.
>>
>> ChangeLog v1 -> v2:
>> - introduce pmd_related() (I know the naming is not good, but can't
>>   think up no better name. Any suggesntion is welcomed.)
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> ChangeLog v2 -> v3:
>> - add is_swap_pmd()
>> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
>>   pmd_trans_huge(), pmd_devmap(), or pmd_none()
>> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
>> - flush_cache_range() while set_pmd_migration_entry()
>> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>>   true on pmd_migration_entry, so that migration entries are not
>>   treated as pmd page table entries.
>>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  arch/x86/mm/gup.c             |  4 +--
>>  fs/proc/task_mmu.c            | 22 +++++++++------
>>  include/asm-generic/pgtable.h |  3 +-
>>  include/linux/huge_mm.h       | 14 +++++++--
>>  mm/gup.c                      | 22 +++++++++++++--
>>  mm/huge_memory.c              | 66 ++++++++++++++++++++++++++++++++++++++-----
>>  mm/madvise.c                  |  2 ++
>>  mm/memcontrol.c               |  2 ++
>>  mm/memory.c                   |  9 ++++--
>>  mm/mprotect.c                 |  6 ++--
>>  mm/mremap.c                   |  2 +-
>>  11 files changed, 124 insertions(+), 28 deletions(-)
>>
<snip>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 94fab8fa432b..2b1effb16242 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -272,6 +272,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>>  			return page;
>>  		return no_page_table(vma, flags);
>>  	}
>> +	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>> +		return no_page_table(vma, flags);
>> +	if (!pmd_present(*pmd)) {
>> +retry:
>> +		if (likely(!(flags & FOLL_MIGRATION)))
>> +			return no_page_table(vma, flags);
>> +		pmd_migration_entry_wait(mm, pmd);
>> +		goto retry;
> 
> This looks a lot like endless loop if flags contain FOLL_MIGRATION. Hm?
> 
> I guess retry label should be on previous line.

You are right. It should be:

+	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
+		return no_page_table(vma, flags);
+retry:
+	if (!pmd_present(*pmd)) {
+		if (likely(!(flags & FOLL_MIGRATION)))
+			return no_page_table(vma, flags);
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry;

> 
>> +	}
>>  	if (pmd_devmap(*pmd)) {
>>  		ptl = pmd_lock(mm, pmd);
>>  		page = follow_devmap_pmd(vma, address, pmd, flags);
>> @@ -286,6 +295,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>>  		return no_page_table(vma, flags);
>>  
>>  	ptl = pmd_lock(mm, pmd);
>> +	if (unlikely(!pmd_present(*pmd))) {
>> +retry_locked:
>> +		if (likely(!(flags & FOLL_MIGRATION))) {
>> +			spin_unlock(ptl);
>> +			return no_page_table(vma, flags);
>> +		}
>> +		pmd_migration_entry_wait(mm, pmd);
>> +		goto retry_locked;
> 
> Again. That's doesn't look right..

It will be changed:

 	ptl = pmd_lock(mm, pmd);
+retry_locked:
+	if (unlikely(!pmd_present(*pmd))) {
+		if (likely(!(flags & FOLL_MIGRATION))) {
+			spin_unlock(ptl);
+			return no_page_table(vma, flags);
+		}
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry_locked;

> 
>> +	}
>>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>>  		spin_unlock(ptl);
>>  		return follow_page_pte(vma, address, pmd, flags);
>> @@ -341,7 +359,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>>  	pud = pud_offset(pgd, address);
>>  	BUG_ON(pud_none(*pud));
>>  	pmd = pmd_offset(pud, address);
>> -	if (pmd_none(*pmd))
>> +	if (!pmd_present(*pmd))
>>  		return -EFAULT;
>>  	VM_BUG_ON(pmd_trans_huge(*pmd));
>>  	pte = pte_offset_map(pmd, address);
>> @@ -1369,7 +1387,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>>  		pmd_t pmd = READ_ONCE(*pmdp);
>>  
>>  		next = pmd_addr_end(addr, end);
>> -		if (pmd_none(pmd))
>> +		if (!pmd_present(pmd))
>>  			return 0;
>>  
>>  		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index a9c2a0ef5b9b..3f18452f3eb1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -898,6 +898,21 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>  
>>  	ret = -EAGAIN;
>>  	pmd = *src_pmd;
>> +
>> +	if (unlikely(is_pmd_migration_entry(pmd))) {
> 
> Shouldn't you first check that the pmd is not present?

is_pmd_migration_entry() checks !pmd_present().

in linux/swapops.h, is_pmd_migration_entry is defined as:

static inline int is_pmd_migration_entry(pmd_t pmd)
{
    return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
}


> 
>> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
>> +
>> +		if (is_write_migration_entry(entry)) {
>> +			make_migration_entry_read(&entry);
>> +			pmd = swp_entry_to_pmd(entry);
>> +			set_pmd_at(src_mm, addr, src_pmd, pmd);
>> +		}
>> +		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>> +		ret = 0;
>> +		goto out_unlock;
>> +	}
>> +	WARN_ONCE(!pmd_present(pmd), "Uknown non-present format on pmd.\n");
> 
> Typo.

Got it.

> 
>> +
>>  	if (unlikely(!pmd_trans_huge(pmd))) {
>>  		pte_free(dst_mm, pgtable);
>>  		goto out_unlock;
>> @@ -1204,6 +1219,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>>  	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
>>  		goto out_unlock;j
>>  
>> +	if (unlikely(!pmd_present(orig_pmd)))
>> +		goto out_unlock;
>> +
>>  	page = pmd_page(orig_pmd);
>>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>>  	/*
>> @@ -1338,7 +1356,15 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>>  	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>>  		goto out;
>>  
>> -	page = pmd_page(*pmd);
>> +	if (is_pmd_migration_entry(*pmd)) {
> 
> Again, I don't think it's it's safe to check if pmd is migration entry
> before checking if it's present.
> 
>> +		swp_entry_t entry;
>> +
>> +		entry = pmd_to_swp_entry(*pmd);
>> +		page = pfn_to_page(swp_offset(entry));
>> +		if (!is_migration_entry(entry))
>> +			goto out;
> 
> I don't understand how it suppose to work.
> You take swp_offset() of entry before checking if it's migration entry.
> What's going on?

This chunk of change inside follow_trans_huge_pmd() is not needed.
Because two callers, smaps_pmd_entry() and follow_page_mask(), guarantee
that the pmd points to a present entry.

I will drop this chunk in the next version.

> 
>> +	} else
>> +		page = pmd_page(*pmd);
>>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>  	if (flags & FOLL_TOUCH)
>>  		touch_pmd(vma, addr, pmd);
>> @@ -1534,6 +1560,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  	if (is_huge_zero_pmd(orig_pmd))
>>  		goto out;
>>  
>> +	if (unlikely(!pmd_present(orig_pmd)))
>> +		goto out;
>> +
>>  	page = pmd_page(orig_pmd);
>>  	/*
>>  	 * If other processes are mapping this page, we couldn't discard
>> @@ -1766,6 +1795,20 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>  	if (prot_numa && pmd_protnone(*pmd))
>>  		goto unlock;
>>  
>> +	if (is_pmd_migration_entry(*pmd)) {
>> +		swp_entry_t entry = pmd_to_swp_entry(*pmd);
>> +
>> +		if (is_write_migration_entry(entry)) {
>> +			pmd_t newpmd;
>> +
>> +			make_migration_entry_read(&entry);
>> +			newpmd = swp_entry_to_pmd(entry);
>> +			set_pmd_at(mm, addr, pmd, newpmd);
>> +		}
>> +		goto unlock;
>> +	} else if (!pmd_present(*pmd))
>> +		WARN_ONCE(1, "Uknown non-present format on pmd.\n");
> 
> Another typo.

Got it.

Thanks for all your comments.

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
  2017-03-24 16:09       ` Zi Yan
@ 2017-03-24 16:50         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2017-03-24 16:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

On Fri, Mar 24, 2017 at 11:09:25AM -0500, Zi Yan wrote:
> Kirill A. Shutemov wrote:
> > On Mon, Mar 13, 2017 at 11:45:02AM -0400, Zi Yan wrote:
> > Again. That's doesn't look right..
> 
> It will be changed:
> 
>  	ptl = pmd_lock(mm, pmd);
> +retry_locked:
> +	if (unlikely(!pmd_present(*pmd))) {
> +		if (likely(!(flags & FOLL_MIGRATION))) {
> +			spin_unlock(ptl);
> +			return no_page_table(vma, flags);
> +		}
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry_locked;

Nope. pmd_migration_entry_wait() unlocks the ptl.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
@ 2017-03-24 16:50         ` Kirill A. Shutemov
  0 siblings, 0 replies; 52+ messages in thread
From: Kirill A. Shutemov @ 2017-03-24 16:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

On Fri, Mar 24, 2017 at 11:09:25AM -0500, Zi Yan wrote:
> Kirill A. Shutemov wrote:
> > On Mon, Mar 13, 2017 at 11:45:02AM -0400, Zi Yan wrote:
> > Again. That's doesn't look right..
> 
> It will be changed:
> 
>  	ptl = pmd_lock(mm, pmd);
> +retry_locked:
> +	if (unlikely(!pmd_present(*pmd))) {
> +		if (likely(!(flags & FOLL_MIGRATION))) {
> +			spin_unlock(ptl);
> +			return no_page_table(vma, flags);
> +		}
> +		pmd_migration_entry_wait(mm, pmd);
> +		goto retry_locked;

Nope. pmd_migration_entry_wait() unlocks the ptl.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
  2017-03-24 16:50         ` Kirill A. Shutemov
@ 2017-03-24 17:09           ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 17:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans



Kirill A. Shutemov wrote:
> On Fri, Mar 24, 2017 at 11:09:25AM -0500, Zi Yan wrote:
>> Kirill A. Shutemov wrote:
>>> On Mon, Mar 13, 2017 at 11:45:02AM -0400, Zi Yan wrote:
>>> Again. That's doesn't look right..
>> It will be changed:
>>
>>  	ptl = pmd_lock(mm, pmd);
>> +retry_locked:
>> +	if (unlikely(!pmd_present(*pmd))) {
>> +		if (likely(!(flags & FOLL_MIGRATION))) {
>> +			spin_unlock(ptl);
>> +			return no_page_table(vma, flags);
>> +		}
>> +		pmd_migration_entry_wait(mm, pmd);
>> +		goto retry_locked;
> 
> Nope. pmd_migration_entry_wait() unlocks the ptl.

Right. This chunk is wrong. pmd_migrtion_entry_wait() actually locks
pmd, then unlocks it and waits on the page if it is suitable.

An simple fix could be:

+retry_locked:
 	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_present(*pmd))) {
+	        spin_unlock(ptl);
+		if (likely(!(flags & FOLL_MIGRATION)))
+			return no_page_table(vma, flags);
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry_locked;
+       }

Or is it better to change pmd_migration_entry_wait() to
void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
spinlock_t *ptl)? So that if ptl is NULL, then it takes the pmd lock and
unlocks it; if ptl is specified, it only unlocks it. This can avoid the
redundant unlock and lock in the code above, when
pmd_migration_entry_wait() is called.

Thanks.

--
Best Regards,
Yan Zi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 06/11] mm: thp: check pmd migration entry in common path
@ 2017-03-24 17:09           ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 17:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans



Kirill A. Shutemov wrote:
> On Fri, Mar 24, 2017 at 11:09:25AM -0500, Zi Yan wrote:
>> Kirill A. Shutemov wrote:
>>> On Mon, Mar 13, 2017 at 11:45:02AM -0400, Zi Yan wrote:
>>> Again. That's doesn't look right..
>> It will be changed:
>>
>>  	ptl = pmd_lock(mm, pmd);
>> +retry_locked:
>> +	if (unlikely(!pmd_present(*pmd))) {
>> +		if (likely(!(flags & FOLL_MIGRATION))) {
>> +			spin_unlock(ptl);
>> +			return no_page_table(vma, flags);
>> +		}
>> +		pmd_migration_entry_wait(mm, pmd);
>> +		goto retry_locked;
> 
> Nope. pmd_migration_entry_wait() unlocks the ptl.

Right. This chunk is wrong. pmd_migrtion_entry_wait() actually locks
pmd, then unlocks it and waits on the page if it is suitable.

An simple fix could be:

+retry_locked:
 	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_present(*pmd))) {
+	        spin_unlock(ptl);
+		if (likely(!(flags & FOLL_MIGRATION)))
+			return no_page_table(vma, flags);
+		pmd_migration_entry_wait(mm, pmd);
+		goto retry_locked;
+       }

Or is it better to change pmd_migration_entry_wait() to
void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
spinlock_t *ptl)? So that if ptl is NULL, then it takes the pmd lock and
unlocks it; if ptl is specified, it only unlocks it. This can avoid the
redundant unlock and lock in the code above, when
pmd_migration_entry_wait() is called.

Thanks.

--
Best Regards,
Yan Zi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/11] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  2017-03-13 15:44   ` Zi Yan
@ 2017-03-24 18:23     ` Tim Chen
  -1 siblings, 0 replies; 52+ messages in thread
From: Tim Chen @ 2017-03-24 18:23 UTC (permalink / raw)
  To: Zi Yan, linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

On Mon, 2017-03-13 at 11:44 -0400, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
> false negative return when it races with thp spilt
> (during which _PAGE_PRESENT is temporary cleared.) I don't think that
> dropping _PAGE_PSE check in pmd_present() works well because it can
> hurt optimization of tlb handling in thp split.
> In the current kernel, bits 1-4 are not used in non-present format
> since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
> work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
> Bit 7 is used as reserved (always clear), so please don't use it for
> other purpose.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
>  arch/x86/include/asm/pgtable_64.h    | 12 +++++++++---
>  arch/x86/include/asm/pgtable_types.h | 10 +++++-----
>  2 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 73c7ccc38912..a5c4fc62e078 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  /*
>   * Encode and de-code a swap entry
>   *
> - * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
> - * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
> - * | OFFSET (14->63) | TYPE (9-13)  |0|X|X|X| X| X|X|X|0| <- swp entry
> + * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
> + * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
> + * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
>   *
>   * G (8) is aliased and used as a PROT_NONE indicator for
>   * !present ptes.  We need to start storing swap entries above
>   * there.  We also need to avoid using A and D because of an
>   * erratum where they can be incorrectly set by hardware on
>   * non-present PTEs.
> + *
> + * SD (1) in swp entry is used to store soft dirty bit, which helps us
> + * remember soft dirty over page migration
> + *
> + * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
> + * but also G.

but also L and G.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/11] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
@ 2017-03-24 18:23     ` Tim Chen
  0 siblings, 0 replies; 52+ messages in thread
From: Tim Chen @ 2017-03-24 18:23 UTC (permalink / raw)
  To: Zi Yan, linux-kernel, linux-mm
  Cc: kirill.shutemov, akpm, minchan, vbabka, mgorman, mhocko,
	n-horiguchi, khandual, zi.yan, dnellans

On Mon, 2017-03-13 at 11:44 -0400, Zi Yan wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
> false negative return when it races with thp spilt
> (during which _PAGE_PRESENT is temporary cleared.) I don't think that
> dropping _PAGE_PSE check in pmd_present() works well because it can
> hurt optimization of tlb handling in thp split.
> In the current kernel, bits 1-4 are not used in non-present format
> since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
> work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
> Bit 7 is used as reserved (always clear), so please don't use it for
> other purpose.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
> ---
> A arch/x86/include/asm/pgtable_64.hA A A A | 12 +++++++++---
> A arch/x86/include/asm/pgtable_types.h | 10 +++++-----
> A 2 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 73c7ccc38912..a5c4fc62e078 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> A /*
> A  * Encode and de-code a swap entry
> A  *
> - * |A A A A A ...A A A A A A A A A A A A | 11| 10|A A 9|8|7|6|5| 4| 3|2|1|0| <- bit number
> - * |A A A A A ...A A A A A A A A A A A A |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
> - * | OFFSET (14->63) | TYPE (9-13)A A |0|X|X|X| X| X|X|X|0| <- swp entry
> + * |A A A A A ...A A A A A A A A A A A A | 11| 10|A A 9|8|7|6|5| 4| 3|2| 1|0| <- bit number
> + * |A A A A A ...A A A A A A A A A A A A |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
> + * | OFFSET (14->63) | TYPE (9-13)A A |0|0|X|X| X| X|X|SD|0| <- swp entry
> A  *
> A  * G (8) is aliased and used as a PROT_NONE indicator for
> A  * !present ptes.A A We need to start storing swap entries above
> A  * there.A A We also need to avoid using A and D because of an
> A  * erratum where they can be incorrectly set by hardware on
> A  * non-present PTEs.
> + *
> + * SD (1) in swp entry is used to store soft dirty bit, which helps us
> + * remember soft dirty over page migration
> + *
> + * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
> + * but also G.

but also L and G.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/11] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
  2017-03-24 18:23     ` Tim Chen
@ 2017-03-24 18:30       ` Zi Yan
  -1 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 18:30 UTC (permalink / raw)
  To: Tim Chen
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

[-- Attachment #1: Type: text/plain, Size: 2490 bytes --]



Tim Chen wrote:
> On Mon, 2017-03-13 at 11:44 -0400, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
>> false negative return when it races with thp spilt
>> (during which _PAGE_PRESENT is temporary cleared.) I don't think that
>> dropping _PAGE_PSE check in pmd_present() works well because it can
>> hurt optimization of tlb handling in thp split.
>> In the current kernel, bits 1-4 are not used in non-present format
>> since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
>> work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
>> Bit 7 is used as reserved (always clear), so please don't use it for
>> other purpose.
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  arch/x86/include/asm/pgtable_64.h    | 12 +++++++++---
>>  arch/x86/include/asm/pgtable_types.h | 10 +++++-----
>>  2 files changed, 14 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
>> index 73c7ccc38912..a5c4fc62e078 100644
>> --- a/arch/x86/include/asm/pgtable_64.h
>> +++ b/arch/x86/include/asm/pgtable_64.h
>> @@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>>  /*
>>   * Encode and de-code a swap entry
>>   *
>> - * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
>> - * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
>> - * | OFFSET (14->63) | TYPE (9-13)  |0|X|X|X| X| X|X|X|0| <- swp entry
>> + * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
>> + * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
>> + * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
>>   *
>>   * G (8) is aliased and used as a PROT_NONE indicator for
>>   * !present ptes.  We need to start storing swap entries above
>>   * there.  We also need to avoid using A and D because of an
>>   * erratum where they can be incorrectly set by hardware on
>>   * non-present PTEs.
>> + *
>> + * SD (1) in swp entry is used to store soft dirty bit, which helps us
>> + * remember soft dirty over page migration
>> + *
>> + * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
>> + * but also G.
> 
> but also L and G.

Got it. Thanks.

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/11] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
@ 2017-03-24 18:30       ` Zi Yan
  0 siblings, 0 replies; 52+ messages in thread
From: Zi Yan @ 2017-03-24 18:30 UTC (permalink / raw)
  To: Tim Chen
  Cc: Zi Yan, linux-kernel, linux-mm, kirill.shutemov, akpm, minchan,
	vbabka, mgorman, mhocko, n-horiguchi, khandual, dnellans

[-- Attachment #1: Type: text/plain, Size: 2490 bytes --]



Tim Chen wrote:
> On Mon, 2017-03-13 at 11:44 -0400, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>
>> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
>> false negative return when it races with thp spilt
>> (during which _PAGE_PRESENT is temporary cleared.) I don't think that
>> dropping _PAGE_PSE check in pmd_present() works well because it can
>> hurt optimization of tlb handling in thp split.
>> In the current kernel, bits 1-4 are not used in non-present format
>> since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
>> work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
>> Bit 7 is used as reserved (always clear), so please don't use it for
>> other purpose.
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
>> ---
>>  arch/x86/include/asm/pgtable_64.h    | 12 +++++++++---
>>  arch/x86/include/asm/pgtable_types.h | 10 +++++-----
>>  2 files changed, 14 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
>> index 73c7ccc38912..a5c4fc62e078 100644
>> --- a/arch/x86/include/asm/pgtable_64.h
>> +++ b/arch/x86/include/asm/pgtable_64.h
>> @@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>>  /*
>>   * Encode and de-code a swap entry
>>   *
>> - * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0| <- bit number
>> - * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
>> - * | OFFSET (14->63) | TYPE (9-13)  |0|X|X|X| X| X|X|X|0| <- swp entry
>> + * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
>> + * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
>> + * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
>>   *
>>   * G (8) is aliased and used as a PROT_NONE indicator for
>>   * !present ptes.  We need to start storing swap entries above
>>   * there.  We also need to avoid using A and D because of an
>>   * erratum where they can be incorrectly set by hardware on
>>   * non-present PTEs.
>> + *
>> + * SD (1) in swp entry is used to store soft dirty bit, which helps us
>> + * remember soft dirty over page migration
>> + *
>> + * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
>> + * but also G.
> 
> but also L and G.

Got it. Thanks.

-- 
Best Regards,
Yan Zi


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 537 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2017-03-24 18:30 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-13 15:44 [PATCH v4 00/11] mm: page migration enhancement for thp Zi Yan
2017-03-13 15:44 ` Zi Yan
2017-03-13 15:44 ` [PATCH v4 01/11] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 Zi Yan
2017-03-13 15:44   ` Zi Yan
2017-03-24 18:23   ` Tim Chen
2017-03-24 18:23     ` Tim Chen
2017-03-24 18:30     ` Zi Yan
2017-03-24 18:30       ` Zi Yan
2017-03-13 15:44 ` [PATCH v4 02/11] mm: mempolicy: add queue_pages_node_check() Zi Yan
2017-03-13 15:44   ` Zi Yan
2017-03-13 15:44 ` [PATCH v4 03/11] mm: thp: introduce separate TTU flag for thp freezing Zi Yan
2017-03-13 15:44   ` Zi Yan
2017-03-13 15:45 ` [PATCH v4 04/11] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Zi Yan
2017-03-13 15:45   ` Zi Yan
2017-03-24 14:10   ` Kirill A. Shutemov
2017-03-24 14:10     ` Kirill A. Shutemov
2017-03-24 14:21     ` Zi Yan
2017-03-24 14:21       ` Zi Yan
2017-03-13 15:45 ` [PATCH v4 05/11] mm: thp: enable thp migration in generic path Zi Yan
2017-03-13 15:45   ` Zi Yan
2017-03-14 21:19   ` kbuild test robot
2017-03-14 21:55     ` Zi Yan
2017-03-14 21:55       ` Zi Yan
2017-03-15  9:01       ` Geert Uytterhoeven
2017-03-15  9:01         ` Geert Uytterhoeven
2017-03-15 16:00         ` Zi Yan
2017-03-15 16:00           ` Zi Yan
2017-03-14 21:26   ` kbuild test robot
2017-03-24 14:28   ` Kirill A. Shutemov
2017-03-24 14:28     ` Kirill A. Shutemov
2017-03-24 15:30     ` Zi Yan
2017-03-24 15:30       ` Zi Yan
2017-03-13 15:45 ` [PATCH v4 06/11] mm: thp: check pmd migration entry in common path Zi Yan
2017-03-13 15:45   ` Zi Yan
2017-03-24 14:50   ` Kirill A. Shutemov
2017-03-24 14:50     ` Kirill A. Shutemov
2017-03-24 16:09     ` Zi Yan
2017-03-24 16:09       ` Zi Yan
2017-03-24 16:50       ` Kirill A. Shutemov
2017-03-24 16:50         ` Kirill A. Shutemov
2017-03-24 17:09         ` Zi Yan
2017-03-24 17:09           ` Zi Yan
2017-03-13 15:45 ` [PATCH v4 07/11] mm: soft-dirty: keep soft-dirty bits over thp migration Zi Yan
2017-03-13 15:45   ` Zi Yan
2017-03-13 15:45 ` [PATCH v4 08/11] mm: hwpoison: soft offline supports " Zi Yan
2017-03-13 15:45   ` Zi Yan
2017-03-13 15:45 ` [PATCH v4 09/11] mm: mempolicy: mbind and migrate_pages support " Zi Yan
2017-03-13 15:45   ` Zi Yan
2017-03-13 15:45 ` [PATCH v4 10/11] mm: migrate: move_pages() supports " Zi Yan
2017-03-13 15:45   ` Zi Yan
2017-03-13 15:45 ` [PATCH v4 11/11] mm: memory_hotplug: memory hotremove " Zi Yan
2017-03-13 15:45   ` Zi Yan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.