linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] hugepage migration fixes (v4)
@ 2014-09-15 22:39 Naoya Horiguchi
  2014-09-15 22:39 ` [PATCH v3 1/5] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Naoya Horiguchi @ 2014-09-15 22:39 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi

This is the ver.4 of hugepage migration fix patchset.

Major changes from ver.3 are:
- to drop locking from follow_huge_pud() and follow_huge_addr() in patch 2,
  which was buggy and is not necessary now because they don't support FOLL_GET,
- follow_huge_pmd(FOLL_GET) can pin and return tail pages,
- and I fixed bugs accidentally introduced in patch 3 and 5.
Others are code improvements and comment/description fixes.

Two related topics (not included in this series but to be discussed/done)
- follow_huge_pmd() explicitly uses pmd_lockptr() instead of huge_pte_lock().
  This point shed light on the subtlety in huge_page_size == PMD_HUGE check
  in huge_pte_lockptr(). This seems to make no runtime problem now, but might
  look fragile for example when pmd is folded (where PUD_SIZE == PMD_SIZE)
  or when hugepage in your architecture is not bound by page table (where
  hugepage size happens to equal with PMD_SIZE.)
- code around is_hugetlb_entry_migration() and is_hugetlb_entry_hwpoisoned()
  is not beautiful or optimized. Cleanup is necessary.

This patchset is based on mmotm-2014-09-09-14-42 and shows no regression
in libhugetlbfs test.

Tree: git@github.com:Naoya-Horiguchi/linux.git
Branch: mmotm-2014-09-09-14-42/fix_follow_huge_pmd.v4

v2: http://thread.gmane.org/gmane.linux.kernel/1761065
v3: http://thread.gmane.org/gmane.linux.kernel/1776585
---
Summary:

Naoya Horiguchi (5):
      mm/hugetlb: reduce arch dependent code around follow_huge_*
      mm/hugetlb: take page table lock in follow_huge_pmd()
      mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
      mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection
      mm/hugetlb: add migration entry check in __unmap_hugepage_range

 arch/arm/mm/hugetlbpage.c     |   6 --
 arch/arm64/mm/hugetlbpage.c   |   6 --
 arch/ia64/mm/hugetlbpage.c    |   6 --
 arch/metag/mm/hugetlbpage.c   |   6 --
 arch/mips/mm/hugetlbpage.c    |  18 ------
 arch/powerpc/mm/hugetlbpage.c |   8 +++
 arch/s390/mm/hugetlbpage.c    |  20 -------
 arch/sh/mm/hugetlbpage.c      |  12 ----
 arch/sparc/mm/hugetlbpage.c   |  12 ----
 arch/tile/mm/hugetlbpage.c    |  28 ----------
 arch/x86/mm/hugetlbpage.c     |  12 ----
 include/linux/hugetlb.h       |   8 +--
 mm/gup.c                      |  25 ++-------
 mm/hugetlb.c                  | 124 ++++++++++++++++++++++++++++--------------
 mm/migrate.c                  |   3 +-
 15 files changed, 101 insertions(+), 193 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/5] mm/hugetlb: reduce arch dependent code around follow_huge_*
  2014-09-15 22:39 [PATCH 0/5] hugepage migration fixes (v4) Naoya Horiguchi
@ 2014-09-15 22:39 ` Naoya Horiguchi
  2014-09-16 15:56   ` Naoya Horiguchi
  2014-09-15 22:39 ` [PATCH v3 2/5] mm/hugetlb: take page table lock in follow_huge_pmd() Naoya Horiguchi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Naoya Horiguchi @ 2014-09-15 22:39 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi, James Hogan

Currently we have many duplicates in definitions around follow_huge_addr(),
follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove them.
The basic idea is to put the default implementation for these functions in
mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETLB),
and to implement arch-specific code only when the arch needs it.

For follow_huge_addr(), only powerpc and ia64 have their own implementation,
and in all other architectures this function just returns ERR_PTR(-EINVAL).
So this patch sets returning ERR_PTR(-EINVAL) as default.

As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always
return 0 in your architecture (like in ia64 or sparc,) it's never called
(the callsite is optimized away) no matter how implemented it is.
So in such architectures, we don't need arch-specific implementation.

In some architecture (like mips, s390 and tile,) their current arch-specific
follow_huge_(pmd|pud)() are effectively identical with the common code,
so this patch lets these architecture use the common code.

One exception is metag, where pmd_huge() could return non-zero but it expects
follow_huge_pmd() to always return NULL. This means that we need arch-specific
implementation which returns NULL. This behavior looks strange to me (because
non-zero pmd_huge() implies that the architecture supports PMD-based hugepage,
so follow_huge_pmd() can/should return some relevant value,) but that's beyond
this cleanup patch, so let's keep it.

Justification of non-trivial changes:
- in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
  patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
  is true when follow_huge_pmd() can be called (note that pmd_huge() has
  the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
- in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
  code. This patch forces these archs use PMD_MASK, but it's OK because
  they are identical in both archs.
  In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
  In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
  PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
  PTE_ORDER is always 0, so these are identical.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
---
 arch/arm/mm/hugetlbpage.c     |  6 ------
 arch/arm64/mm/hugetlbpage.c   |  6 ------
 arch/ia64/mm/hugetlbpage.c    |  6 ------
 arch/metag/mm/hugetlbpage.c   |  6 ------
 arch/mips/mm/hugetlbpage.c    | 18 ------------------
 arch/powerpc/mm/hugetlbpage.c |  8 ++++++++
 arch/s390/mm/hugetlbpage.c    | 20 --------------------
 arch/sh/mm/hugetlbpage.c      | 12 ------------
 arch/sparc/mm/hugetlbpage.c   | 12 ------------
 arch/tile/mm/hugetlbpage.c    | 28 ----------------------------
 arch/x86/mm/hugetlbpage.c     | 12 ------------
 mm/hugetlb.c                  | 30 +++++++++++++++---------------
 12 files changed, 23 insertions(+), 141 deletions(-)

diff --git mmotm-2014-09-09-14-42.orig/arch/arm/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/arm/mm/hugetlbpage.c
index 66781bf34077..c72412415093 100644
--- mmotm-2014-09-09-14-42.orig/arch/arm/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/arm/mm/hugetlbpage.c
@@ -36,12 +36,6 @@
  * of type casting from pmd_t * to pte_t *.
  */
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pud_huge(pud_t pud)
 {
 	return 0;
diff --git mmotm-2014-09-09-14-42.orig/arch/arm64/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/arm64/mm/hugetlbpage.c
index 023747bf4dd7..2de9d2e59d96 100644
--- mmotm-2014-09-09-14-42.orig/arch/arm64/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/arm64/mm/hugetlbpage.c
@@ -38,12 +38,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 }
 #endif
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !(pmd_val(pmd) & PMD_TABLE_BIT);
diff --git mmotm-2014-09-09-14-42.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/ia64/mm/hugetlbpage.c
index 76069c18ee42..52b7604b5215 100644
--- mmotm-2014-09-09-14-42.orig/arch/ia64/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/ia64/mm/hugetlbpage.c
@@ -114,12 +114,6 @@ int pud_huge(pud_t pud)
 	return 0;
 }
 
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
-{
-	return NULL;
-}
-
 void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 			unsigned long addr, unsigned long end,
 			unsigned long floor, unsigned long ceiling)
diff --git mmotm-2014-09-09-14-42.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/metag/mm/hugetlbpage.c
index 3c52fa6d0f8e..745081427659 100644
--- mmotm-2014-09-09-14-42.orig/arch/metag/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/metag/mm/hugetlbpage.c
@@ -94,12 +94,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return pmd_page_shift(pmd) > PAGE_SHIFT;
diff --git mmotm-2014-09-09-14-42.orig/arch/mips/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/mips/mm/hugetlbpage.c
index 4ec8ee10d371..06e0f421b41b 100644
--- mmotm-2014-09-09-14-42.orig/arch/mips/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/mips/mm/hugetlbpage.c
@@ -68,12 +68,6 @@ int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
 	return 0;
 }
 
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return (pmd_val(pmd) & _PAGE_HUGE) != 0;
@@ -83,15 +77,3 @@ int pud_huge(pud_t pud)
 {
 	return (pud_val(pud) & _PAGE_HUGE) != 0;
 }
-
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
-	return page;
-}
diff --git mmotm-2014-09-09-14-42.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/powerpc/mm/hugetlbpage.c
index 7e70ae968e5f..9517a93a315c 100644
--- mmotm-2014-09-09-14-42.orig/arch/powerpc/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/powerpc/mm/hugetlbpage.c
@@ -706,6 +706,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+		pmd_t *pmd, int write)
+{
+	BUG();
+	return NULL;
+}
+
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
 {
diff --git mmotm-2014-09-09-14-42.orig/arch/s390/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/s390/mm/hugetlbpage.c
index 0ff66a7e29bb..811e7f9a2de0 100644
--- mmotm-2014-09-09-14-42.orig/arch/s390/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/s390/mm/hugetlbpage.c
@@ -201,12 +201,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	if (!MACHINE_HAS_HPAGE)
@@ -219,17 +213,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmdp, int write)
-{
-	struct page *page;
-
-	if (!MACHINE_HAS_HPAGE)
-		return NULL;
-
-	page = pmd_page(*pmdp);
-	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
-	return page;
-}
diff --git mmotm-2014-09-09-14-42.orig/arch/sh/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/sh/mm/hugetlbpage.c
index d7762349ea48..534bc978af8a 100644
--- mmotm-2014-09-09-14-42.orig/arch/sh/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/sh/mm/hugetlbpage.c
@@ -67,12 +67,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 	return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return 0;
@@ -82,9 +76,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	return NULL;
-}
diff --git mmotm-2014-09-09-14-42.orig/arch/sparc/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/sparc/mm/hugetlbpage.c
index d329537739c6..4242eab12e10 100644
--- mmotm-2014-09-09-14-42.orig/arch/sparc/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/sparc/mm/hugetlbpage.c
@@ -215,12 +215,6 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 	return entry;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm,
-			      unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return 0;
@@ -230,9 +224,3 @@ int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	return NULL;
-}
diff --git mmotm-2014-09-09-14-42.orig/arch/tile/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/tile/mm/hugetlbpage.c
index e514899e1100..8a00c7b7b862 100644
--- mmotm-2014-09-09-14-42.orig/arch/tile/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/tile/mm/hugetlbpage.c
@@ -150,12 +150,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 	return NULL;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
-			      int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !!(pmd_val(pmd) & _PAGE_HUGE_PAGE);
@@ -166,28 +160,6 @@ int pud_huge(pud_t pud)
 	return !!(pud_val(pud) & _PAGE_HUGE_PAGE);
 }
 
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-			     pmd_t *pmd, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
-	return page;
-}
-
-struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
-			     pud_t *pud, int write)
-{
-	struct page *page;
-
-	page = pte_page(*(pte_t *)pud);
-	if (page)
-		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
-	return page;
-}
-
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
diff --git mmotm-2014-09-09-14-42.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/x86/mm/hugetlbpage.c
index 8b977ebf9388..03b8a7c11817 100644
--- mmotm-2014-09-09-14-42.orig/arch/x86/mm/hugetlbpage.c
+++ mmotm-2014-09-09-14-42/arch/x86/mm/hugetlbpage.c
@@ -52,20 +52,8 @@ int pud_huge(pud_t pud)
 	return 0;
 }
 
-struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
-{
-	return NULL;
-}
 #else
 
-struct page *
-follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
-{
-	return ERR_PTR(-EINVAL);
-}
-
 int pmd_huge(pmd_t pmd)
 {
 	return !!(pmd_val(pmd) & _PAGE_PSE);
diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
index 9fd722769927..34351251e164 100644
--- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
+++ mmotm-2014-09-09-14-42/mm/hugetlb.c
@@ -3653,7 +3653,20 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 	return (pte_t *) pmd;
 }
 
-struct page *
+#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
+
+/*
+ * These functions are overwritable if your architecture needs its own
+ * behavior.
+ */
+struct page * __weak
+follow_huge_addr(struct mm_struct *mm, unsigned long address,
+			      int write)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+struct page * __weak
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
 {
@@ -3665,7 +3678,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return page;
 }
 
-struct page *
+struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int write)
 {
@@ -3677,19 +3690,6 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 	return page;
 }
 
-#else /* !CONFIG_ARCH_WANT_GENERAL_HUGETLB */
-
-/* Can be overriden by architectures */
-struct page * __weak
-follow_huge_pud(struct mm_struct *mm, unsigned long address,
-	       pud_t *pud, int write)
-{
-	BUG();
-	return NULL;
-}
-
-#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
-
 #ifdef CONFIG_MEMORY_FAILURE
 
 /* Should be called in hugetlb_lock */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/5] mm/hugetlb: take page table lock in follow_huge_pmd()
  2014-09-15 22:39 [PATCH 0/5] hugepage migration fixes (v4) Naoya Horiguchi
  2014-09-15 22:39 ` [PATCH v3 1/5] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
@ 2014-09-15 22:39 ` Naoya Horiguchi
  2014-09-30  4:32   ` Hugh Dickins
  2014-09-15 22:39 ` [PATCH v3 3/5] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() Naoya Horiguchi
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Naoya Horiguchi @ 2014-09-15 22:39 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi, stable

We have a race condition between move_pages() and freeing hugepages,
where move_pages() calls follow_page(FOLL_GET) for hugepages internally
and tries to get its refcount without preventing concurrent freeing.
This race crashes the kernel, so this patch fixes it by moving FOLL_GET
code for hugepages into follow_huge_pmd() with taking the page table lock.

This patch intentionally removes page==NULL check after pte_page. This
is justified because pte_page() never returns NULL for any architectures
or configurations.

This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.

We could have a choice to add the similar locking to follow_huge_(addr|pud)
for consistency, but it's not necessary because currently these functions
don't support FOLL_GET flag, so let's leave it for future development.

Here is the reproducer:

  $ cat movepages.c
  #include <stdio.h>
  #include <stdlib.h>
  #include <numaif.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000
  #define PS              0x1000

  int main(int argc, char *argv[]) {
          int i;
          int nr_hp = strtol(argv[1], NULL, 0);
          int nr_p  = nr_hp * HPS / PS;
          int ret;
          void **addrs;
          int *status;
          int *nodes;
          pid_t pid;

          pid = strtol(argv[2], NULL, 0);
          addrs  = malloc(sizeof(char *) * nr_p + 1);
          status = malloc(sizeof(char *) * nr_p + 1);
          nodes  = malloc(sizeof(char *) * nr_p + 1);

          while (1) {
                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 1;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");

                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 0;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");
          }
          return 0;
  }

  $ cat hugepage.c
  #include <stdio.h>
  #include <sys/mman.h>
  #include <string.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000

  int main(int argc, char *argv[]) {
          int nr_hp = strtol(argv[1], NULL, 0);
          char *p;

          while (1) {
                  p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                  if (p != (void *)ADDR_INPUT) {
                          perror("mmap");
                          break;
                  }
                  memset(p, 0, nr_hp * HPS);
                  munmap(p, nr_hp * HPS);
          }
  }

  $ sysctl vm.nr_hugepages=40
  $ ./hugepage 10 &
  $ ./movepages 10 $(pgrep -f hugepage)

Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [3.12+]
---
ChangeLog v4:
- remove changes related to taking ptl from follow_huge_(addr|pud)(),
  which is not neccessary because these functions don't support FOLL_GET
  at least for now
- add justification of removing page==NULL check to patch description
- stop changing parameter mm to vma in follow_huge_(pud|pmd)()
- use pmd_lockptr() instead of huge_pte_lock() in follow_huge_pmd()
- use get_page() instead of get_page_unless_zero() in follow_huge_pmd()
- use Fixes: tag and move changelog under '---'

ChangeLog v3:
- remove unnecessary if (page) check
- check (pmd|pud)_huge again after holding ptl
- do the same change also on follow_huge_pud()
- take page table lock also in follow_huge_addr()

ChangeLog v2:
- introduce follow_huge_pmd_lock() to do locking in arch-independent code.
---
 include/linux/hugetlb.h |  8 ++++----
 mm/gup.c                | 25 ++++---------------------
 mm/hugetlb.c            | 30 +++++++++++++++++++-----------
 mm/migrate.c            |  3 ++-
 4 files changed, 29 insertions(+), 37 deletions(-)

diff --git mmotm-2014-09-09-14-42.orig/include/linux/hugetlb.h mmotm-2014-09-09-14-42/include/linux/hugetlb.h
index 6e6d338641fe..14020c7796af 100644
--- mmotm-2014-09-09-14-42.orig/include/linux/hugetlb.h
+++ mmotm-2014-09-09-14-42/include/linux/hugetlb.h
@@ -99,9 +99,9 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-				pmd_t *pmd, int write);
+				pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
-				pud_t *pud, int write);
+				pud_t *pud, int flags);
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 static inline void hugetlb_show_meminfo(void)
 {
 }
-#define follow_huge_pmd(mm, addr, pmd, write)	NULL
-#define follow_huge_pud(mm, addr, pud, write)	NULL
+#define follow_huge_pmd(mm, addr, pmd, flags)	NULL
+#define follow_huge_pud(mm, addr, pud, flags)	NULL
 #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define pud_huge(x)	0
diff --git mmotm-2014-09-09-14-42.orig/mm/gup.c mmotm-2014-09-09-14-42/mm/gup.c
index 91d044b1600d..4575d23a33b9 100644
--- mmotm-2014-09-09-14-42.orig/mm/gup.c
+++ mmotm-2014-09-09-14-42/mm/gup.c
@@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	pud = pud_offset(pgd, address);
 	if (pud_none(*pud))
 		return no_page_table(vma, flags);
-	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
-		if (flags & FOLL_GET)
-			return NULL;
-		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
-		return page;
-	}
+	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
+		return follow_huge_pud(mm, address, pud, flags);
 	if (unlikely(pud_bad(*pud)))
 		return no_page_table(vma, flags);
 
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		return no_page_table(vma, flags);
-	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
-		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
-		if (flags & FOLL_GET) {
-			/*
-			 * Refcount on tail pages are not well-defined and
-			 * shouldn't be taken. The caller should handle a NULL
-			 * return when trying to follow tail pages.
-			 */
-			if (PageHead(page))
-				get_page(page);
-			else
-				page = NULL;
-		}
-		return page;
-	}
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
+		return follow_huge_pmd(mm, address, pmd, flags);
 	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
 		return no_page_table(vma, flags);
 	if (pmd_trans_huge(*pmd)) {
diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
index 34351251e164..941832ee3d5a 100644
--- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
+++ mmotm-2014-09-09-14-42/mm/hugetlb.c
@@ -3668,26 +3668,34 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
 
 struct page * __weak
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd, int write)
+		pmd_t *pmd, int flags)
 {
-	struct page *page;
+	struct page *page = NULL;
+	spinlock_t *ptl;
 
-	page = pte_page(*(pte_t *)pmd);
-	if (page)
-		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
+	ptl = pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	if (!pmd_huge(*pmd))
+		goto out;
+
+	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (flags & FOLL_GET)
+		get_page(page);
+out:
+	spin_unlock(ptl);
 	return page;
 }
 
 struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
-		pud_t *pud, int write)
+		pud_t *pud, int flags)
 {
-	struct page *page;
+	if (flags & FOLL_GET)
+		return NULL;
 
-	page = pte_page(*(pte_t *)pud);
-	if (page)
-		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
-	return page;
+	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
 }
 
 #ifdef CONFIG_MEMORY_FAILURE
diff --git mmotm-2014-09-09-14-42.orig/mm/migrate.c mmotm-2014-09-09-14-42/mm/migrate.c
index 09d489c55c21..29f12ca176d7 100644
--- mmotm-2014-09-09-14-42.orig/mm/migrate.c
+++ mmotm-2014-09-09-14-42/mm/migrate.c
@@ -1246,7 +1246,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 			goto put_and_set;
 
 		if (PageHuge(page)) {
-			isolate_huge_page(page, &pagelist);
+			if (PageHead(page))
+				isolate_huge_page(page, &pagelist);
 			goto put_and_set;
 		}
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/5] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
  2014-09-15 22:39 [PATCH 0/5] hugepage migration fixes (v4) Naoya Horiguchi
  2014-09-15 22:39 ` [PATCH v3 1/5] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
  2014-09-15 22:39 ` [PATCH v3 2/5] mm/hugetlb: take page table lock in follow_huge_pmd() Naoya Horiguchi
@ 2014-09-15 22:39 ` Naoya Horiguchi
  2014-09-30  4:52   ` Hugh Dickins
  2014-09-15 22:39 ` [PATCH v3 4/5] mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection Naoya Horiguchi
  2014-09-15 22:39 ` [PATCH v3 5/5] mm/hugetlb: add migration entry check in __unmap_hugepage_range Naoya Horiguchi
  4 siblings, 1 reply; 11+ messages in thread
From: Naoya Horiguchi @ 2014-09-15 22:39 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi, stable

When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().

This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again after huge_ptep_get().

This patch also reorders taking ptl and doing pte_page(), because pte_page()
should be done in ptl. Due to this reordering, we need use trylock_page()
in page != pagecache_page case to respect locking order.

Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [3.2+]
---
ChangeLog v4:
- move !pte_present(entry) (i.e. migration/hwpoison) check before
  taking page table lock
- call wait_on_page_locked() if trylock_page() returns false
- remove unused label out_unlock_page
- fix the order of put_page() and unlock_page() after out_put_page label
- move changelog under '---'

Hugh advised me for ver.3 that we can call migration_entry_wait_huge()
when the !pte_present(entry) check returns true to avoid busy faulting.
But it seems that in that case only one additional page fault happens
instead of busy faulting, because is_hugetlb_entry_migration() in the
second call of hugetlb_fault() should return true and then
migration_entry_wait_huge() is called. We could avoid this additional
page fault by adding another migration_entry_wait_huge(), but then
we should separate pte_present() check into is_hugetlb_entry_migration()
path and is_hugetlb_entry_hwpoisoned() path, which makes code complicated.
So let me take the simpler approach for sending stable tree.
And it's also advised that we can clean up is_hugetlb_entry_migration()
and is_hugetlb_entry_hwpoisoned() things. This will be done in another
work, and the above migration_entry_wait_huge problem will be revisited
there.

ChangeLog v3:
- doing pte_page() and taking refcount under page table lock
- check pte_present after taking ptl, which makes it unnecessary to use
  get_page_unless_zero()
- use trylock_page in page != pagecache_page case
- fixed target stable version
---
 mm/hugetlb.c | 42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
index 941832ee3d5a..52f001943169 100644
--- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
+++ mmotm-2014-09-09-14-42/mm/hugetlb.c
@@ -3128,6 +3128,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *pagecache_page = NULL;
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
+	int need_wait_lock = 0;
 
 	address &= huge_page_mask(h);
 
@@ -3164,6 +3165,15 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	ret = 0;
+	/*
+	 * entry could be a migration/hwpoison entry at this point, so this
+	 * check prevents the kernel from going below assuming that we have
+	 * a active hugepage in pagecache. This goto expects the 2nd page fault,
+	 * and is_hugetlb_entry_(migration|hwpoisoned) check will properly
+	 * handle it.
+	 */
+	if (!pte_present(entry))
+		goto out_mutex;
 
 	/*
 	 * If we are going to COW the mapping later, we examine the pending
@@ -3184,6 +3194,12 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 								vma, address);
 	}
 
+	ptl = huge_pte_lock(h, mm, ptep);
+
+	/* Check for a racing update before calling hugetlb_cow */
+	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+		goto out_ptl;
+
 	/*
 	 * hugetlb_cow() requires page locks of pte_page(entry) and
 	 * pagecache_page, so here we need take the former one
@@ -3192,22 +3208,19 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * so no worry about deadlock.
 	 */
 	page = pte_page(entry);
-	get_page(page);
 	if (page != pagecache_page)
-		lock_page(page);
-
-	ptl = huge_pte_lockptr(h, mm, ptep);
-	spin_lock(ptl);
-	/* Check for a racing update before calling hugetlb_cow */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
-		goto out_ptl;
+		if (!trylock_page(page)) {
+			need_wait_lock = 1;
+			goto out_ptl;
+		}
 
+	get_page(page);
 
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!huge_pte_write(entry)) {
 			ret = hugetlb_cow(mm, vma, address, ptep, entry,
 					pagecache_page, ptl);
-			goto out_ptl;
+			goto out_put_page;
 		}
 		entry = huge_pte_mkdirty(entry);
 	}
@@ -3215,7 +3228,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (huge_ptep_set_access_flags(vma, address, ptep, entry,
 						flags & FAULT_FLAG_WRITE))
 		update_mmu_cache(vma, address, ptep);
-
+out_put_page:
+	if (page != pagecache_page)
+		unlock_page(page);
+	put_page(page);
 out_ptl:
 	spin_unlock(ptl);
 
@@ -3223,12 +3239,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unlock_page(pagecache_page);
 		put_page(pagecache_page);
 	}
-	if (page != pagecache_page)
-		unlock_page(page);
-	put_page(page);
-
 out_mutex:
 	mutex_unlock(&htlb_fault_mutex_table[hash]);
+	if (need_wait_lock)
+		wait_on_page_locked(page);
 	return ret;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 4/5] mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection
  2014-09-15 22:39 [PATCH 0/5] hugepage migration fixes (v4) Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2014-09-15 22:39 ` [PATCH v3 3/5] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() Naoya Horiguchi
@ 2014-09-15 22:39 ` Naoya Horiguchi
  2014-09-15 22:39 ` [PATCH v3 5/5] mm/hugetlb: add migration entry check in __unmap_hugepage_range Naoya Horiguchi
  4 siblings, 0 replies; 11+ messages in thread
From: Naoya Horiguchi @ 2014-09-15 22:39 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi, stable

There is a race condition between hugepage migration and change_protection(),
where hugetlb_change_protection() doesn't care about migration entries and
wrongly overwrites them. That causes unexpected results like kernel crash.
HWPoison entries also can cause the same problem.

This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> # [2.6.36+]
---
ChangeLog v4:
- s/set_pte_at/set_huge_pte_at/

ChangeLog v3:
- handle migration entry correctly (instead of just skipping)
---
 mm/hugetlb.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
index 52f001943169..461ee1e10067 100644
--- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
+++ mmotm-2014-09-09-14-42/mm/hugetlb.c
@@ -3372,7 +3372,26 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			spin_unlock(ptl);
 			continue;
 		}
-		if (!huge_pte_none(huge_ptep_get(ptep))) {
+		pte = huge_ptep_get(ptep);
+		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
+			spin_unlock(ptl);
+			continue;
+		}
+		if (unlikely(is_hugetlb_entry_migration(pte))) {
+			swp_entry_t entry = pte_to_swp_entry(pte);
+
+			if (is_write_migration_entry(entry)) {
+				pte_t newpte;
+
+				make_migration_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				set_huge_pte_at(mm, address, ptep, newpte);
+				pages++;
+			}
+			spin_unlock(ptl);
+			continue;
+		}
+		if (!huge_pte_none(pte)) {
 			pte = huge_ptep_get_and_clear(mm, address, ptep);
 			pte = pte_mkhuge(huge_pte_modify(pte, newprot));
 			pte = arch_make_huge_pte(pte, vma, NULL, 0);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 5/5] mm/hugetlb: add migration entry check in __unmap_hugepage_range
  2014-09-15 22:39 [PATCH 0/5] hugepage migration fixes (v4) Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2014-09-15 22:39 ` [PATCH v3 4/5] mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection Naoya Horiguchi
@ 2014-09-15 22:39 ` Naoya Horiguchi
  4 siblings, 0 replies; 11+ messages in thread
From: Naoya Horiguchi @ 2014-09-15 22:39 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi, stable

If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries. This patch simply clears the pte for
migration entries as we do for hwpoison entries.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>  # [2.6.36+]
---
ChangeLog v4:
- check hwpoisoned hugepage and migrating hugepage with the same check
  instead of doing separately
---
 mm/hugetlb.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
index 461ee1e10067..1ecb625bc498 100644
--- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
+++ mmotm-2014-09-09-14-42/mm/hugetlb.c
@@ -2653,9 +2653,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			goto unlock;
 
 		/*
-		 * HWPoisoned hugepage is already unmapped and dropped reference
+		 * Migrating hugepage or HWPoisoned hugepage is already
+		 * unmapped and its refcount is dropped, so just clear pte here.
 		 */
-		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
+		if (unlikely(!pte_present(pte))) {
 			huge_pte_clear(mm, address, ptep);
 			goto unlock;
 		}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/5] mm/hugetlb: reduce arch dependent code around follow_huge_*
  2014-09-15 22:39 ` [PATCH v3 1/5] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
@ 2014-09-16 15:56   ` Naoya Horiguchi
  0 siblings, 0 replies; 11+ messages in thread
From: Naoya Horiguchi @ 2014-09-16 15:56 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: David Rientjes, linux-mm, linux-kernel, Naoya Horiguchi, James Hogan

I found that the version on the subject of this and later patches isn't
correct (s/v3/v4/). It's not critical, so I don't resend to fix it,
but sorry if confusing.

Naoya

On Mon, Sep 15, 2014 at 06:39:55PM -0400, Naoya Horiguchi wrote:
> Currently we have many duplicates in definitions around follow_huge_addr(),
> follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove them.
> The basic idea is to put the default implementation for these functions in
> mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETLB),
> and to implement arch-specific code only when the arch needs it.
> 
> For follow_huge_addr(), only powerpc and ia64 have their own implementation,
> and in all other architectures this function just returns ERR_PTR(-EINVAL).
> So this patch sets returning ERR_PTR(-EINVAL) as default.
> 
> As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always
> return 0 in your architecture (like in ia64 or sparc,) it's never called
> (the callsite is optimized away) no matter how implemented it is.
> So in such architectures, we don't need arch-specific implementation.
> 
> In some architecture (like mips, s390 and tile,) their current arch-specific
> follow_huge_(pmd|pud)() are effectively identical with the common code,
> so this patch lets these architecture use the common code.
> 
> One exception is metag, where pmd_huge() could return non-zero but it expects
> follow_huge_pmd() to always return NULL. This means that we need arch-specific
> implementation which returns NULL. This behavior looks strange to me (because
> non-zero pmd_huge() implies that the architecture supports PMD-based hugepage,
> so follow_huge_pmd() can/should return some relevant value,) but that's beyond
> this cleanup patch, so let's keep it.
> 
> Justification of non-trivial changes:
> - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
>   patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
>   is true when follow_huge_pmd() can be called (note that pmd_huge() has
>   the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
> - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
>   code. This patch forces these archs use PMD_MASK, but it's OK because
>   they are identical in both archs.
>   In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
>   In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
>   PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
>   PTE_ORDER is always 0, so these are identical.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: James Hogan <james.hogan@imgtec.com>
> ---
>  arch/arm/mm/hugetlbpage.c     |  6 ------
>  arch/arm64/mm/hugetlbpage.c   |  6 ------
>  arch/ia64/mm/hugetlbpage.c    |  6 ------
>  arch/metag/mm/hugetlbpage.c   |  6 ------
>  arch/mips/mm/hugetlbpage.c    | 18 ------------------
>  arch/powerpc/mm/hugetlbpage.c |  8 ++++++++
>  arch/s390/mm/hugetlbpage.c    | 20 --------------------
>  arch/sh/mm/hugetlbpage.c      | 12 ------------
>  arch/sparc/mm/hugetlbpage.c   | 12 ------------
>  arch/tile/mm/hugetlbpage.c    | 28 ----------------------------
>  arch/x86/mm/hugetlbpage.c     | 12 ------------
>  mm/hugetlb.c                  | 30 +++++++++++++++---------------
>  12 files changed, 23 insertions(+), 141 deletions(-)
> 
> diff --git mmotm-2014-09-09-14-42.orig/arch/arm/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/arm/mm/hugetlbpage.c
> index 66781bf34077..c72412415093 100644
> --- mmotm-2014-09-09-14-42.orig/arch/arm/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/arm/mm/hugetlbpage.c
> @@ -36,12 +36,6 @@
>   * of type casting from pmd_t * to pte_t *.
>   */
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pud_huge(pud_t pud)
>  {
>  	return 0;
> diff --git mmotm-2014-09-09-14-42.orig/arch/arm64/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/arm64/mm/hugetlbpage.c
> index 023747bf4dd7..2de9d2e59d96 100644
> --- mmotm-2014-09-09-14-42.orig/arch/arm64/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/arm64/mm/hugetlbpage.c
> @@ -38,12 +38,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  }
>  #endif
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !(pmd_val(pmd) & PMD_TABLE_BIT);
> diff --git mmotm-2014-09-09-14-42.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/ia64/mm/hugetlbpage.c
> index 76069c18ee42..52b7604b5215 100644
> --- mmotm-2014-09-09-14-42.orig/arch/ia64/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/ia64/mm/hugetlbpage.c
> @@ -114,12 +114,6 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> -
>  void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>  			unsigned long addr, unsigned long end,
>  			unsigned long floor, unsigned long ceiling)
> diff --git mmotm-2014-09-09-14-42.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/metag/mm/hugetlbpage.c
> index 3c52fa6d0f8e..745081427659 100644
> --- mmotm-2014-09-09-14-42.orig/arch/metag/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/metag/mm/hugetlbpage.c
> @@ -94,12 +94,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return pmd_page_shift(pmd) > PAGE_SHIFT;
> diff --git mmotm-2014-09-09-14-42.orig/arch/mips/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/mips/mm/hugetlbpage.c
> index 4ec8ee10d371..06e0f421b41b 100644
> --- mmotm-2014-09-09-14-42.orig/arch/mips/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/mips/mm/hugetlbpage.c
> @@ -68,12 +68,6 @@ int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return (pmd_val(pmd) & _PAGE_HUGE) != 0;
> @@ -83,15 +77,3 @@ int pud_huge(pud_t pud)
>  {
>  	return (pud_val(pud) & _PAGE_HUGE) != 0;
>  }
> -
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> diff --git mmotm-2014-09-09-14-42.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/powerpc/mm/hugetlbpage.c
> index 7e70ae968e5f..9517a93a315c 100644
> --- mmotm-2014-09-09-14-42.orig/arch/powerpc/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/powerpc/mm/hugetlbpage.c
> @@ -706,6 +706,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	return NULL;
>  }
>  
> +struct page *
> +follow_huge_pud(struct mm_struct *mm, unsigned long address,
> +		pmd_t *pmd, int write)
> +{
> +	BUG();
> +	return NULL;
> +}
> +
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>  				      unsigned long sz)
>  {
> diff --git mmotm-2014-09-09-14-42.orig/arch/s390/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/s390/mm/hugetlbpage.c
> index 0ff66a7e29bb..811e7f9a2de0 100644
> --- mmotm-2014-09-09-14-42.orig/arch/s390/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/s390/mm/hugetlbpage.c
> @@ -201,12 +201,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	if (!MACHINE_HAS_HPAGE)
> @@ -219,17 +213,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmdp, int write)
> -{
> -	struct page *page;
> -
> -	if (!MACHINE_HAS_HPAGE)
> -		return NULL;
> -
> -	page = pmd_page(*pmdp);
> -	if (page)
> -		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> diff --git mmotm-2014-09-09-14-42.orig/arch/sh/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/sh/mm/hugetlbpage.c
> index d7762349ea48..534bc978af8a 100644
> --- mmotm-2014-09-09-14-42.orig/arch/sh/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/sh/mm/hugetlbpage.c
> @@ -67,12 +67,6 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  	return 0;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return 0;
> @@ -82,9 +76,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> diff --git mmotm-2014-09-09-14-42.orig/arch/sparc/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/sparc/mm/hugetlbpage.c
> index d329537739c6..4242eab12e10 100644
> --- mmotm-2014-09-09-14-42.orig/arch/sparc/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/sparc/mm/hugetlbpage.c
> @@ -215,12 +215,6 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>  	return entry;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm,
> -			      unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return 0;
> @@ -230,9 +224,3 @@ int pud_huge(pud_t pud)
>  {
>  	return 0;
>  }
> -
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
> diff --git mmotm-2014-09-09-14-42.orig/arch/tile/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/tile/mm/hugetlbpage.c
> index e514899e1100..8a00c7b7b862 100644
> --- mmotm-2014-09-09-14-42.orig/arch/tile/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/tile/mm/hugetlbpage.c
> @@ -150,12 +150,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>  	return NULL;
>  }
>  
> -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> -			      int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !!(pmd_val(pmd) & _PAGE_HUGE_PAGE);
> @@ -166,28 +160,6 @@ int pud_huge(pud_t pud)
>  	return !!(pud_val(pud) & _PAGE_HUGE_PAGE);
>  }
>  
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> -
> -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -			     pud_t *pud, int write)
> -{
> -	struct page *page;
> -
> -	page = pte_page(*(pte_t *)pud);
> -	if (page)
> -		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> -	return page;
> -}
> -
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  {
>  	return 0;
> diff --git mmotm-2014-09-09-14-42.orig/arch/x86/mm/hugetlbpage.c mmotm-2014-09-09-14-42/arch/x86/mm/hugetlbpage.c
> index 8b977ebf9388..03b8a7c11817 100644
> --- mmotm-2014-09-09-14-42.orig/arch/x86/mm/hugetlbpage.c
> +++ mmotm-2014-09-09-14-42/arch/x86/mm/hugetlbpage.c
> @@ -52,20 +52,8 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> -{
> -	return NULL;
> -}
>  #else
>  
> -struct page *
> -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> -{
> -	return ERR_PTR(-EINVAL);
> -}
> -
>  int pmd_huge(pmd_t pmd)
>  {
>  	return !!(pmd_val(pmd) & _PAGE_PSE);
> diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
> index 9fd722769927..34351251e164 100644
> --- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
> +++ mmotm-2014-09-09-14-42/mm/hugetlb.c
> @@ -3653,7 +3653,20 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>  	return (pte_t *) pmd;
>  }
>  
> -struct page *
> +#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> +
> +/*
> + * These functions are overwritable if your architecture needs its own
> + * behavior.
> + */
> +struct page * __weak
> +follow_huge_addr(struct mm_struct *mm, unsigned long address,
> +			      int write)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +struct page * __weak
>  follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  		pmd_t *pmd, int write)
>  {
> @@ -3665,7 +3678,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	return page;
>  }
>  
> -struct page *
> +struct page * __weak
>  follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  		pud_t *pud, int write)
>  {
> @@ -3677,19 +3690,6 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  	return page;
>  }
>  
> -#else /* !CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> -
> -/* Can be overriden by architectures */
> -struct page * __weak
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -	       pud_t *pud, int write)
> -{
> -	BUG();
> -	return NULL;
> -}
> -
> -#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> -
>  #ifdef CONFIG_MEMORY_FAILURE
>  
>  /* Should be called in hugetlb_lock */
> -- 
> 1.9.3
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/5] mm/hugetlb: take page table lock in follow_huge_pmd()
  2014-09-15 22:39 ` [PATCH v3 2/5] mm/hugetlb: take page table lock in follow_huge_pmd() Naoya Horiguchi
@ 2014-09-30  4:32   ` Hugh Dickins
  2014-10-07 16:56     ` Naoya Horiguchi
  0 siblings, 1 reply; 11+ messages in thread
From: Hugh Dickins @ 2014-09-30  4:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi, stable

On Mon, 15 Sep 2014, Naoya Horiguchi wrote:
> We have a race condition between move_pages() and freeing hugepages,

I've been looking through these 5 today, and they're much better now,
thank you.  But a new concern below, and a minor correction to 3/5.

> --- mmotm-2014-09-09-14-42.orig/mm/gup.c
> +++ mmotm-2014-09-09-14-42/mm/gup.c
> @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		return no_page_table(vma, flags);
> -	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> -		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> -		if (flags & FOLL_GET) {
> -			/*
> -			 * Refcount on tail pages are not well-defined and
> -			 * shouldn't be taken. The caller should handle a NULL
> -			 * return when trying to follow tail pages.
> -			 */
> -			if (PageHead(page))
> -				get_page(page);
> -			else
> -				page = NULL;
> -		}
> -		return page;
> -	}
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> +		return follow_huge_pmd(mm, address, pmd, flags);

This code here allows for pmd_none() and pmd_huge(), and for pmd_numa()
and pmd_trans_huge() below; but makes no explicit allowance for !present
migration and hwpoison entries.

Is it assumed that the pmd_bad() test in follow_page_pte() will catch
those?  But what of races? migration entries are highly volatile.  And
is it assumed that a migration entry cannot pass the pmd_huge() test?

That may be true of x86 today, I'm not certain; but if the soft-dirty
people catch up with the hugetlb-migration people, that might change
(they #define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE).

Why pmd_huge() does not itself test for present, I cannot say; but it
probably didn't matter at all before hwpoison and migration were added.

Mind you, with __get_user_pages()'s is_vm_hugtlb_page() test avoiding
all this code, maybe the only thing that can stumble here is your own
hugetlb migration code; but that appears to be guarded only by
down_read of mmap_sem, so races would be possible (if userspace
is silly enough or malicious enough to do so).

What we have here today looks too fragile to me, but it's probably
best dealt with by a separate patch.

Or I may be over-anxious, and there may be something "obvious"
that I'm missing, which saves us from further change.

>  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
>  		return no_page_table(vma, flags);
>  	if (pmd_trans_huge(*pmd)) {
> diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
> index 34351251e164..941832ee3d5a 100644
> --- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
> +++ mmotm-2014-09-09-14-42/mm/hugetlb.c
> @@ -3668,26 +3668,34 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  
>  struct page * __weak
>  follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +		pmd_t *pmd, int flags)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
>  
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +	ptl = pmd_lockptr(mm, pmd);
> +	spin_lock(ptl);
> +
> +	if (!pmd_huge(*pmd))
> +		goto out;

And similarly here.  Though at least here we now have the necessary
lock, so it's no longer racy, and maybe this pmd_huge() test just needs
to be replaced by a pmd_present() test?  Or are both needed?

> +
> +	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/5] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
  2014-09-15 22:39 ` [PATCH v3 3/5] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() Naoya Horiguchi
@ 2014-09-30  4:52   ` Hugh Dickins
  2014-10-07 16:56     ` Naoya Horiguchi
  0 siblings, 1 reply; 11+ messages in thread
From: Hugh Dickins @ 2014-09-30  4:52 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, Hugh Dickins, David Rientjes, linux-mm,
	linux-kernel, Naoya Horiguchi, stable

On Mon, 15 Sep 2014, Naoya Horiguchi wrote:
> When running the test which causes the race as shown in the previous patch,
> we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().

Two minor comments...

> @@ -3192,22 +3208,19 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 	 * Note that locking order is always pagecache_page -> page,
>  	 * so no worry about deadlock.

That sentence of comment is stale and should be deleted,
now that you're only doing a trylock_page(page) here.

>  out_mutex:
>  	mutex_unlock(&htlb_fault_mutex_table[hash]);
> +	if (need_wait_lock)
> +		wait_on_page_locked(page);
>  	return ret;
>  }

It will be hard to trigger any problem from this (I guess it would
need memory hotremove), but you ought really to hold a reference to
page while doing a wait_on_page_locked(page).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/5] mm/hugetlb: take page table lock in follow_huge_pmd()
  2014-09-30  4:32   ` Hugh Dickins
@ 2014-10-07 16:56     ` Naoya Horiguchi
  0 siblings, 0 replies; 11+ messages in thread
From: Naoya Horiguchi @ 2014-10-07 16:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel,
	Naoya Horiguchi, stable

Hi Hugh,

Sorry for the delayed response, I was off for vacation.

On Mon, Sep 29, 2014 at 09:32:20PM -0700, Hugh Dickins wrote:
> On Mon, 15 Sep 2014, Naoya Horiguchi wrote:
> > We have a race condition between move_pages() and freeing hugepages,
> 
> I've been looking through these 5 today, and they're much better now,
> thank you.  But a new concern below, and a minor correction to 3/5.
> 
> > --- mmotm-2014-09-09-14-42.orig/mm/gup.c
> > +++ mmotm-2014-09-09-14-42/mm/gup.c
> > @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >  
> >  	pmd = pmd_offset(pud, address);
> >  	if (pmd_none(*pmd))
> >  		return no_page_table(vma, flags);
> > -	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> > -		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> > -		if (flags & FOLL_GET) {
> > -			/*
> > -			 * Refcount on tail pages are not well-defined and
> > -			 * shouldn't be taken. The caller should handle a NULL
> > -			 * return when trying to follow tail pages.
> > -			 */
> > -			if (PageHead(page))
> > -				get_page(page);
> > -			else
> > -				page = NULL;
> > -		}
> > -		return page;
> > -	}
> > +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> > +		return follow_huge_pmd(mm, address, pmd, flags);
> 
> This code here allows for pmd_none() and pmd_huge(), and for pmd_numa()
> and pmd_trans_huge() below; but makes no explicit allowance for !present
> migration and hwpoison entries.
> 
> Is it assumed that the pmd_bad() test in follow_page_pte() will catch
> those?

Yes, it is now.

> But what of races?

The current patch is still racy when hugepage migrations from different
reasons (hotremove and mbind, for example) happen concurrently.
We need a fix.

> migration entries are highly volatile.  And
> is it assumed that a migration entry cannot pass the pmd_huge() test?

Yes, _PAGE_PSE bit is always clear for migration/hwpoison entry, so they
can never pass pmd_huge() test for now.

> That may be true of x86 today, I'm not certain; but if the soft-dirty
> people catch up with the hugetlb-migration people, that might change
> (they #define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE).

Yes, this problem is not visible now (note that currently _PAGE_SWP_SOFT_DIRTY
is never set on pmd because hugepage is never swapped out,)
but it's potential one.

> 
> Why pmd_huge() does not itself test for present, I cannot say; but it
> probably didn't matter at all before hwpoison and migration were added.

Correct, so we need check _PAGE_PRESENT bit in x86_64 pmd_huge() now.
And we need do some proper actions if we find migration/hwpoison here.
To do this, adding another routine like huge_pmd_present() might be useful
(pmd_present() is already used for thp.)

> 
> Mind you, with __get_user_pages()'s is_vm_hugtlb_page() test avoiding
> all this code, maybe the only thing that can stumble here is your own
> hugetlb migration code; but that appears to be guarded only by
> down_read of mmap_sem, so races would be possible (if userspace
> is silly enough or malicious enough to do so).

I guess that the race is fixed by this patch with checking _PAGE_PRESENT
in pmd_huge(). Or are you mentioning another race?

> 
> What we have here today looks too fragile to me, but it's probably
> best dealt with by a separate patch.
> 
> Or I may be over-anxious, and there may be something "obvious"
> that I'm missing, which saves us from further change.

No, you found a new issue in the current code, thank you very much.

> >  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
> >  		return no_page_table(vma, flags);
> >  	if (pmd_trans_huge(*pmd)) {
> > diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
> > index 34351251e164..941832ee3d5a 100644
> > --- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
> > +++ mmotm-2014-09-09-14-42/mm/hugetlb.c
> > @@ -3668,26 +3668,34 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
> >  
> >  struct page * __weak
> >  follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -		pmd_t *pmd, int write)
> > +		pmd_t *pmd, int flags)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> > +	spinlock_t *ptl;
> >  
> > -	page = pte_page(*(pte_t *)pmd);
> > -	if (page)
> > -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +	ptl = pmd_lockptr(mm, pmd);
> > +	spin_lock(ptl);
> > +
> > +	if (!pmd_huge(*pmd))
> > +		goto out;
> 
> And similarly here.  Though at least here we now have the necessary
> lock, so it's no longer racy, and maybe this pmd_huge() test just needs
> to be replaced by a pmd_present() test?  Or are both needed?

This check is introduced because the first pmd_huge() check outside
follow_huge_pmd() is called without page table lock. So keeping it to
recheck after holding lock looks correct to me.
But as I mentioned above, I'm thinking of changing x86_64's pmd_huge to
check both _PAGE_PRESENT and _PAGE_PSE to make sure that *pmd is pointing
to a normal hugepage, so this check is internally changed to check both.

Thanks,
Naoya Horiguchi

> > +
> > +	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +
> > +	if (flags & FOLL_GET)
> > +		get_page(page);
> > +out:
> > +	spin_unlock(ptl);
> >  	return page;
> >  }
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/5] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
  2014-09-30  4:52   ` Hugh Dickins
@ 2014-10-07 16:56     ` Naoya Horiguchi
  0 siblings, 0 replies; 11+ messages in thread
From: Naoya Horiguchi @ 2014-10-07 16:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Rientjes, linux-mm, linux-kernel,
	Naoya Horiguchi, stable

On Mon, Sep 29, 2014 at 09:52:24PM -0700, Hugh Dickins wrote:
> On Mon, 15 Sep 2014, Naoya Horiguchi wrote:
> > When running the test which causes the race as shown in the previous patch,
> > we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
> 
> Two minor comments...
> 
> > @@ -3192,22 +3208,19 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > 	 * Note that locking order is always pagecache_page -> page,
> >  	 * so no worry about deadlock.
> 
> That sentence of comment is stale and should be deleted,
> now that you're only doing a trylock_page(page) here.

OK, I'll delete it.

> >  out_mutex:
> >  	mutex_unlock(&htlb_fault_mutex_table[hash]);
> > +	if (need_wait_lock)
> > +		wait_on_page_locked(page);
> >  	return ret;
> >  }
> 
> It will be hard to trigger any problem from this (I guess it would
> need memory hotremove), but you ought really to hold a reference to
> page while doing a wait_on_page_locked(page).

I'll do that.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-10-07 16:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-15 22:39 [PATCH 0/5] hugepage migration fixes (v4) Naoya Horiguchi
2014-09-15 22:39 ` [PATCH v3 1/5] mm/hugetlb: reduce arch dependent code around follow_huge_* Naoya Horiguchi
2014-09-16 15:56   ` Naoya Horiguchi
2014-09-15 22:39 ` [PATCH v3 2/5] mm/hugetlb: take page table lock in follow_huge_pmd() Naoya Horiguchi
2014-09-30  4:32   ` Hugh Dickins
2014-10-07 16:56     ` Naoya Horiguchi
2014-09-15 22:39 ` [PATCH v3 3/5] mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() Naoya Horiguchi
2014-09-30  4:52   ` Hugh Dickins
2014-10-07 16:56     ` Naoya Horiguchi
2014-09-15 22:39 ` [PATCH v3 4/5] mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection Naoya Horiguchi
2014-09-15 22:39 ` [PATCH v3 5/5] mm/hugetlb: add migration entry check in __unmap_hugepage_range Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).